[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [early RFC] ARM PCI Passthrough design document



Hi Roger,

On 06/01/17 15:12, Roger Pau Monné wrote:
On Thu, Dec 29, 2016 at 02:04:15PM +0000, Julien Grall wrote:
Hi all,

The document below is an early version of a design
proposal for PCI Passthrough in Xen. It aims to
describe from an high level perspective the interaction
with the different subsystems and how guest will be able
to discover and access PCI.

I am aware that a similar design has been posted recently
by Cavium (see [1]), however the approach to expose PCI
to guest is different. We have request to run unmodified
baremetal OS on Xen, a such guest would directly
access the devices and no PV drivers will be used.

That's why this design is based on emulating a root controller.
This also has the advantage to have the VM interface as close
as baremetal allowing the guest to use firmware tables to discover
the devices.

Currently on ARM, Xen does not have any knowledge about PCI devices.
This means that IOMMU and interrupt controller (such as ITS)
requiring specific configuration will not work with PCI even with
DOM0.

The PCI Passthrough work could be divided in 2 phases:
        * Phase 1: Register all PCI devices in Xen => will allow
                   to use ITS and SMMU with PCI in Xen
        * Phase 2: Assign devices to guests

This document aims to describe the 2 phases, but for now only phase
1 is fully described.

I have sent the design document to start to gather feedback on
phase 1.

Thanks, this approach looks quite similar to what I have in mind for PVHv2
DomU/Dom0 pci-passthrough.

Cheers,

[1] https://lists.xen.org/archives/html/xen-devel/2016-12/msg00224.html

========================
% PCI pass-through support on ARM
% Julien Grall <julien.grall@xxxxxxxxxx>
% Draft A

# Preface

This document aims to describe the components required to enable PCI
passthrough on ARM.

This is an early draft and some questions are still unanswered, when this is
the case the text will contain XXX.

# Introduction

PCI passthrough allows to give control of physical PCI devices to guest. This
means that the guest will have full and direct access to the PCI device.

ARM is supporting one kind of guest that is exploiting as much as possible
virtualization support in hardware. The guest will rely on PV driver only
for IO (e.g block, network), interrupts will come through the virtualized
interrupt controller. This means that there are no big changes required
within the kernel.

By consequence, it would be possible to replace PV drivers by assigning real
devices to the guest for I/O access. Xen on ARM would therefore be able to
run unmodified operating system.

To achieve this goal, it looks more sensible to go towards emulating the
host bridge (we will go into more details later). A guest would be able
to take advantage of the firmware tables and obviating the need for a specific
driver for Xen.

Thus in this document we follow the emulated host bridge approach.

# PCI terminologies

Each PCI device under a host bridge is uniquely identified by its Requester ID
(AKA RID). A Requester ID is a triplet of Bus number, Device number, and
Function.

When the platform has multiple host bridges, the software can add fourth
number called Segment to differentiate host bridges. A PCI device will
then uniquely by segment:bus:device:function (AKA SBDF).

From my reading of the above sentence, this implies that the segment is an
arbitrary number chosen by the OS? Isn't this picked from the MCFG ACPI table?

The number is chosen by the software. In the case of ACPI, it is "hardcoded" in the MCFG table, but for Device Tree this number could be chosen by the OS unless the property "linux,pci-domain" is present.


So given a specific SBDF, it would be possible to find the host bridge and the
RID associated to a PCI device.

# Interaction of the PCI subsystem with other subsystems

In order to have a PCI device fully working, Xen will need to configure
other subsystems subsytems such as the SMMU and the Interrupt Controller.
                   ^ duplicated.

The interaction expected between the PCI subsystem and the other is:
                                                         ^ this seems quite
                                                         confusing, what's "the
                                                         other"?

By "other" I meant "IOMMU and Interrupt Controller". Would the wording "and the other subsystems" be better?

    * Add a device
    * Remove a device
    * Assign a device to a guest
    * Deassign a device from a guest

XXX: Detail the interaction when assigning/deassigning device

Assigning a device will probably entangle setting up some direct MMIO mappings
(BARs and ROMs) plus a bunch of traps in order to perform emulation of accesses
to the PCI config space (or those can be setup when a new bridge is registered
with Xen).

I am planning to details the root complex emulation in a separate section. I sent the design document before writing it.

In brief, I would expect the registration of a new bridge to setup the trap to emulation access to the PCI configuration space. On ARM, the first approach will rely on the OS to setup the BARs and ROMs. So they will be mapped by the PCI configuration space emulation.

The reason on relying on the OS to setup the BARs/ROMs reducing the work to do for a first version. Otherwise we would have to add code in the toolstack to decide where to place the BARs/ROMs. I don't think it is a lot of work, but it is not that important because it does not require a stable ABI (this is an interaction between the hypervisor and the toolstack). Furthermore, Linux (at least on ARM) is assigning the BARs at the setup. From my understanding, this is the expected behavior with both DT (the DT has a property to skip the scan) and ACPI.


The following subsections will briefly describe the interaction from an
higher level perspective. Implementation details (callback, structure...)
is out of scope.

## SMMU

The SMMU will be used to isolate the PCI device when accessing the memory
(for instance DMA and MSI Doorbells). Often the SMMU will be configured using
a StreamID (SID) that can be deduced from the RID with the help of the firmware
tables (see below).

Whilst in theory all the memory transaction issued by a PCI device should
go through the SMMU, on certain platforms some of the memory transaction may
not reach the SMMU because they are interpreted by the host bridge. For
instance this could happen if the MSI doorbell is built into the PCI host

I would elaborate on what is a MSI doorbell.

I can add an explanation in the glossary.


bridge. See [6] for more details.

XXX: I think this could be solved by using the host memory layout when
creating a guest with PCI devices => Detail it.

I'm not really sure I follow here, but if this write to the MSI doorbell
doesn't go through the SMMU, and instead is handled by the bridge, isn't there
a chance that a gust might be able to write anywhere in physical memory?

The problem is more subtle. On some platform the MSI doorbell is built-in the host bridge. Some of those host bridges will intercept any access to this doorbell coming from the PCI devices and interpret it directly rather than going through the SMMU.

This mean that the physical address of the MSI doorbell is always be interpreted. Even if the guest is using a intermediate address, this will be considered as a physical address because the SMMU has been by-passed.

Furthermore, some platform may have other set of address not going through the SMMU (such as P2P traffic). So we have to prevent mapping anything on those regions.


Or this only happens when a guest writes to a MSI doorbell that's trapped by
the bridge and not forwarded anywhere else?

See above.


## Interrupt controller

PCI supports three kind of interrupts: legacy interrupt, MSI and MSI-X. On ARM
legacy interrupts will be mapped to SPIs. MSI and MSI-x will be
either mapped to SPIs or LPIs.

Whilst SPIs can be programmed using an interrupt number, LPIs can be
identified via a pair (DeviceID, EventID) when configure through the ITS.
                                                          ^d


The DeviceID is a unique identifier for each MSI-capable device that can
be deduced from the RID with the help of the firmware tables (see below).

XXX: Figure out if something is necessary for GICv2m

# Information available in the firmware tables

## ACPI

### Host bridges

The static table MCFG (see 4.2 in [1]) will describe the host bridges available
at boot and supporting ECAM. Unfortunately there are platforms out there
(see [2]) that re-use MCFG to describe host bridge that are not fully ECAM
                                                    ^s

compatible.

This means that Xen needs to account for possible quirks in the host bridge.
The Linux community are working on a patch series for see (see [2] and [3])
where quirks will be detected with:
    * OEM ID
    * OEM Table ID
    * OEM Revision
    * PCI Segment (from _SEG)
    * PCI bus number range (from _CRS, wildcard allowed)

So segment and bus number range needs to be fetched from ACPI objects? Is that
because the information in the MCFG is lacking/wrong?

All the host bridges will be described in ASL. Only the one available at boot will be described in the MCFG. So it looks more sensible to rely on the ASL from Linux POV.



Based on what Linux is currently doing, there are two kind of quirks:
    * Accesses to the configuration space of certain sizes are not allowed
    * A specific driver is necessary for driving the host bridge

Hm, so what are the issues that make this bridges need specific drivers?

This might be quite problematic if you also have to emulate this broken
behavior inside of Xen (because Dom0 is using a specific driver).

I am not expecting to emulate the configuration space access for DOM0. I know you mentioned that it would be necessary to hide PCI used by Xen (such as the UART) to DOM0 or configuring MSI. But for ARM, the UART is integrated in the SOC and MSI will be configured through the interrupt controller.


The former is straight forward to solve, the latter will require more thought.
Instantiation of a specific driver for the host controller can be easily done
if Xen has the information to detect it. However, those drivers may require
resources described in ASL (see [4] for instance).

XXX: Need more investigation to know whether the missing information should
be passed by DOM0 or hardcoded in the driver.

... or poke the ThunderX guys with a pointy stick until they get their act
together.

I would love to do that, but platform is already out. So I am afraid that we have to deal with that.

Although I am hoping *fingers crossed* that future platform will be fully ECAM compliant.


### Finding the StreamID and DeviceID

The static table IORT (see [5]) will provide information that will help to
deduce the StreamID and DeviceID from a given RID.

## Device Tree

### Host bridges

Each Device Tree node associated to a host bridge will have at least the
following properties (see bindings in [8]):
    - device_type: will always be "pci".
    - compatible: a string indicating which driver to instantiate

The node may also contain optional properties such as:
    - linux,pci-domain: assign a fix segment number
    - bus-range: indicate the range of bus numbers supported

When the property linux,pci-domain is not present, the operating system would
have to allocate the segment number for each host bridges. Because the
algorithm to allocate the segment is not specified, it is necessary for
DOM0 and Xen to agree on the number before any PCI is been added.

Since this is all static, can't Xen just assign segment and bus-ranges for
bridges that lack them? (also why it's "linux,pci-domain", instead of just
"pci-domain"?)

I am not the one who decided the name of those properties. This is from the existing binding in Linux (I though it was obvious with the link [8] to the binding).

Usually any property that are added by the Linux community (i.e not part of the Open Firmware standards) will be prefixed by "linux,". So I would rather avoid to

The lack of bus-ranges is not an issue because it has been formalized in the binding: "If absent, defaults to <0 255> (i.e all buses)".


### Finding the StreamID and DeviceID

### StreamID

The first binding existing (see [9]) for SMMU didn't have a way to describe the
relationship between RID and StreamID, it was assumed that StreamID == 
RequesterID.
This bindins has now been deprecated in favor of a generic binding (see [10])
which will use the property "iommu-map" to describe the relationship between
an RID, the associated IOMMU and the StreamID.

### DeviceID

The relationship between the RID and the DeviceID can be found using the
property "msi-map" (see [11]).

# Discovering PCI devices

Whilst PCI devices are currently available in DOM0, the hypervisor does not
have any knowledge of them. The first step of supporting PCI passthrough is
to make Xen aware of the PCI devices.

Xen will require access to the PCI configuration space to retrieve information
for the PCI devices or access it on behalf of the guest via the emulated

I know this is not the intention, but the above sentence makes it look like
Xen is using an emulated host bridge IMHO (although I'm not a native speaker
anyway, so I can be wrong).

How about "Xen will require access to the host PCI configuration space..."?


host bridge.

## Discovering and register hostbridge

Both ACPI and Device Tree do not provide enough information to fully
instantiate an host bridge driver. In the case of ACPI, some data may come
from ASL, whilst for Device Tree the segment number is not available.

For device-tree can't you just add a pci-domain to each bridge device on the DT
if none is specified?

The "linux,pci-domain" is a Linux specific property. We've been avoided to re-use linux specific property recently (see the case of xen,uefi-*). So we would have to introduce a new one.

For ACPI I understand that it's harder. Maybe ARM can somehow assure that MCFG
tables completely describe the system, so that you don't need this anymore.

This is not ARM but the spec. The PCI spec specifies that MCFG will only describe host bridges available at boot. The rest will be in ASL.


So Xen needs to rely on DOM0 to discover the host bridges and notify Xen
with all the relevant informations. This will be done via a new hypercall
PHYSDEVOP_pci_host_bridge_add. The layout of the structure will be:

struct physdev_pci_host_bridge_add
{
    /* IN */
    uint16_t seg;
    /* Range of bus supported by the host bridge */
    uint8_t  bus_start;
    uint8_t  bus_nr;
    uint32_t res0;  /* Padding */
    /* Information about the configuration space region */
    uint64_t cfg_base;
    uint64_t cfg_size;
}

Why do you need to cfg_size attribute? Isn't it always going to be 4096 bytes
in size?

The cfg_size is here to help us to match the corresponding node in the device tree. The cfg_size may differ depending on how the hardware has implemented the access to the configuration space.

But to be fair, I think we can deal without this property. For ACPI, the size will vary following the number of bus handled and can be deduced. For DT, the base address and bus range should be enough to find the associated node.


If that field is removed you could use the PHYSDEVOP_pci_mmcfg_reserved
hypercalls.

DOM0 will issue the hypercall PHYSDEVOP_pci_host_bridge_add for each host
bridge available on the platform. When Xen is receiving the hypercall, the
the driver associated to the host bridge will be instantiated.

XXX: Shall we limit DOM0 the access to the configuration space from that
moment?

Most definitely yes, you should instantiate an emulated bridge over the real
one, in order to proxy Dom0 accesses to the PCI configuration space. You for
example don't want Dom0 moving the position of the BARs of PCI devices without
Xen being aware (and properly changing the second stage translation).

The problem is on ARM we don't have a single way to access the configuration space. So we would need different emulator in Xen, which I don't like unless there is a strong reason to do it.

We could avoid DOM0s to modify the position of the BARs after setup. I also remembered you mention about MSI configuration, for ARM this is done via the interrupt controller.


## Discovering and register PCI

Similarly to x86, PCI devices will be discovered by DOM0 and register
using the hypercalls PHYSDEVOP_pci_add_device or PHYSDEVOP_manage_pci_add_ext.

Why do you need this? If you have access to the bridges you can scan them from
Xen and discover the devices AFAICT.

I am a bit confused. Are you saying that you plan to ditch them for PVH? If so, why are they called by Linux today?


By default all the PCI devices will be assigned to DOM0. So Xen would have
to configure the SMMU and Interrupt Controller to allow DOM0 to use the PCI
devices. As mentioned earlier, those subsystems will require the StreamID
and DeviceID. Both can be deduced from the RID.

XXX: How to hide PCI devices from DOM0?

By adding the ACPI namespace of the device to the STAO and blocking Dom0
access to this device in the emulated bridge that Dom0 will have access to
(returning 0xFFFF when Dom0 tries to read the vendor ID from the PCI header).

Sorry I was not clear here. By hiding, I meant DOM0 not instantiating a driver (similarly to xen-pciback.hide). We still want DOM0 to access the PCI config space in order to reset the device. Unless you plan to import all the reset quirks in Xen?

Cheers,

--
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.