Xen project Mailing List

The document below is an RFC version of a design proposal for PCI Passthrough in Xen on ARM. It aims to describe from an high level perspective the interaction with the different subsystems and how guest will be able to discover and access PCI. Currently on ARM, Xen does not have any knowledge about PCI devices. This means that IOMMU and interrupt controller (such as ITS) requiring specific configuration will not work with PCI even with DOM0. The PCI Passthrough work could be divided in 2 phases: * Phase 1: Register all PCI devices in Xen => will allow to use ITS and SMMU with PCI in Xen * Phase 2: Assign devices to guests This document aims to describe the 2 phases, but for now only phase 1 is fully described. I think I was able to gather all of the feedbacks and come up with a solution that will satisfy all the parties. The design document has changed quite a lot compare to the early draft sent few months ago. The major changes are: * Provide more details how PCI works on ARM and the interactions with MSI controller and IOMMU * Provide details on the existing host bridge implementations * Give more explanation and justifications on the approach chosen * Describing the hypercalls used and how they should be called Feedbacks are welcomed. Cheers, -------------------------------------------------------------------------------- % PCI pass-through support on ARM % Julien Grall <julien.grall@xxxxxxxxxx> % Draft B # Preface This document aims to describe the components required to enable the PCI pass-through on ARM. This is an early draft and some questions are still unanswered. When this is the case, the text will contain XXX. # Introduction PCI pass-through allows the guest to receive full control of physical PCI devices. This means the guest will have full and direct access to the PCI device. ARM is supporting a kind of guest that exploits as much as possible virtualization support in hardware. The guest will rely on PV driver only for IO (e.g block, network) and interrupts will come through the virtualized interrupt controller, therefore there are no big changes required within the kernel. As a consequence, it would be possible to replace PV drivers by assigning real devices to the guest for I/O access. Xen on ARM would therefore be able to run unmodified operating system. To achieve this goal, it looks more sensible to go towards emulating the host bridge (there will be more details later). A guest would be able to take advantage of the firmware tables, obviating the need for a specific driver for Xen. Thus, in this document we follow the emulated host bridge approach. # PCI terminologies Each PCI device under a host bridge is uniquely identified by its Requester ID (AKA RID). A Requester ID is a triplet of Bus number, Device number, and Function. When the platform has multiple host bridges, the software can add a fourth number called Segment (sometimes called Domain) to differentiate host bridges. A PCI device will then uniquely by segment:bus:device:function (AKA SBDF). So given a specific SBDF, it would be possible to find the host bridge and the RID associated to a PCI device. The pair (host bridge, RID) will often be used to find the relevant information for configuring the different subsystems (e.g IOMMU, MSI controller). For convenience, the rest of the document will use SBDF to refer to the pair (host bridge, RID). # PCI host bridge PCI host bridge enables data transfer between a host processor and PCI bus based devices. The bridge is used to access the configuration space of each PCI devices and, on some platform may also act as an MSI controller. ## Initialization of the PCI host bridge Whilst it would be expected that the bootloader takes care of initializing the PCI host bridge, on some platforms it is done in the Operating System. This may include enabling/configuring the clocks that could be shared among multiple devices. ## Accessing PCI configuration space Accessing the PCI configuration space can be divided in 2 category: * Indirect access, where the configuration spaces are multiplexed. An example would be legacy method on x86 (e.g 0xcf8 and 0xcfc). On ARM a similar method is used by PCIe RCar root complex (see [12]). * ECAM access, each configuration space will have its own address space. Whilst ECAM is a standard, some PCI host bridges will require specific fiddling when access the registers (see thunder-ecam [13]). In most of the cases, accessing all the PCI configuration spaces under a given PCI host will be done the same way (i.e either indirect access or ECAM access). However, there are a few cases, dependent on the PCI devices accessed, which will use different methods (see thunder-pem [14]). ## Generic host bridge For the purpose of this document, the term "generic host bridge" will be used to describe any host bridge ECAM-compliant and the initialization, if required, will be already done by the firmware/bootloader. # Interaction of the PCI subsystem with other subsystems In order to have a PCI device fully working, Xen will need to configure other subsystems such as the IOMMU and the Interrupt Controller. The interaction expected between the PCI subsystem and the other subsystems is: * Add a device * Remove a device * Assign a device to a guest * Deassign a device from a guest XXX: Detail the interaction when assigning/deassigning device In the following subsections, the interactions will be briefly described from a higher level perspective. However, implementation details such as callback, structure, etc... are beyond the scope of this document. ## IOMMU The IOMMU will be used to isolate the PCI device when accessing the memory (e.g DMA and MSI Doorbells). Often the IOMMU will be configured using a MasterID (aka StreamID for ARM SMMU) that can be deduced from the SBDF with the help of the firmware tables (see below). Whilst in theory, all the memory transactions issued by a PCI device should go through the IOMMU, on certain platforms some of the memory transaction may not reach the IOMMU because they are interpreted by the host bridge. For instance, this could happen if the MSI doorbell is built into the PCI host bridge or for P2P traffic. See [6] for more details. XXX: I think this could be solved by using direct mapping (e.g GFN == MFN), this would mean the guest memory layout would be similar to the host one when PCI devices will be pass-throughed => Detail it. ## Interrupt controller PCI supports three kind of interrupts: legacy interrupt, MSI and MSI-X. On ARM, legacy interrupts will be mapped to SPIs. MSI and MSI-X will write their payload in a doorbell belonging to a MSI controller. ### Existing MSI controllers In this section some of the existing controllers and their interaction with the devices will be briefly described. More details can be found in the respective specifications of each MSI controller. MSIs can be distinguished by some combination of * the Doorbell It is the MMIO address written to. Devices may be configured by software to write to arbitrary doorbells which they can address. An MSI controller may feature a number of doorbells. * the Payload Devices may be configured to write an arbitrary payload chosen by software. MSI controllers may have restrictions on permitted payload. Xen will have to sanitize the payload unless it is known to be always safe. * Sideband information accompanying the write Typically this is neither configurable nor probeable, and depends on the path taken through the memory system (i.e it is a property of the combination of MSI controller and device rather than a property of either in isolation). ### GICv3/GICv4 ITS The Interrupt Translation Service (ITS) is a MSI controller designed by ARM and integrated in the GICv3/GICv4 interrupt controller. For the specification see [GICV3]. Each MSI/MSI-X will be mapped to a new type of interrupt called LPI. This interrupt will be configured by the software using a pair (DeviceID, EventID). A platform may have multiple ITS block (e.g one per NUMA node), each of them belong to an ITS group. The DeviceID is a unique identifier with an ITS group for each MSI-capable device that can be deduced from the RID with the help of the firmware tables (see below). The EventID is a unique identifier to distinguish different event sending by a device. The MSI payload will only contain the EventID as the DeviceID will be added afterwards by the hardware in a way that will prevent any tampering. The [SBSA] appendix I describes the set of rules for the integration of the ITS that any compliant platform should follow. Some of the rules will explain the security implication of a misbehaving devices. It ensures that a guest will never be able to trigger an MSI on behalf of another guest. XXX: The security implication is described in the [SBSA] but I haven't found any similar working in the GICv3 specification. It is unclear to me if non-SBSA compliant platform (e.g embedded) will follow those rules. ### GICv2m The GICv2m is an extension of the GICv2 to convert MSI/MSI-X writes to unique interrupts. The specification can be found in the [SBSA] appendix E. Depending on the platform, the GICv2m will provide one or multiple instance of register frames. Each frame is composed of a doorbell and associated to a set of SPIs that can be discovered by reading the register MSI_TYPER. On an MSI write, the payload will contain the SPI ID to generate. Note that on some platform the MSI payload may contain an offset form the base SPI rather than the SPI itself. The frame will only generate SPI if the written value corresponds to an SPI allocated to the frame. Each VM should have exclusity to the frame to ensure isolation and prevent a guest OS to trigger an MSI on-behalf of another guest OS. XXX: Linux seems to consider GICv2m as unsafe by default. From my understanding, it is still unclear how we should proceed on Xen, as GICv2m should be safe as long as the frame is only accessed by one guest. ### Other MSI controllers Servers compliant with SBSA level 1 and higher will have to use either ITS or GICv2m. However, it is by no means the only MSI controllers available. The hardware vendor may decide to use their custom MSI controller which can be integrated in the PCI host bridge. Whether it will be possible to write securely an MSI will depend on the MSI controller implementations. XXX: I am happy to give a brief explanation on more MSI controller (such as Xilinx and Renesas) if people think it is necessary. This design document does not pertain to a specific MSI controller and will try to be as agnostic is possible. When possible, it will give insight how to integrate the MSI controller. # Information available in the firmware tables ## ACPI ### Host bridges The static table MCFG (see 4.2 in [1]) will describe the host bridges available at boot and supporting ECAM. Unfortunately, there are platforms out there (see [2]) that re-use MCFG to describe host bridge that are not fully ECAM compatible. This means that Xen needs to account for possible quirks in the host bridge. The Linux community are working on a patch series for this, see [2] and [3], where quirks will be detected with: * OEM ID * OEM Table ID * OEM Revision * PCI Segment * PCI bus number range (wildcard allowed) Based on what Linux is currently doing, there are two kind of quirks: * Accesses to the configuration space of certain sizes are not allowed * A specific driver is necessary for driving the host bridge The former is straightforward to solve but the latter will require more thought. Instantiation of a specific driver for the host controller can be easily done if Xen has the information to detect it. However, those drivers may require resources described in ASL (see [4] for instance). The number of platforms requiring specific PCI host bridge driver is currently limited. Whilst it is not possible to predict the future, it will be expected upcoming platform to have fully ECAM compliant PCI host bridges. Therefore, given Xen does not have any ASL parser, the approach suggested is to hardcode the missing values. This could be revisit in the future if necessary. ### Finding information to configure IOMMU and MSI controller The static table [IORT] will provide information that will help to deduce data (such as MasterID and DeviceID) to configure both the IOMMU and the MSI controller from a given SBDF. ## Finding which NUMA node a PCI device belongs to On NUMA system, the NUMA node associated to a PCI device can be found using the _PXM method of the host bridge (?). XXX: I am not entirely sure where the _PXM will be (i.e host bridge vs PCI device). ## Device Tree ### Host bridges Each Device Tree node associated to a host bridge will have at least the following properties (see bindings in [8]): - device_type: will always be "pci". - compatible: a string indicating which driver to instanciate The node may also contain optional properties such as: - linux,pci-domain: assign a fix segment number - bus-range: indicate the range of bus numbers supported When the property linux,pci-domain is not present, the operating system would have to allocate the segment number for each host bridges. ### Finding information to configure IOMMU and MSI controller ### Configuring the IOMMU The Device Treee provides a generic IOMMU bindings (see [10]) which uses the properties "iommu-map" and "iommu-map-mask" to described the relationship between RID and a MasterID. These properties will be present in the host bridge Device Tree node. From a given SBDF, it will be possible to find the corresponding MasterID. Note that the ARM SMMU also have a legacy binding (see [9]), but it does not have a way to describe the relationship between RID and StreamID. Instead it assumed that StreamID == RID. This binding has now been deprecated in favor of the generic IOMMU binding. ### Configuring the MSI controller The relationship between the RID and data required to configure the MSI controller (such as DeviceID) can be found using the property "msi-map" (see [11]). This property will be present in the host bridge Device Tree node. From a given SBDF, it will be possible to find the corresponding MasterID. ## Finding which NUMA node a PCI device belongs to On NUMA system, the NUMA node associated to a PCI device can be found using the property "numa-node-id" (see [15]) presents in the host bridge Device Tree node. # Discovering PCI devices Whilst PCI devices are currently available in the hardware domain, the hypervisor does not have any knowledge of them. The first step of supporting PCI pass-through is to make Xen aware of the PCI devices. Xen will require access to the PCI configuration space to retrieve information for the PCI devices or access it on behalf of the guest via the emulated host bridge. This means that Xen should be in charge of controlling the host bridge. However, for some host controller, this may be difficult to implement in Xen because of depencencies on other components (e.g clocks, see more details in "PCI host bridge" section). For this reason, the approach chosen in this document is to let the hardware domain to discover the host bridges, scan the PCI devices and then report everything to Xen. This does not rule out the possibility of doing everything without the help of the hardware domain in the future. ## Who is in charge of the host bridge? There are numerous implementation of host bridges which exist on ARM. A part of them requires a specific driver as they cannot be driven by a generic host bridge driver. Porting those drivers may be complex due to dependencies on other components. This would be seen as signal to leave the host bridge drivers in the hardware domain. Because Xen would need to access the configuration space, all the access would have to be forwarded to hardware domain which in turn will access the hardware. In this design document, we are considering that the host bridge driver can be ported in Xen. In the case it is not possible, a interface to forward configuration space access would need to be defined. The interface details is out of scope. ## Discovering and registering host bridge

The approach taken in the document will require communication between Xen and the hardware domain. In this case, they would need to agree on the segment number associated to an host bridge. However, this number is not available in the Device Tree case. The hardware domain will register new host bridges using the existing hypercall PHYSDEV_mmcfg_reserved: #define XEN_PCI_MMCFG_RESERVED 1 struct physdev_pci_mmcfg_reserved { /* IN */ uint64_t address; uint16_t segment; /* Range of bus supported by the host bridge */ uint8_t start_bus; uint8_t end_bus; uint32_t flags; } Some of the host bridges may not have a separate configuration address space region described in the firmware tables. To simplify the registration, the field 'address' should contains the base address of one of the region described in the firmware tables. * For ACPI, it would be the base address specified in the MCFG or in the _CBA method. * For Device Tree, this would be any base address of region specified in the "reg" property. The field 'flags' is expected to have XEN_PCI_MMCFG_RESERVED set. It is expected that this hypercall is called before any PCI devices is registered to Xen. When the hardware domain is in charge of the host bridge, this hypercall will be used to tell Xen the existence of an host bridge in order to find the associated information for configuring the MSI controller and the IOMMU. ## Discovering and registering PCI devices The hardware domain will scan the host bridge to find the list of PCI devices available and then report it to Xen using the existing hypercall PHYSDEV_pci_device_add: #define XEN_PCI_DEV_EXTFN 0x1 #define XEN_PCI_DEV_VIRTFN 0x2 #define XEN_PCI_DEV_PXM 0x3 struct physdev_pci_device_add { /* IN */ uint16_t seg; uint8_t bus; uint8_t devfn; uint32_t flags; struct { uint8_t bus; uint8_t devfn; } physfn; /* * Optional parameters array. * First element ([0]) is PXM domain associated with the device (if * XEN_PCI_DEV_PXM is set) */ uint32_t optarr[0]; } When XEN_PCI_DEV_PXM is set in the field 'flag', optarr[0] will contain the NUMA node ID associated with the device: * For ACPI, it would be the value returned by the method _PXM * For Device Tree, this would the value found in the property "numa-node-id". For more details see the section "Finding which NUMA node a PCI device belongs to" in "ACPI" and "Device Tree". XXX: I still don't fully understand how XEN_PCI_DEV_EXTFN and XEN_PCI_DEV_VIRTFN wil work. AFAICT, the former is used with the bus support ARI and the only usage is in the x86 IOMMU code. For the latter, this is related to IOV but I am not sure what devfn and physfn.devfn will correspond too. Note that x86 currently provides two more hypercalls (PHYSDEVOP_manage_pci_add and PHYSDEVOP_manage_pci_add_ext) to register PCI devices. However they are subset of the hypercall PHYSDEVOP_pci_device_add. Therefore, it is suggested to leave them unimplemented on ARM. ## Removing PCI devices The hardware domain will be in charge Xen a device has been removed using the existing hypercall PHYSDEV_pci_device_remove: struct physdev_pci_device { /* IN */ uint16_t seg; uint8_t bus; uint8_t devfn; } Note that x86 currently provide one more hypercall (PHYSDEVOP_manage_pci_remove) to remove PCI devices. However it does not allow to pass a segment number. Therefore it is suggested to leave unimplemented on ARM.

# Glossary ECAM: Enhanced Configuration Mechanism SBDF: Segment Bus Device Function. The segment is a software concept. MSI: Message Signaled Interrupt MSI doorbell: MMIO address written to by a device to generate an MSI SPI: Shared Peripheral Interrupt LPI: Locality-specific Peripheral Interrupt ITS: Interrupt Translation Service # Specifications [SBSA] ARM-DEN-0029 v3.0 [GICV3] IHI0069C [IORT] DEN0049B # Bibliography [1] PCI firmware specification, rev 3.2 [2] https://www.spinics.net/lists/linux-pci/msg56715.html [3] https://www.spinics.net/lists/linux-pci/msg56723.html [4] https://www.spinics.net/lists/linux-pci/msg56728.html [6] https://www.spinics.net/lists/kvm/msg140116.html [7] http://www.firmware.org/1275/bindings/pci/pci2_1.pdf [8] Documents/devicetree/bindings/pci [9] Documents/devicetree/bindings/iommu/arm,smmu.txt [10] Document/devicetree/bindings/pci/pci-iommu.txt [11] Documents/devicetree/bindings/pci/pci-msi.txt [12] drivers/pci/host/pcie-rcar.c [13] drivers/pci/host/pci-thunder-ecam.c [14] drivers/pci/host/pci-thunder-pem.c [15] Documents/devicetree/bindings/numa.txt

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.

Re: [Xen-devel] [RFC] ARM PCI Passthrough design document