[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen virtual IOMMU high level design doc



On Wed, Aug 17, 2016 at 08:05:51PM +0800, Lan, Tianyu wrote:
> Hi All:
>      The following is our Xen vIOMMU high level design for detail
> discussion. Please have a look. Very appreciate for your comments.
> This design doesn't cover changes when root port is moved to hypervisor.
> We may design it later.

Hi,

I have a few questions.

If I understand correctly, you'll be emulating an Intel IOMMU in Xen.
So guests will essentially create intel iommu style page-tables.

If we were to use this on Xen/ARM, we would likely be modelling an ARM
SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the
hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used.
Do I understand this correctly?

Has a platform agnostic PV-IOMMU been considered to support 2-stage
translation (i.e VFIO in the guest)? Perhaps that would hurt map/unmap
performance too much?

Best regards,
Edgar




> 
> 
> Content:
> ===============================================================================
> 1. Motivation of vIOMMU
>       1.1 Enable more than 255 vcpus
>       1.2 Support VFIO-based user space driver
>       1.3 Support guest Shared Virtual Memory (SVM)
> 2. Xen vIOMMU Architecture
>       2.1 2th level translation overview
>       2.2 Interrupt remapping overview
> 3. Xen hypervisor
>       3.1 New vIOMMU hypercall interface
>       3.2 2nd level translation
>       3.3 Interrupt remapping
>       3.4 1st level translation
>       3.5 Implementation consideration
> 4. Qemu
>       4.1 Qemu vIOMMU framework
>       4.2 Dummy xen-vIOMMU driver
>       4.3 Q35 vs. i440x
>       4.4 Report vIOMMU to hvmloader
> 
> 
> 1 Motivation for Xen vIOMMU
> ===============================================================================
> 1.1 Enable more than 255 vcpu support
> HPC virtualization requires more than 255 vcpus support in a single VM
> to meet parallel computing requirement. More than 255 vcpus support
> requires interrupt remapping capability present on vIOMMU to deliver
> interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
> vcpus if interrupt remapping is absent.
> 
> 
> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> It relies on the 2nd level translation capability (IOVA->GPA) on
> vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
> vIOMMU to isolate DMA requests initiated by user space driver.
> 
> 
> 1.3 Support guest SVM (Shared Virtual Memory)
> It relies on the 1st level translation table capability (GVA->GPA) on
> vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation
> in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> is the main usage today (to support OpenCL 2.0 SVM feature). In the
> future SVM might be used by other I/O devices too.
> 
> 2. Xen vIOMMU Architecture
> ================================================================================
> 
> * vIOMMU will be inside Xen hypervisor for following factors
>       1) Avoid round trips between Qemu and Xen hypervisor
>       2) Ease of integration with the rest of the hypervisor
>       3) HVMlite/PVH doesn't use Qemu
> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th
> level translation.
> 
> 2.1 2th level translation overview
> For Virtual PCI device, dummy xen-vIOMMU does translation in the
> Qemu via new hypercall.
> 
> For physical PCI device, vIOMMU in hypervisor shadows IO page table from
> IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
> 
> The following diagram shows 2th level translation architecture.
> +---------------------------------------------------------+
> |Qemu                                +----------------+   |
> |                                    |     Virtual    |   |
> |                                    |   PCI device   |   |
> |                                    |                |   |
> |                                    +----------------+   |
> |                                            |DMA         |
> |                                            V            |
> |  +--------------------+   Request  +----------------+   |
> |  |                    +<-----------+                |   |
> |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
> |  |                    +----------->+                |   |
> |  +---------+----------+            +-------+--------+   |
> |            |                               |            |
> |            |Hypercall                      |            |
> +--------------------------------------------+------------+
> |Hypervisor  |                               |            |
> |            |                               |            |
> |            v                               |            |
> |     +------+------+                        |            |
> |     |   vIOMMU    |                        |            |
> |     +------+------+                        |            |
> |            |                               |            |
> |            v                               |            |
> |     +------+------+                        |            |
> |     | IOMMU driver|                        |            |
> |     +------+------+                        |            |
> |            |                               |            |
> +--------------------------------------------+------------+
> |HW          v                               V            |
> |     +------+------+                 +-------------+     |
> |     |   IOMMU     +---------------->+  Memory     |     |
> |     +------+------+                 +-------------+     |
> |            ^                                            |
> |            |                                            |
> |     +------+------+                                     |
> |     | PCI Device  |                                     |
> |     +-------------+                                     |
> +---------------------------------------------------------+
> 
> 2.2 Interrupt remapping overview.
> Interrupts from virtual devices and physical devices will be delivered
> to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this
> procedure.
> 
> +---------------------------------------------------+
> |Qemu                       |VM                     |
> |                           | +----------------+    |
> |                           | |  Device driver |    |
> |                           | +--------+-------+    |
> |                           |          ^            |
> |       +----------------+  | +--------+-------+    |
> |       | Virtual device |  | |  IRQ subsystem |    |
> |       +-------+--------+  | +--------+-------+    |
> |               |           |          ^            |
> |               |           |          |            |
> +---------------------------+-----------------------+
> |hyperviosr     |                      | VIRQ       |
> |               |            +---------+--------+   |
> |               |            |      vLAPIC      |   |
> |               |            +---------+--------+   |
> |               |                      ^            |
> |               |                      |            |
> |               |            +---------+--------+   |
> |               |            |      vIOMMU      |   |
> |               |            +---------+--------+   |
> |               |                      ^            |
> |               |                      |            |
> |               |            +---------+--------+   |
> |               |            |   vIOAPIC/vMSI   |   |
> |               |            +----+----+--------+   |
> |               |                 ^    ^            |
> |               +-----------------+    |            |
> |                                      |            |
> +---------------------------------------------------+
> HW                                     |IRQ
>                               +-------------------+
>                               |   PCI Device      |
>                               +-------------------+
> 
> 
> 
> 
> 
> 3 Xen hypervisor
> ==========================================================================
> 
> 3.1 New hypercall XEN_SYSCTL_viommu_op
> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter.
> 
> struct xen_sysctl_viommu_op {
>       u32 cmd;
>       u32 domid;
>       union {
>               struct {
>                       u32 capabilities;
>               } query_capabilities;
>               struct {
>                       u32 capabilities;
>                       u64 base_address;
>               } create_iommu;
>               struct {
>                       u8  bus;
>                       u8  devfn;
>                       u64 iova;
>                       u64 translated_addr;
>                       u64 addr_mask; /* Translation page size */
>                       IOMMUAccessFlags permisson;             
>               } 2th_level_translation;
> };
> 
> typedef enum {
>       IOMMU_NONE = 0,
>       IOMMU_RO   = 1,
>       IOMMU_WO   = 2,
>       IOMMU_RW   = 3,
> } IOMMUAccessFlags;
> 
> 
> Definition of VIOMMU subops:
> #define XEN_SYSCTL_viommu_query_capability            0
> #define XEN_SYSCTL_viommu_create                      1
> #define XEN_SYSCTL_viommu_destroy                     2
> #define XEN_SYSCTL_viommu_dma_translation_for_vpdev   3
> 
> Definition of VIOMMU capabilities
> #define XEN_VIOMMU_CAPABILITY_1nd_level_translation   (1 << 0)
> #define XEN_VIOMMU_CAPABILITY_2nd_level_translation   (1 << 1)
> #define XEN_VIOMMU_CAPABILITY_interrupt_remapping     (1 << 2)
> 
> 
> 2) Design for subops
> - XEN_SYSCTL_viommu_query_capability
>       Get vIOMMU capabilities(1st/2th level translation and interrupt
> remapping).
> 
> - XEN_SYSCTL_viommu_create
>      Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
> base address.
> 
> - XEN_SYSCTL_viommu_destroy
>      Destory vIOMMU in Xen hypervisor with dom_id as parameters.
> 
> - XEN_SYSCTL_viommu_dma_translation_for_vpdev
>      Translate IOVA to GPA for specified virtual PCI device with dom id,
> PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
> address mask and access permission.
> 
> 
> 3.2 2nd level translation
> 1) For virtual PCI device
> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
> hypercall when DMA operation happens.
> 
> 2) For physical PCI device
> DMA operations go though physical IOMMU directly and IO page table for
> IOVA->HPA should be loaded into physical IOMMU. When guest updates
> Second-level Page-table pointer field, it provides IO page table for
> IOVA->GPA. vIOMMU needs to shadow 2nd level translation table, translate
> GPA->HPA and update shadow page table(IOVA->HPA) pointer to Second-level
> Page-table pointer to context entry of physical IOMMU.
> 
> Now all PCI devices in same hvm domain share one IO page table
> (GPA->HPA) in physical IOMMU driver of Xen. To support 2nd level
> translation of vIOMMU, IOMMU driver need to support multiple address
> spaces per device entry. Using existing IO page table(GPA->HPA)
> defaultly and switch to shadow IO page table(IOVA->HPA) when 2th level
> translation function is enabled. These change will not affect current
> P2M logic.
> 
> 3.3 Interrupt remapping
> Interrupts from virtual devices and physical devices will be delivered
> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
> according interrupt remapping table. The following diagram shows the logic.
> 
> 
> 3.4 1st level translation
> When nested translation is enabled, any address generated by first-level
> translation is used as the input address for nesting with second-level
> translation. Physical IOMMU needs to enable both 1st level and 2nd level
> translation in nested translation mode(GVA->GPA->HPA) for passthrough
> device.
> 
> VT-d context entry points to guest 1st level translation table which
> will be nest-translated by 2nd level translation table and so it
> can be directly linked to context entry of physical IOMMU.
> 
> To enable 1st level translation in VM
> 1) Xen IOMMU driver enables nested translation mode
> 2) Update GPA root of guest 1st level translation table to context entry
> of physical IOMMU.
> 
> All handles are in hypervisor and no interaction with Qemu.
> 
> 
> 3.5 Implementation consideration
> Linux Intel IOMMU driver will fail to be loaded without 2th level
> translation support even if interrupt remapping and 1th level
> translation are available. This means it's needed to enable 2th level
> translation first before other functions.
> 
> 
> 4 Qemu
> ==============================================================================
> 4.1 Qemu vIOMMU framework
> Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
> AMD IOMMU) and report in guest ACPI table. So for xen side, a dummy
> xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
> Especially for 2th level translation of virtual PCI device because
> emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
> framework provides callback to deal with 2th level translation when
> DMA operations of virtual PCI devices happen.
> 
> 
> 4.2 Dummy xen-vIOMMU driver
> 1) Query vIOMMU capability(E,G DMA translation, Interrupt remapping and
> Share Virtual Memory) via hypercall.
> 
> 2) Create vIOMMU in Xen hypervisor via new hypercall with DRHU register
> address and desired capability as parameters. Destroy vIOMMU when VM is
> closed.
> 
> 3) Virtual PCI device's 2th level translation
> Qemu already provides DMA translation hook. It's called when DMA
> translation of virtual PCI device happens. The dummy xen-vIOMMU passes
> device bdf and IOVA into Xen hypervisor via new iommu hypercall and
> return back translated GPA.
> 
> 
> 4.3 Q35 vs i440x
> VT-D is introduced since Q35 chipset. Previous concern was that IOMMU
> driver has assumption that VTD only exists on Q35 and newer chipset and
> we have to enable Q35 first.
> 
> Consulted with Linux/Windows IOMMU driver experts and get that these
> drivers doesn't have such assumption. So we may skip Q35 implementation
> and can emulate vIOMMU on I440x chipset. KVM already have vIOMMU support
> with virtual PCI device's DMA translation and interrupt remapping. We
> are using KVM to do experiment of adding vIOMMU on the I440x and test
> Linux/Windows guest. Will report back when have some results.
> 
> 
> 4.4 Report vIOMMU to hvmloader
> Hvmloader is in charge of building ACPI tables for Guest OS and OS
> probes IOMMU via ACPI DMAR table. So hvmloder needs to know whether
> vIOMMU is enabled or not and its capability to prepare ACPI DMAR table
> for Guest OS.
> 
> There are three ways to do that.
> 1) Extend struct hvm_info_table and add variables in the struct
> hvm_info_table to pass vIOMMU information to hvmloader. But this
> requires to add new xc interface to use struct hvm_info_table in the Qemu.
> 
> 2) Pass vIOMMU information to hvmloader via Xenstore
> 
> 3) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore.
> This solution is already present in the vNVDIMM design(4.3.1
> Building Guest ACPI Tables
> http://www.gossamer-threads.com/lists/xen/devel/439766).
> 
> The third option seems more clear and hvmloader doesn't need to deal
> with vIOMMU stuffs and just pass through DMAR table to Guest OS. All
> vIOMMU specific stuffs will be processed in the dummy xen-vIOMMU driver.
> 
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> https://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.