Xen project Mailing List

Re: [Xen-devel] Discussion about virtual iommu support for Xen guest

On Fri, 3 Jun 2016, Andrew Cooper wrote: > On 03/06/16 12:17, Tian, Kevin wrote: > >> Very sorry for the delay. > >> > >> There are multiple interacting issues here. On the one side, it would > >> be useful if we could have a central point of coordination on > >> PVH/HVMLite work. Roger - as the person who last did HVMLite work, > >> would you mind organising that? > >> > >> For the qemu/xen interaction, the current state is woeful and a tangled > >> mess. I wish to ensure that we don't make any development decisions > >> which makes the situation worse. > >> > >> In your case, the two motivations are quite different I would recommend > >> dealing with them independently. > >> > >> IIRC, the issue with more than 255 cpus and interrupt remapping is that > >> you can only use x2apic mode with more than 255 cpus, and IOAPIC RTEs > >> can't be programmed to generate x2apic interrupts? In principle, if you > >> don't have an IOAPIC, are there any other issues to be considered? What > >> happens if you configure the LAPICs in x2apic mode, but have the IOAPIC > >> deliver xapic interrupts? > > The key is the APIC ID. There is no modification to existing PCI MSI and > > IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send > > interrupt message containing 8bit APIC ID, which cannot address >255 > > cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to > > enable >255 cpus with x2apic mode. > > Thanks for clarifying. > > > > > If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC cannot > > deliver interrupts to all cpus in the system if #cpu > 255. > > Ok. So not ideal (and we certainly want to address it), but this isn't > a complete show stopper for a guest. > > >> On the other side of things, what is IGD passthrough going to look like > >> in Skylake? Is there any device-model interaction required (i.e. the > >> opregion), or will it work as a completely standalone device? What are > >> your plans with the interaction of virtual graphics and shared virtual > >> memory? > >> > > The plan is to use a so-called universal pass-through driver in the guest > > which only accesses standard PCI resource (w/o opregion, PCH/MCH, etc.) > > This is fantastic news. > > > > > ---- > > Here is a brief of potential usages relying on vIOMMU: > > > > a) enable >255 vcpus on Xeon Phi, as the initial purpose of this thread. > > It requires interrupt remapping capability present on vIOMMU; > > > > b) support guest SVM (Shared Virtual Memory), which relies on the > > 1st level translation table capability (GVA->GPA) on vIOMMU. pIOMMU > > needs to enable both 1st level and 2nd level translation in nested > > mode (GVA->GPA->HPA) for passthrough device. IGD passthrough is > > the main usage today (to support OpenCL 2.0 SVM feature). In the > > future SVM might be used by other I/O devices too; > > > > c) support VFIO-based user space driver (e.g. DPDK) in the guest, > > which relies on the 2nd level translation capability (IOVA->GPA) on > > vIOMMU. pIOMMU 2nd level becomes a shadowing structure of > > vIOMMU 2nd level by replacing GPA with HPA (becomes IOVA->HPA); > > All of these look like interesting things to do. I know there is a lot > of interest for b). > > As a quick aside, does Xen currently boot on a Phi? Last time I looked > at the Phi manual, I would expect Xen to crash on boot because of MCXSR > differences from more-common x86 hardware. > > > > > ---- > > And below is my thought viability of implementing vIOMMU in Qemu: > > > > a) enable >255 vcpus: > > > > o Enable Q35 in Qemu-Xen; > > o Add interrupt remapping in Qemu vIOMMU; > > o Virtual interrupt injection in hypervisor needs to know virtual > > interrupt remapping (IR) structure, since IR is behind vIOAPIC/vMSI, > > which requires new hypervisor interfaces as Andrew pointed out: > > * either for hypervisor to query IR from Qemu which is not > > good; > > * or for Qemu to register IR info to hypervisor which means > > partial IR knowledge implemented in hypervisor (then why not putting > > whole IR emulation in Xen?) > > > > b) support SVM > > > > o Enable Q35 in Qemu-Xen; > > o Add 1st level translation capability in Qemu vIOMMU; > > o VT-d context entry points to guest 1st level translation table > > which is nest-translated by 2nd level translation table so vIOMMU > > structure can be directly linked. It means: > > * Xen IOMMU driver enables nested mode; > > * Introduce a new hypercall so Qemu vIOMMU can register > > GPA root of guest 1st level translation table which is then written > > to context entry in pIOMMU; > > > > c) support VFIO-based user space driver > > > > o Enable Q35 in Qemu-Xen; > > o Leverage existing 2nd level translation implementation in Qemu > > vIOMMU; > > o Change Xen IOMMU to support (IOVA->HPA) translation which > > means decouple current logic from P2M layer (only for GPA->HPA); > > o As a means of shadowing approach, Xen IOMMU driver needs to > > know both (IOVA->GPA) and (GPA->HPA) info to update (IOVA->HPA) > > mapping in case of any one is changed. So new interface is required > > for Qemu vIOMMU to propagate (IOVA->GPA) info into Xen hypervisor > > which may need to be further cached. > > > > ---- > > > > After writing down above detail, looks it's clear that putting vIOMMU > > in Qemu is not a clean design for a) and c). For b) the hypervisor > > change is not that hacky, but for it alone seems not strong to pursue > > Qemu path. Seems we may have to go with hypervisor based > > approach... > > > > Anyway stop here. With above background let's see whether others > > may have a better thought how to accelerate TTM of those usages > > in Xen. Xen once is a leading hypervisor for many new features, but > > recently it is not sustaining. If above usages can be enabled decoupled > > from HVMlite/virtual_root_port effort, then we can have staged plan > > to move faster (first for HVM, later for HVMLite). :-) > > I dislike that we are in this situation, but I glad to see that I am not > the only one who thinks that the current situation is unsustainable. > > The problem is things were hacked up in the past to assume qemu could > deal with everything like this. Later, performance sucked sufficiently > that bit of qemu were moved back up into the hypervisor, which is why > the vIOAPIC is currently located there. The result is a complete > tangled ratsnest. > > > Xen has 3 common uses for qemu, which are: > 1) Emulation of legacy devices > 2) PCI Passthrough > 3) PV backends > > 3 isn't really relevant here. For 1, we are basically just using Qemu > to provide an LPC implementation (with some populated slots for > disk/network devices). > > I think it would be far cleaner to re-engineer the current Xen/qemu > interaction to more closely resemble real hardware, including > considering having multiple vIOAPICs/vIOMMUs/etc when architecturally > appropriate. I expect that it would be a far cleaner interface to use > and extend. I also realise that this isn't a simple task I am > suggesting, but I don't see any other viable way out. > > Other issues in the mix is support for multiple device emulators, in > which case Xen is already performing first-level redirection of MMIO > requests. > > For HVMLite, there is specifically no qemu, and we need something which > can function when we want PCI Passthrough to work. I am quite confident > that the correct solution here is to have a basic host bridge/root port > implementation in Xen (as we already have 80% of this already), at which > point we don't need any qemu interaction for PCI Passthough at all, even > for HVM guests. > > >From this perspective, it would make sense to have emulators map IOVAs, > not GPAs. We already have mapcache_invalidate infrastructure to flush > mappings as they are changed by the guest. > > > For the HVMLite side of things, my key concern is not to try and do any > development which we realistically expect to have to undo/change. As > you said yourself, we are struggling to sustain, and really aren't > helping ourselves by doing lots of work, and subsequently redoing it > when it doesn't work; PVH is the most obvious recent example here. > > If others agree, I think that it is well worth making some concrete > plans for improvements in this area for Xen 4.8. I think the only > viable way forward is to try and get out of the current hole we are in. > > Thoughts? (especially Stefano/Anthony) Going back to the beginning of the discussion, whether we should enable Q35 in QEMU or not is a distraction: of course we should enable it, but even with Q35 in QEMU, it might not be a good idea to place the vIOMMU emulation there. I agree with Andrew that the current model is flawed: the boundary between Xen and QEMU emulation is not clear enough. In addition using QEMU on Xen introduces latency and security issues (the work to run QEMU as non-root and using unprivileged interfaces is not complete yet). I think of QEMU as a provider of complex, high level emulators, such as the e1000, Cirrus VGA, SCSI controllers, etc., which don't necessarily need to be fast. For core x86 components, such as the vIOMMU, for performance and ease of integration with the rest of the hypervisor, it seems to me that Xen would is the right place to implement them. As a comparison, I would certainly argue in favor of implementing vSMMU in the hypervisor on ARM. However the issue is the PCI root-complex, which today is in QEMU. I don't think it is a particularly bad fit there, although I can also see the benefit of moving it to the hypervisor. It is relevant here if it causes problems to implementing vIOMMU in Xen. From a software engineering perspective, it would be nice to keep the two projects (implementing vIOMMU and moving the PCI root complex to Xen) separate, especially given that the PCI root complex one is without an owner and a timeline. I don't think it is fair to ask Tianyu or Kevin to move the PCI root complex from QEMU to Xen in order to enable vIOMMU on Xen systems. If vIOMMU in Xen and root complex in QEMU cannot be made to work together, then we are at an impasse. I cannot see any good way forward unless somebody volunteers to start working on the PCI root complex project soon to provide Kevin and Tianyu with a branch to based their work upon. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.