Xen project Mailing List

Re: [RFC XEN PATCH v3 1/5] docs/designs: Add a design document for PV-IOMMU

To: Alejandro Vallejo <alejandro.vallejo@xxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

From: Teddy Astie <teddy.astie@xxxxxxxxxx>

Date: Fri, 12 Jul 2024 08:54:15 +0000

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>

Delivery-date: Fri, 12 Jul 2024 08:54:26 +0000

Feedback-id: 30504962:30504962.20240712:md

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hello Alejandro, thanks for reply ! Le 11/07/2024 à 20:26, Alejandro Vallejo a écrit : > Disclaimer: I haven't looked at the code yet. > > On Thu Jul 11, 2024 at 3:04 PM BST, Teddy Astie wrote: >> Some operating systems want to use IOMMU to implement various features (e.g >> VFIO) or DMA protection. >> This patch introduce a proposal for IOMMU paravirtualization for Dom0. >> >> Signed-off-by Teddy Astie <teddy.astie@xxxxxxxxxx> >> --- >> docs/designs/pv-iommu.md | 105 +++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 105 insertions(+) >> create mode 100644 docs/designs/pv-iommu.md >> >> diff --git a/docs/designs/pv-iommu.md b/docs/designs/pv-iommu.md >> new file mode 100644 >> index 0000000000..c01062a3ad >> --- /dev/null >> +++ b/docs/designs/pv-iommu.md >> @@ -0,0 +1,105 @@ >> +# IOMMU paravirtualization for Dom0 >> + >> +Status: Experimental >> + >> +# Background >> + >> +By default, Xen only uses the IOMMU for itself, either to make device adress >> +space coherent with guest adress space (x86 HVM/PVH) or to prevent devices >> +from doing DMA outside it's expected memory regions including the hypervisor >> +(x86 PV). > > "By default...": Do you mean "currently"? > Yes, that's what I mean with default here. >> + >> +[1] VFIO - "Virtual Function I/O" - >> https://www.kernel.org/doc/html/latest/driver-api/vfio.html >> + >> +# Design >> + >> +The operating system may want to have access to various IOMMU features such >> as >> +context management and DMA remapping. We can create a new hypercall that >> allows >> +the guest to have access to a new paravirtualized IOMMU interface. >> + >> +This feature is only meant to be available for the Dom0, as DomU have some >> +emulated devices that can't be managed on Xen side and are not hardware, we >> +can't rely on the hardware IOMMU to enforce DMA remapping. > > Is that the reason though? While it's true we can't mix emulated and real > devices under the same emulated PCI bus covered by an IOMMU, nothing prevents > us > from stating "the IOMMU(s) configured via PV-IOMMU cover from busN to busM". > > AFAIK, that already happens on systems with several IOMMUs, where they might > affect partially disjoint devices. But I admit I'm no expert on this. > I am not a expert on how emulated devices are exposed, but the guest will definitely need a way to know if a device is hardware or not. But I think the situation will be different whether we do PV or HVM. In PV, there is no emulated device AFAIK, so no need for identifying. In case of HVM, there is though, which we should consider. There is still the question of interactions between eventual future IOMMU emulation (VT-d with QEMU) that can be allowed to act on real devices (e.g by relying on the new IOMMU infrastructure) and PV-IOMMU. > I can definitely see a lot of interesting use cases for a PV-IOMMU interface > exposed to domUs (it'd be a subset of that of dom0, obviously); that'd > allow them to use the IOMMU without resorting to 2-stage translation, which > has > terrible IOTLB miss costs. > Makes sense, could be useful for e.g storage domains with support for SPDK. Do note that 2-stage IOMMU translation is only supported by very modern hardware (e.g Xeon Scalable 4th gen). >> + >> +This interface is exposed under the `iommu_op` hypercall. >> + >> +In addition, Xen domains are modified in order to allow existence of several >> +IOMMU context including a default one that implement default behavior (e.g >> +hardware assisted paging) and can't be modified by guest. DomU cannot have >> +contexts, and therefore act as if they only have the default domain. >> + >> +Each IOMMU context within a Xen domain is identified using a domain-specific >> +context number that is used in the Xen IOMMU subsystem and the hypercall >> +interface. >> + >> +The number of IOMMU context a domain can use is predetermined at domain >> creation >> +and is configurable through `dom0-iommu=nb-ctx=N` xen cmdline. > > nit: I think it's more typical within Xen to see "nr" rather than "nb" > yes >> + >> +# IOMMU operations >> + >> +## Alloc context >> + >> +Create a new IOMMU context for the guest and return the context number to >> the >> +guest. >> +Fail if the IOMMU context limit of the guest is reached. > > or -ENOMEM, I guess. > > I'm guessing from this dom0 takes care of the contexts for guests? Or are > these > contexts for use within dom0 exclusively? > Each domain has a set of "IOMMU context" that can be allocated and freed (up to a fixed limit at creation of domain). If there is no available context (if the context number limit is hit), I choosed -ENOSPC as error return (-ENOMEM is reserved for lack of memory which can also happens). >> + >> +A flag can be specified to create a identity mapping. >> + >> +## Free context >> + >> +Destroy a IOMMU context created previously. >> +It is not possible to free the default context. >> + >> +Reattach context devices to default context if specified by the guest. >> + >> +Fail if there is a device in the context and reattach-to-default flag is not >> +specified. >> + >> +## Reattach device >> + >> +Reattach a device to another IOMMU context (including the default one). >> +The target IOMMU context number must be valid and the context allocated. >> + >> +The guest needs to specify a PCI SBDF of a device he has access to. >> + >> +## Map/unmap page >> + >> +Map/unmap a page on a context. >> +The guest needs to specify a gfn and target dfn to map. > > And an "order", I hope; to enable superpages and hugepages without having to > find out after the fact that the mappings are in fact mergeable and the leaf > PTs > can go away. > In my implementation, I added a "nr_page" parameter to specify how much page can be mapped at once (and you can derive the superpages using this), I think as you suppose, it can be useful to try optimizing the map operation by mapping superpages directly. The biggest problem is the superpage mapping we would like is going to only be valid if the target page of the domain is also a superpage (because the actual mapped region will also need to be contiguous in actual physical memory, not just from guest point of view) >> + >> +## Hardware without IOMMU support >> + >> +Operating system needs to be aware on PV-IOMMU capability, and whether it is >> +able to make contexts. However, some operating system may critically fail in >> +case they are able to make a new IOMMU context. Which is supposed to happen >> +if no IOMMU hardware is available. >> + >> +The hypercall interface needs a interface to advertise the ability to create >> +and manage IOMMU contexts including the amount of context the guest is able >> +to use. Using these informations, the Dom0 may decide whether to use or not >> +the PV-IOMMU interface. > > We could just return -ENOTSUPP when there's no IOMMU, then encapsulate a > random > lookup with pv_iommu_is_present() and return true or false depending on rc. > -ENOTSUPP makes sense, another way that I use to report no support for PV-IOMMU is to report limits that means "no operation is actually possible" (e.g max_ctx_no = 0). >> + >> +## Page pool for contexts >> + >> +In order to prevent unexpected starving on the hypervisor memory with a >> +buggy Dom0. We can preallocate the pages the contexts will use and make >> +map/unmap use these pages instead of allocating them dynamically. >> + > > That seems dangerous should we need to shatter a superpage asynchronously > (i.e: > due to HW misbehaving and requiring it) and have no more pages in the pool. > Superpage shattering is actually recoverable (if you fail to allocate the new leafs, you just keep the superpage entry and do as if nothing happened), and report -ENOMEM. Nothing happened from hardware point of view. The modification of the superpage entry into a regular one is only done once the leafs are actually valid. A similar story happens when collapsing leafs into superpages (you can free the leafs only once the hardware doesn't use it anymore e.g after a relevant iotlb_flush). > Cheers, > Alejandro Teddy Teddy Astie | Vates XCP-ng Intern XCP-ng & Xen Orchestra - Vates solutions web: https://vates.tech

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.