[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC XEN PATCH v3 1/5] docs/designs: Add a design document for PV-IOMMU



Hello Alejandro, thanks for reply !

Le 11/07/2024 à 20:26, Alejandro Vallejo a écrit :
> Disclaimer: I haven't looked at the code yet.
>
> On Thu Jul 11, 2024 at 3:04 PM BST, Teddy Astie wrote:
>> Some operating systems want to use IOMMU to implement various features (e.g
>> VFIO) or DMA protection.
>> This patch introduce a proposal for IOMMU paravirtualization for Dom0.
>>
>> Signed-off-by Teddy Astie <teddy.astie@xxxxxxxxxx>
>> ---
>>   docs/designs/pv-iommu.md | 105 +++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 105 insertions(+)
>>   create mode 100644 docs/designs/pv-iommu.md
>>
>> diff --git a/docs/designs/pv-iommu.md b/docs/designs/pv-iommu.md
>> new file mode 100644
>> index 0000000000..c01062a3ad
>> --- /dev/null
>> +++ b/docs/designs/pv-iommu.md
>> @@ -0,0 +1,105 @@
>> +# IOMMU paravirtualization for Dom0
>> +
>> +Status: Experimental
>> +
>> +# Background
>> +
>> +By default, Xen only uses the IOMMU for itself, either to make device adress
>> +space coherent with guest adress space (x86 HVM/PVH) or to prevent devices
>> +from doing DMA outside it's expected memory regions including the hypervisor
>> +(x86 PV).
>
> "By default...": Do you mean "currently"?
>

Yes, that's what I mean with default here.

>> +
>> +[1] VFIO - "Virtual Function I/O" - 
>> https://www.kernel.org/doc/html/latest/driver-api/vfio.html
>> +
>> +# Design
>> +
>> +The operating system may want to have access to various IOMMU features such 
>> as
>> +context management and DMA remapping. We can create a new hypercall that 
>> allows
>> +the guest to have access to a new paravirtualized IOMMU interface.
>> +
>> +This feature is only meant to be available for the Dom0, as DomU have some
>> +emulated devices that can't be managed on Xen side and are not hardware, we
>> +can't rely on the hardware IOMMU to enforce DMA remapping.
>
> Is that the reason though? While it's true we can't mix emulated and real
> devices under the same emulated PCI bus covered by an IOMMU, nothing prevents 
> us
> from stating "the IOMMU(s) configured via PV-IOMMU cover from busN to busM".
>
> AFAIK, that already happens on systems with several IOMMUs, where they might
> affect partially disjoint devices. But I admit I'm no expert on this.
>
I am not a expert on how emulated devices are exposed, but the guest
will definitely need a way to know if a device is hardware or not.

But I think the situation will be different whether we do PV or HVM. In
PV, there is no emulated device AFAIK, so no need for identifying. In
case of HVM, there is though, which we should consider.

There is still the question of interactions between eventual future
IOMMU emulation (VT-d with QEMU) that can be allowed to act on real
devices (e.g by relying on the new IOMMU infrastructure) and PV-IOMMU.

> I can definitely see a lot of interesting use cases for a PV-IOMMU interface
> exposed to domUs (it'd be a subset of that of dom0, obviously); that'd
> allow them to use the IOMMU without resorting to 2-stage translation, which 
> has
> terrible IOTLB miss costs.
>

Makes sense, could be useful for e.g storage domains with support for
SPDK. Do note that 2-stage IOMMU translation is only supported by very
modern hardware (e.g Xeon Scalable 4th gen).

>> +
>> +This interface is exposed under the `iommu_op` hypercall.
>> +
>> +In addition, Xen domains are modified in order to allow existence of several
>> +IOMMU context including a default one that implement default behavior (e.g
>> +hardware assisted paging) and can't be modified by guest. DomU cannot have
>> +contexts, and therefore act as if they only have the default domain.
>> +
>> +Each IOMMU context within a Xen domain is identified using a domain-specific
>> +context number that is used in the Xen IOMMU subsystem and the hypercall
>> +interface.
>> +
>> +The number of IOMMU context a domain can use is predetermined at domain 
>> creation
>> +and is configurable through `dom0-iommu=nb-ctx=N` xen cmdline.
>
> nit: I think it's more typical within Xen to see "nr" rather than "nb"
>

yes

>> +
>> +# IOMMU operations
>> +
>> +## Alloc context
>> +
>> +Create a new IOMMU context for the guest and return the context number to 
>> the
>> +guest.
>> +Fail if the IOMMU context limit of the guest is reached.
>
> or -ENOMEM, I guess.
>
> I'm guessing from this dom0 takes care of the contexts for guests? Or are 
> these
> contexts for use within dom0 exclusively?
>

Each domain has a set of "IOMMU context" that can be allocated and freed
(up to a fixed limit at creation of domain).
If there is no available context (if the context number limit is hit), I
choosed -ENOSPC as error return (-ENOMEM is reserved for lack of memory
which can also happens).

>> +
>> +A flag can be specified to create a identity mapping.
>> +
>> +## Free context
>> +
>> +Destroy a IOMMU context created previously.
>> +It is not possible to free the default context.
>> +
>> +Reattach context devices to default context if specified by the guest.
>> +
>> +Fail if there is a device in the context and reattach-to-default flag is not
>> +specified.
>> +
>> +## Reattach device
>> +
>> +Reattach a device to another IOMMU context (including the default one).
>> +The target IOMMU context number must be valid and the context allocated.
>> +
>> +The guest needs to specify a PCI SBDF of a device he has access to.
>> +
>> +## Map/unmap page
>> +
>> +Map/unmap a page on a context.
>> +The guest needs to specify a gfn and target dfn to map.
>
> And an "order", I hope; to enable superpages and hugepages without having to
> find out after the fact that the mappings are in fact mergeable and the leaf 
> PTs
> can go away.
>

In my implementation, I added a "nr_page" parameter to specify how much
page can be mapped at once (and you can derive the superpages using
this), I think as you suppose, it can be useful to try optimizing the
map operation by mapping superpages directly.
The biggest problem is the superpage mapping we would like is going to
only be valid if the target page of the domain is also a superpage
(because the actual mapped region will also need to be contiguous in
actual physical memory, not just from guest point of view)

>> +
>> +## Hardware without IOMMU support
>> +
>> +Operating system needs to be aware on PV-IOMMU capability, and whether it is
>> +able to make contexts. However, some operating system may critically fail in
>> +case they are able to make a new IOMMU context. Which is supposed to happen
>> +if no IOMMU hardware is available.
>> +
>> +The hypercall interface needs a interface to advertise the ability to create
>> +and manage IOMMU contexts including the amount of context the guest is able
>> +to use. Using these informations, the Dom0 may decide whether to use or not
>> +the PV-IOMMU interface.
>
> We could just return -ENOTSUPP when there's no IOMMU, then encapsulate a 
> random
> lookup with pv_iommu_is_present() and return true or false depending on rc.
>

-ENOTSUPP makes sense, another way that I use to report no support for
PV-IOMMU is to report limits that means "no operation is actually
possible" (e.g max_ctx_no = 0).

>> +
>> +## Page pool for contexts
>> +
>> +In order to prevent unexpected starving on the hypervisor memory with a
>> +buggy Dom0. We can preallocate the pages the contexts will use and make
>> +map/unmap use these pages instead of allocating them dynamically.
>> +
>
> That seems dangerous should we need to shatter a superpage asynchronously 
> (i.e:
> due to HW misbehaving and requiring it) and have no more pages in the pool.
>

Superpage shattering is actually recoverable (if you fail to allocate
the new leafs, you just keep the superpage entry and do as if nothing
happened), and report -ENOMEM. Nothing happened from hardware point of view.

The modification of the superpage entry into a regular one is only done
once the leafs are actually valid. A similar story happens when
collapsing leafs into superpages (you can free the leafs only once the
hardware doesn't use it anymore e.g after a relevant iotlb_flush).

> Cheers,
> Alejandro

Teddy


Teddy Astie | Vates XCP-ng Intern

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.