[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] (v2) Design proposal for RMRR fix

On Fri, Dec 26, 2014 at 11:23 AM, Tian, Kevin <kevin.tian@xxxxxxxxx> wrote:
> (please note some proposal is different from last sent version after more
> discussions. But I tried to summarize previous discussions and explained why
> we choose a different way. Sorry if I may miss some opens/conclusions
> discussed in past months. Please help point it out which is very appreciated. 
> :-)

Kevin, thanks for this document.  A few questions / comments below:

> For proper functioning of these legacy reserved memory usages, when
> system software enables DMA remapping, the translation structures for
> the respective devices are expected to be set up to provide identity
> mapping for the specified reserved memory regions with read and write
> permissions. The system software is also responsible for ensuring
> that any input addresses used for device accesses to OS-visible memory
> do not overlap with the reserved system memory address ranges.

Just to be clear: "identity mapping" here means that gpfn == mfn, in
both the p2m and IOMMU.  (I suppose it might mean vfn == gpfn as well,
but that wouldn't really concern us, as the guest deals with virtual

> However current RMRR implementation in Xen only partially achieves a)
> and completely misses b), which cause some issues:
> --
> [Issue-1] Identity mapping is not setup in shared ept case, so a device
> with RMRR may not function correctly if assigned to a VM.
> This was the original problem we found when assigning IGD on BDW
> platform, which triggered the whole long discussion in past months
> --
> [Issue-2] Being lacking of goal-b), existing device assignment with
> RMRR works only when reserved regions happen to not conflicting with
> other valid allocations in the guest physical address space. This could
> lead to unpredicted failures in various deployments, due to non-detected
> conflictions caused by platform difference and VM configuration
> difference.
> One example is about USB controller assignment. It's already identified
> as a problem on some platforms, that USB reserved regions conflict with
> guest BIOS region. However, being the fact that host BIOS only touches
> those reserved regions for legacy keyboard emulation at early Dom0 boot
> phase, a trick is added in Xen to bypass RMRR handling for usb
> controllers.
> --
> [Issue-3] devices may share same reserved regions, however
> there is no logic to handle this in Xen. Assigning such devices to
> different VMs could lead to secure concern

So to summarize:

When assigning a device to a guest, the device's associated RMRRs must
be identity mapped in the p2m and IOMMU.

At the moment, we don't have a reliable way to reclaim a particular
gpfn space from a guest once it's been used for other puproses (e.g.,
guest RAM or other MMIO ranges).

So, we need to make sure at guest creation time that we reserve any
RMRR ranges for devices we may wish to assign, and make sure that the
RMRR in gpfn space is empty.

For statically-assigned devices, we know at guest creation time which
RMRRs may be required.  But if we want to dynamically add devices, we
must figure out ahead of time which devices we *might* add, and
reserve the RMRRs at boot time.

As a separate problem, two different devices may share the same RMRR,
meaning that if we assign these devices to two different VMs, the RMRR
may be mapped into the gpfn space of two different VMs.  This may well
be a security issue, so we need to handle it carefully.

> 3. High Level Design
> =====================================================================
> To achieve aforementioned two goals, major enhancements are required
> cross Xen hypervisor, libxc, and hvmloader, to address the gap in
> goal-b), i.e. handling possible conflictions in gfn space. Fixing
> goal-a) is straightforward.
>>>>3.1 Guidelines
> ----
> There are several guidelines considered in the design:
> --
> [Guideline-1] No regression in a VM w/o statically-assigned devices
>   If a VM isn't configured with assigned devices at creation, new
> confliction detection logic shouldn't block the VM boot progress
> (either skipped, or just throw warning)
> --
> [Guideline-2] No regression on devices which do not have RMRR reported
>   If a VM is assigned with a device which doesn't have RMRR reported,
> either statically-assigned or dynamically-assigned, new confliction
> detection logic shouldn't fail the assignment request for this device.
> --
> [Guideline-3] New interface should be kept as common as possible
>   New interface will be introduced to expose reserved regions to the
> user space. Though RMRR is a VT-d specific terminology, the interface
> design should be generic enough, i.e. to support a function which
> allows hypervisor to force reserving one or more gfn ranges.
> --
> [Guideline-4] Keep changes simple
>   RMRR reserved regions should be avoided or limited by platform
> designers, per VT-d specification. Per our observations, there are
> only a few reported examples (USB, IGD) on real platforms. So we need
> to balance the code complexity and usage limitation. If one limitation
> is only in niche scenarios, we'd like to vote no-support to simplify
> changes for now.

This is an excellent set of principles -- thanks.

>>>>3.2 Confliction detection
> ----
> Confliction must be detected in several places as far as gfn is
> concerned (how to handle confliction is discussed in 3.3)
> 1) libxc domain builder
>   Here coarse-grained gfn layout is created, including two contiguous
> guest RAM trunks (lowmem and/or highmem) and mmio holes (VGA, PCI),
> which are passed to hvmloader for later fine-grained manipulation. Guest
> RAM trunks are populated with valid translation setup in underlying p2m
> layer. Device reserved regions must be detected in that layout.
> 2) Xen hypervisor device assignment
>   Device assignment can happen either at VM creation time (after domain
> builder), or anytime thru hotplug after VM is booted. Regardless of
> how userspace handles confliction, Xen hypervisor will always do the
> last-conservative detection when setting up identity mapping:
>         * gfn space unoccupied:
>                 -> insert identity mapping; no confliction
>         * gfn space already occupied with identity mapping:
>                 -> do nothing; no confliction
>         * gfn space already occupied with other mapping:
>                 -> confliction detected
> 3) hvmloader
>   Hvmloader allocates other resources (ACPI, PCI MMIO, etc.) and
> internal data structures in gfn space, and it creates the final guest
> e820. So hvmloader also needs to detect conflictions when conducting
> those operations. If there's no confliction, hvmloader will reserve
> those regions in guest e820 to let guest OS aware.

I think this can be summarized a bit more clearly by what each bit of
code needs to actually do:

1. libxc
 - RMRR areas need to be not populated with gfns during boot time.

2. Xen
 - When a device with RMRRs is assigned, Xen must make an
identity-mapping of the appropriate RMRR ranges.

3. hvmloader
 - hvmoader must report RMRRs in the e820 map of all devices which a
guest may ever be assigned
 - when placing devices in MMIO space, hvmloader must avoid placing
MMIO devices over RMRR regions which are / may be assigned to a guest.

One component I think may be missing here -- qemu-traditional is very
tolerant with regards to the gpfn space; but qemu-upstream expects to
know the layout of guest gpfn space, and may crash if its idea of gpfn
space doesn't match Xen's idea.  Unfortunately, however, there is not
a very close link between these two at the moment; IIUC at the moment
this is limited to the domain builder telling qemu how big the lowmem
PCI hole will be.  Any solution which marks GPFN space as "non-memory"
needs to make sure this is communicated to qemu-upstream as well.

>>>>3.3 Policies
> ----
> An intuitive thought is to fail immediately upon a confliction, however
> it is not flexible regarding to different requirments:
> a) it's not appropriate to fail libxc domain builder just because such
> confliction. We still want the guest to boot even w/o assigned device;
> b) whether to fail in hvmloader has several dependencies. If it's
> to check for hotplug preparation, warning is also an acceptable option
> since assignment may not happen at all. Or if it's a USB controller
> but user doesn't care about legacy keyboard emulation, it's also OK to
> move forward upon a confliction;
> c) in Xen hypervisor it is reasonable to fail upon confliction, where
> device is actually assigned. But due to the same requirement on USB
> controller, sometimes we might want it succeed just w/ warnings.
> Regarding to the complexity of addressing all above flexibilities (user
> preferences, per-device), which requires inventing quite some parameters
> passed among different components, and regarding to the fact that
> failures would be rare (except some USB) with proactive avoidance
> in userspace, we'd like to propose below simplified policy following
> [Guideline-4]:
> - 'warn' conflictions in user space (libxc and hvmloader)
> - a boot option to specify 'fail' or 'warn' confliction in Xen device
> assignment path, default to 'fail' (user can set to 'warn' for USB case)
> Such policy provides a relaxed user space policy w/ hypervisor to do
> final judge. It has a unique merit to simplify later interface design
> and hotplug support, w/o breaking [Guideline-1/2] even when all possible
> reserved regions are exposed.
>     ******agreement is first required on above policy******

So the important part of policy is what the user experience is.  I
think we can assume that all device assignment will happen through
libxl; so from a user interface perspective we mainly want to be
thinking about the xl / libxl interface.

How the various sub-components react if something unexpected happens
is then just a matter of robust system design.

So first of all, I think RMRR reservations should be specified at
domain creation time.  If a user tries to assign a device with RMRRs
to a VM that has not reserved those ranges at creation time, the
assignment should fail.

The main place this checking should happen is in the toolstack
(libxl).  The toolstack can then give a sensible error message to the
user, which may include things they can to to fix the problem.

In the case of statically-assigned devices, the toolstack can look at
the RMRRs required and make sure to reserve them at domain creation

For dynamically-assigned devices, I think there should be an option to
make the guest's memory layout mirror the host: this would include the
PCI hole and all RMRR ranges.  This would be off by default.

We could imagine a way of specifying "I may want to assign this pool
of devices to this VM", or to manually specify RMRR ranges which
should be reserved, but I think that's a bit more advanced than we
really need right now.

>>>>3.5 New interface: expose reserved region information

It's not clear to me who this new interface is being exposed to.

It seems to me what we want is for the toolstack to figure out, at
guest creation time, what RMRRs should be reserved for this VM, and
probably put that information in xenstore somewhere, where it's
available to hvmloader.  I assume the RMRR information is already
available through sysfs in dom0?

One question: where are these RMRRs typically located in memory?  Are
they normally up in the MMIO region?  Or can they occur anywhere (even
in really low areas, say, under 1GiB)?

If RMRRs almost always happen up above 2G, for example, then a simple
solution that wouldn't require too much work would be to make sure
that the PCI MMIO hole we specify to libxc and to qemu-upstream is big
enough to include all RMRRs.  That would satisfy the libxc and qemu

If we then store specific RMRRs we want included in xenstore,
hvmloader can put them in the e820 map, and that would satisfy the
hvmloader requirement.

Then when we assign the device, those ranges will be already unused in
the p2m, and (if I understand correctly) Xen will already map the RMRR
ranges 1-1 upon device assignment.

What do you think?

If making the RMRRs fit inside the guest MMIO hole is not practical
(for example, if the ranges occur very low in memory), then we'll have
to come up with a way to specify, both to libxc and to qemu, where
these  holes in memory are.

>>>>3.8 Xen: Handle devices sharing reserved regions
> ----
> Per VT-d spec, it's possible to have two devices sharing same reserved
> region. Though we didn't see such example in reality, hypervisor needs
> to detect and handle such scenario, otherwise vulnerability may exist
> if two devices are assigned to different VMs (so a malicious VM may
> program its assigned device to clobber the shared region to malform
> another VM's device)
> Ideally all devices sharing reserved regions should be assigned to a
> single VM. However achieving this goal can't be done sole in hypervisor
> w/o reworking current device assignment interface. Assignment is managed
> by toolstack, which requires exposing group sharing information to
> userspace and then extends toolstack to manage assignment in bundle.
> Given the problem only in ideal space, we propose to not support such
> scenario, i.e. having hypervisor to fail the assignment, if the target
> device happens to share some reserved regions with another device,
> following [Guideline-4] to keep things simple.

I think denying it by default, first in the toolstack and as a
fall-back in the hypervisor, is a good idea.

It shouldn't be too difficult, however, to add an option to override
this.  We have a lot of individual users who use Xen for device
pass-through; such advanced users should be allowed to "shoot
themselves in the foot" if they want to.



Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.