[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] (v2) Design proposal for RMRR fix

>>> On 26.12.14 at 12:23, <kevin.tian@xxxxxxxxx> wrote:
> [Issue-2] Being lacking of goal-b), existing device assignment with 
> RMRR works only when reserved regions happen to not conflicting with
> other valid allocations in the guest physical address space. This could
> lead to unpredicted failures in various deployments, due to non-detected
> conflictions caused by platform difference and VM configuration 
> difference.
> One example is about USB controller assignment. It's already identified
> as a problem on some platforms, that USB reserved regions conflict with
> guest BIOS region. However, being the fact that host BIOS only touches 
> those reserved regions for legacy keyboard emulation at early Dom0 boot 
> phase, a trick is added in Xen to bypass RMRR handling for usb 
> controllers. 

s/trick/hack/ - after all, doing this is not safe. Plus if these regions
really were needed only for early boot legacy keyboard emulation,
they wouldn't need expressing as RMRR afaict, or if that really was
a requirement a suitable flag should be added to tell the OS that
once a proper driver is in place for the device, the RMRR won't be
needed anymore. In any event - the hack needs to go away.

> [Issue-3] devices may share same reserved regions, however
> there is no logic to handle this in Xen. Assigning such devices to 
> different VMs could lead to secure concern

s/could lead to/is a/

> [Guideline-3] New interface should be kept as common as possible
>   New interface will be introduced to expose reserved regions to the
> user space. Though RMRR is a VT-d specific terminology, the interface
> design should be generic enough, i.e. to support a function which 
> allows hypervisor to force reserving one or more gfn ranges. 

s/hypervisor/user space/ ? Or else I don't see the connection between
the new interface and the enforcement of the reserved ranges.

> 3) hvmloader
>   Hvmloader allocates other resources (ACPI, PCI MMIO, etc.) and 
> internal data structures in gfn space, and it creates the final guest 
> e820. So hvmloader also needs to detect conflictions when conducting 
> those operations. If there's no confliction, hvmloader will reserve 
> those regions in guest e820 to let guest OS aware.

Ideally, rather than detecting conflicts, hvmloader would just
consume what libxc set up. Obviously that would require awareness
in libxc of things it currently doesn't care about (like fitting PCI BARs
into the MMIO hole, enlarging it as necessary). I admit that this may
end up being difficult to implement. Another alternative would be to
have libxc only populate a limited part of RAM (for hvmloader to be
loadable), and have hvmloader do the bulk of the populating.

>>>>3.3 Policies
> ----
> An intuitive thought is to fail immediately upon a confliction, however 
> it is not flexible regarding to different requirments:
> a) it's not appropriate to fail libxc domain builder just because such
> confliction. We still want the guest to boot even w/o assigned device;

I don't think that's right (and I believe this was discussed before):
When device assignment fails, VM creation should fail too. It is the
responsibility of the host admin in that case to remove some or all
of the to be assigned devices from the guest config.

> b) whether to fail in hvmloader has several dependencies. If it's
> to check for hotplug preparation, warning is also an acceptable option
> since assignment may not happen at all. Or if it's a USB controller 
> but user doesn't care about legacy keyboard emulation, it's also OK to 
> move forward upon a confliction;

Again assuming that RMRRs for USB devices are _only_ used for
legacy keyboard emulation, which may or may not be true.

> c) in Xen hypervisor it is reasonable to fail upon confliction, where
> device is actually assigned. But due to the same requirement on USB
> controller, sometimes we might want it succeed just w/ warnings.

But only when asked to do so by the host admin.

> Regarding to the complexity of addressing all above flexibilities (user
> preferences, per-device), which requires inventing quite some parameters
> passed among different components, and regarding to the fact that 
> failures would be rare (except some USB) with proactive avoidance  
> in userspace, we'd like to propose below simplified policy following 
> [Guideline-4]:
> - 'warn' conflictions in user space (libxc and hvmloader)
> - a boot option to specify 'fail' or 'warn' confliction in Xen device
> assignment path, default to 'fail' (user can set to 'warn' for USB case)

I think someone else (Tim?) already said this: Such a "warn" option
would unlikely to be desirable as a global one, affecting all devices,
but should rather be a flag settable on particular devices.

>>>>3.5 New interface: expose reserved region information
> ----
> As explained in [Guideline-3], we'd like to keep this interface general 
> enough, as a common interface for hypervisor to force reserving gfn 
> ranges, due to various reasons (RMRR is a client of this feature).
> One design open was discussed back-and-forth accordingly, regarding to
> whether the interface should return regions reported for all devices
> in the platform (report-all), or selectively return regions only 
> belonging to assigned devices (report-sel). report-sel can be built on
> top of report-all, with extra work to help hypervisor generate filtered 
> regions (e.g. introduce new interface or make device assignment happened 
> before domain builder)
> We propose report-all as the simple solution (different from last sent
> version which used report-sel), regarding to the below facts:
>   - 'warn' policy in user space makes report-all not harmful
>   - 'report-all' still means only a few entries in reality:
>     * RMRR reserved regions should be avoided or limited by platform
> designers, per VT-d specification;
>     * RMRR reserved regions are only a few on real platforms, per our
> current observations;

Few yes, but in the IGD example you gave the region is quite large,
and it would be fairly odd to have all guests have a strange, large
hole in their address spaces. Furthermore remember that these
holes vary from machine to machine, so a migrateable guest would
needlessly end up having a hole potentially not helping subsequent
hotplug at all.

> In this way, there are two situations libxc domain builder may request 
> to query reserved region information w/ same interface:
> a) if any statically-assigned devices, and/or
> b) if a new parameter is specified, asking for hotplug preparation
>       ('rdm_check' or 'prepare_hotplug'?)
> the 1st invocation of this interface will save all reported reserved
> regions under domain structure, and later invocation (e.g. from 
> hvmloader) gets saved content.

Why would the reserved regions need attaching to the domain
structure? The combination of (to be) assigned devices and
global RMRR list always allow reproducing the intended set of
regions without any extra storage.

>>>>3.6 Libxc/hvmloader: detect and avoid conflictions
> ----
> libxc needs to detect reserved region conflictions with:
>       - guest RAM
>       - monolithic PCI MMIO hole
> hvmloader needs to detect reserved region confliction with:
>       - guest RAM
>       - PCI MMIO allocation
>       - memory allocation
>       - some e820 entries like ACPI Opregion, etc.

- BIOS and alike

> There are several other options discussed so far:
> a) Duplicate same relocation algorithm within libxc domain builder 
> (when populating physmap) and hvmloader (when creating e820)
>   - Pros:
>       * no interface/structure change
>       * anyway hvmloader still needs to handle reserved regions
>   - Cons:
>       * duplication is not good
> b) pass sparse information through Xenstore
>   (no much idea. need input from toolstack maintainers)
> c) utilize XENMEM_{set,}_memory_map pair of hypercalls, with libxc to
> set and hvmloader to get. Extension required to allow hvm invoke.
>   - Pros:
>       * centralized ownership in libxc. flexible for extension
>   - Cons:
>       * limiting entry to E820MAX (should be fine)
>       * hvmloader e820 construction may become more complex, given
> two predefined tables (reserved_regions, memory_map)

d) Move down the lowmem RAM/MMIO boundary so that a single,
contiguous chunk of lowmem results, with all other RAM moving up
beyond 4Gb. Of course RMRRs below the 1Mb boundary must not be
considered here, and I think we can reasonably safely assume that
no RMRRs will ever report ranges above 1Mb but below the host
lowmem RAM/MMIO boundary (i.e. we can presumably rest assured
that the lowmem chunk will always be reasonably big).

> 4. Plan
> =====================================================================
> We're seeking an incremental way to split above tasks into 2 stages, 
> and in each stage we move forward a step w/o causing regression. Doing
> so can benefit people who want to use device assignment early, and 
> also benefit newbie developer to rampup, toward a final sane solution.
> 4.1 Stage-1: hypervisor hardening
> ----
>   [Tasks]
>       1) Setup RMRR identity mapping in p2m layer with confliction 
> detection
>       2) add a boot option for fail/warn policy
>       3) remove USB hack
>       4) Detect and fail device assignment w/ shared reserve regions 
>   [Enhancements]
>       * fix [Issue-1] and [Issue-3]

According to what you wrote earlier, [Issue-3] is not intended to be
fixed, but instead devices sharing the same RMRR(s) are to be
declared unassignable.

>       * partially fix [Issue-2] with limitations:
>               - w/o userspace relocation there's larger chance to 
> see conflictions. 
>               - w/o reserve in guest e820, guest OS may allocate 
> reserved pfn when re-enumerating PCI resource
>   [Regressions]
>       * devices which can be assigned successfully before may be
> failed now due to confliction detection. However it's not a regression
> per se. and user can change policy to 'warn' if required.  

Avoiding such a (perceived) regression would seem to be possible by
intermixing hypervisor and libxc/hvmloader adjustments.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.