[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] (v2) Design proposal for RMRR fix

Ping in case this mail is hidden after long holiday. :-)

> From: Tian, Kevin
> Sent: Friday, December 26, 2014 7:23 PM
> (please note some proposal is different from last sent version after more
> discussions. But I tried to summarize previous discussions and explained why
> we choose a different way. Sorry if I may miss some opens/conclusions
> discussed in past months. Please help point it out which is very appreciated. 
> :-)
> ----
> TOC:
>       1. What's RMRR
>       2. RMRR status in Xen
>       3. High Level Design
>               3.1 Guidelines
>               3.2 Confliction detection
>               3.3 Policies
>               3.4 Xen: setup RMRR identity mapping
>               3.5 New interface: expose reserved region information
>               3.6 Libxc/hvmloader: detect and avoid conflictions
>               3.7 Hvmloader: reserve 'reserved regions' in guest E820
>               3.8 Xen: Handle devices sharing reserved regions
>       4. Plan
>               4.1 Stage-1: hypervisor hardening
>               4.2 Stage-2: libxc/hvmloader hardening
> 1. What's RMRR?
> ================================================================
> =====
> RMRR is an acronym for Reserved Memory Region Reporting, expected to
> be used for legacy usages (such as USB, UMA Graphics, etc.) requiring
> reserved memory.
> (From vt-d spec)
> ----
> Reserved system memory regions are typically allocated by BIOS at boot
> time and reported to OS as reserved address ranges in the system memory
> map. Requests to these reserved regions may either occur as a result of
> operations performed by the system software driver (for example in the
> case of DMA from unified memory access (UMA) graphics controllers to
> graphics reserved memory) or may be initiated by non system software
> (for example in case of DMA performed by a USB controller under BIOS
> SMM control for legacy keyboard emulation).
> For proper functioning of these legacy reserved memory usages, when
> system software enables DMA remapping, the translation structures for
> the respective devices are expected to be set up to provide identity
> mapping for the specified reserved memory regions with read and write
> permissions. The system software is also responsible for ensuring
> that any input addresses used for device accesses to OS-visible memory
> do not overlap with the reserved system memory address ranges.
> BIOS may report each such reserved memory region through the RMRR
> structures, along with the devices that requires access to the
> specified reserved memory region. Reserved memory ranges that are
> either not DMA targets, or memory ranges that may be target of BIOS
> initiated DMA only during pre-boot phase (such as from a boot disk
> drive) must not be included in the reserved memory region reporting.
> The base address of each RMRR region must be 4KB aligned and the size
> must be an integer multiple of 4KB. If there are no RMRR structures,
> the system software concludes that the platform does not have any
> reserved memory ranges that are DMA targets.
> Platform designers should avoid or limit use of reserved memory regions
> since these require system software to create holes in the DMA virtual
> address range available to system software and its drivers.
> ----
> Below is one example from a BDW machine:
> (XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
> (XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ab80a000 end_address
> ab81dfff
> (XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
> (XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ad000000 end_address
> af7fffff
> Here the 1st reserved region is for USB controller, with the 2nd one
> belonging to IGD.
> 2. RMRR status in Xen
> ================================================================
> =====
> There are two main design goals according to VT-d spec:
> a) Setup identity mapping for reserved regions in IOMMU page table
> b) Ensure reserved regions not conflicting with OS-visible memory
> (OS-visible memory in a VM means guest physical memory, and more
> strictly it also means no confliction with other types of allocations
> in guest physical address space, such as PCI MMIO, ACPI, etc.)
> However current RMRR implementation in Xen only partially achieves a)
> and completely misses b), which cause some issues:
> --
> [Issue-1] Identity mapping is not setup in shared ept case, so a device
> with RMRR may not function correctly if assigned to a VM.
> This was the original problem we found when assigning IGD on BDW
> platform, which triggered the whole long discussion in past months
> --
> [Issue-2] Being lacking of goal-b), existing device assignment with
> RMRR works only when reserved regions happen to not conflicting with
> other valid allocations in the guest physical address space. This could
> lead to unpredicted failures in various deployments, due to non-detected
> conflictions caused by platform difference and VM configuration
> difference.
> One example is about USB controller assignment. It's already identified
> as a problem on some platforms, that USB reserved regions conflict with
> guest BIOS region. However, being the fact that host BIOS only touches
> those reserved regions for legacy keyboard emulation at early Dom0 boot
> phase, a trick is added in Xen to bypass RMRR handling for usb
> controllers.
> --
> [Issue-3] devices may share same reserved regions, however
> there is no logic to handle this in Xen. Assigning such devices to
> different VMs could lead to secure concern
> 3. High Level Design
> ================================================================
> =====
> To achieve aforementioned two goals, major enhancements are required
> cross Xen hypervisor, libxc, and hvmloader, to address the gap in
> goal-b), i.e. handling possible conflictions in gfn space. Fixing
> goal-a) is straightforward.
> >>>3.1 Guidelines
> ----
> There are several guidelines considered in the design:
> --
> [Guideline-1] No regression in a VM w/o statically-assigned devices
>   If a VM isn't configured with assigned devices at creation, new
> confliction detection logic shouldn't block the VM boot progress
> (either skipped, or just throw warning)
> --
> [Guideline-2] No regression on devices which do not have RMRR reported
>   If a VM is assigned with a device which doesn't have RMRR reported,
> either statically-assigned or dynamically-assigned, new confliction
> detection logic shouldn't fail the assignment request for this device.
> --
> [Guideline-3] New interface should be kept as common as possible
>   New interface will be introduced to expose reserved regions to the
> user space. Though RMRR is a VT-d specific terminology, the interface
> design should be generic enough, i.e. to support a function which
> allows hypervisor to force reserving one or more gfn ranges.
> --
> [Guideline-4] Keep changes simple
>   RMRR reserved regions should be avoided or limited by platform
> designers, per VT-d specification. Per our observations, there are
> only a few reported examples (USB, IGD) on real platforms. So we need
> to balance the code complexity and usage limitation. If one limitation
> is only in niche scenarios, we'd like to vote no-support to simplify
> changes for now.
> >>>3.2 Confliction detection
> ----
> Confliction must be detected in several places as far as gfn is
> concerned (how to handle confliction is discussed in 3.3)
> 1) libxc domain builder
>   Here coarse-grained gfn layout is created, including two contiguous
> guest RAM trunks (lowmem and/or highmem) and mmio holes (VGA, PCI),
> which are passed to hvmloader for later fine-grained manipulation. Guest
> RAM trunks are populated with valid translation setup in underlying p2m
> layer. Device reserved regions must be detected in that layout.
> 2) Xen hypervisor device assignment
>   Device assignment can happen either at VM creation time (after domain
> builder), or anytime thru hotplug after VM is booted. Regardless of
> how userspace handles confliction, Xen hypervisor will always do the
> last-conservative detection when setting up identity mapping:
>       * gfn space unoccupied:
>               -> insert identity mapping; no confliction
>       * gfn space already occupied with identity mapping:
>               -> do nothing; no confliction
>       * gfn space already occupied with other mapping:
>               -> confliction detected
> 3) hvmloader
>   Hvmloader allocates other resources (ACPI, PCI MMIO, etc.) and
> internal data structures in gfn space, and it creates the final guest
> e820. So hvmloader also needs to detect conflictions when conducting
> those operations. If there's no confliction, hvmloader will reserve
> those regions in guest e820 to let guest OS aware.
> >>>3.3 Policies
> ----
> An intuitive thought is to fail immediately upon a confliction, however
> it is not flexible regarding to different requirments:
> a) it's not appropriate to fail libxc domain builder just because such
> confliction. We still want the guest to boot even w/o assigned device;
> b) whether to fail in hvmloader has several dependencies. If it's
> to check for hotplug preparation, warning is also an acceptable option
> since assignment may not happen at all. Or if it's a USB controller
> but user doesn't care about legacy keyboard emulation, it's also OK to
> move forward upon a confliction;
> c) in Xen hypervisor it is reasonable to fail upon confliction, where
> device is actually assigned. But due to the same requirement on USB
> controller, sometimes we might want it succeed just w/ warnings.
> Regarding to the complexity of addressing all above flexibilities (user
> preferences, per-device), which requires inventing quite some parameters
> passed among different components, and regarding to the fact that
> failures would be rare (except some USB) with proactive avoidance
> in userspace, we'd like to propose below simplified policy following
> [Guideline-4]:
> - 'warn' conflictions in user space (libxc and hvmloader)
> - a boot option to specify 'fail' or 'warn' confliction in Xen device
> assignment path, default to 'fail' (user can set to 'warn' for USB case)
> Such policy provides a relaxed user space policy w/ hypervisor to do
> final judge. It has a unique merit to simplify later interface design
> and hotplug support, w/o breaking [Guideline-1/2] even when all possible
> reserved regions are exposed.
>     ******agreement is first required on above policy******
> >>>3.4 Xen: setup RMRR identity mapping
> ----
> Regardless of whether userspace has detected confliction, Xen hypervisor
> always needs to detect confliction itself when setting up identify
> mapping for reserved gfn regions, following above defined policy.
> Identity mapping should be really handled from the general p2m layer,
> so the same r/w permissions apply equally to CPU/DMA access paths,
> regardless of the underlying fact whether EPT is shared with IOMMU.
> This is to match the behavior on bare metal, where although reserved
> regions are marked as E820_RESERVED, it's just a hint to the system
> software which can still read data back because physically those bits
> do exist. So in the virtualization case we don't need to specially
> treat CPU accesses to RMRR reserved regions (similar to other reserved
> regions like ACPI NVS)
> >>>3.5 New interface: expose reserved region information
> ----
> As explained in [Guideline-3], we'd like to keep this interface general
> enough, as a common interface for hypervisor to force reserving gfn
> ranges, due to various reasons (RMRR is a client of this feature).
> One design open was discussed back-and-forth accordingly, regarding to
> whether the interface should return regions reported for all devices
> in the platform (report-all), or selectively return regions only
> belonging to assigned devices (report-sel). report-sel can be built on
> top of report-all, with extra work to help hypervisor generate filtered
> regions (e.g. introduce new interface or make device assignment happened
> before domain builder)
> We propose report-all as the simple solution (different from last sent
> version which used report-sel), regarding to the below facts:
>   - 'warn' policy in user space makes report-all not harmful
>   - 'report-all' still means only a few entries in reality:
>     * RMRR reserved regions should be avoided or limited by platform
> designers, per VT-d specification;
>     * RMRR reserved regions are only a few on real platforms, per our
> current observations;
>   - anyway OS needs to handle all the reserved regions on bare metal;
>   - hotplug friendly;
>   - report-all can be extended to report-sel if really required
> In this way, there are two situations libxc domain builder may request
> to query reserved region information w/ same interface:
> a) if any statically-assigned devices, and/or
> b) if a new parameter is specified, asking for hotplug preparation
>       ('rdm_check' or 'prepare_hotplug'?)
> the 1st invocation of this interface will save all reported reserved
> regions under domain structure, and later invocation (e.g. from
> hvmloader) gets saved content.
> If a VM is configured w/o assigned devices, this interface is not
> invoked so there's no impact and [Guideline-1] is enforced;
> If a VM is configured w/ assigned devices which don't have reserved
> regions, this interface is invoked. In some cases warning may be
> thrown out due to confliction caused by other non-assigned devices,
> but it's just informational and there is no impact on assigned devices
> so [Guideline-2] is enforced;
> >>>3.6 Libxc/hvmloader: detect and avoid conflictions
> ----
> libxc needs to detect reserved region conflictions with:
>       - guest RAM
>       - monolithic PCI MMIO hole
> hvmloader needs to detect reserved region confliction with:
>       - guest RAM
>       - PCI MMIO allocation
>       - memory allocation
>       - some e820 entries like ACPI Opregion, etc.
> When there's a confliction detected, libxc/hvmloader first try to
> relocate conflicting gfn resources to avoid confliction. warning
> will be thrown out when such relocation fails. The relocation policy
> is straightforward for most resources, however there remains a major
> design tradeoff for guest RAM, regarding to handoff between libxc
> and hvmloader...
> In current implementation, guest RAM is contiguous in gfn space, w/
> at most two trunks: lowmem (<4G) and highmem(>4G), which are passed
> to hvmloader through hvm_info. Now by relocating guest RAM to avoid
> confliction with reserved regions, sparse memory trunks are created
> and it's not thought as an extensible way to introduce such sparse
> structure into hvm_info.
> There are several other options discussed so far:
> a) Duplicate same relocation algorithm within libxc domain builder
> (when populating physmap) and hvmloader (when creating e820)
>   - Pros:
>       * no interface/structure change
>       * anyway hvmloader still needs to handle reserved regions
>   - Cons:
>       * duplication is not good
> b) pass sparse information through Xenstore
>   (no much idea. need input from toolstack maintainers)
> c) utilize XENMEM_{set,}_memory_map pair of hypercalls, with libxc to
> set and hvmloader to get. Extension required to allow hvm invoke.
>   - Pros:
>       * centralized ownership in libxc. flexible for extension
>   - Cons:
>       * limiting entry to E820MAX (should be fine)
>       * hvmloader e820 construction may become more complex, given
> two predefined tables (reserved_regions, memory_map)
> ********Inputs are required to find a good option here*********
> >>>3.7 hvmloader: reserve 'reserved regions' in guest E820
> ----
> If there is no confliction detected, hvmloader needs to mark those
> reserved regions as E820_RESERVED in guest E820 table, so the guest OS
> is aware of those reserved regions (thus not does problematic actions
> e.g. when re-allocating PCI MMIO)
> >>>3.8 Xen: Handle devices sharing reserved regions
> ----
> Per VT-d spec, it's possible to have two devices sharing same reserved
> region. Though we didn't see such example in reality, hypervisor needs
> to detect and handle such scenario, otherwise vulnerability may exist
> if two devices are assigned to different VMs (so a malicious VM may
> program its assigned device to clobber the shared region to malform
> another VM's device)
> Ideally all devices sharing reserved regions should be assigned to a
> single VM. However achieving this goal can't be done sole in hypervisor
> w/o reworking current device assignment interface. Assignment is managed
> by toolstack, which requires exposing group sharing information to
> userspace and then extends toolstack to manage assignment in bundle.
> Given the problem only in ideal space, we propose to not support such
> scenario, i.e. having hypervisor to fail the assignment, if the target
> device happens to share some reserved regions with another device,
> following [Guideline-4] to keep things simple.
> 4. Plan
> ================================================================
> =====
> We're seeking an incremental way to split above tasks into 2 stages,
> and in each stage we move forward a step w/o causing regression. Doing
> so can benefit people who want to use device assignment early, and
> also benefit newbie developer to rampup, toward a final sane solution.
> 4.1 Stage-1: hypervisor hardening
> ----
>   [Tasks]
>       1) Setup RMRR identity mapping in p2m layer with confliction
> detection
>       2) add a boot option for fail/warn policy
>       3) remove USB hack
>       4) Detect and fail device assignment w/ shared reserve regions
>   [Enhancements]
>       * fix [Issue-1] and [Issue-3]
>       * partially fix [Issue-2] with limitations:
>               - w/o userspace relocation there's larger chance to
> see conflictions.
>               - w/o reserve in guest e820, guest OS may allocate
> reserved pfn when re-enumerating PCI resource
>   [Regressions]
>       * devices which can be assigned successfully before may be
> failed now due to confliction detection. However it's not a regression
> per se. and user can change policy to 'warn' if required.
> 4.2 Stage-2: libxc/hvmloader hardening
> ----
>   [Tasks]
>       5) Introduce new interface to expose reserve region information
>       6) Detect and avoid reserved region conflictions in libxc
>       7) Pass libxc guest RAM layout to hvmloader
>       8) Detect and avoid reserved region conflictions in hvmloader
>       9) Reserve 'reserved regions' in guest E820 in hvmloader
>   [Enhancements]
>       * completely fix [Issue-2]
>   [Regression]
>       * n/a
> Thanks,
> Kevin
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.