[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] (v2) Design proposal for RMRR fix
Ping in case this mail is hidden after long holiday. :-) > From: Tian, Kevin > Sent: Friday, December 26, 2014 7:23 PM > > (please note some proposal is different from last sent version after more > discussions. But I tried to summarize previous discussions and explained why > we choose a different way. Sorry if I may miss some opens/conclusions > discussed in past months. Please help point it out which is very appreciated. > :-) > > ---- > TOC: > 1. What's RMRR > 2. RMRR status in Xen > 3. High Level Design > 3.1 Guidelines > 3.2 Confliction detection > 3.3 Policies > 3.4 Xen: setup RMRR identity mapping > 3.5 New interface: expose reserved region information > 3.6 Libxc/hvmloader: detect and avoid conflictions > 3.7 Hvmloader: reserve 'reserved regions' in guest E820 > 3.8 Xen: Handle devices sharing reserved regions > 4. Plan > 4.1 Stage-1: hypervisor hardening > 4.2 Stage-2: libxc/hvmloader hardening > > 1. What's RMRR? > ================================================================ > ===== > > RMRR is an acronym for Reserved Memory Region Reporting, expected to > be used for legacy usages (such as USB, UMA Graphics, etc.) requiring > reserved memory. > > (From vt-d spec) > ---- > Reserved system memory regions are typically allocated by BIOS at boot > time and reported to OS as reserved address ranges in the system memory > map. Requests to these reserved regions may either occur as a result of > operations performed by the system software driver (for example in the > case of DMA from unified memory access (UMA) graphics controllers to > graphics reserved memory) or may be initiated by non system software > (for example in case of DMA performed by a USB controller under BIOS > SMM control for legacy keyboard emulation). > > For proper functioning of these legacy reserved memory usages, when > system software enables DMA remapping, the translation structures for > the respective devices are expected to be set up to provide identity > mapping for the specified reserved memory regions with read and write > permissions. The system software is also responsible for ensuring > that any input addresses used for device accesses to OS-visible memory > do not overlap with the reserved system memory address ranges. > > BIOS may report each such reserved memory region through the RMRR > structures, along with the devices that requires access to the > specified reserved memory region. Reserved memory ranges that are > either not DMA targets, or memory ranges that may be target of BIOS > initiated DMA only during pre-boot phase (such as from a boot disk > drive) must not be included in the reserved memory region reporting. > The base address of each RMRR region must be 4KB aligned and the size > must be an integer multiple of 4KB. If there are no RMRR structures, > the system software concludes that the platform does not have any > reserved memory ranges that are DMA targets. > > Platform designers should avoid or limit use of reserved memory regions > since these require system software to create holes in the DMA virtual > address range available to system software and its drivers. > ---- > > Below is one example from a BDW machine: > (XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR: > (XEN) [VT-D]dmar.c:679: RMRR region: base_addr ab80a000 end_address > ab81dfff > (XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR: > (XEN) [VT-D]dmar.c:679: RMRR region: base_addr ad000000 end_address > af7fffff > > Here the 1st reserved region is for USB controller, with the 2nd one > belonging to IGD. > > > > 2. RMRR status in Xen > ================================================================ > ===== > > There are two main design goals according to VT-d spec: > > a) Setup identity mapping for reserved regions in IOMMU page table > b) Ensure reserved regions not conflicting with OS-visible memory > (OS-visible memory in a VM means guest physical memory, and more > strictly it also means no confliction with other types of allocations > in guest physical address space, such as PCI MMIO, ACPI, etc.) > > However current RMRR implementation in Xen only partially achieves a) > and completely misses b), which cause some issues: > > -- > [Issue-1] Identity mapping is not setup in shared ept case, so a device > with RMRR may not function correctly if assigned to a VM. > > This was the original problem we found when assigning IGD on BDW > platform, which triggered the whole long discussion in past months > > -- > [Issue-2] Being lacking of goal-b), existing device assignment with > RMRR works only when reserved regions happen to not conflicting with > other valid allocations in the guest physical address space. This could > lead to unpredicted failures in various deployments, due to non-detected > conflictions caused by platform difference and VM configuration > difference. > > One example is about USB controller assignment. It's already identified > as a problem on some platforms, that USB reserved regions conflict with > guest BIOS region. However, being the fact that host BIOS only touches > those reserved regions for legacy keyboard emulation at early Dom0 boot > phase, a trick is added in Xen to bypass RMRR handling for usb > controllers. > > -- > [Issue-3] devices may share same reserved regions, however > there is no logic to handle this in Xen. Assigning such devices to > different VMs could lead to secure concern > > > > 3. High Level Design > ================================================================ > ===== > > To achieve aforementioned two goals, major enhancements are required > cross Xen hypervisor, libxc, and hvmloader, to address the gap in > goal-b), i.e. handling possible conflictions in gfn space. Fixing > goal-a) is straightforward. > > >>>3.1 Guidelines > ---- > There are several guidelines considered in the design: > > -- > [Guideline-1] No regression in a VM w/o statically-assigned devices > > If a VM isn't configured with assigned devices at creation, new > confliction detection logic shouldn't block the VM boot progress > (either skipped, or just throw warning) > > -- > [Guideline-2] No regression on devices which do not have RMRR reported > > If a VM is assigned with a device which doesn't have RMRR reported, > either statically-assigned or dynamically-assigned, new confliction > detection logic shouldn't fail the assignment request for this device. > > -- > [Guideline-3] New interface should be kept as common as possible > > New interface will be introduced to expose reserved regions to the > user space. Though RMRR is a VT-d specific terminology, the interface > design should be generic enough, i.e. to support a function which > allows hypervisor to force reserving one or more gfn ranges. > > -- > [Guideline-4] Keep changes simple > > RMRR reserved regions should be avoided or limited by platform > designers, per VT-d specification. Per our observations, there are > only a few reported examples (USB, IGD) on real platforms. So we need > to balance the code complexity and usage limitation. If one limitation > is only in niche scenarios, we'd like to vote no-support to simplify > changes for now. > > >>>3.2 Confliction detection > ---- > Confliction must be detected in several places as far as gfn is > concerned (how to handle confliction is discussed in 3.3) > > 1) libxc domain builder > Here coarse-grained gfn layout is created, including two contiguous > guest RAM trunks (lowmem and/or highmem) and mmio holes (VGA, PCI), > which are passed to hvmloader for later fine-grained manipulation. Guest > RAM trunks are populated with valid translation setup in underlying p2m > layer. Device reserved regions must be detected in that layout. > > 2) Xen hypervisor device assignment > Device assignment can happen either at VM creation time (after domain > builder), or anytime thru hotplug after VM is booted. Regardless of > how userspace handles confliction, Xen hypervisor will always do the > last-conservative detection when setting up identity mapping: > * gfn space unoccupied: > -> insert identity mapping; no confliction > * gfn space already occupied with identity mapping: > -> do nothing; no confliction > * gfn space already occupied with other mapping: > -> confliction detected > > 3) hvmloader > Hvmloader allocates other resources (ACPI, PCI MMIO, etc.) and > internal data structures in gfn space, and it creates the final guest > e820. So hvmloader also needs to detect conflictions when conducting > those operations. If there's no confliction, hvmloader will reserve > those regions in guest e820 to let guest OS aware. > > >>>3.3 Policies > ---- > An intuitive thought is to fail immediately upon a confliction, however > it is not flexible regarding to different requirments: > > a) it's not appropriate to fail libxc domain builder just because such > confliction. We still want the guest to boot even w/o assigned device; > > b) whether to fail in hvmloader has several dependencies. If it's > to check for hotplug preparation, warning is also an acceptable option > since assignment may not happen at all. Or if it's a USB controller > but user doesn't care about legacy keyboard emulation, it's also OK to > move forward upon a confliction; > > c) in Xen hypervisor it is reasonable to fail upon confliction, where > device is actually assigned. But due to the same requirement on USB > controller, sometimes we might want it succeed just w/ warnings. > > Regarding to the complexity of addressing all above flexibilities (user > preferences, per-device), which requires inventing quite some parameters > passed among different components, and regarding to the fact that > failures would be rare (except some USB) with proactive avoidance > in userspace, we'd like to propose below simplified policy following > [Guideline-4]: > > - 'warn' conflictions in user space (libxc and hvmloader) > - a boot option to specify 'fail' or 'warn' confliction in Xen device > assignment path, default to 'fail' (user can set to 'warn' for USB case) > > Such policy provides a relaxed user space policy w/ hypervisor to do > final judge. It has a unique merit to simplify later interface design > and hotplug support, w/o breaking [Guideline-1/2] even when all possible > reserved regions are exposed. > > ******agreement is first required on above policy****** > > >>>3.4 Xen: setup RMRR identity mapping > ---- > Regardless of whether userspace has detected confliction, Xen hypervisor > always needs to detect confliction itself when setting up identify > mapping for reserved gfn regions, following above defined policy. > > Identity mapping should be really handled from the general p2m layer, > so the same r/w permissions apply equally to CPU/DMA access paths, > regardless of the underlying fact whether EPT is shared with IOMMU. > > This is to match the behavior on bare metal, where although reserved > regions are marked as E820_RESERVED, it's just a hint to the system > software which can still read data back because physically those bits > do exist. So in the virtualization case we don't need to specially > treat CPU accesses to RMRR reserved regions (similar to other reserved > regions like ACPI NVS) > > >>>3.5 New interface: expose reserved region information > ---- > As explained in [Guideline-3], we'd like to keep this interface general > enough, as a common interface for hypervisor to force reserving gfn > ranges, due to various reasons (RMRR is a client of this feature). > > One design open was discussed back-and-forth accordingly, regarding to > whether the interface should return regions reported for all devices > in the platform (report-all), or selectively return regions only > belonging to assigned devices (report-sel). report-sel can be built on > top of report-all, with extra work to help hypervisor generate filtered > regions (e.g. introduce new interface or make device assignment happened > before domain builder) > > We propose report-all as the simple solution (different from last sent > version which used report-sel), regarding to the below facts: > > - 'warn' policy in user space makes report-all not harmful > - 'report-all' still means only a few entries in reality: > * RMRR reserved regions should be avoided or limited by platform > designers, per VT-d specification; > * RMRR reserved regions are only a few on real platforms, per our > current observations; > - anyway OS needs to handle all the reserved regions on bare metal; > - hotplug friendly; > - report-all can be extended to report-sel if really required > > In this way, there are two situations libxc domain builder may request > to query reserved region information w/ same interface: > > a) if any statically-assigned devices, and/or > b) if a new parameter is specified, asking for hotplug preparation > ('rdm_check' or 'prepare_hotplug'?) > > the 1st invocation of this interface will save all reported reserved > regions under domain structure, and later invocation (e.g. from > hvmloader) gets saved content. > > If a VM is configured w/o assigned devices, this interface is not > invoked so there's no impact and [Guideline-1] is enforced; > > If a VM is configured w/ assigned devices which don't have reserved > regions, this interface is invoked. In some cases warning may be > thrown out due to confliction caused by other non-assigned devices, > but it's just informational and there is no impact on assigned devices > so [Guideline-2] is enforced; > > >>>3.6 Libxc/hvmloader: detect and avoid conflictions > ---- > libxc needs to detect reserved region conflictions with: > - guest RAM > - monolithic PCI MMIO hole > > hvmloader needs to detect reserved region confliction with: > - guest RAM > - PCI MMIO allocation > - memory allocation > - some e820 entries like ACPI Opregion, etc. > > When there's a confliction detected, libxc/hvmloader first try to > relocate conflicting gfn resources to avoid confliction. warning > will be thrown out when such relocation fails. The relocation policy > is straightforward for most resources, however there remains a major > design tradeoff for guest RAM, regarding to handoff between libxc > and hvmloader... > > In current implementation, guest RAM is contiguous in gfn space, w/ > at most two trunks: lowmem (<4G) and highmem(>4G), which are passed > to hvmloader through hvm_info. Now by relocating guest RAM to avoid > confliction with reserved regions, sparse memory trunks are created > and it's not thought as an extensible way to introduce such sparse > structure into hvm_info. > > There are several other options discussed so far: > > a) Duplicate same relocation algorithm within libxc domain builder > (when populating physmap) and hvmloader (when creating e820) > - Pros: > * no interface/structure change > * anyway hvmloader still needs to handle reserved regions > - Cons: > * duplication is not good > > b) pass sparse information through Xenstore > (no much idea. need input from toolstack maintainers) > > c) utilize XENMEM_{set,}_memory_map pair of hypercalls, with libxc to > set and hvmloader to get. Extension required to allow hvm invoke. > - Pros: > * centralized ownership in libxc. flexible for extension > - Cons: > * limiting entry to E820MAX (should be fine) > * hvmloader e820 construction may become more complex, given > two predefined tables (reserved_regions, memory_map) > > ********Inputs are required to find a good option here********* > > >>>3.7 hvmloader: reserve 'reserved regions' in guest E820 > ---- > If there is no confliction detected, hvmloader needs to mark those > reserved regions as E820_RESERVED in guest E820 table, so the guest OS > is aware of those reserved regions (thus not does problematic actions > e.g. when re-allocating PCI MMIO) > > >>>3.8 Xen: Handle devices sharing reserved regions > ---- > Per VT-d spec, it's possible to have two devices sharing same reserved > region. Though we didn't see such example in reality, hypervisor needs > to detect and handle such scenario, otherwise vulnerability may exist > if two devices are assigned to different VMs (so a malicious VM may > program its assigned device to clobber the shared region to malform > another VM's device) > > Ideally all devices sharing reserved regions should be assigned to a > single VM. However achieving this goal can't be done sole in hypervisor > w/o reworking current device assignment interface. Assignment is managed > by toolstack, which requires exposing group sharing information to > userspace and then extends toolstack to manage assignment in bundle. > > Given the problem only in ideal space, we propose to not support such > scenario, i.e. having hypervisor to fail the assignment, if the target > device happens to share some reserved regions with another device, > following [Guideline-4] to keep things simple. > > > > 4. Plan > ================================================================ > ===== > We're seeking an incremental way to split above tasks into 2 stages, > and in each stage we move forward a step w/o causing regression. Doing > so can benefit people who want to use device assignment early, and > also benefit newbie developer to rampup, toward a final sane solution. > > 4.1 Stage-1: hypervisor hardening > ---- > [Tasks] > 1) Setup RMRR identity mapping in p2m layer with confliction > detection > 2) add a boot option for fail/warn policy > 3) remove USB hack > 4) Detect and fail device assignment w/ shared reserve regions > > [Enhancements] > * fix [Issue-1] and [Issue-3] > * partially fix [Issue-2] with limitations: > - w/o userspace relocation there's larger chance to > see conflictions. > - w/o reserve in guest e820, guest OS may allocate > reserved pfn when re-enumerating PCI resource > > [Regressions] > * devices which can be assigned successfully before may be > failed now due to confliction detection. However it's not a regression > per se. and user can change policy to 'warn' if required. > > 4.2 Stage-2: libxc/hvmloader hardening > ---- > [Tasks] > 5) Introduce new interface to expose reserve region information > 6) Detect and avoid reserved region conflictions in libxc > 7) Pass libxc guest RAM layout to hvmloader > 8) Detect and avoid reserved region conflictions in hvmloader > 9) Reserve 'reserved regions' in guest E820 in hvmloader > > [Enhancements] > * completely fix [Issue-2] > > [Regression] > * n/a > > Thanks, > Kevin > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |