Xen project Mailing List

[Xen-devel] Xen x86 host memory limit issues

To: Xen-devel List <xen-devel@xxxxxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Mon, 24 Aug 2015 11:36:50 +0100

Cc: Elena Ufimtseva <elena.ufimtseva@xxxxxxxxxx>, Juergen Gross <JGross@xxxxxxxx>, Tim Deegan <tim@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>

Delivery-date: Mon, 24 Aug 2015 10:37:20 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

(Following up from a discussion at the Seattle Summit). While the theoretical Xen x86 host memory limit is 16TB (or 123TB with CONFIG_BIGMEM), Xen doesn't actually function correctly if host ram exceeds the addressable range in the directmap region, which is at the 5TB boundary (or 3.5TB with CONFIG_BIGMEM). The ultimate bug is that alloc_xenheap_pages() returns virtual addresses which exceed HYPERVISOR_VIRT_END. Because of the way the idle pagetables and monitor pagetables extend the directmap region, these pointers are safe to use. However, in the context of a 64bit PV guest, these virtual addresses belong to the guest kernel. In my repro case (6TB box, 8 numa nodes), it was particularly easy to trigger the issue from a 64bit dom0 with `xenpm get-cpuidle-states all` or `echo c > /proc/sysrq-trigger`, both of which went and accessed per-cpu data allocated higher than HYPERVISOR_VIRT_END and unmapped in the dom0 kernel pagetables. (On broadwell hardware, I would expect SMAP violations as the guest kernel pages are user pages). For XenServer, I used the following gross hack to work around the problem diff --git a/xen/arch/x86/e820.c b/xen/arch/x86/e820.c index 3c64f19..715765a 100644 --- a/xen/arch/x86/e820.c +++ b/xen/arch/x86/e820.c @@ -15,7 +15,7 @@ * opt_mem: Limit maximum address of physical RAM. * Any RAM beyond this address limit is ignored. */ -static unsigned long long __initdata opt_mem; +static unsigned long long __initdata opt_mem = GB(5 * 1024); size_param("mem", opt_mem); /* Which cases Xen to ignore any RAM above the 5TB boundary. (We used a similar trick with the 1TB limit for 32bit toolstack domains and migration). The infrastructure around xenheap_max_mfn() is supposed cause all xenheap page allocations to fall within the Xen direct mapped region, but experimentally doesn't work correctly. In all cases I have seen, the bad xenheap allocations have been from calls which contain numa information in the memflags, which leads me to suspect it is an interaction issue of numa hinting information and xenheap_bits. At a guess I suspect alloc_heap_pages() doesn't correctly override the numa hint when both a numa hint and zone limit are provided, but I have not investigated this yet. Fixing that bug will be a useful step, as it will allow Xen to function with host ram above the direct map limit, but is still not an optimal solution as it prevents getting numa-local xenheap memory. Longterm it would be optimal to segment the direct map region by numa node so there is equal quantities of xenheap memory available from each numa node. This also has an added security benefit as it makes ret2dir exploits harder, as the direct map target address is no longer a static calculation from the point of view of the attacker. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.