[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Xen x86 host memory limit issues

(Following up from a discussion at the Seattle Summit).

While the theoretical Xen x86 host memory limit is 16TB (or 123TB with
CONFIG_BIGMEM), Xen doesn't actually function correctly if host ram
exceeds the addressable range in the directmap region, which is at the
5TB boundary (or 3.5TB with CONFIG_BIGMEM).

The ultimate bug is that alloc_xenheap_pages() returns virtual addresses
which exceed HYPERVISOR_VIRT_END. 

Because of the way the idle pagetables and monitor pagetables extend the
directmap region, these pointers are safe to use.  However, in the
context of a 64bit PV guest, these virtual addresses belong to the guest

In my repro case (6TB box, 8 numa nodes), it was particularly easy to
trigger the issue from a 64bit dom0 with `xenpm get-cpuidle-states all`
or `echo c > /proc/sysrq-trigger`, both of which went and accessed
per-cpu data allocated higher than HYPERVISOR_VIRT_END and unmapped in
the dom0 kernel pagetables.  (On broadwell hardware, I would expect SMAP
violations as the guest kernel pages are user pages).

For XenServer, I used the following gross hack to work around the problem

diff --git a/xen/arch/x86/e820.c b/xen/arch/x86/e820.c
index 3c64f19..715765a 100644
--- a/xen/arch/x86/e820.c
+++ b/xen/arch/x86/e820.c
@@ -15,7 +15,7 @@
  * opt_mem: Limit maximum address of physical RAM.
  *          Any RAM beyond this address limit is ignored.

-static unsigned long long __initdata opt_mem;
+static unsigned long long __initdata opt_mem = GB(5 * 1024);
 size_param("mem", opt_mem);


Which cases Xen to ignore any RAM above the 5TB boundary.  (We used a
similar trick with the 1TB limit for 32bit toolstack domains and migration).

The infrastructure around xenheap_max_mfn() is supposed cause all
xenheap page allocations to fall within the Xen direct mapped region,
but experimentally doesn't work correctly.

In all cases I have seen, the bad xenheap allocations have been from
calls which contain numa information in the memflags, which leads me to
suspect it is an interaction issue of numa hinting information and
xenheap_bits.  At a guess I suspect alloc_heap_pages() doesn't correctly
override the numa hint when both a numa hint and zone limit are
provided, but I have not investigated this yet.

Fixing that bug will be a useful step, as it will allow Xen to function
with host ram above the direct map limit, but is still not an optimal
solution as it prevents getting numa-local xenheap memory.

Longterm it would be optimal to segment the direct map region by numa
node so there is equal quantities of xenheap memory available from each
numa node.  This also has an added security benefit as it makes ret2dir
exploits harder, as the direct map target address is no longer a static
calculation from the point of view of the attacker.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.