[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] vNUMA and automatic numa balancing


During work on vNUMA enabling patchset and PV domain test runs,
I found out that once NUMA is enabled for PV guest, occasional oopses appear.
This was caused my missed set_pmd_at function in pv_mmu_ops.
Such behaviour does not appear if numa_autobalance is turned off for
PV kernel, since set_pmd_at was not used.

I have added  set_pmd_at wich sets correct flags for numa autobalancing to work.
You can check on automatic numa balancing here http://lwn.net/Articles/528881/.
After set_pmd_at was added, the issue with rss reference count
appeared. This problem
was related to exit_mmap and not releasing vma areas correctly what
had pages marked with
_PAGE_PRESENT cleared and _PAGE_NUMA set, but never actually migrated.
See http://pastebin.com/eFP5zc62
The test what shows is by executing the following command:
dd if=/dev/xvda1 of=/tee bs=4096 count=1000056 &

Depending on configuration and command, the rss count is not correctly
set when xen_exit_mmap for different MM counters.
In the example above you see that this is rss count for MM_MMAPANON,
but in other cases it can be MM_FILEPAGES and MM_PAGESANON.
Maybe this rss count for mm is missing for some other reasons.

Another bit added when I forced PV kernel to substitue _PAGE_NUMA bit
right before issuing
mmu_update. Why?
_PAGE_NUMA = _PAGE_PROTNONE = 0x100 in Linux
its _PAGE_GLOBAL = 0x100 in Xen and is used for user mappings. It is
allowed to be set on ptes
so I decided to hand over to Xen instead of 0x100 (_PAGE_NUMA) unused
bit _PAGE_AVAIL2 = 0x800.
So all mmu pvops were updates to translate that bit and Xen will see
it as 0x800 bit set for numa pte/pmd entries.
I had to also make a brutal hack (as a proof of concept) for
pmd_numa/pte_numa checks as this new flag after page fault trap in Xen
is not set back to Linux _PAGE_NUMA.

Initially the plan was to flip it back in Xen page fault trap, but I
was unable to reliably identify where exactly in page fault handler
this check should be and how to handle this.
I am not sure if this is correct way, so all suggestions are welcome.

Ok, having that all in place and basically letting Linux to have
page_fault on numa pages run by correcting pmd_numa check
(do_page_fault will do do_numa_page ) and will launch pages migration
if some accumulated number of page faults on _PAGE_NUMA exeeds whateve
I see this recursive fault when running vNUMA PV domain:

I see that on stack in the  first oops there are two values wich are
page fault exception codes..

[    2.275054]  ffffffff810f5756 ffffffff81639ad0 0000000000000010

The last one would mean absense of page, and first means _PAGE_WRITE
and _PAGE_PRESENT is not set (that means that page is resident, but
not accessible)?

Does anybody has any points on this if I am missing anything? Looks
like the second exception page not present is also not handled
properly in this case.

I  welcome any comments and  questions. I will also provide additional
details if some parts of this seem to be not clear.

couple of questions about Xen page fault trap:

a) In spurious_page_fault/__page_fault_type in traps.c there is a page
walk performed and page table entries are being compared with
field. It includes _PAGE_PRESENT for all levels of page entries from l4 to l1.
How this flag is set for l4 differ from one for l2 or l1? Or it is the
same interpretation that means that the following level of page table
entries should be checked as the
one that caused page fault?

b) In spurious_page_fault routine the page fault is not supposed to be
fixed, as I understand, but just to determine the type (real, smep.. )
If the spurious page fault detects the real fault, then it might be
transparently fixed later or return to the guest handler as is. Is
this correct?

Thank you!

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.