Re: [Xen-devel] [PATCH] turn off writable page tables

Keir Fraser wrote:

On 28 Jul 2006, at 16:51, Ian Pratt wrote:
So, in summary, we know writable page tables are not broken, they just
don't help on typical workloads because the PTEs/page are so low.
However, they do hurt SMP guest performance.  If we are not seeing a
benefit today, should we turn it off?  Should we make it a compile
time
option, with the default off?
I wouldn't mind seeing wrpt removed altogether, or at least emulation
made the compile time default for the moment. There's bound to be some
workload that bites us in the future which is why batching updates on
the fork path mightn't be a bad thing if it can be done without too much
gratuitous hacking of linux core code.
My only fear is that batched wrpt has some guest-visible effects. Forexample, the guest has to be able to cope with seeing page directoryentries with the present bit cleared. Also, on SMP, it has to be ableto cope with spurious page faults anywhere in its address space (e.g.,faults on a unhooked page table which some other VCPU has rehooked bythe time the Xen pagefault handler runs, hence the fault is bouncedback to the guest even though there is no work to be done). If we turnoff batched wrpt then guests will not be tested against it and we arelikely to hit problems if we ever want to turn it back on again --we'll find that some guests are not able to correctly handle the weirdside effects.
On the other hand, perhaps we can find a neater more explicitalternative to batched wrpt in future.

This is a very nice win for shadow page tables on SMP. Basically, weuse the lazy state information to defer all the MMU hypercalls into asingle flush, which happens when leaving lazy MMU mode.

At the PT level, this can be done without gratuitous hacking of linuxcore code. However, this can not be extended safely to also encompassthe set of the parent page directory entry for SMP. It is a littleunclear exactly how this would work under a direct page table hypervisor- would you still take the faults, or would you re-type and reprotectthe pages first? In the fork case, there can be two page tables beingupdated because of COW, but re-typing both pages changes the crossoverpoint for when the batching will be a win. But if the same hooks can beused for direct mode, it makes sense to figure that out now so we don'thave to add 4 different sets of hooks to Linux (UP / SMP want slightlydifferent batching models, as might also shadow/direct).

The PDE p-bit going missing is still a problem, and Linux can be changedto deal with that - but it is messy.

One remaining issue for deferring direct page table updates is the readhazard potential. I believe there is only one read hazard in the Linuxmm code that has the potential to be exposed here - the explicit, ratherthan implicit batching makes it quite a bit easier to reason about that.


Zach

Implement lazy MMU update hooks which are SMP safe for both direct and
shadow page tables.  The idea is that PTE updates and page invalidations
while in lazy mode can be batched into a single hypercall.  We use this
in VMI for shadow page table synchronization, and it is a win.

For SMP, the enter / leave must happen under protection of the page table
locks for page tables which are being modified.  This is because otherwise,
you end up with stale state in the batched hypercall, which other CPUs can
race ahead of.  Doing this under the protection of the locks guarantees
the synchronization is correct, and also means that spurious faults which
are generated during this window by remote CPUs are properly handled, as
the page fault handler must re-check the PTE under protection of the same
lock.

Signed-off-by: Zachary Amsden <zach@xxxxxxxxxx>

Index: linux-2.6.18-rc2/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.18-rc2.orig/include/asm-generic/pgtable.h 2006-07-28 
14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/include/asm-generic/pgtable.h      2006-07-28 
14:18:03.000000000 -0700
@@ -163,6 +163,11 @@ static inline void ptep_set_wrprotect(st
 #define move_pte(pte, prot, old_addr, new_addr)        (pte)
 #endif
 
+#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
+#define arch_enter_lazy_mmu_mode()     do {} while (0)
+#define arch_leave_lazy_mmu_mode()     do {} while (0)
+#endif
+
 /*
  * When walking page tables, get the address of the next boundary,
  * or the end address of the range if that comes earlier.  Although no
Index: linux-2.6.18-rc2/mm/memory.c
===================================================================
--- linux-2.6.18-rc2.orig/mm/memory.c   2006-07-28 14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/mm/memory.c        2006-07-28 14:18:44.000000000 -0700
@@ -506,6 +506,7 @@ again:
        src_ptl = pte_lockptr(src_mm, src_pmd);
        spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 
+       arch_enter_lazy_mmu_mode();
        do {
                /*
                 * We are holding two locks at this point - either of them
@@ -525,6 +526,7 @@ again:
                copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
                progress += 8;
        } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
+       arch_leave_lazy_mmu_mode();
 
        spin_unlock(src_ptl);
        pte_unmap_nested(src_pte - 1);
@@ -627,6 +629,7 @@ static unsigned long zap_pte_range(struc
        int anon_rss = 0;
 
        pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+       arch_enter_lazy_mmu_mode();
        do {
                pte_t ptent = *pte;
                if (pte_none(ptent)) {
@@ -692,6 +695,7 @@ static unsigned long zap_pte_range(struc
                pte_clear_full(mm, addr, pte, tlb->fullmm);
        } while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));
 
+       arch_leave_lazy_mmu_mode();
        add_mm_rss(mm, file_rss, anon_rss);
        pte_unmap_unlock(pte - 1, ptl);
 
@@ -1108,6 +1112,7 @@ static int zeromap_pte_range(struct mm_s
        pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
        if (!pte)
                return -ENOMEM;
+       arch_enter_lazy_mmu_mode();
        do {
                struct page *page = ZERO_PAGE(addr);
                pte_t zero_pte = pte_wrprotect(mk_pte(page, prot));
@@ -1117,6 +1122,7 @@ static int zeromap_pte_range(struct mm_s
                BUG_ON(!pte_none(*pte));
                set_pte_at(mm, addr, pte, zero_pte);
        } while (pte++, addr += PAGE_SIZE, addr != end);
+       arch_leave_lazy_mmu_mode();
        pte_unmap_unlock(pte - 1, ptl);
        return 0;
 }
@@ -1269,11 +1275,13 @@ static int remap_pte_range(struct mm_str
        pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
        if (!pte)
                return -ENOMEM;
+       arch_enter_lazy_mmu_mode();
        do {
                BUG_ON(!pte_none(*pte));
                set_pte_at(mm, addr, pte, pfn_pte(pfn, prot));
                pfn++;
        } while (pte++, addr += PAGE_SIZE, addr != end);
+       arch_leave_lazy_mmu_mode();
        pte_unmap_unlock(pte - 1, ptl);
        return 0;
 }
Index: linux-2.6.18-rc2/mm/mprotect.c
===================================================================
--- linux-2.6.18-rc2.orig/mm/mprotect.c 2006-07-28 14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/mm/mprotect.c      2006-07-28 14:17:25.000000000 -0700
@@ -33,6 +33,7 @@ static void change_pte_range(struct mm_s
        spinlock_t *ptl;
 
        pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+       arch_enter_lazy_mmu_mode();
        do {
                oldpte = *pte;
                if (pte_present(oldpte)) {
@@ -62,6 +63,7 @@ static void change_pte_range(struct mm_s
                }
 
        } while (pte++, addr += PAGE_SIZE, addr != end);
+       arch_leave_lazy_mmu_mode();
        pte_unmap_unlock(pte - 1, ptl);
 }
 
Index: linux-2.6.18-rc2/mm/mremap.c
===================================================================
--- linux-2.6.18-rc2.orig/mm/mremap.c   2006-07-28 14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/mm/mremap.c        2006-07-28 14:17:25.000000000 -0700
@@ -99,6 +99,7 @@ static void move_ptes(struct vm_area_str
        if (new_ptl != old_ptl)
                spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
 
+       arch_enter_lazy_mmu_mode();
        for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
                                   new_pte++, new_addr += PAGE_SIZE) {
                if (pte_none(*old_pte))
@@ -108,6 +109,7 @@ static void move_ptes(struct vm_area_str
                pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
                set_pte_at(mm, new_addr, new_pte, pte);
        }
+       arch_leave_lazy_mmu_mode();
 
        if (new_ptl != old_ptl)
                spin_unlock(new_ptl);
Index: linux-2.6.18-rc2/mm/msync.c
===================================================================
--- linux-2.6.18-rc2.orig/mm/msync.c    2006-07-28 14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/mm/msync.c 2006-07-28 14:17:25.000000000 -0700
@@ -30,6 +30,7 @@ static unsigned long msync_pte_range(str
 
 again:
        pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+       arch_enter_lazy_mmu_mode();
        do {
                struct page *page;
 
@@ -51,6 +52,7 @@ again:
                        ret += set_page_dirty(page);
                progress += 3;
        } while (pte++, addr += PAGE_SIZE, addr != end);
+       arch_leave_lazy_mmu_mode();
        pte_unmap_unlock(pte - 1, ptl);
        cond_resched();
        if (addr != end)

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] [PATCH] turn off writable page tables