This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


Re: [Xen-devel] [PATCH] turn off writable page tables

To: Keir Fraser <Keir.Fraser@xxxxxxxxxxxx>
Subject: Re: [Xen-devel] [PATCH] turn off writable page tables
From: Zachary Amsden <zach@xxxxxxxxxx>
Date: Fri, 28 Jul 2006 14:36:41 -0700
Cc: Ian Pratt <m+Ian.Pratt@xxxxxxxxxxxx>, Gerd Hoffmann <kraxel@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx, Andrew Theurer <habanero@xxxxxxxxxx>, Virtualization Mailing List <virtualization@xxxxxxxxxxxxxx>
Delivery-date: Fri, 28 Jul 2006 14:37:13 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
In-reply-to: <66edb0fb6e0664d309b10701168c495e@xxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <A95E2296287EAD4EB592B5DEEFCE0E9D57229D@xxxxxxxxxxxxxxxxxxxxxxxxxxx> <66edb0fb6e0664d309b10701168c495e@xxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Thunderbird (X11/20060516)
Keir Fraser wrote:

On 28 Jul 2006, at 16:51, Ian Pratt wrote:

So, in summary, we know writable page tables are not broken, they just
don't help on typical workloads because the PTEs/page are so low.
However, they do hurt SMP guest performance.  If we are not seeing a
benefit today, should we turn it off?  Should we make it a compile
option, with the default off?

I wouldn't mind seeing wrpt removed altogether, or at least emulation
made the compile time default for the moment. There's bound to be some
workload that bites us in the future which is why batching updates on
the fork path mightn't be a bad thing if it can be done without too much
gratuitous hacking of linux core code.

My only fear is that batched wrpt has some guest-visible effects. For example, the guest has to be able to cope with seeing page directory entries with the present bit cleared. Also, on SMP, it has to be able to cope with spurious page faults anywhere in its address space (e.g., faults on a unhooked page table which some other VCPU has rehooked by the time the Xen pagefault handler runs, hence the fault is bounced back to the guest even though there is no work to be done). If we turn off batched wrpt then guests will not be tested against it and we are likely to hit problems if we ever want to turn it back on again -- we'll find that some guests are not able to correctly handle the weird side effects.

On the other hand, perhaps we can find a neater more explicit alternative to batched wrpt in future.

This is a very nice win for shadow page tables on SMP. Basically, we use the lazy state information to defer all the MMU hypercalls into a single flush, which happens when leaving lazy MMU mode.

At the PT level, this can be done without gratuitous hacking of linux core code. However, this can not be extended safely to also encompass the set of the parent page directory entry for SMP. It is a little unclear exactly how this would work under a direct page table hypervisor - would you still take the faults, or would you re-type and reprotect the pages first? In the fork case, there can be two page tables being updated because of COW, but re-typing both pages changes the crossover point for when the batching will be a win. But if the same hooks can be used for direct mode, it makes sense to figure that out now so we don't have to add 4 different sets of hooks to Linux (UP / SMP want slightly different batching models, as might also shadow/direct).

The PDE p-bit going missing is still a problem, and Linux can be changed to deal with that - but it is messy.

One remaining issue for deferring direct page table updates is the read hazard potential. I believe there is only one read hazard in the Linux mm code that has the potential to be exposed here - the explicit, rather than implicit batching makes it quite a bit easier to reason about that.

Implement lazy MMU update hooks which are SMP safe for both direct and
shadow page tables.  The idea is that PTE updates and page invalidations
while in lazy mode can be batched into a single hypercall.  We use this
in VMI for shadow page table synchronization, and it is a win.

For SMP, the enter / leave must happen under protection of the page table
locks for page tables which are being modified.  This is because otherwise,
you end up with stale state in the batched hypercall, which other CPUs can
race ahead of.  Doing this under the protection of the locks guarantees
the synchronization is correct, and also means that spurious faults which
are generated during this window by remote CPUs are properly handled, as
the page fault handler must re-check the PTE under protection of the same

Signed-off-by: Zachary Amsden <zach@xxxxxxxxxx>

Index: linux-2.6.18-rc2/include/asm-generic/pgtable.h
--- linux-2.6.18-rc2.orig/include/asm-generic/pgtable.h 2006-07-28 
14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/include/asm-generic/pgtable.h      2006-07-28 
14:18:03.000000000 -0700
@@ -163,6 +163,11 @@ static inline void ptep_set_wrprotect(st
 #define move_pte(pte, prot, old_addr, new_addr)        (pte)
+#define arch_enter_lazy_mmu_mode()     do {} while (0)
+#define arch_leave_lazy_mmu_mode()     do {} while (0)
  * When walking page tables, get the address of the next boundary,
  * or the end address of the range if that comes earlier.  Although no
Index: linux-2.6.18-rc2/mm/memory.c
--- linux-2.6.18-rc2.orig/mm/memory.c   2006-07-28 14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/mm/memory.c        2006-07-28 14:18:44.000000000 -0700
@@ -506,6 +506,7 @@ again:
        src_ptl = pte_lockptr(src_mm, src_pmd);
        spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
+       arch_enter_lazy_mmu_mode();
        do {
                 * We are holding two locks at this point - either of them
@@ -525,6 +526,7 @@ again:
                copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
                progress += 8;
        } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
+       arch_leave_lazy_mmu_mode();
        pte_unmap_nested(src_pte - 1);
@@ -627,6 +629,7 @@ static unsigned long zap_pte_range(struc
        int anon_rss = 0;
        pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+       arch_enter_lazy_mmu_mode();
        do {
                pte_t ptent = *pte;
                if (pte_none(ptent)) {
@@ -692,6 +695,7 @@ static unsigned long zap_pte_range(struc
                pte_clear_full(mm, addr, pte, tlb->fullmm);
        } while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));
+       arch_leave_lazy_mmu_mode();
        add_mm_rss(mm, file_rss, anon_rss);
        pte_unmap_unlock(pte - 1, ptl);
@@ -1108,6 +1112,7 @@ static int zeromap_pte_range(struct mm_s
        pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
        if (!pte)
                return -ENOMEM;
+       arch_enter_lazy_mmu_mode();
        do {
                struct page *page = ZERO_PAGE(addr);
                pte_t zero_pte = pte_wrprotect(mk_pte(page, prot));
@@ -1117,6 +1122,7 @@ static int zeromap_pte_range(struct mm_s
                set_pte_at(mm, addr, pte, zero_pte);
        } while (pte++, addr += PAGE_SIZE, addr != end);
+       arch_leave_lazy_mmu_mode();
        pte_unmap_unlock(pte - 1, ptl);
        return 0;
@@ -1269,11 +1275,13 @@ static int remap_pte_range(struct mm_str
        pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
        if (!pte)
                return -ENOMEM;
+       arch_enter_lazy_mmu_mode();
        do {
                set_pte_at(mm, addr, pte, pfn_pte(pfn, prot));
        } while (pte++, addr += PAGE_SIZE, addr != end);
+       arch_leave_lazy_mmu_mode();
        pte_unmap_unlock(pte - 1, ptl);
        return 0;
Index: linux-2.6.18-rc2/mm/mprotect.c
--- linux-2.6.18-rc2.orig/mm/mprotect.c 2006-07-28 14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/mm/mprotect.c      2006-07-28 14:17:25.000000000 -0700
@@ -33,6 +33,7 @@ static void change_pte_range(struct mm_s
        spinlock_t *ptl;
        pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+       arch_enter_lazy_mmu_mode();
        do {
                oldpte = *pte;
                if (pte_present(oldpte)) {
@@ -62,6 +63,7 @@ static void change_pte_range(struct mm_s
        } while (pte++, addr += PAGE_SIZE, addr != end);
+       arch_leave_lazy_mmu_mode();
        pte_unmap_unlock(pte - 1, ptl);
Index: linux-2.6.18-rc2/mm/mremap.c
--- linux-2.6.18-rc2.orig/mm/mremap.c   2006-07-28 14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/mm/mremap.c        2006-07-28 14:17:25.000000000 -0700
@@ -99,6 +99,7 @@ static void move_ptes(struct vm_area_str
        if (new_ptl != old_ptl)
                spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+       arch_enter_lazy_mmu_mode();
        for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
                                   new_pte++, new_addr += PAGE_SIZE) {
                if (pte_none(*old_pte))
@@ -108,6 +109,7 @@ static void move_ptes(struct vm_area_str
                pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
                set_pte_at(mm, new_addr, new_pte, pte);
+       arch_leave_lazy_mmu_mode();
        if (new_ptl != old_ptl)
Index: linux-2.6.18-rc2/mm/msync.c
--- linux-2.6.18-rc2.orig/mm/msync.c    2006-07-28 14:15:01.000000000 -0700
+++ linux-2.6.18-rc2/mm/msync.c 2006-07-28 14:17:25.000000000 -0700
@@ -30,6 +30,7 @@ static unsigned long msync_pte_range(str
        pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+       arch_enter_lazy_mmu_mode();
        do {
                struct page *page;
@@ -51,6 +52,7 @@ again:
                        ret += set_page_dirty(page);
                progress += 3;
        } while (pte++, addr += PAGE_SIZE, addr != end);
+       arch_leave_lazy_mmu_mode();
        pte_unmap_unlock(pte - 1, ptl);
        if (addr != end)
Xen-devel mailing list