This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


RE: [Xen-devel] [RFC] New shadow paging code

To: "Tim Deegan" <Tim.Deegan@xxxxxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxx>
Subject: RE: [Xen-devel] [RFC] New shadow paging code
From: "Kurtz, Ryan M." <Ryan.Kurtz@xxxxxxxxxx>
Date: Fri, 14 Jul 2006 13:58:22 -0400
Delivery-date: Fri, 14 Jul 2006 10:58:55 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: AcanYSulABw3myJWRzmdqCMTAG4b/wADbkLw
Thread-topic: [Xen-devel] [RFC] New shadow paging code

I noticed the addition of the unsigned long *mb parameter in
xc_shadow_control().  If you have time, could you say a word or two
about what this parameter is used for?


-----Original Message-----
From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
[mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Tim Deegan
Sent: Friday, July 14, 2006 11:39 AM
To: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: [Xen-devel] [RFC] New shadow paging code

We (Michael Fetterman, George Dunlap and I) have been working over the
last while on a full replacement for Xen's shadow pagetable support. 

This mail contains some design notes, below; a patch against
xen-unstable, giving a snapshot of the current state of the new shadow
code, is at http://www.cl.cam.ac.uk/~tjd21/shadow2.patch

Comments on both are welcome, although the code is not finished -- in
particular there are both some optimizations and some tidying-up that
need to be done.




The new shadow code (dubbed 'shadow2'), is designed as a replacement for
the current shadow code.  It's been designed from the ground up to
support the following capabilities:
 * Work for both paravirtualized and HVM guests.  Our focus is on
Windows under HVM, since Linux guests can use paravirtual mechanisms for
faster memory management.
 * Xen may be running in 2-, 3-, or 4-level paging mode.  While booting,
guests may be in direct-access mode (no paging), or any paging level
less than or equal to Xen's current paging level.  This means that we
must support 2-on-2, 2-on-3, 3-on-3, 3-on-4, and 4-on-4 paging modes.
 * While bringing up secondary vcpus in an SMP system, the vcpus may all
be in different paging modes.  We must support these simultaneously.
 * Logdirty mode for live migration.
 * We must work with paravirtualized drivers for HVM domains.
 * We must work for guest superpages.

With this in mind, we have made several design choices:
* Do away with the "out-of-sync" mechanism to begin with.  After a page
is promoted, emulate all writes to it until it is demoted again.  This
makes the logic a lot simpler, and also reduces the overhead of demand
paging, which is one of the most common Windows modes.  (See below for
more information on demand paging.)
* In the case of a size mismatch between guest pagetable entries and
host pagetable entries (i.e., 2-on-3 or 2-on-4, where guest pagetable
entries are 32 bits and host pagetable entires are 64 bits), a single
guest page may need to be shadowed by multiple shadow pages.  In this
case, we always shadow the entire guest pagetable, rather than shadowing
only part at a time.  We also keep the multiple backing shadow
pagetables physically contiguous in memory using a "buddy" allocator.
This allows us to use only one mfn value to designate the entire group
of mfns.
* We allocate a fixed amount of shadow memory at domain creation. This
is shared by all vcpus.  When we need more shadow pages, we begin to
unshadow pages to free up more memory in approximately an LRU fashion.
* We keep the p2m maps for HVM domains in a pagetable format, so that we
can use them as the pagetables fo HVM guests in paging-disabled mode.

So far, we have had several successes.  Demand-paging accesses have been
sped up by doing emulated writes rather than using the out-of-sync
mechanism.  The out-of-sync mechanism requires three page faults, two of
which entail relative expensive shadow operations: marking a page out of
sync, and bringing it back into sync.  In the case of HVM guests, the
faults also cause three expensive vmexit/vmenter cycles.  Our emulated
writes requires only two page faults, and each fault is less expensive.

Also, the overhead of many individual shadow operations is less in the
newer code than in the old code.

We have a number of potential optimizations in mind for the near future:

* Removing writable mappings.  As with the old code, when a guest pfn is
promoted to be a pagetable, we need to find and remove all writable
mappings to it, so that we can detect changes.  Following the "start
simple, then optimize" principle, our current code does a brute-force
search through the shadows.  Our tests indicate that when a page is
promoted to a pagetable, it generally has exactly one writable mapping
outstanding. This is true both for Windows and for Linux.  We plan to
use this fact to keep a back-pointer to the last writable shadow pte of
a page in the page_info struct of a page.  The few exceptions to the
rule can still be handled using brute-force search.

* Fast-pathing some faults.  By storing the guest present / writable
flags in some of the spare bits of the guest pagetable, we can fast-path
certain operations, such as propagating a fault to the guest or updating
guest dirty and accessed bits, without needing to map the guest
pagetables.  This should speed up some common faults, as well as reduce
cache footprint.

* Batch updates.  There are times when guests do batch updates to
pagetables.  At these times, it makes sense to give the guest write
access to the pagetables.  At first this can be done simply by
unshadowing the page entirely. In the future, we can explore whether a a
"mark out of sync"
mechanism would speed things up.  We may be able to have a more extreme
optimization for Linux fork(): when we detect Linux doing a fork(), we
can unshadow the entire user portion of the guest address space, to save
having to detect a "batch update" and unshadow each guest pagetable

* Full emulation of shadow page accesses.  Currently, we allow read-only
access to guest pagetables.  This requires us to emulate the dirty and
accessed bits of the guest pagetables, in turn requiring us to take page
faults.  But how many of these dirty/accessed bits are actually read?
It may be more efficient, in certain circumstances, to emualte reads to
guest page tables as well as writes, taking the dirty and accessed bits
from the shadow pagetables.

* Teardown heuristics.  If we can determine when a guest is destroying a
process, we can unshadow the whole address space at once.  Failure to
detect when a process is being torn down will cause unnecessary
if the guest pagetables of the destroyed process are recycled as data
pages, all writes to the pages will be emulated (in a rather expensive
manner) until the page is unshadowed.  Even if the guest pagetables are
re-used for new process pagetables, constructing the address space will
be faster if unshadowed.

Code Structure

Our code must deal differently with all the different combinations of
shadow modes.  However, we expect that once a guest reaches its target
paging mode, it will stay in that mode for a long time; and the host
will never change its paging mode.  Rather than having a whole string of
ifs in the code based on the current guest and host paging modes, we
compile different code to deal with each pair of modes (2-on-2, 2-on-3,
2-on-4, 3-on-3, 3-on-4, 4-on-4).  (Direct mode is implemented as a
special case of m-on-m, where m is the host's current paging level.)
While increasing the size of the hypervisor overall, this should greatly
decrease both the cache footprint of the shadow code and reduce pipeline
flushes from mispredicted branches.

To keep from having to maintain duplicate logic across 6 different bits
of code, we use a single source code file, and compiler directives to
specify mode-specific code.  This file is shadow2.c, and is built once
appropriate combination.  The compiler is set to redefine the functions
for n-on-m mode.

At the end of shadow2.c is a structure containing function pointers for
each of the mode-specific functions; this is called shadow2_entry (and
is expanded by preprocessor directives using the __shadow_[m]_guest_[n]
naming convention).  When a guest vcpu is put into a particular shadow
mode, an element of the vcpu struct is pointed to the appropriate
shadow2_entry struct.  To call the appropriate function, one generally
calls shadow2_[function_name](v, [args]), which is generally implemented
after the following template:

[rettype] shadow2_[function_name](v, [args]) {
        return v->arch.shadow2->[function_name](v, [args]); }

Xen-devel mailing list

Xen-devel mailing list

<Prev in Thread] Current Thread [Next in Thread>