[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [PATCH RFC v2 14/23] libxc/migration: implement the sender side of postcopy live migration



From: Joshua Otto <jtotto@xxxxxxxxxxxx>

Add a new 'postcopy' phase to the live migration algorithm, during which
unmigrated domain memory is paged over the network on-demand _after_ the
guest has been resumed at the destination.

To do so:
- Add a new precopy policy option, XGS_POLICY_POSTCOPY, that policies
  can use to request a transition to the postcopy live migration phase
  rather than a stop-and-copy of the remaining dirty pages.
- Add support to xc_domain_save() for this policy option by breaking out
  of the precopy loop early, transmitting the final set of dirty pfns
  and all remaining domain state (including higher-layer state) except
  memory, and entering a postcopy loop during which the remaining page
  data is pushed in the background.  Remote requests for specific pages
  in response to faults in the domain are serviced with priority in this
  loop.

The new save callbacks required for this migration phase are stubbed in
libxl for now, to be replaced in a subsequent patch that adds libxl
support for this migration phase.  Support for this phase on the
migration receiver side follows immediately in the next patch.

Signed-off-by: Joshua Otto <jtotto@xxxxxxxxxxxx>
---
 tools/libxc/include/xenguest.h     |  84 ++++---
 tools/libxc/xc_sr_common.h         |   8 +-
 tools/libxc/xc_sr_save.c           | 488 ++++++++++++++++++++++++++++++++++---
 tools/libxc/xc_sr_save_x86_hvm.c   |  13 +
 tools/libxc/xg_save_restore.h      |  16 +-
 tools/libxl/libxl_dom_save.c       |  11 +-
 tools/libxl/libxl_save_msgs_gen.pl |   6 +-
 7 files changed, 558 insertions(+), 68 deletions(-)

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index 215abd0..a662273 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -56,41 +56,59 @@ struct save_callbacks {
 #define XGS_POLICY_CONTINUE_PRECOPY 0  /* Remain in the precopy phase. */
 #define XGS_POLICY_STOP_AND_COPY    1  /* Immediately suspend and transmit the
                                         * remaining dirty pages. */
+#define XGS_POLICY_POSTCOPY         2  /* Suspend the guest and transition into
+                                        * the postcopy phase of the migration. 
*/
     int (*precopy_policy)(struct precopy_stats stats, void *data);
 
-    /* Called after the guest's dirty pages have been
-     *  copied into an output buffer.
-     * Callback function resumes the guest & the device model,
-     *  returns to xc_domain_save.
-     * xc_domain_save then flushes the output buffer, while the
-     *  guest continues to run.
-     */
-    int (*aftercopy)(void* data);
-
-    /* Called after the memory checkpoint has been flushed
-     * out into the network. Typical actions performed in this
-     * callback include:
-     *   (a) send the saved device model state (for HVM guests),
-     *   (b) wait for checkpoint ack
-     *   (c) release the network output buffer pertaining to the acked 
checkpoint.
-     *   (c) sleep for the checkpoint interval.
-     *
-     * returns:
-     * 0: terminate checkpointing gracefully
-     * 1: take another checkpoint */
-    int (*checkpoint)(void* data);
-
-    /*
-     * Called after the checkpoint callback.
-     *
-     * returns:
-     * 0: terminate checkpointing gracefully
-     * 1: take another checkpoint
-     */
-    int (*wait_checkpoint)(void* data);
-
-    /* Enable qemu-dm logging dirty pages to xen */
-    int (*switch_qemu_logdirty)(int domid, unsigned enable, void *data); /* 
HVM only */
+    /* Checkpointing and postcopy live migration are mutually exclusive. */
+    union {
+        struct {
+            /*
+             * Called during a live migration's transition to the postcopy 
phase
+             * to yield control of the stream back to a higher layer so it can
+             * transmit records needed for resumption of the guest at the
+             * destination (e.g. device model state, xenstore context)
+             */
+            int (*postcopy_transition)(void *data);
+        };
+
+        struct {
+            /* Called after the guest's dirty pages have been
+             *  copied into an output buffer.
+             * Callback function resumes the guest & the device model,
+             *  returns to xc_domain_save.
+             * xc_domain_save then flushes the output buffer, while the
+             *  guest continues to run.
+             */
+            int (*aftercopy)(void* data);
+
+            /* Called after the memory checkpoint has been flushed
+             * out into the network. Typical actions performed in this
+             * callback include:
+             *   (a) send the saved device model state (for HVM guests),
+             *   (b) wait for checkpoint ack
+             *   (c) release the network output buffer pertaining to the acked
+             *       checkpoint.
+             *   (c) sleep for the checkpoint interval.
+             *
+             * returns:
+             * 0: terminate checkpointing gracefully
+             * 1: take another checkpoint */
+            int (*checkpoint)(void* data);
+
+            /*
+             * Called after the checkpoint callback.
+             *
+             * returns:
+             * 0: terminate checkpointing gracefully
+             * 1: take another checkpoint
+             */
+            int (*wait_checkpoint)(void* data);
+
+            /* Enable qemu-dm logging dirty pages to xen */
+            int (*switch_qemu_logdirty)(int domid, unsigned enable, void 
*data); /* HVM only */
+        };
+    };
 
     /* to be provided as the last argument to each callback function */
     void* data;
diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index ce72e0d..244c536 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -202,20 +202,24 @@ struct xc_sr_context
 
             enum {
                 XC_SAVE_PHASE_PRECOPY,
-                XC_SAVE_PHASE_STOP_AND_COPY
+                XC_SAVE_PHASE_STOP_AND_COPY,
+                XC_SAVE_PHASE_POSTCOPY
             } phase;
 
             struct precopy_stats stats;
             int policy_decision;
 
             enum {
-                XC_SR_SAVE_BATCH_PRECOPY_PAGE
+                XC_SR_SAVE_BATCH_PRECOPY_PAGE,
+                XC_SR_SAVE_BATCH_POSTCOPY_PFN,
+                XC_SR_SAVE_BATCH_POSTCOPY_PAGE
             } batch_type;
             xen_pfn_t *batch_pfns;
             unsigned nr_batch_pfns;
             unsigned long *deferred_pages;
             unsigned long nr_deferred_pages;
             xc_hypercall_buffer_t dirty_bitmap_hbuf;
+            unsigned long nr_final_dirty_pages;
         } save;
 
         struct /* Restore data. */
diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index 9f077a3..81b4755 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -3,21 +3,28 @@
 
 #include "xc_sr_common.h"
 
-#define MAX_BATCH_SIZE MAX_PRECOPY_BATCH_SIZE
+#define MAX_BATCH_SIZE \
+    max(max(MAX_PRECOPY_BATCH_SIZE, MAX_PFN_BATCH_SIZE), 
MAX_POSTCOPY_BATCH_SIZE)
 
 static const unsigned int batch_sizes[] =
 {
-    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = MAX_PRECOPY_BATCH_SIZE
+    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = MAX_PRECOPY_BATCH_SIZE,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PFN]  = MAX_PFN_BATCH_SIZE,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PAGE] = MAX_POSTCOPY_BATCH_SIZE
 };
 
 static const bool batch_includes_contents[] =
 {
-    [XC_SR_SAVE_BATCH_PRECOPY_PAGE] = true
+    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = true,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PFN]  = false,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PAGE] = true
 };
 
 static const uint32_t batch_rec_types[] =
 {
-    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = REC_TYPE_PAGE_DATA
+    [XC_SR_SAVE_BATCH_PRECOPY_PAGE]  = REC_TYPE_PAGE_DATA,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PFN]  = REC_TYPE_POSTCOPY_PFNS,
+    [XC_SR_SAVE_BATCH_POSTCOPY_PAGE] = REC_TYPE_POSTCOPY_PAGE_DATA
 };
 
 /*
@@ -84,6 +91,38 @@ static int write_checkpoint_record(struct xc_sr_context *ctx)
 }
 
 /*
+ * Writes a POSTCOPY_BEGIN record into the stream.
+ */
+static int write_postcopy_begin_record(struct xc_sr_context *ctx)
+{
+    struct xc_sr_record postcopy_begin = { REC_TYPE_POSTCOPY_BEGIN, 0, NULL };
+
+    return write_record(ctx, ctx->fd, &postcopy_begin);
+}
+
+/*
+ * Writes a POSTCOPY_PFNS_BEGIN record into the stream.
+ */
+static int write_postcopy_pfns_begin_record(struct xc_sr_context *ctx)
+{
+    struct xc_sr_record postcopy_pfns_begin =
+        { REC_TYPE_POSTCOPY_PFNS_BEGIN, 0, NULL };
+
+    return write_record(ctx, ctx->fd, &postcopy_pfns_begin);
+}
+
+/*
+ * Writes a POSTCOPY_TRANSITION record into the stream.
+ */
+static int write_postcopy_transition_record(struct xc_sr_context *ctx)
+{
+    struct xc_sr_record postcopy_transition =
+        { REC_TYPE_POSTCOPY_TRANSITION, 0, NULL };
+
+    return write_record(ctx, ctx->fd, &postcopy_transition);
+}
+
+/*
  * This function:
  * - maps each pfn in the current batch to its gfn
  * - gets the type of each pfn in the batch.
@@ -388,6 +427,125 @@ static void add_to_batch(struct xc_sr_context *ctx, 
xen_pfn_t pfn)
 }
 
 /*
+ * This function:
+ * - flushes the current batch of postcopy pfns into the migration stream
+ * - clears the dirty bits of all pfns with no migrateable backing data
+ * - counts the number of pfns that _do_ have migrateable backing data, adding
+ *   it to nr_final_dirty_pfns
+ */
+static int flush_postcopy_pfns_batch(struct xc_sr_context *ctx)
+{
+    int rc = 0;
+    xc_interface *xch = ctx->xch;
+    xen_pfn_t *pfns = ctx->save.batch_pfns, *gfns = NULL, *types = NULL;
+    unsigned int i, nr_pfns = ctx->save.nr_batch_pfns;
+
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
+    assert(ctx->save.batch_type == XC_SR_SAVE_BATCH_POSTCOPY_PFN);
+
+    if ( batch_empty(ctx) )
+        goto out;
+
+    gfns = malloc(nr_pfns * sizeof(*gfns));
+    types = malloc(nr_pfns * sizeof(*types));
+
+    if ( !gfns || !types )
+    {
+        ERROR("Unable to allocate arrays for a batch of %u pages",
+              nr_pfns);
+        rc = -1;
+        goto out;
+    }
+
+    rc = get_batch_info(ctx, gfns, types);
+    if ( rc )
+        goto out;
+
+    /*
+     * Consider any pages not backed by a physical page of data to have been
+     * 'cleaned' at this point - there's no sense wasting room in a subsequent
+     * postcopy batch to duplicate the type information.
+     */
+    for ( i = 0; i < nr_pfns; ++i )
+    {
+        switch ( types[i] )
+        {
+        case XEN_DOMCTL_PFINFO_BROKEN:
+        case XEN_DOMCTL_PFINFO_XALLOC:
+        case XEN_DOMCTL_PFINFO_XTAB:
+            clear_bit(pfns[i], dirty_bitmap);
+            continue;
+        }
+
+        ++ctx->save.nr_final_dirty_pages;
+    }
+
+    rc = write_batch(ctx, gfns, types);
+    if ( !rc )
+    {
+        VALGRIND_MAKE_MEM_UNDEFINED(ctx->save.batch_pfns,
+                                    MAX_BATCH_SIZE *
+                                    sizeof(*ctx->save.batch_pfns));
+    }
+
+ out:
+    free(gfns);
+    free(types);
+
+    return rc;
+}
+
+/*
+ * This function:
+ * - writes a POSTCOPY_PFNS_BEGIN record into the stream
+ * - writes 0 or more POSTCOPY_PFNS records specifying the subset of domain
+ *   memory that must be migrated during the upcoming postcopy phase of the
+ *   migration
+ * - counts the number of pfns in this subset, storing it in
+ *   nr_final_dirty_pages
+ */
+static int send_postcopy_pfns(struct xc_sr_context *ctx)
+{
+    xen_pfn_t p;
+    int rc;
+
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
+    /*
+     * The true nr_final_dirty_pages is iteratively computed by
+     * flush_postcopy_pfns_batch(), which counts only pages actually backed by
+     * data we need to migrate.
+     */
+    ctx->save.nr_final_dirty_pages = 0;
+
+    rc = write_postcopy_pfns_begin_record(ctx);
+    if ( rc )
+        return rc;
+
+    assert(batch_empty(ctx));
+    ctx->save.batch_type = XC_SR_SAVE_BATCH_POSTCOPY_PFN;
+    for ( p = 0; p < ctx->save.p2m_size; ++p )
+    {
+        if ( !test_bit(p, dirty_bitmap) )
+            continue;
+
+        if ( batch_full(ctx) )
+        {
+            rc = flush_postcopy_pfns_batch(ctx);
+            if ( rc )
+                return rc;
+        }
+
+        add_to_batch(ctx, p);
+    }
+
+    return flush_postcopy_pfns_batch(ctx);
+}
+
+/*
  * Pause/suspend the domain, and refresh ctx->dominfo if required.
  */
 static int suspend_domain(struct xc_sr_context *ctx)
@@ -716,20 +874,19 @@ static int colo_merge_secondary_dirty_bitmap(struct 
xc_sr_context *ctx)
 }
 
 /*
- * Suspend the domain and send dirty memory.
- * This is the last iteration of the live migration and the
- * heart of the checkpointed stream.
+ * Suspend the domain and determine the final set of dirty pages.
  */
-static int suspend_and_send_dirty(struct xc_sr_context *ctx)
+static int suspend_and_check_dirty(struct xc_sr_context *ctx)
 {
     xc_interface *xch = ctx->xch;
     xc_shadow_op_stats_t stats = { 0, ctx->save.p2m_size };
-    char *progress_str = NULL;
     int rc;
     DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
                                     &ctx->save.dirty_bitmap_hbuf);
 
-    ctx->save.phase = XC_SAVE_PHASE_STOP_AND_COPY;
+    ctx->save.phase = (ctx->save.policy_decision == XGS_POLICY_POSTCOPY)
+        ? XC_SAVE_PHASE_POSTCOPY
+        : XC_SAVE_PHASE_STOP_AND_COPY;
 
     rc = suspend_domain(ctx);
     if ( rc )
@@ -746,16 +903,6 @@ static int suspend_and_send_dirty(struct xc_sr_context 
*ctx)
         goto out;
     }
 
-    if ( ctx->save.live )
-    {
-        rc = update_progress_string(ctx, &progress_str,
-                                    ctx->save.stats.iteration);
-        if ( rc )
-            goto out;
-    }
-    else
-        xc_set_progress_prefix(xch, "Checkpointed save");
-
     bitmap_or(dirty_bitmap, ctx->save.deferred_pages, ctx->save.p2m_size);
 
     if ( !ctx->save.live && ctx->save.checkpointed == XC_MIG_STREAM_COLO )
@@ -768,19 +915,37 @@ static int suspend_and_send_dirty(struct xc_sr_context 
*ctx)
         }
     }
 
-    rc = send_dirty_pages(ctx, stats.dirty_count + 
ctx->save.nr_deferred_pages);
-    if ( rc )
-        goto out;
+    if ( !ctx->save.live || ctx->save.policy_decision != XGS_POLICY_POSTCOPY )
+    {
+        /*
+         * If we aren't transitioning to a postcopy live migration, then rather
+         * than explicitly counting the number of final dirty pages, simply
+         * (somewhat crudely) estimate it as this sum to save time.  If we 
_are_
+         * about to begin postcopy then we don't bother, since our count must 
in
+         * that case be exact and we'll work it out later on.
+         */
+        ctx->save.nr_final_dirty_pages =
+            stats.dirty_count + ctx->save.nr_deferred_pages;
+    }
 
     bitmap_clear(ctx->save.deferred_pages, ctx->save.p2m_size);
     ctx->save.nr_deferred_pages = 0;
 
  out:
-    xc_set_progress_prefix(xch, NULL);
-    free(progress_str);
     return rc;
 }
 
+static int suspend_and_send_dirty(struct xc_sr_context *ctx)
+{
+    int rc;
+
+    rc = suspend_and_check_dirty(ctx);
+    if ( rc )
+        return rc;
+
+    return send_dirty_pages(ctx, ctx->save.nr_final_dirty_pages);
+}
+
 static int verify_frames(struct xc_sr_context *ctx)
 {
     xc_interface *xch = ctx->xch;
@@ -821,11 +986,13 @@ static int verify_frames(struct xc_sr_context *ctx)
 }
 
 /*
- * Send all domain memory.  This is the heart of the live migration loop.
+ * Send all domain memory, modulo postcopy pages.  This is the heart of the 
live
+ * migration loop.
  */
 static int send_domain_memory_live(struct xc_sr_context *ctx)
 {
     int rc;
+    xc_interface *xch = ctx->xch;
 
     rc = enable_logdirty(ctx);
     if ( rc )
@@ -835,10 +1002,19 @@ static int send_domain_memory_live(struct xc_sr_context 
*ctx)
     if ( rc )
         goto out;
 
-    rc = suspend_and_send_dirty(ctx);
+    rc = suspend_and_check_dirty(ctx);
     if ( rc )
         goto out;
 
+    if ( ctx->save.policy_decision == XGS_POLICY_STOP_AND_COPY )
+    {
+        xc_set_progress_prefix(xch, "Final precopy iteration");
+        rc = send_dirty_pages(ctx, ctx->save.nr_final_dirty_pages);
+        xc_set_progress_prefix(xch, NULL);
+        if ( rc )
+            goto out;
+    }
+
     if ( ctx->save.debug && ctx->save.checkpointed != XC_MIG_STREAM_NONE )
     {
         rc = verify_frames(ctx);
@@ -850,12 +1026,223 @@ static int send_domain_memory_live(struct xc_sr_context 
*ctx)
     return rc;
 }
 
+static int handle_postcopy_faults(struct xc_sr_context *ctx,
+                                  struct xc_sr_record *rec,
+                                  /* OUT */ unsigned long *nr_new_fault_pfns,
+                                  /* OUT */ xen_pfn_t *last_fault_pfn)
+{
+    int rc;
+    unsigned int i;
+    xc_interface *xch = ctx->xch;
+    struct xc_sr_rec_pages_header *fault_pages = rec->data;
+
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
+    assert(nr_new_fault_pfns);
+    *nr_new_fault_pfns = 0;
+
+    rc = validate_pages_record(ctx, rec, REC_TYPE_POSTCOPY_FAULT);
+    if ( rc )
+        return rc;
+
+    DBGPRINTF("Handling a batch of %"PRIu32" faults!", fault_pages->count);
+
+    assert(ctx->save.batch_type == XC_SR_SAVE_BATCH_POSTCOPY_PAGE);
+    for ( i = 0; i < fault_pages->count; ++i )
+    {
+        if ( test_and_clear_bit(fault_pages->pfn[i], dirty_bitmap) )
+        {
+            if ( batch_full(ctx) )
+            {
+                rc = flush_batch(ctx);
+                if ( rc )
+                    return rc;
+            }
+
+            add_to_batch(ctx, fault_pages->pfn[i]);
+            ++(*nr_new_fault_pfns);
+        }
+    }
+
+    /* _Don't_ flush yet - fill out the rest of the batch. */
+
+    assert(fault_pages->count);
+    *last_fault_pfn = fault_pages->pfn[fault_pages->count - 1];
+    return 0;
+}
+
+/*
+ * Now that the guest has resumed at the destination, send all of the remaining
+ * dirty pages.  Periodically check for pages needed by the destination to make
+ * progress.
+ */
+static int postcopy_domain_memory(struct xc_sr_context *ctx)
+{
+    int rc;
+    xc_interface *xch = ctx->xch;
+    int recv_fd = ctx->save.recv_fd;
+    int old_flags;
+    struct xc_sr_read_record_context rrctx;
+    struct xc_sr_record rec = { 0, 0, NULL };
+    unsigned long nr_new_fault_pfns;
+    unsigned long pages_remaining = ctx->save.nr_final_dirty_pages;
+    xen_pfn_t last_fault_pfn, p;
+    bool received_postcopy_complete = false;
+
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
+    read_record_init(&rrctx, ctx);
+
+    /*
+     * First, configure the receive stream as non-blocking so we can
+     * periodically poll it for fault requests.
+     */
+    old_flags = fcntl(recv_fd, F_GETFL);
+    if ( old_flags == -1 )
+    {
+        rc = old_flags;
+        goto err;
+    }
+
+    assert(!(old_flags & O_NONBLOCK));
+
+    rc = fcntl(recv_fd, F_SETFL, old_flags | O_NONBLOCK);
+    if ( rc == -1 )
+    {
+        goto err;
+    }
+
+    xc_set_progress_prefix(xch, "Postcopy phase");
+
+    assert(batch_empty(ctx));
+    ctx->save.batch_type = XC_SR_SAVE_BATCH_POSTCOPY_PAGE;
+
+    p = 0;
+    while ( pages_remaining )
+    {
+        /*
+         * Between (small) batches, poll the receive stream for new
+         * POSTCOPY_FAULT messages.
+         */
+        for ( ; ; )
+        {
+            rc = try_read_record(&rrctx, recv_fd, &rec);
+            if ( rc )
+            {
+                if ( (errno == EAGAIN) || (errno == EWOULDBLOCK) )
+                {
+                    break;
+                }
+
+                goto err;
+            }
+            else
+            {
+                /*
+                 * Tear down and re-initialize the read record context for the
+                 * next request record.
+                 */
+                read_record_destroy(&rrctx);
+                read_record_init(&rrctx, ctx);
+
+                if ( rec.type == REC_TYPE_POSTCOPY_COMPLETE )
+                {
+                    /*
+                     * The restore side may ultimately not need all of the 
pages
+                     * we think it does - for example, the guest may release
+                     * some outstanding pages.  If this occurs, we'll receive
+                     * this record before we'd otherwise expect to.
+                     */
+                    received_postcopy_complete = true;
+                    goto done;
+                }
+
+                rc = handle_postcopy_faults(ctx, &rec, &nr_new_fault_pfns,
+                                            &last_fault_pfn);
+                if ( rc )
+                    goto err;
+
+                free(rec.data);
+                rec.data = NULL;
+
+                assert(pages_remaining >= nr_new_fault_pfns);
+                pages_remaining -= nr_new_fault_pfns;
+
+                /*
+                 * To take advantage of any locality present in the postcopy
+                 * faults, continue the background copy process from the newest
+                 * page in the fault batch.
+                 */
+                p = (last_fault_pfn + 1) % ctx->save.p2m_size;
+            }
+        }
+
+        /*
+         * Now that we've serviced all of the POSTCOPY_FAULT requests we know
+         * about for now, fill out the current batch with background pages.
+         */
+        for ( ;
+              pages_remaining && !batch_full(ctx);
+              p = (p + 1) % ctx->save.p2m_size )
+        {
+            if ( test_and_clear_bit(p, dirty_bitmap) )
+            {
+                add_to_batch(ctx, p);
+                --pages_remaining;
+            }
+        }
+
+        rc = flush_batch(ctx);
+        if ( rc )
+            goto err;
+
+        xc_report_progress_step(
+            xch, ctx->save.nr_final_dirty_pages - pages_remaining,
+            ctx->save.nr_final_dirty_pages);
+    }
+
+ done:
+    /* Revert the receive stream to the (blocking) state we found it in. */
+    rc = fcntl(recv_fd, F_SETFL, old_flags);
+    if ( rc == -1 )
+        goto err;
+
+    if ( !received_postcopy_complete )
+    {
+        /*
+         * Flush any outstanding POSTCOPY_FAULT requests from the migration
+         * stream by reading until a POSTCOPY_COMPLETE is received.
+         */
+        do
+        {
+            rc = read_record(ctx, recv_fd, &rec);
+            if ( rc )
+                goto err;
+        } while ( rec.type != REC_TYPE_POSTCOPY_COMPLETE );
+    }
+
+ err:
+    xc_set_progress_prefix(xch, NULL);
+    free(rec.data);
+    read_record_destroy(&rrctx);
+    return rc;
+}
+
 /*
  * Checkpointed save.
  */
 static int send_domain_memory_checkpointed(struct xc_sr_context *ctx)
 {
-    return suspend_and_send_dirty(ctx);
+    int rc;
+    xc_interface *xch = ctx->xch;
+
+    xc_set_progress_prefix(xch, "Checkpointed save");
+    rc = suspend_and_send_dirty(ctx);
+    xc_set_progress_prefix(xch, NULL);
+
+    return rc;
 }
 
 /*
@@ -987,11 +1374,54 @@ static int save(struct xc_sr_context *ctx, uint16_t 
guest_type)
             goto err;
         }
 
+        /*
+         * End-of-checkpoint records are handled differently in the case of
+         * postcopy migration, so we need to alert the destination before
+         * sending them.
+         */
+        if ( ctx->save.live &&
+             ctx->save.policy_decision == XGS_POLICY_POSTCOPY )
+        {
+            rc = write_postcopy_begin_record(ctx);
+            if ( rc )
+                goto err;
+        }
+
         rc = ctx->save.ops.end_of_checkpoint(ctx);
         if ( rc )
             goto err;
 
-        if ( ctx->save.checkpointed != XC_MIG_STREAM_NONE )
+        if ( ctx->save.live &&
+             ctx->save.policy_decision == XGS_POLICY_POSTCOPY )
+        {
+            xc_report_progress_single(xch, "Beginning postcopy transition");
+
+            rc = send_postcopy_pfns(ctx);
+            if ( rc )
+                goto err;
+
+            rc = write_postcopy_transition_record(ctx);
+            if ( rc )
+                goto err;
+
+            /*
+             * Yield control to libxl to finish the transition.  Note that this
+             * callback returns _non-zero_ upon success.
+             */
+            rc = ctx->save.callbacks->postcopy_transition(
+                ctx->save.callbacks->data);
+            if ( !rc )
+            {
+                rc = -1;
+                goto err;
+            }
+
+            /* When libxl is done, we can begin the postcopy loop. */
+            rc = postcopy_domain_memory(ctx);
+            if ( rc )
+                goto err;
+        }
+        else if ( ctx->save.checkpointed != XC_MIG_STREAM_NONE )
         {
             /*
              * We have now completed the initial live portion of the checkpoint
diff --git a/tools/libxc/xc_sr_save_x86_hvm.c b/tools/libxc/xc_sr_save_x86_hvm.c
index 54ddbfe..b12f0dd 100644
--- a/tools/libxc/xc_sr_save_x86_hvm.c
+++ b/tools/libxc/xc_sr_save_x86_hvm.c
@@ -92,6 +92,9 @@ static int write_hvm_params(struct xc_sr_context *ctx)
     unsigned int i;
     int rc;
 
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
+                                    &ctx->save.dirty_bitmap_hbuf);
+
     for ( i = 0; i < ARRAY_SIZE(params); i++ )
     {
         uint32_t index = params[i];
@@ -106,6 +109,16 @@ static int write_hvm_params(struct xc_sr_context *ctx)
 
         if ( value != 0 )
         {
+            if ( ctx->save.live &&
+                 ctx->save.policy_decision == XGS_POLICY_POSTCOPY &&
+                 ( index == HVM_PARAM_CONSOLE_PFN ||
+                   index == HVM_PARAM_STORE_PFN ||
+                   index == HVM_PARAM_IOREQ_PFN ||
+                   index == HVM_PARAM_BUFIOREQ_PFN ||
+                   index == HVM_PARAM_PAGING_RING_PFN ) &&
+                 test_and_clear_bit(value, dirty_bitmap) )
+                --ctx->save.nr_final_dirty_pages;
+
             entries[hdr.count].index = index;
             entries[hdr.count].value = value;
             hdr.count++;
diff --git a/tools/libxc/xg_save_restore.h b/tools/libxc/xg_save_restore.h
index 40debf6..9f5b223 100644
--- a/tools/libxc/xg_save_restore.h
+++ b/tools/libxc/xg_save_restore.h
@@ -24,7 +24,21 @@
 ** We process save/restore/migrate in batches of pages; the below
 ** determines how many pages we (at maximum) deal with in each batch.
 */
-#define MAX_PRECOPY_BATCH_SIZE 1024   /* up to 1024 pages (4MB) at a time */
+#define MAX_PRECOPY_BATCH_SIZE ((size_t)1024U)   /* up to 1024 pages (4MB) */
+
+/*
+** We process the migration postcopy transition in batches of pfns to ensure
+** that we stay within the record size bound.  Because these records contain
+** only pfns (and _not_ their contents), we can accomodate many more of them
+** in a batch.
+*/
+#define MAX_PFN_BATCH_SIZE ((4U << 20) / sizeof(uint64_t)) /* up to 512k pfns 
*/
+
+/*
+** The postcopy background copy uses a smaller batch size to ensure it can
+** quickly respond to remote faults.
+*/
+#define MAX_POSTCOPY_BATCH_SIZE ((size_t)64U)
 
 /* When pinning page tables at the end of restore, we also use batching. */
 #define MAX_PIN_BATCH  1024
diff --git a/tools/libxl/libxl_dom_save.c b/tools/libxl/libxl_dom_save.c
index b65135d..eb1271e 100644
--- a/tools/libxl/libxl_dom_save.c
+++ b/tools/libxl/libxl_dom_save.c
@@ -350,6 +350,12 @@ static int libxl__save_live_migration_precopy_policy(
     return XGS_POLICY_CONTINUE_PRECOPY;
 }
 
+static void libxl__save_live_migration_postcopy_transition_callback(void *user)
+{
+    /* XXX we're not yet ready to deal with this */
+    assert(0);
+}
+
 /*----- main code for saving, in order of execution -----*/
 
 void libxl__domain_save(libxl__egc *egc, libxl__domain_save_state *dss)
@@ -409,8 +415,11 @@ void libxl__domain_save(libxl__egc *egc, 
libxl__domain_save_state *dss)
         goto out;
     }
 
-    if (dss->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_NONE)
+    if (dss->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_NONE) {
         callbacks->suspend = libxl__domain_suspend_callback;
+        callbacks->postcopy_transition =
+            libxl__save_live_migration_postcopy_transition_callback;
+    }
 
     callbacks->precopy_policy = libxl__save_live_migration_precopy_policy;
     callbacks->switch_qemu_logdirty = 
libxl__domain_suspend_common_switch_qemu_logdirty;
diff --git a/tools/libxl/libxl_save_msgs_gen.pl 
b/tools/libxl/libxl_save_msgs_gen.pl
index 50c97b4..5647b97 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -33,7 +33,8 @@ our @msgs = (
                                               'xen_pfn_t', 'console_gfn'] ],
     [  9, 'srW',    "complete",              [qw(int retval
                                                  int errnoval)] ],
-    [ 10, 'scxW',   "precopy_policy", ['struct precopy_stats', 'stats'] ]
+    [ 10, 'scxW',   "precopy_policy", ['struct precopy_stats', 'stats'] ],
+    [ 11, 'scxA',   "postcopy_transition", [] ]
 );
 
 #----------------------------------------
@@ -225,6 +226,7 @@ foreach my $sr (qw(save restore)) {
 
     f_decl("${setcallbacks}_${sr}", 'helper', 'void',
            "(struct ${sr}_callbacks *cbs, unsigned cbflags)");
+    f_more("${setcallbacks}_${sr}", "    memset(cbs, 0, sizeof(*cbs));\n");
 
     f_more("${receiveds}_${sr}",
            <<END_ALWAYS.($debug ? <<END_DEBUG : '').<<END_ALWAYS);
@@ -335,7 +337,7 @@ END_ALWAYS
         my $c_v = "(1u<<$msgnum)";
         my $c_cb = "cbs->$name";
         $f_more_sr->("    if ($c_cb) cbflags |= $c_v;\n", $enumcallbacks);
-        $f_more_sr->("    $c_cb = (cbflags & $c_v) ? ${encode}_${name} : 0;\n",
+        $f_more_sr->("    if (cbflags & $c_v) $c_cb = ${encode}_${name};\n",
                      $setcallbacks);
     }
     $f_more_sr->("        return 1;\n    }\n\n");
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.