Xen project Mailing List

On Tue, Jul 16, 2013 at 6:52 AM, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:

Commit 00a4b65f8534c9e6521eab2e6ce796ae36037774 Sep 7 2010
"libxc: provide notification of final checkpoint to restore end"
broke migration from any version of Xen using tools from prior to that commit

Older tools have no idea about an XC_SAVE_ID_LAST_CHECKPOINT, causing newer
tools xc_domain_restore() to start reading the qemu save record, as
ctx->last_checkpoint is 0.

The failure looks like:
xc: error: Max batch size exceeded (1970103633). Giving up.
where 1970103633 = 0x756d6551 = *(uint32_t*)"Qemu"

Sadly, the simple fix of just setting ctx->last_checkpoint = 1 will cause an
opposite function regresson for anyone using the current behaviour of
save_callbacks->checkpoint().

The only safe fix is to rely on the toolstack to provide this information.

Passing 0 results in unchanged behaviour, while passing nonzero means "the
other end of the migration stream does not know about
XC_SAVE_ID_LAST_CHECKPOINT but is performing a normal migrate"

Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

---

Changes since v1:
* Rename "last_checkpoint_unaware" to "checkpointed_stream" to be more
generic yet still accurate as to the meaning and fix for the issue

---
tools/libxc/xc_domain_restore.c | 3 ++-
tools/libxc/xc_nomigrate.c | 2 +-

tools/libxc/xenguest.h | 3 ++-

tools/libxl/libxl_save_helper.c | 2 +-
tools/xcutils/xc_restore.c | 2 +-
5 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c

index 63d36cd..e50b185 100644

--- a/tools/libxc/xc_domain_restore.c
+++ b/tools/libxc/xc_domain_restore.c
@@ -1401,7 +1401,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
domid_t store_domid, unsigned int console_evtchn,
unsigned long *console_mfn, domid_t console_domid,
unsigned int hvm, unsigned int pae, int superpages,
- int no_incr_generationid,

+ int no_incr_generationid, int checkpointed_stream,

unsigned long *vm_generationid_addr,
struct restore_callbacks *callbacks)
{
@@ -1472,6 +1472,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,

ctx->superpages = superpages;
ctx->hvm = hvm;

+ ctx->last_checkpoint = !!checkpointed_stream;

shouldnt it be a single unary NOT ?

i.e., ctx->last_checkpoint = !checkpointed_stream;

Otherwise, I am a bit confused. Basically, there is a memset(ctx, 0) right before this.

And you seem to be passing checkpointed_stream = 0 via both xc_restore and

libxl which basically sets it back to ctx->last_checkpoint = 0.

Also, after getting pages from the last iteration, the code goes like this:

if (ctx->last_checkpoint)

goto finish;

get pagebuf or goto finish on error

get tailbuf or goto finish on error

goto loadpages;

finish:

...

and you setting ctx->last_checkpoint = 0 basically means that you are

banking on the far end (with older version of tools) to close the socket, causing

"get pagebuf" to fail and subsequently land on finish.

IIRC, this was the behavior before XC_SAVE_ID_LAST_CHECKPOINT was introduced

by Ian Campbell, to get rid of this benign error message.

"This allows the restore end to finish up without waiting for a

spurious timeout on the receive fd and thereby avoids unnecessary

error logging in the case of a successful migration or restore."

Further,

"In the normal migration or restore case the first checkpoint is always

the last. For a rolling checkpoint (such as Remus) the notification is

currently unused "

Re: [Xen-devel] [Patch v2] tools/migrate: Fix regression when migrating from older version of Xen