[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Patch v2] tools/migrate: Fix regression when migrating from older version of Xen



On Mon, 2013-07-22 at 18:52 +0100, Andrew Cooper wrote:
> On 18/07/13 19:47, Shriram Rajagopalan wrote:
> 
> > On Thu, Jul 18, 2013 at 12:27 PM, Andrew Cooper
> > <andrew.cooper3@xxxxxxxxxx> wrote:
> >         On 18/07/13 17:20, Shriram Rajagopalan wrote:
> >         
> >         > On Tue, Jul 16, 2013 at 6:52 AM, Andrew Cooper
> >         > <andrew.cooper3@xxxxxxxxxx> wrote: 
> >         > and you setting ctx->last_checkpoint = 0 basically means
> >         > that you are
> >         > banking on the far end (with older version of tools) to
> >         > close the socket, causing
> >         > "get pagebuf" to fail and subsequently land on finish.
> >         > IIRC, this was the behavior before
> >         > XC_SAVE_ID_LAST_CHECKPOINT was introduced
> >         > by Ian Campbell, to get rid of this benign error message.
> >         > 
> >         
> >         
> >         That might have been 'fine' for PV guests, but it cause
> >         active breakage for HVM domains where the qemu save record
> >         immediately follows in the migration stream.
> >         
> >         
> > 
> > 
> > Just to clarify.. the code flows like this, iirc.
> > 
> > 
> > loadpages:
> >  while (1)
> >      if !completed
> >         get pagebufs
> >         if 0 pages sent, break
> >      endif
> >      apply batch (pagebufs)
> >  endwhile
> > 
> > 
> >  if !completed
> >    get tailbuf [[this is where the QEMU record would be obtained]]
> >    completed = 1
> >  endif
> > 
> > 
> >  if last_checkpoint
> >    goto finish
> >  endif
> > 
> > 
> >  get pagebuf or goto finish on error ---> this is where old code
> > used to exit
> >  get tailbuf
> > goto loadpages
> > finish:
> >    apply tailbuf [tailbuf obtained inside the 'if !completed' block]
> >    do the rest of the restore
> 
> (Logically joining the two divergent threads, as this is the answer to
> both)
> 
> This has nothing to do with the buffering mode, and that is not where
> the Qemu record would be obtained from.
> 
> As the code currently stands, if a XC_SAVE_ID_LAST_CHECKPOINT chunk is
> not seen in the stream, we complete the loadpages: section, including
> the magic pages (TSS, console info etc).
> 
> We then read the buffer_tail() on line 171, set ctx->completed on line
> 1725, but fail the ctx->last_checkpoint check on line 1758.
> 
> What we should do is pass the last_checkpoint test, and goto finish
> which then calls dump_qemu().  What actually happens is a call to
> pagebuf_get() on line 1766 which raises an error because of finding a
> Qemu save record rather than more pages.
> 
> So this is very much a bug directly introduced by 00a4b65f85, and can
> only be fixed with knowledge from the higher levels of the toolstack.
> 
> As for the wording of the parameter, I still prefer the original
> "last_checkpoint_unaware" over "checkpointed_stream" as it is more
> accurate.
> 
> Any migration stream started from a version of the tools after c/s
> 00a4b65f85 will work, whether it is checkpointed or not (as the
> XC_SAVE_ID_LAST_CHECKPOINT chunk will be sent in the correct place).
> Any migration started from a version of the tools before c/s
> 00a4b65f85 will fail because it has no idea that the receiving end is
> expecting it to insert XC_SAVE_ID_LAST_CHECKPOINT chunk.  The fault
> here is with the receiving end expecting to find a
> XC_SAVE_ID_LAST_CHECKPOINT chunk.
> 
> The only fix is for newer toolstack to be aware that the migration
> stream is from an older toolstack, and to set
> last_checkpoint_unaware=1, which will set ctx->last_checkpoint to 1,
> allowing the receiving side of the migration to correctly read the
> migration stream.

The toolstack cannot in general know that (i.e. how does xl know?)

It does know if it is doing checkpointing or not though.

There are four cases:

Regular migrate from pre-00a4b65f85 to post-00a4b65f85. Receiver sets
last_checkpoint = 1 at start of day. Doesn't care that sender never
sends XC_SAVE_ID_LAST_CHECKPOINT.

Regular migrate from post-00a4b65f85 to post-00a4b65f85. Receiver sets
last_checkpoint = 1 at start of day. Doesn't care that it sees
XC_SAVE_ID_LAST_CHECKPOINT again, just ends up (pointlessly) setting
last_checkpoint = 1.

Checkpointed migrate from post-00a4b65f85 to post-00a4b65f85. Receiver
doesn't set last_checkpoint = 1 at start of day.
XC_SAVE_ID_LAST_CHECKPOINT is sent at the right time and both sides
agree etc. Things work.

Checkpointed migrate from pre-00a4b65f85 to post-00a4b65f85. We don't
support this, because checkpoints should only be supported between like
versions of Xen (N->N+1 case doesn't apply).

Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.