[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 00/18] libxl: domain save/restore: run in a separate process



On Wed, 2012-06-13 at 11:22 +0100, Ian Campbell wrote:
> On Wed, 2012-06-13 at 09:59 +0100, Ian Campbell wrote:
> > On Fri, 2012-06-08 at 18:34 +0100, Ian Jackson wrote:
> > > This is v3 of my series to asyncify save/restore, rebased to current
> > > tip, retested, and with all comments addressed.
> > 
> > There's quite a lot of combinations which need testing here (PV, HVM,
> > HVM w/ stub dm, old vs new qemu etc etc), which of those have you tried?
> > 
> > I tried a simple localhost migrate of a PV guest and:
> >         # xl -vvv migrate d32-1 localhost
> >         migration target: Ready to receive domain.
> >         Saving to migration stream new xl format (info 0x0/0x0/3541)
> >         libxl: debug: libxl.c:722:libxl_domain_suspend: ao 0x8069720: 
> > create: how=(nil) callback=(nil) poller=0x80696c8
> >         Loading new save file <incoming migration stream> (new xl fmt info 
> > 0x0/0x0/3541)
> >          Savefile contains xl domain config
> >         libxl: debug: libxl_dom.c:969:libxl__toolstack_save: domain=2 
> > toolstack data size=8
> >         libxl: debug: libxl.c:745:libxl_domain_suspend: ao 0x8069720: 
> > inprogress: poller=0x80696c8, flags=i
> >         libxl-save-helper: debug: starting save: Success
> >         xc: detail: Had 0 unexplained entries in p2m table
> >         xc: Saving memory: iter 0 (last sent 0 skipped 0): 0/131072    0%
> >         
> > at which point it appears to just stop.
> > 
> >         # strace -p 2872 # /usr/lib/xen/bin/libxl-save-helper --save-domain 
> > 8 2 0 0 1 0 0 12 8 72
> >         Process 2872 attached - interrupt to quit
> >         write(8, 0xb5d31000, 1974272^C <unfinished ...>
> >         Process 2872 detached
> >         # strace -p 2866 # /usr/lib/xen/bin/libxl-save-helper 
> > --restore-domain 0 3 1 0 2 0 0 1 0 0 0
> 
> The first zero here is restore_fd, I think. But I read in the comment in
> the helper:
>         > + * The helper talks on stdin and stdout, in binary in machine
>         > + * endianness.  The helper speaks first, and only when it has a
>         > + * callback to make.  It writes a 16-bit number being the message
>         > + * length, and then the message body.
> 
> So restore_fd == stdin => running two protocols over the same fd?

Oh, right, migrate-receive takes the migration fd on stdin doesn't it,
so that's where it comes from. I still suspect it is wrong. Might need
to dup the input onto a safe fd?

BTW, since I've been ctrl-c'ing "xl migrate" a bunch I noticed that we
seem to leak an "xl migrate-receive" and the restore side helper
process. Probably pre-existing but I thought it worth mentioning.

> 
> >         Process 2866 attached - interrupt to quit
> >         read(0, ^C <unfinished ...>
> >         # strace -p 4070 # xl -vvv migrate d32-1 localhost
> >         Process 4070 attached - interrupt to quit
> >         restart_syscall(<... resuming interrupted call ...>
> >         # strace -p 4074 # xl migrate-receive
> >         Process 4074 attached - interrupt to quit
> >         restart_syscall(<... resuming interrupted call ...>
> > 
> > So the saver seems to be blocked writing to fd 8, which is argv[1] == io_fd.
> > 
> > Also FWIW:
> >         # xl list
> >         Name                                        ID   Mem VCPUs  State   
> > Time(s)
> >         Domain-0                                     0   511     4     
> > r-----      24.5
> >         d32-1                                        2   128     4     
> > -b----       0.4
> >         d32-1--incoming                              3     0     0     
> > --p---       0.0
> > 
> > /var/log/xen/xl-d32-1.log is just "Waiting for domain d32-1 (domid 9) to
> > die [pid 4045]" (nb: this was a newer attempt than the ones above, to be
> > sure I was looking at the right log, so the domid's don't match, 9 ==
> > d32-1 not the incoming one). There is no xl log for the incoming domain.
> > 
> > Also it'd be worth pinging/CCing Shriram next time to get him to sanity
> > test the Remus cases too.
> > 
> > I'm in the middle of reviewing #5/19 (the meat), I'll keep going
> > although I doubt I'll spot the cause of this...
> > 
> > Ian.
> > 
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@xxxxxxxxxxxxx
> > http://lists.xen.org/xen-devel
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.