[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v5 00/21] libxl: domain save/restore: run in a separate process



On Wed, Jun 27, 2012 at 9:46 AM, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx> wrote:
Shriram Rajagopalan writes ("Re: [PATCH v5 00/21] libxl: domain save/restore: run in a separate process"):
> Ian,
>  The code segfaults. Here are the system details and error traces from gdb.

Thanks.

> My setup:
>
> dom0 : ubuntu 64bit, 2.6.32-39 (pvops kernel),
>            running latest xen-4.2-unstable (built from your repo)
>            tools stack also built from your repo (which I hope has all the latest patches).
>
> domU: ubuntu 32bit PV, xenolinux kernel (2.6.32.2 - novel suse version)
>            with suspend event channel support
>
> As a sanity check, I tested xl remus with latest tip from xen-unstable
> mercurial repo, c/s: 25496:e08cf97e76f0
>
> Blackhole replication (to /dev/null) and localhost replication worked as expected
> and the guest recovered properly without any issues.

Thanks for the test runes.  That didn't work entirely properly for
me, even with the xen-unstable baseline.

I did this
  xl -vvvv remus -b -i 100 debian.guest.osstest dummy >remus.log 2>&1 &
The result was that the guest's networking broke.  The guest shows up
in xl list as
  debian.guest.osstest                      7   512     1     ---ss-       5.2
and is still responsive on its pv console.  

This is normal. You are suspending every 100ms. So, when you see ---ss-,
you just ended up doing "xl list" right when the guest was suspended. :)

do a xl top and you would see the guest's state oscillate from --b-- to --s--
depending on the checkpoint interval. Or do xl list multiple times.

 
After I killed the remus
process, the guest's networking was still broken.

 
That is strange..  xl remus has literally no networking support on the remus
front.  So, it shouldnt affect anything in the guest. In fact I repeated your test
on my box , where the guest was continuously pinging a host . Pings continued
to work. so did ssh.

 
At the start, the guest prints this on its console:
 [   36.017241] WARNING: g.e. still in use!
 [   36.021056] WARNING: g.e. still in use!
 [   36.024740] WARNING: g.e. still in use!
 [   36.024763] WARNING: g.e. still in use!

If I try the rune with "localhost" I would have expected, surely, to
see a domain with the incoming migration ?  But I don't.  I tried
killing the `xl remus' process and the guest became wedged.


With "-b" option the second argument (localhost|dummy) is ignored. Did you
try the command without the -b option, i.e.
xl remus -vvv -e domU localhost 

But I was partially able to reproduce some of your test results without your
patches (i.e. on xen-unstable baseline). See end of mail for more details.


However, when I apply my series, I can indeed produce an assertion
failure:

 xc: detail: All memory is saved
 xc: error: Could not get domain info (3 = No such process): Internal error
 libxl: error: libxl.c:388:libxl_domain_resume: xc_domain_resume failed for domain 3077579968: No such process
 xl: libxl_event.c:1426: libxl__ao_inprogress_gc: Assertion `ao->magic == 0xA0FACE00ul' failed.

So I have indeed made matters worse.


> Blackhole replication:
> ================
> xl error:
> ----------
> xc: error: Could not get domain info (3 = No such process): Internal error
> libxl: error: libxl.c:388:libxl_domain_resume: xc_domain_resume failed for domain 4154075147<tel:4154075147>: No such process
> libxl: error: libxl_dom.c:1184:libxl__domain_save_device_model: unable to open qemu save file ?8b: No such file or directory

I don't see that at all.

NB that PV guests may have a qemu for certain disk backends, or
consoles, depending on the configuration.  Can you show me your domain
config ?  Mine is below.


Ah that explains the qemu related calls. 

My Guest config: (from tests on 32bit PV domU w/ suspend event channel support)

kernel = "/home/kernels/vmlinuz-2.6.32.2-xenu"
memory = 1024
name = "xltest2"
vcpus = 2
vif = [ 'mac=00:16:3e:00:00:01,bridge=eth0' ]
disk = [ 'phy:/dev/drbd1,xvda1,w']
hostname= "rshriram-vm3"
root = "/dev/xvda1 ro"
extra = "console=xvc0 3"
>
on_reboot   = 'destroy'
on_crash    = 'coredump-destroy'

NB: This guest kernel has suspend-event-channel support
which is available in all suse-kernels I suppose. If you would
just like to use mine, the source tarball (2.6.32.2 version + kernel config)


> I also ran xl in GDB to get a stack trace and hopefully some useful debug info.
> gdb traces: http://pastebin.com/7zFwFjW4

I get a different crash - see above.

> Localhost replication: Partial success, but xl still segfaults
>  dmesg shows
>  [ 1399.254849] xl[4716]: segfault at 0 ip 00007f979483a417 sp 00007fffe06043e0 error 6 in libxenlight.so.2.0.0[7f9794807000+4d000]

I see exactly the same thing with `localhost' instead of `dummy'.  And
I see no incoming domain.

I will investigate the crash I see.  In the meantime can you try to
help me see why it doesn't work me even with the baseline ?



I also tested with 64-bit 3.3.0 PV kernel (w/o suspend-event channel support)

guest config:
kernel = "/home/kernels/vmlinuz-3.3.0-rc1-xenu"
memory = 1024
name = "xl-ubuntu-pv64"
vcpus = 2
vif = [ 'mac=00:16:3e:00:00:03, bridge=eth0' ]
disk = [ 'phy:/dev/vgdrbd/ubuntu-pv64,xvda1,w' ]
hostname= "rshriram-vm1"
root = "/dev/xvda1 ro"
extra = "console=hvc0 3"

With xen-unstable baseline,
Test 1. Blackhole replication
 command: nohup xl remus -vvv -e -b -i 100 xl-ubuntu-pv64 dummy >blackhole.log 2>&1 &
 result: works (networking included)
debug output:
libxl: debug: libxl_dom.c:687:libxl__domain_suspend_common_callback: issuing PV suspend request via XenBus control node
libxl: debug: libxl_dom.c:691:libxl__domain_suspend_common_callback: wait for the guest to acknowledge suspend request
libxl: debug: libxl_dom.c:738:libxl__domain_suspend_common_callback: guest acknowledged suspend request
libxl: debug: libxl_dom.c:742:libxl__domain_suspend_common_callback: wait for the guest to suspend
libxl: debug: libxl_dom.c:754:libxl__domain_suspend_common_callback: guest has suspended

 caveat: killing remus doesnt do a proper cleanup i.e if you killed it while the domain was
             suspended, it leaves it in the suspended state (where libxl waits for guest to suspend)
              Its a pain. In xend/python version, I added a handler (SIGUSR1) , so that one could do
             pkill -USR1 -f remus and gracefully exit remus, without wedging the domU.

             * I do not know if adding signal handlers is frowned upon in the xl land :)
               If there is some protocol in place to handle such things, I would be happy to send
               a patch that ensures that the guest is "resumed" while doing blackhole replication

Test 2. Localhost replication w/ failover by destroying primary VM
 command: nohup xl remus -vvv -b -i 100 xl-ubuntu-pv64 localhost >blackhole.log 2>&1 &
 result: works (networking included)

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.