Xen project Mailing List

Shriram Rajagopalan writes ("Re: [PATCH v5 00/21] libxl: domain save/restore: run in a separate process"):

> Ian,
> The code segfaults. Here are the system details and error traces from gdb.

Thanks.

> My setup:
>
> dom0 : ubuntu 64bit, 2.6.32-39 (pvops kernel),
> running latest xen-4.2-unstable (built from your repo)
> tools stack also built from your repo (which I hope has all the latest patches).
>
> domU: ubuntu 32bit PV, xenolinux kernel (2.6.32.2 - novel suse version)
> with suspend event channel support
>
> As a sanity check, I tested xl remus with latest tip from xen-unstable
> mercurial repo, c/s: 25496:e08cf97e76f0
>
> Blackhole replication (to /dev/null) and localhost replication worked as expected
> and the guest recovered properly without any issues.

Thanks for the test runes. That didn't work entirely properly for
me, even with the xen-unstable baseline.

I did this
xl -vvvv remus -b -i 100 debian.guest.osstest dummy >remus.log 2>&1 &
The result was that the guest's networking broke. The guest shows up
in xl list as
debian.guest.osstest 7 512 1 ---ss- 5.2
and is still responsive on its pv console.

This is normal. You are suspending every 100ms. So, when you see ---ss-,

you just ended up doing "xl list" right when the guest was suspended. :)

do a xl top and you would see the guest's state oscillate from --b-- to --s--

depending on the checkpoint interval. Or do xl list multiple times.

After I killed the remus
process, the guest's networking was still broken.

That is strange.. xl remus has literally no networking support on the remus

front. So, it shouldnt affect anything in the guest. In fact I repeated your test

on my box , where the guest was continuously pinging a host . Pings continued

to work. so did ssh.

At the start, the guest prints this on its console:
[ 36.017241] WARNING: g.e. still in use!
[ 36.021056] WARNING: g.e. still in use!
[ 36.024740] WARNING: g.e. still in use!
[ 36.024763] WARNING: g.e. still in use!

If I try the rune with "localhost" I would have expected, surely, to
see a domain with the incoming migration ? But I don't. I tried
killing the `xl remus' process and the guest became wedged.

With "-b" option the second argument (localhost|dummy) is ignored. Did you

try the command without the -b option, i.e.

xl remus -vvv -e domU localhost

But I was partially able to reproduce some of your test results without your

patches (i.e. on xen-unstable baseline). See end of mail for more details.

However, when I apply my series, I can indeed produce an assertion
failure:

xc: detail: All memory is saved

xc: error: Could not get domain info (3 = No such process): Internal error

libxl: error: libxl.c:388:libxl_domain_resume: xc_domain_resume failed for domain 3077579968: No such process
xl: libxl_event.c:1426: libxl__ao_inprogress_gc: Assertion `ao->magic == 0xA0FACE00ul' failed.

So I have indeed made matters worse.

> Blackhole replication:
> ================
> xl error:
> ----------
> xc: error: Could not get domain info (3 = No such process): Internal error

> libxl: error: libxl.c:388:libxl_domain_resume: xc_domain_resume failed for domain 4154075147<tel:4154075147>: No such process

> libxl: error: libxl_dom.c:1184:libxl__domain_save_device_model: unable to open qemu save file ?8b: No such file or directory

I don't see that at all.

NB that PV guests may have a qemu for certain disk backends, or
consoles, depending on the configuration. Can you show me your domain
config ? Mine is below.

Ah that explains the qemu related calls.

My Guest config: (from tests on 32bit PV domU w/ suspend event channel support)

kernel = "/home/kernels/vmlinuz-2.6.32.2-xenu"

memory = 1024

name = "xltest2"

vcpus = 2

vif = [ 'mac=00:16:3e:00:00:01,bridge=eth0' ]

disk = [ 'phy:/dev/drbd1,xvda1,w']

hostname= "rshriram-vm3"

root = "/dev/xvda1 ro"

extra = "console=xvc0 3"

on_reboot = 'destroy'

on_crash = 'coredump-destroy'

NB: This guest kernel has suspend-event-channel support

which is available in all suse-kernels I suppose. If you would

just like to use mine, the source tarball (2.6.32.2 version + kernel config)

is at http://aramis.nss.cs.ubc.ca/xenolinux-2.6.32.2.tar.gz

> I also ran xl in GDB to get a stack trace and hopefully some useful debug info.
> gdb traces: http://pastebin.com/7zFwFjW4

I get a different crash - see above.

> Localhost replication: Partial success, but xl still segfaults
> dmesg shows
> [ 1399.254849] xl[4716]: segfault at 0 ip 00007f979483a417 sp 00007fffe06043e0 error 6 in libxenlight.so.2.0.0[7f9794807000+4d000]

I see exactly the same thing with `localhost' instead of `dummy'. And
I see no incoming domain.

I will investigate the crash I see. In the meantime can you try to
help me see why it doesn't work me even with the baseline ?

I also tested with 64-bit 3.3.0 PV kernel (w/o suspend-event channel support)

guest config:

kernel = "/home/kernels/vmlinuz-3.3.0-rc1-xenu"

memory = 1024

name = "xl-ubuntu-pv64"

vcpus = 2

vif = [ 'mac=00:16:3e:00:00:03, bridge=eth0' ]

disk = [ 'phy:/dev/vgdrbd/ubuntu-pv64,xvda1,w' ]

hostname= "rshriram-vm1"

root = "/dev/xvda1 ro"

extra = "console=hvc0 3"

With xen-unstable baseline,

Test 1. Blackhole replication

command: nohup xl remus -vvv -e -b -i 100 xl-ubuntu-pv64 dummy >blackhole.log 2>&1 &

result: works (networking included)

debug output:

libxl: debug: libxl_dom.c:687:libxl__domain_suspend_common_callback: issuing PV suspend request via XenBus control node

libxl: debug: libxl_dom.c:691:libxl__domain_suspend_common_callback: wait for the guest to acknowledge suspend request

libxl: debug: libxl_dom.c:738:libxl__domain_suspend_common_callback: guest acknowledged suspend request

libxl: debug: libxl_dom.c:742:libxl__domain_suspend_common_callback: wait for the guest to suspend

libxl: debug: libxl_dom.c:754:libxl__domain_suspend_common_callback: guest has suspended

caveat: killing remus doesnt do a proper cleanup i.e if you killed it while the domain was

suspended, it leaves it in the suspended state (where libxl waits for guest to suspend)

Its a pain. In xend/python version, I added a handler (SIGUSR1) , so that one could do

pkill -USR1 -f remus and gracefully exit remus, without wedging the domU.

* I do not know if adding signal handlers is frowned upon in the xl land :)

If there is some protocol in place to handle such things, I would be happy to send

a patch that ensures that the guest is "resumed" while doing blackhole replication

Test 2. Localhost replication w/ failover by destroying primary VM

command: nohup xl remus -vvv -b -i 100 xl-ubuntu-pv64 localhost >blackhole.log 2>&1 &

result: works (networking included)

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

References:
- [Xen-devel] [PATCH v5 00/21] libxl: domain save/restore: run in a separate process
  - From: Ian Jackson
- [Xen-devel] [PATCH v5 00/21] libxl: domain save/restore: run in a separate process
  - From: Ian Jackson
- Re: [Xen-devel] [PATCH v5 00/21] libxl: domain save/restore: run in a separate process
  - From: Shriram Rajagopalan
- Re: [Xen-devel] [PATCH v5 00/21] libxl: domain save/restore: run in a separate process
  - From: Shriram Rajagopalan
- Re: [Xen-devel] [PATCH v5 00/21] libxl: domain save/restore: run in a separate process
  - From: Ian Jackson

Prev by Date: Re: [Xen-devel] [PATCH v5 00/21] libxl: domain save/restore: run in a separate process
Next by Date: Re: [Xen-devel] [PATCH v5 00/21] libxl: domain save/restore: run in a separate process
Previous by thread: Re: [Xen-devel] [PATCH v5 00/21] libxl: domain save/restore: run in a separate process
Next by thread: Re: [Xen-devel] [PATCH v5 00/21] libxl: domain save/restore: run in a separate process
Index(es):
- Date
- Thread

Re: [Xen-devel] [PATCH v5 00/21] libxl: domain save/restore: run in a separate process