[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Questions about GPLPV stability tests



I have finally got some results (the tests are still running). Let me
first summarize what tests were running.

In total I had 4 machines. All systems had at least Xen 4.1.2 + Linux
2.6.32.48 (and one had a Xen 4.1.3 snapshot). I divided the systems
into 2 groups and each group ran a different test. There are 2
differencess between the groups:
1) one was the use of blktap vs not using blktap. The main reason for
having the blktap systems in even though it adds noise, is that it
kept some of tests close to what they previously were.
2) the non-blktap systems used a different cpu pinning configuration.
Previously Dom0 was using 0-3 and DomU 4-7, but due to hyperthreading
both VMs used the same cores (0 and 4 are on the same physical core).

Now to the initial results.
- so far each machine has been up for 10 days.
- one machine failed early on due to a PCI device assignment issue. I
suspect that at some point due to a race condition, ownership of the
PCI device wasn't transferred correctly back from DomU to Dom0 on VM
shutdown. Xm and Xl respectively fail with 'Error: failed to assign
device 03:00.0: maybe it has already been assigned to other domain, or
maybe it doesn't exist.' From a quick glance at the logs it looks like
on shutdown of a previous VM, the 'unmap bdf' never happened. Not sure
why, but I guess due to a potential toolchain bug.

- One of the non-blktap machines had a kernel Oops. It happened on VM
shutdown and I wouldn't be surprised if it was similar to the crashes
we wanted to debug. The system is in a usable state though, but this
may have been due to the use of Linux 2.6.32.48 or the different CPU
pinning configuration. Some of the serial output:

Thu Dec 22 23:17:38 2011 - 1232   0: 87    113   365   blkif-backend
     525325sec

Thu Dec 22 23:30:07 2011 - 1259   0: 12    6     222   xenbus
     526065sec
Thu Dec 22 23:30:07 2011 - 1250   0: 62    42    1461  ahci
     526065sec
Thu Dec 22 23:30:07 2011 - 1249   0: 37    28    75    eth0-rx-0
     526065sec
Thu Dec 22 23:30:07 2011 - 1248   0: 71    305   933   eth0-tx-0
     526065sec
Thu Dec 22 23:30:07 2011 - 1243   0: 6     3     134
evtchn:xenstored     526065sec
Thu Dec 22 23:30:07 2011 - 1241   0: 6     3     87582
evtchn:xenstored     526065sec
Thu Dec 22 23:30:07 2011 - 1240   0: 256   261   54162 evtchn:qemu-dm
     526065sec
Thu Dec 22 23:30:07 2011 - 1239   0: 244   251   7219  evtchn:qemu-dm
     526065sec
Thu Dec 22 23:30:07 2011 - 1238   0: 289   273   5931  evtchn:qemu-dm
     526065sec
Thu Dec 22 23:30:07 2011 - 1237   0: 248   245   4279  evtchn:qemu-dm
     526065sec
Thu Dec 22 23:30:07 2011 - 1236   0: 12    7     315   blkif-backend
     526065sec
Thu Dec 22 23:30:07 2011 - 1234   0: 7     4     43    veth1
     526065sec
Thu Dec 22 23:30:07 2011 - 1232 CPU0 going backwards (-3304)!
Thu Dec 22 23:30:07 2011 -   19 CPU0 going backwards (-212)!
Thu Dec 22 23:30:07 2011 -   18   0: 35    17    35    ehci_hcd:usb1,
uhci_hcd:usb8 526065sec
Thu Dec 22 23:30:07 2011 -   16 CPU0 going backwards (-176)!
Thu Dec 22 23:30:07 2011 -   12 CPU0 going backwards (-4)!
Thu Dec 22 23:30:07 2011 -    4 CPU0 going backwards (-1)!
Thu Dec 22 23:30:12 2011 - 1250   0: 384   213   1461  ahci
     526070sec
Thu Dec 22 23:30:12 2011 - 1249   0: 14    21    75    eth0-rx-0
     526070sec
Thu Dec 22 23:30:12 2011 - 1240   0: 260   260   54162 evtchn:qemu-dm
     526070sec
Thu Dec 22 23:30:12 2011 - 1239   0: 279   265   7219  evtchn:qemu-dm
     526070sec
Thu Dec 22 23:30:13 2011 - 1238   0: 271   272   5931  evtchn:qemu-dm
     526070sec
Thu Dec 22 23:30:13 2011 - 1237   0: 261   253   4279  evtchn:qemu-dm
     526070sec
Thu Dec 22 23:30:13 2011 - 1236   0: 145   76    315   blkif-backend
     526070sec
Thu Dec 22 23:30:18 2011 - 1250   0: 372   293   1461  ahci
     526075sec
Thu Dec 22 23:30:18 2011 - 1249   0: 26    24    75    eth0-rx-0
     526075sec
Thu Dec 22 23:30:18 2011 - 1248   0: 16    86    933   eth0-tx-0
     526075sec
Thu Dec 22 23:30:18 2011 - 1240   0: 234   247   54162 evtchn:qemu-dm
     526075sec
Thu Dec 22 23:30:18 2011 - 1239   0: 234   249   7219  evtchn:qemu-dm
     526075sec
Thu Dec 22 23:30:18 2011 - 1238   0: 241   256   5931  evtchn:qemu-dm
     526075sec
Thu Dec 22 23:30:18 2011 - 1237   0: 224   239   4279  evtchn:qemu-dm
     526075sec
Thu Dec 22 23:30:18 2011 - 1236   0: 100   88    315   blkif-backend
     526075sec
Thu Dec 22 23:30:23 2011 - 1249   0: 16    20    75    eth0-rx-0
     526080sec
Thu Dec 22 23:30:23 2011 - 1248   0: 7     46    933   eth0-tx-0
     526080sec
Thu Dec 22 23:30:23 2011 - 1240   0: 8     128   54162 evtchn:qemu-dm
     526080sec
Thu Dec 22 23:30:23 2011 - 1239   0: 16    133   7219  evtchn:qemu-dm
     526080sec
Thu Dec 22 23:30:23 2011 - 1238   0: 28    142   5931  evtchn:qemu-dm
     526080sec
Thu Dec 22 23:30:23 2011 - 1233 CPU0 going backwards (-12648)!
Thu Dec 22 23:30:23 2011 -   19 CPU0 going backwards (-212)!
Thu Dec 22 23:30:23 2011 -   18   0: 35    17    42    ehci_hcd:usb1,
uhci_hcd:usb8 526080sec
Thu Dec 22 23:30:23 2011 -   16 CPU0 going backwards (-176)!
Thu Dec 22 23:30:23 2011 -   12 CPU0 going backwards (-4)!
Thu Dec 22 23:30:23 2011 -    4 CPU0 going backwards (-1)!

[533804.785589] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000008
[533804.793913] IP: [<ffffffff814bdba9>] _spin_lock+0x5/0x15
[533804.799556] PGD 0
[533804.801896] Oops: 0002 [#1] SMP
[533804.805448] last sysfs file:
/sys/devices/pci0000:00/0000:00:03.0/0000:03:00.0/class
[533804.813736] CPU 0
[533804.816066] Modules linked in: dm_snapshot dm_mod ipmi_si ipmi_devintf
[533804.822914] Pid: 21632, comm: qemu-dm Not tainted 2.6.32.48 #1 X8STi
[533804.829595] RIP: e030:[<ffffffff814bdba9>]  [<ffffffff814bdba9>]
_spin_lock+0x5/0x15
[533804.837873] RSP: e02b:ffff88005f187c50  EFLAGS: 00010202
[533804.843513] RAX: 0000000000000100 RBX: ffff88007d6923c0 RCX:
0000000000000003
[533804.851192] RDX: ffff88007deb3180 RSI: ffff88007d6923c0 RDI:
0000000000000008
[533804.858871] RBP: ffff88007e260128 R08: 0000000000000000 R09:
0000000000000000
[533804.866552] R10: ffff88007121eb40 R11: ffffffff811b862b R12:
ffff88007fac1e40
[533804.874541] R13: ffff88007e3c3340 R14: ffff88007e3c3340 R15:
ffff88007fdfbc00
[533804.882243] FS:  00007f5cdcefe6f0(0000) GS:ffff880028038000(0000)
knlGS:0000000000000000
[533804.890874] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[533804.896948] CR2: 0000000000000008 CR3: 0000000001001000 CR4:
0000000000002660
[533804.904627] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[533804.912306] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[533804.919985] Process qemu-dm (pid: 21632, threadinfo
ffff88005f186000, task ffff880074f4e270)
[533804.928971] Stack:
[533804.931312]  ffffffff810a9ad0 ffff88007deb3180 ffff88007e260100
ffff88007e260100
[533804.938762] <0> ffffffff8124222d 0000000000000008 0000000000000008
ffff88007deb3180
[533804.946900] <0> ffffffff810b245e ffff88007fac1e40 ffff88007deb3180
0000000000000000
[533804.955465] Call Trace:
[533804.958244]  [<ffffffff810a9ad0>] ? mmu_notifier_unregister+0x27/0xa5
[533804.965019]  [<ffffffff8124222d>] ? gntdev_release+0xc3/0xd1
[533804.971007]  [<ffffffff810b245e>] ? __fput+0x100/0x1af
[533804.976477]  [<ffffffff810af998>] ? filp_close+0x5b/0x62
[533804.982119]  [<ffffffff81042989>] ? put_files_struct+0x64/0xc1
[533804.988280]  [<ffffffff810441f0>] ? do_exit+0x209/0x68d
[533804.993836]  [<ffffffff810441f8>] ? do_exit+0x211/0x68d
[533804.999390]  [<ffffffff810446e9>] ? do_group_exit+0x75/0x9c
[533805.005294]  [<ffffffff8104cf1d>] ? get_signal_to_deliver+0x30d/0x338
[533805.012063]  [<ffffffff81010f7a>] ? do_notify_resume+0x8a/0x6b1
[533805.018310]  [<ffffffff810bdb3a>] ? poll_select_copy_remaining+0xd0/0xf3
[533805.025339]  [<ffffffff81011c8e>] ? int_signal+0x12/0x17
[533805.030980] Code: 00 00 00 01 74 05 e8 67 1c d2 ff 48 89 d0 5e c3
ff 14 25 20 2d 6a 81 f0 81 2f 00 00 00 01 74 05 e8 4d 1c d2 ff c3 b8
00 01 00 00 <f0> 66 0f c1 07 38 e0 74 06 f3 90 8a 07 eb f6 c3 f0 81 2f
00 00
[533805.050454] RIP  [<ffffffff814bdba9>] _spin_lock+0x5/0x15
[533805.056182]  RSP <ffff88005f187c50>
[533805.059997] CR2: 0000000000000008
[533805.063638] ---[ end trace 85ee7cbec9ce72ac ]---
[533805.068584] Fixing recursive fault but reboot is needed!

If I had to guess, some of the gnttab code in qemu triggered a bug in
the gntdev code? I have some experience with gnttab/gntdev, but don't
know the inner parts of it very well.

Roderick

On Fri, Dec 16, 2011 at 2:25 AM, Roderick Colenbrander
<thunderbird2k@xxxxxxxxx> wrote:
> Thanks. On Tuesday I set up your monitoring tool on the already
> running tests. Some of the systems were on the verge of showing
> results, but we just had a power outage and I have to restart the
> tests. It will probably take until the Holidays before there any
> results :(
>
> Roderick
>
> On Tue, Dec 13, 2011 at 9:20 PM, Konrad Rzeszutek Wilk
> <konrad@xxxxxxxxxx> wrote:
>>> > Does this feel like blktap issues? Or is it more likely that something
>>> > else happened which causes blktap and the other things to fail?
>>>
>>> It looks like the interrupts stopped. Perhaps you can run the C code
>>> (attached) on the serial console and see _what_ (or perhaps _when_)
>>> the interrupts stops (and also verify that the timeouts and 'frozen'
>>> happen due to no interrupts being received).
>>
>> The C code is http://darnok.org/xen/read_interrupts.c
>>
>>>
>>> There are a couple of bugs that were discovered in the interrupt code
>>> (that had been there since 2.6.27!) that went to the stable tree - so
>>> they should been backported to 2.6.32.x tree - but I honestly don't
>>> recall the names. I can look them up tomorrow.
>>
>> xen: x86_32: do not enable iterrupts when returning from exception in
>> interrupt context (git commit d198d499148a0c64a41b3aba9e7dd43772832b91)
>>

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.