[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Questions about GPLPV stability tests



On Sun, Dec 11, 2011 at 6:58 PM, Roderick Colenbrander
<thunderbird2k@xxxxxxxxx> wrote:
> On Sun, Dec 11, 2011 at 4:52 AM, Pasi Kärkkäinen <pasik@xxxxxx> wrote:
>> On Fri, Dec 09, 2011 at 02:02:31PM -0800, Roderick Colenbrander wrote:
>>> One interesting observation. This morning I had another of my stress
>>> machines with the problem. I never had it on this problem before and
>>> it didn't have any of the new logging / software updates yet.
>>>
>>> The system was in the same state as the other machine which I reported
>>> about before. I tried SysRq stuff and other things. While I was about
>>> to reboot the system, a login prompt appeared again on the VGA. I
>>> don't know whether any of the stuff I did triggered it or not. Anyway
>>> it means Linux is not really dead. I tried logging, but I don't even
>>> see characters appearing. The system feels to be very, very, very
>>> slow.
>>>
>>
>> Hmm.. I wonder if this is the same issue I'm sometimes seeing on my laptop..
>> suddenly it starts becoming slower, and after a while it's *very* slow..
>> not dead, but unusably slow..
>>
>> I haven't had time to (try to) debug it..
>>
>> -- Pasi
>>
>
> I'm seeing slowness issues on our systems as well. As in some code
> starts running really, really slowly. Local TCP 'heartbeat' like
> mechanism from Dom0 to a DomU timing out. Code which should execute
> quickly becoming orders of magnitude slower for no obvious reason.
> Typically we see evidence of this in logging.
>
> I felt there was a connection between this slowness and the
> 'unresponsive Dom0' but I haven't been able to confirm this. Normally
> we see weird things in our logs, but on the unresponsive systems I
> didn't see anything strange in the logs yet. Most likely the logs
> weren't synced to disk yet.
>
> After more investigation it seems that in my case all issues, seem to
> happen after the DomU is up and we somehow 'start' using it. During
> startup of our own software, both DomU and Dom0 would at least see a
> spike in cpu/disk activity before things settle down a bit. My feeling
> is (based on some logs) that this is when Dom0 sometimes becomes
> unresponsive.
>
> I'm still running tests on a number of machines, but it takes ages to
> get results.
>
> Roderick

Today, yet another of my machines went down. To explain the setup, I'm
running tests across a couple of similar machines. On some of the
machines I updated to latest 2.6.32.x and Xen 4.1.x (and I upgrade
every machine which goes down). The machine which just went down
wasn't using the latest versions yet, but some more logging was
enabled in 'dmesg'.

This box still had Xen 4.1.2 and 2.6.32.37. I'm not sure if the
logging is too useful, yet but it may be of some use. The log output
is mostly from the serial console. I looked at some log files on
disks. Some files give clues, but some important log files didn't get
synced to disk before bad things happened. The last timestamp I was
able to find was at '16:13:55' which some of our own software wrote
some line to its logs.

The few log files which got synced to disk tell that the VM started
around 16:13:02 which matches PCIe device ownership to the VM:

(XEN) [2011-12-10 16:13:03] [VT-D]iommu.c:1501: d0:PCIe: unmap bdf = 3:0.0
(XEN) [2011-12-10 16:13:03] [VT-D]iommu.c:1374: d1109:PCIe: map bdf = 3:0.0
(XEN) [2011-12-10 16:13:03] [VT-D]io.c:304: d1109: bind: m_gsi=55
g_gsi=32 device=4 intx=0
(XEN) [2011-12-10 16:13:23] [VT-D]io.c:328: d1109: unbind: m_gsi=55
g_gsi=32 device=4 intx=0
(XEN) [2011-12-10 16:13:23] [VT-D]io.c:386: d1109 unmap: m_irq=55
device=4 intx=0
(XEN) [2011-12-10 16:13:23] [VT-D]io.c:304: d1109: bind: m_gsi=16
g_gsi=32 device=4 intx=0


We are using the tapdisk driver for disk 'xvdc' which is referred to
as 'tdc' I guess. It is a copy-on-write
VHD disk which uses a raw disk image as its backing store. Apparently
there are all sorts of read errors.
Not sure if the read errors are too bad.

[905276.510737] blkback: ring-ref 15658, event-channel 13, protocol 1
(unspecified, assuming native)
[905276.512128] blkback: ring-ref 15658, event-channel 13, protocol 1
(unspecified, assuming native)
[905277.113215] block tdc: sector-size: 512 capacity: 118785
[905277.215872] blkback: ring-ref 15775, event-channel 14, protocol 1
(unspecified, assuming native)
[905680.196850] end_request: I/O error, dev tdc, sector 45689
[905680.197293] end_request: I/O error, dev tdc, sector 45689
[905680.197624] end_request: I/O error, dev tdc, sector 45688
[905680.197993] end_request: I/O error, dev tdc, sector 45696

More blktap issues hours later, right? The real problem must have
happened much earlier then.

[912393.552881] blkback: ring-ref 15810, event-channel 13, protocol 1
(unspecified, assuming native)
[912394.101667] block tdc: sector-size: 512 capacity: 118785
[912394.384715] blkback: ring-ref 15660, event-channel 14, protocol 1
(unspecified, assuming native)
[912402.433492] BUG: unable to handle kernel NULL pointer dereference
at 00000000000002f0
[912402.434077] IP: [<ffffffff81249ac2>] blktap_device_end_request+0x4e/0x68
[912402.434384] PGD 384bb067 PUD 3a55f067 PMD 0
[912402.434752] Oops: 0000 [#1] SMP
[912402.435096] last sysfs file: /sys/devices/virtual/block/tdc/removable
[912402.435363] CPU 2
[912402.435653] Modules linked in: dm_snapshot dm_mod ipmi_si ipmi_devintf
[912402.436092] Pid: 16994, comm: tapdisk2 Not tainted 2.6.32.37 #1 X8STi
[912402.436595] RIP: e030:[<ffffffff81249ac2>]  [<ffffffff81249ac2>]
blktap_device_end_request+0x4e/0x68
[912402.437117] RSP: e02b:ffff8800616f9d08  EFLAGS: 00010046
[912402.437380] RAX: 0000000000000000 RBX: ffff880014ef29b0 RCX:
0000000000000018
[912402.437884] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff88007fcebd58
[912402.438368] RBP: 0000000000000000 R08: ffffffff816a9488 R09:
ffff8800123af040
[912402.438855] R10: 00000001365d23c7 R11: ffffffff810861e1 R12:
ffff88007bea4e70
[912402.439358] R13: ffff88007b8aa800 R14: 0000000000000000 R15:
ffff880014ef29b0
[912402.439848] FS:  00007fde3bbe2730(0000) GS:ffff880028070000(0000)
knlGS:0000000000000000
[912402.440341] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[912402.440621] CR2: 00000000000002f0 CR3: 000000001f721000 CR4:
0000000000002660
[912402.441102] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[912402.441583] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[912402.442083] Process tapdisk2 (pid: 16994, threadinfo
ffff8800616f8000, task ffff88007bab4ec0)
[912402.442569] Stack:
[912402.442825]  ffffea000185d038 0000000000000001 ffff88007bab4ec0
0000000000000000
[912402.443199] <0> ffff88007b8aa800 ffffffff81248c3b 00000000000001ff
0000000000000000
[912402.443871] <0> 00033e583f482159 000000000000e030 ffff88006f6da480
0000039f000003a0
[912402.444793] Call Trace:
[912402.445054]  [<ffffffff81248c3b>] ? blktap_ring_ioctl+0x137/0x228
[912402.445326]  [<ffffffff810bea63>] ? core_sys_select+0x1ee/0x21e
[912402.445599]  [<ffffffff810dc64f>] ? really_put_req+0x62/0x8f
[912402.445871]  [<ffffffff8100e865>] ? xen_force_evtchn_callback+0x9/0xa
[912402.446159]  [<ffffffff8100eef2>] ? check_events+0x12/0x20
[912402.446434]  [<ffffffff810dc4e4>] ? aio_read_evt+0x26/0xe4
[912402.446704]  [<ffffffff810dd71a>] ? sys_io_getevents+0x10e/0x356
[912402.446984]  [<ffffffff810bcc47>] ? vfs_ioctl+0x56/0x6c
[912402.447255]  [<ffffffff810bd133>] ? do_vfs_ioctl+0x460/0x49e
[912402.447543]  [<ffffffff81059699>] ? getnstimeofday+0x55/0xaf
[912402.447814]  [<ffffffff810bd1c2>] ? sys_ioctl+0x51/0x70
[912402.448083]  [<ffffffff81011a02>] ? system_call_fastpath+0x16/0x1b
[912402.448351] Code: e8 a5 f5 ff ff 49 8b 44 24 40 48 8b b8 f0 02 00
00 e8 a5 37 27 00 41 8b 54 24 60 44 89 f6 4c 89 e7 e8 85 13 f8 ff 49
8b 44 24 40 <48> 8b 80 f0 02 00 00 fe 00 ff 14 25 28 0d 6a 81 5b 5b 41
5c 41
[912402.451398] RIP  [<ffffffff81249ac2>] blktap_device_end_request+0x4e/0x68
[912402.451733]  RSP <ffff8800616f9d08>
[912402.451997] CR2: 00000000000002f0
[912402.452265] ---[ end trace f20c5f7e976fe78b ]---

Half a minute later I get tons of these:
[912432.458625] ata6.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x6 frozen
[912432.458893] ata6.00: failed command: READ FPDMA QUEUED
[912432.459172] ata6.00: cmd 60/08:00:f2:13:e3/00:00:0d:00:00/40 tag 0
ncq 4096 in

And tons of these:
[912478.690536] sd 5:0:0:0: [sda] Unhandled error code
[912478.690798] sd 5:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET
driverbyte=DRIVER_OK
[912478.691306] sd 5:0:0:0: [sda] CDB: Read(10): 28 00 00 e6 95 05 00 00 68 00

I wonder if we are having real SATA issues or whether this is somehow
caused by the 'real' problem which is tapdisk?

Finally I also get timeouts in the network code. I think these are
just caused by the bad things happening.
[912737.470673] WARNING: at net/sched/sch_generic.c:261
dev_watchdog+0xd2/0x16e()
[912737.471170] Hardware name: X8STi
[912737.471426] NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
[912737.471686] Modules linked in: dm_snapshot dm_mod ipmi_si ipmi_devintf
[912737.472106] Pid: 0, comm: swapper Tainted: G      D    2.6.32.37 #1
[912737.472586] Call Trace:
[912737.472838]  <IRQ>  [<ffffffff813c8e12>] ? dev_watchdog+0xd2/0x16e
[912737.473130]  [<ffffffff813c8e12>] ? dev_watchdog+0xd2/0x16e
[912737.473391]  [<ffffffff81040fb1>] ? warn_slowpath_common+0x77/0xa3
[912737.473668]  [<ffffffff813c8d40>] ? dev_watchdog+0x0/0x16e
[912737.473928]  [<ffffffff81041039 > ] ? warn_slowpk
_dev_program_evt+ 0xe3/0x18c
[90/0x60
[912737.474988]  [<ffffffff8100e865>] ? xen_force_evtchn_callback+0x9/0xa
[912737.480706]  [<ffffffff8100eef2>] ? check_events+0x12/0x20
[912737.480965]  [<ffffffff813c8d15>] ? netif_tx_lock+0x3d/0x68
[912737.481222]  [<ffffffff813c8d40>] ? dev_watchdog+0x0/0x16e
[912737.481481]  [<ffffffff813b83a0>] ? netdev_drivername+0x3b/0x40
[912737.481741]  [<ffffffff813c8e12>] ? dev_watchdog+0xd2/0x16e
[912737.482015]  [<ffffffff81049614>] ? run_timer_softirq+0x16c/0x1d9
[912737.482274]  [<ffffffff81078fb8>] ? handle_percpu_irq+0x4e/0x6a
[912737.482534]  [<ffffffff81045bc8>] ? __do_softirq+0x91/0x116
[912737.482794]  [<ffffffff81012b6c>] ? call_softirq+0x1c/0x30
[912737.483052]  [<ffffffff810140e7>] ? do_softirq+0x3f/0x7c
[912737.483324]  [<ffffffff81045a64>] ? irq_exit+0x36/0x76
[912737.483584]  [<ffffffff8123b6ad>] ? xen_evtchn_do_upcall+0x33/0x42
[912737.483843]  [<ffffffff81012bbe>] ? xen_do_hypervisor_callback+0x1e/0x30
[912737.484101]  <EOI>  [<ffffffff8100eedf>] ? xen_restore_fl_direct_end+0x0/0x1
[912737.484394]  [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1001
[912737.484670]  [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1001
[912737.484930]  [<ffffffff8100e8e3>] ? xen_safe_halt+0xc/0x15
[912737.485188]  [<ffffffff81017f07>] ? default_idle+0x21/0x3d
[912737.485447]  [<ffffffff81010df1>] ? cpu_idle+0x59/0x91
[912737.485704] ---[ end trace f20c5f7e976fe78c ]---


Does this feel like blktap issues? Or is it more likely that something
else happened which causes blktap and the other things to fail?

Roderick

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.