Xen project Mailing List

Re: [Xen-devel] Questions about GPLPV stability tests

To: Pasi Kärkkäinen <pasik@xxxxxx>

From: Roderick Colenbrander <thunderbird2k@xxxxxxxxx>

Date: Mon, 12 Dec 2011 22:30:24 +0000

Cc: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx, Andreas Kinzler <ml-xen-devel@xxxxxx>

Delivery-date: Mon, 12 Dec 2011 22:31:47 +0000

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

On Sun, Dec 11, 2011 at 6:58 PM, Roderick Colenbrander <thunderbird2k@xxxxxxxxx> wrote: > On Sun, Dec 11, 2011 at 4:52 AM, Pasi Kärkkäinen <pasik@xxxxxx> wrote: >> On Fri, Dec 09, 2011 at 02:02:31PM -0800, Roderick Colenbrander wrote: >>> One interesting observation. This morning I had another of my stress >>> machines with the problem. I never had it on this problem before and >>> it didn't have any of the new logging / software updates yet. >>> >>> The system was in the same state as the other machine which I reported >>> about before. I tried SysRq stuff and other things. While I was about >>> to reboot the system, a login prompt appeared again on the VGA. I >>> don't know whether any of the stuff I did triggered it or not. Anyway >>> it means Linux is not really dead. I tried logging, but I don't even >>> see characters appearing. The system feels to be very, very, very >>> slow. >>> >> >> Hmm.. I wonder if this is the same issue I'm sometimes seeing on my laptop.. >> suddenly it starts becoming slower, and after a while it's *very* slow.. >> not dead, but unusably slow.. >> >> I haven't had time to (try to) debug it.. >> >> -- Pasi >> > > I'm seeing slowness issues on our systems as well. As in some code > starts running really, really slowly. Local TCP 'heartbeat' like > mechanism from Dom0 to a DomU timing out. Code which should execute > quickly becoming orders of magnitude slower for no obvious reason. > Typically we see evidence of this in logging. > > I felt there was a connection between this slowness and the > 'unresponsive Dom0' but I haven't been able to confirm this. Normally > we see weird things in our logs, but on the unresponsive systems I > didn't see anything strange in the logs yet. Most likely the logs > weren't synced to disk yet. > > After more investigation it seems that in my case all issues, seem to > happen after the DomU is up and we somehow 'start' using it. During > startup of our own software, both DomU and Dom0 would at least see a > spike in cpu/disk activity before things settle down a bit. My feeling > is (based on some logs) that this is when Dom0 sometimes becomes > unresponsive. > > I'm still running tests on a number of machines, but it takes ages to > get results. > > Roderick Today, yet another of my machines went down. To explain the setup, I'm running tests across a couple of similar machines. On some of the machines I updated to latest 2.6.32.x and Xen 4.1.x (and I upgrade every machine which goes down). The machine which just went down wasn't using the latest versions yet, but some more logging was enabled in 'dmesg'. This box still had Xen 4.1.2 and 2.6.32.37. I'm not sure if the logging is too useful, yet but it may be of some use. The log output is mostly from the serial console. I looked at some log files on disks. Some files give clues, but some important log files didn't get synced to disk before bad things happened. The last timestamp I was able to find was at '16:13:55' which some of our own software wrote some line to its logs. The few log files which got synced to disk tell that the VM started around 16:13:02 which matches PCIe device ownership to the VM: (XEN) [2011-12-10 16:13:03] [VT-D]iommu.c:1501: d0:PCIe: unmap bdf = 3:0.0 (XEN) [2011-12-10 16:13:03] [VT-D]iommu.c:1374: d1109:PCIe: map bdf = 3:0.0 (XEN) [2011-12-10 16:13:03] [VT-D]io.c:304: d1109: bind: m_gsi=55 g_gsi=32 device=4 intx=0 (XEN) [2011-12-10 16:13:23] [VT-D]io.c:328: d1109: unbind: m_gsi=55 g_gsi=32 device=4 intx=0 (XEN) [2011-12-10 16:13:23] [VT-D]io.c:386: d1109 unmap: m_irq=55 device=4 intx=0 (XEN) [2011-12-10 16:13:23] [VT-D]io.c:304: d1109: bind: m_gsi=16 g_gsi=32 device=4 intx=0 We are using the tapdisk driver for disk 'xvdc' which is referred to as 'tdc' I guess. It is a copy-on-write VHD disk which uses a raw disk image as its backing store. Apparently there are all sorts of read errors. Not sure if the read errors are too bad. [905276.510737] blkback: ring-ref 15658, event-channel 13, protocol 1 (unspecified, assuming native) [905276.512128] blkback: ring-ref 15658, event-channel 13, protocol 1 (unspecified, assuming native) [905277.113215] block tdc: sector-size: 512 capacity: 118785 [905277.215872] blkback: ring-ref 15775, event-channel 14, protocol 1 (unspecified, assuming native) [905680.196850] end_request: I/O error, dev tdc, sector 45689 [905680.197293] end_request: I/O error, dev tdc, sector 45689 [905680.197624] end_request: I/O error, dev tdc, sector 45688 [905680.197993] end_request: I/O error, dev tdc, sector 45696 More blktap issues hours later, right? The real problem must have happened much earlier then. [912393.552881] blkback: ring-ref 15810, event-channel 13, protocol 1 (unspecified, assuming native) [912394.101667] block tdc: sector-size: 512 capacity: 118785 [912394.384715] blkback: ring-ref 15660, event-channel 14, protocol 1 (unspecified, assuming native) [912402.433492] BUG: unable to handle kernel NULL pointer dereference at 00000000000002f0 [912402.434077] IP: [<ffffffff81249ac2>] blktap_device_end_request+0x4e/0x68 [912402.434384] PGD 384bb067 PUD 3a55f067 PMD 0 [912402.434752] Oops: 0000 [#1] SMP [912402.435096] last sysfs file: /sys/devices/virtual/block/tdc/removable [912402.435363] CPU 2 [912402.435653] Modules linked in: dm_snapshot dm_mod ipmi_si ipmi_devintf [912402.436092] Pid: 16994, comm: tapdisk2 Not tainted 2.6.32.37 #1 X8STi [912402.436595] RIP: e030:[<ffffffff81249ac2>] [<ffffffff81249ac2>] blktap_device_end_request+0x4e/0x68 [912402.437117] RSP: e02b:ffff8800616f9d08 EFLAGS: 00010046 [912402.437380] RAX: 0000000000000000 RBX: ffff880014ef29b0 RCX: 0000000000000018 [912402.437884] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88007fcebd58 [912402.438368] RBP: 0000000000000000 R08: ffffffff816a9488 R09: ffff8800123af040 [912402.438855] R10: 00000001365d23c7 R11: ffffffff810861e1 R12: ffff88007bea4e70 [912402.439358] R13: ffff88007b8aa800 R14: 0000000000000000 R15: ffff880014ef29b0 [912402.439848] FS: 00007fde3bbe2730(0000) GS:ffff880028070000(0000) knlGS:0000000000000000 [912402.440341] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b [912402.440621] CR2: 00000000000002f0 CR3: 000000001f721000 CR4: 0000000000002660 [912402.441102] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [912402.441583] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [912402.442083] Process tapdisk2 (pid: 16994, threadinfo ffff8800616f8000, task ffff88007bab4ec0) [912402.442569] Stack: [912402.442825] ffffea000185d038 0000000000000001 ffff88007bab4ec0 0000000000000000 [912402.443199] <0> ffff88007b8aa800 ffffffff81248c3b 00000000000001ff 0000000000000000 [912402.443871] <0> 00033e583f482159 000000000000e030 ffff88006f6da480 0000039f000003a0 [912402.444793] Call Trace: [912402.445054] [<ffffffff81248c3b>] ? blktap_ring_ioctl+0x137/0x228 [912402.445326] [<ffffffff810bea63>] ? core_sys_select+0x1ee/0x21e [912402.445599] [<ffffffff810dc64f>] ? really_put_req+0x62/0x8f [912402.445871] [<ffffffff8100e865>] ? xen_force_evtchn_callback+0x9/0xa [912402.446159] [<ffffffff8100eef2>] ? check_events+0x12/0x20 [912402.446434] [<ffffffff810dc4e4>] ? aio_read_evt+0x26/0xe4 [912402.446704] [<ffffffff810dd71a>] ? sys_io_getevents+0x10e/0x356 [912402.446984] [<ffffffff810bcc47>] ? vfs_ioctl+0x56/0x6c [912402.447255] [<ffffffff810bd133>] ? do_vfs_ioctl+0x460/0x49e [912402.447543] [<ffffffff81059699>] ? getnstimeofday+0x55/0xaf [912402.447814] [<ffffffff810bd1c2>] ? sys_ioctl+0x51/0x70 [912402.448083] [<ffffffff81011a02>] ? system_call_fastpath+0x16/0x1b [912402.448351] Code: e8 a5 f5 ff ff 49 8b 44 24 40 48 8b b8 f0 02 00 00 e8 a5 37 27 00 41 8b 54 24 60 44 89 f6 4c 89 e7 e8 85 13 f8 ff 49 8b 44 24 40 <48> 8b 80 f0 02 00 00 fe 00 ff 14 25 28 0d 6a 81 5b 5b 41 5c 41 [912402.451398] RIP [<ffffffff81249ac2>] blktap_device_end_request+0x4e/0x68 [912402.451733] RSP <ffff8800616f9d08> [912402.451997] CR2: 00000000000002f0 [912402.452265] ---[ end trace f20c5f7e976fe78b ]--- Half a minute later I get tons of these: [912432.458625] ata6.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x6 frozen [912432.458893] ata6.00: failed command: READ FPDMA QUEUED [912432.459172] ata6.00: cmd 60/08:00:f2:13:e3/00:00:0d:00:00/40 tag 0 ncq 4096 in And tons of these: [912478.690536] sd 5:0:0:0: [sda] Unhandled error code [912478.690798] sd 5:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [912478.691306] sd 5:0:0:0: [sda] CDB: Read(10): 28 00 00 e6 95 05 00 00 68 00 I wonder if we are having real SATA issues or whether this is somehow caused by the 'real' problem which is tapdisk? Finally I also get timeouts in the network code. I think these are just caused by the bad things happening. [912737.470673] WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0xd2/0x16e() [912737.471170] Hardware name: X8STi [912737.471426] NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out [912737.471686] Modules linked in: dm_snapshot dm_mod ipmi_si ipmi_devintf [912737.472106] Pid: 0, comm: swapper Tainted: G D 2.6.32.37 #1 [912737.472586] Call Trace: [912737.472838] <IRQ> [<ffffffff813c8e12>] ? dev_watchdog+0xd2/0x16e [912737.473130] [<ffffffff813c8e12>] ? dev_watchdog+0xd2/0x16e [912737.473391] [<ffffffff81040fb1>] ? warn_slowpath_common+0x77/0xa3 [912737.473668] [<ffffffff813c8d40>] ? dev_watchdog+0x0/0x16e [912737.473928] [<ffffffff81041039 > ] ? warn_slowpk _dev_program_evt+ 0xe3/0x18c [90/0x60 [912737.474988] [<ffffffff8100e865>] ? xen_force_evtchn_callback+0x9/0xa [912737.480706] [<ffffffff8100eef2>] ? check_events+0x12/0x20 [912737.480965] [<ffffffff813c8d15>] ? netif_tx_lock+0x3d/0x68 [912737.481222] [<ffffffff813c8d40>] ? dev_watchdog+0x0/0x16e [912737.481481] [<ffffffff813b83a0>] ? netdev_drivername+0x3b/0x40 [912737.481741] [<ffffffff813c8e12>] ? dev_watchdog+0xd2/0x16e [912737.482015] [<ffffffff81049614>] ? run_timer_softirq+0x16c/0x1d9 [912737.482274] [<ffffffff81078fb8>] ? handle_percpu_irq+0x4e/0x6a [912737.482534] [<ffffffff81045bc8>] ? __do_softirq+0x91/0x116 [912737.482794] [<ffffffff81012b6c>] ? call_softirq+0x1c/0x30 [912737.483052] [<ffffffff810140e7>] ? do_softirq+0x3f/0x7c [912737.483324] [<ffffffff81045a64>] ? irq_exit+0x36/0x76 [912737.483584] [<ffffffff8123b6ad>] ? xen_evtchn_do_upcall+0x33/0x42 [912737.483843] [<ffffffff81012bbe>] ? xen_do_hypervisor_callback+0x1e/0x30 [912737.484101] <EOI> [<ffffffff8100eedf>] ? xen_restore_fl_direct_end+0x0/0x1 [912737.484394] [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1001 [912737.484670] [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1001 [912737.484930] [<ffffffff8100e8e3>] ? xen_safe_halt+0xc/0x15 [912737.485188] [<ffffffff81017f07>] ? default_idle+0x21/0x3d [912737.485447] [<ffffffff81010df1>] ? cpu_idle+0x59/0x91 [912737.485704] ---[ end trace f20c5f7e976fe78c ]--- Does this feel like blktap issues? Or is it more likely that something else happened which causes blktap and the other things to fail? Roderick _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.