WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-users

Re: [Xen-users] Dom0 Locked up for 4 hours "BUG: soft lockup - CPU#3 stu

To: Todd Deshane <todd.deshane@xxxxxxx>
Subject: Re: [Xen-users] Dom0 Locked up for 4 hours "BUG: soft lockup - CPU#3 stuck for 61s!"
From: Javier Frias <jfrias@xxxxxxxxx>
Date: Tue, 29 Mar 2011 09:35:25 -0400
Cc: xen-users@xxxxxxxxxxxxxxxxxxx
Delivery-date: Tue, 29 Mar 2011 06:36:36 -0700
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=lhNI5wYJ637br59LO7CcXkwse6mZh/4Tz/ExJC7fjYM=; b=ekbt0MVkXngj04Qj2/wIpUKqoW1B/xbjd9AENQAxFLvwQLP8ehReoYiNyn/AcubC9Q 0zj7RKm/TvMoFRg9ZB/frUlqOogAW34n70RIBb4T/4QoU1QFTLRYi+nbUHJNyi6JIYrH yIAzxSDwNPAsLqQsz/i30neuVlfCHChX1Stro=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=PkUyhgCdfXimC/JFXJP0z/7oPkwJB6LBRUyIe8dB4hztN+SzZuVxZYA2qOoQC4GbiI PgiR6oUL5jf48sa3gq+qNMvl2YYmgi6gXy5gQd15DHnVvY8CBrfwAA0hnWT5eRqVZVyp cg7quohhuyXy5f2Qrj/ULbA6R7udzRDAbMPpw=
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <AANLkTi=svmMSC7k4Zwt+0ycDN_p1iMG5MbZdy3vuZVKz@xxxxxxxxxxxxxx>
List-help: <mailto:xen-users-request@lists.xensource.com?subject=help>
List-id: Xen user discussion <xen-users.lists.xensource.com>
List-post: <mailto:xen-users@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-users>, <mailto:xen-users-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-users>, <mailto:xen-users-request@lists.xensource.com?subject=unsubscribe>
References: <AANLkTik-v5eQeyWC4npJaq2_=5N0owb4KBEtZgFfKJyp@xxxxxxxxxxxxxx> <AANLkTi=svmMSC7k4Zwt+0ycDN_p1iMG5MbZdy3vuZVKz@xxxxxxxxxxxxxx>
Sender: xen-users-bounces@xxxxxxxxxxxxxxxxxxx
Never saw this reply, sorry for the delay. Answers inline. ( still
seeing the issue )

On Sat, Feb 26, 2011 at 2:50 PM, Todd Deshane <todd.deshane@xxxxxxx> wrote:
> On Sat, Feb 26, 2011 at 12:11 AM, Javier Frias <jfrias@xxxxxxxxx> wrote:
>> I posted a bug about this, but figured I'd ask the mailing list to see
>> if someone had seen this.
>> Bugzilla: http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1746
>>
>> Basically, I had a dom0, after 57 days of non issues, lock up for 4
>> hours, completely unresponsive, and then recovered. The domU's were
>> unaffected except for the fact that I could not shut them down. (
>> since dom0 was unresponsive ). Although I was able to gain access via
>> xapi/xencenter, and I atleast had some access ( console, status, etc,
>> all worked via xapi).
>>
>
> Could you clarify this explanation a bit. What access was not
> available for 4 hours?
>

The dom0 was so loaded, ssh and any services running on (snmp for
one), were just unavailable. It was swapping, and just thoroughly
overloaded. I think this was due to the high io being done by one of
the guests, since I was able to log in to the host as one of the
events happened, and saw this via top.

Tasks: 228 total,   2 running, 226 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.4%us,  0.0%sy,  0.0%ni, 98.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.8%st
Mem:    771328k total,   747572k used,    23756k free,   139952k buffers
Swap:   524280k total,     5440k used,   518840k free,   342188k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
15715 root      20   0  3796 2388 1868 S 9293.8  0.3 686893:40
tapdisk2
24367 root      20   0  4128 2720 1896 S 8004.8  0.4 553094:39
tapdisk2
3133 root      20   0  3928 2520 1868 S 5264.2  0.3 695773:20 tapdisk2
26586 root      20   0  4924 3516 1868 S 1370.3  0.5 450796:40 tapdisk2


Everywhere I read, they say tapdisk2 is way more cpu intensive than
any other driver, is there a way to use raw LVM in xcp? In our case, I
think that would be the best choice, since we have a beefy subsystem.


> You say you could access via xapi/xencenter was this after the 4 hours
> or during?
>
Oddly enough, during. Which was puzzling since every other service was
affected by the high loads and swapping going on in the host. Things
like shutting down a host did not work though, seemed only read only
things ( like verying vm running state and params worked via xencenter
or hitting the api directly )

> Did you happen to look at the guest performance during those times?
> Was one of the guest doing a lot of disk I/O? Could you give some more
> information as to how the guests access their virtual disks (local,
> NFS, iSCSI, etc.) and any other information about your setup that
> could give us hints as to what might have caused this.

Yes, absolutely, two vms in this host that locked up have what would
be considered high i/o characteristics. ( one is lots of small file
i/o, and the other just large files being appended to )

My hardware looks like the following ( i use no shared storage )

Dell R710
72Gb Ram
2 x X5650  @ 2.67GHz ( 12 physical cores, 12 additional threads )
6 x 600GB 15K disks in raid 10
Dell H700 raid controller ( 512MB version )

So the hardware should handle the i/o that's being done by the vm no problem.

The dom0 has the default cpu and ram allocation ( 768MB and 4 vcpus )

any help greatly appreciated.

Also, here's a kernel message of a new vm as it went nuts ... ( seems related )

===dmesg====

[6954775.046768] BUG: soft lockup - CPU#2 stuck for 61s! [apache2:20139]
[6954775.046776] Modules linked in: xenfs lp parport
[6954775.046784] CPU 2
[6954775.046786] Modules linked in: xenfs lp parport
[6954775.046793]
[6954775.046796] Pid: 20139, comm: apache2 Tainted: G      D
2.6.35-22-virtual #34~lucid1-Ubuntu /
[6954775.046802] RIP: e030:[<ffffffff812526a5>]  [<ffffffff812526a5>]
sys_semtimedop+0x625/0x690
[6954775.046811] RSP: e02b:ffff8800fb0fbcf8  EFLAGS: 00000246
[6954775.046815] RAX: 0000000000000001 RBX: 0000000000430000 RCX:
ffff8800fb0fbfd8
[6954775.046820] RDX: 0000000000000000 RSI: ffff8800eeb744a0 RDI:
00000000ffffffff
[6954775.046825] RBP: ffff8800fb0fbf68 R08: 0000000000000000 R09:
0000000000000000
[6954775.046830] R10: 0000000000000000 R11: 0000000000000001 R12:
0000000000000001
[6954775.046835] R13: 0000000000000000 R14: 0000000000000001 R15:
ffff8800fae5ee50
[6954775.046843] FS:  00007f3943fd2740(0000) GS:ffff880003e76000(0000)
knlGS:0000000000000000
[6954775.046848] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[6954775.046852] CR2: 00007f9da4c3b000 CR3: 00000000fa0b3000 CR4:
0000000000002660
[6954775.046857] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[6954775.046863] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[6954775.046868] Process apache2 (pid: 20139, threadinfo
ffff8800fb0fa000, task ffff8800fa9496e0)
[6954775.046874] Stack:
[6954775.046876]  ffff8800ffc39400 ffff8800fb0fbf28 ffff8800fa9496e0
ffffffff81a514a8
[6954775.046884] <0> ffff8800fa4ec060 0000000000000000
00000001810072d2 ffff8800fb0fbd48
[6954775.046893] <0> ffff8800fae5ee50 ffff8800fb0fbd48
ffff1000ffff0000 ffff8800fa9402e0
[6954775.046904] Call Trace:
[6954775.046909]  [<ffffffff810072bf>] ? xen_restore_fl_direct_end+0x0/0x1
[6954775.046916]  [<ffffffff8100611d>] ? xen_flush_tlb_single+0x9d/0xb0
[6954775.046921]  [<ffffffff8100527f>] ? xen_set_pte_at+0x6f/0xf0
[6954775.046927]  [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954775.046932]  [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954775.046938]  [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954775.046943]  [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954775.046949]  [<ffffffff810072bf>] ? xen_restore_fl_direct_end+0x0/0x1
[6954775.046954]  [<ffffffff810041a1>] ? xen_clts+0x71/0x80
[6954775.046959]  [<ffffffff8101407c>] ? restore_i387_xstate+0xcc/0x1c0
[6954775.046965]  [<ffffffff81252720>] sys_semop+0x10/0x20
[6954775.046970]  [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[6954775.046974] Code: 57 48 45 85 f6 74 65 48 8b 4a 10 48 89 42 10 48
83 c2 08 48 89 95 60 ff ff ff 48 89 8d 68 ff ff ff 48 89 01 e9 29 fe
ff ff f3 90 <e9> 63 fe ff ff 48 8b 95 60 ff ff ff 48 8b 85 68 ff ff ff
49 b8
[6954775.047036] Call Trace:
[6954775.047040]  [<ffffffff810072bf>] ? xen_restore_fl_direct_end+0x0/0x1
[6954775.047045]  [<ffffffff8100611d>] ? xen_flush_tlb_single+0x9d/0xb0
[6954775.047050]  [<ffffffff8100527f>] ? xen_set_pte_at+0x6f/0xf0
[6954775.047055]  [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954775.047061]  [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954775.047066]  [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954775.047071]  [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954775.047077]  [<ffffffff810072bf>] ? xen_restore_fl_direct_end+0x0/0x1
[6954775.047082]  [<ffffffff810041a1>] ? xen_clts+0x71/0x80
[6954775.047087]  [<ffffffff8101407c>] ? restore_i387_xstate+0xcc/0x1c0
[6954775.047092]  [<ffffffff81252720>] sys_semop+0x10/0x20
[6954775.047097]  [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[6954777.197935] BUG: soft lockup - CPU#3 stuck for 61s! [apache2:20145]
[6954777.197949] Modules linked in: xenfs lp parport
[6954777.197959] CPU 3
[6954777.197961] Modules linked in: xenfs lp parport
[6954777.197969]
[6954777.197973] Pid: 20145, comm: apache2 Tainted: G      D
2.6.35-22-virtual #34~lucid1-Ubuntu /
[6954777.197979] RIP: e030:[<ffffffff812526a5>]  [<ffffffff812526a5>]
sys_semtimedop+0x625/0x690
[6954777.197993] RSP: e02b:ffff880048ed3cf8  EFLAGS: 00000246
[6954777.197997] RAX: 0000000000000001 RBX: 0000000000430000 RCX:
ffff880048ed3fd8
[6954777.198002] RDX: 0000000000000000 RSI: ffff8800032d16e0 RDI:
00000000ffffffff
[6954777.198007] RBP: ffff880048ed3f68 R08: 0000000000000000 R09:
0000000000000000
[6954777.198012] R10: 0000000000000000 R11: 0000000000000001 R12:
0000000000000001
[6954777.198017] R13: 0000000000000000 R14: 0000000000000001 R15:
ffff8800fae5ee50
[6954777.198027] FS:  00007f3943fd2740(0000) GS:ffff880003e94000(0000)
knlGS:0000000000000000
[6954777.198032] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[6954777.198036] CR2: 00007f393dc39030 CR3: 00000000faf56000 CR4:
0000000000002660
[6954777.198042] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[6954777.198047] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[6954777.198052] Process apache2 (pid: 20145, threadinfo
ffff880048ed2000, task ffff8800fb16c4a0)
[6954777.198057] Stack:
[6954777.198060]  0000000000000293 ffff880048ed3f28 ffff8800fb16c4a0
ffffffff81a514a8
[6954777.198068] <0> ffff8800fa4ec8a0 0000000000000000
0000000148ed3dd8 ffff880048ed3d48
[6954777.198077] <0> ffff8800fae5ee50 ffff880048ed3d48
ffff1000ffff0000 ffff8800fb2d4480
[6954777.198088] Call Trace:
[6954777.198097]  [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954777.198104]  [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954777.198109]  [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954777.198115]  [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954777.198123]  [<ffffffff81036e88>] ? pvclock_clocksource_read+0x58/0xd0
[6954777.198129]  [<ffffffff81007161>] ? xen_clocksource_read+0x21/0x30
[6954777.198137]  [<ffffffff8108931a>] ? do_gettimeofday+0x1a/0x50
[6954777.198142]  [<ffffffff81252720>] sys_semop+0x10/0x20
[6954777.198148]  [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[6954777.198152] Code: 57 48 45 85 f6 74 65 48 8b 4a 10 48 89 42 10 48
83 c2 08 48 89 95 60 ff ff ff 48 89 8d 68 ff ff ff 48 89 01 e9 29 fe
ff ff f3 90 <e9> 63 fe ff ff 48 8b 95 60 ff ff ff 48 8b 85 68 ff ff ff
49 b8
[6954777.198218] Call Trace:
[6954777.198223]  [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954777.198229]  [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954777.198234]  [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954777.198239]  [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954777.198245]  [<ffffffff81036e88>] ? pvclock_clocksource_read+0x58/0xd0
[6954777.198251]  [<ffffffff81007161>] ? xen_clocksource_read+0x21/0x30
[6954777.198256]  [<ffffffff8108931a>] ? do_gettimeofday+0x1a/0x50
[6954777.198261]  [<ffffffff81252720>] sys_semop+0x10/0x20
[6954777.198267]  [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users

<Prev in Thread] Current Thread [Next in Thread>
  • Re: [Xen-users] Dom0 Locked up for 4 hours "BUG: soft lockup - CPU#3 stuck for 61s!", Javier Frias <=