[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] BUG: soft lockup - CPU#1 stuck for 61s! [udevd:4865]


  • To: "Jeremy Fitzhardinge" <jeremy@xxxxxxxx>
  • From: "James Harper" <james.harper@xxxxxxxxxxxxxxxx>
  • Date: Fri, 3 Sep 2010 10:35:34 +1000
  • Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
  • Delivery-date: Thu, 02 Sep 2010 17:36:26 -0700
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>
  • Thread-index: ActK/q882zvhES+FSZ6IH+PETbgLzAAAAuyw
  • Thread-topic: [Xen-devel] BUG: soft lockup - CPU#1 stuck for 61s! [udevd:4865]

> 
>  On 09/02/2010 05:20 PM, James Harper wrote:
> > I've had "BUG: soft lockup - CPU#1 stuck for 61s! [udevd:4865]"
happen
> > twice now and the server fairly quickly (over the course of 5-10
> > minutes) becomes unusable after that happens. The call trace appears
to
> > be:
> >
> > Call Trace:
> >  [<ffffffff8100e8e9>] ? xen_force_evtchn_callback+0x9/0xa
> >  [<ffffffff8100ef72>] ? check_events+0x12/0x20
> >  [<ffffffff8100ef19>] ? xen_irq_enable_direct_end+0x0/0x7
> >  [<ffffffff8102fcc1>] ? do_page_fault+0x9e/0x27b
> >  [<ffffffff812b39a5>] ? page_fault+0x25/0x30
> >  [<ffffffff8116f28d>] ? __put_user_4+0x1d/0x30
> >  [<ffffffff81045265>] ? schedule_tail+0x92/0x96
> >  [<ffffffff81011983>] ? ret_from_fork+0x13/0x80
> >
> > That was under 2.6.32.17-g69a73fa. I've since upgraded to
> > 2.6.32.18-ge6b9b2c but the changelogs don't show anything obviously
> > related to the crash (but as usual there are a lot of them so maybe
I've
> > missed something?)
> >
> > Is this a known problem under 2.6.32.17-g69a73fa that has since been
> > fixed?
> 
> Hard to know; nothing springs to mind right now.  What else is going
on
> at the time?  What's the full softlockup message?  Is there any more
> context?

I just noticed that it coincided pretty much exactly when a backup runs.
I take a snapshot in Dom0 then block-attach it to the linux DomU so it
can be backed up from DomU which makes restores easier etc, rather than
doing the backup in Dom0. I hadn't noticed before because it's the
Windows DomU that people complain about first, not the linux DomU.
Block-attach/detach would explain why udevd is involved too which makes
more sense. The block-attach happens every hour on the hour. The full
cut&paste of the first hang is:

Sep  3 09:00:42 keechsvr2 kernel: [235893.079566] blkback: ring-ref
1442, event-channel 13, protocol 1 (x86_64-abi)
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] BUG: soft lockup -
CPU#1 stuck for 61s! [udevd:4865]
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Modules linked in:
ipt_MASQUERADE xt_physdev pppoe pppox ipmi_devintf autofs4 tun
act_police sch_ingress cls_u32 sch_sfq sch_cbq ppp_generic slhc ipt_ULO
G xt_limit ipt_REJECT xt_state xt_dscp iptable_filter ipt_REDIRECT
xt_conntrack iptable_nat xt_TCPMSS xt_tcpmss xt_TCPOPTSTRIP xt_tcpudp
iptable_mangle ip_tables x_tables ipv6 8021q garp bridge stp llc n
f_nat_ftp nf_conntrack_ftp ip_gre nf_nat_pptp nf_conntrack_pptp
nf_conntrack_proto_gre nf_nat_proto_gre nf_nat nf_conntrack_ipv4
nf_conntrack nf_defrag_ipv4 xen_gntdev xen_evtchn xen_netback
xen_blkback
blkback_pagemap loop snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd
tpm_tis tpm sd_mod soundcore ipmi_si tpm_bios hpilo snd_page_alloc
ipmi_msghandler pcspkr psmouse serio_raw rng_core i5000_edac edac_c
ore i5k_amb button evdev xfs exportfs dm_mirror dm_region_hash dm_log
dm_snapshot dm_mod ide_cd_mod cdrom ata_piix ata_generic libata
usb_storage scsi_mod usbhid hid piix ide_pci_generic bnx2 ehci_hcd id
e_core uhci_hc
Sep  3 09:02:13 keechsvr2 kernel:  cciss thermal fan [last unloaded:
scsi_wait_scan]
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] CPU 1:
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Modules linked in:
ipt_MASQUERADE xt_physdev pppoe pppox ipmi_devintf autofs4 tun
act_police sch_ingress cls_u32 sch_sfq sch_cbq ppp_generic slhc ipt_ULO
G xt_limit ipt_REJECT xt_state xt_dscp iptable_filter ipt_REDIRECT
xt_conntrack iptable_nat xt_TCPMSS xt_tcpmss xt_TCPOPTSTRIP xt_tcpudp
iptable_mangle ip_tables x_tables ipv6 8021q garp bridge stp llc n
f_nat_ftp nf_conntrack_ftp ip_gre nf_nat_pptp nf_conntrack_pptp
nf_conntrack_proto_gre nf_nat_proto_gre nf_nat nf_conntrack_ipv4
nf_conntrack nf_defrag_ipv4 xen_gntdev xen_evtchn xen_netback
xen_blkback
blkback_pagemap loop snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd
tpm_tis tpm sd_mod soundcore ipmi_si tpm_bios hpilo snd_page_alloc
ipmi_msghandler pcspkr psmouse serio_raw rng_core i5000_edac edac_c
ore i5k_amb button evdev xfs exportfs dm_mirror dm_region_hash dm_log
dm_snapshot dm_mod ide_cd_mod cdrom ata_piix ata_generic libata
usb_storage scsi_mod usbhid hid piix ide_pci_generic bnx2 ehci_hcd id
e_core uhci_hc
Sep  3 09:02:13 keechsvr2 kernel:  cciss thermal fan [last unloaded:
scsi_wait_scan]
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Pid: 4865, comm: udevd
Not tainted 2.6.32.17-g69a73fa #1 ProLiant ML370 G5
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RIP:
e030:[<ffffffff8100922a>]  [<ffffffff8100922a>]
hypercall_page+0x22a/0x1001
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RSP:
e02b:ffff8800114d5dd0  EFLAGS: 00000246
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RAX: 0000000000040000
RBX: 0000000000000001 RCX: ffffffff8100922a
keechsvr2:/usr/local/src/linux-2.6-xen# head -100 /var/log/kern.log
Sep  3 07:00:44 keechsvr2 kernel: [228694.767691] blkback: ring-ref 311,
event-channel 13, protocol 1 (x86_64-abi)
Sep  3 08:00:42 keechsvr2 kernel: [232293.072248] blkback: ring-ref
1334, event-channel 13, protocol 1 (x86_64-abi)
Sep  3 09:00:42 keechsvr2 kernel: [235893.079566] blkback: ring-ref
1442, event-channel 13, protocol 1 (x86_64-abi)
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] BUG: soft lockup -
CPU#1 stuck for 61s! [udevd:4865]
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Modules linked in:
ipt_MASQUERADE xt_physdev pppoe pppox ipmi_devintf autofs4 tun
act_police sch_ingress cls_u32 sch_sfq sch_cbq ppp_generic slhc ipt_ULOG
xt_limit ipt_REJECT xt_state xt_dscp iptable_filter ipt_REDIRECT
xt_conntrack iptable_nat xt_TCPMSS xt_tcpmss xt_TCPOPTSTRIP xt_tcpudp
iptable_mangle ip_tables x_tables ipv6 8021q garp bridge stp llc
nf_nat_ftp nf_conntrack_ftp ip_gre nf_nat_pptp nf_conntrack_pptp
nf_conntrack_proto_gre nf_nat_proto_gre nf_nat nf_conntrack_ipv4
nf_conntrack nf_defrag_ipv4 xen_gntdev xen_evtchn xen_netback
xen_blkback blkback_pagemap loop snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd tpm_tis tpm sd_mod soundcore ipmi_si tpm_bios hpilo
snd_page_alloc ipmi_msghandler pcspkr psmouse serio_raw rng_core
i5000_edac edac_core i5k_amb button evdev xfs exportfs dm_mirror
dm_region_hash dm_log dm_snapshot dm_mod ide_cd_mod cdrom ata_piix
ata_generic libata usb_storage scsi_mod usbhid hid piix ide_pci_generic
bnx2 ehci_hcd ide_core uhci_hc
Sep  3 09:02:13 keechsvr2 kernel:  cciss thermal fan [last unloaded:
scsi_wait_scan]
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] CPU 1:
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Modules linked in:
ipt_MASQUERADE xt_physdev pppoe pppox ipmi_devintf autofs4 tun
act_police sch_ingress cls_u32 sch_sfq sch_cbq ppp_generic slhc ipt_ULOG
xt_limit ipt_REJECT xt_state xt_dscp iptable_filter ipt_REDIRECT
xt_conntrack iptable_nat xt_TCPMSS xt_tcpmss xt_TCPOPTSTRIP xt_tcpudp
iptable_mangle ip_tables x_tables ipv6 8021q garp bridge stp llc
nf_nat_ftp nf_conntrack_ftp ip_gre nf_nat_pptp nf_conntrack_pptp
nf_conntrack_proto_gre nf_nat_proto_gre nf_nat nf_conntrack_ipv4
nf_conntrack nf_defrag_ipv4 xen_gntdev xen_evtchn xen_netback
xen_blkback blkback_pagemap loop snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd tpm_tis tpm sd_mod soundcore ipmi_si tpm_bios hpilo
snd_page_alloc ipmi_msghandler pcspkr psmouse serio_raw rng_core
i5000_edac edac_core i5k_amb button evdev xfs exportfs dm_mirror
dm_region_hash dm_log dm_snapshot dm_mod ide_cd_mod cdrom ata_piix
ata_generic libata usb_storage scsi_mod usbhid hid piix ide_pci_generic
bnx2 ehci_hcd ide_core uhci_hc
Sep  3 09:02:13 keechsvr2 kernel:  cciss thermal fan [last unloaded:
scsi_wait_scan]
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Pid: 4865, comm: udevd
Not tainted 2.6.32.17-g69a73fa #1 ProLiant ML370 G5
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RIP:
e030:[<ffffffff8100922a>]  [<ffffffff8100922a>]
hypercall_page+0x22a/0x1001
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RSP:
e02b:ffff8800114d5dd0  EFLAGS: 00000246
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RAX: 0000000000040000
RBX: 0000000000000001 RCX: ffffffff8100922a
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RDX: ffffffff814293d0
RSI: 0000000000000000 RDI: 0000000000000000
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RBP: 00007fbc628a9800
R08: 0000000000000000 R09: 0000000000000000
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] R10: 0000000000000002
R11: 0000000000000246 R12: 0000000000000002
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] R13: ffff8800114d5e88
R14: ffff880017f9cfc0 R15: ffff880017eed500
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] FS:
00007fbc628a9770(0000) GS:ffff880002de8000(0000) knlGS:0000000000000000
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] CS:  e033 DS: 0000 ES:
0000 CR0: 000000008005003b
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] CR2: 00007fbc628a9800
CR3: 000000001e2d7000 CR4: 0000000000002660
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] DR0: 0000000000000000
DR1: 0000000000000000 DR2: 0000000000000000
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] DR3: 0000000000000000
DR6: 00000000ffff0ff0 DR7: 0000000000000400
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Call Trace:
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]  [<ffffffff8100e8e9>]
? xen_force_evtchn_callback+0x9/0xa
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]  [<ffffffff8100ef72>]
? check_events+0x12/0x20
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]  [<ffffffff8100ef19>]
? xen_irq_enable_direct_end+0x0/0x7
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]  [<ffffffff8102fcc1>]
? do_page_fault+0x9e/0x27b
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]  [<ffffffff812b39a5>]
? page_fault+0x25/0x30
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]  [<ffffffff8116f28d>]
? __put_user_4+0x1d/0x30
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]  [<ffffffff81045265>]
? schedule_tail+0x92/0x96
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]  [<ffffffff81011983>]
? ret_from_fork+0x13/0x80

There's another hang at 09:03:09 which I can post if you'd find it
useful.

If the block-attach happened at 09:00:42 and the crash happened at
09:02:13 then I assume that whatever udevd started doing started at
09:01:12 (eg 61 seconds earlier) which means that it's more likely to be
the block-detach at the end of the backup causing the hang not the
block-attach. The backup runs pretty quick.

I'm inclined to turn off the snapshot backup if this isn't a known and
resolved problem...

Thanks

James


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.