WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

[Xen-devel] Re: blocking Xen 3.X production use: soft lockup bugs

To: Keir Fraser <Keir.Fraser@xxxxxxxxxxxx>
Subject: [Xen-devel] Re: blocking Xen 3.X production use: soft lockup bugs
From: Steve Traugott <stevegt@xxxxxxxxxxxxx>
Date: Wed, 2 Aug 2006 15:48:03 -0700
Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>
Delivery-date: Wed, 02 Aug 2006 15:48:49 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
In-reply-to: <20060802205449.GA17411@xxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <20060802205449.GA17411@xxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mutt/1.5.9i
Here are some examples of the sort of soft lockups I'm seeing -- I
can't say right now if they've all been showing the same stack trace,
but I'll keep an eye on that from now on.  I know they haven't all
been on the same CPU.  Anything else anyone needs, just let me know --
and I'd like to reaffirm my earlier offer of access to one of these
machines.  

I'm also starting to think a XenSource wiki page "how to
report/workaround soft lockups" might be in order; I suspect many of
the bug reports (including my own) haven't been detailed enough to
differentiate between the various things that can cause soft lockups.

This was on an IBM x330.

Steve

n4h34:~# xm create -c /etc/xen/auto/build2.t7a.org
Using config file "/etc/xen/auto/build2.t7a.org".
Started domain build2.t7a.org
Linux version 2.6.16.13-xen (root@n4h33) (gcc version 3.3.5 (Debian 
1:3.3.5-12)) #2 SMP Sun Jun 11 14:25:16 PDT 2006
BIOS-provided physical RAM map:
 Xen: 0000000000000000 - 0000000008000000 (usable)
0MB HIGHMEM available.
136MB LOWMEM available.
ACPI in unprivileged domain disabled
IRQ lockup detection disabled
Built 1 zonelists
Kernel command line:  root=/dev/sda1 2
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
PID hash table entries: 1024 (order: 10, 16384 bytes)
Xen reported: 1130.113 MHz processor.
Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
Software IO TLB disabled
vmalloc area: c9000000-fb7fe000, maxmem 33ffe000
Memory: 114612k/139264k available (3368k kernel code, 16308k reserved, 1033k 
data, 196k init, 0k highmem)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay using timer specific routine.. 2261.96 BogoMIPS (lpj=11309833)
Security Framework v1.0.0 initialized
Capability LSM initialized
Mount-cache hash table entries: 512
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 512K
Checking 'hlt' instruction... OK.
Brought up 1 CPUs
migration_cost=0
checking if image is initramfs... it is
Freeing initrd memory: 9535k freed
Grant table initialized
NET: Registered protocol family 16
Brought up 1 CPUs
PCI: setting up Xen PCI frontend stub
ACPI: Subsystem revision 20060127
ACPI: Interpreter disabled.
Linux Plug and Play Support v0.97 (c) Adam Belay
xen_mem: Initialising balloon driver.
SCSI subsystem initialized
usbcore: registered new driver usbfs
usbcore: registered new driver hub
PCI: System does not support PCI
PCI: System does not support PCI
IA-32 Microcode Update Driver: v1.14-xen <tigran@xxxxxxxxxxx>
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
JFS: nTxBlock = 1024, nTxLock = 8192
SGI XFS with ACLs, security attributes, realtime, large block numbers, no debug 
enabled
Initializing Cryptographic API
io scheduler noop registered
io scheduler anticipatory registered (default)
io scheduler deadline registered
io scheduler cfq registered
PNP: No PS/2 controller found. Probing ports directly.
i8042.c: No controller found.
RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize
Xen virtual console successfully installed as tty1
Event-channel device installed.
blkif_init: reqs=64, pages=704, mmap_vstart=0xc7400000
netfront: Initialising virtual ethernet driver.
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 50MHz system bus speed for PIO modes; override with idebus=xx
Registering block device major 8
ide-floppy driver 0.99.newide
Fusion MPT base driver 3.03.07
Copyright (c) 1999-2005 LSI Logic Corporation
Fusion MPT SPI Host driver 3.03.07
Fusion MPT misc device (ioctl) driver 3.03.07
mptctl: Registered with Fusion MPT base driver
mptctl: /dev/mptctl @ (major,minor=10,220)
usbmon: debugfs is not available
usbcore: registered new driver libusual
mice: PS/2 mouse device common for all mice
md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: bitmap version 4.39
NET: Registered protocol family 2
IP route cache hash table entries: 2048 (order: 1, 8192 bytes)
TCP established hash table entries: 8192 (order: 4, 65536 bytes)
TCP bind hash table entries: 8192 (order: 4, 65536 bytes)
TCP: Hash tables configured (established 8192 bind 8192)
TCP reno registered
Initializing IPsec netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
NET: Registered protocol family 8
NET: Registered protocol family 20
Using IPI No-Shortcut mode
Freeing unused kernel memory: 196k freed
Loading, please wait...
Begin: Loading essential drivers... ...
tg3: no version for "struct_module" found: kernel tainted.
eepro100.c:v1.09j-t 9/29/99 Donald Becker 
http://www.scyld.com/network/eepro100.html
eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin 
<saw@xxxxxxxxxxxxx> and others
Intel(R) PRO/1000 Network Driver - version 6.3.9-k4
Copyright (c) 1999-2005 Intel Corporation.
Done.
Begin: Running /scripts/init-premount ...
FATAL: Error inserting fan 
(/lib/modules/2.6.16.13-xen/kernel/drivers/acpi/fan.ko): No such device
FATAL: Error inserting thermal 
(/lib/modules/2.6.16.13-xen/kernel/drivers/acpi/thermal.ko): No such device
Done.
Begin: Mounting root file system... ...
Begin: Running /scripts/local-top ...
Done.
Begin: Running /scripts/local-premount ...
Done.
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
Begin: Running /scripts/log-bottom ...
Done.
Done.
Begin: Running /scripts/init-bottom ...
Done.
mount: Mounting /sys on /root/sys failed: No such file or directory
INIT: version 2.85 booting
Activating swap.
Checking root file system...
fsck 1.39 (29-May-2006)
/dev/sda1: clean, 21526/917504 files, 245920/1835007 blocks
EXT3 FS on sda1, internal journal
System time was Wed Aug  2 22:17:34 UTC 2006.
Setting the System Clock using the Hardware Clock as reference...
System Clock set. System local time is now Wed Aug  2 22:17:37 UTC 2006.
Loading device-mapper support.
Checking all file systems...
fsck 1.39 (29-May-2006)
Setting kernel variables..
Mounting local filesystems...
Adding 524280k swap on /swap00.  Priority:-1 extents:134 across:533176k
Cleaning /tmp /var/run /var/lock.
Running 0dns-down to make sure resolv.conf is ok...done.
Cleaning: /etc/network/ifstate.
Setting up IP spoofing protection: rp_filter.
Configuring network interfaces...done.
Loading the saved-state of the serial devices...
/dev/ttyS0: No such file or directory
/dev/ttyS0: No such file or directory
/dev/ttyS1: No such file or directory
/dev/ttyS1: No such file or directory
Not setting System Clock
Initializing random number generator...done.
Recovering nvi editor sessions... done.
INIT: Entering runlevel: 2
Starting isconf daemonRunning isconf updateisconf: info: build2.t7a.org is on 
guest-1 branch
isconf: info: may reboot...
isconf: info: checking for updates
isconf: info: fetching 
http://10.27.4.7:65028/t7a.org/block/fb2/fb2e8177e647be52a1c64e21fcb913455c71e731-8b3a10ecde5fc43984807e34550a2ebd-1?challenge=0.911958506882
isconf: info: fetching 
http://10.27.4.7:65028/t7a.org/block/fb2/fb2e8177e647be52a1c64e21fcb913455c71e731-8b3a10ecde5fc43984807e34550a2ebd-1?challenge=0.999292957677
isconf: info: fetching 
http://10.27.4.7:65028/t7a.org/block/fb2/fb2e8177e647be52a1c64e21fcb913455c71e731-8b3a10ecde5fc43984807e34550a2ebd-1?challenge=0.239902520967
BUG: soft lockup detected on CPU#0!

Pid: 2383, comm:               isconf
EIP: 0073:[<080c9763>] CPU: 0
EIP is at 0x80c9763
 ESP: 007b:bfcc962c EFLAGS: 00200282    Tainted: GF      (2.6.16.13-xen #2)
EAX: 00000001 EBX: 0000003a ECX: bfcc9624 EDX: 00000000
ESI: 08137cb4 EDI: 00000001 EBP: bfcc9638 DS: 007b ES: 007b
CR0: 80050033 CR2: b7b97000 CR3: 0055e000 CR4: 00000640
isconf: info: fetching 
http://10.27.4.34:65028/t7a.org/block/ff1/ff1276f7811aeeade18d54a6c3578261ff36ecbb-4fb47b36cda57ae95af56372f03bb2ca-1?challenge=0.265409462016
isconf: info: updated /etc/ldap/ldap.conf
BUG: soft lockup detected on CPU#0!

Pid: 2383, comm:               isconf
EIP: 0073:[<080af84d>] CPU: 0
EIP is at 0x80af84d
 ESP: 007b:bfcc96d0 EFLAGS: 00200246    Tainted: GF      (2.6.16.13-xen #2)
EAX: 00000001 EBX: 082031fe ECX: 082031fe EDX: b7af1f8c
ESI: 00000000 EDI: 082030ec EBP: bfcc9838 DS: 007b ES: 007b
CR0: 80050033 CR2: b7b97000 CR3: 0055e000 CR4: 00000640
isconf: info: fetching 
http://10.27.4.7:65028/t7a.org/block/c0e/c0e10bc50572deb89da6e9d96ac5971a39fddc65-fc3558eaffc90497248f97f9b0e3a924-1?challenge=0.130730726051
isconf: info: updated /etc/ca-certificates.conf
isconf: info: running ['update-ca-certificates']
Updating certificates in /etc/ssl/certs....done.
isconf: info: updated /etc/ldap/ldap.conf
BUG: soft lockup detected on CPU#0!

Pid: 1, comm:                 init
EIP: 0061:[<c0322fe1>] CPU: 0
EIP is at netif_poll+0x101/0x810
 EFLAGS: 00000216    Tainted: GF      (2.6.16.13-xen #2)
EAX: 00000037 EBX: c0945180 ECX: 0001134e EDX: c0945000
ESI: c0f48280 EDI: c0f499e8 EBP: c09451c0 DS: 007b ES: 007b
CR0: 8005003b CR2: b7d579e0 CR3: 0057e000 CR4: 00000640
 [<c03d891a>] net_rx_action+0xea/0x230
 [<c0124cb5>] __do_softirq+0xf5/0x120
 [<c0124d75>] do_softirq+0x95/0xa0
 [<c0106c0f>] do_IRQ+0x1f/0x30
 [<c0312f58>] evtchn_do_upcall+0xa8/0xf0
 [<c0105178>] hypervisor_callback+0x2c/0x34
 [<c02c2081>] __copy_user_intel+0x31/0xb0
 [<c02c2220>] __copy_to_user_ll+0x70/0x80
 [<c02c22f2>] copy_to_user+0x42/0x60
 [<c0171068>] cp_new_stat64+0xf8/0x110
 [<c01710b7>] sys_stat64+0x37/0x40
 [<c0104fb5>] syscall_call+0x7/0xb
isconf: warning: clierr:  Connection reset by peer
Starting system log daemon: syslogd.
Starting kernel log daemon: klogd.
No configuration file was found for slapd at /etc/ldap/slapd.conf.
If you have moved the slapd configuration file please modify
/etc/default/slapd to reflect this.  If you chose to not
configure slapd during installation then you need to do so
prior to attempting to start slapd.
An example slapd.conf is in /usr/share/slapd
Starting Heimdal KDC: heimdal-kdc.
Starting Heimdal password server: kpasswdd.
Starting internet superserver: inetd.
Starting PCMCIA services: module directory /lib/modules/2.6.16.13-xen/pcmcia 
not found.
Starting OpenBSD Secure Shell server: sshd.
Starting deferred execution scheduler: atd.
Starting periodic command scheduler: cron.

Debian GNU/Linux testing/unstable build2.t7a.org tty1

build2.t7a.org login:

On Wed, Aug 02, 2006 at 01:54:49PM -0700, Steve Traugott wrote:
> Hi All,
> 
> I hate to say it, but it's starting to look like soft lockup bug(s)
> are turning into a serious roadblock for general production use of Xen
> 3.X, on a wide range of hardware.  I've been using Xen since the 1.0
> days, and I have to say that this the most serious showstopper bug
> I've ever hit -- it usually manifests itself during the first
> significant network and/or disk I/O after starting a second or third
> domU on the same box, and is the only bug I've ever hit that has
> caused permanent damage -- it tends to corrupt guest filesystems.  In
> my case it's stopped a deployment dead in its tracks, and our only
> options at this point are to go back to Xen 2.X or (horrors) to native
> Linux kernels.
> 
> The problem (or something that looks identical) is described in
> several tickets, status currently NEW or REOPENED, no clear
> resolution:
> http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=543
> http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=690
> http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=697
> http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=705
> 
> In our own shop, we consistently hit soft lockups while running on
> both IBM x330's and older Netengines (similar to an IBM 4000R).  We've
> found no workaround.  We're on xen-3.0-testing, changeset 9732, kernel
> 2.6.6.13.  On April 6th, Keir posted a note saying this was fixed as
> of a blkif_schedule() fix, which we already have because that was way
> back in changeset 9587...
> http://lists.xensource.com/archives/html/xen-devel/2006-04/msg00121.html.
> 
> The most recent devel list traffic I've found which covers this is
> July 7th:
> http://lists.xensource.com/archives/html/xen-users/2006-07/msg00134.html
> ...this message referred back to Kier's comment as describing a fix,
> but it doesn't look true; while Kier's 9587 checkin may have fixed a
> soft lockup problem, there appear to be more out there, or else
> there's been regression.
> 
> Do we have any consensus that this bug is fixed at all in
> xen-3.0-testing, or even unstable?  Is anyone who was hitting soft
> lockups in testing *not* hitting them any more on the same hardware?
> If so, what changeset are you on now?
> 
> If anyone needs any more information, just let me know.  As usual, if
> anyone wants login and console server access to one of these boxes to
> chase this down, I'm more than happy to provide that.
> 
> Thanks, 
> 
> Steve
> -- 
> Stephen G. Traugott  (KG6HDQ)
> UNIX/Linux Infrastructure Architect, TerraLuna LLC
> stevegt@xxxxxxxxxxxxx 
> http://www.stevegt.com -- http://Infrastructures.Org

-- 
Stephen G. Traugott  (KG6HDQ)
UNIX/Linux Infrastructure Architect, TerraLuna LLC
stevegt@xxxxxxxxxxxxx 
http://www.stevegt.com -- http://Infrastructures.Org

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel