This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


[Xen-devel] blocking Xen 3.X production use: soft lockup bugs

To: Keir Fraser <Keir.Fraser@xxxxxxxxxxxx>
Subject: [Xen-devel] blocking Xen 3.X production use: soft lockup bugs
From: Steve Traugott <stevegt@xxxxxxxxxxxxx>
Date: Wed, 2 Aug 2006 13:54:49 -0700
Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>
Delivery-date: Wed, 02 Aug 2006 13:55:32 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mutt/1.5.9i
Hi All,

I hate to say it, but it's starting to look like soft lockup bug(s)
are turning into a serious roadblock for general production use of Xen
3.X, on a wide range of hardware.  I've been using Xen since the 1.0
days, and I have to say that this the most serious showstopper bug
I've ever hit -- it usually manifests itself during the first
significant network and/or disk I/O after starting a second or third
domU on the same box, and is the only bug I've ever hit that has
caused permanent damage -- it tends to corrupt guest filesystems.  In
my case it's stopped a deployment dead in its tracks, and our only
options at this point are to go back to Xen 2.X or (horrors) to native
Linux kernels.

The problem (or something that looks identical) is described in
several tickets, status currently NEW or REOPENED, no clear

In our own shop, we consistently hit soft lockups while running on
both IBM x330's and older Netengines (similar to an IBM 4000R).  We've
found no workaround.  We're on xen-3.0-testing, changeset 9732, kernel  On April 6th, Keir posted a note saying this was fixed as
of a blkif_schedule() fix, which we already have because that was way
back in changeset 9587...

The most recent devel list traffic I've found which covers this is
July 7th:
...this message referred back to Kier's comment as describing a fix,
but it doesn't look true; while Kier's 9587 checkin may have fixed a
soft lockup problem, there appear to be more out there, or else
there's been regression.

Do we have any consensus that this bug is fixed at all in
xen-3.0-testing, or even unstable?  Is anyone who was hitting soft
lockups in testing *not* hitting them any more on the same hardware?
If so, what changeset are you on now?

If anyone needs any more information, just let me know.  As usual, if
anyone wants login and console server access to one of these boxes to
chase this down, I'm more than happy to provide that.


Stephen G. Traugott  (KG6HDQ)
UNIX/Linux Infrastructure Architect, TerraLuna LLC
http://www.stevegt.com -- http://Infrastructures.Org

Xen-devel mailing list