WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

[Xen-devel] ext3 directory corruption under Xen

To: linux-kernel@xxxxxxxxxxxxxxx
Subject: [Xen-devel] ext3 directory corruption under Xen
From: "Christopher S. Aker" <caker@xxxxxxxxxxxx>
Date: Mon, 23 Jun 2008 12:15:33 -0400
Cc: xen devel <xen-devel@xxxxxxxxxxxxxxxxxxx>
Delivery-date: Mon, 23 Jun 2008 09:16:27 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Thunderbird 2.0.0.14 (Macintosh/20080421)
We've been seeing a rash of ext3 directory corruption occurring under Xen. All but one of the reports have been with filesystems formatted with 1024 blocksize. We have one report, that's potentialy the same bug, occurring on a filesystem with 4096 blocksize (either way, it was some type of corruption in that case). In all cases, the filesystems were mounted with ext3's default journaling mode. No quotas or anything else other than the default ext3 mount options.

It's happened on a number of different hosts, all of the same hardware and software configuration (Xen 3.2 64bit, 32bit pae dom0, 32bit pae domUs. LVM backend with 3ware hardware RAID-1). Some of those hosts were previously running non-virtaulized Linux and UML, using the identical guest images, and under that configuration never experienced this problem.

This has occurred under both 2.6.18-xenbits and the more recent pv_ops based kernels (2.6.24, 2.6.25), which I presume are all using the same blkfront driver code.

The common workloads from the reports seems to be active maildirs and rsync.

The initial errors reported back are all from fs/ext3/dir.c, in ext3_check_dir_entry(). Most commonly hit is the "rec_len % 4 != 0" check. We've seen other checks trigger, but my assumption is that those happen after more stuff gets whacked out.

Eventually the fs will go read-only. In extreme cases, the fs is chewed through enough that data is lost.

It's tricky to track down the trigger because you can only detect the corruption after it's happened. Our attempts to reproduce this using various filesystem thrashing scripts haven't yielded a reliable way to trigger it, however we have been successful in triggering it twice -- in two weeks :( .

My hope is that this triggers an "a-hah" from someone in LKML or Xen land who has experience with this code, or that this is a known issue and a fix already lives.

We're scared.  Please help.

Thanks,
-Chris


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

<Prev in Thread] Current Thread [Next in Thread>