2010/6/19 Miles Fidelman <mfidelman@xxxxxxxxxxxxxxxx>:
> Hi Folks,
> I'm experiencing a very odd, daily, high-load situation - that seems to
> localize in my disk stack. I direct this to the xen-users, linux-raid
> and linux-ha lists as I expect there's a pretty high degree of
> experience on these lists with complicated disk driver stacks.
> I recently virtualized a production system, and have been slowly
> wringing out the bugs that have shown up. This seems to be the last
> one, and it's a doozie.
> Basic setup: Two identical machines except for the DomUs they're running.
> Two machines, slightly older Pentium 4 processors, 4meg RAM each (max),
> 2 CPUs each, 4 SATA Drives each.
> Debian Lenny Install for Dom0 and DomUs (2.6.26-2-xen-686)
> Disk setup on each:
> - 4 partitions on each drive
> - 3 RAID-1s set up across the 4 drives (4 drives in each - yes it's
> silly, but easy) - for Dom0 /boot / swap
> - 1 RAID-6 set up across the 4 drives - set up as a LVM PV - underlies
> all my DomUs
> note: all the RAIDs are set up with internal metadata, chunk size of
> 131072KB - per advice here - works like a charm
> - pairs of LVs - / and swap per VM
> - each LV is linked with it's counterpart on the other machine, using DRBD
> - LVs are specified as drbd: devices in DomU .cfg files
> - LVs are mounted with noatime option inside production DomU - makes a
> big difference
> A few DomUs - currently started and stopped either via links in
> /etc/xen/auto or manually - I've temporarily turned off heartbeat and
> pacemaker until I get the underlying stuff stable.
> now to the problem:
> for several days in a row, at 2:05am, iowait on the production DomU went
> from averaging 10% or to 100% (I've been running vmstat 1 in a window
> and watching the iowait column)
> the past two days, this has happened at 2:26am instead of 2:05
> rebooting the VM fixes the problem, though it has occured again within
> 20 minutes of the reboot, and then another reboot fixes the problem
> until 2am the next day
> killing a bunch of processes also fixed things, but at that point so
> little was running that I just rebooted the DomU - unfortunately, one
> night it looked like lwresd was eating up resources, the next night it
> was something else.
> ok... so I'm thinking there'a cron job that's doing something that eats
> up all my i/o - I traced a couple of other issues back to cron jobs - I
> can't seem to find either a cron job that runs around this time, or
> anything in my logs
> so, now I set up a bunch of things to watch what's going - copies of
> atop running in Dom0 on both servers, and in the production DomU (note:
> I caught a couple of more bugs by running top in a window, and seeing
> what was frozen in the window, after the machine crashed)
> ok - so I'm up at 2am for the 4th day in a row (along with a couple of
> proposals I'm writing during the day, and a couple of fires with my
> kids' computers - I've discovered that Mozy is perhaps the world's worst
> backup service - it's impossible to restore things) - anyway.... 2:26
> rolls around, the iowait goes to 100%, and I start looking using ps, and
> iostat, and lsof and such to try to locate whatever process is locking
> up my DomU, when I notice:
> --- on one server, atop is showing one drive (/dev/sdb) maxing out at
> 98% busy - sort of suggestive of a drive failure, and something that
> would certainly ripple through all the layers of RAID, LVM, DRBD to slow
> down everything on top of it (which is everything)
> Now this is pretty weird - given the way my system is set up, I'd expect
> a dying disk that to show up as very high iowaits, but....
> - it's a relatively new drive
> - I've been running smartd, and smartctl doesn't yield any results
> suggesting a drive problem
> - the problem goes away when I reboot the DomU
> One more symptom: I migrated the DomU to my other server, and there's
> still a correlation between seeing the 98% busy on /dev/sdb, and seeing
> iowait of 100% on the DomU - even though we're now talking a disk on one
> machine dragging down a VM on the other machine. (Presumeably it's
> impacting DRBD replication.)
> - on the one hand, the 98% busy on /dev/sdb is rippling up through md,
> lvm, drbd, dom0 - and causing 100% iowait in the production DomU - which
> is to be expected in a raided, drbd'd environment - a low level delay
> ripples all the way up
> - on the other hand, it's only effecting the one DomU, and it's not
> effecting the Dom0 on that machine
> - there seems to be something going on at 2:25am, give or take a few,
> that kicks everything into the high iowait state (but I can't find a job
> running at that time - though I guess someone could be hitting me with
> some spam that's kicking amavisd or clam into a high-resource mode)
> All of which leads to two questions:
> - if it's a disk going bad, why does this manifest nightly, at roughly
> the same time, and effect only one DomU?
> - if it's something in the DomU, by what mechanism is that rippling all
> they way down to a component of a raid array, hidden below several
> layers of stuff that's supposed to isolate virtual volumes from hardware?
> The only thought that occurs to me is that perhaps there's a bad record
> or block on that one drive, that only gets exercised when on particular
> process runs.
> - is that a possibility?
> - if yes, why isn't drbd or md or something catching it and fixing it
> (or adding the block to the bad block table)?
> - any suggestions on diagnostic or drive rebuilding steps to take next?
> (includings that I can do before staying up until 2am tomorrow!)
> If it weren't hitting me, I'd be intrigued by this one. Unfortunately,
> it IS hitting me, and I'm getting tireder and crankier by the minute,
> hour, and day. And it's now 4:26am. Sigh...
> Thanks very much for any ideas or suggestions.
> Off to bed....
> Miles Fidelman
> In theory, there is no difference between theory and practice.
> In<fnord> practice, there is. .... Yogi Berra
> Linux-HA mailing list
> See also: http://linux-ha.org/ReportingProblems
Xen-users mailing list