[Xen-users] Re: very odd iowait problem

Miles Fidelman wrote:

Hi Folks,
I'm experiencing a very odd, daily, high-load situation - that seemsto localize in my disk stack. I direct this to the xen-users,linux-raid and linux-ha lists as I expect there's a pretty high degreeof experience on these lists with complicated disk driver stacks.
I recently virtualized a production system, and have been slowlywringing out the bugs that have shown up. This seems to be the lastone, and it's a doozie.
Basic setup: Two identical machines except for the DomUs they'rerunning.
Two machines, slightly older Pentium 4 processors, 4meg RAM each(max), 2 CPUs each, 4 SATA Drives each.
Debian Lenny Install for Dom0 and DomUs (2.6.26-2-xen-686)

Disk setup on each:
- 4 partitions on each drive
- 3 RAID-1s set up across the 4 drives (4 drives in each - yes it'ssilly, but easy) - for Dom0 /boot / swap- 1 RAID-6 set up across the 4 drives - set up as a LVM PV - underliesall my DomUsnote: all the RAIDs are set up with internal metadata, chunk size of131072KB - per advice here - works like a charm
- pairs of LVs - / and swap per VM
- each LV is linked with it's counterpart on the other machine, usingDRBD
- LVs are specified as drbd: devices in DomU .cfg files
- LVs are mounted with noatime option inside production DomU - makes abig difference
A few DomUs - currently started and stopped either via links in/etc/xen/auto or manually - I've temporarily turned off heartbeat andpacemaker until I get the underlying stuff stable.
------
now to the problem:
for several days in a row, at 2:05am, iowait on the production DomUwent from averaging 10% or to 100% (I've been running vmstat 1 in awindow and watching the iowait column)
the past two days, this has happened at 2:26am instead of 2:05
rebooting the VM fixes the problem, though it has occured again within20 minutes of the reboot, and then another reboot fixes the problemuntil 2am the next day
killing a bunch of processes also fixed things, but at that point solittle was running that I just rebooted the DomU - unfortunately, onenight it looked like lwresd was eating up resources, the next night itwas something else.
------
ok... so I'm thinking there'a cron job that's doing something thateats up all my i/o - I traced a couple of other issues back to cronjobs - I can't seem to find either a cron job that runs around thistime, or anything in my logs
so, now I set up a bunch of things to watch what's going - copies ofatop running in Dom0 on both servers, and in the production DomU(note: I caught a couple of more bugs by running top in a window, andseeing what was frozen in the window, after the machine crashed)
ok - so I'm up at 2am for the 4th day in a row (along with a couple ofproposals I'm writing during the day, and a couple of fires with mykids' computers - I've discovered that Mozy is perhaps the world'sworst backup service - it's impossible to restore things) - anyway....2:26 rolls around, the iowait goes to 100%, and I start looking usingps, and iostat, and lsof and such to try to locate whatever process islocking up my DomU, when I notice:
--- on one server, atop is showing one drive (/dev/sdb) maxing out at98% busy - sort of suggestive of a drive failure, and something thatwould certainly ripple through all the layers of RAID, LVM, DRBD toslow down everything on top of it (which is everything)
Now this is pretty weird - given the way my system is set up, I'dexpect a dying disk that to show up as very high iowaits, but....
- it's a relatively new drive
- I've been running smartd, and smartctl doesn't yield any resultssuggesting a drive problem
- the problem goes away when I reboot the DomU
One more symptom: I migrated the DomU to my other server, and there'sstill a correlation between seeing the 98% busy on /dev/sdb, andseeing iowait of 100% on the DomU - even though we're now talking adisk on one machine dragging down a VM on the other machine.(Presumeably it's impacting DRBD replication.)
So....
- on the one hand, the 98% busy on /dev/sdb is rippling up through md,lvm, drbd, dom0 - and causing 100% iowait in the production DomU -which is to be expected in a raided, drbd'd environment - a low leveldelay ripples all the way up- on the other hand, it's only effecting the one DomU, and it's noteffecting the Dom0 on that machine- there seems to be something going on at 2:25am, give or take a few,that kicks everything into the high iowait state (but I can't find ajob running at that time - though I guess someone could be hitting mewith some spam that's kicking amavisd or clam into a high-resource mode)
All of which leads to two questions:
- if it's a disk going bad, why does this manifest nightly, at roughlythe same time, and effect only one DomU?- if it's something in the DomU, by what mechanism is that ripplingall they way down to a component of a raid array, hidden below severallayers of stuff that's supposed to isolate virtual volumes from hardware?
The only thought that occurs to me is that perhaps there's a badrecord or block on that one drive, that only gets exercised when onparticular process runs.
- is that a possibility?
- if yes, why isn't drbd or md or something catching it and fixing it(or adding the block to the bad block table)?- any suggestions on diagnostic or drive rebuilding steps to takenext? (includings that I can do before staying up until 2am tomorrow!)
If it weren't hitting me, I'd be intrigued by this one.Unfortunately, it IS hitting me, and I'm getting tireder and crankierby the minute, hour, and day. And it's now 4:26am. Sigh...
Thanks very much for any ideas or suggestions.


Get some sleep, for one.

I would install and enable process accounting, turn it on at midnightand let it run until morning (unless you feel like staying up toreboot). That's at a low enough level that I would expect it to haveinformation as to what's running, at least.


--
Bill Davidsen <davidsen@xxxxxxx>
 "We can't solve today's problems by using the same thinking we
  used in creating them." - Einstein


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users

WARNING - OLD ARCHIVES

xen-users

[Xen-users] Re: very odd iowait problem