Xen project Mailing List

Re: [Xen-devel] OOM problems

To: John Weekes <lists.xen@xxxxxxxxxxxxxxxxxx>

From: Daniel Stodden <daniel.stodden@xxxxxxxxxx>

Date: Wed, 17 Nov 2010 20:08:57 -0800

Cc: Ian Pratt <Ian.Pratt@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxxxx>

Delivery-date: Wed, 17 Nov 2010 20:09:54 -0800

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

On Wed, 2010-11-17 at 22:29 -0500, John Weekes wrote: > Daniel: > > > Which branch/revision does latest pvops mean? > > stable-2.6.32, using the latest pull as of today. (I also tried > next-2.6.37, but it wouldn't boot for me.) > > Would you be willing to try and reproduce that again with the XCP blktap > > (userspace, not kernel) sources? Just to further isolate the problem. > > Those see a lot of testing. I certainly can't come up with a single fix > > to the aio layer, in ages. But I'm never sure about other stuff > > potentially broken in userland. > > I'll have to give it a try. Normal blktap still isn't working with > pv_ops, though, so I hope this is a drop-in for blktap2. I think it should work fine, or wouldn't ask. If not, lemme know. > In my last bit of troubleshooting, I took O_DIRECT out of the open call > in tools/blktap2/drivers/block-aio.c, and preliminary testing indicates > that this might have eliminated the problem with corruption. I'm testing > further now, but could there be an issue with alignment (since the > kernel is apparently very strict about it with direct I/O)? Nope. It is, but they're 4k-aligned all over the place. You'd see syslog yelling quite miserably in cases like that. Keeping an eye on syslog (the daemon and kern facilites) is a generally good idea btw. > (Removing > this flag also brings back in use of the page cache, of course.) I/O-wise it's not much different from the file:-path. Meaning it should have carried you directly back into the Oom realm. > > If dio is definitely not what you feel you need, let's get back your > > original OOM problem. Did reducing dom0 vcpus help? 24 of them is quite > > aggressive, to say the least. > > When I switched to aio, I reduced the vcpus to 2 (I needed to do this > with dom0_max_vcpus, rather than through xend-config.sxp -- the latter > wouldn't always boot). I haven't separately tried cached I/O with > reduced CPUs yet, except in the lab; and unfortunately I still can't get > the problem to happen in the lab, no matter what I try. Just reducing the cpu count alone sounds like sth worth trying even on a production box, if the current state of things already tends to take the system down. Also, the dirty_ratio sysctl should be pretty safe to tweak at runtime. > > If that alone doesn't help, I'd definitely try and check vm.dirty_ratio. > > There must be a tradeoff which doesn't imply scribbling the better half > > of 1.5GB main memory. > > The default for dirty_ratio is 20. I tried halving that to 10, but it > didn't help. Still too much. That's meant to be %/task. Try 2, with 1.5G that's still a decent 30M write cache and should block all out of 24 disks after some 700M, worst case. Or so I think... > I could try lower, but I like the thought of keeping this > in user space, if possible, so I've been pursuing the blktap2 path most > aggressively. Okay. I'm sending you a tbz to try. Daniel > Ian: > > > That's disturbing. It might be worth trying to drop the number of VCPUs in > > dom0 to 1 and then try to repro. > > BTW: for production use I'd currently be strongly inclined to use the XCP > > 2.6.32 kernel. > > Interesting, ok. > > -John _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.