On Wed, 2010-11-17 at 17:02 -0500, John Weekes wrote:
> There is certainly a trade-off, and historically, we've had problems
> with stability under Xen, so crashes are definitely a concern.
> Implementation in tapdisk would be great.
> I found today that tapdisk2 (at least on the latest 4.0-testing/unstable
> and latest pv_ops) is causing data corruption for Windows guests; I can
> see this by copying a few thousand files to another folder inside the
> guest, totalling a bit more than a GB, then running "fc" to check for
> differences (I tried with and without GPLPV). That's obviously a huge
> deal in production (and an even bigger deal than crashes), so in the
> short term, I may have to switch back to the uglier, crashier file:
> setup. I've been trying to find a workaround for the corruption all day
> without much luck.
Which branch/revision does latest pvops mean?
Would you be willing to try and reproduce that again with the XCP blktap
(userspace, not kernel) sources? Just to further isolate the problem.
Those see a lot of testing. I certainly can't come up with a single fix
to the aio layer, in ages. But I'm never sure about other stuff
potentially broken in userland.
If dio is definitely not what you feel you need, let's get back your
original OOM problem. Did reducing dom0 vcpus help? 24 of them is quite
aggressive, to say the least.
If that alone doesn't help, I'd definitely try and check vm.dirty_ratio.
There must be a tradeoff which doesn't imply scribbling the better half
of 1.5GB main memory.
> On 11/17/2010 12:10 PM, Ian Pratt wrote:
> >> Performance is noticeably lower with aio on these bursty write
> >> workloads; I've been getting a number of complaints.
> > That's the cost of having guest data safely committed to disk before being
> > ACK'ed. The users will presumably be happier when a host failure doesn't
> > trash their filesystems due to the total loss of any of the write ordering
> > the filesystem implementer intended.
> > Personally, I wouldn't want any data of mine stored on such a system, but I
> > guess others mileage may vary.
> > If unsafe write buffering is desired, I'd be inclined to implement it
> > explicitly in tapdisk rather than rely on the total vagaries of the linux
> > buffer cache. It would thus be possible to bound the amount of outstanding
> > data, continue to respect ordering, and still respect explicit flushes.
> > Ian
> >> I see that 2.6.36 has some page_writeback changes:
> >> http://www.kernel.org/diff/diffview.cgi?file=%2Fpub%2Flinux%2Fkernel%2Fv2.
> >> 6%2Fpatch-2.6.36.bz2;z=8379
> >> . Any thoughts on whether these would make a difference for the problems
> >> with "file:"? I'm still trying to find a way to reproduce the issue in
> >> the lab, so I'd have to test the patch in production -- that's not a
> >> tantalizing prospect, unless there is a real chance that it will affect
> >> it.
> >> -John
> >> On 11/15/2010 9:59 AM, John Weekes wrote:
> >>>> They are throttled, but the single control I'm aware of
> >>>> is /proc/sys/vm/dirty_ratio (or dirty_bytes, nowadays). Which is only
> >>>> per process, not a global limit. Could well be that's part of the
> >>>> problem -- outwitting mm with just too many writers on too many cores?
> >>>> We had a bit of trouble when switching dom0 to 2.6.32, buffered writes
> >>>> made it much easier than with e.g. 2.6.27 to drive everybody else into
> >>>> costly reclaims.
> >>>> The Oom shown here reports about ~650M in dirty pages. The fact alone
> >>>> that this counts as on oom condition doesn't sound quite right in
> >>>> itself. That qemu might just have dared to ask at the wrong point in
> >>>> time.
> >>>> Just to get an idea -- how many guests did this box carry?
> >>> It carries about two dozen guests, with a mix of mostly HVMs (all
> >>> stubdom-based, some with PV-on-HVM drivers) and some PV.
> >>> This problem occurred more often for me under 2.6.32 than 2.6.31, I
> >>> noticed. Since I made the switch to aio, I haven't seen a crash, but
> >>> it hasn't been long enough for that to mean much.
> >>> Having extra caching in the dom0 is nice because it allows for domUs
> >>> to get away with having small amounts of free memory, while still
> >>> having very good (much faster than hardware) write performance. If you
> >>> have a large number of domUs that are all memory-constrained and use
> >>> the disk in infrequent, large bursts, this can work out pretty well,
> >>> since the big communal pool provides a better value proposition than
> >>> giving each domU a few more megabytes of RAM.
> >>> If the OOM problem isn't something that can be fixed, it might be a
> >>> good idea to print out a warning to the user when a domain using
> >>> "file:" is started. Or, to go a step further and automatically run
> >>> "file" based domains as though "aio" was specified, possibly with a
> >>> warning and a way to override that behavior. It's not really intuitive
> >>> that "file" would cause crashes.
> >>> -John
> >>> _______________________________________________
> >>> Xen-devel mailing list
> >>> Xen-devel@xxxxxxxxxxxxxxxxxxx
> >>> http://lists.xensource.com/xen-devel
Xen-devel mailing list