This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


Re: [Xen-devel] OOM problems

To: Daniel Stodden <daniel.stodden@xxxxxxxxxx>
Subject: Re: [Xen-devel] OOM problems
From: John Weekes <lists.xen@xxxxxxxxxxxxxxxxxx>
Date: Thu, 18 Nov 2010 23:27:10 -0800
Cc: Ian Pratt <Ian.Pratt@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxxxx>
Delivery-date: Thu, 18 Nov 2010 23:28:14 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <1290076883.6481.178.camel@ramone>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <4CDE44E2.2060807@xxxxxxxxxxxxxxxxxx> <4FA716B1526C7C4DB0375C6DADBC4EA38D80702C25@xxxxxxxxxxxxxxxxxxxxxxxxx> <4CDE4C08.70309@xxxxxxxxxxxxxxxxxx> <4FA716B1526C7C4DB0375C6DADBC4EA38D80702C2E@xxxxxxxxxxxxxxxxxxxxxxxxx> <4CE1037402000078000222F0@xxxxxxxxxxxxxxxxxx> <1289814037.21694.22.camel@ramone> <4CE1751F.9020202@xxxxxxxxxxxxxxxxxx> <4CE2E163.2090809@xxxxxxxxxxxxxxxxxx> <4FA716B1526C7C4DB0375C6DADBC4EA38D80702E0E@xxxxxxxxxxxxxxxxxxxxxxxxx> <4CE450E7.9010508@xxxxxxxxxxxxxxxxxx> <1290043433.11102.1742.camel@xxxxxxxxxxxxxxxxxxxxxxx> <4CE49D98.2030402@xxxxxxxxxxxxxxxxxx> <1290053337.18200.28.camel@xxxxxxxxxxxxxxxxxxxxxxx> <4CE4D285.5060500@xxxxxxxxxxxxxxxxxx> <1290076883.6481.178.camel@ramone>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv: Gecko/20101027 Thunderbird/3.1.6
Daniel, thank you for the help and in-depth information, as well as the test code off-list. The corruption problem with blktap2 O_DIRECT is easily reproducible for me on multiple machines, so I hope that we'll be able to nail this one down pretty quickly.

To follow up on my question about the potential performance difference between blktap2 without O_DIRECT and loop (both of which use the page cache), I did some tests inside a sparse file-backed domU by timing copying a folder containing 7419 files and folders totalling 1.6 GB (of mixed sizes), and found that loop returned this:

real    1m18.257s
user    0m0.050s
sys     0m6.550s

While tapdisk2 aio w/o O_DIRECT clocked in at:

real    0m55.373s
user    0m0.050s
sys     0m6.690s

With each, I saw a few more seconds of disk activity on dom0, since dirty_ratio was set to 2. I ran the tests several times and dropped caches on dom0 between each one; all of the results were within a second or two of each other.

This represents a significant ~41% performance bump for that particular workload. In light of this, I would recommend to anyone who is using "file:" that they try out tapdisk2 aio with a modified block-aio.c to remove O_DIRECT, and see how it goes. If you find results similar to mine, it might be worth modifying this into another blktap2 driver.


On 11/18/2010 2:41 AM, Daniel Stodden wrote:
On Thu, 2010-11-18 at 02:15 -0500, John Weekes wrote:
I think [XCP blktap] should work fine, or wouldn't ask. If not, lemme know.

In my last bit of troubleshooting, I took O_DIRECT out of the open call
in tools/blktap2/drivers/block-aio.c, and preliminary testing indicates
that this might have eliminated the problem with corruption. I'm testing
further now, but could there be an issue with alignment (since the
kernel is apparently very strict about it with direct I/O)?
Nope. It is, but they're 4k-aligned all over the place. You'd see syslog
yelling quite miserably in cases like that. Keeping an eye on syslog
(the daemon and kern facilites) is a generally good idea btw.
I've been doing that and haven't seen any unusual output so far, which I
guess is good.

this flag also brings back in use of the page cache, of course.)
I/O-wise it's not much different from the file:-path. Meaning it should
have carried you directly back into the Oom realm.
Does it make a difference that it's not using "loop" and instead the CPU
usage (and presumably some blocking) occurs in user-space?
It's certainly a different path taken. I just meant to say file access
has about the same properties, so you're likely back to the original

  There's not
too much information on this out there, but it seems at though the OOM
issue might be at least somewhat loop device-specific. One document that
references loop OOM problems that I found is this one:
  My initial take on it was
that it might be saying that it mattered when these things were being
done in the kernel, but now I'm not so certain --

".. [their method and loop] submit[s] [I/O requests] via a kernel thread
to the VFS layer using traditional I/O calls (read, write etc.). This
has the advantage that it should work with any file system type
supported by the Linux VFS (including networked file systems), but has
some drawbacks that may affect performance and scalability. This is
because it is hard to predict what a file system may attempt to do when
an I/O request is submitted; for example, it may need to allocate memory
to handle the request and the loopback driver has no control over this.
Particularly under low-memory or intensive I/O scenarios this can lead
to out of memory (OOM) problems or deadlocks as the kernel tries to make
memory available to the VFS layer while satisfying a request from the
block layer. "

Would there be an advantage to using blktap/blktap2 over loop, if I
leave off O_DIRECT? Would it be faster, or anything like that?
No, it's essentially the same thing. Both blktap and loopdevs sit on the
vfs in a similar fashion, without O_DIRECT even more so. The deadlocking
and OOM hazards are also the same, btw.

Deadlocks are a fairly general problem whenever you layer two subsystems
depending on the same resource on top of each other. Both in the blktap
and loopback case the system has several opportunities to hang itself,
because there's even more stuff stacked than normal. The layers are, top
to bottom

  (1) potential caching of {tap/loop}dev writes (Xen doesn't do that)
  (2) The block device, which needs some minimum amount of memory to run
      its request queue
  (3) Cached writes on the file layer
  (4) The filesystem needs memory to launder those pages
  (5) The disk's block device, equivalent to 2.
  (6) The driver driver running the data transfers.

The shared resource is memory. Now consider what happens when upper
layers in combination grab everything the lower layers need to make
progress. The upper layer can't roll back, so won't get off their memory
before that happened. So we're stuck.

It shouldn't happen, the kernel has a bunch of mechanisms to prevent
that. It obviously doesn't quite work here.

That's why I'm suggesting that the most obvious fix for your case is to
limit the cache dirtying rate.

Just reducing the cpu count alone sounds like sth worth trying even on a
production box, if the current state of things already tends to take the
system down. Also, the dirty_ratio sysctl should be pretty safe to tweak
at runtime.
That's good to hear.

The default for dirty_ratio is 20. I tried halving that to 10, but it
didn't help.
Still too much. That's meant to be %/task. Try 2, with 1.5G that's still
a decent 30M write cache and should block all out of 24 disks after some
700M, worst case. Or so I think...
Ah, ok. I was thinking that it was global. With a small per-process
cache like that, it becomes much closer to AIO for writes, but at least
the leftover memory could still be used for the read cache.
I agree it doesn't do what you want. I have no idea why there's no
global limit, seriously.

Note that in theory, 24*2% would still approach the oom state you were
in with the log you sent. I think it's going to be less likely though.
With all guests going mad at the same time, it may still not be low
enough. In case that happens, you could resort to pumping even more
memory into dom0.


Xen-devel mailing list

Xen-devel mailing list

<Prev in Thread] Current Thread [Next in Thread>