[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Self-ballooning question / cache issue

Hi Dan,

> > I have been testing autoballooning on a production Xen system today
> > (with cleancache + frontswap on Xen-provided tmem).  For most of the
> > idle or CPU-centric VMs it seems to work just fine.
> > 
> > However, on one of the web-serving VMs, there is also a cron job running
> > every few minutes which runs over a rather large directory (plus, this
> > directory is on OCFS2 so this is a rather time-consuming process).  Now,
> > if the dcache/inode cache is large enough (which it was before, since
> > the VM got allocated 4 GB and is only using 1-2 most of the time), this
> > was not a problem.
> > 
> > Now, with self-ballooning, the memory gets reduced to somewhat between 1
> > and 2 GB and after a few minutes the load is going through the ceiling.
> > Jobs reading through said directories are piling up (stuck in D state,
> > waiting for the FS).  And most of the time kswapd is spinning at 100%.
> > If I deactivate self-ballooning and assign the VM 3 GB, everything goes
> > back to normal after a few minutes. (and, "ls -l" on said directory is
> > served from the cache again).
> > 
> > Now, I am aware that said problem is a self-made one.  The directory was
> > not actually supposed to contain that many files and the next job not
> > waiting for the previous job to terminate is cause for trouble - but
> > still, I would consider this a possible regression since it seems
> > self-ballooning is constantly thrashing the VM's caches.  Not all caches
> > can be saved in cleancache.
> > 
> > What about an additional tunable: a user-specified amount of pages that
> > is added on top of the computed target number of pages?  This way, one
> > could manually reserve a bit more room for other types of caches. (in
> > fact, I might try this myself, since it shouldn't be too hard to do so)
> > 
> > Any opinions on this?
> Thanks for doing this analysis.  While your workload is a bit
> unusual, I agree that you have exposed a problem that will need
> to be resolved.  It was observed three years ago that the next
> "frontend" for tmem could handle a cleancache-like mechanism
> for the dcache.  Until now, I had thought that this was purely
> optional and would yield only a small performance improvement.
> But with your workload, I think the combination of the facts that
> selfballooning is forcing out dcache entries and they aren't
> being saved in tmem is resulting in the problem you are seeing.

Yes.  In fact, I've been rolling out selfballooning across a development
system and most VMs were just fine with the default.  The overall memory
savings from going from a static to a dynamic memory allocation is quite
significant - without the VMs having to resort to actual to-disk-paging
when there is a sudden increase in memory usage.  Quite nice.

Just for information: The filesystem which this machine was using is
OCFS2 (shared across 5 VMs) and the directory contains 45k files
(*cough* - I'm aware that's not optimal, I'm currently talking to the
dev of that application to not scan the entire list of files every
minute) - which takes a few minutes (especially stat'ing every file).

I have been observing, that kswapd seems rather busy at times on some
VMs, even when there is no actual swapping taking place. (or, could it
be frontswap or just page reclaim?) This can be migitated by increasing
the memory reserve a bit using my trivial test patch (see below).

> I think the best solution for this will be a "cleandcache"
> patch in the Linux guest... but given how long it has taken
> to get cleancache and frontswap into the kernel (and the fact
> that a working cleandcache patch doesn't even exist yet), I
> wouldn't hold my breath ;-)  I will put it on the "to do"
> list though.

That sounds nice!

> Your idea of the tunable is interesting (and patches are always
> welcome!) but I am skeptical that it will solve the problem
> since I would guess the Linux kernel is shrinking dcache
> proportional to the size of the page cache.  So adding more
> RAM with your "user-specified amount of pages that is
> added on top of the computed target number of pages",
> the RAM will still be shared across all caches and only
> some small portion of the added RAM will likely be used
> for dcache.

That's true.  In fact, I have to add about 1 GB of memory in order to
keep the relevant dcache / inode cache entries to stay in the cache.
When I do that the largest portion of memory is still eaten up by the
regular page cache.  So this is more of a workaround than a solution,
but for now it works.

I've attached the simple patch I've whipped up below.

> However, if you have a chance to try it, I would be interested
> in your findings.  Note that you already can set a
> permanent floor for selfballooning ("min_usable_mb") or,
> of course, just turn off selfballooning altogether.

Sure, that's always a possibility.  However, the VM already had an
overly large amount of memory before to avoid the problem.  Now it runs
with less memory (still a bit more than required), and when a load spike
comes, it can quickly balloon up, which is exactly what I was looking


Author: Jana Saout <jana@xxxxxxxx>
Date:   Sun Apr 29 22:09:29 2012 +0200

    Add selfballoning memory reservation tunable.

diff --git a/drivers/xen/xen-selfballoon.c b/drivers/xen/xen-selfballoon.c
index 146c948..7d041cb 100644
--- a/drivers/xen/xen-selfballoon.c
+++ b/drivers/xen/xen-selfballoon.c
@@ -105,6 +105,12 @@ static unsigned int selfballoon_interval __read_mostly = 5;
 static unsigned int selfballoon_min_usable_mb;
+ * Amount of RAM in MB to add to the target number of pages.
+ * Can be used to reserve some more room for caches and the like.
+ */
+static unsigned int selfballoon_reserved_mb;
 static void selfballoon_process(struct work_struct *work);
 static DECLARE_DELAYED_WORK(selfballoon_worker, selfballoon_process);
@@ -217,7 +223,8 @@ static void selfballoon_process(struct work_struct *work)
                cur_pages = totalram_pages;
                tgt_pages = cur_pages; /* default is no change */
                goal_pages = percpu_counter_read_positive(&vm_committed_as) +
-                               totalreserve_pages;
+                               totalreserve_pages +
+                               MB2PAGES(selfballoon_reserved_mb);
                /* allow space for frontswap pages to be repatriated */
                if (frontswap_selfshrinking && frontswap_enabled)
@@ -397,6 +404,30 @@ static DEVICE_ATTR(selfballoon_min_usable_mb, S_IRUGO | 
+SELFBALLOON_SHOW(selfballoon_reserved_mb, "%d\n",
+                               selfballoon_reserved_mb);
+static ssize_t store_selfballoon_reserved_mb(struct device *dev,
+                                            struct device_attribute *attr,
+                                            const char *buf,
+                                            size_t count)
+       unsigned long val;
+       int err;
+       if (!capable(CAP_SYS_ADMIN))
+               return -EPERM;
+       err = strict_strtoul(buf, 10, &val);
+       if (err)
+               return -EINVAL;
+       selfballoon_reserved_mb = val;
+       return count;
+static DEVICE_ATTR(selfballoon_reserved_mb, S_IRUGO | S_IWUSR,
+                  show_selfballoon_reserved_mb,
+                  store_selfballoon_reserved_mb);
 SELFBALLOON_SHOW(frontswap_selfshrinking, "%d\n", frontswap_selfshrinking);
@@ -480,6 +511,7 @@ static struct attribute *selfballoon_attrs[] = {
+       &dev_attr_selfballoon_reserved_mb.attr,

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.