[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes

On 07/19/2012 04:22 PM, Dario Faggioli wrote:
On Thu, 2012-07-19 at 14:21 +0200, Andre Przywara wrote:

thanks for the warm welcome.

>> ...
As you can see, the nodes with more memory are _way_ overloaded, while
the lower memory ones are underutilized. In fact the first 20 guests
didn't use the other nodes at all.
I don't care so much about the two memory-less nodes, but I'd like to
know how you came to the magic "3" in the formula:

+    return sign(3*freememkb_diff + nrdomains_diff);

Ok. The idea behind the current implementation of that heuristics is to
prefer nodes with more free memory. In fact, this leaves larger "holes",
maximizing the probability of being able to put more domain there. Of
course that means more domains exploiting local accesses, but introduces
the risk of overloading large (from a memory POV) nodes with a lot of
domains (which seems right what's happening to you! :-P).

I always assumed the vast majority of actual users/customers use comparably small domains, something like 1-4 VCPUs and like 4 GB of RAM. So these domains are much smaller than a usual node. I'd consider a node size of 16 GB the lower boundary, with up to 128GB as the common scenarios. Sure there are bigger or smaller machines, but I'd consider this the sweet spot.

Therefore, I wanted to balance that by putting something related to the
current load on a node into the equation. Unfortunately, I really am not
sure yet what a reasonable estimation of the actual "load on a node"
could be. Even worse, Xen does not report anything even close to that,
at least not right now. That's why I went for a quite dumb count of the
number of domains for now, waiting to find the time to implement
something more clever.

Right. So we just use the number of already pinned vCPUs as the metric. Let me look if I can change the code to really use number of vCPUs instead of number of domains. A domain could be UP or 8-way SMP, which really makes much difference wrt to load on a node.

In the long run we need something like a per-node (or per-pCPU) load average. We cannot foresee the future, but we just assume that the past is a good indicator for it. xl top generates such numbers on demand already. But that surely is something for 4.3, just wanted to mention it.

So, that is basically why I thought it could be a good idea to
overweight the differences in free memory wrt the differences in number
of assigned domain. The first implementation was only considering the
number of assigned domain to decide which was the best candidate between
two that were less 10% different in their amount of free memory.
However, that didn't produce a good comparison function, and thus I
rewrote it like above, with the magic 3 selected via trial and error to
mimic something similar to the old 10% rule.

OK, I see. I thought about this a bit more and agree a single heuristic formula isn't easy to find. After reading the code I consider this a bit over-engineered, but I cannot possibly complain about this after having remained silent for such a long time. So lets see what we can make out of this code, just firing up some ideas, feel free to just ignore them in case they are dumb ;-)

So if you agree to the small-domain assumption, then domains easily fitting into a node are the rule, not the exception. We should handle it that way. Maybe we can also solve the complexity problem by only generating single node candidates in the first place and only if these don't fit look at alternatives?

I really admire that lazy comb_next generation function, so why we don't use it in a really lazy way? I think there was already a discussion about this, just don't remember what it's outcome was. Some debugging code showed that on the above (empty) machine a 2 VCPUs/2GB domain generated already 255 candidates. That really looks like overkill, especially if we actually should focus on the 8 single-node ones.

Maybe we can use a two-step approach? First use a simple heuristic similar to the xend one: We only consider domains with enough free memory. Then we look for the least utilized ones: Simply calculate the difference between the number of currently pinned vCPUs and the number pCPUs. So any node with free (aka non-overcommited) CPUs should really be considered first. After all we don't need to care about memory latency if the domains starve for compute time and only get a fraction of each pCPU.
If you don't want to believe this, I can run some benchmarks to prove this.

If we somehow determine that this approach doesn't work (no nodes with enough free memory or more vCPUs than CPUs-per-node) we should use the sophisticated algorithm.

Also consider the following: With really big machines or with odd configurations people will probably do their pinning/placement themselves (or by external mgmt applications). What this automatic placement algorithm is good for is more the What-is-this-NUMA-thingie-anyways people.

That all being said, this is the first time the patchset had the chance
to run on such a big system, so I'm definitely open to suggestion on how
to make that formula better in reflecting what we think it's The Right

I haven't done any measurements on this, but I guess scheduling 36 vCPUs
on 8 pCPUs has a much bigger performance penalty than any remote NUMA
access, which nowadays is much better than a few years ago, with big L3
caches, better predictors and faster interconnects.

Definitely. Consider that the guests are being pinned because that is
what XenD does and because there has been no time to properly refine the
implementation of a more NUMA-aware scheduler for 4.2. In future, as
soon as I'll have it ready, the vcpus from the overloaded nodes would
get some runtime on the otherwise idle ones, even if they're remote.

Right, that sounds good. If you have any good (read: meaningful to customers) benchmarks I can do some experiments on my machine to fine-tune this.

Nevertheless, this is what Xen 4.2 will have, and I really think initial
placement is a very important step, and we must get the most out of
being able to do it well (as opposed to other technologies, where
something like that has to happen in the kernel/hypervisor, which
entails a lot of limitations we don't have!), and am therefore happy
about trying to do so as hard as I can.

Right. We definitely need some placement for 4.2. Lets push this in if anyhow possible.

I will now put in the memory again and also try to play a bit with the
amount of memory and VCPUs per guest.

Great, keep me super-posted about these things and feel free to ask
anything that comes to your mind! :-)

So changing the VCPUs and memory config didn't make any real difference. I think that is because the number of domains is considered, not the number of vCPUS. This should be fixed.

Second I inserted the memory again. I only have 24 DIMMs for 32 sockets (we have easy access to boards and CPUs, but memory we have to buy like everyone else ;-), so I have to go with this alternating setup, having four 16GB nodes and four 8 GB nodes.

This didn't change much, so 16 guests ended up with:
4-1-0-0-4-0-4-3 setup (domains per node). The 0's or 1's where the 8GB nodes. The guests were 2P/2GB ones.

So far.


Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.