[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes

To: Dario Faggioli <raistlin@xxxxxxxx>
From: Andre Przywara <andre.przywara@xxxxxxx>
Date: Fri, 20 Jul 2012 10:19:08 +0200
Cc: Ian Campbell <Ian.Campbell@xxxxxxxxxx>, Stefano Stabellini <Stefano.Stabellini@xxxxxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxx>
Delivery-date: Fri, 20 Jul 2012 08:24:54 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 07/19/2012 04:22 PM, Dario Faggioli wrote:

On Thu, 2012-07-19 at 14:21 +0200, Andre Przywara wrote:

Dario,


thanks for the warm welcome.

>> ...

As you can see, the nodes with more memory are _way_ overloaded, while
the lower memory ones are underutilized. In fact the first 20 guests
didn't use the other nodes at all.
I don't care so much about the two memory-less nodes, but I'd like to
know how you came to the magic "3" in the formula:

+
+    return sign(3*freememkb_diff + nrdomains_diff);
+}

Ok. The idea behind the current implementation of that heuristics is to
prefer nodes with more free memory. In fact, this leaves larger "holes",
maximizing the probability of being able to put more domain there. Of
course that means more domains exploiting local accesses, but introduces
the risk of overloading large (from a memory POV) nodes with a lot of
domains (which seems right what's happening to you! :-P).

I always assumed the vast majority of actual users/customers usecomparably small domains, something like 1-4 VCPUs and like 4 GB of RAM.So these domains are much smaller than a usual node. I'd consider a nodesize of 16 GB the lower boundary, with up to 128GB as the commonscenarios. Sure there are bigger or smaller machines, but I'd considerthis the sweet spot.

Therefore, I wanted to balance that by putting something related to the
current load on a node into the equation. Unfortunately, I really am not
sure yet what a reasonable estimation of the actual "load on a node"
could be. Even worse, Xen does not report anything even close to that,
at least not right now. That's why I went for a quite dumb count of the
number of domains for now, waiting to find the time to implement
something more clever.

Right. So we just use the number of already pinned vCPUs as the metric.Let me look if I can change the code to really use number of vCPUsinstead of number of domains. A domain could be UP or 8-way SMP, whichreally makes much difference wrt to load on a node.

In the long run we need something like a per-node (or per-pCPU) loadaverage. We cannot foresee the future, but we just assume that the pastis a good indicator for it. xl top generates such numbers on demandalready. But that surely is something for 4.3, just wanted to mention it.

So, that is basically why I thought it could be a good idea to
overweight the differences in free memory wrt the differences in number
of assigned domain. The first implementation was only considering the
number of assigned domain to decide which was the best candidate between
two that were less 10% different in their amount of free memory.
However, that didn't produce a good comparison function, and thus I
rewrote it like above, with the magic 3 selected via trial and error to
mimic something similar to the old 10% rule.

OK, I see. I thought about this a bit more and agree a single heuristicformula isn't easy to find. After reading the code I consider this a bitover-engineered, but I cannot possibly complain about this after havingremained silent for such a long time.So lets see what we can make out of this code, just firing up someideas, feel free to just ignore them in case they are dumb ;-)

So if you agree to the small-domain assumption, then domains easilyfitting into a node are the rule, not the exception. We should handle itthat way. Maybe we can also solve the complexity problem by onlygenerating single node candidates in the first place and only if thesedon't fit look at alternatives?

I really admire that lazy comb_next generation function, so why we don'tuse it in a really lazy way? I think there was already a discussionabout this, just don't remember what it's outcome was.Some debugging code showed that on the above (empty) machine a 2VCPUs/2GB domain generated already 255 candidates. That really lookslike overkill, especially if we actually should focus on the 8single-node ones.

Maybe we can use a two-step approach? First use a simple heuristicsimilar to the xend one:We only consider domains with enough free memory. Then we look for theleast utilized ones: Simply calculate the difference between the numberof currently pinned vCPUs and the number pCPUs. So any node with free(aka non-overcommited) CPUs should really be considered first.After all we don't need to care about memory latency if the domainsstarve for compute time and only get a fraction of each pCPU.

If you don't want to believe this, I can run some benchmarks to prove this.

If we somehow determine that this approach doesn't work (no nodes withenough free memory or more vCPUs than CPUs-per-node) we should use thesophisticated algorithm.

Also consider the following: With really big machines or with oddconfigurations people will probably do their pinning/placementthemselves (or by external mgmt applications).What this automatic placement algorithm is good for is more theWhat-is-this-NUMA-thingie-anyways people.


That all being said, this is the first time the patchset had the chance
to run on such a big system, so I'm definitely open to suggestion on how
to make that formula better in reflecting what we think it's The Right
Thing!

I haven't done any measurements on this, but I guess scheduling 36 vCPUs
on 8 pCPUs has a much bigger performance penalty than any remote NUMA
access, which nowadays is much better than a few years ago, with big L3
caches, better predictors and faster interconnects.

Definitely. Consider that the guests are being pinned because that is
what XenD does and because there has been no time to properly refine the
implementation of a more NUMA-aware scheduler for 4.2. In future, as
soon as I'll have it ready, the vcpus from the overloaded nodes would
get some runtime on the otherwise idle ones, even if they're remote.

Right, that sounds good. If you have any good (read: meaningful tocustomers) benchmarks I can do some experiments on my machine tofine-tune this.

Nevertheless, this is what Xen 4.2 will have, and I really think initial
placement is a very important step, and we must get the most out of
being able to do it well (as opposed to other technologies, where
something like that has to happen in the kernel/hypervisor, which
entails a lot of limitations we don't have!), and am therefore happy
about trying to do so as hard as I can.

Right. We definitely need some placement for 4.2. Lets push this in ifanyhow possible.

I will now put in the memory again and also try to play a bit with the
amount of memory and VCPUs per guest.

Great, keep me super-posted about these things and feel free to ask
anything that comes to your mind! :-)

So changing the VCPUs and memory config didn't make any real difference.I think that is because the number of domains is considered, not thenumber of vCPUS. This should be fixed.

Second I inserted the memory again. I only have 24 DIMMs for 32 sockets(we have easy access to boards and CPUs, but memory we have to buy likeeveryone else ;-), so I have to go with this alternating setup, havingfour 16GB nodes and four 8 GB nodes.


This didn't change much, so 16 guests ended up with:

4-1-0-0-4-0-4-3 setup (domains per node). The 0's or 1's where the 8GBnodes. The guests were 2P/2GB ones.


So far.

Regards,
Andre.

--
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes
  - From: Dario Faggioli

References:
- [Xen-devel] [PATCH 0 of 3 v5/leftover] Automatic NUMA placement for xl
  - From: Dario Faggioli
- [Xen-devel] [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes
  - From: Dario Faggioli
- Re: [Xen-devel] [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes
  - From: Andre Przywara
- Re: [Xen-devel] [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes
  - From: Dario Faggioli

Prev by Date: Re: [Xen-devel] [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes
Next by Date: Re: [Xen-devel] [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes
Previous by thread: Re: [Xen-devel] [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes
Next by thread: Re: [Xen-devel] [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.