[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes

To: Dario Faggioli <raistlin@xxxxxxxx>
From: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
Date: Thu, 3 May 2012 14:41:47 +0100
Cc: Andre Przywara <andre.przywara@xxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, Stefano Stabellini <Stefano.Stabellini@xxxxxxxxxxxxx>, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>
Delivery-date: Thu, 03 May 2012 13:43:41 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 02/05/12 17:30, Dario Faggioli wrote:

+
+/* Store the policy for the domain while parsing */
+static int nodes_policy = NODES_POLICY_DEFAULT;
+
+/* Store the number of nodes to be used while parsing */
+static int num_nodes_policy = 0;

Why are "nodes_policy" and "num_nodes_policy" not passed in along with
b_info?

That was my first implementation. Then I figured out that I want to do
the placement in _xl_, not in _libxl_, so I really don't need to muck up
build info with placement related stuff. Should I use b_info anyway,
even if I don't need these fields while in libxl?

Ah right -- yeah, probably since b_info is a libxl structure, youshouldn't add it in there. But in that case you should probably addanother xl-specific structure and pass it through, rather than havingglobal variables, I think. It's only used in the handful of placementfunctions, right?

Sounds definitely nicer. I just did it like that because I found a very
similar example in xl itself, but I'm open about changing this to
whatever you and libxl maintainers reach a consensus on. :-)

Right. This is always a bit tricky, balancing your own taste for how todo things, and following the style of the code that you're modifying.

Also, is it really necessary for a VM to have an equal amount of memory
on every node? It seems like it would be better to have 75% on one node
and 25% on a second node than to have 25% on four nodes, for example.
Furthermore, insisting on an even amount fragments the node memory
further -- i.e., if we chose to have 25% on four nodes instead of 75% on
one and 25% on another, that will make it harder for another VM to fit
on a single node as well.

Ok, that is something quite important to discuss. What you propose makes
a lot of sense, although some issues comes to my mind:

- which percent should I try, and in what order? I mean, 75%/25%
    sounds reasonable, but maybe also 80%/20% or even 60%/40% helps your
    point.

I had in mind no constraints at all on the ratios -- basically, if youcan find N nodes such that the sum of free memory is enough to createthe VM, even 99%/1%, then go for that rather than looking for N+1.Obviously finding a more balanced option would be better. One optionwould be to scan through finding all sets of N nodes that will satisfythe criteria, and then choose the most "balanced" one. That might bemore than we need for 4.2, so another option would be to look for evenlybalanced nodes first, then if we don't find a set, look for any set.(That certainly fits with the "first fit" description!)

- suppose I go for 75%/25%, what about the scheduling oof the VM?

Haha -- yeah, for a research paper, you'd probably implement some kindof lottery scheduling algorithm that would schedule it on one node 75%of the time and another node 25% of the time. :-) But I think that justmaking the node affinity equal to both of them will be good enough fornow. There will be some variability in performance, but there will besome of that anyway depending on what node's memory the guest happens touse more.

This actually kind of a different issue, but I'll bring it up nowbecause it's related. (Something to toss in for thinking about in 4.3really.) Suppose there are 4 cores and 16GiB per node, and a VM has 8vcpus and 8GiB of RAM. The algorithm you have here will attempt to put4GiB on each of two nodes (since it will take 2 nodes to get 8 cores).However, it's pretty common for larger VMs to have way more vcpus thanthey actually use at any one time. So it might actually have betterperformance to put all 8GiB on one node, and set the node affinityaccordingly. In the rare event that more than 4 vcpus are active, ahandful of vcpus will have all remote accesses, but the majority of thetime, all of the cpus will have local accesses. (OTOH, maybe thatshould be only a policy thing that we should recommend in thedocumentation...)

Please, don't get me wrong, I see your point and really think it makes
sense. I've actually thought along the same line for a while, but then I
couldn't find an answers to the questions above.

That's why, kind of falling back with Xen's default "striped" approach
(although on as less nodes as possible, which is _much_ better than the
actual Xen's default!). It looked simple enough to write, read and
understand, while still providing statistically consistent performances.

Dude, this is open source.  Be opinionated. ;-)

What do you think of my suggestions above?

Hmm -- if I'm reading this right, the only time the nodemap won't beall nodes is if (1) the user specified nodes, or (2) there's acpumask in effect. If we're going to override that setting, wouldn'tit make sense to just expand to all numa nodes?

As you wish, the whole "what to do if what I've been provided with
doesn't work" is in the *wild guess* status, meaning I tried to figure
out what would be best to do, but I might well be far from the actual
correct solution, provided there is one.

Trying to enlarge the nodemap step by step is potentially yielding
better performances, but is probably not so near to the "least surprise"
principle one should use when designing UIs. :-(

Hmm -- though I suppose what you'd really want to try is adding each
node in turn, rather than one at a time (i.e., if the cpus are pinned to
nodes 2 and 3, and [2,3] doesn't work, try [1,2,3] and [2,3,4] before
trying [1,2,3,4].

Yep, that makes a real lot of sense, thanks! I can definitely try doing
that, although it will complicate the code a bit...

But that's starting to get really complicated -- I
wonder if it's better to just fail and let the user change the pinnings
/ configuration node mapping.

Well, that will probably be the least surprising behaviour.

Again, just let me know what you think it's best among the various
alternatives and I'll go for it.

I think if the user specifies a nodemap, and that nodemap doesn't haveenough memory, we should throw an error.

If there's a node_affinity set, no memory on that node, but memory on a*different* node, what will Xen do? It will allocate memory on someother node, right?

So ATM even if you specify a cpumask, you'll get memory on the maskednodes first, and then memory elsewhere (probably in a fairly randommanner); but as much of the memory as possible will be on the maskednodes. I wonder then if we shouldnt' just keep that behavior -- i.e.,if there's a cpumask specified, just return the nodemask from that mask,and let Xen put as much as possible on that node and let the rest fallwhere it may.


What do you think?

+
+        if (use_cpus>= b_info->max_vcpus) {
+            rc = 0;
+            break;
+        }

Hmm -- there's got to be a better way to find out the minimum number of
nodes to house a given number of vcpus than just starting at 1 and
re-trying until we have enough.

+        /* Add one more node and retry fitting the domain */
+        __add_nodes_to_nodemap(&new_nodemap, numa, nr_nodes, 1);

Same comment as above.

I'm not sure I'm getting this. The whole point here is let's consider
free memory on the various nodes first, and then adjust the result if
some other constraints are being violated.

Sorry, wrong above -- I meant the other comment about__add_nodes_to_nodemap(). :-)

However, if what you mean is I could check beforehand whether or not the
user provided configuration will give us enough CPUs and avoid testing
scenarios that are guaranteed to fail, then I agree and I'll reshape the
code to look like that. This triggers the heuristics re-designing stuff
from above again, as one have to decide what to do if user asks for
"nodes=[1,3]" and I discover (earlier) that I need one more node for
having enough CPUs (I mean, what node should I try first?).

No, that's not exactly what I meant. Suppose there are 4 cores pernode, and a VM has 16 vcpus, and NUMA is just set to auto, with no otherparameters. If I'm reading your code right, what it will do is firsttry to find a set of 1 node that will satisfy the constraints, then 2nodes, then 3, nodes, then 4, &c. Since there are at most 4 cores pernode, we know that 1, 2, and 3 nodes are going to fail, regardless ofhow much memory there is or how many cpus are offline. So why not juststart with 4, if the user hasn't specified anything? Then if 4 doesn'twork (either because there's not enough memory, or some of the cpus areoffline), then we can start bumping it up to 5, 6, &c.

That's what I was getting at -- but again, if it makes it toocomplicated, trading a bit of extra passes for a significant chunk ofyour debugging time is OK. :-)

So, I'm not entirely sure I answered your question but the point is your
idea above is the best one: if you ask something and we don't manage in
getting it done, just stop and let you figure things out.
I've only one question about this approach, what if the automatic
placement is/becomes the default? I mean, avoiding any kind of fallback
(which again, makes sense to me in case the user is explicitly asking
something specific) would mean a completely NUMA-unaware VM creation can
be aborted even if the user did not say anything... How do we deal with
this?

Well, if the user didn't specify anything, then we can't contradictanything he specified, right? :-) If the user doesn't specify anything,and the default is "numa=auto", then I think we're free to do whateverwe think is best regarding NUMA placement; in fact, I think we shouldtry to avoid failing VM creation if it's at all possible. I just meantwhat I think we should do if the user asked for specific NUMA nodes or aspecific number of nodes. (I think that cpu masks should probablybehave as it does now -- set the numa_affinity, but not fail domaincreation if there's not enough memory on those nodes.)

It seems like we have a number of issues here that would be good formore people to come in on -- what if I attempt to summarize thehigh-level decisions we're talking about so that it's easier for morepeople to comment on them?


 -George

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
...

This should be in its own patch.

Ok.

Thanks  lot again for taking a look!

Regards,
Dario



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes
  - From: Dario Faggioli

References:
- Re: [Xen-devel] [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes
  - From: George Dunlap
- Re: [Xen-devel] [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes
  - From: Dario Faggioli

Prev by Date: Re: [Xen-devel] [PATCH] vmx: Allow software (user defined) interrupts to be injected in to the guest
Next by Date: Re: [Xen-devel] Unable to get QXL vga working
Previous by thread: Re: [Xen-devel] [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes
Next by thread: Re: [Xen-devel] [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.