[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes

On 02/05/12 17:30, Dario Faggioli wrote:
+/* Store the policy for the domain while parsing */
+static int nodes_policy = NODES_POLICY_DEFAULT;
+/* Store the number of nodes to be used while parsing */
+static int num_nodes_policy = 0;
Why are "nodes_policy" and "num_nodes_policy" not passed in along with

That was my first implementation. Then I figured out that I want to do
the placement in _xl_, not in _libxl_, so I really don't need to muck up
build info with placement related stuff. Should I use b_info anyway,
even if I don't need these fields while in libxl?
Ah right -- yeah, probably since b_info is a libxl structure, you shouldn't add it in there. But in that case you should probably add another xl-specific structure and pass it through, rather than having global variables, I think. It's only used in the handful of placement functions, right?
Sounds definitely nicer. I just did it like that because I found a very
similar example in xl itself, but I'm open about changing this to
whatever you and libxl maintainers reach a consensus on. :-)
Right. This is always a bit tricky, balancing your own taste for how to do things, and following the style of the code that you're modifying.
Also, is it really necessary for a VM to have an equal amount of memory
on every node? It seems like it would be better to have 75% on one node
and 25% on a second node than to have 25% on four nodes, for example.
Furthermore, insisting on an even amount fragments the node memory
further -- i.e., if we chose to have 25% on four nodes instead of 75% on
one and 25% on another, that will make it harder for another VM to fit
on a single node as well.

Ok, that is something quite important to discuss. What you propose makes
a lot of sense, although some issues comes to my mind:

- which percent should I try, and in what order? I mean, 75%/25%
    sounds reasonable, but maybe also 80%/20% or even 60%/40% helps your
I had in mind no constraints at all on the ratios -- basically, if you can find N nodes such that the sum of free memory is enough to create the VM, even 99%/1%, then go for that rather than looking for N+1. Obviously finding a more balanced option would be better. One option would be to scan through finding all sets of N nodes that will satisfy the criteria, and then choose the most "balanced" one. That might be more than we need for 4.2, so another option would be to look for evenly balanced nodes first, then if we don't find a set, look for any set. (That certainly fits with the "first fit" description!)
- suppose I go for 75%/25%, what about the scheduling oof the VM?
Haha -- yeah, for a research paper, you'd probably implement some kind of lottery scheduling algorithm that would schedule it on one node 75% of the time and another node 25% of the time. :-) But I think that just making the node affinity equal to both of them will be good enough for now. There will be some variability in performance, but there will be some of that anyway depending on what node's memory the guest happens to use more.

This actually kind of a different issue, but I'll bring it up now because it's related. (Something to toss in for thinking about in 4.3 really.) Suppose there are 4 cores and 16GiB per node, and a VM has 8 vcpus and 8GiB of RAM. The algorithm you have here will attempt to put 4GiB on each of two nodes (since it will take 2 nodes to get 8 cores). However, it's pretty common for larger VMs to have way more vcpus than they actually use at any one time. So it might actually have better performance to put all 8GiB on one node, and set the node affinity accordingly. In the rare event that more than 4 vcpus are active, a handful of vcpus will have all remote accesses, but the majority of the time, all of the cpus will have local accesses. (OTOH, maybe that should be only a policy thing that we should recommend in the documentation...)

Please, don't get me wrong, I see your point and really think it makes
sense. I've actually thought along the same line for a while, but then I
couldn't find an answers to the questions above.

That's why, kind of falling back with Xen's default "striped" approach
(although on as less nodes as possible, which is _much_ better than the
actual Xen's default!). It looked simple enough to write, read and
understand, while still providing statistically consistent performances.
Dude, this is open source.  Be opinionated. ;-)

What do you think of my suggestions above?
Hmm -- if I'm reading this right, the only time the nodemap won't be all nodes is if (1) the user specified nodes, or (2) there's a cpumask in effect. If we're going to override that setting, wouldn't it make sense to just expand to all numa nodes?
As you wish, the whole "what to do if what I've been provided with
doesn't work" is in the *wild guess* status, meaning I tried to figure
out what would be best to do, but I might well be far from the actual
correct solution, provided there is one.

Trying to enlarge the nodemap step by step is potentially yielding
better performances, but is probably not so near to the "least surprise"
principle one should use when designing UIs. :-(

Hmm -- though I suppose what you'd really want to try is adding each
node in turn, rather than one at a time (i.e., if the cpus are pinned to
nodes 2 and 3, and [2,3] doesn't work, try [1,2,3] and [2,3,4] before
trying [1,2,3,4].

Yep, that makes a real lot of sense, thanks! I can definitely try doing
that, although it will complicate the code a bit...

But that's starting to get really complicated -- I
wonder if it's better to just fail and let the user change the pinnings
/ configuration node mapping.

Well, that will probably be the least surprising behaviour.

Again, just let me know what you think it's best among the various
alternatives and I'll go for it.
I think if the user specifies a nodemap, and that nodemap doesn't have enough memory, we should throw an error.

If there's a node_affinity set, no memory on that node, but memory on a *different* node, what will Xen do? It will allocate memory on some other node, right?

So ATM even if you specify a cpumask, you'll get memory on the masked nodes first, and then memory elsewhere (probably in a fairly random manner); but as much of the memory as possible will be on the masked nodes. I wonder then if we shouldnt' just keep that behavior -- i.e., if there's a cpumask specified, just return the nodemask from that mask, and let Xen put as much as possible on that node and let the rest fall where it may.

What do you think?
+        if (use_cpus>= b_info->max_vcpus) {
+            rc = 0;
+            break;
+        }
Hmm -- there's got to be a better way to find out the minimum number of
nodes to house a given number of vcpus than just starting at 1 and
re-trying until we have enough.
+        /* Add one more node and retry fitting the domain */
+        __add_nodes_to_nodemap(&new_nodemap, numa, nr_nodes, 1);
Same comment as above.

I'm not sure I'm getting this. The whole point here is let's consider
free memory on the various nodes first, and then adjust the result if
some other constraints are being violated.
Sorry, wrong above -- I meant the other comment about __add_nodes_to_nodemap(). :-)
However, if what you mean is I could check beforehand whether or not the
user provided configuration will give us enough CPUs and avoid testing
scenarios that are guaranteed to fail, then I agree and I'll reshape the
code to look like that. This triggers the heuristics re-designing stuff
from above again, as one have to decide what to do if user asks for
"nodes=[1,3]" and I discover (earlier) that I need one more node for
having enough CPUs (I mean, what node should I try first?).
No, that's not exactly what I meant. Suppose there are 4 cores per node, and a VM has 16 vcpus, and NUMA is just set to auto, with no other parameters. If I'm reading your code right, what it will do is first try to find a set of 1 node that will satisfy the constraints, then 2 nodes, then 3, nodes, then 4, &c. Since there are at most 4 cores per node, we know that 1, 2, and 3 nodes are going to fail, regardless of how much memory there is or how many cpus are offline. So why not just start with 4, if the user hasn't specified anything? Then if 4 doesn't work (either because there's not enough memory, or some of the cpus are offline), then we can start bumping it up to 5, 6, &c.

That's what I was getting at -- but again, if it makes it too complicated, trading a bit of extra passes for a significant chunk of your debugging time is OK. :-)

So, I'm not entirely sure I answered your question but the point is your
idea above is the best one: if you ask something and we don't manage in
getting it done, just stop and let you figure things out.
I've only one question about this approach, what if the automatic
placement is/becomes the default? I mean, avoiding any kind of fallback
(which again, makes sense to me in case the user is explicitly asking
something specific) would mean a completely NUMA-unaware VM creation can
be aborted even if the user did not say anything... How do we deal with
Well, if the user didn't specify anything, then we can't contradict anything he specified, right? :-) If the user doesn't specify anything, and the default is "numa=auto", then I think we're free to do whatever we think is best regarding NUMA placement; in fact, I think we should try to avoid failing VM creation if it's at all possible. I just meant what I think we should do if the user asked for specific NUMA nodes or a specific number of nodes. (I think that cpu masks should probably behave as it does now -- set the numa_affinity, but not fail domain creation if there's not enough memory on those nodes.)

It seems like we have a number of issues here that would be good for more people to come in on -- what if I attempt to summarize the high-level decisions we're talking about so that it's easier for more people to comment on them?


diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c

This should be in its own patch.


Thanks  lot again for taking a look!


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.