[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1 of 3 v5/leftover] libxl: enable automatic placement of guests on NUMA nodes

On Fri, 2012-07-20 at 10:19 +0200, Andre Przywara wrote:
> thanks for the warm welcome.

> I always assumed the vast majority of actual users/customers use 
> comparably small domains, something like 1-4 VCPUs and like 4 GB of RAM. 
> So these domains are much smaller than a usual node. I'd consider a node 
> size of 16 GB the lower boundary, with up to 128GB as the common 
> scenarios. Sure there are bigger or smaller machines, but I'd consider 
> this the sweet spot.
Yep, I agree this should be our target, at least for the default

> Right. So we just use the number of already pinned vCPUs as the metric. 
> Let me look if I can change the code to really use number of vCPUs 
> instead of number of domains. A domain could be UP or 8-way SMP, which 
> really makes much difference wrt to load on a node.
As we both are saying in the other e-mail, this is probably something
good to do. I'm working on a new version of the patchset that I'm going
to release later today (addressing the comments and the outcome of the
long discussion the last round generated). If you can go as far as
having a patch, that would be great, and I guess it could make it in
even in early -rc days.

> In the long run we need something like a per-node (or per-pCPU) load 
> average. We cannot foresee the future, but we just assume that the past 
> is a good indicator for it. xl top generates such numbers on demand 
> already. But that surely is something for 4.3, just wanted to mention it.

> After reading the code I consider this a bit 
> over-engineered, but I cannot possibly complain about this after having 
> remained silent for such a long time.

> So lets see what we can make out of this code, just firing up some 
> ideas, feel free to just ignore them in case they are dumb ;-)
I bet they are not!

> So if you agree to the small-domain assumption, then domains easily 
> fitting into a node are the rule, not the exception. We should handle it 
> that way. Maybe we can also solve the complexity problem by only 
> generating single node candidates in the first place and only if these 
> don't fit look at alternatives?
That's exactly what I'm doing right now. You'll see the code later but,
basically, at the i-eth step, I now compare all the candidates with i
nodes and, if I find at least one, I quit the whole thing and avoid
proceeding at step i+1. Of course, if I find more than one, I return the
one that is best according to the heuristics.

Given that the very first step is very likely going to be looking at
candidates with 1 node in them, here you are exactly what you're talking

> I really admire that lazy comb_next generation function, so why we don't 
> use it in a really lazy way? I think there was already a discussion 
> about this, just don't remember what it's outcome was.
> Some debugging code showed that on the above (empty) machine a 2 
> VCPUs/2GB domain generated already 255 candidates. That really looks 
> like overkill, especially if we actually should focus on the 8 
> single-node ones.
Yep, and in fact 255 is what it takes on 8 nodes. As said above, I
sort-of read your mind and started implementing what you write here
yesterday. :-P

> Maybe we can use a two-step approach? First use a simple heuristic 
> similar to the xend one:
> We only consider domains with enough free memory. Then we look for the 
> least utilized ones: Simply calculate the difference between the number 
> of currently pinned vCPUs and the number pCPUs. So any node with free 
> (aka non-overcommited) CPUs should really be considered first.
Again, with the change above, this thing you're saying here can be
achieved just removing the memfree_diff from the comparison function
(which is no longer used during a proper sort, rather is is being called
on-line as soon as a new candidate is found, to compare it with the
current cached best). And yes, of course turning the domain count into a
vcpu count, as said above.

> If we somehow determine that this approach doesn't work (no nodes with 
> enough free memory or more vCPUs than CPUs-per-node) we should use the 
> sophisticated algorithm.
And again, you'll see the new code and will tell me what you think
later, but I really think I turned it into something like that. The only

> Right, that sounds good. If you have any good (read: meaningful to 
> customers) benchmarks I can do some experiments on my machine to 
> fine-tune this.
Nothing that goes that far. I tried to run specjbb2005 concurrently on
some VM with and without placement, but I only have a very small
testbox. :-(

> > Nevertheless, this is what Xen 4.2 will have, and I really think initial
> > placement is a very important step, and we must get the most out of
> > being able to do it well (as opposed to other technologies, where
> > something like that has to happen in the kernel/hypervisor, which
> > entails a lot of limitations we don't have!), and am therefore happy
> > about trying to do so as hard as I can.
> Right. We definitely need some placement for 4.2. Lets push this in if 
> anyhow possible.
That's what I'm working quite hard for. :-)

Thanks and Regards,

<<This happens because I choose it to happen!>> (Raistlin Majere)
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.