[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Re: NUMA and SMP

On Jan 16, 2007, at 15:19, Petersson, Mats wrote:
There is a strong argument for making hypervisors and OSes NUMA
aware in the sense that:
1- They know about system topology
2- They can export this information up the stack to applications and
3- They can take in directives from users and applications to
     host and place some threads and memory in specific partitions.
4- They use an interleaved (or random) initial memory
placement strategy
     by default.

The argument that the OS on its own -- without user or application
directives -- can make better placement decisions than round-robin or
random placement is -- in my opinion -- flawed.

Debatable - it depends a lot on WHAT applications you expect to run, and
how they behave. If you consider an application that frequently
allocates and de-allocates memory dynamically in a single threaded
process (say compiler), then allocating memory in the local node should
be the "first choice".

Multithreaded apps can use a similar approach, if a thread is allocating
memory, it's often a good chance that the memory is being used by that
thread too [although this doesn't work for message passing between
threads, obviously, this is again a case where "knowledge from the app"
will be the only better solution than "random"].

This approach is by far not perfect, but if you consider that
applications often do short term allocations, it makes sense to allocate
on the local node if possible.

I do not agree.

Just because a thread happens to run on processor X when
it first faults in a page off the process' heap doesn't give you
a good indication that the memory will be used mostly by
this thread or that the thread will continue running on the
same processor. There are at least as many cases when
this assumption is invalid than when it is valid. Without any
solid indication that something else will work better, round
robin allocation has to be the default strategy.

Also, if you allow one process to consume a large percentage
of one node's memory, you are indirectly hurting all competing
multi-threaded apps which benefit from higher total memory
bandwidth when they spread their data across nodes.

I understand your point that if a single threaded process quickly
shrinks its heap after growing it, it makes it less likely that it will
migrate to a different processor while it is using this memory. I'm
not sure how you predict that memory will be quickly released at
allocation time though. Even if you could, I maintain you would
still need safeguards in place to balance that process' needs
with that of competing multi-threaded apps benefiting from the
memory bandwidth scaling with number of hosting nodes.

You could try and compromise and allocate round robin starting
locally and perhaps with diminishing strides as the total allocation
grows (ie allocate local and progressively move towards a page
round robin scheme as more memory is requested). I'm not sure
this would do any better than plain old dumb round robin in the
average case but it's worth a thought.

However, supporting NUMA in the Hypervisor and forwarding arch-info to
the guest would make sense. At the least the very basic principle of: If
the guest is to run on a limited set of processors (nodes), allocate
memory from that (those) node(s) for the guest would make a lot of

I suspect there is widespread agreement on this point.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.