[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] Re: NUMA and SMP

> -----Original Message-----
> From: Emmanuel Ackaouy [mailto:ack@xxxxxxxxxxxxx] 
> Sent: 16 January 2007 13:56
> To: Petersson, Mats
> Cc: xen-devel; Anthony Liguori; David Pilger; Ryan Harper
> Subject: Re: [Xen-devel] Re: NUMA and SMP
> On the topic of NUMA:
> I'd like to dispute the assumption that a NUMA-aware OS can actually
> make good decisions about the initial placement of memory in a
> reasonable hardware ccNUMA system.

I'm not saying that it ALWAYS can make good decisions, but it's got a
better chance than software that just places things in "first available"

> How does the OS know on which node a particular chunk of memory
> will be most accessed? The truth is that unless the application or
> person running the application is herself NUMA-aware and can provide
> placement hints or directives, the OS will seldom beat a round-robin /
> interleave or random placement strategy.

I don't disagree with that. 
> To illustrate, consider an app which lays out a bunch of data 
> in memory
> in a single thread and then spawns worker threads to process it.

That's a good example of a hard to crack nut. Not easily solved in the
OS, that's for sure. 
> Is the OS to place memory close to the initial thread? How can it 
> possibly
> know how many threads will eventually process the data?
> Even if the OS knew how many threads will eventually crunch the data,
> it cannot possibly know at placement time if each thread will 
> work on an
> assigned data subset (and if so, which one) or if it will act as a 
> pipeline
> stage with all the data being passed from one thread to the next.
> If you go beyond initial memory placement or start considering memory
> migration, then it's even harder to win because you have to pay copy
> and stall penalties during migrations. So you have to be real smart
> about predicting the future to do better than your ~10-40% memory
> bandwidth and latency hit associated with doing simple memory
> interleaving on a modern hardware-ccNUMA system.

Sure, I certainly wasn't suggesting memory migration. 

However, there is a case where NUMA information COULD be helpful, and
that is if the system is paging in, it could try to find a page in the
local node rather than "random" [although without knowing what the
future holds, this could be wrong - as any non-future-knowing strategy
would be]. Of course, I wouldn't disagree if you said "The system
probably has too little memory if it's paging"!. 

> And it gets worse for you when your app is successfully 
> taking advantage
> of the memory cache hierarchy because its performance is less impacted
> by raw memory latency and bandwidth.

> Things also get more difficult on a time-sharing host with competing
> apps.

> There is a strong argument for making hypervisors and OSes NUMA
> aware in the sense that:
> 1- They know about system topology
> 2- They can export this information up the stack to applications and 
> users
> 3- They can take in directives from users and applications to 
> partition 
> the
>      host and place some threads and memory in specific partitions.
> 4- They use an interleaved (or random) initial memory 
> placement strategy
>      by default.
> The argument that the OS on its own -- without user or application
> directives -- can make better placement decisions than round-robin or
> random placement is -- in my opinion -- flawed.

Debatable - it depends a lot on WHAT applications you expect to run, and
how they behave. If you consider an application that frequently
allocates and de-allocates memory dynamically in a single threaded
process (say compiler), then allocating memory in the local node should
be the "first choice". 

Multithreaded apps can use a similar approach, if a thread is allocating
memory, it's often a good chance that the memory is being used by that
thread too [although this doesn't work for message passing between
threads, obviously, this is again a case where "knowledge from the app"
will be the only better solution than "random"].

This approach is by far not perfect, but if you consider that
applications often do short term allocations, it makes sense to allocate
on the local node if possible. 
> I also am skeptical that the complexity associated with page migration
> strategies would be worthwhile: If you got it wrong the first 
> time, what
> makes you think you'll do better this time?

I'm not advocating for any page-migration, with the possible exception
that page-faults that are resolved by paging in should have a
first-choice of local node. 

However, supporting NUMA in the Hypervisor and forwarding arch-info to
the guest would make sense. At the least the very basic principle of: If
the guest is to run on a limited set of processors (nodes), allocate
memory from that (those) node(s) for the guest would make a lot of

[Note that I'm by no means a NUMA expert - I just happen to work for AMD
that happens to have a ccNUMA architecture]. 

> Emmanuel.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.