WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] Re: NUMA and SMP

To: "Emmanuel Ackaouy" <ack@xxxxxxxxxxxxx>
Subject: RE: [Xen-devel] Re: NUMA and SMP
From: "Petersson, Mats" <Mats.Petersson@xxxxxxx>
Date: Tue, 16 Jan 2007 15:19:06 +0100
Cc: Anthony Liguori <aliguori@xxxxxxxxxxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>, David Pilger <pilger.david@xxxxxxxxx>, Ryan Harper <ryanh@xxxxxxxxxx>
Delivery-date: Tue, 16 Jan 2007 06:22:03 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
In-reply-to: <8790346913e7b2e96fdc58199e039895@xxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: Acc5di+SSR68F9xEStynsV7EXrlhEQAADmBQ
Thread-topic: [Xen-devel] Re: NUMA and SMP
> -----Original Message-----
> From: Emmanuel Ackaouy [mailto:ack@xxxxxxxxxxxxx] 
> Sent: 16 January 2007 13:56
> To: Petersson, Mats
> Cc: xen-devel; Anthony Liguori; David Pilger; Ryan Harper
> Subject: Re: [Xen-devel] Re: NUMA and SMP
> 
> On the topic of NUMA:
> 
> I'd like to dispute the assumption that a NUMA-aware OS can actually
> make good decisions about the initial placement of memory in a
> reasonable hardware ccNUMA system.

I'm not saying that it ALWAYS can make good decisions, but it's got a
better chance than software that just places things in "first available"
way. 

> 
> How does the OS know on which node a particular chunk of memory
> will be most accessed? The truth is that unless the application or
> person running the application is herself NUMA-aware and can provide
> placement hints or directives, the OS will seldom beat a round-robin /
> interleave or random placement strategy.

I don't disagree with that. 
> 
> To illustrate, consider an app which lays out a bunch of data 
> in memory
> in a single thread and then spawns worker threads to process it.

That's a good example of a hard to crack nut. Not easily solved in the
OS, that's for sure. 
> 
> Is the OS to place memory close to the initial thread? How can it 
> possibly
> know how many threads will eventually process the data?
> 
> Even if the OS knew how many threads will eventually crunch the data,
> it cannot possibly know at placement time if each thread will 
> work on an
> assigned data subset (and if so, which one) or if it will act as a 
> pipeline
> stage with all the data being passed from one thread to the next.
> 
> If you go beyond initial memory placement or start considering memory
> migration, then it's even harder to win because you have to pay copy
> and stall penalties during migrations. So you have to be real smart
> about predicting the future to do better than your ~10-40% memory
> bandwidth and latency hit associated with doing simple memory
> interleaving on a modern hardware-ccNUMA system.

Sure, I certainly wasn't suggesting memory migration. 

However, there is a case where NUMA information COULD be helpful, and
that is if the system is paging in, it could try to find a page in the
local node rather than "random" [although without knowing what the
future holds, this could be wrong - as any non-future-knowing strategy
would be]. Of course, I wouldn't disagree if you said "The system
probably has too little memory if it's paging"!. 

> 
> And it gets worse for you when your app is successfully 
> taking advantage
> of the memory cache hierarchy because its performance is less impacted
> by raw memory latency and bandwidth.

Indeed. 
> 
> Things also get more difficult on a time-sharing host with competing
> apps.

Agreed.
> 
> There is a strong argument for making hypervisors and OSes NUMA
> aware in the sense that:
> 1- They know about system topology
> 2- They can export this information up the stack to applications and 
> users
> 3- They can take in directives from users and applications to 
> partition 
> the
>      host and place some threads and memory in specific partitions.
> 4- They use an interleaved (or random) initial memory 
> placement strategy
>      by default.
> 
> The argument that the OS on its own -- without user or application
> directives -- can make better placement decisions than round-robin or
> random placement is -- in my opinion -- flawed.

Debatable - it depends a lot on WHAT applications you expect to run, and
how they behave. If you consider an application that frequently
allocates and de-allocates memory dynamically in a single threaded
process (say compiler), then allocating memory in the local node should
be the "first choice". 

Multithreaded apps can use a similar approach, if a thread is allocating
memory, it's often a good chance that the memory is being used by that
thread too [although this doesn't work for message passing between
threads, obviously, this is again a case where "knowledge from the app"
will be the only better solution than "random"].

This approach is by far not perfect, but if you consider that
applications often do short term allocations, it makes sense to allocate
on the local node if possible. 
> 
> I also am skeptical that the complexity associated with page migration
> strategies would be worthwhile: If you got it wrong the first 
> time, what
> makes you think you'll do better this time?

I'm not advocating for any page-migration, with the possible exception
that page-faults that are resolved by paging in should have a
first-choice of local node. 

However, supporting NUMA in the Hypervisor and forwarding arch-info to
the guest would make sense. At the least the very basic principle of: If
the guest is to run on a limited set of processors (nodes), allocate
memory from that (those) node(s) for the guest would make a lot of
sense. 

[Note that I'm by no means a NUMA expert - I just happen to work for AMD
that happens to have a ccNUMA architecture]. 

--
Mats
> 
> Emmanuel.
> 
> 
> 
> 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

<Prev in Thread] Current Thread [Next in Thread>