Xen project Mailing List

RE: [Xen-devel] Re: NUMA and SMP

To: "Emmanuel Ackaouy" <ack@xxxxxxxxxxxxx>

From: "Petersson, Mats" <Mats.Petersson@xxxxxxx>

Date: Tue, 16 Jan 2007 15:19:06 +0100

Cc: Anthony Liguori <aliguori@xxxxxxxxxxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>, David Pilger <pilger.david@xxxxxxxxx>, Ryan Harper <ryanh@xxxxxxxxxx>

Delivery-date: Tue, 16 Jan 2007 06:22:03 -0800

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Thread-index: Acc5di+SSR68F9xEStynsV7EXrlhEQAADmBQ

Thread-topic: [Xen-devel] Re: NUMA and SMP

> -----Original Message----- > From: Emmanuel Ackaouy [mailto:ack@xxxxxxxxxxxxx] > Sent: 16 January 2007 13:56 > To: Petersson, Mats > Cc: xen-devel; Anthony Liguori; David Pilger; Ryan Harper > Subject: Re: [Xen-devel] Re: NUMA and SMP > > On the topic of NUMA: > > I'd like to dispute the assumption that a NUMA-aware OS can actually > make good decisions about the initial placement of memory in a > reasonable hardware ccNUMA system. I'm not saying that it ALWAYS can make good decisions, but it's got a better chance than software that just places things in "first available" way. > > How does the OS know on which node a particular chunk of memory > will be most accessed? The truth is that unless the application or > person running the application is herself NUMA-aware and can provide > placement hints or directives, the OS will seldom beat a round-robin / > interleave or random placement strategy. I don't disagree with that. > > To illustrate, consider an app which lays out a bunch of data > in memory > in a single thread and then spawns worker threads to process it. That's a good example of a hard to crack nut. Not easily solved in the OS, that's for sure. > > Is the OS to place memory close to the initial thread? How can it > possibly > know how many threads will eventually process the data? > > Even if the OS knew how many threads will eventually crunch the data, > it cannot possibly know at placement time if each thread will > work on an > assigned data subset (and if so, which one) or if it will act as a > pipeline > stage with all the data being passed from one thread to the next. > > If you go beyond initial memory placement or start considering memory > migration, then it's even harder to win because you have to pay copy > and stall penalties during migrations. So you have to be real smart > about predicting the future to do better than your ~10-40% memory > bandwidth and latency hit associated with doing simple memory > interleaving on a modern hardware-ccNUMA system. Sure, I certainly wasn't suggesting memory migration. However, there is a case where NUMA information COULD be helpful, and that is if the system is paging in, it could try to find a page in the local node rather than "random" [although without knowing what the future holds, this could be wrong - as any non-future-knowing strategy would be]. Of course, I wouldn't disagree if you said "The system probably has too little memory if it's paging"!. > > And it gets worse for you when your app is successfully > taking advantage > of the memory cache hierarchy because its performance is less impacted > by raw memory latency and bandwidth. Indeed. > > Things also get more difficult on a time-sharing host with competing > apps. Agreed. > > There is a strong argument for making hypervisors and OSes NUMA > aware in the sense that: > 1- They know about system topology > 2- They can export this information up the stack to applications and > users > 3- They can take in directives from users and applications to > partition > the > host and place some threads and memory in specific partitions. > 4- They use an interleaved (or random) initial memory > placement strategy > by default. > > The argument that the OS on its own -- without user or application > directives -- can make better placement decisions than round-robin or > random placement is -- in my opinion -- flawed. Debatable - it depends a lot on WHAT applications you expect to run, and how they behave. If you consider an application that frequently allocates and de-allocates memory dynamically in a single threaded process (say compiler), then allocating memory in the local node should be the "first choice". Multithreaded apps can use a similar approach, if a thread is allocating memory, it's often a good chance that the memory is being used by that thread too [although this doesn't work for message passing between threads, obviously, this is again a case where "knowledge from the app" will be the only better solution than "random"]. This approach is by far not perfect, but if you consider that applications often do short term allocations, it makes sense to allocate on the local node if possible. > > I also am skeptical that the complexity associated with page migration > strategies would be worthwhile: If you got it wrong the first > time, what > makes you think you'll do better this time? I'm not advocating for any page-migration, with the possible exception that page-faults that are resolved by paging in should have a first-choice of local node. However, supporting NUMA in the Hypervisor and forwarding arch-info to the guest would make sense. At the least the very basic principle of: If the guest is to run on a limited set of processors (nodes), allocate memory from that (those) node(s) for the guest would make a lot of sense. [Note that I'm by no means a NUMA expert - I just happen to work for AMD that happens to have a ccNUMA architecture]. -- Mats > > Emmanuel. > > > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.