[Xen-ia64-devel] How to support NUMA?

   We discussed this a little bit at Xen Summit, but we didn't leave
with a plan to move forward.  Jes is now to the point where he's got
Altix booting to some extent and we need to be in agreement on what NUMA
support in Xen/ia64 is going to look like.

   First, there are a couple ways that NUMA is described and implemented
in Linux.  Many of us are more familiar with the ACPI approach (or "DIG"
as Jes might call it).  This is comprised of ACPI static tables and
methods in namespace.  The SRAT static table defines processors and
memory ranges and assigns each into a proximity domain.  The SLIT table
defines the locality between proximity domains.  ACPI namespace also
provides _PXM methods on objects that allow us to place things like PCI
buses and iommu hardware into the right locality.

   Another approach is that used on the SGI Altix systems.  I'm no
expert here, but as I understand it, a range of bits within the physical
address defines which node the physical address resides.  I haven't
looked in the SN code base, but presumably PCI root buses, iommus, and
perhaps other hardware including processors are associated with nodes in
a similar way.  Maybe Jes can expand on this a bit for us.  Also, is
there a way to describe multiple levels of locality in the Altix scheme,
or is it simply local vs non-local?

   In order to incur minimal changes to the Linux code based, Jes has
proposed a P==M model.  This is where the guest physical (or
meta/pseudo-physical) address is equal to the machine physical address.
This might seem like a step backwards, since we just transitioned from
P==M to a virtual physical (VP) model about a year ago.  However, I
think this might be a more loosely interpreted P==M model than we had
previously, see below.  The obvious benefit to this approach is that the
NUMA layout of the system is plain to see in the metaphysical addresses
provided to the guest.  The downside here is that we think this might
break the grant table API that we worked so hard to fix with the VP
transition.

   An alternative might be available using the current VP approach.  One
could imagine that a contiguous chunk of metaphysical memory could be
allocated out of memory from a given node.  Xen could then rewrite the
SLIT & SRAT tables for the domain.  Perhaps this is more of a VP with
P->node==M->node model.  The actual metaphysical addresses are
irrelevant, but the node metaphysical memory is assigned must match the
node of the machine memory and we must not re-arrange proximity domains
(unless someone wants to volunteer to rewrite AML from within Xen).
This approach helps the ACPI NUMA systems, but obviously doesn't work
for the Altix systems since they need specific bits in their
metaphysical address for locality.

   Will this latter approach eventually devolve/evolve into the former? 
I think all that Jes really needs is a way to get the node info from a
metaphysical address.  To support NUMA, there's no way to get around
P->node==M->node, correct?  We simply can't do per page lookups in the
mm code to get a node ID and expect any kind of performance.  The guest
needs to be able to assume contiguous metaphysical addresses come from
the same locality (except of course at the edges of a node).  We have to
assign some kind of metaphysical address to a guest, so why shouldn't at
least the Node ID bits of the metaphysical address match the machine
physical addresses?  The part that I think we're missing is that pages
within a node don't need to map 1:1, P==M.  Effectively we end up with a
pool of VP memory for each node.  In the SGI case, a few high order bits
in the metaphysical address will happen to match the machine physical
high order bits.  In the ACPI NUMA case, we might choose to do something
similar so that we have to modify the SRAT table a little less.

   Even if this is the base, there are still a lot of questions.  Is
this model only for dom0, or can we specify it for domU also?  There are
obvious performance advantages to a NUMA aware domU if its running on a
NUMA boxes and doesn't entirely fit within a node.  How do we specify
which resources go to which domains for both the dom0 and domU cases?
Can NUMA aware domains be migrated or restored?  Do non-NUMA aware
domains have zero-based metaphysical memory (below 4G)?  Does a non-NUMA
aware domain that spans nodes have a discontiguous address map?  How do
driver domains fit into the picture?  How can a NUMA aware domain be
told the locality of a PCI device?  Will we make an attempt to allocate
non-NUMA aware guests within a node?

   Please comment and discuss.  Let me know if I'm way off base.  If
this doesn't meet our needs or is not feasible, let's come up with
something that is.  Thanks,

        Alex
   
-- 
Alex Williamson                             HP Open Source & Linux Org.


_______________________________________________
Xen-ia64-devel mailing list
Xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-ia64-devel
WARNING - OLD ARCHIVES

xen-ia64-devel

[Xen-ia64-devel] How to support NUMA?