[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Hwloc with Xen host topology



On 02/01/14 20:26, Andrew Cooper wrote:
> Hello,
>
> For some post-holiday hacking, I tried playing around with getting hwloc
> to understand Xen's full system topology, rather than the faked up
> topology dom0 receives.
>
> I present here some code which works (on some interestingly shaped
> servers in the XenRT test pool), and some discoveries/problems found
> along the way.
>
> Code can be found at:
> http://xenbits.xen.org/gitweb/?p=people/andrewcoop/hwloc.git;a=shortlog;h=refs/heads/hwloc-xen-topology-v1
>
> You will need a libxc with the following patch:
> http://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=shortlog;h=refs/heads/hwloc-support-experimental
>
> Instructions for use found in the commit message of the hwloc.git tree. 
> It is worth noting that with the help of the hwloc-devel list, v2 is
> already quite a bit different, but is still in-progress.
>
>
> Anyway, for the Xen issues I encountered.  If memory serves, some of
> them might have been brought up on xen-devel in the past.
>
> The first problem, as indicated from the extra patch required against
> libxc is that the current interface for xc_{topology,numa}info() suck if
> you are not libxl.  The current interface forces the caller to handle
> hypercall bounce buffering, which is even harder to do sensibly as half
> the bounce buffer macros are private to libxc.  Bounce buffering is the
> kind of details which libxc should deal with on behalf of its callers,
> and should only be exposed to callers who want to do something special.
>
> My patch implements xc_{topology,numa}info_bounced()  (name up for
> reconsideration) which takes some uint{32,64}_t arrays (optionally
> NULL), and properly bounce buffer them.  This results in not needing to
> mess around with any of the bounce buffering in hwloc.
>
> The second problem is with the choice of max_node_id, which is
> MAX_NUMNODES-1, or 63.  This means that the toolstack has to bounce a
> 16k buffer (64 * 64 * uint32_t) to get the node-node distances, even on
> a single or dual node system.  The issue is less pronounced with the
> node_to_mem{size,free} arrays, which only have to be 64 * uint64_t long,
> but still wasteful especially if node_to_memfree is being periodically
> polled.  Having nr_node_ids set dynamically (similar to nr_cpu_ids)
> would alleviate this overhead, as the number of nodes available on the
> system will unconditionally be static after boot.
>
> The third problem is the one which created the only real bug in my hwloc
> implementation.  Cores are numbered per-socket in Xen, while sockets,
> numa nodes and cpus are numbered on an absolute scale.  There is
> currently a gross hack in my hwloc code which adds (socket_id *
> cores_per_socket * threads_per_core) onto each core id to make them
> similarly numbered on an absolute scale.  This is fine for a homogeneous
> system, but not for a hetrogeneous system.
>
> Relatedly, when debugging the third problem on an AMD Opteron 63xx
> system, I noticed that it advertises 8 cores per socket and 2 threads
> per core, but numbers the cores 1-16 on each socket.  This is broken. 
> It should ether be 16 cores per socket and 1 thread per core, or really
> 8 cores per socket and 2 threads per core, with the cores numbered 1-8
> and each pair of cpus with the same core id.
>
> Fourth, the API for identifying offline cpus is broken.  To mark a cpu
> as offline, it has its topology information shot, meaning that an
> offline cpu cannot be positively located in the topology.  I happen to
> know it can as Xen writes the records sequentially, so a single offline
> cpu can be identified based on the valid information either side, but a
> block of offline cpus become rather harder to locate.  Ideally,
> XEN_SYSCTL_topologyinfo should return 4 parameters, with one of them
> being a bitmap from 0 to max_cpu_index identifying which cpus are
> online, and writing the correct core/socket/node information (when
> known) into the other parameters.  However, being an ABI now makes this
> somewhat harder to do.
>
> Fifth, Xen has no way of querying the cpu cache information.  hwloc
> likes to know the entire cache hierarchy, which is arguably more useful
> for its primary purpose of optimising HPC than for simply viewing the
> Xen topology, but is none-the-less a missing feature as far as Xen is
> concerned.  I was considering adding a sysctl along the lines of "please
> execute cpuid with these parameters on that pcpu and give me the answers".
>
> Sixth and finally, which is also the hardest problem conceptually to
> solve, Xen has no notion of IO proximity.  Devices on the system can
> report their location using _PXM() methods in the DSDT/SSDTs, but only
> dom0 can gather this information, and doesn't have an accurate view of
> the NUMA or CPU topology.

Seventh, as some very up-to-the-minute hacking,

XEN_SYSCTL_numainfo is not giving back valid information.

From a Haswell-EP SDP, running XenServer trunk (xen-4.3 based):

Xen NUMA information:
  numa count 64, max numa id 1
  node[  0], size 19327352832, free 15262810112
  node[  1], size 17179869184, free 15961382912

Which sums to ~2GB more than the total system ram of:
(XEN) System RAM: 32320MB (33096268kB)

It would appear that a node memsize includes IO encompassed by the nodes
start/end pfns, rather than just the RAM contained inside the nodes pfns.

(XEN) SRAT: Node 0 PXM 0 0-480000000
(XEN) SRAT: Node 1 PXM 1 480000000-880000000

Is this intentional or an oversight?

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.