Xen project Mailing List

Re: [Xen-devel] Hwloc with Xen host topology

To: Xen-devel List <xen-devel@xxxxxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Thu, 2 Jan 2014 21:38:37 +0000

Delivery-date: Thu, 02 Jan 2014 21:38:52 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 02/01/14 20:26, Andrew Cooper wrote: > Hello, > > For some post-holiday hacking, I tried playing around with getting hwloc > to understand Xen's full system topology, rather than the faked up > topology dom0 receives. > > I present here some code which works (on some interestingly shaped > servers in the XenRT test pool), and some discoveries/problems found > along the way. > > Code can be found at: > http://xenbits.xen.org/gitweb/?p=people/andrewcoop/hwloc.git;a=shortlog;h=refs/heads/hwloc-xen-topology-v1 > > You will need a libxc with the following patch: > http://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=shortlog;h=refs/heads/hwloc-support-experimental > > Instructions for use found in the commit message of the hwloc.git tree. > It is worth noting that with the help of the hwloc-devel list, v2 is > already quite a bit different, but is still in-progress. > > > Anyway, for the Xen issues I encountered. If memory serves, some of > them might have been brought up on xen-devel in the past. > > The first problem, as indicated from the extra patch required against > libxc is that the current interface for xc_{topology,numa}info() suck if > you are not libxl. The current interface forces the caller to handle > hypercall bounce buffering, which is even harder to do sensibly as half > the bounce buffer macros are private to libxc. Bounce buffering is the > kind of details which libxc should deal with on behalf of its callers, > and should only be exposed to callers who want to do something special. > > My patch implements xc_{topology,numa}info_bounced() (name up for > reconsideration) which takes some uint{32,64}_t arrays (optionally > NULL), and properly bounce buffer them. This results in not needing to > mess around with any of the bounce buffering in hwloc. > > The second problem is with the choice of max_node_id, which is > MAX_NUMNODES-1, or 63. This means that the toolstack has to bounce a > 16k buffer (64 * 64 * uint32_t) to get the node-node distances, even on > a single or dual node system. The issue is less pronounced with the > node_to_mem{size,free} arrays, which only have to be 64 * uint64_t long, > but still wasteful especially if node_to_memfree is being periodically > polled. Having nr_node_ids set dynamically (similar to nr_cpu_ids) > would alleviate this overhead, as the number of nodes available on the > system will unconditionally be static after boot. > > The third problem is the one which created the only real bug in my hwloc > implementation. Cores are numbered per-socket in Xen, while sockets, > numa nodes and cpus are numbered on an absolute scale. There is > currently a gross hack in my hwloc code which adds (socket_id * > cores_per_socket * threads_per_core) onto each core id to make them > similarly numbered on an absolute scale. This is fine for a homogeneous > system, but not for a hetrogeneous system. > > Relatedly, when debugging the third problem on an AMD Opteron 63xx > system, I noticed that it advertises 8 cores per socket and 2 threads > per core, but numbers the cores 1-16 on each socket. This is broken. > It should ether be 16 cores per socket and 1 thread per core, or really > 8 cores per socket and 2 threads per core, with the cores numbered 1-8 > and each pair of cpus with the same core id. > > Fourth, the API for identifying offline cpus is broken. To mark a cpu > as offline, it has its topology information shot, meaning that an > offline cpu cannot be positively located in the topology. I happen to > know it can as Xen writes the records sequentially, so a single offline > cpu can be identified based on the valid information either side, but a > block of offline cpus become rather harder to locate. Ideally, > XEN_SYSCTL_topologyinfo should return 4 parameters, with one of them > being a bitmap from 0 to max_cpu_index identifying which cpus are > online, and writing the correct core/socket/node information (when > known) into the other parameters. However, being an ABI now makes this > somewhat harder to do. > > Fifth, Xen has no way of querying the cpu cache information. hwloc > likes to know the entire cache hierarchy, which is arguably more useful > for its primary purpose of optimising HPC than for simply viewing the > Xen topology, but is none-the-less a missing feature as far as Xen is > concerned. I was considering adding a sysctl along the lines of "please > execute cpuid with these parameters on that pcpu and give me the answers". > > Sixth and finally, which is also the hardest problem conceptually to > solve, Xen has no notion of IO proximity. Devices on the system can > report their location using _PXM() methods in the DSDT/SSDTs, but only > dom0 can gather this information, and doesn't have an accurate view of > the NUMA or CPU topology. Seventh, as some very up-to-the-minute hacking, XEN_SYSCTL_numainfo is not giving back valid information. From a Haswell-EP SDP, running XenServer trunk (xen-4.3 based): Xen NUMA information: numa count 64, max numa id 1 node[ 0], size 19327352832, free 15262810112 node[ 1], size 17179869184, free 15961382912 Which sums to ~2GB more than the total system ram of: (XEN) System RAM: 32320MB (33096268kB) It would appear that a node memsize includes IO encompassed by the nodes start/end pfns, rather than just the RAM contained inside the nodes pfns. (XEN) SRAT: Node 0 PXM 0 0-480000000 (XEN) SRAT: Node 1 PXM 1 480000000-880000000 Is this intentional or an oversight? ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.