Xen project Mailing List

[Xen-devel] [PATCH 11 of 11] Some automatic NUMA placement documentation

From: Dario Faggioli <raistlin@xxxxxxxx>

Date: Thu, 31 May 2012 14:11:16 +0200

Cc: Andre Przywara <andre.przywara@xxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, Stefano Stabellini <Stefano.Stabellini@xxxxxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxxxxx>, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>

Delivery-date: Thu, 31 May 2012 12:12:39 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

About rationale, usage and API. Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx> diff --git a/docs/misc/xl-numa-placement.markdonw b/docs/misc/xl-numa-placement.markdonw new file mode 100644 --- /dev/null +++ b/docs/misc/xl-numa-placement.markdonw @@ -0,0 +1,113 @@ +# Guest Automatic NUMA Placement in libxl and xl # + +## Rationale ## + +The Xen hypervisor deals with Non-Uniform Memory Access (NUMA]) +machines by assigning to its domain a "node affinity", i.e., a set of NUMA +nodes of the host from which it gets its memory allocated. + +NUMA awareness becomes very important as soon as many domains start running +memory-intensive workloads on a shared host. In fact, the cost of accessing +non node-local memory locations is very high, and the performance degradation +is likely to be noticeable. + +## Guest Placement in xl ## + +If using xl for creating and managing guests, it is very easy to ask +for both manual or automatic placement of them across the host's NUMA +nodes. + +Note that xm/xend does the very same thing, the only differences residing +in the details of the heuristics adopted for the placement (see below). + +### Manual Guest Placement with xl ### + +Thanks to the "cpus=" option, it is possible to specify where a domain +should be created and scheduled on, directly in its config file. This +affects NUMA placement and memory accesses as the hypervisor constructs +the node affinity of a VM basing right on its CPU affinity when it is +created. + +This is very simple and effective, but requires the user/system +administrator to explicitly specify affinities for each and every domain, +or Xen won't be able to enable guarantee the locality for their memory +accesses. + +### Automatic Guest Placement with xl ### + +In case no "cpus=" option is specified in the config file, xl tries +to figure out on its own on which node(s) the domain could fit best. + +First of all, it needs to find a node (or a set of nodes) that have +enough free memory for accommodating the domain. After that, the actual +decision on where to put the new guest happens by generating all the +possible combinations of nodes that satisfies the above and chose among +them according to the following heuristics: + + * candidates involving fewer nodes come first. In case two (or more) + candidates span the same number of nodes, + * candidates with greater amount of free memory come first. In case + two (or more) candidates differ in their amount of free memory by + less than 10%, + * candidates with fewer domains already placed on them come first. + +Giving preference to small candidates ensures better performance for +the guest, as it avoid spreading its memory among different nodes. +Using the nodes that have the biggest amounts of free memory helps +keeping the memory fragmentation small, from a system wide perspective. +Finally, in case more candidates fulfil these criteria by the same +extent, choosing the candidate that is hosting fewer domain helps +balancing the load on the various nodes. + +The last step is figuring out whether the selected candidate contains +at least as much CPUs as the number of VCPUs of the VM. The current +solution for the case when this is not verified is just to add some +more nodes, until the condition turns into being true. When doing +this, the nodes with the least possible distance from the ones +already in the nodemap are considered. + +## Guest Placement in libxl ## + +xl achieves automatic NUMA placement by means of the following API +calls, provided by libxl. + + libxl_numa_candidate *libxl_domain_numa_candidates(libxl_ctx *ctx, + libxl_domain_build_info *b_info, + int min_nodes, int *nr_cndts); + +This is what should be used to generate the full set of placement +candidates. In fact, the function returns an array of containing nr_cndts +libxl_numa_candidate (see below). Each candidate is basically a set of nodes +that has been checked against the memory requirement derived from the +provided libxl_domain_build_info. + + int libxl_numa_candidate_add_cpus(libxl_ctx *ctx, + int min_cpus, int max_nodes, + libxl_numa_candidate *candidate); + +This is what should be used to ensure a placement candidate has at least +min_cpus CPUs. In case it does not, the function also take care of +adding more nodes to the candidate itself (up to when the value specified +in max_nodes is reached). When adding new nodes, the one that has the +smallest "distance" from the current node map is selected at each step. + + libxl_numa_candidate_count_domains(libxl_ctx *ctx, + libxl_numa_candidate *candidate); + +This is what counts the number of domains that are currently pinned +to the CPUs of the nodes of a given candidate. + +Finally, a placement candidate is represented by the following data +structure: + + typedef struct libxl_numa_candidate { + int nr_nodes; + int nr_domains; + uint32_t free_memkb; + libxl_nodemap nodemap; + } libxl_numa_candidate; + +It basically tells what are the nodes the candidate spans (in the nodemap), +how many of them there are (nr_nodes), how much free memory they can +provide all together (free_memkb) and how many domains are running pinned +to their CPUs (nr_domains). _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.