[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 08 of 10 v3] libxl: enable automatic placement of guests on NUMA nodes



On Wed, 2012-07-04 at 17:18 +0100, Dario Faggioli wrote:
> # HG changeset patch
> # User Dario Faggioli <raistlin@xxxxxxxx>
> # Date 1341416324 -7200
> # Node ID 7087d3622ee2051654c9e78fe4829da10c2d46f1
> # Parent  6fd693e7f3bc8b4d9bd20befff2c13de5591a7c5
> libxl: enable automatic placement of guests on NUMA nodes
> 
> If a domain does not have a VCPU affinity, try to pin it automatically to some
> PCPUs. This is done taking into account the NUMA characteristics of the host.
> In fact, we look for a combination of host's NUMA nodes with enough free 
> memory
> and number of PCPUs for the new domain, and pin it to the VCPUs of those 
> nodes.
> 
> Once we know which ones, among all the possible combinations, represents valid
> placement candidates for a domain, use some heuistics for deciding which is 
> the
> best. For instance, smaller candidates are considered to be better, both from
> the domain's point of view (fewer memory spreading among nodes) and from the
> system as a whole point of view (fewer memoy fragmentation).  In case of
> candidates of equal sizes (i.e., with the same number of nodes), the amount of
> free memory and the number of domain already assigned to their nodes are
> considered. Very often, candidates with greater amount of memory are the one
> we wants, as this is also good for keeping memory fragmentation under control.
> However, if the difference in how much free memory two candidates have, the
> number of assigned domains might be what decides which candidate wins.

I can't parse this last sentence. Are there some words missing after
"how much free memory two candidates have"?

If you want to post the corrected text I think we can fold it in while
applying rather than reposting (assuming no other reason to repost crops
up).

> This all happens internally to libxl, and no API for driving the mechanism is
> provided for now. This matches what xend already does.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
> Acked-by: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
> 
> ---
> Changes from v2:
>  * lots of typos.
>  * Clayfied some comments, as requested during (ijc's) review.
>  * Added some more information/reference for the combination generation
>    algorithm.
>  * nodemap_to_nodes_cpus() function renamed to nodemap_to_nr_cpus().
>  * libxl_bitmap_init() used to make sure we do not try to free random
>    memory on failure paths of functions that allocates a libxl_bitmap.
>  * Always invoke libxl__sort_numa_candidates(), even if we get there
>    with just 1 candidate, as requested during review.
>  * Simplified the if-s that check for input parameter consistency in
>    libxl__get_numa_candidates() as requested during (gwd's) review.
>  * Comparison function for candidates changed so that it now provides
>    total ordering, as requested during review. It is still using FP
>    arithmetic, though. Also I think that just putting the difference
>    between the amount of free memory and between the number of assigned
>    domains of two candidates in a single formula (after normalizing and
>    weighting them) is both clear and effective enough.
>  * Function definitions moved to a numa specific source file (libxl_numa.c),
>    as suggested during review.
> 
> 
> Changes from v1:
>  * This patches incorporates the changes from both "libxl, xl: enable 
> automatic
>    placement of guests on NUMA nodes" and "libxl, xl: heuristics for 
> reordering
>    NUMA placement candidates" from v1.
>  * The logic of the algorithm is basically the same as in v1, but the 
> splitting
>    of it in the various functions has been completely redesigned from scratch.
>  * No public API for placement or candidate generation is now exposed,
>    everything happens within libxl, as agreed during v1 review.
>  * The relevant documentation have been moved near the actual functions and
>    features. Also, the amount and (hopefully!) the quality of the 
> documentation
>    has been improved a lot, as requested.
>  * All the comments about using the proper libxl facilities and helpers for
>    allocations, etc., have been considered and applied.
>  * This patch still bails out from NUMA optimizations if it find out cpupools
>    are being utilized. It is next patch that makes the two things interact
>    properly, as suggested during review.
> 
> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> --- a/docs/man/xl.cfg.pod.5
> +++ b/docs/man/xl.cfg.pod.5
> @@ -111,8 +111,8 @@ created online and the remainder will be
> 
>  =item B<cpus="CPU-LIST">
> 
> -List of which cpus the guest is allowed to use. Default behavior is
> -`all cpus`. A C<CPU-LIST> may be specified as follows:
> +List of which cpus the guest is allowed to use. By default xl will (via
> +libxl) pick some cpus (see below). A C<CPU-LIST> may be specified as follows:
> 
>  =over 4
> 
> @@ -132,6 +132,12 @@ run on cpu #3 of the host.
> 
>  =back
> 
> +If this option is not specified, libxl automatically tries to place the new
> +domain on the host's NUMA nodes (provided the host has more than one NUMA
> +node) by pinning it to the cpus of those nodes. A heuristic approach is
> +utilized with the goals of maximizing performance for the domain and, at
> +the same time, achieving efficient utilization of the host's CPUs and RAM.
> +
>  =item B<cpu_weight=WEIGHT>
> 
>  A domain with a weight of 512 will get twice as much CPU as a domain
> diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
> --- a/tools/libxl/Makefile
> +++ b/tools/libxl/Makefile
> @@ -66,7 +66,7 @@ LIBXL_LIBS += -lyajl -lm
>  LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \
>                         libxl_dom.o libxl_exec.o libxl_xshelp.o 
> libxl_device.o \
>                         libxl_internal.o libxl_utils.o libxl_uuid.o \
> -                       libxl_json.o libxl_aoutils.o \
> +                       libxl_json.o libxl_aoutils.o libxl_numa.o \
>                         libxl_save_callout.o _libxl_save_msgs_callout.o \
>                         libxl_qmp.o libxl_event.o libxl_fork.o $(LIBXL_OBJS-y)
>  LIBXL_OBJS += _libxl_types.o libxl_flask.o _libxl_types_internal.o
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -98,6 +98,106 @@ out:
>      return sched;
>  }
> 
> +/* Subtract two values and translate the result in [0, 1] */
> +static double normalized_diff(double a, double b)
> +{
> +#define max(a, b) (a > b ? a : b)
> +    if (!a && a == b)
> +        return 0.0;
> +    return (a - b) / max(a, b);
> +}

I think this actually returns a result in [-1,1] rather than [0,1] like
the comment says. Considering the a==1 b==2 case -> 
        ( 1 - 2 ) / max(1, 2)
=>      -1 / 2
=>      -0.5

Is it the comment or the code which is wrong? (From the following
numa_cmpf I think the comment)

> +
> +/*
> + * The NUMA placement candidates are reordered according to the following
> + * heuristics:
> + *  - candidates involving fewer nodes come first. In case two (or
> + *    more) candidates span the same number of nodes,
> + *  - the amount of free memory and the number of domains assigned to the
> + *    candidates are considered. In doing that, candidates with greater
> + *    amount of free memory and fewer domains assigned to them are preferred,
> + *    with free memory "weighting" three times as much as number of domains.
> + */
> +static int numa_cmpf(const void *v1, const void *v2)
> +{
> +    const libxl__numa_candidate *c1 = v1;
> +    const libxl__numa_candidate *c2 = v2;
> +#define sign(a) a > 0 ? 1 : a < 0 ? -1 : 0

Does the caller of numa_cmpf rely on the result being specifically -1, 0
or +1? Usually such functions only care about -ve, 0 or +ve.

Or maybe this is a double->int conversion thing? I guess (int)0.1 == 0
rather than +ve like you'd want? Yes, I suppose that's it, nevermind. I
might have written the floating point zeroes as 0.0 to make it clearer
rather than relying on automatic promotion.

> +    double freememkb_diff = normalized_diff(c2->free_memkb, c1->free_memkb);
> +    double nrdomains_diff = normalized_diff(c1->nr_domains, c2->nr_domains);
> +
> +    if (c1->nr_nodes != c2->nr_nodes)
> +        return c1->nr_nodes - c2->nr_nodes;
> +
> +    return sign(3*freememkb_diff + nrdomains_diff);
> +}
> +
> +/* The actual automatic NUMA placement routine */
> +static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info)

I didn't review the rest -- George already acked it.

Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.