[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v10 8/9] libxl: vnuma nodes placement bits



On mer, 2014-09-03 at 00:24 -0400, Elena Ufimtseva wrote:

> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -23,6 +23,7 @@
>  #include <xc_dom.h>
>  #include <xen/hvm/hvm_info_table.h>
>  #include <xen/hvm/hvm_xs_strings.h>
> +#include <libxl_vnuma.h>
>  
>  libxl_domain_type libxl__domain_type(libxl__gc *gc, uint32_t domid)
>  {
> @@ -227,12 +228,114 @@ static void hvm_set_conf_params(xc_interface *handle, 
> uint32_t domid,
>                      libxl_defbool_val(info->u.hvm.nested_hvm));
>  }
>  
> +/* sets vnode_to_pnode map. */
> +static int libxl__init_vnode_to_pnode(libxl__gc *gc, uint32_t domid,
> +                        libxl_domain_build_info *info)
>
Indentation.

> +{
> +    unsigned int i, n;
> +    int nr_nodes = 0;
> +    uint64_t *vnodes_mem;
> +    unsigned long long *nodes_claim = NULL;
> +    libxl_numainfo *ninfo = NULL;
> +
> +    if (info->vnuma_vnodemap == NULL) {
> +        info->vnuma_vnodemap = libxl__calloc(gc, info->vnodes,
> +                                      sizeof(*info->vnuma_vnodemap));
> +    }
> +
So, help me a bit, do you mind? :-D

If I'm not wrong, here, vnuma_vnodemap is NULL (only?) if no vnode to
pnode mapping has been specified in the config file, and so you have to
allocate it for the first time. In this case, info->vnodes will be 1, so
you're allocating a 1 element array (which is not wrong, I'm just
asking). Is this correct?

If yes, I repeat what I said while reviewing one of the other patches, I
think this can and, if yes, is better done in
libxl__domain_build_info_setdefault().

Or am I missing something.

> +    /* default setting. */
> +    for (i = 0; i < info->vnodes; i++)
> +        info->vnuma_vnodemap[i] = LIBXC_VNUMA_NO_NODE;
> +
This too, of course. BTW, LIBXC_ ?

> +    /* Get NUMA info. */
> +    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
> +    if (ninfo == NULL)
> +        return ERROR_FAIL;
>
The error reporting is ok, this time. However, I think you should hold
the error code in an rc variable, have a 'goto out;' here and, at the
'out' label, issue the proper cleanups.

There may not be much to cleanup a this very stage, but it's a pattern
that is usually pretty much always followed in libxl.

> +    /* Nothing to see if only one NUMA node. */
> +    if (nr_nodes <= 1)
> +        return 0;
> +
And in fact, once you're here, you've got at least
libxl_numainfo_list_free() disposal function to call. That's why I was
saying above, have an rc variable and return only at the bottom, using a
goto label.

> +    vnodes_mem = info->vnuma_mem;
> +    /*
> +     * TODO: change algorithm. The current just fits the nodes
> +     * by its memory sizes. If no p-node found, will be used default
> +     * value of LIBXC_VNUMA_NO_NODE.
> +     */
> +    nodes_claim = libxl__calloc(gc, info->vnodes, sizeof(*nodes_claim));
> +    if ( !nodes_claim )
> +        return ERROR_FAIL;
> +
Same as above, about the 'goto out'.

However, rather than changing the algorithm, I think this last part of
the function (the one below this very comment) should just go away or,
at most, be moved somewhere else.

I'll try as hard as I can to explain how I think this code and the
existing NUMA automatic placement should interact. To achieve that, I
may also consider sending some code, in the form of a draft patch (but
that will likely happen on Monday).

> +    libxl_for_each_set_bit(n, info->nodemap)
> +    {
> +        for (i = 0; i < info->vnodes; i++)
> +        {
> +            unsigned long mem_sz = vnodes_mem[i] << 20;
> +            if ((nodes_claim[n] + mem_sz <= ninfo[n].free) &&
> +                 /* vnode was not set yet. */
> +                 (info->vnuma_vnodemap[i] == LIBXC_VNUMA_NO_NODE ) )
> +            {
> +                info->vnuma_vnodemap[i] = n;
> +                nodes_claim[n] += mem_sz;
> +            }
> +        }
> +    }
> +
So, AFAIUI, in absence of any indication from the user, you are trying
to 'compact' things as much as possible, i.e., putting the vnodes on the
smallest possible number of pnode.

This is fine, but does not belong here (more details below).

> +    return 0;
> +}
> +
> +/*
> + * Builds vnode memory regions from configuration info
> + * for vnuma nodes with multiple regions.
> + */
> +static int libxl__build_vnuma_ranges(libxl__gc *gc,
> +                              uint32_t domid,
> +                              /* IN: mem sizes in megabytes */
> +                              libxl_domain_build_info *b_info,
> +                              /* OUT: linux NUMA blocks addresses */
> +                              vmemrange_t **memrange)
>
This is going away, right?

>  int libxl__build_pre(libxl__gc *gc, uint32_t domid,
>                libxl_domain_config *d_config, libxl__domain_build_state 
> *state)
>  {
>      libxl_domain_build_info *const info = &d_config->b_info;
>      libxl_ctx *ctx = libxl__gc_owner(gc);
>      char *xs_domid, *con_domid;
> +    struct vmemrange *memrange;
>      int rc;
>  
>      if (xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus) != 0) {
> @@ -240,6 +343,20 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
>          return ERROR_FAIL;
>      }
>  
> +    if (libxl__build_vnuma_ranges(gc, domid, info, &memrange) != 0) {
> +        LOG(DETAIL, "Failed to build vnuma nodes memory ranges.\n");
> +        return ERROR_FAIL;
> +
> +    }
> +
> +    /*
> +     * NUMA placement and vNUMA autoplacement handling:
> +     * If numa_placement is set to default, do not use vnode to pnode
> +     * mapping as automatic placement algorithm will find best numa nodes.
> +     * If numa_placement is not used, we can try and use domain vnode
> +     * to pnode mask.
> +     */
> +
>      /*
>       * Check if the domain has any CPU or node affinity already. If not, try
>       * to build up the latter via automatic NUMA placement. In fact, in case
> @@ -298,7 +415,33 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
>                                     NULL, &cpumap_soft);
>  
>          libxl_bitmap_dispose(&cpumap_soft);
> +
> +        /*
> +         * If vnode_to_pnode mask was defined, dont use it if we 
> automatically
> +         * place domain on NUMA nodes, just give warning.
> +         */
> +        if (!libxl_defbool_val(info->vnuma_autoplacement)) {
> +            LOG(INFO, "Automatic NUMA placement for domain is turned on. \
> +                vnode to physical nodes mapping will not be used.");
> +        }
> +        if (libxl__init_vnode_to_pnode(gc, domid, info) < 0) {
> +            LOG(ERROR, "Failed to build vnode to pnode map\n");
> +            return ERROR_FAIL;
> +        }
> +    } else {
> +        if (!libxl_defbool_val(info->vnuma_autoplacement)) {
> +                if (!libxl__vnodemap_is_usable(gc, info)) {
> +                    LOG(ERROR, "Defined vnode to pnode domain map cannot be 
> used.\n");
> +                    return ERROR_FAIL;
> +                }
> +        } else {
> +            if (libxl__init_vnode_to_pnode(gc, domid, info) < 0) {
> +                LOG(ERROR, "Failed to build vnode to pnode map.\n");
> +                return ERROR_FAIL;
> +            }
> +        }
>      }
> +
The additional vnuma_autoplacement bool switch is going away, so this is
changing a lot in next version, and I'll comment directly on that.

About the interactions between automatic placement and vNUMA, here's
what I think.

There should exist the following situations:
 1) the user does not specify *any* vNUMA topology (or explicitly ask
    for 1 node, which is pointless, but possible), in the config file;
 2) the user specifies a vNUMA topology, and also a vnode to pnode map,
    in the config file;
 3) the user specifies a vNUMA topology, but _not_ a vnode to pnode
    map, in the config file;

Am I missing anything?

In all of the above cases, I think libxc should always receive a well
filled and consistent dom->vnode_to_pnode structure, and build the
domain image according to that. What changes, is where and how we
populate such vnode_to_pnode structure.

Here's how I'd proceed:

1) nothing to do, wrt vNUMA. In my opinion,
   libxl__domain_build_info_setdefault() should already set all the
   vNUMA libxl fields to match this situation, and we should not touch
   them in this case. libxl__dom_vnuma_init() (see below) will do the
   libxl-->libxc 'translation', and we're done.
   Note that this does not have any relationship or dependency with
   automatic NUMA placement, and with the existing numa_placement
   boolean build_info switch. That may or may not be enabled, for
   reasons vNUMA does not have to care or deal with!

2) quite similar to 1: we have to do something, but nothing too fancy or
   complicated. In fact, the values of the vNUMA libxl fields are again
   already set for us to what we want to just pass down to libxc... The
   only difference is that they've been set by xl, rather than by
   libxl__domain_build_info_setdefault(). :-)
   The only *practical* difference is that, as soon as we recognize that
   we're being explicitly passed a vnode to pnode map, we should disable
   automatic placement. That can easily happen by means of the existing
   boolean numa_placement switch, in build_info. What I'd do, is to set
   it to false from within xl, in the same code branch where the
   vnode_to_pnode config entry is parsed (look at what happens when, in
   xl, we find an hard or soft affinity being explicitly set).

3) in this case, since the user has not been explicit about vnode to
   pnode placement, we need to figure that out ourselves, and that is
   the job of NUMA automatic placement. What we hence do is leaving the
   automatic placement enabled, and modify the algorithm in order for it
   to:
    - take into account the vNUMA topology, when looking for an optimal
      placement solution,
    - provide a valid vnode_to_pnode map, as an output.
   At that point, such outputted vnode_to_pnode map will be passed down
   to libxc, as in all the previous cases.

I understand that this may look complex, but this is probably one of the
cases where the explanation is more complex than the code itself! :-P

In any case, I felt like it was important for me to provide my view on
how this should be implemented. I also appreciate that, especially
modifying the automatic placement logic, may be a bit hard for someone
which is not me. So, if Elena, wants to have a go on this, please, be my
guest. :-)

In case I don't see anything on xen-devel, as I said, I'll hack up a
prototype on Monday, and send it on the list.
 
> +/*
> + * Function fills the xc_dom_image with memory sizes for later
> + * use in domain memory allocator. No regions here are being
> + * filled and used in allocator as the regions do belong to one node.
> + */
> +static int libxl__dom_vnuma_init(struct libxl_domain_build_info *info,
> +                                 struct xc_dom_image *dom)
> +{
> +    errno = ERROR_INVAL;
> +
> +    if (info->vnodes == 0)
> +        return -1;
> +
> +    info->vmemranges = info->vnodes;
> +
> +    dom->vnode_to_pnode = (unsigned int *)malloc(
> +                            info->vnodes * sizeof(*info->vnuma_vnodemap));
> +    dom->numa_memszs = (uint64_t *)malloc(
> +                          info->vnodes * sizeof(*info->vnuma_mem));
> +
This is still libxl, so I think libxl__malloc() and friends is still the
right call to use, I think. If you don't want the memory you're
allocating to be garbage collected automatically, just use NOGC.

> +    errno = ERROR_FAIL;
> +    if ( dom->numa_memszs == NULL || dom->vnode_to_pnode == NULL ) {
> +        info->vnodes = 0;
> +        if (dom->vnode_to_pnode)
> +            free(dom->vnode_to_pnode);
> +        if (dom->numa_memszs)
> +            free(dom->numa_memszs);
> +        return -1;
> +    }
>
No errno, and return only libxl error codes in libxl. However, if you
use the proper libxl__*alloc() variant, error handling will just go
away! :-)

> +    memcpy(dom->numa_memszs, info->vnuma_mem,
> +            sizeof(*info->vnuma_mem) * info->vnodes);
> +    memcpy(dom->vnode_to_pnode, info->vnuma_vnodemap,
> +            sizeof(*info->vnuma_vnodemap) * info->vnodes);
> +
> +    dom->vnodes = info->vnodes;
> +
> +    return 0;
> +}
> +
Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.