[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] xen: make sure the node-affinity is always updated



On Thu, Sep 12, 2013 at 4:57 PM, Dario Faggioli
<dario.faggioli@xxxxxxxxxx> wrote:
> in particular when a domain moves from a cpupool to another, and
> when the list of cpus in a cpupool changes.
>
> Not doing that may result in the domain's node-affinity mask
> (as retrieved by csched_balance_cpumask(..CSCHED_BALANCE_NODE_AFFINITY..) )
> and online mask (as retrieved by cpupool_scheduler_cpumask() ) to have
> an empty intersection.
>
> This in turn means that, in _csched_cpu_pick(), we think it is
> legitimate to do a node-affinity load balancing step and, when
> running this:
>
>     ...
>     /* Pick an online CPU from the proper affinity mask */
>     csched_balance_cpumask(vc, balance_step, &cpus);
>     cpumask_and(&cpus, &cpus, online);
>     ...
>
> we end up with an empty cpumask (in cpus). At this point, in
> the following code:
>
>     ....
>     /* If present, prefer vc's current processor */
>     cpu = cpumask_test_cpu(vc->processor, &cpus)
>             ? vc->processor
>             : cpumask_cycle(vc->processor, &cpus);
>     ....
>
> an ASSERT (from inside cpumask_cycle() ) triggers like this:
>
> (XEN) Xen call trace:
> (XEN)    [<ffff82d08011b124>] _csched_cpu_pick+0x1d2/0x652
> (XEN)    [<ffff82d08011b5b2>] csched_cpu_pick+0xe/0x10
> (XEN)    [<ffff82d0801232de>] vcpu_migrate+0x167/0x31e
> (XEN)    [<ffff82d0801238cc>] cpu_disable_scheduler+0x1c8/0x287
> (XEN)    [<ffff82d080101b3f>] cpupool_unassign_cpu_helper+0x20/0xb4
> (XEN)    [<ffff82d08010544f>] continue_hypercall_tasklet_handler+0x4a/0xb1
> (XEN)    [<ffff82d080127793>] do_tasklet_work+0x78/0xab
> (XEN)    [<ffff82d080127a70>] do_tasklet+0x5f/0x8b
> (XEN)    [<ffff82d080158985>] idle_loop+0x57/0x5e
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 1:
> (XEN) Assertion 'cpu < nr_cpu_ids' failed at 
> /home/dario/Sources/xen/xen/xen.git/xen/include/xe:16481
>
> It is for example sufficient to have a domain with node-affinity
> to NUMA node 1 running, and issueing a `xl cpupool-numa-split'
> would make the above happen. That is because, by default, all
> the existing domains remain assigned to the first cpupool, and
> it now (after the cpupool-numa-split) only includes NUMA node 0.
>
> This change prevents that, by making sure that the node-affinity
> mask is reset and automatically recomputed, if that is the case,
> in domain_update_node_affinity(), as well as making sure that
> such function is invoked every time that is required.
>
> In fact, it makes much more sense to reset the node-affinity
> when either it and the domain's vcpu-affinity or the cpumap of
> the domain's cpupool are inconsistent, rather than not touching
> it, as it was before. Also, updating the domain's node affinity
> on the cpupool_unassing_cpu() path, as it already happens on
> the cpupool_assign_cpu_locked() path, was something we were
> missing anyway, even independently from the ASSERT triggering.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
> Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>
> Cc: Jan Beulich <JBeulich@xxxxxxxx>
> Cc: Keir Fraser <keir@xxxxxxx>
> Cc: Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>
> ---
>  xen/common/cpupool.c |    8 ++++++++
>  xen/common/domain.c  |   40 +++++++++++++++++++++++-----------------
>  2 files changed, 31 insertions(+), 17 deletions(-)
>
> diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
> index 2164a9f..23e461d 100644
> --- a/xen/common/cpupool.c
> +++ b/xen/common/cpupool.c
> @@ -355,6 +355,14 @@ int cpupool_unassign_cpu(struct cpupool *c, unsigned int 
> cpu)
>      atomic_inc(&c->refcnt);
>      cpupool_cpu_moving = c;
>      cpumask_clear_cpu(cpu, c->cpu_valid);
> +
> +    rcu_read_lock(&domlist_read_lock);
> +    for_each_domain_in_cpupool(d, c)
> +    {
> +        domain_update_node_affinity(d);
> +    }
> +    rcu_read_unlock(&domlist_read_lock);
> +
>      spin_unlock(&cpupool_lock);
>
>      work_cpu = smp_processor_id();
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 5999779..f708b3c 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -352,7 +352,6 @@ void domain_update_node_affinity(struct domain *d)
>      cpumask_var_t cpumask;
>      cpumask_var_t online_affinity;
>      const cpumask_t *online;
> -    nodemask_t nodemask = NODE_MASK_NONE;
>      struct vcpu *v;
>      unsigned int node;
>
> @@ -374,28 +373,35 @@ void domain_update_node_affinity(struct domain *d)
>          cpumask_or(cpumask, cpumask, online_affinity);
>      }
>
> -    if ( d->auto_node_affinity )
> -    {
> -        /* Node-affinity is automaically computed from all vcpu-affinities */
> -        for_each_online_node ( node )
> -            if ( cpumask_intersects(&node_to_cpumask(node), cpumask) )
> -                node_set(node, nodemask);
> -
> -        d->node_affinity = nodemask;
> -    }
> -    else
> +    if ( !d->auto_node_affinity )
>      {
>          /* Node-affinity is provided by someone else, just filter out cpus
>           * that are either offline or not in the affinity of any vcpus. */
> -        nodemask = d->node_affinity;
>          for_each_node_mask ( node, d->node_affinity )
>              if ( !cpumask_intersects(&node_to_cpumask(node), cpumask) )
> -                node_clear(node, nodemask);//d->node_affinity);
> +                node_clear(node, d->node_affinity);
> +
> +        /*
> +         * If we end up with an empty node-affinity, e.g., because the user
> +         * specified an incompatible vcpu-affinity, or moved the domain to
> +         * a different cpupool, reset the node-affinity and re-calculate it
> +         * (in the body of the if() below).
> +         *
> +         * This is necessary to prevent the schedulers from attempting
> +         * node-affinity load balancing on empty cpumasks, with (potentially)
> +         * nasty effects (increased overhead or even crash!).
> +         */
> +        if ( nodes_empty(d->node_affinity) )
> +            d->auto_node_affinity = 1;

Hmm -- silently erasing one set of parameters and changing it to
"auto" doesn't seem that great from a UI perspective -- "principle of
least surprise", and all that.

Also, consider someone who has already set both a node and a vcpu
affinity, and decides they now want to change it such that the new
vcpu affinity doesn't intersect with the old node affinity, and does
the following:

1. Set new node affinity.  The update notices that the node & vcpu
affinity clash, and effectively removes the node affinity
2. Set new vcpu affinity.

Now we have the desired vcpu affinity, but the node affinity has been
erased.  To avoid this you have to set the vcpu affinity first, which
isn't exactly obvious.  (Either that, or #1 will not actually remove
the node affinity, and we still have the potential for a disjoint set
tripping over an assert if someone only updates node affinity but not
vcpu affinity.)

Since node affinity is just advisory, wouldn't it make more sense to
make the scheduling system tolerate a disjoint set of affinities?  If
we could make cpu_cycle indicate somehow that the mask is empty,
instead of crashing, then we can detect that and skip the "look for an
idle processor within the NUMA affinity" step.

For cpupool_numa split, it's probably a good idea to do one of the following:
1. Detect the numa affinity and place it in the appropriate pool
2. Detect the conflict and break the affinity from xl (giving a
warning that you're doing so)
3. Detect the conflict and just give a warning.

But that's a UI polish thing, rather than a bug fix, so is slightly
lower priority.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.