[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1/2] sched: credit2: respect per-vcpu hard affinity



On Sat, 2014-11-29 at 14:33 -1000, Justin T. Weaver wrote:
> by making sure that vcpus only run on the pcpu(s) they are allowed to
> run on based on their hard affinity cpu masks.
> 
> Signed-off-by: Justin T. Weaver <jtweaver@xxxxxxxxxx>
>
Still about referencing "history", one usually puts here a quick summary
of what changed in each patch, wrt the previously submitted version.

This is actually not really important in this case, since _a_lot_ of
things changed, as it usually happens when moving from RFC to actual
submissions... But consider doing something like this for next round.

Style apart, this overall is a very good submission. Since last time,
you really made huge progresses, and are now tackling the issue with a
good approach. I have a few questions on the approach itself, and some
comments on the code, but I'd say you are on the right track.

> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 1bcd6c0..90e9cdf 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -501,8 +501,9 @@ runq_tickle(const struct scheduler *ops, unsigned int 
> cpu, struct csched2_vcpu *
>          goto tickle;
>      }
>      
> -    /* Get a mask of idle, but not tickled */
> +    /* Get a mask of idle, but not tickled, that new is allowed to run on. */
>      cpumask_andnot(&mask, &rqd->idle, &rqd->tickled);
> +    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
>
This looks ok.

>      /* If it's not empty, choose one */
>      i = cpumask_cycle(cpu, &mask);
> @@ -513,9 +514,11 @@ runq_tickle(const struct scheduler *ops, unsigned int 
> cpu, struct csched2_vcpu *
>      }
>  
>      /* Otherwise, look for the non-idle cpu with the lowest credit,
> -     * skipping cpus which have been tickled but not scheduled yet */
> +     * skipping cpus which have been tickled but not scheduled yet,
> +     * that new is allowed to run on. */
>      cpumask_andnot(&mask, &rqd->active, &rqd->idle);
>      cpumask_andnot(&mask, &mask, &rqd->tickled);
> +    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
>
And this too.

I wonder whether there is a way to group the cpumask operations
differently to save at least one one of them (yes, they may be quite
costly on large boxes), but let's leave this for later...

> @@ -1038,6 +1041,7 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>      int i, min_rqi = -1, new_cpu;
>      struct csched2_vcpu *svc = CSCHED2_VCPU(vc);
>      s_time_t min_avgload;
> +    cpumask_t temp_mask;
>  
Still about cpumasks, we really should try to avoid adding new ones on
the stack like this.

> @@ -1053,7 +1057,7 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>       *
>       * Since one of the runqueue locks is already held, we can't
>       * just grab the prv lock.  Instead, we'll have to trylock, and
> -     * do something else reasonable if we fail.
> +     * return a safe cpu.
>  
I don't think this change is particularly useful.

> @@ -1063,9 +1067,23 @@ choose_cpu(const struct scheduler *ops, struct vcpu 
> *vc)
>              d2printk("%pv -\n", svc->vcpu);
>              clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
>          }
> -        /* Leave it where it is for now.  When we actually pay attention
> -         * to affinity we'll have to figure something out... */
> -        return vc->processor;
> +
> +        /* Check vc's hard affinity mask with the run queue's active mask. */
> +        cpumask_and(&temp_mask, vc->cpu_hard_affinity, &svc->rqd->active);
> +        if ( cpumask_empty(&temp_mask) )
> +        {
> +            /* Can't be assigned to current runqueue; return a safe pcpu. */
> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity,
> +                cpupool_online_cpumask(vc->domain->cpupool));
> +            return cpumask_any(&temp_mask);
> +        }
> +        else
> +            if ( cpumask_test_cpu(vc->processor, &temp_mask) )
> +                /* Leave it where it is. */
> +                return vc->processor;
> +            else
> +                /* Same runq, different cpu; affinity must have changed. */
> +                return cpumask_any(&temp_mask);
>      }
>  
I see what you are after, and it should be fine. However, how about
something like this?

  /* We can't perform the operation properly because of locking issues.
   * Check whether we can at least leave the vcpu where it is, or if
   * we need to send it somewhere else.
   */
  cpumask_and(&temp_mask, vc->cpu_hard_affinity, &svc->rqd->active);
  if ( unlikely(cpumask_empty(&temp_mask)) )
    cpumask_and(&temp_mask, vc->cpu_hard_affinity,
                cpupool_online_cpumask(vc->domain->cpupool));

  if ( cpumask_test_cpu(vc->processor, &temp_mask) )
    return vc->processor;
  else
    return cpumask_any(&temp_mask);

>      /* First check to see if we're here because someone else suggested a 
> place
> @@ -1081,13 +1099,17 @@ choose_cpu(const struct scheduler *ops, struct vcpu 
> *vc)
>          else
>          {
>              d2printk("%pv +\n", svc->vcpu);
> -            new_cpu = cpumask_cycle(vc->processor, 
> &svc->migrate_rqd->active);
> -            goto out_up;
> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity,
> +                &svc->migrate_rqd->active);
> +            if ( !cpumask_empty(&temp_mask) )
> +            {
> +                new_cpu = cpumask_any(&temp_mask);
> +                goto out_up;
> +            }
> +            /* Fall-through to normal cpu pick */
>
So, if we've been asked to move to a cpu where we are not allowed to run
(i.e., in case temp_mask ends up empty), we just, silently, ignore the
request. This has the potential of screwing the work of the load
balancer. If we need to keep this as a sort of "safety catch", then
fine, but I'd really like to see a lot of efforts made in the load
balancing code to make this not trigger!

Actually, since we've been asked to do something, and we're doing
something completely different, I even wonder if, if we can't go to
migrate_rqd, that "something completely different" is better than  doing
nothing, and if yes, under what conditions and circumstances...

>      /* Find the runqueue with the lowest instantaneous load */
> @@ -1099,17 +1121,26 @@ choose_cpu(const struct scheduler *ops, struct vcpu 
> *vc)
>          rqd = prv->rqd + i;
>  
>          /* If checking a different runqueue, grab the lock,
> -         * read the avg, and then release the lock.
> +         * check hard affinity, read the avg, and then release the lock.
>           *
>           * If on our own runqueue, don't grab or release the lock;
>           * but subtract our own load from the runqueue load to simulate
>           * impartiality */
>          if ( rqd == svc->rqd )
>          {
> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity, &rqd->active);
> +            if ( cpumask_empty(&temp_mask) )
> +                continue;
>
You can use cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) for
the test. However, can we really be on a runq, and not have hard
affinity with _any_ of its pcpus? I think no. And if yes, then we should
try to change that and make it no. :-)

>              rqd_avgload = rqd->b_avgload - svc->avgload;
>          }
>          else if ( spin_trylock(&rqd->lock) )
>          {
> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity, &rqd->active);
> +            if ( cpumask_empty(&temp_mask) )
>
Again, cpumask_intersects()

> +            {
> +                spin_unlock(&rqd->lock);
> +                continue;
> +            }
>              rqd_avgload = rqd->b_avgload;
>              spin_unlock(&rqd->lock);
>          }

> @@ -1123,12 +1154,30 @@ choose_cpu(const struct scheduler *ops, struct vcpu 
> *vc)
>          }
>      }
>  
> -    /* We didn't find anyone (most likely because of spinlock contention); 
> leave it where it is */
>      if ( min_rqi == -1 )
> -        new_cpu = vc->processor;
> +    {
> +        /* No runqs found (most likely because of spinlock contention). */
> +        cpumask_and(&temp_mask, vc->cpu_hard_affinity, &svc->rqd->active);
> +        if ( cpumask_empty(&temp_mask) )
> +        {
> +            /* Can't be assigned to current runqueue; return a safe pcpu. */
> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity,
> +                cpupool_online_cpumask(vc->domain->cpupool));
> +            new_cpu = cpumask_any(&temp_mask);
> +        }
> +        else
> +            if ( cpumask_test_cpu(vc->processor, &temp_mask) )
> +                /* Leave it where it is. */
> +                new_cpu = vc->processor;
> +            else
> +                /* Same runq, different cpu; affinity must have changed. */
> +                new_cpu = cpumask_any(&temp_mask);
> +    }
>
This is the exact same code as above, isn't it? Put it in an helper and
call it from both sites.

Well, actually, put my rework of the code above in the helper and call
it. :-)

>      else
>      {
> -        new_cpu = cpumask_cycle(vc->processor, &prv->rqd[min_rqi].active);
> +        cpumask_and(&temp_mask, vc->cpu_hard_affinity,
> +            &prv->rqd[min_rqi].active);
> +        new_cpu = cpumask_any(&temp_mask);
>          BUG_ON(new_cpu >= nr_cpu_ids);
>      }
>
Looks ok.
 
> @@ -1197,22 +1246,40 @@ static void migrate(const struct scheduler *ops,
>      }
>      else
>      {
> -        int on_runq=0;
> -        /* It's not running; just move it */
> +        /* It's not running; move it if it's on a different runq than trqd. 
> */
> +        bool_t on_runq = 0;
> +        cpumask_t temp_mask;
> +
>
Ditto.

>          d2printk("%pv %d-%d i\n", svc->vcpu, svc->rqd->id, trqd->id);
> +
> +        /* Re-assign vcpu's processor, if necessary. */
> +        cpumask_and(&temp_mask, svc->vcpu->cpu_hard_affinity, &trqd->active);
> +        svc->vcpu->processor = cpumask_any(&temp_mask);
> +        if ( !cpumask_test_cpu(svc->vcpu->processor, &temp_mask) )
> +            svc->vcpu->processor = cpumask_any(&temp_mask);
> +
Not following: you assign processor one of the bits set in temp_mask.
Then, if such new value of processor is not in temp_mask, you assign to
it cpumask_any(&temp_mask), i.e., the exact same thing as before...

Am I missing something obvious? What is it that you are up for here?
If the goal is to do something sane and safe if trqd->active and
svc->vcpu->cpu_hard_affinity does not intersect, then, we certainly need
it, but I really can't tell how this is supposed
to handle such situation.

BTW, after an operation like cpumask_first(), cpumask_next(),
cpumask_any(), etc., a quick way to check whether the operation
succeeded is to check for the returned value to be < nr_cpu_ids (see the
implementation of those operations). If that is the case, then you can
be sure that the bit returned is one of the bits set in the mask.

>          if ( __vcpu_on_runq(svc) )
> +            on_runq = 1;
> +
> +        /* If the runqs are different, move svc to trqd. */
> +        if ( svc->rqd != trqd )
>          {
> -            __runq_remove(svc);
> -            update_load(ops, svc->rqd, svc, -1, now);
> -            on_runq=1;
> +            if ( on_runq )
> +            {
> +                __runq_remove(svc);
> +                update_load(ops, svc->rqd, svc, -1, now);
> +            }
> +            __runq_deassign(svc);
> +            __runq_assign(svc, trqd);
> +            if ( on_runq )
> +            {
> +                update_load(ops, svc->rqd, svc, 1, now);
> +                runq_insert(ops, svc->vcpu->processor, svc);
> +            }
>          }
> -        __runq_deassign(svc);
> -        svc->vcpu->processor = cpumask_any(&trqd->active);
> -        __runq_assign(svc, trqd);
> +
>
Mmm.. I do not like the way the code looks after this is applied. Before
the patch, it was really straightforward and easy to understand. Now
it's way more involved. Can you explain why this rework is necessary?
For now do it here, then we'll see whether and how to put that into a
doc comment.

I know there is a comment on the call site of this function, but that
does not help much either (as I say there).

>          if ( on_runq )
>          {
> -            update_load(ops, svc->rqd, svc, 1, now);
> -            runq_insert(ops, svc->vcpu->processor, svc);
>              runq_tickle(ops, svc->vcpu->processor, svc, now);
>          }
>      }

> @@ -1224,6 +1291,7 @@ static void balance_load(const struct scheduler *ops, 
> int cpu, s_time_t now)
>      struct csched2_private *prv = CSCHED2_PRIV(ops);
>      int i, max_delta_rqi = -1;
>      struct list_head *push_iter, *pull_iter;
> +    cpumask_t temp_mask;
>  
...

>      balance_state_t st = { .best_push_svc = NULL, .best_pull_svc = NULL };
>      
> @@ -1250,6 +1318,11 @@ retry:
>      for_each_cpu(i, &prv->active_queues)
>      {
>          s_time_t delta;
> +        /* true if there are no vcpus to push due to hard affinity */
> +        bool_t ha_no_push = 1;
> +        /* true if there are no vcpus to pull due to hard affinity */
> +        bool_t ha_no_pull = 1;
> +        struct list_head *iter;
>          
>          st.orqd = prv->rqd + i;
>  
> @@ -1257,6 +1330,47 @@ retry:
>               || !spin_trylock(&st.orqd->lock) )
>              continue;
>  
> +        /*
> +         * If due to hard affinity there are no vcpus that can be
> +         * pulled or pushed, move to the next runq in the loop.
> +         */
> +
> +        /* See if there are any vcpus that can be pushed from lrqd to orqd. 
> */
> +        list_for_each( iter, &st.lrqd->svc )
> +        {
> +            struct csched2_vcpu * svc =
> +                list_entry(iter, struct csched2_vcpu, rqd_elem);
> +            cpumask_and(&temp_mask, svc->vcpu->cpu_hard_affinity,
> +                &st.orqd->active);
> +            if (!cpumask_empty(&temp_mask))
> +            {
> +                /* vcpu can be pushed from lrqd to ordq. */
> +                ha_no_push = 0;
> +                break;
> +            }
> +        }
> +
>
So, here, for each runqueue j, we scan all the vcpus in lrqd, to see if
at least one can migrate to j.

This means that, if we have 4 runqueues, we potentially loop through all
the vcpus in lrqd 4 times, right?

> +        /* See if there are any vcpus that can be pulled from orqd to lrqd. 
> */
> +        list_for_each( iter, &st.orqd->svc )
> +        {
> +            struct csched2_vcpu * svc =
> +                list_entry(iter, struct csched2_vcpu, rqd_elem);
> +            cpumask_and(&temp_mask, svc->vcpu->cpu_hard_affinity,
> +                &st.lrqd->active);
> +            if (!cpumask_empty(&temp_mask))
> +            {
> +                /* vcpu can be pulled from orqd to lrdq. */
> +                ha_no_pull = 0;
> +                break;
> +            }
> +        }
> +
>
And here, for each runq j, we scan all its vcpus to see if at least one
can run on lrqd.

In summary, this means (at least potentially) looping through all the
active vcpus, some more than one time, which sounds pretty expensive.
Add to this the fact that, since we are releasing and re-acquiring the
locks, things may even change under our feet, making all that sort of
useless.

I like the idea of filtering out runqueues that does not have any
suitable vcpus, due to hard affinity issues, but done like this, it
looks really expensive. Can we perhaps think to a data structure to make
it cheaper?

In the meanwhile, I'd go for a more 'greedy approach'. Out of the top of
my head, I think I would leave this loop alone, and just have it select
a candidate runqueue. Below, where the vcpus are considered, I'd filter
out the inappropriate one (as this patch does also). If it turns out
that there are no suitable vcpus in the runqueue the loop selected, I'd
"goto retry", perhaps adding a maximum number of attempts, or something
like that.

What do you think?

> @@ -1338,11 +1458,17 @@ retry:
>              {
>                  __update_svc_load(ops, pull_svc, 0, now);
>              }
> -        
> +
>
Avoid things like these. I appreciate that this is an actual
improvement, but let's not mix coding style fixes with new features. :-)

> @@ -1399,8 +1531,12 @@ csched2_vcpu_migrate(
>  
>      trqd = RQD(ops, new_cpu);
>  
> -    if ( trqd != svc->rqd )
> -        migrate(ops, svc, trqd, NOW());
> +    /*
> +     * Call migrate even if svc->rqd == trqd; there may have been an
> +     * affinity change that requires a call to runq_tickle for a new
> +     * processor within the same run queue.
> +     */
> +    migrate(ops, svc, trqd, NOW());
>  }
>  
As said above, I don't think I see the reason for this. Affinity
changes, e.g., due to calls to vcpu_set_affinity() in schedule.c, forces
the vcpu through a sleep wakeup cycle (it calls vcpu_sleep_nosync()
direcly, while vcpu_wake() is called inside vcpu_migrate()).

So, looks like what you are after (i.e., runq_tickle being called)
should happen already, isn't it? Are there other reasons you need it
for?

>  static int
> @@ -1610,6 +1746,13 @@ runq_candidate(struct csched2_runqueue_data *rqd,
>      {
>          struct csched2_vcpu * svc = list_entry(iter, struct csched2_vcpu, 
> runq_elem);
>  
> +        /*
> +         * If vcpu is not allowed to run on this processor due to
> +         * hard affinity, continue to the next vcpu on the queue.
> +         */
> +        if ( !cpumask_test_cpu(cpu, svc->vcpu->cpu_hard_affinity) )
> +            continue;
> +
>
Looks ok. The comment explains a bit too much the "what", which one can
easily see from the code. Comments (most of the time :-) ) should be
about the "why" things are done in a certain way.

Here, the relevant piece of information is that "Only consider vcpus
allowed to run on this processor", so I'd just say so. The fact that you
are continuing scanning the runque is pretty evident. :-D

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.