[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1/2] sched: credit2: respect per-vcpu hard affinity



On Mon, Jan 12, 2015 at 8:05 AM, Dario Faggioli
<dario.faggioli@xxxxxxxxxx> wrote:
> Still about referencing "history", one usually puts here a quick summary
> of what changed in each patch, wrt the previously submitted version.
>
> This is actually not really important in this case, since _a_lot_ of
> things changed, as it usually happens when moving from RFC to actual
> submissions... But consider doing something like this for next round.

I'll put in a change summary when I send out the next version.

> Style apart, this overall is a very good submission. Since last time,
> you really made huge progresses, and are now tackling the issue with a
> good approach. I have a few questions on the approach itself, and some
> comments on the code, but I'd say you are on the right track.

Thanks, very glad to hear it!

>> @@ -1038,6 +1041,7 @@ choose_cpu(const struct scheduler *ops, struct vcpu 
>> *vc)
>>      int i, min_rqi = -1, new_cpu;
>>      struct csched2_vcpu *svc = CSCHED2_VCPU(vc);
>>      s_time_t min_avgload;
>> +    cpumask_t temp_mask;
>>
> Still about cpumasks, we really should try to avoid adding new ones on
> the stack like this.

I'll update it making a cpumask_var_t balance_mask struct member like
you've got in credit 1. Or something similar.

>> @@ -1053,7 +1057,7 @@ choose_cpu(const struct scheduler *ops, struct vcpu 
>> *vc)
>>       *
>>       * Since one of the runqueue locks is already held, we can't
>>       * just grab the prv lock.  Instead, we'll have to trylock, and
>> -     * do something else reasonable if we fail.
>> +     * return a safe cpu.
>>
> I don't think this change is particularly useful.

You mean the comment change isn't useful? I don't think it should be
left as is because we're not "doing something else reasonable" we're
now doing something specific. I'll change it to make it more clear.

>> @@ -1063,9 +1067,23 @@ choose_cpu(const struct scheduler *ops, struct vcpu 
>> *vc)
>>              d2printk("%pv -\n", svc->vcpu);
>>              clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
>>          }
>> -        /* Leave it where it is for now.  When we actually pay attention
>> -         * to affinity we'll have to figure something out... */
>> -        return vc->processor;
>> +
>> +        /* Check vc's hard affinity mask with the run queue's active mask. 
>> */
>> +        cpumask_and(&temp_mask, vc->cpu_hard_affinity, &svc->rqd->active);
>> +        if ( cpumask_empty(&temp_mask) )
>> +        {
>> +            /* Can't be assigned to current runqueue; return a safe pcpu. */
>> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity,
>> +                cpupool_online_cpumask(vc->domain->cpupool));
>> +            return cpumask_any(&temp_mask);
>> +        }
>> +        else
>> +            if ( cpumask_test_cpu(vc->processor, &temp_mask) )
>> +                /* Leave it where it is. */
>> +                return vc->processor;
>> +            else
>> +                /* Same runq, different cpu; affinity must have changed. */
>> +                return cpumask_any(&temp_mask);
>>      }
>>
> I see what you are after, and it should be fine. However, how about
> something like this?
>
>   /* We can't perform the operation properly because of locking issues.
>    * Check whether we can at least leave the vcpu where it is, or if
>    * we need to send it somewhere else.
>    */
>   cpumask_and(&temp_mask, vc->cpu_hard_affinity, &svc->rqd->active);
>   if ( unlikely(cpumask_empty(&temp_mask)) )
>     cpumask_and(&temp_mask, vc->cpu_hard_affinity,
>                 cpupool_online_cpumask(vc->domain->cpupool));
>
>   if ( cpumask_test_cpu(vc->processor, &temp_mask) )
>     return vc->processor;
>   else
>     return cpumask_any(&temp_mask);

Looks good to me. I'll change it.

>>      /* First check to see if we're here because someone else suggested a 
>> place
>> @@ -1081,13 +1099,17 @@ choose_cpu(const struct scheduler *ops, struct vcpu 
>> *vc)
>>          else
>>          {
>>              d2printk("%pv +\n", svc->vcpu);
>> -            new_cpu = cpumask_cycle(vc->processor, 
>> &svc->migrate_rqd->active);
>> -            goto out_up;
>> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity,
>> +                &svc->migrate_rqd->active);
>> +            if ( !cpumask_empty(&temp_mask) )
>> +            {
>> +                new_cpu = cpumask_any(&temp_mask);
>> +                goto out_up;
>> +            }
>> +            /* Fall-through to normal cpu pick */
>>
> So, if we've been asked to move to a cpu where we are not allowed to run
> (i.e., in case temp_mask ends up empty), we just, silently, ignore the
> request. This has the potential of screwing the work of the load
> balancer. If we need to keep this as a sort of "safety catch", then
> fine, but I'd really like to see a lot of efforts made in the load
> balancing code to make this not trigger!
>
> Actually, since we've been asked to do something, and we're doing
> something completely different, I even wonder if, if we can't go to
> migrate_rqd, that "something completely different" is better than  doing
> nothing, and if yes, under what conditions and circumstances...

I'm glad you brought this up because it's not clear to me what the
relationship between migrate() and choose_cpu() is, if any. This block
of code in choose_cpu only triggers if __CSFLAG_runq_migrate_request
is set, and that flag is only set in migrate(), and only if
__CSFLAG_scheduled is set. choose_cpu() is only called in response to
the pick_cpu call in schedule.c in vcpu_migrate() which doesn't have
anything to do with the load balancer balance_load() in credit 2. It
seems to me that the hard affinity check is needed at the end of this
block in case affinity has changed after __CSFLAG_runq_migrate_request
is set in migrate() and before a call to choose_cpu(). Please let me
know what you think.

>>      /* Find the runqueue with the lowest instantaneous load */
>> @@ -1099,17 +1121,26 @@ choose_cpu(const struct scheduler *ops, struct vcpu 
>> *vc)
>>          rqd = prv->rqd + i;
>>
>>          /* If checking a different runqueue, grab the lock,
>> -         * read the avg, and then release the lock.
>> +         * check hard affinity, read the avg, and then release the lock.
>>           *
>>           * If on our own runqueue, don't grab or release the lock;
>>           * but subtract our own load from the runqueue load to simulate
>>           * impartiality */
>>          if ( rqd == svc->rqd )
>>          {
>> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity, &rqd->active);
>> +            if ( cpumask_empty(&temp_mask) )
>> +                continue;
>>
> You can use cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) for
> the test. However, can we really be on a runq, and not have hard
> affinity with _any_ of its pcpus? I think no. And if yes, then we should
> try to change that and make it no. :-)

I'll update the code to use cpumask_intersects in all cases like this.

If I use xl vcpu-pin and turn off hard affinity for all pcpus in a
vpcu's currently assigned run queue, then xen will go through
vcpu_set_affinity, vcpu_migrate in schedule.c then choose_cpu in
credit 2. This is credit 2's first chance to react to a change in hard
affinity so there is a chance here that a vcpu can be assigned to a
run queue in which it no longer has any hard affinity with the pcpus.
Also, svc->rqd is the run queue that svc is assigned to, but svc isn't
necessarily on the queue waiting to be run. Please let me know what
your thoughts are on this.

>>              rqd_avgload = rqd->b_avgload - svc->avgload;
>>          }
>>          else if ( spin_trylock(&rqd->lock) )
>>          {
>> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity, &rqd->active);
>> +            if ( cpumask_empty(&temp_mask) )
>>
> Again, cpumask_intersects()

Will do.

>> +            {
>> +                spin_unlock(&rqd->lock);
>> +                continue;
>> +            }
>>              rqd_avgload = rqd->b_avgload;
>>              spin_unlock(&rqd->lock);
>>          }
>
>> @@ -1123,12 +1154,30 @@ choose_cpu(const struct scheduler *ops, struct vcpu 
>> *vc)
>>          }
>>      }
>>
>> -    /* We didn't find anyone (most likely because of spinlock contention); 
>> leave it where it is */
>>      if ( min_rqi == -1 )
>> -        new_cpu = vc->processor;
>> +    {
>> +        /* No runqs found (most likely because of spinlock contention). */
>> +        cpumask_and(&temp_mask, vc->cpu_hard_affinity, &svc->rqd->active);
>> +        if ( cpumask_empty(&temp_mask) )
>> +        {
>> +            /* Can't be assigned to current runqueue; return a safe pcpu. */
>> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity,
>> +                cpupool_online_cpumask(vc->domain->cpupool));
>> +            new_cpu = cpumask_any(&temp_mask);
>> +        }
>> +        else
>> +            if ( cpumask_test_cpu(vc->processor, &temp_mask) )
>> +                /* Leave it where it is. */
>> +                new_cpu = vc->processor;
>> +            else
>> +                /* Same runq, different cpu; affinity must have changed. */
>> +                new_cpu = cpumask_any(&temp_mask);
>> +    }
>>
> This is the exact same code as above, isn't it? Put it in an helper and
> call it from both sites.
>
> Well, actually, put my rework of the code above in the helper and call
> it. :-)

Will do.

>>      else
>>      {
>> -        new_cpu = cpumask_cycle(vc->processor, &prv->rqd[min_rqi].active);
>> +        cpumask_and(&temp_mask, vc->cpu_hard_affinity,
>> +            &prv->rqd[min_rqi].active);
>> +        new_cpu = cpumask_any(&temp_mask);
>>          BUG_ON(new_cpu >= nr_cpu_ids);
>>      }
>>
> Looks ok.
>
>> @@ -1197,22 +1246,40 @@ static void migrate(const struct scheduler *ops,
>>      }
>>      else
>>      {
>> -        int on_runq=0;
>> -        /* It's not running; just move it */
>> +        /* It's not running; move it if it's on a different runq than trqd. 
>> */
>> +        bool_t on_runq = 0;
>> +        cpumask_t temp_mask;
>> +
>>
> Ditto.
>
>>          d2printk("%pv %d-%d i\n", svc->vcpu, svc->rqd->id, trqd->id);
>> +
>> +        /* Re-assign vcpu's processor, if necessary. */
>> +        cpumask_and(&temp_mask, svc->vcpu->cpu_hard_affinity, 
>> &trqd->active);
>> +        svc->vcpu->processor = cpumask_any(&temp_mask);
>> +        if ( !cpumask_test_cpu(svc->vcpu->processor, &temp_mask) )
>> +            svc->vcpu->processor = cpumask_any(&temp_mask);
>> +
> Not following: you assign processor one of the bits set in temp_mask.
> Then, if such new value of processor is not in temp_mask, you assign to
> it cpumask_any(&temp_mask), i.e., the exact same thing as before...
>
> Am I missing something obvious? What is it that you are up for here?
> If the goal is to do something sane and safe if trqd->active and
> svc->vcpu->cpu_hard_affinity does not intersect, then, we certainly need
> it, but I really can't tell how this is supposed
> to handle such situation.

Not sure what I was thinking here, woops! The point is to assign
svc->vcpu->processor to a pcpu in run queue trqd that is in svc's hard
affinity. If for some reason there are no pcpus that satisfy hard
affinity it should probably be a BUG_ON because migrate() shouldn't be
able to decide not to do the migration. What do you think, bug_on or
return without migrating?

> BTW, after an operation like cpumask_first(), cpumask_next(),
> cpumask_any(), etc., a quick way to check whether the operation
> succeeded is to check for the returned value to be < nr_cpu_ids (see the
> implementation of those operations). If that is the case, then you can
> be sure that the bit returned is one of the bits set in the mask.

I'll keep that in mind while fixing this block.

>>          if ( __vcpu_on_runq(svc) )
>> +            on_runq = 1;
>> +
>> +        /* If the runqs are different, move svc to trqd. */
>> +        if ( svc->rqd != trqd )
>>          {
>> -            __runq_remove(svc);
>> -            update_load(ops, svc->rqd, svc, -1, now);
>> -            on_runq=1;
>> +            if ( on_runq )
>> +            {
>> +                __runq_remove(svc);
>> +                update_load(ops, svc->rqd, svc, -1, now);
>> +            }
>> +            __runq_deassign(svc);
>> +            __runq_assign(svc, trqd);
>> +            if ( on_runq )
>> +            {
>> +                update_load(ops, svc->rqd, svc, 1, now);
>> +                runq_insert(ops, svc->vcpu->processor, svc);
>> +            }
>>          }
>> -        __runq_deassign(svc);
>> -        svc->vcpu->processor = cpumask_any(&trqd->active);
>> -        __runq_assign(svc, trqd);
>> +
>>
> Mmm.. I do not like the way the code looks after this is applied. Before
> the patch, it was really straightforward and easy to understand. Now
> it's way more involved. Can you explain why this rework is necessary?
> For now do it here, then we'll see whether and how to put that into a
> doc comment.

When I was testing, if I removed hard affinity from a vcpu's current
pcpu to another pcpu in the same run queue, the VM would stop
executing. I'll go back and look at this because I see what you wrote
below about wake being called by vcpu_migrate in schedule.c; it
shouldn't freeze on the old cpu, it should wake on the new cpu no
matter if the run queue changed or not. I'll address this again after
some testing.

> I know there is a comment on the call site of this function, but that
> does not help much either (as I say there).
>
>>          if ( on_runq )
>>          {
>> -            update_load(ops, svc->rqd, svc, 1, now);
>> -            runq_insert(ops, svc->vcpu->processor, svc);
>>              runq_tickle(ops, svc->vcpu->processor, svc, now);
>>          }
>>      }
>
>> @@ -1224,6 +1291,7 @@ static void balance_load(const struct scheduler *ops, 
>> int cpu, s_time_t now)
>>      struct csched2_private *prv = CSCHED2_PRIV(ops);
>>      int i, max_delta_rqi = -1;
>>      struct list_head *push_iter, *pull_iter;
>> +    cpumask_t temp_mask;
>>
> ...
>
>>      balance_state_t st = { .best_push_svc = NULL, .best_pull_svc = NULL };
>>
>> @@ -1250,6 +1318,11 @@ retry:
>>      for_each_cpu(i, &prv->active_queues)
>>      {
>>          s_time_t delta;
>> +        /* true if there are no vcpus to push due to hard affinity */
>> +        bool_t ha_no_push = 1;
>> +        /* true if there are no vcpus to pull due to hard affinity */
>> +        bool_t ha_no_pull = 1;
>> +        struct list_head *iter;
>>
>>          st.orqd = prv->rqd + i;
>>
>> @@ -1257,6 +1330,47 @@ retry:
>>               || !spin_trylock(&st.orqd->lock) )
>>              continue;
>>
>> +        /*
>> +         * If due to hard affinity there are no vcpus that can be
>> +         * pulled or pushed, move to the next runq in the loop.
>> +         */
>> +
>> +        /* See if there are any vcpus that can be pushed from lrqd to orqd. 
>> */
>> +        list_for_each( iter, &st.lrqd->svc )
>> +        {
>> +            struct csched2_vcpu * svc =
>> +                list_entry(iter, struct csched2_vcpu, rqd_elem);
>> +            cpumask_and(&temp_mask, svc->vcpu->cpu_hard_affinity,
>> +                &st.orqd->active);
>> +            if (!cpumask_empty(&temp_mask))
>> +            {
>> +                /* vcpu can be pushed from lrqd to ordq. */
>> +                ha_no_push = 0;
>> +                break;
>> +            }
>> +        }
>> +
>>
> So, here, for each runqueue j, we scan all the vcpus in lrqd, to see if
> at least one can migrate to j.
>
> This means that, if we have 4 runqueues, we potentially loop through all
> the vcpus in lrqd 4 times, right?

Three times I think, but yes, lots of looping. It will loop through
lrqd once for each additional run queue orqd.

>> +        /* See if there are any vcpus that can be pulled from orqd to lrqd. 
>> */
>> +        list_for_each( iter, &st.orqd->svc )
>> +        {
>> +            struct csched2_vcpu * svc =
>> +                list_entry(iter, struct csched2_vcpu, rqd_elem);
>> +            cpumask_and(&temp_mask, svc->vcpu->cpu_hard_affinity,
>> +                &st.lrqd->active);
>> +            if (!cpumask_empty(&temp_mask))
>> +            {
>> +                /* vcpu can be pulled from orqd to lrdq. */
>> +                ha_no_pull = 0;
>> +                break;
>> +            }
>> +        }
>> +
>>
> And here, for each runq j, we scan all its vcpus to see if at least one
> can run on lrqd.
>
> In summary, this means (at least potentially) looping through all the
> active vcpus, some more than one time, which sounds pretty expensive.
> Add to this the fact that, since we are releasing and re-acquiring the
> locks, things may even change under our feet, making all that sort of
> useless.

Yes, I considered things changing between holding the locks making
this looping potentially pointless.

> I like the idea of filtering out runqueues that does not have any
> suitable vcpus, due to hard affinity issues, but done like this, it
> looks really expensive. Can we perhaps think to a data structure to make
> it cheaper?

I will think about a solution with a data structure.

> In the meanwhile, I'd go for a more 'greedy approach'. Out of the top of
> my head, I think I would leave this loop alone, and just have it select
> a candidate runqueue. Below, where the vcpus are considered, I'd filter
> out the inappropriate one (as this patch does also). If it turns out
> that there are no suitable vcpus in the runqueue the loop selected, I'd
> "goto retry", perhaps adding a maximum number of attempts, or something
> like that.

I think balance_load needs some work beyond these hard affinity
changes. I noticed some odd behavior where it would migrate a vcpu
away from a runqueue that only had the one vcpu running on it. Maybe
it hasn't been tested much because of the single run queue issue which
your patches fix? Anyway, I will examine this section further and see
what I can come up with.

>> @@ -1338,11 +1458,17 @@ retry:
>>              {
>>                  __update_svc_load(ops, pull_svc, 0, now);
>>              }
>> -
>> +
>>
> Avoid things like these. I appreciate that this is an actual
> improvement, but let's not mix coding style fixes with new features. :-)

Got it, I will take this out.

>> @@ -1399,8 +1531,12 @@ csched2_vcpu_migrate(
>>
>>      trqd = RQD(ops, new_cpu);
>>
>> -    if ( trqd != svc->rqd )
>> -        migrate(ops, svc, trqd, NOW());
>> +    /*
>> +     * Call migrate even if svc->rqd == trqd; there may have been an
>> +     * affinity change that requires a call to runq_tickle for a new
>> +     * processor within the same run queue.
>> +     */
>> +    migrate(ops, svc, trqd, NOW());
>>  }
>>
> As said above, I don't think I see the reason for this. Affinity
> changes, e.g., due to calls to vcpu_set_affinity() in schedule.c, forces
> the vcpu through a sleep wakeup cycle (it calls vcpu_sleep_nosync()
> direcly, while vcpu_wake() is called inside vcpu_migrate()).
>
> So, looks like what you are after (i.e., runq_tickle being called)
> should happen already, isn't it? Are there other reasons you need it
> for?

Like I said above, I will look at this again. My VMs were getting
stuck after certain hard affinity changes. I'll roll back some of
these changes and test it out again.

>>  static int
>> @@ -1610,6 +1746,13 @@ runq_candidate(struct csched2_runqueue_data *rqd,
>>      {
>>          struct csched2_vcpu * svc = list_entry(iter, struct csched2_vcpu, 
>> runq_elem);
>>
>> +        /*
>> +         * If vcpu is not allowed to run on this processor due to
>> +         * hard affinity, continue to the next vcpu on the queue.
>> +         */
>> +        if ( !cpumask_test_cpu(cpu, svc->vcpu->cpu_hard_affinity) )
>> +            continue;
>> +
>>
> Looks ok. The comment explains a bit too much the "what", which one can
> easily see from the code. Comments (most of the time :-) ) should be
> about the "why" things are done in a certain way.
>
> Here, the relevant piece of information is that "Only consider vcpus
> allowed to run on this processor", so I'd just say so. The fact that you
> are continuing scanning the runque is pretty evident. :-D

Understood, I'll fix the comment here.

Thank you for the feedback! I agree that the hard affinity piece needs
to be solid before the soft affinity. I'm going to hold off on
replying to your feedback on my second patch until I get this one
fixed. I hope that's OK.

Justin Weaver
University of Hawaii at Manoa

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.