Xen project Mailing List

Re: [Xen-devel] [PATCH 1/2] sched: credit2: respect per-vcpu hard affinity

To: "Justin T. Weaver" <jtweaver@xxxxxxxxxx>

From: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

Date: Mon, 12 Jan 2015 19:05:38 +0100

Cc: george.dunlap@xxxxxxxxxxxxx, henric@xxxxxxxxxx, xen-devel@xxxxxxxxxxxxx

Delivery-date: Mon, 12 Jan 2015 18:11:12 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Sat, 2014-11-29 at 14:33 -1000, Justin T. Weaver wrote: > by making sure that vcpus only run on the pcpu(s) they are allowed to > run on based on their hard affinity cpu masks. > > Signed-off-by: Justin T. Weaver <jtweaver@xxxxxxxxxx> > Still about referencing "history", one usually puts here a quick summary of what changed in each patch, wrt the previously submitted version. This is actually not really important in this case, since _a_lot_ of things changed, as it usually happens when moving from RFC to actual submissions... But consider doing something like this for next round. Style apart, this overall is a very good submission. Since last time, you really made huge progresses, and are now tackling the issue with a good approach. I have a few questions on the approach itself, and some comments on the code, but I'd say you are on the right track. > diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c > index 1bcd6c0..90e9cdf 100644 > --- a/xen/common/sched_credit2.c > +++ b/xen/common/sched_credit2.c > @@ -501,8 +501,9 @@ runq_tickle(const struct scheduler *ops, unsigned int > cpu, struct csched2_vcpu * > goto tickle; > } > > - /* Get a mask of idle, but not tickled */ > + /* Get a mask of idle, but not tickled, that new is allowed to run on. */ > cpumask_andnot(&mask, &rqd->idle, &rqd->tickled); > + cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity); > This looks ok. > /* If it's not empty, choose one */ > i = cpumask_cycle(cpu, &mask); > @@ -513,9 +514,11 @@ runq_tickle(const struct scheduler *ops, unsigned int > cpu, struct csched2_vcpu * > } > > /* Otherwise, look for the non-idle cpu with the lowest credit, > - * skipping cpus which have been tickled but not scheduled yet */ > + * skipping cpus which have been tickled but not scheduled yet, > + * that new is allowed to run on. */ > cpumask_andnot(&mask, &rqd->active, &rqd->idle); > cpumask_andnot(&mask, &mask, &rqd->tickled); > + cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity); > And this too. I wonder whether there is a way to group the cpumask operations differently to save at least one one of them (yes, they may be quite costly on large boxes), but let's leave this for later... > @@ -1038,6 +1041,7 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc) > int i, min_rqi = -1, new_cpu; > struct csched2_vcpu *svc = CSCHED2_VCPU(vc); > s_time_t min_avgload; > + cpumask_t temp_mask; > Still about cpumasks, we really should try to avoid adding new ones on the stack like this. > @@ -1053,7 +1057,7 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc) > * > * Since one of the runqueue locks is already held, we can't > * just grab the prv lock. Instead, we'll have to trylock, and > - * do something else reasonable if we fail. > + * return a safe cpu. > I don't think this change is particularly useful. > @@ -1063,9 +1067,23 @@ choose_cpu(const struct scheduler *ops, struct vcpu > *vc) > d2printk("%pv -\n", svc->vcpu); > clear_bit(__CSFLAG_runq_migrate_request, &svc->flags); > } > - /* Leave it where it is for now. When we actually pay attention > - * to affinity we'll have to figure something out... */ > - return vc->processor; > + > + /* Check vc's hard affinity mask with the run queue's active mask. */ > + cpumask_and(&temp_mask, vc->cpu_hard_affinity, &svc->rqd->active); > + if ( cpumask_empty(&temp_mask) ) > + { > + /* Can't be assigned to current runqueue; return a safe pcpu. */ > + cpumask_and(&temp_mask, vc->cpu_hard_affinity, > + cpupool_online_cpumask(vc->domain->cpupool)); > + return cpumask_any(&temp_mask); > + } > + else > + if ( cpumask_test_cpu(vc->processor, &temp_mask) ) > + /* Leave it where it is. */ > + return vc->processor; > + else > + /* Same runq, different cpu; affinity must have changed. */ > + return cpumask_any(&temp_mask); > } > I see what you are after, and it should be fine. However, how about something like this? /* We can't perform the operation properly because of locking issues. * Check whether we can at least leave the vcpu where it is, or if * we need to send it somewhere else. */ cpumask_and(&temp_mask, vc->cpu_hard_affinity, &svc->rqd->active); if ( unlikely(cpumask_empty(&temp_mask)) ) cpumask_and(&temp_mask, vc->cpu_hard_affinity, cpupool_online_cpumask(vc->domain->cpupool)); if ( cpumask_test_cpu(vc->processor, &temp_mask) ) return vc->processor; else return cpumask_any(&temp_mask); > /* First check to see if we're here because someone else suggested a > place > @@ -1081,13 +1099,17 @@ choose_cpu(const struct scheduler *ops, struct vcpu > *vc) > else > { > d2printk("%pv +\n", svc->vcpu); > - new_cpu = cpumask_cycle(vc->processor, > &svc->migrate_rqd->active); > - goto out_up; > + cpumask_and(&temp_mask, vc->cpu_hard_affinity, > + &svc->migrate_rqd->active); > + if ( !cpumask_empty(&temp_mask) ) > + { > + new_cpu = cpumask_any(&temp_mask); > + goto out_up; > + } > + /* Fall-through to normal cpu pick */ > So, if we've been asked to move to a cpu where we are not allowed to run (i.e., in case temp_mask ends up empty), we just, silently, ignore the request. This has the potential of screwing the work of the load balancer. If we need to keep this as a sort of "safety catch", then fine, but I'd really like to see a lot of efforts made in the load balancing code to make this not trigger! Actually, since we've been asked to do something, and we're doing something completely different, I even wonder if, if we can't go to migrate_rqd, that "something completely different" is better than doing nothing, and if yes, under what conditions and circumstances... > /* Find the runqueue with the lowest instantaneous load */ > @@ -1099,17 +1121,26 @@ choose_cpu(const struct scheduler *ops, struct vcpu > *vc) > rqd = prv->rqd + i; > > /* If checking a different runqueue, grab the lock, > - * read the avg, and then release the lock. > + * check hard affinity, read the avg, and then release the lock. > * > * If on our own runqueue, don't grab or release the lock; > * but subtract our own load from the runqueue load to simulate > * impartiality */ > if ( rqd == svc->rqd ) > { > + cpumask_and(&temp_mask, vc->cpu_hard_affinity, &rqd->active); > + if ( cpumask_empty(&temp_mask) ) > + continue; > You can use cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) for the test. However, can we really be on a runq, and not have hard affinity with _any_ of its pcpus? I think no. And if yes, then we should try to change that and make it no. :-) > rqd_avgload = rqd->b_avgload - svc->avgload; > } > else if ( spin_trylock(&rqd->lock) ) > { > + cpumask_and(&temp_mask, vc->cpu_hard_affinity, &rqd->active); > + if ( cpumask_empty(&temp_mask) ) > Again, cpumask_intersects() > + { > + spin_unlock(&rqd->lock); > + continue; > + } > rqd_avgload = rqd->b_avgload; > spin_unlock(&rqd->lock); > } > @@ -1123,12 +1154,30 @@ choose_cpu(const struct scheduler *ops, struct vcpu > *vc) > } > } > > - /* We didn't find anyone (most likely because of spinlock contention); > leave it where it is */ > if ( min_rqi == -1 ) > - new_cpu = vc->processor; > + { > + /* No runqs found (most likely because of spinlock contention). */ > + cpumask_and(&temp_mask, vc->cpu_hard_affinity, &svc->rqd->active); > + if ( cpumask_empty(&temp_mask) ) > + { > + /* Can't be assigned to current runqueue; return a safe pcpu. */ > + cpumask_and(&temp_mask, vc->cpu_hard_affinity, > + cpupool_online_cpumask(vc->domain->cpupool)); > + new_cpu = cpumask_any(&temp_mask); > + } > + else > + if ( cpumask_test_cpu(vc->processor, &temp_mask) ) > + /* Leave it where it is. */ > + new_cpu = vc->processor; > + else > + /* Same runq, different cpu; affinity must have changed. */ > + new_cpu = cpumask_any(&temp_mask); > + } > This is the exact same code as above, isn't it? Put it in an helper and call it from both sites. Well, actually, put my rework of the code above in the helper and call it. :-) > else > { > - new_cpu = cpumask_cycle(vc->processor, &prv->rqd[min_rqi].active); > + cpumask_and(&temp_mask, vc->cpu_hard_affinity, > + &prv->rqd[min_rqi].active); > + new_cpu = cpumask_any(&temp_mask); > BUG_ON(new_cpu >= nr_cpu_ids); > } > Looks ok. > @@ -1197,22 +1246,40 @@ static void migrate(const struct scheduler *ops, > } > else > { > - int on_runq=0; > - /* It's not running; just move it */ > + /* It's not running; move it if it's on a different runq than trqd. > */ > + bool_t on_runq = 0; > + cpumask_t temp_mask; > + > Ditto. > d2printk("%pv %d-%d i\n", svc->vcpu, svc->rqd->id, trqd->id); > + > + /* Re-assign vcpu's processor, if necessary. */ > + cpumask_and(&temp_mask, svc->vcpu->cpu_hard_affinity, &trqd->active); > + svc->vcpu->processor = cpumask_any(&temp_mask); > + if ( !cpumask_test_cpu(svc->vcpu->processor, &temp_mask) ) > + svc->vcpu->processor = cpumask_any(&temp_mask); > + Not following: you assign processor one of the bits set in temp_mask. Then, if such new value of processor is not in temp_mask, you assign to it cpumask_any(&temp_mask), i.e., the exact same thing as before... Am I missing something obvious? What is it that you are up for here? If the goal is to do something sane and safe if trqd->active and svc->vcpu->cpu_hard_affinity does not intersect, then, we certainly need it, but I really can't tell how this is supposed to handle such situation. BTW, after an operation like cpumask_first(), cpumask_next(), cpumask_any(), etc., a quick way to check whether the operation succeeded is to check for the returned value to be < nr_cpu_ids (see the implementation of those operations). If that is the case, then you can be sure that the bit returned is one of the bits set in the mask. > if ( __vcpu_on_runq(svc) ) > + on_runq = 1; > + > + /* If the runqs are different, move svc to trqd. */ > + if ( svc->rqd != trqd ) > { > - __runq_remove(svc); > - update_load(ops, svc->rqd, svc, -1, now); > - on_runq=1; > + if ( on_runq ) > + { > + __runq_remove(svc); > + update_load(ops, svc->rqd, svc, -1, now); > + } > + __runq_deassign(svc); > + __runq_assign(svc, trqd); > + if ( on_runq ) > + { > + update_load(ops, svc->rqd, svc, 1, now); > + runq_insert(ops, svc->vcpu->processor, svc); > + } > } > - __runq_deassign(svc); > - svc->vcpu->processor = cpumask_any(&trqd->active); > - __runq_assign(svc, trqd); > + > Mmm.. I do not like the way the code looks after this is applied. Before the patch, it was really straightforward and easy to understand. Now it's way more involved. Can you explain why this rework is necessary? For now do it here, then we'll see whether and how to put that into a doc comment. I know there is a comment on the call site of this function, but that does not help much either (as I say there). > if ( on_runq ) > { > - update_load(ops, svc->rqd, svc, 1, now); > - runq_insert(ops, svc->vcpu->processor, svc); > runq_tickle(ops, svc->vcpu->processor, svc, now); > } > } > @@ -1224,6 +1291,7 @@ static void balance_load(const struct scheduler *ops, > int cpu, s_time_t now) > struct csched2_private *prv = CSCHED2_PRIV(ops); > int i, max_delta_rqi = -1; > struct list_head *push_iter, *pull_iter; > + cpumask_t temp_mask; > ... > balance_state_t st = { .best_push_svc = NULL, .best_pull_svc = NULL }; > > @@ -1250,6 +1318,11 @@ retry: > for_each_cpu(i, &prv->active_queues) > { > s_time_t delta; > + /* true if there are no vcpus to push due to hard affinity */ > + bool_t ha_no_push = 1; > + /* true if there are no vcpus to pull due to hard affinity */ > + bool_t ha_no_pull = 1; > + struct list_head *iter; > > st.orqd = prv->rqd + i; > > @@ -1257,6 +1330,47 @@ retry: > || !spin_trylock(&st.orqd->lock) ) > continue; > > + /* > + * If due to hard affinity there are no vcpus that can be > + * pulled or pushed, move to the next runq in the loop. > + */ > + > + /* See if there are any vcpus that can be pushed from lrqd to orqd. > */ > + list_for_each( iter, &st.lrqd->svc ) > + { > + struct csched2_vcpu * svc = > + list_entry(iter, struct csched2_vcpu, rqd_elem); > + cpumask_and(&temp_mask, svc->vcpu->cpu_hard_affinity, > + &st.orqd->active); > + if (!cpumask_empty(&temp_mask)) > + { > + /* vcpu can be pushed from lrqd to ordq. */ > + ha_no_push = 0; > + break; > + } > + } > + > So, here, for each runqueue j, we scan all the vcpus in lrqd, to see if at least one can migrate to j. This means that, if we have 4 runqueues, we potentially loop through all the vcpus in lrqd 4 times, right? > + /* See if there are any vcpus that can be pulled from orqd to lrqd. > */ > + list_for_each( iter, &st.orqd->svc ) > + { > + struct csched2_vcpu * svc = > + list_entry(iter, struct csched2_vcpu, rqd_elem); > + cpumask_and(&temp_mask, svc->vcpu->cpu_hard_affinity, > + &st.lrqd->active); > + if (!cpumask_empty(&temp_mask)) > + { > + /* vcpu can be pulled from orqd to lrdq. */ > + ha_no_pull = 0; > + break; > + } > + } > + > And here, for each runq j, we scan all its vcpus to see if at least one can run on lrqd. In summary, this means (at least potentially) looping through all the active vcpus, some more than one time, which sounds pretty expensive. Add to this the fact that, since we are releasing and re-acquiring the locks, things may even change under our feet, making all that sort of useless. I like the idea of filtering out runqueues that does not have any suitable vcpus, due to hard affinity issues, but done like this, it looks really expensive. Can we perhaps think to a data structure to make it cheaper? In the meanwhile, I'd go for a more 'greedy approach'. Out of the top of my head, I think I would leave this loop alone, and just have it select a candidate runqueue. Below, where the vcpus are considered, I'd filter out the inappropriate one (as this patch does also). If it turns out that there are no suitable vcpus in the runqueue the loop selected, I'd "goto retry", perhaps adding a maximum number of attempts, or something like that. What do you think? > @@ -1338,11 +1458,17 @@ retry: > { > __update_svc_load(ops, pull_svc, 0, now); > } > - > + > Avoid things like these. I appreciate that this is an actual improvement, but let's not mix coding style fixes with new features. :-) > @@ -1399,8 +1531,12 @@ csched2_vcpu_migrate( > > trqd = RQD(ops, new_cpu); > > - if ( trqd != svc->rqd ) > - migrate(ops, svc, trqd, NOW()); > + /* > + * Call migrate even if svc->rqd == trqd; there may have been an > + * affinity change that requires a call to runq_tickle for a new > + * processor within the same run queue. > + */ > + migrate(ops, svc, trqd, NOW()); > } > As said above, I don't think I see the reason for this. Affinity changes, e.g., due to calls to vcpu_set_affinity() in schedule.c, forces the vcpu through a sleep wakeup cycle (it calls vcpu_sleep_nosync() direcly, while vcpu_wake() is called inside vcpu_migrate()). So, looks like what you are after (i.e., runq_tickle being called) should happen already, isn't it? Are there other reasons you need it for? > static int > @@ -1610,6 +1746,13 @@ runq_candidate(struct csched2_runqueue_data *rqd, > { > struct csched2_vcpu * svc = list_entry(iter, struct csched2_vcpu, > runq_elem); > > + /* > + * If vcpu is not allowed to run on this processor due to > + * hard affinity, continue to the next vcpu on the queue. > + */ > + if ( !cpumask_test_cpu(cpu, svc->vcpu->cpu_hard_affinity) ) > + continue; > + > Looks ok. The comment explains a bit too much the "what", which one can easily see from the code. Comments (most of the time :-) ) should be about the "why" things are done in a certain way. Here, the relevant piece of information is that "Only consider vcpus allowed to run on this processor", so I'd just say so. The fact that you are continuing scanning the runque is pretty evident. :-D Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.