Xen project Mailing List

Re: [Xen-devel] [PATCH v2] sched: credit2: respect per-vcpu hard affinity

To: "jtweaver@xxxxxxxxxx" <jtweaver@xxxxxxxxxx>

From: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

Date: Tue, 3 Mar 2015 03:15:11 +0000

Accept-language: en-GB, en-US

Cc: "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxx>, "henric@xxxxxxxxxx" <henric@xxxxxxxxxx>

Delivery-date: Tue, 03 Mar 2015 03:15:16 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Thread-index: AQHQRBphmCeQJBrQTEKHN9DeU7JUKJ0KJ4gA

Thread-topic: [Xen-devel] [PATCH v2] sched: credit2: respect per-vcpu hard affinity

On Sun, 2015-02-08 at 17:45 -1000, Justin T. Weaver wrote: > From: "Justin T. Weaver" <jtweaver@xxxxxxxxxx> > > by making sure that vcpus only run on the pcpu(s) they are allowed to > run on based on their hard affinity cpu masks. > Ok, here I am reviewing this, at last... sorry for the delay! :-( > Signed-off-by: Justin T. Weaver <jtweaver@xxxxxxxxxx> > --- > Changes in v2: > * Added dynamically allocated cpu masks to avoid putting them on the stack; > replaced temp masks from v1 throughout > * Added helper function for code suggested in v1 review and called it in two > locations in function choose_cpu > * Removed v1 change to comment in the beginning of choose_cpu > * Replaced two instances of cpumask_and/cpumask_empty with cpumask_intersects > * Removed v1 re-work of code in function migrate; only change in migrate in > v2 is the assignment of a valid pcpu from the destination run queue to > vc->processor > * In function csched2_vcpu_migrate: removed change from v1 that called > function migrate even if cur and dest run queues were the same in order > to get a runq_tickle call; added processor assignment to new_cpu to fix > the real underlying issue which was the vcpu not getting a call to > sched_move_irqs > Aha, so that was the issue! Glad you figured this out. :-) > * Removed the looping added in v1 in function balance_load; may be added back > later because it would help to have balance_load be more aware of hard > affinity, but adding it does not affect credit2's current inability to > respect hard affinity. > * Removed coding style fix in function balance_load > * Improved comment in function runq_candidate > Thanks for putting this changes recap here. :-) > --- > xen/common/sched_credit2.c | 122 > +++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 108 insertions(+), 14 deletions(-) > > diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c > index cf53770..de8fb5a 100644 > --- a/xen/common/sched_credit2.c > +++ b/xen/common/sched_credit2.c > @@ -194,6 +194,12 @@ int opt_overload_balance_tolerance=-3; > integer_param("credit2_balance_over", opt_overload_balance_tolerance); > > /* > + * Use this to avoid having too many cpumask_t structs on the stack > + */ > +static cpumask_t **cpumask = NULL; > Not just 'cpumask', please... It's too generic a name. Let's pick up something that makes it easier to understand what's the purpose of this. I'm not really sure right now what something like that could be... Credit has balance_mask, but we're not going as far as introducing multiple step load balancing here (not with this patch, at least). Maybe affinity_cpumask, or hard_affinity_cpumask (and we'll rename to something else when introducing soft affinity, if needed). > +#define csched2_cpumask cpumask[smp_processor_id()] > + I like the idea, but put the right side between parentheses. Of course, what just said about the name applies here too. :-) > @@ -268,6 +274,23 @@ struct csched2_dom { > uint16_t nr_vcpus; > }; > > +/* > + * When a hard affinity change occurs, we may not be able to check some or > + * all of the other run queues for a valid new processor for the given vcpu. > + * Return svc's current pcpu if valid, otherwise return a safe pcpu. > + */ > Add at least a very quick mention on why this is (could be) necessary. > +static int get_safe_pcpu(struct csched2_vcpu *svc) > +{ > I also don't like the name... __choose_cpu() maybe ? > + cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity, > &svc->rqd->active); > + if ( unlikely(cpumask_empty(csched2_cpumask)) ) > + cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity, > + cpupool_online_cpumask(svc->vcpu->domain->cpupool)); > VCPU2ONLINE(svc->vcpu) would make the line shorter. Also, I know I'm the one that suggested this form for the code, but thinking again about it, I'm not sure the first if is worth: cpumask_and(csched2_cpumask, &svc->rqd->active, VCPU2ONLINE(svc->vcpu)); cpumask_and(csched2_cpumask, csched2_cpumask, svc->vcpu->cpu_hard_affinity); Oh, and, with either yours or my variant... can csched2_cpumask end up empty after the two &&-s? That's important or, below, cpumask_any will return garbage! :-/ From a quick inspection, it does not seem it can, in which case, I'd put down an ASSERT() about this somewhere. OTOH, if it can, I'd look for ways of preventing that to happen before getting here... It seems easier and more correct to do that, rather than trying to figure out what to do here if the mask is empty. :-O > + > + if ( cpumask_test_cpu(svc->vcpu->processor, csched2_cpumask) ) > + return svc->vcpu->processor; > + else > + return cpumask_any(csched2_cpumask); > And, perhaps, we could put back the likely/unlikely hint here (provided we think the then branch is actually likely): if ( likely(svc->vcpu->processor, csched2_cpumask) ) return svc->vcpu->processor; else return cpumask_any(csched2_cpumask); > @@ -1081,13 +1106,17 @@ choose_cpu(const struct scheduler *ops, struct vcpu > *vc) > else > { > d2printk("%pv +\n", svc->vcpu); > - new_cpu = cpumask_cycle(vc->processor, > &svc->migrate_rqd->active); > - goto out_up; > + cpumask_and(csched2_cpumask, vc->cpu_hard_affinity, > + &svc->migrate_rqd->active); > + if ( !cpumask_empty(csched2_cpumask) ) > + { > + new_cpu = cpumask_any(csched2_cpumask); > + goto out_up; > + } > cpumask_and(csched2_cpumask, vc->cpu_hard_affinity, &svc->migrate_rqd->active); new_cpu = cpumask_any(csched2_cpumask); if ( new_cpu < nr_cpu_ids ) { d2printk("%pv +\n", svc->vcpu); goto out_up; } As I told you last round, checking the results of most of the cpumask_foo() operations against nr_cpu_ids, can likely avoid the need of more cpumask fiddling, and hence should be done whenever it is possible. > + /* Fall-through to normal cpu pick */ > Don't add this here. Instead, kill the instance of the exact same comment in the if(){}, and only have it once ... > } > ^ ... here, i.e., outside of the if (){}else{} block. > } > > - /* FIXME: Pay attention to cpu affinity */ > > - > min_avgload = MAX_LOAD; > > /* Find the runqueue with the lowest instantaneous load */ > @@ -1099,17 +1128,24 @@ choose_cpu(const struct scheduler *ops, struct vcpu > *vc) > rqd = prv->rqd + i; > > /* If checking a different runqueue, grab the lock, > - * read the avg, and then release the lock. > + * check hard affinity, read the avg, and then release the lock. > * > * If on our own runqueue, don't grab or release the lock; > * but subtract our own load from the runqueue load to simulate > * impartiality */ > if ( rqd == svc->rqd ) > { > + if ( !cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) ) > + continue; > I'd add a short comment on how it is possible that we are here and don't have hard affinit with any pCPU in our runqueue (exactly as you explained it during v1 review, i.e., in case affinit is being changed). > rqd_avgload = rqd->b_avgload - svc->avgload; > } > else if ( spin_trylock(&rqd->lock) ) > { > + if ( !cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) ) > + { > + spin_unlock(&rqd->lock); > + continue; > + } > rqd_avgload = rqd->b_avgload; > spin_unlock(&rqd->lock); > } > Something like this would be easier to read, IMO: if ( rqd == svc->rqd ) { if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) ) rqd_avgload = rqd->b_avgload - svc->avgload; } else if ( spin_trylock(&rqd->lock) ) { if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) ) rqd_avgload = rqd->b_avgload; spin_unlock(rqd->lock); } else continue; Semantic should be the same, provided we initialize rqd_avgload to MAX_LOAD, I would say. In fact, without the two 'continue;' statements you were introducing, we'll execute the if() that follows this block even if there was no intersection with the hard affinity mask but, in that case, no chance we have updated rqd_avgload, so it really should behave the same, what do you think? > @@ -1123,12 +1159,16 @@ choose_cpu(const struct scheduler *ops, struct vcpu > *vc) > } > } > > - /* We didn't find anyone (most likely because of spinlock contention); > leave it where it is */ > if ( min_rqi == -1 ) > - new_cpu = vc->processor; > + { > + /* No runqs found (most likely because of spinlock contention). */ > + new_cpu = get_safe_pcpu(svc); > + } > No need to move the comment inside the if(), just kill the 'leave it where it is' part. > @@ -1330,6 +1375,12 @@ retry: > if ( test_bit(__CSFLAG_runq_migrate_request, &push_svc->flags) ) > continue; > > + /* Skip if it can't run on the destination runq. */ > + cpumask_and(csched2_cpumask, push_svc->vcpu->cpu_hard_affinity, > + &st.orqd->active); > + if ( cpumask_empty(csched2_cpumask) ) > + continue; > + > list_for_each( pull_iter, &st.orqd->svc ) > { > struct csched2_vcpu * pull_svc = list_entry(pull_iter, struct > csched2_vcpu, rqd_elem); > @@ -1343,6 +1394,12 @@ retry: > if ( test_bit(__CSFLAG_runq_migrate_request, &pull_svc->flags) ) > continue; > > + /* Skip if it can't run on the destination runq. */ > + cpumask_and(csched2_cpumask, pull_svc->vcpu->cpu_hard_affinity, > + &st.lrqd->active); > + if ( cpumask_empty(csched2_cpumask) ) > + continue; > + > consider(&st, push_svc, pull_svc); > } > > @@ -1360,6 +1417,12 @@ retry: > if ( test_bit(__CSFLAG_runq_migrate_request, &pull_svc->flags) ) > continue; > > + /* Skip if it can't run on the destination runq. */ > + cpumask_and(csched2_cpumask, pull_svc->vcpu->cpu_hard_affinity, > + &st.lrqd->active); > + if ( cpumask_empty(csched2_cpumask) ) > + continue; > + > Same 4 lines of code (5, if we count the comment), repeated 3 times: make an helper. :-) > @@ -1396,6 +1459,15 @@ csched2_vcpu_migrate( > > /* Check if new_cpu is valid */ > BUG_ON(!cpumask_test_cpu(new_cpu, &CSCHED2_PRIV(ops)->initialized)); > + BUG_ON(!cpumask_test_cpu(new_cpu, vc->cpu_hard_affinity)); > Not sure.. An ASSERT() seems more suitable than a BUG_ON here... What's the reasoning behind this? > + /* > + * Assign new_cpu to vc->processor here to get a call to sched_move_irqs > + * in schedule.c in case there was a hard affinity change within the same > + * run queue. vc will not be able to run in certain situations without > + * this call. > + */ > + vc->processor = new_cpu; > Oh, and this is what was causing you troubles, in case source and destination runqueue were the same... Help me understand, which call to sched_move_irqs() in schedule.c were we missing? I'd say it is the one in vcpu_migrate(), but that does not seem to care about vc->processor (much rater about new_cpu)... what am I missing? However, if they are not the same, the call to migrate() will override this right away, won't it? What I mean to say is, if this is required only in case trqd and svc->rqd are equal, can we tweak this part of csched2_vcpu_migrate() as follows? if ( trqd != svc->rqd ) migrate(ops, svc, trqd, NOW()); else vc->processor = new_cpu; Thanks and Regards, Dario

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.