Xen project Mailing List

Re: [Xen-devel] [PATCH 1/4] xen: credit2: implement utilization cap

To: Anshul Makkar <anshul.makkar@xxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxxx>

From: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

Date: Mon, 12 Jun 2017 15:19:24 +0200

Cc: George Dunlap <george.dunlap@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, Ian Jackson <ian.jackson@xxxxxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>

Delivery-date: Mon, 12 Jun 2017 13:19:49 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Hey, Thanks for looking at the patch! :-) On Mon, 2017-06-12 at 12:16 +0100, Anshul Makkar wrote: > On 08/06/2017 13:08, Dario Faggioli wrote: > > This commit implements the Xen part of the cap mechanism for > > Credit2. > > > > A cap is how much, in terms of % of physical CPU time, a domain > > can execute at most. > > > > For instance, a domain that must not use more than 1/4 of one > > physical CPU, must have a cap of 25%; one that must not use more > > than 1+1/2 of physical CPU time, must be given a cap of 150%. > > > > Caps are per domain, so it is all a domain's vCPUs, cumulatively, > > that will be forced to execute no more than the decided amount. > > > > This is implemented by giving each domain a 'budget', and using > > a (per-domain again) periodic timer. Values of budget and 'period' > > are chosen so that budget/period is equal to the cap itself. > > > > Budget is burned by the domain's vCPUs, in a similar way to how > > credits are. > > > > When a domain runs out of budget, its vCPUs can't run any longer. > > if the vcpus of a domain have credit and if budget has run out, will > the > vcpus won't be scheduled. > Is this a question? Assuming it is, what do you mean with "domain have credit"? Domains always have credits, and they never run out of them. There's no such thing as a domain not being runnable because it does not have credits. About budget, a domain with <= 0 budget means all its vcpus are not runnable, and hence won't be scheduler, no matter their credits. > > @@ -92,6 +92,82 @@ > > */ > > > > /* > > + * Utilization cap: > > + * > > + * Setting an pCPU utilization cap for a domain means the > > following: > > + * > > + * - a domain can have a cap, expressed in terms of % of physical > > CPU time. > > + * A domain that must not use more than 1/4 of _one_ physical > > CPU, will > > + * be given a cap of 25%; a domain that must not use more than > > 1+1/2 of > > + * physical CPU time, will be given a cap of 150%; > > + * > > + * - caps are per-domain (not per-vCPU). If a domain has only 1 > > vCPU, and > > + * a 40% cap, that one vCPU will use 40% of one pCPU. If a > > somain has 4 > > + * vCPUs, and a 200% cap, all its 4 vCPUs are allowed to run for > > (the > > + * equivalent of) 100% time on 2 pCPUs. How much each of the > > various 4 > > + * vCPUs will get, is unspecified (will depend on various > > aspects: workload, > > + * system load, etc.). > > or the ratio can vary eg. 4 vcpus are allowed to run got 50% of the > time > if on 4 pcpus ? > Yes, this is just an example. From the cap mechanism point of view, running for 4x50% is exactly the same as of running fo 2x100%, and there is no way to control what the actual distribution of runtime will be, on a multi-vcpu guest. I can try to make the wording a bit clearer... I can certainly add more examples (but I have no chance being exhaustive, as the combinations are virtually infinite! :-P). > > + * > > + * For implementing this, we use the following approach: > > + * > > + * - each domain is given a 'budget', an each domain has a timer, > > which > > + * replenishes the domain's budget periodically. The budget is > > the amount > > + * of time the vCPUs of the domain can use every 'period'; > > + * > > + * - the period is CSCHED2_BDGT_REPL_PERIOD, and is the same for > > all domains > > + * (but each domain has its own timer; so the all are periodic > > by the same > > + * period, but replenishment of the budgets of the various > > domains, at > > + * periods boundaries, are not synchronous); > > + * > > + * - when vCPUs run, they consume budget. When they don't run, > > they don't > > + * consume budget. If there is no budget left for the domain, no > > vCPU of > > + * that domain can run. If a vCPU tries to run and finds that > > there is no > > + * budget, it blocks. > > + * Budget never expires, so at whatever time a vCPU wants to > > run, it can > > + * check the domain's budget, and if there is some, it can use > > it. > > + * > > + * - budget is replenished to the top of the capacity for the > > domain once > > + * per period. Even if there was some leftover budget from > > previous period, > > + * though, the budget after a replenishment will always be at > > most equal > > + * to the total capacify of the domain ('tot_budget'); > > + * > > budget is replenished but credits not available ? > Still not getting this. > budget is finished but not vcpu has not reached the rate limit > boundary ? > Budget takes precedence over ratelimiting. This is important to keep cap working "regularly", rather then in some kind of permanent "trying- to-keep-up-with-overruns-in-previous-periods" state. And, ideally, a vcpu cap and ratelimiting should be set in such a way that they don't step on each other toe (or do that only rarely). I can see about trying to print a warning when I detect potential tricky values (but it's not easy, considering budget is per-domain, so I can't be sure about how much each vcpu will actually get, and whether or not that will reveal to be significantly less than ratelimiting the most of the times). > > + * - when a budget replenishment occurs, if there are vCPUs that > > had been > > + * blocked because of lack of budget, they'll be unblocked, and > > they will > > + * (potentially) be able to run again. > > + * > > + * Finally, some even more implementation related detail: > > + * > > + * - budget is stored in a domain-wide pool. vCPUs of the domain > > that want > > + * to run go to such pool, and grub some. When they do so, the > > amount > > + * they grabbed is _immediately_ removed from the pool. This > > happens in > > + * vcpu_try_to_get_budget(); > > + * > > + * - when vCPUs stop running, if they've not consumed all the > > budget they > > + * took, the leftover is put back in the pool. This happens in > > + * vcpu_give_budget_back(); > > 200% budget, 4 vcpus to run on 4 pcpus each allowed only 50% of > budget. > This is a static allocation . > Err... again, are you telling or asking? > for eg. 2 vcpus running on 2 pvpus at 20% > budgeted time, if vcpu3 wants to execute some cpu intensive task, > then > it won't be allowed to allowed to use more than 50% of the pcpus. > With what parameters? You mean with these ones you cite here (i.e., a 200% budget)? If the VM has 200%, and vcpu1 and vcpu2 consumes 20%+20%=40%, there's 160% free for vcpu3 and vcpu4. > I checked the implenation below and I believe we can allow for these > type of dynamic budget_quota allocation per vcpu. Not for initial > version, but certainly we can consider it for future versions. > But... it's already totally dynamic. > > @@ -408,6 +505,10 @@ struct csched2_vcpu { > > unsigned int residual; > > > > int credit; > > + > > + s_time_t budget; > > it's confusing, please can we have different member names for budget > in > domain and vcpu structure. > Mmm.. I don't think it is. It's "how much budget this _thing_ have", where "thing" can be the domain or a vcpu, and you can tell that by looking at the containing structure. Most of the time, that's rather explicit, the former being sdom->budget, the latter being svc->budget. What different names did you have in mind? The only alternative that I can come up with would be something like: struct csched2_dom { ... dom_budget; ... }; struct csched2_vcpu { ... vcpu_budget; ... }; Which I don't like (because of the repetition). > > @@ -1354,7 +1469,16 @@ static void reset_credit(const struct > > scheduler *ops, int cpu, s_time_t now, > > * that the credit it has spent so far get accounted. > > */ > > if ( svc->vcpu == curr_on_cpu(svc_cpu) ) > > + { > > burn_credits(rqd, svc, now); > > + /* > > + * And, similarly, in case it has run out of budget, > > as a > > + * consequence of this round of accounting, we also > > must inform > > + * its pCPU that it's time to park it, and pick up > > someone else. > > + */ > > + if ( unlikely(svc->budget <= 0) ) > > use of unlikely here is not saving much of cpu cycles. > Well, considering that not all domains will have a cap, and that we don't expect, even for the domains with caps, all their vcpus to exhaust their budget at every reset event, I think annotating this as an unlikely event makes sense. It's not a super big deal, though, and I can get rid of it, if people don't like/are not convinced about it. :-) > > @@ -1410,27 +1534,35 @@ void burn_credits(struct > > csched2_runqueue_data *rqd, > > > > delta = now - svc->start_time; > > > > - if ( likely(delta > 0) ) > > - { > > - SCHED_STAT_CRANK(burn_credits_t2c); > > - t2c_update(rqd, delta, svc); > > - svc->start_time = now; > > - } > > - else if ( delta < 0 ) > > + if ( unlikely(delta <= 0) ) > > { > > - d2printk("WARNING: %s: Time went backwards? now > > %"PRI_stime" start_time %"PRI_stime"\n", > > - __func__, now, svc->start_time); > > + if ( unlikely(delta < 0) ) > > + d2printk("WARNING: %s: Time went backwards? now > > %"PRI_stime > > + " start_time %"PRI_stime"\n", __func__, now, > > + svc->start_time); > > + goto out; > > } > > > > + SCHED_STAT_CRANK(burn_credits_t2c); > > + t2c_update(rqd, delta, svc); > > + > > + if ( unlikely(svc->budget != STIME_MAX) ) > > not clear, what is this check about ? > Ah, good catch. This should have been has_cap()! I'll fix it. > > @@ -1438,6 +1570,217 @@ void burn_credits(struct > > > > +static bool vcpu_try_to_get_budget(struct csched2_vcpu *svc) > > +{ > > + struct csched2_dom *sdom = svc->sdom; > > + unsigned int cpu = svc->vcpu->processor; > > + > > + ASSERT(spin_is_locked(per_cpu(schedule_data, > > cpu).schedule_lock)); > > + > > + if ( svc->budget > 0 ) > > + return true; > > + > > + /* budget_lock nests inside runqueue lock. */ > > + spin_lock(&sdom->budget_lock); > > + > > + /* > > + * Here, svc->budget is <= 0 (as, if it was > 0, we'd have > > taken the if > > + * above!). That basically means the vCPU has overrun a bit -- > > because of > > + * various reasons-- and we want to take that into account. > > With the +=, > > + * we are actually subtracting the amount of budget the vCPU > > has > > + * overconsumed, from the total domain budget. > > + */ > > + sdom->budget += svc->budget; > > + > > + if ( sdom->budget > 0 ) > > + { > > + svc->budget = sdom->budget; > > why are you assigning the remaining sdom->budge to only this svc. > svc > should be assigned a proportionate budget. > Each vcpu is assigned a %age of the domain budget based on the cap > and > number of vcpus. > There is difference in the code that's here and the code in branch > git://xenbits.xen.org/people/dariof/xen.git (fetch) > rel/sched/credti2-caps branch. Logic in the branch code looks fine > where > you are taking svc->budget_quota into considration.. > Yeah... maybe look at patch 3/4. :-P > > @@ -2423,14 +2785,22 @@ csched2_runtime(const struct scheduler > > *ops, int cpu, > > * credit values of MIN,MAX per vcpu, since each vcpu burns > > credit > > * at a different rate. > > */ > > - if (rt_credit > 0) > > + if ( rt_credit > 0 ) > > time = c2t(rqd, rt_credit, snext); > > else > > time = 0; > > > > - /* 3) But never run longer than MAX_TIMER or less than > > MIN_TIMER or > > - * the rate_limit time. */ > > - if ( time < min_time) > > + /* > > + * 3) But, if capped, never run more than our budget. > > + */ > > + if ( unlikely(has_cap(snext)) ) > > + time = snext->budget < time ? snext->budget : time; > > + > > does the budget takes precedence over rate and credit ? > It does takes precedence over ratelimiting, yes. There's no precedence relationship between budget and credits, and, anyway, the code here has nothing to do with that. In fact, all this code is saying is that, if this vcpu has 748us of budget left, I want csched2_schedule() to be called again no farther than 748us from now. > please replace snext->budget with something which is less confusing > eg > snext->budget_allocated.. > How would budget_allocated be less confusing? > > @@ -2544,11 +2914,13 @@ runq_candidate(struct csched2_runqueue_data > > *rqd, > > } > > > > In runq candidate we have a code base > /* > * Return the current vcpu if it has executed for less than > ratelimit. > * Adjuststment for the selected vcpu's credit and decision > * for how long it will run will be taken in csched2_runtime. > * > * Note that, if scurr is yielding, we don't let rate limiting kick > in. > * In fact, it may be the case that scurr is about to spin, and > there's > * no point forcing it to do so until rate limiting expires. > */ > if ( !yield && prv->ratelimit_us && !is_idle_vcpu(scurr->vcpu) && > vcpu_runnable(scurr->vcpu) && > (now - scurr->vcpu->runstate.state_entry_time) < > MICROSECS(prv->ratelimit_us) ) > In this codeblock we return scurr. Here there is no check for vcpu- > >budget. > Even if the scurr vcpu has executed for less than rate limit and > scurr > is not yielding, we need to check for its budget before returning > scurr. > But we check vcpu_runnable(scurr). And we've already called, in csched2_schedule(), vcpu_try_to_get_budget(scurr). And if scurr could not get any budget, we called park_vcpu(scurr), which sets scurr up in such a way that vcpu_runnable(scurr) is false. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.