Xen project Mailing List

Re: [Xen-devel] [Help] Trigger Watchdog when adding an IPI in vcpu_wake

On Fri, Sep 16, 2016 at 06:07:08PM +0200, Dario Faggioli wrote: >On Fri, 2016-09-16 at 10:49 +0800, Wei Yang wrote: >> On Wed, Sep 14, 2016 at 06:18:48PM +0200, Dario Faggioli wrote: >> > On Wed, 2016-09-14 at 18:44 +0800, Wei Yang wrote: >> > If the system is not overbooked, it's a bit strange that the >> > scheduler >> > is the bottleneck. >> I looked at the original data again. I don't see detailed data to >> describe the >> dom0 configuration. >> >I see. No collection of output of top and xentop in dom0 either then, >I'm guessing? :-/ > Probably, let me check with some one to see whether we have the luck. >> The exact user model is not accessible from our client. We guess >> their model >> looks like this. >> >> >> +--------+ +-----------+ +-------------+ >> |Timer | --->|Coordinator| ---+--->|Worker | >> +--------+ +-----------+ | +-------------+ >> | >> | +-------------+ >> +--->|Worker | >> | +-------------+ >> | >> | +-------------+ >> +--->|Worker | >> +-------------+ >> >> One Coordinator would driver several workers based on a high >> resolution timer. >> Periodically, workers would be waked up by the coordinator. So at one >> moment, >> many workers would be waked up and would triggers the vcpu_wake() in >> xen. >> >It's not clear to me whether 'Coordinator' and 'Worker's are VMs, or if >the graph describes the workload run inside the (and if yes, which >ones) VMs... but that is not terribly important, after all. > Oh, yes, these are threads in a VM. Each VM may contain several groups(instance) of these threads. >> Not sure this would be a possible reason for the burst vcpu_wake()? >> >Well, there would be, at least potentially, a sensible number of vcpus >waking up, which indeed can make the runqueue locks of the various >pcpus contended. > >But then again, if the system is not oversubscribed, I'd tend to think >it to be tolerable, and I'd expect the biggest problem to be the work- >stealing logic (considering the high number of pcpus), rather than the >duration of the critical sections within vcpu_wake(). > Yes, we are trying to improve the stealing part too. Sounds reasonable, vcpu_wake() is O(1) while "stealing" is O(N) in terms of #PCPUs. >> > - pcpu 6 is eithe idle or it is running someone already (say d3v4) >> > + if pcpu 6 is idle, we tickle (i.e., we raise SCHEDULE_SOFTIRQ) >> > pcpu 6 itself, and we're done >> Ok, it looks the behavior differs from 4.1 and current upstream. The >> upstream >> just raise the softirq to pcpu6 when it is idle, while 4.1 would >> raise softirq >> to both pcpu6 and other idlers even pcpu6 is idle. >> >> I think current upstream is more cleaver. >> >I also think current upstream is a bit better, especially because it's >mostly me that made it look the way it does. :-D Ah, not intended to flattering you. > >But I was actually describing how 4.1 works. In fact, in 4.1, if pcpu 6 >is idle (se the '//xxx xxx xxx' comments I'm adding to the code >excerpts: > > if ( new->pri > cur->pri ) //is true, so we put pcpu 6 in mask > { > cpu_set(cpu, mask); > } > if ( cur->pri > CSCHED_PRI_IDLE ) //is false!! > { > .... > } > if ( !cpus_empty(mask) ) //the mask contains pcpu 6 only > cpumask_raise_softirq(mask, SCHEDULE_SOFTIRQ); > Hmm... don't have the code at hand, while looks you are right. I misunderstood the code. >On the other hand, if pcpu 6 is not idle (and, sticking to the example >started in the last email, is running d3v4): > > if ( new->pri > cur->pri ) //depends from d2v2's prio and d3v4 prio's > { > cpu_set(cpu, mask); > } > if ( cur->pri > CSCHED_PRI_IDLE ) //is true, so let's see... > { > if ( cpus_empty(prv->idlers) ) //is true *only* if there are no idle >pcpu 6. Let's assume there are (i.e., let's assume this is false) > { > .... > } > else > { > cpumask_t idle_mask; > > cpus_and(idle_mask, prv->idlers, new->vcpu->cpu_affinity); > if ( !cpus_empty(idle_mask) ) //is true if there are idlers >suitable for new (let's assume there are) > { > if ( opt_tickle_one_idle ) //chosen on boot, default is true > { > this_cpu(last_tickle_cpu) = > cycle_cpu(this_cpu(last_tickle_cpu), idle_mask); > cpu_set(this_cpu(last_tickle_cpu), mask); May misunderstand the code previously and like this part too. So only one idler would be tickled even there would be several idlers in the system. I thought we would tickle several idlers, which may introduce some contention between them. BTW, any idea behind the cycle_cpu()? If this is about the locality, how about cycle within the node? and cycle within the node where v->processor is? > } > else > .... > } > cpus_and(mask, mask, new->vcpu->cpu_affinity); > } > } > if ( !cpus_empty(mask) ) //mask contains pcpu 6 if d2v2 prio's was higher, >and also contains one idle pcpu > cpumask_raise_softirq(mask, SCHEDULE_SOFTIRQ); > >So, as I was saying, if pcpu 6 is idle, only pcpu 6 is tickled. If it's >not, at least one idler (if it exists) is tickled, and pcpu 6 is >tickled or not, depending or priorities. > That's clear to me now :-) >> > And in fact, in more recent version of Xen, I made the code do >> > something very close to what you're suggesting. Here's the comment >> > that >> > you can find where all this logic is implemented (it's way more >> > complicated than this, and than the code in 4.1, because it takes >> > soft- >> > affinity into account). >> > >> > /* >> > * If the pcpu is idle, or there are no idlers and the new >> > * vcpu is a higher priority than the old vcpu, run it here. >> > * >> > * If there are idle cpus, first try to find one suitable to >> > run >> > * new, so we can avoid preempting cur. If we cannot find a >> > * suitable idler on which to run new, run it here, but try to >> > * find a suitable idler on which to run cur instead. >> > */ >> > >> I like this idea. >> >Then update... Xen 4.2 or 4.3 should already contains it. :-P > >> We found the schedule lock be a consuming point in our scenario, so >> the direct >> thought is try to avoid to grab it. Hmm... while our idea is not that >> sparkling. >> >Again, I can't say how sparkling it will reveal once implemented. >Architecturally, it doesn't sound much different from the status quo to >me, and so I'd say it's unlikely that it will solve your problem, but >this is something always very hard to tell. > >And again, I'd personally spend some more time --even by temporarily >and hackishly instrumenting the code-- to understand better where the >bottleneck is. > Hmm... usually we use xentrace and lock profile to identify the bottleneck, other method you would like to suggest? and instrumenting the code means to add some log in code? >> > Yes, but pcpu 4 releases pcpu's 6 lock right after having raised >> > the >> > softirq for pcpus 6 and 9. Most likely, when 6 and 9 will try to >> > grab >> > their own locks for running csched_schedule(), after having reacted >> > to >> > the IPI, etc., pcpu 4 will have released pcpu's 6 lock already. >> > >> > BTW, you say that pcpu 9 is idle, but then that "it can't do his >> > job" >> > because pcpu 4 holds the lock of pcpu 6... I don't think I >> > understand >> > what you mean with this. >> After pcpu9 get the softirq from pcup4, it try to schedule and do >> load >> balance, in which would take pcpu6 schedule lock. >> >> As I thought previously, pcpu6 schedule lock is hold by pcpu4, so >> pcpu9 should >> wait until pcpu4 release it. This is what I mean "it can't do his job >> immediately". >> >Yeah, but... > >> While as you mentioned, compared with pcpu9, pcpu4 would release >> pcpu6 >> schedule lock earlier. :-) >> >... indeed that's what I think it happens 99% of the time. And this is >also why I'd tempted to think that the issue may be laying somewhere >else. > >E.g., something that has proven effective for others (and for which >I've got an unfinished and never submitted patch to upstream), is >keeping track of what pcpus actually have at least 1 vcpu in their >runqueues (i.e., at least 1 vcpus that is runnable and not running) >and, inside balance_load(), only consider those when looking for work >to steal. Looks like we keep a cpumask with those who have 1 more vcpus and during stealing process, just scan those instead of the online cpumask. > >Regards, >Dario >-- ><<This happens because I choose it to happen!>> (Raistlin Majere) >----------------------------------------------------------------- >Dario Faggioli, Ph.D, http://about.me/dario.faggioli >Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) > -- Wei Yang Help you, Help me _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.