Xen project Mailing List

Re: [Xen-devel] [Help] Trigger Watchdog when adding an IPI in vcpu_wake

On Wed, Sep 14, 2016 at 06:44:17PM +0800, Wei Yang wrote: >On Tue, Sep 13, 2016 at 01:30:17PM +0200, Dario Faggioli wrote: >>[using xendevel correct address] >> >>On Tue, 2016-09-13 at 16:54 +0800, Wei Yang wrote: >>> On Fri, 2016-09-09 at 17:41 +0800, Wei Yang wrote: >>> > >>> > I'm not surprised by that. Yet, I'd be interested in hearing more >>> > about this profiling you have done (things like, how you captured >>> > the data, what workloads you are exactly considering, how you >>> > determined what is the bottleneck, etc). >>> Let me try to explain this. >>> >>> Workload: a. Blue Screen in Windows Guests >>> b. some client using Windows to do some video >>> processing >>> which need precise timestamp (we are not sure the >>> core >>> reason but we see the system is very slow) >>> >>Do you mind sharing just a bit more, such as: >> - number of pcpus >> - number of vcpus of the various VMs > >160 pcpus >16 vcpus in VM and 8 VMs > >> >>I also am not sure what "a. Blue screen in Windows guests" above >>means... is there a workload called "Blue Screen"? Or is it that you >>are hitting a BSOD in some of your guests (which ones, what were they >>doing)? Or is it that you _want_ to provoke a BSOD on some of your >>guests? Or is that something else? :-P >> > >Yes, the "Blue Screen" is what we want to mimic the behavior from client. > >The "Blue Screen" will force the hypervisor to do load balance in my mind. > >>> Capture the data: lock profile >>> Bottleneck Check: The schedule lock wait time is really high >>> >>Ok, cool. Interesting that lock profiling works on 4.1! :-O >> >>> > The scheduler tries to see whether the v->processor of the waking >>> > vcpu can be re-used, but that's not at all guaranteed, and again, >>> > on a very loaded system, not even that likely! >>> > >>> Hmm... I may missed something. >>> >>> Took your assumption below for example. >>> In my mind, the process looks like this: >>> >>> csched_vcpu_wake() >>> __runq_insert(), insert the vcpu in pcpu's 6 runq >>> __runq_tickle(), raise SCHEDULE_SOFTIRQ on pcpu 6 or other >>> (1) >>> >>Well, yes. More precisely, at least in current staging, >>SCHEDULE_SOFTIRQ is raised for pcpu 6: >> - if pcpu 6 is idle, or >> - if pcpu 6 is not idle but there actually isn't any idle vcpu, and >> the waking vcpu is higher in priority than what is running on pcpu 6 >> >>> __do_softirq() >>> schedule() >>> csched_schedule(), for pcpu 6, it may wake up d2v2 based >>> on it's >>> priority >>> >>Yes, but it is pcpu 6 that will run csched_schedule() only if the >>conditions mentioned above are met. If not, it will be some other pcpu >>that will do so (or none!). >> >>But I just realized that the fact that you are actually working on 4.1 >>is going to be an issue. In fact, the code that it is now in Xen has >>changed **a lot** since 4.1. In fact, you're missing all the soft- >>affinity work (which you may or may not find useful) but also >>improvements and bugfixing. >> > >That's true... > >>I'll have a quick look at how __runq_tickle() looks like in Xen 4.1, >>but there's very few chances I'll be available to provide detailed >>review, advice, testing, etc., on such an old codebase. :-( >> >>> By looking at the code, what I missed may be in __runq_tickle(), >>> which in case >>> there are idle pcpu, schedule_softirq is also raised on them. By >>> doing so, >>> those idle pcpus would steal other busy tasks and may in chance get >>> d2v2? >>> >>Yes, it looks like, in Xen 4.1, this is more or less what happens. The >>idea is to always tickle pcpu 6, if the priority of the waking vcpu is >>higher than what is running there. >> > >Hmm... in case there are idle pcpus and the priority of the waking vcpu is >higher than what is running on pcpu 6, would pcpu 6 have more chance to run it? >or other idle pcup would steal it from pcpu6? or they have equal chance? > >>If pcpu 6 was not idle, we also tickle one or more idle pcpus so that: >> - if the waking vcpu preempted what was running on pcpu 6, an idler >> can pick-up ("steal") such preempted vcpu and run it; >> - if the waking vcpu ended up in the runqueue, an idler can steal it >> > >Hmm... I don't get the difference between these two cases. > >Looks both are an idler steal the vcpu. > >>> BTW, if the system is in heavy load, how much chance would we have >>> idle pcpu? >>> >>It's not very likely that there will be idle pcpus in a very busy >>system, I agree. >> >>> > > We can divide this in three steps: >>> > > a) Send IPI to the target CPU and tell it which vcpu we want it >>> > > to >>> > > wake up. >>> > > b) The interrupt handler on target cpu insert vcpu to a percpu >>> > > queue >>> > > and raise >>> > > softirq. >>> > > c) softirq will dequeue the vcpu and wake it up. >>> > > >>> > I'm not sure I see how you think this would improve the situation. >>> > >>> > So, quickly, right now: >>> > >>> > - let's say vcpu 2 of domain 2 (from now d2v2) is waking up >>> > - let's assume d2v2->processor = 6 >>> > - let's assume the wakeup is happening on pcpu 4 >>> > >>> > Right now: >>> > >>> > - on pcpu 4, vcpu_wake(d2v2) takes the scheduler lock of d2v2, >>> > which is the runqueue lock of pcpu 6 (in Credit1, there is 1 >>> > runqueue per pcpu, and locks are per-cpu already) >>> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue >>> > - still executing on pcpu 4, __runq_tickle() is called, and it >>> > determines on which pcpu d2v2 should run >>> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the >>> > following two cases: >>> > a) if it was determined that d2v2 should run on pcpu 6: >>> > - in a (little, hopefully) while, pcpu 6 will schedule >>> > - it takes its own runqueue lock >>> > - it finds d2v2 in there, and runs it >>> Even in this case, it depends on the priority of d2v2 whether it >>> would be run >>> now. Right? >>> >>Yes. But both in 4.1 and in current staging, we only raise >>SCHEDULE_SOFTIRQ on pcpu 6, if and only if the priority of the waking >>vcpu is higher and it should preempt the currently running vcpu. >> > >Oh, you are right. I didn't catch this either. > >This means in case >a) the priority is lower than current one >b) no idler in system > >The vcpu will sit quietly in the runq and waiting for the schedule next time. > >>> > b) if it was determined that d2v2 should run on, say, pcpu 9: >>> > - in a (little, hopefully) while, pcpu 9 will schedule >>> > - it takes its own runqueue lock >>> > - if it has nothing in its runqueue (likely, if >>> > __runq_tickle() chose it, but not necessarily and always >>> > true, especially under high load), it looks around to >>> > see if other pcpus have any vcpu that it can execute >>> > - it will try to take the locks of the various pcpu's >>> > runqueues it comes across >>> > - if it manages to take the lock of pcpu's 6 runqueue, it >>> > sees d2v2 in it, steal it and run it. >>> Oh, my understanding matches yours :-) >>> >>> BYW, we also found in csched_load_balance() will hold the schedule >>> lock, but >>> not found a proper vcpu to steal. Maybe this would be a potential >>> optimization >>> point. >>> >>Mmm... I actually don't understand what you mean here... what is the >>possible optimization? >> >>> > With your modifications, AFAICT: >>> > >>> > - on pcpu 4, notify_remote_vcpu_wake(d2v2) takes the wake_vcpu_lock >>> > of pcpu 6 and queue d2v2 in pcpu's 6 deferred wakeup list >>> > - poke pcpu 6 with an IPI >>> > - in a (little, hopefully) while, pcpu 6 react to the IPI and, >>> > _I_think_, call vcpu_wake(d2v2) ? [*] >>> You are almost right. Here in the interrupt handler, it does two >>> things: >>> 1. echo back notify_remote_vcpu_wake() it finishes the job >>> 2. raise a separate softirq, which will call vcpu_wake(d2v2) >>> >>> > >>> > - on pcpu 6, vcpu_wake(d2v2) takes its own runqueue lock >>> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue >>> > - on pcpu 6, __runq_tickle() is called, and it determines on which >>> > pcpu d2v2 should run >>> At this place, the behavior of __run_tickle() is what I mentioned >>> above. Raise >>> softirq on pcpu6 and idle pcpus. Who is faster, who get d2v2. >>> >>Sort of, yes. In general, you can expect pcpu 6 going through >>csched_schedule(), and hence setting d2v2 to run, to be faster that >>SCHEDULE_SOFIRQ being notified to someone remote, and it scheduling and >>getting to work stealing from pcpu 6's runqueue. >> >>However, that is the case only if d2v2 had higher priority than what is >>running on pcpu 6. If it does not, you: >> - _don't_ tickle pcpu 6 >> - tickle one or more idlers, if any. >> >>So, again, it's not that you always wake and run it on pcpu 6 >> > >I see what you mean :-) > >So I am curious about why we add this in pcpu's 6 queue, when we know the >priority is low in __runq_tickle() and won't tickle that. And we are sure >someone in the future will grab pcpu's 6 schedule lock and take it. > >Or in this case -- when there are idlers and vcpu's priority is higher, >instead of let pcpu 6 to wake this, but let an idler to take it? Ask someone >who is free to help a busy one. > >>> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the >>> > following two cases: >>> > a) if it was determined that d2v2 should run on pcpu 6 >>> > (i.e., SCHEDULE_SOFTIRQ was raised by pcpu 6 on itself): >>> > - at the first softirq check, pcpu 6 will schedule >>> > - it takes its own runqueue lock >>> > - it finds d2v2 in there, and runs it >>> > b) if it was determined that d2v2 should run on, say, pcpu 9: >>> > - in a (little, hopefully) while, pcpu 9 will schedule >>> > - it takes its own runqueue lock >>> > - if it has nothing in its runqueue (likely, if >>> > __runq_tickle() chose it, but not necessarily and always >>> > true, especially under high load), it looks around to >>> > see if other pcpus have any vcpu that it can execute >>> > - it will try to take the locks of the various pcpu's >>> > runqueues it comes across >>> > - if it manages to take the lock of pcpu's 6 runqueue, it >>> > sees d2v2 in it, steal it and run it. >>> > >>> > [*] I don't see in the code you shared what happens in reaction to >>> > the IPI WAKEUP_TASK_VECTOR, so I'm going to assume that it actually >>> > calls vcpu_wake() >>> > >>> > So, basically, it looks to me that you're adding a level of >>> > indirection, I'm not sure for what benefit. >> >>> In my mind, we are trying to reduce the contention on schedule lock >>> from two >>> aspects: >>> 1. for those vcpus would run on other pcpu, it will take >>> wake_vcpu_lock >>> instead of schedule lock >>> >>I don't see how you can say "instead". It looks to me that, for those >>vcpus what would run on other pcpu, we need to take wake_vcpu_lock >>_in_addition_ to the runqueue lock. >> >>> 2. and in vcpu_wake(), it will not grab a remote schedule lock, which >>> also >>> reduce the cache coherency between pcpus >>> >>Right. Taking the wake_vcpu_lock of pcpu 6 from pcpu 4 is taking a >>remote lock, though. > >Hmm... you are right. > >> >>> > In fact, in the best case (i.e., d2v2 will actually be run on its >>> > v->processor), in the original code there is: >>> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6. >>> > In your case, there is: >>> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6 >>> > - 1 softirq (again SCHEDULE_SOFTIRQ) to self on pcpu 6. >>> > >>> Our idea may looks like this: >>> - 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6 >>> - 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the vcpu. >>> - 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6. >>> >>So, it's even worse than how I imagined! :-P >> >>> > Also, as far as locking within the wakeup path is concerned, in the >>> > original code: >>> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake() >>> > (i.e., for the time of __runq_insert() and __runq_tickle(). >>> > In your case: >>> > - the wake_vcpu_lock of pcpu 6 is held during queueing of the >>> > deferred wakeup >>> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake() >>> > >>> The wake_vcpu_lock is hold in the first step in above. >>> The runqueue lock(I think you mean the schedule lock?) won't be taken >>> in this >>> case. >>> >>It will be taken, for executing vcpu_wake(). From your own description >>of the mechanism above: >> >>"Our idea may looks like this: >> 1) 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6 >> 2) 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the >> vcpu. >> 3) 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6." >> >>1) runs on pcpu 4 and takes pcpu's 6 wake_vcpu_lock >>2) runs on pcpu 6 and takes pcpu'6 6 runqueue lock, for vcpu_wake() >>3) runs on pcpu 6 and takes pcpu's 6 runqueue lock, for >> csched_schedule() >> > >Oh, you are right. > >We try to let the pcpu's 6 schedule lock just be grabbed by pcpu 6, while >looks the wake_vcpu_lock is grabbed by others. > >I am thinking our change may benefit like this . > >The situation is: >* vcpu's priority is higher >* pcpu 9 is idle >* pcpu 6 and pcpu 9 are raised SCHEDULE_SOFTIRQ > >At this moment pcpu 4 is holding pcpu 6 runqueue lock, neither pcpu 6 >nor pcpu 9 could do their job until pcpu 4 release pcpu 6 runqueue lock. > >While in our case, pcpu 4 doesn't hold the runqueue lock, so pcpu 6 and pcpu 9 >could do their job immediately. Ah I know, very corner, not sure about the >effect. > >BTW, I suddenly found softirq is implemented as an IPI. So I am thinking >whether it is save to raise softirq to itself? Since it looks we already hold >the runqueue lock. Or because we have turned the irq off in vcpu_wake()? > >>> > >>> > (i.e., for the time of __runq_insert() and __runq_tickle(). >>> > >>> > Which, to me, looks like both more inter-CPU communication overhead >>> > and more chances for lock contention. >>> Hmm... yes, more inter-CPU communication, while less lock contention >>> I think. >>> >>More inter-CPU communication, the same level of local lock contention >>plus some added moderately to low contented remote lock contention, >>AFAICT. >> >>> > So, again, more IPIs and more (potential) lock contention. >>> > >>> > True the queueing into the deferred wakeup list will be very simple >>> > and quick, and hence the critical section needed for implementing >>> > it very short. But the problem is that you are not shortening >>> > something long, or splitting something long in two shorter chunks, >>> > you are _adding_ something, and thus you can't possibly win. :-/ >>> > >>> Yes, we are moving some thing, let me digest it for a while. >>> >>Sure. Let me also add that I'm not saying it's not going to work, or >>that it's not going to improve the situation. >> >>What I'm saying is that, architecturally, it's not too different from >>the current situation, so I would not expect wonders from just this. >> >>What I'd do would be to try figuring out where it is that the most >>waiting time is being accumulated, i.e., which *specific* locking >>attempts of the scheduler's runqueues' locks are the most contended, >>and focus on those first. >> >>If you're on a big host, I think the way in which Credit1 does load >>balancing (i.e., by work stealing) may be the thing to blame. But all >>these are things that only profiling and benchmarking can really tell! >>:-) >> >>As a side note, we're also working on something which, at the end of >>the day, is rather similar, although for different purposes. In fact, >>see here: >>https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01892.html >> >>The goal is not having to disable IRQs during scheduling, and I'm >>looking at keeping the per-pcpu deferred wakeup list lockless, in order >>not to add more lock contention. >> Dario, Just get chance to look into this. This is interesting and I am trying to understand the problem you want to solve first. vcpu_wake() is a _special_ case who has to turn off the irq, because SCHED_OP(wake, v) would not only enqueue the vcpu but also tickle pcpu to pick up the queued vcpu. And the tickling part needd the irq to be turned off. I don't get the this part. Why we have to turn off the irq for tickling pcpu? And by enabling irq in schedule handlers benefits the performance? or ? The motivation behind this is? Didn't look into your code yet. From the description from Andrew, I comes with several questions: 1. which per-cpu list you want to queue the vcpu? v->processor? 2. SCHEDULE_WAKE_SOFTIRQ would do runqueue insert and tickle? 3. purpose is so that schedule softirq could run with irq enabled 4. the schedule softirq would do the jobs currently in vcpu_wake()? Took the per-cpu branch as an example, I see several commits. The top 6 are related ones, right? >>If you're interested on knowing more or working on that, just let me >>know. But, of course, this have to be done on Xen upstream, not on an >>old version. >> >>Regards, >>Dario >>-- >><<This happens because I choose it to happen!>> (Raistlin Majere) >>----------------------------------------------------------------- >>Dario Faggioli, Ph.D, http://about.me/dario.faggioli >>Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) >> > > > >-- >Richard Yang\nHelp you, Help me -- Wei Yang Help you, Help me _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.