[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Help] Trigger Watchdog when adding an IPI in vcpu_wake



On Wed, Sep 14, 2016 at 06:44:17PM +0800, Wei Yang wrote:
>On Tue, Sep 13, 2016 at 01:30:17PM +0200, Dario Faggioli wrote:
>>[using xendevel correct address]
>>
>>On Tue, 2016-09-13 at 16:54 +0800, Wei Yang wrote:
>>> On Fri, 2016-09-09 at 17:41 +0800, Wei Yang wrote:
>>> > 
>>> > I'm not surprised by that. Yet, I'd be interested in hearing more
>>> > about this profiling you have done (things like, how you captured
>>> > the data, what workloads you are exactly considering, how you
>>> > determined what is the bottleneck, etc).
>>> Let me try to explain this.
>>> 
>>> Workload:         a. Blue Screen in Windows Guests
>>>               b. some client using Windows to do some video
>>> processing
>>>                  which need precise timestamp (we are not sure the
>>> core
>>>                  reason but we see the system is very slow)
>>>
>>Do you mind sharing just a bit more, such as:
>> - number of pcpus
>> - number of vcpus of the various VMs
>
>160 pcpus
>16 vcpus in VM and 8 VMs
>
>>
>>I also am not sure what "a. Blue screen in Windows guests" above
>>means... is there a workload called "Blue Screen"? Or is it that you
>>are hitting a BSOD in some of your guests (which ones, what were they
>>doing)? Or is it that you _want_ to provoke a BSOD on some of your
>>guests? Or is that something else? :-P
>>
>
>Yes, the "Blue Screen" is what we want to mimic the behavior from client.
>
>The "Blue Screen" will force the hypervisor to do load balance in my mind.
>
>>> Capture the data: lock profile
>>> Bottleneck Check: The schedule lock wait time is really high      
>>> 
>>Ok, cool. Interesting that lock profiling works on 4.1! :-O
>>
>>> > The scheduler tries to see whether the v->processor of the waking
>>> > vcpu can be re-used, but that's not at all guaranteed, and again,
>>> > on a very loaded system, not even that likely!
>>> > 
>>> Hmm... I may missed something.
>>> 
>>> Took your assumption below for example. 
>>> In my mind, the process looks like this:
>>> 
>>>     csched_vcpu_wake()
>>>         __runq_insert(),  insert the vcpu in pcpu's 6 runq
>>>         __runq_tickle(),  raise SCHEDULE_SOFTIRQ on pcpu 6 or other
>>> (1)
>>>     
>>Well, yes. More precisely, at least in current staging,
>>SCHEDULE_SOFTIRQ is raised for pcpu 6:
>> - if pcpu 6 is idle, or
>> - if pcpu 6 is not idle but there actually isn't any idle vcpu, and
>>   the waking vcpu is higher in priority than what is running on pcpu 6
>>
>>>     __do_softirq()
>>>         schedule()
>>>             csched_schedule(),  for pcpu 6, it may wake up d2v2 based
>>> on it's
>>>                                 priority
>>>
>>Yes, but it is pcpu 6 that will run csched_schedule() only if the
>>conditions mentioned above are met. If not, it will be some other pcpu
>>that will do so (or none!).
>>
>>But I just realized that the fact that you are actually working on 4.1
>>is going to be an issue. In fact, the code that it is now in Xen has
>>changed **a lot** since 4.1. In fact, you're missing all the soft-
>>affinity work (which you may or may not find useful) but also
>>improvements and bugfixing.
>>
>
>That's true...
>
>>I'll have a quick look at how __runq_tickle() looks like in Xen 4.1,
>>but there's very few chances I'll be available to provide detailed
>>review, advice, testing, etc., on such an old codebase. :-(
>>
>>> By looking at the code, what I missed may be in __runq_tickle(),
>>> which in case
>>> there are idle pcpu, schedule_softirq is also raised on them. By
>>> doing so,
>>> those idle pcpus would steal other busy tasks and may in chance get
>>> d2v2?
>>> 
>>Yes, it looks like, in Xen 4.1, this is more or less what happens. The
>>idea is to always tickle pcpu 6, if the priority of the waking vcpu is
>>higher than what is running there. 
>>
>
>Hmm... in case there are idle pcpus and the priority of the waking vcpu is
>higher than what is running on pcpu 6, would pcpu 6 have more chance to run it?
>or other idle pcup would steal it from pcpu6? or they have equal chance?
>
>>If pcpu 6 was not idle, we also tickle one or more idle pcpus so that:
>> - if the waking vcpu preempted what was running on pcpu 6, an idler
>>   can pick-up ("steal") such preempted vcpu and run it;
>> - if the waking vcpu ended up in the runqueue, an idler can steal it
>>
>
>Hmm... I don't get the difference between these two cases.
>
>Looks both are an idler steal the vcpu.
>
>>> BTW, if the system is in heavy load, how much chance would we have
>>> idle pcpu?
>>> 
>>It's not very likely that there will be idle pcpus in a very busy
>>system, I agree.
>>
>>> > > We can divide this in three steps:
>>> > > a) Send IPI to the target CPU and tell it which vcpu we want it
>>> > > to 
>>> > > wake up.
>>> > > b) The interrupt handler on target cpu insert vcpu to a percpu
>>> > > queue 
>>> > > and raise
>>> > >    softirq.
>>> > > c) softirq will dequeue the vcpu and wake it up.
>>> > > 
>>> > I'm not sure I see how you think this would improve the situation.
>>> > 
>>> > So, quickly, right now:
>>> > 
>>> > - let's say vcpu 2 of domain 2 (from now d2v2) is waking up
>>> > - let's assume d2v2->processor = 6
>>> > - let's assume the wakeup is happening on pcpu 4
>>> > 
>>> > Right now:
>>> > 
>>> > - on pcpu 4, vcpu_wake(d2v2) takes the scheduler lock of d2v2,
>>> >   which is the runqueue lock of pcpu 6 (in Credit1, there is 1
>>> >   runqueue per pcpu, and locks are per-cpu already)
>>> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue
>>> > - still executing on pcpu 4, __runq_tickle() is called, and it
>>> >   determines on which pcpu d2v2 should run
>>> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the
>>> >   following two cases:
>>> >    a) if it was determined that d2v2 should run on pcpu 6:
>>> >        - in a (little, hopefully) while, pcpu 6 will schedule
>>> >        - it takes its own runqueue lock
>>> >        - it finds d2v2 in there, and runs it
>>> Even in this case, it depends on the priority of d2v2 whether it
>>> would be run
>>> now. Right?
>>> 
>>Yes. But both in 4.1 and in current staging, we only raise
>>SCHEDULE_SOFTIRQ on pcpu 6, if and only if the priority of the waking
>>vcpu is higher and it should preempt the currently running vcpu.
>>
>
>Oh, you are right. I didn't catch this either.
>
>This means in case 
>a) the priority is lower than current one 
>b) no idler in system
>
>The vcpu will sit quietly in the runq and waiting for the schedule next time.
>
>>> >    b) if it was determined that d2v2 should run on, say, pcpu 9:
>>> >        - in a (little, hopefully) while, pcpu 9 will schedule
>>> >        - it takes its own runqueue lock
>>> >        - if it has nothing in its runqueue (likely, if
>>> >          __runq_tickle() chose it, but not necessarily and always
>>> >          true, especially under high load), it looks around to
>>> >          see if other pcpus have any vcpu that it can execute
>>> >        - it will try to take the locks of the various pcpu's
>>> >          runqueues it comes across
>>> >        - if it manages to take the lock of pcpu's 6 runqueue, it
>>> >          sees d2v2 in it, steal it and run it.
>>> Oh, my understanding matches yours :-)
>>> 
>>> BYW, we also found in csched_load_balance() will hold the schedule
>>> lock, but
>>> not found a proper vcpu to steal. Maybe this would be a potential
>>> optimization
>>> point.
>>> 
>>Mmm... I actually don't understand what you mean here... what is the
>>possible optimization?
>>
>>> > With your modifications, AFAICT:
>>> > 
>>> > - on pcpu 4, notify_remote_vcpu_wake(d2v2) takes the wake_vcpu_lock
>>> >   of pcpu 6 and queue d2v2 in pcpu's 6 deferred wakeup list
>>> > - poke pcpu 6 with an IPI
>>> > - in a (little, hopefully) while, pcpu 6 react to the IPI and,
>>> >   _I_think_, call vcpu_wake(d2v2) ? [*]
>>> You are almost right. Here in the interrupt handler, it does two
>>> things:
>>> 1. echo back notify_remote_vcpu_wake() it finishes the job
>>> 2. raise a separate softirq, which will call vcpu_wake(d2v2)
>>> 
>>> > 
>>> > - on pcpu 6, vcpu_wake(d2v2) takes its own runqueue lock
>>> > - in csched_vcpu_wake(d2v2), d2v2 is inserted in pcpu's 6 runqueue
>>> > - on pcpu 6, __runq_tickle() is called, and it determines on which
>>> >   pcpu d2v2 should run
>>> At this place, the behavior of __run_tickle() is what I mentioned
>>> above. Raise
>>> softirq on pcpu6 and idle pcpus. Who is faster, who get d2v2.
>>>
>>Sort of, yes. In general, you can expect pcpu 6 going through
>>csched_schedule(), and hence setting d2v2 to run, to be faster that
>>SCHEDULE_SOFIRQ being notified to someone remote, and it scheduling and
>>getting to work stealing from pcpu 6's runqueue.
>>
>>However, that is the case only if d2v2 had higher priority than what is
>>running on pcpu 6. If it does not, you:
>> - _don't_ tickle pcpu 6
>> - tickle one or more idlers, if any.
>>
>>So, again, it's not that you always wake and run it on pcpu 6
>>
>
>I see what you mean :-)
>
>So I am curious about why we add this in pcpu's 6 queue, when we know the
>priority is low in __runq_tickle() and won't tickle that. And we are sure
>someone in the future will grab pcpu's 6 schedule lock and take it.
>
>Or in this case -- when there are idlers and vcpu's priority is higher,
>instead of let pcpu 6 to wake this, but let an idler to take it? Ask someone
>who is free to help a busy one.
>
>>> > - it raises the SCHEDULE_SOFTIRQ for such pcpu. Let's look at the
>>> >   following two cases:
>>> >    a) if it was determined that d2v2 should run on pcpu 6
>>> >       (i.e., SCHEDULE_SOFTIRQ was raised by pcpu 6 on itself):
>>> >        - at the first softirq check, pcpu 6 will schedule
>>> >        - it takes its own runqueue lock
>>> >        - it finds d2v2 in there, and runs it
>>> >    b) if it was determined that d2v2 should run on, say, pcpu 9:
>>> >        - in a (little, hopefully) while, pcpu 9 will schedule
>>> >        - it takes its own runqueue lock
>>> >        - if it has nothing in its runqueue (likely, if
>>> >          __runq_tickle() chose it, but not necessarily and always
>>> >          true, especially under high load), it looks around to
>>> >          see if other pcpus have any vcpu that it can execute
>>> >        - it will try to take the locks of the various pcpu's
>>> >          runqueues it comes across
>>> >        - if it manages to take the lock of pcpu's 6 runqueue, it
>>> >          sees d2v2 in it, steal it and run it.
>>> > 
>>> > [*] I don't see in the code you shared what happens in reaction to
>>> > the IPI WAKEUP_TASK_VECTOR, so I'm going to assume that it actually
>>> > calls vcpu_wake()
>>> > 
>>> > So, basically, it looks to me that you're adding a level of
>>> > indirection, I'm not sure for what benefit.
>>
>>> In my mind, we are trying to reduce the contention on schedule lock
>>> from two
>>> aspects:
>>> 1. for those vcpus would run on other pcpu, it will take
>>> wake_vcpu_lock
>>>    instead of schedule lock
>>>
>>I don't see how you can say "instead". It looks to me that, for those
>>vcpus what would run on other pcpu, we need to take wake_vcpu_lock
>>_in_addition_ to the runqueue lock.
>>
>>> 2. and in vcpu_wake(), it will not grab a remote schedule lock, which
>>> also
>>>    reduce the cache coherency between pcpus
>>> 
>>Right. Taking the wake_vcpu_lock of pcpu 6 from pcpu 4 is taking a
>>remote lock, though.
>
>Hmm... you are right.
>
>>
>>> > In fact, in the best case (i.e., d2v2 will actually be run on its
>>> > v->processor), in the original code there is:
>>> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6.
>>> > In your case, there is:
>>> > - 1 IPI (SCHEDULE_SOFTIRQ) from pcpu 4 to pcpu 6
>>> > - 1 softirq (again SCHEDULE_SOFTIRQ) to self on pcpu 6.
>>> > 
>>> Our idea may looks like this:
>>>  - 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6
>>>  - 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the vcpu.
>>>  - 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6.
>>> 
>>So, it's even worse than how I imagined! :-P
>>
>>> > Also, as far as locking within the wakeup path is concerned, in the
>>> > original code:
>>> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake()
>>> >   (i.e., for the time of __runq_insert() and __runq_tickle().
>>> > In your case:
>>> > - the wake_vcpu_lock of pcpu 6 is held during queueing of the
>>> >   deferred wakeup
>>> > - the runqueue lock of pcpu 6 is held for the time of vcpu_wake()
>>> > 
>>> The wake_vcpu_lock is hold in the first step in above.
>>> The runqueue lock(I think you mean the schedule lock?) won't be taken
>>> in this
>>> case.
>>> 
>>It will be taken, for executing vcpu_wake(). From your own description
>>of the mechanism above:
>>
>>"Our idea may looks like this:
>>   1) 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6
>>   2) 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the
>>      vcpu.
>>   3) 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6."
>>
>>1) runs on pcpu 4 and takes pcpu's 6 wake_vcpu_lock
>>2) runs on pcpu 6 and takes pcpu'6 6 runqueue lock, for vcpu_wake()
>>3) runs on pcpu 6 and takes pcpu's 6 runqueue lock, for
>>   csched_schedule()
>>
>
>Oh, you are right.
>
>We try to let the pcpu's 6 schedule lock just be grabbed by pcpu 6, while
>looks the wake_vcpu_lock is grabbed by others.
>
>I am thinking our change may benefit like this .
>
>The situation is:
>* vcpu's priority is higher
>* pcpu 9 is idle
>* pcpu 6 and pcpu 9 are raised SCHEDULE_SOFTIRQ
>
>At this moment pcpu 4 is holding pcpu 6 runqueue lock, neither pcpu 6
>nor pcpu 9 could do their job until pcpu 4 release pcpu 6 runqueue lock.
>
>While in our case, pcpu 4 doesn't hold the runqueue lock, so pcpu 6 and pcpu 9
>could do their job immediately. Ah I know, very corner, not sure about the
>effect.
>
>BTW, I suddenly found softirq is implemented as an IPI. So I am thinking
>whether it is save to raise softirq to itself? Since it looks we already hold
>the runqueue lock. Or because we have turned the irq off in vcpu_wake()?
>
>>> > 
>>> > (i.e., for the time of __runq_insert() and __runq_tickle().
>>> > 
>>> > Which, to me, looks like both more inter-CPU communication overhead
>>> > and more chances for lock contention.
>>> Hmm... yes, more inter-CPU communication, while less lock contention
>>> I think.
>>> 
>>More inter-CPU communication, the same level of local lock contention
>>plus some added moderately to low contented remote lock contention,
>>AFAICT.
>>
>>> > So, again, more IPIs and more (potential) lock contention.
>>> > 
>>> > True the queueing into the deferred wakeup list will be very simple
>>> > and quick, and hence the critical section needed for implementing
>>> > it very short. But the problem is that you are not shortening
>>> > something long, or splitting something long in two shorter chunks,
>>> > you are _adding_ something, and thus you can't possibly win. :-/
>>> > 
>>> Yes, we are moving some thing, let me digest it for a while.
>>> 
>>Sure. Let me also add that I'm not saying it's not going to work, or
>>that it's not going to improve the situation.
>>
>>What I'm saying is that, architecturally, it's not too different from
>>the current situation, so I would not expect wonders from just this.
>>
>>What I'd do would be to try figuring out where it is that the most
>>waiting time is being accumulated, i.e., which *specific* locking
>>attempts of the scheduler's runqueues' locks are the most contended,
>>and focus on those first.
>>
>>If you're on a big host, I think the way in which Credit1 does load
>>balancing (i.e., by work stealing) may be the thing to blame. But all
>>these are things that only profiling and benchmarking can really tell!
>>:-)
>>
>>As a side note, we're also working on something which, at the end of
>>the day, is rather similar, although for different purposes. In fact,
>>see here:
>>https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01892.html
>>
>>The goal is not having to disable IRQs during scheduling, and I'm
>>looking at keeping the per-pcpu deferred wakeup list lockless, in order
>>not to add more lock contention.
>>

Dario,

Just get chance to look into this. This is interesting and I am trying to
understand the problem you want to solve first.

vcpu_wake() is a _special_ case who has to turn off the irq, because
SCHED_OP(wake, v) would not only enqueue the vcpu but also tickle pcpu to pick
up the queued vcpu. And the tickling part needd the irq to be turned off.

I don't get the this part. Why we have to turn off the irq for tickling pcpu?

And by enabling irq in schedule handlers benefits the performance? or ? The
motivation behind this is?

Didn't look into your code yet. From the description from Andrew, I comes with
several questions:
1. which per-cpu list you want to queue the vcpu? v->processor?
2. SCHEDULE_WAKE_SOFTIRQ would do runqueue insert and tickle?
3. purpose is so that schedule softirq could run with irq enabled
4. the schedule softirq would do the jobs currently in vcpu_wake()?

Took the per-cpu branch as an example, I see several commits. The top 6 are
related ones, right?

>>If you're interested on knowing more or working on that, just let me
>>know. But, of course, this have to be done on Xen upstream, not on an
>>old version.
>>
>>Regards,
>>Dario
>>-- 
>><<This happens because I choose it to happen!>> (Raistlin Majere)
>>-----------------------------------------------------------------
>>Dario Faggioli, Ph.D, http://about.me/dario.faggioli
>>Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>>
>
>
>
>-- 
>Richard Yang\nHelp you, Help me

-- 
Wei Yang
Help you, Help me

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.