[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Help] Trigger Watchdog when adding an IPI in vcpu_wake



On Fri, Sep 16, 2016 at 06:07:08PM +0200, Dario Faggioli wrote:
>On Fri, 2016-09-16 at 10:49 +0800, Wei Yang wrote:
>> On Wed, Sep 14, 2016 at 06:18:48PM +0200, Dario Faggioli wrote:
>> > On Wed, 2016-09-14 at 18:44 +0800, Wei Yang wrote:
>> > If the system is not overbooked, it's a bit strange that the
>> > scheduler
>> > is the bottleneck.
>> I looked at the original data again. I don't see detailed data to
>> describe the
>> dom0 configuration.
>> 
>I see. No collection of output of top and xentop in dom0 either then,
>I'm guessing? :-/
>

Probably, let me check with some one to see whether we have the luck.

>> The exact user model is not accessible from our client. We guess
>> their model
>> looks like this.
>> 
>> 
>>      +--------+     +-----------+         +-------------+
>>      |Timer   | --->|Coordinator| ---+--->|Worker       |
>>      +--------+     +-----------+    |    +-------------+
>>                                      |
>>                                      |    +-------------+
>>                                      +--->|Worker       |
>>                                      |    +-------------+
>>                                      |
>>                                      |    +-------------+
>>                                      +--->|Worker       |
>>                                           +-------------+
>> 
>> One Coordinator would driver several workers based on a high
>> resolution timer.
>> Periodically, workers would be waked up by the coordinator. So at one
>> moment,
>> many workers would be waked up and would triggers the vcpu_wake() in
>> xen.
>> 
>It's not clear to me whether 'Coordinator' and 'Worker's are VMs, or if
>the graph describes the workload run inside the (and if yes, which
>ones) VMs... but that is not terribly important, after all.
>

Oh, yes, these are threads in a VM. Each VM may contain several
groups(instance) of these threads.

>> Not sure this would be a possible reason for the burst vcpu_wake()?
>> 
>Well, there would be, at least potentially, a sensible number of vcpus
>waking up, which indeed can make the runqueue locks of the various
>pcpus contended.
>
>But then again, if the system is not oversubscribed, I'd tend to think
>it to be tolerable, and I'd expect the biggest problem to be the work-
>stealing logic (considering the high number of pcpus), rather than the
>duration of the critical sections within vcpu_wake().
>

Yes, we are trying to improve the stealing part too.

Sounds reasonable, vcpu_wake() is O(1) while "stealing" is O(N) in terms of
#PCPUs.

>> >  - pcpu 6 is eithe idle or it is running someone already (say d3v4)
>> >    + if pcpu 6 is idle, we tickle (i.e., we raise SCHEDULE_SOFTIRQ)
>> >      pcpu 6 itself, and we're done
>> Ok, it looks the behavior differs from 4.1 and current upstream. The
>> upstream
>> just raise the softirq to pcpu6 when it is idle, while 4.1 would
>> raise softirq
>> to both pcpu6 and other idlers even pcpu6 is idle.
>> 
>> I think current upstream is more cleaver.
>> 
>I also think current upstream is a bit better, especially because it's
>mostly me that made it look the way it does. :-D

Ah, not intended to flattering you.

>
>But I was actually describing how 4.1 works. In fact, in 4.1, if pcpu 6
>is idle (se the '//xxx xxx xxx' comments I'm adding to the code
>excerpts:
>
>    if ( new->pri > cur->pri )  //is true, so we put pcpu 6 in mask
>    {
>        cpu_set(cpu, mask);
>    }
>    if ( cur->pri > CSCHED_PRI_IDLE )  //is false!!
>    {
>        ....
>    }
>    if ( !cpus_empty(mask) ) //the mask contains pcpu 6 only
>        cpumask_raise_softirq(mask, SCHEDULE_SOFTIRQ);
>

Hmm... don't have the code at hand, while looks you are right. I misunderstood
the code.

>On the other hand, if pcpu 6 is not idle (and, sticking to the example
>started in the last email, is running d3v4):
>
>    if ( new->pri > cur->pri )  //depends from d2v2's prio and d3v4 prio's
>    {
>        cpu_set(cpu, mask);
>    }
>    if ( cur->pri > CSCHED_PRI_IDLE ) //is true, so let's see...
>    {
>        if ( cpus_empty(prv->idlers) )  //is true *only* if there are no idle 
>pcpu 6. Let's assume there are (i.e., let's assume this is false)
>        {
>            ....
>        }
>        else
>        {
>            cpumask_t idle_mask;
>
>            cpus_and(idle_mask, prv->idlers, new->vcpu->cpu_affinity);
>            if ( !cpus_empty(idle_mask) )  //is true if there are idlers 
>suitable for new (let's assume there are)
>            {
>                if ( opt_tickle_one_idle ) //chosen on boot, default is true
>                {
>                    this_cpu(last_tickle_cpu) = 
>                        cycle_cpu(this_cpu(last_tickle_cpu), idle_mask);
>                    cpu_set(this_cpu(last_tickle_cpu), mask);

May misunderstand the code previously and like this part too.

So only one idler would be tickled even there would be several idlers in the
system. I thought we would tickle several idlers, which may introduce some
contention between them.

BTW, any idea behind the cycle_cpu()? If this is about the locality, how about
cycle within the node? and cycle within the node where v->processor is?

>                }
>                else
>                    ....
>            }
>            cpus_and(mask, mask, new->vcpu->cpu_affinity);
>        }
>    }
>    if ( !cpus_empty(mask) ) //mask contains pcpu 6 if d2v2 prio's was higher, 
>and also contains one idle pcpu
>        cpumask_raise_softirq(mask, SCHEDULE_SOFTIRQ);
>
>So, as I was saying, if pcpu 6 is idle, only pcpu 6 is tickled. If it's
>not, at least one idler (if it exists) is tickled, and pcpu 6 is
>tickled or not, depending or priorities.
>

That's clear to me now :-)

>> > And in fact, in more recent version of Xen, I made the code do
>> > something very close to what you're suggesting. Here's the comment
>> > that
>> > you can find where all this logic is implemented (it's way more
>> > complicated than this, and than the code in 4.1, because it takes
>> > soft-
>> > affinity into account).
>> > 
>> >     /*
>> >      * If the pcpu is idle, or there are no idlers and the new
>> >      * vcpu is a higher priority than the old vcpu, run it here.
>> >      *
>> >      * If there are idle cpus, first try to find one suitable to
>> > run
>> >      * new, so we can avoid preempting cur.  If we cannot find a
>> >      * suitable idler on which to run new, run it here, but try to
>> >      * find a suitable idler on which to run cur instead.
>> >      */
>> > 
>> I like this idea.
>> 
>Then update... Xen 4.2 or 4.3 should already contains it. :-P
>
>> We found the schedule lock be a consuming point in our scenario, so
>> the direct
>> thought is try to avoid to grab it. Hmm... while our idea is not that
>> sparkling.
>> 
>Again, I can't say how sparkling it will reveal once implemented.
>Architecturally, it doesn't sound much different from the status quo to
>me, and so I'd say it's unlikely that it will solve your problem, but
>this is something always very hard to tell.
>
>And again, I'd personally spend some more time --even by temporarily
>and hackishly instrumenting the code-- to understand better where the
>bottleneck is.
>

Hmm... usually we use xentrace and lock profile to identify the bottleneck,
other method you would like to suggest? and instrumenting the code means to
add some log in code?

>> > Yes, but pcpu 4 releases pcpu's 6 lock right after having raised
>> > the
>> > softirq for pcpus 6 and 9. Most likely, when 6 and 9 will try to
>> > grab
>> > their own locks for running csched_schedule(), after having reacted
>> > to
>> > the IPI, etc., pcpu 4 will have released pcpu's 6 lock already.
>> > 
>> > BTW, you say that pcpu 9 is idle, but then that "it can't do his
>> > job"
>> > because pcpu 4 holds the lock of pcpu 6... I don't think I
>> > understand
>> > what you mean with this.
>> After pcpu9 get the softirq from pcup4, it try to schedule and do
>> load
>> balance, in which would take pcpu6 schedule lock.
>> 
>> As I thought previously, pcpu6 schedule lock is hold by pcpu4, so
>> pcpu9 should
>> wait until pcpu4 release it. This is what I mean "it can't do his job
>> immediately".
>> 
>Yeah, but...
>
>> While as you mentioned, compared with pcpu9, pcpu4 would release
>> pcpu6
>> schedule lock earlier. :-)
>> 
>... indeed that's what I think it happens 99% of the time. And this is
>also why I'd tempted to think that the issue may be laying somewhere
>else.
>
>E.g., something that has proven effective for others (and for which
>I've got an unfinished and never submitted patch to upstream), is
>keeping track of what pcpus actually have at least 1 vcpu in their
>runqueues (i.e., at least 1 vcpus that is runnable and not running)
>and, inside balance_load(), only consider those when looking for work
>to steal.

Looks like we keep a cpumask with those who have 1 more vcpus and during
stealing process, just scan those instead of the online cpumask.

>
>Regards,
>Dario
>-- 
><<This happens because I choose it to happen!>> (Raistlin Majere)
>-----------------------------------------------------------------
>Dario Faggioli, Ph.D, http://about.me/dario.faggioli
>Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>



-- 
Wei Yang
Help you, Help me

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.