[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Help] Trigger Watchdog when adding an IPI in vcpu_wake



On Wed, 2016-09-14 at 18:44 +0800, Wei Yang wrote:
> On Tue, Sep 13, 2016 at 01:30:17PM +0200, Dario Faggioli wrote:
>
> > Do you mind sharing just a bit more, such as:
> >  - number of pcpus
> >  - number of vcpus of the various VMs
> 160 pcpus
> 16 vcpus in VM and 8 VMs
> 
So, 16x8=128, which means you're not even oversubscribing. Maybe some
of the pcpus are hyperthreads and not full cores (are they?), but
still, this is not something I'd describe "super intensive cpu load".

Oh, well, there's dom0 of course. So, if dom0 has more than 160-128=32
vcpus (is this the case?), you technically are oversubscribing. But
then, what does dom0 do? Is it very busy doing some stuff on its own,
or running the backends/etc for the VMs? Can you check that with top,
xentop, and alike?

If the system is not overbooked, it's a bit strange that the scheduler
is the bottleneck.

> Yes, the "Blue Screen" is what we want to mimic the behavior from
> client.
> 
> The "Blue Screen" will force the hypervisor to do load balance in my
> mind.
> 
I've no idea what this means (but I don't know much of Windows and of
what happens when it crashes with a blue screen).

> > I'll have a quick look at how __runq_tickle() looks like in Xen
> > 4.1,
> > but there's very few chances I'll be available to provide detailed
> > review, advice, testing, etc., on such an old codebase. :-(
> > 
> > > 
> > > By looking at the code, what I missed may be in __runq_tickle(),
> > > which in case
> > > there are idle pcpu, schedule_softirq is also raised on them. By
> > > doing so,
> > > those idle pcpus would steal other busy tasks and may in chance
> > > get
> > > d2v2?
> > > 
> > Yes, it looks like, in Xen 4.1, this is more or less what happens.
> > The
> > idea is to always tickle pcpu 6, if the priority of the waking vcpu
> > is
> > higher than what is running there. 
> > 
> Hmm... in case there are idle pcpus and the priority of the waking
> vcpu is
> higher than what is running on pcpu 6, would pcpu 6 have more chance
> to run it?
> or other idle pcup would steal it from pcpu6? or they have equal
> chance?
> 
No, it works like this:

 - pcpu 6 is eithe idle or it is running someone already (say d3v4)
   + if pcpu 6 is idle, we tickle (i.e., we raise SCHEDULE_SOFTIRQ)
     pcpu 6 itself, and we're done
   + if pcpu 6 is running d3v4 and there is no other idle pcpu:
     * if prio(d2v2) > prio(d3v4) we tickle pcpu 6 and we're done
     * if prio(d2v2) < prio(d3v4) we just don't tickle anyone
   + if pcpu 6 is running d3v4 and there are some idle pcpus:
     * if prio(d2v2) > prio(d3v4) we tickle pcpu 6 _AND_ one or  [1]
       more of the idle pcpus
     * if prio(d2v2) < prio(d3v4) we _ONLY_ tickle one or more   [2]
       of the idle pcpus

Idea behind [1] is that d2v2 should preempt d3v4 on pcpu 6, and that's
why we tickle pcpu 6. However, that would mean that d3v4 would be put
back in pcpu's 6 runqueue. So, if there are idle pcpus, we tickle them
so that one can come and steam d3v4.

Idea behind [2] is that d3v4 should continue to run on pcpu 6, while
d2v2 will be put in pcpu's 6 runqueue. However, if there are idle
pcpus, we tickle them so that one can come and steal d2v2.

What really happens is not always what is expected, because it's
possible that, even if prio(d2v2)>prio(d3v4), an idler is quicker in
waking up and stealing d2v2 from pcpu's 6 runqueue where it was stashed
while pcpu 6 itself reschedules and enacts the preemption... But that's
unlikely and, all in all, not catastrophic (although, of course, worse
in terms of overhead, locality, etc)

> > If pcpu 6 was not idle, we also tickle one or more idle pcpus so
> > that:
> >  - if the waking vcpu preempted what was running on pcpu 6, an
> > idler
> >    can pick-up ("steal") such preempted vcpu and run it;
> >  - if the waking vcpu ended up in the runqueue, an idler can steal
> > it
> > 
> Hmm... I don't get the difference between these two cases.
> 
> Looks both are an idler steal the vcpu.
> 
I hope it's more clear now. :-)

> > Yes. But both in 4.1 and in current staging, we only raise
> > SCHEDULE_SOFTIRQ on pcpu 6, if and only if the priority of the
> > waking
> > vcpu is higher and it should preempt the currently running vcpu.
> > 
> Oh, you are right. I didn't catch this either.
> 
> This means in case 
> a) the priority is lower than current one 
> b) no idler in system
> 
> The vcpu will sit quietly in the runq and waiting for the schedule
> next time.
> 
Well, yes. It will stay in pcpu's 6 runqueue until either:
 - what's running on pcpu 6 blocks, or finishes its credits, etc.,
   and pcpu 6 reschedules and picks it up
 - some other pcpu becomes idle and, when poking around the various
   pcpus for stealing work, picks it up

Of course, both depends on in what position in pcpu's 6 runqueue the
vcpu is when the specific event happens.

This is a limit of the Credit1 scheduler, especially of the version
that you find in 4.1. Newer code is a bit better wrt this (although not
dramatically... that has not been possible! :-/).

> > However, that is the case only if d2v2 had higher priority than
> > what is
> > running on pcpu 6. If it does not, you:
> >  - _don't_ tickle pcpu 6
> >  - tickle one or more idlers, if any.
> > 
> > So, again, it's not that you always wake and run it on pcpu 6
> > 
> I see what you mean :-)
> 
> So I am curious about why we add this in pcpu's 6 queue, when we know
> the
> priority is low in __runq_tickle() and won't tickle that. And we are
> sure
> someone in the future will grab pcpu's 6 schedule lock and take it.
> 
Yes, but what we should do? We've woken it up and it's v->processor
points to pcpu 6, so the lock that vcpu_schedule_lock() takes is pcpu's
6 runqueue lock.

To queue it on some other pcpu we'd need to basically migrate it (i.e.,
at least changing the value of v->processor, but also other things,
like checking hard affinity, etc). And we also would need to take the
runqueue lock of another pcpu's runqueue. All this in the wakeup path,
which will become slow and complex. And we'd need all the work stealing
logic (in load_balance(), etc) to still exist, to cover the case of
work needing to be stolen when wakeups are not involved (i.e., it's not
that we make other part of the scheduler's code less complex).

So, basically, of course there probably were alternatives, but the
basic model, pretty much everywhere, is "queue where it was and tickle
others to come pick it up if suitable", and that's what's done here too
(and consistency is a very good thing :-D).

> Or in this case -- when there are idlers and vcpu's priority is
> higher,
> instead of let pcpu 6 to wake this, but let an idler to take it? Ask
> someone
> who is free to help a busy one.
> 
And in fact, in more recent version of Xen, I made the code do
something very close to what you're suggesting. Here's the comment that
you can find where all this logic is implemented (it's way more
complicated than this, and than the code in 4.1, because it takes soft-
affinity into account).

    /*
     * If the pcpu is idle, or there are no idlers and the new
     * vcpu is a higher priority than the old vcpu, run it here.
     *
     * If there are idle cpus, first try to find one suitable to run
     * new, so we can avoid preempting cur.  If we cannot find a
     * suitable idler on which to run new, run it here, but try to
     * find a suitable idler on which to run cur instead.
     */

> > It will be taken, for executing vcpu_wake(). From your own
> > description
> > of the mechanism above:
> > 
> > "Our idea may looks like this:
> >    1) 1 IPI (WAKEUP_TASK_VECTOR) from pcpu 4 to pcpu 6
> >    2) 1 softirq (SCHEDULE_WAKEUP) to self on pcpu 6 to queue the
> >       vcpu.
> >    3) 1 softirq (SCHEDULE_SOFTIRQ) to self on pcpu 6."
> > 
> > 1) runs on pcpu 4 and takes pcpu's 6 wake_vcpu_lock
> > 2) runs on pcpu 6 and takes pcpu'6 6 runqueue lock, for vcpu_wake()
> > 3) runs on pcpu 6 and takes pcpu's 6 runqueue lock, for
> >    csched_schedule()
> > 
> Oh, you are right.
> 
> We try to let the pcpu's 6 schedule lock just be grabbed by pcpu 6,
> while
> looks the wake_vcpu_lock is grabbed by others.
> 
I understand that, and I can't say that this is not going to improve
things for you. All I'm saying is that you're potentially reducing the
overhead of one "operation", but at the same time adding some overhead
in other "operations".

Depending on a multiplicity of aspects, the net effect could be
positive or negative.

You you asked for general advice on the solution, and help with the
code. I can't help much debugging hangs in Xen 4.1. My opinion on the
solution is that, architecturally, is not something that I'd call an
absolute improvement.

Let me put it this way: if you'd send a patch series implementing what
we're talking about for upstream Xen, I would not Ack it without solid
data, coming from benchmarks run on different platform, with different
workloads and under different load conditions, that clearly shows
performance are improved.

Translated to your case, which is, AFAICT, that you need something on
top of 4.1. If you can hack this up quickly and try it, then good. If
this is taking a lot of time, I'm not sure I'd invest such time on this
(e.g., I'd rather try to narrow down even more the cause of the issue
you're seeing and, after that look at other solutions). But of course
I'm not you and this is only, and you absolutely have no reasons to
follow my advice. :-) :-)

> I am thinking our change may benefit like this .
> 
> The situation is:
> * vcpu's priority is higher
> * pcpu 9 is idle
> * pcpu 6 and pcpu 9 are raised SCHEDULE_SOFTIRQ
> 
> At this moment pcpu 4 is holding pcpu 6 runqueue lock, neither pcpu 6
> nor pcpu 9 could do their job until pcpu 4 release pcpu 6 runqueue
> lock.
> 
Yes, but pcpu 4 releases pcpu's 6 lock right after having raised the
softirq for pcpus 6 and 9. Most likely, when 6 and 9 will try to grab
their own locks for running csched_schedule(), after having reacted to
the IPI, etc., pcpu 4 will have released pcpu's 6 lock already.

BTW, you say that pcpu 9 is idle, but then that "it can't do his job"
because pcpu 4 holds the lock of pcpu 6... I don't think I understand
what you mean with this.

> While in our case, pcpu 4 doesn't hold the runqueue lock, so pcpu 6
> and pcpu 9
> could do their job immediately. Ah I know, very corner, not sure
> about the
> effect.
> 
Again, I don't get what "do their job" and "immediately" means.

In both cases the consequentiality of events is as follows:
 1. queue the task in one pcpu's 6 queue (either the runq, in original
    code, or the wakeup list, with your modifications)
 2. tickle pcpu 6
 3. pcpu 6 react to being tickled
 4. pcpu 6 start running the new vcpu

I original code 1 and 4 are serialized by the same lock, with your
modifications, half of 1 can happen using a different lock. But 1 and 4
are almost never overlapped anyway, so I don't think you're gaining
much parallelism.

What (maybe) you're achieving is that 1 interferes a little bit less
with other things that wants to happen on vcpu 6 (e.g., because of
other vcpus wrt the one waking up), which may be a good thing, if there
are a lot of wakeups. But then, again, if there are a lot of wakeups,
you probably will see contention of the wake_vcpu_lock!

> BTW, I suddenly found softirq is implemented as an IPI. So I am
> thinking
> whether it is save to raise softirq to itself? 
>
Softirqs are implemented by means of IPIs, but when they're raised for
other processors. Self-raising a softirq does not involve IPIs, just
setting a bit in the pending mask (see raise_softirq() in
xen/common/softirq.c).

> Since it looks we already hold
> the runqueue lock. Or because we have turned the irq off in
> vcpu_wake()?
> 
We hold the runqueue lock, so what? Should we schedule that
immediately? That would probably be possible, but it's again an
architectural decision, and a matter of consistency. In Xen, we
schedule when we find the SCHEDULE_SOFTIRQ to be pending.

Anyhow, I would not expect wonders by embedding the re-scheduling code
directly here either.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.