[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] Modified RTDS scheduler to use an event-driven model instead of polling.



On Tue, 2015-06-09 at 22:54 -0700, Meng Xu wrote:
> Hi Dario and Dagaen,
> 
Hi,

> 2015-06-09 5:53 GMT-07:00 Dario Faggioli <dario.faggioli@xxxxxxxxxx>:

> > Thanks for doing this. The first question I have is whether you run any
> > test/benchmark that you can show the result of here.
> 
> Thank you very much Dario for your quick review!
> 
It wasn't as quick as I wanted it to be. I'll improve on that. :-)

> > For instance, a few weeks ago, Meng (while trying to do a completely
> > different thing) sent here on the list some numbers about the frequency
> > of the scheduler being called, and the overhead caused by that... Would
> > it be hard to run the same experiments and collect the numbers, with and
> > without your patch?
> 
> I think there should be two ways to evaluate the performance:
> 1) Using xentrace to trace the frequency of the scheduler being called
> and the overhead, when we run workload inside guest domains;
>
Yes, that counts more as a way of testing whether what is done in here
actually works, as we expect the scheduler to be invoked a lot less, and
that would tell us whether or not it is the case.

So, please, provide these numbers (at least the number/frequency of
invocation, overhead measurements can come later) as soon as practical.

> 2) Using cyclictest as Dario mentioned before to test the real-time
> performance at end user. Dagaen, I can provide you the commands to run
> it, which is actually quite simple to run.
> 
It is indeed, and that's more similar to a proper performance evaluation
of an aspect which is rather critical for this scheduler, and really
important for people wanting to use it.

So, yes, seeing some results would be great, independently from the
specific work done in this patch.

> > On Mon, 2015-06-08 at 07:46 -0400, Dagaen Golomb wrote:
> > > To do this, we create a new list that holds, for each
> > > vcpu, the time least into the future that it may need to be
> > > rescheduled.
> > >
> > Ok. Actually, what I really expected to see in this patch was, either:
> >  a) a new, scheduler wide, list of replenishment times _and_ a timer,
> >     always (re)programmed to fire at the most imminent one; or
> >  b) one timer for each vcpu, always (re)programmed to fire at the time
> >     instant when the replenishment for that vcpu should happen.
> 
> Ah...
> 
> Here what we are trying to do is:
> Reuse the existing timer that is used by rt_schedule().
>
Yeah, and what I can't figure out is why you decided to do so. The
reason I don't like it is that things become a lot (more) complex to
understand, maintain and modify.

>  We use  that
> timer as the timer to invoke the schedule function (rt_schedule). So
> that timer has two purposes:
> (a) budget enforcement as it was;
> (b) budget replenish (and deadline enforcement since deadline = period
> for each VCPU).
> 
Exactly, the timer has two purposes. It actually has 3, as you're saying
yourself ("budget replenish (*and* deadline enforcement"). It should
have exactly one. :-)

> When the timer fires, the rt_schedule is called: enforcement the
> budget, replenish the budget of VCPU(s), and pick the (next) VCPU.
> 
Right. OTOH, I think that all it should do is pick the next vcpu, if
necessary. In fact, the purpose of all this that Dagaen is doing is to
dramatically reduce the calls to rt_scheduler(), when that is not
necessary. Well, replenishing the budget of all running vcpus around is
not the core scheduling function's business, especially considering that
we're holding a global spinlock.

> In other words, we can think about the scheduler in another way:
> The scheduling decision happens at the following two time points:
> (a) budget enforcement as it was;
> (b) budget replenish and deadline enforcement;
> 
The scheduler is all that's in sched_rt.c, point here is how things
should be organized.

> Whenever any of the above two situations occurs, the scheduler may
> need to pick the next VCPU to run or preempt a current running VCPU.
> Therefore, the logic for scheduling decision is unavoidable when
> either of these two situation occurs, IMO.
> 
A budget replenishment for a vcpu with a big period, in a busy system,
may well end up not requiring any immediate rescheduling of none of the
lower period vcpus, may not it?

So, with your approach, it is like this, when replenishment time for
vcpu X arrives:

  timer interrupt
    TIMER_SOFTIRQ raised
      process softirqs
        SCHEDULE_SOFTIRQ raised
          process softirqs
            rt_schedule()
              [spin_lock]
              burn_budget(scurr)
              __repl_update()  (goes through all runnable vcpus!!)
              snext = __runq_pick(): snext == scurr
              runq_tickle(snext)...WAIT, WHAT?!?! :-O

I mean, and I'm noticing this now, if the replenishments done during a
particular call to rt_schedule() are not enough to change the situation
on that particular pcpu, and hence the task which was running (and that
you are deliberately disturbing with _a_full_execution_ of the
scheduler's core function!) should continue to do so, you're calling
runq_tickle() right on it. So, AFAICT, you're tickling scurr, not a
newly replenished vcpu!

Am I missing or misreading something? Let's assume not, and see what
happens in this case...

Looking at runq_tickle(), if there are idle pcpus, you'll probably
tickle one of them, which will likely pick up the newly replenished
vcpu, like this (as a followup of the 'callchain' above):

  process softirqs
    SCHEDULE_SOFTIRQ
      rt_schedule()
        [spin_lock]
        __repl_update()  (goes through all runnable vcpus!! AGAIN!!)
        snext = runq_pick(): snext == vcpu X
        [spin_unlock]

If there are no idle pcpus, you probably won't tickle anyone, which is
also fine.

So, yay, it works!! Is this abuse of runq_ticke() there by desing? :-D

Jokes apart, if the above analysis is accurate, I think this is a good
enough example of what I meant when saying to Dagaen "this is making
things too complex".

> In terms of performance/overhead, I think the option b) you pointed
> out has the benefit of low overhead in updating the budget because we
> don't need to hold the lock. However, the budget replenish occurs when
> deadline of the VCPU is changed (which means the next period of the
> VCPU arrives). Then it means the scheduler (may) need to make a new
> scheduling decision, in which situation, the scheduler will hold the
> lock for the runq. So I'm curious about the source of overhead the
> option b) could save over the current implementation/design Dagaen is
> doing.
> 
Leave my b) option alone, for now. With a) done as I'm suggesting, for
one, you'll be removing __repl_update() from rt_schedule(), which means
no longer going through the list of all runnable vcpus with the global
scheduler spin lock held, which really is something we should aim at for
this scheduler, sooner rather than later.

Here's how I envision things to go. Again, I'm talking about sticking
with option a), so no per-vcpu timers, just 1 timer and a global queue,
which now is a replenishment queue:

  timer interrupt
    TIMER_SOFTIRQ raised
      process softirqs
        replenishment_timer_handler()
          [spin_lock]
            <for_each_replenishment_event(repl_time < NOW()) {
               replenish(vcpu)
               runq_tickle(vcpu)
             }>
          [spin_lock]

Then, on the tickled pcpus (if any):

  process softirqs
    SCHEDULE_SOFTIRQ
      rt_schedule()
        [spin_lock]
        snext = runq_pick(): snext == vcpu X
        [spin_unlock]

And get rid of __repl_update(), which makes my eyes bleed every time I
open sched_rt.c. :-)

Oh, and note that we probably can use a new spin lock for the
replenishment queue, different from the existing global scheduler
spinlock, lowering contention and increasing scalabiliy even further!

> > >  The scheduler chooses the lowest time off of this
> > > list and waits until the specified time instead of running every
> > > 1 ms as it did before.
> > >
> > Yeah, well, I see what you mean and how you this is actually succeeding
> > (at least AFAICT, by looking at the code, again, it would be nice to
> > have some numbers) in improving the scheduler behavior.
> >
> > However, this transition toward event driven-ness has two goals:
> >  * improve the scheduler behavior [check, at least to some extent]
> >  * improve the code (readability, maintainability, etc.)
> >    [not check at all :-(]
> 
> I see. We did consider the readability and maintainability factor in
> the design but I think we just neglect the standards that "define" the
> rules of readability and maintainability. Is there some documentation
> that we could follow?
> 
There is not, and probably never will be... and that's fine, because
that is not something you write down in a document.

It's trial and error, following suits and making and looking at examples
(like the analysis I made above).

> I see how you think this interdependecies should be handled (via Xen
> timers per VCPU in option b), but I didn't quite get the
> reason/principles why you think the current design is not good to
> handle such interdependencies. :-(
> 
Right. I hope this is more clear now.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.