Xen project Mailing List

Re: [Xen-devel] [PATCH] Modified RTDS scheduler to use an event-driven model instead of polling.

On Tue, 2015-06-09 at 22:54 -0700, Meng Xu wrote: > Hi Dario and Dagaen, > Hi, > 2015-06-09 5:53 GMT-07:00 Dario Faggioli <dario.faggioli@xxxxxxxxxx>: > > Thanks for doing this. The first question I have is whether you run any > > test/benchmark that you can show the result of here. > > Thank you very much Dario for your quick review! > It wasn't as quick as I wanted it to be. I'll improve on that. :-) > > For instance, a few weeks ago, Meng (while trying to do a completely > > different thing) sent here on the list some numbers about the frequency > > of the scheduler being called, and the overhead caused by that... Would > > it be hard to run the same experiments and collect the numbers, with and > > without your patch? > > I think there should be two ways to evaluate the performance: > 1) Using xentrace to trace the frequency of the scheduler being called > and the overhead, when we run workload inside guest domains; > Yes, that counts more as a way of testing whether what is done in here actually works, as we expect the scheduler to be invoked a lot less, and that would tell us whether or not it is the case. So, please, provide these numbers (at least the number/frequency of invocation, overhead measurements can come later) as soon as practical. > 2) Using cyclictest as Dario mentioned before to test the real-time > performance at end user. Dagaen, I can provide you the commands to run > it, which is actually quite simple to run. > It is indeed, and that's more similar to a proper performance evaluation of an aspect which is rather critical for this scheduler, and really important for people wanting to use it. So, yes, seeing some results would be great, independently from the specific work done in this patch. > > On Mon, 2015-06-08 at 07:46 -0400, Dagaen Golomb wrote: > > > To do this, we create a new list that holds, for each > > > vcpu, the time least into the future that it may need to be > > > rescheduled. > > > > > Ok. Actually, what I really expected to see in this patch was, either: > > a) a new, scheduler wide, list of replenishment times _and_ a timer, > > always (re)programmed to fire at the most imminent one; or > > b) one timer for each vcpu, always (re)programmed to fire at the time > > instant when the replenishment for that vcpu should happen. > > Ah... > > Here what we are trying to do is: > Reuse the existing timer that is used by rt_schedule(). > Yeah, and what I can't figure out is why you decided to do so. The reason I don't like it is that things become a lot (more) complex to understand, maintain and modify. > We use that > timer as the timer to invoke the schedule function (rt_schedule). So > that timer has two purposes: > (a) budget enforcement as it was; > (b) budget replenish (and deadline enforcement since deadline = period > for each VCPU). > Exactly, the timer has two purposes. It actually has 3, as you're saying yourself ("budget replenish (*and* deadline enforcement"). It should have exactly one. :-) > When the timer fires, the rt_schedule is called: enforcement the > budget, replenish the budget of VCPU(s), and pick the (next) VCPU. > Right. OTOH, I think that all it should do is pick the next vcpu, if necessary. In fact, the purpose of all this that Dagaen is doing is to dramatically reduce the calls to rt_scheduler(), when that is not necessary. Well, replenishing the budget of all running vcpus around is not the core scheduling function's business, especially considering that we're holding a global spinlock. > In other words, we can think about the scheduler in another way: > The scheduling decision happens at the following two time points: > (a) budget enforcement as it was; > (b) budget replenish and deadline enforcement; > The scheduler is all that's in sched_rt.c, point here is how things should be organized. > Whenever any of the above two situations occurs, the scheduler may > need to pick the next VCPU to run or preempt a current running VCPU. > Therefore, the logic for scheduling decision is unavoidable when > either of these two situation occurs, IMO. > A budget replenishment for a vcpu with a big period, in a busy system, may well end up not requiring any immediate rescheduling of none of the lower period vcpus, may not it? So, with your approach, it is like this, when replenishment time for vcpu X arrives: timer interrupt TIMER_SOFTIRQ raised process softirqs SCHEDULE_SOFTIRQ raised process softirqs rt_schedule() [spin_lock] burn_budget(scurr) __repl_update() (goes through all runnable vcpus!!) snext = __runq_pick(): snext == scurr runq_tickle(snext)...WAIT, WHAT?!?! :-O I mean, and I'm noticing this now, if the replenishments done during a particular call to rt_schedule() are not enough to change the situation on that particular pcpu, and hence the task which was running (and that you are deliberately disturbing with _a_full_execution_ of the scheduler's core function!) should continue to do so, you're calling runq_tickle() right on it. So, AFAICT, you're tickling scurr, not a newly replenished vcpu! Am I missing or misreading something? Let's assume not, and see what happens in this case... Looking at runq_tickle(), if there are idle pcpus, you'll probably tickle one of them, which will likely pick up the newly replenished vcpu, like this (as a followup of the 'callchain' above): process softirqs SCHEDULE_SOFTIRQ rt_schedule() [spin_lock] __repl_update() (goes through all runnable vcpus!! AGAIN!!) snext = runq_pick(): snext == vcpu X [spin_unlock] If there are no idle pcpus, you probably won't tickle anyone, which is also fine. So, yay, it works!! Is this abuse of runq_ticke() there by desing? :-D Jokes apart, if the above analysis is accurate, I think this is a good enough example of what I meant when saying to Dagaen "this is making things too complex". > In terms of performance/overhead, I think the option b) you pointed > out has the benefit of low overhead in updating the budget because we > don't need to hold the lock. However, the budget replenish occurs when > deadline of the VCPU is changed (which means the next period of the > VCPU arrives). Then it means the scheduler (may) need to make a new > scheduling decision, in which situation, the scheduler will hold the > lock for the runq. So I'm curious about the source of overhead the > option b) could save over the current implementation/design Dagaen is > doing. > Leave my b) option alone, for now. With a) done as I'm suggesting, for one, you'll be removing __repl_update() from rt_schedule(), which means no longer going through the list of all runnable vcpus with the global scheduler spin lock held, which really is something we should aim at for this scheduler, sooner rather than later. Here's how I envision things to go. Again, I'm talking about sticking with option a), so no per-vcpu timers, just 1 timer and a global queue, which now is a replenishment queue: timer interrupt TIMER_SOFTIRQ raised process softirqs replenishment_timer_handler() [spin_lock] <for_each_replenishment_event(repl_time < NOW()) { replenish(vcpu) runq_tickle(vcpu) }> [spin_lock] Then, on the tickled pcpus (if any): process softirqs SCHEDULE_SOFTIRQ rt_schedule() [spin_lock] snext = runq_pick(): snext == vcpu X [spin_unlock] And get rid of __repl_update(), which makes my eyes bleed every time I open sched_rt.c. :-) Oh, and note that we probably can use a new spin lock for the replenishment queue, different from the existing global scheduler spinlock, lowering contention and increasing scalabiliy even further! > > > The scheduler chooses the lowest time off of this > > > list and waits until the specified time instead of running every > > > 1 ms as it did before. > > > > > Yeah, well, I see what you mean and how you this is actually succeeding > > (at least AFAICT, by looking at the code, again, it would be nice to > > have some numbers) in improving the scheduler behavior. > > > > However, this transition toward event driven-ness has two goals: > > * improve the scheduler behavior [check, at least to some extent] > > * improve the code (readability, maintainability, etc.) > > [not check at all :-(] > > I see. We did consider the readability and maintainability factor in > the design but I think we just neglect the standards that "define" the > rules of readability and maintainability. Is there some documentation > that we could follow? > There is not, and probably never will be... and that's fine, because that is not something you write down in a document. It's trial and error, following suits and making and looking at examples (like the analysis I made above). > I see how you think this interdependecies should be handled (via Xen > timers per VCPU in option b), but I didn't quite get the > reason/principles why you think the current design is not good to > handle such interdependencies. :-( > Right. I hope this is more clear now. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.