Xen project Mailing List

Re: [Xen-devel] [PATCH] Modified RTDS scheduler to use an event-driven model instead of polling.

To: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

From: Dagaen Golomb <dgolomb@xxxxxxxxxxxxxx>

Date: Wed, 17 Jun 2015 01:24:17 -0400

Cc: Wei Liu <wei.liu2@xxxxxxxxxx>, Sisu Xi <xisisu@xxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx, Meng Xu <mengxu@xxxxxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Chong Li <lichong659@xxxxxxxxx>

Delivery-date: Wed, 17 Jun 2015 05:24:33 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

> Thanks for this actually... I love discussing these things, it makes me > remind the time when I was doing these stuff myself, and makes me feel > young! :-P And thank you for the very detailed and well-thought response! > >> Separating the replenishment from the scheduler may be problematic. The >> nature >> of the gEDF scheduler is that when a priority changes it should be instantly >> reflected. >> > 'instantly'. Is there such a thing? :-O I meant no accounting period and implicitly during scheduler functions it may happen, but from your reply I know you understand this :P > >> If replenishments were done by themsevles, then the scheduler may >> decide to wait for some period of time, and in this same time period a >> replenishment >> could change what should be executing. >> > Well, the scheduler then would better *not* decide to wait for any > period of time. It better act like this: as soon as a replenishment > happens, and a new deadline is assigned to a vCPU, put the vCPU in the > runqueue, in the proper position; and if such a newly replenished vCPU > should now preempt any other, currently running, vCPU, 'tickle' the pCPU > in question and let it reschedule, and pick up the newly replenished > vCPU. Yes, this is an option. However, I thought this would actually be an option you would not like. For the replenishment routine to know when to call the scheduler and when not to, it basically has to perform the the scheduler's logic, at which point if there is a change... why doesn't it just do it (i.e., be the scheduler) instead of adding logic and possibly overhead to call it. > >> Either the gEDF policy will be >> violated or >> the replenishment routine would have to notify the scheduler after any >> replenishment, requiring some form of interaction and possibly more scheduler >> invocations than the current structure. >> > Obviously not. In fact, it's your 'current structure' that already > invokes the scheduler for any replenishment. Doing more than that, would > be really hard (impossible, I should say). > > I fact, it's the s_timer that you're making fire for any replenishment, > and replenishments happen inside rt_schedule() itself. This is > _the_definition_ of invoking the scheduler! > > OTOH, by using a (some) replenishment timer(s), the scheduler *may* > *potentially* be invoked for any replenishment, yes, but it may also > very well not. In fact, if even after a replenishment, the new deadline > of the replenished vCPU is is father than the deadlines of all the > currently running vCPUs, there's no need to tickle any pCPU, no need to > call the scheduler. > > BTW, from what you say, it seems to me that we are in agreement: we > _do_not_ want to call the scheduler upon any replenishment. Do as I say > and we'll achieve that. I'm glad we are in agreement on that :) Also, I'd like to point out that when I said things like "make more intelligent" decision handling, I meant exactly things like the above: using as much information as possible to prevent scheduler invocations. The current patch doesn't reschedule for any replenishment. Firstly, it doesn't even look at the run queue: replenishments here can only happen on miss. > >> The other option is for the scheduler to >> check for replenishments, as it does now. >> > Yeah, the wrong option. :-D :-D > >> Without any interaction a violation of >> the policy could ensue. The gEDF is not like the credit scheduler where >> there is >> an accounting period where, during this time, policy may be violated >> temporarily. >> > Yep, I do have an idea of what EDF is. > >> Further, with two timer routines we now need to deal with any ordering >> or preemption >> between them (or logic to prevent / order such) which I could argue is far >> more >> complexity than having them done at once as it is now. >> > Mmm... I think I'm starting to lose you. Ordering between two timers? I > don't see what you mean. > > Allow me, though, to walk through the behavior of your current approach. > I've done this already in a previous email, but no big deal re-doing it. > Oh, actually, the current approach is buggy, as you tickle the wrong > vCPU, but let's forget about that (let's assume it's fixed). > > Let's say that time for a replenishment to vCPU w has come. > > on pCPU X: > . timer interrupt > . TIMER_SOFTIRQ raised > . process softirqs > . SCHEDULE_SOFTIRQ raised > . process softirqs > . rt_schedule() > . [spin_lock] > . burn_budget(scurr) > . __repl_update() (goes through all runnable vcpus!!) > . snext = __runq_pick() -> says vCPU w should run on pCPU Y > . runq_tickle(snext) ----> tickle pCPU Y [*] > . [spin_unlock] > . > on pCPU Y: > . process softirqs > . SCHEDULE_SOFTIRQ > . rt_schedule() > . [spin_lock] > . burn_budget(scurr) > . __repl_update() (goes through all runnable vcpus!! AGAIN!!) > . snext = runq_pick() ----------> pick vCPU w > . [spin_unlock] > . > context switch: vCPU w now runs on pCPU Y > > So, you forced pCPU X through a full execution of rt_schedule(), with > the spinlock, the scan of runqueue and depleted queue, and everything, > just for the sake of figuring out that pCPU Y should reschedule. Then > you (as soon as practical) actually reschedule on pCPU Y and context > switch in favour of vCPU w, enacting EDF. > > That's 2 scheduler invocations, and rt_schedule() is still ugly and > complex to read, maintain and improve. > > [*] and don't forget that this needs fixing, because right now it's just > incorrect, and (just speculating) making runq_pick() behave in the way > you want, may not be super-simple. > > Oh, I was almost forgetting! Wonder what happens in the time interval > between when __repl_update() is called (from inside rt_schedule()) on > pCPU X, and the actual time instant of the context switch on pCPU Y? > Yeah, right, what happens is that you're violating EDF. :-/ > > This can be accounted as a form of overhead introduced by the > implementation, I guess. In this specific case, it's due to the fact > that you can't, from pCPU X, just change *instantly* what's running on > pCPU Y. No, you've got to send an IPI, then Y has to react and it has to > schedule, etc. This is of course dependant on the architecture of the > scheduler. Xen's scheduler works like this. FWIW, Linux's scheduler, wrt > this, also does. > > So, see? Transient violation of EDF is just there, no matter the > approach! Of course, violation during the scheduler is possible for any implementation :( > > Let's look at the timers way: > > on pCPU X: > . timer interrupt > . TIMER_SOFTIRQ raised > . process softirqs > . replenishment_timer_handler() > . [spin_lock] > . <for_each_depleted_vcpu(i, i.repl_time < NOW()) { > . replenish(i) ---------> replenish vCPU w > . runq_tickle(i) -------> tickle pCPU Y > . }> > . [spin_lock] > . > on pCPU Y: > . process softirqs > . SCHEDULE_SOFTIRQ > . rt_schedule() > . [spin_lock] > . burn_budget(scurr) > . snext = runq_pick() ----------> pick vCPU w > . [spin_unlock] > . > context switch: vCPU x now runs on pCPU Y > > So this means only 1 invocation of the scheduler, and only on the pCPU > that actually needs to reschedule. There's some overhead due to the > timer, of course, but, notably, this allows for shorter critical > sections, not to mention better looking and easier to maintain code. > > Are we still violating EDF? Of course we are, between the replenishment > and context switch time, as above. That's unavoidable, I'm afraid. > > Summarizing, the two solutions are on par wrt temporary violation of the > theoretical algorithm, but the timer based approach has loads of other > benefits, so let's go with timers. :-) > >> If both are armed for the same time, which should execute? Scheduler >> first before >> a possibly applicable replenishment? Or replenishment always first? >> Both with added >> logic to enforce this and prevent preemption, of course. >> > I really don't see what you mean here. There won't be anything like that > to take care of. Xen offers timers as an abstraction, and deals with > them itself. You only need to take care of properly serializing > rt_schedule() and the timer handling routine, for the code sections that > require that. This is good to know, thanks. The question would then change to how does Xen handle such an occurrence, but I think this is irrelevant now that I know which "side" you had in mind - that is, that the replenishment handler would decide when the scheduler should be invoked. > >> Due to this, it is my belief that by the nature of the gEDF policy the >> current solution is >> better, mostly because it appears to actually be the least complex way that >> is >> functionally correct. >> > Not really, no. I wouldn't be convinced of this, even if what you have > right now were functionally correct. > >> The gEDF policy just isn't well suited for >> decoupling events, as >> the events are highly dependent on one another, and particularly dependent >> at an >> instantaneous timescale. >> > And here it come 'instantaneous timescale' again. > > No, sorry, but I don't buy this. In fact, why does it not apply to > vCPUs' wakeups? I think it does, conceptually... So, should we move the > wakeup logic inside rt_schedule() too? > >> It would be a hard pitch for a gEDF scheduler with a >> similar "accounting period" where the gEDF policy could be violated. >> > There's no accounting period of any sort. It's called overhead! This part was referring to if we had two timers without extensive communication, cooperation, or information/logic sharing. > > What we need to do is to try to keep it as small as possible. And I'm > quite sure that an effective way to do so is to keep critical sections > short, especially when protected by (potentially) highly contented spin > locks. RTDS, currently, suffer from poor scalability (and, I suspect, > poor latency as well, at least under pressure), and one of the reasons > why the work you're doing is interesting is that it can alleviate > this... if done with that in mind, of course. > >> That is >> blasphemy in the real-time world. >> > Blasphemy, that's a nice one! :-D :-D > > Well, I've never been good at religions, I have to admit. So, yes, I can > live with being called blasphemous, I guess. :-) You get the complement of blasphemy, I get that of "nice hack" :) > >> Its also worth noting that the stereotypical textbook event-driven >> model is as we have >> now: a single event loop that handles events. In this case the scheduler is >> the >> conceptually the main loop and this makes perfect sense: its deciding >> what to run >> (think of the running VCPUs as callback functions and the abstraction >> falls into place >> perfectly). The event loop needs some information (about replenishments) >> before >> deciding, and collecting this would be part of the decode and decision >> phase for an >> event signal. >> > Talking about 'stereotypes', I don't have any textbook describing an > event-driven model at hand right now. However, you RT-Xen people like > and use LitmusRT, don't you? Well, you're doing the right thing, because > it's a great piece of software! > > If you're interested, have a look in there. Spoiler: you'll find a lot > of timers. :-D :-D > > More seriously, it's of course rather different, as Linux is not Xen, > but since you're looking for stereotypes: > > - Linux (or at lease LitmusRT patch 2014.2, i.e., what I'm looking at > right now) has ticks. This means a timer fires every X milliseconds > (on every CPU), and task_tick_litmus() is run as a consequence. Such > function does not (at all!) invoke the scheduler directly, it much > rather just checks whether doing so is necessary, in this case > because of some task having exhausted its budget. If it happened, it > calls litmus_reschedule_local() (see below). I know, this is budget > enforcement, not replenishment, but I think it works as a good > example of my point about using timers. > > - Actually, budget enforcement can be done, in LitmusRT, in two ways. > One is tick based (described above), the other, called 'precise', is > timer based. To see that in action, check how the enforcement timer > is handled, e.g., in update_enforcement_timer(). Look at > on_enforcement_timeout(), and find that it also _does_not_ schedule. > It again just asks for the scheduler to be invoked as soon as > practical, via (again) litmus_reschedule_local(). > > - Finally, a 'job release', in LitmusRT terminology, is probably what > is most close to a budget replenishment for us here. In fact, when a > new RT job is released, it's given full budget, and its scheduling > parameters (deadline, in most cases) are updated accordingly, like it > happens, for us, with replenishments. > Check, therefore, how that happens in add_release() and > __add_release(). You'll find a call to arm_release_timer() which > calls, when firing, on_release_timer(). From there, go, e.g., to > gsnedf_release_jobs(), and follow the call chain through > check_for_preemptions() --> preempt() --> preempt_if_preemptable() > --> litmus_reschedule(), which then calls either > set_tsk_need_resched(current) or smp_send_reschedule(cpu). > > litmus_reschedule_local(), and similar functions, do something very > similar to our tickling, i.e., they ask a specific CPU to go through the > scheduler. When this happens via set_tsk_need_resched(current) it's > like, in Xen, tickling the pCPU we're on. When it happens via > smp_send_reschedule(cpu) it's like, in Xen, tickling the remote pCPU > cpu. > > Now, although some years have passed, I've interacted quite a bit with > LitmusRT folks. They're very smart, and usually very good at what they > do, and I'd be surprised to find them guilty of that much real-time > blasphemy. :-P > > You also may know that there is an EDF implementation in mainline Linux, > and you may would like to go have a look there too. I'm not including > any example from there because I'm one of the main authors of it, so > it's pretty obvious that it uses the approach I think it's best (and, > more important, it would be rather inelegant to cite myself! :-D). > > Anyway, if you fancy doing that, the core is in kernel/sched/deadline.c. Thanks for the history and background lesson here :) > >> Not only is there a complexity issue, but as mentioned before this may be the >> best performing option, making the most information available and the >> best timing >> decision possible. Adding a few extra cycles to a hot path, even with >> a lock being >> held, is worth it if the scheduler and context switching is done less. >> > Right, so let's use timers, they require less or equal scheduler > invocations wrt to... Always invoking the scheduler!! (Not sure why you > mention the number of context switches, that's independent on which of > the two approaches you choose.) > >> For the above reasons I think you should reconsider the current >> implementation, as it >> appears it may be the least complex and easiest to reason about >> solution. >> > Well, no, sorry. :-( > > I mean, I appreciate that what you have now (once fixed), may be > effective in avoiding some unnecessary invocation of the scheduler, with > respect to what we currently have in the repo, so I'm not saying it > can't be seen as an improvement. It's a "nice hack", actually. "nice hack." heh, heh, thanks :P > > However, if we're still aiming at making this scheduler a first class > citizen within Xen, we're interested in less hacks, rather than in more > --no matter how nice they are. So, with that in mind, it would be a step > in the wrong direction, IMO. > >> Let me know if I'm missing some key insight into how the behavior >> could be implemented >> correctly and beautifully using the multiple timer approach. >> > well, I tried. I really did. > >> I simply >> don't see how it can >> be done without heavy interaction and information sharing between them >> which really >> defeats the purpose. >> > No, it doesn't. Ok this last line is the zinger. Almost this entire email was built on the premise that you would NOT like the idea of the replenishment handler having basically the same decision logic as the scheduler and then tickling the pCPU to pick up the new vCPU. Actually, with it done this way, I would have a hard time calling the tickle-invoked method the "scheduler." It would be more like the mover, with the replenishment function being essentially the scheduler. In this case, I'm not too sure actually how much different this would be from what we have now. It feels like, from what you want, that we could get the same behavior by modifying rt_schedule to do replenishments first, then check if a reschedule is needed (these two parts would be in this proposed replenishment timer routine) and the perform any move if necessary. I know this isn't exactly what you want, but that sounds close right? But the scheduler will have to decide which to move in, so that logic is done twice. Also, if these are done back-to-back and require the locks, isn't it really the same as having one long hot path? If you want maintainability, couldn't we just do replenishment then schedule and move (if necessary) in one timer (the one we have now) and move them to functions. It seems this can be done with one timer neatly. So here's my proposal, lets see if it fits what you want: 1.) The scheduler does not do any timing, 2.) replenishments are scheduled via timer at each [next] replenishment period. Each replenishment resorts the replenished vCPUs, and if any of the first #CPUs in the runq change, we tickle a pCPU for each change In this case, we can use one timer. We could use the current one as a replenishment timer, and make it so rt_schedule isn't the default invoked method. Does this sound similar to what you are suggesting? I have to ask because I thought you wouldn't like the idea, and its not really *that* far off from what we have now, Its a little restructuring so that replenishments occur before any scheduling activity and the handler checks if switching is needed (basically acting as the scheduler) and then calls tickle. Sounds like what you had in mind? Regards, ~Dagaen _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.