Xen project Mailing List

Re: [Xen-devel] [PATCH] Modified RTDS scheduler to use an event-driven model instead of polling.

Hello Dagaen, Thanks for doing this. The first question I have is whether you run any test/benchmark that you can show the result of here. For instance, a few weeks ago, Meng (while trying to do a completely different thing) sent here on the list some numbers about the frequency of the scheduler being called, and the overhead caused by that... Would it be hard to run the same experiments and collect the numbers, with and without your patch? On Mon, 2015-06-08 at 07:46 -0400, Dagaen Golomb wrote: > To do this, we create a new list that holds, for each > vcpu, the time least into the future that it may need to be > rescheduled. > Ok. Actually, what I really expected to see in this patch was, either: a) a new, scheduler wide, list of replenishment times _and_ a timer, always (re)programmed to fire at the most imminent one; or b) one timer for each vcpu, always (re)programmed to fire at the time instant when the replenishment for that vcpu should happen. And note that, when I say "timer", I mean an actual Xen timer, i.e., those things that are started, stopped, and with a timer handling routine being called when they expire. For an example, you can have a look, in sched_credit.c, at 'ticker' or 'master_ticker'. Deciding between a) or b) isn't easy, without actually trying to implement them. I personally prefer b), and I think it would be a superior long term solution (see right below), but given te current architecture of the RTDS scheduler (e.g., the fact that it has a global runq, etc), I won't nack a), which would most likely be simpler. Performance and overhead wise, again, hard to tell in advance. b) would introduce more overhead (as there are more timers), but will likely reveal more scalable (which is not something of great importance for this scheduler) and may allow for better latency (which _is_ something of great importance for this scheduler :-) ), since there won't be the need to lock the whole scheduler during the replenishments (which is, OTOH, necessary with a)). And that's it for the replenishments. For enforcing the budget, well, we have the ret.time value of the task_slice struct returned by rt_schedule, so that's event driven already, and we don't need to do much about it. Does all this make sense? Am I making myself clear enough? If not, feel free to ask. > The scheduler chooses the lowest time off of this > list and waits until the specified time instead of running every > 1 ms as it did before. > Yeah, well, I see what you mean and how you this is actually succeeding (at least AFAICT, by looking at the code, again, it would be nice to have some numbers) in improving the scheduler behavior. However, this transition toward event driven-ness has two goals: * improve the scheduler behavior [check, at least to some extent] * improve the code (readability, maintainability, etc.) [not check at all :-(] As an example, the __repl_update() function: it's called 2 times inside rt_schedule() and a third from rt_context_saved(), which is basically like saying it's called 3 times from rt_schedule(), and this always made very few sense to me. In fact, I think that scheduling decisions and replenishment events shouldn't be so tightly coupled. There are interdependencies, of course (a replenishment could call for a scheduling decision to be taken), but not like this. That's why I say I'd like to see a timer handling routine dealing with replenishment, and let rt_schedule() do it's job, which is: - enforcing the budget of the running vcpu. If it expires, _(re)program_ the replenishment timer - choose the best vcpu to run next, if necessary With this patch, you're still using rt_schedule() to do both scheduling and replenishments, although you're now doing it at actual replenishment times, instead of checking for them every millisecond. Also, the bits and pieces that you need to add in order to deal with this new queue is, really, making things even more complex than they are now, which is undesirable. So, all this being said, let me know what you think about it (and that applies to everyone else as well, of course, in particular Meng). I won't comment on the code in too much details, as it will require some quite radical restructuring, but I'll skim through it and provide some hints anyway. > Signed-off-by: Dagaen Golomb <dgolomb@xxxxxxxxxxxxxx> > Signed-off-by: Meng Xu <mengxu@xxxxxxxxxxxxx> > --- > xen/common/sched_rt.c | 319 > ++++++++++++++++++++++++++++++++++--------------- > 1 file changed, 222 insertions(+), 97 deletions(-) > First of all, it's a big patch. It's only changing one file and one logical component, and for that reason, TBH, it's not that hard to review. Still I think you can break it in at least two, and perhaps even more, if you try to implement what I described above. > diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c > index 7c39a9e..25f0458 100644 > --- a/xen/common/sched_rt.c > +++ b/xen/common/sched_rt.c > @@ -156,6 +160,7 @@ struct rt_vcpu { > s_time_t cur_budget; /* current budget */ > s_time_t last_start; /* last start time */ > s_time_t cur_deadline; /* current deadline for EDF */ > + s_time_t next_sched_needed; /* next time to make scheduling decision */ > As an example of why I said that things should become simpler, and are instead becoming more complex: with my solution, you don't need anything like this. In fact, budget enforcement is already being done ok in rt_schedule(), so no need to do anything more/different about it. Replenishments should be programmed to fire at cur_deadline, so again, no need to add this new field, and multiplex its actual value between budget and deadline (which, yes, counts as being something rather complex for me! :-D). > @@ -361,6 +387,12 @@ __q_remove(struct rt_vcpu *svc) > list_del_init(&svc->q_elem); > } > > +static inline void __t_remove(struct rt_vcpu *svc) > +{ > + if( !list_empty(&svc->t_elem) ) > + list_del_init(&svc->t_elem); > +} > + > You're using hard tabs for indenting here. As you see everywhere esle, Xen wants 4 spaces for this. > /* > * Insert svc with budget in RunQ according to EDF: > * vcpus with smaller deadlines go first. > @@ -395,6 +427,72 @@ __runq_insert(const struct scheduler *ops, struct > rt_vcpu *svc) > } > > /* > + * Insert svc into the timerq, maintaining ascending order by > next_sched_needed. > + */ > +static void __timerq_insert(const struct scheduler *ops, struct rt_vcpu *svc) > +{ > + struct rt_private *prv = rt_priv(ops); > + struct list_head *timerq = rt_timerq(ops); > + struct list_head *iter = timerq; > + > + ASSERT( spin_is_locked(&prv->lock) ); > + > + ASSERT( list_empty(&svc->t_elem) ); > + The blank line between the asserts, I'd kill it. > + list_for_each(iter, timerq) > + { > You're still using tabs, and mixing them with spaces, making things look even more cumbersome. > +/* > + * Return how far the lowest time on the timerq (that is after NOW) is in the > + * future. > + * Lock must be grabbed before calling this. > + */ > +static s_time_t __timerq_next(const struct scheduler *ops, s_time_t now) > +{ > + struct list_head *timerq = rt_timerq(ops); > + struct list_head *iter, *tmp; > + > + list_for_each_safe(iter, tmp, timerq) > + { > + struct rt_vcpu * iter_svc = __t_elem(iter); > + > + if ( iter_svc->next_sched_needed > now ) > + return (iter_svc->next_sched_needed - now); > + else > + __t_remove(iter_svc); > + } > + > + return MAX_SCHEDULE; > +} > If using a timer, you can just get rid of items while, in the timer routine, you handle the event associated to them. Also, why is MAX_SCHEDULE still there? > +/* > + * Updates the next_sched_needed field for a vcpu, which is used for > + * determining the next needed schedule time for this vcpu. Then updates > + * timerq via reinsert. > + */ > +static void update_sched_time(const struct scheduler *ops, s_time_t now, > + struct rt_vcpu *svc) > +{ > + /* update next needed schedule time */ > + if( test_bit(__RTDS_scheduled, &svc->flags) ) > + svc->next_sched_needed = now + svc->cur_budget; > + else > + svc->next_sched_needed = svc->cur_deadline; > + > + /* Remove and reinsert to maintain sorted order in timerq */ > + __t_remove(svc); > + __timerq_insert(ops, svc); > + > + return; > +} > And here's the multiplexing, which --even worse-- (may) require rearranging the queue! We really don't need anything like this. > /* > + * Pick a cpu where to run a vcpu, > + * possibly kicking out the vcpu running there > + * Called by wake() and context_saved() > + * We have a running candidate here, the kick logic is: > + * Among all the cpus that are within the cpu affinity > + * 1) if the new->cpu is idle, kick it. This could benefit cache hit > + * 2) if there are any idle vcpu, kick it. > + * 3) now all pcpus are busy; > + * among all the running vcpus, pick lowest priority one > + * if snext has higher priority, kick it. > + * > + * TODO: > + * 1) what if these two vcpus belongs to the same domain? > + * replace a vcpu belonging to the same domain introduces more overhead > + * > + * lock is grabbed before calling this function > + */ > +static void > +runq_tickle(const struct scheduler *ops, struct rt_vcpu *new) > +{ > + struct rt_private *prv = rt_priv(ops); > + struct rt_vcpu *latest_deadline_vcpu = NULL; /* lowest priority */ > + struct rt_vcpu *iter_svc; > + struct vcpu *iter_vc; > + int cpu = 0, cpu_to_tickle = 0; > + cpumask_t not_tickled; > + cpumask_t *online; > + > [snip] And here you are moving a function up. In general, it is better to have separate patches that do the movings, with the fact that the are all about moving and that the code is not being changed, clearly stated in the commit message. This is because it is really really hard to figure out, e.g. from here, whether you also changed something in the function while moving it, making reviewing a lot harder and more prone to error. That being said, in this specific case, you're moving up runq_tickle(), the purpose of which is to trigger a call to rt_schedule() (after going through the softirq machinery)... in order to call it in rt_schedule(). That's pretty tricky, and is not how tickling should work. Again, with an approach that really take advantage of timers, this won't be necessary. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.