[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] Modified RTDS scheduler to use an event-driven model instead of polling.



Thanks for the feedback.


Thanks for doing this. The first question I have is whether you run any
test/benchmark that you can show the result of here.

For instance, a few weeks ago, Meng (while trying to do a completely
different thing) sent here on the list some numbers about the frequency
of the scheduler being called, and the overhead caused by that... Would
it be hard to run the same experiments and collect the numbers, with and
without your patch?
Â
I could run some experiments to measure this. An easier quick test may be to count invocations and run the same workload on each, making sure each run a long time to remove biases.
Â
On Mon, 2015-06-08 at 07:46 -0400, Dagaen Golomb wrote:
> To do this, we create a new list that holds, for each
> vcpu, the time least into the future that it may need to be
> rescheduled.
>
Ok. Actually, what I really expected to see in this patch was, either:
Âa) a new, scheduler wide, list of replenishment times _and_ a timer,
  always (re)programmed to fire at the most imminent one; or
Âb) one timer for each vcpu, always (re)programmed to fire at the time
  instant when the replenishment for that vcpu should happen.
Â
a) is essentially what we have going on here. The list is exactly that, a list of possible scheduling change times (including replenishments as they are coalesced with period), one per vcpu, and the timer is triggered using the earliest one.
Â
And note that, when I say "timer", I mean an actual Xen timer, i.e.,
those things that are started, stopped, and with a timer handling
routine being called when they expire. For an example, you can have a
look, in sched_credit.c, at 'ticker' or 'master_ticker'.

I will look at this. However, the current solution is likely functionally equivalent and, with only one timer and a single list used to arm it, I'm not sure if using a Xen timer is necessary. Do they incur any extra overhead or coarsen granularity?
Â

Deciding between a) or b) isn't easy, without actually trying to
implement them. I personally prefer b), and I think it would be a
superior long term solution (see right below), but given te current
architecture of the RTDS scheduler (e.g., the fact that it has a global
runq, etc), I won't nack a), which would most likely be simpler.

I agree b) would may be better in the long run, but with the current architecture a) is simpler. b) can be future work as the scheduler evolves.
Â
Performance and overhead wise, again, hard to tell in advance. b) would
introduce more overhead (as there are more timers), but will likely
reveal more scalable (which is not something of great importance for
this scheduler) and may allow for better latency (which _is_ something
of great importance for this scheduler :-) ), since there won't be the
need to lock the whole scheduler during the replenishments (which is,
OTOH, necessary with a)).

And that's it for the replenishments. For enforcing the budget, well, we
have the ret.time value of the task_slice struct returned by
rt_schedule, so that's event driven already, and we don't need to do
much about it.

The timerq will also hold the end of period if that is the next relevant event, and if at head of the timerq this value will be used to arm the timer to let the task run until budget exhaustion.
Â
Does all this make sense? Am I making myself clear enough?
If not, feel free to ask.

>Â The scheduler chooses the lowest time off of this
> list and waits until the specified time instead of running every
> 1 ms as it did before.
>
Yeah, well, I see what you mean and how you this is actually succeeding
(at least AFAICT, by looking at the code, again, it would be nice to
have some numbers) in improving the scheduler behavior.

However, this transition toward event driven-ness has two goals:
Â* improve the scheduler behavior [check, at least to some extent]
Â* improve the code (readability, maintainability, etc.)
 Â[not check at all :-(]

As an example, the __repl_update() function: it's called 2 times inside
rt_schedule() and a third from rt_context_saved(), which is basically
like saying it's called 3 times from rt_schedule(), and this always made
very few sense to me. In fact, I think that scheduling decisions and
replenishment events shouldn't be so tightly coupled. There are
interdependencies, of course (a replenishment could call for a
scheduling decision to be taken), but not like this. That's why I say
I'd like to see a timer handling routine dealing with replenishment, and
let rt_schedule() do it's job, which is:
Â- enforcing the budget of the running vcpu. If it expires,
 Â_(re)program_ the replenishment timer
Â- choose the best vcpu to run next, if necessary

With this patch, you're still using rt_schedule() to do both scheduling
and replenishments, although you're now doing it at actual replenishment
times, instead of checking for them every millisecond.


I think once you understand that the timerq is not only replenishments, but any event that could cause a branch is the scheduler behavior, it becomes more palatable to have them intertwined. Really, the timerq and scheduling aren't as intertwined as they appear, they are piratically isolated functionally. Its just that the timerq is most naturally serviced during scheduling events when data that effects scheduling decisions is changing. Its straightforward and efficient. Let me know your thoughts on this.
Â
Also, the bits and pieces that you need to add in order to deal with
this new queue is, really, making things even more complex than they are
now, which is undesirable.

While it does add some complexity, I don't feel a single queue and managing it is that much extra complexity.
Â
So, all this being said, let me know what you think about it (and that
applies to everyone else as well, of course, in particular Meng).

I won't comment on the code in too much details, as it will require some
quite radical restructuring, but I'll skim through it and provide some
hints anyway.

> Signed-off-by: Dagaen Golomb <dgolomb@xxxxxxxxxxxxxx>
> Signed-off-by: Meng Xu <mengxu@xxxxxxxxxxxxx>
> ---
>Â xen/common/sched_rt.c |Â 319 ++++++++++++++++++++++++++++++++++---------------
>Â 1 file changed, 222 insertions(+), 97 deletions(-)
>
First of all, it's a big patch. It's only changing one file and one
logical component, and for that reason, TBH, it's not that hard to
review. Still I think you can break it in at least two, and perhaps even
more, if you try to implement what I described above.

Noted.
Note that its not as large as it seems, a large chunk of the deletions and an equal number of insertions include moving a function due to visibility to other functions.
Â
> diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
> index 7c39a9e..25f0458 100644
> --- a/xen/common/sched_rt.c
> +++ b/xen/common/sched_rt.c

> @@ -156,6 +160,7 @@ struct rt_vcpu {
>Â Â Â s_time_t cur_budget;Â Â Â Â /* current budget */
>Â Â Â s_time_t last_start;Â Â Â Â /* last start time */
>Â Â Â s_time_t cur_deadline;Â Â Â /* current deadline for EDF */
> +Â Â s_time_t next_sched_needed; /* next time to make scheduling decision */
>
As an example of why I said that things should become simpler, and are
instead becoming more complex: with my solution, you don't need anything
like this. In fact, budget enforcement is already being done ok in
rt_schedule(), so no need to do anything more/different about it.
Replenishments should be programmed to fire at cur_deadline, so again,
no need to add this new field, and multiplex its actual value between
budget and deadline (which, yes, counts as being something rather
complex for me! :-D).

As mentioned, the timerq is handling all events that could change the scheduling decision, not just replenishments. This tracks the earliest time anything cause this to be scheduled differently, and could be extended if any more insights are found. Budget enforcement is being done in rt_schedule but its being done by this very list: a budget expiration is one possible event that would require a vcpu reschedule.
Â

> @@ -361,6 +387,12 @@ __q_remove(struct rt_vcpu *svc)
>Â Â Â Â Â list_del_init(&svc->q_elem);
>Â }
>
> +static inline void __t_remove(struct rt_vcpu *svc)
> +{
> +Â Â Âif( !list_empty(&svc->t_elem) )
> +Â Â Â Â Â Â Âlist_del_init(&svc->t_elem);
> +}
> +
>
You're using hard tabs for indenting here. As you see everywhere esle,
Xen wants 4 spaces for this.

Sorry, I thought I had gotten them all changed to spaces. My editor wasn't configured correctly at first so some are lurking around. I will exterminate the remaining ones.
Â
>Â /*
>Â Â* Insert svc with budget in RunQ according to EDF:
>Â Â* vcpus with smaller deadlines go first.
> @@ -395,6 +427,72 @@ __runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
>Â }
>
>Â /*
> + * Insert svc into the timerq, maintaining ascending order by next_sched_needed.
> + */
> +static void __timerq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
> +{
> +Â Â struct rt_private *prv = rt_priv(ops);
> +Â Â struct list_head *timerq = rt_timerq(ops);
> +Â Â struct list_head *iter = timerq;
> +
> +Â Â ASSERT( spin_is_locked(&prv->lock) );
> +
> +Â Â ASSERT( list_empty(&svc->t_elem) );
> +
The blank line between the asserts, I'd kill it.

Can do.
Â
> +Â Â Âlist_for_each(iter, timerq)
> +Â Â Â{
>
You're still using tabs, and mixing them with spaces, making things look
even more cumbersome.

Again, sorry. Will exterminate.
Â

> +/*
> + * Return how far the lowest time on the timerq (that is after NOW) is in the
> + * future.
> + * Lock must be grabbed before calling this.
> + */
> +static s_time_t __timerq_next(const struct scheduler *ops, s_time_t now)
> +{
> +Â Â struct list_head *timerq = rt_timerq(ops);
> +Â Â struct list_head *iter, *tmp;
> +
> +Â Â list_for_each_safe(iter, tmp, timerq)
> +Â Â {
> +Â Â Â Â struct rt_vcpu * iter_svc = __t_elem(iter);
> +
> +Â Â Â Â if ( iter_svc->next_sched_needed > now )
> +Â Â Â Â Â Â return (iter_svc->next_sched_needed - now);
> +Â Â Â Â else
> +Â Â Â Â Â Â __t_remove(iter_svc);
> +Â Â }
> +
> +Â Â return MAX_SCHEDULE;
> +}
>
If using a timer, you can just get rid of items while, in the timer
routine, you handle the event associated to them.

Also, why is MAX_SCHEDULE still there?

Honestly, events do not have to be removed here, but its done to prevent walking a list of stale values to get to the newest one on the next call. This is done more for performance. Their non-removal would be functionally equivalent.
With the timer alternative you mention, where would the timer events and their data be held? I think that could be extra complexity, unless I fail to understand the idea completely.

The current approach is actually somewhat simple if you think about it, requiring just one piece of information per vcpu (the next_sched_needed field) which is updated as information is available. It also uses a guaranteed O(1) increase in memory with how the queue and next_sched_time is set up, and most computational parts are rolled into existing routines that need to be done anyways.

MAX_SCHEDULE may not be required, but I have it there as an "in case." For example, it will cause the scheduler to be called after MAX_SCHEDULE even when no events are occurring (such as no vcpus). Its there to ensure progress in case of any bugs or unexpected behavior.
Â
> +/*
> + * Updates the next_sched_needed field for a vcpu, which is used for
> + * determining the next needed schedule time for this vcpu. Then updates
> + * timerq via reinsert.
> + */
> +static void update_sched_time(const struct scheduler *ops, s_time_t now,
> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â struct rt_vcpu *svc)
> +{
> +Â Â /* update next needed schedule time */
> +Â Â if( test_bit(__RTDS_scheduled, &svc->flags) )
> +Â Â Â Â svc->next_sched_needed = now + svc->cur_budget;
> +Â Â else
> +Â Â Â Â svc->next_sched_needed = svc->cur_deadline;
> +
> +Â Â /* Remove and reinsert to maintain sorted order in timerq */
> +Â Â __t_remove(svc);
> +Â Â __timerq_insert(ops, svc);
> +
> +Â Â return;
> +}
>
And here's the multiplexing, which --even worse-- (may) require
rearranging the queue! We really don't need anything like this.

I see what you mean here, but this doesn't seem bad to me. Right now its only these two events and a simple if-else, but what is great is that this method can be updated to include any number of tricks or insights that could help determine when something may need be scheduled or not. Right now it seems complex for what appears to be just two cases, but many more cases or an entirely different accounting method could be placed here and it would be fine. It makes for a nice "ground zero" for modifying the timer update behavior.
Â
>Â /*
> + * Pick a cpu where to run a vcpu,
> + * possibly kicking out the vcpu running there
> + * Called by wake() and context_saved()
> + * We have a running candidate here, the kick logic is:
> + * Among all the cpus that are within the cpu affinity
> + * 1) if the new->cpu is idle, kick it. This could benefit cache hit
> + * 2) if there are any idle vcpu, kick it.
> + * 3) now all pcpus are busy;
> + *Â Â among all the running vcpus, pick lowest priority one
> + *Â Â if snext has higher priority, kick it.
> + *
> + * TODO:
> + * 1) what if these two vcpus belongs to the same domain?
> + *Â Â replace a vcpu belonging to the same domain introduces more overhead
> + *
> + * lock is grabbed before calling this function
> + */
> +static void
> +runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
> +{
> +Â Â struct rt_private *prv = rt_priv(ops);
> +Â Â struct rt_vcpu *latest_deadline_vcpu = NULL; /* lowest priority */
> +Â Â struct rt_vcpu *iter_svc;
> +Â Â struct vcpu *iter_vc;
> +Â Â int cpu = 0, cpu_to_tickle = 0;
> +Â Â cpumask_t not_tickled;
> +Â Â cpumask_t *online;
> +
>
[snip]

And here you are moving a function up. In general, it is better to have
separate patches that do the movings, with the fact that the are all
about moving and that the code is not being changed, clearly stated in
the commit message. This is because it is really really hard to figure
out, e.g. from here, whether you also changed something in the function
while moving it, making reviewing a lot harder and more prone to error.

Ah, thanks for mentioning. The move was required but I didn't know a separate patch was customary. I can make it so.
Â
That being said, in this specific case, you're moving up runq_tickle(),
the purpose of which is to trigger a call to rt_schedule() (after going
through the softirq machinery)... in order to call it in rt_schedule().
That's pretty tricky, and is not how tickling should work.

It set up to only tickle if needed. I'm not sure if this is an issue,
Â
Again, with an approach that really take advantage of timers, this won't
be necessary.

I think there could be some discussion on the approaches. Hopefully others comment as well. As mentioned this is actually a logically quite simple change, with good efficiency, and the groud zero makes changes using lots of information easy although right now it was kept simple to get the change out. Future improvements to the update function was somewhat expected as future work.

Thanks for the comments.
~Dagaen

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.