[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] Modified RTDS scheduler to use an event-driven model instead of polling.



Hello Dagaen,

Thanks for doing this. The first question I have is whether you run any
test/benchmark that you can show the result of here.

For instance, a few weeks ago, Meng (while trying to do a completely
different thing) sent here on the list some numbers about the frequency
of the scheduler being called, and the overhead caused by that... Would
it be hard to run the same experiments and collect the numbers, with and
without your patch?

On Mon, 2015-06-08 at 07:46 -0400, Dagaen Golomb wrote:
> To do this, we create a new list that holds, for each
> vcpu, the time least into the future that it may need to be
> rescheduled.
>
Ok. Actually, what I really expected to see in this patch was, either:
 a) a new, scheduler wide, list of replenishment times _and_ a timer,
    always (re)programmed to fire at the most imminent one; or
 b) one timer for each vcpu, always (re)programmed to fire at the time
    instant when the replenishment for that vcpu should happen.

And note that, when I say "timer", I mean an actual Xen timer, i.e.,
those things that are started, stopped, and with a timer handling
routine being called when they expire. For an example, you can have a
look, in sched_credit.c, at 'ticker' or 'master_ticker'.

Deciding between a) or b) isn't easy, without actually trying to
implement them. I personally prefer b), and I think it would be a
superior long term solution (see right below), but given te current
architecture of the RTDS scheduler (e.g., the fact that it has a global
runq, etc), I won't nack a), which would most likely be simpler.

Performance and overhead wise, again, hard to tell in advance. b) would
introduce more overhead (as there are more timers), but will likely
reveal more scalable (which is not something of great importance for
this scheduler) and may allow for better latency (which _is_ something
of great importance for this scheduler :-) ), since there won't be the
need to lock the whole scheduler during the replenishments (which is,
OTOH, necessary with a)).

And that's it for the replenishments. For enforcing the budget, well, we
have the ret.time value of the task_slice struct returned by
rt_schedule, so that's event driven already, and we don't need to do
much about it.

Does all this make sense? Am I making myself clear enough?
If not, feel free to ask.

>  The scheduler chooses the lowest time off of this
> list and waits until the specified time instead of running every
> 1 ms as it did before.
> 
Yeah, well, I see what you mean and how you this is actually succeeding
(at least AFAICT, by looking at the code, again, it would be nice to
have some numbers) in improving the scheduler behavior.

However, this transition toward event driven-ness has two goals:
 * improve the scheduler behavior [check, at least to some extent]
 * improve the code (readability, maintainability, etc.)
   [not check at all :-(]

As an example, the __repl_update() function: it's called 2 times inside
rt_schedule() and a third from rt_context_saved(), which is basically
like saying it's called 3 times from rt_schedule(), and this always made
very few sense to me. In fact, I think that scheduling decisions and
replenishment events shouldn't be so tightly coupled. There are
interdependencies, of course (a replenishment could call for a
scheduling decision to be taken), but not like this. That's why I say
I'd like to see a timer handling routine dealing with replenishment, and
let rt_schedule() do it's job, which is:
 - enforcing the budget of the running vcpu. If it expires,
   _(re)program_ the replenishment timer
 - choose the best vcpu to run next, if necessary

With this patch, you're still using rt_schedule() to do both scheduling
and replenishments, although you're now doing it at actual replenishment
times, instead of checking for them every millisecond.

Also, the bits and pieces that you need to add in order to deal with
this new queue is, really, making things even more complex than they are
now, which is undesirable.

So, all this being said, let me know what you think about it (and that
applies to everyone else as well, of course, in particular Meng).

I won't comment on the code in too much details, as it will require some
quite radical restructuring, but I'll skim through it and provide some
hints anyway.

> Signed-off-by: Dagaen Golomb <dgolomb@xxxxxxxxxxxxxx>
> Signed-off-by: Meng Xu <mengxu@xxxxxxxxxxxxx>
> ---
>  xen/common/sched_rt.c |  319 
> ++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 222 insertions(+), 97 deletions(-)
> 
First of all, it's a big patch. It's only changing one file and one
logical component, and for that reason, TBH, it's not that hard to
review. Still I think you can break it in at least two, and perhaps even
more, if you try to implement what I described above.

> diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
> index 7c39a9e..25f0458 100644
> --- a/xen/common/sched_rt.c
> +++ b/xen/common/sched_rt.c

> @@ -156,6 +160,7 @@ struct rt_vcpu {
>      s_time_t cur_budget;        /* current budget */
>      s_time_t last_start;        /* last start time */
>      s_time_t cur_deadline;      /* current deadline for EDF */
> +    s_time_t next_sched_needed; /* next time to make scheduling decision */
>
As an example of why I said that things should become simpler, and are
instead becoming more complex: with my solution, you don't need anything
like this. In fact, budget enforcement is already being done ok in
rt_schedule(), so no need to do anything more/different about it.
Replenishments should be programmed to fire at cur_deadline, so again,
no need to add this new field, and multiplex its actual value between
budget and deadline (which, yes, counts as being something rather
complex for me! :-D).

> @@ -361,6 +387,12 @@ __q_remove(struct rt_vcpu *svc)
>          list_del_init(&svc->q_elem);
>  }
>  
> +static inline void __t_remove(struct rt_vcpu *svc)
> +{
> +     if( !list_empty(&svc->t_elem) )
> +             list_del_init(&svc->t_elem);
> +}
> +
>
You're using hard tabs for indenting here. As you see everywhere esle,
Xen wants 4 spaces for this.

>  /*
>   * Insert svc with budget in RunQ according to EDF:
>   * vcpus with smaller deadlines go first.
> @@ -395,6 +427,72 @@ __runq_insert(const struct scheduler *ops, struct 
> rt_vcpu *svc)
>  }
>  
>  /*
> + * Insert svc into the timerq, maintaining ascending order by 
> next_sched_needed.
> + */
> +static void __timerq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
> +{
> +    struct rt_private *prv = rt_priv(ops);
> +    struct list_head *timerq = rt_timerq(ops);
> +    struct list_head *iter = timerq;
> +
> +    ASSERT( spin_is_locked(&prv->lock) );
> +
> +    ASSERT( list_empty(&svc->t_elem) );
> +
The blank line between the asserts, I'd kill it.

> +     list_for_each(iter, timerq)
> +     {
>
You're still using tabs, and mixing them with spaces, making things look
even more cumbersome.

> +/*
> + * Return how far the lowest time on the timerq (that is after NOW) is in the
> + * future.
> + * Lock must be grabbed before calling this.
> + */
> +static s_time_t __timerq_next(const struct scheduler *ops, s_time_t now)
> +{
> +    struct list_head *timerq = rt_timerq(ops);
> +    struct list_head *iter, *tmp;
> +
> +    list_for_each_safe(iter, tmp, timerq)
> +    {
> +        struct rt_vcpu * iter_svc = __t_elem(iter);
> +
> +        if ( iter_svc->next_sched_needed > now )
> +            return (iter_svc->next_sched_needed - now);
> +        else
> +            __t_remove(iter_svc);
> +    }
> +
> +    return MAX_SCHEDULE;
> +}
>
If using a timer, you can just get rid of items while, in the timer
routine, you handle the event associated to them.

Also, why is MAX_SCHEDULE still there?

> +/*
> + * Updates the next_sched_needed field for a vcpu, which is used for
> + * determining the next needed schedule time for this vcpu. Then updates
> + * timerq via reinsert.
> + */
> +static void update_sched_time(const struct scheduler *ops, s_time_t now, 
> +                              struct rt_vcpu *svc)
> +{
> +    /* update next needed schedule time */
> +    if( test_bit(__RTDS_scheduled, &svc->flags) )
> +        svc->next_sched_needed = now + svc->cur_budget;
> +    else
> +        svc->next_sched_needed = svc->cur_deadline;
> +
> +    /* Remove and reinsert to maintain sorted order in timerq */
> +    __t_remove(svc);
> +    __timerq_insert(ops, svc);
> +
> +    return;
> +}
>
And here's the multiplexing, which --even worse-- (may) require
rearranging the queue! We really don't need anything like this.

>  /*
> + * Pick a cpu where to run a vcpu,
> + * possibly kicking out the vcpu running there
> + * Called by wake() and context_saved()
> + * We have a running candidate here, the kick logic is:
> + * Among all the cpus that are within the cpu affinity
> + * 1) if the new->cpu is idle, kick it. This could benefit cache hit
> + * 2) if there are any idle vcpu, kick it.
> + * 3) now all pcpus are busy;
> + *    among all the running vcpus, pick lowest priority one
> + *    if snext has higher priority, kick it.
> + *
> + * TODO:
> + * 1) what if these two vcpus belongs to the same domain?
> + *    replace a vcpu belonging to the same domain introduces more overhead
> + *
> + * lock is grabbed before calling this function
> + */
> +static void
> +runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
> +{
> +    struct rt_private *prv = rt_priv(ops);
> +    struct rt_vcpu *latest_deadline_vcpu = NULL; /* lowest priority */
> +    struct rt_vcpu *iter_svc;
> +    struct vcpu *iter_vc;
> +    int cpu = 0, cpu_to_tickle = 0;
> +    cpumask_t not_tickled;
> +    cpumask_t *online;
> +
>
[snip]

And here you are moving a function up. In general, it is better to have
separate patches that do the movings, with the fact that the are all
about moving and that the code is not being changed, clearly stated in
the commit message. This is because it is really really hard to figure
out, e.g. from here, whether you also changed something in the function
while moving it, making reviewing a lot harder and more prone to error.

That being said, in this specific case, you're moving up runq_tickle(),
the purpose of which is to trigger a call to rt_schedule() (after going
through the softirq machinery)... in order to call it in rt_schedule().
That's pretty tricky, and is not how tickling should work.

Again, with an approach that really take advantage of timers, this won't
be necessary.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.