[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] Modified RTDS scheduler to use an event-driven model instead of polling.



> No HTML, please.

Got it, sorry.

>>         And note that, when I say "timer", I mean an actual Xen timer,
>>         i.e.,
>>         those things that are started, stopped, and with a timer
>>         handling
>>         routine being called when they expire. For an example, you can
>>         have a
>>         look, in sched_credit.c, at 'ticker' or 'master_ticker'.
>>
>>
>> I will look at this. However, the current solution is likely
>> functionally equivalent and, with only one timer and a single list
>> used to arm it, I'm not sure if using a Xen timer is necessary.
>>
> It is functionally equivalent, by chance (see the issue about calling
> runq_tickle() on current vcpus in my reply to Meng). The reason why a
> different approach is required is making the code look better, reducing
> (instead than increasing) the complexity of sched_rt.c, lowering the
> overhead caused by long running operations performed while holding the
> global scheduler spin lock, and improving scalability.
>
>> Do they incur any extra overhead or coarsen granularity?
>>
> "extra overhead or coarsen granularity" as compared to what? s_timer in
> schedule.c (which is what you're using) is one of them already!
>
> What I meant with "an actual timer" is that you should ad a new one, and
> move some of the stuff currently being done in rt_schedule() in its
> handling routine, rather than just adding a new queue of events to be
> serviced by abusing s_timer.

Okay, I was wondering what you mean by "an actual timer."

>
>>         Deciding between a) or b) isn't easy, without actually trying
>>         to
>>         implement them. I personally prefer b), and I think it would
>>         be a
>>         superior long term solution (see right below), but given te
>>         current
>>         architecture of the RTDS scheduler (e.g., the fact that it has
>>         a global
>>         runq, etc), I won't nack a), which would most likely be
>>         simpler.
>>
>>
>> I agree b) would may be better in the long run, but with the current
>> architecture a) is simpler. b) can be future work as the scheduler
>> evolves.
>>
> Sure. And in fact, I'm fine with a), if done properly.
>>
>>         With this patch, you're still using rt_schedule() to do both
>>         scheduling
>>         and replenishments, although you're now doing it at actual
>>         replenishment
>>         times, instead of checking for them every millisecond.
>
>>
>> I think once you understand that the timerq is not only
>> replenishments, but any event that could cause a branch is the
>> scheduler behavior, it becomes more palatable to have them
>> intertwined.
>>
> I got that, and I'm asking you to do it differently.
>
>> Really, the timerq and scheduling aren't as intertwined as they
>> appear, they are piratically isolated functionally. Its just that the
>> timerq is most naturally serviced during scheduling events when data
>> that effects scheduling decisions is changing. Its straightforward and
>> efficient.
>>
> Yeah, replenishments are 'naturally serviced' in a 'straightforward and
> efficient' way by looping on all runnable vcpus, in more than one place,
> from within super-hot paths, with the global scheduler spin lock held.
> Neat! :-/
>
> Oh, and that's before you introducing of another list to be taken care
> of, still under the same conditions. :-O

The paths are already hot, so adding a small amount of extra work is a small
percentage increase. I'm usually against this kind of thing, too, but separating
to another timer, while runnable independently, may actually be more work if it
needs to use some of the same maintenance behavior. I guess the main argument
against is maintainability.

>
>>         Also, the bits and pieces that you need to add in order to
>>         deal with
>>         this new queue is, really, making things even more complex
>>         than they are
>>         now, which is undesirable.
>
>> While it does add some complexity, I don't feel a single queue and
>> managing it is that much extra complexity.
>>
> But in fact the point of making the scheduler 'event driven' is to take
> advantage of the chances that this offers for getting stuff *out* of
> rt_schedule(), and the purpose is not "not adding that much extra
> complexity", is making it _simpler_.

I understand where you are coming from now. I was looking at is mostly
from the perspective of performance. This explains our differences.

>
>
>>         > diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
>>         > index 7c39a9e..25f0458 100644
>>         > --- a/xen/common/sched_rt.c
>>         > +++ b/xen/common/sched_rt.c
>>
>>         > @@ -156,6 +160,7 @@ struct rt_vcpu {
>>         >      s_time_t cur_budget;        /* current budget */
>>         >      s_time_t last_start;        /* last start time */
>>         >      s_time_t cur_deadline;      /* current deadline for EDF
>>         */
>>         > +    s_time_t next_sched_needed; /* next time to make
>>         scheduling decision */
>>         >
>>         As an example of why I said that things should become simpler,
>>         and are
>>         instead becoming more complex: with my solution, you don't
>>         need anything
>>         like this. In fact, budget enforcement is already being done
>>         ok in
>>         rt_schedule(), so no need to do anything more/different about
>>         it.
>>         Replenishments should be programmed to fire at cur_deadline,
>>         so again,
>>         no need to add this new field, and multiplex its actual value
>>         between
>>         budget and deadline (which, yes, counts as being something
>>         rather
>>         complex for me! :-D).
>>
>>
>> As mentioned, the timerq is handling all events that could change the
>> scheduling decision, not just replenishments.
>>
> Yes, exactly, it's handling *too much* events. :-)
>
> For example, have a look at 'struct vcpu', in
> xen/include/xen/sched.h. It's got 3 timers in it:
>
>     struct timer     periodic_timer;
>     struct timer     singleshot_timer;
>     struct timer     poll_timer;    /* timeout for SCHEDOP_poll */
>
> It probably would have been possible to just use one, with a single and
> mixed event queue, as you're doing, a 1k lines handling routines, etc.
>
> Do you it it would have been easier or harder to implement, understand,
> debug and maintain? I bet on harder.

Point taken.

>
>> This tracks the earliest time anything cause this to be scheduled
>> differently, and could be extended if any more insights are found.
>> Budget enforcement is being done in rt_schedule but its being done by
>> this very list: a budget expiration is one possible event that would
>> require a vcpu reschedule.
>>
> OMG, 'extended'... You're thinking to actually put more stuff in
> there?!? :-O
>
> It's a rather key and already too long and complex critical section, so
> please, let's aim at shortening and making it simpler, i.e., at
> improving things, not making them worse.

This can again be explained by our goal mismatch. I see where you are
coming from now.

>>
>>         > +/*
>>         > + * Return how far the lowest time on the timerq (that is
>>         after NOW) is in the
>>         > + * future.
>>         > + * Lock must be grabbed before calling this.
>>         > + */
>>         > +static s_time_t __timerq_next(const struct scheduler *ops,
>>         s_time_t now)
>>         > +{
>>         > +    struct list_head *timerq = rt_timerq(ops);
>>         > +    struct list_head *iter, *tmp;
>>         > +
>>         > +    list_for_each_safe(iter, tmp, timerq)
>>         > +    {
>>         > +        struct rt_vcpu * iter_svc = __t_elem(iter);
>>         > +
>>         > +        if ( iter_svc->next_sched_needed > now )
>>         > +            return (iter_svc->next_sched_needed - now);
>>         > +        else
>>         > +            __t_remove(iter_svc);
>>         > +    }
>>         > +
>>         > +    return MAX_SCHEDULE;
>>         > +}
>>         >
>>         If using a timer, you can just get rid of items while, in the
>>         timer
>>         routine, you handle the event associated to them.
>>
>>         Also, why is MAX_SCHEDULE still there?
>>
>>
>> Honestly, events do not have to be removed here, but its done to
>> prevent walking a list of stale values to get to the newest one on the
>> next call. This is done more for performance. Their non-removal would
>> be functionally equivalent.
>>
> Of course you should remove the stale entries, I certainly was not
> arguing that the list should just grow indefinitely!
>
> Point is, again, that this is another list walk occurring in an hot
> path, with a big spin lock held. We want to get rid of such thing, not
> adding more of it.
>
>> With the timer alternative you mention, where would the timer events
>> and their data be held? I think that could be extra complexity, unless
>> I fail to understand the idea completely.
>>
> In an event queue like yours, of course. But you won't go through it
> and/or bookkeep it while in hot paths, with the scheduler lock held.
>
> See my email to Meng to have more details on what I have in mind, or
> fell free to ask more details.
>
>> MAX_SCHEDULE may not be required, but I have it there as an "in case."
>> For example, it will cause the scheduler to be called after
>> MAX_SCHEDULE even when no events are occurring (such as no vcpus). Its
>> there to ensure progress in case of any bugs or unexpected behavior.
>>
> Wait, so, all this work, and then you still want to call rt_schedule()
> every millisecond, when there's nothing to do?!?! :-O
>
>>         That being said, in this specific case, you're moving up
>>         runq_tickle(),
>>         the purpose of which is to trigger a call to rt_schedule()
>>         (after going
>>         through the softirq machinery)... in order to call it in
>>         rt_schedule().
>>         That's pretty tricky, and is not how tickling should work.
>
>> It set up to only tickle if needed. I'm not sure if this is an issue,
>>
> It's wrong, AFAICT. See my reply to Meng and, please, comment by
> replying to it, if you've got anything to say about this, to make the
> discussion easier to follow.
>
> Regards,
> Dario

I understand what caused our mismatch now. I think option a) you
mentioned makes sense
given your values. I am going to look into the details of this
approach and get back with any
questions or concerns.

Regards,
~Dagaen

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.