[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2] Modified RTDS scheduler to use an event-driven model instead of polling.



On Mon, 2015-07-06 at 22:51 -0700, Meng Xu wrote:

> >> > +static void replenishment_handler(void* data)
> >> > +{
> >> > +    const struct scheduler *ops = data;
> >> > +    struct rt_private *prv = rt_priv(ops);
> >> > +    struct list_head *depletedq = rt_depletedq(ops);
> >> > +    struct list_head *iter;
> >> > +    struct list_head *tmp;
> >> > +    struct rt_vcpu *svc = NULL;
> >> > +    s_time_t now = NOW();
> >> > +    int new_position = 0;
> >> > +    unsigned long flags;
> >> > +
> >> > +    spin_lock_irqsave(&prv->lock, flags);
> >> > +
> >> > +    // Replenish the vCPU that triggered this, and resort it into runq
> >> > +    list_for_each_safe(iter, tmp, depletedq)
> >> > +    {
> >> > +        svc = __q_elem(iter);
> >> > +        if ( now >= svc->cur_deadline )
> >> > +        {
> >> > +            rt_update_deadline(now, svc);
> >> > +            __q_remove(svc); /* remove from depleted queue */
> >> > +            new_position = __runq_insert(ops, svc); /* add to runq */
> >> > +        }
> >> > +        else break; // leave list_for_each_safe
> >> > +    }
> >> > +
> >> > +    // If we become one of top [# CPUs] in the runq, tickle it
> >> > +    // TODO: make this work when multiple tickles are required
> >>
> >> Why do you need multiple tickles?
> >>
> > Because the loop above may have been done more than one replenishment,
> > which makes sense.
> 
> Ah-ha, I got the reason.I should have read it twice to figure it out. :-)
> >
> > While we are here. I've said that I don't like the fact that we have one
> > big spinlock for the whole scheduler. However, we do have it right now,
> > so let's make good use of it.
> 
> OK.
> 
> >
> > So (but please double check what I'm about to say, and provide your
> > view), it should be safe to access the various svc->vc->processor from
> > inside here, since we're holding the big log. If that's the case, then
> > it should be fairly easy to figure out which are the pCPUs that needs to
> > reschedule (playing with the index, etc), and then use their
> > vc->processor to decide what pCPU to actually tickle, isn't this the
> > case?
> 
> Hmm, I don't quite get what you meant here although I read it for
> three times. :-(
> 
Yeah, sorry, my bad. :-/

> Did you mean we can decide which PCPUs to tickle for the several
> "highest" priority vcpus after their budgets get replenished?
> If so, doesn't that mean that we will "duplicate" (part of) the
> runq_tickle code to find the pCPUs to preempt? Is it better than the
> approach that just call runq_tickle each time whenever a high priority
> "ready" VCPU is found?
> 
I was just sketching a possible improvement, which as usual is something
ttivially done, without actually writing the code. However, let me try
again.

Let's look at runq_tickle(). It is, right now, called from:
 (1) rt_vcpu_wake(), and it _does_belong_ in there (at least, the logic
     it implements does), but it's called in a way that I really don't
     like. In fact, there is a call to __repl_update(), followed by a
     __runq_pick(), and I think both should go away;
 (2) in rt_context_saved(), again, following a replenishment update and
     a pick. Same as above, I'd love for the update+pick to be kick out
     of here as well;
 (3) with Dagaen patch, it's called from the replenishment handler, and
     again I think the rescheduling request logic (i.e., what
     runq_tickle() currently implements) is fine in there, although how
     this is done needs to improve to support tickling multiple pCPUs.

Let's look at these one by one.

1) rt_vcpu_wake():
   You are waking up vCPU i, and you're doing it right until the call to
   __runq_insert(), where you insert the vCPU on the runq. But then, why
   do you call __repl_update()? Why, at every vCPU wakeup, you want to
   scan the whole runqueue updating everyone's budget? I mean, I (not
   completely, but, well...) understand why you've been doing this
   _until_now_, but after Dagaen patch, we have the replenishment timer,
   and replenishments will be done by the replenishment timer handler,
   so you don't need to do this in here any longer!

   So, basically, I would like Dagaen, as a part of this really cool
   work he's doing, to remove completely the __repl_update() and
   __runq_pick() logic from rt_vcpu_wake().  Then, it is ok to call
   runq_tickle(), but you'll be calling it like this:

     runq_tickle(ops, svc);

  i.e., you'll go checking whether the vCPU that just woke up needs to  
  preempt any other one currently running on a pCPU, AND THAT'S IT!
  We're dealing with a vCPU waking up, not with a wakeup+replenishment
  +pick+reschedule... Let's keep functions focused on their own purpose,
  as we were saying in the previous thread, ok? :-)

  So, as far as this function is concerned, it is ok to call
  runq_tickle(), and runq_tickle() is doing the right thing already, in
  the way it's currently implemented, provided it's called _on_ the
  waking vCPU, not on something else.

2) rt_context_saved()
   This looks a lot similar to (1). In fact, the replenishment update
   and the pick should be removed from here completely. In fact, with a 
   similar reasoning to the above case, what is this function doing? It
   is inserting a vCPU in the runq. It's actually rather similar to a
   wake-up, the only difference being the wake-up is an actual insertion
   (i.e., the vCPU --most times-- was not there before), while in this
   case it is a re-insertion (i.e., the vCPU was there already, but we
   needed to delay fiddling with it).

   So, indeed, everything just said in (1) applies here too, and I'd
   like this function to be restructured accordingly (kill
   __repl_update(), kill __runq_pick(), kill snext, and call
   runq_tickle(ops, svc)).

Now, allow me to stop for a sec, and make some considerations. Dagaen,
in his patch, is changing __runq_insert(), making it return the position
in the runqueue where the vCPU has been inserted. This is actually a
good thing, IMO. 

In different, and generally more scalable implementations of global EDF,
e.g., where you have multiple runqueues, and hence multiple spinlocks,
doing something like this is an absolute nightmare. More specifically,
what is really really difficult (not to implement, but to implement in a
way that does not defeat the scalability benefits of splitting
runqueues) is to actually have a reliable and low overhead way of
looking at other queues and/or pCPUs, to figure out what vCPUs are
running there, and whether or not the one we are dealing with (e.g.,
during a vCPU wakeup) should preempt any one of them. E.g., if you look
at the Linux EDF implementation, which uses different queues, there is a
whole data structure dedicate to that, implemented in its own source
file!

Here, everything is below the umbrella of the big, scheduler wide,
spinlock... So, really, it's rather easy to keep track of that, and that
is actually what runq_tickle() is doing (well, it does not keep track of
it, it goes figuring that out), and what Dagaen is doing by returning
the index is exactly another way to do that.

That's actually what I was (badly) trying to hint at in my previous
mail. Certainly, we don't want to duplicate code from runq_tickle() by
cutting-&-pasting it around... But maybe we can change things in a way
that they're even simpler than they are now (which is really what I
hoped, when thinking to the outcomes of this work, and at using this
design for it! :-D).

So, it looks to me that, as far as (1) and (2) are concerned, since we
are "just" inserting a vCPU in the runq, if we have M pCPUs, and we know
whether we inserted it within the first M spots, we already have what we
want, or am I missing something? And if __runq_insert() now (with Dagaen
patch) tells us this, well, we can simplify the tickling logic, can't
we?

With an example:
We are waking up (or re-inserting, in rt_context_saved()) vCPU j. We
have 6 pCPUs. __runq_insert() tells us that it put vCPU j at the 3rd
place in the runq. This means vCPU j should be set to run as soon as
possible. So, if vCPU j is 3rd in runq, either
 (a) there are only 3 runnable vCPUs (i.e., if we are waking up j, there
     were 2 of them, and j is the third; if we are in context_saved,
     there already where 3, and j just got it's deadline postponed, or
     someone else got its one replenished);
 (b) there are more than 3 runnable vCPUs, i.e., there is at least a 4th
     vCPU --say vCPU k-- in the runq, which was the 3rd before vCPU j
     were woken (or re-inserted), but now became the 4th, because
     deadline(j)<deadline(k).
In case (a), there are for sure idle pCPUs, and we should tickle one of
them. In case (b) there may be idle pCPUs (and, if that's the case, we
should tickle one of them, of course) or not. If not, we need to go
figure out which pCPU to tickle, which is exactly what runq_tickle()
does, but we at least know for sure that we want to tickle the pCPU
where vCPU k runs, or others where vCPUs with deadline greater than vCPU
k run.

Does this make sense? If yes, I think this means we can (and hence
should) restructure runq_tickle() in order to make it behave as above,
as I think it would be more quick and effective. It's probably necessary
to turn the for_each_cpu() loop in there, in a runq scan, but one that
only starts from a specific vCPU (the one that is ->next wrt the newly
inserted vCPU, k, in the example above), and stopping at the M-eth vCPU
in the queue. It's probably necessary to have not only the index, but
also an rt_vcpu pointer, from __runq_insert() (or maybe not, if svc is
already available, and we chan check and traverse its q_elem->next
pointer), and to pass either (or both) to runq_tickle()... But this is
better determined while implementing things.

Honestly, I originally thought that runq_tickle() could be simplified
even more, by looking at the svc->vc->processor field of the vCPU that
our vCPU is inserted before. I see now, however, that this can't be
done, as we, not always tickle exactly the pCPU of the vCPU that is now
next to us (i.e., the pCPU where vCPU k runs, in the example), but we
may need to tickle the one that is running the latest deadline vCPU,
which may be another one.

Still, I think I gave enough material for an actual optimization. What
do you think?

Oh, just out of the top of my head, I think that to implement the above
more easily and effectively, it would make sense to track the idle and
the already tickled pCPUs at a scheduler-wide level. If that reveals to
be true, have a look at how Credit2 is doing it, it's not that
difficult.

Let's go back to cases above, and look at (3). Well, right now, I don't
think I see much alternatives to putting the tickling in the loop, and
(potentially) issuing one call to runq_tickle() for each vCPU that an
execution instance of the replenishment handler actually replenishes.
That may not look cheap, but:
 - it won't be so common that we'll replenish tens of vCPUs in a single
   instance. Ideally, each execution of the handler will replenish (and
   hence insert) only one vCPU (and hence tickle only one pCPU), with
   the loop being there to cover for non-ideal scenarios due to
   overhead, etc;
 - with the optimization to the tickling logic suggested above, it
   should be a bit cheaper to issue multiple tickling calls;

So, I think I've said everything I wanted to say, and I hope this is a
bit more clear now.

Dagaen, Meng, any question?

I really think that, if we manage to implement all this, code quality
and performance would improve a lot. Oh, considering all the various and
(logically) different changes that I've hinted at, the patch needs to
become a patch series! :-D

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.