Xen project Mailing List

Re: [Xen-devel] [PATCH v2] Modified RTDS scheduler to use an event-driven model instead of polling.

On Mon, 2015-07-06 at 22:51 -0700, Meng Xu wrote: > >> > +static void replenishment_handler(void* data) > >> > +{ > >> > + const struct scheduler *ops = data; > >> > + struct rt_private *prv = rt_priv(ops); > >> > + struct list_head *depletedq = rt_depletedq(ops); > >> > + struct list_head *iter; > >> > + struct list_head *tmp; > >> > + struct rt_vcpu *svc = NULL; > >> > + s_time_t now = NOW(); > >> > + int new_position = 0; > >> > + unsigned long flags; > >> > + > >> > + spin_lock_irqsave(&prv->lock, flags); > >> > + > >> > + // Replenish the vCPU that triggered this, and resort it into runq > >> > + list_for_each_safe(iter, tmp, depletedq) > >> > + { > >> > + svc = __q_elem(iter); > >> > + if ( now >= svc->cur_deadline ) > >> > + { > >> > + rt_update_deadline(now, svc); > >> > + __q_remove(svc); /* remove from depleted queue */ > >> > + new_position = __runq_insert(ops, svc); /* add to runq */ > >> > + } > >> > + else break; // leave list_for_each_safe > >> > + } > >> > + > >> > + // If we become one of top [# CPUs] in the runq, tickle it > >> > + // TODO: make this work when multiple tickles are required > >> > >> Why do you need multiple tickles? > >> > > Because the loop above may have been done more than one replenishment, > > which makes sense. > > Ah-ha, I got the reason.I should have read it twice to figure it out. :-) > > > > While we are here. I've said that I don't like the fact that we have one > > big spinlock for the whole scheduler. However, we do have it right now, > > so let's make good use of it. > > OK. > > > > > So (but please double check what I'm about to say, and provide your > > view), it should be safe to access the various svc->vc->processor from > > inside here, since we're holding the big log. If that's the case, then > > it should be fairly easy to figure out which are the pCPUs that needs to > > reschedule (playing with the index, etc), and then use their > > vc->processor to decide what pCPU to actually tickle, isn't this the > > case? > > Hmm, I don't quite get what you meant here although I read it for > three times. :-( > Yeah, sorry, my bad. :-/ > Did you mean we can decide which PCPUs to tickle for the several > "highest" priority vcpus after their budgets get replenished? > If so, doesn't that mean that we will "duplicate" (part of) the > runq_tickle code to find the pCPUs to preempt? Is it better than the > approach that just call runq_tickle each time whenever a high priority > "ready" VCPU is found? > I was just sketching a possible improvement, which as usual is something ttivially done, without actually writing the code. However, let me try again. Let's look at runq_tickle(). It is, right now, called from: (1) rt_vcpu_wake(), and it _does_belong_ in there (at least, the logic it implements does), but it's called in a way that I really don't like. In fact, there is a call to __repl_update(), followed by a __runq_pick(), and I think both should go away; (2) in rt_context_saved(), again, following a replenishment update and a pick. Same as above, I'd love for the update+pick to be kick out of here as well; (3) with Dagaen patch, it's called from the replenishment handler, and again I think the rescheduling request logic (i.e., what runq_tickle() currently implements) is fine in there, although how this is done needs to improve to support tickling multiple pCPUs. Let's look at these one by one. 1) rt_vcpu_wake(): You are waking up vCPU i, and you're doing it right until the call to __runq_insert(), where you insert the vCPU on the runq. But then, why do you call __repl_update()? Why, at every vCPU wakeup, you want to scan the whole runqueue updating everyone's budget? I mean, I (not completely, but, well...) understand why you've been doing this _until_now_, but after Dagaen patch, we have the replenishment timer, and replenishments will be done by the replenishment timer handler, so you don't need to do this in here any longer! So, basically, I would like Dagaen, as a part of this really cool work he's doing, to remove completely the __repl_update() and __runq_pick() logic from rt_vcpu_wake(). Then, it is ok to call runq_tickle(), but you'll be calling it like this: runq_tickle(ops, svc); i.e., you'll go checking whether the vCPU that just woke up needs to preempt any other one currently running on a pCPU, AND THAT'S IT! We're dealing with a vCPU waking up, not with a wakeup+replenishment +pick+reschedule... Let's keep functions focused on their own purpose, as we were saying in the previous thread, ok? :-) So, as far as this function is concerned, it is ok to call runq_tickle(), and runq_tickle() is doing the right thing already, in the way it's currently implemented, provided it's called _on_ the waking vCPU, not on something else. 2) rt_context_saved() This looks a lot similar to (1). In fact, the replenishment update and the pick should be removed from here completely. In fact, with a similar reasoning to the above case, what is this function doing? It is inserting a vCPU in the runq. It's actually rather similar to a wake-up, the only difference being the wake-up is an actual insertion (i.e., the vCPU --most times-- was not there before), while in this case it is a re-insertion (i.e., the vCPU was there already, but we needed to delay fiddling with it). So, indeed, everything just said in (1) applies here too, and I'd like this function to be restructured accordingly (kill __repl_update(), kill __runq_pick(), kill snext, and call runq_tickle(ops, svc)). Now, allow me to stop for a sec, and make some considerations. Dagaen, in his patch, is changing __runq_insert(), making it return the position in the runqueue where the vCPU has been inserted. This is actually a good thing, IMO. In different, and generally more scalable implementations of global EDF, e.g., where you have multiple runqueues, and hence multiple spinlocks, doing something like this is an absolute nightmare. More specifically, what is really really difficult (not to implement, but to implement in a way that does not defeat the scalability benefits of splitting runqueues) is to actually have a reliable and low overhead way of looking at other queues and/or pCPUs, to figure out what vCPUs are running there, and whether or not the one we are dealing with (e.g., during a vCPU wakeup) should preempt any one of them. E.g., if you look at the Linux EDF implementation, which uses different queues, there is a whole data structure dedicate to that, implemented in its own source file! Here, everything is below the umbrella of the big, scheduler wide, spinlock... So, really, it's rather easy to keep track of that, and that is actually what runq_tickle() is doing (well, it does not keep track of it, it goes figuring that out), and what Dagaen is doing by returning the index is exactly another way to do that. That's actually what I was (badly) trying to hint at in my previous mail. Certainly, we don't want to duplicate code from runq_tickle() by cutting-&-pasting it around... But maybe we can change things in a way that they're even simpler than they are now (which is really what I hoped, when thinking to the outcomes of this work, and at using this design for it! :-D). So, it looks to me that, as far as (1) and (2) are concerned, since we are "just" inserting a vCPU in the runq, if we have M pCPUs, and we know whether we inserted it within the first M spots, we already have what we want, or am I missing something? And if __runq_insert() now (with Dagaen patch) tells us this, well, we can simplify the tickling logic, can't we? With an example: We are waking up (or re-inserting, in rt_context_saved()) vCPU j. We have 6 pCPUs. __runq_insert() tells us that it put vCPU j at the 3rd place in the runq. This means vCPU j should be set to run as soon as possible. So, if vCPU j is 3rd in runq, either (a) there are only 3 runnable vCPUs (i.e., if we are waking up j, there were 2 of them, and j is the third; if we are in context_saved, there already where 3, and j just got it's deadline postponed, or someone else got its one replenished); (b) there are more than 3 runnable vCPUs, i.e., there is at least a 4th vCPU --say vCPU k-- in the runq, which was the 3rd before vCPU j were woken (or re-inserted), but now became the 4th, because deadline(j)<deadline(k). In case (a), there are for sure idle pCPUs, and we should tickle one of them. In case (b) there may be idle pCPUs (and, if that's the case, we should tickle one of them, of course) or not. If not, we need to go figure out which pCPU to tickle, which is exactly what runq_tickle() does, but we at least know for sure that we want to tickle the pCPU where vCPU k runs, or others where vCPUs with deadline greater than vCPU k run. Does this make sense? If yes, I think this means we can (and hence should) restructure runq_tickle() in order to make it behave as above, as I think it would be more quick and effective. It's probably necessary to turn the for_each_cpu() loop in there, in a runq scan, but one that only starts from a specific vCPU (the one that is ->next wrt the newly inserted vCPU, k, in the example above), and stopping at the M-eth vCPU in the queue. It's probably necessary to have not only the index, but also an rt_vcpu pointer, from __runq_insert() (or maybe not, if svc is already available, and we chan check and traverse its q_elem->next pointer), and to pass either (or both) to runq_tickle()... But this is better determined while implementing things. Honestly, I originally thought that runq_tickle() could be simplified even more, by looking at the svc->vc->processor field of the vCPU that our vCPU is inserted before. I see now, however, that this can't be done, as we, not always tickle exactly the pCPU of the vCPU that is now next to us (i.e., the pCPU where vCPU k runs, in the example), but we may need to tickle the one that is running the latest deadline vCPU, which may be another one. Still, I think I gave enough material for an actual optimization. What do you think? Oh, just out of the top of my head, I think that to implement the above more easily and effectively, it would make sense to track the idle and the already tickled pCPUs at a scheduler-wide level. If that reveals to be true, have a look at how Credit2 is doing it, it's not that difficult. Let's go back to cases above, and look at (3). Well, right now, I don't think I see much alternatives to putting the tickling in the loop, and (potentially) issuing one call to runq_tickle() for each vCPU that an execution instance of the replenishment handler actually replenishes. That may not look cheap, but: - it won't be so common that we'll replenish tens of vCPUs in a single instance. Ideally, each execution of the handler will replenish (and hence insert) only one vCPU (and hence tickle only one pCPU), with the loop being there to cover for non-ideal scenarios due to overhead, etc; - with the optimization to the tickling logic suggested above, it should be a bit cheaper to issue multiple tickling calls; So, I think I've said everything I wanted to say, and I hope this is a bit more clear now. Dagaen, Meng, any question? I really think that, if we manage to implement all this, code quality and performance would improve a lot. Oh, considering all the various and (logically) different changes that I've hinted at, the patch needs to become a patch series! :-D Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.