Xen project Mailing List

Re: [Xen-devel] [Design RFC] Towards work-conserving RTDS scheduler

To: Meng Xu <mengxu@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>

From: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

Date: Mon, 8 Aug 2016 11:38:31 +0200

Cc: George Dunlap <george.dunlap@xxxxxxxxxxxxx>

Delivery-date: Mon, 08 Aug 2016 09:39:16 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Thu, 2016-08-04 at 01:15 -0400, Meng Xu wrote: > Hi Dario, > Hi, > I'm thinking about changing the current RTDS scheduler to > work-conserving version as we briefly discussed before. > Below is a design of the work-conserving RTDS. > I'm hoping to get your feedback about the design ideas first before I > start writing it in code. > Here I am, sorry for the delay. > I think the code change should not be a lot as long as we don't > provide the functionality of switching between work-conserving and > non-work-conserving. Because the following design will keep the > real-time property of the current RTDS scheduler, I don't see the > reason why we should let users switch to non-work-conserving version. > :-) > Oh, but there's a bit one: _money_! :-O If you're a service/cloud provided you may or may not want that a customers that pays for a 40% utilization VM to be able to use more than that. In particular, you may want to ask more money to them, in order to enable that possibility! :-P Anyway, I don't think --with this design of yours-- that it is such a big deal to make it possible to switch work-conserving*ness on and off (see below). Actually, I think it's even possible to to that on a per- vcpu basis, which I think would be quite cool! > --- Below is the design --- > > [...] > > *** Requirement of the work-conserving RTDS scheduler *** > 1) The new RTDS scheduler should be work-conserving, of course. > 2) The new RTDS scheduler should not break any real-time guarantee > provided by the current RTDS scheduler. > > *** Design of Work-Conserving RTDS Scheduler *** > VCPU model > 1) (Period, Budget): Guaranteed <Budget> time for each <Period> > 2) Priority index: It indicates the current priority level of the > VCPU. When a VCPU’s budget is depleted in the current period, its > priority index will increase by 1 and its budget will be replenished. > 3) A VCPU’s budget and priority index will be reset at the beginning > of each period > Ok, I think I see what you mean and it looks to make sense to me. Just one question/observation. As you know, I come from a CBS mindset. CBS postpones a task/vcpu's deadline when it runs out of budget, and it can, natively, work in work conserving or non-work conserving mode (just by wither continue to consider the vcpu runnable, with the later deadline which mean demoted priority, or block it until the next period, respectively). The nice thing about this is that the scheduling analysis that has been developed works for both modes. Of course, what it says is that you can only guarantee to each vcpu the reserved utilization, and you should not rely on the additional capacity that you may be getting because you're in work conserving mode and the system happened to be idle for a few time this or that other period (so, very similar to what you're proposing). _HOWEVER_, there are evolutions of CBS (called GRUB and SHRUB, I'm sure you'll be able to find the papers), where the 'unused bandwidth' (i.e., the otherwise idle time that you're making use of iff you're in work conserving mode) is distributed in a precise way (according to some weights, IIRC) to the various vcpus, hence making scheduling analysis both possible and useful again. Now, I'm not at all saying that we (you! :-D) should RTDS into using CBS(ish) or anything like that. I'm just thinking out loud and wondering: - could it be useful to have a scheduling analysis in place for the scheduler in work conserving mode (one, of course, that takes into account and give guarantees on the otherwise idle bandwidth... I know that the existing one holds! :-P) ? - if yes, do you already have one --or do you think it will be possible to develop one-- for your priority-index based model? Note that I'm not saying you should, and I'd be perfectly fine with a "no analysis, but let's keep things simple for now"... This just came to my mind, and I'm just pointing it ouy, to make sure we consider and think about it, and make a conscious decision. > Scheduling policy: modified gEDF > 1) Priority comparison: > a) VCPUs with lower priority index has higher priority than VCPUs > with higher priority index > b) VCPUs with same priority index uses gEDF policy to decide the > priority order > 2) Scheduling point > a) VCPU’s budget is depleted for the current priority index > b) VCPU starts a new period > c) VCPU is blocked or waked up > 3) Scheduling decision is made when scheduler is invoked > a) Always pick the current M highest-priority VCPUs to run on the > M cores. > So, still about the analysis point above, and just out of the top of my head (and without being used to do this things any longer!!), it looks like it's possible think at some analysis for this. In fact, since: - vcpus with different priority indexes are totally disjoint sets, - there's a strict ordering between priority indexes, - vcpus sort of use their scheduling parameters at each priority index This looks to me like vcpus are subject to a "hierarchy" of RTDS schedulers, the one at level x+1 running in the idle time of the one at level x... And I think there's scope for writing down some maths formulas that model this situation. :-) Actually, it's quite likely that you either have already noticed this and done the analysis, or that someone else in literature has done something similar --maybe with other schedulers-- before. Anyway, the idea itself looks fair enough to me. I'd like to hear, if that's fine with you, how you plan to actually implement it, as there of course are multiple different ways to do it, and there are, IMO, a couple of things that should be kept in mind. Finally, about the work-conserving*ness on-off switch, what added difficulty or increase in code complexity prevents us to, instead of this: "2) Priority index: It indicates the current priority level of the VCPU. When a VCPU’s budget is depleted in the current period, its priority index will increase by 1 and its budget will be replenished." do something like this: "2) Priority index: It indicates the current priority level of the VCPU. When a VCPU's budget is depleted in the current period: 2a) if the VCPU has the work conserving flag set, its priority index will be increased by 1, and its budget replenished; 2b) if the VCPU has the work conserving flag cleat, it's blocked until next period." ? Thanks and Regards, Dario --- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.