Xen project Mailing List

Re: [Xen-devel] [PATCH v2 1/4] xen: add real time scheduler rt

2014-09-08 14:44 GMT-04:00 George Dunlap <george.dunlap@xxxxxxxxxxxxx>: > On 09/07/2014 08:40 PM, Meng Xu wrote: >> >> This scheduler follows the Preemptive Global Earliest Deadline First >> (EDF) theory in real-time field. >> At any scheduling point, the VCPU with earlier deadline has higher >> priority. The scheduler always picks the highest priority VCPU to run on a >> feasible PCPU. >> A PCPU is feasible if the VCPU can run on this PCPU and (the PCPU is >> idle or has a lower-priority VCPU running on it.) >> >> Each VCPU has a dedicated period and budget. >> The deadline of a VCPU is at the end of each of its periods; >> A VCPU has its budget replenished at the beginning of each of its periods; >> While scheduled, a VCPU burns its budget. >> The VCPU needs to finish its budget before its deadline in each period; >> The VCPU discards its unused budget at the end of each of its periods. >> If a VCPU runs out of budget in a period, it has to wait until next >> period. >> >> Each VCPU is implemented as a deferable server. >> When a VCPU has a task running on it, its budget is continuously burned; >> When a VCPU has no task but with budget left, its budget is preserved. >> >> Queue scheme: A global runqueue for each CPU pool. >> The runqueue holds all runnable VCPUs. >> VCPUs in the runqueue are divided into two parts: with and without budget. >> At the first part, VCPUs are sorted based on EDF priority scheme. >> >> Note: cpumask and cpupool is supported. >> >> This is an experimental scheduler. >> >> Signed-off-by: Meng Xu <mengxu@xxxxxxxxxxxxx> >> Signed-off-by: Sisu Xi <xisisu@xxxxxxxxx> >> --- >> xen/common/Makefile | 1 + >> xen/common/sched_rt.c | 1146 >> +++++++++++++++++++++++++++++++++++++++++++ >> xen/common/schedule.c | 1 + >> xen/include/public/domctl.h | 6 + >> xen/include/public/trace.h | 1 + >> xen/include/xen/sched-if.h | 1 + >> 6 files changed, 1156 insertions(+) >> create mode 100644 xen/common/sched_rt.c >> >> diff --git a/xen/common/Makefile b/xen/common/Makefile >> index 3683ae3..5a23aa4 100644 >> --- a/xen/common/Makefile >> +++ b/xen/common/Makefile >> @@ -26,6 +26,7 @@ obj-y += sched_credit.o >> obj-y += sched_credit2.o >> obj-y += sched_sedf.o >> obj-y += sched_arinc653.o >> +obj-y += sched_rt.o >> obj-y += schedule.o >> obj-y += shutdown.o >> obj-y += softirq.o >> diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c >> new file mode 100644 >> index 0000000..412f8b1 >> --- /dev/null >> +++ b/xen/common/sched_rt.c >> @@ -0,0 +1,1146 @@ >> >> +/****************************************************************************** >> + * Preemptive Global Earliest Deadline First (EDF) scheduler for Xen >> + * EDF scheduling is a real-time scheduling algorithm used in embedded >> field. >> + * >> + * by Sisu Xi, 2013, Washington University in Saint Louis >> + * and Meng Xu, 2014, University of Pennsylvania >> + * >> + * based on the code of credit Scheduler >> + */ >> + >> +#include <xen/config.h> >> +#include <xen/init.h> >> +#include <xen/lib.h> >> +#include <xen/sched.h> >> +#include <xen/domain.h> >> +#include <xen/delay.h> >> +#include <xen/event.h> >> +#include <xen/time.h> >> +#include <xen/perfc.h> >> +#include <xen/sched-if.h> >> +#include <xen/softirq.h> >> +#include <asm/atomic.h> >> +#include <xen/errno.h> >> +#include <xen/trace.h> >> +#include <xen/cpu.h> >> +#include <xen/keyhandler.h> >> +#include <xen/trace.h> >> +#include <xen/guest_access.h> >> + >> +/* >> + * TODO: >> + * >> + * Migration compensation and resist like credit2 to better use cache; >> + * Lock Holder Problem, using yield? >> + * Self switch problem: VCPUs of the same domain may preempt each other; >> + */ >> + >> +/* >> + * Design: >> + * >> + * This scheduler follows the Preemptive Global Earliest Deadline First >> (EDF) >> + * theory in real-time field. >> + * At any scheduling point, the VCPU with earlier deadline has higher >> priority. >> + * The scheduler always picks highest priority VCPU to run on a feasible >> PCPU. >> + * A PCPU is feasible if the VCPU can run on this PCPU and (the PCPU is >> idle or >> + * has a lower-priority VCPU running on it.) >> + * >> + * Each VCPU has a dedicated period and budget. >> + * The deadline of a VCPU is at the end of each of its periods; >> + * A VCPU has its budget replenished at the beginning of each of its >> periods; >> + * While scheduled, a VCPU burns its budget. >> + * The VCPU needs to finish its budget before its deadline in each >> period; >> + * The VCPU discards its unused budget at the end of each of its periods. >> + * If a VCPU runs out of budget in a period, it has to wait until next >> period. >> + * >> + * Each VCPU is implemented as a deferable server. >> + * When a VCPU has a task running on it, its budget is continuously >> burned; >> + * When a VCPU has no task but with budget left, its budget is preserved. >> + * >> + * Queue scheme: A global runqueue for each CPU pool. >> + * The runqueue holds all runnable VCPUs. >> + * VCPUs in the runqueue are divided into two parts: >> + * with and without remaining budget. >> + * At the first part, VCPUs are sorted based on EDF priority scheme. >> + * >> + * Note: cpumask and cpupool is supported. >> + */ >> + >> +/* >> + * Locking: >> + * A global system lock is used to protect the RunQ. >> + * The global lock is referenced by schedule_data.schedule_lock >> + * from all physical cpus. >> + * >> + * The lock is already grabbed when calling wake/sleep/schedule/ >> functions >> + * in schedule.c >> + * >> + * The functions involes RunQ and needs to grab locks are: >> + * vcpu_insert, vcpu_remove, context_saved, __runq_insert >> + */ >> + >> + >> +/* >> + * Default parameters: >> + * Period and budget in default is 10 and 4 ms, respectively >> + */ >> +#define RT_DS_DEFAULT_PERIOD (MICROSECS(10000)) >> +#define RT_DS_DEFAULT_BUDGET (MICROSECS(4000)) >> + >> +/* >> + * Flags >> + */ >> +/* >> + * RT_scheduled: Is this vcpu either running on, or context-switching >> off, >> + * a phyiscal cpu? >> + * + Accessed only with Runqueue lock held. >> + * + Set when chosen as next in rt_schedule(). >> + * + Cleared after context switch has been saved in rt_context_saved() >> + * + Checked in vcpu_wake to see if we can add to the Runqueue, or if we >> should >> + * set RT_delayed_runq_add >> + * + Checked to be false in runq_insert. >> + */ >> +#define __RT_scheduled 1 >> +#define RT_scheduled (1<<__RT_scheduled) >> +/* >> + * RT_delayed_runq_add: Do we need to add this to the Runqueueu once it'd >> done >> + * being context switching out? >> + * + Set when scheduling out in rt_schedule() if prev is runable >> + * + Set in rt_vcpu_wake if it finds RT_scheduled set >> + * + Read in rt_context_saved(). If set, it adds prev to the Runqueue and >> + * clears the bit. >> + */ >> +#define __RT_delayed_runq_add 2 >> +#define RT_delayed_runq_add (1<<__RT_delayed_runq_add) >> + >> +/* >> + * Debug only. Used to printout debug information >> + */ >> +#define printtime()\ >> + ({s_time_t now = NOW(); \ >> + printk("%u : %3ld.%3ldus : %-19s\n",smp_processor_id(),\ >> + now/MICROSECS(1), now%MICROSECS(1)/1000, __func__);} ) >> + >> +/* >> + * rt tracing events ("only" 512 available!). Check >> + * include/public/trace.h for more details. >> + */ >> +#define TRC_RT_TICKLE TRC_SCHED_CLASS_EVT(RT, 1) >> +#define TRC_RT_RUNQ_PICK TRC_SCHED_CLASS_EVT(RT, 2) >> +#define TRC_RT_BUDGET_BURN TRC_SCHED_CLASS_EVT(RT, 3) >> +#define TRC_RT_BUDGET_REPLENISH TRC_SCHED_CLASS_EVT(RT, 4) >> +#define TRC_RT_SCHED_TASKLET TRC_SCHED_CLASS_EVT(RT, 5) >> +#define TRC_RT_VCPU_DUMP TRC_SCHED_CLASS_EVT(RT, 6) >> + >> +/* >> + * Systme-wide private data, include a global RunQueue >> + * Global lock is referenced by schedule_data.schedule_lock from all >> + * physical cpus. It can be grabbed via vcpu_schedule_lock_irq() >> + */ >> +struct rt_private { >> + spinlock_t lock; /* The global coarse grand lock */ >> + struct list_head sdom; /* list of availalbe domains, used for >> dump */ >> + struct list_head runq; /* Ordered list of runnable VMs */ >> + struct rt_vcpu *flag_vcpu; /* position of the first depleted vcpu */ >> + cpumask_t cpus; /* cpumask_t of available physical cpus */ >> + cpumask_t tickled; /* cpus been tickled */ >> +}; >> + >> +/* >> + * Virtual CPU >> + */ >> +struct rt_vcpu { >> + struct list_head runq_elem; /* On the runqueue list */ >> + struct list_head sdom_elem; /* On the domain VCPU list */ >> + >> + /* Up-pointers */ >> + struct rt_dom *sdom; >> + struct vcpu *vcpu; >> + >> + /* VCPU parameters, in nanoseconds */ >> + s_time_t period; >> + s_time_t budget; >> + >> + /* VCPU current infomation in nanosecond */ >> + s_time_t cur_budget; /* current budget */ >> + s_time_t last_start; /* last start time */ >> + s_time_t cur_deadline; /* current deadline for EDF */ >> + >> + unsigned flags; /* mark __RT_scheduled, etc.. */ >> +}; >> + >> +/* >> + * Domain >> + */ >> +struct rt_dom { >> + struct list_head vcpu; /* link its VCPUs */ >> + struct list_head sdom_elem; /* link list on rt_priv */ >> + struct domain *dom; /* pointer to upper domain */ >> +}; >> + >> +/* >> + * Useful inline functions >> + */ >> +static inline struct rt_private *RT_PRIV(const struct scheduler *ops) >> +{ >> + return ops->sched_data; >> +} >> + >> +static inline struct rt_vcpu *RT_VCPU(const struct vcpu *vcpu) >> +{ >> + return vcpu->sched_priv; >> +} >> + >> +static inline struct rt_dom *RT_DOM(const struct domain *dom) >> +{ >> + return dom->sched_priv; >> +} >> + >> +static inline struct list_head *RUNQ(const struct scheduler *ops) >> +{ >> + return &RT_PRIV(ops)->runq; >> +} >> + >> +/* >> + * RunQueue helper functions >> + */ >> +static int >> +__vcpu_on_runq(const struct rt_vcpu *svc) >> +{ >> + return !list_empty(&svc->runq_elem); >> +} >> + >> +static struct rt_vcpu * >> +__runq_elem(struct list_head *elem) >> +{ >> + return list_entry(elem, struct rt_vcpu, runq_elem); >> +} >> + >> +/* >> + * Debug related code, dump vcpu/cpu information >> + */ >> +static void >> +rt_dump_vcpu(const struct scheduler *ops, const struct rt_vcpu *svc) >> +{ >> + struct rt_private *prv = RT_PRIV(ops); >> + char cpustr[1024]; >> + cpumask_t *cpupool_mask; >> + >> + ASSERT(svc != NULL); >> + /* flag vcpu */ >> + if( svc->sdom == NULL ) >> + return; >> + >> + cpumask_scnprintf(cpustr, sizeof(cpustr), >> svc->vcpu->cpu_hard_affinity); >> + printk("[%5d.%-2u] cpu %u, (%"PRI_stime", %"PRI_stime")," >> + " cur_b=%"PRI_stime" cur_d=%"PRI_stime" last_start=%"PRI_stime >> + " onR=%d runnable=%d cpu_hard_affinity=%s ", >> + svc->vcpu->domain->domain_id, >> + svc->vcpu->vcpu_id, >> + svc->vcpu->processor, >> + svc->period, >> + svc->budget, >> + svc->cur_budget, >> + svc->cur_deadline, >> + svc->last_start, >> + __vcpu_on_runq(svc), >> + vcpu_runnable(svc->vcpu), >> + cpustr); >> + memset(cpustr, 0, sizeof(cpustr)); >> + cpupool_mask = cpupool_scheduler_cpumask(svc->vcpu->domain->cpupool); >> + cpumask_scnprintf(cpustr, sizeof(cpustr), cpupool_mask); >> + printk("cpupool=%s ", cpustr); >> + memset(cpustr, 0, sizeof(cpustr)); >> + cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->cpus); >> + printk("prv->cpus=%s\n", cpustr); >> + >> + /* TRACE */ >> + { >> + struct { >> + unsigned dom:16,vcpu:16; >> + unsigned processor; >> + unsigned cur_budget_lo, cur_budget_hi; >> + unsigned cur_deadline_lo, cur_deadline_hi; >> + unsigned is_vcpu_on_runq:16,is_vcpu_runnable:16; >> + } d; >> + d.dom = svc->vcpu->domain->domain_id; >> + d.vcpu = svc->vcpu->vcpu_id; >> + d.processor = svc->vcpu->processor; >> + d.cur_budget_lo = (unsigned) svc->cur_budget; >> + d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32); >> + d.cur_deadline_lo = (unsigned) svc->cur_deadline; >> + d.cur_deadline_hi = (unsigned) (svc->cur_deadline >> 32); >> + d.is_vcpu_on_runq = __vcpu_on_runq(svc); >> + d.is_vcpu_runnable = vcpu_runnable(svc->vcpu); >> + trace_var(TRC_RT_VCPU_DUMP, 1, >> + sizeof(d), >> + (unsigned char *)&d); >> + } >> +} >> + >> +static void >> +rt_dump_pcpu(const struct scheduler *ops, int cpu) >> +{ >> + struct rt_vcpu *svc = RT_VCPU(curr_on_cpu(cpu)); >> + >> + printtime(); >> + rt_dump_vcpu(ops, svc); >> +} >> + >> +/* >> + * should not need lock here. only showing stuff >> + */ > > > This isn't true -- you're walking both the runqueue and the lists of domains > and vcpus, each of which may change under your feet. I see. So even when I only read (and never write) the runqueue, I still need to use the lock. I can add the lock in these dumps. > >> + >> + /* TRACE */ >> + { >> + struct { >> + unsigned dom:16,vcpu:16; >> + unsigned cur_budget_lo, cur_budget_hi; >> + } d; >> + d.dom = svc->vcpu->domain->domain_id; >> + d.vcpu = svc->vcpu->vcpu_id; >> + d.cur_budget_lo = (unsigned) svc->cur_budget; >> + d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32); >> + trace_var(TRC_RT_BUDGET_REPLENISH, 1, >> + sizeof(d), >> + (unsigned char *) &d); >> + } >> + >> + return; >> + } >> +} >> + >> +static inline void >> +__runq_remove(struct rt_vcpu *svc) >> +{ >> + if ( __vcpu_on_runq(svc) ) >> + list_del_init(&svc->runq_elem); >> +} >> + >> +/* >> + * Insert svc in the RunQ according to EDF: vcpus with smaller deadlines >> + * goes first. >> + */ >> +static void >> +__runq_insert(const struct scheduler *ops, struct rt_vcpu *svc) >> +{ >> + struct rt_private *prv = RT_PRIV(ops); >> + struct list_head *runq = RUNQ(ops); >> + struct list_head *iter; >> + spinlock_t *schedule_lock; >> + >> + schedule_lock = per_cpu(schedule_data, >> svc->vcpu->processor).schedule_lock; >> + ASSERT( spin_is_locked(schedule_lock) ); >> + >> + ASSERT( !__vcpu_on_runq(svc) ); >> + >> + /* svc still has budget */ >> + if ( svc->cur_budget > 0 ) >> + { >> + list_for_each(iter, runq) >> + { >> + struct rt_vcpu * iter_svc = __runq_elem(iter); >> + if ( iter_svc->cur_budget == 0 || >> + svc->cur_deadline <= iter_svc->cur_deadline ) >> + break; >> + } >> + list_add_tail(&svc->runq_elem, iter); >> + } >> + else >> + { >> + list_add(&svc->runq_elem, &prv->flag_vcpu->runq_elem); >> + } > > > OK, this thing with the "flag vcpu" seems a bit weird. Why not just have > two queues -- a runq and a depletedq. You don't need to have another > function; you just add it to depleted_runq rather than runq in > __runq_insert(). Then you don't have to have this "cur_budget==0" stuff. > The only extra code you'd have is (I think) in __repl_update(). I may need to add some other code, like the static inline function DEPLETEDQ() to get the depletedq from struct rt_private, and the DepletedQ's helper functions, like __vcpu_on_depletedq, etc. But these codes are not big, so Yes, I will change it to two queues in the next version. >> + >> +/* >> + * Burn budget in nanosecond granularity >> + */ >> +static void >> +burn_budgets(const struct scheduler *ops, struct rt_vcpu *svc, s_time_t >> now) >> +{ >> + s_time_t delta; >> + >> + /* don't burn budget for idle VCPU */ >> + if ( is_idle_vcpu(svc->vcpu) ) >> + return; >> + >> + rt_update_helper(now, svc); >> + >> + /* not burn budget when vcpu miss deadline */ >> + if ( now >= svc->cur_deadline ) >> + return; >> + >> + /* burn at nanoseconds level */ >> + delta = now - svc->last_start; >> + /* >> + * delta < 0 only happens in nested virtualization; >> + * TODO: how should we handle delta < 0 in a better way? > > > I think what I did in credit2 was just > > if(delta < 0) delta = 0; > > What you're doing here basically takes away an entire budget when the time > goes backwards for whatever reason. Much better, it seems to me, to just > give the vcpu some "free" time and deal with it. :-) I can remove svc->cur_budget = 0 to not set the vcpu's budget to 0. If this vcpu has some budget left during this period and has higher priority, it should be able to run. So I will remove svc->cur_budget = 0. >> + >> +/* >> + * Update vcpu's budget and sort runq by insert the modifed vcpu back to >> runq >> + * lock is grabbed before calling this function >> + */ >> +static void >> +__repl_update(const struct scheduler *ops, s_time_t now) >> +{ >> + struct list_head *runq = RUNQ(ops); >> + struct list_head *iter; >> + struct list_head *tmp; >> + struct rt_vcpu *svc = NULL; >> + >> + list_for_each_safe(iter, tmp, runq) >> + { >> + svc = __runq_elem(iter); >> + >> + /* not update flag_vcpu's budget */ >> + if(svc->sdom == NULL) >> + continue; >> + >> + rt_update_helper(now, svc); >> + /* reinsert the vcpu if its deadline is updated */ >> + if ( now >= 0 ) > > > Uum, when is this ever not going to be >= 0? The comment here seems > completely inaccurate... My bad. This is incorrect. :-( It should be diff (which is now-svc->cur_deadline) >= 0. Sorry. Will change in the next patch. > > Also, it seems like you could make this a bit more efficient by pulling the > check into this loop itself, rather than putting it in the helper function. > Since the queue is sorted by deadline, you could stop processing once you > reach one for which now < cur_deadline, since you know all subsequent ones > will even later than this one. > > Of course, that wouldn't take care of the depleted ones, but if those were > already on a separate queue, you'd be OK. :-) Sure! Will do that. > > Right, past time for me to go home... I've given a quick scan over the other > things and nothing jumped out at me, but I'll come back to it again tomorrow > and see how we fare. Thank you so much for your comments and time! I really appreciate it and will tackle these comments in the next version asap. Thanks! Meng ----------- Meng Xu PhD Student in Computer and Information Science University of Pennsylvania _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.