[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2 1/4] xen: add real time scheduler rt



2014-09-08 14:44 GMT-04:00 George Dunlap <george.dunlap@xxxxxxxxxxxxx>:
> On 09/07/2014 08:40 PM, Meng Xu wrote:
>>
>> This scheduler follows the Preemptive Global Earliest Deadline First
>> (EDF) theory in real-time field.
>> At any scheduling point, the VCPU with earlier deadline has higher
>> priority. The scheduler always picks the highest priority VCPU to run on a
>> feasible PCPU.
>> A PCPU is feasible if the VCPU can run on this PCPU and (the PCPU is
>> idle or has a lower-priority VCPU running on it.)
>>
>> Each VCPU has a dedicated period and budget.
>> The deadline of a VCPU is at the end of each of its periods;
>> A VCPU has its budget replenished at the beginning of each of its periods;
>> While scheduled, a VCPU burns its budget.
>> The VCPU needs to finish its budget before its deadline in each period;
>> The VCPU discards its unused budget at the end of each of its periods.
>> If a VCPU runs out of budget in a period, it has to wait until next
>> period.
>>
>> Each VCPU is implemented as a deferable server.
>> When a VCPU has a task running on it, its budget is continuously burned;
>> When a VCPU has no task but with budget left, its budget is preserved.
>>
>> Queue scheme: A global runqueue for each CPU pool.
>> The runqueue holds all runnable VCPUs.
>> VCPUs in the runqueue are divided into two parts: with and without budget.
>> At the first part, VCPUs are sorted based on EDF priority scheme.
>>
>> Note: cpumask and cpupool is supported.
>>
>> This is an experimental scheduler.
>>
>> Signed-off-by: Meng Xu <mengxu@xxxxxxxxxxxxx>
>> Signed-off-by: Sisu Xi <xisisu@xxxxxxxxx>
>> ---
>>   xen/common/Makefile         |    1 +
>>   xen/common/sched_rt.c       | 1146
>> +++++++++++++++++++++++++++++++++++++++++++
>>   xen/common/schedule.c       |    1 +
>>   xen/include/public/domctl.h |    6 +
>>   xen/include/public/trace.h  |    1 +
>>   xen/include/xen/sched-if.h  |    1 +
>>   6 files changed, 1156 insertions(+)
>>   create mode 100644 xen/common/sched_rt.c
>>
>> diff --git a/xen/common/Makefile b/xen/common/Makefile
>> index 3683ae3..5a23aa4 100644
>> --- a/xen/common/Makefile
>> +++ b/xen/common/Makefile
>> @@ -26,6 +26,7 @@ obj-y += sched_credit.o
>>   obj-y += sched_credit2.o
>>   obj-y += sched_sedf.o
>>   obj-y += sched_arinc653.o
>> +obj-y += sched_rt.o
>>   obj-y += schedule.o
>>   obj-y += shutdown.o
>>   obj-y += softirq.o
>> diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
>> new file mode 100644
>> index 0000000..412f8b1
>> --- /dev/null
>> +++ b/xen/common/sched_rt.c
>> @@ -0,0 +1,1146 @@
>>
>> +/******************************************************************************
>> + * Preemptive Global Earliest Deadline First  (EDF) scheduler for Xen
>> + * EDF scheduling is a real-time scheduling algorithm used in embedded
>> field.
>> + *
>> + * by Sisu Xi, 2013, Washington University in Saint Louis
>> + * and Meng Xu, 2014, University of Pennsylvania
>> + *
>> + * based on the code of credit Scheduler
>> + */
>> +
>> +#include <xen/config.h>
>> +#include <xen/init.h>
>> +#include <xen/lib.h>
>> +#include <xen/sched.h>
>> +#include <xen/domain.h>
>> +#include <xen/delay.h>
>> +#include <xen/event.h>
>> +#include <xen/time.h>
>> +#include <xen/perfc.h>
>> +#include <xen/sched-if.h>
>> +#include <xen/softirq.h>
>> +#include <asm/atomic.h>
>> +#include <xen/errno.h>
>> +#include <xen/trace.h>
>> +#include <xen/cpu.h>
>> +#include <xen/keyhandler.h>
>> +#include <xen/trace.h>
>> +#include <xen/guest_access.h>
>> +
>> +/*
>> + * TODO:
>> + *
>> + * Migration compensation and resist like credit2 to better use cache;
>> + * Lock Holder Problem, using yield?
>> + * Self switch problem: VCPUs of the same domain may preempt each other;
>> + */
>> +
>> +/*
>> + * Design:
>> + *
>> + * This scheduler follows the Preemptive Global Earliest Deadline First
>> (EDF)
>> + * theory in real-time field.
>> + * At any scheduling point, the VCPU with earlier deadline has higher
>> priority.
>> + * The scheduler always picks highest priority VCPU to run on a feasible
>> PCPU.
>> + * A PCPU is feasible if the VCPU can run on this PCPU and (the PCPU is
>> idle or
>> + * has a lower-priority VCPU running on it.)
>> + *
>> + * Each VCPU has a dedicated period and budget.
>> + * The deadline of a VCPU is at the end of each of its periods;
>> + * A VCPU has its budget replenished at the beginning of each of its
>> periods;
>> + * While scheduled, a VCPU burns its budget.
>> + * The VCPU needs to finish its budget before its deadline in each
>> period;
>> + * The VCPU discards its unused budget at the end of each of its periods.
>> + * If a VCPU runs out of budget in a period, it has to wait until next
>> period.
>> + *
>> + * Each VCPU is implemented as a deferable server.
>> + * When a VCPU has a task running on it, its budget is continuously
>> burned;
>> + * When a VCPU has no task but with budget left, its budget is preserved.
>> + *
>> + * Queue scheme: A global runqueue for each CPU pool.
>> + * The runqueue holds all runnable VCPUs.
>> + * VCPUs in the runqueue are divided into two parts:
>> + * with and without remaining budget.
>> + * At the first part, VCPUs are sorted based on EDF priority scheme.
>> + *
>> + * Note: cpumask and cpupool is supported.
>> + */
>> +
>> +/*
>> + * Locking:
>> + * A global system lock is used to protect the RunQ.
>> + * The global lock is referenced by schedule_data.schedule_lock
>> + * from all physical cpus.
>> + *
>> + * The lock is already grabbed when calling wake/sleep/schedule/
>> functions
>> + * in schedule.c
>> + *
>> + * The functions involes RunQ and needs to grab locks are:
>> + *    vcpu_insert, vcpu_remove, context_saved, __runq_insert
>> + */
>> +
>> +
>> +/*
>> + * Default parameters:
>> + * Period and budget in default is 10 and 4 ms, respectively
>> + */
>> +#define RT_DS_DEFAULT_PERIOD     (MICROSECS(10000))
>> +#define RT_DS_DEFAULT_BUDGET     (MICROSECS(4000))
>> +
>> +/*
>> + * Flags
>> + */
>> +/*
>> + * RT_scheduled: Is this vcpu either running on, or context-switching
>> off,
>> + * a phyiscal cpu?
>> + * + Accessed only with Runqueue lock held.
>> + * + Set when chosen as next in rt_schedule().
>> + * + Cleared after context switch has been saved in rt_context_saved()
>> + * + Checked in vcpu_wake to see if we can add to the Runqueue, or if we
>> should
>> + *   set RT_delayed_runq_add
>> + * + Checked to be false in runq_insert.
>> + */
>> +#define __RT_scheduled            1
>> +#define RT_scheduled (1<<__RT_scheduled)
>> +/*
>> + * RT_delayed_runq_add: Do we need to add this to the Runqueueu once it'd
>> done
>> + * being context switching out?
>> + * + Set when scheduling out in rt_schedule() if prev is runable
>> + * + Set in rt_vcpu_wake if it finds RT_scheduled set
>> + * + Read in rt_context_saved(). If set, it adds prev to the Runqueue and
>> + *   clears the bit.
>> + */
>> +#define __RT_delayed_runq_add     2
>> +#define RT_delayed_runq_add (1<<__RT_delayed_runq_add)
>> +
>> +/*
>> + * Debug only. Used to printout debug information
>> + */
>> +#define printtime()\
>> +        ({s_time_t now = NOW(); \
>> +          printk("%u : %3ld.%3ldus : %-19s\n",smp_processor_id(),\
>> +          now/MICROSECS(1), now%MICROSECS(1)/1000, __func__);} )
>> +
>> +/*
>> + * rt tracing events ("only" 512 available!). Check
>> + * include/public/trace.h for more details.
>> + */
>> +#define TRC_RT_TICKLE           TRC_SCHED_CLASS_EVT(RT, 1)
>> +#define TRC_RT_RUNQ_PICK        TRC_SCHED_CLASS_EVT(RT, 2)
>> +#define TRC_RT_BUDGET_BURN      TRC_SCHED_CLASS_EVT(RT, 3)
>> +#define TRC_RT_BUDGET_REPLENISH TRC_SCHED_CLASS_EVT(RT, 4)
>> +#define TRC_RT_SCHED_TASKLET    TRC_SCHED_CLASS_EVT(RT, 5)
>> +#define TRC_RT_VCPU_DUMP        TRC_SCHED_CLASS_EVT(RT, 6)
>> +
>> +/*
>> + * Systme-wide private data, include a global RunQueue
>> + * Global lock is referenced by schedule_data.schedule_lock from all
>> + * physical cpus. It can be grabbed via vcpu_schedule_lock_irq()
>> + */
>> +struct rt_private {
>> +    spinlock_t lock;           /* The global coarse grand lock */
>> +    struct list_head sdom;     /* list of availalbe domains, used for
>> dump */
>> +    struct list_head runq;     /* Ordered list of runnable VMs */
>> +    struct rt_vcpu *flag_vcpu; /* position of the first depleted vcpu */
>> +    cpumask_t cpus;            /* cpumask_t of available physical cpus */
>> +    cpumask_t tickled;         /* cpus been tickled */
>> +};
>> +
>> +/*
>> + * Virtual CPU
>> + */
>> +struct rt_vcpu {
>> +    struct list_head runq_elem; /* On the runqueue list */
>> +    struct list_head sdom_elem; /* On the domain VCPU list */
>> +
>> +    /* Up-pointers */
>> +    struct rt_dom *sdom;
>> +    struct vcpu *vcpu;
>> +
>> +    /* VCPU parameters, in nanoseconds */
>> +    s_time_t period;
>> +    s_time_t budget;
>> +
>> +    /* VCPU current infomation in nanosecond */
>> +    s_time_t cur_budget;        /* current budget */
>> +    s_time_t last_start;        /* last start time */
>> +    s_time_t cur_deadline;      /* current deadline for EDF */
>> +
>> +    unsigned flags;             /* mark __RT_scheduled, etc.. */
>> +};
>> +
>> +/*
>> + * Domain
>> + */
>> +struct rt_dom {
>> +    struct list_head vcpu;      /* link its VCPUs */
>> +    struct list_head sdom_elem; /* link list on rt_priv */
>> +    struct domain *dom;         /* pointer to upper domain */
>> +};
>> +
>> +/*
>> + * Useful inline functions
>> + */
>> +static inline struct rt_private *RT_PRIV(const struct scheduler *ops)
>> +{
>> +    return ops->sched_data;
>> +}
>> +
>> +static inline struct rt_vcpu *RT_VCPU(const struct vcpu *vcpu)
>> +{
>> +    return vcpu->sched_priv;
>> +}
>> +
>> +static inline struct rt_dom *RT_DOM(const struct domain *dom)
>> +{
>> +    return dom->sched_priv;
>> +}
>> +
>> +static inline struct list_head *RUNQ(const struct scheduler *ops)
>> +{
>> +    return &RT_PRIV(ops)->runq;
>> +}
>> +
>> +/*
>> + * RunQueue helper functions
>> + */
>> +static int
>> +__vcpu_on_runq(const struct rt_vcpu *svc)
>> +{
>> +   return !list_empty(&svc->runq_elem);
>> +}
>> +
>> +static struct rt_vcpu *
>> +__runq_elem(struct list_head *elem)
>> +{
>> +    return list_entry(elem, struct rt_vcpu, runq_elem);
>> +}
>> +
>> +/*
>> + * Debug related code, dump vcpu/cpu information
>> + */
>> +static void
>> +rt_dump_vcpu(const struct scheduler *ops, const struct rt_vcpu *svc)
>> +{
>> +    struct rt_private *prv = RT_PRIV(ops);
>> +    char cpustr[1024];
>> +    cpumask_t *cpupool_mask;
>> +
>> +    ASSERT(svc != NULL);
>> +    /* flag vcpu */
>> +    if( svc->sdom == NULL )
>> +        return;
>> +
>> +    cpumask_scnprintf(cpustr, sizeof(cpustr),
>> svc->vcpu->cpu_hard_affinity);
>> +    printk("[%5d.%-2u] cpu %u, (%"PRI_stime", %"PRI_stime"),"
>> +           " cur_b=%"PRI_stime" cur_d=%"PRI_stime" last_start=%"PRI_stime
>> +           " onR=%d runnable=%d cpu_hard_affinity=%s ",
>> +            svc->vcpu->domain->domain_id,
>> +            svc->vcpu->vcpu_id,
>> +            svc->vcpu->processor,
>> +            svc->period,
>> +            svc->budget,
>> +            svc->cur_budget,
>> +            svc->cur_deadline,
>> +            svc->last_start,
>> +            __vcpu_on_runq(svc),
>> +            vcpu_runnable(svc->vcpu),
>> +            cpustr);
>> +    memset(cpustr, 0, sizeof(cpustr));
>> +    cpupool_mask = cpupool_scheduler_cpumask(svc->vcpu->domain->cpupool);
>> +    cpumask_scnprintf(cpustr, sizeof(cpustr), cpupool_mask);
>> +    printk("cpupool=%s ", cpustr);
>> +    memset(cpustr, 0, sizeof(cpustr));
>> +    cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->cpus);
>> +    printk("prv->cpus=%s\n", cpustr);
>> +
>> +    /* TRACE */
>> +    {
>> +        struct {
>> +            unsigned dom:16,vcpu:16;
>> +            unsigned processor;
>> +            unsigned cur_budget_lo, cur_budget_hi;
>> +            unsigned cur_deadline_lo, cur_deadline_hi;
>> +            unsigned is_vcpu_on_runq:16,is_vcpu_runnable:16;
>> +        } d;
>> +        d.dom = svc->vcpu->domain->domain_id;
>> +        d.vcpu = svc->vcpu->vcpu_id;
>> +        d.processor = svc->vcpu->processor;
>> +        d.cur_budget_lo = (unsigned) svc->cur_budget;
>> +        d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
>> +        d.cur_deadline_lo = (unsigned) svc->cur_deadline;
>> +        d.cur_deadline_hi = (unsigned) (svc->cur_deadline >> 32);
>> +        d.is_vcpu_on_runq = __vcpu_on_runq(svc);
>> +        d.is_vcpu_runnable = vcpu_runnable(svc->vcpu);
>> +        trace_var(TRC_RT_VCPU_DUMP, 1,
>> +                  sizeof(d),
>> +                  (unsigned char *)&d);
>> +    }
>> +}
>> +
>> +static void
>> +rt_dump_pcpu(const struct scheduler *ops, int cpu)
>> +{
>> +    struct rt_vcpu *svc = RT_VCPU(curr_on_cpu(cpu));
>> +
>> +    printtime();
>> +    rt_dump_vcpu(ops, svc);
>> +}
>> +
>> +/*
>> + * should not need lock here. only showing stuff
>> + */
>
>
> This isn't true -- you're walking both the runqueue and the lists of domains
> and vcpus, each of which may change under your feet.

I see.  So even when I only read (and never write) the runqueue, I
still need to use the lock. I can add the lock in these dumps.

>
>> +
>> +        /* TRACE */
>> +        {
>> +            struct {
>> +                unsigned dom:16,vcpu:16;
>> +                unsigned cur_budget_lo, cur_budget_hi;
>> +            } d;
>> +            d.dom = svc->vcpu->domain->domain_id;
>> +            d.vcpu = svc->vcpu->vcpu_id;
>> +            d.cur_budget_lo = (unsigned) svc->cur_budget;
>> +            d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
>> +            trace_var(TRC_RT_BUDGET_REPLENISH, 1,
>> +                      sizeof(d),
>> +                      (unsigned char *) &d);
>> +        }
>> +
>> +        return;
>> +    }
>> +}
>> +
>> +static inline void
>> +__runq_remove(struct rt_vcpu *svc)
>> +{
>> +    if ( __vcpu_on_runq(svc) )
>> +        list_del_init(&svc->runq_elem);
>> +}
>> +
>> +/*
>> + * Insert svc in the RunQ according to EDF: vcpus with smaller deadlines
>> + * goes first.
>> + */
>> +static void
>> +__runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
>> +{
>> +    struct rt_private *prv = RT_PRIV(ops);
>> +    struct list_head *runq = RUNQ(ops);
>> +    struct list_head *iter;
>> +    spinlock_t *schedule_lock;
>> +
>> +    schedule_lock = per_cpu(schedule_data,
>> svc->vcpu->processor).schedule_lock;
>> +    ASSERT( spin_is_locked(schedule_lock) );
>> +
>> +    ASSERT( !__vcpu_on_runq(svc) );
>> +
>> +    /* svc still has budget */
>> +    if ( svc->cur_budget > 0 )
>> +    {
>> +        list_for_each(iter, runq)
>> +        {
>> +            struct rt_vcpu * iter_svc = __runq_elem(iter);
>> +            if ( iter_svc->cur_budget == 0 ||
>> +                 svc->cur_deadline <= iter_svc->cur_deadline )
>> +                    break;
>> +         }
>> +        list_add_tail(&svc->runq_elem, iter);
>> +     }
>> +    else
>> +    {
>> +        list_add(&svc->runq_elem, &prv->flag_vcpu->runq_elem);
>> +    }
>
>
> OK, this thing with the "flag vcpu" seems a bit weird.  Why not just have
> two queues -- a runq and a depletedq.  You don't need to have another
> function; you just add it to depleted_runq rather than runq in
> __runq_insert().  Then you don't have to have this "cur_budget==0" stuff.
> The only extra code you'd have is (I think) in __repl_update().

I may need to add some other code, like the static inline function
DEPLETEDQ() to get the depletedq from struct rt_private, and the
DepletedQ's helper functions, like __vcpu_on_depletedq, etc.  But
these codes are not big, so Yes, I will change it to two queues in the
next version.

>> +
>> +/*
>> + * Burn budget in nanosecond granularity
>> + */
>> +static void
>> +burn_budgets(const struct scheduler *ops, struct rt_vcpu *svc, s_time_t
>> now)
>> +{
>> +    s_time_t delta;
>> +
>> +    /* don't burn budget for idle VCPU */
>> +    if ( is_idle_vcpu(svc->vcpu) )
>> +        return;
>> +
>> +    rt_update_helper(now, svc);
>> +
>> +    /* not burn budget when vcpu miss deadline */
>> +    if ( now >= svc->cur_deadline )
>> +        return;
>> +
>> +    /* burn at nanoseconds level */
>> +    delta = now - svc->last_start;
>> +    /*
>> +     * delta < 0 only happens in nested virtualization;
>> +     * TODO: how should we handle delta < 0 in a better way?
>
>
> I think what I did in credit2 was just
>
> if(delta < 0) delta = 0;
>
> What you're doing here basically takes away an entire budget when the time
> goes backwards for whatever reason.  Much better, it seems to me, to just
> give the vcpu some "free" time and deal with it. :-)

I can remove  svc->cur_budget = 0 to not set the vcpu's budget to 0.
If this vcpu has some budget left during this period and has higher
priority, it should be able to run.
So I will remove svc->cur_budget = 0.

>> +
>> +/*
>> + * Update vcpu's budget and sort runq by insert the modifed vcpu back to
>> runq
>> + * lock is grabbed before calling this function
>> + */
>> +static void
>> +__repl_update(const struct scheduler *ops, s_time_t now)
>> +{
>> +    struct list_head *runq = RUNQ(ops);
>> +    struct list_head *iter;
>> +    struct list_head *tmp;
>> +    struct rt_vcpu *svc = NULL;
>> +
>> +    list_for_each_safe(iter, tmp, runq)
>> +    {
>> +        svc = __runq_elem(iter);
>> +
>> +        /* not update flag_vcpu's budget */
>> +        if(svc->sdom == NULL)
>> +            continue;
>> +
>> +        rt_update_helper(now, svc);
>> +        /* reinsert the vcpu if its deadline is updated */
>> +        if ( now >= 0 )
>
>
> Uum, when is this ever not going to be >= 0?  The comment here seems
> completely inaccurate...

My bad. This is incorrect. :-( It should be diff (which is
now-svc->cur_deadline) >= 0. Sorry. Will change in the next patch.

>
> Also, it seems like you could make this a bit more efficient by pulling the
> check into this loop itself, rather than putting it in the helper function.
> Since the queue is sorted by deadline, you could stop processing once you
> reach one for which now < cur_deadline, since you know all subsequent ones
> will even later than this one.
>
> Of course, that wouldn't take care of the depleted ones, but if those were
> already on a separate queue, you'd be OK. :-)

Sure! Will do that.

>
> Right, past time for me to go home... I've given a quick scan over the other
> things and nothing jumped out at me, but I'll come back to it again tomorrow
> and see how we fare.

Thank you so much for your comments and time! I really appreciate it
and will tackle these comments in the next version asap.

Thanks!

Meng



-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.