[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor



[Adding Jan]

On ven, 2014-07-11 at 00:49 -0400, Meng Xu wrote:
> This is the core rt scheduler patch. It adds the new real time scheduler to
> the hypervisor, as the non-default scheduler.
> 
So, honestly, I think these two lines above can just go away...

The subject can be improved a bit too, probably. I don't know...
something like "scheduling: introduce XEN_SCHEDULER_RT" or "sched:
implement XEN_SCHEDULER_RT scheduling policy", etc.

> This scheduler follows the pre-emptive Global EDF theory in real-time field.
> Each VCPU can have a dedicated period and budget.
> While scheduled, a VCPU burns its budget.
> A VCPU has its budget replenished at the beginning of each of its periods;
> The VCPU discards its unused budget at the end of each of its periods.
> If a VCPU runs out of budget in a period, it has to wait until next period.
> The mechanism of how to burn a VCPU's budget depends on the server mechanism
> implemented for each VCPU.
> 
> Server mechanism: a VCPU is implemented as a deferable server.
> When a VCPU has a task running on it, its budget is continuously
> burned;
>
"When a VCPU is scheduled", or "When a VCPU executes on a PCPU". Let's
avoid messing with _why_ a VCPU executes, i.e., what the VCPU is
actually doing... That's guest's business, not ours! :-)

> When a VCPU has no task but with budget left, its budget is
> preserved.
> 
Ditto.

> Priority scheme: Preemptive Global Earliest Deadline First (gEDF).
> At any scheduling point, the VCPU with earliest deadline has highest
> priority.
> 
> Queue scheme: A global runqueue for each CPU pool.
  ^Runqueue, it's a bit more clear, I think.

> The runqueue holds all runnable VCPUs.
> VCPUs in the runqueue are divided into two parts: with and without budget.
> At each part, VCPUs are sorted based on EDF priority scheme.
> 
> Scheduling quanta: 1 ms; but accounting the budget is in microsecond.
> 
> Note: cpumask and cpupool is supported.
>
No need to say this in the changelog. Actually, it's probably useful to
say it right now, given this is an RFC... Not so much when this won't be
an RFC any longer.

> This is still in the development phase.
> 
Right, and threfore let's concentrate on the code... we'll refine the
changelog version after version. :-)


> diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
> new file mode 100644
> index 0000000..41543a2
> --- /dev/null
> +++ b/xen/common/sched_rt.c

> +/*
> + * TODO:
> + *
> + * Migration compensation and resist like credit2 to better use cache;
> + * Lock Holder Problem, using yield?
> + * Self switch problem: VCPUs of the same domain may preempt each other;
> + */
> +
This last one is an interesting point, even for non real-time
scheduling. I gave some thoughts at this a few times, but never had time
to properly gather ideas and/or collect some numbers... if you have
anything like that, feel free to share (in another thread, as it's, I
think, more general).

> +/*
> + * Locking:
> + * Just like credit2, a global system lock is used to protect the RunQ.
> + * The global lock is referenced by schedule_data.schedule_lock from all 
> physical cpus.
> + *
Mmm... Actually, I think credit2 at least tries to do something
different. Well, each RunQ does have its own lock, but there are more
than 1 RunQ in credit2 (or, again, at least that's the design. There are
a few bugs in the credit2 implementation, but that does not matter
here!! :-P).

No big deal, of course, just perhaps try to describe a bit better what's
the intended locking scheme. :-)

> + * The lock is already grabbed when calling wake/sleep/schedule/ functions 
> in schedule.c
> + *
> + * The functions involes RunQ and needs to grab locks are:
> + *    dump, vcpu_insert, vcpu_remove, context_saved,
> + */
> +
> +
> +/*
> + * Default parameters in ms
> + */
> +#define RT_DEFAULT_PERIOD     10
> +#define RT_DEFAULT_BUDGET      4
> +
ms? us? Can you put a comment saying what time units these 10 and 4 are.

> +/*
> + * Insert a vcpu in the RunQ basing on vcpu's deadline: 
> + * vcpus with shorter deadline are inserted first.
> + * EDF schedule policy: vcpu with smaller deadline has higher priority;
>
The two line above sound redundant. I'd ditch the latter.

> + * When vcpus have the same deadline, insert the current one at the head of 
> these vcpus.
> + */
>
Knowing how ties are broken is useful. However, I'm not sure what you
mean with "insert the current one at the head of these vcpus". Can you
rephrase it?

> +static void
> +__runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
> +{
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +   
> +    ASSERT( spin_is_locked(per_cpu(schedule_data, 
> svc->vcpu->processor).schedule_lock) );
> +
Coding style: this line looks very long. :-)

> +    if ( __vcpu_on_runq(svc) )
> +        return;
> +
Can this happen? I mean the fact that you try to put a VCPU on a RunQ
while it's already there. If yes, and if that's by design, ok, but
that's the sort of things I usually want to keep under control, as
they're frequently hiding bugs etc.

Again, I'm not saying this is the case. What I'd do is, if it can
happen, put a comment explaining under what circumstances that is the
case. If it can't, then put down another ASSERT rather than just
returning.

> +    list_for_each(iter, runq) {
> +        struct rt_vcpu * iter_svc = __runq_elem(iter);
> +
> +        if ( svc->cur_budget > 0 ) { /* svc still has budget */
>
Put the comment above (minor thing, though).

> +            if ( iter_svc->cur_budget == 0 ||
> +                 svc->cur_deadline <= iter_svc->cur_deadline )
> +                    break;
> +        } else { /* svc has no budget */
> +            if ( iter_svc->cur_budget == 0 &&
> +                 svc->cur_deadline <= iter_svc->cur_deadline )
> +                    break;
> +        }
>
Bear with me, as I'm rally thinking out loud here. Have you though at,
or perhaps explored the solution of using two lists --one for vcpus with
budget the other for depleted ones-- instead of only one divided in two
parts.

Perhaps it's not a big deal, but looking at this code looks like
inserting a fully depleted vcpu potentially involves pointlessly going
through all the elements with budget>0, which also means holding the
global RunQ spinlock longer than one would expect. Thoughts?


Also, coding style: inside the hypervisor, we do:

if ( bla )
{
    blabla;
}
else    /* or else if (bleblu ) */
{
    bleah
}

(Of course, this needs fixing everywhere).

> +/*
> + * Init/Free related code
> + */
> +static int
> +rt_init(struct scheduler *ops)
> +{
> +    struct rt_private *prv;
> +
> +    prv = xzalloc(struct rt_private);
> +    if ( prv == NULL ) {
> +        printk("xzalloc failed at rt_private\n");
> +        return -ENOMEM;
> +    }
> +
> +    ops->sched_data = prv;
> +    spin_lock_init(&prv->lock);
> +    INIT_LIST_HEAD(&prv->sdom);
> +    INIT_LIST_HEAD(&prv->runq);
> +    cpumask_clear(&prv->tickled);
> +    cpumask_clear(&prv->cpus);
> +
> +    printtime();
> +    printk("\n");
> +
Perhaps printtime() can take a string argument and append it to the time
it prints. Well, it probably does not matter much, as I don't think
things like printtime() should make it to the final versions. I mean,
it's useful now, but when upstream, debugging happens via tracing, so it
hopefully will become useless.

Actually, I think adding a few xentrace tracepoints would be a great
plus, as it'll allow Xen devs to start investigating and debugging this
the usual 'Xen way'. Look at entries like TRC_CSCHED2_RUNQ_POS in
credit2, or similar ones in credit.

> +    return 0;
> +}

> +static void *
> +rt_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
> +{
> +    struct rt_vcpu *svc;
> +    s_time_t now = NOW();
> +    long count;
> +
> +    /* Allocate per-VCPU info */
> +    svc = xzalloc(struct rt_vcpu);
> +    if ( svc == NULL ) {
> +        printk("%s, xzalloc failed\n", __func__);
> +        return NULL;
> +    }
> +
> +    INIT_LIST_HEAD(&svc->runq_elem);
> +    INIT_LIST_HEAD(&svc->sdom_elem);
> +    svc->flags = 0U;
> +    svc->sdom = dd;
> +    svc->vcpu = vc;
> +    svc->last_start = 0;            /* init last_start is 0 */
> +
> +    svc->period = RT_DEFAULT_PERIOD;
> +    if ( !is_idle_vcpu(vc) && vc->domain->domain_id != 0 ) {
> +        svc->budget = RT_DEFAULT_BUDGET;
> +    } else {
> +        svc->budget = RT_DEFAULT_PERIOD; /* give vcpus of dom0 100% 
> utilization */
> +    }
> +
Oh, so idle VCPUs and Dom0's VCPUs get 100%? Well, for one, write this
explicitly somewhere. Anyway, idle, not so much important, it's kind of
obvious actually. Dom0, it's certainly worth a few words, both here and
up in the 'Design' description.

I've got to think at whether or not that's a sane default setting. For
sure Dom0 is a very peculiar domain, but the beauty (well, one of the
beauties :-D) of Xen is that Dom0 is actually just one domain, there may
exist other "special" domains (driver/stub domains, etc)... So let's try
not making Dom0 itself too special, ok? :-P

> +    count = (now/MILLISECS(svc->period)) + 1;
> +    /* sync all VCPU's start time to 0 */
> +    svc->cur_deadline += count*MILLISECS(svc->period);
> +
> +    svc->cur_budget = svc->budget*1000; /* counting in microseconds level */
> +    /* Debug only: dump new vcpu's info */
> +    printtime();
> +    rt_dump_vcpu(svc);
> +
> +    return svc;
> +}

> +static void
> +rt_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu *svc = RT_VCPU(vc);
> +
> +    /* Debug only: dump info of vcpu to insert */
> +    printtime();
> +    rt_dump_vcpu(svc);
> +
> +    /* IDLE VCPU not allowed on RunQ */
> +    if ( is_idle_vcpu(vc) )
> +        return;
> +
Mmm... Why are we speaking about the RunQ... this is
not the function adding a VCPU to the RunQ, is it?

> +    list_add_tail(&svc->sdom_elem, &svc->sdom->vcpu);   /* add to dom vcpu 
> list */
> +}
> +
> +static void
> +rt_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu * const svc = RT_VCPU(vc);
> +    struct rt_dom * const sdom = svc->sdom;
> +
> +    printtime();
> +    rt_dump_vcpu(svc);
> +
> +    BUG_ON( sdom == NULL );
> +    BUG_ON( __vcpu_on_runq(svc) );
> +
> +    if ( !is_idle_vcpu(vc) ) {
> +        list_del_init(&svc->sdom_elem);
> +    }
> +}
> +
About these two functions (rtglobal_vcpu_insert and
rtglobal_vcpu_remove), in addition to what I wrote yesterday.

It looks to me that, when calling them, for instance, from
sched_move_domain() in schedule.c, what you do is removing the VCPU from
the list of VCPUs of its domain, and then add it back there. Sure the
net effect is ok, but it does certainly not look like the best possible
thing that should happen in such case.

I can't be positive about what I'm about to say, but it looks to me
that, given how the insert_vcpu hook is used, you may not want to define
it (just don't put any .insert_vcpu=xxx nor .remove_vcpu=yyy there when
initializing sched_rtglobal_def, at the bottom of the file). Of course
you'd need to find another place from where to add the vcpu to the list
of its own domain's VCPUs, but that should be easy.

Or am I missing something?

For sure, in the way it looks not, it is now, although probably
functional (it does look so, I haven't tested this but I trust you
did :-) ), it's rather inefficient and confusing to read.

> +/*
> + * Implemented as deferrable server. 
> + * Different server mechanism has different implementation.
>
Kill this line, until (if ever) we'll actually have more than one server
mechanism.

> + * burn budget at microsecond level. 
> + */
I.e., using microsecs everythere, including in the interface, would kill
the need of doing these conversions inside the scheduler logic, which
is, to me, is already a fair reason to go that way.

I see you've got this in your TODO list already, so good. :-)

> +static void
> +burn_budgets(const struct scheduler *ops, struct rt_vcpu *svc, s_time_t now) 
> {
> +    s_time_t delta;
> +    unsigned int consume;
> +    long count = 0;
> +
> +    /* first time called for this svc, update last_start */
> +    if ( svc->last_start == 0 ) {
> +        svc->last_start = now;
> +        return;
> +    }
> +
> +    /* don't burn budget for idle VCPU */
> +    if ( is_idle_vcpu(svc->vcpu) ) {
> +        return;
> +    }
> +
This check on is_idle_vcpu() can be the very first thing you do,
can't it? I mean, is it important that we update last_start for the idle
VCPUs? If not, move it up.

> +    /* don't burn budget for Domain-0, RT-Xen use only */
> +    if ( svc->sdom->dom->domain_id == 0 ) {
> +        return;
> +    }
> +
Oh, so Dom0 is not inside a (deferrable) server? Nope, that needs to
change. It's probably ok to give it 100% bandwidth by default (still
thinking about this), but it certainly needs to be possible for a
sysadmin to change that, assign to Dom0's VCPUs budgets and periods, and
have it scheduled like all the other domains.

Is it going to be hard to make happen? I honestly don't think
so. I think that just killing all of this (domain_id == 0) checks would
pretty much do...

> +    /* update deadline info */
>
Be a bit more explicit in the comment, in explaining what's going on
here, i.e., deadline is in the past, so we need to compensate, etc. etc.

Also, can you provide me a little insight on what could cause delta to
be non-negative? Have you observed it? If yes, a lot? Under what
circumstances?

More important, is it ok to just silently fix things, or should we warn
the user/sysadmin that we're lagging behind?

> +    delta = now - svc->cur_deadline;
> +    if ( delta >= 0 ) {
> +        count = ( delta/MILLISECS(svc->period) ) + 1;
> +        svc->cur_deadline += count * MILLISECS(svc->period);
> +        svc->cur_budget = svc->budget * 1000;
> +        return;
> +    }
> +
So, in case deadline was > 0, deadline was in the past. What about
cur_budget? I mean, what about overruns?

I know it's something that should not happen, but if it does, is
forgetting about it the best option?

However, I see budget is handled below, so I'll comment there.

> +    delta = now - svc->last_start;
> +    if ( delta < 0 ) {
> +        printk("%s, delta = %ld for ", __func__, delta);
> +        rt_dump_vcpu(svc);
> +        svc->last_start = now;  /* update last_start */
> +        svc->cur_budget = 0;
> +        return;
> +    }
> +
As said above, put down some words of comment on when this can happen
and why you think this is the best recovery action, please. :-)

> +    if ( svc->cur_budget == 0 ) return;
> +
> +    /* burn at microseconds level */
> +    consume = ( delta/MICROSECS(1) );
> +    if ( delta%MICROSECS(1) > MICROSECS(1)/2 ) consume++;
> +
ditto (about conversions).

> +    svc->cur_budget -= consume;
> +    if ( svc->cur_budget < 0 ) svc->cur_budget = 0;
>
What I've always done is having the task (in this case, that would be
the vcpu) to pay back for that. So if it's max_budget was 100 and I at
some point find out it managed to execute for 115, i.e., it overran by
15, I only replenish it to 100-15=85.

I'm saying this here because, even if it's not here that you replenish,
all the mechanisms I can think of to take care of overruns need the
current budget to actually be negative, when replenishment happens, in
order to be able to take overrun itself into account.

So what do you think? Can overrun happen in this scheduler? If yes, how
severely? Don't you think it's something important to consider?

> +}
> +
> +/* 
> + * RunQ is sorted. Pick first one within cpumask. If no one, return NULL
> + * lock is grabbed before calling this function 
> + */
> +static struct rt_vcpu *
> +__runq_pick(const struct scheduler *ops, cpumask_t mask)
> +{
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +    struct rt_vcpu *svc = NULL;
> +    struct rt_vcpu *iter_svc = NULL;
> +    cpumask_t cpu_common;
> +    cpumask_t *online;
> +    struct rt_private * prv = RT_PRIV(ops);
> +
> +    list_for_each(iter, runq) {
> +        iter_svc = __runq_elem(iter);
> +
> +        /* mask is intersection of cpu_hard_affinity and cpupool and 
> priv->cpus */
>
What's in priv->cpus, BTW? How is it different form the cpupool online
mask (returned in 'online' by cpupool_scheduler_cpumask() )?

> +        online = cpupool_scheduler_cpumask(iter_svc->vcpu->domain->cpupool);
> +        cpumask_and(&cpu_common, online, &prv->cpus);
> +        cpumask_and(&cpu_common, &cpu_common, 
> iter_svc->vcpu->cpu_hard_affinity);
> +        cpumask_and(&cpu_common, &mask, &cpu_common);
> +        if ( cpumask_empty(&cpu_common) )
> +            continue;
> +
> +        if ( iter_svc->cur_budget <= 0 )
> +            continue;
> +
> +        svc = iter_svc;
> +        break;
> +    }
> +
> +    return svc;
> +}
> +
> +/*
> + * Update vcpu's budget and sort runq by insert the modifed vcpu back to runq
> + * lock is grabbed before calling this function 
> + */
> +static void
> +__repl_update(const struct scheduler *ops, s_time_t now)
> +{
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +    struct list_head *tmp;
> +    struct rt_vcpu *svc = NULL;
> +
> +    s_time_t diff;
> +    long count;
> +
> +    list_for_each_safe(iter, tmp, runq) {
> +        svc = __runq_elem(iter);
> +
Wow, so we scan all the RunQ... :-/

Let's check when this is called, but I personally really don't like this
things. :-(

Can't we use something like a timer for replenishments?

> +        diff = now - svc->cur_deadline;
> +        if ( diff > 0 ) {
> +            count = (diff/MILLISECS(svc->period)) + 1;
> +            svc->cur_deadline += count * MILLISECS(svc->period);
> +            svc->cur_budget = svc->budget * 1000;
>
Payback for possible overrun again?

In any case, payback or not, this is the same code from inside
burn_budget() above. An helper function is probably what we want.

> +            __runq_remove(svc);
> +            __runq_insert(ops, svc);
> +        }
> +    }
> +}
> +
> +/* 
> + * schedule function for rt scheduler.
> + * The lock is already grabbed in schedule.c, no need to lock here 
> + */
> +static struct task_slice
> +rt_schedule(const struct scheduler *ops, s_time_t now, bool_t 
> tasklet_work_scheduled)
> +{
> +    const int cpu = smp_processor_id();
> +    struct rt_private * prv = RT_PRIV(ops);
> +    struct rt_vcpu * const scurr = RT_VCPU(current);
> +    struct rt_vcpu * snext = NULL;
> +    struct task_slice ret;
> +
    struct task_slice ret = { .migrated = 0 };

should work, and should allow you to get rid of the ret.migrated=0
below. It's a minor thing, though, mostly a matter of taste, I think.

> +    /* clear ticked bit now that we've been scheduled */
> +    if ( cpumask_test_cpu(cpu, &prv->tickled) )
> +        cpumask_clear_cpu(cpu, &prv->tickled);
> +
> +    /* burn_budget would return for IDLE VCPU */
> +    burn_budgets(ops, scurr, now);
> +
> +    __repl_update(ops, now);
> +
And here it is. So, at each tick (and possibly even more frequently) we
scan the full RunQ, just to update the deadlines and perform
replenishments.

I confirm that I don't like this very much. burn_budget() is something
that should indeed happen here. __repl_update(), OTOH, should somehow be
made event based. I can't provide a deferrable server example out of the
top of my head.

IIRC, in my Linux CBS implementation I was setting a timer when a task
is inserted in the runqueue, to fire at the task's absolute deadline
time (well, after one period, if you want to allow deadline different
than periods). At that point, you either find the task still running,
with some leftover budget, you find it still running with no (or very
few) budget, which means overrun (or something very close to overrun),
or (most of the cases) you find it stopped, because it ran out of
budget. That's when --in the timer handler-- you issue the replenishment
and postpone the deadline accordingly (depending on which one of the 3
above the actual situation was). When re-enqueueing the task, with new
budget and new deadline, I was also setting the timer again.
Of course, when the task was removed from the runqueue because it
blocked by itself (i.e., _not_ because he depleted its budget), I was
stopping the timer too.

I've also implemented the sporadic server on Linux, using the exact same
mechanism, although it was a bit more complex, as the sporadic server
handles replenishments in a more complicated way (it replenishes in
chunks, so I had to set the timer for each chunk, one after the other).

I'm sure something like this can be done here, in Xen, and with DS as an
algorithm.

It shouldn't even be too hard to just use either only one timer, or,
perhaps, one timer per PCPU. I'm quite sure we don't want one timer per
VCPU, but this is of course something we can discuss. For now, what do
you think about the timer idea?

> +    if ( tasklet_work_scheduled ) {
> +        snext = RT_VCPU(idle_vcpu[cpu]);
> +    } else {
> +        cpumask_t cur_cpu;
> +        cpumask_clear(&cur_cpu);
> +        cpumask_set_cpu(cpu, &cur_cpu);
> +        snext = __runq_pick(ops, cur_cpu);
> +        if ( snext == NULL )
> +            snext = RT_VCPU(idle_vcpu[cpu]);
> +
> +        /* if scurr has higher priority and budget, still pick scurr */
> +        if ( !is_idle_vcpu(current) &&
> +             vcpu_runnable(current) &&
> +             scurr->cur_budget > 0 &&
> +             ( is_idle_vcpu(snext->vcpu) ||
> +               scurr->cur_deadline <= snext->cur_deadline ) ) {
> +            snext = scurr;
> +        }
> +    }
> +
> +    if ( snext != scurr &&
> +         !is_idle_vcpu(current) &&
> +         vcpu_runnable(current) ) {
> +        set_bit(__RT_delayed_runq_add, &scurr->flags);
> +    }
> +
> +    snext->last_start = now;
> +    ret.migrated = 0;
> +    if ( !is_idle_vcpu(snext->vcpu) ) {
> +        if ( snext != scurr ) {
> +            __runq_remove(snext);
> +            set_bit(__RT_scheduled, &snext->flags);
> +        }
> +        if ( snext->vcpu->processor != cpu ) {
> +            snext->vcpu->processor = cpu;
> +            ret.migrated = 1;
> +        }
> +    }
> +
> +    ret.time = MILLISECS(1);
> +    ret.task = snext->vcpu;
> +
> +    return ret;
> +}
> +
> +/*
> + * Remove VCPU from RunQ
> + * The lock is already grabbed in schedule.c, no need to lock here 
> + */
> +static void
> +rt_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu * const svc = RT_VCPU(vc);
> +
> +    BUG_ON( is_idle_vcpu(vc) );
> +
> +    if ( curr_on_cpu(vc->processor) == vc ) {
> +        cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
> +        return;
> +    }
> +
> +    if ( __vcpu_on_runq(svc) ) {
> +        __runq_remove(svc);
> +    }
> +
As said, can it actually happen that we try to put to sleep a vcpu
that's not in the RunQ. If yes, fine, but I'd like a comment about why
that's the case. If not, let's convert the if to an ASSERT.

Let me state once more that I'm not questioning the correctness of your
code, I just want to be sure that the final result is as straightforward
to read and understand as possible, when it comes to study or debug it.
Well, this, in my opinion, is one of the spots where having the said aid
(comment or ASSERT) from the original implementors would help a lot with
that goal. :-)

> +    clear_bit(__RT_delayed_runq_add, &svc->flags);
> +}
> +
> +/*
> + * Pick a vcpu on a cpu to kick out to place the running candidate
> + * Called by wake() and context_saved()
> + * We have a running candidate here, the kick logic is:
> + * Among all the cpus that are within the cpu affinity
> + * 1) if the new->cpu is idle, kick it. This could benefit cache hit
> + * 2) if there are any idle vcpu, kick it.
> + * 3) now all pcpus are busy, among all the running vcpus, pick lowest 
> priority one
> + *    if snext has higher priority, kick it.
> + *
> + * TODO:
> + * 1) what if these two vcpus belongs to the same domain?
> + *    replace a vcpu belonging to the same domain introduces more overhead
> + *
> + * lock is grabbed before calling this function 
> + */
Now this is a really good looking doc comment!! :-D

> +static void
> +runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
> +{
> +    struct rt_private * prv = RT_PRIV(ops);
> +    struct rt_vcpu * scheduled = NULL;    /* lowest priority scheduled */
>
Can I suggest a better name for this variable? Maybe
latest_deadline_vcpu, or latest_scheduled, or just latest... In general,
I'd like it to reflect the concept that it points to the VCPU with the
latest deadline (i.e., the lowest prio, but as repeatedly said, we're
dealing with EDF only!).

> +    struct rt_vcpu * iter_svc;
> +    struct vcpu * iter_vc;
> +    int cpu = 0;
> +    cpumask_t not_tickled;                      /* not tickled cpus */
>
Comment not necessary, the name is talking enough by itself in this
case.

One issue here is that we really are tying hard to avoid putting
cpumask_t on the stack, as they can be big on large boxes, with lots of
cpus. However, let's consider this low prio for now.

> +    cpumask_t *online;
> +
> +    if ( new == NULL || is_idle_vcpu(new->vcpu) ) return;
> +
> +    online = cpupool_scheduler_cpumask(new->vcpu->domain->cpupool);
> +    cpumask_and(&not_tickled, online, &prv->cpus);
> +    cpumask_and(&not_tickled, &not_tickled, new->vcpu->cpu_hard_affinity);
> +    cpumask_andnot(&not_tickled, &not_tickled, &prv->tickled);
> +
> +    /* 1) if new's previous cpu is idle, kick it for cache benefit */
> +    if ( is_idle_vcpu(curr_on_cpu(new->vcpu->processor)) ) {
>
Remember: '{' goes on new line while inside the hypervisor.

> +        cpumask_set_cpu(new->vcpu->processor, &prv->tickled);
> +        cpu_raise_softirq(new->vcpu->processor, SCHEDULE_SOFTIRQ);
> +        return;
> +    }
> +
> +    /* 2) if there are any idle pcpu, kick it */
> +    /* The same loop also find the one with lowest priority */
> +    for_each_cpu(cpu, &not_tickled) {
> +        iter_vc = curr_on_cpu(cpu);
> +        if ( is_idle_vcpu(iter_vc) ) {
> +            cpumask_set_cpu(cpu, &prv->tickled);
> +            cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
> +            return;
> +        }
> +        iter_svc = RT_VCPU(iter_vc);
> +        if ( scheduled == NULL || 
> +             iter_svc->cur_deadline > scheduled->cur_deadline ) {
> +            scheduled = iter_svc;
> +        }
> +    }
> +
> +    /* 3) candicate has higher priority, kick out the lowest priority vcpu */
> +    if ( scheduled != NULL && new->cur_deadline < scheduled->cur_deadline ) {
> +        cpumask_set_cpu(scheduled->vcpu->processor, &prv->tickled);
> +        cpu_raise_softirq(scheduled->vcpu->processor, SCHEDULE_SOFTIRQ);
> +    }
> +    return;
> +}
> +
> +/* 
> + * Should always wake up runnable vcpu, put it back to RunQ. 
> + * Check priority to raise interrupt 
> + * The lock is already grabbed in schedule.c, no need to lock here 
> + * TODO: what if these two vcpus belongs to the same domain? 
> + */
> +static void
> +rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu * const svc = RT_VCPU(vc);
> +    s_time_t diff;
> +    s_time_t now = NOW();
> +    long count = 0;
> +    struct rt_private * prv = RT_PRIV(ops);
> +    struct rt_vcpu * snext = NULL;        /* highest priority on RunQ */
> +
> +    BUG_ON( is_idle_vcpu(vc) );
> +
> +    if ( unlikely(curr_on_cpu(vc->processor) == vc) ) return;
> +
ASSERT or comment.

Also, here and everywhere else where this happens, 'return' (in general
the then branch of the if) goes on new line, even if it's a single
statement.

> +    /* on RunQ, just update info is ok */
> +    if ( unlikely(__vcpu_on_runq(svc)) ) return;
> +
This happens to have a comment, but not as good as the case
requires. :-P

> +    /* If context hasn't been saved yet, set flag so it will add later */
> +    if ( unlikely(test_bit(__RT_scheduled, &svc->flags)) ) {
> +        set_bit(__RT_delayed_runq_add, &svc->flags);
> +        return;
> +    }
> +
Credit2 implementation has proper comments on what this flag (and the
other one too, actually) means and how it's used, both when the flags
are defined and used. Either you provide similar comments in here
(taking them from sched_credit2.c is ok), or you somehow reference those
in sched_credit2.c.

Personally, I don't like referencing other comments from other file, so
I'd go for the cut-&-past-ish approach.

> +    /* update deadline info */
> +    diff = now - svc->cur_deadline;
> +    if ( diff >= 0 ) {
> +        count = ( diff/MILLISECS(svc->period) ) + 1;
> +        svc->cur_deadline += count * MILLISECS(svc->period);
> +        svc->cur_budget = svc->budget * 1000;
> +    }
> +
And here it comes the same code again. Definitely, make an helper
function and call it from all the 3 spots.

> +    __runq_insert(ops, svc);
> +    __repl_update(ops, now);
>
So, every time a VCPU wakes up, we scan the full RunQ,, which also means
grabbing and holding the global RunQ lock, to check whether any task
needs a replenishment... Why is this required?

Once more, can't we use timers for replenishments, and get rid of this?
I'm sure it comes at a rather high cost, doesn't it?

> +    snext = __runq_pick(ops, prv->cpus);    /* pick snext from ALL valid 
> cpus */
> +    runq_tickle(ops, snext);
> +
> +    return;
> +}
> +
> +/* 
> + * scurr has finished context switch, insert it back to the RunQ,
> + * and then pick the highest priority vcpu from runq to run 
> + */
> +static void
> +rt_context_saved(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu * svc = RT_VCPU(vc);
> +    struct rt_vcpu * snext = NULL;
> +    struct rt_private * prv = RT_PRIV(ops);
> +    spinlock_t *lock;
> +
> +    clear_bit(__RT_scheduled, &svc->flags);
>
You clear this flag outside of the critical section on lock... This is
different from what happens in credit2, and it does not sound safe to
me. Is it intentional?

> +    if ( is_idle_vcpu(vc) ) return;
> +
> +    lock = vcpu_schedule_lock_irq(vc);
> +    if ( test_and_clear_bit(__RT_delayed_runq_add, &svc->flags) && 
> +         likely(vcpu_runnable(vc)) ) {
> +        __runq_insert(ops, svc);
> +        __repl_update(ops, NOW());
ditto.

> +        snext = __runq_pick(ops, prv->cpus);    /* pick snext from ALL cpus 
> */
> +        runq_tickle(ops, snext);
>
I also think that, if we use something like timers for replenishments,
we can avoid calling __runq_pick() here too, as we won't be re-sorting
the queue, so it'd be enough to runq_tickle() for svc... I think. :-)

> +    }
> +    vcpu_schedule_unlock_irq(lock, vc);
> +}
> +
> +/*
> + * set/get each vcpu info of each domain
> + */
> +static int
> +rt_dom_cntl(const struct scheduler *ops, struct domain *d, struct 
> xen_domctl_scheduler_op *op)
>
Lots of long lines in this function...

> +{
> +    xen_domctl_sched_rt_params_t local_sched;
> +    struct rt_dom * const sdom = RT_DOM(d);
> +    struct list_head *iter;
> +    int vcpu_index = 0;
> +    int rc = -EINVAL;
> +
> +    switch ( op->cmd )
> +    {
> +    case XEN_DOMCTL_SCHEDOP_getinfo:
> +        /* for debug use, whenever adjust Dom0 parameter, do global dump */
> +        if ( d->domain_id == 0 ) {
> +            rt_dump(ops);
> +        }
> +
> +        local_sched.num_vcpus = 0;
> +        list_for_each( iter, &sdom->vcpu ) {
> +            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu, 
> sdom_elem);
> +
> +            ASSERT(vcpu_index < XEN_LEGACY_MAX_VCPUS);
> +            local_sched.vcpus[vcpu_index].budget = svc->budget;
> +            local_sched.vcpus[vcpu_index].period = svc->period;
> +
> +            vcpu_index++;
> +        }
> +        local_sched.num_vcpus = vcpu_index;
> +        copy_to_guest(op->u.rt.schedule, &local_sched, 1);
> +        rc = 0;
> +        break;
> +    case XEN_DOMCTL_SCHEDOP_putinfo:
> +        copy_from_guest(&local_sched, op->u.rt.schedule, 1);
> +        list_for_each( iter, &sdom->vcpu ) {
> +            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu, 
> sdom_elem);
> +
> +            if ( local_sched.vcpu_index == svc->vcpu->vcpu_id ) { /* adjust 
> per VCPU parameter */
> +                vcpu_index = local_sched.vcpu_index;
> +
> +                if ( vcpu_index < 0 || vcpu_index > XEN_LEGACY_MAX_VCPUS) {
> +                    printk("XEN_DOMCTL_SCHEDOP_putinfo: vcpu_index=%d\n",
> +                            vcpu_index);
> +                }else{
> +                    printk("XEN_DOMCTL_SCHEDOP_putinfo: vcpu_index=%d, 
> period=%"PRId64", budget=%"PRId64"\n",
> +                            vcpu_index, 
> local_sched.vcpus[vcpu_index].period, 
> +                            local_sched.vcpus[vcpu_index].budget);
> +                }
> +
> +                svc->period = local_sched.vcpus[vcpu_index].period;
> +                svc->budget = local_sched.vcpus[vcpu_index].budget;
> +
> +                break;
> +            }
> +        }
> +        rc = 0;
> +        break;
> +    }
> +
> +    return rc;
> +}
> +
> +static struct rt_private _rt_priv;
> +
> +const struct scheduler sched_rt_def = {
> +    .name           = "SMP RT Scheduler",
> +    .opt_name       = "rt",
> +    .sched_id       = XEN_SCHEDULER_RT,
> +    .sched_data     = &_rt_priv,
> +
> +    .dump_cpu_state = rt_dump_pcpu,
> +    .dump_settings  = rt_dump,
> +    .init           = rt_init,
> +    .deinit         = rt_deinit,
> +    .alloc_pdata    = rt_alloc_pdata,
> +    .free_pdata     = rt_free_pdata,
> +    .alloc_domdata  = rt_alloc_domdata,
> +    .free_domdata   = rt_free_domdata,
> +    .init_domain    = rt_dom_init,
> +    .destroy_domain = rt_dom_destroy,
> +    .alloc_vdata    = rt_alloc_vdata,
> +    .free_vdata     = rt_free_vdata,
> +    .insert_vcpu    = rt_vcpu_insert,
> +    .remove_vcpu    = rt_vcpu_remove,
> +
> +    .adjust         = rt_dom_cntl,
> +
> +    .pick_cpu       = rt_cpu_pick,
> +    .do_schedule    = rt_schedule,
> +    .sleep          = rt_vcpu_sleep,
> +    .wake           = rt_vcpu_wake,
> +    .context_saved  = rt_context_saved,
> +
> +    .yield          = NULL,
> +    .migrate        = NULL,
>
You can skip the hunks that you want to be NULL. If you don't, because
you think it's worthwhile to give a more visual and explicit info about
a particular hunk being not implemented, put down a comment on why that
is the case.

It think, in this case, it's good to show the reviewers and future
readers/hackers that we don't have yield and migrate here, so ack on the
"=NULL", but do add the comment on what that is.

> --- a/xen/include/xen/sched-if.h
> +++ b/xen/include/xen/sched-if.h
> @@ -169,7 +169,7 @@ extern const struct scheduler sched_sedf_def;
>  extern const struct scheduler sched_credit_def;
>  extern const struct scheduler sched_credit2_def;
>  extern const struct scheduler sched_arinc653_def;
> -
> +extern const struct scheduler sched_rt_def;
>  
Don't kill blank lines, just add your stuff.

Phew... that's it for this patch... let's move to (hopefully) simpler
ones! :-)

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.