[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 34/47] xen/sched: add fall back to idle vcpu when scheduling unit



On 14.09.2019 10:52, Juergen Gross wrote:
> When scheduling an unit with multiple vcpus there is no guarantee all
> vcpus are available (e.g. above maxvcpus or vcpu offline). Fall back to
> idle vcpu of the current cpu in that case. This requires to store the
> correct schedule_unit pointer in the idle vcpu as long as it used as
> fallback vcpu.
> 
> In order to modify the runstates of the correct vcpus when switching
> schedule units merge sched_unit_runstate_change() into
> sched_switch_units() and loop over the affected physical cpus instead
> of the unit's vcpus. This in turn requires an access function to the
> current variable of other cpus.
> 
> Today context_saved() is called in case previous and next vcpus differ
> when doing a context switch. With an idle vcpu being capable to be a
> substitute for an offline vcpu this is problematic when switching to
> an idle scheduling unit. An idle previous vcpu leaves us in doubt which
> schedule unit was active previously, so save the previous unit pointer
> in the per-schedule resource area. If it is NULL the unit has not
> changed and we don't have to set the previous unit to be not running.
> 
> When running an idle vcpu in a non-idle scheduling unit use a specific
> guest idle loop not performing any tasklets and livepatching in order
> to avoid populating the cpu caches with memory used by other domains
> (as far as possible). Softirqs are considered to be save.

Aiui "tasklets" here means only ones run in (idle) vCPU context, whereas
"softirqs" includes tasklets run in softirq context. I think it would
help if the description made this distinction. With this I then wonder
whether the cache related argumentation above still holds: VT-d's IOMMU
fault handling tasklet runs in softirq context, for example, and
hvm_assert_evtchn_irq(), being handed a struct vcpu *, does too. Of
course this could be considered covered by "(as far as possible)" ...

> @@ -172,6 +191,10 @@ void startup_cpu_idle_loop(void)
>  
>  static void noreturn continue_idle_domain(struct vcpu *v)
>  {
> +    /* Idle vcpus might be attached to non-idle units! */
> +    if ( !is_idle_domain(v->sched_unit->domain) )
> +        reset_stack_and_jump_nolp(guest_idle_loop);

The construct and comment make me wonder - did you audit all uses
of is_idle_{domain,vcpu}() for being in line with this new usage
mode?

> @@ -1767,33 +1774,66 @@ static void sched_switch_units(struct sched_resource 
> *sd,
>                                 struct sched_unit *next, struct sched_unit 
> *prev,
>                                 s_time_t now)
>  {
> -    sd->curr = next;
> -
> -    TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, 
> prev->unit_id,
> -             now - prev->state_entry_time);
> -    TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, 
> next->unit_id,
> -             (next->vcpu_list->runstate.state == RUNSTATE_runnable) ?
> -             (now - next->state_entry_time) : 0, prev->next_time);
> +    int cpu;

unsigned int?

>      ASSERT(unit_running(prev));
>  
> -    TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id,
> -             next->domain->domain_id, next->unit_id);
> +    if ( prev != next )
> +    {
> +        sd->curr = next;
> +        sd->prev = prev;
>  
> -    sched_unit_runstate_change(prev, false, now);
> +        TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id,
> +                 prev->unit_id, now - prev->state_entry_time);
> +        TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id,
> +                 next->unit_id,
> +                 (next->vcpu_list->runstate.state == RUNSTATE_runnable) ?
> +                 (now - next->state_entry_time) : 0, prev->next_time);
> +        TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id,
> +                 next->domain->domain_id, next->unit_id);
>  
> -    ASSERT(!unit_running(next));
> -    sched_unit_runstate_change(next, true, now);
> +        ASSERT(!unit_running(next));
>  
> -    /*
> -     * NB. Don't add any trace records from here until the actual context
> -     * switch, else lost_records resume will not work properly.
> -     */
> +        /*
> +         * NB. Don't add any trace records from here until the actual context
> +         * switch, else lost_records resume will not work properly.
> +         */
> +
> +        ASSERT(!next->is_running);
> +        next->is_running = 1;
> +        next->state_entry_time = now;
> +
> +        if ( is_idle_unit(prev) )
> +        {
> +            prev->runstate_cnt[RUNSTATE_running] = 0;
> +            prev->runstate_cnt[RUNSTATE_runnable] = sched_granularity;
> +        }
> +        if ( is_idle_unit(next) )
> +        {
> +            next->runstate_cnt[RUNSTATE_running] = sched_granularity;
> +            next->runstate_cnt[RUNSTATE_runnable] = 0;
> +        }

Is this correct when some of the vCPU-s a substitute idle ones?

> @@ -1849,29 +1889,39 @@ static struct sched_unit *do_schedule(struct 
> sched_unit *prev, s_time_t now,
>      if ( prev->next_time >= 0 ) /* -ve means no limit */
>          set_timer(&sd->s_timer, now + prev->next_time);
>  
> -    if ( likely(prev != next) )
> -        sched_switch_units(sd, next, prev, now);
> +    sched_switch_units(sd, next, prev, now);
>  
>      return next;
>  }
>  
> -static void context_saved(struct vcpu *prev)
> +static void vcpu_context_saved(struct vcpu *vprev, struct vcpu *vnext)
>  {
> -    struct sched_unit *unit = prev->sched_unit;
> -
>      /* Clear running flag /after/ writing context to memory. */
>      smp_wmb();
>  
> -    prev->is_running = 0;
> +    if ( vprev != vnext )
> +        vprev->is_running = 0;
> +}

With this "vnext" could be const qualified as it seems. And "false"
instead of "0" perhaps, as you touch this anyway?

> @@ -2015,7 +2079,8 @@ static void sched_slave(void)
>  
>      pcpu_schedule_unlock_irq(lock, cpu);
>  
> -    sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu), now);
> +    sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu),
> +                         is_idle_unit(next) && !is_idle_unit(prev), now);
>  }
>  
>  /*
> @@ -2075,7 +2140,8 @@ static void schedule(void)
>      pcpu_schedule_unlock_irq(lock, cpu);
>  
>      vnext = sched_unit2vcpu_cpu(next, cpu);
> -    sched_context_switch(vprev, vnext, now);
> +    sched_context_switch(vprev, vnext,
> +                         !is_idle_unit(prev) && is_idle_unit(next), now);
>  }

As a minor remark, I think between such constructs it would be good
if there was no difference, unless there's a reason to have one. Yet
if there was a reason, it surely would want to be spelled out.

> @@ -124,16 +129,22 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>  # define CHECK_FOR_LIVEPATCH_WORK ""
>  #endif
>  
> -#define reset_stack_and_jump(__fn)                                      \
> +#define switch_stack_and_jump(fn, instr)                                \

Is there any dependency on "instr" to just be a single instruction?
I'm inclined to ask for it to be named "extra" or some such.

> --- a/xen/include/asm-x86/smp.h
> +++ b/xen/include/asm-x86/smp.h
> @@ -76,6 +76,9 @@ void set_nr_sockets(void);
>  /* Representing HT and core siblings in each socket. */
>  extern cpumask_t **socket_cpumask;
>  
> +#define get_cpu_current(cpu) \
> +    (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu)

I don't think this can go without a comment clarifying under what
(pretty narrow I think) conditions this is legitimate to use.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.