[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [v4 16/17] vmx: Add some scheduler hooks for VT-d posted interrupts




> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@xxxxxxxxxx]
> Sent: Tuesday, July 28, 2015 10:16 PM
> To: Wu, Feng
> Cc: xen-devel@xxxxxxxxxxxxx; Keir Fraser; Jan Beulich; Andrew Cooper; Tian,
> Kevin; George Dunlap
> Subject: Re: [v4 16/17] vmx: Add some scheduler hooks for VT-d posted
> interrupts
> 
> On Thu, 2015-07-23 at 19:35 +0800, Feng Wu wrote:
> > This patch adds the following arch hooks in scheduler:
> > - vmx_pre_ctx_switch_pi():
> > It is called before context switch, we update the posted
> > interrupt descriptor when the vCPU is preempted, go to sleep,
> > or is blocked.
> >
> > - vmx_post_ctx_switch_pi()
> > It is called after context switch, we update the posted
> > interrupt descriptor when the vCPU is going to run.
> >
> > - arch_vcpu_wake()
> > It will be called when waking up the vCPU, we update
> > the posted interrupt descriptor when the vCPU is unblocked.
> >
> Thanks again for doing this. I'll never get tired to say how much better
> I like this, wrt the previous RUNSTATE_* based approach. :-)

I should say thanks to you, it is you that suggested this method! :)

> 
> > --- a/xen/arch/x86/domain.c
> > +++ b/xen/arch/x86/domain.c
> > @@ -1550,9 +1550,19 @@ void context_switch(struct vcpu *prev, struct
> vcpu *next)
> >
> >      set_current(next);
> >
> > +    /*
> > +     * We need to update posted interrupt descriptor for each context
> switch,
> > +     * hence cannot use the lazy context switch for this.
> > +     */
> >
> Perhaps it's me, but I don't get the comment. Why do you mention "the
> lazy context switch"? We can't use it "for this", as opposed to what
> other circumstance where we can use it?

Oh, maybe I shouldn't use the word here, what I want to say here is
__context_switch() isn't called in each context switch, such as,
non-idle vcpu -> idle vcpu, so we need to call prev->arch.pi_ctxt_switch_from
explicitly instead of in __context_switch().

> 
> > +    if ( !is_idle_vcpu(prev) && prev->arch.pi_ctxt_switch_from )
> > +        prev->arch.pi_ctxt_switch_from(prev);
> > +
> Flip the order in the conditions, maybe? So that guests with
> pi_ctxt_switch_from==NULL suffers the least possible overhead?
> 

Sounds good!

> 
> >      if ( (per_cpu(curr_vcpu, cpu) == next) ||
> >           (is_idle_vcpu(next) && cpu_online(cpu)) )
> >      {
> > +        if ( !is_idle_vcpu(next) && next->arch.pi_ctxt_switch_to )
> >
> Same as above.
> 
> > +            next->arch.pi_ctxt_switch_to(next);
> > +
> >          local_irq_enable();
> >
> Another thing: if prev == next --and let's call such vcpu pp-- you go
> through both:
> 
>     pp->arch.pi_ctxt_switch_from(pp);
>     pp->arch.pi_ctxt_switch_to(pp);

In my understanding, if the scheduler chooses the same vcpu to run, it
will return early in schedule() as below:

static void schedule(void)
{
    ....

    /* get policy-specific decision on scheduling... */
    sched = this_cpu(scheduler);
    next_slice = sched->do_schedule(sched, now, tasklet_work_scheduled);

    next = next_slice.task;

    sd->curr = next;

    if ( next_slice.time >= 0 ) /* -ve means no limit */
        set_timer(&sd->s_timer, now + next_slice.time);

    if ( unlikely(prev == next) )
    {
        pcpu_schedule_unlock_irq(lock, cpu);
        trace_continue_running(next);
        return continue_running(prev);
    }

    ....

}

If this is that case, when we get context_switch(), the prev and next are
different. Do I miss something?

> 
> Is this really necessary? AFAICT, it reduces to:
> 
>     pi_set_sn(&pp->arch.hvm_vmx.pi_desc);
>     if ( x2apic_enabled )
>         write_atomic(&pp->arch.hvm_vmx.pi_desc->ndst,
>                      cpu_physical_id(v->processor));
>     else
>         write_atomic(&pp->arch.hvm_vmx.pi_desc->ndst,
>                      MASK_INSR(cpu_physical_id(v->processor),
>                      PI_xAPIC_NDST_MASK));
>     pi_clear_sn(pi_desc);
> 
> In fact, if the scheduler picked prev again, as the next vcpu to be run,
> I expect
> 
>     ( cpu_runnable(v) || !test_bit(_VPF_blocked, &v->pause_flags) )
> 
> to be true, and hence vmx_pre_ctx_switch_pi() reducing to just a call to
> pi_set_sn().
> 
> So, either it is the write_atomic() that you absolutely need, which I
> don't think, or you really can check whether such situation is
> occurring, and avoid going through _pre_ and _post_ back to back.
> 
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> 
> > @@ -116,10 +118,17 @@ static int vmx_vcpu_initialise(struct vcpu *v)
> >      INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> >      INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_vcpu_on_set_list);
> >
> > +    v->arch.hvm_vmx.pi_block_cpu = -1;
> > +
> > +    spin_lock_init(&v->arch.hvm_vmx.pi_lock);
> > +
> >      v->arch.schedule_tail    = vmx_do_resume;
> >      v->arch.ctxt_switch_from = vmx_ctxt_switch_from;
> >      v->arch.ctxt_switch_to   = vmx_ctxt_switch_to;
> >
> > +    v->arch.pi_ctxt_switch_from = vmx_pre_ctx_switch_pi;
> > +    v->arch.pi_ctxt_switch_to = vmx_post_ctx_switch_pi;
> > +
> Would it make sense to get the two initialization above with:
> 
>   if ( iommu_intpost && is_hvm_vcpu(v) )
> 
> and leave them NULL if that is false? I think it works, and it looks
> worthwhile to me.
> 
> In fact, this would make the operands flip suggested above actually
> effective at cutting the overhead for non-PI enabled guests/hosts.
> 
> > +static void vmx_pre_ctx_switch_pi(struct vcpu *v)
> > +{
> > +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> > +    struct pi_desc old, new;
> > +    unsigned long flags, gflags;
> > +
> > +    if ( !iommu_intpost || !is_hvm_vcpu(v)
> || !has_arch_pdevs(v->domain) )
> > +        return;
> > +
> And, if you do as I suggest in context_switch() and
> vmx_vcpu_initialise(), this will reduce to:
> 
>     if ( !has_arch_pdevs(v->domain) )
>         return;
> 
> > +    else if ( test_bit(_VPF_blocked, &v->pause_flags) )
> > +    {
> > +        /*
> > +         * The vCPU is blocking, we need to add it to one of the per pCPU
> lists.
> > +         * We save v->processor to v->arch.hvm_vmx.pi_block_cpu and
> use it for
> > +         * the per-CPU list, we also save it to posted-interrupt descriptor
> and
> > +         * make it as the destination of the wake-up notification event.
> > +         */
> > +        v->arch.hvm_vmx.pi_block_cpu = v->processor;
> > +        spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
> > +                          v->arch.hvm_vmx.pi_block_cpu), flags);
> > +        list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list,
> > +                      &per_cpu(pi_blocked_vcpu,
> v->arch.hvm_vmx.pi_block_cpu));
> > +        spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
> > +                           v->arch.hvm_vmx.pi_block_cpu), flags);
> > +
> > +        do {
> > +            old.control = new.control = pi_desc->control;
> > +
> > +            /* Should not block the vCPU if an interrupt was posted for it
> */
> > +
> There's no need for the blank line, and I'd say kill it.
> 
> > +            if ( pi_test_on(&old) )
> > +            {
> > +                spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock,
> gflags);
> > +                vcpu_unblock(v);
> > +                return;
> > +            }
> 
> > --- a/xen/common/schedule.c
> > +++ b/xen/common/schedule.c
> > @@ -381,6 +381,8 @@ void vcpu_wake(struct vcpu *v)
> >      unsigned long flags;
> >      spinlock_t *lock = vcpu_schedule_lock_irqsave(v, &flags);
> >
> > +    arch_vcpu_wake(v);
> > +
> So, in the draft you sent a few days back, this was called at the end of
> vcpu_wake(), right before releasing the lock. Now it's at the beginning,
> before the scheduler's wakeup routine has a chance to run.
> 
> IMO, it feels more natural for it to be at the bottom (i.e., generic
> stuff first, arch specific stuff afterwards), and, after a quick
> inspection, I don't think I see nothing preventing things to be that
> way.
> 
> However, I recall you mentioning having issues with such draft, which
> are now resolved with this version. 

The long latency issue mentioned previously is caused by another reason.
Originally I called the ' pi_ctxt_switch_from ' and ' pi_ctxt_switch_to ' in
__context_switch(), however, this function is not called for each context
switch, as I described above, after fixing this, the performance issue
disappeared.

> Since this is one of the differences
> between the two, was it the cause of the issues you were seeing? If yes,
> can you elaborate on how and why?
> 
> In the end, I'm not too opposed to the hook being at the beginning
> rather than at the end, but there has to be a reason, which may well end
> up better be stated in a comment...

Here is the reason I put arch_vcpu_wake() ahead of vcpu_wake():
arch_vcpu_wake() does some prerequisites for a vCPU which is about
to run, such as, setting SN again, changing NV filed back to
' posted_intr_vector ', which should be finished before the vCPU is
actually scheduled to run. However, if we put arch_vcpu_wake() later
in vcpu_wake() right before ' vcpu_schedule_unlock_irqrestore', after
the 'wake' hook get finished, the vcpu can run at any time (maybe in
another pCPU since the current pCPU is protected by the lock), if
this can happen, it is incorrect. Does my understanding make sense?

Thanks,
Feng

> 
> Thanks and Regards,
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.