[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [v4 16/17] vmx: Add some scheduler hooks for VT-d posted interrupts




> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@xxxxxxxxxx]
> Sent: Friday, July 31, 2015 2:27 AM
> To: Wu, Feng
> Cc: xen-devel@xxxxxxxxxxxxx; Keir Fraser; Jan Beulich; Andrew Cooper; Tian,
> Kevin; George Dunlap
> Subject: Re: [v4 16/17] vmx: Add some scheduler hooks for VT-d posted
> interrupts
> 
> On Thu, 2015-07-30 at 02:04 +0000, Wu, Feng wrote:
> > > -----Original Message-----
> > > From: Dario Faggioli [mailto:dario.faggioli@xxxxxxxxxx]
> 
> > > > --- a/xen/arch/x86/domain.c
> > > > +++ b/xen/arch/x86/domain.c
> > > > @@ -1550,9 +1550,19 @@ void context_switch(struct vcpu *prev, struct
> > > vcpu *next)
> > > >
> > > >      set_current(next);
> > > >
> > > > +    /*
> > > > +     * We need to update posted interrupt descriptor for each context
> > > switch,
> > > > +     * hence cannot use the lazy context switch for this.
> > > > +     */
> > > >
> > > Perhaps it's me, but I don't get the comment. Why do you mention "the
> > > lazy context switch"? We can't use it "for this", as opposed to what
> > > other circumstance where we can use it?
> >
> > Oh, maybe I shouldn't use the word here, what I want to say here is
> > __context_switch() isn't called in each context switch, such as,
> > non-idle vcpu -> idle vcpu, so we need to call 
> > prev->arch.pi_ctxt_switch_from
> > explicitly instead of in __context_switch().
> >
> Ok, I see what you mean now, and it's probably correct, as 'lazy context
> switch' is, in this context, exactly that (i.e., not actually context
> switching if next is the idle vcpu).
> 
> It's just that such term is used, in literature, in different places to
> mean (slightly) different thing, and there is no close reference to it
> (like in the function), so I still see a bit of room for potential
> confusion.
> 
> In the end, as you which. If it were me, I'd add a few word to specify
> things better, something very similar to what you've put in this email,
> e.g.:
> 
> "When switching from non-idle to idle, we only do a lazy context switch.
> However, in order for posted interrupt (if available and enabled) to
> work properly, we at least need to update the descriptors"

Sounds good!

> 
> Or some better English form of it. :-)
> 
> But that's certainly something not critical, and I'll be ok with
> everything other maintainers agree on.
> 
> > > >      if ( (per_cpu(curr_vcpu, cpu) == next) ||
> > > >           (is_idle_vcpu(next) && cpu_online(cpu)) )
> > > >      {
> > > > +        if ( !is_idle_vcpu(next) && next->arch.pi_ctxt_switch_to )
> > > >
> > > Same as above.
> > >
> > > > +            next->arch.pi_ctxt_switch_to(next);
> > > > +
> > > >          local_irq_enable();
> > > >
> > > Another thing: if prev == next --and let's call such vcpu pp-- you go
> > > through both:
> > >
> > >     pp->arch.pi_ctxt_switch_from(pp);
> > >     pp->arch.pi_ctxt_switch_to(pp);
> >
> > In my understanding, if the scheduler chooses the same vcpu to run, it
> > will return early in schedule() as below:
> >
> > static void schedule(void)
> > {
> >     ....
> >
> >     /* get policy-specific decision on scheduling... */
> >     sched = this_cpu(scheduler);
> >     next_slice = sched->do_schedule(sched, now, tasklet_work_scheduled);
> >
> >     next = next_slice.task;
> >
> >     sd->curr = next;
> >
> >     if ( next_slice.time >= 0 ) /* -ve means no limit */
> >         set_timer(&sd->s_timer, now + next_slice.time);
> >
> >     if ( unlikely(prev == next) )
> >     {
> >         pcpu_schedule_unlock_irq(lock, cpu);
> >         trace_continue_running(next);
> >         return continue_running(prev);
> >     }
> >
> >     ....
> >
> > }
> >
> > If this is that case, when we get context_switch(), the prev and next are
> > different. Do I miss something?
> >
> That looks correct. Still, there are checks like '(prev!=next)' around
> in context_switch(), for both x86 and ARM... weird. I shall have a
> deeper look...
> 
> In any case, as far as this hunk is concerned, the
> '(per_cpu(curr_vcpu,cpu)==next)' is there to deal with the case where we
> went from vcpu v to idle, and we're now going from idle to v again,
> which is something you want to intercept.
> 
> So, at least for now, ignore my comments about it. I'll let you know if
> I find something interesting that you should take into account.
> 
> > > > --- a/xen/common/schedule.c
> > > > +++ b/xen/common/schedule.c
> > > > @@ -381,6 +381,8 @@ void vcpu_wake(struct vcpu *v)
> > > >      unsigned long flags;
> > > >      spinlock_t *lock = vcpu_schedule_lock_irqsave(v, &flags);
> > > >
> > > > +    arch_vcpu_wake(v);
> > > > +
> > > So, in the draft you sent a few days back, this was called at the end of
> > > vcpu_wake(), right before releasing the lock. Now it's at the beginning,
> > > before the scheduler's wakeup routine has a chance to run.
> > >
> > > IMO, it feels more natural for it to be at the bottom (i.e., generic
> > > stuff first, arch specific stuff afterwards), and, after a quick
> > > inspection, I don't think I see nothing preventing things to be that
> > > way.
> > >
> > > However, I recall you mentioning having issues with such draft, which
> > > are now resolved with this version.
> >
> > The long latency issue mentioned previously is caused by another reason.
> > Originally I called the ' pi_ctxt_switch_from ' and ' pi_ctxt_switch_to ' in
> > __context_switch(), however, this function is not called for each context
> > switch, as I described above, after fixing this, the performance issue
> > disappeared.
> >
> I see, thanks for explaining this.
> 
> > > Since this is one of the differences
> > > between the two, was it the cause of the issues you were seeing? If yes,
> > > can you elaborate on how and why?
> > >
> > > In the end, I'm not too opposed to the hook being at the beginning
> > > rather than at the end, but there has to be a reason, which may well end
> > > up better be stated in a comment...
> >
> > Here is the reason I put arch_vcpu_wake() ahead of vcpu_wake():
> > arch_vcpu_wake() does some prerequisites for a vCPU which is about
> > to run, such as, setting SN again, changing NV filed back to
> > ' posted_intr_vector ', which should be finished before the vCPU is
> > actually scheduled to run. However, if we put arch_vcpu_wake() later
> > in vcpu_wake() right before ' vcpu_schedule_unlock_irqrestore', after
> > the 'wake' hook get finished, the vcpu can run at any time (maybe in
> > another pCPU since the current pCPU is protected by the lock), if
> > this can happen, it is incorrect. Does my understanding make sense?
> >
> It's safe in any case. In fact, the spinlock will  prevent both the
> vcpu's processor to schedule, as well as any other processors to steal
> the waking vcpu from the runqueue to run it.

Good to know this. For " as well as any other processors to steal
the waking vcpu from the runqueue to run it ", could you please show
some hints in the code side, so I can better understand how this can
be protected by the spinlock. Thank you!

Thanks,
Feng

> 
> That's actually why I wanted to double check you changing the position
> of the hook (wrt the draft), as it felt weird that the issue were in
> there. :-)
> 
> So, now that we know that safety is not an issue, where should we put
> the hook?
> 
> Having it before SCHED_OP(wake) may make people think that arch specific
> code is (or can, at some point) somehow influencing the scheduler
> specific wakeup code, which is not (and should not become, if possible)
> the case.
> 
> However, I kind of like the fact that the spinlock is released as soon
> as possible, after the call to SCHED_OP(wake). That will make it more
> likely, for the processors we may have sent IPIs to, during the
> scheduler specific wakeup code, to find the spinlock free. So, looking
> at things from this angle, it would be better to avoid putting stuff in
> between SCHED_OP(wake) and vcpu_schedule_unlock().
> 
> So, all in all, I'd say leave it on top, where it is in this patch. Of
> course, if others have opinions, I'm all ears. :-)
> 
> Thanks and Regards,
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.