Xen project Mailing List

Re: [Xen-devel] [v4 16/17] vmx: Add some scheduler hooks for VT-d posted interrupts

To: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

Date: Mon, 3 Aug 2015 01:36:01 +0000

Accept-language: en-US

Cc: "Tian, Kevin" <kevin.tian@xxxxxxxxx>, Keir Fraser <keir@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, "Wu, Feng" <feng.wu@xxxxxxxxx>

Delivery-date: Mon, 03 Aug 2015 01:36:37 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Thread-index: AdDNjLpikmF9wnMFSGGWGhlr/Vi1ow==

Thread-topic: [v4 16/17] vmx: Add some scheduler hooks for VT-d posted interrupts

> -----Original Message----- > From: Dario Faggioli [mailto:dario.faggioli@xxxxxxxxxx] > Sent: Friday, July 31, 2015 2:27 AM > To: Wu, Feng > Cc: xen-devel@xxxxxxxxxxxxx; Keir Fraser; Jan Beulich; Andrew Cooper; Tian, > Kevin; George Dunlap > Subject: Re: [v4 16/17] vmx: Add some scheduler hooks for VT-d posted > interrupts > > On Thu, 2015-07-30 at 02:04 +0000, Wu, Feng wrote: > > > -----Original Message----- > > > From: Dario Faggioli [mailto:dario.faggioli@xxxxxxxxxx] > > > > > --- a/xen/arch/x86/domain.c > > > > +++ b/xen/arch/x86/domain.c > > > > @@ -1550,9 +1550,19 @@ void context_switch(struct vcpu *prev, struct > > > vcpu *next) > > > > > > > > set_current(next); > > > > > > > > + /* > > > > + * We need to update posted interrupt descriptor for each context > > > switch, > > > > + * hence cannot use the lazy context switch for this. > > > > + */ > > > > > > > Perhaps it's me, but I don't get the comment. Why do you mention "the > > > lazy context switch"? We can't use it "for this", as opposed to what > > > other circumstance where we can use it? > > > > Oh, maybe I shouldn't use the word here, what I want to say here is > > __context_switch() isn't called in each context switch, such as, > > non-idle vcpu -> idle vcpu, so we need to call > > prev->arch.pi_ctxt_switch_from > > explicitly instead of in __context_switch(). > > > Ok, I see what you mean now, and it's probably correct, as 'lazy context > switch' is, in this context, exactly that (i.e., not actually context > switching if next is the idle vcpu). > > It's just that such term is used, in literature, in different places to > mean (slightly) different thing, and there is no close reference to it > (like in the function), so I still see a bit of room for potential > confusion. > > In the end, as you which. If it were me, I'd add a few word to specify > things better, something very similar to what you've put in this email, > e.g.: > > "When switching from non-idle to idle, we only do a lazy context switch. > However, in order for posted interrupt (if available and enabled) to > work properly, we at least need to update the descriptors" Sounds good! > > Or some better English form of it. :-) > > But that's certainly something not critical, and I'll be ok with > everything other maintainers agree on. > > > > > if ( (per_cpu(curr_vcpu, cpu) == next) || > > > > (is_idle_vcpu(next) && cpu_online(cpu)) ) > > > > { > > > > + if ( !is_idle_vcpu(next) && next->arch.pi_ctxt_switch_to ) > > > > > > > Same as above. > > > > > > > + next->arch.pi_ctxt_switch_to(next); > > > > + > > > > local_irq_enable(); > > > > > > > Another thing: if prev == next --and let's call such vcpu pp-- you go > > > through both: > > > > > > pp->arch.pi_ctxt_switch_from(pp); > > > pp->arch.pi_ctxt_switch_to(pp); > > > > In my understanding, if the scheduler chooses the same vcpu to run, it > > will return early in schedule() as below: > > > > static void schedule(void) > > { > > .... > > > > /* get policy-specific decision on scheduling... */ > > sched = this_cpu(scheduler); > > next_slice = sched->do_schedule(sched, now, tasklet_work_scheduled); > > > > next = next_slice.task; > > > > sd->curr = next; > > > > if ( next_slice.time >= 0 ) /* -ve means no limit */ > > set_timer(&sd->s_timer, now + next_slice.time); > > > > if ( unlikely(prev == next) ) > > { > > pcpu_schedule_unlock_irq(lock, cpu); > > trace_continue_running(next); > > return continue_running(prev); > > } > > > > .... > > > > } > > > > If this is that case, when we get context_switch(), the prev and next are > > different. Do I miss something? > > > That looks correct. Still, there are checks like '(prev!=next)' around > in context_switch(), for both x86 and ARM... weird. I shall have a > deeper look... > > In any case, as far as this hunk is concerned, the > '(per_cpu(curr_vcpu,cpu)==next)' is there to deal with the case where we > went from vcpu v to idle, and we're now going from idle to v again, > which is something you want to intercept. > > So, at least for now, ignore my comments about it. I'll let you know if > I find something interesting that you should take into account. > > > > > --- a/xen/common/schedule.c > > > > +++ b/xen/common/schedule.c > > > > @@ -381,6 +381,8 @@ void vcpu_wake(struct vcpu *v) > > > > unsigned long flags; > > > > spinlock_t *lock = vcpu_schedule_lock_irqsave(v, &flags); > > > > > > > > + arch_vcpu_wake(v); > > > > + > > > So, in the draft you sent a few days back, this was called at the end of > > > vcpu_wake(), right before releasing the lock. Now it's at the beginning, > > > before the scheduler's wakeup routine has a chance to run. > > > > > > IMO, it feels more natural for it to be at the bottom (i.e., generic > > > stuff first, arch specific stuff afterwards), and, after a quick > > > inspection, I don't think I see nothing preventing things to be that > > > way. > > > > > > However, I recall you mentioning having issues with such draft, which > > > are now resolved with this version. > > > > The long latency issue mentioned previously is caused by another reason. > > Originally I called the ' pi_ctxt_switch_from ' and ' pi_ctxt_switch_to ' in > > __context_switch(), however, this function is not called for each context > > switch, as I described above, after fixing this, the performance issue > > disappeared. > > > I see, thanks for explaining this. > > > > Since this is one of the differences > > > between the two, was it the cause of the issues you were seeing? If yes, > > > can you elaborate on how and why? > > > > > > In the end, I'm not too opposed to the hook being at the beginning > > > rather than at the end, but there has to be a reason, which may well end > > > up better be stated in a comment... > > > > Here is the reason I put arch_vcpu_wake() ahead of vcpu_wake(): > > arch_vcpu_wake() does some prerequisites for a vCPU which is about > > to run, such as, setting SN again, changing NV filed back to > > ' posted_intr_vector ', which should be finished before the vCPU is > > actually scheduled to run. However, if we put arch_vcpu_wake() later > > in vcpu_wake() right before ' vcpu_schedule_unlock_irqrestore', after > > the 'wake' hook get finished, the vcpu can run at any time (maybe in > > another pCPU since the current pCPU is protected by the lock), if > > this can happen, it is incorrect. Does my understanding make sense? > > > It's safe in any case. In fact, the spinlock will prevent both the > vcpu's processor to schedule, as well as any other processors to steal > the waking vcpu from the runqueue to run it. Good to know this. For " as well as any other processors to steal the waking vcpu from the runqueue to run it ", could you please show some hints in the code side, so I can better understand how this can be protected by the spinlock. Thank you! Thanks, Feng > > That's actually why I wanted to double check you changing the position > of the hook (wrt the draft), as it felt weird that the issue were in > there. :-) > > So, now that we know that safety is not an issue, where should we put > the hook? > > Having it before SCHED_OP(wake) may make people think that arch specific > code is (or can, at some point) somehow influencing the scheduler > specific wakeup code, which is not (and should not become, if possible) > the case. > > However, I kind of like the fact that the spinlock is released as soon > as possible, after the call to SCHED_OP(wake). That will make it more > likely, for the processors we may have sent IPIs to, during the > scheduler specific wakeup code, to find the spinlock free. So, looking > at things from this angle, it would be better to avoid putting stuff in > between SCHED_OP(wake) and vcpu_schedule_unlock(). > > So, all in all, I'd say leave it on top, where it is in this patch. Of > course, if others have opinions, I'm all ears. :-) > > Thanks and Regards, > Dario > -- > <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- > Dario Faggioli, Ph.D, http://about.me/dario.faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.