[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts



>>> On 10.09.15 at 10:59, <feng.wu@xxxxxxxxx> wrote:
>> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
>> Sent: Thursday, September 10, 2015 4:28 PM
>> >>> On 10.09.15 at 04:07, <feng.wu@xxxxxxxxx> wrote:
>> >> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
>> >> Sent: Wednesday, September 09, 2015 6:27 PM
>> >> >>> On 09.09.15 at 10:56, <feng.wu@xxxxxxxxx> wrote:
>> >> >> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
>> >> >> Sent: Monday, September 07, 2015 8:55 PM
>> >> >> >>> On 25.08.15 at 03:57, <feng.wu@xxxxxxxxx> wrote:
>> >> >> > +
>> >> >> > +        /*
>> >> >> > +         * Delete the vCPU from the related block list
>> >> >> > +         * if we are resuming from blocked state.
>> >> >> > +         */
>> >> >> > +        if ( v->arch.hvm_vmx.pi_block_cpu != -1 )
>> >> >> > +        {
>> >> >> > +            spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
>> >> >> > +                              v->arch.hvm_vmx.pi_block_cpu),
>> >> >> flags);
>> >> >> > +            list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
>> >> >>
>> >> >> Shouldn't you set .pi_block_cpu back to -1 along with removing
>> >> >> the vCPU from the list? Which then ought to allow using just
>> >> >> list_del() here?
>> >> >
>> >> > Here is the story about using list_del_init() instead of list_del(), 
>> >> > (the
>> >> > same reason for using list_del_init() in pi_wakeup_interrupt() ).
>> >> > If we use list_del(), that means we need to set .pi_block_cpu to
>> >> > -1 after removing the vCPU from the block list, there are two
>> >> > places where the vCPU is removed from the list:
>> >> > - arch_vcpu_wake_prepare()
>> >> > - pi_wakeup_interrupt()
>> >> >
>> >> > That means we also need to set .pi_block_cpu to -1 in
>> >> > pi_wakeup_interrupt(), which seems problematic to me, since
>> >> > .pi_block_cpu is used to get the per-CPU spin lock, it is not safe
>> >> > to change it in pi_wakeup_interrupt(), maybe someone is using
>> >> > it at the same time.
>> >> >
>> >> > And the check "if ( v->arch.hvm_vmx.pi_block_cpu != -1 )" here
>> >> > is only used when .pi_block_cpu is set to -1 at the initial time.
>> >> >
>> >> > If you have any good suggestions about this, I will be all ears.
>> >>
>> >> Latching pi_block_cpu into a local variable would take care of that
>> >> part of the problem. Of course you then may also need to check
>> >> that while waiting to acquire the lock pi_block_cpu didn't change.
>> >
>> > Thanks for the suggestion! But I think it is not easy to "check
>> > "that while waiting to acquire the lock pi_block_cpu didn't change".
>> > Let's take the following pseudo code as an example:
>> >
>> > int local_pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu;
>> >
>> > ......
>> >
>> > spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
>> >                local_pi_block_cpu), flags);
>> > list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
>> > spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
>> >                    local_pi_block_cpu), flags);
>> >
>> > If .pi_block_cpu is changed to -1 by others during calling list_del()
>> > above, that means the vCPU is removed by others, then calling
>> > list_del() here again would be problematic. I think there might be
>> > other corner cases for this. So I suggest adding some comments
>> > for list_del_init() as you mentioned below.
>> 
>> That's why I said "check that while waiting to acquire the lock
>> pi_block_cpu didn't change." The code you present does not do
>> this.
> 
> I didn't see we need implement it as the above code, I just list it
> here the show it is hard to do that. 
> First, how to check it while waiting to acquire the lock .pi_block_cpu
> didn't change?

Note the difference between "check while waiting" and "check that
while waiting": The former is indeed hard to implement, while the
latter is pretty straightforward (and we do so elsewhere).

> Secondly, even if we can check it, what should we do if .pi_block_cpu
> is changed after acquiring the lock as I mentioned above?

Drop the lock and start over. I.e. (taking your pseudo code)

restart:
    local_pi_block_cpu = ...;
    bail-if-invalid (e.g. -1 in current model)
    spin_lock_irqsave(&per_cpu(, local_pi_block_cpu), flags);
    if(local_pi_block_cpu != actual_pi_block_cpu) {
        spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);
        goto restart;
    }
    list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
    spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);

>> > Anyway, I go through the main path of the code as below, I really don't 
>> > find
>> > anything unsafe here.
>> >
>> > context_switch() -->
>> >     pi_ctxt_switch_from() -->
>> >         vmx_pre_ctx_switch_pi() -->
>> >             vcpu_unblock() -->
>> >                 vcpu_wake() -->
>> >                     SCHED_OP(VCPU2OP(v), wake, v) -->
>> >                         csched_vcpu_wake() -->
>> >                             __runq_insert()
>> >                             __runq_tickle()
>> >
>> > If you continue to think it is unsafe, pointing out the place will be
>> > welcome!
>> 
>> Once again - I didn't say it's unsafe (and hence can't point at a
>> particular place). What bothers me is that vcpu_unblock() affects
>> scheduler state, and altering scheduler state from inside the
>> context switch path looks problematic at the first glance. I'd be
>> happy if I was told there is no such problem, but be aware that
>> this would have to include latent ones: Even if right now things
>> are safe, having such feedback have the potential of becoming
>> an actual problem with a later scheduler change is already an
>> issue, as finding the problem is then likely going to be rather hard
>> (and I suspect it's not going to be you to debug this).
> 
> What I can say is that after investigation, I don't find anything bad
> for this vcpu_unblock(), I tested it in my experiment. So that is why
> I'd like to ask some ideas from scheduler expects.

Agreed - I'm also awaiting their input.

>> >> >> > +static void vmx_post_ctx_switch_pi(struct vcpu *v)
>> >> >> > +{
>> >> >> > +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
>> >> >> > +
>> >> >> > +    if ( !has_arch_pdevs(v->domain) )
>> >> >> > +        return;
>> >> >> > +
>> >> >> > +    if ( x2apic_enabled )
>> >> >> > +        write_atomic(&pi_desc->ndst,
>> cpu_physical_id(v->processor));
>> >> >> > +    else
>> >> >> > +        write_atomic(&pi_desc->ndst,
>> >> >> > +                     MASK_INSR(cpu_physical_id(v->processor),
>> >> >> > +                     PI_xAPIC_NDST_MASK));
>> >> >> > +
>> >> >> > +    pi_clear_sn(pi_desc);
>> >> >> > +}
>> >> >>
>> >> >> So you alter where notifications go, but not via which vector? How
>> >> >> is the vCPU going to get removed from the blocked list then?
>> >> >
>> >> > vmx_post_ctx_switch_pi() is called when the vCPU is about to run,
>> >> > in that case, the vector has been set properly and it has been removed
>> >> > from the block list if it was blocked before.
>> >>
>> >> So why do you two updates then (first [elsewhere] to set the vector
>> >> and then here to set the destination)?
>> >
>> > When the vCPU is unblocked and then removed from the blocking list, we
>> > need to change the vector to the normal notification vector, since the
>> > wakeup vector is only used when the vCPU is blocked and hence in the
>> > blocked list. And here is the place we can decide which pCPU the vCPU
>> > will be scheduled on, so we change the destination field here.
>> 
>> Right, that's what I understood. But isn't the state things are in
>> between these two updates inconsistent? I.e. - where does the
>> notification go if one arrives in this window?
> 
> Before arriving here, the SN is set, no need to send notification event and
> hence no notification at all.

Ah, okay.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.