[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC] pass-through: sync pir to irr after msix vector been updated



On 9/16/19 1:01 AM, Jan Beulich wrote:
> On 13.09.2019 18:38, Joe Jin wrote:
>> Hi Jan,
>>
>> Thanks for your reply, see my reply in line please.
>>
>> On 9/13/19 12:14 AM, Jan Beulich wrote:
>>> On 12.09.2019 20:03, Joe Jin wrote:
>>>> With below testcase, guest kernel reported "No irq handler for vector":
>>>>   1). Passthrough mlx ib VF to 2 pvhvm guests.
>>>>   2). Start rds-stress between 2 guests.
>>>>   3). Scale down 2 guests vcpu from 32 to 6 at the same time.
>>>>
>>>> Repeat above test several iteration, guest kernel reported "No irq handler
>>>> for vector", and IB traffic downed to zero which caused by interrupt lost.
>>>>
>>>> When vcpu offline, kernel disabled local IRQ, migrate IRQ to other cpu,
>>>> update MSI-X table, enable IRQ. If any new interrupt arrived after
>>>> local IRQ disabled also before MSI-X table been updated, interrupt still 
>>>> used old vector and dest cpu info, and when local IRQ enabled again, 
>>>> interrupt been sent to wrong cpu and vector.
>>>>
>>>> Looks sync PIR to IRR after MSI-X been updated is help for this issue.
>>>
>>> I'm having trouble making the connection, which quite possibly simply
>>> means the description needs to be further extended: Sync-ing PIR to
>>> IRR has nothing to do with a vector change. It would help if nothing
>>> else caused this bitmap propagation, and an interrupt was lost (rather
>>> than delivered through the wrong vector, or to the wrong CPU).
>>> Furthermore with vector and destination being coupled, after a CPU has
>>> been offlined this would generally mean
>>> - if there was just a single destination permitted, lack of delivery
>>>   altogether,
>>> - if there were multiple destinations permitted, delivery to one of
>>>   the other CPUs, at which point the vector would still be valid.
>>
>> When cpu offline on guest kernel, it only migrates IRQs which affinity set
>> to the cpu only, if multiple destinations, kernel does not do migration
>> which included update msi-x table with new destination also vector.
>>
>> After IRQ migration, kernel will check all vector's IRR, if APIC IRR
>> been set, retrigger the IRQ to new destination. This intend to avoid
>> to lost any interrupt.
>>
>> But on Xen, after msi-x table updated, it never tried to check and notify
>> guest kernel there was pending IRQ.
>>
>>>
>>> An interesting aspect would be on which CPU the log message was
>>> observed, and how this correlates with the destination sets of the
>>> CPUs that have got offlined. From there it would then further be
>>> interesting to understand why the interrupt made it to that CPU,
>>> since - as said - destination and vector get changed together, and
>>> hence with things going wrong it would be of interest to know whether
>>> the CPU receiving the IRQ is within the new destination set, or some
>>> (random?) other one.
>>
>> irq_retrigger() been called after kernel updated vector, irq_retrigger()
>> will send pending IRQ(s) to new destination.
>>
>> Here are kernel log when issue happened, guest kernel is 4.1, on 4.14
>> guest, it's almost same, but no "(irq -1)" for kernel changes, IRQ
>> migrations workflow are same(fixup_irqs()):
>>
>> Sep 12 20:26:46 localhost kernel: smpboot: CPU 17 is now offline
>> Sep 12 20:26:46 localhost kernel: smpboot: CPU 18 is now offline
>> Sep 12 20:26:46 localhost kernel: smpboot: CPU 19 is now offline
>> Sep 12 20:26:47 localhost kernel: Broke affinity for irq 251
>> Sep 12 20:26:47 localhost kernel: do_IRQ: 20.178 No irq handler for vector 
>> (irq -1)
>> Sep 12 20:26:47 localhost kernel: smpboot: CPU 20 is now offline
>> Sep 12 20:26:47 localhost kernel: smpboot: CPU 21 is now offline
>>
>> From above, you can see IRQ sent to cpu 20, which is offlining.
>>
>> IRQ arrived to the cpu immediately when IRQ enabled, after CPU offlined,
>> it prints log "smpboot: CPU 20 is now offline".
>>
>> Call path in kernel as below:
>> cpu_down()
>>   |-> cpu_down_maps_locked()
>>   |     _cpu_down
>>   |       |-> __stop_machine
>>   |             |-> stop_cpus()
>>   |                   |->__stop_cpus()
>>   |                        |- queue_stop_cpus_work() ---+
>>   |-> __cpu_die()                                       |
>>          |-> pr_info("CPU %u is now offline\n", cpu);   |
>>      +--------------------------------------------------+
>>      |
>>      +
>> multi_cpu_stop()
>>   |-> local_save_flags
>>   |-> take_cpu_down()
>>   |      |-> __cpu_disable
>>   |            |-> smp_ops.cpu_disable = xen_cpu_disable
>>   |                  |-> cpu_disable_common
>>   |                        |-> fixup_irqs <== IRQ migration.
>>   |-> local_irq_restore()
> 
> Ah yes, this makes sense. You want to extend the description to
> reflect some of the further explanation above.

I will add more explanation later.

> 
>>>> --- a/xen/drivers/passthrough/io.c
>>>> +++ b/xen/drivers/passthrough/io.c
>>>> @@ -412,6 +412,9 @@ int pt_irq_create_bind(
>>>>                  pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
>>>>                  pirq_dpci->gmsi.gflags = gflags;
>>>>              }
>>>> +
>>>> +            if ( hvm_funcs.sync_pir_to_irr )
>>>> +                
>>>> hvm_funcs.sync_pir_to_irr(d->vcpu[pirq_dpci->gmsi.dest_vcpu_id]);
>>>
>>> If the need for this change can be properly explained, then it
>>> still wants converting to alternative_vcall() - the the other
>>> caller of this hook. Or perhaps even better move vlapic.c's
>>> wrapper (suitably renamed) into hvm.h, and use it here.
>>
>> Yes I agree, I'm not 100% sure, so I set it to RFC.
> 
> And btw, please also attach a brief comment here, to clarify
> why the syncing is needed precisely at this point.
> 
>>> Additionally, the code setting pirq_dpci->gmsi.dest_vcpu_id
>>> (right after your code insertion) allows for the field to be
>>> invalid, which I think you need to guard against.
>>
>> I think you means multiple destination, then it's -1?
> 
> The reason for why it might be -1 are irrelevant here, I think.
> You need to handle the case both to avoid an out-of-bounds
> array access and to make sure an IRR bit wouldn't still get
> propagated too late in some special case.

Add following checks?
            if ( dest_vcpu_id >= 0 && dest_vcpu_id < d->max_vcpus &&
                 d->vcpu[dest_vcpu_id]->runstate.state <= RUNSTATE_blocked )

> 
> Also - what about the respective other path in the function,
> dealing with PT_IRQ_TYPE_PCI and PT_IRQ_TYPE_MSI_TRANSLATE? It
> seems to me that there's the same chance of deferring IRR
> propagation for too long?

This is possible, can you please help on how to get which vcpu associate the 
IRQ?
I did not found any helper on current Xen.

Thanks,
Joe

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.