[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH for-4.13 v4 2/3] x86/passthrough: fix migration of MSI when using posted interrupts



On Thu, Nov 14, 2019 at 02:35:56PM +0100, Jan Beulich wrote:
> On 13.11.2019 16:59, Roger Pau Monne wrote:
> > +    for ( id = find_first_bit(vcpus, d->max_vcpus);
> > +          id < d->max_vcpus;
> > +          id = find_next_bit(vcpus, d->max_vcpus, id + 1) )
> > +    {
> > +        if ( d->vcpu[id] != current )
> > +            vcpu_pause(d->vcpu[id]);
> 
> Isn't this setting us up for a deadlock if two parties come here
> for the same domain, and both on a vCPU belonging to that domain
> (and with the opposite one's bit set in the bitmap)? But it looks
> like d would never be the current domain here - you will want to
> assert and comment on this, though. At that point the comparisons
> against current can then go away as well afaict.

The above is true for syncs triggered by MSI changes that bounce to
QEMU and then get forwarded to Xen as DOMCTLs, but AFAICT syncs that
result from a vIO-APIC entry write (patch #3) will have v ==
current.

vIO-APIC writes however use the d->arch.hvm.irq_lock, so it's not
possible to process multiple vCPUs vIO-APIC accesses at the same time.
I'm afraid I don't know how which kind of assert should be added
here. I could add a comment, but seems fragile.

> 
> > @@ -345,6 +289,8 @@ int pt_irq_create_bind(
> >          const struct vcpu *vcpu;
> >          uint32_t gflags = pt_irq_bind->u.msi.gflags &
> >                            ~XEN_DOMCTL_VMSI_X86_UNMASKED;
> > +        DECLARE_BITMAP(dest_vcpus, MAX_VIRT_CPUS) = { };
> > +        DECLARE_BITMAP(prev_vcpus, MAX_VIRT_CPUS) = { };
> 
> This is reachable for HVM domains only, isn't it? In which case
> why the much larger MAX_VIRT_CPUS (creating two unreasonably big
> local variables) instead of HVM_MAX_VCPUS? However, even once
> switched I'd be opposed to this - There'd be a fair chance that
> the need to deal with these variables might go unnoticed once
> the maximum vCPU count for HVM gets increased (which has been a
> pending todo item for many years now).

See below, after your rant about how to fix it.

> > @@ -420,20 +384,16 @@ int pt_irq_create_bind(
> >          delivery_mode = MASK_EXTR(pirq_dpci->gmsi.gflags,
> >                                    XEN_DOMCTL_VMSI_X86_DELIV_MASK);
> >  
> > -        dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
> > +        hvm_intr_get_dests(d, dest, dest_mode, delivery_mode, dest_vcpus);
> > +        dest_vcpu_id = bitmap_weight(dest_vcpus, d->max_vcpus) != 1 ?
> > +            -1 : find_first_bit(dest_vcpus, d->max_vcpus);
> >          pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
> >          spin_unlock(&d->event_lock);
> >  
> >          pirq_dpci->gmsi.posted = false;
> >          vcpu = (dest_vcpu_id >= 0) ? d->vcpu[dest_vcpu_id] : NULL;
> > -        if ( iommu_intpost )
> > -        {
> > -            if ( delivery_mode == dest_LowestPrio )
> > -                vcpu = vector_hashing_dest(d, dest, dest_mode,
> > -                                           pirq_dpci->gmsi.gvec);
> > -            if ( vcpu )
> > -                pirq_dpci->gmsi.posted = true;
> > -        }
> > +        if ( vcpu && iommu_intpost )
> > +            pirq_dpci->gmsi.posted = true;
> 
> One aspect that I'm curious about: How much posting opportunity do
> we lose in practice by no longer posting when the guest uses lowest
> priority mode with multiple destinations?

Linux seems to use dest_Fixed exclusively, and the same goes to
FreeBSD.

> > @@ -442,6 +402,9 @@ int pt_irq_create_bind(
> >              pi_update_irte(vcpu ? &vcpu->arch.hvm.vmx.pi_desc : NULL,
> >                             info, pirq_dpci->gmsi.gvec);
> >  
> > +        if ( hvm_funcs.deliver_posted_intr )
> > +            domain_sync_vlapic_pir(d, prev_vcpus);
> 
> Accessing hvm_funcs here looks like a layering violation. This
> wants either moving into the function or (seeing the other use)
> abstracting away. Seeing the conditional here (and below) I also
> notice that you calculate prev_vcpus in vein when there's no
> interrupt posting in use.

I could indeed only fill prev_vcpus when posting is in use.

> I guess together with the variable size issue mentioned above a
> possible solution would be:
> - have one bitmap hanging off of pirq_dpci->gmsi,
> - have one bitmap per pCPU,
> - populate the new destination bits into the per-pCPU one,
> - issue the PIR->IRR sync,
> - exchange the per-pCPU and per-DPCI pointers.
> You could then leave the pointers at NULL when no posting is to
> be used, addressing the apparent layering violation here at the
> same time.

Right, the above option avoids having to calculate the possible
destinations twice (once on setup and once on teardown), however it
expands the size of gmsi.

While here, I've also realized that interrupts injected using
XEN_DMOP_inject_msi will also be posted, and I'm afraid there's no way
to track and flush those unless we provide a posted-flush hypercall or
some such, so that emulators can request a PIR flush when interrupts
of fully emulated devices are reconfigured.

OTOH, maybe XEN_DMOP_inject_msi should pause the vCPU, set the bit in
IRR and unpause?

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.