[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 4/9] AMD/IOMMU: introduce 128-bit IRTE non-guest-APIC IRTE format



>>> On 18.06.19 at 13:57, <andrew.cooper3@xxxxxxxxxx> wrote:
> On 13/06/2019 14:23, Jan Beulich wrote:
>> This is in preparation of actually enabling x2APIC mode, which requires
>> this wider IRTE format to be used.
>>
>> A specific remark regarding the first hunk changing
>> amd_iommu_ioapic_update_ire(): This bypass was introduced for XSA-36,
>> i.e. by 94d4a1119d ("AMD,IOMMU: Clean up old entries in remapping
>> tables when creating new one"). Other code introduced by that change has
>> meanwhile disappeared or further changed, and I wonder if - rather than
>> adding an x2apic_enabled check to the conditional - the bypass couldn't
>> be deleted altogether. For now the goal is to affect the non-x2APIC
>> paths as little as possible.
>>
>> Take the liberty and use the new "fresh" flag to suppress an unneeded
>> flush in update_intremap_entry_from_ioapic().
> 
> What is the meaning of fresh?  Wouldn't "needs_update" be a more
> descriptive name?

I don't think so, no. "Fresh" means "freshly allocated" and hence not
holding any meaningful data yet.

>> --- a/xen/drivers/passthrough/amd/iommu_intr.c
>> +++ b/xen/drivers/passthrough/amd/iommu_intr.c
>> @@ -35,12 +35,34 @@ struct irte_basic {
>>      unsigned int :8;
>>  };
>>  
>> +struct irte_full {
>> +    unsigned int remap_en:1;
>> +    unsigned int sup_io_pf:1;
>> +    unsigned int int_type:3;
>> +    unsigned int rq_eoi:1;
>> +    unsigned int dm:1;
>> +    unsigned int guest_mode:1; /* MBZ */
> 
> /* MBZ - not implemented yet. */
> 
> Seeing as interrupt posting will be a minor tweak to this data
> structure, rather than implementing a new one.

Again I don't think so: Bits 2...6 have entirely different meaning,
so I think we'd better not try to fold both into one structure.
Separate structures will also better enforce thinking about using
the correct one in any given context, I think (hope).

>> +    unsigned int dest_lo:24;
>> +    unsigned int :32;
>> +    unsigned int vector:8;
>> +    unsigned int :24;
>> +    unsigned int :24;
>> +    unsigned int dest_hi:8;
> 
> The manual says that we should prefer aligned 64bit access, so some
> raw_{lo,hi} fields here will allow...
> 
>> @@ -136,7 +170,21 @@ static void free_intremap_entry(unsigned
>>  {
>>      union irte_ptr entry = get_intremap_entry(seg, bdf, offset);
>>  
>> -    *entry.basic = (struct irte_basic){};
>> +    switch ( irte_mode )
>> +    {
>> +    case irte_basic:
>> +        *entry.basic = (struct irte_basic){};
>> +        break;
>> +
>> +    case irte_full:
>> +        entry.full->remap_en = 0;
>> +        wmb();
> 
> ... this to become
> 
> entry._128->raw_lo = 0;
> smp_wmb();
> entry._128->raw_hi = 0;
> 
> The interrupt mapping table is allocated in WB memory and accessed
> coherently, so an sfence instruction isn't necessary.  All that matters
> is that remap_en gets cleared first.

I've been trying to spot such a coherency statement - where did I
overlook this being said?

As to introducing "raw" fields - see my reply on the earlier patch. It's
possible, but has other downsides.

And finally on smb_wmb() - there's nothing SMP-ish here. The barrier
is specifically not to vanish (in the theoretical case of us patching in
SMP alternatives) in the UP case. Either it's wmb(), or it's just barrier().

>> @@ -154,8 +202,38 @@ static void update_intremap_entry(union
>>          .dest = dest,
>>          .vector = vector,
>>      };
>> +    struct irte_full full = {
>> +        .remap_en = 1,
>> +        .sup_io_pf = 0,
>> +        .int_type = int_type,
>> +        .rq_eoi = 0,
>> +        .dm = dest_mode,
>> +        .dest_lo = dest,
>> +        .dest_hi = dest >> 24,
>> +        .vector = vector,
>> +    };
> 
> Looking at the resulting code after this patch, I think these structures
> should move into their respective case blocks, to help the compiler to
> avoid initialising both.

I admit I didn't check the code, but I'm pretty surprised you say they
can't track this. I pretty much dislike the improperly indented braces
we use to frame case blocks with their own local variables, which is
why I've decided to put the variables where they are now (assuming
that it ought to be rather easy for the compiler to move the actual
initialization into the switch()).

>> +
>> +    switch ( irte_mode )
>> +    {
>> +        __uint128_t ret;
>> +        union {
>> +            __uint128_t raw;
>> +            struct irte_full full;
>> +        } old;
>> +
>> +    case irte_basic:
>> +        *entry.basic = basic;
>> +        break;
>> +
>> +    case irte_full:
>> +        old.full = *entry.full;
>> +        ret = cmpxchg16b(entry.full, &old, &full);
>> +        ASSERT(ret == old.raw);
> 
> Similarly, this can be implemented with
> 
> entry.full->remap_en = 0;
> smp_wmb();
> entry._128->raw_hi = full.raw_hi;
> smp_wmb();
> entry._128->raw_lo = full.raw_lo;
> 
> which avoids using a locked operation.

Right, and the locked operation goes away again in the last patch.
As said there, that part of that patch can probably be moved here.

But first of all we need to settle on whether we want "raw" union
members.

>> @@ -169,6 +247,11 @@ static inline void set_rte_index(struct
>>      rte->delivery_mode = offset >> 8;
>>  }
>>  
>> +static inline unsigned int get_full_dest(const struct irte_full *entry)
>> +{
>> +    return entry->dest_lo | (entry->dest_hi << 24);
> 
> Given your observation on my still-not-upstream-yet patch about GCC not
> honouring the type of bitfields, doesn't dest_hi need explicitly casting
> to unsigned int before the shift, to avoid it being performed as int?

Hmm, interesting question. I think you're right, and a little bit of
experimenting supports your position.

>> @@ -280,6 +392,18 @@ int __init amd_iommu_setup_ioapic_remapp
>>              dest_mode = rte.dest_mode;
>>              dest = rte.dest.logical.logical_dest;
>>  
>> +            if ( iommu->ctrl.xt_en )
>> +            {
>> +                /*
>> +                 * In x2APIC mode we have no way of discovering the high 24
>> +                 * bits of the destination of an already enabled interrupt.
> 
> Yes we do.  We read the interrupt remapping table, which is where that
> information lives.
> 
> Any firmware driver which is following the spec won't have let an IRTE
> get cached, then played with the table without appropriate flushing.

Which interrupt remapping table? This is code which runs when we've
just freshly allocated one, with all zeros in it. I'm unaware of a protocol
to communicate whatever interrupt remapping table the firmware may
have used.

But - how would there legitimately be an enabled interrupt here in the
first place?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.