[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 for-4.21 2/9] x86/HPET: use single, global, low-priority vector for broadcast IRQ


  • To: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Wed, 22 Oct 2025 11:21:15 +0200
  • Autocrypt: addr=jbeulich@xxxxxxxx; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Oleksii Kurochko <oleksii.kurochko@xxxxxxxxx>
  • Delivery-date: Wed, 22 Oct 2025 09:21:27 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 21.10.2025 15:49, Roger Pau Monné wrote:
> On Tue, Oct 21, 2025 at 08:42:13AM +0200, Jan Beulich wrote:
>> On 20.10.2025 18:22, Roger Pau Monné wrote:
>>> On Mon, Oct 20, 2025 at 01:18:34PM +0200, Jan Beulich wrote:
>>>> Using dynamically allocated / maintained vectors has several downsides:
>>>> - possible nesting of IRQs due to the effects of IRQ migration,
>>>> - reduction of vectors available for devices,
>>>> - IRQs not moving as intended if there's shortage of vectors,
>>>> - higher runtime overhead.
>>>>
>>>> As the vector also doesn't need to be of any priority (first and foremost
>>>> it really shouldn't be of higher or same priority as the timer IRQ, as
>>>> that raises TIMER_SOFTIRQ anyway), avoid any "ordinary" vectors altogther
>>>> and use a vector from the 0x10...0x1f exception vector space. Exception vs
>>>> interrupt can easily be distinguished by checking for the presence of an
>>>> error code.
>>>>
>>>> With a fixed vector, less updating is now necessary in
>>>> set_channel_irq_affinity(); in particular channels don't need transiently
>>>> masking anymore, as the necessary update is now atomic. To fully leverage
>>>> this, however, we want to stop using hpet_msi_set_affinity() there. With
>>>> the transient masking dropped, we're no longer at risk of missing events.
>>>>
>>>> In principle a change to setup_vector_irq() would be necessary, but only
>>>> if we used low-prio vectors as direct-APIC ones. Since the change would be
>>>> at best benign here, it is being omitted.
>>>>
>>>> Fixes: 996576b965cc ("xen: allow up to 16383 cpus")
>>>> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
>>>> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@xxxxxxxxx>
>>>> ---
>>>> This is an alternative proposal to
>>>> https://lists.xen.org/archives/html/xen-devel/2014-03/msg00399.html.
>>>>
>>>> Should we keep hpet_msi_set_affinity() at all? We'd better not have the
>>>> generic IRQ subsystem play with our IRQs' affinities ... (If so, this
>>>> likely would want to be a separate patch, though.)
>>>
>>> I think that needs to become a no-op, with possibly an ASSERT?  Is it
>>> possibly for dom0 to try to balance this IRQ?  I would think not.
>>
>> I'd consider it an error if that was possible. But then the same goes for
>> other Xen-internal IRQs, like the IOMMU ones. They all implement a
>> .set_affinity hook ...
> 
> We need such hook for fixup_irqs() at least, so that interrupts can be
> evacuated when an AP goes offline.

Hmm, yes. Just not here.

>>>> @@ -476,19 +486,50 @@ static struct hpet_event_channel *hpet_g
>>>>  static void set_channel_irq_affinity(struct hpet_event_channel *ch)
>>>>  {
>>>>      struct irq_desc *desc = irq_to_desc(ch->msi.irq);
>>>> +    struct msi_msg msg = ch->msi.msg;
>>>>  
>>>>      ASSERT(!local_irq_is_enabled());
>>>>      spin_lock(&desc->lock);
>>>> -    hpet_msi_mask(desc);
>>>> -    hpet_msi_set_affinity(desc, cpumask_of(ch->cpu));
>>>> -    hpet_msi_unmask(desc);
>>>> +
>>>> +    per_cpu(vector_irq, ch->cpu)[HPET_BROADCAST_VECTOR] = ch->msi.irq;
>>>> +
>>>> +    /*
>>>> +     * Open-coding a reduced form of hpet_msi_set_affinity() here.  With 
>>>> the
>>>> +     * actual update below (either of the IRTE or of [just] message 
>>>> address;
>>>> +     * with interrupt remapping message address/data don't change) now 
>>>> being
>>>> +     * atomic, we can avoid masking the IRQ around the update.  As a 
>>>> result
>>>> +     * we're no longer at risk of missing IRQs (provided 
>>>> hpet_broadcast_enter()
>>>> +     * keeps setting the new deadline only afterwards).
>>>> +     */
>>>> +    cpumask_copy(desc->arch.cpu_mask, cpumask_of(ch->cpu));
>>>> +
>>>>      spin_unlock(&desc->lock);
>>>>  
>>>> -    spin_unlock(&ch->lock);
>>>> +    msg.dest32 = cpu_physical_id(ch->cpu);
>>>> +    msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK;
>>>> +    msg.address_lo |= MSI_ADDR_DEST_ID(msg.dest32);
>>>> +    if ( msg.dest32 != ch->msi.msg.dest32 )
>>>> +    {
>>>> +        ch->msi.msg = msg;
>>>> +
>>>> +        if ( iommu_intremap != iommu_intremap_off )
>>>> +        {
>>>> +            int rc = iommu_update_ire_from_msi(&ch->msi, &msg);
>>>>  
>>>> -    /* We may have missed an interrupt due to the temporary masking. */
>>>> -    if ( ch->event_handler && ch->next_event < NOW() )
>>>> -        ch->event_handler(ch);
>>>> +            ASSERT(rc <= 0);
>>>> +            if ( rc > 0 )
>>>> +            {
>>>> +                ASSERT(msg.data == hpet_read32(HPET_Tn_ROUTE(ch->idx)));
>>>> +                ASSERT(msg.address_lo ==
>>>> +                       hpet_read32(HPET_Tn_ROUTE(ch->idx) + 4));
>>>> +            }
>>>
>>> The sequence of asserts seem wrong here, the asserts inside of the rc
>>>> 0 check will never trigger, because there's an ASSERT(rc <= 0)
>>> ahead of them?
>>
>> Hmm. My way of thinking was that if we get back 1 (which we shouldn't),
>> we ought to check (and presumably fail on) data or address having changed.
> 
> Right, but the ASSERT(rc <= 0) will prevent reaching any of the
> followup ASSERTs if rc == 1?

Which is no problem, as we'd be dead already anyway if the first assertion
triggered. Nevertheless I've switched the if() to >= 0 (which then pointed
out a necessary change in AMD IOMMU code).

>  IOW, we possibly want:
> 
>             if ( rc > 0 )
>             {
>                 dprintk(XENLOG_ERR,
>                         "Unexpected HPET MSI setup returned: data: %#x 
> address: %#lx expected data %#x address %#lx\n",
>                         msg.data, msg.address,
>                         ch->msi.msg.data, ch->msi.msg.address);
>                 ASSERT_UNREACHABLE();
>                 hpet_msi_mask(desc);
>                 hpet_write32(msg.data, HPET_Tn_ROUTE(ch->idx));
>                 hpet_write32(msg.address_lo, HPET_Tn_ROUTE(ch->idx) + 4);
>                 hpet_msi_unmask(desc);
>             }
>             ASSERT(!rc);

To be honest, for my taste this goes too far as to what follows an
ASSERT_UNREACHABLE().

> I'm unsure about attempting to propagate the returned values on release
> builds, I guess it's slightly better than possibly using an outdated
> RTE entry?  Albeit this should never happen.

Yes to the last remark; I don't actually see what you would want to do
with the propagated return value.

> Also, should the desc->arch.cpu_mask update only be done after the MSI
> fields have correctly updated, so that in case of failure of
> iommu_update_ire_from_msi(9 we could return early form the function
> and avoid changing desc->arch.cpu_mask?

Hmm, yes, I could do that, but then also in hpet_msi_set_affinity().
However, as this needs doing under the IRQ descriptor lock, I'd have to
either extend the locked region here (again), or re-acquire the lock
later. Neither looks very attractive to me.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.