|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v2 for-4.21 2/9] x86/HPET: use single, global, low-priority vector for broadcast IRQ
On Wed, Oct 22, 2025 at 11:21:15AM +0200, Jan Beulich wrote:
> On 21.10.2025 15:49, Roger Pau Monné wrote:
> > On Tue, Oct 21, 2025 at 08:42:13AM +0200, Jan Beulich wrote:
> >> On 20.10.2025 18:22, Roger Pau Monné wrote:
> >>> On Mon, Oct 20, 2025 at 01:18:34PM +0200, Jan Beulich wrote:
> >>>> Using dynamically allocated / maintained vectors has several downsides:
> >>>> - possible nesting of IRQs due to the effects of IRQ migration,
> >>>> - reduction of vectors available for devices,
> >>>> - IRQs not moving as intended if there's shortage of vectors,
> >>>> - higher runtime overhead.
> >>>>
> >>>> As the vector also doesn't need to be of any priority (first and foremost
> >>>> it really shouldn't be of higher or same priority as the timer IRQ, as
> >>>> that raises TIMER_SOFTIRQ anyway), avoid any "ordinary" vectors altogther
> >>>> and use a vector from the 0x10...0x1f exception vector space. Exception
> >>>> vs
> >>>> interrupt can easily be distinguished by checking for the presence of an
> >>>> error code.
> >>>>
> >>>> With a fixed vector, less updating is now necessary in
> >>>> set_channel_irq_affinity(); in particular channels don't need transiently
> >>>> masking anymore, as the necessary update is now atomic. To fully leverage
> >>>> this, however, we want to stop using hpet_msi_set_affinity() there. With
> >>>> the transient masking dropped, we're no longer at risk of missing events.
> >>>>
> >>>> In principle a change to setup_vector_irq() would be necessary, but only
> >>>> if we used low-prio vectors as direct-APIC ones. Since the change would
> >>>> be
> >>>> at best benign here, it is being omitted.
> >>>>
> >>>> Fixes: 996576b965cc ("xen: allow up to 16383 cpus")
> >>>> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
> >>>> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@xxxxxxxxx>
> >>>> ---
> >>>> This is an alternative proposal to
> >>>> https://lists.xen.org/archives/html/xen-devel/2014-03/msg00399.html.
> >>>>
> >>>> Should we keep hpet_msi_set_affinity() at all? We'd better not have the
> >>>> generic IRQ subsystem play with our IRQs' affinities ... (If so, this
> >>>> likely would want to be a separate patch, though.)
> >>>
> >>> I think that needs to become a no-op, with possibly an ASSERT? Is it
> >>> possibly for dom0 to try to balance this IRQ? I would think not.
> >>
> >> I'd consider it an error if that was possible. But then the same goes for
> >> other Xen-internal IRQs, like the IOMMU ones. They all implement a
> >> .set_affinity hook ...
> >
> > We need such hook for fixup_irqs() at least, so that interrupts can be
> > evacuated when an AP goes offline.
>
> Hmm, yes. Just not here.
>
> >>>> @@ -476,19 +486,50 @@ static struct hpet_event_channel *hpet_g
> >>>> static void set_channel_irq_affinity(struct hpet_event_channel *ch)
> >>>> {
> >>>> struct irq_desc *desc = irq_to_desc(ch->msi.irq);
> >>>> + struct msi_msg msg = ch->msi.msg;
> >>>>
> >>>> ASSERT(!local_irq_is_enabled());
> >>>> spin_lock(&desc->lock);
> >>>> - hpet_msi_mask(desc);
> >>>> - hpet_msi_set_affinity(desc, cpumask_of(ch->cpu));
> >>>> - hpet_msi_unmask(desc);
> >>>> +
> >>>> + per_cpu(vector_irq, ch->cpu)[HPET_BROADCAST_VECTOR] = ch->msi.irq;
> >>>> +
> >>>> + /*
> >>>> + * Open-coding a reduced form of hpet_msi_set_affinity() here.
> >>>> With the
> >>>> + * actual update below (either of the IRTE or of [just] message
> >>>> address;
> >>>> + * with interrupt remapping message address/data don't change) now
> >>>> being
> >>>> + * atomic, we can avoid masking the IRQ around the update. As a
> >>>> result
> >>>> + * we're no longer at risk of missing IRQs (provided
> >>>> hpet_broadcast_enter()
> >>>> + * keeps setting the new deadline only afterwards).
> >>>> + */
> >>>> + cpumask_copy(desc->arch.cpu_mask, cpumask_of(ch->cpu));
> >>>> +
> >>>> spin_unlock(&desc->lock);
> >>>>
> >>>> - spin_unlock(&ch->lock);
> >>>> + msg.dest32 = cpu_physical_id(ch->cpu);
> >>>> + msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK;
> >>>> + msg.address_lo |= MSI_ADDR_DEST_ID(msg.dest32);
> >>>> + if ( msg.dest32 != ch->msi.msg.dest32 )
> >>>> + {
> >>>> + ch->msi.msg = msg;
> >>>> +
> >>>> + if ( iommu_intremap != iommu_intremap_off )
> >>>> + {
> >>>> + int rc = iommu_update_ire_from_msi(&ch->msi, &msg);
> >>>>
> >>>> - /* We may have missed an interrupt due to the temporary masking. */
> >>>> - if ( ch->event_handler && ch->next_event < NOW() )
> >>>> - ch->event_handler(ch);
> >>>> + ASSERT(rc <= 0);
> >>>> + if ( rc > 0 )
> >>>> + {
> >>>> + ASSERT(msg.data == hpet_read32(HPET_Tn_ROUTE(ch->idx)));
> >>>> + ASSERT(msg.address_lo ==
> >>>> + hpet_read32(HPET_Tn_ROUTE(ch->idx) + 4));
> >>>> + }
> >>>
> >>> The sequence of asserts seem wrong here, the asserts inside of the rc
> >>>> 0 check will never trigger, because there's an ASSERT(rc <= 0)
> >>> ahead of them?
> >>
> >> Hmm. My way of thinking was that if we get back 1 (which we shouldn't),
> >> we ought to check (and presumably fail on) data or address having changed.
> >
> > Right, but the ASSERT(rc <= 0) will prevent reaching any of the
> > followup ASSERTs if rc == 1?
>
> Which is no problem, as we'd be dead already anyway if the first assertion
> triggered. Nevertheless I've switched the if() to >= 0 (which then pointed
> out a necessary change in AMD IOMMU code).
Right, so and adjusted if condition plus an ASSERT_UNREACHABLE() at
the end of the if code block?
> > IOW, we possibly want:
> >
> > if ( rc > 0 )
> > {
> > dprintk(XENLOG_ERR,
> > "Unexpected HPET MSI setup returned: data: %#x
> > address: %#lx expected data %#x address %#lx\n",
> > msg.data, msg.address,
> > ch->msi.msg.data, ch->msi.msg.address);
> > ASSERT_UNREACHABLE();
> > hpet_msi_mask(desc);
> > hpet_write32(msg.data, HPET_Tn_ROUTE(ch->idx));
> > hpet_write32(msg.address_lo, HPET_Tn_ROUTE(ch->idx) + 4);
> > hpet_msi_unmask(desc);
> > }
> > ASSERT(!rc);
>
> To be honest, for my taste this goes too far as to what follows an
> ASSERT_UNREACHABLE().
I can understand that. It's the best way I've come up with attempting
to recover from a possible error in the release case, but I don't
particularly like it either.
> > I'm unsure about attempting to propagate the returned values on release
> > builds, I guess it's slightly better than possibly using an outdated
> > RTE entry? Albeit this should never happen.
>
> Yes to the last remark; I don't actually see what you would want to do
> with the propagated return value.
OK, I can this this not being clear. By propagate here I mean
propagate to the hardware registers, not to the function caller.
> > Also, should the desc->arch.cpu_mask update only be done after the MSI
> > fields have correctly updated, so that in case of failure of
> > iommu_update_ire_from_msi(9 we could return early form the function
> > and avoid changing desc->arch.cpu_mask?
>
> Hmm, yes, I could do that, but then also in hpet_msi_set_affinity().
> However, as this needs doing under the IRQ descriptor lock, I'd have to
> either extend the locked region here (again), or re-acquire the lock
> later. Neither looks very attractive to me.
Hm, I guess given the point in the release we can leave it as-is. It
would be nice, but this change is big enough as it is.
Thanks, Roger.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |