Xen project Mailing List

Re: [PATCH v2 for-4.21 2/9] x86/HPET: use single, global, low-priority vector for broadcast IRQ

From: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Date: Thu, 23 Oct 2025 10:39:26 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=19eBYxrnfoAkcua0jE/me9wgh/Gl7m3Gny6ARk2Zz5U=; b=MH9EVcmEu/c0xB1aZ/BleZAmJWSWYJYoQxG4Gf7Sx/rF09zI/fR80Ok2yaDDb2sdEyv26BmNMtl09i+nUSItgco6ZcBb3fSnO3oWKwHNxQ0CpQHevIReuKB2kRMQrfYGiR7CFq66INL4IZifuWcGSbUHHn7qubcvDMJSg1QUVP1hpsfPjJjLW9WMciwIxeyolVtzhv+8iK6Hv22fJBIZ8J5+hiAiUCyWkOFxgY70XaTDqh94TRjR0CmT8fU5bld/NZJGf8s4zdOXODwM9NR1TrNTv2jmkccHwtlby/AZSYhRyPRKYT3eeNG2/pkw3TPwEx9s1RdXKZ1GWi9hkq1V6g==

Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=hcCJ0MnotFlKRV8oyewUl+PmVPogLVH5DjC2UOkfjuB6X6udpeHlGRqFg/0z9dtU4KaI20FDfb8MdubCvj064b71xf9s6pOcmge3gxd/TC4rZQOWmFa5oT5eljRx/elP4MuU2ybGh1e+LuGG6QqYzQ973aoi5TdqhCYW3LYnaMlxcI4Gkk4dnnNceO3EbITeW1rsI4LoOUrGUsz38xCFxAoChqNjJGrkXEUqNHYY2VewW1nWQzwRqnSnKv2CsN5Laa9WKfAMq/wOpa4j/BAHa/Qlmj78zSQmOk/sc3OCUwuuJ6MRLe7vVAG9lXVHb3+PdYMKH1yGrQ3vxo4dAmT+Uw==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Oleksii Kurochko <oleksii.kurochko@xxxxxxxxx>

Delivery-date: Thu, 23 Oct 2025 08:39:57 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Wed, Oct 22, 2025 at 11:21:15AM +0200, Jan Beulich wrote: > On 21.10.2025 15:49, Roger Pau Monné wrote: > > On Tue, Oct 21, 2025 at 08:42:13AM +0200, Jan Beulich wrote: > >> On 20.10.2025 18:22, Roger Pau Monné wrote: > >>> On Mon, Oct 20, 2025 at 01:18:34PM +0200, Jan Beulich wrote: > >>>> Using dynamically allocated / maintained vectors has several downsides: > >>>> - possible nesting of IRQs due to the effects of IRQ migration, > >>>> - reduction of vectors available for devices, > >>>> - IRQs not moving as intended if there's shortage of vectors, > >>>> - higher runtime overhead. > >>>> > >>>> As the vector also doesn't need to be of any priority (first and foremost > >>>> it really shouldn't be of higher or same priority as the timer IRQ, as > >>>> that raises TIMER_SOFTIRQ anyway), avoid any "ordinary" vectors altogther > >>>> and use a vector from the 0x10...0x1f exception vector space. Exception > >>>> vs > >>>> interrupt can easily be distinguished by checking for the presence of an > >>>> error code. > >>>> > >>>> With a fixed vector, less updating is now necessary in > >>>> set_channel_irq_affinity(); in particular channels don't need transiently > >>>> masking anymore, as the necessary update is now atomic. To fully leverage > >>>> this, however, we want to stop using hpet_msi_set_affinity() there. With > >>>> the transient masking dropped, we're no longer at risk of missing events. > >>>> > >>>> In principle a change to setup_vector_irq() would be necessary, but only > >>>> if we used low-prio vectors as direct-APIC ones. Since the change would > >>>> be > >>>> at best benign here, it is being omitted. > >>>> > >>>> Fixes: 996576b965cc ("xen: allow up to 16383 cpus") > >>>> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> > >>>> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@xxxxxxxxx> > >>>> --- > >>>> This is an alternative proposal to > >>>> https://lists.xen.org/archives/html/xen-devel/2014-03/msg00399.html. > >>>> > >>>> Should we keep hpet_msi_set_affinity() at all? We'd better not have the > >>>> generic IRQ subsystem play with our IRQs' affinities ... (If so, this > >>>> likely would want to be a separate patch, though.) > >>> > >>> I think that needs to become a no-op, with possibly an ASSERT? Is it > >>> possibly for dom0 to try to balance this IRQ? I would think not. > >> > >> I'd consider it an error if that was possible. But then the same goes for > >> other Xen-internal IRQs, like the IOMMU ones. They all implement a > >> .set_affinity hook ... > > > > We need such hook for fixup_irqs() at least, so that interrupts can be > > evacuated when an AP goes offline. > > Hmm, yes. Just not here. > > >>>> @@ -476,19 +486,50 @@ static struct hpet_event_channel *hpet_g > >>>> static void set_channel_irq_affinity(struct hpet_event_channel *ch) > >>>> { > >>>> struct irq_desc *desc = irq_to_desc(ch->msi.irq); > >>>> + struct msi_msg msg = ch->msi.msg; > >>>> > >>>> ASSERT(!local_irq_is_enabled()); > >>>> spin_lock(&desc->lock); > >>>> - hpet_msi_mask(desc); > >>>> - hpet_msi_set_affinity(desc, cpumask_of(ch->cpu)); > >>>> - hpet_msi_unmask(desc); > >>>> + > >>>> + per_cpu(vector_irq, ch->cpu)[HPET_BROADCAST_VECTOR] = ch->msi.irq; > >>>> + > >>>> + /* > >>>> + * Open-coding a reduced form of hpet_msi_set_affinity() here. > >>>> With the > >>>> + * actual update below (either of the IRTE or of [just] message > >>>> address; > >>>> + * with interrupt remapping message address/data don't change) now > >>>> being > >>>> + * atomic, we can avoid masking the IRQ around the update. As a > >>>> result > >>>> + * we're no longer at risk of missing IRQs (provided > >>>> hpet_broadcast_enter() > >>>> + * keeps setting the new deadline only afterwards). > >>>> + */ > >>>> + cpumask_copy(desc->arch.cpu_mask, cpumask_of(ch->cpu)); > >>>> + > >>>> spin_unlock(&desc->lock); > >>>> > >>>> - spin_unlock(&ch->lock); > >>>> + msg.dest32 = cpu_physical_id(ch->cpu); > >>>> + msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK; > >>>> + msg.address_lo |= MSI_ADDR_DEST_ID(msg.dest32); > >>>> + if ( msg.dest32 != ch->msi.msg.dest32 ) > >>>> + { > >>>> + ch->msi.msg = msg; > >>>> + > >>>> + if ( iommu_intremap != iommu_intremap_off ) > >>>> + { > >>>> + int rc = iommu_update_ire_from_msi(&ch->msi, &msg); > >>>> > >>>> - /* We may have missed an interrupt due to the temporary masking. */ > >>>> - if ( ch->event_handler && ch->next_event < NOW() ) > >>>> - ch->event_handler(ch); > >>>> + ASSERT(rc <= 0); > >>>> + if ( rc > 0 ) > >>>> + { > >>>> + ASSERT(msg.data == hpet_read32(HPET_Tn_ROUTE(ch->idx))); > >>>> + ASSERT(msg.address_lo == > >>>> + hpet_read32(HPET_Tn_ROUTE(ch->idx) + 4)); > >>>> + } > >>> > >>> The sequence of asserts seem wrong here, the asserts inside of the rc > >>>> 0 check will never trigger, because there's an ASSERT(rc <= 0) > >>> ahead of them? > >> > >> Hmm. My way of thinking was that if we get back 1 (which we shouldn't), > >> we ought to check (and presumably fail on) data or address having changed. > > > > Right, but the ASSERT(rc <= 0) will prevent reaching any of the > > followup ASSERTs if rc == 1? > > Which is no problem, as we'd be dead already anyway if the first assertion > triggered. Nevertheless I've switched the if() to >= 0 (which then pointed > out a necessary change in AMD IOMMU code). Right, so and adjusted if condition plus an ASSERT_UNREACHABLE() at the end of the if code block? > > IOW, we possibly want: > > > > if ( rc > 0 ) > > { > > dprintk(XENLOG_ERR, > > "Unexpected HPET MSI setup returned: data: %#x > > address: %#lx expected data %#x address %#lx\n", > > msg.data, msg.address, > > ch->msi.msg.data, ch->msi.msg.address); > > ASSERT_UNREACHABLE(); > > hpet_msi_mask(desc); > > hpet_write32(msg.data, HPET_Tn_ROUTE(ch->idx)); > > hpet_write32(msg.address_lo, HPET_Tn_ROUTE(ch->idx) + 4); > > hpet_msi_unmask(desc); > > } > > ASSERT(!rc); > > To be honest, for my taste this goes too far as to what follows an > ASSERT_UNREACHABLE(). I can understand that. It's the best way I've come up with attempting to recover from a possible error in the release case, but I don't particularly like it either. > > I'm unsure about attempting to propagate the returned values on release > > builds, I guess it's slightly better than possibly using an outdated > > RTE entry? Albeit this should never happen. > > Yes to the last remark; I don't actually see what you would want to do > with the propagated return value. OK, I can this this not being clear. By propagate here I mean propagate to the hardware registers, not to the function caller. > > Also, should the desc->arch.cpu_mask update only be done after the MSI > > fields have correctly updated, so that in case of failure of > > iommu_update_ire_from_msi(9 we could return early form the function > > and avoid changing desc->arch.cpu_mask? > > Hmm, yes, I could do that, but then also in hpet_msi_set_affinity(). > However, as this needs doing under the IRQ descriptor lock, I'd have to > either extend the locked region here (again), or re-acquire the lock > later. Neither looks very attractive to me. Hm, I guess given the point in the release we can leave it as-is. It would be nice, but this change is big enough as it is. Thanks, Roger.

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.