Xen project Mailing List

Re: [PATCH v2 for-4.21 2/9] x86/HPET: use single, global, low-priority vector for broadcast IRQ

From: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Date: Tue, 21 Oct 2025 14:49:46 +0100

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=8jM7wPd1hok6EuGBb0PaoWOgDB2IorXtzdr9mJMJro0=; b=j2kp9/+LmChVRcjM/nLLMpbdPs3Bz32W7Qj/QOeppWuHcUCn+Kstno+CFm5uhdHeRkKQHbPRCZsotiMcUV0cNrgw4cNptJez/VqFn15foZ5EKgKCTiZvj8cTMWo8qmv6I0lDQBbZd9ABc/YnLhQ8heWYjxw4MY+jVWxHJqb++RN1Z0XRm5ktoPrmwe+YUm/9MQH8RPRvdvS4x23+dpCY37FPEJlGATqaaY6ZDl+LfJG5I/mShmyTuUHGAPC4so1CBxxRY8JUkx8W2PV0qA515oV/2UwY7zFer9XQuDvOFqa5QfJ1s+6UBI13lQ6fGdVhzLIbFRbEx8QdjI1r8ZA6bg==

Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=oYib2CYQa+Vx1hIeUTv6EoMnZg8hI+hEbkyimzC09wQQlhC7TatAoW2t0OdikbSH7++8rIM6oGZDsko0+o7rv3MTXTAZEMnYuASvjvizMtwq5XF8+b0HrzEU2DnulsMMR8gEJasvXLfZeAKPKiRcwmNzRuknK1CUTfcTPpS9H+t1PkwDmnvY3G1NVWnQrvwEXQefTPr/NQrc+FOXuY9KAE/TZEjpSNquG3pOTTArU3AwAwJC8rc0LHrjQMQdapb2rEf31pjhSfVAcLkxnsMj4aQ4JuIT1qP/Pyoh6aeEs2bdfiRNkioY6DkK3Z5XafQ+dVqnAias/Ufghq8IzNDt4Q==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Oleksii Kurochko <oleksii.kurochko@xxxxxxxxx>

Delivery-date: Tue, 21 Oct 2025 13:50:01 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Tue, Oct 21, 2025 at 08:42:13AM +0200, Jan Beulich wrote: > On 20.10.2025 18:22, Roger Pau Monné wrote: > > On Mon, Oct 20, 2025 at 01:18:34PM +0200, Jan Beulich wrote: > >> Using dynamically allocated / maintained vectors has several downsides: > >> - possible nesting of IRQs due to the effects of IRQ migration, > >> - reduction of vectors available for devices, > >> - IRQs not moving as intended if there's shortage of vectors, > >> - higher runtime overhead. > >> > >> As the vector also doesn't need to be of any priority (first and foremost > >> it really shouldn't be of higher or same priority as the timer IRQ, as > >> that raises TIMER_SOFTIRQ anyway), avoid any "ordinary" vectors altogther > >> and use a vector from the 0x10...0x1f exception vector space. Exception vs > >> interrupt can easily be distinguished by checking for the presence of an > >> error code. > >> > >> With a fixed vector, less updating is now necessary in > >> set_channel_irq_affinity(); in particular channels don't need transiently > >> masking anymore, as the necessary update is now atomic. To fully leverage > >> this, however, we want to stop using hpet_msi_set_affinity() there. With > >> the transient masking dropped, we're no longer at risk of missing events. > >> > >> In principle a change to setup_vector_irq() would be necessary, but only > >> if we used low-prio vectors as direct-APIC ones. Since the change would be > >> at best benign here, it is being omitted. > >> > >> Fixes: 996576b965cc ("xen: allow up to 16383 cpus") > >> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> > >> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@xxxxxxxxx> > >> --- > >> This is an alternative proposal to > >> https://lists.xen.org/archives/html/xen-devel/2014-03/msg00399.html. > >> > >> Should we keep hpet_msi_set_affinity() at all? We'd better not have the > >> generic IRQ subsystem play with our IRQs' affinities ... (If so, this > >> likely would want to be a separate patch, though.) > > > > I think that needs to become a no-op, with possibly an ASSERT? Is it > > possibly for dom0 to try to balance this IRQ? I would think not. > > I'd consider it an error if that was possible. But then the same goes for > other Xen-internal IRQs, like the IOMMU ones. They all implement a > .set_affinity hook ... We need such hook for fixup_irqs() at least, so that interrupts can be evacuated when an AP goes offline. However movement of Xen owned IRQs should be limited to Xen (if it's not already). > >> @@ -476,19 +486,50 @@ static struct hpet_event_channel *hpet_g > >> static void set_channel_irq_affinity(struct hpet_event_channel *ch) > >> { > >> struct irq_desc *desc = irq_to_desc(ch->msi.irq); > >> + struct msi_msg msg = ch->msi.msg; > >> > >> ASSERT(!local_irq_is_enabled()); > >> spin_lock(&desc->lock); > >> - hpet_msi_mask(desc); > >> - hpet_msi_set_affinity(desc, cpumask_of(ch->cpu)); > >> - hpet_msi_unmask(desc); > >> + > >> + per_cpu(vector_irq, ch->cpu)[HPET_BROADCAST_VECTOR] = ch->msi.irq; > >> + > >> + /* > >> + * Open-coding a reduced form of hpet_msi_set_affinity() here. With > >> the > >> + * actual update below (either of the IRTE or of [just] message > >> address; > >> + * with interrupt remapping message address/data don't change) now > >> being > >> + * atomic, we can avoid masking the IRQ around the update. As a > >> result > >> + * we're no longer at risk of missing IRQs (provided > >> hpet_broadcast_enter() > >> + * keeps setting the new deadline only afterwards). > >> + */ > >> + cpumask_copy(desc->arch.cpu_mask, cpumask_of(ch->cpu)); > >> + > >> spin_unlock(&desc->lock); > >> > >> - spin_unlock(&ch->lock); > >> + msg.dest32 = cpu_physical_id(ch->cpu); > >> + msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK; > >> + msg.address_lo |= MSI_ADDR_DEST_ID(msg.dest32); > >> + if ( msg.dest32 != ch->msi.msg.dest32 ) > >> + { > >> + ch->msi.msg = msg; > >> + > >> + if ( iommu_intremap != iommu_intremap_off ) > >> + { > >> + int rc = iommu_update_ire_from_msi(&ch->msi, &msg); > >> > >> - /* We may have missed an interrupt due to the temporary masking. */ > >> - if ( ch->event_handler && ch->next_event < NOW() ) > >> - ch->event_handler(ch); > >> + ASSERT(rc <= 0); > >> + if ( rc > 0 ) > >> + { > >> + ASSERT(msg.data == hpet_read32(HPET_Tn_ROUTE(ch->idx))); > >> + ASSERT(msg.address_lo == > >> + hpet_read32(HPET_Tn_ROUTE(ch->idx) + 4)); > >> + } > > > > The sequence of asserts seem wrong here, the asserts inside of the rc > > > 0 check will never trigger, because there's an ASSERT(rc <= 0) > > ahead of them? > > Hmm. My way of thinking was that if we get back 1 (which we shouldn't), > we ought to check (and presumably fail on) data or address having changed. Right, but the ASSERT(rc <= 0) will prevent reaching any of the followup ASSERTs if rc == 1? IOW, we possibly want: if ( rc > 0 ) { dprintk(XENLOG_ERR, "Unexpected HPET MSI setup returned: data: %#x address: %#lx expected data %#x address %#lx\n", msg.data, msg.address, ch->msi.msg.data, ch->msi.msg.address); ASSERT_UNREACHABLE(); hpet_msi_mask(desc); hpet_write32(msg.data, HPET_Tn_ROUTE(ch->idx)); hpet_write32(msg.address_lo, HPET_Tn_ROUTE(ch->idx) + 4); hpet_msi_unmask(desc); } ASSERT(!rc); I'm unsure about attempting to propagate the returned values on release builds, I guess it's slightly better than possibly using an outdated RTE entry? Albeit this should never happen. Also, should the desc->arch.cpu_mask update only be done after the MSI fields have correctly updated, so that in case of failure of iommu_update_ire_from_msi(9 we could return early form the function and avoid changing desc->arch.cpu_mask? > Whereas when we get back 0, we're told "no change" anyway, and hence > checking isn't even needed. Did I misunderstand the purpose of the zero > vs positive return value here? > > Of could I could switch to using "rc >= 0" anyway; I actually had it that > way first, but then decided the extra checks would be redundant in the 0 > case. > > >> + } > >> + else > >> + hpet_write32(msg.address_lo, HPET_Tn_ROUTE(ch->idx) + 4); > > > > If you avoid the HPET register update here you possibly need to make > > sure that both fields are unconditionally written on the first call > > after resume from suspension. hpet_resume() needs to somehow taint > > the channels to signal that a re-write of the address and data fields > > is mandatory regardless of what iommu_update_ire_from_msi() has > > returned. > > hpet_broadcast_resume() calls __hpet_setup_msi_irq() (and hence > hpet_msi_write()), which I thought is enough? Oh, sorry, I was looking at hpet_resume(), not hpet_broadcast_resume(). Thanks, Roger.

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.