[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] xen/arm64: flushtlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI


  • To: Luca Fancellu <Luca.Fancellu@xxxxxxx>
  • From: "Orzel, Michal" <michal.orzel@xxxxxxx>
  • Date: Tue, 14 Apr 2026 17:09:03 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=arm.com smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0)
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=GPnLa69NK5YQDa5XdNtqbH6KdIGMfRh498olWKYQckE=; b=QYW3ito3ZuxQVpG4nUQRBAtsB1ReoqcFQO2eCnInSa7bB5CmV79z37o6ygEtxdQAw1qjosbNJsVT8lEHN51KxqxN1W0MUnjwaYZzeIqcTwR1kYnCrTPoDAokYqIFMetgCT6NEFAquW/EEkw3pg9PXtkscKh3hYbLOI4+eHLNATXjW8h2THjb2VgyOf2DNitWbFHqiR34kjAVLguPHJgyeeqH1LjevUGss9qxbPaifpEkajMBuiVRCOqRBD5K7QOhs4o4PpF5MRoDJRqd4rV9vDW+tV/wFSlLoMAUTiVhc+EeL7S6rh6pAtml7mQhU8j1Dpv/Q0Qy/j8oAiRXobsufQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=eSBa7Ye7ZBFcKKiB6jh26xvEZiIenchoGWuAfY94p7QcKoVTQYNAa3/e3MEEgypn39IGX9ul7dW2YS1Jv0OG4d35B+ioua370gwAa416kRGy7ZPCcsd8Fz0cx1nynl80aOUZnHj/D2cuWKjM+tcVvP+dokxDAzy4MQawnXMKO5kc4yILqILUCaMBqXQHwVkaiVbLCBbAFf4F/Jcv8VzpqSfhhQFVk6S/4PptSneSbeqeC3DK8k+0rG2065bd5L3eE1eEc09WyhmmLm6KXBVMorTrnR3EgiSK0eNYDLsFKquALapDLE4r2/BiVvr+oWFxau5rFmjQlSImNF9hhPrRTw==
  • Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=selector1 header.d=amd.com header.i="@amd.com" header.h="From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck"
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, "Stefano Stabellini" <sstabellini@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, "Bertrand Marquis" <Bertrand.Marquis@xxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Mark Rutland <Mark.Rutland@xxxxxxx>
  • Delivery-date: Tue, 14 Apr 2026 15:09:30 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>


On 14/04/2026 16:00, Luca Fancellu wrote:
> Hi Michal,
> 
>> On 14 Apr 2026, at 09:11, Michal Orzel <michal.orzel@xxxxxxx> wrote:
>>
>> The ARM64_WORKAROUND_REPEAT_TLBI workaround is used to mitigate several
>> errata where broadcast TLBI;DSB sequences don't provide all the
>> architecturally required synchronization. The workaround performs more
>> work than necessary, and can have significant overhead. This patch
>> optimizes the workaround, as explained below.
>>
>> 1. All relevant errata only affect the ordering and/or completion of
>>   memory accesses which have been translated by an invalidated TLB
>>   entry. The actual invalidation of TLB entries is unaffected.
>>
>> 2. The existing workaround is applied to both broadcast and local TLB
>>   invalidation, whereas for all relevant errata it is only necessary to
>>   apply a workaround for broadcast invalidation.
>>
>> 3. The existing workaround replaces every TLBI with a TLBI;DSB;TLBI
>>   sequence, whereas for all relevant errata it is only necessary to
>>   execute a single additional TLBI;DSB sequence after any number of
>>   TLBIs are completed by a DSB.
>>
>>   For example, for a sequence of batched TLBIs:
>>
>>       TLBI <op1>[, <arg1>]
>>       TLBI <op2>[, <arg2>]
>>       TLBI <op3>[, <arg3>]
>>       DSB ISH
>>
>>   ... the existing workaround will expand this to:
>>
>>       TLBI <op1>[, <arg1>]
>>       DSB ISH                  // additional
>>       TLBI <op1>[, <arg1>]     // additional
>>       TLBI <op2>[, <arg2>]
>>       DSB ISH                  // additional
>>       TLBI <op2>[, <arg2>]     // additional
>>       TLBI <op3>[, <arg3>]
>>       DSB ISH                  // additional
>>       TLBI <op3>[, <arg3>]     // additional
>>       DSB ISH
>>
>>   ... whereas it is sufficient to have:
>>
>>       TLBI <op1>[, <arg1>]
>>       TLBI <op2>[, <arg2>]
>>       TLBI <op3>[, <arg3>]
>>       DSB ISH
>>       TLBI <opX>[, <argX>]     // additional
>>       DSB ISH                  // additional
>>
>>   Using a single additional TBLI and DSB at the end of the sequence can
> 
> NIT: Typo s/TBLI/TLBI
> 
>>   have significantly lower overhead as each DSB which completes a TLBI
>>   must synchronize with other PEs in the system, with potential
>>   performance effects both locally and system-wide.
>>
>> 4. The existing workaround repeats each specific TLBI operation, whereas
>>   for all relevant errata it is sufficient for the additional TLBI to
>>   use *any* operation which will be broadcast, regardless of which
>>   translation regime or stage of translation the operation applies to.
>>
>>   For example, for a single TLBI:
>>
>>       TLBI ALLE2IS
>>       DSB ISH
>>
>>   ... the existing workaround will expand this to:
>>
>>       TLBI ALLE2IS
>>       DSB ISH
>>       TLBI ALLE2IS             // additional
>>       DSB ISH                  // additional
>>
>>   ... whereas it is sufficient to have:
>>
>>       TLBI ALLE2IS
>>       DSB ISH
>>       TLBI VALE1IS, XZR        // additional
>>       DSB ISH                  // additional
>>
>>   As the additional TLBI doesn't have to match a specific earlier TLBI,
>>   the additional TLBI can be implemented in separate code, with no
>>   memory of the earlier TLBIs. The additional TLBI can also use a
>>   cheaper TLBI operation.
>>
>> 5. The existing workaround is applied to both Stage-1 and Stage-2 TLB
>>   invalidation, whereas for all relevant errata it is only necessary to
>>   apply a workaround for Stage-1 invalidation.
>>
>>   Architecturally, TLBI operations which invalidate only Stage-2
>>   information (e.g. IPAS2E1IS) are not required to invalidate TLB
>>   entries which combine information from Stage-1 and Stage-2
>>   translation table entries, and consequently may not complete memory
>>   accesses translated by those combined entries. In these cases,
>>   completion of memory accesses is only guaranteed after subsequent
>>   invalidation of Stage-1 information (e.g. VMALLE1IS).
>>
>> Rework the workaround logic as follows:
>> - add TLB_HELPER_LOCAL() to be used for local TLB ops without a
>>   workaround,
>> - modify TLB_HELPER() workaround to use tlbi vale2is, xzr as a second
>>   TLB,
> 
> TLBI ?
> 
>> - drop TLB_HELPER_VA(). It's used only by __flush_xen_tlb_one_local
>>   which is local and does not need workaround and by
>>   __flush_xen_tlb_one. In the latter case, since it's used in a loop,
>>   we don't need a workaround in the middle. Add __tlb_repeat_sync with
>>   a workaround to be used at the end after DSB and before final ISB,
>> - TLBI VALE2IS passing XZR is used as an additional TLBI. While there is
>>   an identity mapping there, it's used very rarely. The performance
>>   impact is therefore negligible. If things change in the future, we
>>   can revisit the decision.
>>
>> Signed-off-by: Michal Orzel <michal.orzel@xxxxxxx>
>> ---
>> Linux counterpart (already merged):
>> https://lore.kernel.org/linux-arm-kernel/20260218164348.2022831-1-mark.rutland@xxxxxxx/
>> ---
>> xen/arch/arm/include/asm/arm32/flushtlb.h |   3 +
>> xen/arch/arm/include/asm/arm64/flushtlb.h | 108 ++++++++++++++--------
>> xen/arch/arm/include/asm/flushtlb.h       |   1 +
>> 3 files changed, 71 insertions(+), 41 deletions(-)
>>
>> diff --git a/xen/arch/arm/include/asm/arm32/flushtlb.h 
>> b/xen/arch/arm/include/asm/arm32/flushtlb.h
>> index 61c25a318998..5483be08fbbe 100644
>> --- a/xen/arch/arm/include/asm/arm32/flushtlb.h
>> +++ b/xen/arch/arm/include/asm/arm32/flushtlb.h
>> @@ -57,6 +57,9 @@ static inline void __flush_xen_tlb_one(vaddr_t va)
>>     asm volatile(STORE_CP32(0, TLBIMVAHIS) : : "r" (va) : "memory");
>> }
>>
>> +/* Only for ARM64_WORKAROUND_REPEAT_TLBI */
>> +static inline void __tlb_repeat_sync(void) {}
>> +
>> #endif /* __ASM_ARM_ARM32_FLUSHTLB_H__ */
>> /*
>>  * Local variables:
>> diff --git a/xen/arch/arm/include/asm/arm64/flushtlb.h 
>> b/xen/arch/arm/include/asm/arm64/flushtlb.h
>> index 3b99c11b50d1..1606b26bf28a 100644
>> --- a/xen/arch/arm/include/asm/arm64/flushtlb.h
>> +++ b/xen/arch/arm/include/asm/arm64/flushtlb.h
>> @@ -12,9 +12,14 @@
>>  * ARM64_WORKAROUND_REPEAT_TLBI:
>>  * Modification of the translation table for a virtual address might lead to
>>  * read-after-read ordering violation.
>> - * The workaround repeats TLBI+DSB ISH operation for all the TLB flush
>> - * operations. While this is strictly not necessary, we don't want to
>> - * take any risk.
>> + * The workaround repeats TLBI+DSB ISH operation for broadcast TLB flush
>> + * operations. The workaround is not needed for local operations.
>> + *
>> + * It is sufficient for the additional TLBI to use *any* operation which 
>> will
>> + * be broadcast, regardless of which translation regime or stage of 
>> translation
>> + * the operation applies to. TLBI VALE2IS is used passing XZR. While there 
>> is
>> + * an identity mapping there, it's only used during suspend/resume, CPU 
>> on/off,
>> + * so the impact (performance if any) is negligible.
>>  *
>>  * For Xen page-tables the ISB will discard any instructions fetched
>>  * from the old mappings.
>> @@ -26,69 +31,90 @@
>>  * Note that for local TLB flush, using non-shareable (nsh) is sufficient
>>  * (see D5-4929 in ARM DDI 0487H.a). Although, the memory barrier in
>>  * for the workaround is left as inner-shareable to match with Linux
>> - * v6.1-rc8.
>> + * v6.19.
>>  */
>> -#define TLB_HELPER(name, tlbop, sh)              \
>> +#define TLB_HELPER_LOCAL(name, tlbop)            \
>> static inline void name(void)                    \
>> {                                                \
>>     asm_inline volatile (                        \
>> -        "dsb  "  # sh  "st;"                     \
>> +        "dsb  nshst;"                            \
>>         "tlbi "  # tlbop  ";"                    \
>> -        ALTERNATIVE(                             \
>> -            "nop; nop;",                         \
>> -            "dsb  ish;"                          \
>> -            "tlbi "  # tlbop  ";",               \
>> -            ARM64_WORKAROUND_REPEAT_TLBI,        \
>> -            CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \
>> -        "dsb  "  # sh  ";"                       \
>> +        "dsb  nsh;"                              \
>>         "isb;"                                   \
>>         : : : "memory");                         \
>> }
>>
>> -/*
>> - * FLush TLB by VA. This will likely be used in a loop, so the caller
>> - * is responsible to use the appropriate memory barriers before/after
>> - * the sequence.
>> - *
>> - * See above about the ARM64_WORKAROUND_REPEAT_TLBI sequence.
>> - */
>> -#define TLB_HELPER_VA(name, tlbop)               \
>> -static inline void name(vaddr_t va)              \
>> -{                                                \
>> -    asm_inline volatile (                        \
>> -        "tlbi "  # tlbop  ", %0;"                \
>> -        ALTERNATIVE(                             \
>> -            "nop; nop;",                         \
>> -            "dsb  ish;"                          \
>> -            "tlbi "  # tlbop  ", %0;",           \
>> -            ARM64_WORKAROUND_REPEAT_TLBI,        \
>> -            CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \
>> -        : : "r" (va >> PAGE_SHIFT) : "memory");  \
>> +#define TLB_HELPER(name, tlbop)                       \
>> +static inline void name(void)                         \
>> +{                                                     \
>> +    asm_inline volatile (                             \
>> +        "dsb  ishst;"                                 \
>> +        "tlbi "  # tlbop  ";"                         \
>> +        ALTERNATIVE(                                  \
>> +            "nop; nop;",                              \
>> +            "dsb  ish;"                               \
>> +            "tlbi vale2is, xzr;",                     \
>> +            ARM64_WORKAROUND_REPEAT_TLBI,             \
>> +            CONFIG_ARM64_WORKAROUND_REPEAT_TLBI)      \
>> +        "dsb  ish;"                                   \
>> +        "isb;"                                        \
>> +        : : : "memory"); \
>> }
>>
>> /* Flush local TLBs, current VMID only. */
>> -TLB_HELPER(flush_guest_tlb_local, vmalls12e1, nsh)
>> +TLB_HELPER_LOCAL(flush_guest_tlb_local, vmalls12e1)
>>
>> /* Flush innershareable TLBs, current VMID only */
>> -TLB_HELPER(flush_guest_tlb, vmalls12e1is, ish)
>> +TLB_HELPER(flush_guest_tlb, vmalls12e1is)
>>
>> /* Flush local TLBs, all VMIDs, non-hypervisor mode */
>> -TLB_HELPER(flush_all_guests_tlb_local, alle1, nsh)
>> +TLB_HELPER_LOCAL(flush_all_guests_tlb_local, alle1)
>>
>> /* Flush innershareable TLBs, all VMIDs, non-hypervisor mode */
>> -TLB_HELPER(flush_all_guests_tlb, alle1is, ish)
>> +TLB_HELPER(flush_all_guests_tlb, alle1is)
>>
>> /* Flush all hypervisor mappings from the TLB of the local processor. */
>> -TLB_HELPER(flush_xen_tlb_local, alle2, nsh)
>> +TLB_HELPER_LOCAL(flush_xen_tlb_local, alle2)
>> +
>> +#undef TLB_HELPER_LOCAL
>> +#undef TLB_HELPER
>> +
>> +/*
>> + * FLush TLB by VA. This will likely be used in a loop, so the caller
>> + * is responsible to use the appropriate memory barriers before/after
>> + * the sequence.
>> + */
>>
>> /* Flush TLB of local processor for address va. */
>> -TLB_HELPER_VA(__flush_xen_tlb_one_local, vae2)
>> +static inline void __flush_xen_tlb_one_local(vaddr_t va)
>> +{
>> +    asm_inline volatile (
>> +        "tlbi vae2, %0" : : "r" (va >> PAGE_SHIFT) : "memory");
>> +}
>>
>> /* Flush TLB of all processors in the inner-shareable domain for address va. 
>> */
>> -TLB_HELPER_VA(__flush_xen_tlb_one, vae2is)
>> +static inline void __flush_xen_tlb_one(vaddr_t va)
>> +{
>> +    asm_inline volatile (
>> +        "tlbi vae2is, %0" : : "r" (va >> PAGE_SHIFT) : "memory");
>> +}
>>
>> -#undef TLB_HELPER
>> -#undef TLB_HELPER_VA
>> +/*
>> + * ARM64_WORKAROUND_REPEAT_TLBI:
>> + * For all relevant erratas it is only necessary to execute a single
>> + * additional TLBI;DSB sequence after any number of TLBIs are completed by 
>> DSB.
>> + */
>> +static inline void __tlb_repeat_sync(void)
>> +{
>> +    asm_inline volatile (
>> +        ALTERNATIVE(
>> +            "nop; nop;",
>> +            "tlbi vale2is, xzr;"
>> +            "dsb  ish;",
>> +            ARM64_WORKAROUND_REPEAT_TLBI,
>> +            CONFIG_ARM64_WORKAROUND_REPEAT_TLBI)
>> +        : : : "memory");
>> +}
>>
>> #endif /* __ASM_ARM_ARM64_FLUSHTLB_H__ */
>> /*
>> diff --git a/xen/arch/arm/include/asm/flushtlb.h 
>> b/xen/arch/arm/include/asm/flushtlb.h
>> index e45fb6d97b02..c292c3c00d29 100644
>> --- a/xen/arch/arm/include/asm/flushtlb.h
>> +++ b/xen/arch/arm/include/asm/flushtlb.h
>> @@ -65,6 +65,7 @@ static inline void flush_xen_tlb_range_va(vaddr_t va,
>>         va += PAGE_SIZE;
>>     }
>>     dsb(ish); /* Ensure the TLB invalidation has completed */
>> +    __tlb_repeat_sync();
> 
> More a question here rather than a comment, shall we have a comment on top
> of this stating that it’s deliberate to have it before the isb?
> Or developer should infer it from the code and from git blame?
I think the explanation in flushtlb.h and commit msg should be enough. To
understand it fully, you still need to read the complete message to get a full
picture, so I don't think that adding a commit with one sentence improves the
situation.

> 
>>     isb();
>> }
>>
>> -- 
>> 2.43.0
>>
>>
> 
> With these fixed:
The NITs can be fixed on commit provided no other remarks from other
maintainers, specifically @Julien.

~Michal




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.