|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH] xen/arm64: flushtlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI
On 14/04/2026 16:00, Luca Fancellu wrote: > Hi Michal, > >> On 14 Apr 2026, at 09:11, Michal Orzel <michal.orzel@xxxxxxx> wrote: >> >> The ARM64_WORKAROUND_REPEAT_TLBI workaround is used to mitigate several >> errata where broadcast TLBI;DSB sequences don't provide all the >> architecturally required synchronization. The workaround performs more >> work than necessary, and can have significant overhead. This patch >> optimizes the workaround, as explained below. >> >> 1. All relevant errata only affect the ordering and/or completion of >> memory accesses which have been translated by an invalidated TLB >> entry. The actual invalidation of TLB entries is unaffected. >> >> 2. The existing workaround is applied to both broadcast and local TLB >> invalidation, whereas for all relevant errata it is only necessary to >> apply a workaround for broadcast invalidation. >> >> 3. The existing workaround replaces every TLBI with a TLBI;DSB;TLBI >> sequence, whereas for all relevant errata it is only necessary to >> execute a single additional TLBI;DSB sequence after any number of >> TLBIs are completed by a DSB. >> >> For example, for a sequence of batched TLBIs: >> >> TLBI <op1>[, <arg1>] >> TLBI <op2>[, <arg2>] >> TLBI <op3>[, <arg3>] >> DSB ISH >> >> ... the existing workaround will expand this to: >> >> TLBI <op1>[, <arg1>] >> DSB ISH // additional >> TLBI <op1>[, <arg1>] // additional >> TLBI <op2>[, <arg2>] >> DSB ISH // additional >> TLBI <op2>[, <arg2>] // additional >> TLBI <op3>[, <arg3>] >> DSB ISH // additional >> TLBI <op3>[, <arg3>] // additional >> DSB ISH >> >> ... whereas it is sufficient to have: >> >> TLBI <op1>[, <arg1>] >> TLBI <op2>[, <arg2>] >> TLBI <op3>[, <arg3>] >> DSB ISH >> TLBI <opX>[, <argX>] // additional >> DSB ISH // additional >> >> Using a single additional TBLI and DSB at the end of the sequence can > > NIT: Typo s/TBLI/TLBI > >> have significantly lower overhead as each DSB which completes a TLBI >> must synchronize with other PEs in the system, with potential >> performance effects both locally and system-wide. >> >> 4. The existing workaround repeats each specific TLBI operation, whereas >> for all relevant errata it is sufficient for the additional TLBI to >> use *any* operation which will be broadcast, regardless of which >> translation regime or stage of translation the operation applies to. >> >> For example, for a single TLBI: >> >> TLBI ALLE2IS >> DSB ISH >> >> ... the existing workaround will expand this to: >> >> TLBI ALLE2IS >> DSB ISH >> TLBI ALLE2IS // additional >> DSB ISH // additional >> >> ... whereas it is sufficient to have: >> >> TLBI ALLE2IS >> DSB ISH >> TLBI VALE1IS, XZR // additional >> DSB ISH // additional >> >> As the additional TLBI doesn't have to match a specific earlier TLBI, >> the additional TLBI can be implemented in separate code, with no >> memory of the earlier TLBIs. The additional TLBI can also use a >> cheaper TLBI operation. >> >> 5. The existing workaround is applied to both Stage-1 and Stage-2 TLB >> invalidation, whereas for all relevant errata it is only necessary to >> apply a workaround for Stage-1 invalidation. >> >> Architecturally, TLBI operations which invalidate only Stage-2 >> information (e.g. IPAS2E1IS) are not required to invalidate TLB >> entries which combine information from Stage-1 and Stage-2 >> translation table entries, and consequently may not complete memory >> accesses translated by those combined entries. In these cases, >> completion of memory accesses is only guaranteed after subsequent >> invalidation of Stage-1 information (e.g. VMALLE1IS). >> >> Rework the workaround logic as follows: >> - add TLB_HELPER_LOCAL() to be used for local TLB ops without a >> workaround, >> - modify TLB_HELPER() workaround to use tlbi vale2is, xzr as a second >> TLB, > > TLBI ? > >> - drop TLB_HELPER_VA(). It's used only by __flush_xen_tlb_one_local >> which is local and does not need workaround and by >> __flush_xen_tlb_one. In the latter case, since it's used in a loop, >> we don't need a workaround in the middle. Add __tlb_repeat_sync with >> a workaround to be used at the end after DSB and before final ISB, >> - TLBI VALE2IS passing XZR is used as an additional TLBI. While there is >> an identity mapping there, it's used very rarely. The performance >> impact is therefore negligible. If things change in the future, we >> can revisit the decision. >> >> Signed-off-by: Michal Orzel <michal.orzel@xxxxxxx> >> --- >> Linux counterpart (already merged): >> https://lore.kernel.org/linux-arm-kernel/20260218164348.2022831-1-mark.rutland@xxxxxxx/ >> --- >> xen/arch/arm/include/asm/arm32/flushtlb.h | 3 + >> xen/arch/arm/include/asm/arm64/flushtlb.h | 108 ++++++++++++++-------- >> xen/arch/arm/include/asm/flushtlb.h | 1 + >> 3 files changed, 71 insertions(+), 41 deletions(-) >> >> diff --git a/xen/arch/arm/include/asm/arm32/flushtlb.h >> b/xen/arch/arm/include/asm/arm32/flushtlb.h >> index 61c25a318998..5483be08fbbe 100644 >> --- a/xen/arch/arm/include/asm/arm32/flushtlb.h >> +++ b/xen/arch/arm/include/asm/arm32/flushtlb.h >> @@ -57,6 +57,9 @@ static inline void __flush_xen_tlb_one(vaddr_t va) >> asm volatile(STORE_CP32(0, TLBIMVAHIS) : : "r" (va) : "memory"); >> } >> >> +/* Only for ARM64_WORKAROUND_REPEAT_TLBI */ >> +static inline void __tlb_repeat_sync(void) {} >> + >> #endif /* __ASM_ARM_ARM32_FLUSHTLB_H__ */ >> /* >> * Local variables: >> diff --git a/xen/arch/arm/include/asm/arm64/flushtlb.h >> b/xen/arch/arm/include/asm/arm64/flushtlb.h >> index 3b99c11b50d1..1606b26bf28a 100644 >> --- a/xen/arch/arm/include/asm/arm64/flushtlb.h >> +++ b/xen/arch/arm/include/asm/arm64/flushtlb.h >> @@ -12,9 +12,14 @@ >> * ARM64_WORKAROUND_REPEAT_TLBI: >> * Modification of the translation table for a virtual address might lead to >> * read-after-read ordering violation. >> - * The workaround repeats TLBI+DSB ISH operation for all the TLB flush >> - * operations. While this is strictly not necessary, we don't want to >> - * take any risk. >> + * The workaround repeats TLBI+DSB ISH operation for broadcast TLB flush >> + * operations. The workaround is not needed for local operations. >> + * >> + * It is sufficient for the additional TLBI to use *any* operation which >> will >> + * be broadcast, regardless of which translation regime or stage of >> translation >> + * the operation applies to. TLBI VALE2IS is used passing XZR. While there >> is >> + * an identity mapping there, it's only used during suspend/resume, CPU >> on/off, >> + * so the impact (performance if any) is negligible. >> * >> * For Xen page-tables the ISB will discard any instructions fetched >> * from the old mappings. >> @@ -26,69 +31,90 @@ >> * Note that for local TLB flush, using non-shareable (nsh) is sufficient >> * (see D5-4929 in ARM DDI 0487H.a). Although, the memory barrier in >> * for the workaround is left as inner-shareable to match with Linux >> - * v6.1-rc8. >> + * v6.19. >> */ >> -#define TLB_HELPER(name, tlbop, sh) \ >> +#define TLB_HELPER_LOCAL(name, tlbop) \ >> static inline void name(void) \ >> { \ >> asm_inline volatile ( \ >> - "dsb " # sh "st;" \ >> + "dsb nshst;" \ >> "tlbi " # tlbop ";" \ >> - ALTERNATIVE( \ >> - "nop; nop;", \ >> - "dsb ish;" \ >> - "tlbi " # tlbop ";", \ >> - ARM64_WORKAROUND_REPEAT_TLBI, \ >> - CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \ >> - "dsb " # sh ";" \ >> + "dsb nsh;" \ >> "isb;" \ >> : : : "memory"); \ >> } >> >> -/* >> - * FLush TLB by VA. This will likely be used in a loop, so the caller >> - * is responsible to use the appropriate memory barriers before/after >> - * the sequence. >> - * >> - * See above about the ARM64_WORKAROUND_REPEAT_TLBI sequence. >> - */ >> -#define TLB_HELPER_VA(name, tlbop) \ >> -static inline void name(vaddr_t va) \ >> -{ \ >> - asm_inline volatile ( \ >> - "tlbi " # tlbop ", %0;" \ >> - ALTERNATIVE( \ >> - "nop; nop;", \ >> - "dsb ish;" \ >> - "tlbi " # tlbop ", %0;", \ >> - ARM64_WORKAROUND_REPEAT_TLBI, \ >> - CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \ >> - : : "r" (va >> PAGE_SHIFT) : "memory"); \ >> +#define TLB_HELPER(name, tlbop) \ >> +static inline void name(void) \ >> +{ \ >> + asm_inline volatile ( \ >> + "dsb ishst;" \ >> + "tlbi " # tlbop ";" \ >> + ALTERNATIVE( \ >> + "nop; nop;", \ >> + "dsb ish;" \ >> + "tlbi vale2is, xzr;", \ >> + ARM64_WORKAROUND_REPEAT_TLBI, \ >> + CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \ >> + "dsb ish;" \ >> + "isb;" \ >> + : : : "memory"); \ >> } >> >> /* Flush local TLBs, current VMID only. */ >> -TLB_HELPER(flush_guest_tlb_local, vmalls12e1, nsh) >> +TLB_HELPER_LOCAL(flush_guest_tlb_local, vmalls12e1) >> >> /* Flush innershareable TLBs, current VMID only */ >> -TLB_HELPER(flush_guest_tlb, vmalls12e1is, ish) >> +TLB_HELPER(flush_guest_tlb, vmalls12e1is) >> >> /* Flush local TLBs, all VMIDs, non-hypervisor mode */ >> -TLB_HELPER(flush_all_guests_tlb_local, alle1, nsh) >> +TLB_HELPER_LOCAL(flush_all_guests_tlb_local, alle1) >> >> /* Flush innershareable TLBs, all VMIDs, non-hypervisor mode */ >> -TLB_HELPER(flush_all_guests_tlb, alle1is, ish) >> +TLB_HELPER(flush_all_guests_tlb, alle1is) >> >> /* Flush all hypervisor mappings from the TLB of the local processor. */ >> -TLB_HELPER(flush_xen_tlb_local, alle2, nsh) >> +TLB_HELPER_LOCAL(flush_xen_tlb_local, alle2) >> + >> +#undef TLB_HELPER_LOCAL >> +#undef TLB_HELPER >> + >> +/* >> + * FLush TLB by VA. This will likely be used in a loop, so the caller >> + * is responsible to use the appropriate memory barriers before/after >> + * the sequence. >> + */ >> >> /* Flush TLB of local processor for address va. */ >> -TLB_HELPER_VA(__flush_xen_tlb_one_local, vae2) >> +static inline void __flush_xen_tlb_one_local(vaddr_t va) >> +{ >> + asm_inline volatile ( >> + "tlbi vae2, %0" : : "r" (va >> PAGE_SHIFT) : "memory"); >> +} >> >> /* Flush TLB of all processors in the inner-shareable domain for address va. >> */ >> -TLB_HELPER_VA(__flush_xen_tlb_one, vae2is) >> +static inline void __flush_xen_tlb_one(vaddr_t va) >> +{ >> + asm_inline volatile ( >> + "tlbi vae2is, %0" : : "r" (va >> PAGE_SHIFT) : "memory"); >> +} >> >> -#undef TLB_HELPER >> -#undef TLB_HELPER_VA >> +/* >> + * ARM64_WORKAROUND_REPEAT_TLBI: >> + * For all relevant erratas it is only necessary to execute a single >> + * additional TLBI;DSB sequence after any number of TLBIs are completed by >> DSB. >> + */ >> +static inline void __tlb_repeat_sync(void) >> +{ >> + asm_inline volatile ( >> + ALTERNATIVE( >> + "nop; nop;", >> + "tlbi vale2is, xzr;" >> + "dsb ish;", >> + ARM64_WORKAROUND_REPEAT_TLBI, >> + CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) >> + : : : "memory"); >> +} >> >> #endif /* __ASM_ARM_ARM64_FLUSHTLB_H__ */ >> /* >> diff --git a/xen/arch/arm/include/asm/flushtlb.h >> b/xen/arch/arm/include/asm/flushtlb.h >> index e45fb6d97b02..c292c3c00d29 100644 >> --- a/xen/arch/arm/include/asm/flushtlb.h >> +++ b/xen/arch/arm/include/asm/flushtlb.h >> @@ -65,6 +65,7 @@ static inline void flush_xen_tlb_range_va(vaddr_t va, >> va += PAGE_SIZE; >> } >> dsb(ish); /* Ensure the TLB invalidation has completed */ >> + __tlb_repeat_sync(); > > More a question here rather than a comment, shall we have a comment on top > of this stating that it’s deliberate to have it before the isb? > Or developer should infer it from the code and from git blame? I think the explanation in flushtlb.h and commit msg should be enough. To understand it fully, you still need to read the complete message to get a full picture, so I don't think that adding a commit with one sentence improves the situation. > >> isb(); >> } >> >> -- >> 2.43.0 >> >> > > With these fixed: The NITs can be fixed on commit provided no other remarks from other maintainers, specifically @Julien. ~Michal
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |