Xen project Mailing List

Re: Limitations for Running Xen on KVM Arm64

From: Mohamed Mediouni <mohamed@xxxxxxxxxxxxxxxx>

Date: Fri, 31 Oct 2025 01:38:03 +0100

Cc: haseeb.ashraf@xxxxxxxxxxx, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, "Volodymyr_Babchuk@xxxxxxxx" <Volodymyr_Babchuk@xxxxxxxx>

Delivery-date: Fri, 31 Oct 2025 00:38:38 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Mail-alias-created-date: 1752046281608

> On 31. Oct 2025, at 01:20, Mohamed Mediouni <mohamed@xxxxxxxxxxxxxxxx> wrote: > > > >> On 31. Oct 2025, at 00:55, Julien Grall <julien@xxxxxxx> wrote: >> >> Hi Mohamed, >> >> On 30/10/2025 18:33, Mohamed Mediouni wrote: >>>> On 30. Oct 2025, at 14:41, haseeb.ashraf@xxxxxxxxxxx wrote: >>>> >>>> Adding @julien@xxxxxxx and replying to his questions he asked over >>>> #XenDevel:matrix.org. >>>> >>>> can you add some details why the implementation cannot be optimized in >>>> KVM? Asking because I have never seen such issue when running Xen on QEMU >>>> (without nested virt enabled). >>>> AFAIK when Xen is run on QEMU without virtualization, then instructions >>>> are emulated in QEMU while with KVM, ideally the instruction should run >>>> directly on hardware except in some special cases (those trapped by >>>> FGT/CGT). Such as this one where KVM maintains shadow page tables for each >>>> VM. It traps these instructions and emulates them with callback such as >>>> handle_vmalls12e1is(). The way this callback is implemented, it has to >>>> iterate over the whole address space and clean-up the page tables which is >>>> a costly operation. Regardless of this, it should still be optimized in >>>> Xen as invalidating a selective range would be much better than >>>> invalidating a whole range of 48-bit address space. >>>> Some details about your platform and use case would be helpful. I am >>>> interested to know whether you are using all the features for nested virt. >>>> I am using AWS G4. My use case is to run Xen as guest hypervisor. Yes, >>>> most of the features are enabled except VHE or those which are disabled by >>>> KVM. >>> Hello, >>> You mean Graviton4 (for reference to others, from a bare metal instance)? >>> Interesting to see people caring about nested virt there :) - and hopefully >>> using it wasn’t too much of a pain for you to deal with. >>>> >>>> ; switch to current VMID >>>> tlbi rvae1, guest_vaddr ; first invalidate stage-1 TLB by guest VA for >>>> current VMID >>>> tlbi ripas2e1, guest_paddr ; then invalidate stage-2 TLB by IPA range for >>>> current VMID >>>> dsb ish >>>> isb >>>> ; switch back the VMID >>>> • This is where I am not quite sure and I was hoping that if someone >>>> with Arm expertise could sign off on this so that I can work on its >>>> implementation in Xen. This will be an optimization not only for >>>> virtualized hardware but also in general for Xen on arm64 machines. >>>> >>> Note that the documentation says >>>> The invalidation is not required to apply to caching structures that >>>> combine stage 1 and stage 2 translation table entries. >>> for TLBIP RIPAS2E1 >>>> • The second place in Xen where this is problematic is when multiple >>>> vCPUs of the same domain juggle on single pCPU, TLBs are invalidated >>>> everytime a different vCPU runs on a pCPU. I do not know how this can be >>>> optimized. Any support on this is appreciated. >>> One way to handle this is every invalidate within the VM a broadcast TLB >>> invalidate (HCR_EL2.FB is what you’re looking for) and then forego that TLB >>> maintenance as it’s no longer necessary. This should not have a practical >>> performance impact. >> >> To confirm my understanding, you are suggesting to rely on the L2 guest to >> send the TLB flush. Did I understanding correctly? If so, wouldn't this open >> a security hole because a misbehaving guest may never send the TLB flush? >> > Hello, > > HCR_EL2.FB can be used to make every TLB invalidate the guest issues (which > is a stage1 one) a broadcast TLB invalidate. > > If a TLB invalidate wasn’t issued, then well the cached stage1 translations > could have been out of date on the core the VM was running on in the first > place. > > If a core-local TLB invalidate was issued, this bit forces it to become a > broadcast, so that you don’t have to worry about flushing TLBs when moving a > vCPU between different pCPUs. KVM operates with this bit set. > > As of the hypervisor, it’s responsible to issue the appropriate TLB > invalidates as necessary if it changes stage2 mappings. This includes a > stage-2 TLB invalidate and further necessary maintenance if the CPU core does > do combined TLB entries. Whether a CPU core does that can be queried through > FEAT_nTLBPA. > > On processors without FEAT_nTLBPA, it should be assumed that there are > non-coherent caching structures within the TLB. And as such also do the > corresponding stage-1 maintenance when invalidating stage2 entries. > > Thank you, > -Mohamed On the Neoverse V3 core for example, there’s this note in the TRM: https://developer.arm.com/documentation/107734/0002/AArch64-registers/AArch64-Identification-registers-summary/ID-AA64MMFR1-EL1--AArch64-Memory-Model-Feature-Register-1?lang=en > nTLBPA: The intermediate caching of translation table walks does not include > non-coherent physical translation caches. Which means that this heavyweight Stage-2 flush is no longer necessary on that core. On Neoverse V2, this bit is defined as RES0 instead. And as such invalidating the whole of Stage1 is necessary on Neoverse V2 when doing Stage-2 invalidates for an HV in practice… (or more heavyweight tracking…) What KVM currently does in arch/arm64/kvm/hyp/nvhe/tlb.c ( ~line 158) (__kvm_tlb_flush_vmid_ipa) today: It just always flushes Stage-1 when doing a Stage-2 flush. Thank you, -Mohamed

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.