Xen project Mailing List

KVM has a similar logic see "last_vcpu_ran" and "__kvm_flush_cpu_context()". That said... they are using "vmalle1" whereas we are using "vmalls12e1". So maybe we can relax it. Not sure if this would make any difference for the performance though. I have seen no such performance issue with nested KVM. For Xen, if this can be relaxed from vmalls12e1 to vmalle1, this would still be a huge performance improvement. I used Ftrace to get execution time of each of these handler functions: handle_vmalls12e1is() min-max = 1464441 - 9495486 us handle_tlbi_el1() min-max = 10 - 27 us So, to summarize using HCR_EL2.FB (which Xen already enables?) and then using vmalle1 instead of vmalls12e1 should resolve the issue-2 for vCPUs switching on pCPUs. Coming back to issue-1, what do you think about creating a batch version of hypercall XENMEM_remove_from_physmap (other batch versions exist such as for XENMEM_add_to_physmap) and doing the TLB invalidation only once per this hypercall? I just realized that ripas2e1 is a range TLBI instruction which is only supported after Armv8.4 indicated by ID_AA64ISAR0_EL1.TLB == 2. So, on older architectures, full stage-2 invalidation would be required. For an architecture independent solution, creating a batch version seems to be a better way. Regards, Haseeb

On 31/10/2025 00:20, Mohamed Mediouni wrote:
>
>
>> On 31. Oct 2025, at 00:55, Julien Grall <julien@xxxxxxx> wrote:
>>
>> Hi Mohamed,
>>
>> On 30/10/2025 18:33, Mohamed Mediouni wrote:
>>>> On 30. Oct 2025, at 14:41, haseeb.ashraf@xxxxxxxxxxx wrote:
>>>>
>>>> Adding @julien@xxxxxxx and replying to his questions he asked over #XenDevel:matrix.org.
>>>>
>>>> can you add some details why the implementation cannot be optimized in KVM? Asking because I have never seen such issue when running Xen on QEMU (without nested virt enabled).
>>>> AFAIK when Xen is run on QEMU without virtualization, then instructions are emulated in QEMU while with KVM, ideally the instruction should run directly on hardware except in some special cases (those trapped by FGT/CGT). Such as this one where KVM maintains shadow page tables for each VM. It traps these instructions and emulates them with callback such as handle_vmalls12e1is(). The way this callback is implemented, it has to iterate over the whole address space and clean-up the page tables which is a costly operation. Regardless of this, it should still be optimized in Xen as invalidating a selective range would be much better than invalidating a whole range of 48-bit address space.
>>>> Some details about your platform and use case would be helpful. I am interested to know whether you are using all the features for nested virt.
>>>> I am using AWS G4. My use case is to run Xen as guest hypervisor. Yes, most of the features are enabled except VHE or those which are disabled by KVM.
>>> Hello,
>>> You mean Graviton4 (for reference to others, from a bare metal instance)? Interesting to see people caring about nested virt there :) - and hopefully using it wasn’t too much of a pain for you to deal with.
>>>>
>>>> ; switch to current VMID
>>>> tlbi rvae1, guest_vaddr ; first invalidate stage-1 TLB by guest VA for current VMID
>>>> tlbi ripas2e1, guest_paddr ; then invalidate stage-2 TLB by IPA range for current VMID
>>>> dsb ish
>>>> isb
>>>> ; switch back the VMID
>>>> • This is where I am not quite sure and I was hoping that if someone with Arm expertise could sign off on this so that I can work on its implementation in Xen. This will be an optimization not only for virtualized hardware but also in general for Xen on arm64 machines.
>>>>
>>> Note that the documentation says
>>>> The invalidation is not required to apply to caching structures that combine stage 1 and stage 2 translation table entries.
>>> for TLBIP RIPAS2E1
>>>> • The second place in Xen where this is problematic is when multiple vCPUs of the same domain juggle on single pCPU, TLBs are invalidated everytime a different vCPU runs on a pCPU. I do not know how this can be optimized. Any support on this is appreciated.
>>> One way to handle this is every invalidate within the VM a broadcast TLB invalidate (HCR_EL2.FB is what you’re looking for) and then forego that TLB maintenance as it’s no longer necessary. This should not have a practical performance impact.
>>
>> To confirm my understanding, you are suggesting to rely on the L2 guest to send the TLB flush. Did I understanding correctly? If so, wouldn't this open a security hole because a misbehaving guest may never send the TLB flush?
>>
> Hello,
>
> HCR_EL2.FB can be used to make every TLB invalidate the guest issues (which is a stage1 one) a broadcast TLB invalidate.

Xen already sets HCR_EL2.FB. But I believe this is only solving the
problem where the vCPU is moved to another pCPU. This doesn't solve the
problem where two vCPUs from the same VM is sharing the same pCPU.

Per the Arm Arm each CPU have their own private TLBs. So we have to
flush between vCPU of the same domains to avoid translations from vCPU 1
to "leak" to the vCPU 2 (they may have confliected page-tables).

KVM has a similar logic see "last_vcpu_ran" and
"__kvm_flush_cpu_context()". That said... they are using "vmalle1"
whereas we are using "vmalls12e1". So maybe we can relax it. Not sure if
this would make any difference for the performance though.

Cheers,

--
Julien Grall

Re: Limitations for Running Xen on KVM Arm64