[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Limitations for Running Xen on KVM Arm64




> On 31. Oct 2025, at 10:18, Julien Grall <julien@xxxxxxx> wrote:
> 
> 
> 
> On 31/10/2025 00:20, Mohamed Mediouni wrote:
>>> On 31. Oct 2025, at 00:55, Julien Grall <julien@xxxxxxx> wrote:
>>> 
>>> Hi Mohamed,
>>> 
>>> On 30/10/2025 18:33, Mohamed Mediouni wrote:
>>>>> On 30. Oct 2025, at 14:41, haseeb.ashraf@xxxxxxxxxxx wrote:
>>>>> 
>>>>> Adding @julien@xxxxxxx and replying to his questions he asked over 
>>>>> #XenDevel:matrix.org.
>>>>> 
>>>>> can you add some details why the implementation cannot be optimized in 
>>>>> KVM? Asking because I have never seen such issue when running Xen on QEMU 
>>>>> (without nested virt enabled).
>>>>> AFAIK when Xen is run on QEMU without virtualization, then instructions 
>>>>> are emulated in QEMU while with KVM, ideally the instruction should run 
>>>>> directly on hardware except in some special cases (those trapped by 
>>>>> FGT/CGT). Such as this one where KVM maintains shadow page tables for 
>>>>> each VM. It traps these instructions and emulates them with callback such 
>>>>> as handle_vmalls12e1is(). The way this callback is implemented, it has to 
>>>>> iterate over the whole address space and clean-up the page tables which 
>>>>> is a costly operation. Regardless of this, it should still be optimized 
>>>>> in Xen as invalidating a selective range would be much better than 
>>>>> invalidating a whole range of 48-bit address space.
>>>>> Some details about your platform and use case would be helpful. I am 
>>>>> interested to know whether you are using all the features for nested virt.
>>>>> I am using AWS G4. My use case is to run Xen as guest hypervisor. Yes, 
>>>>> most of the features are enabled except VHE or those which are disabled 
>>>>> by KVM.
>>>> Hello,
>>>> You mean Graviton4 (for reference to others, from a bare metal instance)? 
>>>> Interesting to see people caring about nested virt there :) - and 
>>>> hopefully using it wasn’t too much of a pain for you to deal with.
>>>>> 
>>>>> ; switch to current VMID
>>>>> tlbi rvae1, guest_vaddr ; first invalidate stage-1 TLB by guest VA for 
>>>>> current VMID
>>>>> tlbi ripas2e1, guest_paddr ; then invalidate stage-2 TLB by IPA range for 
>>>>> current VMID
>>>>> dsb ish
>>>>> isb
>>>>> ; switch back the VMID
>>>>>     • This is where I am not quite sure and I was hoping that if someone 
>>>>> with Arm expertise could sign off on this so that I can work on its 
>>>>> implementation in Xen. This will be an optimization not only for 
>>>>> virtualized hardware but also in general for Xen on arm64 machines.
>>>>> 
>>>> Note that the documentation says
>>>>> The invalidation is not required to apply to caching structures that 
>>>>> combine stage 1 and stage 2 translation table entries.
>>>> for TLBIP RIPAS2E1
>>>>>     • The second place in Xen where this is problematic is when multiple 
>>>>> vCPUs of the same domain juggle on single pCPU, TLBs are invalidated 
>>>>> everytime a different vCPU runs on a pCPU. I do not know how this can be 
>>>>> optimized. Any support on this is appreciated.
>>>> One way to handle this is every invalidate within the VM a broadcast TLB 
>>>> invalidate (HCR_EL2.FB is what you’re looking for) and then forego that 
>>>> TLB maintenance as it’s no longer necessary. This should not have a 
>>>> practical performance impact.
>>> 
>>> To confirm my understanding, you are suggesting to rely on the L2 guest to 
>>> send the TLB flush. Did I understanding correctly? If so, wouldn't this 
>>> open a security hole because a misbehaving guest may never send the TLB 
>>> flush?
>>> 
>> Hello,
>> HCR_EL2.FB can be used to make every TLB invalidate the guest issues (which 
>> is a stage1 one) a broadcast TLB invalidate.
> 
> Xen already sets HCR_EL2.FB. But I believe this is only solving the problem 
> where the vCPU is moved to another pCPU. This doesn't solve the problem where 
> two vCPUs from the same VM is sharing the same pCPU.
> 
> Per the Arm Arm each CPU have their own private TLBs. So we have to flush 
> between vCPU of the same domains to avoid translations from vCPU 1 to "leak" 
> to the vCPU 2 (they may have confliected page-tables).
Hm… it varies on whether the VM uses CnP or not (and whether the HW supports 
it)… (Linux does…)
> KVM has a similar logic see "last_vcpu_ran" and "__kvm_flush_cpu_context()". 
> That said... they are using "vmalle1" whereas we are using "vmalls12e1". So 
> maybe we can relax it. Not sure if this would make any difference for the 
> performance though.
vmalle1 avoids the problem here (because it only invalidates stage-1 
translations). 
> Cheers,
> 
> -- 
> Julien Grall
> 
> 




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.