Xen project Mailing List

Re: [Xen-devel] Altp2m use with PML can deadlock Xen

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

From: Tamas K Lengyel <tamas.k.lengyel@xxxxxxxxx>

Date: Fri, 10 May 2019 08:42:51 -0600

Cc: Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Fri, 10 May 2019 14:43:36 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Thu, May 9, 2019 at 10:19 AM Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote: > > On 09/05/2019 14:38, Tamas K Lengyel wrote: > > Hi all, > > I'm investigating an issue with altp2m that can easily be reproduced > > and leads to a hypervisor deadlock when PML is available in hardware. > > I haven't been able to trace down where the actual deadlock occurs. > > > > The problem seem to stem from hvm/vmx/vmcs.c:vmx_vcpu_flush_pml_buffer > > that calls p2m_change_type_one on all gfns that were recorded the PML > > buffer. The problem occurs when the PML buffer full vmexit happens > > while the active p2m is an altp2m. Switching p2m_change_type_one to > > work with the altp2m instead of the hostp2m however results in EPT > > misconfiguration crashes. > > > > Adding to the issue is that it seem to only occur when the altp2m has > > remapped GFNs. Since PML records entries based on GFN leads me to > > question whether it is safe at all to use PML when altp2m is used with > > GFN remapping. However, AFAICT the GFNs in the PML buffer are not the > > remapped GFNs and my understanding is that it should be safe as long > > as the GFNs being tracked by PML are never the remapped GFNs. > > > > Booting Xen with ept=pml=0 resolves the issue. > > > > If anyone has any insight into what might be happening, please let me know. > > > I could have sworn that George spotted a problem here and fixed it. I > shouldn't be surprised if we have more. > > The problem that PML introduced (and this is mostly my fault, as I > suggested the buggy solution) is that the vmexit handler from one vcpu > pauses others to drain the PML queue into the dirty bitmap. Overall I > wasn't happy with the design and I've got some ideas to improve it, but > within the scope of how altp2m was engineered, I proposed > domain_pause_except_self(). > > As it turns out, that is vulnerable to deadlocks when you get two vcpus > trying to pause each other and waiting for each other to become > de-scheduled. Makes sense. > > I see this has been reused by the altp2m code, but it *should* be safe > to deadlocks now that it takes the hypercall_deadlock_mutext. Is that already in staging or your x86-next branch? I would like to verify that the problem is still present or not with that change. I tested with Xen 4.12 release and that definitely still deadlocks. > Anyway - sorry for not being more help, but I bet the problem is going > to be somewhere around vcpu pausing. No problem, I appreciate the help. Thanks, Tamas _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.