[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v1] x86/hvm: Generic instruction re-execution mechanism for execute faults



On Thu, Nov 22, 2018 at 05:25:02PM +0200, Razvan Cojocaru wrote:
> On 11/22/18 4:49 PM, Roger Pau Monné wrote:
> > On Thu, Nov 22, 2018 at 02:48:07PM +0200, Razvan Cojocaru wrote:
> >> On 11/22/18 12:58 PM, Roger Pau Monné wrote:
> >>> On Thu, Nov 22, 2018 at 12:14:59PM +0200, Razvan Cojocaru wrote:
> >>>> On 11/22/18 12:05 PM, Roger Pau Monné wrote:
> >>>>> On Wed, Nov 21, 2018 at 08:55:48PM +0200, Razvan Cojocaru wrote:
> >>>>>> On 11/16/18 7:04 PM, Roger Pau Monné wrote:
> >>>>>>>> +            if ( a == v )
> >>>>>>>> +                continue;
> >>>>>>>> +
> >>>>>>>> +            /* Pause, synced. */
> >>>>>>>> +            while ( !a->arch.in_host )
> >>>>>>> Why not use a->is_running as a way to know whether the vCPU is
> >>>>>>> running?
> >>>>>>>
> >>>>>>> I think the logic of using vcpu_pause and expecting the running vcpu
> >>>>>>> to take a vmexit and thus set in_host is wrong because a vcpu that
> >>>>>>> wasn't running when vcpu_pause_nosync is called won't get scheduled
> >>>>>>> anymore, thus not taking a vmexit and this function will lockup.
> >>>>>>>
> >>>>>>> I don't think you need the in_host boolean at all.
> >>>>>>>
> >>>>>>>> +                cpu_relax();
> >>>>>>> Is this really better than using vcpu_pause?
> >>>>>>>
> >>>>>>> I assume this is done to avoid waiting on each vcpu, and instead doing
> >>>>>>> it here likely means less wait time?
> >>>>>>
> >>>>>> The problem with plain vcpu_pause() is that we weren't able to use it,
> >>>>>> for the same reason (which remains unclear as of yet) that we couldn't
> >>>>>> use a->is_running: we get CPU stuck hypervisor crashes that way. Here's
> >>>>>> one that uses the same logic, but loops on a->is_running instead of
> >>>>>> !a->arch.in_host:
> >>>
> >>> [...]
> >>>
> >>>>>> Some scheduler magic appears to happen here where it is unclear why
> >>>>>> is_running doesn't seem to end up being 0 as expected in our case. 
> >>>>>> We'll
> >>>>>> keep digging.
> >>>>>
> >>>>> There seems to be some kind of deadlock between
> >>>>> vmx_start_reexecute_instruction and hap_track_dirty_vram/handle_mmio.
> >>>>> Are you holding a lock while trying to put the other vcpus to sleep?
> >>>>
> >>>> d->arch.rexec_lock, but I don't see how that would matter in this case.
> >>>
> >>> The trace from pCPU#0:
> >>>
> >>> (XEN) [ 3668.016989] RFLAGS: 0000000000000202   CONTEXT: hypervisor (d0v0)
> >>> [...]
> >>> (XEN) [ 3668.275417] Xen call trace:
> >>> (XEN) [ 3668.278714]    [<ffff82d0801327d2>] vcpu_sleep_sync+0x40/0x71
> >>> (XEN) [ 3668.284952]    [<ffff82d08010735b>] 
> >>> domain.c#do_domain_pause+0x33/0x4f
> >>> (XEN) [ 3668.291973]    [<ffff82d08010879a>] domain_pause+0x25/0x27
> >>> (XEN) [ 3668.297952]    [<ffff82d080245e69>] 
> >>> hap_track_dirty_vram+0x2c1/0x4a7
> >>> (XEN) [ 3668.304797]    [<ffff82d0801dd8f5>] do_hvm_op+0x18be/0x2b58
> >>> (XEN) [ 3668.310864]    [<ffff82d080172aca>] pv_hypercall+0x1e5/0x402
> >>> (XEN) [ 3668.317017]    [<ffff82d080250899>] 
> >>> entry.o#test_all_events+0/0x3d
> >>>
> >>> Shows there's an hypercall executed from Dom0 that's trying to pause
> >>> the domain, thus pausing all the vCPUs.
> >>>
> >>> Then pCPU#3:
> >>>
> >>> (XEN) [ 3669.062841] RFLAGS: 0000000000000202   CONTEXT: hypervisor (d1v0)
> >>> [...]
> >>> (XEN) [ 3669.322832] Xen call trace:
> >>> (XEN) [ 3669.326128]    [<ffff82d08021006a>] 
> >>> vmx_start_reexecute_instruction+0x107/0x68a
> >>> (XEN) [ 3669.333925]    [<ffff82d080210b3e>] 
> >>> p2m_mem_access_check+0x551/0x64d
> >>> (XEN) [ 3669.340774]    [<ffff82d0801dee9e>] 
> >>> hvm_hap_nested_page_fault+0x2f2/0x631
> >>> (XEN) [ 3669.348051]    [<ffff82d080202c00>] 
> >>> vmx_vmexit_handler+0x156c/0x1e45
> >>> (XEN) [ 3669.354899]    [<ffff82d08020820c>] 
> >>> vmx_asm_vmexit_handler+0xec/0x250
> >>>
> >>> Seems to be blocked in vmx_start_reexecute_instruction, and thus not
> >>> getting paused and triggering the watchdog on pCPU#0?
> >>>
> >>> You should check on which vCPU is the trace from pCPU#0 waiting, if
> >>> that's the vCPU running on pCPU#3 (d1v0) you will have to check what's
> >>> taking such a long time in vmx_start_reexecute_instruction.
> >>
> >> Right, so this is what appears to be happening, if the output of my test
> >> is to be trusted: https://pastebin.com/YEDqNuwh
> >>
> >> 1. vmx_start_reexecute_instruction() pauses all VCPUs but self (which
> >> appears to be VCPU 1):
> >>
> >> (XEN) [  195.427141] 0 pause_count 0
> >> (XEN) [  195.427142] 2 pause_count 0
> >> (XEN) [  195.427143] 3 pause_count 0
> >> (XEN) [  195.427144] 4 pause_count 0
> >> (XEN) [  195.427146] 5 pause_count 0
> >> (XEN) [  195.427147] 6 pause_count 0
> >> (XEN) [  195.427148] 7 pause_count 0
> > 
> > The diff below doesn't show where you add this message, neither
> > what's actually printing. I guess the first number is the vCPU ID, and
> > the second the value of pause_count at some point?
> 
> Yes, exactly. So the above tells us that VCPUs 0 and 2-7 have been
> paused (nosync) by vmx_start_reexecute_instruction(), which is now doing
> a while ( a->is_running ) cpu_relax().
> 
> >> 2. The hypercall happens, which calls domain_pause(), which I've
> >> modified thus:
> >>
> >> @@ -959,7 +961,10 @@ static void do_domain_pause(struct domain *d,
> >>      atomic_inc(&d->pause_count);
> >>
> >>      for_each_vcpu( d, v )
> >> +    {
> >> +        printk("domain_pause %d\n", v->vcpu_id);
> > 
> > Could you print both the domain and the vcpu ids?
> 
> Of course, but I think I've found the issue (please see below).
> 
> >>          sleep_fn(v);
> >> +    }
> >>
> >>      arch_domain_pause(d);
> >>  }
> >>
> >> and which says:
> >>
> >> (XEN) [  195.492064] domain_pause 0
> > 
> > This is the hypercall code waiting for domain 1 vCPU 0 to pause?
> 
> Yes.
> 
> >> 3. At this point, according to addr2line,
> >> vmx_start_reexecute_instruction() does "while ( a->is_running )
> >> cpu_relax();" for all VCPUs but itself.
> > 
> > Why don't you just start by using:
> > 
> > for_each_vcpu( d, v )
> >     if ( v != current )
> >         vcpu_pause(v);
> > 
> > Instead of open-coding it in vmx_start_reexecute_instruction.
> 
> That's the intention if we can get it to work.
> 
> >> Now, d1v0, which, if I'm reading this correctly, is the VCPU that
> >> domain_pause() is stuck waiting for, does:
> >>
> >> (XEN) [  200.829874] Xen call trace:
> >> (XEN) [  200.833166]    [<ffff82d0801278c6>]
> >> queue_read_lock_slowpath+0x25/0x4d
> >> (XEN) [  200.840186]    [<ffff82d08020c1f6>]
> >> get_page_from_gfn_p2m+0x14e/0x3b0
> >> (XEN) [  200.847121]    [<ffff82d080247213>]
> >> hap_p2m_ga_to_gfn_4_levels+0x48/0x299
> >> (XEN) [  200.854400]    [<ffff82d080247480>]
> >> hap_gva_to_gfn_4_levels+0x1c/0x1e
> >> (XEN) [  200.861331]    [<ffff82d08021275c>] paging_gva_to_gfn+0x10e/0x11d
> >> (XEN) [  200.867918]    [<ffff82d0801d66e0>] hvm.c#__hvm_copy+0x98/0x37f
> >> (XEN) [  200.874329]    [<ffff82d0801d848d>]
> >> hvm_fetch_from_guest_virt_nofault+0x14/0x16
> >> (XEN) [  200.882130]    [<ffff82d0801d141a>]
> >> emulate.c#_hvm_emulate_one+0x118/0x2bc
> >> (XEN) [  200.889496]    [<ffff82d0801d16b4>] hvm_emulate_one+0x10/0x12
> >> (XEN) [  200.895735]    [<ffff82d0801e0902>] handle_mmio+0x52/0xc9
> >> (XEN) [  200.901626]    [<ffff82d0801e09ba>]
> >> handle_mmio_with_translation+0x41/0x43
> >> (XEN) [  200.908994]    [<ffff82d0801ded1f>]
> >> hvm_hap_nested_page_fault+0x133/0x631
> >> (XEN) [  200.916271]    [<ffff82d080202c40>]
> >> vmx_vmexit_handler+0x156c/0x1e45
> >> (XEN) [  200.923117]    [<ffff82d08020824c>]
> >> vmx_asm_vmexit_handler+0xec/0x250
> > 
> > What lock is it waiting on? Is this the paging lock? If so you will
> > have to figure out who is holding this lock.
> 
> It turns out that it's the p2m lock. I've looked at the code more
> closely, and hvm_hap_nested_page_fault() takes a p2m lock:
> 
> 1923     /*
> 1924      * Take a lock on the host p2m speculatively, to avoid potential
> 1925      * locking order problems later and to handle unshare etc.
> 1926      */
> 1927     hostp2m = p2m_get_hostp2m(currd);
> 
> then ends up calling p2m_mem_access_check() with said lock taken.
> 
> Then p2m_mem_access_check() also does a bit of gfn_lock(p2m, gfn, 0)
> (although it also unlocks those by the time
> vmx_start_reexecute_instruction() gets called).
> 
> And then, of course, vmx_start_reexecute() runs, gets stuck in that loop
> (it doesn't matter if we manually look at is_running or call
> vcpu_pause(), the effect is the same), while the other VCPU ends up
> running get_page_from_gfn_p2m(), which tries to also lock the p2m and
> deadlocks.

I don't think you are supposed to try to pause other vcpus while
holding a lock, as you can see it's quite likely that you will end up
deadlocking because the vCPU you are trying to pause is stuck waiting
on the lock that you are holding.

You should figure out whether you can get into vmx_start_reexecute
without holding any locks, or alternatively drop the lock, pause the
vCPUs and pick the lock again.

See for example how hap_track_dirty_vram releases the lock before
attempting to pause the domain for this same reason.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.