[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] 4.11.0 RC1 panic



On Tue, May 15, 2018 at 03:30:17AM -0600, Jan Beulich wrote:
> >> So in combination with your later reply I'm confused: Are you observing
> >> this with 64-bit guests as well (your later reply appears to hint towards
> >> 64-bit-ness), or (as the stack trace suggests) only 32-bit ones? Knowing
> >> this may already narrow areas where to look.
> > 
> > I've seen it a server where, I think, only 32bits domUs are running.
> > But the dom0 is a 64bits NetBSD anyway.
> 
> Right; Dom0 bitness is of no interest. I've been going through numerous
> possibly racing combinations of code paths, without being able to spot
> anything yet. I'm afraid I'm not in the position to try to set up the full
> environment you're observing the problem in. It would therefore really
> help if you could
> - debug this yourself, or

In my experience this kind of bug can only be found by code inspection,
or by putting asserts to try to detect the problem earlier. Both needs
good knowledge of the affected code, and I don't have this knowledge.

> - reduce the test environment (ideally to a simple [XTF?] test), or
> - at least narrow the conditions, or

Now that I know where to find the domU number in the panic message,
I can say that, so far, only 32bit domUs have caused this assert failure.

> - at the very least summarize the relevant actions NetBSD takes in
>   terms of page table management, to hopefully reduce the sets of
>   code paths potentially involved (for example, across a larger set of
>   crashes knowing whether UNPIN is always involved would be
>   helpful; I've been blindly assuming it would be short of having
>   further data)

So far I've seen 2 stack traces with 4.11:
(XEN) Xen call trace:
(XEN)    [<ffff82d080284bd2>] mm.c#dec_linear_entries+0x12/0x20
(XEN)    [<ffff82d08028922e>] mm.c#_put_page_type+0x13e/0x350
(XEN)    [<ffff82d08023a00d>] _spin_lock+0xd/0x50
(XEN)    [<ffff82d0802898af>] mm.c#put_page_from_l2e+0xdf/0x110
(XEN)    [<ffff82d080288c59>] free_page_type+0x2f9/0x790
(XEN)    [<ffff82d0802891f7>] mm.c#_put_page_type+0x107/0x350
(XEN)    [<ffff82d0802898ef>] put_page_type_preemptible+0xf/0x10
(XEN)    [<ffff82d080272adb>] domain.c#relinquish_memory+0xab/0x460
(XEN)    [<ffff82d080276ae3>] domain_relinquish_resources+0x203/0x290
(XEN)    [<ffff82d0802068bd>] domain_kill+0xbd/0x150
(XEN)    [<ffff82d0802039e3>] do_domctl+0x7d3/0x1a90
(XEN)    [<ffff82d080203210>] do_domctl+0/0x1a90
(XEN)    [<ffff82d080367b95>] pv_hypercall+0x1f5/0x430
(XEN)    [<ffff82d08036e422>] lstar_enter+0xa2/0x120
(XEN)    [<ffff82d08036e42e>] lstar_enter+0xae/0x120
(XEN)    [<ffff82d08036e422>] lstar_enter+0xa2/0x120
(XEN)    [<ffff82d08036e42e>] lstar_enter+0xae/0x120
(XEN)    [<ffff82d08036e422>] lstar_enter+0xa2/0x120
(XEN)    [<ffff82d08036e42e>] lstar_enter+0xae/0x120
(XEN)    [<ffff82d08036e48c>] lstar_enter+0x10c/0x120

and
(XEN)    [<ffff82d080284bd2>] mm.c#dec_linear_entries+0x12/0x20
(XEN)    [<ffff82d08028922e>] mm.c#_put_page_type+0x13e/0x350
(XEN)    [<ffff82d0802898af>] mm.c#put_page_from_l2e+0xdf/0x110
(XEN)    [<ffff82d080288c59>] free_page_type+0x2f9/0x790
(XEN)    [<ffff82d0802891f7>] mm.c#_put_page_type+0x107/0x350
(XEN)    [<ffff82d0802898ef>] put_page_type_preemptible+0xf/0x10
(XEN)    [<ffff82d080290b6d>] do_mmuext_op+0x73d/0x1810
(XEN)    [<ffff82d080295630>] compat_mmuext_op+0x430/0x450
(XEN)    [<ffff82d080367d4a>] pv_hypercall+0x3aa/0x430
(XEN)    [<ffff82d08036bbf4>] entry_int82+0x74/0xc0
(XEN)    [<ffff82d08036bbe8>] entry_int82+0x68/0xc0
(XEN)    [<ffff82d08036bbf4>] entry_int82+0x74/0xc0
(XEN)    [<ffff82d08036bbe8>] entry_int82+0x68/0xc0
(XEN)    [<ffff82d08036bbf4>] entry_int82+0x74/0xc0
(XEN)    [<ffff82d08036bbe8>] entry_int82+0x68/0xc0
(XEN)    [<ffff82d08036bbf4>] entry_int82+0x74/0xc0
(XEN)    [<ffff82d08036957e>] do_entry_int82+0x1e/0x20
(XEN)    [<ffff82d08036bc31>] entry_int82+0xb1/0xc0

both are from 4.11rc4

> (besides a more reliable confirmation - or otherwise - that this indeed
> is an issue with 32-bit guests only).
> 
> While I think I have ruled out the TLB flush time stamp setting still
> happening too early / wrongly in certain cases, there's a small
> debugging patch that I would hope could help prove this one or the
> other way (see below).

I applied this patch to 4.11rc4 a week ago, but the assert didn't fire so far.
t still panics with:
(XEN) Assertion 'oc > 0' failed at mm.c:681

-- 
Manuel Bouyer <bouyer@xxxxxxxxxxxxxxx>
     NetBSD: 26 ans d'experience feront toujours la difference
--

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.