[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Recent upgrade of 4.13 -> 4.14 issue



On Sat, Oct 31, 2020 at 04:27:58AM +0100, Dario Faggioli wrote:
> On Sat, 2020-10-31 at 03:54 +0100, marmarek@xxxxxxxxxxxxxxxxxxxxxx
> wrote:
> > On Sat, Oct 31, 2020 at 02:34:32AM +0000, Dario Faggioli wrote:
> > (XEN) *** Dumping CPU7 host state: ***
> > (XEN) Xen call trace:
> > (XEN)    [<ffff82d040223625>] R _spin_lock+0x35/0x40
> > (XEN)    [<ffff82d0402233cd>] S on_selected_cpus+0x1d/0xc0
> > (XEN)    [<ffff82d040284aba>] S vmx_do_resume+0xba/0x1b0
> > (XEN)    [<ffff82d0402df160>] S context_switch+0x110/0xa60
> > (XEN)    [<ffff82d04024310a>] S core.c#schedule+0x1aa/0x250
> > (XEN)    [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0
> > (XEN)    [<ffff82d040291b6b>] S vmx_asm_do_vmentry+0x2b/0x30
> > 
> > And so on, for (almost?) all CPUs.
>
> Right. So, it seems like a live (I would say) lock. It might happen on
> some resource which his shared among domains. And introduced (the
> livelock, not the resource or the sharing) in 4.14.
> 
> Just giving a quick look, I see that vmx_do_resume() calls
> vmx_clear_vmcs() which calls on_selected_cpus() which takes the
> call_lock spinlock.
> 
> And none of these seems to have received much attention recently.
> 
> But this is just a really basic analysis!

I've looked at on_selected_cpus() and my understanding is this:
1. take call_lock spinlock
2. set function+args+what cpus to be called in a global "call_data" variable
3. ask CPUs to execute that function (smp_send_call_function_mask() call)
4. wait for all requested CPUs to execute the function, still holding
the spinlock
5. only then - release the spinlock

So, if any CPU does not execute requested function for any reason, it
will keep the call_lock locked forever.

I don't see any CPU waiting on step 4, but also I don't see call traces
from CPU3 and CPU8 in the log - that's because they are in guest (dom0
here) context, right? I do see "guest state" dumps from them.
The only three CPUs that do logged xen call traces and are not waiting on that
spin lock are:

CPU0:
(XEN) Xen call trace:
(XEN)    [<ffff82d040240f89>] R vcpu_unblock+0x9/0x50
(XEN)    [<ffff82d0402e0171>] S vcpu_kick+0x11/0x60
(XEN)    [<ffff82d0402259c8>] S tasklet.c#do_tasklet_work+0x68/0xc0
(XEN)    [<ffff82d040225a59>] S tasklet.c#tasklet_softirq_action+0x39/0x60
(XEN)    [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0
(XEN)    [<ffff82d040291b6b>] S vmx_asm_do_vmentry+0x2b/0x30

CPU4:
(XEN) Xen call trace:
(XEN)    [<ffff82d040227043>] R set_timer+0x133/0x220
(XEN)    [<ffff82d040234e90>] S credit.c#csched_tick+0/0x3a0
(XEN)    [<ffff82d04022660f>] S timer.c#timer_softirq_action+0x9f/0x300
(XEN)    [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0
(XEN)    [<ffff82d0402d64e6>] S x86_64/entry.S#process_softirqs+0x6/0x20

CPU14:
(XEN) Xen call trace:
(XEN)    [<ffff82d040222dc0>] R do_softirq+0/0x10
(XEN)    [<ffff82d0402d64e6>] S x86_64/entry.S#process_softirqs+0x6/0x20

I'm not sure if any of those is related to that spin lock,
on_selected_cpus() call, or anything like that...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Attachment: signature.asc
Description: PGP signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.