[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: BUG: credit=sched2 machine hang when using DRAKVUF



On Wed, 2020-10-28 at 08:45 +0100, Jan Beulich wrote:
> On 28.10.2020 03:04, Michał Leszczyński wrote:
> 
> 
> I have to admit that the log makes me wonder whether this isn't a
> Dom0 internal issue:
> 
> > [  338.968676] watchdog: BUG: soft lockup - CPU#5 stuck for 22s!
> > [sshd:5991]
> > [  346.963959] watchdog: BUG: soft lockup - CPU#2 stuck for 23s!
> > [xenconsoled:2747]
> 
Yeah, weird.

> For these two vCPU-s we see ...
> 
> > (XEN) Domain info:
> > (XEN)   Domain: 0 w 256 c 0 v 14
> > (XEN)     1: [0.0] flags=20 cpu=0 credit=-10000000 [w=256]
> > load=4594 (~1%)
> > (XEN)     2: [0.1] flags=20 cpu=2 credit=9134904 [w=256]
> > load=262144 (~100%)
> > (XEN)     3: [0.2] flags=22 cpu=4 credit=-10000000 [w=256]
> > load=262144 (~100%)
> > (XEN)     4: [0.3] flags=20 cpu=6 credit=-10000000 [w=256]
> > load=4299 (~1%)
> > (XEN)     5: [0.4] flags=20 cpu=8 credit=-10000000 [w=256]
> > load=4537 (~1%)
> > (XEN)     6: [0.5] flags=22 cpu=10 credit=-10000000 [w=256]
> > load=262144 (~100%)
> 
> ... that both are fully loaded and ...
> 
> > [...]
> 
> ... they're actively running,
>
True indeed. But as I said in my other reply, it's weird that we have
so many vCPUs with the artificial value that we use to represent the
minimum value of credits we allow a vCPU to have.

And it's weird that, with some idle CPUs and with two vCPUs running
vCPUs with negative credits, we have one with positive credits sitting
in the runqueue.

Unless the debug-key captured a transient  state. Like, d0v1 is in the
runqueue because it just woke-up and the 'r' dump occurred between when
it's put in the runqueue and when a physical CPU (which is poked during
the wake-up itself) picks it up.

It seems unlikely, and this still would not explain nor justify the -
10000000. But, still, Michał, can you perhaps check whether, while the
issue manifests, poking at the 'r' key a few times always show the same
(or a similar) situation?

> > (XEN) RUNQ:
> > (XEN) CPUs info:
> > (XEN) CPU[00] current=d[IDLE]v0, curr=d[IDLE]v0, prev=NULL
> > (XEN) CPU[02] current=d[IDLE]v2, curr=d[IDLE]v2, prev=NULL
> > (XEN) CPU[04] current=d0v2, curr=d0v2, prev=NULL
> > (XEN) CPU[06] current=d[IDLE]v6, curr=d[IDLE]v6, prev=NULL
> > (XEN) CPU[08] current=d[IDLE]v8, curr=d[IDLE]v8, prev=NULL
> > (XEN) CPU[10] current=d0v5, curr=d0v5, prev=NULL
> 
> ... here. Hence an additional question is what exactly they're doing.
> '0' and possibly 'd' debug key output may shed some light on it, but
> to interpret that output the exact kernel and hypervisor binaries
> would need to be known / available.
> 
Yes, I agree. Even considering all that I said (which seems to point
back at a Xen issue, rather than kernel), knowing more about what the
vCPUs are doing could indeed be helpful!

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

Attachment: signature.asc
Description: This is a digitally signed message part


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.