[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: BUG: credit=sched2 machine hang when using DRAKVUF



On 28.10.2020 03:04, Michał Leszczyński wrote:
> As I've said before, I'm using RELEASE-4.14.0, this is DELL PowerEdge R640 
> with 14 PCPUs.

I.e. you haven't tried the tip of the 4.14 stable branch?

> I have the following additional pieces of log (enclosed below). As you could 
> see, the issue is about particular vCPUs of Dom0 not being scheduled for a 
> long time, which really decreases stability of the host system.

I have to admit that the log makes me wonder whether this isn't a
Dom0 internal issue:

> [  338.968676] watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [sshd:5991]
> [  346.963959] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! 
> [xenconsoled:2747]

For these two vCPU-s we see ...

> (XEN) Domain info:
> (XEN)   Domain: 0 w 256 c 0 v 14
> (XEN)     1: [0.0] flags=20 cpu=0 credit=-10000000 [w=256] load=4594 (~1%)
> (XEN)     2: [0.1] flags=20 cpu=2 credit=9134904 [w=256] load=262144 (~100%)
> (XEN)     3: [0.2] flags=22 cpu=4 credit=-10000000 [w=256] load=262144 (~100%)
> (XEN)     4: [0.3] flags=20 cpu=6 credit=-10000000 [w=256] load=4299 (~1%)
> (XEN)     5: [0.4] flags=20 cpu=8 credit=-10000000 [w=256] load=4537 (~1%)
> (XEN)     6: [0.5] flags=22 cpu=10 credit=-10000000 [w=256] load=262144 
> (~100%)

... that both are fully loaded and ...

> (XEN) Runqueue 0:
> (XEN) CPU[00] runq=0, sibling={0}, core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
> (XEN) CPU[02] runq=0, sibling={2}, core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
> (XEN) CPU[04] runq=0, sibling={4}, core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
> (XEN)   run: [0.2] flags=22 cpu=4 credit=-10000000 [w=256] load=262144 (~100%)
> (XEN) CPU[06] runq=0, sibling={6}, core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
> (XEN) CPU[08] runq=0, sibling={8}, core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
> (XEN) CPU[10] runq=0, sibling={10}, 
> core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
> (XEN)   run: [0.5] flags=22 cpu=10 credit=-10000000 [w=256] load=262144 
> (~100%)

... they're actively running, confirmed another time ...

> (XEN) RUNQ:
> (XEN) CPUs info:
> (XEN) CPU[00] current=d[IDLE]v0, curr=d[IDLE]v0, prev=NULL
> (XEN) CPU[02] current=d[IDLE]v2, curr=d[IDLE]v2, prev=NULL
> (XEN) CPU[04] current=d0v2, curr=d0v2, prev=NULL
> (XEN) CPU[06] current=d[IDLE]v6, curr=d[IDLE]v6, prev=NULL
> (XEN) CPU[08] current=d[IDLE]v8, curr=d[IDLE]v8, prev=NULL
> (XEN) CPU[10] current=d0v5, curr=d0v5, prev=NULL

... here. Hence an additional question is what exactly they're doing.
'0' and possibly 'd' debug key output may shed some light on it, but
to interpret that output the exact kernel and hypervisor binaries
would need to be known / available.

Furthermore to tell dead lock from live lock, more than one invocation
of any of the involved debug keys is often helpful.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.