[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: BUG: credit=sched2 machine hang when using DRAKVUF


  • To: Jürgen Groß <jgross@xxxxxxxx>
  • From: Michał Leszczyński <michal.leszczynski@xxxxxxx>
  • Date: Wed, 28 Oct 2020 03:04:12 +0100 (CET)
  • Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Wed, 28 Oct 2020 02:04:28 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: BnNcHQLsrjmQr0BXl+REe3376mCu2Q==
  • Thread-topic: credit=sched2 machine hang when using DRAKVUF

----- 23 paź, 2020 o 6:47, Jürgen Groß jgross@xxxxxxxx napisał(a):

> On 23.10.20 00:59, Michał Leszczyński wrote:
>> Hello,
>> 
>> when using DRAKVUF against a Windows 7 x64 DomU, the whole machine hangs 
>> after a
>> few minutes.
>> 
>> The chance for a hang seems to be correlated with number of PCPUs, in this 
>> case
>> we have 14 PCPUs and hang is very easily reproducible, while on other 
>> machines
>> with 2-4 PCPUs it's very rare (but still occurring sometimes). The issue is
>> observed with the default sched=credit2 and is no longer reproducible once
>> sched=credit is set.
> 
> Interesting. Can you please share some more information?
> 
> Which Xen version are you using?
> 
> Is there any additional information in the dom0 log which could be
> related to the hang (earlier WARN() splats, Oopses, Xen related
> messages, hardware failure messages, ...?
> 
> Can you please try to get backtraces of all cpus at the time of the
> hang?
> 
> It would help to know which cpu was the target of the call of
> smp_call_function_single(), so a disassembly of that function would
> be needed to find that information from the dumped registers.
> 
> I'm asking because I've seen a similar problem recently and I was
> rather suspecting a fifo event channel issue than the Xen scheduler,
> but your data suggests it could be the scheduler after all (if it is
> the same issue, of course).
> 
> 
> Juergen


As I've said before, I'm using RELEASE-4.14.0, this is DELL PowerEdge R640 with 
14 PCPUs.

I have the following additional pieces of log (enclosed below). As you could 
see, the issue is about particular vCPUs of Dom0 not being scheduled for a long 
time, which really decreases stability of the host system.

Hope this helps somehow.



Best regards,
Michał Leszczyński
CERT Polska

---

[  313.730969] rcu: INFO: rcu_sched self-detected stall on CPU
[  313.731154] rcu:     5-....: (5249 ticks this GP) 
idle=c6e/1/0x4000000000000002 softirq=4625/4625 fqs=2624
[  313.731474] rcu:      (t=5250 jiffies g=10309 q=220)
[  338.968676] watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [sshd:5991]
[  346.963959] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! 
[xenconsoled:2747]
(XEN) *** Serial input to Xen (type 'CTRL-a' three times to switch input)
(XEN) sched_smt_power_savings: disabled
(XEN) NOW=384307105230
(XEN) Online Cpus: 0,2,4,6,8,10,12,14,16,18,20,22,24,26
(XEN) Cpupool 0:
(XEN) Cpus: 0,2,4,6,8,10,12,14,16,18,20,22,24,26
(XEN) Scheduling granularity: cpu, 1 CPU per sched-resource
(XEN) Scheduler: SMP Credit Scheduler rev2 (credit2)
(XEN) Active queues: 2
(XEN)   default-weight     = 256
(XEN) Runqueue 0:
(XEN)   ncpus              = 7
(XEN)   cpus               = 0,2,4,6,8,10,12
(XEN)   max_weight         = 256
(XEN)   pick_bias          = 10
(XEN)   instload           = 3
(XEN)   aveload            = 805194 (~307%)
(XEN)   idlers: 00000000,00000000,00000000,00000000,00000000,00000000,00001145
(XEN)   tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000
(XEN)   fully idle cores: 
00000000,00000000,00000000,00000000,00000000,00000000,00001145
(XEN) Runqueue 1:
(XEN)   ncpus              = 7
(XEN)   cpus               = 14,16,18,20,22,24,26
(XEN)   max_weight         = 256
(XEN)   pick_bias          = 22
(XEN)   instload           = 0
(XEN)   aveload            = 51211 (~19%)
(XEN)   idlers: 00000000,00000000,00000000,00000000,00000000,00000000,05454000
(XEN)   tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000
(XEN)   fully idle cores: 
00000000,00000000,00000000,00000000,00000000,00000000,05454000
(XEN) Domain info:
(XEN)   Domain: 0 w 256 c 0 v 14
(XEN)     1: [0.0] flags=20 cpu=0 credit=-10000000 [w=256] load=4594 (~1%)
(XEN)     2: [0.1] flags=20 cpu=2 credit=9134904 [w=256] load=262144 (~100%)
(XEN)     3: [0.2] flags=22 cpu=4 credit=-10000000 [w=256] load=262144 (~100%)
(XEN)     4: [0.3] flags=20 cpu=6 credit=-10000000 [w=256] load=4299 (~1%)
(XEN)     5: [0.4] flags=20 cpu=8 credit=-10000000 [w=256] load=4537 (~1%)
(XEN)     6: [0.5] flags=22 cpu=10 credit=-10000000 [w=256] load=262144 (~100%)
(XEN)     7: [0.6] flags=20 cpu=12 credit=-10000000 [w=256] load=5158 (~1%)
(XEN)     8: [0.7] flags=20 cpu=14 credit=10053352 [w=256] load=5150 (~1%)
(XEN)     9: [0.8] flags=20 cpu=16 credit=10200526 [w=256] load=5155 (~1%)
(XEN)    10: [0.9] flags=20 cpu=18 credit=10207025 [w=256] load=4939 (~1%)
(XEN)    11: [0.10] flags=20 cpu=20 credit=10131199 [w=256] load=5753 (~2%)
(XEN)    12: [0.11] flags=20 cpu=22 credit=8103663 [w=256] load=22544 (~8%)
(XEN)    13: [0.12] flags=20 cpu=24 credit=10213151 [w=256] load=4905 (~1%)
(XEN)    14: [0.13] flags=20 cpu=26 credit=10235821 [w=256] load=4858 (~1%)
(XEN)   Domain: 29 w 256 c 0 v 4
(XEN)    15: [29.0] flags=0 cpu=16 credit=10500000 [w=256] load=0 (~0%)
(XEN)    16: [29.1] flags=0 cpu=26 credit=10500000 [w=256] load=0 (~0%)
(XEN)    17: [29.2] flags=0 cpu=10 credit=8294046 [w=256] load=0 (~0%)
(XEN)    18: [29.3] flags=0 cpu=12 credit=9009727 [w=256] load=0 (~0%)
(XEN)   Domain: 30 w 256 c 0 v 4
(XEN)    19: [30.0] flags=0 cpu=26 credit=10500000 [w=256] load=0 (~0%)
(XEN)    20: [30.1] flags=0 cpu=16 credit=10500000 [w=256] load=0 (~0%)
(XEN)    21: [30.2] flags=0 cpu=18 credit=10500000 [w=256] load=0 (~0%)
(XEN)    22: [30.3] flags=0 cpu=22 credit=10500000 [w=256] load=0 (~0%)
(XEN) Runqueue 0:
(XEN) CPU[00] runq=0, sibling={0}, core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
(XEN) CPU[02] runq=0, sibling={2}, core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
(XEN) CPU[04] runq=0, sibling={4}, core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
(XEN)   run: [0.2] flags=22 cpu=4 credit=-10000000 [w=256] load=262144 (~100%)
(XEN) CPU[06] runq=0, sibling={6}, core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
(XEN) CPU[08] runq=0, sibling={8}, core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
(XEN) CPU[10] runq=0, sibling={10}, core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
(XEN)   run: [0.5] flags=22 cpu=10 credit=-10000000 [w=256] load=262144 (~100%)
(XEN) CPU[12] runq=0, sibling={12}, core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
(XEN) RUNQ:
(XEN)     0: [0.1] flags=20 cpu=2 credit=9134904 [w=256] load=262144 (~100%)
(XEN) Runqueue 1:
(XEN) CPU[14] runq=1, sibling={14}, core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
(XEN) CPU[16] runq=1, sibling={16}, core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
(XEN) CPU[18] runq=1, sibling={18}, core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
(XEN) CPU[20] runq=1, sibling={20}, core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
(XEN) CPU[22] runq=1, sibling={22}, core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
(XEN) CPU[24] runq=1, sibling={24}, core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
(XEN) CPU[26] runq=1, sibling={26}, core={0,2,4,6,8,10,12,14,16,18,20,22,24,26}
(XEN) RUNQ:
(XEN) CPUs info:
(XEN) CPU[00] current=d[IDLE]v0, curr=d[IDLE]v0, prev=NULL
(XEN) CPU[02] current=d[IDLE]v2, curr=d[IDLE]v2, prev=NULL
(XEN) CPU[04] current=d0v2, curr=d0v2, prev=NULL
(XEN) CPU[06] current=d[IDLE]v6, curr=d[IDLE]v6, prev=NULL
(XEN) CPU[08] current=d[IDLE]v8, curr=d[IDLE]v8, prev=NULL
(XEN) CPU[10] current=d0v5, curr=d0v5, prev=NULL
(XEN) CPU[12] current=d[IDLE]v12, curr=d[IDLE]v12, prev=NULL
(XEN) CPU[14] current=d[IDLE]v14, curr=d[IDLE]v14, prev=NULL
(XEN) CPU[16] current=d[IDLE]v16, curr=d[IDLE]v16, prev=NULL
(XEN) CPU[18] current=d[IDLE]v18, curr=d[IDLE]v18, prev=NULL
(XEN) CPU[20] current=d[IDLE]v20, curr=d[IDLE]v20, prev=NULL
(XEN) CPU[22] current=d[IDLE]v22, curr=d[IDLE]v22, prev=NULL
(XEN) CPU[24] current=d[IDLE]v24, curr=d[IDLE]v24, prev=NULL
(XEN) CPU[26] current=d[IDLE]v26, curr=d[IDLE]v26, prev=NULL
[  366.966158] watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [sshd:5991]
[  374.961437] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! 
[xenconsoled:2747]
[  376.737277] rcu: INFO: rcu_sched self-detected stall on CPU
[  376.737457] rcu:     5-....: (21002 ticks this GP) 
idle=c6e/1/0x4000000000000002 softirq=4625/4630 fqs=10491
[  376.737773] rcu:      (t=21003 jiffies g=10309 q=4631)
[  402.958905] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! 
[xenconsoled:2747]
[  402.962904] watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [sshd:5991]
[  413.434099] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 
2-... 5-... } 5657 jiffies s: 57 root: 0x24/.
[  413.434470] rcu: blocking rcu_node structures:
[  430.956362] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! 
[xenconsoled:2747]
[  430.960361] watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [sshd:5991]
[  439.743562] rcu: INFO: rcu_sched self-detected stall on CPU
[  439.743741] rcu:     5-....: (36755 ticks this GP) 
idle=c6e/1/0x4000000000000002 softirq=4625/4630 fqs=18363
[  439.744060] rcu:      (t=36756 jiffies g=10309 q=8837)
[  458.953810] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! 
[xenconsoled:2747]
[  466.957079] watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [sshd:5991]
[  476.916310] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 
2-... 5-... } 21529 jiffies s: 57 root: 0x24/.
[  476.916712] rcu: blocking rcu_node structures:
[  486.951250] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! 
[xenconsoled:2747]
[  491.251030] INFO: task sshd:5993 blocked for more than 120 seconds.
[  491.251249]       Tainted: P           OEL    4.19.0-6-amd64 #1 Debian 
4.19.67-2+deb10u2
[  491.251535] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  491.251891] INFO: task sshd:5995 blocked for more than 120 seconds.
[  491.252078]       Tainted: P           OEL    4.19.0-6-amd64 #1 Debian 
4.19.67-2+deb10u2
[  491.252321] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  491.252657] INFO: task qemu-system-i38:6011 blocked for more than 120 
seconds.
[  491.252924]       Tainted: P           OEL    4.19.0-6-amd64 #1 Debian 
4.19.67-2+deb10u2
[  491.253167] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  491.253512] INFO: task sshd:6056 blocked for more than 120 seconds.
[  491.253703]       Tainted: P           OEL    4.19.0-6-amd64 #1 Debian 
4.19.67-2+deb10u2
[  491.253947] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  494.954517] watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [sshd:5991]
[  502.749802] rcu: INFO: rcu_sched self-detected stall on CPU
[  502.749981] rcu:     5-....: (52508 ticks this GP) 
idle=c6e/1/0x4000000000000002 softirq=4625/4630 fqs=26225
[  502.750307] rcu:      (t=52509 jiffies g=10309 q=12321)
[  514.948683] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! 
[xenconsoled:2747]
[  526.951580] watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [sshd:5991]
[  540.398469] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 
2-... 5-... } 37401 jiffies s: 57 root: 0x24/.
[  540.398843] rcu: blocking rcu_node structures:
[  542.946109] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! 
[xenconsoled:2747]
[  554.949003] watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [sshd:5991]
[  565.756021] rcu: INFO: rcu_sched self-detected stall on CPU
[  565.756203] rcu:     5-....: (68261 ticks this GP) 
idle=c6e/1/0x4000000000000002 softirq=4625/4630 fqs=34096
[  565.756521] rcu:      (t=68262 jiffies g=10309 q=14838)
[  570.943529] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! 
[xenconsoled:2747]
[  590.945683] watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [sshd:5991]
[  598.940945] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! 
[xenconsoled:2747]
[  603.880631] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 
2-... 5-... } 53273 jiffies s: 57 root: 0x24/.
[  603.881006] rcu: blocking rcu_node structures:
[  612.071898] INFO: task ntpd:2726 blocked for more than 120 seconds.
[  612.072120]       Tainted: P           OEL    4.19.0-6-amd64 #1 Debian 
4.19.67-2+deb10u2
[  612.072409] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  612.072806] INFO: task sshd:5993 blocked for more than 120 seconds.
[  612.073016]       Tainted: P           OEL    4.19.0-6-amd64 #1 Debian 
4.19.67-2+deb10u2
[  612.073258] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  612.073605] INFO: task sshd:5995 blocked for more than 120 seconds.
[  612.085380]       Tainted: P           OEL    4.19.0-6-amd64 #1 Debian 
4.19.67-2+deb10u2
[  612.097439] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  612.109691] INFO: task qemu-system-i38:6011 blocked for more than 120 
seconds.
[  612.121877]       Tainted: P           OEL    4.19.0-6-amd64 #1 Debian 
4.19.67-2+deb10u2
[  612.134123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  612.146501] INFO: task sshd:6056 blocked for more than 120 seconds.
[  612.158836]       Tainted: P           OEL    4.19.0-6-amd64 #1 Debian 
4.19.67-2+deb10u2
[  612.171134] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  612.183643] INFO: task qemu-system-i38:6105 blocked for more than 120 
seconds.
[  612.196174]       Tainted: P           OEL    4.19.0-6-amd64 #1 Debian 
4.19.67-2+deb10u2
[  612.208899] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  618.943102] watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [sshd:5991]
[  626.938354] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! 
[xenconsoled:2747]
[  628.762185] rcu: INFO: rcu_sched self-detected stall on CPU
[  628.774574] rcu:     5-....: (84011 ticks this GP) 
idle=c6e/1/0x4000000000000002 softirq=4625/4630 fqs=41964
[  628.787208] rcu:      (t=84015 jiffies g=10309 q=21278)
[  654.935761] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! 
[xenconsoled:2747]
[  654.939869] watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [sshd:5991]



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.