[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Recent upgrade of 4.13 -> 4.14 issue


  • To: Frédéric Pierret <frederic.pierret@xxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Mon, 26 Oct 2020 13:54:38 +0000
  • Authentication-results: esa2.hc3370-68.iphmx.com; dkim=none (message not signed) header.i=none
  • Cc: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Mon, 26 Oct 2020 13:55:36 +0000
  • Ironport-sdr: qpxJKNKJ347VBdUy9zJ3ZSghHRMxR/BTnjGjJLi7Kj4meswgFDkA2RguwqDnthxAPsRkifHSOt B9+3wWteUI1jUPRj6tT+TN8YZGs5qWOD5d2/B/TDS4cL5GZRk5HTjil8ZcRvMbH8woSwBjKmG2 Xmdc4M2wCuGL8K/rYA25ZSkykZa3oBCZc/ak//MwjDEiuOsey7Xk6h/egsmkTZXK3slnjHBw59 Ctmi3lp902fwYn6YfkD24YcYTZl/4UqJAar9oN4t6M89zdm3hvuiJ4U8qcqMAtLG8E/ijMD2JA 1rg=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 26/10/2020 13:37, Frédéric Pierret wrote:
> Hi all,
>
> I'm experiencing problem with a HP ProLiant DL360p Gen8 and recent
> upgrade of 4.13 -> 4.14. dom0 and the entire system becomes unstable
> and freeze at some point.
>
> I've managed to get three pieces of logs (last one after a reboot and
> just before total freeze) by connecting to the serial console with
> loglvl options and also redirecting linux kernel output to the serial
> console:
>
> ```
> [ 2150.954883] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 2150.954934] rcu:     7-...0: (3 GPs behind)
> idle=842/1/0x4000000000000000 softirq=64670/64671 fqs=14673
> [ 2150.954962]     (detected by 12, t=60002 jiffies, g=236593, q=126)
> [ 2150.954984] Sending NMI from CPU 12 to CPUs 7:
> [ 2160.968519] rcu: rcu_sched kthread starved for 10008 jiffies!
> g236593 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=9
> [ 2160.968568] rcu: RCU grace-period kthread stack dump:
> [ 2160.968586] rcu_sched       R  running task        0    10      2
> 0x80004000
> [ 2160.968612] Call Trace:
> [ 2160.968634]  ? xen_hypercall_xen_version+0xa/0x20
> [ 2160.968658]  ? xen_force_evtchn_callback+0x9/0x10
> [ 2160.968918]  ? check_events+0x12/0x20
> [ 2160.968946]  ? schedule+0x39/0xa0
> [ 2160.968964]  ? schedule_timeout+0x96/0x150
> [ 2160.968981]  ? __next_timer_interrupt+0xd0/0xd0
> [ 2160.969002]  ? rcu_gp_fqs_loop+0xea/0x2a0
> [ 2160.969019]  ? rcu_gp_kthread+0xb5/0x140
> [ 2160.969035]  ? rcu_gp_init+0x470/0x470
> [ 2160.969052]  ? kthread+0x115/0x140
> [ 2160.969067]  ? __kthread_bind_mask+0x60/0x60
> [ 2160.969085]  ? ret_from_fork+0x35/0x40
> ```
>
> and also
>
> ```
> [ 2495.945931] CPU: 14 PID: 24181 Comm: Xorg Not tainted
> 5.4.72-2.qubes.x86_64 #1
> [ 2495.945954] Hardware name: HP ProLiant DL360p Gen8, BIOS P71
> 05/24/2019
> [ 2495.945984] RIP: e030:smp_call_function_many+0x20a/0x270
> [ 2495.946004] Code: 8a 00 3b 05 4c b5 69 01 89 c7 0f 83 89 fe ff ff
> 48 63 c7 49 8b 17 48 03 14 c5 80 f9 3d 82 8b 42 18 a8 01 74 09 f3 90
> 8b 42 18 <a8> 01 75 f7 eb c9 48 c7 c2 a0 6f 82 82 4c 89 f6 89 df e8 bf b2
> 8a
> [ 2495.946051] RSP: e02b:ffffc90003aa7cf0 EFLAGS: 00000202
> [ 2495.946068] RAX: 0000000000000003 RBX: 0000000000000010 RCX:
> 0000000000000007
> [ 2495.946090] RDX: ffff8882413ef640 RSI: 0000000000000010 RDI:
> 0000000000000007
> [ 2495.946113] RBP: ffffffff81082fc0 R08: 0000000000000007 R09:
> 0000000000000000
> [ 2495.946134] R10: 0000000000000000 R11: ffffffff8265b6a8 R12:
> 0000000000000000
> [ 2495.946156] R13: 0000000000000001 R14: 0000000000029ac0 R15:
> ffff8882415a9b00
> [ 2495.946211] FS:  00007a0d51a91a40(0000) GS:ffff888241580000(0000)
> knlGS:0000000000000000
> [ 2495.946235] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2495.946253] CR2: 000070abab15a000 CR3: 00000001e18ee000 CR4:
> 0000000000040660
> [ 2495.946290] Call Trace:
> [ 2495.946314]  ? do_kernel_range_flush+0x50/0x50
> [ 2495.946334]  on_each_cpu+0x28/0x50
> [ 2495.946354]  decrease_reservation+0x22f/0x310
> [ 2495.946377]  alloc_xenballooned_pages+0xeb/0x120
> [ 2495.946396]  ? __kmalloc+0x183/0x260
> [ 2495.946413]  gnttab_alloc_pages+0x11/0x40
> [ 2495.946434]  gntdev_alloc_map+0x12f/0x250 [xen_gntdev]
> [ 2495.946454]  gntdev_ioctl_map_grant_ref+0x73/0x1d0 [xen_gntdev]
> [ 2495.946479]  do_vfs_ioctl+0x2fb/0x490
> [ 2495.946500]  ? syscall_trace_enter+0x1d1/0x2c0
> [ 2495.946551]  ksys_ioctl+0x5e/0x90
> [ 2495.946567]  __x64_sys_ioctl+0x16/0x20
> [ 2495.946583]  do_syscall_64+0x5b/0xf0
> [ 2495.946601]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 2495.946620] RIP: 0033:0x7a0d51f763bb
> [ 2495.946727] Code: 0f 1e fa 48 8b 05 dd aa 0c 00 64 c7 00 26 00 00
> 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00
> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ad aa 0c 00 f7 d8 64 89 01
> 48
> [ 2495.946804] RSP: 002b:00007fffc48d5058 EFLAGS: 00000206 ORIG_RAX:
> 0000000000000010
> [ 2495.946863] RAX: ffffffffffffffda RBX: 0000000000001000 RCX:
> 00007a0d51f763bb
> [ 2495.946885] RDX: 00007fffc48d5060 RSI: 0000000000184700 RDI:
> 0000000000000009
> [ 2495.946939] RBP: 00007fffc48d5100 R08: 00007fffc48d512c R09:
> 00007a0d51a30010
> [ 2495.946998] R10: 0000000000000000 R11: 0000000000000206 R12:
> 00007fffc48d5060
> [ 2495.947020] R13: 0000000000000001 R14: 0000000000000009 R15:
> 0000000000000001
> [ 2510.964667] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 2510.964716] rcu:     7-...0: (3 GPs behind)
> idle=842/1/0x4000000000000000 softirq=64670/64671 fqs=96182
> [ 2510.964744]     (detected by 12, t=420012 jiffies, g=236593, q=11404)
> [ 2510.964769] Sending NMI from CPU 12 to CPUs 7:
> [ 2523.945643] watchdog: BUG: soft lockup - CPU#14 stuck for 22s!
> [Xorg:24181]
> [ 2523.945686] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq
> snd_seq_device snd_timer snd soundcore br_netfilter xt_physdev
> xen_netback loop bridge stp llc rfkill ebtable_filter ebtables
> ip6table_filter ip
> 6_tables iptable_filter intel_rapl_msr iTCO_wdt ipmi_ssif
> iTCO_vendor_support intel_rapl_common sb_edac rapl raid456
> async_raid6_recov async_memcpy async_pq async_xor async_tx xor
> raid6_pq pcspkr joydev hpilo lpc
> _ich hpwdt ioatdma dca tg3 r8169 ipmi_si ipmi_devintf ipmi_msghandler
> acpi_power_meter xenfs ip_tables dm_thin_pool dm_persistent_data
> libcrc32c dm_bio_prison dm_crypt uas usb_storage uhci_hcd
> crct10dif_pclmul cr
> c32_pclmul crc32c_intel ghash_clmulni_intel mgag200 drm_kms_helper
> serio_raw drm_vram_helper ttm drm ata_generic pata_acpi i2c_algo_bit
> ehci_pci ehci_hcd xhci_pci xhci_hcd hpsa scsi_transport_sas
> xen_privcmd xen_
> pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput
> pkcs8_key_parser
> [ 2523.945934] CPU: 14 PID: 24181 Comm: Xorg Tainted: G            
> L    5.4.72-2.qubes.x86_64 #1
> [ 2523.945960] Hardware name: HP ProLiant DL360p Gen8, BIOS P71
> 05/24/2019
> [ 2523.945989] RIP: e030:smp_call_function_many+0x20a/0x270
> [ 2523.946010] Code: 8a 00 3b 05 4c b5 69 01 89 c7 0f 83 89 fe ff ff
> 48 63 c7 49 8b 17 48 03 14 c5 80 f9 3d 82 8b 42 18 a8 01 74 09 f3 90
> 8b 42 18 <a8> 01 75 f7 eb c9 48 c7 c2 a0 6f 82 82 4c 89 f6 89 df e8 bf b2
> 8a
> [ 2523.946057] RSP: e02b:ffffc90003aa7cf0 EFLAGS: 00000202
> [ 2523.946075] RAX: 0000000000000003 RBX: 0000000000000010 RCX:
> 0000000000000007
> [ 2523.946097] RDX: ffff8882413ef640 RSI: 0000000000000010 RDI:
> 0000000000000007
> [ 2523.946119] RBP: ffffffff81082fc0 R08: 0000000000000007 R09:
> 0000000000000000
> [ 2523.946162] R10: 0000000000000000 R11: ffffffff8265b6a8 R12:
> 0000000000000000
> [ 2523.946184] R13: 0000000000000001 R14: 0000000000029ac0 R15:
> ffff8882415a9b00
> [ 2523.946233] FS:  00007a0d51a91a40(0000) GS:ffff888241580000(0000)
> knlGS:0000000000000000
> [ 2523.946256] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2523.946275] CR2: 000070abab15a000 CR3: 00000001e18ee000 CR4:
> 0000000000040660
> [ 2523.946313] Call Trace:
> [ 2523.946336]  ? do_kernel_range_flush+0x50/0x50
> [ 2523.946356]  on_each_cpu+0x28/0x50
> [ 2523.946376]  decrease_reservation+0x22f/0x310
> [ 2523.946397]  alloc_xenballooned_pages+0xeb/0x120
> [ 2523.946418]  ? __kmalloc+0x183/0x260
> [ 2523.946434]  gnttab_alloc_pages+0x11/0x40
> [ 2523.946457]  gntdev_alloc_map+0x12f/0x250 [xen_gntdev]
> [ 2523.946478]  gntdev_ioctl_map_grant_ref+0x73/0x1d0 [xen_gntdev]
> [ 2523.946502]  do_vfs_ioctl+0x2fb/0x490
> [ 2523.946559]  ? syscall_trace_enter+0x1d1/0x2c0
> [ 2523.946578]  ksys_ioctl+0x5e/0x90
> [ 2523.946594]  __x64_sys_ioctl+0x16/0x20
> [ 2523.946610]  do_syscall_64+0x5b/0xf0
> [ 2523.946713]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 2523.946738] RIP: 0033:0x7a0d51f763bb
> [ 2523.946782] Code: 0f 1e fa 48 8b 05 dd aa 0c 00 64 c7 00 26 00 00
> 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00
> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ad aa 0c 00 f7 d8 64 89 01
> 48
> [ 2523.946867] RSP: 002b:00007fffc48d5058 EFLAGS: 00000206 ORIG_RAX:
> 0000000000000010
> [ 2523.946927] RAX: ffffffffffffffda RBX: 0000000000001000 RCX:
> 00007a0d51f763bb
> [ 2523.946950] RDX: 00007fffc48d5060 RSI: 0000000000184700 RDI:
> 0000000000000009
> [ 2523.946972] RBP: 00007fffc48d5100 R08: 00007fffc48d512c R09:
> 00007a0d51a30010
> [ 2523.947029] R10: 0000000000000000 R11: 0000000000000206 R12:
> 00007fffc48d5060
> [ 2523.947051] R13: 0000000000000001 R14: 0000000000000009 R15:
> 0000000000000001
> ```
>
> and finally
>
> ```
> [ 1597.971380] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 1597.971409] rcu:     0-...0: (5 ticks this GP) idle=cd2/0/0x1
> softirq=59121/59122 fqs=14998
> [ 1597.971420] rcu:     7-...0: (2 ticks this GP)
> idle=e9e/1/0x4000000000000000 softirq=49519/49519 fqs=14998
> [ 1597.971431]     (detected by 11, t=60002 jiffies, g=234321, q=83)
> [ 1597.971441] Sending NMI from CPU 11 to CPUs 0:
> [ 1597.972452] NMI backtrace for cpu 0
> [ 1597.972453] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> 5.4.72-2.qubes.x86_64 #1
> [ 1597.972453] Hardware name: HP ProLiant DL360p Gen8, BIOS P71
> 05/24/2019
> [ 1597.972454] RIP: e030:xen_hypercall_sched_op+0xa/0x20
> [ 1597.972454] Code: 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc
> cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00
> 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
> cc
> [ 1597.972455] RSP: e02b:ffffc90001403d88 EFLAGS: 00000002
> [ 1597.972456] RAX: 0000000000000000 RBX: ffff888241215f80 RCX:
> ffffffff810013aa
> [ 1597.972456] RDX: ffff88823c070428 RSI: ffffc90001403da8 RDI:
> 0000000000000003
> [ 1597.972456] RBP: ffff8882365d8bf0 R08: ffffffff8265b6a0 R09:
> 0000000000000000
> [ 1597.972457] R10: 0000000000000000 R11: 0000000000000202 R12:
> 0000000000000049
> [ 1597.972457] R13: 0000000000000100 R14: ffff8882422b6640 R15:
> 0000000000000001
> [ 1597.972458] FS:  0000729633b7d700(0000) GS:ffff888241200000(0000)
> knlGS:0000000000000000
> [ 1597.972458] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1597.972458] CR2: 000077f6508c47c0 CR3: 000000020d006000 CR4:
> 0000000000040660
> [ 1597.972459] Call Trace:
> [ 1597.972459]  <IRQ>
> [ 1597.972459]  ? xen_poll_irq+0x73/0xa0
> [ 1597.972459]  ? xen_qlock_wait+0x7b/0x80
> [ 1597.972460]  ? pv_wait_head_or_lock+0x85/0xb0
> [ 1597.972460]  ? __pv_queued_spin_lock_slowpath+0xb5/0x1b0
> [ 1597.972460]  ? _raw_spin_lock_irqsave+0x32/0x40
> [ 1597.972461]  ? bfq_finish_requeue_request+0xb5/0x120
> [ 1597.972461]  ? blk_mq_free_request+0x3a/0xf0
> [ 1597.972461]  ? scsi_end_request+0x95/0x140
> [ 1597.972461]  ? scsi_io_completion+0x74/0x190
> [ 1597.972462]  ? blk_done_softirq+0xea/0x180
> [ 1597.972462]  ? __do_softirq+0xd9/0x2c8
> [ 1597.972462]  ? irq_exit+0xcf/0x110
> [ 1597.972462]  ? xen_evtchn_do_upcall+0x2c/0x40
> [ 1597.972463]  ? xen_do_hypervisor_callback+0x29/0x40
> [ 1597.972463]  </IRQ>
> [ 1597.972463]  ? xen_hypercall_sched_op+0xa/0x20
> [ 1597.972464]  ? xen_hypercall_sched_op+0xa/0x20
> [ 1597.972464]  ? xen_safe_halt+0xc/0x20
> [ 1597.972464]  ? default_idle+0x1a/0x140
> [ 1597.972465]  ? cpuidle_idle_call+0x139/0x190
> [ 1597.972465]  ? do_idle+0x73/0xd0
> [ 1597.972465]  ? cpu_startup_entry+0x19/0x20
> [ 1597.972466]  ? start_kernel+0x68a/0x6bf
> [ 1597.972466]  ? xen_start_kernel+0x6a2/0x6c1
> [ 1597.972470] Sending NMI from CPU 11 to CPUs 7:
> [ 1607.976873] rcu: rcu_sched kthread starved for 10007 jiffies!
> g234321 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=11
> [ 1607.976924] rcu: RCU grace-period kthread stack dump:
> [ 1607.976942] rcu_sched       I    0    10      2 0x80004000
> [ 1607.976972] Call Trace:
> [ 1607.976999]  __schedule+0x217/0x570
> [ 1607.977020]  ? schedule+0x39/0xa0
> [ 1607.977038]  ? schedule_timeout+0x96/0x150
> [ 1607.977056]  ? __next_timer_interrupt+0xd0/0xd0
> [ 1607.977079]  ? rcu_gp_fqs_loop+0xea/0x2a0
> [ 1607.977096]  ? rcu_gp_kthread+0xb5/0x140
> [ 1607.977112]  ? rcu_gp_init+0x470/0x470
> [ 1607.977130]  ? kthread+0x115/0x140
> [ 1607.977145]  ? __kthread_bind_mask+0x60/0x60
> [ 1607.977164]  ? ret_from_fork+0x35/0x40
> ```
>
> I've tried to increase memory, set maximum of dom0 vcps to 1 or 4, pin
> vcpus too, multiple 5.4 kernels tool...no luck.
>
> I've also observed that some (never the same) VM (PVH or HVM) fail to
> start randomly because of being stuck at boot time with analog stack
> trace as the first piece of log provided above. Those VM are
> impossible to desroy and then it "propagates" though dom0 with the two
> latest piece of codes.
>
> If anyone would have any idea of what's going on, that would be very
> appreciated. Thank you.

Does booting Xen with `sched=credit` make a difference?

~Andrew

Attachment: signature.asc
Description: OpenPGP digital signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.