Xen project Mailing List

Re: [BUG] Core scheduling patches causing deadlock in some situations

To: Michał Leszczyński <michal.leszczynski@xxxxxxx>

From: Tamas K Lengyel <tamas.k.lengyel@xxxxxxxxx>

Date: Fri, 29 May 2020 08:48:01 -0600

Cc: Jürgen Groß <jgross@xxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, chivay@xxxxxxx, bonus@xxxxxxx

Delivery-date: Fri, 29 May 2020 14:48:39 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Fri, May 29, 2020 at 7:51 AM Michał Leszczyński <michal.leszczynski@xxxxxxx> wrote: > > ----- 29 maj 2020 o 15:15, Jürgen Groß jgross@xxxxxxxx napisał(a): > > > On 29.05.20 14:51, Michał Leszczyński wrote: > >> ----- 29 maj 2020 o 14:44, Jürgen Groß jgross@xxxxxxxx napisał(a): > >> > >>> On 29.05.20 14:30, Michał Leszczyński wrote: > >>>> Hello, > >>>> > >>>> I'm running DRAKVUF on Dell Inc. PowerEdge R640/08HT8T server with > >>>> Intel(R) > >>>> Xeon(R) Gold 6132 CPU @ 2.60GHz CPU. > >>>> When upgrading from Xen RELEASE 4.12 to 4.13, we have noticed some > >>>> stability > >>>> problems concerning freezes of Dom0 (Debian Buster): > >>>> > >>>> --- > >>>> > >>>> maj 27 23:17:02 debian kernel: rcu: INFO: rcu_sched self-detected stall > >>>> on CPU > >>>> maj 27 23:17:02 debian kernel: rcu: 0-....: (5250 ticks this GP) > >>>> idle=cee/1/0x4000000000000002 softirq=11964/11964 fqs=2515 > >>>> maj 27 23:17:02 debian kernel: rcu: (t=5251 jiffies g=27237 q=799) > >>>> maj 27 23:17:02 debian kernel: NMI backtrace for cpu 0 > >>>> maj 27 23:17:02 debian kernel: CPU: 0 PID: 643 Comm: z_rd_int_1 Tainted: > >>>> P OE > >>>> 4.19.0-6-amd64 #1 Debian 4.19.67-2+deb10u2 > >>>> maj 27 23:17:02 debian kernel: Hardware name: Dell Inc. PowerEdge > >>>> R640/08HT8T, > >>>> BIOS 2.1.8 04/30/2019 > >>>> maj 27 23:17:02 debian kernel: Call Trace: > >>>> maj 27 23:17:02 debian kernel: <IRQ> > >>>> maj 27 23:17:02 debian kernel: dump_stack+0x5c/0x80 > >>>> maj 27 23:17:02 debian kernel: nmi_cpu_backtrace.cold.4+0x13/0x50 > >>>> maj 27 23:17:02 debian kernel: ? lapic_can_unplug_cpu.cold.29+0x3b/0x3b > >>>> maj 27 23:17:02 debian kernel: nmi_trigger_cpumask_backtrace+0xf9/0xfb > >>>> maj 27 23:17:02 debian kernel: rcu_dump_cpu_stacks+0x9b/0xcb > >>>> maj 27 23:17:02 debian kernel: rcu_check_callbacks.cold.81+0x1db/0x335 > >>>> maj 27 23:17:02 debian kernel: ? tick_sched_do_timer+0x60/0x60 > >>>> maj 27 23:17:02 debian kernel: update_process_times+0x28/0x60 > >>>> maj 27 23:17:02 debian kernel: tick_sched_handle+0x22/0x60 > >>>> > >>>> --- > >>>> > >>>> This usually results in machine being completely unresponsive and > >>>> performing an > >>>> automated reboot after some time. > >>>> > >>>> I've bisected commits between 4.12 and 4.13 and it seems like this is > >>>> the patch > >>>> which introduced a bug: > >>>> https://github.com/xen-project/xen/commit/7c7b407e77724f37c4b448930777a59a479feb21 > >>>> > >>>> Enclosed you can find the `xl dmesg` log (attachment: dmesg.txt) from > >>>> the fresh > >>>> boot of the machine on which the bug was reproduced. > >>>> > >>>> I'm also attaching the `xl info` output from this machine: > >>>> > >>>> --- > >>>> > >>>> release : 4.19.0-6-amd64 > >>>> version : #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) > >>>> machine : x86_64 > >>>> nr_cpus : 14 > >>>> max_cpu_id : 223 > >>>> nr_nodes : 1 > >>>> cores_per_socket : 14 > >>>> threads_per_core : 1 > >>>> cpu_mhz : 2593.930 > >>>> hw_caps : > >>>> bfebfbff:77fef3ff:2c100800:00000121:0000000f:d19ffffb:00000008:00000100 > >>>> virt_caps : pv hvm hvm_directio pv_directio hap shadow > >>>> total_memory : 130541 > >>>> free_memory : 63591 > >>>> sharing_freed_memory : 0 > >>>> sharing_used_memory : 0 > >>>> outstanding_claims : 0 > >>>> free_cpus : 0 > >>>> xen_major : 4 > >>>> xen_minor : 13 > >>>> xen_extra : -unstable > >>>> xen_version : 4.13-unstable > >>>> xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p > >>>> hvm-3.0-x86_64 > >>>> xen_scheduler : credit2 > >>>> xen_pagesize : 4096 > >>>> platform_params : virt_start=0xffff800000000000 > >>>> xen_changeset : Wed Oct 2 09:27:27 2019 +0200 git:7c7b407e77-dirty > >>> > >>> Which is your original Xen base? This output is clearly obtained at the > >>> end of the bisect process. > >>> > >>> There have been quite some corrections since the release of Xen 4.13, so > >>> please make sure you are running the most actual version (4.13.1). > >>> > >>> > >>> Juergen > >> > >> Sure, we have tested both RELEASE 4.13 and RELEASE 4.13.1. Unfortunately > >> these > >> corrections didn't help and the bug is still reproducible. > >> > >> From our testing it turns out that: > >> > >> Known working revision: 997d6248a9ae932d0dbaac8d8755c2b15fec25dc > >> Broken revision: 6278553325a9f76d37811923221b21db3882e017 > >> First bad commit: 7c7b407e77724f37c4b448930777a59a479feb21 > > > > Would it be possible to test xen unstable, too? > > > > I could imagine e.g. commit b492c65da5ec5ed or 99266e31832fb4a4 to have > > an impact here. > > > > > > Juergen > > > I've tried b492c65da5ec5ed revision but it seems that there is some problem > with ALTP2M support, so I can't launch anything at all. > > maj 29 15:45:32 debian drakrun[1223]: Failed to set HVM_PARAM_ALTP2M, RC: -1 > maj 29 15:45:32 debian drakrun[1223]: VMI_ERROR: xc_altp2m_switch_to_view > returned rc: -1 Ough, great, that's another regression in 4.14-unstable. I ran into it myself but couldn't spend time to figure out whether its just something in my configuration or not so I reverted to 4.13.1. Now we know it's a real bug. Tamas

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.