[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [BUG] Core scheduling patches causing deadlock in some situations
On Fri, May 29, 2020 at 8:48 AM Tamas K Lengyel <tamas.k.lengyel@xxxxxxxxx> wrote: > > On Fri, May 29, 2020 at 7:51 AM Michał Leszczyński > <michal.leszczynski@xxxxxxx> wrote: > > > > ----- 29 maj 2020 o 15:15, Jürgen Groß jgross@xxxxxxxx napisał(a): > > > > > On 29.05.20 14:51, Michał Leszczyński wrote: > > >> ----- 29 maj 2020 o 14:44, Jürgen Groß jgross@xxxxxxxx napisał(a): > > >> > > >>> On 29.05.20 14:30, Michał Leszczyński wrote: > > >>>> Hello, > > >>>> > > >>>> I'm running DRAKVUF on Dell Inc. PowerEdge R640/08HT8T server with > > >>>> Intel(R) > > >>>> Xeon(R) Gold 6132 CPU @ 2.60GHz CPU. > > >>>> When upgrading from Xen RELEASE 4.12 to 4.13, we have noticed some > > >>>> stability > > >>>> problems concerning freezes of Dom0 (Debian Buster): > > >>>> > > >>>> --- > > >>>> > > >>>> maj 27 23:17:02 debian kernel: rcu: INFO: rcu_sched self-detected > > >>>> stall on CPU > > >>>> maj 27 23:17:02 debian kernel: rcu: 0-....: (5250 ticks this GP) > > >>>> idle=cee/1/0x4000000000000002 softirq=11964/11964 fqs=2515 > > >>>> maj 27 23:17:02 debian kernel: rcu: (t=5251 jiffies g=27237 q=799) > > >>>> maj 27 23:17:02 debian kernel: NMI backtrace for cpu 0 > > >>>> maj 27 23:17:02 debian kernel: CPU: 0 PID: 643 Comm: z_rd_int_1 > > >>>> Tainted: P OE > > >>>> 4.19.0-6-amd64 #1 Debian 4.19.67-2+deb10u2 > > >>>> maj 27 23:17:02 debian kernel: Hardware name: Dell Inc. PowerEdge > > >>>> R640/08HT8T, > > >>>> BIOS 2.1.8 04/30/2019 > > >>>> maj 27 23:17:02 debian kernel: Call Trace: > > >>>> maj 27 23:17:02 debian kernel: <IRQ> > > >>>> maj 27 23:17:02 debian kernel: dump_stack+0x5c/0x80 > > >>>> maj 27 23:17:02 debian kernel: nmi_cpu_backtrace.cold.4+0x13/0x50 > > >>>> maj 27 23:17:02 debian kernel: ? lapic_can_unplug_cpu.cold.29+0x3b/0x3b > > >>>> maj 27 23:17:02 debian kernel: nmi_trigger_cpumask_backtrace+0xf9/0xfb > > >>>> maj 27 23:17:02 debian kernel: rcu_dump_cpu_stacks+0x9b/0xcb > > >>>> maj 27 23:17:02 debian kernel: rcu_check_callbacks.cold.81+0x1db/0x335 > > >>>> maj 27 23:17:02 debian kernel: ? tick_sched_do_timer+0x60/0x60 > > >>>> maj 27 23:17:02 debian kernel: update_process_times+0x28/0x60 > > >>>> maj 27 23:17:02 debian kernel: tick_sched_handle+0x22/0x60 > > >>>> > > >>>> --- > > >>>> > > >>>> This usually results in machine being completely unresponsive and > > >>>> performing an > > >>>> automated reboot after some time. > > >>>> > > >>>> I've bisected commits between 4.12 and 4.13 and it seems like this is > > >>>> the patch > > >>>> which introduced a bug: > > >>>> https://github.com/xen-project/xen/commit/7c7b407e77724f37c4b448930777a59a479feb21 > > >>>> > > >>>> Enclosed you can find the `xl dmesg` log (attachment: dmesg.txt) from > > >>>> the fresh > > >>>> boot of the machine on which the bug was reproduced. > > >>>> > > >>>> I'm also attaching the `xl info` output from this machine: > > >>>> > > >>>> --- > > >>>> > > >>>> release : 4.19.0-6-amd64 > > >>>> version : #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) > > >>>> machine : x86_64 > > >>>> nr_cpus : 14 > > >>>> max_cpu_id : 223 > > >>>> nr_nodes : 1 > > >>>> cores_per_socket : 14 > > >>>> threads_per_core : 1 > > >>>> cpu_mhz : 2593.930 > > >>>> hw_caps : > > >>>> bfebfbff:77fef3ff:2c100800:00000121:0000000f:d19ffffb:00000008:00000100 > > >>>> virt_caps : pv hvm hvm_directio pv_directio hap shadow > > >>>> total_memory : 130541 > > >>>> free_memory : 63591 > > >>>> sharing_freed_memory : 0 > > >>>> sharing_used_memory : 0 > > >>>> outstanding_claims : 0 > > >>>> free_cpus : 0 > > >>>> xen_major : 4 > > >>>> xen_minor : 13 > > >>>> xen_extra : -unstable > > >>>> xen_version : 4.13-unstable > > >>>> xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 > > >>>> hvm-3.0-x86_32p > > >>>> hvm-3.0-x86_64 > > >>>> xen_scheduler : credit2 > > >>>> xen_pagesize : 4096 > > >>>> platform_params : virt_start=0xffff800000000000 > > >>>> xen_changeset : Wed Oct 2 09:27:27 2019 +0200 git:7c7b407e77-dirty > > >>> > > >>> Which is your original Xen base? This output is clearly obtained at the > > >>> end of the bisect process. > > >>> > > >>> There have been quite some corrections since the release of Xen 4.13, so > > >>> please make sure you are running the most actual version (4.13.1). > > >>> > > >>> > > >>> Juergen > > >> > > >> Sure, we have tested both RELEASE 4.13 and RELEASE 4.13.1. Unfortunately > > >> these > > >> corrections didn't help and the bug is still reproducible. > > >> > > >> From our testing it turns out that: > > >> > > >> Known working revision: 997d6248a9ae932d0dbaac8d8755c2b15fec25dc > > >> Broken revision: 6278553325a9f76d37811923221b21db3882e017 > > >> First bad commit: 7c7b407e77724f37c4b448930777a59a479feb21 > > > > > > Would it be possible to test xen unstable, too? > > > > > > I could imagine e.g. commit b492c65da5ec5ed or 99266e31832fb4a4 to have > > > an impact here. > > > > > > > > > Juergen > > > > > > I've tried b492c65da5ec5ed revision but it seems that there is some problem > > with ALTP2M support, so I can't launch anything at all. > > > > maj 29 15:45:32 debian drakrun[1223]: Failed to set HVM_PARAM_ALTP2M, RC: -1 > > maj 29 15:45:32 debian drakrun[1223]: VMI_ERROR: xc_altp2m_switch_to_view > > returned rc: -1 > > Ough, great, that's another regression in 4.14-unstable. I ran into it > myself but couldn't spend time to figure out whether its just > something in my configuration or not so I reverted to 4.13.1. Now we > know it's a real bug. This was a bug in libxl, I've sent a patch in that fixes it but you can grab it from https://github.com/tklengyel/xen/tree/libxl_fix. There is also an update to DRAKVUF that will need to be applied due to the recent altp2m visibility option having to be specified, you can grab that from https://github.com/tklengyel/drakvuf/tree/4.14. Tamas
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |