[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [BUG] Core scheduling patches causing deadlock in some situations



On Fri, May 29, 2020 at 7:51 AM Michał Leszczyński
<michal.leszczynski@xxxxxxx> wrote:
>
> ----- 29 maj 2020 o 15:15, Jürgen Groß jgross@xxxxxxxx napisał(a):
>
> > On 29.05.20 14:51, Michał Leszczyński wrote:
> >> ----- 29 maj 2020 o 14:44, Jürgen Groß jgross@xxxxxxxx napisał(a):
> >>
> >>> On 29.05.20 14:30, Michał Leszczyński wrote:
> >>>> Hello,
> >>>>
> >>>> I'm running DRAKVUF on Dell Inc. PowerEdge R640/08HT8T server with 
> >>>> Intel(R)
> >>>> Xeon(R) Gold 6132 CPU @ 2.60GHz CPU.
> >>>> When upgrading from Xen RELEASE 4.12 to 4.13, we have noticed some 
> >>>> stability
> >>>> problems concerning freezes of Dom0 (Debian Buster):
> >>>>
> >>>> ---
> >>>>
> >>>> maj 27 23:17:02 debian kernel: rcu: INFO: rcu_sched self-detected stall 
> >>>> on CPU
> >>>> maj 27 23:17:02 debian kernel: rcu: 0-....: (5250 ticks this GP)
> >>>> idle=cee/1/0x4000000000000002 softirq=11964/11964 fqs=2515
> >>>> maj 27 23:17:02 debian kernel: rcu: (t=5251 jiffies g=27237 q=799)
> >>>> maj 27 23:17:02 debian kernel: NMI backtrace for cpu 0
> >>>> maj 27 23:17:02 debian kernel: CPU: 0 PID: 643 Comm: z_rd_int_1 Tainted: 
> >>>> P OE
> >>>> 4.19.0-6-amd64 #1 Debian 4.19.67-2+deb10u2
> >>>> maj 27 23:17:02 debian kernel: Hardware name: Dell Inc. PowerEdge 
> >>>> R640/08HT8T,
> >>>> BIOS 2.1.8 04/30/2019
> >>>> maj 27 23:17:02 debian kernel: Call Trace:
> >>>> maj 27 23:17:02 debian kernel: <IRQ>
> >>>> maj 27 23:17:02 debian kernel: dump_stack+0x5c/0x80
> >>>> maj 27 23:17:02 debian kernel: nmi_cpu_backtrace.cold.4+0x13/0x50
> >>>> maj 27 23:17:02 debian kernel: ? lapic_can_unplug_cpu.cold.29+0x3b/0x3b
> >>>> maj 27 23:17:02 debian kernel: nmi_trigger_cpumask_backtrace+0xf9/0xfb
> >>>> maj 27 23:17:02 debian kernel: rcu_dump_cpu_stacks+0x9b/0xcb
> >>>> maj 27 23:17:02 debian kernel: rcu_check_callbacks.cold.81+0x1db/0x335
> >>>> maj 27 23:17:02 debian kernel: ? tick_sched_do_timer+0x60/0x60
> >>>> maj 27 23:17:02 debian kernel: update_process_times+0x28/0x60
> >>>> maj 27 23:17:02 debian kernel: tick_sched_handle+0x22/0x60
> >>>>
> >>>> ---
> >>>>
> >>>> This usually results in machine being completely unresponsive and 
> >>>> performing an
> >>>> automated reboot after some time.
> >>>>
> >>>> I've bisected commits between 4.12 and 4.13 and it seems like this is 
> >>>> the patch
> >>>> which introduced a bug:
> >>>> https://github.com/xen-project/xen/commit/7c7b407e77724f37c4b448930777a59a479feb21
> >>>>
> >>>> Enclosed you can find the `xl dmesg` log (attachment: dmesg.txt) from 
> >>>> the fresh
> >>>> boot of the machine on which the bug was reproduced.
> >>>>
> >>>> I'm also attaching the `xl info` output from this machine:
> >>>>
> >>>> ---
> >>>>
> >>>> release : 4.19.0-6-amd64
> >>>> version : #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11)
> >>>> machine : x86_64
> >>>> nr_cpus : 14
> >>>> max_cpu_id : 223
> >>>> nr_nodes : 1
> >>>> cores_per_socket : 14
> >>>> threads_per_core : 1
> >>>> cpu_mhz : 2593.930
> >>>> hw_caps :
> >>>> bfebfbff:77fef3ff:2c100800:00000121:0000000f:d19ffffb:00000008:00000100
> >>>> virt_caps : pv hvm hvm_directio pv_directio hap shadow
> >>>> total_memory : 130541
> >>>> free_memory : 63591
> >>>> sharing_freed_memory : 0
> >>>> sharing_used_memory : 0
> >>>> outstanding_claims : 0
> >>>> free_cpus : 0
> >>>> xen_major : 4
> >>>> xen_minor : 13
> >>>> xen_extra : -unstable
> >>>> xen_version : 4.13-unstable
> >>>> xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p
> >>>> hvm-3.0-x86_64
> >>>> xen_scheduler : credit2
> >>>> xen_pagesize : 4096
> >>>> platform_params : virt_start=0xffff800000000000
> >>>> xen_changeset : Wed Oct 2 09:27:27 2019 +0200 git:7c7b407e77-dirty
> >>>
> >>> Which is your original Xen base? This output is clearly obtained at the
> >>> end of the bisect process.
> >>>
> >>> There have been quite some corrections since the release of Xen 4.13, so
> >>> please make sure you are running the most actual version (4.13.1).
> >>>
> >>>
> >>> Juergen
> >>
> >> Sure, we have tested both RELEASE 4.13 and RELEASE 4.13.1. Unfortunately 
> >> these
> >> corrections didn't help and the bug is still reproducible.
> >>
> >>  From our testing it turns out that:
> >>
> >> Known working revision: 997d6248a9ae932d0dbaac8d8755c2b15fec25dc
> >> Broken revision: 6278553325a9f76d37811923221b21db3882e017
> >> First bad commit: 7c7b407e77724f37c4b448930777a59a479feb21
> >
> > Would it be possible to test xen unstable, too?
> >
> > I could imagine e.g. commit b492c65da5ec5ed or 99266e31832fb4a4 to have
> > an impact here.
> >
> >
> > Juergen
>
>
> I've tried b492c65da5ec5ed revision but it seems that there is some problem 
> with ALTP2M support, so I can't launch anything at all.
>
> maj 29 15:45:32 debian drakrun[1223]: Failed to set HVM_PARAM_ALTP2M, RC: -1
> maj 29 15:45:32 debian drakrun[1223]: VMI_ERROR: xc_altp2m_switch_to_view 
> returned rc: -1

Ough, great, that's another regression in 4.14-unstable. I ran into it
myself but couldn't spend time to figure out whether its just
something in my configuration or not so I reverted to 4.13.1. Now we
know it's a real bug.

Tamas



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.