[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [BUG] Core scheduling patches causing deadlock in some situations



On Fri, May 29, 2020 at 8:48 AM Tamas K Lengyel
<tamas.k.lengyel@xxxxxxxxx> wrote:
>
> On Fri, May 29, 2020 at 7:51 AM Michał Leszczyński
> <michal.leszczynski@xxxxxxx> wrote:
> >
> > ----- 29 maj 2020 o 15:15, Jürgen Groß jgross@xxxxxxxx napisał(a):
> >
> > > On 29.05.20 14:51, Michał Leszczyński wrote:
> > >> ----- 29 maj 2020 o 14:44, Jürgen Groß jgross@xxxxxxxx napisał(a):
> > >>
> > >>> On 29.05.20 14:30, Michał Leszczyński wrote:
> > >>>> Hello,
> > >>>>
> > >>>> I'm running DRAKVUF on Dell Inc. PowerEdge R640/08HT8T server with 
> > >>>> Intel(R)
> > >>>> Xeon(R) Gold 6132 CPU @ 2.60GHz CPU.
> > >>>> When upgrading from Xen RELEASE 4.12 to 4.13, we have noticed some 
> > >>>> stability
> > >>>> problems concerning freezes of Dom0 (Debian Buster):
> > >>>>
> > >>>> ---
> > >>>>
> > >>>> maj 27 23:17:02 debian kernel: rcu: INFO: rcu_sched self-detected 
> > >>>> stall on CPU
> > >>>> maj 27 23:17:02 debian kernel: rcu: 0-....: (5250 ticks this GP)
> > >>>> idle=cee/1/0x4000000000000002 softirq=11964/11964 fqs=2515
> > >>>> maj 27 23:17:02 debian kernel: rcu: (t=5251 jiffies g=27237 q=799)
> > >>>> maj 27 23:17:02 debian kernel: NMI backtrace for cpu 0
> > >>>> maj 27 23:17:02 debian kernel: CPU: 0 PID: 643 Comm: z_rd_int_1 
> > >>>> Tainted: P OE
> > >>>> 4.19.0-6-amd64 #1 Debian 4.19.67-2+deb10u2
> > >>>> maj 27 23:17:02 debian kernel: Hardware name: Dell Inc. PowerEdge 
> > >>>> R640/08HT8T,
> > >>>> BIOS 2.1.8 04/30/2019
> > >>>> maj 27 23:17:02 debian kernel: Call Trace:
> > >>>> maj 27 23:17:02 debian kernel: <IRQ>
> > >>>> maj 27 23:17:02 debian kernel: dump_stack+0x5c/0x80
> > >>>> maj 27 23:17:02 debian kernel: nmi_cpu_backtrace.cold.4+0x13/0x50
> > >>>> maj 27 23:17:02 debian kernel: ? lapic_can_unplug_cpu.cold.29+0x3b/0x3b
> > >>>> maj 27 23:17:02 debian kernel: nmi_trigger_cpumask_backtrace+0xf9/0xfb
> > >>>> maj 27 23:17:02 debian kernel: rcu_dump_cpu_stacks+0x9b/0xcb
> > >>>> maj 27 23:17:02 debian kernel: rcu_check_callbacks.cold.81+0x1db/0x335
> > >>>> maj 27 23:17:02 debian kernel: ? tick_sched_do_timer+0x60/0x60
> > >>>> maj 27 23:17:02 debian kernel: update_process_times+0x28/0x60
> > >>>> maj 27 23:17:02 debian kernel: tick_sched_handle+0x22/0x60
> > >>>>
> > >>>> ---
> > >>>>
> > >>>> This usually results in machine being completely unresponsive and 
> > >>>> performing an
> > >>>> automated reboot after some time.
> > >>>>
> > >>>> I've bisected commits between 4.12 and 4.13 and it seems like this is 
> > >>>> the patch
> > >>>> which introduced a bug:
> > >>>> https://github.com/xen-project/xen/commit/7c7b407e77724f37c4b448930777a59a479feb21
> > >>>>
> > >>>> Enclosed you can find the `xl dmesg` log (attachment: dmesg.txt) from 
> > >>>> the fresh
> > >>>> boot of the machine on which the bug was reproduced.
> > >>>>
> > >>>> I'm also attaching the `xl info` output from this machine:
> > >>>>
> > >>>> ---
> > >>>>
> > >>>> release : 4.19.0-6-amd64
> > >>>> version : #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11)
> > >>>> machine : x86_64
> > >>>> nr_cpus : 14
> > >>>> max_cpu_id : 223
> > >>>> nr_nodes : 1
> > >>>> cores_per_socket : 14
> > >>>> threads_per_core : 1
> > >>>> cpu_mhz : 2593.930
> > >>>> hw_caps :
> > >>>> bfebfbff:77fef3ff:2c100800:00000121:0000000f:d19ffffb:00000008:00000100
> > >>>> virt_caps : pv hvm hvm_directio pv_directio hap shadow
> > >>>> total_memory : 130541
> > >>>> free_memory : 63591
> > >>>> sharing_freed_memory : 0
> > >>>> sharing_used_memory : 0
> > >>>> outstanding_claims : 0
> > >>>> free_cpus : 0
> > >>>> xen_major : 4
> > >>>> xen_minor : 13
> > >>>> xen_extra : -unstable
> > >>>> xen_version : 4.13-unstable
> > >>>> xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 
> > >>>> hvm-3.0-x86_32p
> > >>>> hvm-3.0-x86_64
> > >>>> xen_scheduler : credit2
> > >>>> xen_pagesize : 4096
> > >>>> platform_params : virt_start=0xffff800000000000
> > >>>> xen_changeset : Wed Oct 2 09:27:27 2019 +0200 git:7c7b407e77-dirty
> > >>>
> > >>> Which is your original Xen base? This output is clearly obtained at the
> > >>> end of the bisect process.
> > >>>
> > >>> There have been quite some corrections since the release of Xen 4.13, so
> > >>> please make sure you are running the most actual version (4.13.1).
> > >>>
> > >>>
> > >>> Juergen
> > >>
> > >> Sure, we have tested both RELEASE 4.13 and RELEASE 4.13.1. Unfortunately 
> > >> these
> > >> corrections didn't help and the bug is still reproducible.
> > >>
> > >>  From our testing it turns out that:
> > >>
> > >> Known working revision: 997d6248a9ae932d0dbaac8d8755c2b15fec25dc
> > >> Broken revision: 6278553325a9f76d37811923221b21db3882e017
> > >> First bad commit: 7c7b407e77724f37c4b448930777a59a479feb21
> > >
> > > Would it be possible to test xen unstable, too?
> > >
> > > I could imagine e.g. commit b492c65da5ec5ed or 99266e31832fb4a4 to have
> > > an impact here.
> > >
> > >
> > > Juergen
> >
> >
> > I've tried b492c65da5ec5ed revision but it seems that there is some problem 
> > with ALTP2M support, so I can't launch anything at all.
> >
> > maj 29 15:45:32 debian drakrun[1223]: Failed to set HVM_PARAM_ALTP2M, RC: -1
> > maj 29 15:45:32 debian drakrun[1223]: VMI_ERROR: xc_altp2m_switch_to_view 
> > returned rc: -1
>
> Ough, great, that's another regression in 4.14-unstable. I ran into it
> myself but couldn't spend time to figure out whether its just
> something in my configuration or not so I reverted to 4.13.1. Now we
> know it's a real bug.

This was a bug in libxl, I've sent a patch in that fixes it but you
can grab it from https://github.com/tklengyel/xen/tree/libxl_fix.
There is also an update to DRAKVUF that will need to be applied due to
the recent altp2m visibility option having to be specified, you can
grab that from https://github.com/tklengyel/drakvuf/tree/4.14.

Tamas



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.