[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Recent upgrade of 4.13 -> 4.14 issue



Hi Jan,
    Response inline...

Liwei

On Wed, 16 Dec 2020 at 16:12, Jan Beulich <jbeulich@xxxxxxxx> wrote:
>
> On 15.12.2020 20:08, Liwei wrote:
> > Hi list,
> >     This is a reply to the thread of the same title (linked here:
> > https://www.mail-archive.com/xen-devel@xxxxxxxxxxxxxxxxxxxx/msg84916.html
> > ) which I could not reply to because I receive this list by digest.
> >
> >     I'm unclear if this is exactly the reason, but I experienced the
> > same symptoms when upgrading to 4.14. The issue does not occur if I
> > downgrade to 4.11 (the previous version that was provided by Debian).
> > Kernel is 5.9.11 and unchanged between xen versions.
> >
> >     One thing I noticed is that if I disable the monitor/mwait
> > instructions on my CPU (Intel Xeon E5-2699 v4 ES), the stalls seem to
> > occur later into the boot. With the instructions enabled, the system
> > usually stalls less than a few minutes after boot; disabled, it can
> > last for tens of minutes.
> >
> >     Further disabling the HPET or forcing the kernel to use PIT causes
> > it to be somewhat usable. The stalls still occur tens of minutes in
> > but somehow everything seems to continue chugging along fine?
>
> By "the kernel" do you really mean the kernel, or Xen?

Sorry, I mean xen. Too used to thinking that xen isn't there when I'm
talking about dom0.

>
> >     I've also verified that the stalls do not occur in all the above
> > cases if I just boot into the kernel without xen.
> >
> >     When the stalls happen, I get the "rcu: INFO: rcu_sched detected
> > stalls on CPUs/tasks" backtraces printed on the console periodically,
> > but keystrokes don't do anything on the console, and I can't spawn new
> > SSH sessions even though pinging the system produces a reply. The last
> > item in the call trace is usually "xen_safe_halt", but I've seen it
> > occur for other functions related to btrfs and the network adapter as
> > well.
>
> The kernel log may not be the only relevant thing here - the hypervisor
> log may also need looking at (with full verbosity enabled and
> preferably a debug build in use).

I will build a debug version and get back to you on that. Do I just
have loglvl and guest_loglvl set to full, console to ring, and get the
entire serial spew? I recall that you wanted to see the I, q and r
outputs as well.

>
> >     Do let me know if there's anything I can provide to help
> > troubleshoot this. At the moment I've reverted to 4.11, but I can
> > temporarily switch over to 4.14 to collect any necessary information.
>
> In that earlier thread a number of things to try were suggested, iirc
> (switching scheduler or disabling use of deep C states come to mind).
> Did you experiment with those? If so, can you let us know of the
> results, so we can see whether there's a pattern?

1. Switching to credit didn't seem to make any difference in my case
2. I have tried with cpuidle=off and max_cstate=1, and it actually
provides the same result as when I have mwait/monitor & hpet turned
off (even when I leave mwait & hpet on in BIOS)
3. I could not try with dom0=PVH as my system reboots after(or while?)
the kernel is loaded/ing when I do that

I do realise, after working with the cpuidle=off and max_cstate=1
combination for a day, the system is actually limping. Most of the
visible issues seem to stem from storage hanging or responding very
slowly, but it might be due to the btrfs tasks hanging in the
background.

>
> Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.