Re: [ANNOUNCE] Xen 4.15 - call for notification/status of significant bugs

On Thu, 2021-02-04 at 12:12 +0000, Ian Jackson wrote:
> B. "scheduler broken" bugs.
> Information from
>   Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
>   Dario Faggioli <dfaggioli@xxxxxxxx>
> Quoting Andrew Cooper
> > We've had 4 or 5 reports of Xen not working, and very little
> > investigation on whats going on.  Suspicion is that there might be
> > two bugs, one with smt=0 on recent AMD hardware, and one more
> > general "some workloads cause negative credit" and might or might
> > not be specific to credit2 (debugging feedback differs - also might
> > be 3 underlying issue).
> I reviewed a thread about this and it is not clear to me where we are
> with this.
Ok, let me try to summarize the current status.

- BUG: credit=sched2 machine hang when using DRAKVUF


  99% sure that it's a Credit2 scheduler issue.
  I'm actively working on it.
  "Seems a tricky one; I'm still in the analysis phase"

  Manifests only with certain combination of hardware and workload. 
  I'm not reproducing, but there are multiple reports of it (see 
  above). I'm investigating and trying to come up at least with 
  debug patches that one of the reporter should be able and willing to 

- Null scheduler and vwfi native problem


  RCU issues, but manifests due to scheduler behavior (especially   
  NULL scheduler, especially on ARM).
  I'm actively working on it.

  Patches that should solve the issue for ARM posted already. They 
  will need to be slightly adjusted to cover x86 as well. Waiting a 
  couple days more for a confirmation from the reporter that the
  patches do help, at least on ARM.

- Xen crash after S3 suspend - Xen 4.13


  S3 suspend issue, but root cause seems to be in the scheduler.

  Marek is, as usual, providing good info and feedback. It comes as 
  third in my list (below the two above, basically), but I will look
  into it.

- Ryzen 4000 (Mobile) Softlocks/Micro-stutters


  Seems could be scheduling, but amount of info is limited.

  What we know is that with `dom0_max_vcpus=1 dom0_vcpus_pin`, all 
  schedulers seem to work fine. Without those params, Credit2 is the 
  "least bad", although not satisfactory. Other schedulers don't even 
  Fact is, it is reported to occure on QubesOS, which has its own 
  downstream patches, plus there are no logs.
  There's a feeling that this (together with others) hints at SMT off 
  having issues on AMD (Ryzen?), but again, it's not crystal clear to 
  me whether this is the issue (or an issue at all) and, if yes, in 
  what subsystem the problem lays.
  I can try to have a look, mostly for trying to understand whether or 
  not it is really the case that some AMDs have issues with SMT=off.
  But that probably will be after I'll be done with the other issues 
  I've mentioned before (above) this one.

- Recent upgrade of 4.13 -> 4.14 issue


  To my judgment, It's not at all clear whether or not this is a 
  scheduler issue. And at least with the amount of info that we have 
  so far, I'd lean toward "no, it's not". I'm happy to help with it 
  anyway, of course, but it comes after the others.

So, Ian, was this any helpful?

If not, help me understand how I can help you. :-P

Thanks and Regards
Dario Faggioli, Ph.D
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

