[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: HVM/PVH Balloon crash



On Mon, Sep 06, 2021 at 09:52:17AM +0200, Jan Beulich wrote:
> On 06.09.2021 00:10, Elliott Mitchell wrote:
> > I brought this up a while back, but it still appears to be present and
> > the latest observations appear rather serious.
> > 
> > I'm unsure of the entire set of conditions for reproduction.
> > 
> > Domain 0 on this machine is PV (I think the BIOS enables the IOMMU, but
> > this is an older AMD IOMMU).
> > 
> > This has been confirmed with Xen 4.11 and Xen 4.14.  This includes
> > Debian's patches, but those are mostly backports or environment
> > adjustments.
> > 
> > Domain 0 is presently using a 4.19 kernel.
> > 
> > The trigger is creating a HVM or PVH domain where memory does not equal
> > maxmem.
> 
> I take it you refer to "[PATCH] x86/pod: Do not fragment PoD memory
> allocations" submitted very early this year? There you said the issue
> was with a guest's maxmem exceeding host memory size. Here you seem to
> be talking of PoD in its normal form of use. Personally I uses this
> all the time (unless enabling PCI pass-through for a guest, for being
> incompatible). I've not observed any badness as severe as you've
> described.

I've got very little idea what is occurring as I'm expecting to be doing
ARM debugging, not x86 debugging.

I was starting to wonder whether this was widespread or not.  As such I
was reporting the factors which might be different in my environment.

The one which sticks out is the computer has an older AMD processor (you
a 100% Intel shop?).  The processor has the AMD NPT feature, but a very
early/limited IOMMU (according to Linux "AMD IOMMUv2 functionality not
available").

Xen 4.14 refused to load the Domain 0 kernel as PVH (not enough of an
IOMMU).


There is also the possibility Debian added a bad patch, but that seems
improbable as there aren't enough bug reports.


> > New observations:
> > 
> > I discovered this occurs with PVH domains in addition to HVM ones.
> > 
> > I got PVH GRUB operational.  PVH GRUB appeared at to operate normally
> > and not trigger the crash/panic.
> > 
> > The crash/panic occurred some number of seconds after the Linux kernel
> > was loaded.
> > 
> > 
> > Mitigation by not using ballooning with HVM/PVH is workable, but this is
> > quite a large mine in the configuration.
> > 
> > I'm wondering if perhaps it is actually the Linux kernel in Domain 0
> > which is panicing.
> > 
> > The crash/panic occurring AFTER the main kernel loads suggests some
> > action by the user domain is doing is the actual trigger of the
> > crash/panic.
> 
> All of this is pretty vague: If you don't even know what component it
> is that crashes / panics, I don't suppose you have any logs. Yet what
> do you expect us to do without any technical detail?

Initially this had looked so spectacular as to be easy to reproduce.

No logs, I wasn't expecting to be doing hardware-level debugging on x86.
I've got several USB to TTL-serial cables (ARM/MIPS debug), I may need to
hunt a USB to full voltage EIA-232C cable.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@xxxxxxx  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445





 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.