[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] MCE/EDAC Status/Updating?

  • To: xen-devel@xxxxxxxxxxxxxxxxxxxx
  • From: Elliott Mitchell <ehem+xen@xxxxxxx>
  • Date: Thu, 14 Feb 2019 20:23:34 -0800
  • Delivery-date: Fri, 15 Feb 2019 04:23:54 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

The MCE/EDAC support code appears to be in rather poor shape.

The AMD code mentions Family 10h, which might have been available 10
years ago.  They've likely been findable used with difficulty more
recently, but no hardware made in the past 5 years matches this profile.

The Intel code has had some more recent minor updates.  Intel may have
managed to keep their hardware supporting the interface used by Xen, and
so the driver /may/ function on current Intel hardware.

Looks like both drivers originated with employees of the respective
companies (I'm suspecting both were paid for by the corporations).

Given the recent trends in Xen's development I'd tend to suggest going a
different direction from the existing code.  The existing code was
attempting to handle MCE/EDAC errors by emulating them and passing them
to the effected domain.  Instead of this approach, let Domain 0 handle
talking to MCE/EDAC hardware and merely have Xen decode addresses.

If errors/warnings are occuring, you need those reports centralized,
which points to handling them in Domain 0.  If an uncorrectable error
occurs, Domain 0 should choose whether to kill a given VM or panic the
entire machine.  Either way, Domain 0 really needs to be alerted that
hardware is misbehaving and may need to be replaced.

The other part is alerting Domain 0 is *far* more likely to get the
correct type of attention.  A business owning a Domain U on a random
machine, may run a kernel without MCE/EDAC support or could miss the
entries in their system log, nor would they necessarily know the correct
personel to contact about hardware failing.

The case you would want to pass MCE/EDAC messages would be to a HVM
domain where you were testing a new system image.  Then you would likely
be injecting fake exceptions instead of real ones.  At which point we're
talking QEMU, rather than Xen.

(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@xxxxxxx  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.