[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] MCE/EDAC Status/Updating?



On Fri, Feb 15, 2019 at 03:58:49AM -0700, Jan Beulich wrote:
> >>> On 15.02.19 at 05:23, <ehem+xen@xxxxxxx> wrote:
> > The MCE/EDAC support code appears to be in rather poor shape.
> > 
> > The AMD code mentions Family 10h, which might have been available 10
> > years ago.  They've likely been findable used with difficulty more
> > recently, but no hardware made in the past 5 years matches this profile.
> 
> Well, Fam10 is mentioned explicitly, but as per the use of e.g.
> mcheck_amd_famXX newer ones are supported by this code
> as well.

In that case sometime between Xen 4.1 and Xen 4.4, the AMD MCE/EDAC code
was completely broken and hasn't been fixed.



> > Given the recent trends in Xen's development I'd tend to suggest going a
> > different direction from the existing code.  The existing code was
> > attempting to handle MCE/EDAC errors by emulating them and passing them
> > to the effected domain.  Instead of this approach, let Domain 0 handle
> > talking to MCE/EDAC hardware and merely have Xen decode addresses.
> > 
> > If errors/warnings are occuring, you need those reports centralized,
> > which points to handling them in Domain 0.  If an uncorrectable error
> > occurs, Domain 0 should choose whether to kill a given VM or panic the
> > entire machine.  Either way, Domain 0 really needs to be alerted that
> > hardware is misbehaving and may need to be replaced.
> 
> But the point of the virtualization is to allow guests to more or less
> gracefully recover (at least as far as the theory of it goes), e.g. by
> killing just a process, rather than getting blindly killed.
> 
> As to panic-ing the entire machine - if that's necessary, Dom0 is
> unlikely to be in the right position. There's way too high a chance for
> further things to go wrong until the event has even just arrived in
> Dom0, let alone it having taken a decision.

I'll agree it does make sense to try sending a corrupted memory alert to
the effected domain, rather than nuking the entire VM.  Alerting the
owner of the hardware though should be higher priority as they will then
know they need to schedule a downtime and replace the module.


> > The other part is alerting Domain 0 is *far* more likely to get the
> > correct type of attention.  A business owning a Domain U on a random
> > machine, may run a kernel without MCE/EDAC support or could miss the
> > entries in their system log, nor would they necessarily know the correct
> > personel to contact about hardware failing.
> 
> Alerting Dom0 alongside the affected DomU may indeed be desirable,
> but mainly for the purpose of logging, only as a last resort for the
> purpose of killing a guest.

I think alerting Dom0 should be rather higher priority than alerting
DomUs.  A given DomU may see one correctable memory error per month,
which might seem harmless until you find there are a hundred DomUs on
that hardware and every one of them is seeing one error per month.

The only real useful place to report correctable errors like that is to
Dom0.  Meanwhile uncorrectable errors are likely better to send a PV
message to the DomU.  Let QEMU turn it into something which looks like
real hardware if needed.  Meanwhile Dom0 may have a more up to date
driver for the hardware than Xen.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@xxxxxxx  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.