[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] MCE/EDAC Status/Updating?

>>> On 15.02.19 at 05:23, <ehem+xen@xxxxxxx> wrote:
> The MCE/EDAC support code appears to be in rather poor shape.
> The AMD code mentions Family 10h, which might have been available 10
> years ago.  They've likely been findable used with difficulty more
> recently, but no hardware made in the past 5 years matches this profile.

Well, Fam10 is mentioned explicitly, but as per the use of e.g.
mcheck_amd_famXX newer ones are supported by this code
as well.

> The Intel code has had some more recent minor updates.  Intel may have
> managed to keep their hardware supporting the interface used by Xen, and
> so the driver /may/ function on current Intel hardware.
> Looks like both drivers originated with employees of the respective
> companies (I'm suspecting both were paid for by the corporations).
> Given the recent trends in Xen's development I'd tend to suggest going a
> different direction from the existing code.  The existing code was
> attempting to handle MCE/EDAC errors by emulating them and passing them
> to the effected domain.  Instead of this approach, let Domain 0 handle
> talking to MCE/EDAC hardware and merely have Xen decode addresses.
> If errors/warnings are occuring, you need those reports centralized,
> which points to handling them in Domain 0.  If an uncorrectable error
> occurs, Domain 0 should choose whether to kill a given VM or panic the
> entire machine.  Either way, Domain 0 really needs to be alerted that
> hardware is misbehaving and may need to be replaced.

But the point of the virtualization is to allow guests to more or less
gracefully recover (at least as far as the theory of it goes), e.g. by
killing just a process, rather than getting blindly killed.

As to panic-ing the entire machine - if that's necessary, Dom0 is
unlikely to be in the right position. There's way too high a chance for
further things to go wrong until the event has even just arrived in
Dom0, let alone it having taken a decision.

> The other part is alerting Domain 0 is *far* more likely to get the
> correct type of attention.  A business owning a Domain U on a random
> machine, may run a kernel without MCE/EDAC support or could miss the
> entries in their system log, nor would they necessarily know the correct
> personel to contact about hardware failing.

Alerting Dom0 alongside the affected DomU may indeed be desirable,
but mainly for the purpose of logging, only as a last resort for the
purpose of killing a guest.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.