Xen project Mailing List

Re: [Xen-devel] MCE/EDAC Status/Updating?

From: Elliott Mitchell <ehem+xen@xxxxxxx>

Date: Fri, 15 Feb 2019 10:20:23 -0800

Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Fri, 15 Feb 2019 18:20:35 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Fri, Feb 15, 2019 at 03:58:49AM -0700, Jan Beulich wrote: > >>> On 15.02.19 at 05:23, <ehem+xen@xxxxxxx> wrote: > > The MCE/EDAC support code appears to be in rather poor shape. > > > > The AMD code mentions Family 10h, which might have been available 10 > > years ago. They've likely been findable used with difficulty more > > recently, but no hardware made in the past 5 years matches this profile. > > Well, Fam10 is mentioned explicitly, but as per the use of e.g. > mcheck_amd_famXX newer ones are supported by this code > as well. In that case sometime between Xen 4.1 and Xen 4.4, the AMD MCE/EDAC code was completely broken and hasn't been fixed. > > Given the recent trends in Xen's development I'd tend to suggest going a > > different direction from the existing code. The existing code was > > attempting to handle MCE/EDAC errors by emulating them and passing them > > to the effected domain. Instead of this approach, let Domain 0 handle > > talking to MCE/EDAC hardware and merely have Xen decode addresses. > > > > If errors/warnings are occuring, you need those reports centralized, > > which points to handling them in Domain 0. If an uncorrectable error > > occurs, Domain 0 should choose whether to kill a given VM or panic the > > entire machine. Either way, Domain 0 really needs to be alerted that > > hardware is misbehaving and may need to be replaced. > > But the point of the virtualization is to allow guests to more or less > gracefully recover (at least as far as the theory of it goes), e.g. by > killing just a process, rather than getting blindly killed. > > As to panic-ing the entire machine - if that's necessary, Dom0 is > unlikely to be in the right position. There's way too high a chance for > further things to go wrong until the event has even just arrived in > Dom0, let alone it having taken a decision. I'll agree it does make sense to try sending a corrupted memory alert to the effected domain, rather than nuking the entire VM. Alerting the owner of the hardware though should be higher priority as they will then know they need to schedule a downtime and replace the module. > > The other part is alerting Domain 0 is *far* more likely to get the > > correct type of attention. A business owning a Domain U on a random > > machine, may run a kernel without MCE/EDAC support or could miss the > > entries in their system log, nor would they necessarily know the correct > > personel to contact about hardware failing. > > Alerting Dom0 alongside the affected DomU may indeed be desirable, > but mainly for the purpose of logging, only as a last resort for the > purpose of killing a guest. I think alerting Dom0 should be rather higher priority than alerting DomUs. A given DomU may see one correctable memory error per month, which might seem harmless until you find there are a hundred DomUs on that hardware and every one of them is seeing one error per month. The only real useful place to report correctable errors like that is to Dom0. Meanwhile uncorrectable errors are likely better to send a PV message to the DomU. Let QEMU turn it into something which looks like real hardware if needed. Meanwhile Dom0 may have a more up to date driver for the hardware than Xen. -- (\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/) \BS ( | ehem+sigmsg@xxxxxxx PGP 87145445 | ) / \_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.