Xen project Mailing List

Re: [Xen-devel] [PATCH 2/2] Xen/vMCE: bugfix to remove problematic is_vmce_ready check

To: Christoph Egger <chegger@xxxxxxxxx>

From: "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>

Date: Mon, 6 May 2013 16:00:57 +0000

Accept-language: en-US

Cc: Jan Beulich <JBeulich@xxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Mon, 06 May 2013 16:02:22 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Thread-index: AQHOSm7MGsdbS84pvUuls/cWiRHQfJj4T1GA

Thread-topic: [Xen-devel] [PATCH 2/2] Xen/vMCE: bugfix to remove problematic is_vmce_ready check

Christoph Egger wrote: > On 06.05.13 17:14, Liu, Jinsong wrote: >> Egger, Christoph wrote: >>> On 06.05.13 17:00, Liu, Jinsong wrote: >>>> Christoph Egger wrote: >>>>> On 06.05.13 11:50, Liu, Jinsong wrote: >>>>>> Christoph Egger wrote: >>>>>>> On 06.05.13 11:24, Liu, Jinsong wrote: >>>>>>>> Jan Beulich wrote: >>>>>>>>>>>> On 06.05.13 at 10:54, Christoph Egger <chegger@xxxxxxxxx> >>>>>>>>>>>> wrote: >>>>>>>>>> On 03.05.13 17:51, Liu, Jinsong wrote: >>>>>>>>>>> Jan Beulich wrote: >>>>>>>>>>>>>>> On 03.05.13 at 16:16, "Liu, Jinsong" >>>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> wrote: >>>>>>>>>>>>> Jan Beulich wrote: >>>>>>>>>>>>>>>>> On 03.05.13 at 10:41, "Liu, Jinsong" >>>>>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> wrote: >>>>>>>>>>>>>>> Jan Beulich wrote: >>>>>>>>>>>>>>>>>>> On 27.04.13 at 10:38, "Liu, Jinsong" >>>>>>>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> wrote: >>>>>>>>>>>>>>>>> From 9098666db640183f894b9aec09599dd32dddb7fa Mon Sep >>>>>>>>>>>>>>>>> 17 00:00:00 2001 From: Liu Jinsong >>>>>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> Date: Sat, 27 Apr 2013 >>>>>>>>>>>>>>>>> 22:37:35 +0800 >>>>>>>>>>>>>>>>> Subject: [PATCH 2/2] Xen/vMCE: bugfix to remove >>>>>>>>>>>>>>>>> problematic is_vmce_ready check >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> is_vmce_ready() is problematic: >>>>>>>>>>>>>>>>> * For dom0, it checks if virq bind to dom0 mcelog >>>>>>>>>>>>>>>>> driver. If not, it results dom0 crash. However, it's >>>>>>>>>>>>>>>>> problematic and overkilled since mcelog as a dom0 >>>>>>>>>>>>>>>>> feature could be enabled/disabled per dom0 option: >>>>>>>>>>>>>>>>> (XEN) MCE: This error page is ownded by DOM 0 (XEN) >>>>>>>>>>>>>>>>> DOM0 not ready for vMCE (XEN) domain_crash called from >>>>>>>>>>>>>>>>> mcaction.c:133 (XEN) Domain 0 reported crashed by >>>>>>>>>>>>>>>>> domain 32767 on cpu#31: (XEN) Domain 0 crashed: >>>>>>>>>>>>>>>>> rebooting machine in 5 seconds. (XEN) Resetting with >>>>>>>>>>>>>>>>> ACPI MEMORY or I/O RESET_REG. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> * For dom0, if really need check, it should check >>>>>>>>>>>>>>>>> whether vMCE injection for dom0 ready (say, exception >>>>>>>>>>>>>>>>> trap bounce check, which has been done at >>>>>>>>>>>>>>>>> inject_vmce()), not check dom0 mcelog ready (which has >>>>>>>>>>>>>>>>> been done at mce_softirq() before send global virq to >>>>>>>>>>>>>>>>> dom0). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Following the argumentation above, I wonder which of >>>>>>>>>>>>>>>> the other "goto vmce_failed" are really appropriate, >>>>>>>>>>>>>>>> i.e. whether the patch shouldn't be extended (at least >>>>>>>>>>>>>>>> for the Dom0 case). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> You mean other 'goto vmce_failed' are also not >>>>>>>>>>>>>>> appropriate (I'm not quite clear your point)? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Would you please point out which point you think not >>>>>>>>>>>>>>> appropriate? >>>>>>>>>>>>>> >>>>>>>>>>>>>> I question whether it is correct/necessary to crash the >>>>>>>>>>>>>> domain in any of those failure cases. Perhaps when we >>>>>>>>>>>>>> fail to unmap the page it is, but failure of >>>>>>>>>>>>>> fill_vmsr_data() and inject_vmce() don't appear to be >>>>>>>>>>>>>> valid reasons once the is_vmce_ready() path is being >>>>>>>>>>>>>> dropped. >>>>>>>>>>>>> >>>>>>>>>>>>> For fill_vmsr_data(), it failed only when MCG_STATUS_MCIP >>>>>>>>>>>>> bit still set when next vMCE# occur, means the 2nd vMCE# >>>>>>>>>>>>> occur when the 1st vMCE# not handled yet. Per SDM it >>>>>>>>>>>>> should shutdown. >>>>>>>>>>>>> >>>>>>>>>>>>> For inject_vmce(), it failed when >>>>>>>>>>>>> 1). vcpu is still mce_pending, or >>>>>>>>>>>>> 2). pv not register trap callback >>>>>>>>>>>>> Maybe it's some overkilled for dom0 (for other guest, it's >>>>>>>>>>>>> ok to kill them), but any graceful way to quit? >>>>>>>>>>>> >>>>>>>>>>>> Just exit and do nothing (except perhaps log a rate >>>>>>>>>>>> limited message)? >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> One concern of quiet exit is, the error will be totally >>>>>>>>>>> ignored by guest --> it >>>>>>>>>> didn't get preperly handled, and may recursively occur to >>>>>>>>>> make worse error --> it's better to kill guest under such >>>>>>>>>> case. >>>>>>>>>>> >>>>>>>>>>>>> or, considering it rarely happens, how about keep current >>>>>>>>>>>>> way (kill guest no matter dom0 or not)? >>>>>>>>>>>> >>>>>>>>>>>> Possibly - I was merely asking why this one condition was >>>>>>>>>>>> found to be too strict, while the others are being left as >>>>>>>>>>>> is. >>>>>>>>>>>> >>>>>>>>>>>> Jan >>>>>>>>>>> >>>>>>>>>>> Ah, the reason of removing is_vmce_ready check is, it's >>>>>>>>>>> problematic (check mcelog driver, not vmce tap callback), >>>>>>>>>>> and overkilled (since defaultly dom0 will not start mcelog >>>>>>>>>>> driver, under which case system will crash whenever vmce >>>>>>>>>>> inject to dom0) >>>>>>>>>>> >>>>>>>>>>> --> So patch 2/2 is not too strict for dom0. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Please keep in mind the mcelog userland/kernel interface is >>>>>>>>>> not designed with xen in mind. mcelog cannot report which >>>>>>>>>> guest is impacted for example, although xen reports that to >>>>>>>>>> dom0. >>>>>>>>>> I object 'fixing' the hypervisor to come over with mcelog >>>>>>>>>> drawbacks. I prefer fixing Dom0 instead. >>>>>>>>>> >>>>>>>> >>>>>>>> Sure, xen mcelog driver in linux is implemented by me :-) >>>>>>>> This patch does not intend to 'fix' hypervisor but just avoid >>>>>>>> overkilled system (when xen mcelog driver in dom0 not loaded as >>>>>>>> default). >>>>>>> >>>>>>> I assume dom0 w/o xen mcelog driver active means dom0 is not >>>>>>> capable to deal with machine check errors. Is this correct? >>>>>>> >>>>>> >>>>>> No, w/o xen mcelog driver active, only user daemon 'mcelog' was >>>>>> affected. Dom0 is still capable of handling vmce as long as it >>>>>> registered trap callback (which is checked at hypervisor >>>>>> inject_vmce()). >>>>> >>>>> Oh, I thought the xen mcelog *driver* registers the trap callback >>>>> and in a Dom0 it additionally registers the logging callback. >>>>> What registers the logging callback and what registers >>>>> the trap callback? >>>>> >>>>> Christoph >>>>> >>>> > [...] >>> >>> Ok, that's how I understand the code, too. But you say above that >>> w/o this driver Dom0 is still capable to handle vmce. That puzzle's >>> me. >>> >>> Christoph >> >> Ah, what I meant is, w/o mcelog dirver, kernel still successfully >> register its mce handler as trap callback to hypervisor, hence the >> injection of vmce from hypervisor to dom0 is OK, and surely got >> properly handled at dom0. mcelog driver is just for error logging, >> getting error info from hypervisor and provide interface for user >> daemon tools 'mcelog'. > > Ah, so the trap handler is *always* installed in a Linux Dom0? Hypervisor didn't have such assumption, it's agnostic to guest. > > NetBSD Dom0/DomU still has no machine check support at all, so in that > case Dom0/DomU is not vmce ready. That means the vmce ready check is > needed. > > Christoph Hypervisor indeed check vmce ready or not at inject_vmce(). Point of patch 2/2 is, is_vmce_ready itself is problematic. Thanks, Jinsong _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.