[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 2/2] Xen/vMCE: bugfix to remove problematic is_vmce_ready check


  • To: Christoph Egger <chegger@xxxxxxxxx>
  • From: "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>
  • Date: Mon, 6 May 2013 16:00:57 +0000
  • Accept-language: en-US
  • Cc: Jan Beulich <JBeulich@xxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxx>
  • Delivery-date: Mon, 06 May 2013 16:02:22 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xen.org>
  • Thread-index: AQHOSm7MGsdbS84pvUuls/cWiRHQfJj4T1GA
  • Thread-topic: [Xen-devel] [PATCH 2/2] Xen/vMCE: bugfix to remove problematic is_vmce_ready check

Christoph Egger wrote:
> On 06.05.13 17:14, Liu, Jinsong wrote:
>> Egger, Christoph wrote:
>>> On 06.05.13 17:00, Liu, Jinsong wrote:
>>>> Christoph Egger wrote:
>>>>> On 06.05.13 11:50, Liu, Jinsong wrote:
>>>>>> Christoph Egger wrote:
>>>>>>> On 06.05.13 11:24, Liu, Jinsong wrote:
>>>>>>>> Jan Beulich wrote:
>>>>>>>>>>>> On 06.05.13 at 10:54, Christoph Egger <chegger@xxxxxxxxx>
>>>>>>>>>>>> wrote:
>>>>>>>>>> On 03.05.13 17:51, Liu, Jinsong wrote:
>>>>>>>>>>> Jan Beulich wrote:
>>>>>>>>>>>>>>> On 03.05.13 at 16:16, "Liu, Jinsong"
>>>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> wrote:
>>>>>>>>>>>>> Jan Beulich wrote:
>>>>>>>>>>>>>>>>> On 03.05.13 at 10:41, "Liu, Jinsong"
>>>>>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> wrote:
>>>>>>>>>>>>>>> Jan Beulich wrote:
>>>>>>>>>>>>>>>>>>> On 27.04.13 at 10:38, "Liu, Jinsong"
>>>>>>>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> wrote:
>>>>>>>>>>>>>>>>> From 9098666db640183f894b9aec09599dd32dddb7fa Mon Sep
>>>>>>>>>>>>>>>>> 17 00:00:00 2001 From: Liu Jinsong
>>>>>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> Date: Sat, 27 Apr 2013
>>>>>>>>>>>>>>>>> 22:37:35 +0800 
>>>>>>>>>>>>>>>>> Subject: [PATCH 2/2] Xen/vMCE: bugfix to remove
>>>>>>>>>>>>>>>>> problematic is_vmce_ready check
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> is_vmce_ready() is problematic:
>>>>>>>>>>>>>>>>> * For dom0, it checks if virq bind to dom0 mcelog
>>>>>>>>>>>>>>>>> driver. If not, it results dom0 crash. However, it's
>>>>>>>>>>>>>>>>> problematic and overkilled since mcelog as a dom0
>>>>>>>>>>>>>>>>> feature could be enabled/disabled per dom0 option:
>>>>>>>>>>>>>>>>> (XEN) MCE: This error page is ownded by DOM 0 (XEN)
>>>>>>>>>>>>>>>>> DOM0 not ready for vMCE (XEN) domain_crash called from
>>>>>>>>>>>>>>>>> mcaction.c:133 (XEN) Domain 0 reported crashed by
>>>>>>>>>>>>>>>>> domain 32767 on cpu#31: (XEN) Domain 0 crashed:
>>>>>>>>>>>>>>>>> rebooting machine in 5 seconds. (XEN) Resetting with
>>>>>>>>>>>>>>>>> ACPI MEMORY or I/O RESET_REG. 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> * For dom0, if really need check, it should check
>>>>>>>>>>>>>>>>> whether vMCE injection for dom0 ready (say, exception
>>>>>>>>>>>>>>>>> trap bounce check, which has been done at
>>>>>>>>>>>>>>>>> inject_vmce()), not check dom0 mcelog ready (which has
>>>>>>>>>>>>>>>>> been done at mce_softirq() before send global virq to
>>>>>>>>>>>>>>>>> dom0).
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Following the argumentation above, I wonder which of
>>>>>>>>>>>>>>>> the other "goto vmce_failed" are really appropriate,
>>>>>>>>>>>>>>>> i.e. whether the patch shouldn't be extended (at least
>>>>>>>>>>>>>>>> for the Dom0 case).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> You mean other 'goto vmce_failed' are also not
>>>>>>>>>>>>>>> appropriate (I'm not quite clear your point)?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Yes.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Would you please point out which point you think not
>>>>>>>>>>>>>>> appropriate?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I question whether it is correct/necessary to crash the
>>>>>>>>>>>>>> domain in any of those failure cases. Perhaps when we
>>>>>>>>>>>>>> fail to unmap the page it is, but failure of
>>>>>>>>>>>>>> fill_vmsr_data() and inject_vmce() don't appear to be
>>>>>>>>>>>>>> valid reasons once the is_vmce_ready() path is being
>>>>>>>>>>>>>> dropped. 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For fill_vmsr_data(), it failed only when MCG_STATUS_MCIP
>>>>>>>>>>>>> bit still set when next vMCE# occur, means the 2nd vMCE#
>>>>>>>>>>>>> occur when the 1st vMCE# not handled yet. Per SDM it
>>>>>>>>>>>>> should shutdown. 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For inject_vmce(), it failed when
>>>>>>>>>>>>> 1). vcpu is still mce_pending, or
>>>>>>>>>>>>> 2). pv not register trap callback
>>>>>>>>>>>>> Maybe it's some overkilled for dom0 (for other guest, it's
>>>>>>>>>>>>> ok to kill them), but any graceful way to quit?
>>>>>>>>>>>> 
>>>>>>>>>>>> Just exit and do nothing (except perhaps log a rate
>>>>>>>>>>>> limited message)? 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> One concern of quiet exit is, the error will be totally
>>>>>>>>>>> ignored by guest --> it
>>>>>>>>>> didn't get preperly handled, and may recursively occur to
>>>>>>>>>> make worse error --> it's better to kill guest under such
>>>>>>>>>> case. 
>>>>>>>>>>> 
>>>>>>>>>>>>> or, considering it rarely happens, how about keep current
>>>>>>>>>>>>> way (kill guest no matter dom0 or not)?
>>>>>>>>>>>> 
>>>>>>>>>>>> Possibly - I was merely asking why this one condition was
>>>>>>>>>>>> found to be too strict, while the others are being left as
>>>>>>>>>>>> is. 
>>>>>>>>>>>> 
>>>>>>>>>>>> Jan
>>>>>>>>>>> 
>>>>>>>>>>> Ah, the reason of removing is_vmce_ready check is, it's
>>>>>>>>>>> problematic (check mcelog driver, not vmce tap callback),
>>>>>>>>>>> and overkilled (since defaultly dom0 will not start mcelog
>>>>>>>>>>> driver, under which case system will crash whenever vmce
>>>>>>>>>>> inject to dom0) 
>>>>>>>>>>> 
>>>>>>>>>>> --> So patch 2/2 is not too strict for dom0.
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Please keep in mind the mcelog userland/kernel interface is
>>>>>>>>>> not designed with xen in mind. mcelog cannot report which
>>>>>>>>>> guest is impacted for example, although xen reports that to
>>>>>>>>>> dom0. 
>>>>>>>>>> I object 'fixing' the hypervisor to come over with mcelog
>>>>>>>>>> drawbacks. I prefer fixing Dom0 instead.
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Sure, xen mcelog driver in linux is implemented by me :-)
>>>>>>>> This patch does not intend to 'fix' hypervisor but just avoid
>>>>>>>> overkilled system (when xen mcelog driver in dom0 not loaded as
>>>>>>>> default).
>>>>>>> 
>>>>>>> I assume dom0 w/o xen mcelog driver active means dom0 is not
>>>>>>> capable to deal with machine check errors. Is this correct?
>>>>>>> 
>>>>>> 
>>>>>> No, w/o xen mcelog driver active, only user daemon 'mcelog' was
>>>>>> affected. Dom0 is still capable of handling vmce as long as it
>>>>>> registered trap callback (which is checked at hypervisor
>>>>>> inject_vmce()).
>>>>> 
>>>>> Oh, I thought the xen mcelog *driver* registers the trap callback
>>>>> and in a Dom0 it additionally registers the logging callback.
>>>>> What registers the logging callback and what registers
>>>>> the trap callback?
>>>>> 
>>>>> Christoph
>>>>> 
>>>> 
> [...]
>>> 
>>> Ok, that's how I understand the code, too. But you say above that
>>> w/o this driver Dom0 is still capable to handle vmce. That puzzle's
>>> me. 
>>> 
>>> Christoph
>> 
>> Ah, what I meant is, w/o mcelog dirver, kernel still successfully
>> register its mce handler as trap callback to hypervisor, hence the
>> injection of vmce from hypervisor to dom0 is OK, and surely got
>> properly handled at dom0. mcelog driver is just for error logging,
>> getting error info from hypervisor and provide interface for user
>> daemon tools 'mcelog'. 
> 
> Ah, so the trap handler is *always* installed in a Linux Dom0?

Hypervisor didn't have such assumption, it's agnostic to guest.

> 
> NetBSD Dom0/DomU still has no machine check support at all, so in that
> case Dom0/DomU is not vmce ready. That means the vmce ready check is
> needed.
> 
> Christoph

Hypervisor indeed check vmce ready or not at inject_vmce().
Point of patch 2/2 is, is_vmce_ready itself is problematic.

Thanks,
Jinsong
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.