[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 2/2] Xen/vMCE: bugfix to remove problematic is_vmce_ready check



On 06.05.13 17:14, Liu, Jinsong wrote:
> Egger, Christoph wrote:
>> On 06.05.13 17:00, Liu, Jinsong wrote:
>>> Christoph Egger wrote:
>>>> On 06.05.13 11:50, Liu, Jinsong wrote:
>>>>> Christoph Egger wrote:
>>>>>> On 06.05.13 11:24, Liu, Jinsong wrote:
>>>>>>> Jan Beulich wrote:
>>>>>>>>>>> On 06.05.13 at 10:54, Christoph Egger <chegger@xxxxxxxxx>
>>>>>>>>>>> wrote:
>>>>>>>>> On 03.05.13 17:51, Liu, Jinsong wrote:
>>>>>>>>>> Jan Beulich wrote:
>>>>>>>>>>>>>> On 03.05.13 at 16:16, "Liu, Jinsong"
>>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> wrote:
>>>>>>>>>>>> Jan Beulich wrote:
>>>>>>>>>>>>>>>> On 03.05.13 at 10:41, "Liu, Jinsong"
>>>>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> wrote:
>>>>>>>>>>>>>> Jan Beulich wrote:
>>>>>>>>>>>>>>>>>> On 27.04.13 at 10:38, "Liu, Jinsong"
>>>>>>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> wrote:
>>>>>>>>>>>>>>>> From 9098666db640183f894b9aec09599dd32dddb7fa Mon Sep 17
>>>>>>>>>>>>>>>> 00:00:00 2001 From: Liu Jinsong <jinsong.liu@xxxxxxxxx>
>>>>>>>>>>>>>>>> Date: Sat, 27 Apr 2013 22:37:35 +0800
>>>>>>>>>>>>>>>> Subject: [PATCH 2/2] Xen/vMCE: bugfix to remove
>>>>>>>>>>>>>>>> problematic is_vmce_ready check
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> is_vmce_ready() is problematic:
>>>>>>>>>>>>>>>> * For dom0, it checks if virq bind to dom0 mcelog
>>>>>>>>>>>>>>>> driver. If not, it results dom0 crash. However, it's
>>>>>>>>>>>>>>>> problematic and overkilled since mcelog as a dom0
>>>>>>>>>>>>>>>> feature could be enabled/disabled per dom0 option:
>>>>>>>>>>>>>>>> (XEN) MCE: This error page is ownded by DOM 0 (XEN)
>>>>>>>>>>>>>>>> DOM0 not ready for vMCE (XEN) domain_crash called from
>>>>>>>>>>>>>>>> mcaction.c:133 (XEN) Domain 0 reported crashed by
>>>>>>>>>>>>>>>> domain 32767 on cpu#31: (XEN) Domain 0 crashed:
>>>>>>>>>>>>>>>> rebooting machine in 5 seconds. (XEN) Resetting with
>>>>>>>>>>>>>>>> ACPI MEMORY or I/O RESET_REG. 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> * For dom0, if really need check, it should check
>>>>>>>>>>>>>>>> whether vMCE injection for dom0 ready (say, exception
>>>>>>>>>>>>>>>> trap bounce check, which has been done at
>>>>>>>>>>>>>>>> inject_vmce()), not check dom0 mcelog ready (which has
>>>>>>>>>>>>>>>> been done at mce_softirq() before send global virq to
>>>>>>>>>>>>>>>> dom0). 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Following the argumentation above, I wonder which of the
>>>>>>>>>>>>>>> other "goto vmce_failed" are really appropriate, i.e.
>>>>>>>>>>>>>>> whether the patch shouldn't be extended (at least for the
>>>>>>>>>>>>>>> Dom0 case).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You mean other 'goto vmce_failed' are also not appropriate
>>>>>>>>>>>>>> (I'm not quite clear your point)?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Would you please point out which point you think not
>>>>>>>>>>>>>> appropriate?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I question whether it is correct/necessary to crash the
>>>>>>>>>>>>> domain in any of those failure cases. Perhaps when we fail
>>>>>>>>>>>>> to unmap the page it is, but failure of fill_vmsr_data()
>>>>>>>>>>>>> and inject_vmce() don't appear to be valid reasons once the
>>>>>>>>>>>>> is_vmce_ready() path is being dropped.
>>>>>>>>>>>>
>>>>>>>>>>>> For fill_vmsr_data(), it failed only when MCG_STATUS_MCIP
>>>>>>>>>>>> bit still set when next vMCE# occur, means the 2nd vMCE#
>>>>>>>>>>>> occur when the 1st vMCE# not handled yet. Per SDM it should
>>>>>>>>>>>> shutdown. 
>>>>>>>>>>>>
>>>>>>>>>>>> For inject_vmce(), it failed when
>>>>>>>>>>>> 1). vcpu is still mce_pending, or
>>>>>>>>>>>> 2). pv not register trap callback
>>>>>>>>>>>> Maybe it's some overkilled for dom0 (for other guest, it's
>>>>>>>>>>>> ok to kill them), but any graceful way to quit?
>>>>>>>>>>>
>>>>>>>>>>> Just exit and do nothing (except perhaps log a rate limited
>>>>>>>>>>> message)? 
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> One concern of quiet exit is, the error will be totally
>>>>>>>>>> ignored by guest --> it
>>>>>>>>> didn't get preperly handled, and may recursively occur to make
>>>>>>>>> worse error --> it's better to kill guest under such case.
>>>>>>>>>>
>>>>>>>>>>>> or, considering it rarely happens, how about keep current
>>>>>>>>>>>> way (kill guest no matter dom0 or not)?
>>>>>>>>>>>
>>>>>>>>>>> Possibly - I was merely asking why this one condition was
>>>>>>>>>>> found to be too strict, while the others are being left as
>>>>>>>>>>> is. 
>>>>>>>>>>>
>>>>>>>>>>> Jan
>>>>>>>>>>
>>>>>>>>>> Ah, the reason of removing is_vmce_ready check is, it's
>>>>>>>>>> problematic (check mcelog driver, not vmce tap callback),
>>>>>>>>>> and overkilled (since defaultly dom0 will not start mcelog
>>>>>>>>>> driver, under which case system will crash whenever vmce
>>>>>>>>>> inject to dom0) 
>>>>>>>>>>
>>>>>>>>>> --> So patch 2/2 is not too strict for dom0.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please keep in mind the mcelog userland/kernel interface is not
>>>>>>>>> designed with xen in mind. mcelog cannot report which guest is
>>>>>>>>> impacted for example, although xen reports that to dom0.
>>>>>>>>> I object 'fixing' the hypervisor to come over with mcelog
>>>>>>>>> drawbacks. I prefer fixing Dom0 instead.
>>>>>>>>>
>>>>>>>
>>>>>>> Sure, xen mcelog driver in linux is implemented by me :-)
>>>>>>> This patch does not intend to 'fix' hypervisor but just avoid
>>>>>>> overkilled system (when xen mcelog driver in dom0 not loaded as
>>>>>>> default).
>>>>>>
>>>>>> I assume dom0 w/o xen mcelog driver active means dom0 is not
>>>>>> capable to deal with machine check errors. Is this correct?
>>>>>>
>>>>>
>>>>> No, w/o xen mcelog driver active, only user daemon 'mcelog' was
>>>>> affected. Dom0 is still capable of handling vmce as long as it
>>>>> registered trap callback (which is checked at hypervisor
>>>>> inject_vmce()).
>>>>
>>>> Oh, I thought the xen mcelog *driver* registers the trap callback
>>>> and in a Dom0 it additionally registers the logging callback.
>>>> What registers the logging callback and what registers
>>>> the trap callback?
>>>>
>>>> Christoph
>>>>
>>>
[...]
>>
>> Ok, that's how I understand the code, too. But you say above that w/o
>> this driver Dom0 is still capable to handle vmce. That puzzle's me.
>>
>> Christoph
> 
> Ah, what I meant is, w/o mcelog dirver, kernel still successfully
> register its mce handler as trap callback to hypervisor, hence the
> injection of vmce from hypervisor to dom0 is OK, and surely got properly
> handled at dom0.
> mcelog driver is just for error logging, getting error info from hypervisor
> and provide interface for user daemon tools 'mcelog'.

Ah, so the trap handler is *always* installed in a Linux Dom0?

NetBSD Dom0/DomU still has no machine check support at all, so in that
case Dom0/DomU is not vmce ready. That means the vmce ready check is
needed.

Christoph


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.