[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 2/2] Xen/vMCE: bugfix to remove problematic is_vmce_ready check

On 06.05.13 11:50, Liu, Jinsong wrote:
> Christoph Egger wrote:
>> On 06.05.13 11:24, Liu, Jinsong wrote:
>>> Jan Beulich wrote:
>>>>>>> On 06.05.13 at 10:54, Christoph Egger <chegger@xxxxxxxxx> wrote:
>>>>> On 03.05.13 17:51, Liu, Jinsong wrote:
>>>>>> Jan Beulich wrote:
>>>>>>>>>> On 03.05.13 at 16:16, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>
>>>>>>>>>> wrote:
>>>>>>>> Jan Beulich wrote:
>>>>>>>>>>>> On 03.05.13 at 10:41, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>
>>>>>>>>>>>> wrote:
>>>>>>>>>> Jan Beulich wrote:
>>>>>>>>>>>>>> On 27.04.13 at 10:38, "Liu, Jinsong"
>>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> wrote:
>>>>>>>>>>>> From 9098666db640183f894b9aec09599dd32dddb7fa Mon Sep 17
>>>>>>>>>>>> 00:00:00 2001 From: Liu Jinsong <jinsong.liu@xxxxxxxxx>
>>>>>>>>>>>> Date: Sat, 27 Apr 2013 22:37:35 +0800
>>>>>>>>>>>> Subject: [PATCH 2/2] Xen/vMCE: bugfix to remove problematic
>>>>>>>>>>>> is_vmce_ready check 
>>>>>>>>>>>> is_vmce_ready() is problematic:
>>>>>>>>>>>> * For dom0, it checks if virq bind to dom0 mcelog driver. If
>>>>>>>>>>>> not, it results dom0 crash. However, it's problematic and
>>>>>>>>>>>> overkilled since mcelog as a dom0 feature could be
>>>>>>>>>>>> enabled/disabled per dom0 option: (XEN) MCE: This error page
>>>>>>>>>>>> is ownded by DOM 0 (XEN) DOM0 not ready for vMCE (XEN)
>>>>>>>>>>>> domain_crash called from mcaction.c:133 (XEN) Domain 0
>>>>>>>>>>>> reported crashed by domain 32767 on cpu#31: (XEN) Domain 0
>>>>>>>>>>>> crashed: rebooting machine in 5 seconds. (XEN) Resetting
>>>>>>>>>>>> with ACPI MEMORY or I/O RESET_REG. 
>>>>>>>>>>>> * For dom0, if really need check, it should check whether
>>>>>>>>>>>> vMCE injection for dom0 ready (say, exception trap bounce
>>>>>>>>>>>> check, which has been done at inject_vmce()), not check
>>>>>>>>>>>> dom0 mcelog ready (which has been done at mce_softirq()
>>>>>>>>>>>> before send global virq to dom0).
>>>>>>>>>>> Following the argumentation above, I wonder which of the
>>>>>>>>>>> other "goto vmce_failed" are really appropriate, i.e.
>>>>>>>>>>> whether the patch shouldn't be extended (at least for the
>>>>>>>>>>> Dom0 case). 
>>>>>>>>>> You mean other 'goto vmce_failed' are also not appropriate
>>>>>>>>>> (I'm not quite clear your point)?
>>>>>>>>> Yes.
>>>>>>>>>> Would you please point out which point you think not
>>>>>>>>>> appropriate?
>>>>>>>>> I question whether it is correct/necessary to crash the domain
>>>>>>>>> in any of those failure cases. Perhaps when we fail to unmap
>>>>>>>>> the page it is, but failure of fill_vmsr_data() and
>>>>>>>>> inject_vmce() don't appear to be valid reasons once the
>>>>>>>>> is_vmce_ready() path is being dropped.
>>>>>>>> For fill_vmsr_data(), it failed only when MCG_STATUS_MCIP bit
>>>>>>>> still set when next vMCE# occur, means the 2nd vMCE# occur when
>>>>>>>> the 1st vMCE# not handled yet. Per SDM it should shutdown.
>>>>>>>> For inject_vmce(), it failed when
>>>>>>>> 1). vcpu is still mce_pending, or
>>>>>>>> 2). pv not register trap callback
>>>>>>>> Maybe it's some overkilled for dom0 (for other guest, it's ok to
>>>>>>>> kill them), but any graceful way to quit?
>>>>>>> Just exit and do nothing (except perhaps log a rate limited
>>>>>>> message)? 
>>>>>> One concern of quiet exit is, the error will be totally ignored by
>>>>>> guest --> it
>>>>> didn't get preperly handled, and may recursively occur to make
>>>>> worse error --> it's better to kill guest under such case.
>>>>>>>> or, considering it rarely happens, how about keep current way
>>>>>>>> (kill guest no matter dom0 or not)?
>>>>>>> Possibly - I was merely asking why this one condition was found
>>>>>>> to be too strict, while the others are being left as is.
>>>>>>> Jan
>>>>>> Ah, the reason of removing is_vmce_ready check is, it's
>>>>>> problematic (check mcelog driver, not vmce tap callback),
>>>>>> and overkilled (since defaultly dom0 will not start mcelog driver,
>>>>>> under which case system will crash whenever vmce inject to dom0)
>>>>>> --> So patch 2/2 is not too strict for dom0.
>>>>> Please keep in mind the mcelog userland/kernel interface is not
>>>>> designed with xen in mind. mcelog cannot report which guest is
>>>>> impacted for example, although xen reports that to dom0.
>>>>> I object 'fixing' the hypervisor to come over with mcelog
>>>>> drawbacks. I prefer fixing Dom0 instead. 
>>> Sure, xen mcelog driver in linux is implemented by me :-)
>>> This patch does not intend to 'fix' hypervisor but just avoid
>>> overkilled system (when xen mcelog driver in dom0 not loaded as
>>> default). 
>> I assume dom0 w/o xen mcelog driver active means dom0 is not capable
>> to deal with machine check errors. Is this correct?
> No, w/o xen mcelog driver active, only user daemon 'mcelog' was affected.
> Dom0 is still capable of handling vmce as long as it registered trap callback
> (which is checked at hypervisor inject_vmce()).

Oh, I thought the xen mcelog *driver* registers the trap callback
and in a Dom0 it additionally registers the logging callback.
What registers the logging callback and what registers
the trap callback?


>>>>> From the design perspective, the virq for Dom0 is for logging
>>>>> purpose only and the trap handler has equal purpose for both Dom0
>>>>> and DomU. 
>>> Sure, that's what I meant 'problematic' check.
>> What do you want to do when Dom0 is not capable to deal with
>> machine check errors and Dom0 is impacted?
>> Christoph
> As above comments :)
> Thanks,
> Jinsong
>>> Thanks,
>>> Jinsong
>>>> So as this doesn't read like "don't care" - is this an ack, nak, or
>>>> a request to Jinsong to change something for the patch to be
>>>> acceptable? 
>>>> Jan

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.