[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 2/2] Xen/vMCE: bugfix to remove problematic is_vmce_ready check



On 06.05.13 11:24, Liu, Jinsong wrote:
> Jan Beulich wrote:
>>>>> On 06.05.13 at 10:54, Christoph Egger <chegger@xxxxxxxxx> wrote:
>>> On 03.05.13 17:51, Liu, Jinsong wrote:
>>>> Jan Beulich wrote:
>>>>>>>> On 03.05.13 at 16:16, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>
>>>>>>>> wrote: 
>>>>>> Jan Beulich wrote:
>>>>>>>>>> On 03.05.13 at 10:41, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>
>>>>>>>>>> wrote:
>>>>>>>> Jan Beulich wrote:
>>>>>>>>>>>> On 27.04.13 at 10:38, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>
>>>>>>>>>>>> wrote:
>>>>>>>>>> From 9098666db640183f894b9aec09599dd32dddb7fa Mon Sep 17
>>>>>>>>>> 00:00:00 2001 From: Liu Jinsong <jinsong.liu@xxxxxxxxx>
>>>>>>>>>> Date: Sat, 27 Apr 2013 22:37:35 +0800
>>>>>>>>>> Subject: [PATCH 2/2] Xen/vMCE: bugfix to remove problematic
>>>>>>>>>> is_vmce_ready check 
>>>>>>>>>>
>>>>>>>>>> is_vmce_ready() is problematic:
>>>>>>>>>> * For dom0, it checks if virq bind to dom0 mcelog driver. If
>>>>>>>>>> not, it results dom0 crash. However, it's problematic and
>>>>>>>>>> overkilled since mcelog as a dom0 feature could be
>>>>>>>>>> enabled/disabled per dom0 option: (XEN) MCE: This error page
>>>>>>>>>> is ownded by DOM 0 (XEN) DOM0 not ready for vMCE (XEN)
>>>>>>>>>> domain_crash called from mcaction.c:133 (XEN) Domain 0
>>>>>>>>>> reported crashed by domain 32767 on cpu#31: (XEN) Domain 0
>>>>>>>>>> crashed: rebooting machine in 5 seconds. (XEN) Resetting with
>>>>>>>>>> ACPI MEMORY or I/O RESET_REG. 
>>>>>>>>>>
>>>>>>>>>> * For dom0, if really need check, it should check whether vMCE
>>>>>>>>>> injection for dom0 ready (say, exception trap bounce check,
>>>>>>>>>> which has been done at inject_vmce()), not check dom0 mcelog
>>>>>>>>>> ready (which has been done at mce_softirq() before send
>>>>>>>>>> global virq to dom0).
>>>>>>>>>
>>>>>>>>> Following the argumentation above, I wonder which of the other
>>>>>>>>> "goto vmce_failed" are really appropriate, i.e. whether the
>>>>>>>>> patch shouldn't be extended (at least for the Dom0 case).
>>>>>>>>
>>>>>>>> You mean other 'goto vmce_failed' are also not appropriate (I'm
>>>>>>>> not quite clear your point)?
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>> Would you please point out which point you think not
>>>>>>>> appropriate? 
>>>>>>>
>>>>>>> I question whether it is correct/necessary to crash the domain in
>>>>>>> any of those failure cases. Perhaps when we fail to unmap the
>>>>>>> page it is, but failure of fill_vmsr_data() and inject_vmce()
>>>>>>> don't appear to be valid reasons once the is_vmce_ready() path
>>>>>>> is being dropped.
>>>>>>
>>>>>> For fill_vmsr_data(), it failed only when MCG_STATUS_MCIP bit
>>>>>> still set when next vMCE# occur, means the 2nd vMCE# occur when
>>>>>> the 1st vMCE# not handled yet. Per SDM it should shutdown.
>>>>>>
>>>>>> For inject_vmce(), it failed when
>>>>>> 1). vcpu is still mce_pending, or
>>>>>> 2). pv not register trap callback
>>>>>> Maybe it's some overkilled for dom0 (for other guest, it's ok to
>>>>>> kill them), but any graceful way to quit?
>>>>>
>>>>> Just exit and do nothing (except perhaps log a rate limited
>>>>> message)? 
>>>>>
>>>>
>>>> One concern of quiet exit is, the error will be totally ignored by
>>>> guest --> it 
>>> didn't get preperly handled, and may recursively occur to make worse
>>> error --> it's better to kill guest under such case.
>>>>
>>>>>> or, considering it rarely happens, how about keep current way
>>>>>> (kill guest no matter dom0 or not)?
>>>>>
>>>>> Possibly - I was merely asking why this one condition was found to
>>>>> be too strict, while the others are being left as is.
>>>>>
>>>>> Jan
>>>>
>>>> Ah, the reason of removing is_vmce_ready check is, it's
>>>> problematic (check mcelog driver, not vmce tap callback),
>>>> and overkilled (since defaultly dom0 will not start mcelog driver,
>>>> under which case system will crash whenever vmce inject to dom0)
>>>>
>>>> --> So patch 2/2 is not too strict for dom0.
>>>>
>>>
>>> Please keep in mind the mcelog userland/kernel interface is not
>>> designed 
>>> with xen in mind. mcelog cannot report which guest is impacted for
>>> example, although xen reports that to dom0.
>>> I object 'fixing' the hypervisor to come over with mcelog drawbacks.
>>> I prefer fixing Dom0 instead.
>>>
> 
> Sure, xen mcelog driver in linux is implemented by me :-)
> This patch does not intend to 'fix' hypervisor but just avoid
> overkilled system (when xen mcelog driver in dom0 not loaded as default).

I assume dom0 w/o xen mcelog driver active means dom0 is not capable
to deal with machine check errors. Is this correct?

>>> From the design perspective, the virq for Dom0 is for logging purpose
>>> only and the trap handler has equal purpose for both Dom0 and DomU.
> 
> Sure, that's what I meant 'problematic' check.

What do you want to do when Dom0 is not capable to deal with
machine check errors and Dom0 is impacted?

Christoph

> Thanks,
> Jinsong
> 
>> So as this doesn't read like "don't care" - is this an ack, nak, or
>> a request to Jinsong to change something for the patch to be
>> acceptable?
>>
>> Jan
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.