[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 2/2] Xen/vMCE: bugfix to remove problematic is_vmce_ready check


  • To: Jan Beulich <JBeulich@xxxxxxxx>, Christoph Egger <chegger@xxxxxxxxx>
  • From: "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>
  • Date: Mon, 6 May 2013 09:24:22 +0000
  • Accept-language: en-US
  • Cc: xen-devel <xen-devel@xxxxxxxxxxxxx>
  • Delivery-date: Mon, 06 May 2013 09:24:52 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xen.org>
  • Thread-index: AQHOSjeCGsdbS84pvUuls/cWiRHQfJj3V1EAgACJnJA=
  • Thread-topic: [Xen-devel] [PATCH 2/2] Xen/vMCE: bugfix to remove problematic is_vmce_ready check

Jan Beulich wrote:
>>>> On 06.05.13 at 10:54, Christoph Egger <chegger@xxxxxxxxx> wrote:
>> On 03.05.13 17:51, Liu, Jinsong wrote:
>>> Jan Beulich wrote:
>>>>>>> On 03.05.13 at 16:16, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>
>>>>>>> wrote: 
>>>>> Jan Beulich wrote:
>>>>>>>>> On 03.05.13 at 10:41, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>
>>>>>>>>> wrote:
>>>>>>> Jan Beulich wrote:
>>>>>>>>>>> On 27.04.13 at 10:38, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>
>>>>>>>>>>> wrote:
>>>>>>>>> From 9098666db640183f894b9aec09599dd32dddb7fa Mon Sep 17
>>>>>>>>> 00:00:00 2001 From: Liu Jinsong <jinsong.liu@xxxxxxxxx>
>>>>>>>>> Date: Sat, 27 Apr 2013 22:37:35 +0800
>>>>>>>>> Subject: [PATCH 2/2] Xen/vMCE: bugfix to remove problematic
>>>>>>>>> is_vmce_ready check 
>>>>>>>>> 
>>>>>>>>> is_vmce_ready() is problematic:
>>>>>>>>> * For dom0, it checks if virq bind to dom0 mcelog driver. If
>>>>>>>>> not, it results dom0 crash. However, it's problematic and
>>>>>>>>> overkilled since mcelog as a dom0 feature could be
>>>>>>>>> enabled/disabled per dom0 option: (XEN) MCE: This error page
>>>>>>>>> is ownded by DOM 0 (XEN) DOM0 not ready for vMCE (XEN)
>>>>>>>>> domain_crash called from mcaction.c:133 (XEN) Domain 0
>>>>>>>>> reported crashed by domain 32767 on cpu#31: (XEN) Domain 0
>>>>>>>>> crashed: rebooting machine in 5 seconds. (XEN) Resetting with
>>>>>>>>> ACPI MEMORY or I/O RESET_REG. 
>>>>>>>>> 
>>>>>>>>> * For dom0, if really need check, it should check whether vMCE
>>>>>>>>> injection for dom0 ready (say, exception trap bounce check,
>>>>>>>>> which has been done at inject_vmce()), not check dom0 mcelog
>>>>>>>>> ready (which has been done at mce_softirq() before send
>>>>>>>>> global virq to dom0).
>>>>>>>> 
>>>>>>>> Following the argumentation above, I wonder which of the other
>>>>>>>> "goto vmce_failed" are really appropriate, i.e. whether the
>>>>>>>> patch shouldn't be extended (at least for the Dom0 case).
>>>>>>> 
>>>>>>> You mean other 'goto vmce_failed' are also not appropriate (I'm
>>>>>>> not quite clear your point)?
>>>>>> 
>>>>>> Yes.
>>>>>> 
>>>>>>> Would you please point out which point you think not
>>>>>>> appropriate? 
>>>>>> 
>>>>>> I question whether it is correct/necessary to crash the domain in
>>>>>> any of those failure cases. Perhaps when we fail to unmap the
>>>>>> page it is, but failure of fill_vmsr_data() and inject_vmce()
>>>>>> don't appear to be valid reasons once the is_vmce_ready() path
>>>>>> is being dropped.
>>>>> 
>>>>> For fill_vmsr_data(), it failed only when MCG_STATUS_MCIP bit
>>>>> still set when next vMCE# occur, means the 2nd vMCE# occur when
>>>>> the 1st vMCE# not handled yet. Per SDM it should shutdown.
>>>>> 
>>>>> For inject_vmce(), it failed when
>>>>> 1). vcpu is still mce_pending, or
>>>>> 2). pv not register trap callback
>>>>> Maybe it's some overkilled for dom0 (for other guest, it's ok to
>>>>> kill them), but any graceful way to quit?
>>>> 
>>>> Just exit and do nothing (except perhaps log a rate limited
>>>> message)? 
>>>> 
>>> 
>>> One concern of quiet exit is, the error will be totally ignored by
>>> guest --> it 
>> didn't get preperly handled, and may recursively occur to make worse
>> error --> it's better to kill guest under such case.
>>> 
>>>>> or, considering it rarely happens, how about keep current way
>>>>> (kill guest no matter dom0 or not)?
>>>> 
>>>> Possibly - I was merely asking why this one condition was found to
>>>> be too strict, while the others are being left as is.
>>>> 
>>>> Jan
>>> 
>>> Ah, the reason of removing is_vmce_ready check is, it's
>>> problematic (check mcelog driver, not vmce tap callback),
>>> and overkilled (since defaultly dom0 will not start mcelog driver,
>>> under which case system will crash whenever vmce inject to dom0)
>>> 
>>> --> So patch 2/2 is not too strict for dom0.
>>> 
>> 
>> Please keep in mind the mcelog userland/kernel interface is not
>> designed 
>> with xen in mind. mcelog cannot report which guest is impacted for
>> example, although xen reports that to dom0.
>> I object 'fixing' the hypervisor to come over with mcelog drawbacks.
>> I prefer fixing Dom0 instead.
>> 

Sure, xen mcelog driver in linux is implemented by me :-)
This patch does not intend to 'fix' hypervisor but just avoid overkilled system 
(when xen mcelog driver in dom0 not loaded as default).

>> From the design perspective, the virq for Dom0 is for logging purpose
>> only and the trap handler has equal purpose for both Dom0 and DomU.
> 

Sure, that's what I meant 'problematic' check.

Thanks,
Jinsong

> So as this doesn't read like "don't care" - is this an ack, nak, or
> a request to Jinsong to change something for the patch to be
> acceptable?
> 
> Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.