[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 2/2] Xen/vMCE: bugfix to remove problematic is_vmce_ready check


  • To: Christoph Egger <chegger@xxxxxxxxx>
  • From: "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>
  • Date: Mon, 6 May 2013 15:00:43 +0000
  • Accept-language: en-US
  • Cc: Jan Beulich <JBeulich@xxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxx>
  • Delivery-date: Mon, 06 May 2013 15:01:15 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xen.org>
  • Thread-index: AQHOSk5EGsdbS84pvUuls/cWiRHQfJj4N5kg
  • Thread-topic: [Xen-devel] [PATCH 2/2] Xen/vMCE: bugfix to remove problematic is_vmce_ready check

Christoph Egger wrote:
> On 06.05.13 11:50, Liu, Jinsong wrote:
>> Christoph Egger wrote:
>>> On 06.05.13 11:24, Liu, Jinsong wrote:
>>>> Jan Beulich wrote:
>>>>>>>> On 06.05.13 at 10:54, Christoph Egger <chegger@xxxxxxxxx>
>>>>>>>> wrote: 
>>>>>> On 03.05.13 17:51, Liu, Jinsong wrote:
>>>>>>> Jan Beulich wrote:
>>>>>>>>>>> On 03.05.13 at 16:16, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>
>>>>>>>>>>> wrote:
>>>>>>>>> Jan Beulich wrote:
>>>>>>>>>>>>> On 03.05.13 at 10:41, "Liu, Jinsong"
>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> wrote:
>>>>>>>>>>> Jan Beulich wrote:
>>>>>>>>>>>>>>> On 27.04.13 at 10:38, "Liu, Jinsong"
>>>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> wrote:
>>>>>>>>>>>>> From 9098666db640183f894b9aec09599dd32dddb7fa Mon Sep 17
>>>>>>>>>>>>> 00:00:00 2001 From: Liu Jinsong <jinsong.liu@xxxxxxxxx>
>>>>>>>>>>>>> Date: Sat, 27 Apr 2013 22:37:35 +0800
>>>>>>>>>>>>> Subject: [PATCH 2/2] Xen/vMCE: bugfix to remove
>>>>>>>>>>>>> problematic is_vmce_ready check 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> is_vmce_ready() is problematic:
>>>>>>>>>>>>> * For dom0, it checks if virq bind to dom0 mcelog driver.
>>>>>>>>>>>>> If not, it results dom0 crash. However, it's problematic
>>>>>>>>>>>>> and overkilled since mcelog as a dom0 feature could be
>>>>>>>>>>>>> enabled/disabled per dom0 option: (XEN) MCE: This error
>>>>>>>>>>>>> page is ownded by DOM 0 (XEN) DOM0 not ready for vMCE
>>>>>>>>>>>>> (XEN) domain_crash called from mcaction.c:133 (XEN)
>>>>>>>>>>>>> Domain 0 reported crashed by domain 32767 on cpu#31:
>>>>>>>>>>>>> (XEN) Domain 0 crashed: rebooting machine in 5 seconds.
>>>>>>>>>>>>> (XEN) Resetting with ACPI MEMORY or I/O RESET_REG.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> * For dom0, if really need check, it should check whether
>>>>>>>>>>>>> vMCE injection for dom0 ready (say, exception trap bounce
>>>>>>>>>>>>> check, which has been done at inject_vmce()), not check
>>>>>>>>>>>>> dom0 mcelog ready (which has been done at mce_softirq()
>>>>>>>>>>>>> before send global virq to dom0).
>>>>>>>>>>>> 
>>>>>>>>>>>> Following the argumentation above, I wonder which of the
>>>>>>>>>>>> other "goto vmce_failed" are really appropriate, i.e.
>>>>>>>>>>>> whether the patch shouldn't be extended (at least for the
>>>>>>>>>>>> Dom0 case).
>>>>>>>>>>> 
>>>>>>>>>>> You mean other 'goto vmce_failed' are also not appropriate
>>>>>>>>>>> (I'm not quite clear your point)?
>>>>>>>>>> 
>>>>>>>>>> Yes.
>>>>>>>>>> 
>>>>>>>>>>> Would you please point out which point you think not
>>>>>>>>>>> appropriate?
>>>>>>>>>> 
>>>>>>>>>> I question whether it is correct/necessary to crash the
>>>>>>>>>> domain in any of those failure cases. Perhaps when we fail
>>>>>>>>>> to unmap the page it is, but failure of fill_vmsr_data() and
>>>>>>>>>> inject_vmce() don't appear to be valid reasons once the
>>>>>>>>>> is_vmce_ready() path is being dropped.
>>>>>>>>> 
>>>>>>>>> For fill_vmsr_data(), it failed only when MCG_STATUS_MCIP bit
>>>>>>>>> still set when next vMCE# occur, means the 2nd vMCE# occur
>>>>>>>>> when the 1st vMCE# not handled yet. Per SDM it should
>>>>>>>>> shutdown. 
>>>>>>>>> 
>>>>>>>>> For inject_vmce(), it failed when
>>>>>>>>> 1). vcpu is still mce_pending, or
>>>>>>>>> 2). pv not register trap callback
>>>>>>>>> Maybe it's some overkilled for dom0 (for other guest, it's ok
>>>>>>>>> to kill them), but any graceful way to quit?
>>>>>>>> 
>>>>>>>> Just exit and do nothing (except perhaps log a rate limited
>>>>>>>> message)? 
>>>>>>>> 
>>>>>>> 
>>>>>>> One concern of quiet exit is, the error will be totally ignored
>>>>>>> by guest --> it
>>>>>> didn't get preperly handled, and may recursively occur to make
>>>>>> worse error --> it's better to kill guest under such case.
>>>>>>> 
>>>>>>>>> or, considering it rarely happens, how about keep current way
>>>>>>>>> (kill guest no matter dom0 or not)?
>>>>>>>> 
>>>>>>>> Possibly - I was merely asking why this one condition was found
>>>>>>>> to be too strict, while the others are being left as is.
>>>>>>>> 
>>>>>>>> Jan
>>>>>>> 
>>>>>>> Ah, the reason of removing is_vmce_ready check is, it's
>>>>>>> problematic (check mcelog driver, not vmce tap callback),
>>>>>>> and overkilled (since defaultly dom0 will not start mcelog
>>>>>>> driver, under which case system will crash whenever vmce inject
>>>>>>> to dom0) 
>>>>>>> 
>>>>>>> --> So patch 2/2 is not too strict for dom0.
>>>>>>> 
>>>>>> 
>>>>>> Please keep in mind the mcelog userland/kernel interface is not
>>>>>> designed with xen in mind. mcelog cannot report which guest is
>>>>>> impacted for example, although xen reports that to dom0.
>>>>>> I object 'fixing' the hypervisor to come over with mcelog
>>>>>> drawbacks. I prefer fixing Dom0 instead.
>>>>>> 
>>>> 
>>>> Sure, xen mcelog driver in linux is implemented by me :-)
>>>> This patch does not intend to 'fix' hypervisor but just avoid
>>>> overkilled system (when xen mcelog driver in dom0 not loaded as
>>>> default).
>>> 
>>> I assume dom0 w/o xen mcelog driver active means dom0 is not capable
>>> to deal with machine check errors. Is this correct?
>>> 
>> 
>> No, w/o xen mcelog driver active, only user daemon 'mcelog' was
>> affected. Dom0 is still capable of handling vmce as long as it
>> registered trap callback (which is checked at hypervisor
>> inject_vmce()). 
> 
> Oh, I thought the xen mcelog *driver* registers the trap callback
> and in a Dom0 it additionally registers the logging callback.
> What registers the logging callback and what registers
> the trap callback?
> 
> Christoph
> 

They are 2 paths (refer latest kernl):
1. for xen mcelog driver, dom0 registers irq handler at driver init via
        drivers/xen/mcelog.c
            --> bind_virq_for_mce()
                --> bind_virq_to_irqhandler()
        which got VIRQ_MCA via eventchannel binding and handle mcelog error 
logging accordingly;

2. for vmce injection, dom0 registers trap callback (which re-use kernel 
machine check handler) via normal kernel traps init:
        arch/x86/kernel/traps.c
            --> set_intr_gate_ist()
                --> write_idt_entry()
                    --> xen_write_idt_entry()
                        --> cvt_gate_to_trap()  & HYPERVISOR_set_trap_table()
        hypervisor entry.S use the registerred trap callback to construct 
bounce back stack for TRAP_machine_check, bounce back to dom0 mce handler.

Thanks,
Jinsong
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.