|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 2/2] Xen/vMCE: bugfix to remove problematic is_vmce_ready check
Christoph Egger wrote:
> On 06.05.13 11:50, Liu, Jinsong wrote:
>> Christoph Egger wrote:
>>> On 06.05.13 11:24, Liu, Jinsong wrote:
>>>> Jan Beulich wrote:
>>>>>>>> On 06.05.13 at 10:54, Christoph Egger <chegger@xxxxxxxxx>
>>>>>>>> wrote:
>>>>>> On 03.05.13 17:51, Liu, Jinsong wrote:
>>>>>>> Jan Beulich wrote:
>>>>>>>>>>> On 03.05.13 at 16:16, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>
>>>>>>>>>>> wrote:
>>>>>>>>> Jan Beulich wrote:
>>>>>>>>>>>>> On 03.05.13 at 10:41, "Liu, Jinsong"
>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> wrote:
>>>>>>>>>>> Jan Beulich wrote:
>>>>>>>>>>>>>>> On 27.04.13 at 10:38, "Liu, Jinsong"
>>>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> wrote:
>>>>>>>>>>>>> From 9098666db640183f894b9aec09599dd32dddb7fa Mon Sep 17
>>>>>>>>>>>>> 00:00:00 2001 From: Liu Jinsong <jinsong.liu@xxxxxxxxx>
>>>>>>>>>>>>> Date: Sat, 27 Apr 2013 22:37:35 +0800
>>>>>>>>>>>>> Subject: [PATCH 2/2] Xen/vMCE: bugfix to remove
>>>>>>>>>>>>> problematic is_vmce_ready check
>>>>>>>>>>>>>
>>>>>>>>>>>>> is_vmce_ready() is problematic:
>>>>>>>>>>>>> * For dom0, it checks if virq bind to dom0 mcelog driver.
>>>>>>>>>>>>> If not, it results dom0 crash. However, it's problematic
>>>>>>>>>>>>> and overkilled since mcelog as a dom0 feature could be
>>>>>>>>>>>>> enabled/disabled per dom0 option: (XEN) MCE: This error
>>>>>>>>>>>>> page is ownded by DOM 0 (XEN) DOM0 not ready for vMCE
>>>>>>>>>>>>> (XEN) domain_crash called from mcaction.c:133 (XEN)
>>>>>>>>>>>>> Domain 0 reported crashed by domain 32767 on cpu#31:
>>>>>>>>>>>>> (XEN) Domain 0 crashed: rebooting machine in 5 seconds.
>>>>>>>>>>>>> (XEN) Resetting with ACPI MEMORY or I/O RESET_REG.
>>>>>>>>>>>>>
>>>>>>>>>>>>> * For dom0, if really need check, it should check whether
>>>>>>>>>>>>> vMCE injection for dom0 ready (say, exception trap bounce
>>>>>>>>>>>>> check, which has been done at inject_vmce()), not check
>>>>>>>>>>>>> dom0 mcelog ready (which has been done at mce_softirq()
>>>>>>>>>>>>> before send global virq to dom0).
>>>>>>>>>>>>
>>>>>>>>>>>> Following the argumentation above, I wonder which of the
>>>>>>>>>>>> other "goto vmce_failed" are really appropriate, i.e.
>>>>>>>>>>>> whether the patch shouldn't be extended (at least for the
>>>>>>>>>>>> Dom0 case).
>>>>>>>>>>>
>>>>>>>>>>> You mean other 'goto vmce_failed' are also not appropriate
>>>>>>>>>>> (I'm not quite clear your point)?
>>>>>>>>>>
>>>>>>>>>> Yes.
>>>>>>>>>>
>>>>>>>>>>> Would you please point out which point you think not
>>>>>>>>>>> appropriate?
>>>>>>>>>>
>>>>>>>>>> I question whether it is correct/necessary to crash the
>>>>>>>>>> domain in any of those failure cases. Perhaps when we fail
>>>>>>>>>> to unmap the page it is, but failure of fill_vmsr_data() and
>>>>>>>>>> inject_vmce() don't appear to be valid reasons once the
>>>>>>>>>> is_vmce_ready() path is being dropped.
>>>>>>>>>
>>>>>>>>> For fill_vmsr_data(), it failed only when MCG_STATUS_MCIP bit
>>>>>>>>> still set when next vMCE# occur, means the 2nd vMCE# occur
>>>>>>>>> when the 1st vMCE# not handled yet. Per SDM it should
>>>>>>>>> shutdown.
>>>>>>>>>
>>>>>>>>> For inject_vmce(), it failed when
>>>>>>>>> 1). vcpu is still mce_pending, or
>>>>>>>>> 2). pv not register trap callback
>>>>>>>>> Maybe it's some overkilled for dom0 (for other guest, it's ok
>>>>>>>>> to kill them), but any graceful way to quit?
>>>>>>>>
>>>>>>>> Just exit and do nothing (except perhaps log a rate limited
>>>>>>>> message)?
>>>>>>>>
>>>>>>>
>>>>>>> One concern of quiet exit is, the error will be totally ignored
>>>>>>> by guest --> it
>>>>>> didn't get preperly handled, and may recursively occur to make
>>>>>> worse error --> it's better to kill guest under such case.
>>>>>>>
>>>>>>>>> or, considering it rarely happens, how about keep current way
>>>>>>>>> (kill guest no matter dom0 or not)?
>>>>>>>>
>>>>>>>> Possibly - I was merely asking why this one condition was found
>>>>>>>> to be too strict, while the others are being left as is.
>>>>>>>>
>>>>>>>> Jan
>>>>>>>
>>>>>>> Ah, the reason of removing is_vmce_ready check is, it's
>>>>>>> problematic (check mcelog driver, not vmce tap callback),
>>>>>>> and overkilled (since defaultly dom0 will not start mcelog
>>>>>>> driver, under which case system will crash whenever vmce inject
>>>>>>> to dom0)
>>>>>>>
>>>>>>> --> So patch 2/2 is not too strict for dom0.
>>>>>>>
>>>>>>
>>>>>> Please keep in mind the mcelog userland/kernel interface is not
>>>>>> designed with xen in mind. mcelog cannot report which guest is
>>>>>> impacted for example, although xen reports that to dom0.
>>>>>> I object 'fixing' the hypervisor to come over with mcelog
>>>>>> drawbacks. I prefer fixing Dom0 instead.
>>>>>>
>>>>
>>>> Sure, xen mcelog driver in linux is implemented by me :-)
>>>> This patch does not intend to 'fix' hypervisor but just avoid
>>>> overkilled system (when xen mcelog driver in dom0 not loaded as
>>>> default).
>>>
>>> I assume dom0 w/o xen mcelog driver active means dom0 is not capable
>>> to deal with machine check errors. Is this correct?
>>>
>>
>> No, w/o xen mcelog driver active, only user daemon 'mcelog' was
>> affected. Dom0 is still capable of handling vmce as long as it
>> registered trap callback (which is checked at hypervisor
>> inject_vmce()).
>
> Oh, I thought the xen mcelog *driver* registers the trap callback
> and in a Dom0 it additionally registers the logging callback.
> What registers the logging callback and what registers
> the trap callback?
>
> Christoph
>
They are 2 paths (refer latest kernl):
1. for xen mcelog driver, dom0 registers irq handler at driver init via
drivers/xen/mcelog.c
--> bind_virq_for_mce()
--> bind_virq_to_irqhandler()
which got VIRQ_MCA via eventchannel binding and handle mcelog error
logging accordingly;
2. for vmce injection, dom0 registers trap callback (which re-use kernel
machine check handler) via normal kernel traps init:
arch/x86/kernel/traps.c
--> set_intr_gate_ist()
--> write_idt_entry()
--> xen_write_idt_entry()
--> cvt_gate_to_trap() & HYPERVISOR_set_trap_table()
hypervisor entry.S use the registerred trap callback to construct
bounce back stack for TRAP_machine_check, bounce back to dom0 mce handler.
Thanks,
Jinsong
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |