[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler



>>> On 11.10.11 at 10:15, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx> wrote:
> Jan Beulich wrote:
>>>>> On 08.10.11 at 10:29, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>
>>>>> wrote: 
>> 
>>> 
>>>> -----Original Message-----
>>>> From: Jan Beulich [mailto:JBeulich@xxxxxxxx] 
>>>> Sent: Friday, September 30, 2011 3:25 PM
>>>> To: Liu, Jinsong; Jiang, Yunhong
>>>> Cc: keir.xen@xxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxx 
>>>> Subject: RE: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler
>>>> 
>>>>>>> On 30.09.11 at 04:51, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>
>>>>>>> wrote: 
>>>> 
>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Jan Beulich [mailto:JBeulich@xxxxxxxx] 
>>>>>> This made me look at the current source, and there I see in
>>>>>> mce_urgent_action() 
>>>>>> 
>>>>>>     if ( !(gstatus & MCG_STATUS_RIPV) && !guest_mode(regs))      
>>>>>> return -1; 
>>>>>> 
>>>>>> which I think should say ... _EIPV and use || instead. Thoughts?
>>>>> 
>>>>> I think this code means, if the error happens in hypervisor mode
>>>>> (i.e. !guest_mode()), and RIPV indicate the RIP in stack can't be
>>>>> restarted, we have to panic.
>>>> 
>>>> Then the guest_mode() check still lacks an extra check of EIPV, like
>>>> 
>>>>      if ( !(gstatus & MCG_STATUS_RIPV) &&
>>>>           (!(gstatus & MCG_STATUS_EIPV) || !guest_mode(regs)))     
>>>> return -1; 
>>>> 
>>> 
>>> The RIPV is not related to the EIPV. RIPV means the context saved in
>>> the stack can't be restarted anymore. According to the SDM, RIPV
>>> means "execution can be restarted reliably at the instruction
>>> pointed to by the instruction pointer pushed on the stack". It's not
>>> about error happened synchronously or asynchronously. The point is,
>>> if the program is running in hypervisor context, and RIPV tells us
>>> that the program can't be restarted, we can't do anything but panic,
>>> because we can't switch context while we are in xen. So this code
>>> have nothing to do with EIPV. 
>> 
>> I continue to disagree (including the statement in your other
>> response): RIPV only tells us whether we can resume, not in which
>> context the error occurred. EIPV tells us whether, by looking at the
>> saved registers, we can determine the context that the error occurred
>> in. Since with !RIPV 
>> we have to determine in what context the error occurred in order to
>> decide whether to panic or just kill a guest, we can't ignore EIPV
>> (and if it's not set we have to assume the worst case, since even if
>> the registers indicate guest mode the error may have occurred in
>> hypervisor context or accessing hypervisor structures [consider e.g. a
>> data load error during a GDT access]).
>> 
>> Jan
> 
> Yes, I agree EIPV=0 may indicate async error, but I think your solution 
> *overkilled* most cases (i.e. the real guest instruction fetch error).
> 
> Our idea is,
>   * xen mce would flush prefetched instruction so we can delay handle it 
> until if real need;
>   * a h/w error will not disappear, but if it was not being *consumed*, it's 
> OK for system keep going (like SRAO error which do not need s/w handle 
> immediately);
> 
> Suppose an async instruction fetch error (RIVP=EIVP=0), triggered at guest 
> context but instruction prefetch hypervisor context. The scenario is,
>   * at xen mce, the prefetched instruction has been flushed. xen mce handler 
> needn't panic, instead it mark the page as broken page, then trigger vmce to 
> guest;

If the prefetch was from Xen space (only in guest context), delivering a
vMCE to the guest is pointless (and perhaps confusing to the guest).

>   * guest may kill app, kernel thread, guest itself, or whatever;
> 
> The error is still an error, w/ 2 possibilities in the future:
>   1. it may not be consumed as an SRAR error, system keep going, h/w 
> mechanism may detect a SRAO error (i.e. memroy scrub) at some time point and 
> handled then;
>   2. it may be consumed at some time point and a SRAR error triggered again. 
> At this time,
>    1). if srar occurred at hypervisor context, xen will panic. or, 
>    2). if srar occurred at guest context, xen kill the guest as a malicious 
> one (as what the 2nd patch do), and move the page to broken page list;
> 
> Considering the rare possibility of the above case, I think it's acceptable 
> to handle it in this way.
> Thoughts?

You're only discussing instruction fetches (which can be discarded), but
you're not covering the other example I gave (GDT access from guest
context - just like this is a ring-0 operations from the paging unit's pov,
this ought to be an out-of-context operation from MCE's perspective).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.