WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler

>>> On 11.10.11 at 10:15, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx> wrote:
> Jan Beulich wrote:
>>>>> On 08.10.11 at 10:29, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>
>>>>> wrote: 
>> 
>>> 
>>>> -----Original Message-----
>>>> From: Jan Beulich [mailto:JBeulich@xxxxxxxx] 
>>>> Sent: Friday, September 30, 2011 3:25 PM
>>>> To: Liu, Jinsong; Jiang, Yunhong
>>>> Cc: keir.xen@xxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxx 
>>>> Subject: RE: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler
>>>> 
>>>>>>> On 30.09.11 at 04:51, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>
>>>>>>> wrote: 
>>>> 
>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Jan Beulich [mailto:JBeulich@xxxxxxxx] 
>>>>>> This made me look at the current source, and there I see in
>>>>>> mce_urgent_action() 
>>>>>> 
>>>>>>     if ( !(gstatus & MCG_STATUS_RIPV) && !guest_mode(regs))      
>>>>>> return -1; 
>>>>>> 
>>>>>> which I think should say ... _EIPV and use || instead. Thoughts?
>>>>> 
>>>>> I think this code means, if the error happens in hypervisor mode
>>>>> (i.e. !guest_mode()), and RIPV indicate the RIP in stack can't be
>>>>> restarted, we have to panic.
>>>> 
>>>> Then the guest_mode() check still lacks an extra check of EIPV, like
>>>> 
>>>>      if ( !(gstatus & MCG_STATUS_RIPV) &&
>>>>           (!(gstatus & MCG_STATUS_EIPV) || !guest_mode(regs)))     
>>>> return -1; 
>>>> 
>>> 
>>> The RIPV is not related to the EIPV. RIPV means the context saved in
>>> the stack can't be restarted anymore. According to the SDM, RIPV
>>> means "execution can be restarted reliably at the instruction
>>> pointed to by the instruction pointer pushed on the stack". It's not
>>> about error happened synchronously or asynchronously. The point is,
>>> if the program is running in hypervisor context, and RIPV tells us
>>> that the program can't be restarted, we can't do anything but panic,
>>> because we can't switch context while we are in xen. So this code
>>> have nothing to do with EIPV. 
>> 
>> I continue to disagree (including the statement in your other
>> response): RIPV only tells us whether we can resume, not in which
>> context the error occurred. EIPV tells us whether, by looking at the
>> saved registers, we can determine the context that the error occurred
>> in. Since with !RIPV 
>> we have to determine in what context the error occurred in order to
>> decide whether to panic or just kill a guest, we can't ignore EIPV
>> (and if it's not set we have to assume the worst case, since even if
>> the registers indicate guest mode the error may have occurred in
>> hypervisor context or accessing hypervisor structures [consider e.g. a
>> data load error during a GDT access]).
>> 
>> Jan
> 
> Yes, I agree EIPV=0 may indicate async error, but I think your solution 
> *overkilled* most cases (i.e. the real guest instruction fetch error).
> 
> Our idea is,
>   * xen mce would flush prefetched instruction so we can delay handle it 
> until if real need;
>   * a h/w error will not disappear, but if it was not being *consumed*, it's 
> OK for system keep going (like SRAO error which do not need s/w handle 
> immediately);
> 
> Suppose an async instruction fetch error (RIVP=EIVP=0), triggered at guest 
> context but instruction prefetch hypervisor context. The scenario is,
>   * at xen mce, the prefetched instruction has been flushed. xen mce handler 
> needn't panic, instead it mark the page as broken page, then trigger vmce to 
> guest;

If the prefetch was from Xen space (only in guest context), delivering a
vMCE to the guest is pointless (and perhaps confusing to the guest).

>   * guest may kill app, kernel thread, guest itself, or whatever;
> 
> The error is still an error, w/ 2 possibilities in the future:
>   1. it may not be consumed as an SRAR error, system keep going, h/w 
> mechanism may detect a SRAO error (i.e. memroy scrub) at some time point and 
> handled then;
>   2. it may be consumed at some time point and a SRAR error triggered again. 
> At this time,
>    1). if srar occurred at hypervisor context, xen will panic. or, 
>    2). if srar occurred at guest context, xen kill the guest as a malicious 
> one (as what the 2nd patch do), and move the page to broken page list;
> 
> Considering the rare possibility of the above case, I think it's acceptable 
> to handle it in this way.
> Thoughts?

You're only discussing instruction fetches (which can be discarded), but
you're not covering the other example I gave (GDT access from guest
context - just like this is a ring-0 operations from the paging unit's pov,
this ought to be an out-of-context operation from MCE's perspective).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel