[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] X86 MCE: Add more strict sanity check of one SRAR case



Jan,

Attached is the email we discuss last month, and 2 mce patches:
srar-1.patch is the patch we discussed before,
srar-guest-mode.patch is an additional patch w/ more strict sanity checkof RIPV 
= EIPV = 0.
Any comments please let me know.

Thanks,
Jinsong

________________________________

From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx 
[mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Liu, Jinsong
Sent: Friday, December 02, 2011 11:55 PM
To: jbeulich@xxxxxxxx
Cc: keir.xen@xxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxx; Jiang, Yunhong
Subject: [Xen-devel] [PATCH] X86 MCE: Add more strict sanity check of one SRAR 
case


Jan,
 
I cannot access my remote office desktop today (network maintain), so cannot 
reply the SRAR patch email we discussed last month.
 
At the last email we basically agree that the SRAR patch itself looks fine, 
only with 1 concern about 'guest_mode' (in fact the SRAR patch itself didn't 
involved guest_mode), and the SRAR patch not in xen unstable tree yet.
Currently I still don't find out an alternative way to solve the guest_mode 
issue, so in order to be safe I add this patch.
I think although it may be some overkilled, it's OK to keep it safe here, after 
all the SRAR case which RIPV = EIPV = 0 occur rarely.
 
Thanks,
Jinsong
 
======================================
 
X86 MCE: Add more strict sanity check of one SRAR case
 
When RIPV = EIPV = 0, it's a little bit tricky. It may be an asynchronic error, 
currently we have no way to precisely locate whether the error occur at guest 
or hypervisor.
To avoid handling error in wrong way, we treat it as unrecovered.
 
Signed-off-by: Liu, Jinsong <jinsong.liu@xxxxxxxxx>
Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
 
# HG changeset patch
# Parent e758b1d928e3d663602741ddc7f55461e2187829
 
diff -r e758b1d928e3 xen/arch/x86/cpu/mcheck/mce_intel.c
--- a/xen/arch/x86/cpu/mcheck/mce_intel.c       Sun Oct 30 17:50:53 2011 -0800
+++ b/xen/arch/x86/cpu/mcheck/mce_intel.c       Fri Dec 02 15:16:52 2011 -0800
@@ -367,8 +367,18 @@
         return 0;
 
     gstatus = mca_rdmsr(MSR_IA32_MCG_STATUS);
-    /* Xen is not pre-emptible */
-    if ( !(gstatus & MCG_STATUS_RIPV) && !guest_mode(regs))
+
+    /*
+     * FIXME: When RIPV = EIPV = 0, it's a little bit tricky. It may be an
+     * asynchronic error, currently we have no way to precisely locate
+     * whether the error occur at guest or hypervisor.
+     * To avoid handling error in wrong way, we treat it as unrecovered.
+     *
+     * Another unrecovered case is RIPV = 0 while in hypervisor
+     * since Xen is not pre-emptible.
+     */
+    if ( !(gstatus & MCG_STATUS_RIPV) &&
+        (!(gstatus & MCG_STATUS_EIPV) || !guest_mode(regs)) )
         return -1;
 
     return mce_action(regs, mctc) == MCER_RESET ? -1 : 0;
--- Begin Message ---
>>> On 21.10.11 at 22:25, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx> wrote:
> I update a little for my former patch, as attached.
> For my former patch, you mainly have 2 concerns (list below). I double check 
> xen mce code, w/ my opinion append:

Those concerns were really just triggered by the patch, not directly
related to it. The patch itself looks okay to me.

> Concern 1: for SRAR IFU error, since RIPV=EIPV=0, it maybe an async error 
> which occur at guest but root from hypervisor.
> [Jinsong]: 
>     Yes, but EIPV didn't tell us where the error root from (it's just a 
> hint, warning us async possibility). 
>     It no need to overkill xen at mce isr, instead, at mce softirq we can 
> find out error root location and then handle accordingly:
>     * at mce isr:
>             /* a total insurance */
>             /* if error is async, we delay handle it at mce softirq */
>             if ( !(gstatus & MCG_STATUS_RIPV) && !guest_mode(regs))
>                 return -1;

I continue to think that guest_mode() must not be used without
EIPV, no matter what the purpose of the use.

>     * at mce softirq:
>             /* detect error location by bank->mc_addr */
>             /* handle different page OWNER cases at intel_memerr_dhandler() 
> and offline_page() */
>             /* who own, who take */
>             if (error page owner is guest)
>                 trigger vmce to guest;
>             else
>                 panic xen;

That part is certainly fine.

> Concern 2: If a guest accesses the hypervisor part of the GDT or page 
> tables, or some other shared data structure owned by the hypervisor (like the 
> M2P table), its handler may get utterly confused by being presented an 
> address it doesn't own and knows nothing about.
> [Jinsong]: for such cases, page owner would be dom_xen/dom_cow or NULL, but 
> not guest --> it would be handled at hypervisor, not triggering vmce to guest 
> --> 
> who own, who take.

That latter part of your explanation is fine too, but with the caveat that
the bogus use of guest_mode() above may have an overall effect on the
behavior.

Jan


--- End Message ---

Attachment: srar-1.patch
Description: srar-1.patch

Attachment: srar-guest-mode.patch
Description: srar-guest-mode.patch

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.