[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN



Christoph Egger <mailto:Christoph.Egger@xxxxxxx> wrote:
> Ok, here is a different interpretation of what is correctable and
> uncorrectable. Uncorrectable in your interpretation means neither hardware
> nor software can't
> do anything.
> Uncorrectable in my interpretation means the hardware can't
> correct it, but
> software may have more information and correct it.

Yes. Maybe "fatal" is more appropriate name here. 

> 
>> The main reason we need these flag is, several step is required for MCA
>> handling, for example, when multiple MCE happen to multiple CPU, firstly
>> each CPU check it's own severity, seconldy we need check the most severity
>> CPU and take action. For example, CPU A may get unrecoverable  while CPU B
>> get recoverable, they will check the information and the result, and the
>> final solution will be unrecoverable .
> 
> I brought up an example of a broken memory page for my argumentation,
> you bring up a broken CPU for your argumentation.
> 
> We need to find a common denominator to compare.
> 
> If a CPU is completely broken and you are on UP, then game is over. Not
> even a reboot can help. On a SMP system, offline the CPU and inform Dom0.

Sorry I didn't get relationship between the flags and comparing the two example 
:$

>> Currently we only plan to support these two types, do you have plan for
>> other recover action? And is that action be done better in Dom0 than in
>> Xen?
> 
> Yes!! Solaris maintains a list of broken pages which is even persistent
> across reboot when the serial number of the DIMM didn't change.
> For doing page offlining properly, SUN should design a
> hypercall allowing
> the Dom0 to give Xen this list as early as possible at boot time.

We have a patch to support  page offline (sent as RFC to mailing list), and it 
already export a hypercall for Dom0 to ask Xen to offline pages (this is for 
proactive action to CE errors from Dom0), also, as Frank suggested, we will add 
a hypercall for Dom0 to get page's offline status, so it should be OK.

> Further, with our Shanghai CPU, we can disable certain parts
> of its L3 cache.
> Instead of offlining that broken CPU completely, just disable
> the broken
> part of it. The registers for this is in PCI config space.
> Since Xen delegates
> PCI access to Dom0, Dom0 can do that.

Sorry that I have no idea of Shanghai, but I'm a bit suprised that when error 
happens to cache, we will transfer control to Dom0  and wait for Dom0's MCA 
handler to take action to disable the cache, it is really a loooong code path. 
Per my understanding, if there are issue in cache, we should clear/disable the 
cache ASAP to avoid more server result, and it is a extreme example to let Xen 
handle the MCA. Or maybe I missed something important in this feature?

BTW, I want to clarify that this patch is for #MC handling (i.e. the 
"uncorrected" error in your mind). For hardware correctable error (i.e. 
"correctable") , Xen will do nothing, but just pass it to Dom0 as vIRQ as our 
previous patch 
(http://lists.xensource.com/archives/html/xen-devel/2008-12/msg00970.html ) 
shown, because CE will not impact system. So if the "cache index disable" is to 
disable part of cache after too many CE (Correctable Error) as proactive 
action, I think we are on the same page.

I attached two foil that are part of our Xen summit presentation. Page 1 is 
mainly for #MC handling, page2 is for CE handling (though CMCI or polling). The 
page 1 is described in the patch clearly. Page 2 is what our previous patch did 
.

Thanks
-- Yunhong Jiang

> 
> Christoph
> 
> --
> ---to satisfy European Law for business letters:
> Advanced Micro Devices GmbH
> Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
> Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
> Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
> Registergericht Muenchen, HRB Nr. 43632

Attachment: MCA.pdf
Description: MCA.pdf

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.