[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN



xen-devel-bounces@xxxxxxxxxxxxxxxxxxx <> wrote:
> On Tuesday 17 February 2009 07:41:29 Jiang, Yunhong wrote:
>> I think the major difference including: a) How to handle the #MC, i.e.
>> reset system, decide impacted components, take recover action like page
>> offline etc. b) How to handle error impact guest. As to other item like
>> log/telemetry, I think our implementation didn't have much different to
>> current implementation.
> 
> The hardware doesn't know what recover actions the software can do.
> If page A is faulty, and software maintains a copy in page B, then
> software can turn an uncorrectable error into an correctable one.
> If the hardware is aware of that copy (memory mirroring done by memory
> controller), then the hardware itself turns the uncorrectable error
> into an correctable one and reports an correctable error.
> 
> Therefore, I don't see why other flags than correctable and uncorrectable
> are needed at all.

Christoph, thanks for your reply.

I think recoverable means VMM/OS can take recover action like page offline, 
while unrecoverable means VMM/OS can't do anything and we have to reboot. The 
main reason we need these flag is, several step is required for MCA handling, 
for example, when multipel MCE happen to multiple CPU, firstly each CPU check 
it's own severity, seconldy we need check the most severity CPU and take 
action. For example, CPU A may get unrecoverable  while CPU B  get recoverable, 
they will check the information and the result, and the final solution will be 
unrecoverable .

> 
> 
> After some thinking on taking some quick actions, I can
> agree on it if it meets the condition below. Be aware, error analyzes
> is highly CPU vendor and even CPU family/model specific. Doing a
> complete analyzes as Solaris does blows Xen up a *lot*.

I didn't check Solaris code, so can Gavin or Frank gives us more information? 
At least currently it will not be large AFAIK, and if we do need model specific 
support (I don't know such requirement now, and I suppose it will not be common 
if exists, please correct me if wrong), dom0 can inform Xen for it.
 
> 
> Therefore, a *cheap* error analysis must be enough to figure out
> if recover actions like page-offlining or cpu offlining
> are *obviously* only the right thing to do.

Currently we only plan to support these two types, do you have plan for other 
recover action? And is that action be done better in Dom0 than in Xen?

Thanks
-- Yunhong Jiang

> 
> If this is not the case, then let Dom0 decide what to do.

> 
> Christoph
> 
> 
> --
> ---to satisfy European Law for business letters:
> Advanced Micro Devices GmbH
> Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
> Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
> Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
> Registergericht Muenchen, HRB Nr. 43632
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.