[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN



On Thursday 19 February 2009 10:13:18 Jiang, Yunhong wrote:
> xen-devel-bounces@xxxxxxxxxxxxxxxxxxx <> wrote:
> > On Tuesday 17 February 2009 07:41:29 Jiang, Yunhong wrote:
> >> I think the major difference including: a) How to handle the #MC, i.e.
> >> reset system, decide impacted components, take recover action like page
> >> offline etc. b) How to handle error impact guest. As to other item like
> >> log/telemetry, I think our implementation didn't have much different to
> >> current implementation.
> >
> > The hardware doesn't know what recover actions the software can do.
> > If page A is faulty, and software maintains a copy in page B, then
> > software can turn an uncorrectable error into an correctable one.
> > If the hardware is aware of that copy (memory mirroring done by memory
> > controller), then the hardware itself turns the uncorrectable error
> > into an correctable one and reports an correctable error.
> >
> > Therefore, I don't see why other flags than correctable and uncorrectable
> > are needed at all.
>
> Christoph, thanks for your reply.
>
> I think recoverable means VMM/OS can take recover action like page offline,
> while unrecoverable means VMM/OS can't do anything and we have to reboot.

Ok, here is a different interpretation of what is correctable and 
uncorrectable.
Uncorrectable in your interpretation means neither hardware nor software can't
do anything.
Uncorrectable in my interpretation means the hardware can't correct it, but 
software may have more information and correct it.

> The main reason we need these flag is, several step is required for MCA
> handling, for example, when multiple MCE happen to multiple CPU, firstly
> each CPU check it's own severity, seconldy we need check the most severity
> CPU and take action. For example, CPU A may get unrecoverable  while CPU B 
> get recoverable, they will check the information and the result, and the
> final solution will be unrecoverable .

I brought up an example of a broken memory page for my argumentation,
you bring up a broken CPU for your argumentation.

We need to find a common denominator to compare.

If a CPU is completely broken and you are on UP, then game is over.
Not even a reboot can help.
On a SMP system, offline the CPU and inform Dom0.

> > After some thinking on taking some quick actions, I can
> > agree on it if it meets the condition below. Be aware, error analyzes
> > is highly CPU vendor and even CPU family/model specific. Doing a
> > complete analyzes as Solaris does blows Xen up a *lot*.
>
> I didn't check Solaris code, so can Gavin or Frank gives us more
> information? At least currently it will not be large AFAIK, and if we do
> need model specific support (I don't know such requirement now, and I
> suppose it will not be common if exists, please correct me if wrong), dom0
> can inform Xen for it.
>
> > Therefore, a *cheap* error analysis must be enough to figure out
> > if recover actions like page-offlining or cpu offlining
> > are *obviously* only the right thing to do.
>
> Currently we only plan to support these two types, do you have plan for
> other recover action? And is that action be done better in Dom0 than in
> Xen?

Yes!! Solaris maintains a list of broken pages which is even persistent
across reboot when the serial number of the DIMM didn't change.
For doing page offlining properly, SUN should design a hypercall allowing
the Dom0 to give Xen this list as early as possible at boot time.

Further, with our Shanghai CPU, we can disable certain parts of its L3 cache.
Instead of offlining that broken CPU completely, just disable the broken
part of it. The registers for this is in PCI config space. Since Xen delegates
PCI access to Dom0, Dom0 can do that.

Christoph

-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.