Xen project Mailing List

Re: [Xen-devel] RFC: MCA/MCE concept

From: Gavin Maltby <Gavin.Maltby@xxxxxxx>

Date: Wed, 30 May 2007 14:50:37 +0100

Delivery-date: Wed, 30 May 2007 06:51:12 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Hi, On 05/30/07 10:10, Christoph Egger wrote: [cut]

2b) error == UE and UE impacts Xen or Dom0:

A very important aspect here is how you want to classify what impact an
uncorrectable has - generally, I can see very few situations where you
could confine the impact to a sub-portion of the system (i.e. a single
domU, dom0, or Xen). The general rule in my opinion must be to halt the
system, the question just is how likely it is that you can get a
meaningful message out (to screen, serial, or logs) that can help
analyze the problem afterwards. If it is somewhat likely, then dom0
should be involved, otherwise Xen should just shut down the system.

Here you can best help out using HW features to handle errors.
AMD CPUs features online-spare RAM and Chipkill since K8 RevF.

CPUs such as the Sparc features Data Poisoning. That would be the
most handy technique that can be used here.

But that assumes the error is recoverable (i.e. no other data got
corrupted). You still didn't clarify how you intend to determine the
impact an uncorrectable error had.


I know. I am lacking a sudden inspiration here.
That's why I discuss this here before writing code that goes to nowhere.
Anyone here with a flash of genius? :-)

For a first phase I'd suggest that treating an uncorrectable error as terminal to the entire system (e.g., panic hypervisor or setup a hardware reset mechanism such as Sync Flood) is practical and safe, and allows us to concentrate on getting some more basic elements in place. As Christoph says we really need some form of data poisoning supported on the platform to really be able to isolate the impact of an uncorrectable error. In the absence of such support I think some fancy heuristics could work in some limited cases (e.g., a memory uncorrectable on a page that only a domU has a mapping to and which is not shared with any other domain not even via a front/backend driver) but the penalty for bugs in those heuristics is silent data corruption which is the ultimate crime.

3a) DomU is a PV guest:
      if DomU installed MCA event handler, it gets notified to perform
         self-healing
      if DomU did not install MCA event handler, notify Dom0 to do
         some operations on DomU (case II)
      if neither DomU nor Dom0 did not install MCA event handlers,
         then Xen kills DomU
3b) DomU is a HVM guest:
      if DomU features a PV driver then behave as in 3a)

What significance do pv drivers have here? Or do you mean a pv MCA
driver?

[cut] My feeling is that the hypervisor and dom0 own the hardware and as such all hardware fault management should reside there. So we should never deliver any form of #MC to a domU, nor should a poll of MCA state from a domU ever observe valid state (e.g, make the RDMSR return 0). So all handling, logging and diagnosis as well as hardware response actions (such as to deploy an online spare chip-select) are controlled in the hypervisor/dom0 combination. That seems a consistent model - e.g., if a domU is migrated to another system it should not carry the diagnosis state of the original system across etc, since that belongs with the one domain that cannot migrate. But that is not to say that (I think at a future phase) domU should not participate in a higher-level fault management function, at the direction of the hypervisor/dom0 combo. For example if/when we can isolate an uncorrectable error to a single domU we could forward such an event to the affected domU if it has registered its ability/interest in such events. These won't be in the form of a faked #MC or anything, instead they'd be some form of synchronous trap experienced when next the affected domU context resumes on CPU. The intelligent domU handler can then decide whether the domU must panic, whether it could simply kill the affected process etc. Those details are clearly sketchy, but the idea is to up-level the communication to a domU to be more like "you're broken" rather than "here's a machine-level hardware error for you to interpret and decide what to do with". Gavin _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.