[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] RFC: MCA/MCE concept

  • To: "Gavin Maltby" <Gavin.Maltby@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx
  • From: "Petersson, Mats" <Mats.Petersson@xxxxxxx>
  • Date: Wed, 30 May 2007 17:03:55 +0200
  • Delivery-date: Wed, 30 May 2007 08:03:12 -0700
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>
  • Thread-index: AceiweeIdhki3L9NSz6dxRZ0hVG3ZQAAJZHQ
  • Thread-topic: [Xen-devel] RFC: MCA/MCE concept

> My feeling is that the hypervisor and dom0 own the hardware 
> and as such
> all hardware fault management should reside there.  So we should never
> deliver any form of #MC to a domU, nor should a poll of MCA state from
> a domU ever observe valid state (e.g, make the RDMSR return 0).
> So all handling, logging and diagnosis as well as hardware 
> response actions
> (such as to deploy an online spare chip-select) are controlled
> in the hypervisor/dom0 combination.  That seems a consistent 
> model - e.g.,
> if a domU is migrated to another system it should not carry the
> diagnosis state of the original system across etc, since that 
> belongs with
> the one domain that cannot migrate.

I agree entirely with this. 

> But that is not to say that (I think at a future phase) domU 
> should not
> participate in a higher-level fault management function, at 
> the direction
> of the hypervisor/dom0 combo.  For example if/when we can isolate an
> uncorrectable error to a single domU we could forward such an event to
> the affected domU if it has registered its ability/interest in such
> events.  These won't be in the form of a faked #MC or anything,
> instead they'd be some form of synchronous trap experienced when next
> the affected domU context resumes on CPU.  The intelligent 
> domU handler
> can then decide whether the domU must panic, whether it could simply
> kill the affected process etc.  Those details are clearly 
> sketchy, but the
> idea is to up-level the communication to a domU to be more like
> "you're broken" rather than "here's a machine-level hardware error for
> you to interpret and decide what to do with".

Yes, this makes much more sense than forwarding #MC, as the guest would
have a hard time to actually do anything really useful with this. As far
as I know, most uncorrectable errors are near enough entirely fatal in
most commercial non-Enterprise OS's anyways - e.g. in Windows XP or
Server 2K3, it always ends in a blue-screen - which is hardly any better
than the guest being "humanely euthenazed" by Dom0. 

I take it this would be some sort of hypercall (available through the
regular PV-driver interface for HVM guests) to say "Let me know if I'm
broken - trap on vector X". 

> Gavin
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.