WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] RFC: MCA/MCE concept

To: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: Re: [Xen-devel] RFC: MCA/MCE concept
From: "Christoph Egger" <Christoph.Egger@xxxxxxx>
Date: Wed, 6 Jun 2007 15:24:33 +0200
Cc: Gavin Maltby <Gavin.Maltby@xxxxxxx>, Keir Fraser <keir@xxxxxxxxxxxxx>
Delivery-date: Wed, 06 Jun 2007 06:32:04 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
In-reply-to: <4666A7B6.1020702@xxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <200705291732.46709.Christoph.Egger@xxxxxxx> <200706061357.26924.Christoph.Egger@xxxxxxx> <4666A7B6.1020702@xxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: KMail/1.9.6
On Wednesday 06 June 2007 14:25:26 Gavin Maltby wrote:
> Hi,
>
> On 06/06/07 12:57, Christoph Egger wrote:
> >>>> For the first I've assumed so far that an event channel notification
> >>>> of the MCA event will suffice;  as long as the hypervisor only polls
> >>>> for correctable MCA errors at a low-frequency rate (currently 15s
> >>>> interval) there is no danger of spamming that single notification.
> >>>
> >>> Why polling?
> >>
> >> Polling for correctable errors, but #MC as usual for others.  Setting
> >> MCi_CTL bits for correctable errors does not produce a machine check,
> >> so polling is the only approach unless one sets additional (and
> >> undocumented, certainly for AMD chips) config bits.  What I was getting
> >> at here is that polling at largish intervals for correctables is
> >> the correct approach - trapping for them or polling at a high-frequency
> >> is bad because in cases where you have some form of solid correctable
> >> error (say a single bad pin in a dimm socket affecting one or two ranks
> >> of that dimm but never able to produce a UE) the trap handling and
> >> diagnosis software consume the machine and things make little useful
> >> forward progress.
> >
> > I still don't see, why #MC for all kind of errors is bad.
>
> I'm talking about whether the hypervisor takes a machine check
> for an error or polls for it.  We do not want #MC for correctable
> errors stopping the hypervisor from making progress.  And if the
> hypervisor poll interval was to small a solid error would again
> keep the hypervisor busy producing (mostly/all duplicate)
> error telemetry and the diagnosis code in dom0 would burn
> cpu cycles, too.
>
> How errors observed by the hypervisor, be they from #MC or from
> a poll, are propogated to the domains is unimportant from this
> point of view - e.g., if we decide to take error telemetry
> discovered via a poll in the hypervisor and propogate it
> to the domain pretending it is undistinguishable from a machine
> check that will not hurt or limit the domain processing.
>
> An untested design I had in mind, unashamedly influenced by what
> we do in Solaris, was to have some common memory shared between
> hypervisor and domain into which the hypervisor produces
> error telemetry and the domain consumes that telemetry.

That is the struct vcpu_info in the PUBLIC xen.h. It is accessable
in the hypervisor as well as in the guest.

> Producing and consuming is lockless using compare-and-swap.
> There are two queues in this shared memory - one for uncorrectable
> error telemetry and one for correctable error telemetry.  When the
> domain gets whatever event to notify it of telemetry for processing
> it processes the queues;  the event would be synchronous for
> uncorrectable errors (ie, domain must process the telemetry
> right now) or asynchronous in the case of correctable errors
> (process when convenient).  The separation of CE and UE queues
> stops CEs from flooding the more important UE events (you can
> always drop CEs if there is no more space, but you can never
> drop UEs).

So we use the asynchronous event mechanism VIRQ_DOM_EXC to report
correctable errors to the Dom0 and the nmi stuff for uncorrectable errors to
Dom0 and DomU, right?

The fact that VIRQ_DOM_EXC is for Dom0 only doesn't hurt here, since we never 
report CEs to DomUs.


> [cut]
>
> > After some code reading I found a nmi_pending, nmi_masked and nmi_addr in
>
> [cut]
>
> Still chewing on that ...


Christoph


-- 
AMD Saxony, Dresden, Germany
Operating System Research Center

Legal Information:
AMD Saxony Limited Liability Company & Co. KG
Sitz (Geschäftsanschrift):
   Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland
Registergericht Dresden: HRA 4896
vertretungsberechtigter Komplementär:
   AMD Saxony LLC (Sitz Wilmington, Delaware, USA)
Geschäftsführer der AMD Saxony LLC:
   Dr. Hans-R. Deppe, Thomas McCoy



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

<Prev in Thread] Current Thread [Next in Thread>