WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] RFC: MCA/MCE concept

Hi,

On 06/06/07 12:57, Christoph Egger wrote:

For the first I've assumed so far that an event channel notification
of the MCA event will suffice;  as long as the hypervisor only polls
for correctable MCA errors at a low-frequency rate (currently 15s
interval) there is no danger of spamming that single notification.
Why polling?
Polling for correctable errors, but #MC as usual for others.  Setting
MCi_CTL bits for correctable errors does not produce a machine check,
so polling is the only approach unless one sets additional (and
undocumented, certainly for AMD chips) config bits.  What I was getting
at here is that polling at largish intervals for correctables is
the correct approach - trapping for them or polling at a high-frequency
is bad because in cases where you have some form of solid correctable
error (say a single bad pin in a dimm socket affecting one or two ranks
of that dimm but never able to produce a UE) the trap handling and
diagnosis software consume the machine and things make little useful
forward progress.

I still don't see, why #MC for all kind of errors is bad.

I'm talking about whether the hypervisor takes a machine check
for an error or polls for it.  We do not want #MC for correctable
errors stopping the hypervisor from making progress.  And if the
hypervisor poll interval was to small a solid error would again
keep the hypervisor busy producing (mostly/all duplicate)
error telemetry and the diagnosis code in dom0 would burn
cpu cycles, too.

How errors observed by the hypervisor, be they from #MC or from
a poll, are propogated to the domains is unimportant from this
point of view - e.g., if we decide to take error telemetry
discovered via a poll in the hypervisor and propogate it
to the domain pretending it is undistinguishable from a machine
check that will not hurt or limit the domain processing.

An untested design I had in mind, unashamedly influenced by what
we do in Solaris, was to have some common memory shared between
hypervisor and domain into which the hypervisor produces
error telemetry and the domain consumes that telemetry.
Producing and consuming is lockless using compare-and-swap.
There are two queues in this shared memory - one for uncorrectable
error telemetry and one for correctable error telemetry.  When the
domain gets whatever event to notify it of telemetry for processing
it processes the queues;  the event would be synchronous for
uncorrectable errors (ie, domain must process the telemetry
right now) or asynchronous in the case of correctable errors
(process when convenient).  The separation of CE and UE queues
stops CEs from flooding the more important UE events (you can
always drop CEs if there is no more space, but you can never
drop UEs).

[cut]

After some code reading I found a nmi_pending, nmi_masked and nmi_addr in
[cut]

Still chewing on that ...

Cheers

Gavin

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel