[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: [Xen-devel] RFC: MCA/MCE concept
> -----Original Message----- > From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx > [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of > Christoph Egger > Sent: 01 June 2007 09:12 > To: xen-devel@xxxxxxxxxxxxxxxxxxx > Cc: Gavin Maltby > Subject: Re: [Xen-devel] RFC: MCA/MCE concept > > On Wednesday 30 May 2007 17:03:55 Petersson, Mats wrote: > > [snip] > > > > > My feeling is that the hypervisor and dom0 own the hardware > > > and as such > > > all hardware fault management should reside there. So we > should never > > > deliver any form of #MC to a domU, nor should a poll of > MCA state from > > > a domU ever observe valid state (e.g, make the RDMSR return 0). > > > So all handling, logging and diagnosis as well as hardware > > > response actions > > > (such as to deploy an online spare chip-select) are controlled > > > in the hypervisor/dom0 combination. That seems a consistent > > > model - e.g., > > > if a domU is migrated to another system it should not carry the > > > diagnosis state of the original system across etc, since that > > > belongs with > > > the one domain that cannot migrate. > > > > I agree entirely with this. > > > > > But that is not to say that (I think at a future phase) domU > > > should not > > > participate in a higher-level fault management function, at > > > the direction > > > of the hypervisor/dom0 combo. For example if/when we can > isolate an > > > uncorrectable error to a single domU we could forward > such an event to > > > the affected domU if it has registered its > ability/interest in such > > > events. These won't be in the form of a faked #MC or anything, > > > instead they'd be some form of synchronous trap > experienced when next > > > the affected domU context resumes on CPU. The intelligent > > > domU handler > > > can then decide whether the domU must panic, whether it > could simply > > > kill the affected process etc. Those details are clearly > > > sketchy, but the > > > idea is to up-level the communication to a domU to be more like > > > "you're broken" rather than "here's a machine-level > hardware error for > > > you to interpret and decide what to do with". > > > > Yes, this makes much more sense than forwarding #MC, as the > guest would > > have a hard time to actually do anything really useful with > this. As far > > as I know, most uncorrectable errors are near enough > entirely fatal in > > most commercial non-Enterprise OS's anyways - e.g. in Windows XP or > > Server 2K3, it always ends in a blue-screen - which is > hardly any better > > than the guest being "humanely euthenazed" by Dom0. > > > > I take it this would be some sort of hypercall (available > through the > > regular PV-driver interface for HVM guests) to say "Let me > know if I'm > > broken - trap on vector X". > > For short, guests with a PV MCA driver will see a certain event > (assuming the event mechanism will be used for the notification) > and guests w/o a PV MCA driver will see a "General Protection Fault". > Is that right? Not sure if GP fault is the right thing for non-"MCA PV driver" domains. I think "just killing" the domain is the right thing to do. We can't gurantee that a GP fault is actually going to "kill" the guest. Let's assume the code that ran on the guest was something along the lines of: int some_function(...) { ... try { ... /* Some code that does quite a lot of "random" processing that may cause, for example, GP fault */ ... } catch(Exception e) { ... /* handles GP fault within the kernel code */ ... } } Note that Windows kernel drivers are allowed to use the kernel exception handling, and ARE allowed to "allow" GP faults if they wish to do so. [Don't ask me why MS allows this, but that's the case, so we have to live with it]. I'm not sure if Linux, Solaris, *BSD, OS/2 or other OS's will allow "catching" a Kernel GP fault in a non-precise fashion (I know Linux has exception handling for EXACT positions in the code). But since at least one kernel DOES allow this, we can't be sure that a GPF will destroy the guest. Second point to note is of course that if the guest is in user-mode when the GPF happens, then almost all OS's will just kill the application - and there's absolutely no reason to believe that the application running is necessarily where the actual memory problem is - it may be caused by memory scrubbing for example. Whatever we do to the guest, it should be a "certain death", unless the kernel has told us "I can handle MCE's". -- Mats > > > -- > > Mats > > > > > Gavin > > > > > -- > AMD Saxony, Dresden, Germany > Operating System Research Center > > Legal Information: > AMD Saxony Limited Liability Company & Co. KG > Sitz (Geschäftsanschrift): > Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland > Registergericht Dresden: HRA 4896 > vertretungsberechtigter Komplementär: > AMD Saxony LLC (Sitz Wilmington, Delaware, USA) > Geschäftsführer der AMD Saxony LLC: > Dr. Hans-R. Deppe, Thomas McCoy > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxxxxxxxx > http://lists.xensource.com/xen-devel > > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |