[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] RFC: MCA/MCE concept
On Wednesday 06 June 2007 12:35:15 Gavin Maltby wrote: > Hi, > > On 06/06/07 10:28, Christoph Egger wrote: > > On Monday 04 June 2007 18:16:56 Gavin Maltby wrote: > >> Hi, > >> > >> On 05/30/07 10:10, Christoph Egger wrote: > >>> On Wednesday 30 May 2007 10:49:40 Jan Beulich wrote: > >>>>>>> "Christoph Egger" <Christoph.Egger@xxxxxxx> 30.05.07 09:45 >>> > >>>>> > >>>>> On Wednesday 30 May 2007 09:19:12 Jan Beulich wrote: > >>>>>>> case I) - Xen reveives a MCE from the CPU > >>>>>>> > >>>>>>> 1) Xen MCE handler figures out if error is an correctable error > >>>>>>> (CE) or uncorrectable error (UE) > >>>>>>> 2a) error == CE: > >>>>>>> Xen notifies Dom0 if Dom0 installed an MCA event handler > >>>>>>> for statistical purpose > >> > >> [rest cut] > >> > >> For the hypervisor to dom0 communication that 2a) above refers to I > >> think we need to agree on two aspects: what form the notification event > >> will take, and what error telemetry data and additional information will > >> be provided by the hypervisor for dom0 to chew on for statistical and > >> diagnosis purposes. > > > > Additionally, the hypervisor must be able to notify domU that has > > a PV MCA driver. > > Yes, forgot that; although I guess I view that most likely as a future > phase. Yes, but ignoring this can lead to a design that is bad for DomU and requires a re-design in the worst case. > >> For the first I've assumed so far that an event channel notification > >> of the MCA event will suffice; as long as the hypervisor only polls > >> for correctable MCA errors at a low-frequency rate (currently 15s > >> interval) there is no danger of spamming that single notification. > > > > Why polling? > > Polling for correctable errors, but #MC as usual for others. Setting > MCi_CTL bits for correctable errors does not produce a machine check, > so polling is the only approach unless one sets additional (and > undocumented, certainly for AMD chips) config bits. What I was getting > at here is that polling at largish intervals for correctables is > the correct approach - trapping for them or polling at a high-frequency > is bad because in cases where you have some form of solid correctable > error (say a single bad pin in a dimm socket affecting one or two ranks > of that dimm but never able to produce a UE) the trap handling and > diagnosis software consume the machine and things make little useful > forward progress. I still don't see, why #MC for all kind of errors is bad. > >> On receipt of the notification the event handler will need to suck > >> some event data out of somewhere - uncertain which somewhere would > >> be best? > >> > >> We should standardize both the format and the content of this event > >> data. The following is just to get the conversation started in this > >> area. > >> > >> Content first. Obviously we need the raw MCA register content - > >> MCi_STATUS, MCi_ADDR, MCi_MISC. We also need know which > >> MCA detector bank made the observation, so we need to include > >> some indication of which chip (where I use "chip" to coincide > >> with "socket"), core on that chip, and MCA bank number > >> the telemetry came from. I think I am correct in saying that > >> hyperthreaded CPUs do not have any MCA banks per-thread, but we > >> may want to allow for that future possibility (I know, for instance, > >> that some SPARC cpus have error state for each hardware thread). > > > > And we need the domain and the domain's vcpu to identify > > who is impacted. > > Yes, the domain ID. I'm not sure we need the vcpu id if we instead > present some physical identifiers such as chip, core number etc > (and have the namespaces well-defined). If we don't present those > the vcpu in the payload and some external method to resolve that to > physical components. Since errors correlate to physical components it > would, I think, be nicer to report detector info in some physical sense. The vcpu is more interesting for the domU than for dom0. See below. > As regards a vcpu to physical translation, I didn't think there was any > fixed mapping (or certainly any mapping that a dom0 should interpret > and rely on). For example if we have two physical cores but choose > to present 32 vcpus to domain I don't believe there is anything to > say that 0-15 map always run on physical core 0? > > >> We should also allow for additional model-specific error telemetry > >> that may be available and relevant - I know that will be necessary > >> for some upcoming x86 cpu models. We should probably avoid adding > >> "cooked" content to this error event payload - such cooking of the > >> raw data is much more easily performed in dom0 (the example I'm > >> thinking of here is physical address to memory location translation). > >> > >> In terms of the form of the error event data, the simplest but also > >> the dumbest would be a binary structure passed from hypervisor > >> to dom0: > > > > struct mca_error_data_ver1 { > > uint8_t version; /* structure version */ > > uint64_t mc_status; > > uint64_t mc_addr; > > uint64_t mc_misc; > > uint16_t mc_chip; > > uint16_t mc_core; > > uint16_t mc_bank; > > uint16_t domid; > > uint16_t vcpu_id; > > ... > > }; > > > >> That is easily passed around and can be extended by versioning. > >> A more self-describing and naturally extensible approach would be > >> to parcel the error data in some form of name-type-value list. > >> That's what we do in the corresponding kernel->userland error > >> code in Solaris; the downside is that the supporting libnvpair > >> library is not tiny and likely not the sort of footprint to > >> include in a hypervisor. Perhaps some cut-down form would do. > > > > In the public xen.h header is a VIRQ_DOM_EXC defined, which seems > > to be appropriate for an NMI event. > > There are two functions to send VIRQs: send_guest_vcpu_virq() and > > send_guest_global_virq(). > > > > However, VIRQ_DOM_EXC is not properly implemented: > > All virtual interrupts are maskable. We definitely need > > an event that guarantees to immediately interrupts the guest, no matter > > if this is Dom0 or DomU and whatever they are doing. > > > > And VIRQ_DOM_EXC is explicitely reserved for Dom0. Maybe > > we should introduce a VIRQ_MCA as a special NMI event for both Dom0 and > > DomU? > > Sounds like it may be necessary. I don't know this mechanism very well > so I'll go and do some reading (after a big long unrelated codereview). After some code reading I found a nmi_pending, nmi_masked and nmi_addr in struct vcpu in xen/include/xen/sched.h. xen/include/xen/nmi.h is also of interest. The implementation is in xen/common/kernel.c. There is only one callback per vcpu allowed and only Dom0 can register an NMI. So the guests NMI handler must multiplex several nmi handlers - at least for Dom0 (MCA + watchdog timer). It's fine with me to allow DomUs to only register the MCA NMI. To inform domU (having a PV MCA driver), they must be able to register an NMI callback as well. To allow this, struct vcpu_info in the PUBLIC xen.h also needs nmi_pending and nmi_addr. Keir: How do you feel about all this? Is this the right way or do you see things that should be done in a different way? Christoph -- AMD Saxony, Dresden, Germany Operating System Research Center Legal Information: AMD Saxony Limited Liability Company & Co. KG Sitz (Geschäftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland Registergericht Dresden: HRA 4896 vertretungsberechtigter Komplementär: AMD Saxony LLC (Sitz Wilmington, Delaware, USA) Geschäftsführer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |