[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Re: [Xen-devel] RFC: MCA/MCE concept
Hi, Apologies for the screwy quoting below - I did not receive the first half of this thread so it's been forwarded to me. - Dom0 got enough CEs so that UEs are very likely to happen in order to "circumvent" UEs. The greatest rewards here are in syndrome/row/column/bank analysis of the error stream. Where something like a bad pin produces tonnes of CEs they are always on the same bit and your chance of a UE is that of a random radiation type CE colliding within the set of ECC checkwords being undermined by that pin - not very high. On the other hand if we're seeing repeated distinct syndromes from the same chip-select (or chip-select in a pair) then there is a good chance they could collide "soon" - our data is that this combination predicts a UE within hours to a few days. If you have row/column/bank decoding you can also perform further analysis of the error source and assess the chances of a collision that would produce a UE. That example has DIMM memory in mind, but similar approaches apply to cache memory where it is ECC protected and so on. - Possible operations on a DomU - save/restore DomU - (live-)migrate DomU to a different physical machine - etc.Very heavy-weight operations, which I think are unlikely to succeed if you already suspect the system's going to suffer a UE soon. As above, some predictors can give you hours to a few days warning of a UE. Gavin _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |