[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Re: [Xen-devel] RFC: MCA/MCE concept


Apologies for the screwy quoting below - I did not receive the first half of 
thread so it's been forwarded to me.

  - Dom0 got enough CEs so that UEs are very likely to happen in order
     to "circumvent" UEs.

The greatest rewards here are in syndrome/row/column/bank analysis of the
error stream.  Where something like a bad pin produces tonnes of CEs
they are always on the same bit and your chance of a UE is that of a random
radiation type CE colliding within the set of ECC checkwords being undermined
by that pin - not very high.  On the other hand if we're seeing repeated
distinct syndromes from the same chip-select (or chip-select in a pair)
then there is a good chance they could collide "soon" - our data is that
this combination predicts a UE within hours to a few days.  If you have
row/column/bank decoding you can also perform further analysis of the
error source and assess the chances of a collision that would produce a UE.

That example has DIMM memory in mind, but similar approaches apply to
cache memory where it is ECC protected and so on.

  - Possible operations on a DomU
       - save/restore DomU
       - (live-)migrate DomU to a different physical machine
       - etc.
Very heavy-weight operations, which I think are unlikely to succeed if
you already suspect the system's going to suffer a UE soon.

As above, some predictors can give you hours to a few days warning of a UE.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.