|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Woes of NMIs and MCEs, and possibly how to fix
At 22:55 +0000 on 30 Nov (1354316132), Andrew Cooper wrote:
> Any spinlocking whatsoever is out, until we completely fix the
> re-entrant entry to do_{nmi,mce}(), at which point spinlocks which are
> ok so long as they are exclusively used inside their respective
> handlers.
AFAICT spinlocks are bad news even if they're only in their respective
handlers, as one CPU can get (NMI (MCE)) while the other gets (MCE (NMI)).
> >>> 2) Faults on the MCE path will re-enable NMIs, as will the iret of the
> >>> MCE itself if an MCE interrupts an NMI.
> >> The same questions apply as to #1 (just replace NMI with MCE)
> > Andrew pointed out that some MCE code uses rdmsr_safe().
> >
> > FWIW, I think that constraining MCE and NMI code not to do anything that
> > can fault is perfectly reasonable. The MCE code has grown a lot
> > recently and probably needs an audit to check for spinlocks, faults &c.
>
> Yes. However, being able to deal gracefully with the case were we miss
> things on code review which touches the NMI/MCE paths is certainly
> better than crashing in subtle ways.
Agreed.
> At the moment, the clearing of the MCIP bit is quite early in a few of
> the cpu family specific MCE handlers. As it is an architectural MSR, I
> was considering moving it outside the family specific handlers, and make
> one of the last things on the MCE path, to help reduce the race
> condition window until we properly fix reentrant MCEs.
Why not have a per-cpu mce-in-progress flag, and clear MCIP early? That
way you get a panic instead of silently losing a CPU.
> >>> [1] In an effort to prevent a flamewar with my comment, the situation we
> >>> find outself in now is almost certainly the result of unforseen
> >>> interactions of individual features, but we are left to pick up the many
> >>> pieces in way which cant completely be solved.
> >>>
> >> Happy to have my comments completely shot down into little bits, but I'm
> >> worrying that we're looking to solve a problem that doesn't actually
> >> need solving - at least as long as the code in the respective handlers
> >> are "doing the right thing", and if that happens to be broken, then we
> >> should fix THAT, not build lots of extra code to recover from such a thing.
> > I agree. The only things we can't fix by DTRT in do_nmi() and do_mce()
> > are:
> > - IRET in SMM mode re-enabling NMIs; and
> > - detecting _every_ case where we get a nested NMI/MCE (all we can
> > do is detect _most_ cases, but the detection is just so we can print
> > a message before the crash).
>
> We would need to modify the asm stubs to detect nesting.
We can detect _almost all_ nesting from C with an in-progress flag. We
can probably expand that to cover all nesting by pushing the
flag-setting/flag-clearing out to the asm but that'd still be only a
couple of lines of change - a lot simpler than what we'd need to allow
nested MCE/NMIs to continue without crashing.
> I think it is
> unreasonable to expect the C functions do_{nmi,mce}() to be reentrantly
> safe.
Good. :) I'm not suggesting that.
Tim.
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |