[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] x86: adjust handling of interrupts coming in via legacy vectors

On 14/05/12 15:28, Jan Beulich wrote:
>>>> On 14.05.12 at 15:33, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:
>> On 14/05/12 13:55, Jan Beulich wrote:
>>>>>> On 14.05.12 at 14:39, "Jan Beulich" <JBeulich@xxxxxxxx> wrote:
>>>> The debugging code added in c/s 24707:96987c324a4f was hit a (small)
>>>> number of times (one report being
>>>> http://lists.xen.org/archives/html/xen-devel/2012-05/msg00332.html),
>>>> apparently always with a vector within the legacy range. Obviously,
>>>> besides legacy vectors not normally expected to be in use on systems
>>>> with IO-APIC(s), they should never make it to the IRQ migration logic.
>>>> This wasn't being prevented so far: Since we don't have a one-to-one
>>>> mapping between vectors and IRQs - legacy IRQs may have two vectors
>>>> associated with them (one used in either 8259A, the other used in one
>>>> of the IO-APICs) -, vector-to-IRQ translations for legacy vectors (as
>>>> used in do_IRQ()) would yield a valid IRQ number despite the IRQ
>>>> really being handled via an IO-APIC.
>>>> This gets changed here - disable_8259A_irq() zaps the legacy vector-to-
>>>> IRQ mapping, and enable_8259A_irq(), should it ever be called for a
>>>> particular interrupts, restores it.
>>>> Additionally, the spurious interrupt logic in do_IRQ() gets adjusted
>>>> too: Interrupts coming in via legacy vectors obviously didn't get
>>>> reported through the IO-APIC/LAPIC pair (as we never program these
>>>> vectors into any RTE), and hence shouldn't get ack_APIC_irq() called on
>>>> them. Instead, a new function (pointer) bogus_8259A_irq() gets used to
>>>> have the 8259A driver take care of the bogus interrupt (as outside of
>>>> automatice EOI mode it may need an EOI to be issued for it to prevent
>>>> other interrupts that may legitimately go through the 8259As from
>>>> getting masked out).
>>> Note that this patch does not make any attempt at dealing with the
>>> underlying issue that causes the bogus interrupt(s) to show up. If
>>> my analysis is right, we shouldn't see crashes anymore, but instead
>>> observe instances of spurious interrupts on legacy vectors. It would
>>> certainly be nice to have an actual proof of this (albeit I realize that
>>> this isn't readily reproducible), in order to then - if indeed behaving
>>> as expected - add debugging code to identify whether such interrupts
>>> in fact get raised by one of the 8259A-s (particularly printing the
>>> cached and physical mask register values), or whether they get
>>> introduced into the system by yet another obscure mechanism.
>>> One particular thing I'm suspicious about are the numerous aliases
>>> to the two (each) 8259A I/O ports that various chipsets have: What
>>> if some component in Dom0 accesses one of the alias ports in order
>>> to do something specific to a non-standard platform (say, probe for
>>> some special hardware interface), not realizing that it actually plays
>>> with PIC state? Linux under the same conditions wouldn't severely
>>> suffer - as it has a 1:1 vector <-> IRQ translation, it likely would
>>> merely observe an extra interrupt.
>> On the whole, the patch looks sensible, but what happens if the spurious
>> interrupt is coming in through the Local APIC ?  If this is the case,
>> then we still need to ACK it, even if it is a bogus PIC interrupt.
>> Perhaps in irq.c, the changes should check whether the observed vector
>> has been raised in the LAPIC and ack it, and then decide whether it is
>> bogus or not.
> Should that really turn out to be the case, we're in much bigger trouble,
> as then we need an explanation how an interrupt at that vector could
> have got raised in the first place. I'd therefore like to keep the current
> change deal only with things that we know can happen.

We would be in huge trouble.  As it currently stands, I am not certain
that we can be sure that this is not happening.

As a concession, perhaps a test of the LAPIC IIR, and an obvious error
to the console?  It would be be more useful than having Xen crash/hang
due to no longer always ack'ing the LAPIC.

>> Might it also be sensible to remove dom0's permissions to use the PIC
>> ports, in case it is some weird issue like that?
> That's already being done iirc. The problem is that it's non-trivial (and
> perhaps non-reliable) to determine the aliases, and hence we can't
> blindly remove more than the two real ports from Dom0's permitted
> set.
> Jan

Ah yes - in which case its not feasible.

Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.