[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: NetBSD dom0 PVH: hardware interrupts stalls



On 23.11.2020 18:39, Manuel Bouyer wrote:
> On Mon, Nov 23, 2020 at 06:06:10PM +0100, Roger Pau Monné wrote:
>> OK, I'm afraid this is likely too verbose and messes with the timings.
>>
>> I've been looking (again) into the code, and I found something weird
>> that I think could be related to the issue you are seeing, but haven't
>> managed to try to boot the NetBSD kernel provided in order to assert
>> whether it solves the issue or not (or even whether I'm able to
>> repro it). Would you mind giving the patch below a try?
> 
> With this, I get the same hang but XEN outputs don't wake up the interrupt
> any more. The NetBSD counter shows only one interrupt for ioapic2 pin 2,
> while I would have about 8 at the time of the hang.
> 
> So, now it looks like interrupts are blocked forever.

Which may be a good thing for debugging purposes, because now we have
a way to investigate what is actually blocking the interrupt's
delivery without having to worry about more output screwing the
overall picture.

> At
> http://www-soc.lip6.fr/~bouyer/xen-log5.txt
> you'll find the output of the 'i' key.

(XEN)    IRQ:  34 vec:59 IO-APIC-level   status=010 aff:{0}/{0-7} in-flight=1 
d0: 34(-MM)

(XEN)     IRQ 34 Vec 89:
(XEN)       Apic 0x02, Pin  2: vec=59 delivery=LoPri dest=L status=1 polarity=1 
irr=1 trig=L mask=0 dest_id:00000001

(XEN) ioapic 2 pin 2 gsi 34 vector 0x67
(XEN)   delivery mode 0 dest mode 0 delivery status 0
(XEN)   polarity 1 IRR 0 trig mode 1 mask 0 dest id 0

IOW from guest pov the interrupt is entirely idle (mask and irr clear),
while Xen sees it as both in-flight and irr also already having become
set again. I continue to suspect the EOI timer not doing its job. Yet
as said before, for it to have to do anything in the first place the
"guest" (really Dom0 here) would need to fail to EOI the IRQ within
the timeout period. Which in turn, given your description of how you
handle interrupts, cannot be excluded (i.e. the handling may simply
take "slightly" too long).

What we're missing is LAPIC information, since the masked status logged
is unclear: (-MM) isn't fully matching up with "mask=0". But of course
the former is just a software representation, while the latter is what
the RTE holds. IOW for the interrupt to not get delivered, there needs
to be this or a higher ISR bit set (considering we don't use the TPR),
or (I think we can pretty much exclude this) we'd need to be running
with IRQs off for extended periods of time.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.