[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Interrupt for port 19, but apparently not enabled; per-user 000000004af23acc

To: Juergen Gross <jgross@xxxxxxxx>
From: Julien Grall <julien@xxxxxxx>
Date: Wed, 23 Jun 2021 14:25:33 +0200
Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, mheyne@xxxxxxxxx
Delivery-date: Wed, 23 Jun 2021 12:25:48 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hi Juergen,

On 22/06/2021 17:14, Juergen Gross wrote:

On 22.06.21 14:21, Julien Grall wrote:
Hi Juergen,

On 22/06/2021 13:04, Juergen Gross wrote:
On 22.06.21 12:24, Julien Grall wrote:
Hi Juergen,
As discussed on IRC yesterday, we noticed a couple of splat in 5.13-rc6
(and stable 5.4) in the evtchn driver:

[    7.581000] ------------[ cut here ]------------
[ 7.581899] Interrupt for port 19, but apparently not
enabled;
per-user 000000004af23acc
[ 7.583401] WARNING: CPU: 0 PID: 467 at/home/ANT.AMAZON.COM/jgrall/works/oss/linux/drivers/xen/evtchn.c:169evtchn_interrupt+0xd5/0x100
[    7.585583] Modules linked in:
[ 7.586188] CPU: 0 PID: 467 Comm: xenstore-read Not
tainted
5.13.0-rc6 #240
[ 7.587462] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[    7.589462] RIP: e030:evtchn_interrupt+0xd5/0x100
[ 7.590361] Code: 48 8d bb d8 01 00 00 ba 01 00 00 00
be 1d 00 00 00
e8 5f 72 c4 ff eb b2 8b 75 20 48 89 da 48 c7 c7 a8 03 5f 82 e8 6b 2d 96
ff <0f> 0b e9 4d ff ff ff 41 0f b6 f4 48 c7 c7 80 da a2 82 e8 f0
[    7.593662] RSP: e02b:ffffc90040003e60 EFLAGS: 00010082
[ 7.594636] RAX: 0000000000000000 RBX: ffff888102328c00 RCX:0000000000000027[ 7.595924] RDX: 0000000000000000 RSI: ffff88817fe18ad0 RDI:ffff88817fe18ad8[ 7.597216] RBP: ffff888108ef8140 R08: 0000000000000000 R09:0000000000000001[ 7.598522] R10: 0000000000000000 R11: 7075727265746e49 R12:0000000000000000[ 7.599810] R13: ffffc90040003ec4 R14: ffff8881001b8000 R15:ffff888109b36f80[ 7.601113] FS: 0000000000000000(0000) GS:ffff88817fe00000(0000)knlGS:0000000000000000
[    7.602570] CS:  10000e030 DS: 0000 ES: 0000 CR0:0000000080050033
[ 7.603700] CR2: 00007f15b390e368 CR3: 000000010bb04000 CR4:0000000000050660
[    7.604993] Call Trace:
[    7.605501]  <IRQ>
[    7.605929]  __handle_irq_event_percpu+0x4c/0x330
[    7.606817]  handle_irq_event_percpu+0x32/0xa0
[    7.607670]  handle_irq_event+0x3a/0x60
[    7.608416]  handle_edge_irq+0x9b/0x1f0
[    7.609154]  generic_handle_irq+0x4f/0x60
[    7.609918]  __evtchn_fifo_handle_events+0x195/0x3a0
[    7.610864]  __xen_evtchn_do_upcall+0x66/0xb0
[    7.611693]  __xen_pv_evtchn_do_upcall+0x1d/0x30
[    7.612582]  xen_pv_evtchn_do_upcall+0x9d/0xc0
[    7.613439]  </IRQ>
[    7.613882]  exc_xen_hypervisor_callback+0x8/0x10
This is quite similar to the problem I reported a few months ago (see
[1]) but this time this is happening with fifo rather than 2L.
I haven't been able to reproduced it reliably so far. But looking atthe code, I think I have found another potential race after commit
commit b6622798bc50b625a1e62f82c7190df40c1f5b21
Author: Juergen Gross <jgross@xxxxxxxx>
Date:   Sat Mar 6 17:18:33 2021 +0100
xen/events: avoid handling the same event on two cpusat the sametime
    When changing the cpu affinity of an event it can happen today that
(with some unlucky timing) the same event will be handled
on the old
    and the new cpu at the same time.
Avoid that by adding an "event active" flag to the
per-event data
and
    call the handler only if this flag isn't set.
    Cc: stable@xxxxxxxxxxxxxxx
    Reported-by: Julien Grall <julien@xxxxxxx>
    Signed-off-by: Juergen Gross <jgross@xxxxxxxx>
    Reviewed-by: Julien Grall <jgrall@xxxxxxxxxx>
Link:https://lore.kernel.org/r/20210306161833.4552-4-jgross@xxxxxxxx
    Signed-off-by: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
The evtchn driver will use the lateeoi handlers. So the code to acklooks like:
do_mask(..., EVT_MASK_REASON_EOI_PENDING)
smp_store_release(&info->is_active, 0);
clear_evtchn(info->evtchn);

The code to handle an interrupts look like:

clear_link(...)
if ( evtchn_fifo_is_pending(port) && !evtchn_fifo_is_mask()) {
   if (xchg_acquire(&info->is_active, 1)
     return;
   generic_handle_irq();
}
After changing the affinity, an interrupt may be received once on the
previous vCPU. So, I think the following can happen:

vCPU0                             | vCPU1
                   |
  Receive event              |
                   | change affinity to vCPU1
  clear_link()              |
                       |
                /* The interrupt is re-raised */
                   | receive event
                     |
                   | /* The interrupt is not masked */
  info->is_active = 1          |
  do_mask(...)              |
  info->is_active = 0          |
                   | info->is_active = 1
  clear_evtchn(...)               |
                                  | do_mask(...)
                                  | info->is_active = 0
                   | clear_evtchn(...)

Does this look plausible to you?
Yes, it does.

Thanks for the analysis.

So I guess for lateeoi events we need to clear is_active only in
xen_irq_lateeoi()? At a first glance this should fix the issue.
It should work and would be quite neat. But, I believe clear_evtchn()would have to stick in the ack helper to avoid losing interrupts.
Could you try the attached patch, please? Only compile tested.

Thanks for the patch! I have also found a reproducer on Linux 5.13 so itwas easier to confirm the patch works.

The reproducer is continuously the affinity of the interrupt under highinterrupt load. After a few seconds I can see dozen of WARN splat.


Regarding the patch itself, a few suggestions:

1) It is not entirely obvious from the code why ack_mask_dynirq() isnot modified. My understanding is we are assuming that thexen_irq_lateeoi_locked() will not be called in this case. I wouldsuggest to spell it clearly in the commit message2) I would suggest to add a comment in the code explaining whyevent_handler_exit() is not used. It is probably worth to also add onein event_handler_exit() so one know that this doesn't cover all the paths.


Cheers,

--
Julien Grall

References:
- Interrupt for port 19, but apparently not enabled; per-user 000000004af23acc
  - From: Julien Grall
- Re: Interrupt for port 19, but apparently not enabled; per-user 000000004af23acc
  - From: Juergen Gross
- Re: Interrupt for port 19, but apparently not enabled; per-user 000000004af23acc
  - From: Julien Grall
- Re: Interrupt for port 19, but apparently not enabled; per-user 000000004af23acc
  - From: Juergen Gross

Prev by Date: [ovmf test] 162994: regressions - FAIL
Next by Date: [qemu-mainline test] 162969: regressions - FAIL
Previous by thread: Re: Interrupt for port 19, but apparently not enabled; per-user 000000004af23acc
Next by thread: [OSSTEST PATCH] di-version update
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.