[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 0/7] xen/events: bug fixes and some diagnostic aids

To: Juergen Gross <jgross@xxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx, linux-kernel@xxxxxxxxxxxxxxx, linux-block@xxxxxxxxxxxxxxx, netdev@xxxxxxxxxxxxxxx, linux-scsi@xxxxxxxxxxxxxxx
From: Julien Grall <julien@xxxxxxx>
Date: Sat, 6 Feb 2021 18:46:30 +0000
Cc: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, stable@xxxxxxxxxxxxxxx, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Jens Axboe <axboe@xxxxxxxxx>, Wei Liu <wei.liu@xxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>, "David S. Miller" <davem@xxxxxxxxxxxxx>, Jakub Kicinski <kuba@xxxxxxxxxx>
Delivery-date: Sat, 06 Feb 2021 18:46:59 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hi Juergen,

On 06/02/2021 10:49, Juergen Gross wrote:

The first three patches are fixes for XSA-332. The avoid WARN splats
and a performance issue with interdomain events.

Thanks for helping to figure out the problem. Unfortunately, I still seereliably the WARN splat with the latest Linux master (1e0d27fce010) +your first 3 patches.

I am using Xen 4.11 (1c7d984645f9) and dom0 is forced to use the 2Levents ABI.

After some debugging, I think I have an idea what's went wrong. Theproblem happens when the event is initially bound from vCPU0 to adifferent vCPU.

From the comment in xen_rebind_evtchn_to_cpu(), we are masking theevent to prevent it being delivered on an unexpected vCPU. However, Ibelieve the following can happen:


vCPU0                           | vCPU1
                                |
                                | Call xen_rebind_evtchn_to_cpu()
receive event X                 |
                                | mask event X
                                | bind to vCPU1
<vCPU descheduled>                | unmask event X
                                |
                                | receive event X
                                |
                                | handle_edge_irq(X)
handle_edge_irq(X)              |  -> handle_irq_event()
                                |   -> set IRQD_IN_PROGRESS
 -> set IRQS_PENDING         |
                                |   -> evtchn_interrupt()
                                |   -> clear IRQD_IN_PROGRESS
                                |  -> IRQS_PENDING is set
                                |  -> handle_irq_event()
                                |   -> evtchn_interrupt()
                                |     -> WARN()
                                |

All the lateeoi handlers expect a ONESHOT semantic andevtchn_interrupt() is doesn't tolerate any deviation.

I think the problem was introduced by 7f874a0447a9 ("xen/events: fixlateeoi irq acknowledgment") because the interrupt was disabledpreviously. Therefore we wouldn't do another iteration in handle_edge_irq().

Aside the handlers, I think it may impact the defer EOI mitigationbecause in theory if a 3rd vCPU is joining the party (let say vCPU Amigrate the event from vCPU B to vCPU C). So info->{eoi_cpu, irq_epoch,eoi_time} could possibly get mangled?

For a fix, we may want to consider to hold evtchn_rwlock with the writepermission. Although, I am not 100% sure this is going to preventeverything.


Does my write-up make sense to you?

Cheers,

--
Julien Grall

Follow-Ups:
- Re: [PATCH 0/7] xen/events: bug fixes and some diagnostic aids
  - From: Jürgen Groß

References:
- [PATCH 0/7] xen/events: bug fixes and some diagnostic aids
  - From: Juergen Gross

Prev by Date: [xen-4.11-testing test] 159042: regressions - trouble: broken/fail/pass
Next by Date: Linux DomU freezes and dies under heavy memory shuffling
Previous by thread: Re: [PATCH 5/7] xen/events: add per-xenbus device event statistics and settings
Next by thread: Re: [PATCH 0/7] xen/events: bug fixes and some diagnostic aids
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.