[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] xen-block: race condition when stopping the device (WAS: Re: [xen-4.13-testing test] 144736: regressions - FAIL)
> -----Original Message----- [snip] > >> > >> This feels like a race condition between the init/free code with > >> handler. Anthony, does it ring any bell? > >> > > > > From that stack bt it looks like an iothread managed to run after the > sring was NULLed. This should not be able happen as the dataplane should > have been moved back onto QEMU's main thread context before the ring is > unmapped. > > My knowledge of this code is fairly limited, so correct me if I am wrong. > > blk_set_aio_context() would set the context for the block aio. AFAICT, > the only aio for the block is xen_block_complete_aio(). Not quite. xen_block_dataplane_start() calls xen_device_bind_event_channel() and that will add an event channel fd into the aio context, so the shared ring is polled by the iothread as well as block i/o completion. > > In the stack above, we are not dealing with a block aio but an aio tie > to the event channel (see the call from xen_device_poll). So I don't > think the blk_set_aio_context() would affect the aio. > For the reason I outline above, it does. > So it would be possible to get the iothread running because we received > a notification on the event channel while we are stopping the block (i.e > xen_block_dataplane_stop()). > We should assume an iothread can essentially run at any time, as it is a polling entity. It should eventually block polling on fds assign to its aio context but I don't think the abstraction guarantees that it cannot be awoken for other reasons (e.g. off a timeout). However and event from the frontend will certainly cause the evtchn fd poll to wake up. > If xen_block_dataplane_stop() grab the context lock first, then the > iothread dealing with the event may wait on the lock until its released. > > By the time the lock is grabbed, we may have free all the resources > (including srings). So the event iothread will end up to dereference a > NULL pointer. > I think the problem may actually be that xen_block_dataplane_event() does not acquire the context and thus is not synchronized with xen_block_dataplane_stop(). The documentation in multiple-iothreads.txt is not clear whether a poll handler called by an iothread needs to acquire the context though; TBH I would not have thought it necessary. > It feels to me we need a way to quiesce all the iothreads (blk, > event,...) before continuing. But I am a bit unsure how to do this in > QEMU. > Looking at virtio-blk.c I see that it does seem to close off its evtchn equivalent from iothread context via aio_wait_bh_oneshot(). So I wonder whether the 'right' thing to do is to call xen_device_unbind_event_channel() using the same mechanism to ensure xen_block_dataplane_event() can't race. Paul > Cheers, > > -- > Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |