|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [BUG] xl devd segmentation fault on xl block-detach
Wei Liu writes ("Re: [Xen-devel] [BUG] xl devd segmentation fault on xl
block-detach"):
> Can you give the following patch a try? This applies to 4.8.
>
> Not sure if there is a better way to fix it though. Ian and Roger?
I find the logic here rather awkward. I do remember reviewing it and
becoming a bit confused at the time and it seems that even though I
eventually convinced myself it was OK, I was wrong.
> From: Wei Liu <wei.liu2@xxxxxxxxxx>
> Date: Wed, 3 May 2017 17:55:42 +0100
> Subject: [PATCH] libxl: fix backend_watch_callback
>
> That function needs to cope with spurious events. The original "skip"
> path blindly freed dguest even when it needed to stay in ddomain list.
>
> Free dguest iff it is newly added to the list. That way we don't free
> the one that should stay on the list and we don't unnecessarily add a
> stale dguest entry to ddomain list.
AFAICT right now you are right. But I see another possible way of
fixing it:
How about moving the num_devs == 0 check, and associated cleanup, to
the exit path ? That way a if new guest struct is spuriously
allocated, it will automatically be freed. It would mean that the
freeing of dguest would depend only on other invariants already in the
code, rather than on explicit tracking.
The invariants are, I think:
* Any libxl__ddomain_device is either
* on some list libxl__ddomain_guest->devices
* being processed for removal, and referenced by a device
remove async call initiated by remove_device and which will
call device_complete() when done
but not both!
* Any libxl__domain_guest is on the list libxl__ddomain->guests.
The above apply even within any function, except very briefly when
transitioning from one state to another (eg creation, destruction).
* SUM(libxl__domain_guest->num_*) != 0, when we return from the
outermost callback. (Ie, there are no leftover empty guest
structs.)
Thinking about this like this, and observing the control flow, leads
me to think I have found another bug.
Consider what happens if a device is removed while it is still being
added. That is, an event comes in which causes us to call add_device.
add_device sets up the callback and starts doing work (eg hotplug
scripts). Before that finishes, the device is removed again.
backend_watch_callback will tear the device down and free dev.
But dev is still referenced by the add_device operation, and when it
completes, device_complete will call
libxl__device_backend_path(gc, aodev->dev)
There ought to be a (perhaps implicit) invariant that
* Any dev referenced by an aodev call is legit
But this invariant is violated by backend_watch_callback, which frees
it despite it not knowing whether there is a callback in flight.
Perhaps we should do explicit reference counting.
Ian.
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |