[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Race condition on device add hanling in xl devd

On Thu, Feb 28, 2019 at 11:08:37AM +0100, Roger Pau Monné wrote:
> On Mon, Feb 25, 2019 at 12:14:02AM +0100, Marek Marczykowski-Górecki wrote:
> > On Mon, Dec 17, 2018 at 05:09:19PM +0100, Roger Pau Monné wrote:
> > > On Mon, Dec 17, 2018 at 02:42:23PM +0000, Paul Durrant wrote:
> > > > I suspect I must be remembering a XenServer-specific hack^Wpatch then. 
> > > > I'd have to dig... it's been a while since I messed with the netif 
> > > > state model, which is of course different the blkif state model.
> > > 
> > > Quite likely. With udev scripts is was feasible to only execute
> > > hotplug scripts for vifs with an attached frontend.
> > > 
> > > With libxl this is not possible, since hotplug scripts are run during
> > > domain creation, at which point the guest is completely paused.
> > > 
> > > I'm not that familiar with bridges and vifs, but maybe the vifs status
> > > can be set to offline until there's a frontend attached in order to
> > > reduce the bridge distributor load? (if that's not already the case).
> > 
> > I've found was the problem, and with some definition of "race condition"
> > it could be named this way.
> > The problem is that for some reason xenstore watch on device add
> > sometimes does not fire in xl devd. But then, when libxl in dom0
> > timeouts and remove the device, the xenstore watch in xl devd fire and
> > hotplug script is called. At this point device is already gone, so
> > it fails. xl devd then quickly calls hotplug script the second time, for
> > device removal.
> > 
> > I have no idea why this xenstore watch do not fire, but triggering a
> > no-op write into watched path (to trigger the watch again) workarounds
> > the problem. I use a xenstore watch in dom0 for that[1] - which works.
> > I suspect something related to KVM nested virtualization (lost
> > interrupt?)...
> That's very weird, could you try to run xenstored in dom0 with trace
> enabled [0] in order to try to figure out what's happening?

I've tried already, but it was way too slow (remember it's nested KVM,
it doesn't really improve the performance). I hit multiple timeouts even
without hitting this problem. Unfortunately I don't have logs from that
experiment anymore.

I can try again...

> I assume this only happens when running nested in KVM?

I'd say so. I'm not entirely sure, because I've seen similar symptoms on
bare metal Xen too in the past, but I think it could be a different
problem and also I haven't seen it in past 3 months.

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Attachment: signature.asc
Description: PGP signature

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.