[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] tools/hotplug: fix locking



On Thu, Jun 21, 2012 at 08:08:51AM -0400, Zhigang Wang wrote:
> On 06/21/2012 07:49 AM, Ian Campbell wrote:
> > On Thu, 2012-06-21 at 12:42 +0100, Ian Campbell wrote:
> >> On Wed, 2012-06-20 at 18:53 +0100, Zhigang Wang wrote:
> >>> On 06/20/2012 11:34 AM, Ian Campbell wrote:
> >>>> On Wed, 2012-06-13 at 18:29 +0100, Zhigang Wang wrote:
> >>>>> # HG changeset patch
> >>>>> # User Zhigang Wang <zhigang.x.wang@xxxxxxxxxx>
> >>>>> # Date 1339608534 14400
> >>>>> # Node ID 650b03f412143c3057abbd0ae0e256e6b3fd2ba8
> >>>>> # Parent  32034d1914a607d7b6f1f060352b4cac973600f8
> >>>>> tools/hotplug: fix locking
> >>>> A better title would be "tools/hotplug: Switch to locking with flock"
> >>>> since "fix" is not very descriptive.
> >>> Agree.
> >>>> The commit message should also state why changing this scheme is
> >>>> preferable to fixing the current one.
> >>> I have two points:
> >>>
> >>> 1. No timeout: in the old one, if one process holding the lock more than 
> >>> 100
> >>>    seconds, other processes could steal it.
> >>> 2. No leftovers: if a process holding this lock is dead, it will close 
> >>> the lock
> >>>    file, so it will not affect other processes.
> >>>
> >>>> Is this flock tool widely available? Does it need to be added to the
> >>>> dependencies in the README?
> >>> It is widely distributed: it's part of the util-linux package. So I think 
> >>> no
> >>> need to document it.
> >>>>> The current locking implementation would allow two processes get the 
> >>>>> lock
> >>>>> simultaneously:
> >>>> [...snip shell trace...]
> >>>>
> >>>> Can you add a line or two of analysis to explain where in that shell
> >>>> spew things are actually going wrong and why?
> >>> I didn't spent much time on this complicated implement. But it fails for 
> >>> me and
> >>> also for others (from response).
> >>>>> This patch is ported from Red Hat Enterprise Linux 5.8.
> >>>> I think we need a Signed-off-by from the original patch author. (Unless
> >>>> DCO clause b somehow applies?)
> >>> I'm not sure how to handle this. There's no signed-off-by on the original 
> >>> Red
> >>> Hat patch. Could anyone in Red Hat help to get it signed off?
> >> Perhaps Andrew Jones or Don Dutile can point us in the right direction
> >> (both CCd) if you identify the precise source of the patch (i.e. the
> >> name of the .src.rpm and the name of the patch within it)?
> >>
> >> I looked in kernel-2.6.18-308.8.2.el5.src.rpm (which seems fairly
> >> recent) but maybe I'm looking in the wrong place.
> > Nevermind, I found it in xen-3.0.3-135.el5_8.2.src.rpm (I'd have sworn
> > that RH shipped kernel+xen in a single source package, oh well).
> >
> > %changelog says:
> >         * Fri Sep 14 2007 Daniel P. Berrange <berrange@xxxxxxxxxx> - 
> > 3.0.3-40.el5
> >         - Rewrite locking in hotplug scripts to fix timeouts (rhbz #267861)
> >
> > Daniel CCd.
> >
> > I wonder if there were any followup fixes or changes, especially wrt to
> > your point 1 above (the lack of a timeout now) requiring xend (or now
> > libxl) changes?
> 
> We have another timeout in xend/xl: if device backend is not setup properly, 
> the
> device creation will fail.
> 
> Without any timeout for this locking, all backend uevent scripts will hang if
> one goes wrong. Just stealing the lock upon timeoutis not a good idea: it may
> cause error or damage. What about adding a long (10minutes) timeout? (by 
> adding
> -w/--wait/--timeout 600)

Timeouts are a really bad idea for this area of code. If you make the
timeout really long (10 mins as you suggest), then no other guest can
be started during this entire 10 minute window if something goes wrong.
IMHO this is unacceptable.  If you make the timeout short enough that
you don't unduly delay other VM startup operations, then you will often
hit bogus timeouts under high load.

In practice the scripts should not hang unless something is seriously
wrong, in which case timing out & pretending every thing is still fine
for the future is bogus.  If you have real world examples of hangs
then you should focus efforts on fixing the hangs, rather than papering
over the problem with a timeout.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.