[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] RE: Bug in tmem: refcount leak leaves zombie saved domains



> I think both of these should be backported for Xen 4.0.1?

Thanks for handling it.  I agree it should go into Xen 4.0.1,
but I won't be able to give it much testing before that
releases (unless it takes a few more RC's) so if you are
confident and Jan doesn't see any problems, go for it.

> -----Original Message-----
> From: Keir Fraser [mailto:keir.fraser@xxxxxxxxxxxxx]
> Sent: Thursday, June 10, 2010 3:43 PM
> To: Dan Magenheimer; xen-devel@xxxxxxxxxxxxxxxxxxx
> Cc: JBeulich@xxxxxxxxxx
> Subject: Re: [Xen-devel] RE: Bug in tmem: refcount leak leaves zombie
> saved domains
> 
> On 10/06/2010 21:06, "Keir Fraser" <keir.fraser@xxxxxxxxxxxxx> wrote:
> 
> > This patch looks good, I'll test it tomorrow.
> 
> It tests okay so I applied it as xen-unstable:21595.
> 
> > What you *do* need to do when setting up a new tmem client is check
> that the
> > associated domain is not dying. Without that check the code is in
> fact
> > currently buggy: you can end up with a zombie domain that is a client
> of
> > tmem and will never stop being a client because it became a client
> after
> > tmem_destroy() was called on it.
> 
> I implemented this as xen-unstable:21596. Take a look. It's pretty
> straightforward.
> 
> I think both of these should be backported for Xen 4.0.1?
> 
>  -- Keir
> 
> > Does that make sense?
> >
> >  -- Keir
> >
> > On 10/06/2010 18:54, "Dan Magenheimer" <dan.magenheimer@xxxxxxxxxx>
> wrote:
> >
> >> Could you give the attached a try on your test case?  If
> >> it passes and Jan thinks it is OK (as I backed out most of
> >> his patch at cs 20918), then:
> >>
> >> Tmem: fix domain refcount leak by returning to simpler model
> >> which claims a ref once when the tmem client is first associated
> >> with the domain, and puts it once when the tmem client is
> >> destroyed.
> >>
> >> Signed-off-by: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>
> >>
> >>> If you have a handle on a domain already I wonder why you need to
> >>> continually look up by domid...
> >>
> >> Nearly all tmem uses are via current->domain.  The remaining
> >> (such as from the tools) are via a specified domid.  I don't
> >> keep a domid->domain lookup table around as the frequency is
> >> very low and the existing mechanism works fine (or it does
> >> if I use it correctly anyway ;-)
> >>
> >>> RCU locking
> >>> is fully implemented already. It's highly unlikely to change in
> future
> >>> and we can work out something else for your case if that happens.
> >>
> >> I guess I was confused by the fact that the rcu_lock/unlock macros
> >> are no-ops.
> >>
> >> In any case, I think I understand the semantics well enough now
> >> after your second reply pointing me to rcu_unlock_domain, so
> >> I think the attached patch should avoid special cases in the
> >> future.
> >>
> >>>> I'd like to do a get_domain_by_id() without doing a get_domain()
> >>>> as the tmem code need only get_domain() once on first use
> >>>> and put_domain() once when destroying, but frequently needs
> >>>> to lookup a domain by id.
> >>>
> >>> If you have a handle on a domain already I wonder why you need to
> >>> continually look up by domid...
> >>>
> >>>> It looks like rcu_lock_domain_by_id() does what I need, but
> >>>> I don't need any rcu critical sections (outside of the domain
> >>>> lookup itself) and am fearful that if rcu locking ever is fully
> >>>> implemented, my use of rcu_lock_domain_by_id() would become
> >>>> incorrect and I may have a problem.  Should I create an equivalent
> >>>> get_domain_by_id_no_ref()?  Or am I misunderstanding something?
> >>>
> >>> If you really know what you're doing, I suggest just have your own
> >>> tmh_lookup_domain() macro mapping onto rcu_lock_domain_by_id(). RCU
> >>> locking
> >>> is fully implemented already. It's highly unlikely to change in
> future
> >>> and
> >>> we can work out something else for your case if that happens.
> >>>
> >>> I'm not keen on providing an explicitly synchronisation-free
> version in
> >>> common code. It just encourages people not to think about
> >>> synchronisation at
> >>> all.
> >>>
> >>>  -- Keir
> >>>
> >>>> Semi-related, rcu_lock_domain_by_id() has a "return d" inside
> >>>> the for loop without an rcu_read_unlock().  I see that this
> >>>> is irrelevant now because rcu_read_unlock() is a no-op anyway,
> >>>> but maybe this should be cleaned up for the same reason --
> >>>> in case rcu locking is ever fully implemented.
> >>>>
> >>>> Thanks,
> >>>> Dan
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Dan Magenheimer
> >>>>> Sent: Thursday, June 10, 2010 7:08 AM
> >>>>> To: Keir Fraser; xen-devel@xxxxxxxxxxxxxxxxxxx
> >>>>> Subject: [Xen-devel] RE: Bug in tmem: refcount leak leaves zombie
> >>> saved
> >>>>> domains
> >>>>>
> >>>>> OK, will take a look.
> >>>>>
> >>>>> IIRC, Jan's work to fix the domain reference stuff just
> >>>>> before 4.0 shipped was a heavy hammer but since it seemed
> >>>>> to work, I didn't want to mess with it so close to release...
> >>>>> really there's only a need to take a reference once on
> >>>>> first use and release it at shutdown, rather than
> >>>>> take/release frequently.  IIRC, I had used a macro that
> >>>>> took references when they weren't really needed and
> >>>>> Jan placed the matching macros that did the release.
> >>>>>
> >>>>> Dan
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Keir Fraser [mailto:keir.fraser@xxxxxxxxxxxxx]
> >>>>>> Sent: Thursday, June 10, 2010 3:47 AM
> >>>>>> To: xen-devel@xxxxxxxxxxxxxxxxxxx
> >>>>>> Cc: Dan Magenheimer
> >>>>>> Subject: Bug in tmem: refcount leak leaves zombie saved domains
> >>>>>>
> >>>>>> Dan,
> >>>>>>
> >>>>>> Just doing some save/restore testing on xen-unstable tip, I
> noticed
> >>>>>> that:
> >>>>>> # xm create ./pv_config
> >>>>>> # xm save PV1
> >>>>>>
> >>>>>> Would leave the saved guest as a zombie in the DOMDYING_dead
> state
> >>>>> with
> >>>>>> no
> >>>>>> pages, yet with refcnt=1. This happens absolutely consistently.
> >>> Just
> >>>>> as
> >>>>>> consistently, it does not happen when I boot Xen with no-tmem.
> My
> >>>>>> conclusion
> >>>>>> is that tmem is leaking a domain reference count during domain
> >>> save.
> >>>>>> This
> >>>>>> doesn't happen if I merely "xm create ...; xm destroy ...".
> >>>>>>
> >>>>>> My pv_config file contains nothing exciting:
> >>>>>> kernel = "/nfs/keir/xen/xen64.hg/dist/install/boot/vmlinuz-
> >>> 2.6.18.8-
> >>>>>> xenU"
> >>>>>> memory = 750
> >>>>>> name = "PV1"
> >>>>>> vcpus = 2
> >>>>>> vif = [ 'mac=00:1a:00:00:01:01' ]
> >>>>>> disk = [ 'phy:/dev/VG/Suse10.1_64_1,sda1,w' ]
> >>>>>> root = "/dev/sda1 ro xencons=tty"
> >>>>>> extra = ""
> >>>>>> tsc_native = 1
> >>>>>> on_poweroff = 'destroy'
> >>>>>> on_reboot   = 'restart'
> >>>>>> on_crash    = 'preserve'
> >>>>>>
> >>>>>> The dom{0,U} kernels are tip of linux-2.6.18-xen, default -
> xen{0,U}
> >>>>>> configs.
> >>>>>>
> >>>>>>  -- Keir
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Xen-devel mailing list
> >>>>> Xen-devel@xxxxxxxxxxxxxxxxxxx
> >>>>> http://lists.xensource.com/xen-devel
> >>>
> >>>
> >
> >
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@xxxxxxxxxxxxxxxxxxx
> > http://lists.xensource.com/xen-devel
> 
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.