[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] Re: Crash on blktap shutdown



On Thu, 2010-02-25 at 18:18 -0500, Jeremy Fitzhardinge wrote:
> On 02/24/2010 07:03 PM, Daniel Stodden wrote:
> > On Wed, 2010-02-24 at 20:47 -0500, Daniel Stodden wrote:
> >    
> >> On Wed, 2010-02-24 at 19:37 -0500, Jeremy Fitzhardinge wrote:
> >>      
> >>> On 02/24/2010 04:29 PM, Daniel Stodden wrote:
> >>>        
> >>>> On Wed, 2010-02-24 at 18:52 -0500, Jeremy Fitzhardinge wrote:
> >>>>
> >>>>          
> >>>>> On 02/24/2010 03:49 PM, Daniel Stodden wrote:
> >>>>>
> >>>>>            
> >>>>>> On Wed, 2010-02-24 at 17:55 -0500, Jeremy Fitzhardinge wrote:
> >>>>>>
> >>>>>>
> >>>>>>              
> >>>>>>> When rebooting the machine,  I got this crash from blktap.  The rip 
> >>>>>>> maps to line 262 in
> >>>>>>> 0xffffffff812548a1 is in blktap_request_pool_free 
> >>>>>>> (/home/jeremy/git/linux/drivers/xen/blktap/request.c:262).
> >>>>>>>
> >>>>>>>
> >>>>>>>                
> >>>>>> Uhm, where did that RIP come from?
> >>>>>>
> >>>>>> pool_free is on the module exit path. The stack trace below looks like 
> >>>>>> a
> >>>>>> crash from the broadcasted SIGTERM before reboot.
> >>>>>>
> >>>>>>
> >>>>>>              
> >>>>> Ignore it; I generated it from a different kernel from the one that
> >>>>> crashed.  But the other oops I posted should be all consistent and
> >>>>> meaningful.
> >>>>>
> >>>>>            
> >>>> Ignore only the debuginfo quote, right?
> >>>> Cos this looks like a different issue to me.
> >>>>
> >>>>          
> >>> Perhaps.  I got all the others on normal domain shutdown, but this one
> >>> was on machine reboot.  I'll try to repro (as I boot the test kernel
> >>> with your patch in it).
> >>>        
> >> (gdb) list *(blktap_device_restart+0x7a)
> >> 0x2a73 is in blktap_device_restart
> >> (/local/exp/dns/scratch/xenbits/xen-unstable.hg/linux-2.6-pvops.git/drivers/xen/blktap/device.c:920).
> >> 915 /* Re-enable calldowns. */
> >> 916 if (blk_queue_stopped(dev->gd->queue))
> >> 917 blk_start_queue(dev->gd->queue);
> >> 918
> >> 919 /* Kick things off immediately. */
> >> 920 blktap_device_do_request(dev->gd->queue);
> >> 921
> >> 922 spin_unlock_irq(&dev->lock);
> >> 923 }
> >> 924
> >>
> >> Assuming we've been dereferencing a NULL gendisk, i.e. device_destroy
> >> racing against device_restart.
> >>
> >> Would take
> >>
> >>   * Tapdisk killed on the other thread, which goes through into
> >>     a device_restart(). Which is what your stacktrace shows.
> >>
> >>   * Device removal pending, blocking until
> >>     device->users drops to 0, then doing the device_destroy().
> >>     That might have happened during bdev .release.
> >>
> >> Both running at the same time sounds like what happens if you kill them
> >> all at once.
> >>
> >> That clearly takes another patch then.
> >>      
> > Jeremy,
> >
> > can you try out the attached patch for me?
> >
> > This should close the above shutdown race as well.
> >
> > Should be nowhere as frequent as the timer_sync crash fixed earlier.
> >    
> 
> Hm, the two patches changed things but I'm still seeing problems on 
> domain shutdown.  Still looks like use-after-free.

All these new-fashioned debug switches. Only causing trouble.

This is yet a different piece. The sysfs code was causing a double unref
on the ring device.

Daniel


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.