On Mon, 2010-08-30 at 16:30 -0400, Scott Garron wrote:
> On 08/30/2010 03:13 PM, Daniel Stodden wrote:
> > Are you sure it's spinning or just freezing?
>
> I'm not sure that I understand the difference between those two
> terms, so I'm going to guess "freezing" is probably a more accurate
> description. The best way to describe what I was seeing was that my
> scripted backup procedure would get to a certain point and freeze, then
> I wouldn't be able to break out of it or issue a kill from another SSH
> session on its PID. The kill command freezes the same way (never
> returns to a shell prompt and pressing CTRL-C just shows ^C on the
> display without breaking out).
If it were just some or more tasks hanging initially, and it's caught
some wait state, then identifying the point where things broke can
sometimes be quite straightforward. Doesn't seem to be the case here.
> > Can you try find the minimum number of steps necessary to get into
> > that state and try sth like $ ps -eH -owchan,nwchan,cmd
>
> The minimum number of steps that I took, just now, to make it
> happen was as follows:
>
> There's an HVM domU that's active and running Windows 2008 Server,
> called "scrappy", with the following Xen configuration:
>
> kernel = "hvmloader"
> builder='hvm'
> memory = 768
> name = "scrappy"
> vcpus=1
> vif = [ 'type=ioemu, mac=00:16:3e:00:00:18, bridge=eth0','type=ioemu,
> mac=00:16:3e:00:00:19, bridge=xenbr1','type=ioemu,
> mac=00:16:3e:00:00:1A, bridge=xenbr2' ]
> disk = [ 'phy:hurricanevg1/scrappy-primarymaster,xvda,w',
> 'file:/mnt/scratch/WindowsServerStd2008OEM_x86-64.iso,xvdb:cdrom,r',
> 'phy:hurricanevg1/scrappy-secondarymaster,xvdc,w' ]
> on_reboot = 'restart'
> device_model = 'qemu-dm'
> sdl=0
> opengl=1
> vnc=1
> vnclisten="192.168.0.90"
> vncdisplay=3
> vncunused=1
> stdvga=0
> serial='pty'
> tsc_mode=0
> localtime=1
> rtc_timeoffset=-3600
>
>
> While that's running, I created a snapshot of the primarymaster
> volume, then removed it, created one for the secondarymaster, removed
> it, and created another one for the primarymaster, tried to remove it,
> and the lvremove command froze. A minute or two later, I got a similar
> kernel OOPS message on my console to the one that I posted before.
> These are the commands that I used to create and remove the volumes:
>
> lvcreate -L 2G -n scrappy-primarymaster-backupsnap -s
> hurricanevg1/scrappy-primarymaster
>
> lvremove hurricanevg1/scrappy-primarymaster-backupsnap
>
> lvcreate -L 2G -n scrappy-secondarymaster-backupsnap -s
> hurricanevg1/scrappy-secondarymaster
>
> lvremove hurricanevg1/scrappy-secondarymaster-backupsnap
>
> lvcreate -L 2G -n scrappy-primarymaster-backupsnap -s
> hurricanevg1/scrappy-primarymaster
>
> lvremove hurricanevg1/scrappy-primarymaster-backupsnap
>
>
> This time, the console froze completely and I couldn't open any new
> SSH sessions into the machine, and couldn't run the ps -eH command that
> you asked for in your previous message. If I go for another attempt,
> I'll try to have a few logins already going so I can try to get that
> output for you. This is a somewhat critical, production server, though,
> so I didn't want to keep bouncing it in the middle of the day.
>
> > Also, is that sequence completely reproducible or does the behaviour
> > change evertime? Just trying if there's some point where deadlock
> > ends and corruption like the one quoted below would start.
>
> It seems to be 3 for 3 at this point.
Okay. I guess that won't be simple to repro. I wonder what you are
running in dom0. Distro and version, what you upgraded and what not, any
customized software builds etc.
Given the rate at which you reproduce this and because only the
snapshots seem to trigger the problem, to me this looks more like an
LVM/DM issue than pvops specific.
Also, it might be worth trying to turn off udev and see whether that
changes sth.
Daniel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|