[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Crashing / unable to start domUs due to high number of luns?


  • To: Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>
  • From: Nathan March <nathan@xxxxxx>
  • Date: Wed, 01 Feb 2012 11:48:23 -0800
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Wed, 01 Feb 2012 19:48:41 +0000
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=gt.net; h=message-id:date :from:mime-version:to:cc:subject:references:in-reply-to :content-type:content-transfer-encoding; q=dns; s=mail; b=j2EY81 hs6NDWyh/Xd5aQPM6FReunLkpiTLvYE/EHrfJnGnnX5pg6I7HfuSt9lx1r4LaIOs 19gPCBzCOvdPS2rzNH8XH+x5nUtxIEzjEo2IWTchy9Ea2/YB8GZwfcwJANF/rv+2 zqS6iPr+lrskORJ63dvUrvL3WsLYMi42shEro=
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>

On 1/31/2012 5:30 PM, Konrad Rzeszutek Wilk wrote:
On Tue, Jan 31, 2012 at 01:42:23PM -0800, Nathan March wrote:
Hi All,

We've got a xen setup based around a dell iscsi device with each xen
host having 2 lun's, we then run multipath on top of that. After adding
a couple new virtual disks the other day, a couple of our online stable
VM's suddenly hard locked up. Attaching to the console gave me nothing,
looked like they lost their disk devices.

Attempting to restart them on the same dom0 failed with hot plug errors,
as did attempting to start them on a few different dom0's. After doing a
"multipath -F" to remove unused devices and manually bringing in just
the selected LUN's via "multipath diskname", I was able to successfully
start them. This initially made me think perhaps I was hitting some sort
of udev / multipath / iscsi device lun limit (136 luns, 8 paths per lun
= 1088 iscsi connections). Just to be clear, the problem occurred on
multiple dom0's at the same time so it definitely seems iscsi related.

Now, a day later, I'm debugging this further and I'm again unable to
start VM's, even with all extra multipath devices removed. I rebooted
one of the dom0's and was able to successfully migrate our production
VM's off a broken server, so I've now got an empty dom0 that's unable to
start any vm's.

Starting a VM results in the following in xend.log:

[2012-01-31 13:06:16 12353] DEBUG (DevController:144) Waiting for 0.
[2012-01-31 13:06:16 12353] DEBUG (DevController:628)
hotplugStatusCallback /local/domain/0/backend/vif/35/0/hotplug-status.
[2012-01-31 13:07:56 12353] ERROR (SrvBase:88) Request wait_for_devices
failed.
Traceback (most recent call last):
   File "/usr/lib64/python2.6/site-packages/xen/web/SrvBase.py", line
85, in perform
     return op_method(op, req)
   File
"/usr/lib64/python2.6/site-packages/xen/xend/server/SrvDomain.py", line
85, in op_wait_for_devices
     return self.dom.waitForDevices()
   File "/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py",
line 1237, in waitForDevices
     self.getDeviceController(devclass).waitForDevices()
   File
"/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py",
line 140, in waitForDevices
     return map(self.waitForDevice, self.deviceIDs())
   File
"/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py",
line 155, in waitForDevice
     (devid, self.deviceClass))
VmError: Device 0 (vif) could not be connected. Hotplug scripts not working.

Was there anything in the kernel (dmesg) about vifs? What does your
/proc/interrupts look like? Can you provide the dmesg that you get
during startup. I am mainly looking for:

NR_IRQS:16640 nr_irqs:1536 16

How many guests are your running when this happens?

One theory is that your are running out dom0 interrupts. Thought
I *think* that was made dynamic in 3.0..


Thought that does explain your iSCSI network wonky in the guest -
was there anything in the dmesg when the guest started going bad?

Was running approximately 15 guests, although this persisted after migrating them off.

Nothing in dmesg (dom0 dmesg or xm dmesg) that looked abnormal at all, no references to vifs. Asides from the inability to start a VM, I couldn't seem to find any sort of error anywhere.

All the hosts show the same irq counts:[ 34.903763] NR_IRQS:4352 nr_irqs:4352 16

Unfortunately I'm not able to reproduce this now, but I've posted several different copies of /proc/interrupts here: http://pastebin.com/n7PWNeaZ

Full xm / kernel dmesg is uploaded here: http://pastebin.com/AtCvFBDS

[2012-01-31 13:07:56 12353] DEBUG (XendDomainInfo:3071)
XendDomainInfo.destroy: domid=35
[2012-01-31 13:07:58 12353] DEBUG (XendDomainInfo:2401) Destroying
device model

I tried turning up udev's log level but that didn't reveal anything.
Reading the xenstore for the vif doesn't show anything unusual either:

ukxen1 ~ # xenstore-ls /local/domain/0/backend/vif/35
0 = ""
  bridge = "vlan91"
  domain = "nathanxenuk1"
  handle = "0"
  uuid = "2128d0b7-d50f-c2ad-4243-8a42bb598b81"
  script = "/etc/xen/scripts/vif-bridge"
  state = "1"
  frontend = "/local/domain/35/device/vif/0"
  mac = "00:16:3d:03:00:44"
  online = "1"
  frontend-id = "35"

The bridge device (vlan91) exists, trying a different bridge doesn't
matter. Removing the VIF completely results in the same error for the
VBD. Adding debugging to the hotplug/network scripts didn't reveal
anything, it looks like they aren't even being executed yet. Nothing is
logged to xen-hotplug.log.
OK, so that would imply the kernel hasn't been able to do the right
thing. Hmm.

What do you see when this happens with udev --monitor --kernel --udev
--property ?

The remaining server I thought was doing this is apparently not (I was probably mistaken), so the 2 that were definitely doing it have been rebooted and I can't reproduce this at the moment.

I've been abusing a free server all morning with a loop to spawn/shutdown a VM repeatedly and flush / rescan multipath to see if I can reproduce this again. No luck so far unfortunately, but I'll keep trying.


The only thing I can think of that this may be related to, is gentoo
defaulted to a 10mb /dev which we filled up a few months back. We upped
the size to 50mb in the mount options and everything's been completely
stable since (~33 days). None of the /dev on the dom0's is higher than
25% usage. Asides from adding the new luns, no changes have been made in
the past month.

To try and test if removing some devices would solve anything, I tried
doing an "iscsiadm -m node --logout" and it promptly hard locked the
entire box. After a reboot, I was unable to reproduce the problem on
that particular dom0.

I've still got 1 dom0 that's exhibiting the problem, if anyone is able
to suggest any further debugging steps?

- Nathan
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


--
Nathan March<nathan@xxxxxx>
Gossamer Threads Inc. http://www.gossamer-threads.com/
Tel: (604) 687-5804 Fax: (604) 687-5806


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.