WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-users

Re: [Xen-users] Dell Poweredge 2650 - heavy IO hangs domU machines; xen

To: xen-users@xxxxxxxxxxxxxxxxxxx
Subject: Re: [Xen-users] Dell Poweredge 2650 - heavy IO hangs domU machines; xen 2.0.7, xen kernel 2.6.11.12
From: Stephen Bosch <posting@xxxxxxxxxxx>
Date: Sun, 12 Feb 2006 15:02:00 -0700
Delivery-date: Sun, 12 Feb 2006 22:13:51 +0000
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <43ECDA92.2000205@xxxxxxxxxxx>
List-help: <mailto:xen-users-request@lists.xensource.com?subject=help>
List-id: Xen user discussion <xen-users.lists.xensource.com>
List-post: <mailto:xen-users@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-users>, <mailto:xen-users-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-users>, <mailto:xen-users-request@lists.xensource.com?subject=unsubscribe>
References: <43ECDA92.2000205@xxxxxxxxxxx>
Sender: xen-users-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mozilla Thunderbird 1.0.7 (X11/20051208)
Hello:

At the off-list suggestion of another user, we have tried adding
'noirqbalance' to the xen start line in grub, we've disabled USB in the
system BIOS, and we've added 'nousb' to the kernel parameters.

The problem is still there, exactly as before, even with all those changes.

*All* the virtual machines lose network connectivity, not just the ones
involved in the backup. We have an LDAP server VM running on this
hardware that is totally idle when this hang happens. We cannot ping or
ssh into them. We can get a console using 'xm console', but after
entering the userid, the login times out (after 60 seconds) before we
ever get a password prompt.

I still suspect an interrupt problem: it would appear that the tty is
unable to do a disk read to do authentication. At the same time, the
tape backup process hangs.

If we kill the bacula storage daemon on dom0, all of the virtual
machines release and we can log in again. At no point does anything
reboot -- it just hangs, and it's not a fatal hang. If the backup
process stops, whether through a timeout or by forceably stopping the
storage daemon, the virtual machines are again pingable and we can log
in both with ssh or 'xm console'.

We tried monitoring the memory usage during the backup test by running
'top' in separate console windows. Loads were actually modest and there
was plenty of memory remaining on all the virtual machines (over 1 GB in
free RAM in one case).

To recap: this is a Dell *2650*, not a 2850. It has a Serverworks, not
an Intel chipset. The RAID controller is a PERC 3 DC (LSI Logic) which
uses the Megaraid drivers. The controller firmware has been upgraded to
3.35/1.07, the most recent available.

Note also -- dom0 is unaffected. We can still interact with dom0 without
trouble. This hang affects only the virtual machines.

Cheers,

-Stephen-


Stephen Bosch wrote:
> Hello:
> 
> We are running three domU machines on a Dell 2650 and using Bacula to do
> backups to an Exabyte VXA SCSI tape drive attached to the external
> channel of a PERC 3 DC, with a RAID 1 running on the internal channel.
> 
> Xen version is 2.0.7
> Kernel is xen-kernel-2.6.11.12
> 
> We have the bacula storage daemon running on dom0.
> 
> When we begin a large backup (several gigabytes), all of the domU
> machines will lock up, regardless of whether they are involved in the
> backup or not.
> 
> Characteristics of the lockup:
> - We lose all network connectivity to all of them. We cannot ping or ssh
> to them -- you cannot do anything. Even an nmap fails.
> 
> - the dom0 is still running fine.
> 
> - We can 'xm console' to the affected domU's and get a login prompt, but
> we can only enter the login id; the login times out waiting for the
> password prompt.
> 
> 
> Eventually, the bacula backup will time out: at this point, the machines
> come back to life. This takes about 15 - 20 minutes. The backup,
> however, does not complete successfully. In fact, very little happens on
> the backup at all :)
> 
> We're very puzzled by this -- we suspect an interrupt issue, but we
> really don't have a clue where to start looking. Other people seem to
> have reported similar IO-related problems.



_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users

<Prev in Thread] Current Thread [Next in Thread>