[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] tg3 network stall in xen-3.4.x but not in xen-3.3.x


  • To: xen-devel@xxxxxxxxxxxxxxxxxxx
  • From: Teck Choon Giam <giamteckchoon@xxxxxxxxx>
  • Date: Sat, 4 Jul 2009 14:32:26 +0800
  • Delivery-date: Fri, 03 Jul 2009 23:32:47 -0700
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type :content-transfer-encoding; b=rdUsXpDqppTgPpPBahhFO4efCbSP55/rQUA5knsyYf7vHOSoDjKjmpGxN7B13YYQzp JhBtSAfLMbkddqt5ptCg9xDOTRFAVMiJ07BnxX2VcJqCA13gdkNJ0QJx9jiSZXo84nvg u2ZI6no5ANw8+DEjr25RqKzuxcWZIaDyGAQ2M=
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Hi,

I have experienced network stall when running in xen 3.4.x on all DELL
PE850/860/R200 which are using onboard broadcom driver tg3 driver.

I have done some testing on both xen-3.3.2-rc3 and xen-3.4.1-rc5 with
linux-2.6.18-xen.hg changeset 913 which in domU doing scp transfer of
couple 1MB/10MB/100MB files to another server in few instances
concurrently.  Within an hour the network will stall in xen-3.4.1-rc5
but not in xen-3.3.2-rc3.  ifconfig, route -n and ip link show normal
but unable to ping gateway.  Sometimes, doing the following (in
crontab using custom script to check ping gateway and if 100% packet
lost will execute the below can bring back the network but not always
and needed a reboot):

1. xm shutdown all domUs
2. service xendomains stop
3. stop network-bridge
4. service xend stop
5. service xend start
6. xm create all domUs

However the above might cause some domU ext3 file system dirty and
e2fsck is required.

I have done many tests (at least more than 5 times on 3 DELL PE850/860
servers) and the results are the same.  With xen-3.3.2-rc3 no issue
and network will not be down/stalled doing the scp transfer test to
other server.  Whereby with xen-3.4.1-rc5, it will happen within an
hour if such test are carried out at least 5 instances running
concurrently.  In fact from xen-3.4.0 to xen-3.4.1-rc1 to rc5 are the
same.

/var/log/messages will show the following when network stall:
tg3: peth0: transmit timed out, resetting

I have tried:
/sbin/ethtool -K eth0 tx off
/sbin/ethtool -K eth0 rx off
/sbin/ethtool -K eth0 gso off
/sbin/ethtool -K eth0 tso off

Is there any netfront/netback changes between xen-3.3.x and xen-3.4.x
which cause such issue?  Anybody experience such network stall in your
tg3 in bridge network environment?

The above test also carried out in non tg3 servers such as with
e100/e1000 drivers do not cause such network stall problem.

All servers are running CentOS 5.3 with linux-2.6.18.8-xen for all
dom0s and domUs.

Any idea?

Thanks.

Kindest regards,
Giam Teck Choon

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.