[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Xen unstability on HP Moonshot m400



Hi,

I have been experiencing a problematic crash running Xen on m400 over the last few days. I already spoke to Ian and Stefano about this, but thought I'd summarize what I've seen so far and loop in a wider audience.

The basic setup is this:
Â- Two m400 nodes, one running Linux bare-metal, the other running Xen.
Â- The Xen node runs Dom0 and 1 DomU
Â- The m400 has a Mellanox Connectx-3 PCIe 10G ethernet card with two parts on it
Â- Dom0 uses NAT forwarding from Dom0's eth0 (which is connected to the internet) and regular bridging to eth1 which is connected to a private VLAN to the bare-metal node
Â- Dom0 and DomU are configured with 14GB of ram, 4 cpus each
Â- DomU runs apache2 serving the GCC manual (see https://github.com/chazy/kvmperf/blob/master/cmdline_tests/apache_install.sh)

The bare-metal node runs apache bench, like this: "ab -n 100000 -c 100 http://10.10.1.120/gcc/index.html"

(10.10.1.120 is the DomU IP address of the bridged interface to eth1)

What happens now is that the entire Xen node goes down. I see various errors in the kernel log, some examples:
http://pastebin.ubuntu.com/10642148/
http://pastebin.ubuntu.com/10642177/
http://pastebin.ubuntu.com/10642181/

I have also tried applying a set of swiotlb fixes provided by Stefano to both the Dom0 and DomU kernel, like this:

With these patches I sometime also saw this error in the kernel log (but not always):

Other data points of interest:
Â- Bare-metal serving apache doesn't exhibit this behavior
Â- KVM guests with bridged networking on identical hardware/setup with the same kernels also don't exhibit this behavior
Â- Other physical identical nodes exhibit the same behavior
Â- Just running Dom0 serving apache without running DomU doesn't appear to exhibit this behavior
Â- Running apache on Dom0 and benchmarking the system using Dom0's ip address but running DomU idle in the background causes this behavior (http://pastebin.ubuntu.com/10642311/), but the system seems to stay alive (at least for much longer)!

Stefano suggested that this could be related DMA cache coherency, but I'm not sure how to investigate that further.

This is a somewhat urgent issue for us at Columbia so I would appreciate any feedback and/or ideas and will be happy to try out any debugging steps to get to the bottom of this.

Thanks,
-Christoffer

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.