Scott Garron wrote:
Another issue that comes up is that if I run the 18.104.22.168 pvops
kernel for my Linux domUs, after a time (usually only about an
hour or so), the network interfaces stop responding.
Jeremy Fitzhardinge wrote:
That's a separate problem in netfront that appears to be a bug in
the "smartpoll" code. I think Dongxiao is looking into it.
On 8/31/2010 2:59 AM, Xu, Dongxiao wrote:
Yes, I tried to reproduce these days, however I could catch it
locally. I tried both netperf and ping for a long time, but the bug
is not triggered. What workload are you using when met the bug?
I'd say that the whole machine is under moderate to high
utilization because it has 10 virtual machines running - three of which
are Windows 2008 Servers as HVM guests. However, as far as the "load"
goes, most of the virtual machines are fairly idle and probably not
under much stress, overall. Just to give you an idea, we have a
10Mbit/s connection to the Internet, and this server's physical network
interface (all 10 of the domUs' traffic, combined) usually accounts for
less than 2Mbit/s of the outbound traffic at any given point in the day.
Aside from Windows being Windows (the HVM guests are running graphical
desktops), I wouldn't say that any of them cause a high CPU load,
either. Database load is fairly low to moderate on guests running MySQL
and/or PostgreSQL. The only guest that seems to use more CPU and
RAM is one serving e-mail, and that's because it runs ClamAV and
SpamAssassin. That e-mail server was one that kept its network
connectivity the longest, though (after a few hours, it did stop
responding, but that was after some guests with lighter loads stopped
An observation that I made, and it may just be coincidental,
but at least noteworthy, is that the virtual machines that are assigned
less RAM seem to lose connectivity more quickly than those with more
RAM. The most recent time that I was able to trigger the bug, the
virtual machine that lost connectivity was only assigned 384MB RAM,
running 22.214.171.124. At the time, the rest of my paravirtualized guests
were running 126.96.36.199, and they didn't experience the problem.
I've previously triggered the bug in multiple domUs that were
running a more recent kernel (I think it was 188.8.131.52 - before I
reverted to a netback-patched 184.108.40.206 kernel), and the first ones to
disappear from the network were ones that were only assigned 256MB.
Eventually, they all disappeared, though. The only "load" on one of the
first to disappear is an installation of bind9, servicing about 50
domain names - none of which receive an abnormally high hit count.
The first time I noticed the problem, I had started 7
paravirtualized guests, of varying memory assignments. The moment I
started the 8th guest, an HVM Windows 2008 Server, the networking on all
of the running of the guests (the paravirt ones) stopped responding at
the same time. That may also be something to try/look at.
After a reboot, I avoided starting any of the HVM guests, and the
connectivity lasted a couple of hours on the 7 running paravirt guests,
but started disappearing one guest at a time, over the course of the
next few hours.
I didn't mention in my previous e-mail that in order to get
networking to work in a stable fashion in the 220.127.116.11 kernel (the one
I reverted to), I had to apply the patch mentioned here:
Otherwise, networking became unstable immediately at the time of guest
creation. That patch was already applied to the 18.104.22.168 kernel that
is giving me the eventual network loss problems, though.
More specifics about my configuration can be found here:
Xen-devel mailing list