Xen project Mailing List

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

From: Jeremy Fitzhardinge <jeremy@xxxxxxxx>

Date: Sun, 12 Sep 2010 18:55:48 +1000

Cc: "Xu, Dongxiao" <dongxiao.xu@xxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx

Delivery-date: Sun, 12 Sep 2010 01:56:47 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

On 09/12/2010 11:00 AM, Gerald Turner wrote: > I examined the Debian linux-image-2.6.32-5-xen-amd64 package and > confirmed the netfront driver is patched with an earlier version the > smartpoll code. > > I manually merged Debian's kernel with Jeremy's updates to the netfront > driver in his git repository. > > $ git diff > 5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c8475f0c00e0606 > > Deployed this new image on all domU's (except for two of them, as a > control group) and updated grub kernel parameter with > xen_netfront.use_smartpoll=0. That's good to hear. But I also included a fix from Dongxiao which, if correct, means it should work with use_smartpoll=1 (or nothing, as that's the default). Could you verify whether the fix in cb09635065163a933d0d00d077ddd9f0c0a908a1 does actually work or not? > Problem solved. Only the two domU's I left unpatched get victimized. > The rest of the hosts have been up for over a day and have not lost any > packets. > > P.S. this is my first NNTP post thru gmane, I have no idea if it will > reach the list, keep Message-Id/References intact, and CC Christophe, > Jeremy, Dongxiao et al. There were no cc:s. Thanks, J > >> Jeremy Fitzhardinge wrote: >>> On 09/10/2010 04:50 AM, Pasi KÃrkkÃinen wrote: >>>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote: >>>>> Hi Christophe, >>>>> >>>>> Thanks for finding and checking the problem. >>>>> I will try to reproduce the issue and check what caused the >>>>> problem. >>>>> >>>> Hello, >>>> >>>> Was this issue resolved? Some users have been complaining "network >>>> freezing up" issues recently on ##xen on irc.. >>> Yeah, I'll add a command-line parameter to disable smartpoll (and >>> leave it off by default). >>> >>> J >>> >>>> -- Pasi >>>> >>>>> Thanks, >>>>> Dongxiao >>>>> >>>>> Jeremy Fitzhardinge wrote: >>>>>> On 08/22/2010 09:43 AM, Christophe Saout wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I've been playing with some of the new pvops code, namely DomU >>>>>>> guest code. What I've been observing on one of the virtual >>>>>>> machines is that the network (vif) is dying after about ten to >>>>>>> sixty minutes of uptime. The unfortunate thing here is that I can >>>>>>> only repoduce it on a production VM and have been unlucky so far >>>>>>> to trigger the bug on a test machine. While this has not been >>>>>>> tragic - rebooting fixed the issue, unfortunately I can't spend >>>>>>> very much time on debugging after the issue pops up. >>>>>> Ah, OK. I've seen this a couple of times as well. And it just >>>>>> happened to me then... >>>>>> >>>>>> >>>>>>> Now, what is happening is that the receive path goes dead. The >>>>>>> DomU can send packets to Dom0 and those are visible using tcpdump >>>>>>> on the Dom0 on the virtual interface, but not the other way >>>>>>> around. >>>>>> I hadn't got to that level of diagnosis, but I can confirm that >>>>>> that's what seems to be happening here too. >>>>>> >>>>>>> Now, I have done more than one change at a time (I'd like to >>>>>>> avoid going into pinning it down since I can only reproduce it on >>>>>>> a production machine, as I said, so suggestions are welcome), but >>>>>>> my suspicion is that it might have to do with the new "smart >>>>>>> polling" feature in xen/netfront. Note that I have also updated >>>>>>> Dom0 to pull in the latest dom0/backend and netback changes, just >>>>>>> to make sure it's not due to an issue that has been fixed there, >>>>>>> but I'm still seeing the same. >>>>>> I agree. I think I started seeing this once I merged smartpoll >>>>>> into netfront. >>>>>> >>>>>> J >>>>>> >>>>>>> The production machine is a machine that doesn't have much >>>>>>> network load, but deals with a lot of small network requests (DNS >>>>>>> and smtp mostly). A workload which is hard to reproduce on the >>>>>>> test machine. Heavy network load (NFS, FTP and so on) for days >>>>>>> hasn't triggered the problem. Also, segmentation offloading and >>>>>>> similar settings don't have any effect. >>>>>>> >>>>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has >>>>>>> PREEMPT enabled. >>>>>>> >>>>>>> I've been looking at the code, if there might be a race condition >>>>>>> somewhere, something like where one could run into a situation >>>>>>> where the hrtimer doesn't run and Dom0 believes the DomU should >>>>>>> be polling and doesn't emit an interrupt or something, but I'm >>>>>>> afraid I don't know enough to judge this (I mean, there are >>>>>>> spinlocks which look safe to me). >>>>>>> >>>>>>> Do you have any suggestions what to try? I can trigger the issue >>>>>>> on the production VM again, but debugging should not take more >>>>>>> than a few minutes if it happens. Access is only possible via >>>>>>> the console. Neither Dom0 nor the guest show anything unusual in >>>>>>> the kernel message and continue to behave normally after the >>>>>>> network goes dead (also able to shut down the guest normally). >>>>>>> _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.