[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Re: new netfront and occasional receive path lockup



 On 09/12/2010 11:00 AM, Gerald Turner wrote:
> I examined the Debian linux-image-2.6.32-5-xen-amd64 package and
> confirmed the netfront driver is patched with an earlier version the
> smartpoll code.
>
> I manually merged Debian's kernel with Jeremy's updates to the netfront
> driver in his git repository.
>
>   $ git diff 
> 5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c8475f0c00e0606
>
> Deployed this new image on all domU's (except for two of them, as a
> control group) and updated grub kernel parameter with
> xen_netfront.use_smartpoll=0.

That's good to hear.  But I also included a fix from Dongxiao which, if
correct, means it should work with use_smartpoll=1 (or nothing, as
that's the default).  Could you verify whether the fix in
cb09635065163a933d0d00d077ddd9f0c0a908a1 does actually work or not?

> Problem solved.  Only the two domU's I left unpatched get victimized.
> The rest of the hosts have been up for over a day and have not lost any
> packets.
>
> P.S. this is my first NNTP post thru gmane, I have no idea if it will
> reach the list, keep Message-Id/References intact, and CC Christophe,
> Jeremy, Dongxiao et al.

There were no cc:s.

Thanks,
    J

>
>> Jeremy Fitzhardinge wrote:
>>>  On 09/10/2010 04:50 AM, Pasi KÃrkkÃinen wrote:
>>>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote:
>>>>> Hi Christophe,
>>>>>
>>>>> Thanks for finding and checking the problem.
>>>>> I will try to reproduce the issue and check what caused the
>>>>> problem.
>>>>>
>>>> Hello,
>>>>
>>>> Was this issue resolved? Some users have been complaining "network
>>>> freezing up" issues recently on ##xen on irc..
>>> Yeah, I'll add a command-line parameter to disable smartpoll (and
>>> leave it off by default).
>>>
>>>     J
>>>
>>>> -- Pasi
>>>>
>>>>> Thanks,
>>>>> Dongxiao
>>>>>
>>>>> Jeremy Fitzhardinge wrote:
>>>>>>  On 08/22/2010 09:43 AM, Christophe Saout wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I've been playing with some of the new pvops code, namely DomU
>>>>>>> guest code.  What I've been observing on one of the virtual
>>>>>>> machines is that the network (vif) is dying after about ten to
>>>>>>> sixty minutes of uptime. The unfortunate thing here is that I can
>>>>>>> only repoduce it on a production VM and have been unlucky so far
>>>>>>> to trigger the bug on a test machine.  While this has not been
>>>>>>> tragic - rebooting fixed the issue, unfortunately I can't spend
>>>>>>> very much time on debugging after the issue pops up.
>>>>>> Ah, OK.  I've seen this a couple of times as well.  And it just
>>>>>> happened to me then...
>>>>>>
>>>>>>
>>>>>>> Now, what is happening is that the receive path goes dead.  The
>>>>>>> DomU can send packets to Dom0 and those are visible using tcpdump
>>>>>>> on the Dom0 on the virtual interface, but not the other way
>>>>>>> around.
>>>>>> I hadn't got to that level of diagnosis, but I can confirm that
>>>>>> that's what seems to be happening here too.
>>>>>>
>>>>>>> Now, I have done more than one change at a time (I'd like to
>>>>>>> avoid going into pinning it down since I can only reproduce it on
>>>>>>> a production machine, as I said, so suggestions are welcome), but
>>>>>>> my suspicion is that it might have to do with the new "smart
>>>>>>> polling" feature in xen/netfront.  Note that I have also updated
>>>>>>> Dom0 to pull in the latest dom0/backend and netback changes, just
>>>>>>> to make sure it's not due to an issue that has been fixed there,
>>>>>>> but I'm still seeing the same.
>>>>>> I agree.  I think I started seeing this once I merged smartpoll
>>>>>> into netfront.
>>>>>>
>>>>>>     J
>>>>>>
>>>>>>> The production machine is a machine that doesn't have much
>>>>>>> network load, but deals with a lot of small network requests (DNS
>>>>>>> and smtp mostly).  A workload which is hard to reproduce on the
>>>>>>> test machine. Heavy network load (NFS, FTP and so on) for days
>>>>>>> hasn't triggered the problem.  Also, segmentation offloading and
>>>>>>> similar settings don't have any effect.
>>>>>>>
>>>>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has
>>>>>>> PREEMPT enabled.
>>>>>>>
>>>>>>> I've been looking at the code, if there might be a race condition
>>>>>>> somewhere, something like where one could run into a situation
>>>>>>> where the hrtimer doesn't run and Dom0 believes the DomU should
>>>>>>> be polling and doesn't emit an interrupt or something, but I'm
>>>>>>> afraid I don't know enough to judge this (I mean, there are
>>>>>>> spinlocks which look safe to me).
>>>>>>>
>>>>>>> Do you have any suggestions what to try?  I can trigger the issue
>>>>>>> on the production VM again, but debugging should not take more
>>>>>>> than a few minutes if it happens.  Access is only possible via
>>>>>>> the console. Neither Dom0 nor the guest show anything unusual in
>>>>>>> the kernel message and continue to behave normally after the
>>>>>>> network goes dead (also able to shut down the guest normally).
>>>>>>>


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.