RE: [Xen-devel] new netfront and occasional receive path lockup

Jeremy Fitzhardinge wrote:
>  On 09/10/2010 11:45 AM, Xu, Dongxiao wrote:
>> Hi Jeremy and Pasi,
>> 
>> I was frustrated that I couldn't reproduce this bug in my site.
> 
> Perhaps you have been trying to reproduce it in the wrong conditions?
> I have generally seen this bug when the networking is under very
> light load, such as a couple of fairly idle dom0<->domU ssh
> connections.  I'm not sure that I've seen it under heavy load.   
> 
>> However I investigated the code, indeed there is one race condition
>> that probably cause the bug. See the attached patch.
>> 
>> Could anybody who can see this bug help to try it? Appreciate much!
> 
> Thanks for looking into this.  Your logic seems reasonable, so I'll
> apply it (however I also added a patch to make smartpoll default to
> "off"; I guess I can switch that to default on again to make sure it
> gets tested, but leave the option as a workaround if there are still
> problems).    
> 
> However, I am concerned about these manipulations of a cross-cpu
> shared variable without any barriers or other ordering constraints. 
> Are you sure this code is correct under any reordering (either by the
> compiler or CPUs); and if the compiler decides to access it more or
> less often than the source says it should?    

Do you mean the flag "np->rx.sring->private.netif.smartpoll_active"?
It is a flag in shared ring structure, Therefore operations towards
this flag are the same as other component in shared ring, such as
under spinlock, etc.

I will put dom0 and domU ssh(ed) for some time to see if the bug
still exists.

Thanks,
Dongxiao

> 
> Thanks,
>     J
> 
>> Thanks,
>> Dongxiao
>> 
>> 
>> Jeremy Fitzhardinge wrote:
>>>  On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote:
>>>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote:
>>>>> Hi Christophe,
>>>>> 
>>>>> Thanks for finding and checking the problem.
>>>>> I will try to reproduce the issue and check what caused the
>>>>> problem. 
>>>>> 
>>>> Hello,
>>>> 
>>>> Was this issue resolved? Some users have been complaining "network
>>>> freezing up" issues recently on ##xen on irc..
>>> Yeah, I'll add a command-line parameter to disable smartpoll (and
>>> leave it off by default). 
>>> 
>>>     J
>>> 
>>>> -- Pasi
>>>> 
>>>>> Thanks,
>>>>> Dongxiao
>>>>> 
>>>>> Jeremy Fitzhardinge wrote:
>>>>>>  On 08/22/2010 09:43 AM, Christophe Saout wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I've been playing with some of the new pvops code, namely DomU
>>>>>>> guest code.  What I've been observing on one of the virtual
>>>>>>> machines is that the network (vif) is dying after about ten to
>>>>>>> sixty minutes of uptime. The unfortunate thing here is that I
>>>>>>> can only repoduce it on a production VM and have been unlucky
>>>>>>> so far 
>>>>>>> to trigger the bug on a test machine.  While this has not been
>>>>>>> tragic - rebooting fixed the issue, unfortunately I can't spend
>>>>>>> very much time on debugging after the issue pops up.
>>>>>> Ah, OK.  I've seen this a couple of times as well.  And it just
>>>>>> happened to me then... 
>>>>>> 
>>>>>> 
>>>>>>> Now, what is happening is that the receive path goes dead.  The
>>>>>>> DomU can send packets to Dom0 and those are visible using
>>>>>>> tcpdump 
>>>>>>> on the Dom0 on the virtual interface, but not the other way
>>>>>>> around.
>>>>>> I hadn't got to that level of diagnosis, but I can confirm that
>>>>>> that's what seems to be happening here too.
>>>>>> 
>>>>>>> Now, I have done more than one change at a time (I'd like to
>>>>>>> avoid going into pinning it down since I can only reproduce it
>>>>>>> on 
>>>>>>> a production machine, as I said, so suggestions are welcome),
>>>>>>> but 
>>>>>>> my suspicion is that it might have to do with the new "smart
>>>>>>> polling" feature in xen/netfront.  Note that I have also
>>>>>>> updated Dom0 to 
>>>>>>> pull in the latest dom0/backend and netback changes, just to
>>>>>>> make sure it's not due to an issue that has been fixed there,
>>>>>>> but I'm still seeing the same.
>>>>>> I agree.  I think I started seeing this once I merged smartpoll
>>>>>> into netfront. 
>>>>>> 
>>>>>>     J
>>>>>> 
>>>>>>> The production machine is a machine that doesn't have much
>>>>>>> network load, but deals with a lot of small network requests
>>>>>>> (DNS and smtp mostly).  A workload which is hard to reproduce
>>>>>>> on the 
>>>>>>> test machine. Heavy network load (NFS, FTP and so on) for days
>>>>>>> hasn't triggered the problem.  Also, segmentation offloading and
>>>>>>> similar settings don't have any effect.
>>>>>>> 
>>>>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has
>>>>>>> PREEMPT enabled. 
>>>>>>> 
>>>>>>> I've been looking at the code, if there might be a race
>>>>>>> condition somewhere, something like where one could run into a
>>>>>>> situation 
>>>>>>> where the hrtimer doesn't run and Dom0 believes the DomU should
>>>>>>> be polling and doesn't emit an interrupt or something, but I'm
>>>>>>> afraid I don't know enough to judge this (I mean, there are
>>>>>>> spinlocks which look safe to me).
>>>>>>> 
>>>>>>> Do you have any suggestions what to try?  I can trigger the
>>>>>>> issue 
>>>>>>> on the production VM again, but debugging should not take more
>>>>>>> than a few minutes if it happens.  Access is only possible via
>>>>>>> the console. Neither Dom0 nor the guest show anything unusual in
>>>>>>> the kernel message and continue to behave normally after the
>>>>>>> network goes dead (also able to shut down the guest normally).
>>>>>>> 
>>>>>>> Thanks,
>>>>>>>         Christophe
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Xen-devel mailing list
>>>>>>> Xen-devel@xxxxxxxxxxxxxxxxxxx
>>>>>>> http://lists.xensource.com/xen-devel
>>>>> _______________________________________________
>>>>> Xen-devel mailing list
>>>>> Xen-devel@xxxxxxxxxxxxxxxxxxx
>>>>> http://lists.xensource.com/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

RE: [Xen-devel] new netfront and occasional receive path lockup