WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] new netfront and occasional receive path lockup

To: Jeremy Fitzhardinge <jeremy@xxxxxxxx>
Subject: RE: [Xen-devel] new netfront and occasional receive path lockup
From: "Xu, Dongxiao" <dongxiao.xu@xxxxxxxxx>
Date: Fri, 10 Sep 2010 10:37:37 +0800
Accept-language: en-US
Acceptlanguage: en-US
Cc: "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Christophe Saout <christophe@xxxxxxxx>
Delivery-date: Thu, 09 Sep 2010 19:39:28 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4C8996FE.2040500@xxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <1282495384.12843.11.camel@xxxxxxxxxxxxxxxxxxxx> <4C73166D.3030000@xxxxxxxx> <D5AB6E638E5A3E4B8F4406B113A5A19A2A44184D@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <20100909185058.GR2804@xxxxxxxxxxx> <4C8981E5.6010000@xxxxxxxx> <D5AB6E638E5A3E4B8F4406B113A5A19A2A5ED71F@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <4C8996FE.2040500@xxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: ActQj2edZF86c9vgR4+nKp40VPDQAgAARTzA
Thread-topic: [Xen-devel] new netfront and occasional receive path lockup
Jeremy Fitzhardinge wrote:
>  On 09/10/2010 11:45 AM, Xu, Dongxiao wrote:
>> Hi Jeremy and Pasi,
>> 
>> I was frustrated that I couldn't reproduce this bug in my site.
> 
> Perhaps you have been trying to reproduce it in the wrong conditions?
> I have generally seen this bug when the networking is under very
> light load, such as a couple of fairly idle dom0<->domU ssh
> connections.  I'm not sure that I've seen it under heavy load.   
> 
>> However I investigated the code, indeed there is one race condition
>> that probably cause the bug. See the attached patch.
>> 
>> Could anybody who can see this bug help to try it? Appreciate much!
> 
> Thanks for looking into this.  Your logic seems reasonable, so I'll
> apply it (however I also added a patch to make smartpoll default to
> "off"; I guess I can switch that to default on again to make sure it
> gets tested, but leave the option as a workaround if there are still
> problems).    
> 
> However, I am concerned about these manipulations of a cross-cpu
> shared variable without any barriers or other ordering constraints. 
> Are you sure this code is correct under any reordering (either by the
> compiler or CPUs); and if the compiler decides to access it more or
> less often than the source says it should?    

Do you mean the flag "np->rx.sring->private.netif.smartpoll_active"?
It is a flag in shared ring structure, Therefore operations towards
this flag are the same as other component in shared ring, such as
under spinlock, etc.

I will put dom0 and domU ssh(ed) for some time to see if the bug
still exists.

Thanks,
Dongxiao

> 
> Thanks,
>     J
> 
>> Thanks,
>> Dongxiao
>> 
>> 
>> Jeremy Fitzhardinge wrote:
>>>  On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote:
>>>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote:
>>>>> Hi Christophe,
>>>>> 
>>>>> Thanks for finding and checking the problem.
>>>>> I will try to reproduce the issue and check what caused the
>>>>> problem. 
>>>>> 
>>>> Hello,
>>>> 
>>>> Was this issue resolved? Some users have been complaining "network
>>>> freezing up" issues recently on ##xen on irc..
>>> Yeah, I'll add a command-line parameter to disable smartpoll (and
>>> leave it off by default). 
>>> 
>>>     J
>>> 
>>>> -- Pasi
>>>> 
>>>>> Thanks,
>>>>> Dongxiao
>>>>> 
>>>>> Jeremy Fitzhardinge wrote:
>>>>>>  On 08/22/2010 09:43 AM, Christophe Saout wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I've been playing with some of the new pvops code, namely DomU
>>>>>>> guest code.  What I've been observing on one of the virtual
>>>>>>> machines is that the network (vif) is dying after about ten to
>>>>>>> sixty minutes of uptime. The unfortunate thing here is that I
>>>>>>> can only repoduce it on a production VM and have been unlucky
>>>>>>> so far 
>>>>>>> to trigger the bug on a test machine.  While this has not been
>>>>>>> tragic - rebooting fixed the issue, unfortunately I can't spend
>>>>>>> very much time on debugging after the issue pops up.
>>>>>> Ah, OK.  I've seen this a couple of times as well.  And it just
>>>>>> happened to me then... 
>>>>>> 
>>>>>> 
>>>>>>> Now, what is happening is that the receive path goes dead.  The
>>>>>>> DomU can send packets to Dom0 and those are visible using
>>>>>>> tcpdump 
>>>>>>> on the Dom0 on the virtual interface, but not the other way
>>>>>>> around.
>>>>>> I hadn't got to that level of diagnosis, but I can confirm that
>>>>>> that's what seems to be happening here too.
>>>>>> 
>>>>>>> Now, I have done more than one change at a time (I'd like to
>>>>>>> avoid going into pinning it down since I can only reproduce it
>>>>>>> on 
>>>>>>> a production machine, as I said, so suggestions are welcome),
>>>>>>> but 
>>>>>>> my suspicion is that it might have to do with the new "smart
>>>>>>> polling" feature in xen/netfront.  Note that I have also
>>>>>>> updated Dom0 to 
>>>>>>> pull in the latest dom0/backend and netback changes, just to
>>>>>>> make sure it's not due to an issue that has been fixed there,
>>>>>>> but I'm still seeing the same.
>>>>>> I agree.  I think I started seeing this once I merged smartpoll
>>>>>> into netfront. 
>>>>>> 
>>>>>>     J
>>>>>> 
>>>>>>> The production machine is a machine that doesn't have much
>>>>>>> network load, but deals with a lot of small network requests
>>>>>>> (DNS and smtp mostly).  A workload which is hard to reproduce
>>>>>>> on the 
>>>>>>> test machine. Heavy network load (NFS, FTP and so on) for days
>>>>>>> hasn't triggered the problem.  Also, segmentation offloading and
>>>>>>> similar settings don't have any effect.
>>>>>>> 
>>>>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has
>>>>>>> PREEMPT enabled. 
>>>>>>> 
>>>>>>> I've been looking at the code, if there might be a race
>>>>>>> condition somewhere, something like where one could run into a
>>>>>>> situation 
>>>>>>> where the hrtimer doesn't run and Dom0 believes the DomU should
>>>>>>> be polling and doesn't emit an interrupt or something, but I'm
>>>>>>> afraid I don't know enough to judge this (I mean, there are
>>>>>>> spinlocks which look safe to me).
>>>>>>> 
>>>>>>> Do you have any suggestions what to try?  I can trigger the
>>>>>>> issue 
>>>>>>> on the production VM again, but debugging should not take more
>>>>>>> than a few minutes if it happens.  Access is only possible via
>>>>>>> the console. Neither Dom0 nor the guest show anything unusual in
>>>>>>> the kernel message and continue to behave normally after the
>>>>>>> network goes dead (also able to shut down the guest normally).
>>>>>>> 
>>>>>>> Thanks,
>>>>>>>         Christophe
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Xen-devel mailing list
>>>>>>> Xen-devel@xxxxxxxxxxxxxxxxxxx
>>>>>>> http://lists.xensource.com/xen-devel
>>>>> _______________________________________________
>>>>> Xen-devel mailing list
>>>>> Xen-devel@xxxxxxxxxxxxxxxxxxx
>>>>> http://lists.xensource.com/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel