[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] xen/netback: calculate correctly the SKB slots.

To: Adnan Misherfi <adnan.misherfi@xxxxxxxxxx>
From: Ian Campbell <Ian.Campbell@xxxxxxxxxx>
Date: Thu, 24 May 2012 12:12:58 +0100
Cc: "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>, "netdev@xxxxxxxxxxxxxxx" <netdev@xxxxxxxxxxxxxxx>, "linux-kernel@xxxxxxxxxxxxxxx" <linux-kernel@xxxxxxxxxxxxxxx>, Ben Hutchings <bhutchings@xxxxxxxxxxxxxx>, "davem@xxxxxxxxxxxxx" <davem@xxxxxxxxxxxxx>
Delivery-date: Thu, 24 May 2012 11:13:21 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Tue, 2012-05-22 at 20:24 +0100, Adnan Misherfi wrote:
> 
> Konrad Rzeszutek Wilk wrote:
> >>>> wrong, which caused the RX ring to be erroneously declared full,
> >>>> and the receive queue to be stopped. The problem shows up when two
> >>>> guest running on the same server tries to communicates using large
> >>>>         
> > .. snip..
> >   
> >>> The function name is xen_netbk_count_skb_slots() in net-next.  This
> >>> appears to depend on the series in
> >>> <http://lists.xen.org/archives/html/xen-devel/2012-01/msg00982.html>.
> >>>       
> >> Yes, I don't think that patchset was intended for prime time just yet.
> >> Can this issue be reproduced without it?
> >>     
> >
> > It was based on 3.4, but the bug and work to fix this was  done on top of
> > a 3.4 version of netback backported in a 3.0 kernel. Let me double check
> > whether there were some missing patches.
> >
> >   
> >>>>          int i, copy_off;
> >>>>  
> >>>>          count = DIV_ROUND_UP(
> >>>> -                        offset_in_page(skb->data)+skb_headlen(skb), 
> >>>> PAGE_SIZE);
> >>>> +                        offset_in_page(skb->data + skb_headlen(skb)), 
> >>>> PAGE_SIZE);
> >>>>         
> >>> The new version would be equivalent to:
> >>>   count = offset_in_page(skb->data + skb_headlen(skb)) != 0;
> >>> which is not right, as netbk_gop_skb() will use one slot per page.
> >>>       
> >> Just outside the context of this patch we separately count the frag
> >> pages.
> >>
> >> However I think you are right if skb->data covers > 1 page, since the
> >> new version can only ever return 0 or 1. I expect this patch papers over
> >> the underlying issue by not stopping often enough, rather than actually
> >> fixing the underlying issue.
> >>     
> >
> > Ah, any thoughts? Have you guys seen this behavior as well?
> >   
> >>> The real problem is likely that you're not using the same condition to
> >>> stop and wake the queue.
> >>>       
> >> Agreed, it would be useful to see the argument for this patch presented
> >> in that light. In particular the relationship between
> >> xenvif_rx_schedulable() (used to wake queue) and
> >> xen_netbk_must_stop_queue() (used to stop queue).
> >>     
> >
> > Do you have any debug patches to ... do open-heart surgery on the
> > rings of netback as its hitting the issues Adnan has found?
> >
> >   
> >> As it stands the description describes a setup which can repro the
> >> problem but doesn't really analyse what actually happens, nor justify
> >> the correctness of the fix.
> >>     
> >
> > Hm, Adnan - you dug in to this and you got tons of notes. Could you
> > describe what you saw that caused this?
> >   
> The problem is that the function xen_netbk_count_skb_slots() returns two 
> different counts for same type packets of same size (ICMP,3991). At the 
> start of the test
> the count is one, later on the count changes to two, soon after the 
> counts becomes two, the condition ring full becomes true, and queue get 
> stopped, and never gets
> started again.There are few point to make here:
> 1- It takes less that 128 ping packets to reproduce this
> 2- What is interesting here is that it works correct for many packet 
> sizes including 1500,400,500 9000, (3990, but not 3991)
> 3- The inconsistent count for the same packet size and type
> 4- I do not believe the ring was actually full when it was declared 
> full, I think the consumer pointer was wrong. (vif->rx_req_cons_peek in 
> function xenvif_start_xmit())
> 5- After changing the code the count returned from 
> xen_netbk_count_skb_slots() was always consistent, and worked just fine, 
> I let it runs for at least 12 hours.

That doesn't really explain why you think your fix is correct though,
which is what I was asking for.

In any case, does Simon's patch also fix things for you? As far as I can
tell that is the right fix.

Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

References:
- [Xen-devel] [PATCH] xen/netback: calculate correctly the SKB slots.
  - From: Konrad Rzeszutek Wilk
- Re: [Xen-devel] [PATCH] xen/netback: calculate correctly the SKB slots.
  - From: Ben Hutchings
- Re: [Xen-devel] [PATCH] xen/netback: calculate correctly the SKB slots.
  - From: Ian Campbell
- Re: [Xen-devel] [PATCH] xen/netback: calculate correctly the SKB slots.
  - From: Konrad Rzeszutek Wilk

Prev by Date: Re: [Xen-devel] [PATCH 4/4] xen/events: Add WARN_ON when quick lookup found invalid type.
Next by Date: Re: [Xen-devel] [PATCH v2] libxl: introduce LIBXL_DOMAIN_TYPE_INVALID
Previous by thread: Re: [Xen-devel] [PATCH] xen/netback: calculate correctly the SKB slots.
Next by thread: Re: [Xen-devel] [PATCH] xen/netback: calculate correctly the SKB slots.
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.