[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected"



Monday, March 17, 2014, 11:35:24 AM, you wrote:

> On Wed, Mar 12, 2014 at 05:47:29PM +0100, Sander Eikelenboom wrote:
>> 
>> Wednesday, March 12, 2014, 4:45:01 PM, you wrote:
>> 
>> > On Wed, Mar 12, 2014 at 04:20:03PM +0100, Sander Eikelenboom wrote:
>> > [...]
>> >> 
>> >> > Sorry, remove the trailing "S". Actually you only need to look at 
>> >> > netback.c.
>> >> 
>> >> What producer index to compare with .. there are quite some 
>> >> RING_GET_REQUESTS .. and i see:
>> >> npo->meta_prod
>> >> vif->rx.sring->req_prod
>> >> vif->pending_prod
>> >> 
>> >> to name a few ..
>> >> Any particular RING_GET_REQUESTS call and particular producer index you 
>> >> are interested in ?
>> >> 
>> 
>> > There are two macros you can use
>> 
>> > RING_REQUEST_CONS_OVERFLOW and RING_REQUEST_PROD_OVERFLOW.
>> 
>> Ah i already produced my own .. diff to netback is attached ..
>> 
>> Netback:
>> Mar 12 17:41:26 serveerstertje kernel: [  464.778614] vif vif-7-0 vif7.0: 
>> ?!? npo->meta_prod:37 vif->rx.sring->req_prod:431006 vif->rx.req_cons:431007
>> Mar 12 17:41:26 serveerstertje kernel: [  464.786203] vif vif-7-0 vif7.0: 
>> ?!? npo->meta_prod:38 vif->rx.sring->req_prod:431006 vif->rx.req_cons:431008

> req_prod < req_cons, so there's an overflow. I'm actually curious how
> this could happen.

> Back to the code, before netback enqueues SKB to its internal queue, it
> will check if there's enough room in the ring. Before Paul's changeset,
> it checks against a static number (the possible maximum slots that can
> be consumed by an SKB). Paul's changeset made it check against the
> actual slots the incoming SKB consumes. See
> interface.c:xenvif_start_xmit.  Another interesting site would be when
> the SKB is broken down later on in internal queue. See
> netback.c:xenvif_rx_action. The routine to break down an SKB is
> xenvif_gop_skb.

> Although they look alright to me, but you might want to instrument them
> a bit more to see what triggers that overflow. It's a bit frustrating,
> but a bug that cannot be easily reproduced is indeed extremely hard to
> fix.

[  554.166914] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req 
npo->meta_prod:7 vif->rx.sring->req_prod:301750 vif->rx.req_cons:301751
[  571.143788] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req 
npo->meta_prod:37 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823
[  571.150762] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req 
npo->meta_prod:37 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823 
req->gref:19791875 req->id:22
[  571.164691] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2  
npo->meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364823 
npo->copy_gref:19791875
[  571.179474] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 3 
npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:364822 
vif->rx.req_cons:364823 gso_type:1 gso_size:1448 nr_frags:1 req->gref:576 
req->id:14
[  571.194844] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 before req 
npo->meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364824 
gso_type:0 gso_size:0 nr_frags:0
[  571.211215] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 after req 
npo->meta_prod:38 vif->rx.sring->req_prod:364822 vif->rx.req_cons:364824 
gso_type:0 gso_size:0 nr_frags:0 req->gref:4325379 req->id:23
[  571.228471] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 3 
npo->meta_prod:39 old_meta_prod:38 vif->rx.sring->req_prod:364822 
vif->rx.req_cons:364824 gso_type:0 gso_size:0 nr_frags:0 req->gref:4325379 
req->id:23
[  571.246412] vif vif-7-0 vif7.0: Bad status -3 from copy to DOM7.
[  571.255531] vif vif-7-0 vif7.0: ?!? xenvif_check_gop status err? status:-1 
i:7 nr_meta_slots:8 copy_op->status:-3 npo->copy_cons:38
[  571.264804] vif vif-7-0 vif7.0: ?!? xenvif_rx_action status err? status:-1 
meta_slots_used:8 flags:7 size:1 ip_summed:1 is_gso:1 
skb_shinfo(skb)->gso_type:1634 vif->meta[npo.meta_cons].gso_type:3 nr_frags:1 
npo.copy_prod:39, npo.meta_cons:30

It seems in 'get_next_rx_buffer' you can get the situation that cons > prod, 
but the first time somehow the code after the actual RING_GET_REQUEST isn't ran 
(why?!? queue stop ?) .. and it doesn't lead to an immediate problem ..
The second time it does get to the code after the RING_GET_REQUEST in 
'get_next_rx_buffer' and that leads to mayhem ...

So added a netdev_warn to xenvif_start_xmit for the "stop queue" case .. 
unfortunately it now triggers net_ratelimit at the end:

[  402.909693] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:7 vif->rx.sring->req_prod:189228 vif->rx.req_cons:189222
[  403.409624] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:4 vif->rx.sring->req_prod:191612 vif->rx.req_cons:191609
[  403.658975] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:2 vif->rx.sring->req_prod:192860 vif->rx.req_cons:192860
[  406.417504] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:2 vif->rx.sring->req_prod:204870 vif->rx.req_cons:204869
[  408.156112] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:1 vif->rx.sring->req_prod:212862 vif->rx.req_cons:212862
[  408.407375] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:7 vif->rx.sring->req_prod:213613 vif->rx.req_cons:213609
[  408.899250] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:2 vif->rx.sring->req_prod:215352 vif->rx.req_cons:215352
[  413.655552] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:2 vif->rx.sring->req_prod:237250 vif->rx.req_cons:237249
[  417.649867] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:2 vif->rx.sring->req_prod:252580 vif->rx.req_cons:252580
[  418.400194] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:4 vif->rx.sring->req_prod:255828 vif->rx.req_cons:255826
[  419.147431] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req 
npo->meta_prod:37 vif->rx.sring->req_prod:258745 vif->rx.req_cons:258746
[  419.149966] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:1 vif->rx.sring->req_prod:258745 vif->rx.req_cons:258745
[  419.153145] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req 
npo->meta_prod:37 vif->rx.sring->req_prod:258745 vif->rx.req_cons:258746 
req->gref:19791875 req->id:185
[  419.159518] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2  
npo->meta_prod:38 vif->rx.sring->req_prod:258745 vif->rx.req_cons:258746 
npo->copy_gref:19791875
[  419.166980] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 3 
npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:258745 
vif->rx.req_cons:258746 gso_type:1 gso_size:1448 nr_frags:1 req->gref:686 
req->id:177
[  419.175597] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 before req 
npo->meta_prod:38 vif->rx.sring->req_prod:258745 vif->rx.req_cons:258747 
gso_type:0 gso_size:0 nr_frags:0
[  419.185284] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 after req 
npo->meta_prod:38 vif->rx.sring->req_prod:258745 vif->rx.req_cons:258747 
gso_type:0 gso_size:0 nr_frags:0 req->gref:4325379 req->id:186
[  422.717242] net_ratelimit: 158 callbacks suppressed

Hope that this time it does shed some light :-)

diff to netback is attached.

--
Sander

> Wei.

Attachment: netback_debug.diff
Description: Binary data

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.