[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected"



On Tue, Mar 18, 2014 at 04:21:27PM +0100, Sander Eikelenboom wrote:
[...]
> 
> Added even more warns ...
> 
> [  297.885969] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
> min_slots_needed:4 vif->rx.sring->req_prod:21764 vif->rx.req_cons:21762
> [  298.760555] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
> min_slots_needed:3 vif->rx.sring->req_prod:22488 vif->rx.req_cons:22486
> 
> [  306.376176] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req 
> npo->meta_prod:30 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28313
> [  306.376556] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
> min_slots_needed:1 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28313
> [  306.391863] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req 
> npo->meta_prod:30 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28314 
> req->gref:4325377 req->id:153
> 
> [  306.407599] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2  
> npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28314 
> npo->copy_gref:4325377  npo->copy_off:0  MAX_BUFFER_OFFSET:4096 bytes:640 
> size:640 i:4
> [  306.423913] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here end 
> npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28314 
> npo->copy_gref:4325377 npo->copy_off:640  MAX_BUFFER_OFFSET:4096 bytes:640 
> size:0 i:5
> 
> 
> [  306.440941] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 4 
> npo->meta_prod:31 old_meta_prod:25 vif->rx.sring->req_prod:28313 
> vif->rx.req_cons:28314 gso_type:1 gso_size:1448 nr_frags:1 req->gref:638 
> req->id:147
> [  306.458334] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 before req 
> npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28315 
> gso_type:0 gso_size:0 nr_frags:0
> [  306.476097] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 after req 
> npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28315 
> gso_type:0 gso_size:0 nr_frags:0 req->gref:4325377 req->id:154
> [  306.494462] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 3 before 
> npo->meta_prod:32 old_meta_prod:31 vif->rx.sring->req_prod:28313 
> vif->rx.req_cons:28315 gso_type:0 gso_size:0 nr_frags:0 req->gref:4325377 
> req->id:154 j:0
> [  306.513424] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here start   
> npo->meta_prod:32 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28315 
> npo->copy_gref:4325377 npo->copy_off:0  MAX_BUFFER_OFFSET:4096 bytes:0 
> size:66 i:0
> [  311.390883] net_ratelimit: 317 callbacks suppressed
> [  311.400901] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
> min_slots_needed:3 vif->rx.sring->req_prod:32386 vif->rx.req_cons:32322
> 
> - So in this case we are in the 3rd iteration of the loop in 
> xenvif_gop_frag_copy ...
> - Xenvif_start_xmit stop the queue since it has detected it needs one more 
> slot which is unavailable at that time.

Yes.

> - The current rx thread however doesn't know and doesn't check  (neither in 
> the loop in xenvif_gop_frag_copy nor in get_next_rx_buffer that the ring if 
> full) .. while prod == cons now .. consumes another one ..

It does check -- but not in xenvif_gop_frag_copy -- see
xenvif_rx_action, which calls xenvif_rx_ring_slots_available before
queueing skb to break down. That is, when you call xenvif_gop_skb there
should be enough room to accommodate that SKB.

> - That ring request leads to the bad grant references reported by the 
> hypervisor
> 
> (XEN) [2014-03-18 15:02:58.928] grant_table.c:1857:d0v2 Bad grant reference 
> 4325377
> 
> So should there be a check added there ... or should the callers  
> "xenvif_gop_frag_copy" and the caller of that one "xenvif_gop_skb" already 
> have anticipated that what the were about
> to do wasn't going to fit anyway ?
> 

No, see above.

> And of course .. how made Paul's change trigger this ?
> 

Before Paul's change, we always reserve very large room for an incoming
SKB. After Paul's change, we only reserve just enough room. Probably
some extra room prevents this bug being triggered.

Wei.

> 
> >> The second time it does get to the code after the RING_GET_REQUEST in 
> >> 'get_next_rx_buffer' and that leads to mayhem ...
> >> 
> >> So added a netdev_warn to xenvif_start_xmit for the "stop queue" case .. 
> >> unfortunately it now triggers net_ratelimit at the end:
> >> 
> >> [  402.909693] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
> >> min_slots_needed:7 vif->rx.sring->req_prod:189228 vif->rx.req_cons:189222
> 
> > I think xenvif_rx_ring_slots_available is doing its job. If req_prod -
> > req_cons < needed, the queue is stopeed.
> 
> > Wei.
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.