|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected"
Wednesday, March 19, 2014, 12:35:32 PM, you wrote:
> On Wed, Mar 19, 2014 at 12:11:04AM +0100, Sander Eikelenboom wrote:
> [...]
>>
>> >>> Before Paul's change, we always reserve very large room for an incoming
>> >>> SKB. After Paul's change, we only reserve just enough room. Probably
>> >>> some extra room prevents this bug being triggered.
>>
>> >> [ 599.970745] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req
>> >> npo->meta_prod:37 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506387
>> >> vif->rx.sring->req_event:504174
>> >> [ 599.972487] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !
>> >> min_slots_needed:1 vif->rx.sring->req_prod:506387
>> >> vif->rx.req_cons:506387 vif->rx.sring->req_event:506388
>> >> [ 600.044322] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req
>> >> npo->meta_prod:37 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388
>> >> req->gref:165543936 req->id:19 vif->rx.sring->req_event:506388
>> >> [ 600.081167] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2
>> >> npo->meta_prod:38 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388
>> >> npo->copy_gref:165543936 npo->copy_off:0 MAX_BUFFER_OFFSET:4096
>> >> bytes:1168 size:1168 i:6 vif->rx.sring->req_event:506388
>> >> estimated_slots_needed:8
>> >> [ 600.118268] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here end
>> >> npo->meta_prod:38 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388
>> >> npo->copy_gref:165543936 npo->copy_off:1168 MAX_BUFFER_OFFSET:4096
>> >> bytes:1168 size:0 i:7 vif->rx.sring->req_event:506388
>> >> estimated_slots_needed:8
>> >> [ 600.155378] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 4
>> >> npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:506387
>> >> vif->rx.req_cons:506388 gso_type:1 gso_size:1448 nr_frags:1 req->gref:570
>> >> req->id:11 estimated_slots_needed:8 i(frag): 0
>> >> [ 600.192438] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 5
>> >> npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:506387
>> >> vif->rx.req_cons:506388 gso_type:1 gso_size:1448 nr_frags:1 req->gref:570
>> >> req->id:11 estimated_slots_needed:8
>> >> [ 600.229395] vif vif-7-0 vif7.0: ?!? xenvif_rx_action me here 2 ..
>> >> vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388
>> >> sco->meta_slots_used:8 max_upped_gso:1 skb_is_gso(skb):1
>> >> max_slots_needed:8 j:3 is_gso:1 nr_frags:1 firstpart:1 secondpart:6
>> >> min_slots_needed:3
>> >> [ 600.266518] vif vif-7-0 vif7.0: ?!? xenvif_rx_action me here 1 ..
>> >> vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388
>> >> max_upped_gso:1 skb_is_gso(skb):0 max_slots_needed:1 j:4 is_gso:0
>> >> nr_frags:0 firstpart:1 secondpart:0 min_slots_needed:1
>>
>> >> It seems to estimate 8 slots and need 8 slots ... however .. shouldn't
>> >> the queue have been stopped before getting here ..
>>
>> > *hrmm please just ignore* .. got to get some sleep i guess ..
>>
>> Just went the empirical way .. and unconditionally upped the calculated
>> "max_slots_needed" by one in "xenvif_rx_action" just before checking the
>> "xenvif_rx_ring_slots_available",
>> this has prevented all non-fatal and fatal occurrences of "cons > prod" so
>> far. I will leave my tests running overnight, see if it survives the
>> pounding.
>>
>> >From other net drivers i see different calculations .. seems it is kind of
>> >voodoo to determine the value .. one of which
>> >(drivers/net/ethernet/marvell/sky2.c)
>> seems to unconditionally reserve a slot for both GSO and CKSUM. Don't know
>> if that makes any sense though:
>>
>> /* This is the worst case number of transmit list elements for a single skb:
>> VLAN:GSO + CKSUM + Data + skb_frags * DMA */
>> #define MAX_SKB_TX_LE (2 +
>> (sizeof(dma_addr_t)/sizeof(u32))*(MAX_SKB_FRAGS+1))
>>
> Xen network "wire" format doesn't have a CKSUM metadata type though.
>>
>>
>> >>> Wei.
>>
>> >>>>
>> >>>> >> The second time it does get to the code after the RING_GET_REQUEST
>> >>>> >> in 'get_next_rx_buffer' and that leads to mayhem ...
>> >>>> >>
>> >>>> >> So added a netdev_warn to xenvif_start_xmit for the "stop queue"
>> >>>> >> case .. unfortunately it now triggers net_ratelimit at the end:
>> >>>> >>
>> >>>> >> [ 402.909693] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping
>> >>>> >> queue ! min_slots_needed:7 vif->rx.sring->req_prod:189228
>> >>>> >> vif->rx.req_cons:189222
>> >>>>
>> >>>> > I think xenvif_rx_ring_slots_available is doing its job. If req_prod -
>> >>>> > req_cons < needed, the queue is stopeed.
>>
>> >> So it seems .. most of the time .. but if i look at the calculation of
>> >> "min_slots_needed" in this function it seems completely different from
>> >> the one in
>> >> xenvif_rx_action for max_slots_needed .. though both seem to be used for
>> >> the same thing .. to calcultate how many slots the brokendown SKB would
>> >> need to fit in ..
>> >> So i added the calculation method from xenvif_start_xmit to
>> >> xenvif_rx_action .. in the error case you see min_slots_needed was 3 ..
>> >> but max_slots_needed and max_slots_used both were 8.
>>
> Those are different estimations. max_slots_needed estimates the worst
> case while min_slots_needed estimates the best case. When
> min_slots_needed is met, netback puts the SKB into its internal queue.
> xenvif_rx_action then will dequeue that packet, check max_slots_needed,
> if met, break SKB down.
I realised that, though if you don't count the cost of calculation .. the best
would be
to have min_slots and max_slots the same .. and being correct. But cost of the
calculation count ..
on the other hand min_slots seems to be far too low in this case. Is the
nr_frags the best thing to check for ?
Wouldn't something like skbsize / PAGE_SIZE be a better estimate ?
> What I would expect is, even if they yield different values, checking
> the ring availability is enough before breaking SKB down.
> In your above case, max_slots_needed was met and SKB broken down. And as
> you said in your empirical test, bumping max_slots_needed by 1 prevented
> issues, then we might have problem calculating max_slots_needed. If you
> can work out what went wrong that can be very useful.
OK ran the whole night with "max_slots_needed+1" and i saw no triggering of
both:
- the non-fatal oddity (only a temporary cons > prod ..)
- the fatal case (leading to the grantref error)
So it seems we are off by one .. somewhere
At first the non-fatal oddity .. this one i really can't grasp:
[ 567.880282] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !
min_slots_needed:4 vif->rx.sring->req_prod:562972 vif->rx.req_cons:562969
vif->rx.sring->req_event:562973
[ 569.893801] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !
min_slots_needed:4 vif->rx.sring->req_prod:571188 vif->rx.req_cons:571187
vif->rx.sring->req_event:571189
[ 571.892419] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !
min_slots_needed:4 vif->rx.sring->req_prod:577141 vif->rx.req_cons:577138
vif->rx.sring->req_event:577142
[ 575.451383] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req
npo->meta_prod:6 vif->rx.sring->req_prod:590603 vif->rx.req_cons:590603
vif->rx.sring->req_event:589615, reserved_slots_left:0
[ 575.487721] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req
npo->meta_prod:6 vif->rx.sring->req_prod:590660 vif->rx.req_cons:590604
req->gref:569 req->id:11 vif->rx.sring->req_event:589615 reserved_slots_left:-1
[ 575.524427] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !
min_slots_needed:4 vif->rx.sring->req_prod:590660 vif->rx.req_cons:590660
vif->rx.sring->req_event:590661
[ 576.508063] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req
npo->meta_prod:16 vif->rx.sring->req_prod:594343 vif->rx.req_cons:594343
vif->rx.sring->req_event:591619, reserved_slots_left:0
[ 576.544708] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req
npo->meta_prod:16 vif->rx.sring->req_prod:594389 vif->rx.req_cons:594344
req->gref:668 req->id:167 vif->rx.sring->req_event:591619 reserved_slots_left:-1
[ 576.581595] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !
min_slots_needed:4 vif->rx.sring->req_prod:594389 vif->rx.req_cons:594389
vif->rx.sring->req_event:594390
[ 577.325826] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !
min_slots_needed:2 vif->rx.sring->req_prod:597248 vif->rx.req_cons:597247
vif->rx.sring->req_event:597249
[ 577.576973] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !
min_slots_needed:4 vif->rx.sring->req_prod:598146 vif->rx.req_cons:598143
vif->rx.sring->req_event:598147
What i don't get is:
- i do warn just before the RING_GET_REQUEST on vif->rx.sring->req_prod:590603
vif->rx.req_cons:590603 .. so cons == prod
- i warn because req = RING_GET_REQUEST(&vif->rx, vif->rx.req_cons++); is
going to add one to cons .. leading to a situation that should not exist ..
- however in the warn just after this RING_GET_REQUEST .. prod is somehow
increased a lot .. probably the "next_rx_buffer" aspect of things .. so cons is
much smaller than prod again ..
- How everi don't see anything in the RING_GET_REQUEST macro fiddling with
vif->rx.sring->req_prod ..
- so who has been fiddling with vif->rx.sring->req_prod in the mean time .. my
very limited c skills fail to see how this works :-)
Then the fatal case ...
[ 1108.566486] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !
min_slots_needed:1 vif->rx.sring->req_prod:1843278 vif->rx.req_cons:1843278
vif->rx.sring->req_event:1843279
[ 1109.775757] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !
min_slots_needed:4 vif->rx.sring->req_prod:1847779 vif->rx.req_cons:1847776
vif->rx.sring->req_event:1847780
[ 1110.041661] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !
min_slots_needed:4 vif->rx.sring->req_prod:1848795 vif->rx.req_cons:1848794
vif->rx.sring->req_event:1848796
[ 1110.301778] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !
min_slots_needed:4 vif->rx.sring->req_prod:1849755 vif->rx.req_cons:1849754
vif->rx.sring->req_event:1849756
[ 1111.565412] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req
npo->meta_prod:35 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854291
vif->rx.sring->req_event:1853482, reserved_slots_left:0
[ 1111.565973] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !
min_slots_needed:1 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854291
vif->rx.sring->req_event:1854292
[ 1111.638425] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req
npo->meta_prod:35 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292
req->gref:4325377 req->id:83 vif->rx.sring->req_event:1854292
reserved_slots_left:-1
[ 1111.675042] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2
npo->meta_prod:36 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292
npo->copy_gref:4325377 npo->copy_off:0 MAX_BUFFER_OFFSET:4096 bytes:1168
size:1168 i:5 vif->rx.sring->req_event:1854292 estimated_slots_needed:7
reserved_slots_left:-1
[ 1111.711833] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here end
npo->meta_prod:36 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292
npo->copy_gref:4325377 npo->copy_off:1168 MAX_BUFFER_OFFSET:4096 bytes:1168
size:0 i:6 vif->rx.sring->req_event:1854292 gso_gaps:0 estimated_slots_needed:7
reserved_slots_left:-1
[ 1111.766836] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 4 after
npo->meta_prod:36 old_meta_prod:29 vif->rx.sring->req_prod:1854291
vif->rx.req_cons:1854292 meta->gso_type:1 meta->gso_size:1448 nr_frags:1
req->gref:674 req->id:76 estimated_slots_needed:7 i(frag):0 j(data):1
reserved_slots_left:-1
[ 1111.803702] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 5
npo->meta_prod:36 old_meta_prod:29 vif->rx.sring->req_prod:1854291
vif->rx.req_cons:1854292 meta->gso_type:1 meta->gso_size:1448 nr_frags:1
req->gref:674 req->id:76 estimated_slots_needed:7 j(data):1
reserved_slots_left:-1 used in funcstart: 0 + 1 .. used_dataloop:1 ..
used_fragloop:6
[ 1111.858602] vif vif-7-0 vif7.0: ?!? xenvif_rx_action me here 2 ..
vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 sco->meta_slots_used:7
max_upped_gso:1 skb_is_gso(skb):1 max_slots_needed:7 j:3 is_gso:1 nr_frags:1
firstpart:1 secondpart:5 reserved_slots_left:-1
Ok so if i try to pull this apart (from bottom to top):
- it starts out with "xenvif_rx_action" and calculated we would need:
1 slot for the DIV_ROUND_UP(offset_in_page(skb->data) +
skb_headlen(skb), PAGE_SIZE);
5 slots for 1 frag
1 slot for GSO
So 7 slots in total ..
- it checks if there are enough slots available if
(!xenvif_rx_ring_slots_available(vif, max_slots_needed)) and aparently there
are .. so it continues
- We end up in "xenvif_gop_skb" processing .. and everything seems to go well ..
- In the start of this function we do one unconditional RING_GET_REQUEST
.. and potentially one conditional (if gso)
- Question: in "xenvif_rx_action" we only reserved ONE slot for GSO
stuff .. so this could cost TWO slots .. is that right ?
- But in this case we only use one ... see "used in funcstart: 0 + 1"
- Now we loop 1 time through the "while (data < skb_tail_pointer(skb))"
loop ... since j(data)=1
- and call "xenvif_gop_frag_copy" from inside this loop here .. which
potentially calls .. get_next_rx_buffer
- If i understand it correctly the number of slots this loop should
take at a maximum should be what we calculated before with the
DIV_ROUND_UP(offset_in_page(skb->data) + skb_headlen(skb),
PAGE_SIZE) ? (1 slot in this case)
- Nothing is logged yet because cons is still smaller than prod at
this time, but from the aftermath we see this was "used_dataloop:1"
- So that seems correct ..
- Now we are at the loop through "nr_frags" and loop the 1 frag we have
..
- This all goes well until the sixth iteration (i from 0..5)
.. and call "xenvif_gop_frag_copy" from inside this loop ...
- "xenvif_gop_frag_copy" calls "get_next_rx_buffer" .. cons
becomes larger than prod ..and the "miracle" from the non-fatal case of prod
suddenly becoming larger does not happen
- So the calculation of how many slots are required for the frags in
"xenvif_rx_action" does not match wath the code in "xenvif_gop_frag_copy" seems
to need,
I don't see why .. it's rounding up .. should behave correct if the
left over happens to be 0 size ..
> Wei.
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |