[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] fix netfront alloc_page error handling bug, need to raise up rx_refill timer if rx request buffer not big enough for backend



First thing, for a network driver related patch you should send to
netdev@ list as well. As a bug fix you would need to prefix it with
[PATCH net], targeting net tree.

A good reading on how netdev@ works can be found in kernel's
documentation directory, called netdev-FAQ.txt.

Second, the analysis is good but your commit message is not formal.
Could you please have a look at SubmittingPatches.txt on how commit
message should be written.

On Thu, Nov 14, 2013 at 12:15:48AM +0800, Ma JieYue wrote:
> Hi,
> 
> We found a netfront driver bug when some user downloaded a large 8G file 
> using wget, in a centOS guest on Xen. The virtual machine has only 512MB 
> memory and the system had high IO stress when wget was running. After some 
> time, wget hung and netfront could not receive any packet from netback at 
> all. We also observed netback vif device was dropping packets at that time.
> 
> By debugging we found lots of alloc_page error in netfront driver occured in 
> function xennet_alloc_rx_buffers, after wget was running for a while. The 
> alloc_page error won't stop until the wget hung forever. After that, the rx 
> IO ring paramters from frontend and backend driver, indicated that netfront 
> did not provide enough rx request buffer to netback, so that netback driver 
> thought the ring is full and dropped packets.
> 
> the rx IO rings looks as below, e.g.
> 
> [netback]
> rsp_prod_pvt 666318, req_cons 666318, nr_ents 256
> sring req_prod 666337, req_event 666338, rsp_prod 666318, rsp_event 666319
> 
> [netfront]
> rsp_cons 666318, req_prod_pvt 666337
> sring req_prod 666337, req_event 666338, rsp_prod 666318, rsp_event 666319
> 
> So, the rx request buffer only has 19 requests for backend, while backend at 
> least needs 20 ( MAX_SKB_FRAGS+2 ) requests buffer in order to send a single 
> skb to netfront when SG and TSO enabled.
> 

FWIW newer kernel has MAX_SKB_FRAGS = 17, not 18.

> Also we gathered some info about alloc_page failure, for the last 6 times 
> before netfront/netback stopped communicating. From the last alloc_page 
> failure, we found the rx request ring buffer had only 18 requests buffer 
> left, and it started to allocate 46 pages to refill the buffer. 
> Unfortunately, the driver failed to allocate the second page, so that after 
> this round it only refilled one page and the rx ring request buffer only had 
> 19 request buffers left.
> 

I just look at the code snippet, would it be simpler just to move the
mod_timer before "goto refill" in that part?

> the debug log looks as below, e.g.
> 
> alloc_rx_buffers, alloc_page failed at 2 page, try to alloc 26 pages, 
> rx_target 64, req_prod 552445, rsp_cons 552407
> alloc_rx_buffers, alloc_page failed at 0 page, try to alloc 36 pages, 
> rx_target 64, req_prod 552447, rsp_cons 552419
> alloc_rx_buffers, alloc_page failed at 0 page, try to alloc 46 pages, 
> rx_target 64, req_prod 552447, rsp_cons 552429
> alloc_rx_buffers, alloc_page failed at 0 page, try to alloc 46 pages, 
> rx_target 64, req_prod 552447, rsp_cons 552429
> alloc_rx_buffers, alloc_page failed at 4 page, try to alloc 28 pages, 
> rx_target 64, req_prod 639240, rsp_cons 639204
> alloc_rx_buffers, alloc_page failed at 1 page, try to alloc 46 pages, 
> rx_target 64, req_prod 639244, rsp_cons 639226
> 
> The problem is that, if the alloc page failure not occured at the first page, 
> the rx_refill_timer won't be set, which will help extend the rx request 
> buffer later. So, our fix is to set the rx_refill_timer, if the rx request 
> buffer may be not enough. The patch works well during our stress test.
> 
> 
> Signed-off-by: Ma JieYue <jieyue.majy@xxxxxxxxxxxxxx>
> ---
>  drivers/net/xen-netfront.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index 1db10141..680e959 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -243,10 +243,16 @@ static void xennet_alloc_rx_buffers(struct net_device 
> *dev)
>       unsigned long pfn;
>       void *vaddr;
>       struct xen_netif_rx_request *req;
> +     int needed;
>  
>       if (unlikely(!netif_carrier_ok(dev)))
>               return;
>  
> +     if (dev->features & (NETIF_F_SG|NETIF_F_TSO))

This is the guest receive path, why does NETIF_F_TSO matter?

Wei.

> +             needed = MAX_SKB_FRAGS + 2;
> +     else
> +             needed = 1;
> +
>       /*
>        * Allocate skbuffs greedily, even though we batch updates to the
>        * receive ring. This creates a less bursty demand on the memory
> @@ -327,6 +333,9 @@ no_skb:
>  
>       /* Above is a suitable barrier to ensure backend will see requests. */
>       np->rx.req_prod_pvt = req_prod + i;
> +     if (i && (np->rx.req_prod_pvt - np->rx.sring->rsp_prod < needed))
> +             mod_timer(&np->rx_refill_timer, jiffies + (HZ/10));
> +
>   push:
>       RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&np->rx, notify);
>       if (notify)
> -- 
> 1.8.4
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.