[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] ADs over dom0 iSCSI = high page_count()




I've come across a disturbing page ref count situation and need some advice.
This only happens very rarely, when writing through ADs to iSCSI storage.
(My guess is that this is probably during a tcp fragmented retransmit.)

Novell SLES10sp2 kernel :: Xen 3.2, but all of the blkback and netback
code is the same as unstable.

1.  blkback :: maps the foreign page :: page_count() == 1
2.  blkback :: submits a bio with this foreign page
3.  iscsi_tcp :: makes a tcp request with this foreign page
4.  tcp :: gets twice, page_count() == 3
5.  tcp :: puts once, page_count() == 2
6.  tcp :: gets twice, page_count() == 4
7.  __gnttab_dma_map_page(), sets page_mapcount() == 1
8.  tcp :: puts twice, page_count() == 2
9.  tcp :: done, but page_count() == 2, not 1
10. iscsi_tcp :: done bio completes
11. blkback :: __end_block_io_op() call fast_flush_area()
       page state:  page_count() == 2, page_mapcount() == 1

BUT:    page_count() should be 1 and page_mapcount() should be 0
       Perhaps these two counts are related, but I'm wondering if these
       might be two separate issues.  However, in all of my reproductions
       of this issue, if __gnttab_dma_map_page() gets called, then it is
       the case where the page_count() is high.

QUESTION 1:  Is having the page_count() be high after leaving the tcp layer
   when the packets are fragmented, a known unsolved problem?

Looking at netback.c I see the comment in the read path:

   net_rx_action()
       /* We can't rely on skb_release_data to release the
          pages used by fragments for us, since it tries to
          touch the pages in the fraglist.  If we're in
          flipping mode, that doesn't work.  In copying mode,
          we still have access to all of the pages, and so
          it's safe to let release_data deal with it. */
       /* (Freeing the fragments is safe since we copy
          non-linear skbs destined for flipping interfaces) */

Also in netback.c in net_tx_action_dealloc() after make_tx_response() I see:

   /* Ready for next use. */
   gnttab_reset_grant_page()

Sure this resets the page_mapcount() back to 0, but it also sets the page_count() to 1 regardless of the current value.

QUESTION 2:  Why does the page_count() have to be set to 1?

QUESTION 3:  If the page_count() is known to be high after leaving the
   tcp layer by only 1 ( ie. page_count() == 2 instead of being 1 ),
   then wouldn't a atomic_cmpxchg() be safer or can the count be even
   higher?

I can add a call to gnttab_reset_grant_page() in blkback.  However, we
have found legitimate cases where the page_count() is 2, such as when
dhcpd is sniffing for a release_renew while there are IOs in progress.
Thus I'd like more understanding before setting the page_count().

Thank you,

Joshua

PS: Below is a more detailed walk through the get_page, put_page calls,
   which were made resulting in the page_count() being high.

PSS:The thread originally discussing dhcpd SEGV when dhcpd is loses
   the race to when blkback unmaps the page from dom0 is:

Problem with PV disk and iSCSI
http://lists.xensource.com/archives/html/xen-devel/2008-02/msg00330.html

================================================================
================================================================
================================================================

blkback maps the foreign page
   page_count() == 1

GetPage_Trace [ffff8800087ba6c0] (1) G 1 0
   | 562 /srcTrees/na_main/nex.bk/linux/net/ipv4/tcp.c
       do_tcp_sendpages()
           !can_coalesce

GetPage_Trace [ffff8800087ba6c0] (2) G 2 0
   | 1576 /srcTrees/na_main/nex.bk/linux/net/core/skbuff.c
       skb_split_no_header()
           pos < len
           /* Split frag.
            * We have two variants in this case:
            * 1. Move all the frag to the second
            *    part, if it is possible. F.e.
            *    this approach is mandatory for TUX,
            *    where splitting is expensive.
            * 2. Split is accurately. We make this.
            */
   | 1134 /srcTrees/na_main/nex.bk/linux/net/ipv4/tcp_output.c
       tcp_write_xmit()
           calls tso_fragment()
           which eventually calls skb_split_no_header()

PutPage_Trace [ffff8800087ba6c0] (3) P 3 0
   | 281 /srcTrees/na_main/nex.bk/linux/net/core/skbuff.c
       skb_release_data()
           for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
   | 462 /srcTrees/na_main/nex.bk/linux/include/net/sock.h
       sk_stream_free_skb()
           calls __kfree_skb()
           which ecventually calls skb_release_data()

??? second put_page() seems to be missing ???

================ ??? retransmit maybe ??? ================

GetPage_Trace [ffff8800087ba6c0] (4) G 2 0
   | 562 /srcTrees/na_main/nex.bk/linux/net/ipv4/tcp.c
       do_tcp_sendpages()
           !can_coalesce

GetPage_Trace [ffff8800087ba6c0] (5) G 3 0
   | 1576 /srcTrees/na_main/nex.bk/linux/net/core/skbuff.c
       skb_split_no_header()
           pos < len
           /* Split frag.
            * We have two variants in this case:
            * 1. Move all the frag to the second
            *    part, if it is possible. F.e.
            *    this approach is mandatory for TUX,
            *    where splitting is expensive.
            * 2. Split is accurately. We make this.
            */
   | 1134 /srcTrees/na_main/nex.bk/linux/net/ipv4/tcp_output.c
       tcp_write_xmit()
           calls tso_fragment()
           which eventually calls skb_split_no_header()

dma_map_single()
   swiotlb_map_single()
       gnttab_dma_map_page()
           __gnttab_dma_map_page()
               In: drivers/xen/core/gnttab.c

               page->_mapcount gets set
               (Not an increment, but like a flag)

Sometimes this gets called multiple times for this same page

PutPage_Trace [ffff8800087ba6c0] (6) P 4 1
   | 281 /srcTrees/na_main/nex.bk/linux/net/core/skbuff.c
       skb_release_data()
           for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
   | 462 /srcTrees/na_main/nex.bk/linux/include/net/sock.h
       sk_stream_free_skb()
           calls __kfree_skb()
           which ecventually calls skb_release_data()

PutPage_Trace [ffff8800087ba6c0] (7) P 3 1
   | 281 /srcTrees/na_main/nex.bk/linux/net/core/skbuff.c
       skb_release_data()
           for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
   | 462 /srcTrees/na_main/nex.bk/linux/include/net/sock.h
       sk_stream_free_skb()
           calls __kfree_skb()
           which ecventually calls skb_release_data()

================================================================

Joshua Nicolas
Virtual Iron Software, Inc.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.