[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC V2] xen/netback: Count ring slots properly when larger MTU sizes are used



On Mon, 2012-12-17 at 20:09 +0000, Matt Wilson wrote:
> On Mon, Dec 17, 2012 at 11:26:38AM +0000, Ian Campbell wrote:
> > On Fri, 2012-12-14 at 18:53 +0000, Matt Wilson wrote:
> > > On Thu, Dec 13, 2012 at 11:12:50PM +0000, Palagummi, Siva wrote:
> > > > > -----Original Message-----
> > > > > From: Matt Wilson [mailto:msw@xxxxxxxxxx]
> > > > > Sent: Wednesday, December 12, 2012 3:05 AM
> > > > >
> > > > > On Tue, Dec 11, 2012 at 10:25:51AM +0000, Palagummi, Siva wrote:
> > > > > >
> > > > > > You can clearly see below that copy_off is input to
> > > > > > start_new_rx_buffer while copying frags.
> > > > > 
> > > > > Yes, but that's the right thing to do. copy_off should be set to the
> > > > > destination offset after copying the last byte of linear data, which
> > > > > means "skb_headlen(skb) % PAGE_SIZE" is correct.
> > > > > 
> > > >
> > > > No. It is not correct for two reasons. For example what if
> > > > skb_headlen(skb) is exactly a multiple of PAGE_SIZE. Copy_off would
> > > > be set to ZERO. And now if there exists some data in frags, ZERO
> > > > will be passed in as copy_off value and start_new_rx_buffer will
> > > > return FALSE. And second reason is the obvious case from the current
> > > > code where "offset_in_page(skb->data)" size hole will be left in the
> > > > first buffer after first pass in case remaining data that need to be
> > > > copied is going to overflow the first buffer.
> > > 
> > > Right, and I'm arguing that having the code leave a hole is less
> > > desirable than potentially increasing the number of copy
> > > operations. I'd like to hear from Ian and others if using the buffers
> > > efficiently is more important than reducing copy operations. Intuitively,
> > > I think it's more important to use the ring efficiently.
> > 
> > Do you mean the ring or the actual buffers?
> 
> Sorry, the actual buffers.
> 
> > The current code tries to coalesce multiple small frags/heads because it
> > is usually trivial but doesn't try too hard with multiple larger frags,
> > since they take up most of a page by themselves anyway. I suppose this
> > does waste a bit of buffer space and therefore could take more ring
> > slots, but it's not clear to me how much this matters in practice (it
> > might be tricky to measure this with any realistic workload).
> 
> In the case where we're consistently handling large heads (like when
> using a MTU value of 9000 for streaming traffic), we're wasting 1/3 of
> the available buffers.

Sorry if I missed this earlier in the thread, but how do we end up
wasting so much?

For an skb with 9000 bytes in the linear area, which must necessarily be
contiguous, do we not fill the first two page sized buffers completely?
The remaining 808 bytes must then have its own buffer. Hrm, I suppose
that's about 27% wasted over the three pages. If we are doing something
worse than that though then we have a bug somewhere (nb: older netbacks
would only fill the first 2048 bytes of each buffer, the wastage is
presumably phenomenal in that case ;-), MAX_BUFFER_OFFSET is now ==
PAGE_SIZE though)

Unless I've misunderstood this thread and we are considering packing
data from multiple skbs into a single buffer? (i.e. the remaining
4096-808=3288 bytes in the third buffer above would contain data for the
next skb). Does the ring protocol even allow for that possibility? It
seems like a path to complexity to me.

> > The cost of splitting a copy into two should be low though, the copies
> > are already batched into a single hypercall and I'd expect things to be
> > mostly dominated by the data copy itself rather than the setup of each
> > individual op, which would argue for splitting a copy in two if that
> > helps fill the buffers.
> 
> That was my thought as well. We're testing a patch that does just this
> now.
> 
> > The flip side is that once you get past the headers etc the paged frags
> > likely tend to either bits and bobs (fine) or mostly whole pages. In the
> > whole pages case trying to fill the buffers will result in every copy
> > getting split. My gut tells me that the whole pages case probably
> > dominates, but I'm not sure what the real world impact of splitting all
> > the copies would be.
> 
> Right, I'm less concerned about the paged frags. It might make sense
> to skip some space so that the copying can be page aligned. I suppose
> it depends on how many defferent pages are in the list, and what the
> total size is.
> 
> In practice I'd think it would be rare to see a paged SKB for ingress
> traffic to domUs unless there is significant intra-host communication
> (dom0->domU, domU->domU). When domU ingress traffic is originating
> from an Ethernet device it shouldn't be paged. Paged SKBs would come
> into play when a SKB is formed for transmit on an egress device that
> is SG-capable. Or am I misunderstanding how paged SKBs are used these
> days?

I think it depends on the hardware and/or driver. IIRC some devices push
down frag zero into the device for RX DMA and then share it with the
linear area (I think this might have something to do with making LRO or
GRO easier/workable).

Also things such as GRO can commonly cause received skbs being passed up
the stack to contain several frags.

I'm not quite sure how this works but in the case of s/w GRO I wouldn't
be surprised if this resulted in a skb with lots of 1500 byte (i.e. wire
MTU) frags. I think we would end up at least coalescing those two per
buffer on transmit (3000/4096 = 73% filling the page).

Doing better would either need start_new_rx_buffer to always completely
fill each buffer or to take a much more global view of the frags (e.g.
taking the size of the next frag and how it fits into consideration
too).

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.