[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC] netif: staging grants for requests



On Wed, 14 Dec 2016, Joao Martins wrote:
> # Proposed Extension
> 
> ## Terminology
> 
> `data gref` refers to the reusable/staging grants, whereas `gref` are the
> ones that are granted/revoked for each packet being sent. `command ring`
> refers to the current ring, whereas `data list` refers to the one proposed
> below.
> 
> ## Xenstore
> 
> "feature-staging-grants" is a capability to allow the use of a set of
> recycled buffers permanently mapped at the device setup. If
> advertised, the frontend will grant a region equivalent to ```maximum number
> of slots per ring * size of buffer``` where:
> 
>  * `maximum number of slots per ring` is the number of request structure
>    that can be fit within the command ring. By default a request
>    structure consumes 16 bytes. The first 64 bytes of a ring are used by
>    producer and consumer indicies.

This means that by default the number of slots is (4096-64)/16 = 252,
correct?


>  * `size of buffer` is the maximum portion of the packet the backend (and
>    frontend) have negotiated which will be put for each slot of the
>    `data ring`.
> 
> Which means that for 256 possible packets in ring with 256 bytes of
> chosen size the amount of memory to be permanently granted is a region of
> 64KB i.e. 16 grefs.
> 
> The lack of "feature-staging-grants" or a value of 0 means that it's not
> supported. If feature is present then a second entry "staging-grants-sizes"
> must exist and it contains the sizes that can be used per slot. To avoid
> frontend clobbering with various different values, we limit to a set of fixed
> ones semi-colon delimited.
> 
> The allowed values are implementation specific. Examples of good values
> include: 256 (protocol/header region), 2048 (fits 1 MSS which is common to be
> in the linear part of linux packets), 4096 (grant per packet if feature-sg=0, 
> for
> DPDK/XDP/netmap buffers) and 65535 (maximum packet size i.e.
> ```NETIF_NR_SLOTS_MIN * XEN_PAGE_SIZE``` for feature-sg=1).

I am not following. So far we have been talking about two values:
- maximum number of slots per ring
- size of buffer
We have also discussed the number of bytes used for indexes at the
beginning of the command ring, which is 64 bytes.

But now you are introducing a few other values:
- protocol/header region
- fits 1 MSS which is common to be in the linear part of linux packets
- grant per packet
- maximum packet size (I take this is the same as size of buffer)

Could you please described what these values represent in more details
and how they relate to the two previous values?


> Individual size covered per entry by frontend is through ```tx-data-len``` or
> ```rx-data-len``` which contains maximum amount of data on a single packet.
> Chosen values of "tx-data-len" and "rx-data-len" are asymmetrical (hence can
> be different between TX and RX) are validated against this list of valid 
> sizes.
> For it's use see [datapath](#datapath-changes) section further below.
> 
>       /local/domain/1/device/vif/0/feature-staging-grants = "1"
>       /local/domain/1/device/vif/0/staging-grants-sizes = "128;256;2048;4096"

I don't get what the 4 numbers mean. Are they possible supported options
for "tx-data-len" and "rx-data-len", which in turn are the "size of
buffer" as described above?

If so, shouldn't they be written by the backend to backend paths? Why
only 4 options? Could there be more? What is the max number of allowed
option? And what is the minimum?


>       /local/domain/1/device/vif/0/queue-0/tx-data-ref0 = "<data-ref-tx0>"
>       /local/domain/1/device/vif/0/queue-0/tx-data-len = "128"
>       /local/domain/1/device/vif/0/queue-0/rx-data-ref0 = "<data-ref-rx0>"
>       /local/domain/1/device/vif/0/queue-0/rx-data-len = "256"
> 
> The ```tx-data-ref%u``` and ```rx-data-ref%u``` describe the list of grants 
> to be
> used for each ring. The example above exemplifies a list composed of a single

For what ring, the data ring?


> page whereas multiple pages would be as:
> 
>       tx-data-ref0 = <data-ref-tx0>
>       tx-data-ref1 = <data-ref-tx1>
>       rx-data-ref0 = <data-ref-rx0>
>       rx-data-ref1 = <data-ref-rx1>
> 
> Each slot in the `data-ref` list is formatted as following:
> 
>             0     1     2     3     4     5     6     7  octet
>          +-----+-----+-----+-----+-----+-----+-----+-----+
>          | id        | gref                  | offset    |
>          +-----+-----+-----+-----+-----+-----+-----+-----+
> 
>       id: the frame identifier.

What is a "frame identifier"?


>       gref: grant reference to the frame.
>       offset: offset within the frame.

I assumed that each grant used for the data ring would be used in its
entirety. Does it really make sense to create a ring using portions of
pages? Given that we can only share memory at a page granularity, I
don't think it makes sense: the rest of the page would still be
accessible from the backend, which could compromise its content. I would
share a list of pages that can be mapped contiguously to create a data
ring.


> This list has twice as max slots as would have `tx-ring-ref` or `rx-ring-ref`
> respectively, and it is set up at device creation and removed at device
> teardown, same as the command ring entries. This way we keep up with ring size
> changes as it it expected to be in the command ring. A hypothetical multi-page
                ^ repetition

> command ring would increase number of slots and thus this data list would grow
> accordingly. List is terminated by an entry which ```gref``` field is 0, 
> having
> ignored the other fields of this specific entry.

Could you please explain a bit more how the command ring increases the
number of slot and why we need to allocate twice the number of required
slots at the beginning?



> ## Datapath Changes
> 
> The packet is copied to/from the mapped data list grefs of up to `tx-data-len`
> or `rx-data-len`. This means that the buffer (referenced by `gref` from within
> the `struct netif_[tx|rx]_request`) has the same data up to `size`.  In other
> words, *gref[0->size] contents is replicated on the `data-ring` at `idx`. 
> Hence
> netback should ignore up to `size` of the `gref` when processing as the
> `data-ring` has the contents of it.

My lack of netfront/netback knowledge might show, but I am not
following. What do gref, size and idx represent in this context? Please
explain the new workflow in more details.


> Values bigger then the 4096 page/grant boundary only have special meaning for
> backend being how much it is required to be copied/pulled across the whole
> packet (which can be composed of multiple slots). Hence (e.g.) a value of 
> 65536
> vs 4096 will have the same data list size and the latter value would lead to
> only copy/pull one gref in the whole packet, whereas the former will be a
> copy-only interface for all slots.
> 
> ## Buffer Identification and Flags
> 
> The data list ids must start from 0 and are global and continguous (across 
> both
> lists).  Data slot is identified by ring slot ```id``` field. Resultant data
> gref id to be used in RX data list is computed by subtract of `struct
> netif_[tx|rx]_request` ```id``` from total amount of tx data grefs.Example of
> the lists layout  below:

What is the "resultant data gref id"? What is the "RX data list"?
Please explain in more details.


> ```
>  [tx-data-ref-0, tx-data-len=256]
>  { .id = 0, gref = 512, .offset = 0x0   }
>  { .id = 1, gref = 512, .offset = 0x100 }
>  { .id = 2, gref = 512, .offset = 0x200 }
>  ...
>  { .id = 256, gref = 0, .offset = 0x0   }
> 
>  [rx-data-ref-0, rx-data-len=4096]
>  { .id = 256, gref = 529, .offset = 0x0 }
>  { .id = 257, gref = 530, .offset = 0x0 }
>  { .id = 258, gref = 531, .offset = 0x0 }
>  ...
>  { .id = 512, gref = 0,   .offset = 0x0 }
> ```
> 
> Permissions of RX data grefs are read-write whereas TX data grefs is 
> read-only.
> 
> ## Zerocopy
> 
> Frontend may wish to provide a bigger RX list than TX, and use RX buffers for
> transmission in a zerocopy fashion for guests mainly doing forwarding. In such
> cases backend set NETTXF_staging_buffer flag in ```netif_tx_request``` flags
> field such that `gref` field instead designates the `id` of a data grefs.
> 
> This is only valid when packets are solely described by the staging grants for
> the slot packet size being written. Or when [tx|rx]-data-len is 4096 (for
> feature-sg 0) or 65535 (for feature-sg 1) and thus no new `gref` is needed for
> describing the packet payload.
> 
> \clearpage
> 
> ## Performance
> 
> Numbers that give a rough idea on the performance benefits of this extension.
> These are Guest <-> Dom0 which test the communication between backend and
> frontend, excluding other bottlenecks in the datapath (the software switch).
> 
> ```
> # grant copy
> Guest TX (1vcpu,  64b, UDP in pps):  1 506 170 pps
> Guest TX (4vcpu,  64b, UDP in pps):  4 988 563 pps
> Guest TX (1vcpu, 256b, UDP in pps):  1 295 001 pps
> Guest TX (4vcpu, 256b, UDP in pps):  4 249 211 pps
> 
> # grant copy + grant map (see next subsection)
> Guest TX (1vcpu, 260b, UDP in pps):    577 782 pps
> Guest TX (4vcpu, 260b, UDP in pps):  1 218 273 pps
> 
> # drop at the guest network stack
> Guest RX (1vcpu,  64b, UDP in pps):  1 549 630 pps
> Guest RX (4vcpu,  64b, UDP in pps):  2 870 947 pps
> ```
> 
> With this extension:
> ```
> # memcpy
> data-len=256 TX (1vcpu,  64b, UDP in pps):  3 759 012 pps
> data-len=256 TX (4vcpu,  64b, UDP in pps): 12 416 436 pps
> data-len=256 TX (1vcpu, 256b, UDP in pps):  3 248 392 pps
> data-len=256 TX (4vcpu, 256b, UDP in pps): 11 165 355 pps
> 
> # memcpy + grant map (see next subsection)
> data-len=256 TX (1vcpu, 260b, UDP in pps):    588 428 pps
> data-len=256 TX (4vcpu, 260b, UDP in pps):  1 668 044 pps
> 
> # (drop at the guest network stack)
> data-len=256 RX (1vcpu,  64b, UDP in pps):  3 285 362 pps
> data-len=256 RX (4vcpu,  64b, UDP in pps): 11 761 847 pps
> 
> # (drop with guest XDP_DROP prog)
> data-len=256 RX (1vcpu,  64b, UDP in pps):  9 466 591 pps
> data-len=256 RX (4vcpu,  64b, UDP in pps): 33 006 157 pps
> ```

Very nice!


> Latency measurements (netperf TCP_RR request size 1 and response size 1):
> ```
> 24 KTps vs 28 KTps
> 39 KTps vs 50 KTps (with kernel busy poll)
> ```
> 
> TCP Bulk transfer measurements aren't showing a representative increase on
> maximum throughput (sometimes ~10%), but rather less retransmissions and
> more stable. This is probably because of being having a slight decrease in rtt
> time (i.e. receiver acknowledging data quicker). Currently trying exploring
> other data list sizes and probably will have a better idea on the effects of
> this.

This is strange. By TCP Bulk transfers, do you mean iperf? From my
pvcalls experience, I would expect a great improvement there too.


> ## Linux grant copy vs map remark
> 
> Based on numbers above there's a sudden 2x performance drop when we switch 
> from
> grant copy to also grant map the ` gref`: 1 295 001 vs  577 782 for 256 and 
> 260
> packets bytes respectivally. Which is all the more visible when removing the 
> grant
> copy with memcpy in this extension (3 248 392 vs 588 428). While there's been
> discussions of avoid the TLB unflush on unmap, one could wonder what the
> threshold of that improvement would be. Chances are that this is the least of
> our concerns in a fully poppulated host (or with an oversubscribed one). Would
> it be worth experimenting increasing the threshold of the copy beyond the
> header?

+1


> \clearpage
> 
> # References
> 
> [0] http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html
> 
> [1] 
> https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap_mem2.c#L362
> 
> [2] https://www.freebsd.org/cgi/man.cgi?query=vale&sektion=4&n=1
> 
> [3] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
> 
> [4]
> http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#write-access-to-packet-data
> 
> [5] 
> http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L2073
> 
> [6] 
> http://lxr.free-electrons.com/source/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L52
> 
> # History
> 
> A table of changes to the document, in chronological order.
> 
> ------------------------------------------------------------------------
> Date       Revision Version  Notes
> ---------- -------- -------- -------------------------------------------
> 2016-12-14 1        Xen 4.9  Initial version.
> ---------- -------- -------- -------------------------------------------
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.