Xen project Mailing List

[Xen-devel] [RFC] netif: staging grants for requests

Hey, Back in the Xen hackaton '16 networking session there were a couple of ideas brought up. One of them was about exploring permanently mapped grants between xen-netback/xen-netfront. I started experimenting and came up with sort of a design document (in pandoc) on what it would like to be proposed. This is meant as a seed for discussion and also requesting input to know if this is a good direction. Of course, I am willing to try alternatives that we come up beyond the contents of the spec, or any other suggested changes ;) Any comments or feedback is welcome! Cheers, Joao --- % Staging grants for network I/O requests % Joao Martins <<joao.m.martins@xxxxxxxxxx>> % Revision 1 \clearpage -------------------------------------------------------------------- Status: **Experimental** Architecture(s): x86 and ARM Component(s): Guest Hardware: Intel and AMD -------------------------------------------------------------------- # Background and Motivation At the Xen hackaton '16 networking session, we spoke about having a permanently mapped region to describe header/linear region of packet buffers. This document outlines the proposal covering motivation of this and applicability for other use-cases alongside the necessary changes. This proposal is an RFC and also includes alternative solutions. The motivation of this work is to eliminate grant ops for packet I/O intensive workloads such as those observed with smaller requests size (i.e. <= 256 bytes or <= MTU). Currently on Xen, only bulk transfer (e.g. 32K..64K packets) are the only ones performing really good (up to 80 Gbit/s in few CPUs), usually backing end-hosts and server appliances. Anything that involves higher packet rates (<= 1500 MTU) or without sg, performs badly almost like a 1 Gbit/s throughput. # Proposal The proposal is to leverage the already implicit copy from and to packet linear data on netfront and netback, to be done instead from a permanently mapped region. In some (physical) NICs this is known as header/data split. Specifically some workloads (e.g. NFV) it would provide a big increase in throughput when we switch to (zero)copying in the backend/frontend, instead of the grant hypercalls. Thus this extension aims at futureproofing the netif protocol by adding the possibility of guests setting up a list of grants that are set up at device creation and revoked at device freeing - without taking too much grant entries in account for the general case (i.e. to cover only the header region <= 256 bytes, 16 grants per ring) while configurable by kernel when one wants to resort to a copy-based as opposed to grant copy/map. \clearpage # General Operation Here we describe how netback and netfront general operate, and where the proposed solution will fit. The security mechanism currently involves grants references which in essence are round-robin recycled 'tickets' stamped with the GPFNs, permission attributes, and the authorized domain: (This is an in-memory view of struct grant_entry_v1): 0 1 2 3 4 5 6 7 octet +------------+-----------+------------------------+ | flags | domain id | frame | +------------+-----------+------------------------+ Where there are N grant entries in a grant table, for example: @0: +------------+-----------+------------------------+ | rw | 0 | 0xABCDEF | +------------+-----------+------------------------+ | rw | 0 | 0xFA124 | +------------+-----------+------------------------+ | ro | 1 | 0xBEEF | +------------+-----------+------------------------+ ..... @N: +------------+-----------+------------------------+ | rw | 0 | 0x9923A | +------------+-----------+------------------------+ Each entry consumes 8 bytes, therefore 512 entries can fit on one page. The `gnttab_max_frames` which is a default of 32 pages. Hence 16,384 grants. The ParaVirtualized (PV) drivers will use the grant reference (index in the grant table - 0 .. N) in their command ring. \clearpage ## Guest Transmit The view of the shared transmit ring is the following: 0 1 2 3 4 5 6 7 octet +------------------------+------------------------+ | req_prod | req_event | +------------------------+------------------------+ | rsp_prod | rsp_event | +------------------------+------------------------+ | pvt | pad[44] | +------------------------+ | | .... | [64bytes] +------------------------+------------------------+-\ | gref | offset | flags | | +------------+-----------+------------------------+ +-'struct | id | size | id | status | | netif_tx_sring_entry' +-------------------------------------------------+-/ |/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/| .. N +-------------------------------------------------+ Each entry consumes 16 octets therefore 256 entries can fit on one page.`struct netif_tx_sring_entry` includes both `struct netif_tx_request` (first 12 octets) and `struct netif_tx_response` (last 4 octets). Additionally a `struct netif_extra_info` may overlay the request in which case the format is: +------------------------+------------------------+-\ | type |flags| type specific data (gso, hash, etc)| | +------------+-----------+------------------------+ +-'struct | padding for tx | unused | | netif_extra_info' +-------------------------------------------------+-/ In essence the transmission of a packet in a from frontend to the backend network stack goes as following: **Frontend** 1) Calculate how many slots are needed for transmitting the packet. Fail if there are aren't enough slots. [ Calculation needs to estimate slots taking into account 4k page boundary ] 2) Make first request for the packet. The first request contains the whole packet size, checksum info, flag whether it contains extra metadata, and if following slots contain more data. 3) Put grant in the `gref` field of the tx slot. 4) Set extra info if packet requires special metadata (e.g. GSO size) 5) If there's still data to be granted set flag `NETTXF_more_data` in request `flags`. 6) Grant remaining packet pages one per slot. (grant boundary is 4k) 7) Fill resultant grefs in the slots setting `NETTXF_more_data` for the N-1. 8) Fill the total packet size in the first request. 9) Set checksum info of the packet (if the chksum offload if supported) 10) Update the request producer index (`req_prod`) 11) Check whether backend needs a notification 11.1) Perform hypercall `EVTCHNOP_send` which might mean a __VMEXIT__ depending on the guest type. **Backend** 12) Backend gets an interrupt and runs its interrupt service routine. 13) Backend checks if there are unconsumed requests 14) Backend consume a request from the ring 15) Process extra info (e.g. if GSO info was set) 16) Counts all requests for this packet to be processed (while `NETTXF_more_data` is set) and performs a few validation tests: 16.1) Fail transmission if total packet size is smaller than Ethernet minimum allowed; Failing transmission means filling `id` of the request and `status` of `NETIF_RSP_ERR` of `struct netif_tx_response`; update rsp_prod and finally notify frontend (through `EVTCHNOP_send`). 16.2) Fail transmission if one of the slots (size + offset) crosses the page boundary 16.3) Fail transmission if number of slots are bigger than spec defined (18 slots max in netif.h) 17) Allocate packet metadata [ *Linux specific*: This structure emcompasses a linear data region which generally accomodates the protocol header and such. Netback allocates up to 128 bytes for that. ] 18) *Linux specific*: Setup up a `GNTTABOP_copy` to copy up to 128 bytes to this small region (linear part of the skb) *only* from the first slot. 19) Setup GNTTABOP operations to copy/map the packet 20) Perform the `GNTTABOP_copy` (grant copy) and/or `GNTTABOP_map_grant_ref` hypercalls. [ *Linux-specific*: does a copy for the linear region (<=128 bytes) and maps the remaining slots as frags for the rest of the data ] 21) Check if the grant operations were successful and fail transmission if any of the resultant operation `status` were different than `GNTST_okay`. 21.1) If it's a grant copying backend, therefore produce responses for all the the copied grants like in 16.1). Only difference is that status is `NETIF_RSP_OKAY`. 21.2) Update the response producer index (`rsp_prod`) 22) Set up gso info requested by frontend [optional] 23) Set frontend provided checksum info 24) *Linux-specific*: Register destructor callback when packet pages are freed. 25) Call into to the network stack. 26) Update `req_event` to `request consumer index + 1` to receive a notification on the first produced request from frontend. [optional, if backend is polling the ring and never sleeps] 27) *Linux-specific*: Packet destructor callback is called. 27.1) Set up `GNTTABOP_unmap_grant_ref` ops for the designated packet pages. 27.2) Once done, perform `GNTTABOP_unmap_grant_ref` hypercall. Underlying this hypercall a TLB flush of all backend vCPUS is done. 27.3) Produce Tx response like step 21.1) and 21.2) [*Linux-specific*: It contains a thread that is woken for this purpose. And it batch these unmap operations. The callback just queues another unmap.] 27.4) Check whether frontend requested a notification 27.4.1) If so, Perform hypercall `EVTCHNOP_send` which might mean a __VMEXIT__ depending on the guest type. **Frontend** 28) Transmit interrupt is raised which signals the packet transmission completion. 29) Transmit completion routine checks for unconsumed responses 30) Processes the responses and revokes the grants provided. 31) Updates `rsp_cons` (request consumer index) This proposal aims at replacing steps 19) 20) 21) to be be a memcpy from a permanently mapped region. Additionally backend may choose to use these permanently mapped pages in the rest of the packet therefore replacing step 27) (the unmap) preventing the TLB flush. Note that a grant copy does the following (in pseudo code): rcu_lock(src_domain); rcu_lock(dst_domain); for (op = gntcopy[0]; op < nr_ops; op++) { src_frame = __acquire_grant_for_copy(src_domain, <op.src.gref>); ^ here implies a holding a potential contended per CPU lock on the remote grant table. src_vaddr = map_domain_page(src_frame); dst_frame = __get_paged_frame(dst_domain, <op.dst.mfn>) dst_vaddr = map_domain_page(dst_frame); memcpy(dst_vaddr + <op.dst.offset>, src_frame + <op.src.offset>, <op.size>); unmap_domain_page(src_frame); unmap_domain_page(dst_frame); rcu_unlock(src_domain); rcu_unlock(dst_domain); Whereas we propose doing a memcpy from a premapped region that a frontend would have to set. This region would also let us avoid the tricky case that the frontend linear region might be spanned across 4K page boundary (which leads to have one slot with a tiny amount of data). The disadvantage would be to require an additional copy in steps 4) and 6). But even with that the benefits are still quite big as results hint in section [Performance](#performance). \clearpage ## Guest Receive The view of the shared receive ring is the following: 0 1 2 3 4 5 6 7 octet +------------------------+------------------------+ | req_prod | req_event | +------------------------+------------------------+ | rsp_prod | rsp_event | +------------------------+------------------------+ | pvt | pad[44] | +------------------------+ | | .... | [64bytes] +------------------------+------------------------+ | id | pad | gref | ->'struct netif_rx_request' +------------+-----------+------------------------+ | id | offset | flags | status | ->'struct netif_rx_response' +-------------------------------------------------+ |/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/| .. N +-------------------------------------------------+ Each entry in the ring occupies 16 octets which means a page fits 256 entries. Additionally a `struct netif_extra_info` may overlay the rx request in which case the format is: +------------------------+------------------------+ | type |flags| type specific data (gso, hash, etc)| ->'struct netif_extra_info' +------------+-----------+------------------------+ Notice the lack of padding, and that is because it's not used on Rx, as Rx request boundary is 8 octets. In essence the steps for receiving of a packet in a Linux frontend is as from backend to frontend network stack: **Backend** 1) Backend transmit function starts [*Linux-specific*: It means we take a packet and add to an internal queue (protected by a lock) whereas a separate thread takes it from that queue and process the actual like the steps below. This thread has the purpose of aggregating as much copies as possible.] 2) Checks if there are enough rx ring slots that can accomodate the packet. 3) Gets a request from the ring for the first data slot and fetches the `gref` from it. 4) Create grant copy op from packet page to `gref`. [ It's up to the backend to choose how it fills this data. E.g. backend may choose to merge as much as data from different pages into this single gref, similar to mergeable rx buffers in vhost. ] 5) Sets up flags/checksum info on first request. 6) Gets a response from the ring for this data slot. 7) Prefill expected response ring with the request `id` and slot size. 8) Update the request consumer index (`req_cons`) 9) Gets a request from the ring for the first extra info [optional] 10) Sets up extra info (e.g. GSO descriptor) [optional] repeat step 8). 11) Repeat steps 3 through 8 for all packet pages and set `NETRXF_more_data` in the N-1 slot. 12) Perform the `GNTTABOP_copy` hypercall. 13) Check if the grant operations status was incorrect and if so set `status` of the `struct netif_rx_response` field to NETIF_RSP_ERR. 14) Update the response producer index (`rsp_prod`) **Frontend** 15) Frontend gets an interrupt and runs its interrupt service routine 16) Checks if there's unconsumed responses 17) Consumes a response from the ring (first response for a packet) 18) Revoke the `gref` in the response 19) Consumes extra info response [optional] 20) While N-1 requests has `NETRXF_more_data`, then fetch each of responses and revoke the designated `gref`. 21) Update the response consumer index (`rsp_cons`) 22) *Linux-specific*: Copy (from first slot gref) up to 256 bytes to the linear region of the packet metadata structure (skb). The rest of the pages processed in the responses are then added as frags. 23) Set checksum info based on first response flags. 24) Call packet into the network stack. 25) Allocate new pages and any necessary packet metadata strutures to new requests. These requests will then be used in step 1) and so forth. 26) Update the request producer index (`req_prod`) 27) Check whether backend needs notification: 27.1) If so, Perform hypercall `EVTCHNOP_send` which might mean a __VMEXIT__ depending on the guest type. 28) Update `rsp_event` to `response consumer index + 1` such that frontend receive a notification on the first newly produced response. [optional, if frontend is polling the ring and never sleeps] This proposal aims at replacing step 4), 12) and 22) with a memcpy and allow frontend to instead pull from a permanently mapped region. Furthermore it further allows fast recycling of unused grefs for replinishing the ring. Depending on the implementation, receive path it would mean that we no longer would need to aggregate as much as grant ops as possible (step 1) and could transmit the packet on the transmit function (e.g. Linux ```ndo_start_xmit```) as previously proposed here\[[0](http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html)\]. This would heavily improve efficiency specifially for smaller packets. Which in return would decrease RTT, having data being acknoledged much quicker. \clearpage # Use Cases and Recent Trends in Networking Recent trends in networking are aiming heavily at specialized network stacks and packet I/O engines. Network backends and both partial/full bypass Packet I/O engines have been appearing from vendors (OpenOnload, RDMA, DPDK, etc) and operating systems (FreeBSD netmap, PFRING_DNA). These bring novelty in the space of NFV, and fast packet processing. Some of the things/techniques used in these approaches include: * Pre-allocated TX/RX buffers/pages * Efficient device drivers * Memory mapped buffers * I/O batching * Poll-based methods In this section we further discuss some of these use-cases and how this extension can be useful for these scenarios While this document initially aimed (i.e. in hackaton discussion) at points [3.1](#guest-transmit) and [3.2](#guest-receive), it would like to be discussed with extensionability as reasoned on [4.1](#netmap-on-freebsd), [4.2](#dataplane-development-kit-dpdk), [4.3](#express-data-plane-xdp) use cases. For consideration I also have working prototypes for [4.2](#dataplane-development-kit-dpdk), [4.3](#express-data-plane-xdp) showing really promissing results. The whole argument here is to improve the netif protocol to further advertise and allow recycling grants on the backend _if_ hinted/requested by the frontend. ## Netmap on FreeBSD Buffers used by netmap-mode drivers in FreeBSD are permanently allocated by default 2K bytes \[[1](https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap_mem2.c#L362)\] and are statically allocated per netmap-mode device and freed on device teardown. netmap is targetted at network intensive workloads (i.e. to able to fill up 10/40 Gbit cards for at least packet sizes <= 1500 bytes) so being able to avoid the grant ops enables FreeBSD network backend to take full take advantage of netmap which is in tree from the last couple years. For a netmap xen-netback, one would map the frontend pages and use indirect buffers feature on the backend for a vif on top of the netmap-based switch (VALE). VALE is a software switch (or an engine) based on netmap \[[2](https://www.freebsd.org/cgi/man.cgi?query=vale&sektion=4&n=1)\]. It is really fast, but somewhat limited in features when compared to openvswitch or Linux bridge. It has a nice concept of modularity decoupling switching logic (lookup and switch state) and fowarding engine (port locking and engine algorithm of transversing packets between different ports). ## DataPlane Development Kit (DPDK) DPDK uses statically allocated buffers like netmap but more importantly, it is a complete kernel bypass framework and its general operation consists around busy polling the NIC ring descriptors *without* any interrupts or system calls. Everything is done on userspace thus having to make an hypercall to copy packet buffers heavily deteriotates throughput in the rest of that pipeline (that is the rest of the NICs in a given poll mode thread). Adding support for this permanently mapped region would avoid the need for making any system calls and just use data buffers granted by the guest with the added benefit of a really fast memcpy without the problems associated with costs of FPU save/restore (in the backend kernel or hypervisor). Only the notification would be the last thing in the way, which can be mitigaged by having each side (i.e. each DPDK driver) not setting respective event indexes (rsp_event and req_event, as mentioned previously). ## eXpress Data Plane (XDP) Starting with Linux 4.8 Linux brings an experimental feature \[[3](https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf)\]. It is which is to receive and/or forward packet buffers at early stages of the driver RX path. Packet filters/forwarders by loading a eBPF (Extended Berkeley Packet Filter) program which the driver will run before even any skbs are allocated. Some of use-cases include be a ILA router or a load-balancer and future applications could be a programable switch. Drivers supporting XDP (currently) require \[[4](http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#write-access-to-packet-data)\] setting up a bidirectional DMA region (or in our case a granted region with R/W permissions) and reuse packet pages after being done in the eBPF program. No unbounded memory allocations are made (such queueing with new pages), but instead recycle the packet pages after we're done in BPF. This creates an opportunity for a backend to avoid map and unmap grants. BPF program could also modify the packet and thus requested to forward on the same port, reusing the same RX grant on TX. We should be able to create a grant region with R/W permissions to the backend, which would be equivalent of [4.1](#netmap-on-freebsd). Although packets between guests would still involve a copy. ## Existing page recycle approaches in NICs Linux (and probably other OSes) follow similar approaches in network drivers (e.g. ixgbe and mellanox) to address the bottlenecks of page-allocator and DMA APIs. For example: 1) ixgbe (and probably other Intel cards) will memcpy up to 256 bytes into newly allocated packet buffer and recycle half of page for refilling other available descriptors \[[5](http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L2073)\]; 2) other drivers (such as mellanox) chooose to allocate page order>1 and divide them multiple NIC descriptors \[[6](http://lxr.free-electrons.com/source/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L52)\]; \clearpage # Wire Performance This section is a glossary meant to keep in mind numbers on the wire. The minimum size that can fit in a single packet with size N is calculated as: Packet = Ethernet Header (14) + Protocol Data Unit (46 - 1500) = 60 bytes In the wire it's a bit more: Preamble (7) + Start Frame Delimiter (1) + Packet + CRC (4) + Interframe gap (12) = 84 bytes For given Link-speed in Bits/sec and Packet size, real packet rate is calculated as: Rate = Link-speed / ((Preamble + Packet + CRC + Interframe gap) * 8) Numbers to keep in mind (packet size excludes PHY layer, though packet rates disclosed by vendors take those into account, since it's what goes on the wire): | Packet + CRC (bytes) | 10 Gbit/s | 40 Gbit/s | 100 Gbit/s | |------------------------|:----------:|:----------:|:------------:| | 64 | 14.88 Mpps| 59.52 Mpps| 148.80 Mpps | | 128 | 8.44 Mpps| 33.78 Mpps| 84.46 Mpps | | 256 | 4.52 Mpps| 18.11 Mpps| 45.29 Mpps | | 1500 | 822 Kpps| 3.28 Mpps| 8.22 Mpps | | 65535 | ~19 Kpps| 76.27 Kpps| 190.68 Kpps | Caption: Mpps (Million packets per second) ; Kpps (Kilo packets per second) \clearpage # Proposed Extension ## Terminology `data gref` refers to the reusable/staging grants, whereas `gref` are the ones that are granted/revoked for each packet being sent. `command ring` refers to the current ring, whereas `data list` refers to the one proposed below. ## Xenstore "feature-staging-grants" is a capability to allow the use of a set of recycled buffers permanently mapped at the device setup. If advertised, the frontend will grant a region equivalent to ```maximum number of slots per ring * size of buffer``` where: * `maximum number of slots per ring` is the number of request structure that can be fit within the command ring. By default a request structure consumes 16 bytes. The first 64 bytes of a ring are used by producer and consumer indicies. * `size of buffer` is the maximum portion of the packet the backend (and frontend) have negotiated which will be put for each slot of the `data ring`. Which means that for 256 possible packets in ring with 256 bytes of chosen size the amount of memory to be permanently granted is a region of 64KB i.e. 16 grefs. The lack of "feature-staging-grants" or a value of 0 means that it's not supported. If feature is present then a second entry "staging-grants-sizes" must exist and it contains the sizes that can be used per slot. To avoid frontend clobbering with various different values, we limit to a set of fixed ones semi-colon delimited. The allowed values are implementation specific. Examples of good values include: 256 (protocol/header region), 2048 (fits 1 MSS which is common to be in the linear part of linux packets), 4096 (grant per packet if feature-sg=0, for DPDK/XDP/netmap buffers) and 65535 (maximum packet size i.e. ```NETIF_NR_SLOTS_MIN * XEN_PAGE_SIZE``` for feature-sg=1). Individual size covered per entry by frontend is through ```tx-data-len``` or ```rx-data-len``` which contains maximum amount of data on a single packet. Chosen values of "tx-data-len" and "rx-data-len" are asymmetrical (hence can be different between TX and RX) are validated against this list of valid sizes. For it's use see [datapath](#datapath-changes) section further below. /local/domain/1/device/vif/0/feature-staging-grants = "1" /local/domain/1/device/vif/0/staging-grants-sizes = "128;256;2048;4096" /local/domain/1/device/vif/0/queue-0/tx-data-ref0 = "<data-ref-tx0>" /local/domain/1/device/vif/0/queue-0/tx-data-len = "128" /local/domain/1/device/vif/0/queue-0/rx-data-ref0 = "<data-ref-rx0>" /local/domain/1/device/vif/0/queue-0/rx-data-len = "256" The ```tx-data-ref%u``` and ```rx-data-ref%u``` describe the list of grants to be used for each ring. The example above exemplifies a list composed of a single page whereas multiple pages would be as: tx-data-ref0 = <data-ref-tx0> tx-data-ref1 = <data-ref-tx1> rx-data-ref0 = <data-ref-rx0> rx-data-ref1 = <data-ref-rx1> Each slot in the `data-ref` list is formatted as following: 0 1 2 3 4 5 6 7 octet +-----+-----+-----+-----+-----+-----+-----+-----+ | id | gref | offset | +-----+-----+-----+-----+-----+-----+-----+-----+ id: the frame identifier. gref: grant reference to the frame. offset: offset within the frame. This list has twice as max slots as would have `tx-ring-ref` or `rx-ring-ref` respectively, and it is set up at device creation and removed at device teardown, same as the command ring entries. This way we keep up with ring size changes as it it expected to be in the command ring. A hypothetical multi-page command ring would increase number of slots and thus this data list would grow accordingly. List is terminated by an entry which ```gref``` field is 0, having ignored the other fields of this specific entry. ## Datapath Changes The packet is copied to/from the mapped data list grefs of up to `tx-data-len` or `rx-data-len`. This means that the buffer (referenced by `gref` from within the `struct netif_[tx|rx]_request`) has the same data up to `size`. In other words, *gref[0->size] contents is replicated on the `data-ring` at `idx`. Hence netback should ignore up to `size` of the `gref` when processing as the `data-ring` has the contents of it. Values bigger then the 4096 page/grant boundary only have special meaning for backend being how much it is required to be copied/pulled across the whole packet (which can be composed of multiple slots). Hence (e.g.) a value of 65536 vs 4096 will have the same data list size and the latter value would lead to only copy/pull one gref in the whole packet, whereas the former will be a copy-only interface for all slots. ## Buffer Identification and Flags The data list ids must start from 0 and are global and continguous (across both lists). Data slot is identified by ring slot ```id``` field. Resultant data gref id to be used in RX data list is computed by subtract of `struct netif_[tx|rx]_request` ```id``` from total amount of tx data grefs.Example of the lists layout below: ``` [tx-data-ref-0, tx-data-len=256] { .id = 0, gref = 512, .offset = 0x0 } { .id = 1, gref = 512, .offset = 0x100 } { .id = 2, gref = 512, .offset = 0x200 } ... { .id = 256, gref = 0, .offset = 0x0 } [rx-data-ref-0, rx-data-len=4096] { .id = 256, gref = 529, .offset = 0x0 } { .id = 257, gref = 530, .offset = 0x0 } { .id = 258, gref = 531, .offset = 0x0 } ... { .id = 512, gref = 0, .offset = 0x0 } ``` Permissions of RX data grefs are read-write whereas TX data grefs is read-only. ## Zerocopy Frontend may wish to provide a bigger RX list than TX, and use RX buffers for transmission in a zerocopy fashion for guests mainly doing forwarding. In such cases backend set NETTXF_staging_buffer flag in ```netif_tx_request``` flags field such that `gref` field instead designates the `id` of a data grefs. This is only valid when packets are solely described by the staging grants for the slot packet size being written. Or when [tx|rx]-data-len is 4096 (for feature-sg 0) or 65535 (for feature-sg 1) and thus no new `gref` is needed for describing the packet payload. \clearpage ## Performance Numbers that give a rough idea on the performance benefits of this extension. These are Guest <-> Dom0 which test the communication between backend and frontend, excluding other bottlenecks in the datapath (the software switch). ``` # grant copy Guest TX (1vcpu, 64b, UDP in pps): 1 506 170 pps Guest TX (4vcpu, 64b, UDP in pps): 4 988 563 pps Guest TX (1vcpu, 256b, UDP in pps): 1 295 001 pps Guest TX (4vcpu, 256b, UDP in pps): 4 249 211 pps # grant copy + grant map (see next subsection) Guest TX (1vcpu, 260b, UDP in pps): 577 782 pps Guest TX (4vcpu, 260b, UDP in pps): 1 218 273 pps # drop at the guest network stack Guest RX (1vcpu, 64b, UDP in pps): 1 549 630 pps Guest RX (4vcpu, 64b, UDP in pps): 2 870 947 pps ``` With this extension: ``` # memcpy data-len=256 TX (1vcpu, 64b, UDP in pps): 3 759 012 pps data-len=256 TX (4vcpu, 64b, UDP in pps): 12 416 436 pps data-len=256 TX (1vcpu, 256b, UDP in pps): 3 248 392 pps data-len=256 TX (4vcpu, 256b, UDP in pps): 11 165 355 pps # memcpy + grant map (see next subsection) data-len=256 TX (1vcpu, 260b, UDP in pps): 588 428 pps data-len=256 TX (4vcpu, 260b, UDP in pps): 1 668 044 pps # (drop at the guest network stack) data-len=256 RX (1vcpu, 64b, UDP in pps): 3 285 362 pps data-len=256 RX (4vcpu, 64b, UDP in pps): 11 761 847 pps # (drop with guest XDP_DROP prog) data-len=256 RX (1vcpu, 64b, UDP in pps): 9 466 591 pps data-len=256 RX (4vcpu, 64b, UDP in pps): 33 006 157 pps ``` Latency measurements (netperf TCP_RR request size 1 and response size 1): ``` 24 KTps vs 28 KTps 39 KTps vs 50 KTps (with kernel busy poll) ``` TCP Bulk transfer measurements aren't showing a representative increase on maximum throughput (sometimes ~10%), but rather less retransmissions and more stable. This is probably because of being having a slight decrease in rtt time (i.e. receiver acknowledging data quicker). Currently trying exploring other data list sizes and probably will have a better idea on the effects of this. ## Linux grant copy vs map remark Based on numbers above there's a sudden 2x performance drop when we switch from grant copy to also grant map the ` gref`: 1 295 001 vs 577 782 for 256 and 260 packets bytes respectivally. Which is all the more visible when removing the grant copy with memcpy in this extension (3 248 392 vs 588 428). While there's been discussions of avoid the TLB unflush on unmap, one could wonder what the threshold of that improvement would be. Chances are that this is the least of our concerns in a fully poppulated host (or with an oversubscribed one). Would it be worth experimenting increasing the threshold of the copy beyond the header? \clearpage # References [0] http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html [1] https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap_mem2.c#L362 [2] https://www.freebsd.org/cgi/man.cgi?query=vale&sektion=4&n=1 [3] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf [4] http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#write-access-to-packet-data [5] http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L2073 [6] http://lxr.free-electrons.com/source/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L52 # History A table of changes to the document, in chronological order. ------------------------------------------------------------------------ Date Revision Version Notes ---------- -------- -------- ------------------------------------------- 2016-12-14 1 Xen 4.9 Initial version. ---------- -------- -------- ------------------------------------------- _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.