Xen project Mailing List

[Xen-devel] [PATCH v5 2/2] docs/misc: add netif staging grants design document

To: Xen Development List <xen-devel@xxxxxxxxxxxxx>

From: Joao Martins <joao.m.martins@xxxxxxxxxx>

Date: Tue, 3 Oct 2017 18:46:09 +0100

Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Ian Jackson <ian.jackson@xxxxxxxxxxxxx>, Tim Deegan <tim@xxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Joao Martins <joao.m.martins@xxxxxxxxxx>

Delivery-date: Tue, 03 Oct 2017 17:46:44 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Add a document outlining how the guest can map a set of grants on the backend through the control ring. Signed-off-by: Joao Martins <joao.m.martins@xxxxxxxxxx> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> --- New in v5 --- docs/misc/netif-staging-grants.pandoc | 587 ++++++++++++++++++++++++++++++++++ 1 file changed, 587 insertions(+) create mode 100644 docs/misc/netif-staging-grants.pandoc diff --git a/docs/misc/netif-staging-grants.pandoc b/docs/misc/netif-staging-grants.pandoc new file mode 100644 index 0000000000..b26a6e0915 --- /dev/null +++ b/docs/misc/netif-staging-grants.pandoc @@ -0,0 +1,587 @@ +% Staging grants for network I/O requests +% Revision 4 + +\clearpage + +-------------------------------------------------------------------- +Architecture(s): Any +-------------------------------------------------------------------- + +# Background and Motivation + +At the Xen hackaton '16 networking session, we spoke about having a permanently +mapped region to describe header/linear region of packet buffers. This document +outlines the proposal covering motivation of this and applicability for other +use-cases alongside the necessary changes. + +The motivation of this work is to eliminate grant ops for packet I/O intensive +workloads such as those observed with smaller requests size (i.e. <= 256 bytes +or <= MTU). Currently on Xen, only bulk transfer (e.g. 32K..64K packets) are the +only ones performing really good (up to 80 Gbit/s in few CPUs), usually +backing end-hosts and server appliances. Anything that involves higher packet +rates (<= 1500 MTU) or without sg, performs badly almost like a 1 Gbit/s +throughput. + +# Proposal + +The proposal is to leverage the already implicit copy from and to packet linear +data on netfront and netback, to be done instead from a permanently mapped +region. In some (physical) NICs this is known as header/data split. + +Specifically some workloads (e.g. NFV) it would provide a big increase in +throughput when we switch to (zero)copying in the backend/frontend, instead of +the grant hypercalls. Thus this extension aims at futureproofing the netif +protocol by adding the possibility of guests setting up a list of grants that +are set up at device creation and revoked at device freeing - without taking +too much grant entries in account for the general case (i.e. to cover only the +header region <= 256 bytes, 16 grants per ring) while configurable by kernel +when one wants to resort to a copy-based as opposed to grant copy/map. + +\clearpage + +# General Operation + +Here we describe how netback and netfront general operate, and where the proposed +solution will fit. The security mechanism currently involves grants references +which in essence are round-robin recycled 'tickets' stamped with the GPFNs, +permission attributes, and the authorized domain: + +(This is an in-memory view of struct grant_entry_v1): + + 0 1 2 3 4 5 6 7 octet + +------------+-----------+------------------------+ + | flags | domain id | frame | + +------------+-----------+------------------------+ + +Where there are N grant entries in a grant table, for example: + + @0: + +------------+-----------+------------------------+ + | rw | 0 | 0xABCDEF | + +------------+-----------+------------------------+ + | rw | 0 | 0xFA124 | + +------------+-----------+------------------------+ + | ro | 1 | 0xBEEF | + +------------+-----------+------------------------+ + + ..... + @N: + +------------+-----------+------------------------+ + | rw | 0 | 0x9923A | + +------------+-----------+------------------------+ + +Each entry consumes 8 bytes, therefore 512 entries can fit on one page. +The `gnttab_max_frames` which is a default of 32 pages. Hence 16,384 +grants. The ParaVirtualized (PV) drivers will use the grant reference (index +in the grant table - 0 .. N) in their command ring. + +\clearpage + +## Guest Transmit + +The view of the shared transmit ring is the following: + + 0 1 2 3 4 5 6 7 octet + +------------------------+------------------------+ + | req_prod | req_event | + +------------------------+------------------------+ + | rsp_prod | rsp_event | + +------------------------+------------------------+ + | pvt | pad[44] | + +------------------------+ | + | .... | [64bytes] + +------------------------+------------------------+-\ + | gref | offset | flags | | + +------------+-----------+------------------------+ +-'struct + | id | size | id | status | | netif_tx_sring_entry' + +-------------------------------------------------+-/ + |/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/| .. N + +-------------------------------------------------+ + +Each entry consumes 16 octets therefore 256 entries can fit on one page.`struct +netif_tx_sring_entry` includes both `struct netif_tx_request` (first 12 octets) +and `struct netif_tx_response` (last 4 octets). Additionally a `struct +netif_extra_info` may overlay the request in which case the format is: + + +------------------------+------------------------+-\ + | type |flags| type specific data (gso, hash, etc)| | + +------------+-----------+------------------------+ +-'struct + | padding for tx | unused | | netif_extra_info' + +-------------------------------------------------+-/ + +In essence the transmission of a packet in a from frontend to the backend +network stack goes as following: + +**Frontend** + +1) Calculate how many slots are needed for transmitting the packet. + Fail if there are aren't enough slots. + +[ Calculation needs to estimate slots taking into account 4k page boundary ] + +2) Make first request for the packet. + The first request contains the whole packet size, checksum info, + flag whether it contains extra metadata, and if following slots contain + more data. + +3) Put grant in the `gref` field of the tx slot. + +4) Set extra info if packet requires special metadata (e.g. GSO size) + +5) If there's still data to be granted set flag `NETTXF_more_data` in +request `flags`. + +6) Grant remaining packet pages one per slot. (grant boundary is 4k) + +7) Fill resultant grefs in the slots setting `NETTXF_more_data` for the N-1. + +8) Fill the total packet size in the first request. + +9) Set checksum info of the packet (if the chksum offload if supported) + +10) Update the request producer index (`req_prod`) + +11) Check whether backend needs a notification + +11.1) Perform hypercall `EVTCHNOP_send` which might mean a __VMEXIT__ + depending on the guest type. + +**Backend** + +12) Backend gets an interrupt and runs its interrupt service routine. + +13) Backend checks if there are unconsumed requests + +14) Backend consume a request from the ring + +15) Process extra info (e.g. if GSO info was set) + +16) Counts all requests for this packet to be processed (while +`NETTXF_more_data` is set) and performs a few validation tests: + +16.1) Fail transmission if total packet size is smaller than Ethernet +minimum allowed; + + Failing transmission means filling `id` of the request and + `status` of `NETIF_RSP_ERR` of `struct netif_tx_response`; + update rsp_prod and finally notify frontend (through `EVTCHNOP_send`). + +16.2) Fail transmission if one of the slots (size + offset) crosses the page +boundary + +16.3) Fail transmission if number of slots are bigger than spec defined +(18 slots max in netif.h) + +17) Allocate packet metadata + +[ *Linux specific*: This structure emcompasses a linear data region which +generally accomodates the protocol header and such. Netback allocates up to 128 +bytes for that. ] + +18) *Linux specific*: Setup up a `GNTTABOP_copy` to copy up to 128 bytes to this small +region (linear part of the skb) *only* from the first slot. + +19) Setup GNTTABOP operations to copy/map the packet + +20) Perform the `GNTTABOP_copy` (grant copy) and/or `GNTTABOP_map_grant_ref` + hypercalls. + +[ *Linux-specific*: does a copy for the linear region (<=128 bytes) and maps the + remaining slots as frags for the rest of the data ] + +21) Check if the grant operations were successful and fail transmission if +any of the resultant operation `status` were different than `GNTST_okay`. + +21.1) If it's a grant copying backend, therefore produce responses for all the +the copied grants like in 16.1). Only difference is that status is +`NETIF_RSP_OKAY`. + +21.2) Update the response producer index (`rsp_prod`) + +22) Set up gso info requested by frontend [optional] + +23) Set frontend provided checksum info + +24) *Linux-specific*: Register destructor callback when packet pages are freed. + +25) Call into to the network stack. + +26) Update `req_event` to `request consumer index + 1` to receive a notification + on the first produced request from frontend. + [optional, if backend is polling the ring and never sleeps] + +27) *Linux-specific*: Packet destructor callback is called. + +27.1) Set up `GNTTABOP_unmap_grant_ref` ops for the designated packet pages. + +27.2) Once done, perform `GNTTABOP_unmap_grant_ref` hypercall. Underlying +this hypercall a TLB flush of all backend vCPUS is done. + +27.3) Produce Tx response like step 21.1) and 21.2) + +[*Linux-specific*: It contains a thread that is woken for this purpose. And +it batch these unmap operations. The callback just queues another unmap.] + +27.4) Check whether frontend requested a notification + +27.4.1) If so, Perform hypercall `EVTCHNOP_send` which might mean a __VMEXIT__ + depending on the guest type. + +**Frontend** + +28) Transmit interrupt is raised which signals the packet transmission completion. + +29) Transmit completion routine checks for unconsumed responses + +30) Processes the responses and revokes the grants provided. + +31) Updates `rsp_cons` (request consumer index) + +This proposal aims at removing steps 19) 20) 21) by using grefs previously +mapped at guest request. Guest decides how to distribute or use these premapped +grefs with either linear or full packet. This allows us to replace step 27) +(the unmap) preventing the TLB flush. + +Note that a grant copy does the following (in pseudo code): + + rcu_lock(src_domain); + rcu_lock(dst_domain); + + for (op = gntcopy[0]; op < nr_ops; op++) { + src_frame = __acquire_grant_for_copy(src_domain, <op.src.gref>); + ^ here implies a holding a potential contended per CPU lock on the + remote grant table. + src_vaddr = map_domain_page(src_frame); + + dst_frame = __get_paged_frame(dst_domain, <op.dst.mfn>) + dst_vaddr = map_domain_page(dst_frame); + + memcpy(dst_vaddr + <op.dst.offset>, + src_frame + <op.src.offset>, + <op.size>); + + unmap_domain_page(src_frame); + unmap_domain_page(dst_frame); + + rcu_unlock(src_domain); + rcu_unlock(dst_domain); + +Linux netback implementation copies the first 128 bytes into its network buffer +linear region. Hence on the case of the first region it is replaced by a memcpy +on backend, as opposed to a grant copy. + +\clearpage + +## Guest Receive + +The view of the shared receive ring is the following: + + 0 1 2 3 4 5 6 7 octet + +------------------------+------------------------+ + | req_prod | req_event | + +------------------------+------------------------+ + | rsp_prod | rsp_event | + +------------------------+------------------------+ + | pvt | pad[44] | + +------------------------+ | + | .... | [64bytes] + +------------------------+------------------------+ + | id | pad | gref | ->'struct netif_rx_request' + +------------+-----------+------------------------+ + | id | offset | flags | status | ->'struct netif_rx_response' + +-------------------------------------------------+ + |/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/| .. N + +-------------------------------------------------+ + + +Each entry in the ring occupies 16 octets which means a page fits 256 entries. +Additionally a `struct netif_extra_info` may overlay the rx request in which +case the format is: + + +------------------------+------------------------+ + | type |flags| type specific data (gso, hash, etc)| ->'struct netif_extra_info' + +------------+-----------+------------------------+ + +Notice the lack of padding, and that is because it's not used on Rx, as Rx +request boundary is 8 octets. + +In essence the steps for receiving of a packet in a Linux frontend is as + from backend to frontend network stack: + +**Backend** + +1) Backend transmit function starts + +[*Linux-specific*: It means we take a packet and add to an internal queue + (protected by a lock) whereas a separate thread takes it from that queue and + process the actual like the steps below. This thread has the purpose of + aggregating as much copies as possible.] + +2) Checks if there are enough rx ring slots that can accomodate the packet. + +3) Gets a request from the ring for the first data slot and fetches the `gref` + from it. + +4) Create grant copy op from packet page to `gref`. + +[ It's up to the backend to choose how it fills this data. E.g. backend may + choose to merge as much as data from different pages into this single gref, + similar to mergeable rx buffers in vhost. ] + +5) Sets up flags/checksum info on first request. + +6) Gets a response from the ring for this data slot. + +7) Prefill expected response ring with the request `id` and slot size. + +8) Update the request consumer index (`req_cons`) + +9) Gets a request from the ring for the first extra info [optional] + +10) Sets up extra info (e.g. GSO descriptor) [optional] repeat step 8). + +11) Repeat steps 3 through 8 for all packet pages and set `NETRXF_more_data` + in the N-1 slot. + +12) Perform the `GNTTABOP_copy` hypercall. + +13) Check if the grant operations status was incorrect and if so set `status` + of the `struct netif_rx_response` field to NETIF_RSP_ERR. + +14) Update the response producer index (`rsp_prod`) + +**Frontend** + +15) Frontend gets an interrupt and runs its interrupt service routine + +16) Checks if there's unconsumed responses + +17) Consumes a response from the ring (first response for a packet) + +18) Revoke the `gref` in the response + +19) Consumes extra info response [optional] + +20) While N-1 requests has `NETRXF_more_data`, then fetch each of responses + and revoke the designated `gref`. + +21) Update the response consumer index (`rsp_cons`) + +22) *Linux-specific*: Copy (from first slot gref) up to 256 bytes to the linear + region of the packet metadata structure (skb). The rest of the pages + processed in the responses are then added as frags. + +23) Set checksum info based on first response flags. + +24) Call packet into the network stack. + +25) Allocate new pages and any necessary packet metadata strutures to new + requests. These requests will then be used in step 1) and so forth. + +26) Update the request producer index (`req_prod`) + +27) Check whether backend needs notification: + +27.1) If so, Perform hypercall `EVTCHNOP_send` which might mean a __VMEXIT__ + depending on the guest type. + +28) Update `rsp_event` to `response consumer index + 1` such that frontend + receive a notification on the first newly produced response. + [optional, if frontend is polling the ring and never sleeps] + +This proposal aims at replacing step 4), 12) and 22) with memcpy if the +grefs on the Rx ring were requested to be mapped by the guest. Frontend may use +strategies to allow fast recycling of grants for replinishing the ring, +hence letting Domain-0 replace the grant copies with memcpy instead, which is +faster. + +Depending on the implementation, it would mean that we no longer +would need to aggregate as much as grant ops as possible (step 1) and could +transmit the packet on the transmit function (e.g. Linux ```ndo_start_xmit```) +as previously proposed +here\[[0](http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html)\]. +This would heavily improve efficiency specifially for smaller packets. Which in +return would decrease RTT, having data being acknoledged much quicker. + +\clearpage + +# Proposed Extension + +The idea is to allow guest more controllability on how its grants are mapped or +not. Currently there's no control over it for frontends or backends, and latter +cannot make assumptions on the mapping transmit or receive grants, hence we +need frontend to take initiative into managing its own mapping of grants. +Guests may then opportunistically recycle these grants (e.g. Linux) and avoid +resorting to copies which come when using a fixed amount of buffers. Other +frameworks (e.g. XDP, netmap, DPDK) use a fixed set of buffers which also +makes the case for this extension. + +## Terminology + +`staging grants` is a term used in this document to refer to the whole concept +of having a set of grants permanently mapped with backend, containing data +staging until completion. Therefore the term should not be confused with a new +kind of grants on the hypervisor. + +## Control Ring Messages + +### `XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE` + +This message is sent by the frontend to fetch the number of grefs that can +be kept mapped in the backend. It only receives the queue as argument, and +data representing amount of free entries in the mapping table. + +### `XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING` + +This is sent by the frontend to map a list of grant references in the backend. +It receives the queue index, the grant containing the list (offset is +implicitly zero) and how many entries in the list. Each entry in this list +has the following format: + + 0 1 2 3 4 5 6 7 octet + +-----+-----+-----+-----+-----+-----+-----+-----+ + | grant ref | flags | status | + +-----+-----+-----+-----+-----+-----+-----+-----+ + + grant ref: grant reference + flags: flags describing the control operation + status: XEN_NETIF_CTRL_STATUS_* + +The list can have a maximum of 512 entries to be mapped at once. +The 'status' field is not used for adding new mappings and hence, The message +returns an error code describing if the operation was successful or not. On +failure cases, none of the grant mappings specified get added. + +### `XEN_NETIF_CTRL_TYPE_DEL_GREF_MAPPING` + +This is sent by the frontend for backend to unmap a list of grant references. +The arguments are the same as `XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING`, including +the format of the list. The entries used are only the ones representing grant +references that were previously the subject of a +`XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING` operation. Any other entries will have +their status set to `XEN_NETIF_CTRL_STATUS_INVALID_PARAMETER` upon completion. +The entry 'status' field determines if the entry was successfully removed. + +## Datapath Changes + +Control ring is only available after backend state is `XenbusConnected` +therefore only on this state change can the frontend query the total amount of +maps it can keep. It then grants N entries per queue on both TX and RX ring +which will create the underying backend gref -> page association (e.g. stored +in hash table). Frontend may wish to recycle these pregranted buffers or choose +a copy approach to replace granting. + +On steps 19) of Guest Transmit and 3) of Guest Receive, data gref is first +looked up in this table and uses the underlying page if it already exists a +mapping. On the successfull cases, steps 20) 21) and 27) of Guest Transmit are +skipped, with 19) being replaced with a memcpy of up to 128 bytes. On Guest +Receive, 4) 12) and 22) are replaced with memcpy instead of a grant copy. + +Failing to obtain the total number of mappings +(`XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE`) means the guest falls back to the +normal usage without pre granting buffers. + +\clearpage + +# Wire Performance + +This section is a glossary meant to keep in mind numbers on the wire. + +The minimum size that can fit in a single packet with size N is calculated as: + + Packet = Ethernet Header (14) + Protocol Data Unit (46 - 1500) = 60 bytes + +In the wire it's a bit more: + + Preamble (7) + Start Frame Delimiter (1) + Packet + CRC (4) + Interframe gap (12) = 84 bytes + +For given Link-speed in Bits/sec and Packet size, real packet rate is + calculated as: + + Rate = Link-speed / ((Preamble + Packet + CRC + Interframe gap) * 8) + +Numbers to keep in mind (packet size excludes PHY layer, though packet rates +disclosed by vendors take those into account, since it's what goes on the +wire): + +| Packet + CRC (bytes) | 10 Gbit/s | 40 Gbit/s | 100 Gbit/s | +|------------------------|:----------:|:----------:|:------------:| +| 64 | 14.88 Mpps| 59.52 Mpps| 148.80 Mpps | +| 128 | 8.44 Mpps| 33.78 Mpps| 84.46 Mpps | +| 256 | 4.52 Mpps| 18.11 Mpps| 45.29 Mpps | +| 1500 | 822 Kpps| 3.28 Mpps| 8.22 Mpps | +| 65535 | ~19 Kpps| 76.27 Kpps| 190.68 Kpps | + +Caption: Mpps (Million packets per second) ; Kpps (Kilo packets per second) + +\clearpage + +# Performance + +Numbers between a Linux v4.11 guest and another host connected by a 100 Gbit/s +NIC on a E5-2630 v4 2.2 GHz host to give an idea on the performance benefits of +this extension. Please refer to this presentation[7] for a better overview of +the results. + +( Numbers include protocol overhead ) + +**bulk transfer (Guest TX/RX)** + + Queues Before (Gbit/s) After (Gbit/s) + ------ ------------- ------------ + 1queue 17244/6000 38189/28108 + 2queue 24023/9416 54783/40624 + 3queue 29148/17196 85777/54118 + 4queue 39782/18502 99530/46859 + +( Guest -> Dom0 ) + +**Packet I/O (Guest TX/RX) in UDP 64b** + + Queues Before (Mpps) After (Mpps) + ------ ------------- ------------ + 1queue 0.684/0.439 2.49/2.96 + 2queue 0.953/0.755 4.74/5.07 + 4queue 1.890/1.390 8.80/9.92 + +\clearpage + +# References + +[0] http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html + +[1] https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap_mem2.c#L362 + +[2] https://www.freebsd.org/cgi/man.cgi?query=vale&sektion=4&n=1 + +[3] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf + +[4] http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#write-access-to-packet-data + +[5] http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L2073 + +[6] http://lxr.free-electrons.com/source/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L52 + +[7] https://schd.ws/hosted_files/xendeveloperanddesignsummit2017/e6/ToGrantOrNotToGrant-XDDS2017_v3.pdf + +# History + +A table of changes to the document, in chronological order. + +------------------------------------------------------------------------ +Date Revision Version Notes +---------- -------- -------- ------------------------------------------- +2016-12-14 1 Xen 4.9 Initial version for RFC + +2017-09-01 2 Xen 4.10 Rework to use control ring + + Trim down the specification + + Added some performance numbers from the + presentation + +2017-09-13 3 Xen 4.10 Addressed changes from Paul Durrant + +2017-09-19 4 Xen 4.10 Addressed changes from Paul Durrant + +------------------------------------------------------------------------ -- 2.11.0 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.