|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [DOC v8] PV Calls protocol design
.snip..
> #### Frontend XenBus Nodes
>
> version
> Values: <string>
>
> Protocol version, chosen among the ones supported by the backend
> (see **versions** under [Backend XenBus Nodes]). Currently the
> value must be "1".
>
> port
> Values: <uint32_t>
>
> The identifier of the Xen event channel used to signal activity
> in the command ring.
>
> ring-ref
> Values: <uint32_t>
>
> The Xen grant reference granting permission for the backend to map
> the sole page in a single page sized command ring.
>
> #### Backend XenBus Nodes
>
> versions
> Values: <string>
>
> List of comma separated protocol versions supported by the backend.
> For example "1,2,3". Currently the value is just "1", as there is
> only one version.
>
> max-page-order
> Values: <uint32_t>
>
> The maximum supported size of a memory allocation in units of
> log2n(machine pages), e.g. 0 == 1 page, 1 = 2 pages, 2 == 4 pages,
> etc.
.. for the **data rings** (not to be confused with the command ring).
>
> function-calls
> Values: <uint32_t>
>
> Value "0" means that no calls are supported.
> Value "1" means that socket, connect, release, bind, listen, accept
> and poll are supported.
>
..snip..
> ### Commands Ring
>
> The shared ring is used by the frontend to forward POSIX function calls
> to the backend. We shall refer to this ring as **commands ring** to
> distinguish it from other rings which can be created later in the
> lifecycle of the protocol (see [Indexes Page and Data ring]). The grant
> reference for shared page for this ring is shared on xenstore (see
> [Frontend XenBus Nodes]). The ring format is defined using the familiar
> `DEFINE_RING_TYPES` macro (`xen/include/public/io/ring.h`). Frontend
> requests are allocated on the ring using the `RING_GET_REQUEST` macro.
> The list of commands below is in calling order.
>
> The format is defined as follows:
>
> #define PVCALLS_SOCKET 0
> #define PVCALLS_CONNECT 1
> #define PVCALLS_RELEASE 2
> #define PVCALLS_BIND 3
> #define PVCALLS_LISTEN 4
> #define PVCALLS_ACCEPT 5
> #define PVCALLS_POLL 6
>
> struct xen_pvcalls_request {
> uint32_t req_id; /* private to guest, echoed in response */
> uint32_t cmd; /* command to execute */
> union {
> struct xen_pvcalls_socket {
> uint64_t id;
> uint32_t domain;
> uint32_t type;
> uint32_t protocol;
> #ifdef CONFIG_X86_32
> uint8_t pad[4];
Could that be shifted to the right?
> #endif
> } socket;
> struct xen_pvcalls_connect {
> uint64_t id;
> uint8_t addr[28];
> uint32_t len;
> uint32_t flags;
> grant_ref_t ref;
> uint32_t evtchn;
> #ifdef CONFIG_X86_32
> uint8_t pad[4];
> #endif
> } connect;
> struct xen_pvcalls_release {
> uint64_t id;
> uint8_t reuse;
> #ifdef CONFIG_X86_32
> uint8_t pad[7];
Could that be shifted to the right?
> #endif
> } release;
> struct xen_pvcalls_bind {
> uint64_t id;
> uint8_t addr[28];
> uint32_t len;
> } bind;
> struct xen_pvcalls_listen {
> uint64_t id;
> uint32_t backlog;
> #ifdef CONFIG_X86_32
> uint8_t pad[4];
Could that be shifted to the right?
> #endif
> } listen;
> struct xen_pvcalls_accept {
> uint64_t id;
> uint64_t id_new;
> grant_ref_t ref;
> uint32_t evtchn;
> } accept;
> struct xen_pvcalls_poll {
> uint64_t id;
> } poll;
> /* dummy member to force sizeof(struct xen_pvcalls_request) to
> match across archs */
> struct xen_pvcalls_dummy {
> uint8_t dummy[56];
> } dummy;
> } u;
> };
>
> The first two fields are common for every command. Their binary layout
> is:
>
> 0 4 8
> +-------+-------+
> |req_id | cmd |
> +-------+-------+
>
> - **req_id** is generated by the frontend and is a cookie used to
> identify one specific request/response pair of commands. Not to be
> confused with any command **id** which are used to identify a socket
> across multiple commands, see [Socket].
> - **cmd** is the command requested by the frontend:
>
> - `PVCALLS_SOCKET`: 0
> - `PVCALLS_CONNECT`: 1
> - `PVCALLS_RELEASE`: 2
> - `PVCALLS_BIND`: 3
> - `PVCALLS_LISTEN`: 4
> - `PVCALLS_ACCEPT`: 5
> - `PVCALLS_POLL`: 6
>
> Both fields are echoed back by the backend. See [Socket families and
> address format] for the format of the **addr** field of connect and
> bind. The maximum size of command specific arguments is 56 bytes. Any
> future command that requires more than that will need a bump the
> **version** of the protocol.
>
> Similarly to other Xen ring based protocols, after writing a request to
> the ring, the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and
> issues an event channel notification when a notification is required.
>
> Backend responses are allocated on the ring using the `RING_GET_RESPONSE`
> macro.
> The format is the following:
>
> struct xen_pvcalls_response {
> uint32_t req_id;
> uint32_t cmd;
> int32_t ret;
> uint32_t pad;
> union {
> struct _xen_pvcalls_socket {
> uint64_t id;
> } socket;
> struct _xen_pvcalls_connect {
> uint64_t id;
> } connect;
> struct _xen_pvcalls_release {
> uint64_t id;
> } release;
> struct _xen_pvcalls_bind {
> uint64_t id;
> } bind;
> struct _xen_pvcalls_listen {
> uint64_t id;
> } listen;
> struct _xen_pvcalls_accept {
> uint64_t id;
> } accept;
> struct _xen_pvcalls_poll {
> uint64_t id;
> } poll;
> struct _xen_pvcalls_dummy {
> uint8_t dummy[8];
> } dummy;
> } u;
> };
>
> The first four fields are common for every response. Their binary layout
> is:
>
> 0 4 8 12 16
> +-------+-------+-------+-------+
> |req_id | cmd | ret | pad |
> +-------+-------+-------+-------+
>
> - **req_id**: echoed back from request
> - **cmd**: echoed back from request
> - **ret**: return value, identifies success (0) or failure (see [Error
> numbers] in further sections). If the **cmd** is not supported by the
> backend, ret is ENOTSUP.
> - **pad**: padding
>
> After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend checks
> whether
> it needs to notify the frontend and does so via event channel.
>
> A description of each command, their additional request and response
> fields follow.
>
>
> #### Socket
>
> The **socket** operation corresponds to the POSIX [socket][socket]
> function. It creates a new socket of the specified family, type and
> protocol. **id** is freely chosen by the frontend and references this
> specific socket from this point forward. See [Socket families and
> address format].
.. to see which ones are supported by different versions of the
protocol.
>
> Request fields:
>
> - **cmd** value: 0
> - additional fields:
> - **id**: generated by the frontend, it identifies the new socket
> - **domain**: the communication domain
> - **type**: the socket type
> - **protocol**: the particular protocol to be used with the socket, usually > 0
>
> Request binary layout:
>
> 8 12 16 20 24 28
> +-------+-------+-------+-------+-------+
> | id |domain | type |protoco|
> +-------+-------+-------+-------+-------+
>
> Response additional fields:
>
> - **id**: echoed back from request
>
> Response binary layout:
>
> 16 20 24
> +-------+--------+
> | id |
> +-------+--------+
>
> Return value:
>
> - 0 on success
> - See the [POSIX socket function][connect] for error names; see
> [Error numbers] in further sections.
>
> #### Connect
>
> The **connect** operation corresponds to the POSIX [connect][connect]
> function. It connects a previously created socket (identified by **id**)
> to the specified address.
>
> The connect operation creates a new shared ring, which we'll call **data
> ring**. The data ring is used to send and receive data from the
> socket. The connect operation passes two additional parameters:
> **evtchn** and **ref**. **evtchn** is the port number of a new event
> channel which will be used for notifications of activity on the data
> ring. **ref** is the grant reference of the **indexes page**: a page
> which contains shared indexes that point to the write and read locations
> in the data ring. The **indexes page** also contains the full array of
s/data ring/**data ring**/
> grant references for the data ring. When the frontend issues a
> **connect** command, the backend:
>
> - finds its own internal socket corresponding to **id**
> - connects the socket to **addr**
> - maps the grant reference **ref**, the indexes page, see struct
> pvcalls_data_intf
> - maps all the grant references listed in `struct pvcalls_data_intf` and
> uses them as shared memory for the data ring
s/data ring/**data ring**/ perhaps?
> - bind the **evtchn**
> - replies to the frontend
>
> The [Indexes Page and Data ring] format will be described in the
> following section. The data ring is unmapped and freed upon issuing a
> **release** command on the active socket identified by **id**. A
> frontend stage change can also cause data rings to be unmapped.
s/stage/state/
>
> Request fields:
>
> - **cmd** value: 0
> - additional fields:
> - **id**: identifies the socket
> - **addr**: address to connect to, see [Socket families and address format]
Hm, so what do we do if we want to support AF_UNIX which has an addr of
108 bytes?
> - **len**: address length
up to 28 octets.
> - **flags**: flags for the connection, reserved for future usage
> - **ref**: grant reference of the indexes page
> - **evtchn**: port number of the evtchn to signal activity on the data ring
>
> Request binary layout:
>
> 8 12 16 20 24 28 32 36 40 44
> +-------+-------+-------+-------+-------+-------+-------+-------+-------+
> | id | addr |
> +-------+-------+-------+-------+-------+-------+-------+-------+-------+
> | len | flags | ref |evtchn |
> +-------+-------+-------+-------+
>
> Response additional fields:
>
> - **id**: echoed back from request
>
> Response binary layout:
>
> 16 20 24
> +-------+-------+
> | id |
> +-------+-------+
>
> Return value:
>
> - 0 on success
> - See the [POSIX connect function][connect] for error names; see
> [Error numbers] in further sections.
>
> #### Release
>
> The **release** operation closes an existing active or a passive socket.
>
> When a release command is issued on a passive socket, the backend
> releases it and frees its internal mappings. When a release command is
> issued for an active socket, the data ring and indexes page are also
> unmapped and freed:
>
> - frontend sends release command for an active socket
> - backend releases the socket
> - backend unmaps the data ring
> - backend unmaps the indexes page
> - backend unbinds the event channel
> - backend replies to frontend with an **ret** value
> - frontend frees data ring, indexes page and unbinds event channel
>
> Request fields:
>
> - **cmd** value: 1
> - additional fields:
> - **id**: identifies the socket
> - **reuse**: an optimization hint for the backend. The field is
> ignored for passive sockets. When set to 1, the frontend lets the
> backend know that it will reuse exactly the same set of grant pages
> (indexes page and data ring) and event channel when creating one of
> the next active sockets. The backend can take advantage of it by
> delaying unmapping grants and unbinding the event channel. The
> backend is free to ignore the hint. Reused data rings are found by
> **ref**, the grant reference of the page containing the indexes.
>
> Request binary layout:
>
> 8 12 16 17
> +-------+-------+-----+
> | id |reuse|
> +-------+-------+-----+
>
> Response additional fields:
>
> - **id**: echoed back from request
>
> Response binary layout:
>
> 16 20 24
> +-------+-------+
> | id |
> +-------+-------+
>
> Return value:
>
> - 0 on success
> - See the [POSIX shutdown function][shutdown] for error names; see
> [Error numbers] in further sections.
>
> #### Bind
>
> The **bind** operation corresponds to the POSIX [bind][bind] function.
> It assigns the address passed as parameter to a previously created
> socket, identified by **id**. **Bind**, **listen** and **accept** are
> the three operations required to have fully working passive sockets and
> should be issued in that order.
>
> Request fields:
>
> - **cmd** value: 2
> - additional fields:
> - **id**: identifies the socket
> - **addr**: address to connect to, see [Socket families and address
> format]
> - **len**: address length
.. up to 28 octets.
>
> Request binary layout:
>
> 8 12 16 20 24 28 32 36 40 44
> +-------+-------+-------+-------+-------+-------+-------+-------+-------+
> | id | addr |
> +-------+-------+-------+-------+-------+-------+-------+-------+-------+
> | len |
> +-------+
>
> Response additional fields:
>
> - **id**: echoed back from request
>
> Response binary layout:
>
> 16 20 24
> +-------+-------+
> | id |
> +-------+-------+
>
> Return value:
>
> - 0 on success
> - See the [POSIX bind function][bind] for error names; see
> [Error numbers] in further sections.
>
>
..snip..
> #### Accept
>
> The **accept** operation extracts the first connection request on the
> queue of pending connections for the listening socket identified by
> **id** and creates a new connected socket. The id of the new socket is
> also chosen by the frontend and passed as an additional field of the
> accept request struct (**id_new**). See the [POSIX accept function][accept]
> as reference.
>
> Similarly to the **connect** operation, **accept** creates new [Indexes
> Page and Data ring]. The data ring is used to send and receive data from
> the socket. The **accept** operation passes two additional parameters:
> **evtchn** and **ref**. **evtchn** is the port number of a new event
> channel which will be used for notifications of activity on the data
s/data/**data/
> ring. **ref** is the grant reference of the **indexes page**: a page
s/ring/ring**/
> which contains shared indexes that point to the write and read locations
> in the data ring. The **indexes page** also contains the full array of
Perhaps highlight data ring here?
> grant references for the data ring.
>
> The backend will reply to the request only when a new connection is
> successfully accepted, i.e. the backend does not return EAGAIN or
> EWOULDBLOCK.
>
> Example workflow:
>
> - frontend issues an **accept** request
> - backend waits for a connection to be available on the socket
> - a new connection becomes available
> - backend accepts the new connection
> - backend creates an internal mapping from **id_new** to the new socket
> - backend maps the grant reference **ref**, the indexes page, see struct
> pvcalls_data_intf
> - backend maps all the grant references listed in `struct
> pvcalls_data_intf` and uses them as shared memory for the new data
> ring **in** and **out** arrays
> - backend binds to the **evtchn**
> - backend replies to the frontend with a **ret** value
>
> Request fields:
>
> - **cmd** value: 4
> - additional fields:
> - **id**: id of listening socket
> - **id_new**: id of the new socket
> - **ref**: grant reference of the indexes page
> - **evtchn**: port number of the evtchn to signal activity on the data ring
>
> Request binary layout:
>
> 8 12 16 20 24 28 32
> +-------+-------+-------+-------+-------+-------+
> | id | id_new | ref |evtchn |
> +-------+-------+-------+-------+-------+-------+
>
> Response additional fields:
>
> - **id**: id of the listening socket, echoed back from request
>
> Response binary layout:
>
> 16 20 24
> +-------+-------+
> | id |
> +-------+-------+
>
> Return value:
>
> - 0 on success
> - See the [POSIX accept function][accept] for error names; see
> [Error numbers] in further sections.
>
>
..snip..
> ### Indexes Page and Data ring
>
> Data rings are used for sending and receiving data over a connected socket.
> They
> are created upon a successful **accept** or **connect** command.
> The **sendmsg** and **recvmsg** calls are implemented by sending data and
> receiving data from a data ring, and updating the corresponding indexes
> on the **indexes page**.
>
> Firstly, the **indexes page** is shared by a **connect** or **accept**
> command, see **ref** parameter in their sections. The content of the
> **indexes page** is represented by `struct pvcalls_ring_intf`, see
> below. The structure contains the list of grant references which
> constitute the **in** and **out** buffers of the data ring, see ref[]
> below. The backend maps the grant references contiguously. Of the
> resulting shared memory, the first half is dedicated to the **in** array
> and the second half to the **out** array. They are used as circular
> buffers for transferring data, and, together, they are the data ring.
>
>
> +---------------------------+ Indexes page
> | Command ring: | +----------------------+
> | @0: xen_pvcalls_connect: | |@0 pvcalls_data_intf: |
^-- The first 64 bytes are reserved for the in_cons, etc.
Perhaps just start at @64 (And naturally add that to the 'ref')
> | @44: ref +-------------------------------->+@76: ring_order = 1 |
> | | |@80: ref[0]+ |
> +---------------------------+ |@84: ref[1]+ |
> | | |
> | | |
> +----------------------+
> |
> v (data ring)
> +-------+-----------+
> | @0->4098: in |
> | ref[0] |
> |-------------------|
> | @4099->8196: out |
> | ref[1] |
> +-------------------+
>
>
Thank you!
> #### Indexes Page Structure
>
> typedef uint32_t PVCALLS_RING_IDX;
>
> struct pvcalls_data_intf {
> PVCALLS_RING_IDX in_cons, in_prod;
> int32_t in_error;
You don't want to perhaps include in_event?
>
> uint8_t pad[52];
>
> PVCALLS_RING_IDX out_cons, out_prod;
> int32_t out_error;
And out_event as way to do some form of interrupt mitigation
(similar to what you had proposed?)
>
> uint32_t ring_order;
> grant_ref_t ref[];
> };
>
> /* not actually C compliant (ring_order changes from socket to socket) */
> struct pvcalls_data {
> char in[((1<<ring_order)<<PAGE_SHIFT)/2];
> char out[((1<<ring_order)<<PAGE_SHIFT)/2];
> };
>
> - **ring_order**
> It represents the order of the data ring. The following list of grant
> references is of `(1 << ring_order)` elements. It cannot be greater than
> **max-page-order**, as specified by the backend on XenBus. It has to
> be one at minimum.
Oh? Why not zero? (4KB) as the 'max-page-order' has an example of zero order?
Perhaps if it MUST be one or more then the 'max-page-order' should say
that at least it MUST be one?
> - **ref[]**
> The list of grant references which will contain the actual data. They are
> mapped contiguosly in virtual memory. The first half of the pages is the
> **in** array, the second half is the **out** array. The arrays must
> have a power of two size. Together, their size is `(1 << ring_order) *
> PAGE_SIZE`.
> - **in** is an array used as circular buffer
> It contains data read from the socket. The producer is the backend, the
> consumer is the frontend.
> - **out** is an array used as circular buffer
> It contains data to be written to the socket. The producer is the frontend,
> the consumer is the backend.
> - **in_cons** and **in_prod**
> Consumer and producer indexes for data read from the socket. They keep track
> of how much data has already been consumed by the frontend from the **in**
> array. **in_prod** is increased by the backend, after writing data to
> **in**.
> **in_cons** is increased by the frontend, after reading data from **in**.
> -ring-page-order
???
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |