[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [DOC v8] PV Calls protocol design



On Tue, 7 Feb 2017, Konrad Rzeszutek Wilk wrote:
> .snip..
> > #### Frontend XenBus Nodes
> > 
> > version
> >      Values:         <string>
> > 
> >      Protocol version, chosen among the ones supported by the backend
> >      (see **versions** under [Backend XenBus Nodes]). Currently the
> >      value must be "1".
> > 
> > port
> >      Values:         <uint32_t>
> > 
> >      The identifier of the Xen event channel used to signal activity
> >      in the command ring.
> > 
> > ring-ref
> >      Values:         <uint32_t>
> > 
> >      The Xen grant reference granting permission for the backend to map
> >      the sole page in a single page sized command ring.
> > 
> > #### Backend XenBus Nodes
> > 
> > versions
> >      Values:         <string>
> > 
> >      List of comma separated protocol versions supported by the backend.
> >      For example "1,2,3". Currently the value is just "1", as there is
> >      only one version.
> > 
> > max-page-order
> >      Values:         <uint32_t>
> > 
> >      The maximum supported size of a memory allocation in units of
> >      log2n(machine pages), e.g. 0 == 1 page,  1 = 2 pages, 2 == 4 pages,
> >      etc.
> 
> .. for the **data rings** (not to be confused with the command ring).
> 
> > 
> > function-calls
> >      Values:         <uint32_t>
> > 
> >      Value "0" means that no calls are supported.
> >      Value "1" means that socket, connect, release, bind, listen, accept
> >      and poll are supported.
> > 
> ..snip..
> > ### Commands Ring
> > 
> > The shared ring is used by the frontend to forward POSIX function calls
> > to the backend. We shall refer to this ring as **commands ring** to
> > distinguish it from other rings which can be created later in the
> > lifecycle of the protocol (see [Indexes Page and Data ring]). The grant
> > reference for shared page for this ring is shared on xenstore (see
> > [Frontend XenBus Nodes]). The ring format is defined using the familiar
> > `DEFINE_RING_TYPES` macro (`xen/include/public/io/ring.h`).  Frontend
> > requests are allocated on the ring using the `RING_GET_REQUEST` macro.
> > The list of commands below is in calling order.
> > 
> > The format is defined as follows:
> >     
> >     #define PVCALLS_SOCKET         0
> >     #define PVCALLS_CONNECT        1
> >     #define PVCALLS_RELEASE        2
> >     #define PVCALLS_BIND           3
> >     #define PVCALLS_LISTEN         4
> >     #define PVCALLS_ACCEPT         5
> >     #define PVCALLS_POLL           6
> > 
> >     struct xen_pvcalls_request {
> >             uint32_t req_id; /* private to guest, echoed in response */
> >             uint32_t cmd;    /* command to execute */
> >             union {
> >                     struct xen_pvcalls_socket {
> >                             uint64_t id;
> >                             uint32_t domain;
> >                             uint32_t type;
> >                             uint32_t protocol;
> >                 #ifdef CONFIG_X86_32
> >                 uint8_t pad[4];
> 
> Could that be shifted to the right?

Tabs vs Spaces, sigh. I fixed it.


> >                 #endif
> >                     } socket;
> >                     struct xen_pvcalls_connect {
> >                             uint64_t id;
> >                             uint8_t addr[28];
> >                             uint32_t len;
> >                             uint32_t flags;
> >                             grant_ref_t ref;
> >                             uint32_t evtchn;
> >                 #ifdef CONFIG_X86_32
> >                 uint8_t pad[4];
> >                 #endif
> >                     } connect;
> >                     struct xen_pvcalls_release {
> >                             uint64_t id;
> >                             uint8_t reuse;
> >                 #ifdef CONFIG_X86_32
> >                 uint8_t pad[7];
> 
> Could that be shifted to the right?

yep


> >                 #endif
> >                     } release;
> >                     struct xen_pvcalls_bind {
> >                             uint64_t id;
> >                             uint8_t addr[28];
> >                             uint32_t len;
> >                     } bind;
> >                     struct xen_pvcalls_listen {
> >                             uint64_t id;
> >                             uint32_t backlog;
> >                 #ifdef CONFIG_X86_32
> >                 uint8_t pad[4];
> 
> Could that be shifted to the right?

yep


> >                 #endif
> >                     } listen;
> >                     struct xen_pvcalls_accept {
> >                             uint64_t id;
> >                             uint64_t id_new;
> >                             grant_ref_t ref;
> >                             uint32_t evtchn;
> >                     } accept;
> >                     struct xen_pvcalls_poll {
> >                             uint64_t id;
> >                     } poll;
> >                     /* dummy member to force sizeof(struct 
> > xen_pvcalls_request) to match across archs */
> >                     struct xen_pvcalls_dummy {
> >                             uint8_t dummy[56];
> >                     } dummy;
> >             } u;
> >     };
> > 
> > The first two fields are common for every command. Their binary layout
> > is:
> > 
> >     0       4       8
> >     +-------+-------+
> >     |req_id |  cmd  |
> >     +-------+-------+
> > 
> > - **req_id** is generated by the frontend and is a cookie used to
> >   identify one specific request/response pair of commands. Not to be
> >   confused with any command **id** which are used to identify a socket
> >   across multiple commands, see [Socket].
> > - **cmd** is the command requested by the frontend:
> > 
> >     - `PVCALLS_SOCKET`:  0
> >     - `PVCALLS_CONNECT`: 1
> >     - `PVCALLS_RELEASE`: 2
> >     - `PVCALLS_BIND`:    3
> >     - `PVCALLS_LISTEN`:  4
> >     - `PVCALLS_ACCEPT`:  5
> >     - `PVCALLS_POLL`:    6
> > 
> > Both fields are echoed back by the backend. See [Socket families and
> > address format] for the format of the **addr** field of connect and
> > bind. The maximum size of command specific arguments is 56 bytes. Any
> > future command that requires more than that will need a bump the
> > **version** of the protocol.
> > 
> > Similarly to other Xen ring based protocols, after writing a request to
> > the ring, the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and
> > issues an event channel notification when a notification is required.
> > 
> > Backend responses are allocated on the ring using the `RING_GET_RESPONSE` 
> > macro.
> > The format is the following:
> > 
> >     struct xen_pvcalls_response {
> >         uint32_t req_id;
> >         uint32_t cmd;
> >         int32_t ret;
> >         uint32_t pad;
> >         union {
> >                     struct _xen_pvcalls_socket {
> >                             uint64_t id;
> >                     } socket;
> >                     struct _xen_pvcalls_connect {
> >                             uint64_t id;
> >                     } connect;
> >                     struct _xen_pvcalls_release {
> >                             uint64_t id;
> >                     } release;
> >                     struct _xen_pvcalls_bind {
> >                             uint64_t id;
> >                     } bind;
> >                     struct _xen_pvcalls_listen {
> >                             uint64_t id;
> >                     } listen;
> >                     struct _xen_pvcalls_accept {
> >                             uint64_t id;
> >                     } accept;
> >                     struct _xen_pvcalls_poll {
> >                             uint64_t id;
> >                     } poll;
> >                     struct _xen_pvcalls_dummy {
> >                             uint8_t dummy[8];
> >                     } dummy;
> >             } u;
> >     };
> > 
> > The first four fields are common for every response. Their binary layout
> > is:
> > 
> >     0       4       8       12      16
> >     +-------+-------+-------+-------+
> >     |req_id |  cmd  |  ret  |  pad  |
> >     +-------+-------+-------+-------+
> > 
> > - **req_id**: echoed back from request
> > - **cmd**: echoed back from request
> > - **ret**: return value, identifies success (0) or failure (see [Error
> >   numbers] in further sections). If the **cmd** is not supported by the
> >   backend, ret is ENOTSUP.
> > - **pad**: padding
> > 
> > After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend checks 
> > whether
> > it needs to notify the frontend and does so via event channel.
> > 
> > A description of each command, their additional request and response
> > fields follow.
> > 
> > 
> > #### Socket
> > 
> > The **socket** operation corresponds to the POSIX [socket][socket]
> > function. It creates a new socket of the specified family, type and
> > protocol. **id** is freely chosen by the frontend and references this
> > specific socket from this point forward. See [Socket families and
> > address format].
> 
> .. to see which ones are supported by different versions of the
> protocol.

OK


> > 
> > Request fields:
> > 
> > - **cmd** value: 0
> > - additional fields:
> >   - **id**: generated by the frontend, it identifies the new socket
> >   - **domain**: the communication domain
> >   - **type**: the socket type
> >   - **protocol**: the particular protocol to be used with the socket, 
> > usually 0
> > 
> > Request binary layout:
> > 
> >     8       12      16      20     24       28
> >     +-------+-------+-------+-------+-------+
> >     |       id      |domain | type  |protoco|
> >     +-------+-------+-------+-------+-------+
> > 
> > Response additional fields:
> > 
> > - **id**: echoed back from request
> > 
> > Response binary layout:
> > 
> >     16       20       24
> >     +-------+--------+
> >     |       id       |
> >     +-------+--------+
> > 
> > Return value:
> > 
> >   - 0 on success
> >   - See the [POSIX socket function][connect] for error names; see
> >     [Error numbers] in further sections.
> > 
> > #### Connect
> > 
> > The **connect** operation corresponds to the POSIX [connect][connect]
> > function. It connects a previously created socket (identified by **id**)
> > to the specified address.
> > 
> > The connect operation creates a new shared ring, which we'll call **data
> > ring**. The data ring is used to send and receive data from the
> > socket. The connect operation passes two additional parameters:
> > **evtchn** and **ref**. **evtchn** is the port number of a new event
> > channel which will be used for notifications of activity on the data
> > ring. **ref** is the grant reference of the **indexes page**: a page
> > which contains shared indexes that point to the write and read locations
> > in the data ring. The **indexes page** also contains the full array of
> 
> s/data ring/**data ring**/ 

OK


> > grant references for the data ring. When the frontend issues a
> > **connect** command, the backend:
> > 
> > - finds its own internal socket corresponding to **id**
> > - connects the socket to **addr**
> > - maps the grant reference **ref**, the indexes page, see struct
> >   pvcalls_data_intf
> > - maps all the grant references listed in `struct pvcalls_data_intf` and
> >   uses them as shared memory for the data ring
> 
> s/data ring/**data ring**/ perhaps?

OK


> > - bind the **evtchn**
> > - replies to the frontend
> > 
> > The [Indexes Page and Data ring] format will be described in the
> > following section. The data ring is unmapped and freed upon issuing a
> > **release** command on the active socket identified by **id**. A
> > frontend stage change can also cause data rings to be unmapped.
> 
> s/stage/state/

OK


> > 
> > Request fields:
> > 
> > - **cmd** value: 0
> > - additional fields:
> >   - **id**: identifies the socket
> >   - **addr**: address to connect to, see [Socket families and address 
> > format]
> 
> 
> Hm, so what do we do if we want to support AF_UNIX which has an addr of
> 108 bytes?

We write a protocol extension and bump the protocol version. However, we
could make the addr array size larger now to be more future proof, but
it takes up memory and I have no use for it, given that we can use
loopback for the same purpose.

 
> >   - **len**: address length
> 
> up to 28 octets.
> 
> >   - **flags**: flags for the connection, reserved for future usage
> >   - **ref**: grant reference of the indexes page
> >   - **evtchn**: port number of the evtchn to signal activity on the data 
> > ring
> > 
> > Request binary layout:
> > 
> >     8       12      16      20      24      28      32      36      40      
> > 44
> >     
> > +-------+-------+-------+-------+-------+-------+-------+-------+-------+
> >     |       id      |                            addr                       
> > |
> >     
> > +-------+-------+-------+-------+-------+-------+-------+-------+-------+
> >     | len   | flags |  ref  |evtchn |
> >     +-------+-------+-------+-------+
> > 
> > Response additional fields:
> > 
> > - **id**: echoed back from request
> > 
> > Response binary layout:
> > 
> >     16      20      24
> >     +-------+-------+
> >     |       id      |
> >     +-------+-------+
> > 
> > Return value:
> > 
> >   - 0 on success
> >   - See the [POSIX connect function][connect] for error names; see
> >     [Error numbers] in further sections.
> > 
> > #### Release
> > 
> > The **release** operation closes an existing active or a passive socket.
> > 
> > When a release command is issued on a passive socket, the backend
> > releases it and frees its internal mappings. When a release command is
> > issued for an active socket, the data ring and indexes page are also
> > unmapped and freed:
> > 
> > - frontend sends release command for an active socket
> > - backend releases the socket
> > - backend unmaps the data ring
> > - backend unmaps the indexes page
> > - backend unbinds the event channel
> > - backend replies to frontend with an **ret** value
> > - frontend frees data ring, indexes page and unbinds event channel
> > 
> > Request fields:
> > 
> > - **cmd** value: 1
> > - additional fields:
> >   - **id**: identifies the socket
> >   - **reuse**: an optimization hint for the backend. The field is
> >     ignored for passive sockets. When set to 1, the frontend lets the
> >     backend know that it will reuse exactly the same set of grant pages
> >     (indexes page and data ring) and event channel when creating one of
> >     the next active sockets. The backend can take advantage of it by
> >     delaying unmapping grants and unbinding the event channel. The
> >     backend is free to ignore the hint. Reused data rings are found by
> >     **ref**, the grant reference of the page containing the indexes.
> > 
> > Request binary layout:
> > 
> >     8       12      16    17
> >     +-------+-------+-----+
> >     |       id      |reuse|
> >     +-------+-------+-----+
> > 
> > Response additional fields:
> > 
> > - **id**: echoed back from request
> > 
> > Response binary layout:
> > 
> >     16      20      24
> >     +-------+-------+
> >     |       id      |
> >     +-------+-------+
> > 
> > Return value:
> > 
> >   - 0 on success
> >   - See the [POSIX shutdown function][shutdown] for error names; see
> >     [Error numbers] in further sections.
> > 
> > #### Bind
> > 
> > The **bind** operation corresponds to the POSIX [bind][bind] function.
> > It assigns the address passed as parameter to a previously created
> > socket, identified by **id**. **Bind**, **listen** and **accept** are
> > the three operations required to have fully working passive sockets and
> > should be issued in that order.
> > 
> > Request fields:
> > 
> > - **cmd** value: 2
> > - additional fields:
> >   - **id**: identifies the socket
> >   - **addr**: address to connect to, see [Socket families and address
> >     format]
> >   - **len**: address length
> 
> .. up to 28 octets.

OK


> > Request binary layout:
> > 
> >     8       12      16      20      24      28      32      36      40      
> > 44
> >     
> > +-------+-------+-------+-------+-------+-------+-------+-------+-------+
> >     |       id      |                            addr                       
> > |
> >     
> > +-------+-------+-------+-------+-------+-------+-------+-------+-------+
> >     |  len  |
> >     +-------+
> > 
> > Response additional fields:
> > 
> > - **id**: echoed back from request
> > 
> > Response binary layout:
> > 
> >     16      20      24
> >     +-------+-------+
> >     |       id      |
> >     +-------+-------+
> > 
> > Return value:
> > 
> >   - 0 on success
> >   - See the [POSIX bind function][bind] for error names; see
> >     [Error numbers] in further sections.
> > 
> > 
> ..snip..
> > #### Accept
> > 
> > The **accept** operation extracts the first connection request on the
> > queue of pending connections for the listening socket identified by
> > **id** and creates a new connected socket. The id of the new socket is
> > also chosen by the frontend and passed as an additional field of the
> > accept request struct (**id_new**). See the [POSIX accept function][accept]
> > as reference.
> > 
> > Similarly to the **connect** operation, **accept** creates new [Indexes
> > Page and Data ring]. The data ring is used to send and receive data from
> > the socket. The **accept** operation passes two additional parameters:
> > **evtchn** and **ref**. **evtchn** is the port number of a new event
> > channel which will be used for notifications of activity on the data
> 
> s/data/**data/
> > ring. **ref** is the grant reference of the **indexes page**: a page
> 
> s/ring/ring**/

OK


> > which contains shared indexes that point to the write and read locations
> > in the data ring. The **indexes page** also contains the full array of
> 
> Perhaps highlight data ring here?

Yep


> 
> > grant references for the data ring.
> > 
> > The backend will reply to the request only when a new connection is
> > successfully accepted, i.e. the backend does not return EAGAIN or
> > EWOULDBLOCK.
> > 
> > Example workflow:
> > 
> > - frontend issues an **accept** request
> > - backend waits for a connection to be available on the socket
> > - a new connection becomes available
> > - backend accepts the new connection
> > - backend creates an internal mapping from **id_new** to the new socket
> > - backend maps the grant reference **ref**, the indexes page, see struct
> >   pvcalls_data_intf
> > - backend maps all the grant references listed in `struct
> >   pvcalls_data_intf` and uses them as shared memory for the new data
> >   ring **in** and **out** arrays
> > - backend binds to the **evtchn**
> > - backend replies to the frontend with a **ret** value
> > 
> > Request fields:
> > 
> > - **cmd** value: 4
> > - additional fields:
> >   - **id**: id of listening socket
> >   - **id_new**: id of the new socket
> >   - **ref**: grant reference of the indexes page
> >   - **evtchn**: port number of the evtchn to signal activity on the data 
> > ring
> > 
> > Request binary layout:
> > 
> >     8       12      16      20      24      28      32
> >     +-------+-------+-------+-------+-------+-------+
> >     |       id      |    id_new     |  ref  |evtchn |
> >     +-------+-------+-------+-------+-------+-------+
> > 
> > Response additional fields:
> > 
> > - **id**: id of the listening socket, echoed back from request
> > 
> > Response binary layout:
> > 
> >     16      20      24
> >     +-------+-------+
> >     |       id      |
> >     +-------+-------+
> > 
> > Return value:
> > 
> >   - 0 on success
> >   - See the [POSIX accept function][accept] for error names; see
> >     [Error numbers] in further sections.
> > 
> > 
> ..snip..
> > ### Indexes Page and Data ring
> > 
> > Data rings are used for sending and receiving data over a connected socket. 
> > They
> > are created upon a successful **accept** or **connect** command.
> > The **sendmsg** and **recvmsg** calls are implemented by sending data and
> > receiving data from a data ring, and updating the corresponding indexes
> > on the **indexes page**.
> > 
> > Firstly, the **indexes page** is shared by a **connect** or **accept**
> > command, see **ref** parameter in their sections. The content of the
> > **indexes page** is represented by `struct pvcalls_ring_intf`, see
> > below. The structure contains the list of grant references which
> > constitute the **in** and **out** buffers of the data ring, see ref[]
> > below. The backend maps the grant references contiguously. Of the
> > resulting shared memory, the first half is dedicated to the **in** array
> > and the second half to the **out** array. They are used as circular
> > buffers for transferring data, and, together, they are the data ring.
> > 
> > 
> >   +---------------------------+                 Indexes page
> >   | Command ring:             |                 +----------------------+
> >   | @0: xen_pvcalls_connect:  |                 |@0 pvcalls_data_intf: |
>       ^-- The first 64 bytes are reserved for the in_cons, etc.
>            Perhaps just start at @64 (And naturally add that to the 'ref')
> 
>       
> >   | @44: ref  +-------------------------------->+@76: ring_order = 1   |
> >   |                           |                 |@80: ref[0]+          |
> >   +---------------------------+                 |@84: ref[1]+          |
> >                                                 |           |          |
> >                                                 |           |          |
> >                                                 +----------------------+
> >                                                             |
> >                                                             v (data ring)
> >                                                     +-------+-----------+
> >                                                     |  @0->4098: in     |
> >                                                     |  ref[0]           |
> >                                                     |-------------------|
> >                                                     |  @4099->8196: out |
> >                                                     |  ref[1]           |
> >                                                     +-------------------+
> >  
> > 
> 
> Thank you!

You are welcome :-)


> > #### Indexes Page Structure
> > 
> >     typedef uint32_t PVCALLS_RING_IDX;
> > 
> >     struct pvcalls_data_intf {
> >             PVCALLS_RING_IDX in_cons, in_prod;
> >             int32_t in_error;
> 
> You don't want to perhaps include in_event?
> > 
> >             uint8_t pad[52];
> > 
> >             PVCALLS_RING_IDX out_cons, out_prod;
> >             int32_t out_error;
> 
> And out_event as way to do some form of interrupt mitigation
> (similar to what you had proposed?)

Yes, the in_event / out_event optimization that I wrote for the 9pfs
protocol could work here too. However, I thought you preferred to remove
it for now as it is not required and increases complexity?

We could always add it later, if we reserved some padding here for it.
Something like:

   struct pvcalls_data_intf {
        PVCALLS_RING_IDX in_cons, in_prod;
        int32_t in_error;

        uint8_t pad[52];

        PVCALLS_RING_IDX out_cons, out_prod;
        int32_t out_error;

        uint8_t pad[52]; <--- this is new

        uint32_t ring_order;
        grant_ref_t ref[];
   };

We have plenty of space for the grant refs anyway. This way, we can
introduce in_event and out_event by eating up 4 bytes from each pad
array.


> > 
> >             uint32_t ring_order;
> >             grant_ref_t ref[];
> >     };
> > 
> >     /* not actually C compliant (ring_order changes from socket to socket) 
> > */
> >     struct pvcalls_data {
> >         char in[((1<<ring_order)<<PAGE_SHIFT)/2];
> >         char out[((1<<ring_order)<<PAGE_SHIFT)/2];
> >     };
> > 
> > - **ring_order**
> >   It represents the order of the data ring. The following list of grant
> >   references is of `(1 << ring_order)` elements. It cannot be greater than
> >   **max-page-order**, as specified by the backend on XenBus. It has to
> >   be one at minimum.
> 
> Oh? Why not zero? (4KB) as the 'max-page-order' has an example of zero order?
> Perhaps if it MUST be one or more then the 'max-page-order' should say
> that at least it MUST be one?

So that each in and out array gets to have its own dedicated page,
although I don't think it's strictly necessary. With zero, they would
get half a page each.

 
> > - **ref[]**
> >   The list of grant references which will contain the actual data. They are
> >   mapped contiguosly in virtual memory. The first half of the pages is the
> >   **in** array, the second half is the **out** array. The arrays must
> >   have a power of two size. Together, their size is `(1 << ring_order) *
> >   PAGE_SIZE`.
> > - **in** is an array used as circular buffer
> >   It contains data read from the socket. The producer is the backend, the
> >   consumer is the frontend.
> > - **out** is an array used as circular buffer
> >   It contains data to be written to the socket. The producer is the 
> > frontend,
> >   the consumer is the backend.
> > - **in_cons** and **in_prod**
> >   Consumer and producer indexes for data read from the socket. They keep 
> > track
> >   of how much data has already been consumed by the frontend from the **in**
> >   array. **in_prod** is increased by the backend, after writing data to 
> > **in**.
> >   **in_cons** is increased by the frontend, after reading data from **in**.
> > -ring-page-order
> 
> ??? 

Sorry, it must have been a copy/paste error.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.