[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [DRAFT 1] XenSock protocol design document



Hi all,

as promised, this is the design document for the XenSock protocol I
mentioned here:

http://marc.info/?l=xen-devel&m=146520572428581

It is still in its early days but should give you a good idea of how it
looks like and how it is supposed to work. Let me know if you find gaps
in the document and I'll fill them in the next version.

You can find prototypes of the Linux frontend and backend drivers here:

git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen.git xensock-1

To use them, make sure to enable CONFIG_XENSOCK in your kernel config
and add "xensock=1" to the command line of your DomU Linux kernel. You
also need the toolstack to create the initial xenstore nodes for the
protocol. To do that, please apply the attached patch to libxl (the
patch is based on Xen 4.7.0-rc3) and add "xensock=1" to your DomU config
file.

Feel free to try them out! Please be kind, they are only prototypes with
a few known issues :-) But they should work well enough to run simple
tests. If you find something missing, let me know or, even better, write
a patch!

I'll follow up with a separate document to cover the design of my
particular implementation of the protocol.

Cheers,

Stefano

---

# XenSocks Protocol v1

## Rationale

XenSocks is a paravirtualized protocol for the POSIX socket API.

The purpose of XenSocks is to allow the implementation of a specific set
of POSIX calls to be done in a domain other than your own. It allows
connect, accept, bind, release, listen, poll, recvmsg and sendmsg to be
implemented in another domain.

XenSocks provides the following benefits:
* guest networking works out of the box with VPNs, wireless networks and
  any other complex configurations on the host
* guest services listen on ports bound directly to the backend domain IP
  addresses
* localhost becomes a secure namespace for intra-VMs communications
* full visibility of the guest behavior on the backend domain, allowing
  for inexpensive filtering and manipulation of any guest calls
* excellent performance


## Design

### Xenstore

The frontend and the backend connect to each other exchanging information via
xenstore. The toolstack creates front and back nodes with state
XenbusStateInitialising. There can only be one XenSock frontend per domain.

#### Frontend XenBus Nodes

port
     Values:         <uint32_t>

     The identifier of the Xen event channel used to signal activity
     in the ring buffer.

ring-ref
     Values:         <uint32_t>

     The Xen grant reference granting permission for the backend to map
     the sole page in a single page sized ring buffer.


#### State Machine

    **Front**                             **Back**
    XenbusStateInitialising               XenbusStateInitialising
    - Query virtual device                - Query backend device
      properties.                           identification data.
    - Setup OS device instance.                          |
    - Allocate and initialize the                        |
      request ring.                                      V
    - Publish transport parameters                XenbusStateInitWait
      that will be in effect during
      this connection.
                 |
                 |
                 V
       XenbusStateInitialised

                                          - Query frontend transport parameters.
                                          - Connect to the request ring and
                                            event channel.
                                                         |
                                                         |
                                                         V
                                                 XenbusStateConnected

     - Query backend device properties.
     - Finalize OS virtual device
       instance.
                 |
                 |
                 V
        XenbusStateConnected

Once frontend and backend are connected, they have a shared page, which
will is used to exchange messages over a ring, and an event channel,
which is used to send notifications.


### Commands Ring

The shared ring is used by the frontend to forward socket API calls to the
backend. I'll refer to this ring as **commands ring** to distinguish it from
other rings which will be created later in the lifecycle of the protocol (data
rings). The ring format is defined using the familiar `DEFINE_RING_TYPES` macro
(`xen/include/public/io/ring.h`). Frontend requests are allocated on the ring
using the `RING_GET_REQUEST` macro.

The format is defined as follows:

    #define XENSOCK_DATARING_ORDER 6
    #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
    #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)
    
    #define XENSOCK_CONNECT        0
    #define XENSOCK_RELEASE        3
    #define XENSOCK_BIND           4
    #define XENSOCK_LISTEN         5
    #define XENSOCK_ACCEPT         6
    #define XENSOCK_POLL           7
    
    struct xen_xensock_request {
        uint32_t id;     /* private to guest, echoed in response */
        uint32_t cmd;    /* command to execute */
        uint64_t sockid; /* id of the socket */
        union {
            struct xen_xensock_connect {
                uint8_t addr[28];
                uint32_t len;
                uint32_t flags;
                grant_ref_t ref[XENSOCK_DATARING_PAGES];
                uint32_t evtchn;
            } connect;
            struct xen_xensock_bind {
                uint8_t addr[28]; /* ipv6 ready */
                uint32_t len;
            } bind;
            struct xen_xensock_accept {
                uint64_t sockid;
                grant_ref_t ref[XENSOCK_DATARING_PAGES];
                uint32_t evtchn;
            } accept;
        } u;
    };

The first three fields are common for every command. Their binary layout
is:

    0       4       8       12      16
    +-------+-------+-------+-------+
    |  id   |  cmd  |     sockid    |
    +-------+-------+-------+-------+

- **id** is generated by the frontend and identifies one specific request
- **cmd** is the command requested by the frontend:
    - `XENSOCK_CONNECT`: 0
    - `XENSOCK_RELEASE`: 3
    - `XENSOCK_BIND`:    4
    - `XENSOCK_LISTEN`:  5
    - `XENSOCK_ACCEPT`:  6
    - `XENSOCK_POLL`:    7
- **sockid** is generated by the frontend and identifies the socket to connect,
  bind, etc. A new sockid is required on `XENSOCK_CONNECT` and `XENSOCK_BIND`
  commands. A new sockid is also required on `XENSOCK_ACCEPT`, for the new
  socket.
  
All three fields are echoed back by the backend.

As for the other Xen ring based protocols, after writing a request to the ring,
the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and issues an event
channel notification when a notification is required.

Backend responses are allocated on the ring using the `RING_GET_RESPONSE` macro.
The format is the following:

    struct xen_xensock_response {
        uint32_t id;
        uint32_t cmd;
        uint64_t sockid;
        int32_t ret;
    };
   
    0       4       8       12      16      20
    +-------+-------+-------+-------+-------+
    |  id   |  cmd  |     sockid    |  ret  |
    +-------+-------+-------+-------+-------+

- **id**: echoed back from request
- **cmd**: echoed back from request
- **sockid**: echoed back from request
- **ret**: return value, identifies success or failure

After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend checks whether
it needs to notify the frontend and does so via event channel.

A description of each command, their additional request fields and the
expected responses follow.


#### Connect

The **connect** operation corresponds to the connect system call. It connects a
socket to the specified address. **sockid** is freely chosen by the frontend and
references this specific socket from this point forward.

The connect operation creates a new shared ring, which we'll call **data ring**.
The new ring is used to send and receive data over the connected socket.
Information necessary to setup the new ring, such as grant table references and
event channel ports, are passed from the frontend to the backend as part of
this request. A **data ring** is unmapped and freed upon issuing a **release**
command on the active socket identified by **sockid**.

When the frontend issues a **connect** command, the backend:
- creates a new socket and connects it to **addr**
- creates an internal mapping from **sockid** to its own socket
- maps all the grant references and uses them as shared memory for the new data
  ring
- bind the **evtchn**
- replies to the frontend

The data ring format will be described in the following section.

Fields:

- **cmd** value: 0
- additional fields:
  - **addr**: address to connect to, in struct sockaddr format
  - **len**: address length
  - **flags**: flags for the connection, reserved for future usage
  - **ref**: grant references of the data ring
  - **evtchn**: port number of the evtchn to signal activity on the data ring

Binary layout:

        16      20      24      28      32      36      40      44     48
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |                            addr                       |  len  |
        +-------+-------+-------+-------+-------+-------+-------+-------+
        | flags |ref[0] |ref[1] |ref[2] |ref[3] |ref[4] |ref[5] |ref[6] |
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[7] |ref[8] |ref[9] |ref[10]|ref[11]|ref[12]|ref[13]|ref[14]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[15]|ref[16]|ref[17]|ref[18]|ref[19]|ref[20]|ref[21]|ref[22]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[23]|ref[24]|ref[25]|ref[26]|ref[27]|ref[28]|ref[29]|ref[30]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[31]|ref[32]|ref[33]|ref[34]|ref[35]|ref[36]|ref[37]|ref[38]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[39]|ref[40]|ref[41]|ref[42]|ref[43]|ref[44]|ref[45]|ref[46]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[47]|ref[48]|ref[49]|ref[50]|ref[51]|ref[52]|ref[53]|ref[54]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[55]|ref[56]|ref[57]|ref[58]|ref[59]|ref[60]|ref[61]|ref[62]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[63]|evtchn |  
        +-------+-------+

Return value:

  - 0 on success
  - less than 0 on failure, see the error codes of the socket system call

#### Release

The **release** operation closes an existing active or a passive socket.

When a release command is issued on a passive socket, the backend releases it
and frees its internal mappings. When a release command is issued for an active
socket, the data ring is also unmapped and freed:

- frontend sends release command for an active socket
- backend releases the socket
- backend unmaps the ring
- backend unbinds the evtchn
- backend replies to frontend
- frontend frees ring and unbinds evtchn

Fields:

- **cmd** value: 3
- additional fields: none

Return value:

  - 0 on success
  - less than 0 on failure, see the error codes of the shutdown system call

#### Bind

The **bind** operation assigns the address passed as parameter to the socket.
It corresponds to the bind system call. **sockid** is freely chosen by the
frontend and references this specific socket from this point forward. **Bind**,
**listen** and **accept** are the three operations required to have fully
working passive sockets and should be issued in this order.

Fields:

- **cmd** value: 4
- additional fields:
  - **addr**: address to bind to, in struct sockaddr format
  - **len**: address length

Binary layout:

        16      20      24      28      32      36      40      44     48
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |                            addr                       |  len  |
        +-------+-------+-------+-------+-------+-------+-------+-------+

Return value:

  - 0 on success
  - less than 0 on failure, see the error codes of the bind system call


#### Listen

The **listen** operation marks the socket as a passive socket. It corresponds to
the listen system call.

Fields:

- **cmd** value: 5
- additional fields: none

Return value:
  - 0 on success
  - less than 0 on failure, see the error codes of the listen system call


#### Accept

The **accept** operation extracts the first connection request on the queue of
pending connections for the listening socket identified by **sockid** and
creates a new connected socket. The **sockid** of the new socket is also chosen
by the frontend and passed as an additional field of the accept request struct.

Similarly to the **connect** operation, **accept** creates a new data ring.
Information necessary to setup the new ring, such as grant table references and
event channel ports, are passed from the frontend to the backend as part of
the request.

The backend will reply to the request only when a new connection is successfully
accepted, i.e. the backend does not return EAGAIN or EWOULDBLOCK.

Example workflow:

- frontend issues an **accept** request
- backend waits for a connection to be available on the socket
- a new connection becomes available
- backend accepts the new connection
- backend creates an internal mapping from **sockid** to the new socket
- backend maps all the grant references and uses them as shared memory for the
  new data ring
- backend binds the **evtchn**
- backend replies to the frontend

Fields:

- **cmd** value: 6
- additional fields:
  - **sockid**: id of the new socket
  - **ref**: grant references of the data ring
  - **evtchn**: port number of the evtchn to signal activity on the data ring

Binary layout:

        16      20      24      28      32      36      40      44     48
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |    sockid     |ref[0] |ref[1] |ref[2] |ref[3] |ref[4] |ref[5] | 
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[6] |ref[7] |ref[8] |ref[9] |ref[10]|ref[11]|ref[12]|ref[13]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[14]|ref[15]|ref[16]|ref[17]|ref[18]|ref[19]|ref[20]|ref[21]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[22]|ref[23]|ref[24]|ref[25]|ref[26]|ref[27]|ref[28]|ref[29]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[30]|ref[31]|ref[32]|ref[33]|ref[34]|ref[35]|ref[36]|ref[37]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[38]|ref[39]|ref[40]|ref[41]|ref[42]|ref[43]|ref[44]|ref[45]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[46]|ref[47]|ref[48]|ref[49]|ref[50]|ref[51]|ref[52]|ref[53]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[54]|ref[55]|ref[56]|ref[57]|ref[58]|ref[59]|ref[60]|ref[61]|
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |ref[62]|ref[63]|evtchn | 
        +-------+-------+-------+

Return value:

  - 0 on success
  - less than 0 on failure, see the error codes of the accept system call


#### Poll

The **poll** operation is only valid for passive sockets. For active sockets,
the frontend should look at the state of the data ring. When a new connection is
available in the queue of the passive socket, the backend generates a response
and notifies the frontend.

Fields:

- **cmd** value: 7
- additional fields: none

Return value:

  - 0 on success
  - less than 0 on failure, see the error codes of the poll system call


### Data ring

Data rings are used for sending and receiving data over a connected socket. They
are created upon a successful **accept** or **connect** command. The ring works
in a similar way to the existing Xen console ring.

#### Format

    #define XENSOCK_DATARING_ORDER 6
    #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
    #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)
    typedef uint32_t XENSOCK_RING_IDX;
    
    struct xensock_ring_intf {
        char in[XENSOCK_DATARING_SIZE/4];
        char out[XENSOCK_DATARING_SIZE/2];
        XENSOCK_RING_IDX in_cons, in_prod;
        XENSOCK_RING_IDX out_cons, out_prod;
        int32_t in_error, out_error;
    };

The design is flexible and can support different ring sizes (at compile time).
The following description is based on order 6 rings, chosen because they provide
excellent performance.

- **in** is an array of 65536 bytes, used as circular buffer
  It contains data read from the socket. The producer is the backend, the
  consumer is the frontend.
- **out** is an array of 131072 bytes, used as circular buffer
  It contains data to be written to the socket. The producer is the frontend,
  the consumer is the backend.
- **in_cons** and **in_prod**
  Consumer and producer pointers for data read from the socket. They keep track
  of how much data has already been consumed by the frontend from the **in**
  array. **in_prod** is increased by the backend, after writing data to **in**.
  **in_cons** is increased by the frontend, after reading data from **in**.
- **out_cons**, **out_prod**
  Consumer and producer pointers for the data to be written to the socket. They
  keep track of how much data has been written by the frontend to **out** and
  how much data has already been consumed by the backend. **out_prod** is
  increased by the frontend, after writing data to **out**. **out_cons** is
  increased by the backend, after reading data from **out**.
- **in_error** and **out_error** They signal errors when reading from the socket
  (**in_error**) or when writing to the socket (**out_error**). 0 means no
  errors. When an error occurs, no further reads or writes operations are
  performed on the socket. In the case of an orderly socket shutdown (i.e. read
  returns 0) **in_error** is set to -ENOTCONN. **in_error** and **out_error**
  are never set to -EAGAIN or -EWOULDBLOCK.

The binary layout follows:

    0        65536           196608     196612    196616    196620   196624    
196628   196632
    
+----//----+-------//-------+---------+---------+---------+---------+---------+---------+
    |    in    |      out       | in_cons | in_prod |out_cons |out_prod 
|in_error |out_error|
    
+----//----+-------//-------+---------+---------+---------+---------+---------+---------+
    

#### Workflow

The **in** and **out** arrays are used as circular buffers:
    
    0                               sizeof(array)
    +-----------------------------------+
    |to consume|    free    |to consume |
    +-----------------------------------+
               ^            ^
               prod         cons

    0                               sizeof(array)
    +-----------------------------------+
    |  free    | to consume |   free    |
    +-----------------------------------+
               ^            ^
               cons         prod

The following function is provided to calculate how many bytes are currently
left unconsumed in an array:

    #define _MASK_XENSOCK_IDX(idx, ring_size) ((idx) & (ring_size-1))

    static inline XENSOCK_RING_IDX xensock_ring_queued(XENSOCK_RING_IDX prod,
                XENSOCK_RING_IDX cons,
                XENSOCK_RING_IDX ring_size)
    {
        XENSOCK_RING_IDX size;
    
        if (prod == cons)
                return 0;
    
        prod = _MASK_XENSOCK_IDX(prod, ring_size);
        cons = _MASK_XENSOCK_IDX(cons, ring_size);
    
        if (prod == cons)
                return ring_size;
    
        if (prod > cons)
                size = prod - cons;
        else {
                size = ring_size - cons;
                size += prod;
        }
        return size;
    }

The producer (the backend for **in**, the frontend for **out**) writes to the
array in the following way:

- read *cons*, *prod*, *error* from shared memory
- memory barrier
- return on *error*
- write to array at position *prod* up to *cons*, wrapping around the circular
  buffer when necessary
- memory barrier
- increase *prod*
- notify the other end via evtchn

The consumer (the backend for **out**, the frontend for **in**) reads from the
array in the following way:

- read *prod*, *cons*, *error* from shared memory
- memory barrier
- return on *error*
- read from array at position *cons* up to *prod*, wrapping around the circular
  buffer when necessary
- memory barrier
- increase *cons*
- notify the other end via evtchn

The producer takes care of writing only as many bytes as available in the buffer
up to *cons*. The consumer takes care of reading only as many bytes as available
in the buffer up to *prod*. *error* is set by the backend when an error occurs
writing or reading from the socket.

Attachment: xensock-libxl
Description: Text document

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.