[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [DRAFT v3] XenSock protocol design document
ping On Wed, 20 Jul 2016, Stefano Stabellini wrote: > Hi all, > > This is the design document of the XenSock protocol. You can find > prototypes of the Linux frontend and backend drivers here: > > git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen.git xensock-3 > > To use them, make sure to enable CONFIG_XENSOCK in your kernel config > and add "xensock=1" to the command line of your DomU Linux kernel. You > also need the toolstack to create the initial xenstore nodes for the > protocol. To do that, please apply the attached patch to libxl (the > patch is based on Xen 4.7.0-rc3) and add "xensock=1" to your DomU config > file. > > Cheers, > > Stefano > > > Changes in v3: > - add a dummy element to struct xen_xensock_request to make sure the > size of the struct is the same on both x86_32 and x86_64 > > Changes in v2: > - add max-dataring-page-order > - add "Publish backend features and transport parameters" to backend > xenbus workflow > - update new cmd values > - update xen_xensock_request > - add backlog parameter to listen and binary layout > - add description of new data ring format (interface+data) > - modify connect and accept to reflect new data ring format > - add link to POSIX docs > - add error numbers > - add address format section and relevant numeric definitions > - add explicit mention of unimplemented commands > - add protocol node name > - add xenbus shutdown diagram > - add socket operation > > --- > > > # XenSocks Protocol v1 > > ## Rationale > > XenSocks is a paravirtualized protocol for the POSIX socket API. > > The purpose of XenSocks is to allow the implementation of a specific set > of POSIX functions to be done in a domain other than your own. It allows > connect, accept, bind, release, listen, poll, recvmsg and sendmsg to be > implemented in another domain. > > XenSocks provides the following benefits: > * guest networking works out of the box with VPNs, wireless networks and > any other complex configurations on the host > * guest services listen on ports bound directly to the backend domain IP > addresses > * localhost becomes a secure namespace for inter-VMs communications > * full visibility of the guest behavior on the backend domain, allowing > for inexpensive filtering and manipulation of any guest calls > * excellent performance > > > ## Design > > ### Xenstore > > The frontend and the backend connect to each other exchanging information via > xenstore. The toolstack creates front and back nodes with state > XenbusStateInitialising. The protocol node name is **xensock**. There can only > be one XenSock frontend per domain. > > #### Frontend XenBus Nodes > > port > Values: <uint32_t> > > The identifier of the Xen event channel used to signal activity > in the ring buffer. > > ring-ref > Values: <uint32_t> > > The Xen grant reference granting permission for the backend to map > the sole page in a single page sized ring buffer. > > #### Backend XenBus Nodes > > max-dataring-page-order > Values: <uint32_t> > > The maximum supported size of the data ring in units of lb(machine > pages). (e.g. 0 == 1 page, 1 = 2 pages, 2 == 4 pages, etc.). > > #### State Machine > > Initialization: > > *Front* *Back* > XenbusStateInitialising XenbusStateInitialising > - Query virtual device - Query backend device > properties. identification data. > - Setup OS device instance. - Publish backend features > - Allocate and initialize the and transport parameters > request ring. | > - Publish transport parameters | > that will be in effect during V > this connection. XenbusStateInitWait > | > | > V > XenbusStateInitialised > > - Query frontend transport > parameters. > - Connect to the request ring and > event channel. > | > | > V > XenbusStateConnected > > - Query backend device properties. > - Finalize OS virtual device > instance. > | > | > V > XenbusStateConnected > > Once frontend and backend are connected, they have a shared page, which > will is used to exchange messages over a ring, and an event channel, > which is used to send notifications. > > Shutdown: > > *Front* *Back* > XenbusStateConnected XenbusStateConnected > | > | > V > XenbusStateClosing > > - Unmap grants > - Unbind evtchns > | > | > V > XenbusStateClosing > > - Unbind evtchns > - Free rings > - Free data structures > | > | > V > XenbusStateClosed > > - Free remaining data structures > | > | > V > XenbusStateClosed > > > ### Commands Ring > > The shared ring is used by the frontend to forward socket API calls to the > backend. I'll refer to this ring as **commands ring** to distinguish it from > other rings which will be created later in the lifecycle of the protocol (data > rings). The ring format is defined using the familiar `DEFINE_RING_TYPES` > macro > (`xen/include/public/io/ring.h`). Frontend requests are allocated on the ring > using the `RING_GET_REQUEST` macro. > > The format is defined as follows: > > #define XENSOCK_SOCKET 0 > #define XENSOCK_CONNECT 1 > #define XENSOCK_RELEASE 2 > #define XENSOCK_BIND 3 > #define XENSOCK_LISTEN 4 > #define XENSOCK_ACCEPT 5 > #define XENSOCK_POLL 6 > > struct xen_xensock_request { > uint32_t id; /* private to guest, echoed in response */ > uint32_t cmd; /* command to execute */ > uint64_t sockid; > union { > struct xen_xensock_socket { > uint32_t domain; > uint32_t type; > uint32_t protocol; > } socket; > struct xen_xensock_connect { > uint8_t addr[28]; > uint32_t len; > uint32_t flags; > grant_ref_t ref; > uint32_t evtchn; > } connect; > struct xen_xensock_bind { > uint8_t addr[28]; > uint32_t len; > } bind; > struct xen_xensock_listen { > uint32_t backlog; > } listen; > struct xen_xensock_accept { > uint64_t sockid; > grant_ref_t ref; > uint32_t evtchn; > } accept; > /* dummy member to force sizeof(struct xen_xensock_request) to > match across archs */ > struct xen_xensock_dummy { > uint8_t dummy[48]; > } dummy; > } u; > }; > > The first three fields are common for every command. Their binary layout > is: > > 0 4 8 12 16 > +-------+-------+-------+-------+ > | id | cmd | sockid | > +-------+-------+-------+-------+ > > - **id** is generated by the frontend and identifies one specific request > - **cmd** is the command requested by the frontend: > - `XENSOCK_SOCKET`: 0 > - `XENSOCK_CONNECT`: 1 > - `XENSOCK_RELEASE`: 2 > - `XENSOCK_BIND`: 3 > - `XENSOCK_LISTEN`: 4 > - `XENSOCK_ACCEPT`: 5 > - `XENSOCK_POLL`: 6 > - **sockid** is generated by the frontend and identifies the socket to > connect, > bind, etc. A new sockid is required on the `XENSOCK_SOCKET` command. A new > sockid is also required on `XENSOCK_ACCEPT`, for the new socket. > > All three fields are echoed back by the backend. > > As for the other Xen ring based protocols, after writing a request to the > ring, > the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and issues an event > channel notification when a notification is required. > > Backend responses are allocated on the ring using the `RING_GET_RESPONSE` > macro. > The format is the following: > > struct xen_xensock_response { > uint32_t id; > uint32_t cmd; > uint64_t sockid; > int32_t ret; > }; > > 0 4 8 12 16 20 > +-------+-------+-------+-------+-------+ > | id | cmd | sockid | ret | > +-------+-------+-------+-------+-------+ > > - **id**: echoed back from request > - **cmd**: echoed back from request > - **sockid**: echoed back from request > - **ret**: return value, identifies success (0) or failure (see error numbers > below). If the **cmd** is not supported by the backend, ret is ENOTSUPP. > > After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend checks > whether > it needs to notify the frontend and does so via event channel. > > A description of each command, their additional request fields and the > expected responses follow. > > > #### Socket > > The **socket** operation corresponds to the POSIX [socket][socket] function. > It > creates a new socket of the specified family, type and protocol. **sockid** is > freely chosen by the frontend and references this specific socket from this > point forward. See "Socket families and address format" below. > > Fields: > > - **cmd** value: 0 > - additional fields: > - **domain**: the communication domain > - **type**: the socket type > - **protocol**: the particular protocol to be used with the socket, usually > 0 > > Binary layout: > > 16 20 24 28 > +--------+--------+--------+ > | domain | type |protocol| > +--------+--------+--------+ > > Return value: > > - 0 on success > - See the [POSIX socket function][connect] for error names; the > corresponding > error numbers are specified later in this document. > > #### Connect > > The **connect** operation corresponds to the POSIX [connect][connect] > function. > It connects a previously created socket (identified by **sockid**) to the > specified address. > > The connect operation creates a new shared ring, which we'll call **data > ring**. The data ring is used to send and receive data from the socket. > The connect operation passes two additional parameters which are > utilized to setup the new ring: **evtchn** and **ref**. **evtchn** is the > port number of a new event channel which will be used for notifications > of activity on the data ring. **ref** is the grant reference of a page > which containes shared pointers to write and read data from the data ring > and the full array of grant references for the ring buffers. It will be > described in more detailed later. The data ring is unmapped and freed upon > issuing a **release** command on the active socket identified by **sockid**. > > When the frontend issues a **connect** command, the backend: > - finds its own internal socket corresponding to **sockid** > - connects the socket to **addr** > - maps the grant reference **ref**, the shared page contains the data > ring interface (`struct xensock_data_intf`) > - maps all the grant references listed in `struct xensock_data_intf` and > uses them as shared memory for the ring buffers > - bind the **evtchn** > - replies to the frontend > > The data ring format will be described in the following section. > > Fields: > > - **cmd** value: 0 > - additional fields: > - **addr**: address to connect to, see the address format section for more > information > - **len**: address length > - **flags**: flags for the connection, reserved for future usage > - **ref**: grant reference of the page containing `struct > xensock_data_intf` > - **evtchn**: port number of the evtchn to signal activity on the data ring > > > Binary layout: > > 16 20 24 28 32 36 40 44 48 > +-------+-------+-------+-------+-------+-------+-------+-------+ > | addr | len | > +-------+-------+-------+-------+-------+-------+-------+-------+ > | flags | ref |evtchn | > +-------+-------+-------+ > > Return value: > > - 0 on success > - See the [POSIX connect function][connect] for error names; the > corresponding > error numbers are specified later in this document. > > #### Release > > The **release** operation closes an existing active or a passive socket. > > When a release command is issued on a passive socket, the backend releases it > and frees its internal mappings. When a release command is issued for an > active > socket, the data ring is also unmapped and freed: > > - frontend sends release command for an active socket > - backend releases the socket > - backend unmaps the data ring buffers > - backend unmaps the data ring interface > - backend unbinds the evtchn > - backend replies to frontend > - frontend frees ring and unbinds evtchn > > Fields: > > - **cmd** value: 1 > - additional fields: none > > Return value: > > - 0 on success > - See the [POSIX shutdown function][shutdown] for error names; the > corresponding error numbers are specified later in this document. > > #### Bind > > The **bind** operation corresponds to the POSIX [bind][bind] function. It > assigns the address passed as parameter to a previously created socket, > identified by **sockid**. **Bind**, **listen** and **accept** are the three > operations required to have fully working passive sockets and should be issued > in this order. > > Fields: > > - **cmd** value: 2 > - additional fields: > - **addr**: address to connect to, see the address format section for more > information > - **len**: address length > > Binary layout: > > 16 20 24 28 32 36 40 44 48 > +-------+-------+-------+-------+-------+-------+-------+-------+ > | addr | len | > +-------+-------+-------+-------+-------+-------+-------+-------+ > > Return value: > > - 0 on success > - See the [POSIX bind function][bind] for error names; the corresponding > error > numbers are specified later in this document. > > > #### Listen > > The **listen** operation marks the socket as a passive socket. It corresponds > to > the [POSIX listen function][listen]. > > Fields: > > - **cmd** value: 3 > - additional fields: > - **backlog**: the maximum length to which the queue of pending > connections may grow > > Binary layout: > > 16 20 > +-------+ > |backlog| > +-------+ > > Return value: > - 0 on success > - See the [POSIX listen function][listen] for error names; the corresponding > error numbers are specified later in this document. > > > #### Accept > > The **accept** operation extracts the first connection request on the queue of > pending connections for the listening socket identified by **sockid** and > creates a new connected socket. The **sockid** of the new socket is also > chosen > by the frontend and passed as an additional field of the accept request > struct. > See the [POSIX accept function][accept] as reference. > > Similarly to the **connect** operation, **accept** creates a new data ring. > Information necessary to setup the new ring, such the grant table reference of > the page containing the data ring interface (`struct xensock_data_intf`) and > event channel port, are passed from the frontend to the backend as part of the > request. > > The backend will reply to the request only when a new connection is > successfully > accepted, i.e. the backend does not return EAGAIN or EWOULDBLOCK. > > Example workflow: > > - frontend issues an **accept** request > - backend waits for a connection to be available on the socket > - a new connection becomes available > - backend accepts the new connection > - backend creates an internal mapping from **sockid** to the new socket > - backend maps the grant reference **ref**, the shared page contains the > data ring interface (`struct xensock_data_intf`) > - backend maps all the grant references listed in `struct > xensock_data_intf` and uses them as shared memory for the new data > ring > - backend binds the **evtchn** > - backend replies to the frontend > > Fields: > > - **cmd** value: 4 > - additional fields: > - **sockid**: id of the new socket > - **ref**: grant reference of the data ring interface (`struct > xensock_data_intf`) > - **evtchn**: port number of the evtchn to signal activity on the data ring > > Binary layout: > > 16 20 24 28 32 > +-------+-------+-------+-------+ > | sockid | ref |evtchn | > +-------+-------+-------+-------+ > > Return value: > > - 0 on success > - See the [POSIX accept function][accept] for error names; the corresponding > error numbers are specified later in this document. > > > #### Poll > > The **poll** operation is only valid for passive sockets. For active sockets, > the frontend should look at the state of the data ring. When a new connection > is > available in the queue of the passive socket, the backend generates a response > and notifies the frontend. > > Fields: > > - **cmd** value: 5 > - additional fields: none > > Return value: > > - 0 on success > - See the [POSIX poll function][poll] for error names; the corresponding > error > numbers are specified later in this document. > > #### Error numbers > > The numbers corresponding to the error names specified by POSIX are: > > [EPERM] -1 > [ENOENT] -2 > [ESRCH] -3 > [EINTR] -4 > [EIO] -5 > [ENXIO] -6 > [E2BIG] -7 > [ENOEXEC] -8 > [EBADF] -9 > [ECHILD] -10 > [EAGAIN] -11 > [EWOULDBLOCK] -11 > [ENOMEM] -12 > [EACCES] -13 > [EFAULT] -14 > [EBUSY] -16 > [EEXIST] -17 > [EXDEV] -18 > [ENODEV] -19 > [EISDIR] -21 > [EINVAL] -22 > [ENFILE] -23 > [EMFILE] -24 > [ENOSPC] -28 > [EROFS] -30 > [EMLINK] -31 > [EDOM] -33 > [ERANGE] -34 > [EDEADLK] -35 > [EDEADLOCK] -35 > [ENAMETOOLONG] -36 > [ENOLCK] -37 > [ENOTEMPTY] -39 > [ENOSYS] -38 > [ENODATA] -61 > [ETIME] -62 > [EBADMSG] -74 > [EOVERFLOW] -75 > [EILSEQ] -84 > [ERESTART] -85 > [ENOTSOCK] -88 > [EOPNOTSUPP] -95 > [EAFNOSUPPORT] -97 > [EADDRINUSE] -98 > [EADDRNOTAVAIL] -99 > [ENOBUFS] -105 > [EISCONN] -106 > [ENOTCONN] -107 > [ETIMEDOUT] -110 > [ENOTSUPP] -524 > > #### Socket families and address format > > The following definitions and explicit sizes, together with POSIX > [sys/socket.h][address] and [netinet/in.h][in] define socket families and > address format. Please be aware that only the **domain** `AF_INET`, **type** > `SOCK_STREAM` and **protocol** `0` are supported by this version of the spec. > > #define AF_UNSPEC 0 > #define AF_UNIX 1 /* Unix domain sockets */ > #define AF_LOCAL 1 /* POSIX name for AF_UNIX */ > #define AF_INET 2 /* Internet IP Protocol */ > #define AF_INET6 10 /* IP version 6 */ > > #define SOCK_STREAM 1 > #define SOCK_DGRAM 2 > #define SOCK_RAW 3 > > /* generic address format */ > struct sockaddr { > uint16_t sa_family_t; > char sa_data[26]; > }; > > struct in_addr { > uint32_t s_addr; > }; > > /* AF_INET address format */ > struct sockaddr_in { > uint16_t sa_family_t; > uint16_t sin_port; > struct in_addr sin_addr; > char sin_zero[20]; > }; > > > ### Data ring > > Data rings are used for sending and receiving data over a connected socket. > They > are created upon a successful **accept** or **connect** command. > > A data ring is composed of two pieces: the interface and the **in** and > **out** > buffers. The interface, represented by `struct xensock_ring_intf` is shared > first and resides on the page whose grant reference is passed by **accept** > and > **connect** as parameter. `struct xensock_ring_intf` contains the list of > grant > references which constitute the **in** and **out** data buffers. > > #### Data ring interface > > struct xensock_data_intf { > XENSOCK_RING_IDX in_cons, in_prod; > XENSOCK_RING_IDX out_cons, out_prod; > int32_t in_error, out_error; > > uint32_t ring_order; > grant_ref_t ref[]; > }; > > /* not actually C compliant (ring_order changes from socket to socket) */ > struct xensock_data { > char in[((1<<ring_order)<<PAGE_SHIFT)/2]; > char out[((1<<ring_order)<<PAGE_SHIFT)/2]; > }; > > - **ring_order** > It represents the order of the data ring. The following list of grant > references is of `(1 << ring_order)` elements. It cannot be greater than > **max-dataring-page-order**, as specified by the backend on XenBus. > - **ref[]** > The list of grant references which will contain the actual data. They are > mapped contiguosly in virtual memory. The first half of the pages is the > **in** array, the second half is the **out** array. > - **in** is an array used as circular buffer > It contains data read from the socket. The producer is the backend, the > consumer is the frontend. > - **out** is an array used as circular buffer > It contains data to be written to the socket. The producer is the frontend, > the consumer is the backend. > - **in_cons** and **in_prod** > Consumer and producer pointers for data read from the socket. They keep > track > of how much data has already been consumed by the frontend from the **in** > array. **in_prod** is increased by the backend, after writing data to > **in**. > **in_cons** is increased by the frontend, after reading data from **in**. > - **out_cons**, **out_prod** > Consumer and producer pointers for the data to be written to the socket. > They > keep track of how much data has been written by the frontend to **out** and > how much data has already been consumed by the backend. **out_prod** is > increased by the frontend, after writing data to **out**. **out_cons** is > increased by the backend, after reading data from **out**. > - **in_error** and **out_error** They signal errors when reading from the > socket > (**in_error**) or when writing to the socket (**out_error**). 0 means no > errors. When an error occurs, no further reads or writes operations are > performed on the socket. In the case of an orderly socket shutdown (i.e. > read > returns 0) **in_error** is set to ENOTCONN. **in_error** and **out_error** > are never set to EAGAIN or EWOULDBLOCK. > > The binary layout of `struct xensock_data_intf` follows: > > 0 4 8 12 16 20 24 28 > +---------+---------+---------+---------+---------+---------+----------+ > | in_cons | in_prod |out_cons |out_prod |in_error |out_error|ring_order| > +---------+---------+---------+---------+---------+---------+----------+ > > 28 32 36 40 4092 4096 > +---------+---------+---------+----//---+---------+ > | ref[0] | ref[1] | ref[2] | | ref[N] | > +---------+---------+---------+----//---+---------+ > > The binary layout of the ring buffers follow: > > 0 ((1<<ring_order)<<PAGE_SHIFT)/2 > ((1<<ring_order)<<PAGE_SHIFT) > +------------//-------------+------------//-------------+ > | in | out | > +------------//-------------+------------//-------------+ > > #### Workflow > > The **in** and **out** arrays are used as circular buffers: > > 0 sizeof(array) == > ((1<<ring_order)<<PAGE_SHIFT)/2 > +-----------------------------------+ > |to consume| free |to consume | > +-----------------------------------+ > ^ ^ > prod cons > > 0 sizeof(array) > +-----------------------------------+ > | free | to consume | free | > +-----------------------------------+ > ^ ^ > cons prod > > The following function is provided to calculate how many bytes are currently > left unconsumed in an array: > > #define _MASK_XENSOCK_IDX(idx, ring_size) ((idx) & (ring_size-1)) > > static inline XENSOCK_RING_IDX xensock_ring_queued(XENSOCK_RING_IDX prod, > XENSOCK_RING_IDX cons, > XENSOCK_RING_IDX ring_size) > { > XENSOCK_RING_IDX size; > > if (prod == cons) > return 0; > > prod = _MASK_XENSOCK_IDX(prod, ring_size); > cons = _MASK_XENSOCK_IDX(cons, ring_size); > > if (prod == cons) > return ring_size; > > if (prod > cons) > size = prod - cons; > else { > size = ring_size - cons; > size += prod; > } > return size; > } > > The producer (the backend for **in**, the frontend for **out**) writes to the > array in the following way: > > - read *cons*, *prod*, *error* from shared memory > - memory barrier > - return on *error* > - write to array at position *prod* up to *cons*, wrapping around the circular > buffer when necessary > - memory barrier > - increase *prod* > - notify the other end via evtchn > > The consumer (the backend for **out**, the frontend for **in**) reads from the > array in the following way: > > - read *prod*, *cons*, *error* from shared memory > - memory barrier > - return on *error* > - read from array at position *cons* up to *prod*, wrapping around the > circular > buffer when necessary > - memory barrier > - increase *cons* > - notify the other end via evtchn > > The producer takes care of writing only as many bytes as available in the > buffer > up to *cons*. The consumer takes care of reading only as many bytes as > available > in the buffer up to *prod*. *error* is set by the backend when an error occurs > writing or reading from the socket. > > > [address]: http://pubs.opengroup.org/onlinepubs/7908799/xns/syssocket.h.html > [in]: > http://pubs.opengroup.org/onlinepubs/000095399/basedefs/netinet/in.h.html > [socket]: http://pubs.opengroup.org/onlinepubs/009695399/functions/socket.html > [connect]: http://pubs.opengroup.org/onlinepubs/7908799/xns/connect.html > [shutdown]: http://pubs.opengroup.org/onlinepubs/7908799/xns/shutdown.html > [bind]: http://pubs.opengroup.org/onlinepubs/7908799/xns/bind.html > [listen]: http://pubs.opengroup.org/onlinepubs/7908799/xns/listen.html > [accept]: http://pubs.opengroup.org/onlinepubs/7908799/xns/accept.html > [poll]: http://pubs.opengroup.org/onlinepubs/7908799/xsh/poll.html _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |