[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [DRAFT v5] PV Calls protocol design document (former XenSock)



Thank you!

On Sun, 21 Aug 2016, Christopher Clark wrote:
> The PV Calls (formerly XenSock) protocol design has recently attracted 
> interest within the OpenXT software
> development community, as it is a novel interdomain communication mechanism 
> and protocol.
> 
> We have a longstanding and active interest in interdomain communication 
> transport. v4v is a protocol and mechanism
> originally designed at Citrix for XenClient and XenClient XT, now in use by 
> OpenXT, and has been deployed in
> production systems for several years at this point. It has benefitted from 
> previous reviews by the Xen Community,
> and continued engagement from its original designers, and is our selection of 
> protocol and implementation to meet
> the particular software requirements of our project. Members of the OpenXT 
> community also have interdomain
> protocol experience through being involved in the development of both vchan 
> and the XSM architecture.
> 
> The PV Calls (formerly XenSock) work described in this thread is interesting 
> and quite a different technology,
> evidently with a different set of design constraints and focus on solving a 
> different problem: it enables the
> insertion of VM boundaries and separation in-between components that are 
> communicating via the POSIX socket API,
> which could not otherwise be so isolated from each other.
> 
> In contrast, v4v prioritizes isolation between VMs over seamless integration 
> of existing components, preferring no
> memory sharing between VMs, and mandatory access control enforced on 
> communication channels by the hypervisor.
> 
> So while PV Calls is not a fit for the needs of OpenXT, this proposed 
> approach looks very good for the problem
> that it aims to address. It is complimentary to the other mechanisms 
> available in the Xen ecosystem.
> 
> I would like to convey my positive support for the PV Calls proposal.
> 
> Christopher Clark
> BAE Systems, OpenXT Project
> http://openxt.org
> 
> 
> On Thu, Aug 4, 2016 at 10:17 AM, Stefano Stabellini <stefano@xxxxxxxxxxx> 
> wrote:
>       Hi all,
> 
>       This is the design document of the PV Calls protocol. You can find
>       prototypes of the Linux frontend and backend drivers here:
> 
>       git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen.git 
> pvcalls-5
> 
>       To use them, make sure to enable CONFIG_PVCALLS in your kernel config
>       and add "pvcalls=1" to the command line of your DomU Linux kernel. You
>       also need the toolstack to create the initial xenstore nodes for the
>       protocol. To do that, please apply the attached patch to libxl (the
>       patch is based on Xen 4.7.0-rc3) and add "pvcalls=1" to your DomU config
>       file.
> 
>       Note that previous versions of the protocols were named XenSock. It has
>       been renamed for clarity of scope and to avoid confusion with hv_sock
>       and vsock, which are used for inter-VMs communications.
> 
>       Cheers,
> 
>       Stefano
> 
>       Changes in v5:
>       - clarify text
>       - rename id to req_id
>       - rename sockid to id
>       - move id to request and response specific fields
>       - add version node to xenstore
> 
>       Changes in v4:
>       - rename xensock to pvcalls
> 
>       Changes in v3:
>       - add a dummy element to struct xen_xensock_request to make sure the
>         size of the struct is the same on both x86_32 and x86_64
> 
>       Changes in v2:
>       - add max-dataring-page-order
>       - add "Publish backend features and transport parameters" to backend
>         xenbus workflow
>       - update new cmd values
>       - update xen_xensock_request
>       - add backlog parameter to listen and binary layout
>       - add description of new data ring format (interface+data)
>       - modify connect and accept to reflect new data ring format
>       - add link to POSIX docs
>       - add error numbers
>       - add address format section and relevant numeric definitions
>       - add explicit mention of unimplemented commands
>       - add protocol node name
>       - add xenbus shutdown diagram
>       - add socket operation
> 
>       ---
> 
>       # PV Calls Protocol version 1
> 
>       ## Rationale
> 
>       PV Calls is a paravirtualized protocol that allows the implementation of
>       a set of POSIX functions in a different domain. The PV Calls frontend
>       sends POSIX function calls to the backend, which implements them and
>       returns a value to the frontend.
> 
>       This version of the document covers networking function calls, such as
>       connect, accept, bind, release, listen, poll, recvmsg and sendmsg; but
>       the protocol is meant to be easily extended to cover different sets of
>       calls. Unimplemented commands return ENOTSUPP.
> 
>       PV Calls provide the following benefits:
>       * full visibility of the guest behavior on the backend domain, allowing
>         for inexpensive filtering and manipulation of any guest calls
>       * excellent performance
> 
>       Specifically, PV Calls for networking offer these advantages:
>       * guest networking works out of the box with VPNs, wireless networks and
>         any other complex configurations on the host
>       * guest services listen on ports bound directly to the backend domain IP
>         addresses
>       * localhost becomes a secure namespace for inter-VMs communications
> 
> 
>       ## Design
> 
>       ### Xenstore
> 
>       The frontend and the backend connect to each other exchanging 
> information via
>       xenstore. The toolstack creates front and back nodes with state
>       XenbusStateInitialising. The protocol node name is **pvcalls**. There 
> can only
>       be one PV Calls frontend per domain.
> 
>       #### Frontend XenBus Nodes
> 
>       port
>            Values:         <uint32_t>
> 
>            The identifier of the Xen event channel used to signal activity
>            in the ring buffer.
> 
>       ring-ref
>            Values:         <uint32_t>
> 
>            The Xen grant reference granting permission for the backend to map
>            the sole page in a single page sized ring buffer.
> 
>       #### Backend XenBus Nodes
> 
>       version
>            Values:         <uint32_t>
> 
>            Protocol version supported by the backend.
> 
>       max-dataring-page-order
>            Values:         <uint32_t>
> 
>            The maximum supported size of the data ring in units of lb(machine
>            pages). (e.g. 0 == 1 page,  1 = 2 pages, 2 == 4 pages, etc.).
> 
>       #### State Machine
> 
>       Initialization:
> 
>           *Front*                               *Back*
>           XenbusStateInitialising               XenbusStateInitialising
>           - Query virtual device                - Query backend device
>             properties.                           identification data.
>           - Setup OS device instance.           - Publish backend features
>           - Allocate and initialize the           and transport parameters
>             request ring.                                      |
>           - Publish transport parameters                       |
>             that will be in effect during                      V
>             this connection.                            XenbusStateInitWait
>                        |
>                        |
>                        V
>              XenbusStateInitialised
> 
>                                                 - Query frontend transport 
> parameters.
>                                                 - Connect to the request ring 
> and
>                                                   event channel.
>                                                                |
>                                                                |
>                                                                V
>                                                        XenbusStateConnected
> 
>            - Query backend device properties.
>            - Finalize OS virtual device
>              instance.
>                        |
>                        |
>                        V
>               XenbusStateConnected
> 
>       Once frontend and backend are connected, they have a shared page, which
>       will is used to exchange messages over a ring, and an event channel,
>       which is used to send notifications.
> 
>       Shutdown:
> 
>           *Front*                            *Back*
>           XenbusStateConnected               XenbusStateConnected
>                       |
>                       |
>                       V
>              XenbusStateClosing
> 
>                                              - Unmap grants
>                                              - Unbind evtchns
>                                                        |
>                                                        |
>                                                        V
>                                                XenbusStateClosing
> 
>           - Unbind evtchns
>           - Free rings
>           - Free data structures
>                      |
>                      |
>                      V
>              XenbusStateClosed
> 
>                                              - Free remaining data structures
>                                                        |
>                                                        |
>                                                        V
>                                                XenbusStateClosed
> 
> 
>       ### Commands Ring
> 
>       The shared ring is used by the frontend to forward POSIX function calls 
> to the
>       backend. I'll refer to this ring as **commands ring** to distinguish it 
> from
>       other rings which can be created later in the lifecycle of the protocol 
> (data
>       rings). The ring format is defined using the familiar 
> `DEFINE_RING_TYPES` macro
>       (`xen/include/public/io/ring.h`). Frontend requests are allocated on 
> the ring
>       using the `RING_GET_REQUEST` macro.
> 
>       The format is defined as follows:
> 
>           #define PVCALLS_SOCKET         0
>           #define PVCALLS_CONNECT        1
>           #define PVCALLS_RELEASE        2
>           #define PVCALLS_BIND           3
>           #define PVCALLS_LISTEN         4
>           #define PVCALLS_ACCEPT         5
>           #define PVCALLS_POLL           6
> 
>           struct xen_pvcalls_request {
>               uint32_t req_id; /* private to guest, echoed in response */
>               uint32_t cmd;    /* command to execute */
>               union {
>                       struct xen_pvcalls_socket {
>                       uint64_t id;
>                               uint32_t domain;
>                               uint32_t type;
>                               uint32_t protocol;
>                       } socket;
>                       struct xen_pvcalls_connect {
>                       uint64_t id;
>                               uint8_t addr[28];
>                               uint32_t len;
>                               uint32_t flags;
>                               grant_ref_t ref;
>                               uint32_t evtchn;
>                       } connect;
>                       struct xen_pvcalls_release {
>                       uint64_t id;
>                   } release;
>                       struct xen_pvcalls_bind {
>                       uint64_t id;
>                               uint8_t addr[28];
>                               uint32_t len;
>                       } bind;
>                       struct xen_pvcalls_listen {
>                       uint64_t id;
>                               uint32_t backlog;
>                       } listen;
>                       struct xen_pvcalls_accept {
>                       uint64_t id;
>                               uint64_t id_new;
>                               grant_ref_t ref;
>                               uint32_t evtchn;
>                       } accept;
>                       struct xen_pvcalls_poll {
>                       uint64_t id;
>                   } poll;
>                       /* dummy member to force sizeof(struct 
> xen_pvcalls_request) to match across archs */
>                       struct xen_pvcalls_dummy {
>                               uint8_t dummy[56];
>                       } dummy;
>               } u;
>           };
> 
>       The first two fields are common for every command. Their binary layout
>       is:
> 
>           0       4       8
>           +-------+-------+
>           |req_id |  cmd  |
>           +-------+-------+
> 
>       - **req_id** is generated by the frontend and identifies one specific 
> request
>       - **cmd** is the command requested by the frontend:
> 
>           - `PVCALLS_SOCKET`:  0
>           - `PVCALLS_CONNECT`: 1
>           - `PVCALLS_RELEASE`: 2
>           - `PVCALLS_BIND`:    3
>           - `PVCALLS_LISTEN`:  4
>           - `PVCALLS_ACCEPT`:  5
>           - `PVCALLS_POLL`:    6
> 
>       Both fields are echoed back by the backend.
> 
>       As for the other Xen ring based protocols, after writing a request to 
> the ring,
>       the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and issues an 
> event
>       channel notification when a notification is required.
> 
>       Backend responses are allocated on the ring using the 
> `RING_GET_RESPONSE` macro.
>       The format is the following:
> 
>           struct xen_pvcalls_response {
>               uint32_t req_id;
>               uint32_t cmd;
>               int32_t ret;
>               uint32_t pad;
>               union {
>                       struct _xen_pvcalls_socket {
>                       uint64_t id;
>                       } socket;
>                       struct _xen_pvcalls_connect {
>                       uint64_t id;
>                       } connect;
>                       struct _xen_pvcalls_release {
>                       uint64_t id;
>                       } release;
>                       struct _xen_pvcalls_bind {
>                       uint64_t id;
>                   } bind;
>                       struct _xen_pvcalls_listen {
>                       uint64_t id;
>                       } listen;
>                       struct _xen_pvcalls_accept {
>                       uint64_t id;
>                       } accept;
>                       struct _xen_pvcalls_poll {
>                       uint64_t id;
>                       } poll;
>                       struct _xen_pvcalls_dummy {
>                       uint8_t dummy[8];
>                   } dummy;
>               } u;
>           };
> 
>       The first four fields are common for every response. Their binary layout
>       is:
> 
>           0       4       8       12      16
>           +-------+-------+-------+-------+
>           |req_id |  cmd  |  ret  |  pad  |
>           +-------+-------+-------+-------+
> 
>       - **req_id**: echoed back from request
>       - **cmd**: echoed back from request
>       - **ret**: return value, identifies success (0) or failure (see error 
> numbers
>         below). If the **cmd** is not supported by the backend, ret is 
> ENOTSUPP.
>       - **pad**: padding
> 
>       After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend 
> checks whether
>       it needs to notify the frontend and does so via event channel.
> 
>       A description of each command, their additional request and response
>       fields follow.
> 
> 
>       #### Socket
> 
>       The **socket** operation corresponds to the POSIX [socket][socket] 
> function. It
>       creates a new socket of the specified family, type and protocol. **id** 
> is
>       freely chosen by the frontend and references this specific socket from 
> this
>       point forward. See "Socket families and address format" below.
> 
>       Request fields:
> 
>       - **cmd** value: 0
>       - additional fields:
>         - **id**: generated by the frontend, it identifies the new socket
>         - **domain**: the communication domain
>         - **type**: the socket type
>         - **protocol**: the particular protocol to be used with the socket, 
> usually 0
> 
>       Request binary layout:
> 
>           8       12      16      20     24       28
>           +-------+-------+-------+-------+-------+
>           |       id      |domain | type  |protoco|
>           +-------+-------+-------+-------+-------+
> 
>       Response additional fields:
> 
>       - **id**: echoed back from request
> 
>       Response binary layout:
> 
>           16       20       24
>           +-------+--------+
>           |       id       |
>           +-------+--------+
> 
>       Return value:
> 
>         - 0 on success
>         - See the [POSIX socket function][connect] for error names; the 
> corresponding
>           error numbers are specified later in this document.
> 
>       #### Connect
> 
>       The **connect** operation corresponds to the POSIX [connect][connect] 
> function.
>       It connects a previously created socket (identified by **id**) to the
>       specified address.
> 
>       The connect operation creates a new shared ring, which we'll call **data
>       ring**. The data ring is used to send and receive data from the socket.
>       The connect operation passes two additional parameters which are
>       utilized to setup the new ring: **evtchn** and **ref**. **evtchn** is 
> the
>       port number of a new event channel which will be used for notifications
>       of activity on the data ring. **ref** is the grant reference of a page
>       which containes shared pointers to write and read data from the data 
> ring
>       and the full array of grant references for the ring buffers. It will be
>       described in more detailed later. The data ring is unmapped and freed 
> upon
>       issuing a **release** command on the active socket identified by **id**.
> 
>       When the frontend issues a **connect** command, the backend:
>       - finds its own internal socket corresponding to **id**
>       - connects the socket to **addr**
>       - maps the grant reference **ref**, the shared page contains the data
>         ring interface (`struct pvcalls_data_intf`)
>       - maps all the grant references listed in `struct pvcalls_data_intf` and
>         uses them as shared memory for the ring buffers
>       - bind the **evtchn**
>       - replies to the frontend
> 
>       The data ring format will be described in the following section.
> 
>       Request fields:
> 
>       - **cmd** value: 0
>       - additional fields:
>         - **id**: identifies the socket
>         - **addr**: address to connect to, see the address format section for 
> more
>           information
>         - **len**: address length
>         - **flags**: flags for the connection, reserved for future usage
>         - **ref**: grant reference of the page containing `struct
>           pvcalls_data_intf`
>         - **evtchn**: port number of the evtchn to signal activity on the 
> data ring
> 
>       Request binary layout:
> 
>           8       12      16      20      24      28      32      36      40  
>     44
>           
> +-------+-------+-------+-------+-------+-------+-------+-------+-------+
>           |       id      |                            addr                   
>     |
>           
> +-------+-------+-------+-------+-------+-------+-------+-------+-------+
>           | len   | flags |  ref  |evtchn |
>           +-------+-------+-------+-------+
> 
>       Response additional fields:
> 
>       - **id**: echoed back from request
> 
>       Response binary layout:
> 
>           16      20      24
>           +-------+-------+
>           |       id      |
>           +-------+-------+
> 
>       Return value:
> 
>         - 0 on success
>         - See the [POSIX connect function][connect] for error names; the 
> corresponding
>           error numbers are specified later in this document.
> 
>       #### Release
> 
>       The **release** operation closes an existing active or a passive socket.
> 
>       When a release command is issued on a passive socket, the backend 
> releases it
>       and frees its internal mappings. When a release command is issued for 
> an active
>       socket, the data ring is also unmapped and freed:
> 
>       - frontend sends release command for an active socket
>       - backend releases the socket
>       - backend unmaps the data ring buffers
>       - backend unmaps the data ring interface
>       - backend unbinds the evtchn
>       - backend replies to frontend
>       - frontend frees ring and unbinds evtchn
> 
>       Request fields:
> 
>       - **cmd** value: 1
>       - additional fields:
>         - **id**: identifies the socket
> 
>       Request binary layout:
> 
>           8       12      16
>           +-------+-------+
>           |       id      |
>           +-------+-------+
> 
>       Response additional fields:
> 
>       - **id**: echoed back from request
> 
>       Response binary layout:
> 
>           16      20      24
>           +-------+-------+
>           |       id      |
>           +-------+-------+
> 
>       Return value:
> 
>         - 0 on success
>         - See the [POSIX shutdown function][shutdown] for error names; the
>           corresponding error numbers are specified later in this document.
> 
>       #### Bind
> 
>       The **bind** operation corresponds to the POSIX [bind][bind] function. 
> It
>       assigns the address passed as parameter to a previously created socket,
>       identified by **id**. **Bind**, **listen** and **accept** are the three
>       operations required to have fully working passive sockets and should be
>       issued in this order.
> 
>       Request fields:
> 
>       - **cmd** value: 2
>       - additional fields:
>         - **id**: identifies the socket
>         - **addr**: address to connect to, see the address format section for 
> more
>           information
>         - **len**: address length
> 
>       Request binary layout:
> 
>           8       12      16      20      24      28      32      36      40  
>     44
>           
> +-------+-------+-------+-------+-------+-------+-------+-------+-------+
>           |       id      |                            addr                   
>     |
>           
> +-------+-------+-------+-------+-------+-------+-------+-------+-------+
>           |  len  |
>           +-------+
> 
>       Response additional fields:
> 
>       - **id**: echoed back from request
> 
>       Response binary layout:
> 
>           16      20      24
>           +-------+-------+
>           |       id      |
>           +-------+-------+
> 
>       Return value:
> 
>         - 0 on success
>         - See the [POSIX bind function][bind] for error names; the 
> corresponding error
>           numbers are specified later in this document.
> 
> 
>       #### Listen
> 
>       The **listen** operation marks the socket as a passive socket. It 
> corresponds to
>       the [POSIX listen function][listen].
> 
>       Reuqest fields:
> 
>       - **cmd** value: 3
>       - additional fields:
>         - **id**: identifies the socket
>         - **backlog**: the maximum length to which the queue of pending
>           connections may grow
> 
>       Request binary layout:
> 
>           8       12      16      20
>           +-------+-------+-------+
>           |       id      |backlog|
>           +-------+-------+-------+
> 
>       Response additional fields:
> 
>       - **id**: echoed back from request
> 
>       Response binary layout:
> 
>           16      20      24
>           +-------+-------+
>           |       id      |
>           +-------+-------+
> 
>       Return value:
>         - 0 on success
>         - See the [POSIX listen function][listen] for error names; the 
> corresponding
>           error numbers are specified later in this document.
> 
> 
>       #### Accept
> 
>       The **accept** operation extracts the first connection request on the
>       queue of pending connections for the listening socket identified by
>       **id** and creates a new connected socket. The id of the new socket is
>       also chosen by the frontend and passed as an additional field of the
>       accept request struct (**id_new**). See the [POSIX accept 
> function][accept]
>       as reference.
> 
>       Similarly to the **connect** operation, **accept** creates a new data 
> ring.
>       Information necessary to setup the new ring, such the grant table 
> reference of
>       the page containing the data ring interface (`struct 
> pvcalls_data_intf`) and
>       event channel port, are passed from the frontend to the backend as part 
> of the
>       request.
> 
>       The backend will reply to the request only when a new connection is 
> successfully
>       accepted, i.e. the backend does not return EAGAIN or EWOULDBLOCK.
> 
>       Example workflow:
> 
>       - frontend issues an **accept** request
>       - backend waits for a connection to be available on the socket
>       - a new connection becomes available
>       - backend accepts the new connection
>       - backend creates an internal mapping from **id_new** to the new socket
>       - backend maps the grant reference **ref**, the shared page contains the
>         data ring interface (`struct pvcalls_data_intf`)
>       - backend maps all the grant references listed in `struct
>         pvcalls_data_intf` and uses them as shared memory for the new data
>         ring
>       - backend binds the **evtchn**
>       - backend replies to the frontend
> 
>       Request fields:
> 
>       - **cmd** value: 4
>       - additional fields:
>         - **id**: id of listening socket
>         - **id_new**: id of the new socket
>         - **ref**: grant reference of the data ring interface (`struct
>           pvcalls_data_intf`)
>         - **evtchn**: port number of the evtchn to signal activity on the 
> data ring
> 
>       Request binary layout:
> 
>           8       12      16      20      24      28      32
>           +-------+-------+-------+-------+-------+-------+
>           |       id      |    id_new     |  ref  |evtchn |
>           +-------+-------+-------+-------+-------+-------+
> 
>       Response additional fields:
> 
>       - **id**: id of the listening socket, echoed back from request
> 
>       Response binary layout:
> 
>           16      20      24
>           +-------+-------+
>           |       id      |
>           +-------+-------+
> 
>       Return value:
> 
>         - 0 on success
>         - See the [POSIX accept function][accept] for error names; the 
> corresponding
>           error numbers are specified later in this document.
> 
> 
>       #### Poll
> 
>       In this version of the protocol, the **poll** operation is only valid
>       for passive sockets. For active sockets, the frontend should look at the
>       state of the data ring. When a new connection is available in the queue
>       of the passive socket, the backend generates a response and notifies the
>       frontend.
> 
>       Request fields:
> 
>       - **cmd** value: 5
>       - additional fields:
>         - **id**: identifies the listening socket
> 
>       Request binary layout:
> 
>           8       12      16
>           +-------+-------+
>           |       id      |
>           +-------+-------+
> 
> 
>       Response additional fields:
> 
>       - **id**: echoed back from request
> 
>       Response binary layout:
> 
>           16       20       24
>           +--------+--------+
>           |        id       |
>           +--------+--------+
> 
>       Return value:
> 
>         - 0 on success
>         - See the [POSIX poll function][poll] for error names; the 
> corresponding error
>           numbers are specified later in this document.
> 
>       #### Error numbers
> 
>       The numbers corresponding to the error names specified by POSIX are:
> 
>           [EPERM]         -1
>           [ENOENT]        -2
>           [ESRCH]         -3
>           [EINTR]         -4
>           [EIO]           -5
>           [ENXIO]         -6
>           [E2BIG]         -7
>           [ENOEXEC]       -8
>           [EBADF]         -9
>           [ECHILD]        -10
>           [EAGAIN]        -11
>           [EWOULDBLOCK]   -11
>           [ENOMEM]        -12
>           [EACCES]        -13
>           [EFAULT]        -14
>           [EBUSY]         -16
>           [EEXIST]        -17
>           [EXDEV]         -18
>           [ENODEV]        -19
>           [EISDIR]        -21
>           [EINVAL]        -22
>           [ENFILE]        -23
>           [EMFILE]        -24
>           [ENOSPC]        -28
>           [EROFS]         -30
>           [EMLINK]        -31
>           [EDOM]          -33
>           [ERANGE]        -34
>           [EDEADLK]       -35
>           [EDEADLOCK]     -35
>           [ENAMETOOLONG]  -36
>           [ENOLCK]        -37
>           [ENOTEMPTY]     -39
>           [ENOSYS]        -38
>           [ENODATA]       -61
>           [ETIME]         -62
>           [EBADMSG]       -74
>           [EOVERFLOW]     -75
>           [EILSEQ]        -84
>           [ERESTART]      -85
>           [ENOTSOCK]      -88
>           [EOPNOTSUPP]    -95
>           [EAFNOSUPPORT]  -97
>           [EADDRINUSE]    -98
>           [EADDRNOTAVAIL] -99
>           [ENOBUFS]       -105
>           [EISCONN]       -106
>           [ENOTCONN]      -107
>           [ETIMEDOUT]     -110
>           [ENOTSUPP]      -524
> 
>       #### Socket families and address format
> 
>       The following definitions and explicit sizes, together with POSIX
>       [sys/socket.h][address] and [netinet/in.h][in] define socket families 
> and
>       address format. Please be aware that only the **domain** `AF_INET`, 
> **type**
>       `SOCK_STREAM` and **protocol** `0` are supported by this version of the 
> spec.
> 
>           #define AF_UNSPEC   0
>           #define AF_UNIX     1   /* Unix domain sockets      */
>           #define AF_LOCAL    1   /* POSIX name for AF_UNIX   */
>           #define AF_INET     2   /* Internet IP Protocol     */
>           #define AF_INET6    10  /* IP version 6         */
> 
>           #define SOCK_STREAM 1
>           #define SOCK_DGRAM  2
>           #define SOCK_RAW    3
> 
>           /* generic address format */
>           struct sockaddr {
>               uint16_t sa_family_t;
>               char sa_data[26];
>           };
> 
>           struct in_addr {
>               uint32_t s_addr;
>           };
> 
>           /* AF_INET address format */
>           struct sockaddr_in {
>               uint16_t         sa_family_t;
>               uint16_t         sin_port;
>               struct in_addr   sin_addr;
>               char             sin_zero[20];
>           };
> 
> 
>       ### Data ring
> 
>       Data rings are used for sending and receiving data over a connected 
> socket. They
>       are created upon a successful **accept** or **connect** command.
> 
>       A data ring is composed of two pieces: the interface and the **in** and 
> **out**
>       buffers. The interface, represented by `struct pvcalls_ring_intf` is 
> shared
>       first and resides on the page whose grant reference is passed by 
> **accept** and
>       **connect** as parameter. `struct pvcalls_ring_intf` contains the list 
> of grant
>       references which constitute the **in** and **out** data buffers.
> 
>       #### Data ring interface
> 
>           struct pvcalls_data_intf {
>               PVCALLS_RING_IDX in_cons, in_prod;
>               PVCALLS_RING_IDX out_cons, out_prod;
>               int32_t in_error, out_error;
> 
>               uint32_t ring_order;
>               grant_ref_t ref[];
>           };
> 
>           /* not actually C compliant (ring_order changes from socket to 
> socket) */
>           struct pvcalls_data {
>               char in[((1<<ring_order)<<PAGE_SHIFT)/2];
>               char out[((1<<ring_order)<<PAGE_SHIFT)/2];
>           };
> 
>       - **ring_order**
>         It represents the order of the data ring. The following list of grant
>         references is of `(1 << ring_order)` elements. It cannot be greater 
> than
>         **max-dataring-page-order**, as specified by the backend on XenBus.
>       - **ref[]**
>         The list of grant references which will contain the actual data. They 
> are
>         mapped contiguosly in virtual memory. The first half of the pages is 
> the
>         **in** array, the second half is the **out** array.
>       - **in** is an array used as circular buffer
>         It contains data read from the socket. The producer is the backend, 
> the
>         consumer is the frontend.
>       - **out** is an array used as circular buffer
>         It contains data to be written to the socket. The producer is the 
> frontend,
>         the consumer is the backend.
>       - **in_cons** and **in_prod**
>         Consumer and producer pointers for data read from the socket. They 
> keep track
>         of how much data has already been consumed by the frontend from the 
> **in**
>         array. **in_prod** is increased by the backend, after writing data to 
> **in**.
>         **in_cons** is increased by the frontend, after reading data from 
> **in**.
>       - **out_cons**, **out_prod**
>         Consumer and producer pointers for the data to be written to the 
> socket. They
>         keep track of how much data has been written by the frontend to 
> **out** and
>         how much data has already been consumed by the backend. **out_prod** 
> is
>         increased by the frontend, after writing data to **out**. 
> **out_cons** is
>         increased by the backend, after reading data from **out**.
>       - **in_error** and **out_error** They signal errors when reading from 
> the socket
>         (**in_error**) or when writing to the socket (**out_error**). 0 means 
> no
>         errors. When an error occurs, no further reads or writes operations 
> are
>         performed on the socket. In the case of an orderly socket shutdown 
> (i.e. read
>         returns 0) **in_error** is set to ENOTCONN. **in_error** and 
> **out_error**
>         are never set to EAGAIN or EWOULDBLOCK.
> 
>       The binary layout of `struct pvcalls_data_intf` follows:
> 
>           0         4         8         12        16        20        24      
>   28
>           
> +---------+---------+---------+---------+---------+---------+----------+
>           | in_cons | in_prod |out_cons |out_prod |in_error 
> |out_error|ring_order|
>           
> +---------+---------+---------+---------+---------+---------+----------+
> 
>           28        32        36        40        4092     4096
>           +---------+---------+---------+----//---+---------+
>           |  ref[0] |  ref[1] |  ref[2] |         |  ref[N] |
>           +---------+---------+---------+----//---+---------+
> 
>       The binary layout of the ring buffers follow:
> 
>           0         ((1<<ring_order)<<PAGE_SHIFT)/2       
> ((1<<ring_order)<<PAGE_SHIFT)
>           +------------//-------------+------------//-------------+
>           |            in             |           out             |
>           +------------//-------------+------------//-------------+
> 
>       #### Workflow
> 
>       The **in** and **out** arrays are used as circular buffers:
> 
>           0                               sizeof(array) == 
> ((1<<ring_order)<<PAGE_SHIFT)/2
>           +-----------------------------------+
>           |to consume|    free    |to consume |
>           +-----------------------------------+
>                      ^            ^
>                      prod         cons
> 
>           0                               sizeof(array)
>           +-----------------------------------+
>           |  free    | to consume |   free    |
>           +-----------------------------------+
>                      ^            ^
>                      cons         prod
> 
>       The following function is provided to calculate how many bytes are 
> currently
>       left unconsumed in an array:
> 
>           #define _MASK_PVCALLS_IDX(idx, ring_size) ((idx) & (ring_size-1))
> 
>           static inline PVCALLS_RING_IDX pvcalls_ring_queued(PVCALLS_RING_IDX 
> prod,
>                       PVCALLS_RING_IDX cons,
>                       PVCALLS_RING_IDX ring_size)
>           {
>               PVCALLS_RING_IDX size;
> 
>               if (prod == cons)
>                       return 0;
> 
>               prod = _MASK_PVCALLS_IDX(prod, ring_size);
>               cons = _MASK_PVCALLS_IDX(cons, ring_size);
> 
>               if (prod == cons)
>                       return ring_size;
> 
>               if (prod > cons)
>                       size = prod - cons;
>               else {
>                       size = ring_size - cons;
>                       size += prod;
>               }
>               return size;
>           }
> 
>       The producer (the backend for **in**, the frontend for **out**) writes 
> to the
>       array in the following way:
> 
>       - read *cons*, *prod*, *error* from shared memory
>       - memory barrier
>       - return on *error*
>       - write to array at position *prod* up to *cons*, wrapping around the 
> circular
>         buffer when necessary
>       - memory barrier
>       - increase *prod*
>       - notify the other end via evtchn
> 
>       The consumer (the backend for **out**, the frontend for **in**) reads 
> from the
>       array in the following way:
> 
>       - read *prod*, *cons*, *error* from shared memory
>       - memory barrier
>       - return on *error*
>       - read from array at position *cons* up to *prod*, wrapping around the 
> circular
>         buffer when necessary
>       - memory barrier
>       - increase *cons*
>       - notify the other end via evtchn
> 
>       The producer takes care of writing only as many bytes as available in 
> the buffer
>       up to *cons*. The consumer takes care of reading only as many bytes as 
> available
>       in the buffer up to *prod*. *error* is set by the backend when an error 
> occurs
>       writing or reading from the socket.
> 
> 
>       [address]: 
> http://pubs.opengroup.org/onlinepubs/7908799/xns/syssocket.h.html
>       [in]: 
> http://pubs.opengroup.org/onlinepubs/000095399/basedefs/netinet/in.h.html
>       [socket]: 
> http://pubs.opengroup.org/onlinepubs/009695399/functions/socket.html
>       [connect]: http://pubs.opengroup.org/onlinepubs/7908799/xns/connect.html
>       [shutdown]: 
> http://pubs.opengroup.org/onlinepubs/7908799/xns/shutdown.html
>       [bind]: http://pubs.opengroup.org/onlinepubs/7908799/xns/bind.html
>       [listen]: http://pubs.opengroup.org/onlinepubs/7908799/xns/listen.html
>       [accept]: http://pubs.opengroup.org/onlinepubs/7908799/xns/accept.html
>       [poll]: http://pubs.opengroup.org/onlinepubs/7908799/xsh/poll.html
> 
>       _______________________________________________
>       Xen-devel mailing list
>       Xen-devel@xxxxxxxxxxxxx
>       https://lists.xen.org/xen-devel
> 
> 
> 
> 
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.