[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [DOC v2] Xen transport for 9pfs



On Wed, 4 Jan 2017, Oleksandr Andrushchenko wrote:
> If this is not too late for comments...
> 
> On 12/06/2016 03:33 AM, Stefano Stabellini wrote:
> > Changes in v2:
> > - fix copy/paste error
> > - rename ring-ref-<num> to ring-ref<num>
> > - fix memory barriers
> > - add "verify prod/cons against local copy"
> > - add a paragraph on high level design
> > - add a note on the maximum possible max-ring-page-order value
> > - add mechanisms to avoid unnecessary evtchn notifications
> > 
> > ---
> > 
> > # Xen transport for 9pfs version 1
> > 
> > ## Background
> > 
> > 9pfs is a network filesystem protocol developed for Plan 9. 9pfs is very
> > simple and describes a series of commands and responses. It is
> > completely independent from the communication channels, in fact many
> > clients and servers support multiple channels, usually called
> > "transports". For example the Linux client supports tcp and unix
> > sockets, fds, virtio and rdma.
> > 
> > 
> > ### 9pfs protocol
> > 
> > This document won't cover the full 9pfs specification. Please refer to
> > this [paper] and this [website] for a detailed description of it.
> > However it is useful to know that each 9pfs request and response has the
> > following header:
> > 
> >      struct header {
> >             uint32_t size;
> >             uint8_t id;
> >             uint16_t tag;
> >      } __attribute__((packed));
> As per my previous experience with sndif/displif
> 
> __attribute__((packed)); is not expected to be in a generic
> protocol

That's right, but this is a description of the existing 9pfs protocol,
there is nothing I can do about that.


> > 
> >      0         4  5    7
> >      +---------+--+----+
> >      |  size   |id|tag |
> >      +---------+--+----+
> > 
> > - *size*
> > The size of the request or response.
> > 
> > - *id*
> > The 9pfs request or response operation.
> > 
> > - *tag*
> > Unique id that identifies a specific request/response pair. It is used
> > to multiplex operations on a single channel.
> > 
> > It is possible to have multiple requests in-flight at any given time.
> > 
> > 
> > ## Rationale
> > 
> > This document describes a Xen based transport for 9pfs, in the
> > traditional PV frontend and backend format. The PV frontend is used by
> > the client to send commands to the server. The PV backend is used by the
> > 9pfs server to receive commands from clients and send back responses.
> > 
> > The transport protocol supports multiple rings up to the maximum
> > supported by the backend. The size of every ring is also configurable
> > and can span multiple pages, up to the maximum supported by the backend
> > (although it cannot be more than 2MB). The design is to exploit
> > parallelism at the vCPU level and support multiple outstanding requests
> > simultaneously.
> > 
> > This document does not cover the 9pfs client/server design or
> > implementation, only the transport for it.
> > 
> > 
> > ## Xenstore
> > 
> > The frontend and the backend connect via xenstore to exchange
> > information. The toolstack creates front and back nodes with state
> > [XenbusStateInitialising]. The protocol node name is **9pfs**.
> > 
> > Multiple rings are supported for each frontend and backend connection.
> > 
> > ### Frontend XenBus Nodes
> > 
> >      num-rings
> port and ring-ref both have indices, thus allowing to find out how
> many rings are there, so why do we need to specify it explicitly?

For clarity.


> >           Values:         <uint32_t>
> >                Number of rings. It needs to be lower or equal to max-rings.
> >           port-<num> (port-0, port-1, etc)
> Correct me if I'm wrong, but "event-channel" is most used name in the
> protocols, not "port"

That is true. However, for reasons unknown to me, often in xenstore
protocols the event channel number is specified as "port".  Of course, I
don't have any problems changing port-<num> to event-channel-<num>.


> >           Values:         <uint32_t>
> >                The identifier of the Xen event channel used to signal
> > activity
> >           in the ring buffer. One for each ring.
> Here you refer to port as to event channel... So, please consider
> changing it accordingly

OK


> >           ring-ref<num> (ring-ref0, ring-ref1, etc)
> >           Values:         <uint32_t>
> >                The Xen grant reference granting permission for the backend
> > to
> >           map a page with information to setup a share ring. One for each
> >           ring.
> > 
> > ### Backend XenBus Nodes
> > 
> > Backend specific properties, written by the backend, read by the
> > frontend:
> > 
> >      version
> >           Values:         <uint32_t>
> >                Protocol version supported by the backend. Currently the
> > value is
> >           1.
> >           max-rings
> >           Values:         <uint32_t>
> >                The maximum supported number of rings.
> Per frontend? If not, how does a frontend know how many
> it is allowed to use?

That's right, per frontend. I'll clarify it.


> >           max-ring-page-order
> Are there any specific requirements that this is order, not size?
> IMHO size allows better control on memory allocations and
> gives more flexibility. The only requirement on size I see is that
> it should be even value (because you divide allocated space for
> in and out)

That's a good question. It needs to be an order because the number of
pages has to be a power of 2 for the indixes to work correctly.


> >           Values:         <uint32_t>
> >                The maximum supported size of a memory allocation in units of
> >           lb(machine pages), e.g. 0 == 1 page,  1 = 2 pages, 2 == 4 pages,
> >           etc.
> > 
> > Backend configuration nodes, written by the toolstack, read by the
> > backend:
> > 
> >      path
> >           Values:         <string>
> >                Host filesystem path to share.
> >           tag
> >           Values:         <string>
> >                Alphanumeric tag that identifies the 9pfs share. The client
> > needs
> >           to know the tag to be able to mount it.
> >           security_model
> >           Values:         "none"
> >                *none*: files are stored using the same credentials as they
> > are
> >                   created on the guest
> >           Only "none" is supported in this version of the protocol.
> > 
> > 
> > ### State Machine
> > 
> > Initialization:
> > 
> >      *Front*                               *Back*
> >      XenbusStateInitialising               XenbusStateInitialising
> >      - Query virtual device                - Query backend device
> >        properties.                           identification data.
> >      - Setup OS device instance.           - Publish backend features
> >      - Allocate and initialize the           and transport parameters
> >        request ring.                                      |
> >      - Publish transport parameters                       |
> >        that will be in effect during                      V
> >        this connection.                            XenbusStateInitWait
> >                   |
> >                   |
> >                   V
> >         XenbusStateInitialised
> > 
> >                                            - Query frontend transport
> > parameters.
> >                                            - Connect to the request ring and
> >                                              event channel.
> >                                                           |
> >                                                           |
> >                                                           V
> >                                                   XenbusStateConnected
> > 
> >       - Query backend device properties.
> >       - Finalize OS virtual device
> >         instance.
> >                   |
> >                   |
> >                   V
> >          XenbusStateConnected
> > 
> > Once frontend and backend are connected, they have a shared page per
> > ring, which are used to setup the rings, and an event channel per ring,
> > which are used to send notifications.
> > 
> > Shutdown:
> > 
> >      *Front*                            *Back*
> >      XenbusStateConnected               XenbusStateConnected
> >                  |
> >                  |
> >                  V
> >         XenbusStateClosing
> > 
> >                                         - Unmap grants
> >                                         - Unbind evtchns
> >                                                   |
> >                                                   |
> >                                                   V
> >                                           XenbusStateClosing
> > 
> >      - Unbind evtchns
> >      - Free rings
> >      - Free data structures
> >                 |
> >                 |
> >                 V
> >         XenbusStateClosed
> > 
> >                                         - Free remaining data structures
> >                                                   |
> >                                                   |
> >                                                   V
> >                                           XenbusStateClosed
> > 
> > 
> > ## Ring Setup
> > 
> > The shared page has the following layout:
> > 
> >      typedef uint32_t XEN_9PFS_RING_IDX;
> > 
> >      struct xen_9pfs_intf {
> >             XEN_9PFS_RING_IDX in_cons, in_prod, in_event;
> >             XEN_9PFS_RING_IDX out_cons, out_prod, out_event;
> >             uint32_t ring_order;
> >             grant_ref_t ref[];
> Please consider changing ref[] to ref[1]: there are concerns
> I faced that not all of the compilers can handle that

OK, but I'll add a comment that the number of elements in the array is
not actually 1.


> >      };
> > 
> >      /* not actually C compliant (ring_order changes from ring to ring) */
> >      struct ring_data {
> >          char in[((1 << ring_order) << PAGE_SHIFT) / 2];
> >          char out[((1 << ring_order) << PAGE_SHIFT) / 2];
> >      };
> > 
> > - **ring_order**
> >    It represents the order of the data ring. The following list of grant
> >    references is of `(1 << ring_order)` elements. It cannot be greater than
> >    **max-ring-page-order**, as specified by the backend on XenBus.
> > - **ref[]**
> >    The list of grant references which will contain the actual data. They are
> >    mapped contiguosly in virtual memory. The first half of the pages is the
> >    **in** array, the second half is the **out** array. The array must
> >    have a power of two number of elements.
> > - **out** is an array used as circular buffer
> >    It contains client requests. The producer is the frontend, the
> >    consumer is the backend.
> > - **in** is an array used as circular buffer
> >    It contains server responses. The producer is the backend, the
> >    consumer is the frontend.
> > - **out_cons**, **out_prod**
> >    Consumer and producer indices for client requests. They keep track of
> >    how much data has been written by the frontend to **out** and how much
> >    data has already been consumed by the backend. **out_prod** is
> >    increased by the frontend, after writing data to **out**. **out_cons**
> >    is increased by the backend, after reading data from **out**.
> > - **in_cons** and **in_prod**
> >    Consumer and producer indices for responses. They keep track of how
> >    much data has already been consumed by the frontend from the **in**
> >    array. **in_prod** is increased by the backend, after writing data to
> >    **in**.  **in_cons** is increased by the frontend, after reading data
> >    from **in**.
> > 
> > The binary layout of `struct xen_9pfs_intf` follows:
> > 
> >      0         4         8         12        16        20         24
> > 28
> >      
> > +---------+---------+---------+---------+---------+----------+---------+
> >      | in_cons | in_prod |in_event |out_cons |out_prod |out_event
> > |ring_orde|
> >      
> > +---------+---------+---------+---------+---------+----------+---------+
> > 
> >      28        32        36      4092      4096
> >      +---------+---------+----//---+---------+
> >      |  ref[0] |  ref[1] |         |  ref[N] |
> >      +---------+---------+----//---+---------+
> > 
> > **N.B** For one page, N is maximum 1017 ((4096-28)/4), but given that N
> > needs to be a power of two, actually max N is 512. As 512 == (1 << 9),
> > the maximum possible max-ring-page-order value is 9.
> > 
> > The binary layout of the ring buffers follow:
> > 
> >      0         ((1<<ring_order)<<PAGE_SHIFT)/2
> > ((1<<ring_order)<<PAGE_SHIFT)
> >      +------------//-------------+------------//-------------+
> >      |            in             |           out             |
> >      +------------//-------------+------------//-------------+
> > 
> > 
> > ## Ring Usage
> > 
> > The **in** and **out** arrays are used as circular buffers:
> >           0                               sizeof(array) ==
> > ((1<<ring_order)<<PAGE_SHIFT)/2
> >      +-----------------------------------+
> >      |to consume|    free    |to consume |
> >      +-----------------------------------+
> >                 ^            ^
> >                 prod         cons
> > 
> >      0                               sizeof(array)
> >      +-----------------------------------+
> >      |  free    | to consume |   free    |
> >      +-----------------------------------+
> >                 ^            ^
> >                 cons         prod
> > 
> > The following functions are provided to read and write to an array:
> > 
> >      #define MASK_XEN_9PFS_IDX(idx) ((idx) & (XEN_9PFS_RING_SIZE - 1))
> I am really concerned on memcpy in both read and write and no way
> to implement zero-copy: please consider extending the API with something
> like get_{read|write}_ptr. As we are dealing with a circular buffer, then
> there are cases when we end up having 2 chunks (at most), so kind of
> simplified scatter-gather may work.
> BTW, do you have some use-cases in mind (on front and back side),
> which will clarify if further memcpy avoidance can be reached?

The two functions below, xen_9pfs_read and xen_9pfs_write, are just for
convenience. Of course, nothing prevents a frontend or a backend to
access the data on the ring directly. The binary layout is clearly
specified, it is not an issue. I have done it myself while writing
prototypes. However, people should be aware that accessing data on the
ring is not safe from accesses from the other end: the frontend (or the
backend) could be changing the data at any time. Usually the treat model
in Xen deployments is that frontends are untrusted while backends are
more trusted, or fully trusted.  For this reason, I discourage backends
from reading data directly. Frontends could do so, if they trust the
backend.


> >      static inline void xen_9pfs_read(char *buf,
> >                     XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX
> > *masked_cons,
> >                     uint8_t *h, size_t len) {
> >             if (*masked_cons < *masked_prod) {
> >                     memcpy(h, buf + *masked_cons, len);
> >             } else {
> >                     if (len > XEN_9PFS_RING_SIZE - *masked_cons) {
> >                             memcpy(h, buf + *masked_cons, 
> > XEN_9PFS_RING_SIZE -
> > *masked_cons);
> >                             memcpy((char *)h + XEN_9PFS_RING_SIZE - 
> > *masked_cons,
> > buf, len - (XEN_9PFS_RING_SIZE - *masked_cons));
> >                     } else {
> >                             memcpy(h, buf + *masked_cons, len);
> >                     }
> >             }
> >             *masked_cons = _MASK_XEN_9PFS_IDX(*masked_cons + len);
> >      }
> >           static inline void xen_9pfs_write(char *buf,
> >                     XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX
> > *masked_cons,
> >                     uint8_t *opaque, size_t len) {
> >             if (*masked_prod < *masked_cons) {
> >                     memcpy(buf + *masked_prod, opaque, len);
> >             } else {
> >                     if (len > XEN_9PFS_RING_SIZE - *masked_prod) {
> >                             memcpy(buf + *masked_prod, opaque, 
> > XEN_9PFS_RING_SIZE
> > - *masked_prod);
> >                             memcpy(buf, opaque + (XEN_9PFS_RING_SIZE -
> > *masked_prod), len - (XEN_9PFS_RING_SIZE - *masked_prod));
> >                     } else {
> >                             memcpy(buf + *masked_prod, opaque, len);
> >                     }
> >             }
> >             *masked_prod = _MASK_XEN_9PFS_IDX(*masked_prod + len);
> >      }
> > 
> > The producer (the backend for **in**, the frontend for **out**) writes to
> > the
> > array in the following way:
> > 
> > - read *cons*, *prod* from shared memory
> > - general memory barrier
> > - verify *prod* against local copy (consumer shouldn't change it)
> > - write to array at position *prod* up to *cons*, wrapping around the
> > circular
> >    buffer when necessary
> > - write memory barrier
> > - increase *prod*
> > - notify the other end via evtchn
> > 
> > The consumer (the backend for **out**, the frontend for **in**) reads from
> > the
> > array in the following way:
> > 
> > - read *prod*, *cons* from shared memory
> > - read memory barrier
> > - verify *cons* against local copy (producer shouldn't change it)
> > - read from array at position *cons* up to *prod*, wrapping around the
> > circular
> >    buffer when necessary
> > - general memory barrier
> > - increase *cons*
> > - notify the other end via evtchn, if *event* == 1
> > - general memory barrier
> > - read *prod* again from shared memory to check for new requests
> > 
> > The producer takes care of writing only as many bytes as available in the
> > buffer
> > up to *cons*. The consumer takes care of reading only as many bytes as
> > available
> > in the buffer up to *prod*.
> > 
> > To avoid unnecessary notifications, the consumer only issues an evtchn
> > notification if the **event** field (**in_event** or **out_event**), has
> > been set to **1**. In fact the producer doesn't usually require any
> > notifications, but if it is necessary, for example because the producer
> > is forced to wait because the ring is full, then it can request to be
> > notified by the consumer by setting **in_event** or **out_event**,
> > depending on the ring. After receiving the notification, the producer
> > can reset *event*.
> > 
> > The producer always notifies the consumer after incrementing **prod**.
> > However in some circumstances the producer is allowed not to notify the
> > consumer, just as a performance improvement, and still maintain
> > correctness. These are the steps to do it: after incrementing *prod*,
> > the producer reads *cons* a second time; if the value is changed from
> > the previous read and it is different from *prod* before the update,
> > then the notification can be avoided. These are the producer steps, with
> > the optimization:
> > 
> > - read *prod* (old_prod), *cons* (old_cons) from shared memory
> > - general memory barrier
> > - verify *prod* against local copy (consumer shouldn't change it)
> > - write to array at position *prod* up to *cons*, wrapping around the
> > circular
> >    buffer when necessary
> > - write memory barrier
> > - increase *prod* (new_prod)
> > - general memory barrier
> > - read *cons* (new_cons)
> > - if new_cons == old_cons or new_cons == old_prod, then notify the
> >    consumer
> > 
> > 
> > ## Request/Response Workflow
> > 
> > The client chooses one of the available rings, then it sends a request
> > to the other end on the *out* array, following the producer workflow
> > described in [Ring Usage].
> I believe it is allowed to send part of the conversation via different rings?
> E.g. request/response are identified by the tag, so back/front may
> use different rings for the same "session"? Could you please explicitly
> describe this scenario if it is allowed or not?

It is not allowed, thanks for asking. A request and response pair,
identified by the same tag, should be on the same ring. I'll clarify it.


> > The server receives the notification and reads the request, following
> > the consumer workflow described in [Ring Usage]. The server knows how
> > much to read because it is specified in the *size* field of the 9pfs
> > header. The server processes the request and sends back a response on
> > the *in* array of the same ring, following the producer workflow as
> > usual.
> > 
> > The client receives a notification and reads the response from the *in*
> > array. The client knows how much data to read because it is specified in
> > the *size* field of the 9pfs header.
> > 
> > 
> > [paper]:
> > https://www.usenix.org/legacy/event/usenix05/tech/freenix/full_papers/hensbergen/hensbergen.pdf
> > [website]: https://github.com/chaos/diod/blob/master/protocol.md
> > [XenbusStateInitialising]:
> > http://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,io,xenbus.h.html
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@xxxxxxxxxxxxx
> > https://lists.xen.org/xen-devel
> Thank you,
> Oleksandr
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.