[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3] RFC: extend the xenstore ring with a 'closing' signal



Hi Paul,

On 4 Jul 2014, at 09:21, Paul Durrant <Paul.Durrant@xxxxxxxxxx> wrote:

>> -----Original Message-----
>> From: David Scott [mailto:dave.scott@xxxxxxxxxx]
>> Sent: 03 July 2014 16:15
>> To: xen-devel@xxxxxxxxxxxxxxxxxxxx
>> Cc: Paul Durrant; Ian Campbell; john.liuqiming@xxxxxxxxxx;
>> yanqiangjun@xxxxxxxxxx; Andrew Cooper; Dave Scott
>> Subject: [PATCH v3] RFC: extend the xenstore ring with a 'closing' signal
>> 
>> Currently hvmloader uses the xenstore ring and then tries to
>> reset it back to its initial state. This is not part of the
>> ring protocol and, if xenstored reads the ring while it is
>> happening, xenstored will conclude it is corrupted. A corrupted
>> ring will prevent PV drivers from connecting. This seems to
>> be a rare failure.
>> 
>> Furthermore, when a VM crashes it may jump to a 'crash kernel'
>> to create a diagnostic dump. Without the ability to safely
>> reset the ring the PV drivers won't be able to reliably
>> establish connections, to (for example) stream a memory dump to
>> disk.
>> 
>> This prototype patch contains a simple extension of the
>> xenstore ring structure, enough to contain version numbers and
>> a 'closing' flag. This memory is currently unused within the
>> 4k page and should be zeroed as part of the domain setup
>> process. The oxenstored server advertises version 1, and
>> hvmloader detects this and sets the 'closing' flag. The server
>> then reinitialises the ring, filling it with obviously invalid
>> data to help debugging (unfortunately blocks of zeroes are
>> valid xenstore packets) and signals hvmloader by the event
>> channel that it's safe to boot the guest OS.
>> 
>> Updates since v2 (thanks to Andy Cooper for the review):
>> * hvmloader: use volatile for read of closing flag
>> * style improvements
>> * remove xenstore version #defines
>> 
>> Updates since v1 (thanks to Paul Durrant for the review):
>> * remove unused client version and associated dead code
>> * treat 'closing' as a flag by using "!=0" rather than "==1"
>> * hvmloader: move function up and remove forward decl
>> * document the existing xenstore ring and the extention in misc/
>> 
>> Signed-off-by: David Scott <dave.scott@xxxxxxxxxx>
>> ---
>> docs/misc/xenstore-ring.txt         |   79
>> +++++++++++++++++++++++++++++++++++
>> tools/firmware/hvmloader/xenbus.c   |   39 +++++++++--------
>> tools/ocaml/libs/xb/xb.ml           |   26 +++++++++++-
>> tools/ocaml/libs/xb/xb.mli          |    1 +
>> tools/ocaml/libs/xb/xs_ring.ml      |   11 +++++
>> tools/ocaml/libs/xb/xs_ring_stubs.c |   63
>> +++++++++++++++++++++++++++-
>> xen/include/public/io/xs_wire.h     |    5 +++
>> 7 files changed, 203 insertions(+), 21 deletions(-)
>> create mode 100644 docs/misc/xenstore-ring.txt
>> 
>> diff --git a/docs/misc/xenstore-ring.txt b/docs/misc/xenstore-ring.txt
>> new file mode 100644
>> index 0000000..df2a09f
>> --- /dev/null
>> +++ b/docs/misc/xenstore-ring.txt
>> @@ -0,0 +1,79 @@
>> +
>> +The domain builder allocates a single page and shares it with the xenstore
>> +daemon. This document describes the ring protocol used to communicate
>> via
>> +this page which is used to transmit and receive
>> +[xenstore protocol packets](xenstore.txt).
>> +
>> +In the original version (we call this "version 0"), the shared page has the
>> +following contents (all offsets and lengths are in octets):
>> +
>> +Offset  Length  Description
>> +-----------------------------------------------------------------
>> +0       1024    Requests: data sent to the xenstore daemon
>> +1024    1024    Replies: data sent to the domain
>> +2048    4       Request consumer offset
>> +2052    4       Request producer offset
>> +2056    4       Response consumer offset
>> +2060    4       Response producer offset
>> +
>> +When the page is allocated it is guaranteed to be full of zeroes.
>> +
>> +The "Requests" and "Replies" are treated as circular buffers, one for
>> +each direction. Each circular buffer is associated wth a producer and
>> +consumer offset, which are free-running counters starting from 0. A
>> "producer"
>> +offset is the offset in the byte stream of the next byte to be written; a
>> +"consumer" offset is the offset in the byte stream of the next byte to be
>> +read. The byte at offset 'x' in the byte stream will be stored in
>> +offset 'x mod 1024' in the circular buffer. "producer - consumer" gives
>> +the number of bytes still to be read, and "1024 - (producer - consumer)"
>> +therefore gives the amount of space currently available for writing,
>> +where we must avoid overwriting unread data.
>> +
>> +The circular buffers are only used to send sequences of bytes between
>> domains.
>> +It is the responsibility of the layer above to segment these bytes into
>> +packets, as described in [xenstore.txt](xenstore.txt).
>> +
>> +The client and server domains can run concurrently on different cores and
>> +different sockets. We must therefore take care to avoid corruption by:
>> +
>> +  1. using atomic loads and stores when reading and writing the producer
>> +     and consumer offsets. If an offset were to be updated by a non-atomic
>> +     store then the other domain may read an invalid offset value.
>> +  2. ensuring request and reply data is fully read or written before
>> +     updating the offsets by issuing a memory barrier.
>> +
>> +If we wish to read data but find the circular buffer empty, or wish to write
>> +data and find the circular buffer full, we wait on a pre-arranged event
>> +channel. When the other party next reads or writes data to the ring, it
>> +will notify() this event channel and we will wake.
>> +
>> +Protocol extension for reconnection
>> +-----------------------------------
>> +
>> +In version 0 of the protocol it is not possible to close and reopen the
>> +connection. This means that if the ring is corrupted, it can never be used
>> +(safely) again. Extending the protocol to allow reconnection would allow
>> +us to:
>> +
>> +  1. use the ring in the firmware (hvmloader) and safely reset it for use
>> +     by the guest
>> +  2. re-establish a ring in a 'crash kernel' environment, allowing us to
>> +     write crash dumps to PV disks or network devices.
>> +
>> +In version 1 of the protocol the ring is extended with the following
>> +fields:
>> +
>> +Offset  Length  Description
>> +-----------------------------------------------------------------
>> +2064    4       Xenstore daemon supported protocol version
>> +2068    4       Close request flag
>> +
>> +In a system supporting only version 0, both these fields are guaranteed
>> +to contain 0 by the domain builder.
>> +
>> +In a system supporting version 1, the xenstore daemon will write "1" into
>> +the support protocol version field. The guest xenstore client (eg in
> 
> s/support/supported
> 
>> +hvmloader) can query the version, and if it is set to "1" it can write
>> +"1" to the close request flag and call notify(). The supporting xenstore
> 
> Perhaps, rather than 'call notify()' this should be 'signal the store event 
> channel'?
> 
>> +daemon can reset the ring to an empty state and signal completion by
>> +clearing the flag and calling notify() again.
> 
> Same here.
> 
>> diff --git a/tools/firmware/hvmloader/xenbus.c
>> b/tools/firmware/hvmloader/xenbus.c
>> index fe72e97..f85832c 100644
>> --- a/tools/firmware/hvmloader/xenbus.c
>> +++ b/tools/firmware/hvmloader/xenbus.c
>> @@ -37,6 +37,19 @@ static struct xenstore_domain_interface *rings; /*
>> Shared ring with dom0 */
>> static evtchn_port_t event;                     /* Event-channel to dom0 */
>> static char payload[XENSTORE_PAYLOAD_MAX + 1];  /* Unmarshalling area
>> */
>> 
>> +static void ring_wait(void)
>> +{
>> +    struct shared_info *shinfo = get_shared_info();
>> +    struct sched_poll poll;
>> +
>> +    memset(&poll, 0, sizeof(poll));
>> +    set_xen_guest_handle(poll.ports, &event);
>> +    poll.nr_ports = 1;
>> +
>> +    while ( !test_and_clear_bit(event, shinfo->evtchn_pending) )
>> +        hypercall_sched_op(SCHEDOP_poll, &poll);
>> +}
>> +
>> /* Connect our xenbus client to the backend.
>>  * Call once, before any other xenbus actions. */
>> void xenbus_setup(void)
>> @@ -68,10 +81,15 @@ void xenbus_shutdown(void)
>> 
>>     ASSERT(rings != NULL);
>> 
>> -    /* We zero out the whole ring -- the backend can handle this, and it's
>> -     * not going to surprise any frontends since it's equivalent to never
>> -     * having used the rings. */
>> -    memset(rings, 0, sizeof *rings);
>> +    if (rings->server_version > 0) {
>> +        rings->closing = 1;
>> +        while (*(volatile uint32_t*)&rings->closing == 1)
>> +            ring_wait ();
>> +    } else
>> +        /* If the backend reads the state while we're erasing it then the
>> +         * ring state will become corrupted, preventing guest frontends from
>> +         * connecting. This is rare. */
>> +        memset(rings, 0, sizeof *rings);
>> 
>>     /* Clear the event-channel state too. */
>>     memset(shinfo->vcpu_info, 0, sizeof(shinfo->vcpu_info));
>> @@ -81,19 +99,6 @@ void xenbus_shutdown(void)
>>     rings = NULL;
>> }
>> 
>> -static void ring_wait(void)
>> -{
>> -    struct shared_info *shinfo = get_shared_info();
>> -    struct sched_poll poll;
>> -
>> -    memset(&poll, 0, sizeof(poll));
>> -    set_xen_guest_handle(poll.ports, &event);
>> -    poll.nr_ports = 1;
>> -
>> -    while ( !test_and_clear_bit(event, shinfo->evtchn_pending) )
>> -        hypercall_sched_op(SCHEDOP_poll, &poll);
>> -}
>> -
>> /* Helper functions: copy data in and out of the ring */
>> static void ring_write(const char *data, uint32_t len)
>> {
>> diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
>> index 29d354d..d5cd776 100644
>> --- a/tools/ocaml/libs/xb/xb.ml
>> +++ b/tools/ocaml/libs/xb/xb.ml
>> @@ -84,7 +84,26 @@ let write con s len =
>>      | Fd backfd     -> write_fd backfd con s len
>>      | Xenmmap backmmap -> write_mmap backmmap con s len
>> 
>> -let output con =
>> +(* If a function throws Xs_ring.Closing, then clear the ring state
>> +   and serve the ring again. *)
>> +let rec handle_closing f con =
>> +    match (try Some (f con) with Xs_ring.Closing -> None) with
>> +    | Some x -> x
>> +    | None ->
>> +            begin match con.backend with
>> +            | Fd _ -> raise Xs_ring.Closing (* should never happen, but
>> just in case *)
>> +            | Xenmmap backend ->
>> +                    Xs_ring.close backend.mmap;
>> +                    backend.eventchn_notify ();
>> +                    (* Clear our old connection state *)
>> +                    Queue.clear con.pkt_in;
>> +                    Queue.clear con.pkt_out;
>> +                    con.partial_in <- init_partial_in ();
>> +                    con.partial_out <- "";
>> +                    handle_closing f con
>> +            end
>> +
>> +let output = handle_closing (fun con ->
>>      (* get the output string from a string_of(packet) or partial_out *)
>>      let s = if String.length con.partial_out > 0 then
>>                      con.partial_out
>> @@ -101,8 +120,9 @@ let output con =
>>      );
>>      (* after sending one packet, partial is empty *)
>>      con.partial_out = ""
>> +)
>> 
>> -let input con =
>> +let input = handle_closing (fun con ->
>>      let newpacket = ref false in
>>      let to_read =
>>              match con.partial_in with
>> @@ -133,6 +153,7 @@ let input con =
>>                      HaveHdr (Partial.of_string buf) else NoHdr (i - sz, buf)
>>      );
>>      !newpacket
>> +)
>> 
>> let newcon backend = {
>>      backend = backend;
>> @@ -145,6 +166,7 @@ let newcon backend = {
>> let open_fd fd = newcon (Fd { fd = fd; })
>> 
>> let open_mmap mmap notifyfct =
>> +    Xs_ring.set_server_version mmap 1; (* defined in xs_wire.h *)
>>      newcon (Xenmmap {
>>              mmap = mmap;
>>              eventchn_notify = notifyfct;
>> diff --git a/tools/ocaml/libs/xb/xb.mli b/tools/ocaml/libs/xb/xb.mli
>> index 58234ae..7f65fa3 100644
>> --- a/tools/ocaml/libs/xb/xb.mli
>> +++ b/tools/ocaml/libs/xb/xb.mli
>> @@ -80,6 +80,7 @@ val read : t -> string -> int -> int
>> val write_fd : backend_fd -> 'a -> string -> int -> int
>> val write_mmap : backend_mmap -> 'a -> string -> int -> int
>> val write : t -> string -> int -> int
>> +val handle_closing : (t -> 'a) -> t -> 'a
>> val output : t -> bool
>> val input : t -> bool
>> val newcon : backend -> t
>> diff --git a/tools/ocaml/libs/xb/xs_ring.ml b/tools/ocaml/libs/xb/xs_ring.ml
>> index 9469609..d7f0fd4 100644
>> --- a/tools/ocaml/libs/xb/xs_ring.ml
>> +++ b/tools/ocaml/libs/xb/xs_ring.ml
>> @@ -14,5 +14,16 @@
>>  * GNU Lesser General Public License for more details.
>>  *)
>> 
>> +exception Closing
>> +
>> +let _ =
>> +  Callback.register_exception "Xs_ring.Closing" Closing
>> +
>> external read: Xenmmap.mmap_interface -> string -> int -> int =
>> "ml_interface_read"
>> external write: Xenmmap.mmap_interface -> string -> int -> int =
>> "ml_interface_write"
>> +
>> +
>> +external set_server_version: Xenmmap.mmap_interface -> int -> unit =
>> "ml_interface_set_server_version" "noalloc"
>> +external get_server_version: Xenmmap.mmap_interface -> int =
>> "ml_interface_get_server_version" "noalloc"
>> +
>> +external close: Xenmmap.mmap_interface -> unit = "ml_interface_close"
>> "noalloc"
>> diff --git a/tools/ocaml/libs/xb/xs_ring_stubs.c
>> b/tools/ocaml/libs/xb/xs_ring_stubs.c
>> index 8bd1047..27c98cd 100644
>> --- a/tools/ocaml/libs/xb/xs_ring_stubs.c
>> +++ b/tools/ocaml/libs/xb/xs_ring_stubs.c
>> @@ -35,19 +35,28 @@
>> 
>> #define GET_C_STRUCT(a) ((struct mmap_interface *) a)
>> 
>> +#define ERROR_UNKNOWN (-1)
>> +#define ERROR_CLOSING (-2)
>> +
>> static int xs_ring_read(struct mmap_interface *interface,
>>                              char *buffer, int len)
>> {
>>      struct xenstore_domain_interface *intf = interface->addr;
>>      XENSTORE_RING_IDX cons, prod; /* offsets only */
>>      int to_read;
>> +    uint32_t closing;
>> 
>>      cons = *(volatile uint32*)&intf->req_cons;
>>      prod = *(volatile uint32*)&intf->req_prod;
>> +    closing = *(volatile uint32*)&intf->closing;
>> +
>> +    if (closing != 0)
>> +            return ERROR_CLOSING;
>> +
>>      xen_mb();
>> 
>>      if ((prod - cons) > XENSTORE_RING_SIZE)
>> -        return -1;
>> +        return ERROR_UNKNOWN;
>> 
>>      if (prod == cons)
>>              return 0;
>> @@ -71,9 +80,15 @@ static int xs_ring_write(struct mmap_interface
>> *interface,
>>      struct xenstore_domain_interface *intf = interface->addr;
>>      XENSTORE_RING_IDX cons, prod;
>>      int can_write;
>> +    uint32_t closing;
>> 
>>      cons = *(volatile uint32*)&intf->rsp_cons;
>>      prod = *(volatile uint32*)&intf->rsp_prod;
>> +    closing = *(volatile uint32*)&intf->closing;
>> +
>> +    if (closing != 0)
>> +            return ERROR_CLOSING;
>> +
>>      xen_mb();
>>      if ( (prod - cons) >= XENSTORE_RING_SIZE )
>>              return 0;
>> @@ -97,8 +112,12 @@ CAMLprim value ml_interface_read(value interface,
>> value buffer, value len)
>> 
>>      res = xs_ring_read(GET_C_STRUCT(interface),
>>                         String_val(buffer), Int_val(len));
>> -    if (res == -1)
>> +    if (res == ERROR_UNKNOWN)
>>              caml_failwith("bad connection");
>> +
>> +    if (res == ERROR_CLOSING)
>> +
> 
> This looks a bit wrong. The blank line above should be removed and the 
> following line indented appropriately.
> 
>>      caml_raise_constant(*caml_named_value("Xs_ring.Closing"));
>> +
>>      result = Val_int(res);
>>      CAMLreturn(result);
>> }
>> @@ -111,6 +130,46 @@ CAMLprim value ml_interface_write(value interface,
>> value buffer, value len)
>> 
>>      res = xs_ring_write(GET_C_STRUCT(interface),
>>                          String_val(buffer), Int_val(len));
>> +
>> +    if (res == ERROR_CLOSING)
>> +
> 
> Same here.
> 
>>      caml_raise_constant(*caml_named_value("Xs_ring.Closing"));
>> +
>>      result = Val_int(res);
>>      CAMLreturn(result);
>> }
>> +
>> +CAMLprim value ml_interface_set_server_version(value interface, value v)
>> +{
>> +    CAMLparam2(interface, v);
>> +    struct xenstore_domain_interface *intf = GET_C_STRUCT(interface)-
>>> addr;
>> +
>> +    intf->server_version = Int_val(v);
>> +
>> +    CAMLreturn(Val_unit);
>> +}
>> +
>> +CAMLprim value ml_interface_get_server_version(value interface)
>> +{
>> +    CAMLparam1(interface);
>> +    struct xenstore_domain_interface *intf = GET_C_STRUCT(interface)-
>>> addr;
>> +
>> +    CAMLreturn(Val_int (intf->server_version));
>> +}
>> +
>> +CAMLprim value ml_interface_close(value interface)
>> +{
>> +    CAMLparam1(interface);
>> +    const char invalid_data[] = { 'd', 'e', 'a', 'd', 'b', 'e', 'e', 'f' };
>> +    struct xenstore_domain_interface *intf = GET_C_STRUCT(interface)-
>>> addr;
>> +    int i;
>> +
>> +    intf->req_cons = intf->req_prod = intf->rsp_cons = intf->rsp_prod =
>> 0;
>> +    /* Ensure the unused space is full of invalid xenstore packets. */
>> +    for (i = 0; i < XENSTORE_RING_SIZE; i++) {
>> +            intf->req[i] = invalid_data[i % sizeof(invalid_data)];
>> +            intf->rsp[i] = invalid_data[i % sizeof(invalid_data)];
>> +    }
> 
> This does not reset the rings to their initial state (i.e. all zeroes). I 
> don't think that's necessarily a problem, but it is inconsistent.

Indeed — I should document this difference in the API.

In one case where I saw the original bug the xenstore logs were full of ‘DEBUG’ 
messages, because message id = 0 = DEBUG, request id = 0, transaction id = 0 
and length = 0 is a valid packet. This invalid data should cause the corruption 
to be noticed more quickly :-)

Thanks,
Dave

> 
>> +    xen_mb ();
>> +    intf->closing = 0;
>> +    CAMLreturn(Val_unit);
>> +}
>> diff --git a/xen/include/public/io/xs_wire.h
>> b/xen/include/public/io/xs_wire.h
>> index 585f0c8..022d614 100644
>> --- a/xen/include/public/io/xs_wire.h
>> +++ b/xen/include/public/io/xs_wire.h
>> @@ -108,14 +108,19 @@ enum xs_watch_type
>>  * `incontents 150 xenstore_struct XenStore wire protocol.
>>  *
>>  * Inter-domain shared memory communications. */
>> +
> 
> Pure whitespace change. Not necessary.
> 
>  Paul
> 
>> #define XENSTORE_RING_SIZE 1024
>> typedef uint32_t XENSTORE_RING_IDX;
>> #define MASK_XENSTORE_IDX(idx) ((idx) & (XENSTORE_RING_SIZE-1))
>> struct xenstore_domain_interface {
>> +    /* XENSTORE_VERSION_0 */
>>     char req[XENSTORE_RING_SIZE]; /* Requests to xenstore daemon. */
>>     char rsp[XENSTORE_RING_SIZE]; /* Replies and async watch events. */
>>     XENSTORE_RING_IDX req_cons, req_prod;
>>     XENSTORE_RING_IDX rsp_cons, rsp_prod;
>> +    uint32_t server_version;
>> +    /* server_version 1 and later: */
>> +    uint32_t closing;             /* Non-zero means close in progress */
>> };
>> 
>> /* Violating this is very bad.  See docs/misc/xenstore.txt. */
>> --
>> 1.7.10.4


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.