[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 08/19] libxl: wait for qemu to acknowledge logdirty command



On Fri, 2012-06-08 at 18:34 +0100, Ian Jackson wrote:
> The current migration code in libxl instructs qemu to start or stop
> logdirty, but it does not wait for an acknowledgement from qemu before
> continuing.  This might lead to memory corruption (!)
> 
> Fix this by waiting for qemu to acknowledge the command.
> 
> Unfortunately the necessary ao arrangements for waiting for this
> command are unique because qemu has a special protocol for this
> particular operation.
> 
> Also, this change means that the switch_qemu_logdirty callback
> implementation in libxl can no longer synchronously produce its return
> value, as it now needs to wait for xenstore.  So we tell the
> marshalling code generator that it is a message which does not need a
> reply.  This turns the callback function called by the marshaller into
> one which returns void; the callback function arranges to later
> explicitly sends the reply to the helper, when the xs watch triggers
> and the appropriate value is read from xenstore.
> 
> Signed-off-by: Ian Jackson <ian.jackson@xxxxxxxxxxxxx>
> ---
>  tools/libxl/libxl_dom.c            |  176 
> +++++++++++++++++++++++++++++++++---
>  tools/libxl/libxl_internal.h       |   18 ++++-
>  tools/libxl/libxl_save_callout.c   |    8 ++
>  tools/libxl/libxl_save_msgs_gen.pl |    7 +-
>  4 files changed, 193 insertions(+), 16 deletions(-)
> 
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> index a597627..d5ac79f 100644
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -528,30 +528,180 @@ int libxl__toolstack_restore(uint32_t domid, const 
> uint8_t *buf,
>  static void domain_suspend_done(libxl__egc *egc,
>                          libxl__domain_suspend_state *dss, int rc);
> 
> -/*----- callbacks, called by xc_domain_save -----*/
> +/*----- complicated callback, called by xc_domain_save -----*/
> +
> +/*
> + * We implement the other end of protocol for controlling qemu-dm's
> + * logdirty.  There is no documentation for this protocol, but our
> + * counterparty's implementation is in
> + * qemu-xen-traditional.git:xenstore.c in the function
> + * xenstore_process_logdirty_event
> + */
> +
> +static void switch_logdirty_timeout(libxl__egc *egc, libxl__ev_time *ev,
> +                                    const struct timeval *requested_abs);
> +static void switch_logdirty_xswatch(libxl__egc *egc, libxl__ev_xswatch*,
> +                            const char *watch_path, const char *event_path);
> +static void switch_logdirty_done(libxl__egc *egc,
> +                                 libxl__domain_suspend_state *dss, int ok);
> 
> -int libxl__domain_suspend_common_switch_qemu_logdirty
> +static void logdirty_init(libxl__logdirty_switch *lds)
> +{
> +    lds->cmd_path = 0;
> +    libxl__ev_xswatch_init(&lds->watch);
> +    libxl__ev_time_init(&lds->timeout);

I meant to mention this once before when reviewing some patch or other
but I'm not sure I actually did, so at the risk of repeating myself:

This watch with timeout pattern seems to be reasonably common (in fact,
I'm not sure we have any watches without a timeout). It might be a
candidate for a specific helper routine in the future?

[...]
> +static void switch_logdirty_timeout(libxl__egc *egc, libxl__ev_time *ev,
> +                                    const struct timeval *requested_abs)
> +{
> +    libxl__domain_suspend_state *dss = CONTAINER_OF(ev, *dss, 
> logdirty.timeout);
> +    STATE_AO_GC(dss->ao);
> +    LOG(ERROR,"logdirty switch: wait for device model timed out");
> +    switch_logdirty_done(egc,dss,0);
>  }
> 
> +static void switch_logdirty_xswatch(libxl__egc *egc, libxl__ev_xswatch 
> *watch,
> +                            const char *watch_path, const char *event_path)
> +{
> +    libxl__domain_suspend_state *dss =
> +        CONTAINER_OF(watch, *dss, logdirty.watch);
> +    libxl__logdirty_switch *lds = &dss->logdirty;
> +    STATE_AO_GC(dss->ao);
> +    const char *got;
> +    xs_transaction_t t = 0;
> +    int rc;
> +
> +    for (;;) {
> +        rc = libxl__xs_transaction_start(gc, &t);
> +        if (rc) goto out;
> +
> +        rc = libxl__xs_read_checked(gc, t, lds->ret_path, &got);
> +        if (rc) goto out;
> +
> +        if (!got) {
> +            rc = +1;
> +            goto out;
> +        }
> +
> +        if (strcmp(got, lds->cmd)) {
> +            LOG(ERROR,"logdirty switch: sent command `%s' but got reply `%s'"
> +                " (xenstore paths `%s' / `%s')", lds->cmd, got,
> +                lds->cmd_path, lds->ret_path);
> +            rc = ERROR_FAIL;
> +            goto out;
> +        }
> +
> +        rc = libxl__xs_rm_checked(gc, t, lds->cmd_path);
> +        if (rc) goto out;
> +
> +        rc = libxl__xs_rm_checked(gc, t, lds->ret_path);
> +        if (rc) goto out;
> +
> +        rc = libxl__xs_transaction_commit(gc, &t);
> +        if (!rc) break;
> +        if (rc<0) goto out;
> +    }
> +
> + out:
> +    /* rc < 0: error
> +     * rc == 0: ok, we are done
> +     * rc == +1: need to keep waiting
> +     */
> +    libxl__xs_transaction_abort(gc, &t);
> +
> +    if (!rc) {
> +        switch_logdirty_done(egc,dss,1);
> +    } else if (rc < 0) {
> +        LOG(ERROR,"logdirty switch: response read failed (rc=%d)",rc);

Is it only "response read failed" which can cause us to end up here,
looks like the read, rm etc could do it?

Anyway, I've got no substantive comments so:

Acked-by: Ian Campbell <ian.campbell@xxxxxxxxxx>

> +        switch_logdirty_done(egc,dss,0);
> +    }
> +}
[...]



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.