[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v5 11/28] xsplice: Implement support for applying/reverting/replacing patches.



>>> On 24.03.16 at 21:00, <konrad.wilk@xxxxxxxxxx> wrote:
> From: Ross Lagerwall <ross.lagerwall@xxxxxxxxxx>
> 
> Implement support for the apply, revert and replace actions.
> 
> To perform and action on a payload, the hypercall sets up a data
> structure to schedule the work.  A hook is added in all the
> return-to-guest paths to check for work to do and execute it if needed.

Looking at the diffstat alone I cannot see this being the case.
Perhaps it's just the description here which needs to become
more precise.

> In this way, patches can be applied with all CPUs idle and without
> stacks.  The first CPU to do_xsplice() becomes the master and triggers a
> reschedule softirq to trigger all the other CPUs to enter do_xsplice()
> with no stack.  Once all CPUs have rendezvoused, all CPUs disable IRQs
> and NMIs are ignored. The system is then quiscient and the master
> performs the action.  After this, all CPUs enable IRQs and NMIs are
> re-enabled.
> 
> Note that it is unsafe to patch do_nmi and the xSplice internal functions.
> Patching functions on NMI/MCE path is liable to end in disaster.

So what measures are (planned to be) taken to make sure this
doesn't happen by accident?

> The action to perform is one of:
> - APPLY: For each function in the module, store the first 5 bytes of the
>   old function and replace it with a jump to the new function.
> - REVERT: Copy the previously stored bytes into the first 5 bytes of the
>   old function.
> - REPLACE: Revert each applied module and then apply the new module.
> 
> To prevent a deadlock with any other barrier in the system, the master
> will wait for up to 30ms before timing out.
> Measurements found that the patch application to take about 100 μs on a
> 72 CPU system, whether idle or fully loaded.

That's for an individual patch I suppose? What if REPLACE has to
revert dozens or hundreds of patches?

> We also add an BUILD_ON to make sure that the size of the structure
> of the payload is not inadvertly changed.
> 
> Lastly we unroll the 'vmap_to_page' on x86 as inside the macro there
> is a posibility of a NULL pointer. Hence we unroll it with extra
> ASSERTS. Note that asserts on non-debug builds are compiled out hence
> the extra checks that will just return (and leak memory).

I'm afraid I can't really make sense of this: What does "unroll" mean
here? There's no loop involved. And where's that potential for NULL?

> --- a/docs/misc/xsplice.markdown
> +++ b/docs/misc/xsplice.markdown
> @@ -841,7 +841,8 @@ The implementation must also have a mechanism for:
>   * Be able to lookup in the Xen hypervisor the symbol names of functions 
> from the ELF payload.
>   * Be able to patch .rodata, .bss, and .data sections.
>   * Further safety checks (blacklist of which functions cannot be patched, 
> check
> -   the stack, make sure the payload is built with same compiler as 
> hypervisor).
> +   the stack, make sure the payload is built with same compiler as 
> hypervisor,
> +   and NMI/MCE handlers and do_nmi for right now - until an safe solution is 
> found).

The whole thing doesn't parse anymore for me with the addition,
so I can't really conclude what you mean to say here (and hence
whether that addresses the earlier question).

> @@ -120,6 +121,7 @@ static void idle_loop(void)
>          (*pm_idle)();
>          do_tasklet();
>          do_softirq();
> +        check_for_xsplice_work(); /* Must be last. */
>      }
>  }
>  
> @@ -136,6 +138,7 @@ void startup_cpu_idle_loop(void)
>  
>  static void noreturn continue_idle_domain(struct vcpu *v)
>  {
> +    check_for_xsplice_work();
>      reset_stack_and_jump(idle_loop);
>  }

The placement here kind of contradicts the comment right above.
And anyway - why in both places? And why here and not in
common code, e.g. in do_softirq(), or - considering the further
addition below - inside reset_stack_and_jump() _after_ having
reset the stack (after all by doing it up front you do not really
meet your goal of doing the action "without stacks")?

> +void arch_xsplice_apply_jmp(struct xsplice_patch_func *func)
> +{
> +    uint32_t val;

The way it's being used below, it clearly should be int32_t.

> +    uint8_t *old_ptr;
> +
> +    BUILD_BUG_ON(PATCH_INSN_SIZE > sizeof(func->undo));
> +    BUILD_BUG_ON(PATCH_INSN_SIZE != (1 + sizeof val));
> +
> +    old_ptr = (uint8_t *)func->old_addr;

(Considering this cast, the "old_addr" member should be
unsigned long (or void *), not uint64_t: The latest on ARM32
such would otherwise cause problems.)

Also - where is the verification that
func->old_size >= PATCH_INSN_SIZE?

> +void arch_xsplice_revert_jmp(struct xsplice_patch_func *func)

const

> --- a/xen/common/xsplice.c
> +++ b/xen/common/xsplice.c
> @@ -3,6 +3,7 @@
>   *
>   */
>  
> +#include <xen/cpu.h>
>  #include <xen/err.h>
>  #include <xen/guest_access.h>
>  #include <xen/keyhandler.h>
> @@ -11,17 +12,29 @@
>  #include <xen/mm.h>
>  #include <xen/sched.h>
>  #include <xen/smp.h>
> +#include <xen/softirq.h>
>  #include <xen/spinlock.h>
>  #include <xen/vmap.h>
> +#include <xen/wait.h>
>  #include <xen/xsplice_elf.h>
>  #include <xen/xsplice.h>
>  
>  #include <asm/event.h>
> +#include <asm/nmi.h>

Urgh?

>  #include <public/sysctl.h>
>  
> +/*
> + * Protects against payload_list operations and also allows only one
> + * caller in schedule_work.
> + */
>  static DEFINE_SPINLOCK(payload_lock);
>  static LIST_HEAD(payload_list);
>  
> +/*
> + * Patches which have been applied.
> + */

Comment style.

> +struct xsplice_work
> +{
> +    atomic_t semaphore;          /* Used for rendezvous. First to grab it 
> will
> +                                    do the patching. */

"First to grab it" doesn't seem to make sense for an atomic_t variable.

> +    atomic_t irq_semaphore;      /* Used to signal all IRQs disabled. */
> +    uint32_t timeout;                    /* Timeout to do the operation. */
> +    struct payload *data;        /* The payload on which to act. */
> +    volatile bool_t do_work;     /* Signals work to do. */
> +    volatile bool_t ready;       /* Signals all CPUs synchronized. */
> +    uint32_t cmd;                /* Action request: XSPLICE_ACTION_* */
> +};

Please can you do without fixed size types when the fixed size
isn't really a requirement?

> +/* There can be only one outstanding patching action. */
> +static struct xsplice_work xsplice_work;
> +
> +/* Indicate whether the CPU needs to consult xsplice_work structure. */
> +static DEFINE_PER_CPU(bool_t, work_to_do);

Peeking ahead to the uses of this, I can't see why this is needed
alongside xsplice_work.do_work.

> @@ -296,6 +331,77 @@ static int secure_payload(struct payload *payload, 
> struct xsplice_elf *elf)
>      return rc;
>  }
>  
> +static int check_special_sections(struct payload *payload,
> +                                  struct xsplice_elf *elf)

const (twice, but the first parameter appears to be unused anyway)

> +{
> +    unsigned int i;
> +    static const char *const names[] = { ".xsplice.funcs" };
> +
> +    for ( i = 0; i < ARRAY_SIZE(names); i++ )
> +    {
> +        struct xsplice_elf_sec *sec;

const

> +        sec = xsplice_elf_sec_by_name(elf, names[i]);
> +        if ( !sec )
> +        {
> +            printk(XENLOG_ERR "%s%s: %s is missing!\n",
> +                   XSPLICE, elf->name, names[i]);
> +            return -EINVAL;
> +        }
> +
> +        if ( !sec->sec->sh_size )
> +            return -EINVAL;
> +    }
> +
> +    return 0;
> +}

So you check for there being one such section. Is having multiple
of them okay / meaningful?

> +static int prepare_payload(struct payload *payload,
> +                           struct xsplice_elf *elf)
> +{
> +    struct xsplice_elf_sec *sec;
> +    unsigned int i;
> +    struct xsplice_patch_func *f;

const (multiple times, and I'm not going to further make this remark:
Please deal with this throughout the series)

> +    sec = xsplice_elf_sec_by_name(elf, ".xsplice.funcs");
> +    if ( sec )

Assuming check_special_sections() got invoked before you get here,
why the conditional?

> +    {
> +        if ( sec->sec->sh_size % sizeof *payload->funcs )
> +        {
> +            dprintk(XENLOG_DEBUG, "%s%s: Wrong size of .xsplice.funcs!\n",
> +                    XSPLICE, elf->name);
> +            return -EINVAL;
> +        }
> +
> +        payload->funcs = (struct xsplice_patch_func *)sec->load_addr;
> +        payload->nfuncs = sec->sec->sh_size / (sizeof *payload->funcs);

Since this repeats (so far I thought this was more like a mistake),
and despite being a matter of taste to some degree - may I ask
that the canonical sizeof() be used, instead of (sizeof ...) or
whatever else variants?

> +    }
> +
> +    for ( i = 0; i < payload->nfuncs; i++ )
> +    {
> +        unsigned int j;
> +
> +        f = &(payload->funcs[i]);
> +
> +        if ( !f->new_addr || !f->old_addr || !f->old_size || !f->new_size )

Isn't new_size == 0 a particularly easy to deal with case?

> @@ -499,6 +614,321 @@ static int xsplice_list(xen_sysctl_xsplice_list_t *list)
>      return rc ? : idx;
>  }
>  
> +/*
> + * The following functions get the CPUs into an appropriate state and
> + * apply (or revert) each of the payload's functions. This is needed
> + * for XEN_SYSCTL_XSPLICE_ACTION operation (see xsplice_action).
> + */
> +
> +static int apply_payload(struct payload *data)
> +{
> +    unsigned int i;
> +
> +    dprintk(XENLOG_DEBUG, "%s%s: Applying %u functions.\n", XSPLICE,
> +            data->name, data->nfuncs);
> +
> +    arch_xsplice_patching_enter();
> +
> +    for ( i = 0; i < data->nfuncs; i++ )
> +        arch_xsplice_apply_jmp(&data->funcs[i]);
> +
> +    arch_xsplice_patching_leave();
> +
> +    list_add_tail(&data->applied_list, &applied_list);
> +
> +    return 0;
> +}
> +
> +/*
> + * This function is executed having all other CPUs with no stack (we may
> + * have cpu_idle on it) and IRQs disabled.
> + */
> +static int revert_payload(struct payload *data)

Wouldn't the same comment apply to apply_payload()? I.e. is it
useful to have here but not there (i.e. isn't the comment ahead
of xsplice_do_action() sufficient)?

> +static void xsplice_do_action(void)
> +{
> +    int rc;
> +    struct payload *data, *other, *tmp;
> +
> +    data = xsplice_work.data;
> +    /* Now this function should be the only one on any stack.
> +     * No need to lock the payload list or applied list. */

Comment style.

> +    switch ( xsplice_work.cmd )
> +    {
> +    case XSPLICE_ACTION_APPLY:
> +        rc = apply_payload(data);
> +        if ( rc == 0 )
> +            data->state = XSPLICE_STATE_APPLIED;
> +        break;
> +
> +    case XSPLICE_ACTION_REVERT:
> +        rc = revert_payload(data);
> +        if ( rc == 0 )
> +            data->state = XSPLICE_STATE_CHECKED;
> +        break;
> +
> +    case XSPLICE_ACTION_REPLACE:
> +        rc = 0;
> +        /* N.B: Use 'applied_list' member, not 'list'. */
> +        list_for_each_entry_safe_reverse ( other, tmp, &applied_list, 
> applied_list )
> +        {
> +            other->rc = revert_payload(other);

Why does this set ->rc, but the two earlier ones only set the local
variable?

> +            if ( other->rc == 0 )
> +                other->state = XSPLICE_STATE_CHECKED;
> +            else
> +            {
> +                rc = -EINVAL;
> +                break;
> +            }
> +        }
> +
> +        if ( rc != -EINVAL )

Perhaps better "rc == 0"?

> +        {
> +            rc = apply_payload(data);
> +            if ( rc == 0 )
> +                data->state = XSPLICE_STATE_APPLIED;
> +        }
> +        break;
> +
> +    default:
> +        rc = -EINVAL;
> +        break;
> +    }
> +
> +    data->rc = rc;

Oh, here it is. But why would an xsplice_work.cmd (which probably
shouldn't make it here anyway) mark the payload having an error?
It didn't change state or anything after all.

> +/*
> + * MUST be holding the payload_lock.
> + */

comment style (another comment I'm not going to repeat any further)

> +static int schedule_work(struct payload *data, uint32_t cmd, uint32_t 
> timeout)
> +{
> +    unsigned int cpu;
> +
> +    ASSERT(spin_is_locked(&payload_lock));

Also the comment above is clearly redundant with this ASSERT().

> +    /* Fail if an operation is already scheduled. */
> +    if ( xsplice_work.do_work )
> +        return -EBUSY;
> +
> +    if ( !get_cpu_maps() )
> +    {
> +        printk(XENLOG_ERR "%s%s: unable to get cpu_maps lock!\n",
> +               XSPLICE, data->name);
> +        return -EBUSY;
> +    }
> +
> +    xsplice_work.cmd = cmd;
> +    xsplice_work.data = data;
> +    xsplice_work.timeout = timeout ?: MILLISECS(30);
> +
> +    dprintk(XENLOG_DEBUG, "%s%s: timeout is %"PRI_stime"ms\n",
> +            XSPLICE, data->name, xsplice_work.timeout / MILLISECS(1));
> +
> +    /*
> +     * Once the patching has been completed, the semaphore value will
> +     * be num_online_cpus()-1.
> +     */

Is this comment really useful? I.e. is that final value meaningful
for anything?

> +    atomic_set(&xsplice_work.semaphore, -1);
> +    atomic_set(&xsplice_work.irq_semaphore, -1);
> +
> +    xsplice_work.ready = 0;
> +    smp_wmb();
> +    xsplice_work.do_work = 1;
> +    smp_wmb();
> +    /*
> +     * Above smp_wmb() gives us an compiler barrier, as we MUST do this
> +     * after setting the global structure.
> +     */

"a compiler barrier"

> +static int mask_nmi_callback(const struct cpu_user_regs *regs, int cpu)
> +{
> +    /* TODO: Handle missing NMI/MCE.*/
> +    return 1;
> +}

Aren't we in common code here?

> +void check_for_xsplice_work(void)
> +{
> +    unsigned int cpu = smp_processor_id();
> +    nmi_callback_t saved_nmi_callback;

This again looks very x86-ish.

> +    s_time_t timeout;
> +    unsigned long flags;
> +
> +    /* Fast path: no work to do. */
> +    if ( !per_cpu(work_to_do, cpu ) )
> +        return;
> +
> +    /* In case we aborted, other CPUs can skip right away. */
> +    if ( (!xsplice_work.do_work) )

Stray parentheses.

> +    {
> +        per_cpu(work_to_do, cpu) = 0;
> +        return;
> +    }
> +
> +    ASSERT(local_irq_is_enabled());
> +
> +    /* Set at -1, so will go up to num_online_cpus - 1. */
> +    if ( atomic_inc_and_test(&xsplice_work.semaphore) )
> +    {
> +        struct payload *p;
> +        unsigned int total_cpus;
> +
> +        p = xsplice_work.data;
> +        if ( !get_cpu_maps() )
> +        {
> +            printk(XENLOG_ERR "%s%s: CPU%u - unable to get cpu_maps lock!\n",
> +                   XSPLICE, p->name, cpu);
> +            per_cpu(work_to_do, cpu) = 0;
> +            xsplice_work.data->rc = -EBUSY;
> +            xsplice_work.do_work = 0;

On x86 such possibly simultaneous accesses may be okay, but is
that universally the case? Wouldn't it better be only the monarch
which updates shared data?

> +            /*
> +             * Do NOT decrement semaphore down - as that may cause the other
> +             * CPU (which may be at this exact moment checking the ASSERT)

Which ASSERT()? (I guess the one a few lines up - but does that
matter?)

> +             * to assume the role of master and then needlessly time out
> +             * out (as do_work is zero).
> +             */
> +            return;
> +        }
> +
> +        barrier(); /* MUST do it after get_cpu_maps. */
> +        total_cpus = num_online_cpus() - 1;

Considering this the variable (and earlier function parameter) is
misnamed.

> +        if ( total_cpus )
> +        {
> +            dprintk(XENLOG_DEBUG, "%s%s: CPU%u - IPIing the %u CPUs\n",

Not being a native speaker, I've previously observed too many
articles - the "the" here again looks quite odd, or there is an
"other" missing.

> +                    XSPLICE, p->name, cpu, total_cpus);
> +            smp_call_function(reschedule_fn, NULL, 0);
> +        }
> +
> +        timeout = xsplice_work.timeout + NOW();
> +        if ( xsplice_do_wait(&xsplice_work.semaphore, timeout, total_cpus,
> +                             "Timed out on CPU semaphore") )
> +            goto abort;
> +
> +        /* "Mask" NMIs. */
> +        saved_nmi_callback = set_nmi_callback(mask_nmi_callback);

So what if an NMI got raised on a remote CPU right before you set
this? I.e. doesn't this need to move earlier?

> +        /* All CPUs are waiting, now signal to disable IRQs. */
> +        xsplice_work.ready = 1;
> +        smp_wmb();
> +
> +        atomic_inc(&xsplice_work.irq_semaphore);
> +        if ( !xsplice_do_wait(&xsplice_work.irq_semaphore, timeout, 
> total_cpus,
> +                              "Timed out on IRQ semaphore") )

I'd prefer if the common parts of that message moved into
xsplice_do_wait() - no reason to have more string literal space
occupied than really needed. Also, looking back, the respective
function parameter could do with a more descriptive name.

And then - does it make sense to wait the perhaps full 30ms
on this second step? Rendezvoused CPUs should react rather
quickly to the signal to disable interrupts.

> +        {
> +            local_irq_save(flags);
> +            /* Do the patching. */
> +            xsplice_do_action();
> +            /* To flush out pipeline. */
> +            arch_xsplice_post_action();

The comment needs to become more generic, and maybe the
function name more specific.

> +            local_irq_restore(flags);
> +        }
> +        set_nmi_callback(saved_nmi_callback);
> +
> + abort:
> +        per_cpu(work_to_do, cpu) = 0;
> +        xsplice_work.do_work = 0;
> +
> +        smp_wmb(); /* Synchronize with waiting CPUs. */

What "synchronization" does this barrier do?

> +        ASSERT(local_irq_is_enabled());

Is there really anything between the earlier identical ASSERT() and
this one which could leave interrupts off?

> +        put_cpu_maps();
> +
> +        printk(XENLOG_INFO "%s%s finished with rc=%d\n", XSPLICE,
> +               p->name, p->rc);

And no record left of what was done with that payload?

> +    }
> +    else
> +    {
> +        /* Wait for all CPUs to rendezvous. */
> +        while ( xsplice_work.do_work && !xsplice_work.ready )
> +        {
> +            cpu_relax();
> +            smp_rmb();

What is the barrier doing inside this and the other loop below?

> @@ -635,15 +1063,30 @@ static void xsplice_printall(unsigned char key)
>      }
>  
>      list_for_each_entry ( data, &payload_list, list )
> +    {
>          printk(" name=%s state=%s(%d) %p (.data=%p, .rodata=%p) using %zu 
> pages.\n",
>                 data->name, state2str(data->state), data->state, 
> data->text_addr,
>                 data->rw_addr, data->ro_addr, data->payload_pages);
>  
> +        for ( i = 0; i < data->nfuncs; i++ )
> +        {
> +            struct xsplice_patch_func *f = &(data->funcs[i]);
> +            printk("    %s patch 0x%"PRIx64"(%u) with 0x%"PRIx64"(%u)\n",

%# again please.

> +                   f->name, f->old_addr, f->old_size, f->new_addr, 
> f->new_size);
> +            if ( !(i % 100) )

Using a power of 2 in situations like this is going to produce both
smaller and faster code (the faster aspect may not matter here,
but anyway). Also I don't think you want to do this on the first
iteration.

> +                process_pending_softirqs();

With a lock held?

>  static int __init xsplice_init(void)
>  {
> +    BUILD_BUG_ON( sizeof(struct xsplice_patch_func) != 64 );
> +    BUILD_BUG_ON( offsetof(struct xsplice_patch_func, new_addr) != 8 );
> +    BUILD_BUG_ON( offsetof(struct xsplice_patch_func, new_size) != 24 );

What assumptions get broken if these sizes change?

> --- a/xen/include/xen/xsplice.h
> +++ b/xen/include/xen/xsplice.h
> @@ -11,12 +11,30 @@ struct xsplice_elf_sec;
>  struct xsplice_elf_sym;
>  struct xen_sysctl_xsplice_op;
>  
> +#include <xen/elfstructs.h>
> +/*
> + * The structure which defines the patching. This is what the hypervisor
> + * expects in the '.xsplice.func' section of the ELF file.
> + *
> + * This MUST be in sync with what the tools generate.

In which case this need to be part of the public interface, I would
say.

> + */
> +struct xsplice_patch_func {
> +    const char *name;
> +    uint64_t new_addr;
> +    uint64_t old_addr;
> +    uint32_t new_size;
> +    uint32_t old_size;
> +    uint8_t undo[8];

Shouldn't the size of this array be a per-arch constant?

> +    uint8_t pad[24];

What is this padding field good for?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.