[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] x86: Meltdown band-aid against malicious 64-bit PV guests



On 12/01/18 10:19, Jan Beulich wrote:
> This is a very simplistic change limiting the amount of memory a running
> 64-bit PV guest has mapped (and hence available for attacking): Only the
> mappings of stack, IDT, and TSS are being cloned from the direct map
> into per-CPU page tables. Guest controlled parts of the page tables are
> being copied into those per-CPU page tables upon entry into the guest.
> Cross-vCPU synchronization of top level page table entry changes is
> being effected by forcing other active vCPU-s of the guest into the
> hypervisor.
>
> The change to context_switch() isn't strictly necessary, but there's no
> reason to keep switching page tables once a PV guest is being scheduled
> out.
>
> There is certainly much room for improvement, especially of performance,
> here - first and foremost suppressing all the negative effects on AMD
> systems. But in the interest of backportability (including to really old
> hypervisors, which may not even have alternative patching) any such is
> being left out here.
>
> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>

I would do with the answer to my question at the end before completing a
review.  In the meantime, some observations.

> ---
> TBD: Is forcing an event check interrupt for synchronization purposes
> enough? It may be necessary to actually wait for remote vCPU-s to have
> touched into the hypervisor, in which case a function-call-IPI should be
> sent, with an empty handler (a flush-IPI with zero operation mask would
> also do). Otoh, if the vCPU isn't already in hypervisor context,
> delivery of the IPI should be almost instantly (as interrupts are always
> enabled while in guest mode).

From a vcpu consistency point of view, once the hypercall making this
change returns, no other vcpus should have executed an instruction with
a stale view of the L4.

Therefore, I think you need to wait until the IPI has at least called
into hypervisor context before releasing the current vcpu, safe in the
knowledge that the update will be picked up on the way back out.

> ---
> Backporting notes:
> - This needs f9eb74789a ("x86/entry: Remove support for partial
>   cpu_user_regs frames") as a prereq, due to the uses of %r14 and %r15.
>   But that's intended to be backported anyway (for Spectre/SP2).
> - The use of "root" instead of "l4" here is mainly to not make 5-level
>   page table additions any harder. In backports "l4" should probably be
>   preferred.
>
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -3683,6 +3683,20 @@ long do_mmu_update(
>                          break;
>                      rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
> +                    if ( !rc )

Perhaps && (d->max_vcpus > 1) ?

> +                    {
> +                        /*
> +                         * Force other vCPU-s of the affected guest to pick 
> up
> +                         * the change (if any).
> +                         */
> +                        unsigned int cpu = smp_processor_id();
> +                        cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
> +
> +                        cpumask_andnot(mask, pt_owner->domain_dirty_cpumask,
> +                                       cpumask_of(cpu));

cpumask_copy() and __clear_bit(, cpu) is probably faster?

> +                        if ( !cpumask_empty(mask) )
> +                            smp_send_event_check_mask(mask);
> +                    }

In terms of performance, if this shadowing/sync algorithm is correct,
then it would be better to defer the IPI until after the update loop. 
We only need force other vcpus once per mmu_update hypercall if there is
an L4 update, rather than for each L4 update in the batch.

>                      break;
>  
>                  case PGT_writable_page:
> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -327,6 +327,9 @@ void start_secondary(void *unused)
>       */
>      spin_debug_disable();
>  
> +    get_cpu_info()->xen_cr3 = 0;
> +    get_cpu_info()->pv_cr3 = __pa(this_cpu(root_pgt));
> +
>      load_system_tables();
>  
>      /* Full exception support from here on in. */
> @@ -633,6 +636,181 @@ void cpu_exit_clear(unsigned int cpu)
>      set_cpu_state(CPU_STATE_DEAD);
>  }
>  
> +static bool clone_mapping(const void *ptr, root_pgentry_t *rpt)

Could we introduce these functions with ints and use -ENOMEM?

> +{
> +    unsigned long linear = (unsigned long)ptr, pfn;
> +    unsigned int flags;
> +    l3_pgentry_t *l3t = l4e_to_l3e(idle_pg_table[root_table_offset(linear)]);
> +    l2_pgentry_t *l2t;
> +    l1_pgentry_t *l1t;
> +
> +    if ( linear < DIRECTMAP_VIRT_START )
> +        return true;
> +
> +    flags = l3e_get_flags(l3t[l3_table_offset(linear)]);
> +    ASSERT(flags & _PAGE_PRESENT);
> +    if ( flags & _PAGE_PSE )
> +    {
> +        pfn = (l3e_get_pfn(l3t[l3_table_offset(linear)]) &
> +               ~((1UL << (2 * PAGETABLE_ORDER)) - 1)) |
> +              (PFN_DOWN(linear) & ((1UL << (2 * PAGETABLE_ORDER)) - 1));

This logic would be easier to read by having an extra

l3_pgentry_t *l3e = &l3t[l3_table_offset(linear)];

broken out.  Conversely, I can't think of a cleaner way to express the
pfn calculation, despite the fact it is very complicated.

> +        flags &= ~_PAGE_PSE;

I presume we don't care for shuffling caching attributes?  This should
only really be called on WB memory.

> +    }
> +    else
> +    {
> +        l2t = l3e_to_l2e(l3t[l3_table_offset(linear)]);
> +        flags = l2e_get_flags(l2t[l2_table_offset(linear)]);
> +        ASSERT(flags & _PAGE_PRESENT);
> +        if ( flags & _PAGE_PSE )
> +        {
> +            pfn = (l2e_get_pfn(l2t[l2_table_offset(linear)]) &
> +                   ~((1UL << PAGETABLE_ORDER) - 1)) |
> +                  (PFN_DOWN(linear) & ((1UL << PAGETABLE_ORDER) - 1));
> +            flags &= ~_PAGE_PSE;
> +        }
> +        else
> +        {
> +            l1t = l2e_to_l1e(l2t[l2_table_offset(linear)]);
> +            flags = l1e_get_flags(l1t[l1_table_offset(linear)]);
> +            if ( !(flags & _PAGE_PRESENT) )
> +                return true;
> +            pfn = l1e_get_pfn(l1t[l1_table_offset(linear)]);
> +        }
> +    }
> +
> +    if ( !(root_get_flags(rpt[root_table_offset(linear)]) & _PAGE_PRESENT) )
> +    {
> +        l3t = alloc_xen_pagetable();
> +        if ( !l3t )
> +            return false;
> +        clear_page(l3t);
> +        l4e_write(&rpt[root_table_offset(linear)],
> +                  l4e_from_paddr(__pa(l3t), __PAGE_HYPERVISOR));
> +    }
> +    else
> +        l3t = l4e_to_l3e(rpt[root_table_offset(linear)]);
> +
> +    if ( !(l3e_get_flags(l3t[l3_table_offset(linear)]) & _PAGE_PRESENT) )
> +    {
> +        l2t = alloc_xen_pagetable();
> +        if ( !l2t )
> +            return false;
> +        clear_page(l2t);
> +        l3e_write(&l3t[l3_table_offset(linear)],
> +                  l3e_from_paddr(__pa(l2t), __PAGE_HYPERVISOR));
> +    }
> +    else
> +    {
> +        ASSERT(!(l3e_get_flags(l3t[l3_table_offset(linear)]) & _PAGE_PSE));
> +        l2t = l3e_to_l2e(l3t[l3_table_offset(linear)]);
> +    }
> +
> +    if ( !(l2e_get_flags(l2t[l2_table_offset(linear)]) & _PAGE_PRESENT) )
> +    {
> +        l1t = alloc_xen_pagetable();
> +        if ( !l1t )
> +            return false;
> +        clear_page(l1t);
> +        l2e_write(&l2t[l2_table_offset(linear)],
> +                  l2e_from_paddr(__pa(l1t), __PAGE_HYPERVISOR));
> +    }
> +    else
> +    {
> +        ASSERT(!(l2e_get_flags(l2t[l2_table_offset(linear)]) & _PAGE_PSE));
> +        l1t = l2e_to_l1e(l2t[l2_table_offset(linear)]);
> +    }
> +
> +    if ( l1e_get_flags(l1t[l1_table_offset(linear)]) & _PAGE_PRESENT )
> +    {
> +        ASSERT(l1e_get_pfn(l1t[l1_table_offset(linear)]) == pfn);
> +        ASSERT(l1e_get_flags(l1t[l1_table_offset(linear)]) == flags);

Calculate l1e_from_pfn(pfn, flags) first and ASSERT() that the full PTE
matches?

> +    }
> +    else
> +        l1e_write(&l1t[l1_table_offset(linear)], l1e_from_pfn(pfn, flags));
> +
> +    return true;
> +}
> +
> +DEFINE_PER_CPU(root_pgentry_t *, root_pgt);
> +
> +static bool setup_cpu_root_pgt(unsigned int cpu)
> +{
> +    root_pgentry_t *rpt = alloc_xen_pagetable();

As an observation, alloc_xen_pagetable() should zero internally.  There
is no circumstance under which we want to forget the clear_page().

Another issue which I attempted to address in my series is that
alloc_xen_pagetable() isn't numa-local.  Its not worth adjusting for
backports, but it is something we should consider moving forwards.

> +    unsigned int off;
> +
> +    if ( !rpt )
> +        return false;
> +
> +    clear_page(rpt);
> +    per_cpu(root_pgt, cpu) = rpt;
> +
> +    rpt[root_table_offset(RO_MPT_VIRT_START)] =
> +        idle_pg_table[root_table_offset(RO_MPT_VIRT_START)];
> +    /* SH_LINEAR_PT inserted together with guest mappings. */
> +    /* PERDOMAIN inserted during context switch. */
> +    rpt[root_table_offset(XEN_VIRT_START)] =
> +        idle_pg_table[root_table_offset(XEN_VIRT_START)];
> +
> +    /* Install direct map page table entries for stack, IDT, and TSS. */
> +    for ( off = 0; off < STACK_SIZE; off += PAGE_SIZE )
> +        if ( !clone_mapping(__va(__pa(stack_base[cpu])) + off, rpt) )
> +            break;
> +
> +    return off == STACK_SIZE &&
> +           clone_mapping(idt_tables[cpu], rpt) &&
> +           clone_mapping(&per_cpu(init_tss, cpu), rpt);

Can we put an outer set of brackets in, so editors retain the
indentation like this?

> --- a/xen/include/asm-x86/current.h
> +++ b/xen/include/asm-x86/current.h
> @@ -41,6 +41,8 @@ struct cpu_info {
>      struct vcpu *current_vcpu;
>      unsigned long per_cpu_offset;
>      unsigned long cr4;
> +    unsigned long xen_cr3;
> +    unsigned long pv_cr3;

These definitely need more description of how they work.

As far as I've reverse engineered, pv_cr3 is the paddr_t for per_pcu
root pagetable, and is static after a cpu us up and operational.

xen_cr3, if not 0, is a cr3 to restore on the way into the hypervisor?

~Andrew

>      /* get_stack_bottom() must be 16-byte aligned */
>  };
>  


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.