Xen project Mailing List

Re: [Xen-devel] [PATCH] x86: Meltdown band-aid against malicious 64-bit PV guests

To: Jan Beulich <JBeulich@xxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Sat, 13 Jan 2018 17:48:49 +0000

Delivery-date: Sat, 13 Jan 2018 17:49:12 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 12/01/18 10:19, Jan Beulich wrote: > This is a very simplistic change limiting the amount of memory a running > 64-bit PV guest has mapped (and hence available for attacking): Only the > mappings of stack, IDT, and TSS are being cloned from the direct map > into per-CPU page tables. Guest controlled parts of the page tables are > being copied into those per-CPU page tables upon entry into the guest. > Cross-vCPU synchronization of top level page table entry changes is > being effected by forcing other active vCPU-s of the guest into the > hypervisor. > > The change to context_switch() isn't strictly necessary, but there's no > reason to keep switching page tables once a PV guest is being scheduled > out. > > There is certainly much room for improvement, especially of performance, > here - first and foremost suppressing all the negative effects on AMD > systems. But in the interest of backportability (including to really old > hypervisors, which may not even have alternative patching) any such is > being left out here. > > Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> I would do with the answer to my question at the end before completing a review. In the meantime, some observations. > --- > TBD: Is forcing an event check interrupt for synchronization purposes > enough? It may be necessary to actually wait for remote vCPU-s to have > touched into the hypervisor, in which case a function-call-IPI should be > sent, with an empty handler (a flush-IPI with zero operation mask would > also do). Otoh, if the vCPU isn't already in hypervisor context, > delivery of the IPI should be almost instantly (as interrupts are always > enabled while in guest mode). From a vcpu consistency point of view, once the hypercall making this change returns, no other vcpus should have executed an instruction with a stale view of the L4. Therefore, I think you need to wait until the IPI has at least called into hypervisor context before releasing the current vcpu, safe in the knowledge that the update will be picked up on the way back out. > --- > Backporting notes: > - This needs f9eb74789a ("x86/entry: Remove support for partial > cpu_user_regs frames") as a prereq, due to the uses of %r14 and %r15. > But that's intended to be backported anyway (for Spectre/SP2). > - The use of "root" instead of "l4" here is mainly to not make 5-level > page table additions any harder. In backports "l4" should probably be > preferred. > > --- a/xen/arch/x86/mm.c > +++ b/xen/arch/x86/mm.c > @@ -3683,6 +3683,20 @@ long do_mmu_update( > break; > rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn, > cmd == MMU_PT_UPDATE_PRESERVE_AD, v); > + if ( !rc ) Perhaps && (d->max_vcpus > 1) ? > + { > + /* > + * Force other vCPU-s of the affected guest to pick > up > + * the change (if any). > + */ > + unsigned int cpu = smp_processor_id(); > + cpumask_t *mask = per_cpu(scratch_cpumask, cpu); > + > + cpumask_andnot(mask, pt_owner->domain_dirty_cpumask, > + cpumask_of(cpu)); cpumask_copy() and __clear_bit(, cpu) is probably faster? > + if ( !cpumask_empty(mask) ) > + smp_send_event_check_mask(mask); > + } In terms of performance, if this shadowing/sync algorithm is correct, then it would be better to defer the IPI until after the update loop. We only need force other vcpus once per mmu_update hypercall if there is an L4 update, rather than for each L4 update in the batch. > break; > > case PGT_writable_page: > --- a/xen/arch/x86/smpboot.c > +++ b/xen/arch/x86/smpboot.c > @@ -327,6 +327,9 @@ void start_secondary(void *unused) > */ > spin_debug_disable(); > > + get_cpu_info()->xen_cr3 = 0; > + get_cpu_info()->pv_cr3 = __pa(this_cpu(root_pgt)); > + > load_system_tables(); > > /* Full exception support from here on in. */ > @@ -633,6 +636,181 @@ void cpu_exit_clear(unsigned int cpu) > set_cpu_state(CPU_STATE_DEAD); > } > > +static bool clone_mapping(const void *ptr, root_pgentry_t *rpt) Could we introduce these functions with ints and use -ENOMEM? > +{ > + unsigned long linear = (unsigned long)ptr, pfn; > + unsigned int flags; > + l3_pgentry_t *l3t = l4e_to_l3e(idle_pg_table[root_table_offset(linear)]); > + l2_pgentry_t *l2t; > + l1_pgentry_t *l1t; > + > + if ( linear < DIRECTMAP_VIRT_START ) > + return true; > + > + flags = l3e_get_flags(l3t[l3_table_offset(linear)]); > + ASSERT(flags & _PAGE_PRESENT); > + if ( flags & _PAGE_PSE ) > + { > + pfn = (l3e_get_pfn(l3t[l3_table_offset(linear)]) & > + ~((1UL << (2 * PAGETABLE_ORDER)) - 1)) | > + (PFN_DOWN(linear) & ((1UL << (2 * PAGETABLE_ORDER)) - 1)); This logic would be easier to read by having an extra l3_pgentry_t *l3e = &l3t[l3_table_offset(linear)]; broken out. Conversely, I can't think of a cleaner way to express the pfn calculation, despite the fact it is very complicated. > + flags &= ~_PAGE_PSE; I presume we don't care for shuffling caching attributes? This should only really be called on WB memory. > + } > + else > + { > + l2t = l3e_to_l2e(l3t[l3_table_offset(linear)]); > + flags = l2e_get_flags(l2t[l2_table_offset(linear)]); > + ASSERT(flags & _PAGE_PRESENT); > + if ( flags & _PAGE_PSE ) > + { > + pfn = (l2e_get_pfn(l2t[l2_table_offset(linear)]) & > + ~((1UL << PAGETABLE_ORDER) - 1)) | > + (PFN_DOWN(linear) & ((1UL << PAGETABLE_ORDER) - 1)); > + flags &= ~_PAGE_PSE; > + } > + else > + { > + l1t = l2e_to_l1e(l2t[l2_table_offset(linear)]); > + flags = l1e_get_flags(l1t[l1_table_offset(linear)]); > + if ( !(flags & _PAGE_PRESENT) ) > + return true; > + pfn = l1e_get_pfn(l1t[l1_table_offset(linear)]); > + } > + } > + > + if ( !(root_get_flags(rpt[root_table_offset(linear)]) & _PAGE_PRESENT) ) > + { > + l3t = alloc_xen_pagetable(); > + if ( !l3t ) > + return false; > + clear_page(l3t); > + l4e_write(&rpt[root_table_offset(linear)], > + l4e_from_paddr(__pa(l3t), __PAGE_HYPERVISOR)); > + } > + else > + l3t = l4e_to_l3e(rpt[root_table_offset(linear)]); > + > + if ( !(l3e_get_flags(l3t[l3_table_offset(linear)]) & _PAGE_PRESENT) ) > + { > + l2t = alloc_xen_pagetable(); > + if ( !l2t ) > + return false; > + clear_page(l2t); > + l3e_write(&l3t[l3_table_offset(linear)], > + l3e_from_paddr(__pa(l2t), __PAGE_HYPERVISOR)); > + } > + else > + { > + ASSERT(!(l3e_get_flags(l3t[l3_table_offset(linear)]) & _PAGE_PSE)); > + l2t = l3e_to_l2e(l3t[l3_table_offset(linear)]); > + } > + > + if ( !(l2e_get_flags(l2t[l2_table_offset(linear)]) & _PAGE_PRESENT) ) > + { > + l1t = alloc_xen_pagetable(); > + if ( !l1t ) > + return false; > + clear_page(l1t); > + l2e_write(&l2t[l2_table_offset(linear)], > + l2e_from_paddr(__pa(l1t), __PAGE_HYPERVISOR)); > + } > + else > + { > + ASSERT(!(l2e_get_flags(l2t[l2_table_offset(linear)]) & _PAGE_PSE)); > + l1t = l2e_to_l1e(l2t[l2_table_offset(linear)]); > + } > + > + if ( l1e_get_flags(l1t[l1_table_offset(linear)]) & _PAGE_PRESENT ) > + { > + ASSERT(l1e_get_pfn(l1t[l1_table_offset(linear)]) == pfn); > + ASSERT(l1e_get_flags(l1t[l1_table_offset(linear)]) == flags); Calculate l1e_from_pfn(pfn, flags) first and ASSERT() that the full PTE matches? > + } > + else > + l1e_write(&l1t[l1_table_offset(linear)], l1e_from_pfn(pfn, flags)); > + > + return true; > +} > + > +DEFINE_PER_CPU(root_pgentry_t *, root_pgt); > + > +static bool setup_cpu_root_pgt(unsigned int cpu) > +{ > + root_pgentry_t *rpt = alloc_xen_pagetable(); As an observation, alloc_xen_pagetable() should zero internally. There is no circumstance under which we want to forget the clear_page(). Another issue which I attempted to address in my series is that alloc_xen_pagetable() isn't numa-local. Its not worth adjusting for backports, but it is something we should consider moving forwards. > + unsigned int off; > + > + if ( !rpt ) > + return false; > + > + clear_page(rpt); > + per_cpu(root_pgt, cpu) = rpt; > + > + rpt[root_table_offset(RO_MPT_VIRT_START)] = > + idle_pg_table[root_table_offset(RO_MPT_VIRT_START)]; > + /* SH_LINEAR_PT inserted together with guest mappings. */ > + /* PERDOMAIN inserted during context switch. */ > + rpt[root_table_offset(XEN_VIRT_START)] = > + idle_pg_table[root_table_offset(XEN_VIRT_START)]; > + > + /* Install direct map page table entries for stack, IDT, and TSS. */ > + for ( off = 0; off < STACK_SIZE; off += PAGE_SIZE ) > + if ( !clone_mapping(__va(__pa(stack_base[cpu])) + off, rpt) ) > + break; > + > + return off == STACK_SIZE && > + clone_mapping(idt_tables[cpu], rpt) && > + clone_mapping(&per_cpu(init_tss, cpu), rpt); Can we put an outer set of brackets in, so editors retain the indentation like this? > --- a/xen/include/asm-x86/current.h > +++ b/xen/include/asm-x86/current.h > @@ -41,6 +41,8 @@ struct cpu_info { > struct vcpu *current_vcpu; > unsigned long per_cpu_offset; > unsigned long cr4; > + unsigned long xen_cr3; > + unsigned long pv_cr3; These definitely need more description of how they work. As far as I've reverse engineered, pv_cr3 is the paddr_t for per_pcu root pagetable, and is static after a cpu us up and operational. xen_cr3, if not 0, is a cr3 to restore on the way into the hypervisor? ~Andrew > /* get_stack_bottom() must be 16-byte aligned */ > }; > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.