[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH v2 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests
On 16/01/18 07:46, Jan Beulich wrote: >>>> On 15.01.18 at 19:23, <andrew.cooper3@xxxxxxxxxx> wrote: >> On 15/01/18 11:06, Jan Beulich wrote: >>> --- a/xen/arch/x86/x86_64/compat/entry.S >>> +++ b/xen/arch/x86/x86_64/compat/entry.S >>> @@ -199,6 +199,17 @@ ENTRY(cstar_enter) >>> pushq $0 >>> movl $TRAP_syscall, 4(%rsp) >>> SAVE_ALL >>> + >>> + GET_STACK_END(bx) >>> + mov STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx >>> + neg %rcx >>> +UNLIKELY_START(nz, cstar_cr3) >>> + mov %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx) >>> + neg %rcx >>> + write_cr3 rcx, rdi, rsi >>> + movq $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx) >>> +UNLIKELY_END(cstar_cr3) >> These UNLIKELY()s aren't correct. It will depend on hardware and >> command line setting as to whether we expect to update cr3. > Why are they not correct? What 64-bit kernels do you know that > use the CSTAR entry method in PV mode? None, but I also don't see you excluding 32bit PV guests, which means we do take the unlikely path all the time in general. > Afaik this was in > existence for a range of kernel versions only in our forward port. > The INT80 path is perhaps indeed more questionable in this regard. > >> Furthermore, they will complicate splitting the early entry code away >> from the main .text section for a full isolation implementation. >> >> For now, I'd drop them and have a simple jz .Lskip. > I will replace them (on the entry paths; I think the one on the exit- > to-Xen path is valid) to please you and in the interest of forward > progress; maybe this will even slightly help backporting to really old > Xen versions, where the UNLIKELY* constructs don't exist yet. > >> Also, can we collect these together into macros, rather than >> opencoding? We seem to have 3 distinct variations. > I had considered that (following the model you use in the SP2 > series), but decided against it not the least because of the > dependent but placement-wise separated code additions to > restore original values. Plus again this might be a hindrance of > backporting to really old Xen (which then typically will also be > built on really old tool chains) - as you certainly recall old gas > had quite a few issues with macro handling. There is nothing special in these macros though? I found the SP2-style far easier to backport because they are a single slot-in line. Anyway, I'm not overly fussed, but I have a > >>> --- a/xen/arch/x86/x86_64/entry.S >>> +++ b/xen/arch/x86/x86_64/entry.S >>> @@ -37,6 +37,32 @@ ENTRY(switch_to_kernel) >>> /* %rbx: struct vcpu, interrupts disabled */ >>> restore_all_guest: >>> ASSERT_INTERRUPTS_DISABLED >>> + >>> + /* Copy guest mappings and switch to per-CPU root page table. */ >>> + mov %cr3, %r9 >>> + GET_STACK_END(dx) >>> + mov STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rdi >>> + movabs $PADDR_MASK & PAGE_MASK, %rsi >>> + movabs $DIRECTMAP_VIRT_START, %rcx >>> + mov %rdi, %rax >>> + and %rsi, %rdi >>> + and %r9, %rsi >>> + add %rcx, %rdi >>> + add %rcx, %rsi >>> + mov $ROOT_PAGETABLE_FIRST_XEN_SLOT, %ecx >>> + mov root_table_offset(SH_LINEAR_PT_VIRT_START)*8(%rsi), %r8 >>> + mov %r8, root_table_offset(SH_LINEAR_PT_VIRT_START)*8(%rdi) >>> + rep movsq >>> + mov $ROOT_PAGETABLE_ENTRIES - \ >>> + ROOT_PAGETABLE_LAST_XEN_SLOT - 1, %ecx >>> + sub $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \ >>> + ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rsi >>> + sub $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \ >>> + ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi >>> + rep movsq >>> + mov %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx) >>> + write_cr3 rax, rdi, rsi >>> + >> Can we possibly move this up into C? For this simplistic algorithm it >> is ok in ASM, but if we want to do any optimisations to avoid the 4k >> memcpy (generation count hidden somewhere in page_info?), ASM is going >> quickly become unwieldy. > I'd prefer to move it into C when it really becomes necessary. Also > you don't properly qualify "this" - for example, I'd rather not move > the write_cr3 invocation into C, yet the placement of your comment > suggests to do so. I meant the whole hunk, not write_cr3 itself. > >> Another optimisation I found made a massive difference for the KAISER >> series was to have an MRU cache of 4 pagetables, so in-guest syscalls >> don't result in any copying as we pass in and out of Xen. > As said elsewhere - optimization can come later. Plus - is avoiding the > copying at _any_ time actually correct, considering possible racing > L4 entry updates on another vCPU? later> indeed. The common behaviour is that L4's don't really change after fork(), so having a generation count on them will allow skipping of the memcpy() in most cases. You can't skip the check, but the amount of logic used to avoid a 4k memcpy can be quite substantial and still be a performance win. > >>> @@ -71,6 +97,18 @@ iret_exit_to_guest: >>> ALIGN >>> /* No special register assumptions. */ >>> restore_all_xen: >>> + /* >>> + * Check whether we need to switch to the per-CPU page tables, in >>> + * case we return to late PV exit code (from an NMI or #MC). >>> + */ >>> + GET_STACK_END(ax) >>> + mov STACK_CPUINFO_FIELD(xen_cr3)(%rax), %rdx >>> + mov STACK_CPUINFO_FIELD(pv_cr3)(%rax), %rax >>> + test %rdx, %rdx >>> +UNLIKELY_START(g, exit_cr3) >> cmp or ne ? > "ne" (or really "nz" when used with "test") is outright wrong - we > want to skip the restore when the value is zero _or_ negative. > What's wrong with "jg" and "test" in combination? There simply is > no "jnsz" (other than e.g. "jnbe"). "cmp" against zero could be > used here, but why would I use the larger instruction when "test" > does? greater or less than is not commonly related to the test instruction, which is why this looks wrong to a reviewer. A comment of /* Ideally jnsz, but jg will have to do */ would go a very long way. I've double checked the logic and I agree with your conclusions, but the only reason this works is because test unconditionally zeroes the overflow flag. > >>> @@ -585,6 +692,17 @@ ENTRY(double_fault) >>> movl $TRAP_double_fault,4(%rsp) >>> /* Set AC to reduce chance of further SMAP faults */ >>> SAVE_ALL STAC >>> + >>> + GET_STACK_END(bx) >>> + mov STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rbx >>> + test %rbx, %rbx >>> + jz .Ldblf_cr3_okay >>> + jns .Ldblf_cr3_load >>> + neg %rbx >>> +.Ldblf_cr3_load: >>> + write_cr3 rbx, rdi, rsi >>> +.Ldblf_cr3_okay: >>> + >> It is moderately common for there to be cascade faults in #DF. This >> would be better if it were the general IST switch. > Based on the issues I had with #DF occurring while debugging this, > I've decided to keep the code here as simple as possible without > being incorrect: There's no point looking at the incoming CR3. > There's no point in trying to avoid nested faults (including > subsequent #DF) restoring CR3. There's also no point in retaining > the value for later restoring here, as we never return. In fact, as > mentioned elsewhere, we should imo indeed consider unilaterally > switching to idle_pg_table[] here. Ok. ~Andrew > > Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |