Xen project Mailing List

Re: [PATCH 3/3] x86/spec-ctrl: Fix NMI race condition with VT-x MSR_SPEC_CTRL handling

To: Roger Pau Monné <roger.pau@xxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

From: Andrew Cooper <amc96@xxxxxxxx>

Date: Fri, 14 Jan 2022 13:08:43 +0000

Cc: Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Wei Liu <wl@xxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>

Delivery-date: Fri, 14 Jan 2022 13:09:10 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 14/01/2022 12:50, Roger Pau Monné wrote: > On Thu, Jan 13, 2022 at 04:38:33PM +0000, Andrew Cooper wrote: >> The logic was based on a mistaken understanding of how NMI blocking on vmexit >> works. NMIs are only blocked for EXIT_REASON_NMI, and not for general exits. >> Therefore, an NMI can in general hit early in the vmx_asm_vmexit_handler >> path, >> and the guest's value will be clobbered before it is saved. >> >> Switch to using MSR load/save lists. This causes the guest value to be saved >> atomically with respect to NMIs/MCEs/etc. >> >> First, update vmx_cpuid_policy_changed() to configure the load/save lists at >> the same time as configuring the intercepts. This function is always used in >> remote context, so extend the vmx_vmcs_{enter,exit}() block to cover the >> whole >> function, rather than having multiple remote acquisitions of the same VMCS. >> >> vmx_add_guest_msr() can fail, but only in ways which are entirely fatal to >> the >> guest, so handle failures using domain_crash(). vmx_del_msr() can fail with >> -ESRCH but we don't matter, and this path will be taken during domain create >> on a system lacking IBRSB. >> >> Second, update vmx_msr_{read,write}_intercept() to use the load/save lists >> rather than vcpu_msrs, and update the comment to describe the new state >> location. >> >> Finally, adjust the entry/exit asm. Drop DO_SPEC_CTRL_ENTRY_FROM_HVM >> entirely, and use a shorter code sequence to simply reload Xen's setting from >> the top-of-stack block. >> >> Because the guest values are loaded/saved atomically, we do not need to use >> the shadowing logic to cope with late NMIs/etc, so we can omit >> DO_SPEC_CTRL_EXIT_TO_GUEST entirely and VMRESUME/VMLAUNCH with Xen's value in >> context. Furthermore, we can drop the SPEC_CTRL_ENTRY_FROM_PV too, as >> there's >> no need to switch back to Xen's context on an early failure. >> >> Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> >> --- >> CC: Jan Beulich <JBeulich@xxxxxxxx> >> CC: Roger Pau Monné <roger.pau@xxxxxxxxxx> >> CC: Wei Liu <wl@xxxxxxx> >> CC: Jun Nakajima <jun.nakajima@xxxxxxxxx> >> CC: Kevin Tian <kevin.tian@xxxxxxxxx> >> >> Needs backporting as far as people can tolerate. >> >> If the entry/exit logic were in C, I'd ASSERT() that shadow tracking is off, >> but this is awkard to arrange in asm. >> --- >> xen/arch/x86/hvm/vmx/entry.S | 19 ++++++++++------- >> xen/arch/x86/hvm/vmx/vmx.c | 36 >> +++++++++++++++++++++++++++----- >> xen/arch/x86/include/asm/msr.h | 10 ++++++++- >> xen/arch/x86/include/asm/spec_ctrl_asm.h | 32 ++++------------------------ >> 4 files changed, 56 insertions(+), 41 deletions(-) >> >> diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S >> index 30139ae58e9d..297ed8685126 100644 >> --- a/xen/arch/x86/hvm/vmx/entry.S >> +++ b/xen/arch/x86/hvm/vmx/entry.S >> @@ -35,7 +35,14 @@ ENTRY(vmx_asm_vmexit_handler) >> >> /* SPEC_CTRL_ENTRY_FROM_VMX Req: b=curr %rsp=regs/cpuinfo, Clob: >> acd */ >> ALTERNATIVE "", DO_OVERWRITE_RSB, X86_FEATURE_SC_RSB_HVM >> - ALTERNATIVE "", DO_SPEC_CTRL_ENTRY_FROM_HVM, X86_FEATURE_SC_MSR_HVM >> + >> + .macro restore_spec_ctrl >> + mov $MSR_SPEC_CTRL, %ecx >> + movzbl CPUINFO_xen_spec_ctrl(%rsp), %eax >> + xor %edx, %edx >> + wrmsr >> + .endm >> + ALTERNATIVE "", restore_spec_ctrl, X86_FEATURE_SC_MSR_HVM > I assume loading the host value into SPEC_CTRL strictly needs to > happen after the RSB overwrite, or else you could use a VM exit host > load MSR entry to handle MSR_SPEC_CTRL. We could use the host load list, but Intel warned that the lists aren't as efficient as writing it like this, hence why I didn't use them originally when we thought the NMI behaviour was different. Obviously, we're using them now for correctness reasons, which is more important than avoiding them for (unquantified) perf reasons. I've got some plans for changes to how Xen handles MSR_SPEC_CTRL for its own protection, which I hope will bring a substantial efficiency improvements, and those would require us not to use a host load list here. > Also I don't understand why SPEC_CTRL shadowing is not cleared before > loading Xen's value now. Because we now don't set shadowing it in the first place, because of ... > >> /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */ >> >> /* Hardware clears MSR_DEBUGCTL on VMExit. Reinstate it if >> debugging Xen. */ >> @@ -83,7 +90,6 @@ UNLIKELY_END(realmode) >> >> /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */ >> /* SPEC_CTRL_EXIT_TO_VMX Req: a=spec_ctrl %rsp=regs/cpuinfo, >> Clob: cd */ >> - ALTERNATIVE "", DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_HVM >> ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)), >> X86_FEATURE_SC_VERW_HVM ... this hunk. Also, see the note about ASSERT() in the commit message. I'm happy to try and make this clearer in the commit message if you have any suggestions. >> >> mov VCPU_hvm_guest_cr2(%rbx),%rax >> @@ -119,12 +125,11 @@ UNLIKELY_END(realmode) >> SAVE_ALL >> >> /* >> - * PV variant needed here as no guest code has executed (so >> - * MSR_SPEC_CTRL can't have changed value), and NMIs/MCEs are liable >> - * to hit (in which case the HVM variant might corrupt things). >> + * SPEC_CTRL_ENTRY notes >> + * >> + * If we end up here, no guest code has executed. We still have >> Xen's >> + * choice of MSR_SPEC_CTRL in context, and the RSB is safe. >> */ >> - SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo Clob: acd */ >> - /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */ >> >> call vmx_vmentry_failure >> jmp .Lvmx_process_softirqs >> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c >> index 28ee6393f11e..d7feb5f9c455 100644 >> --- a/xen/arch/x86/hvm/vmx/vmx.c >> +++ b/xen/arch/x86/hvm/vmx/vmx.c >> @@ -592,6 +592,7 @@ void vmx_update_exception_bitmap(struct vcpu *v) >> static void vmx_cpuid_policy_changed(struct vcpu *v) >> { >> const struct cpuid_policy *cp = v->domain->arch.cpuid; >> + int rc = 0; >> >> if ( opt_hvm_fep || >> (v->domain->arch.cpuid->x86_vendor != boot_cpu_data.x86_vendor) ) >> @@ -601,17 +602,27 @@ static void vmx_cpuid_policy_changed(struct vcpu *v) >> >> vmx_vmcs_enter(v); >> vmx_update_exception_bitmap(v); >> - vmx_vmcs_exit(v); >> >> /* >> * We can safely pass MSR_SPEC_CTRL through to the guest, even if STIBP >> * isn't enumerated in hardware, as SPEC_CTRL_STIBP is ignored. >> */ >> if ( cp->feat.ibrsb ) >> + { >> vmx_clear_msr_intercept(v, MSR_SPEC_CTRL, VMX_MSR_RW); >> + >> + rc = vmx_add_guest_msr(v, MSR_SPEC_CTRL, 0); >> + if ( rc ) >> + goto err; >> + } >> else >> + { >> vmx_set_msr_intercept(v, MSR_SPEC_CTRL, VMX_MSR_RW); >> >> + /* Can only fail with -ESRCH, and we don't care. */ >> + vmx_del_msr(v, MSR_SPEC_CTRL, VMX_MSR_GUEST); >> + } >> + >> /* MSR_PRED_CMD is safe to pass through if the guest knows about it. */ >> if ( cp->feat.ibrsb || cp->extd.ibpb ) >> vmx_clear_msr_intercept(v, MSR_PRED_CMD, VMX_MSR_RW); >> @@ -623,6 +634,15 @@ static void vmx_cpuid_policy_changed(struct vcpu *v) >> vmx_clear_msr_intercept(v, MSR_FLUSH_CMD, VMX_MSR_RW); >> else >> vmx_set_msr_intercept(v, MSR_FLUSH_CMD, VMX_MSR_RW); >> + >> + err: > Nit: I would name this out, since it's used by both the error and the > normal exit paths, but that's just my taste. Ok. > >> + vmx_vmcs_exit(v); >> + >> + if ( rc ) >> + { >> + printk(XENLOG_G_ERR "MSR load/save list error: %d", rc); >> + domain_crash(v->domain); >> + } >> } >> >> int vmx_guest_x86_mode(struct vcpu *v) >> @@ -3128,7 +3148,6 @@ static int is_last_branch_msr(u32 ecx) >> static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content) >> { >> struct vcpu *curr = current; >> - const struct vcpu_msrs *msrs = curr->arch.msrs; >> uint64_t tmp; >> >> HVM_DBG_LOG(DBG_LEVEL_MSR, "ecx=%#x", msr); >> @@ -3136,7 +3155,11 @@ static int vmx_msr_read_intercept(unsigned int msr, >> uint64_t *msr_content) >> switch ( msr ) >> { >> case MSR_SPEC_CTRL: /* guest_rdmsr() has already performed #GP checks. >> */ >> - *msr_content = msrs->spec_ctrl.raw; >> + if ( vmx_read_guest_msr(curr, msr, msr_content) ) >> + { >> + ASSERT_UNREACHABLE(); >> + goto gp_fault; >> + } > AFAICT this is required just for Xen internal callers, as a guest > attempt to access MSR_SPEC_CTRL will never be trapped, or if trapped it > would be because !cp->feat.ibrsb and thus won't get here anyway. We can end up here through FEP, or introspection caring about the MSR, but most importantly, for moving MSR_SPEC_CTRL on migrate. What the ASSERT_UNREACHABLE() is saying is that "if the common code declares not #GP, then MSR_SPEC_CTRL is *definitely* exists in the MSR list". ~Andrew

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.