Xen project Mailing List

[Xen-devel] [PATCH v2 9/9] x86/vmx: Don't leak EFER.NXE into guest context

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Fri, 8 Jun 2018 19:48:43 +0100

Cc: Kevin Tian <kevin.tian@xxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Tim Deegan <tim@xxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>

Delivery-date: Fri, 08 Jun 2018 18:48:55 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Intel hardware only uses 4 bits in MSR_EFER. Changes to LME and LMA are handled automatically via the VMENTRY_CTLS.IA32E_MODE bit. SCE is handled by ad-hoc logic in context_switch(), vmx_restore_guest_msrs() and vmx_update_guest_efer(), and works by altering the host SCE value to match the setting the guest wants. This works because, in HVM vcpu context, Xen never needs to execute a SYSCALL or SYSRET instruction. However, NXE has never been context switched. Unlike SCE, NXE cannot be context switched at vcpu boundaries because disabling NXE makes PTE.NX bits reserved and cause a pagefault when encountered. This means that the guest always has Xen's setting in effect, irrespective of the bit it can see and modify in its virtualised view of MSR_EFER. This isn't a major problem for production operating systems because they, like Xen, always turn the NXE on when it is available. However, it does have an observable effect on which guest PTE bits are valid, and whether PFEC_insn_fetch is visible in a #PF error code. Second generation VT-x hardware has host and guest EFER fields in the VMCS, and support for loading and saving them automatically. First generation VT-x hardware needs to use MSR load/save lists to cause an atomic switch of MSR_EFER on vmentry/exit. Therefore we update vmx_init_vmcs_config() to find and use guest/host EFER support when available (and MSR load/save lists on older hardware) and drop all ad-hoc alteration of SCE. There are two minor complications when selecting the EFER setting: * For shadow guests, NXE is a paging setting and must remain under host control, but this is fine as Xen also handles the pagefaults. * When the Unrestricted Guest control is clear, hardware doesn't tolerate LME and LMA being different. This doesn't matter in practice as we intercept all writes to CR0 and reads from MSR_EFER, so can provide architecturally consistent behaviour from the guests point of view. With changing how EFER is loaded, vmcs_dump_vcpu() needs adjusting. Read EFER from the appropriate information source, and identify when dumping the guest EFER value which source was used. As a result of fixing EFER context switching, we can remove the Intel-special case from hvm_nx_enabled() and let guest_walk_tables() work with the real guest paging settings. Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> Reviewed-by: Roger Pau Monné <roger.pau@xxxxxxxxxx> Reviewed-by: Tim Deegan <tim@xxxxxxx> --- CC: Jan Beulich <JBeulich@xxxxxxxx> CC: Jun Nakajima <jun.nakajima@xxxxxxxxx> CC: Kevin Tian <kevin.tian@xxxxxxxxx> CC: Tim Deegan <tim@xxxxxxx> CC: Wei Liu <wei.liu2@xxxxxxxxxx> CC: Roger Pau Monné <roger.pau@xxxxxxxxxx> v2: * Fix a vmentry failure on Nehalem/Westmere hardware. LME != LMA is actuall dependent on Unrestricted Guest (which is clear for Shadow guests as a side effect of clearing EPT). * Fix vmcs_dump_vcpu() to cope with the new EFER behaviour. --- xen/arch/x86/domain.c | 10 ----- xen/arch/x86/hvm/vmx/vmcs.c | 26 +++++++++---- xen/arch/x86/hvm/vmx/vmx.c | 78 ++++++++++++++++++++++++++++---------- xen/include/asm-x86/hvm/hvm.h | 4 +- xen/include/asm-x86/hvm/vmx/vmcs.h | 16 ++++++++ 5 files changed, 95 insertions(+), 39 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index 0ca820a..e817795 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -1723,16 +1723,6 @@ void context_switch(struct vcpu *prev, struct vcpu *next) { __context_switch(); - if ( is_pv_domain(nextd) && - (is_idle_domain(prevd) || - is_hvm_domain(prevd) || - is_pv_32bit_domain(prevd) != is_pv_32bit_domain(nextd)) ) - { - uint64_t efer = read_efer(); - if ( !(efer & EFER_SCE) ) - write_efer(efer | EFER_SCE); - } - /* Re-enable interrupts before restoring state which may fault. */ local_irq_enable(); diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c index 0ebccc4..9e4e29d 100644 --- a/xen/arch/x86/hvm/vmx/vmcs.c +++ b/xen/arch/x86/hvm/vmx/vmcs.c @@ -342,8 +342,8 @@ static int vmx_init_vmcs_config(void) } min = VM_EXIT_ACK_INTR_ON_EXIT; - opt = VM_EXIT_SAVE_GUEST_PAT | VM_EXIT_LOAD_HOST_PAT | - VM_EXIT_CLEAR_BNDCFGS; + opt = (VM_EXIT_SAVE_GUEST_PAT | VM_EXIT_LOAD_HOST_PAT | + VM_EXIT_LOAD_HOST_EFER | VM_EXIT_CLEAR_BNDCFGS); min |= VM_EXIT_IA32E_MODE; _vmx_vmexit_control = adjust_vmx_controls( "VMExit Control", min, opt, MSR_IA32_VMX_EXIT_CTLS, &mismatch); @@ -383,7 +383,8 @@ static int vmx_init_vmcs_config(void) _vmx_secondary_exec_control &= ~SECONDARY_EXEC_ENABLE_VIRT_EXCEPTIONS; min = 0; - opt = VM_ENTRY_LOAD_GUEST_PAT | VM_ENTRY_LOAD_BNDCFGS; + opt = (VM_ENTRY_LOAD_GUEST_PAT | VM_ENTRY_LOAD_GUEST_EFER | + VM_ENTRY_LOAD_BNDCFGS); _vmx_vmentry_control = adjust_vmx_controls( "VMEntry Control", min, opt, MSR_IA32_VMX_ENTRY_CTLS, &mismatch); @@ -1148,6 +1149,8 @@ static int construct_vmcs(struct vcpu *v) v->arch.hvm_vmx.host_cr0 = read_cr0() | X86_CR0_TS; __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0); __vmwrite(HOST_CR4, mmu_cr4_features); + if ( cpu_has_vmx_efer ) + __vmwrite(HOST_EFER, read_efer()); /* Host CS:RIP. */ __vmwrite(HOST_CS_SELECTOR, __HYPERVISOR_CS); @@ -1905,7 +1908,17 @@ void vmcs_dump_vcpu(struct vcpu *v) vmentry_ctl = vmr32(VM_ENTRY_CONTROLS), vmexit_ctl = vmr32(VM_EXIT_CONTROLS); cr4 = vmr(GUEST_CR4); - efer = vmr(GUEST_EFER); + + /* + * The guests EFER setting comes from the GUEST_EFER VMCS field whenever + * available, or the guest load-only MSR list on Gen1 hardware, the entry + * for which may be elided for performance reasons if identical to Xen's + * setting. + */ + if ( cpu_has_vmx_efer ) + efer = vmr(GUEST_EFER); + else if ( vmx_read_guest_loadonly_msr(v, MSR_EFER, &efer) ) + efer = read_efer(); printk("*** Guest State ***\n"); printk("CR0: actual=0x%016lx, shadow=0x%016lx, gh_mask=%016lx\n", @@ -1942,9 +1955,8 @@ void vmcs_dump_vcpu(struct vcpu *v) vmx_dump_sel("LDTR", GUEST_LDTR_SELECTOR); vmx_dump_sel2("IDTR", GUEST_IDTR_LIMIT); vmx_dump_sel(" TR", GUEST_TR_SELECTOR); - if ( (vmexit_ctl & (VM_EXIT_SAVE_GUEST_PAT | VM_EXIT_SAVE_GUEST_EFER)) || - (vmentry_ctl & (VM_ENTRY_LOAD_GUEST_PAT | VM_ENTRY_LOAD_GUEST_EFER)) ) - printk("EFER = 0x%016lx PAT = 0x%016lx\n", efer, vmr(GUEST_PAT)); + printk("EFER(%s) = 0x%016lx PAT = 0x%016lx\n", + cpu_has_vmx_efer ? "VMCS" : "MSR LL", efer, vmr(GUEST_PAT)); printk("PreemptionTimer = 0x%08x SM Base = 0x%08x\n", vmr32(GUEST_PREEMPTION_TIMER), vmr32(GUEST_SMBASE)); printk("DebugCtl = 0x%016lx DebugExceptions = 0x%016lx\n", diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index b167fde..5d04e45 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -509,15 +509,6 @@ static void vmx_restore_guest_msrs(struct vcpu *v) wrmsrl(MSR_LSTAR, v->arch.hvm_vmx.lstar); wrmsrl(MSR_SYSCALL_MASK, v->arch.hvm_vmx.sfmask); - if ( (v->arch.hvm_vcpu.guest_efer ^ read_efer()) & EFER_SCE ) - { - HVM_DBG_LOG(DBG_LEVEL_2, - "restore guest's EFER with value %lx", - v->arch.hvm_vcpu.guest_efer); - write_efer((read_efer() & ~EFER_SCE) | - (v->arch.hvm_vcpu.guest_efer & EFER_SCE)); - } - if ( cpu_has_rdtscp ) wrmsr_tsc_aux(hvm_msr_tsc_aux(v)); } @@ -1646,22 +1637,71 @@ static void vmx_update_guest_cr(struct vcpu *v, unsigned int cr, static void vmx_update_guest_efer(struct vcpu *v) { - unsigned long vm_entry_value; + unsigned long entry_ctls, guest_efer = v->arch.hvm_vcpu.guest_efer, + xen_efer = read_efer(); + + if ( paging_mode_shadow(v->domain) ) + { + /* + * When using shadow pagetables, EFER.NX is a Xen-owned bit and is not + * under guest control. + */ + guest_efer &= ~EFER_NX; + guest_efer |= xen_efer & EFER_NX; + } + + if ( !(v->arch.hvm_vmx.secondary_exec_control & + SECONDARY_EXEC_UNRESTRICTED_GUEST) ) + { + /* + * When Unrestricted Guest is not enabled in the VMCS, hardware does + * not tolerate the LME and LMA settings being different. As writes + * to CR0 are intercepted, it is safe to leave LME clear at this + * point, and fix up both LME and LMA when CR0.PG is set. + */ + if ( !(guest_efer & EFER_LMA) ) + guest_efer &= ~EFER_LME; + } vmx_vmcs_enter(v); - __vmread(VM_ENTRY_CONTROLS, &vm_entry_value); - if ( v->arch.hvm_vcpu.guest_efer & EFER_LMA ) - vm_entry_value |= VM_ENTRY_IA32E_MODE; + /* + * The intended guest running mode is derived from VM_ENTRY_IA32E_MODE, + * which (architecturally) is the guest's LMA setting. + */ + __vmread(VM_ENTRY_CONTROLS, &entry_ctls); + + entry_ctls &= ~VM_ENTRY_IA32E_MODE; + if ( guest_efer & EFER_LMA ) + entry_ctls |= VM_ENTRY_IA32E_MODE; + + __vmwrite(VM_ENTRY_CONTROLS, entry_ctls); + + /* We expect to use EFER loading in the common case, but... */ + if ( likely(cpu_has_vmx_efer) ) + __vmwrite(GUEST_EFER, guest_efer); + + /* ... on Gen1 VT-x hardware, we have to use MSR load/save lists instead. */ else - vm_entry_value &= ~VM_ENTRY_IA32E_MODE; - __vmwrite(VM_ENTRY_CONTROLS, vm_entry_value); + { + /* + * When the guests choice of EFER matches Xen's, remove the load/save + * list entries. It is unnecessary overhead, especially as this is + * expected to be the common case for 64bit guests. + */ + if ( guest_efer == xen_efer ) + { + vmx_del_msr(v, MSR_EFER, VMX_MSR_HOST); + vmx_del_msr(v, MSR_EFER, VMX_MSR_GUEST_LOADONLY); + } + else + { + vmx_add_msr(v, MSR_EFER, xen_efer, VMX_MSR_HOST); + vmx_add_msr(v, MSR_EFER, guest_efer, VMX_MSR_GUEST_LOADONLY); + } + } vmx_vmcs_exit(v); - - if ( v == current ) - write_efer((read_efer() & ~EFER_SCE) | - (v->arch.hvm_vcpu.guest_efer & EFER_SCE)); } void nvmx_enqueue_n2_exceptions(struct vcpu *v, diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h index ef5e198..fcfc5cf 100644 --- a/xen/include/asm-x86/hvm/hvm.h +++ b/xen/include/asm-x86/hvm/hvm.h @@ -296,10 +296,8 @@ int hvm_girq_dest_2_vcpu_id(struct domain *d, uint8_t dest, uint8_t dest_mode); (hvm_paging_enabled(v) && ((v)->arch.hvm_vcpu.guest_cr[4] & X86_CR4_SMEP)) #define hvm_smap_enabled(v) \ (hvm_paging_enabled(v) && ((v)->arch.hvm_vcpu.guest_cr[4] & X86_CR4_SMAP)) -/* HVM guests on Intel hardware leak Xen's NX settings into guest context. */ #define hvm_nx_enabled(v) \ - ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL && cpu_has_nx) || \ - ((v)->arch.hvm_vcpu.guest_efer & EFER_NX)) + ((v)->arch.hvm_vcpu.guest_efer & EFER_NX) #define hvm_pku_enabled(v) \ (hvm_paging_enabled(v) && ((v)->arch.hvm_vcpu.guest_cr[4] & X86_CR4_PKE)) diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h index 6629b4f..d4b59a0 100644 --- a/xen/include/asm-x86/hvm/vmx/vmcs.h +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h @@ -311,6 +311,8 @@ extern u64 vmx_ept_vpid_cap; (vmx_cpu_based_exec_control & CPU_BASED_MONITOR_TRAP_FLAG) #define cpu_has_vmx_pat \ (vmx_vmentry_control & VM_ENTRY_LOAD_GUEST_PAT) +#define cpu_has_vmx_efer \ + (vmx_vmentry_control & VM_ENTRY_LOAD_GUEST_EFER) #define cpu_has_vmx_unrestricted_guest \ (vmx_secondary_exec_control & SECONDARY_EXEC_UNRESTRICTED_GUEST) #define vmx_unrestricted_guest(v) \ @@ -596,6 +598,20 @@ static inline int vmx_read_guest_msr(const struct vcpu *v, uint32_t msr, return 0; } +static inline int vmx_read_guest_loadonly_msr( + const struct vcpu *v, uint32_t msr, uint64_t *val) +{ + const struct vmx_msr_entry *ent = + vmx_find_msr(v, msr, VMX_MSR_GUEST_LOADONLY); + + if ( !ent ) + return -ESRCH; + + *val = ent->data; + + return 0; +} + static inline int vmx_write_guest_msr(struct vcpu *v, uint32_t msr, uint64_t val) { -- 2.1.4 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.