[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [PATCH v2 5/5] x86: Rework MSR_TSC_AUX handling from scratch.



There are many problems with MSR_TSC_AUX handling.

To begin with, the RDPID instruction reads MSR_TSC_AUX, but it is only the
RDTSCP feature which enumerates the MSR.  Therefore, RDPID functionally
depends on RDTSCP.

For PV guests, we hide RDTSCP but advertise RDPID.  We also silently drop
writes to MSR_TSC_AUX, which is very antisocial.  Therefore, enable RDTSCP for
PV guests, which in turn allows RDPID to work.

To support RDTSCP properly for PV guests, the MSR_TSC_AUX handling is moved
into the generic MSR policy infrastructure, and becomes common.  One
improvement is that we will now reject invalid values, rather than silently
truncating and accepting them.  This also causes the emulator to reject RDTSCP
for guests without the features.

One complication is TSC_MODE_PVRDTSCP, in which Xen takes control of
MSR_TSC_AUX and the reported value is actually the migration incarnation.  The
previous behaviour of this mode was to silently drop writes, but as it is a
break in the x86 ABI to start with, switch the semantics to be more sane, and
have TSC_MODE_PVRDTSCP make the MSR properly read-only.

With PV guests getting to use MSR_TSC_AUX properly now, the MSR needs
migrating, so it is moved into the common MSR logic for PV guests.  Care must
be taken however to avoid sending the MSR if TSC_MODE_PVRDTSCP is active, as
the receiving side will reject the guest_wrmsr().  The HVM side is tweaked as
well to only send/receive hvm_hw_cpu.msr_tsc_aux when the TSC logic isn't in
control of the value.

What remains is that tsc_set_info() need to broadcast d->arch.incarnation to
all vCPUs MSR block if in TSC_MODE_PVRDTSCP, so the context switching and
emulation code functions correctly.

Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
---
CC: Jan Beulich <JBeulich@xxxxxxxx>
CC: Jun Nakajima <jun.nakajima@xxxxxxxxx>
CC: Kevin Tian <kevin.tian@xxxxxxxxx>
CC: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
CC: Suravee Suthikulpanit <suravee.suthikulpanit@xxxxxxx>
CC: Wei Liu <wei.liu2@xxxxxxxxxx>
CC: Roger Pau Monné <roger.pau@xxxxxxxxxx>
CC: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

v2:
 * Spelling corrections in the commit message
 * Fix HVM migration issues
---
 xen/arch/x86/domain.c                       |  3 +--
 xen/arch/x86/domctl.c                       | 15 ++++++++++++++-
 xen/arch/x86/hvm/hvm.c                      | 19 ++++++-------------
 xen/arch/x86/hvm/svm/svm.c                  |  4 ++--
 xen/arch/x86/hvm/vmx/vmx.c                  |  4 ++--
 xen/arch/x86/msr.c                          | 16 ++++++++++++++++
 xen/arch/x86/pv/emul-inv-op.c               |  4 +---
 xen/arch/x86/pv/emul-priv-op.c              |  5 -----
 xen/arch/x86/time.c                         | 11 +++++++++++
 xen/arch/x86/x86_emulate/x86_emulate.c      |  1 +
 xen/include/asm-x86/hvm/hvm.h               |  6 ------
 xen/include/asm-x86/hvm/vcpu.h              |  1 -
 xen/include/asm-x86/msr.h                   |  9 +++++++++
 xen/include/public/arch-x86/cpufeatureset.h |  2 +-
 xen/include/public/arch-x86/hvm/save.h      |  5 +++++
 xen/tools/gen-cpuid.py                      |  3 +++
 16 files changed, 72 insertions(+), 36 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 9c3527f..144d6f0 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1534,8 +1534,7 @@ void paravirt_ctxt_switch_to(struct vcpu *v)
         activate_debugregs(v);
 
     if ( cpu_has_rdtscp )
-        wrmsr_tsc_aux(v->domain->arch.tsc_mode == TSC_MODE_PVRDTSCP
-                      ? v->domain->arch.incarnation : 0);
+        wrmsr_tsc_aux(v->arch.msr->tsc_aux);
 }
 
 /* Update per-VCPU guest runstate shared memory area (if registered). */
diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 8fbbf3a..979afdf 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1249,6 +1249,7 @@ long arch_do_domctl(
         static const uint32_t msrs_to_send[] = {
             MSR_SPEC_CTRL,
             MSR_INTEL_MISC_FEATURES_ENABLES,
+            MSR_TSC_AUX,
         };
         uint32_t nr_msrs = ARRAY_SIZE(msrs_to_send);
 
@@ -1284,7 +1285,18 @@ long arch_do_domctl(
                 for ( j = 0; j < ARRAY_SIZE(msrs_to_send); ++j )
                 {
                     uint64_t val;
-                    int rc = guest_rdmsr(v, msrs_to_send[j], &val);
+                    int rc;
+
+                    /*
+                     * Skip MSR_TSC_AUX if using TSC_MODE_PVRDTSCP.  In this
+                     * case, the MSR is read-only, and should be rejected if
+                     * seen on the restore side.
+                     */
+                    if ( msrs_to_send[j] == MSR_TSC_AUX &&
+                         d->arch.tsc_mode == TSC_MODE_PVRDTSCP )
+                        continue;
+
+                    rc = guest_rdmsr(v, msrs_to_send[j], &val);
 
                     /*
                      * It is the programmers responsibility to ensure that
@@ -1373,6 +1385,7 @@ long arch_do_domctl(
                 {
                 case MSR_SPEC_CTRL:
                 case MSR_INTEL_MISC_FEATURES_ENABLES:
+                case MSR_TSC_AUX:
                     if ( guest_wrmsr(v, msr.index, msr.value) != X86EMUL_OKAY )
                         break;
                     continue;
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 0539551..e45f6df 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -792,7 +792,9 @@ static int hvm_save_cpu_ctxt(struct domain *d, 
hvm_domain_context_t *h)
 
         ctxt.tsc = hvm_get_guest_tsc_fixed(v, d->arch.hvm_domain.sync_tsc);
 
-        ctxt.msr_tsc_aux = hvm_msr_tsc_aux(v);
+        /* Only send MSR_TSC_AUX if it isn't being handled by the TSC logic. */
+        if ( d->arch.tsc_mode != TSC_MODE_PVRDTSCP )
+            ctxt.msr_tsc_aux = v->arch.msr->tsc_aux;
 
         hvm_get_segment_register(v, x86_seg_idtr, &seg);
         ctxt.idtr_limit = seg.limit;
@@ -1046,7 +1048,9 @@ static int hvm_load_cpu_ctxt(struct domain *d, 
hvm_domain_context_t *h)
     if ( hvm_funcs.tsc_scaling.setup )
         hvm_funcs.tsc_scaling.setup(v);
 
-    v->arch.hvm_vcpu.msr_tsc_aux = ctxt.msr_tsc_aux;
+    /* Only accept MSR_TSC_AUX if it isn't being handled by the TSC logic. */
+    if ( d->arch.tsc_mode != TSC_MODE_PVRDTSCP )
+        v->arch.msr->tsc_aux = ctxt.msr_tsc_aux;
 
     hvm_set_guest_tsc_fixed(v, ctxt.tsc, d->arch.hvm_domain.sync_tsc);
 
@@ -3424,10 +3428,6 @@ int hvm_msr_read_intercept(unsigned int msr, uint64_t 
*msr_content)
         *msr_content = v->arch.hvm_vcpu.msr_tsc_adjust;
         break;
 
-    case MSR_TSC_AUX:
-        *msr_content = hvm_msr_tsc_aux(v);
-        break;
-
     case MSR_IA32_APICBASE:
         *msr_content = vcpu_vlapic(v)->hw.apic_base_msr;
         break;
@@ -3575,13 +3575,6 @@ int hvm_msr_write_intercept(unsigned int msr, uint64_t 
msr_content,
         hvm_set_guest_tsc_adjust(v, msr_content);
         break;
 
-    case MSR_TSC_AUX:
-        v->arch.hvm_vcpu.msr_tsc_aux = (uint32_t)msr_content;
-        if ( cpu_has_rdtscp
-             && (v->domain->arch.tsc_mode != TSC_MODE_PVRDTSCP) )
-            wrmsr_tsc_aux(msr_content);
-        break;
-
     case MSR_IA32_APICBASE:
         if ( !vlapic_msr_set(vcpu_vlapic(v), msr_content) )
             goto gp_fault;
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index f53f430..9406624 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -1093,7 +1093,7 @@ static void svm_ctxt_switch_to(struct vcpu *v)
     svm_tsc_ratio_load(v);
 
     if ( cpu_has_rdtscp )
-        wrmsr_tsc_aux(hvm_msr_tsc_aux(v));
+        wrmsr_tsc_aux(v->arch.msr->tsc_aux);
 }
 
 static void noreturn svm_do_resume(struct vcpu *v)
@@ -2842,7 +2842,7 @@ void svm_vmexit_handler(struct cpu_user_regs *regs)
         break;
 
     case VMEXIT_RDTSCP:
-        regs->rcx = hvm_msr_tsc_aux(v);
+        regs->rcx = v->arch.msr->tsc_aux;
         /* fall through */
     case VMEXIT_RDTSC:
         svm_vmexit_do_rdtsc(regs);
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 31acb0e..45fd9c5 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -618,7 +618,7 @@ static void vmx_restore_guest_msrs(struct vcpu *v)
     }
 
     if ( cpu_has_rdtscp )
-        wrmsr_tsc_aux(hvm_msr_tsc_aux(v));
+        wrmsr_tsc_aux(v->arch.msr->tsc_aux);
 }
 
 void vmx_update_cpu_exec_control(struct vcpu *v)
@@ -3858,7 +3858,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
         vmx_invlpg_intercept(exit_qualification);
         break;
     case EXIT_REASON_RDTSCP:
-        regs->rcx = hvm_msr_tsc_aux(v);
+        regs->rcx = v->arch.msr->tsc_aux;
         /* fall through */
     case EXIT_REASON_RDTSC:
         update_guest_eip(); /* Safe: RDTSC, RDTSCP */
diff --git a/xen/arch/x86/msr.c b/xen/arch/x86/msr.c
index 8ae3b4e..8a8cad2 100644
--- a/xen/arch/x86/msr.c
+++ b/xen/arch/x86/msr.c
@@ -175,6 +175,13 @@ int guest_rdmsr(const struct vcpu *v, uint32_t msr, 
uint64_t *val)
                _MSR_MISC_FEATURES_CPUID_FAULTING;
         break;
 
+    case MSR_TSC_AUX:
+        if ( !cp->extd.rdtscp )
+            goto gp_fault;
+
+        *val = vp->tsc_aux;
+        break;
+
     default:
         return X86EMUL_UNHANDLEABLE;
     }
@@ -250,6 +257,15 @@ int guest_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val)
         break;
     }
 
+    case MSR_TSC_AUX:
+        if ( !cp->extd.rdtscp ||                      /* MSR available? */
+             d->arch.tsc_mode == TSC_MODE_PVRDTSCP || /* MSR read-only? */
+             val != (uint32_t)val )                   /* Rsvd bits set? */
+            goto gp_fault;
+
+        wrmsr_tsc_aux(vp->tsc_aux = val);
+        break;
+
     default:
         return X86EMUL_UNHANDLEABLE;
     }
diff --git a/xen/arch/x86/pv/emul-inv-op.c b/xen/arch/x86/pv/emul-inv-op.c
index be95a5f..c071372 100644
--- a/xen/arch/x86/pv/emul-inv-op.c
+++ b/xen/arch/x86/pv/emul-inv-op.c
@@ -46,7 +46,6 @@ static int emulate_invalid_rdtscp(struct cpu_user_regs *regs)
     char opcode[3];
     unsigned long eip, rc;
     struct vcpu *v = current;
-    const struct domain *currd = v->domain;
 
     eip = regs->rip;
     if ( (rc = copy_from_user(opcode, (char *)eip, sizeof(opcode))) != 0 )
@@ -59,8 +58,7 @@ static int emulate_invalid_rdtscp(struct cpu_user_regs *regs)
     eip += sizeof(opcode);
 
     msr_split(regs, pv_soft_rdtsc(v, regs));
-    regs->rcx = (currd->arch.tsc_mode == TSC_MODE_PVRDTSCP
-                 ? currd->arch.incarnation : 0);
+    regs->rcx = v->arch.msr->tsc_aux;
 
     pv_emul_instruction_done(regs, eip);
     return EXCRET_fault_fixed;
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index f2888ab..7e64c10 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -880,11 +880,6 @@ static int read_msr(unsigned int reg, uint64_t *val,
         *val = currd->arch.vtsc ? pv_soft_rdtsc(curr, ctxt->regs) : rdtsc();
         return X86EMUL_OKAY;
 
-    case MSR_TSC_AUX:
-        *val = (uint32_t)(currd->arch.tsc_mode == TSC_MODE_PVRDTSCP
-                          ? currd->arch.incarnation : 0);
-        return X86EMUL_OKAY;
-
     case MSR_EFER:
         *val = read_efer();
         if ( is_pv_32bit_domain(currd) )
diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index 1a6fde6..37d487c 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -2178,7 +2178,18 @@ void tsc_set_info(struct domain *d,
         }
         break;
     }
+
     d->arch.incarnation = incarnation + 1;
+
+    if ( d->arch.tsc_mode == TSC_MODE_PVRDTSCP )
+    {
+        struct vcpu *v;
+
+        /* Distribute incarnation into each vcpu's MSR_TSC_AUX. */
+        for_each_vcpu ( d, v )
+            v->arch.msr->tsc_aux = d->arch.incarnation;
+    }
+
     if ( is_hvm_domain(d) )
     {
         if ( hvm_tsc_scaling_supported && !d->arch.vtsc )
diff --git a/xen/arch/x86/x86_emulate/x86_emulate.c 
b/xen/arch/x86/x86_emulate/x86_emulate.c
index 85383ea..efb6003 100644
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -5141,6 +5141,7 @@ x86_emulate(
 
         case 0xf9: /* rdtscp */
             fail_if(ops->read_msr == NULL);
+            /* Getting a result implies vcpu_has_rdtscp() */
             if ( (rc = ops->read_msr(MSR_TSC_AUX,
                                      &msr_val, ctxt)) != X86EMUL_OKAY )
                 goto done;
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index dd3dd5f..7708fdf 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -511,12 +511,6 @@ static inline void hvm_invalidate_regs_fields(struct 
cpu_user_regs *regs)
 int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
                               struct npfec npfec);
 
-#define hvm_msr_tsc_aux(v) ({                                               \
-    struct domain *__d = (v)->domain;                                       \
-    (__d->arch.tsc_mode == TSC_MODE_PVRDTSCP)                               \
-        ? (u32)__d->arch.incarnation : (u32)(v)->arch.hvm_vcpu.msr_tsc_aux; \
-})
-
 int hvm_x2apic_msr_read(struct vcpu *v, unsigned int msr, uint64_t 
*msr_content);
 int hvm_x2apic_msr_write(struct vcpu *v, unsigned int msr, uint64_t 
msr_content);
 
diff --git a/xen/include/asm-x86/hvm/vcpu.h b/xen/include/asm-x86/hvm/vcpu.h
index d93166f..60425f3 100644
--- a/xen/include/asm-x86/hvm/vcpu.h
+++ b/xen/include/asm-x86/hvm/vcpu.h
@@ -170,7 +170,6 @@ struct hvm_vcpu {
 
     struct hvm_vcpu_asid n1asid;
 
-    u32                 msr_tsc_aux;
     u64                 msr_tsc_adjust;
     u64                 msr_xss;
 
diff --git a/xen/include/asm-x86/msr.h b/xen/include/asm-x86/msr.h
index a5072a2..c13388a 100644
--- a/xen/include/asm-x86/msr.h
+++ b/xen/include/asm-x86/msr.h
@@ -258,6 +258,15 @@ struct msr_vcpu_policy
         bool available; /* This MSR is non-architectural */
         bool cpuid_faulting;
     } misc_features_enables;
+
+    /*
+     * 0xc0000103 - MSR_TSC_AUX
+     *
+     * Usually guest chosen.  However, when using TSC_MODE_PVRDTSCP the value
+     * is Xen-controlled and is the migration incarnation, at which point the
+     * MSR becomes read-only to the guest.
+     */
+    uint32_t tsc_aux;
 };
 
 void init_guest_msr_policy(void);
diff --git a/xen/include/public/arch-x86/cpufeatureset.h 
b/xen/include/public/arch-x86/cpufeatureset.h
index fa81af1..78fa466 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -156,7 +156,7 @@ XEN_CPUFEATURE(NX,            2*32+20) /*A  Execute Disable 
*/
 XEN_CPUFEATURE(MMXEXT,        2*32+22) /*A  AMD MMX extensions */
 XEN_CPUFEATURE(FFXSR,         2*32+25) /*A  FFXSR instruction optimizations */
 XEN_CPUFEATURE(PAGE1GB,       2*32+26) /*H  1Gb large page support */
-XEN_CPUFEATURE(RDTSCP,        2*32+27) /*S  RDTSCP */
+XEN_CPUFEATURE(RDTSCP,        2*32+27) /*A  RDTSCP */
 XEN_CPUFEATURE(LM,            2*32+29) /*A  Long Mode (x86-64) */
 XEN_CPUFEATURE(3DNOWEXT,      2*32+30) /*A  AMD 3DNow! extensions */
 XEN_CPUFEATURE(3DNOW,         2*32+31) /*A  3DNow! */
diff --git a/xen/include/public/arch-x86/hvm/save.h 
b/xen/include/public/arch-x86/hvm/save.h
index 4691d4d..0df3e26 100644
--- a/xen/include/public/arch-x86/hvm/save.h
+++ b/xen/include/public/arch-x86/hvm/save.h
@@ -141,6 +141,11 @@ struct hvm_hw_cpu {
     uint64_t msr_cstar;
     uint64_t msr_syscall_mask;
     uint64_t msr_efer;
+
+    /*
+     * Since 4.11, this value is written as 0 if TSC_MODE_PVRDTSCP is active,
+     * and the value is actually controlled by the TSC logic.
+     */
     uint64_t msr_tsc_aux;
 
     /* guest's idea of what rdtsc() would return */
diff --git a/xen/tools/gen-cpuid.py b/xen/tools/gen-cpuid.py
index 613b909..601a7a0 100755
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -235,6 +235,9 @@ def crunch_numbers(state):
         # absence of any enabled xstate.
         AVX: [FMA, FMA4, F16C, AVX2, XOP],
 
+        # MSR_TSC_AUX is enumerated by RDTSCP, but RDPID also reads TSC_AUX.
+        RDTSCP: [RDPID],
+
         # CX16 is only encodable in Long Mode.  LAHF_LM indicates that the
         # SAHF/LAHF instructions are reintroduced in Long Mode.  1GB
         # superpages, PCID and PKU are only available in 4 level paging.
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.