[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 9/18 V2]: PVH xen: create PVH vmcs, and initialization



On Mon, 18 Mar 2013 11:28:43 -0400
Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote:

> On Fri, Mar 15, 2013 at 05:39:25PM -0700, Mukesh Rathor wrote:
> > This patch mainly contains code to create a VMCS for PVH guest, and
> > HVM specific vcpu/domain creation code.
> > 
> > Changes in V2:
> >   - Avoid call to hvm_do_resume() at call site rather than return
> > in it.
> >   - Return for PVH vmx_do_resume prior to intel debugger stuff.
> > 
> > Signed-off-by: Mukesh Rathor <mukesh.rathor@xxxxxxxxxx>
> > ---
> >  xen/arch/x86/hvm/hvm.c      |   90 ++++++++++++++-
> >  xen/arch/x86/hvm/vmx/vmcs.c |  266
> > ++++++++++++++++++++++++++++++++++++++++++-
> > xen/arch/x86/hvm/vmx/vmx.c  |   34 ++++++ 3 files changed, 383
> > insertions(+), 7 deletions(-)
> > 
> > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> > index ea7adf6..18889ad 100644
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -510,6 +510,29 @@ static int hvm_print_line(
> >      return X86EMUL_OKAY;
> >  }
> >  
> > +static int hvm_pvh_dom_initialise(struct domain *d)
> > +{
> > +    int rc;
> > +
> > +    if (!d->arch.hvm_domain.hap_enabled)
> > +        return -EINVAL;
> > +
> > +    spin_lock_init(&d->arch.hvm_domain.irq_lock);
> > +    hvm_init_guest_time(d);
> > +
> > +    hvm_init_cacheattr_region_list(d);
> > +
> > +    if ( (rc=paging_enable(d,
> > PG_refcounts|PG_translate|PG_external)) != 0 )
> > +        goto fail1;        <=================================== GOTO
> > +
> > +    if ( (rc = hvm_funcs.domain_initialise(d)) == 0 )
> > +        return 0;
> > +
> > +fail1:
> 
> I don't think you need the label here? You are not doing an goto.

Right above.

> > long))hvm_assert_evtchn_irq,
> > +                          (unsigned long)v );
> > +
> > +    v->arch.hvm_vcpu.hcall_64bit = 1;
> > +    v->arch.hvm_vcpu.hvm_pvh.vcpu_info_mfn = INVALID_MFN;
> > +    v->arch.user_regs.eflags = 2;
> 
> So that sets the Reserved flag. Could you include a comment
> explaining why.. Ah, is it b/c we later on bit-shift it and use it to
> figure out whether IOPL needs to be virtualized in
> arch_set_info_guest? Or is it just b/c this function is based off
> hvm_vcpu_initialise? If so, since you are being called by it, can you
> skip it?

That resvd bit is required to be set for bootstrap. Set in other places
also, like arch_set_info_guest():

v->arch.user_regs.eflags |= 2;

> 
> > +    v->arch.hvm_vcpu.inject_trap.vector = -1;
> > +
> > +    if ( (rc=hvm_vcpu_cacheattr_init(v)) != 0 ) {
> 
> The syntax here is off.

Hmm... space surrounding "=" in rc=hvm* ?

> >  int hvm_vcpu_initialise(struct vcpu *v)
> >  {
> >      int rc;
> >      struct domain *d = v->domain;
> > -    domid_t dm_domid =
> > d->arch.hvm_domain.params[HVM_PARAM_DM_DOMAIN];
> > +    domid_t dm_domid; 
> 
> Not sure I follow, why the move of it further down?

params is not defined/allocated for PVH.

> > +    /* VMCS controls. */
> > +    vmx_pin_based_exec_control &= ~PIN_BASED_VIRTUAL_NMIS;
> > +    __vmwrite(PIN_BASED_VM_EXEC_CONTROL,
> > vmx_pin_based_exec_control); +
> > +    v->arch.hvm_vmx.exec_control = vmx_cpu_based_exec_control;
> > +
> > +    /* if rdtsc exiting is turned on and it goes thru
> > emulate_privileged_op,
> > +     * then pv_vcpu.ctrlreg must be added to pvh struct */
> 
> That would be the 'timer_mode' syntax in the guest config right?
> Perhaps then a check at the top of the function to see which
> timer_mode is used and exit out with -ENOSYS?

The vtsc setting. We set it to 0 for PVH guests.

> 
> > +    v->arch.hvm_vmx.exec_control &= ~CPU_BASED_RDTSC_EXITING;
> > +    v->arch.hvm_vmx.exec_control &= ~CPU_BASED_USE_TSC_OFFSETING;
> > +
> > +    v->arch.hvm_vmx.exec_control &= ~(CPU_BASED_INVLPG_EXITING |
> > +                                      CPU_BASED_CR3_LOAD_EXITING |
> > +                                      CPU_BASED_CR3_STORE_EXITING);
> > +    v->arch.hvm_vmx.exec_control |=
> > CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
> > +    v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG;
> 
> Is that b/c the PV code ends up making the SCHED_yield_op hypercall
> and we don't need the monitor/mwait capability? If so, could you add
> that comment in please?

No, MTF is debugging feature used mostly for single step.

> > +        vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_CS,
> > msr_type);
> > +        vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_ESP,
> > msr_type);
> > +        vmx_disable_intercept_for_msr(v, MSR_IA32_SYSENTER_EIP,
> > msr_type);
> > +        vmx_disable_intercept_for_msr(v, MSR_SHADOW_GS_BASE,
> > msr_type);
> 
> So this looks like the one vmcs.c except that one has this extra:
> 
>  895         if ( cpu_has_vmx_pat && paging_mode_hap(d) )
>  896             vmx_disable_intercept_for_msr(v, MSR_IA32_CR_PAT,
> MSR_TYPE_R | MSR_TYPE_W); 897     }
> 
> Did you miss that?

I'll add it. I guess default must be disabled.

> > +
> > +        /* pure hvm doesn't do this. safe? see:
> > long_mode_do_msr_write() */ +#if 0
> > +        vmx_disable_intercept_for_msr(v, MSR_STAR);
> > +        vmx_disable_intercept_for_msr(v, MSR_LSTAR);
> > +        vmx_disable_intercept_for_msr(v, MSR_CSTAR);
> > +        vmx_disable_intercept_for_msr(v, MSR_SYSCALL_MASK);
> > +#endif
> 
> I would just provide a comment saying:
> 
>       /*
>       * long_mode_do_msr_write() takes care of
> MSR_[STAR|LSTAR|CSTAR|SYSCALL_MASK] */

Good Idea. I left the "#if 0" there for suggestion.


> > +    } else {
> > +        printk("PVH: CPU does NOT have msr bitmap\n");
> 
> Perhaps:
> 
>       printk(XENLOG_G_ERR "%s: ..\n", __func__);
> ?
> > +        return -EINVAL;
> > +    }
> > +
> > +    if ( !cpu_has_vmx_vpid ) {
> > +        printk("PVH: At present VPID support is required to run
> > PVH\n");
> 
> Should you de-allocate msr_bitmap at this point?
> 
> Or perhaps move this check (and the one below) to the start of the
> function? So you have:
> 
>       if ( !cpu_has_vmx_vpid )
>               gdprintk ("%s: VPID required for PVH mode.\n",
> __func__);
> 
>       if ( !cpu_has_vmx_secondary_exec_control )
>               .. bla bla
> 
> 
> > +        return -EINVAL;
> > +    }
> > +
> > +    v->arch.hvm_vmx.secondary_exec_control =
> > vmx_secondary_exec_control; +
> > +    if ( cpu_has_vmx_secondary_exec_control ) {
> > +        v->arch.hvm_vmx.secondary_exec_control &= ~0x4FF; /* turn
> > off all */
> > +        v->arch.hvm_vmx.secondary_exec_control |= 
> > +
> > SECONDARY_EXEC_PAUSE_LOOP_EXITING;
> > +        v->arch.hvm_vmx.secondary_exec_control |=
> > SECONDARY_EXEC_ENABLE_VPID; +
> > +        v->arch.hvm_vmx.secondary_exec_control |=
> > SECONDARY_EXEC_ENABLE_EPT;
> > +        __vmwrite(SECONDARY_VM_EXEC_CONTROL,
> > +                  v->arch.hvm_vmx.secondary_exec_control);
> > +    } else {
> > +        printk("PVH: NO Secondary Exec control\n");
> > +        return -EINVAL;
> 
> Ditto - should you de-allocate msr_bitmap ? Or if you are going to
> move the check for cpu_has_vmx_secondary_exec_control, then there is
> no need for this if (.. ) else ..
> 
> 
> > +    }
> > +
> > +    __vmwrite(VM_EXIT_CONTROLS, vmexit_ctl);
> > +
> > +    #define VM_ENTRY_LOAD_DEBUG_CTLS 0x4
> > +    #define VM_ENTRY_LOAD_EFER 0x8000
> > +    vmentry_ctl &= ~VM_ENTRY_LOAD_DEBUG_CTLS;
> > +    vmentry_ctl &= ~VM_ENTRY_LOAD_EFER;
> > +    vmentry_ctl &= ~VM_ENTRY_SMM;
> > +    vmentry_ctl &= ~VM_ENTRY_DEACT_DUAL_MONITOR;
> > +    vmentry_ctl |= VM_ENTRY_IA32E_MODE;
> > +    __vmwrite(VM_ENTRY_CONTROLS, vmentry_ctl);
> > +
> 
> From here on, it looks mostly the same as construct_vmcs right?
> 
> Perhaps you can add a comment saying so - so when a cleanup effort
> is done later on - these can be candidates for it?
> 
> > +    /* MSR intercepts. */
> > +    __vmwrite(VM_ENTRY_MSR_LOAD_COUNT, 0);
> > +    __vmwrite(VM_EXIT_MSR_LOAD_COUNT, 0);
> > +    __vmwrite(VM_EXIT_MSR_STORE_COUNT, 0);
> > +
> > +    /* Host data selectors. */
> > +    __vmwrite(HOST_SS_SELECTOR, __HYPERVISOR_DS);
> > +    __vmwrite(HOST_DS_SELECTOR, __HYPERVISOR_DS);
> > +    __vmwrite(HOST_ES_SELECTOR, __HYPERVISOR_DS);
> > +    __vmwrite(HOST_FS_SELECTOR, 0);
> > +    __vmwrite(HOST_GS_SELECTOR, 0);
> > +    __vmwrite(HOST_FS_BASE, 0);
> > +    __vmwrite(HOST_GS_BASE, 0);
> > +
> > +    vmx_set_host_env(v);
> > +
> > +    /* Host control registers. */
> > +    v->arch.hvm_vmx.host_cr0 = read_cr0() | X86_CR0_TS;
> > +    __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
> > +    __vmwrite(HOST_CR4, mmu_cr4_features|(cpu_has_xsave ?
> > X86_CR4_OSXSAVE : 0));
> 
> That formatting looks odd. 

Copied from hvm code. whats wrong?

> > +    /* Set default guest context values here. Some of these are
> > then overwritten
> > +     * in vmx_pvh_set_vcpu_info() by guest itself during vcpu
> > bringup */
> > +    __vmwrite(GUEST_CS_BASE, 0);
> > +    __vmwrite(GUEST_CS_LIMIT, ~0u);
> > +    __vmwrite(GUEST_CS_AR_BYTES, 0xa09b); /* CS.L == 1 */
> > +    __vmwrite(GUEST_CS_SELECTOR, 0x10);
> 
> 0x10. Could you use a #define for it? Somehow I thought it would
> be running in FLAT_KERNEL_CS but that would be odd. And of course
> since are booting in the Linux kernel without the PV MMU we would
> be using its native segments. So this would correspond to
> GDT_ENTRY_KERNEL_CS right? Might want to mention that
> so if the Linux kernel alters its GDT page we don't blow up?
> 
> Thought I guess it does not matter - this is just the initial
> bootstrap segments. Presumarily the load_gdt in the Linux kernel
> later on resets it to whatever the "new" GDT is.

Correct:
#define __KERNEL_CS     (GDT_ENTRY_KERNEL_CS*8)

And load_gdt loads a new GDT natively.


> > +    __vmwrite(GUEST_INTERRUPTIBILITY_INFO, 0);
> > +    __vmwrite(GUEST_DR7, 0);
> > +    __vmwrite(VMCS_LINK_POINTER, ~0UL);
> > +
> > +    __vmwrite(PAGE_FAULT_ERROR_CODE_MASK, 0);
> > +    __vmwrite(PAGE_FAULT_ERROR_CODE_MATCH, 0);
> 
> Weird. In the vmcs.c file these are somewhat higher in the code.

Yes. I just didn't copy the existing function, but created PVH function
to make it easier for PVH.

> > +
> > +    v->arch.hvm_vmx.exception_bitmap = 
> > +                                   HVM_TRAP_MASK     | (1 <<
> > TRAP_debug) | 
> > +                                   (1U << TRAP_int3) | (1U <<
> > TRAP_no_device);
> 
> Odd syntax.

Similar to existing hvm code, whats wrong?


> > +    __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap);
> > +
> > +    __vmwrite(TSC_OFFSET, 0);
> 
> Hm, so you did earlier:
> 
> v->arch.hvm_vmx.exec_control &= ~CPU_BASED_USE_TSC_OFFSETING;
> 
> so is this neccessary? Or is just that you want it to be set
> to default baseline state?

Not necessary, doesn't hurt either. I can remove it.

> > +
> > +    /* Set WP bit so rdonly pages are not written from CPL 0 */
> > +    tmpval = X86_CR0_PG | X86_CR0_NE | X86_CR0_PE | X86_CR0_WP;
> > +    __vmwrite(GUEST_CR0, tmpval);
> > +    __vmwrite(CR0_READ_SHADOW, tmpval);
> > +    v->arch.hvm_vcpu.hw_cr[0] = v->arch.hvm_vcpu.guest_cr[0] =
> > tmpval; +
> > +    tmpval = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
> > +    required = X86_CR4_PAE | X86_CR4_VMXE | X86_CR4_OSFXSR;
> > +    if ( (tmpval & required) != required )
> > +    {
> > +        printk("PVH: required CR4 features not available:%lx\n",
> > required);
> > +        return -EINVAL;
> 
> You might want to move that to the top of the code. Or if you want
> it here, then at least free the msr_bitmap

I think I'll just move all the checks top of the code.

> >  {
> >      struct domain *d = v->domain;
> > @@ -825,6 +1072,12 @@ static int construct_vmcs(struct vcpu *v)
> >  
> >      vmx_vmcs_enter(v);
> >  
> > +    if ( is_pvh_vcpu(v) ) {
> > +        int rc = pvh_construct_vmcs(v);
> > +        vmx_vmcs_exit(v);
> 
> Do you need to call paging_update_paging_modes as construct_vmcs()
> does?

Nop. We don't need to as the arch_set_info_guest() does it for PVH.


Thanks Konrad.
Mukesh


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.