[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c



>>> On 23.04.13 at 23:25, Mukesh Rathor <mukesh.rathor@xxxxxxxxxx> wrote:
> The heart of this patch is vmx exit handler for PVH guest. It is nicely
> isolated in a separate module as preferred by most of us. A call to it
> is added to vmx_pvh_vmexit_handler().

vmx_vmexit_handler()?

> +static int pvh_grant_table_op(unsigned int cmd, XEN_GUEST_HANDLE(void) uop,
> +                              unsigned int count)
> +{
> +    switch ( cmd )
> +    {
> +        /*
> +         * Only the following Grant Ops have been verified for PVH guest, > 
> hence
> +         * we check for them here.
> +         */
> +        case GNTTABOP_map_grant_ref:
> +        case GNTTABOP_unmap_grant_ref:
> +        case GNTTABOP_setup_table:
> +        case GNTTABOP_copy:
> +        case GNTTABOP_query_size:
> +        case GNTTABOP_set_version:
> +            return do_grant_table_op(cmd, uop, count);
> +    }
> +    return -ENOSYS;
> +}

As said before - I object to this sort of white listing. A PVH guest
ought to be permitted to issue any hypercall, with the sole
exception of MMU and very few other ones. So if anything,
specific hypercall functions should be black listed.

And of course, even if this is a new file, the indentation of case
statements needs adjustment (to match that in the file it got
cloned from, despite there being a small number of bad examples
in hvm.c).

> +static hvm_hypercall_t *pvh_hypercall64_table[NR_hypercalls] = {
> +    [__HYPERVISOR_platform_op]     = (hvm_hypercall_t *)do_platform_op,
> +    [__HYPERVISOR_memory_op]       = (hvm_hypercall_t *)do_memory_op,
> +    [__HYPERVISOR_xen_version]     = (hvm_hypercall_t *)do_xen_version,
> +    [__HYPERVISOR_console_io]      = (hvm_hypercall_t *)do_console_io,
> +    [__HYPERVISOR_grant_table_op]  = (hvm_hypercall_t *)pvh_grant_table_op,
> +    [__HYPERVISOR_vcpu_op]         = (hvm_hypercall_t *)pvh_vcpu_op,
> +    [__HYPERVISOR_mmuext_op]       = (hvm_hypercall_t *)do_mmuext_op,
> +    [__HYPERVISOR_xsm_op]          = (hvm_hypercall_t *)do_xsm_op,
> +    [__HYPERVISOR_sched_op]        = (hvm_hypercall_t *)do_sched_op,
> +    [__HYPERVISOR_event_channel_op]= (hvm_hypercall_t *)do_event_channel_op,
> +    [__HYPERVISOR_physdev_op]      = (hvm_hypercall_t *)pvh_physdev_op,
> +    [__HYPERVISOR_hvm_op]          = (hvm_hypercall_t *)pvh_hvm_op,
> +    [__HYPERVISOR_sysctl]          = (hvm_hypercall_t *)do_sysctl,
> +    [__HYPERVISOR_domctl]          = (hvm_hypercall_t *)do_domctl
> +};

Is this table complete? What about multicalls, timer_op, kexec_op,
tmem_op, mca? I again think that copying hypercall_table and
adjusting the entries you explicitly need to override might be
better than creating yet another table that needs attention when
a new hypercall gets added.

> +/*
> + * Check if hypercall is valid
> + * Returns: 0 if hcall is not valid with eax set to the errno to ret to guest
> + */
> +static bool_t hcall_valid(struct cpu_user_regs *regs)
> +{
> +    struct segment_register sreg;
> +
> +    hvm_get_segment_register(current, x86_seg_ss, &sreg);
> +    if ( unlikely(sreg.attr.fields.dpl == 3) )

    if ( unlikely(sreg.attr.fields.dpl != 0) )

> +    {
> +        regs->eax = -EPERM;
> +        return 0;
> +    }
> +
> +    /* Following HCALLs have not been verified for PVH domUs */
> +    if ( !IS_PRIV(current->domain) &&
> +         (regs->eax == __HYPERVISOR_xsm_op ||
> +          regs->eax == __HYPERVISOR_platform_op ||
> +          regs->eax == __HYPERVISOR_domctl) )       /* for privcmd mmap */
> +    {
> +        regs->eax = -ENOSYS;
> +        return 0;
> +    }

This looks bogus - it shouldn't be the job of the generic handler
to verify permission to use certain hypercalls - the individual
handlers should be doing this quite well. And I suppose you saw
Daniel De Graaf's effort to eliminate IS_PRIV() from the tree?

> +    if ( regs->eax == __HYPERVISOR_sched_op && regs->rdi == SCHEDOP_shutdown 
> )
> +    {
> +        domain_crash_synchronous();
> +        return HVM_HCALL_completed;
> +    }

???

> +    regs->rax = pvh_hypercall64_table[hnum](regs->rdi, regs->rsi, regs->rdx,
> +                                            regs->r10, regs->r8, regs->r9);

This again lacks an annotation to clarify that 32-bit support is missing.

> @@ -1503,7 +1503,8 @@ void vmx_do_resume(struct vcpu *v)
>  
>          vmx_clear_vmcs(v);
>          vmx_load_vmcs(v);
> -        if ( !is_pvh_vcpu(v) ) {
> +        if ( !is_pvh_vcpu(v) )
> +        {

Surely an unnecessary adjustment, if an earlier patch got it right
from the beginning?

> +    switch ( regs->ecx )
> +    {
> +        case MSR_IA32_MISC_ENABLE:
> +        {
> +            rdmsrl(MSR_IA32_MISC_ENABLE, msr_content);
> +            msr_content |= MSR_IA32_MISC_ENABLE_BTS_UNAVAIL |
> +                           MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL;
> +            break;
> +        }
> +        default:
> +        {
> +            /* pvh fixme: see hvm_msr_read_intercept() */
> +            rdmsrl(regs->ecx, msr_content);
> +            break;
> +        }
> +    }

Apart from the recurring indentation problem, please also don't
add braces when you don't really need them (i.e. for declaring
case specific local variables).

> +    if ( vp->domain->domain_id != 0 &&    /* never pause dom0 */
> +         guest_kernel_mode(vp, regs) &&  vp->domain->debugger_attached )
> +    {
> +        domain_pause_for_debugger();
> +    } else {
> +        hvm_inject_hw_exception(TRAP_debug, HVM_DELIVER_NO_ERROR_CODE);
> +    }

In order to avoid two patch iterations here: There's no need for
braces at all in cases like this.

> +static int vmxit_int3(struct cpu_user_regs *regs)
> +{
> +    int ilen = vmx_get_instruction_length();
> +    struct vcpu *vp = current;
> +    struct hvm_trap trap_info = {
> +                        .vector = TRAP_int3,
> +                        .type = X86_EVENTTYPE_SW_EXCEPTION,
> +                        .error_code = HVM_DELIVER_NO_ERROR_CODE,
> +                        .insn_len = ilen

Indentation.

> +    };
> +
> +    regs->eip += ilen;
> +
> +    /* gdbsx or another debugger. Never pause dom0 */
> +    if ( vp->domain->domain_id != 0 && guest_kernel_mode(vp, regs) )
> +    {
> +        dbgp1("[%d]PVH: domain pause for debugger\n", smp_processor_id());
> +        current->arch.gdbsx_vcpu_event = TRAP_int3;
> +        domain_pause_for_debugger();
> +        return 0;
> +    }
> +
> +    regs->eip -= ilen;

Please move the first adjustment into the if() body, making the
second adjustment here unnecessary.

> +static int vmxit_invalid_op(struct cpu_user_regs *regs)
> +{
> +    ulong addr = 0;
> +
> +    if ( guest_kernel_mode(current, regs) ||
> +         emulate_forced_invalid_op(regs, &addr) == 0 )
> +    {
> +        hvm_inject_hw_exception(TRAP_invalid_op, HVM_DELIVER_NO_ERROR_CODE);
> +        return 0;
> +    }
> +    if ( addr )
> +        hvm_inject_page_fault(0, addr);

This cannot be conditional upon addr being non-zero.

> +        case TRAP_no_device:
> +            hvm_funcs.fpu_dirty_intercept();  /* vmx_fpu_dirty_intercept */

It ought to be perfectly valid to avoid the indirect call here.

> +static int access_cr4(struct cpu_user_regs *regs, uint acc_typ, uint64_t 
> *regp)
> +{
> +    if ( acc_typ == VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR )
> +    {
> +        u64 old_cr4 = __vmread(GUEST_CR4);
> +
> +        if ( (old_cr4 ^ (*regp)) & (X86_CR4_PSE | X86_CR4_PGE | X86_CR4_PAE) 
> )
> +            vpid_sync_all();
> +
> +        __vmwrite(GUEST_CR4, *regp);

No modification of CR4_READ_SHADOW here?

> +static int vmxit_io_instr(struct cpu_user_regs *regs)
> +{
> +    int curr_lvl;
> +    int requested = (regs->rflags >> 12) & 3;
> +
> +    read_vmcs_selectors(regs);
> +    curr_lvl = regs->cs & 3;

Shouldn't you look at SS'es DPL instead?

> +static int pvh_ept_handle_violation(unsigned long qualification,
> +                                    paddr_t gpa, struct cpu_user_regs *regs)
> +{
> +    unsigned long gla, gfn = gpa >> PAGE_SHIFT;
> +    p2m_type_t p2mt;
> +    mfn_t mfn = get_gfn_query_unlocked(current->domain, gfn, &p2mt);
> +
> +    gdprintk(XENLOG_ERR, "EPT violation %#lx (%c%c%c/%c%c%c), "

This lacks the _G_ infix, or else you're risking a security issue here
by allowing a DomU to spam the log.

> +    if ( qualification & EPT_GLA_VALID )
> +    {
> +        gla = __vmread(GUEST_LINEAR_ADDRESS);
> +        gdprintk(XENLOG_ERR, " --- GLA %#lx\n", gla);

Same here, obviously.

> +/*
> + * The cpuid macro clears rcx, so execute cpuid here exactly as the user
> + * process would on a PV guest.
> + */
> +static void pvh_user_cpuid(struct cpu_user_regs *regs)
> +{
> +    unsigned int eax, ebx, ecx, edx;
> +
> +    asm volatile ( "cpuid"
> +              : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx)
> +              : "0" (regs->eax), "2" (regs->rcx) );
> +
> +    regs->rax = eax; regs->rbx = ebx; regs->rcx = ecx; regs->rdx = edx;

One assignment per line please.

> +    switch ( (uint16_t)exit_reason )
> +    {
> +        case EXIT_REASON_EXCEPTION_NMI:      /* 0 */
> +            rc = vmxit_exception(regs);
> +            break;

Why would an NMI be blindly reflected to the guest?

> +        case EXIT_REASON_CPUID:              /* 10 */
> +        {
> +            if ( guest_kernel_mode(vp, regs) )
> +                pv_cpuid(regs);
> +            else
> +                pvh_user_cpuid(regs);

What's the reason for this distinction? I would think it's actually a
benefit of PVH to allow also hiding unwanted features from guest
user mode (like HVM, but unlike PV without CPUID faulting).

> +    if ( rc )
> +    {
> +        exit_qualification = __vmread(EXIT_QUALIFICATION);
> +        gdprintk(XENLOG_ERR,
> +                 "PVH: [%d] exit_reas:%d 0x%x qual:%ld 0x%lx cr0:0x%016lx\n",
> +                 ccpu, exit_reason, exit_reason, exit_qualification,
> +                 exit_qualification, __vmread(GUEST_CR0));
> +        gdprintk(XENLOG_ERR, "PVH: RIP:%lx RSP:%lx EFLAGS:%lx CR3:%lx\n",
> +                 regs->rip, regs->rsp, regs->rflags, __vmread(GUEST_CR3));
> +        domain_crash_synchronous();
> +    }

The log levels are too high again here, and lacking the _G_ infix.

> +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct vcpu_guest_context *ctxtp)
> +{
> +    if ( v->vcpu_id == 0 )
> +        return 0;
> +
> +    vmx_vmcs_enter(v);
> +    __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr);
> +    __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit);
> +    __vmwrite(GUEST_GS_BASE, ctxtp->gs_base_user);
> +
> +    __vmwrite(GUEST_CS_SELECTOR, ctxtp->user_regs.cs);
> +    __vmwrite(GUEST_DS_SELECTOR, ctxtp->user_regs.ds);
> +    __vmwrite(GUEST_ES_SELECTOR, ctxtp->user_regs.es);
> +    __vmwrite(GUEST_SS_SELECTOR, ctxtp->user_regs.ss);
> +    __vmwrite(GUEST_GS_SELECTOR, ctxtp->user_regs.gs);
> +
> +    if ( vmx_add_guest_msr(MSR_SHADOW_GS_BASE) )
> +        return -EINVAL;

Lacking vmx_vmcs_exit().

> +int vmx_pvh_read_descriptor(unsigned int sel, const struct vcpu *v,
> +                            const struct cpu_user_regs *regs,
> +                            unsigned long *base, unsigned long *limit,
> +                            unsigned int *ar)
> +{
> +    unsigned int tmp_ar = 0;
> +    ASSERT(v == current);
> +    ASSERT(is_pvh_vcpu(v));
> +
> +    if ( sel == (unsigned int)regs->cs )
> +    {
> +        *base = __vmread(GUEST_CS_BASE);
> +        *limit = __vmread(GUEST_CS_LIMIT);
> +        tmp_ar = __vmread(GUEST_CS_AR_BYTES);
> +    }
> +    else if ( sel == (unsigned int)regs->ds )

This if/else-if sequence can't be right - a selector can be in more
than one selector register (and one of them may have got reloaded
after a GDT/LDT adjustment, while another may not), so you can't
base the descriptor read upon the selector value. The caller will
have to tell you which register it wants the descriptor for, not which
selector.

> +    if ( tmp_ar & X86_SEG_AR_CS_LM_ACTIVE )
> +    {
> +        *base = 0UL;
> +        *limit = ~0UL;
> +    }

Doing that adjustment here rather than only in the CS case
suggests that this can happen for other than CS, in which case
this is wrong specifically for FS and GS.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.