Xen project Mailing List

Re: [Xen-devel] [PATCH 10/17] PVH xen: introduce vmx_pvh.c and pvh.c

To: "Mukesh Rathor" <mukesh.rathor@xxxxxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Wed, 24 Apr 2013 09:47:55 +0100

Delivery-date: Wed, 24 Apr 2013 08:47:37 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

>>> On 23.04.13 at 23:25, Mukesh Rathor <mukesh.rathor@xxxxxxxxxx> wrote: > The heart of this patch is vmx exit handler for PVH guest. It is nicely > isolated in a separate module as preferred by most of us. A call to it > is added to vmx_pvh_vmexit_handler(). vmx_vmexit_handler()? > +static int pvh_grant_table_op(unsigned int cmd, XEN_GUEST_HANDLE(void) uop, > + unsigned int count) > +{ > + switch ( cmd ) > + { > + /* > + * Only the following Grant Ops have been verified for PVH guest, > > hence > + * we check for them here. > + */ > + case GNTTABOP_map_grant_ref: > + case GNTTABOP_unmap_grant_ref: > + case GNTTABOP_setup_table: > + case GNTTABOP_copy: > + case GNTTABOP_query_size: > + case GNTTABOP_set_version: > + return do_grant_table_op(cmd, uop, count); > + } > + return -ENOSYS; > +} As said before - I object to this sort of white listing. A PVH guest ought to be permitted to issue any hypercall, with the sole exception of MMU and very few other ones. So if anything, specific hypercall functions should be black listed. And of course, even if this is a new file, the indentation of case statements needs adjustment (to match that in the file it got cloned from, despite there being a small number of bad examples in hvm.c). > +static hvm_hypercall_t *pvh_hypercall64_table[NR_hypercalls] = { > + [__HYPERVISOR_platform_op] = (hvm_hypercall_t *)do_platform_op, > + [__HYPERVISOR_memory_op] = (hvm_hypercall_t *)do_memory_op, > + [__HYPERVISOR_xen_version] = (hvm_hypercall_t *)do_xen_version, > + [__HYPERVISOR_console_io] = (hvm_hypercall_t *)do_console_io, > + [__HYPERVISOR_grant_table_op] = (hvm_hypercall_t *)pvh_grant_table_op, > + [__HYPERVISOR_vcpu_op] = (hvm_hypercall_t *)pvh_vcpu_op, > + [__HYPERVISOR_mmuext_op] = (hvm_hypercall_t *)do_mmuext_op, > + [__HYPERVISOR_xsm_op] = (hvm_hypercall_t *)do_xsm_op, > + [__HYPERVISOR_sched_op] = (hvm_hypercall_t *)do_sched_op, > + [__HYPERVISOR_event_channel_op]= (hvm_hypercall_t *)do_event_channel_op, > + [__HYPERVISOR_physdev_op] = (hvm_hypercall_t *)pvh_physdev_op, > + [__HYPERVISOR_hvm_op] = (hvm_hypercall_t *)pvh_hvm_op, > + [__HYPERVISOR_sysctl] = (hvm_hypercall_t *)do_sysctl, > + [__HYPERVISOR_domctl] = (hvm_hypercall_t *)do_domctl > +}; Is this table complete? What about multicalls, timer_op, kexec_op, tmem_op, mca? I again think that copying hypercall_table and adjusting the entries you explicitly need to override might be better than creating yet another table that needs attention when a new hypercall gets added. > +/* > + * Check if hypercall is valid > + * Returns: 0 if hcall is not valid with eax set to the errno to ret to guest > + */ > +static bool_t hcall_valid(struct cpu_user_regs *regs) > +{ > + struct segment_register sreg; > + > + hvm_get_segment_register(current, x86_seg_ss, &sreg); > + if ( unlikely(sreg.attr.fields.dpl == 3) ) if ( unlikely(sreg.attr.fields.dpl != 0) ) > + { > + regs->eax = -EPERM; > + return 0; > + } > + > + /* Following HCALLs have not been verified for PVH domUs */ > + if ( !IS_PRIV(current->domain) && > + (regs->eax == __HYPERVISOR_xsm_op || > + regs->eax == __HYPERVISOR_platform_op || > + regs->eax == __HYPERVISOR_domctl) ) /* for privcmd mmap */ > + { > + regs->eax = -ENOSYS; > + return 0; > + } This looks bogus - it shouldn't be the job of the generic handler to verify permission to use certain hypercalls - the individual handlers should be doing this quite well. And I suppose you saw Daniel De Graaf's effort to eliminate IS_PRIV() from the tree? > + if ( regs->eax == __HYPERVISOR_sched_op && regs->rdi == SCHEDOP_shutdown > ) > + { > + domain_crash_synchronous(); > + return HVM_HCALL_completed; > + } ??? > + regs->rax = pvh_hypercall64_table[hnum](regs->rdi, regs->rsi, regs->rdx, > + regs->r10, regs->r8, regs->r9); This again lacks an annotation to clarify that 32-bit support is missing. > @@ -1503,7 +1503,8 @@ void vmx_do_resume(struct vcpu *v) > > vmx_clear_vmcs(v); > vmx_load_vmcs(v); > - if ( !is_pvh_vcpu(v) ) { > + if ( !is_pvh_vcpu(v) ) > + { Surely an unnecessary adjustment, if an earlier patch got it right from the beginning? > + switch ( regs->ecx ) > + { > + case MSR_IA32_MISC_ENABLE: > + { > + rdmsrl(MSR_IA32_MISC_ENABLE, msr_content); > + msr_content |= MSR_IA32_MISC_ENABLE_BTS_UNAVAIL | > + MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL; > + break; > + } > + default: > + { > + /* pvh fixme: see hvm_msr_read_intercept() */ > + rdmsrl(regs->ecx, msr_content); > + break; > + } > + } Apart from the recurring indentation problem, please also don't add braces when you don't really need them (i.e. for declaring case specific local variables). > + if ( vp->domain->domain_id != 0 && /* never pause dom0 */ > + guest_kernel_mode(vp, regs) && vp->domain->debugger_attached ) > + { > + domain_pause_for_debugger(); > + } else { > + hvm_inject_hw_exception(TRAP_debug, HVM_DELIVER_NO_ERROR_CODE); > + } In order to avoid two patch iterations here: There's no need for braces at all in cases like this. > +static int vmxit_int3(struct cpu_user_regs *regs) > +{ > + int ilen = vmx_get_instruction_length(); > + struct vcpu *vp = current; > + struct hvm_trap trap_info = { > + .vector = TRAP_int3, > + .type = X86_EVENTTYPE_SW_EXCEPTION, > + .error_code = HVM_DELIVER_NO_ERROR_CODE, > + .insn_len = ilen Indentation. > + }; > + > + regs->eip += ilen; > + > + /* gdbsx or another debugger. Never pause dom0 */ > + if ( vp->domain->domain_id != 0 && guest_kernel_mode(vp, regs) ) > + { > + dbgp1("[%d]PVH: domain pause for debugger\n", smp_processor_id()); > + current->arch.gdbsx_vcpu_event = TRAP_int3; > + domain_pause_for_debugger(); > + return 0; > + } > + > + regs->eip -= ilen; Please move the first adjustment into the if() body, making the second adjustment here unnecessary. > +static int vmxit_invalid_op(struct cpu_user_regs *regs) > +{ > + ulong addr = 0; > + > + if ( guest_kernel_mode(current, regs) || > + emulate_forced_invalid_op(regs, &addr) == 0 ) > + { > + hvm_inject_hw_exception(TRAP_invalid_op, HVM_DELIVER_NO_ERROR_CODE); > + return 0; > + } > + if ( addr ) > + hvm_inject_page_fault(0, addr); This cannot be conditional upon addr being non-zero. > + case TRAP_no_device: > + hvm_funcs.fpu_dirty_intercept(); /* vmx_fpu_dirty_intercept */ It ought to be perfectly valid to avoid the indirect call here. > +static int access_cr4(struct cpu_user_regs *regs, uint acc_typ, uint64_t > *regp) > +{ > + if ( acc_typ == VMX_CONTROL_REG_ACCESS_TYPE_MOV_TO_CR ) > + { > + u64 old_cr4 = __vmread(GUEST_CR4); > + > + if ( (old_cr4 ^ (*regp)) & (X86_CR4_PSE | X86_CR4_PGE | X86_CR4_PAE) > ) > + vpid_sync_all(); > + > + __vmwrite(GUEST_CR4, *regp); No modification of CR4_READ_SHADOW here? > +static int vmxit_io_instr(struct cpu_user_regs *regs) > +{ > + int curr_lvl; > + int requested = (regs->rflags >> 12) & 3; > + > + read_vmcs_selectors(regs); > + curr_lvl = regs->cs & 3; Shouldn't you look at SS'es DPL instead? > +static int pvh_ept_handle_violation(unsigned long qualification, > + paddr_t gpa, struct cpu_user_regs *regs) > +{ > + unsigned long gla, gfn = gpa >> PAGE_SHIFT; > + p2m_type_t p2mt; > + mfn_t mfn = get_gfn_query_unlocked(current->domain, gfn, &p2mt); > + > + gdprintk(XENLOG_ERR, "EPT violation %#lx (%c%c%c/%c%c%c), " This lacks the _G_ infix, or else you're risking a security issue here by allowing a DomU to spam the log. > + if ( qualification & EPT_GLA_VALID ) > + { > + gla = __vmread(GUEST_LINEAR_ADDRESS); > + gdprintk(XENLOG_ERR, " --- GLA %#lx\n", gla); Same here, obviously. > +/* > + * The cpuid macro clears rcx, so execute cpuid here exactly as the user > + * process would on a PV guest. > + */ > +static void pvh_user_cpuid(struct cpu_user_regs *regs) > +{ > + unsigned int eax, ebx, ecx, edx; > + > + asm volatile ( "cpuid" > + : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) > + : "0" (regs->eax), "2" (regs->rcx) ); > + > + regs->rax = eax; regs->rbx = ebx; regs->rcx = ecx; regs->rdx = edx; One assignment per line please. > + switch ( (uint16_t)exit_reason ) > + { > + case EXIT_REASON_EXCEPTION_NMI: /* 0 */ > + rc = vmxit_exception(regs); > + break; Why would an NMI be blindly reflected to the guest? > + case EXIT_REASON_CPUID: /* 10 */ > + { > + if ( guest_kernel_mode(vp, regs) ) > + pv_cpuid(regs); > + else > + pvh_user_cpuid(regs); What's the reason for this distinction? I would think it's actually a benefit of PVH to allow also hiding unwanted features from guest user mode (like HVM, but unlike PV without CPUID faulting). > + if ( rc ) > + { > + exit_qualification = __vmread(EXIT_QUALIFICATION); > + gdprintk(XENLOG_ERR, > + "PVH: [%d] exit_reas:%d 0x%x qual:%ld 0x%lx cr0:0x%016lx\n", > + ccpu, exit_reason, exit_reason, exit_qualification, > + exit_qualification, __vmread(GUEST_CR0)); > + gdprintk(XENLOG_ERR, "PVH: RIP:%lx RSP:%lx EFLAGS:%lx CR3:%lx\n", > + regs->rip, regs->rsp, regs->rflags, __vmread(GUEST_CR3)); > + domain_crash_synchronous(); > + } The log levels are too high again here, and lacking the _G_ infix. > +int vmx_pvh_set_vcpu_info(struct vcpu *v, struct vcpu_guest_context *ctxtp) > +{ > + if ( v->vcpu_id == 0 ) > + return 0; > + > + vmx_vmcs_enter(v); > + __vmwrite(GUEST_GDTR_BASE, ctxtp->gdt.pvh.addr); > + __vmwrite(GUEST_GDTR_LIMIT, ctxtp->gdt.pvh.limit); > + __vmwrite(GUEST_GS_BASE, ctxtp->gs_base_user); > + > + __vmwrite(GUEST_CS_SELECTOR, ctxtp->user_regs.cs); > + __vmwrite(GUEST_DS_SELECTOR, ctxtp->user_regs.ds); > + __vmwrite(GUEST_ES_SELECTOR, ctxtp->user_regs.es); > + __vmwrite(GUEST_SS_SELECTOR, ctxtp->user_regs.ss); > + __vmwrite(GUEST_GS_SELECTOR, ctxtp->user_regs.gs); > + > + if ( vmx_add_guest_msr(MSR_SHADOW_GS_BASE) ) > + return -EINVAL; Lacking vmx_vmcs_exit(). > +int vmx_pvh_read_descriptor(unsigned int sel, const struct vcpu *v, > + const struct cpu_user_regs *regs, > + unsigned long *base, unsigned long *limit, > + unsigned int *ar) > +{ > + unsigned int tmp_ar = 0; > + ASSERT(v == current); > + ASSERT(is_pvh_vcpu(v)); > + > + if ( sel == (unsigned int)regs->cs ) > + { > + *base = __vmread(GUEST_CS_BASE); > + *limit = __vmread(GUEST_CS_LIMIT); > + tmp_ar = __vmread(GUEST_CS_AR_BYTES); > + } > + else if ( sel == (unsigned int)regs->ds ) This if/else-if sequence can't be right - a selector can be in more than one selector register (and one of them may have got reloaded after a GDT/LDT adjustment, while another may not), so you can't base the descriptor read upon the selector value. The caller will have to tell you which register it wants the descriptor for, not which selector. > + if ( tmp_ar & X86_SEG_AR_CS_LM_ACTIVE ) > + { > + *base = 0UL; > + *limit = ~0UL; > + } Doing that adjustment here rather than only in the CS case suggests that this can happen for other than CS, in which case this is wrong specifically for FS and GS. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.