[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH for-4.5 v8 4/7] xen: Add vmware_port support
On 01/16/15 05:09, Jan Beulich wrote: >>>> On 03.10.14 at 00:40, <dslutz@xxxxxxxxxxx> wrote: >> This is a new domain_create() flag, DOMCRF_vmware_port. It is >> passed to domctl as XEN_DOMCTL_CDF_vmware_port. > > Can you explain why a HVM param isn't suitable here? > The issue is that you need this flag during construct_vmcb() and construct_vmcs(). While Intel has vmx_update_exception_bitmap() AMD does not. So when HVM param's are setup and/or changed there currently is no way to adjust AMD's exception bitmap. So this is the simpler way. >> This is both a more complete support then in currently provided by >> QEMU and/or KVM and less. The missing part requires QEMU changes >> and has been left out until the QEMU patches are accepted upstream. > > I vaguely recall the question having been asked before, but I can't > find it to the answer to it: If qemu has support for this, why can't > you build on that rather than adding everything in the hypervisor? > The v10 version of this patch set (which is waiting for an adjusted QEMU (the released 2.2.0 is one) does use QEMU for more VMware port support. The issues are: 1) QEMU needs access to parts of CPU registers to handle VMware port. 2) You need to allow ring 3 access to this 1 I/O port. 3) There is more state in xen that would need to also be sent to QEMU if all support is moved to QEMU. >> For AMD (svm) the max instruction length of 15 is hard coded. This >> is because __get_instruction_length_from_list() has issues that when >> called from #GP handler NRIP is not available, or that NRIP may not >> be available at all on a particular HW, leading to the need read the >> instruction twice --- once in __get_instruction_length_from_list() >> and then again in vmport_gp_check(). Which is bad because memory may >> change between the reads. > > I don't get the connection between the first sentence (which just > states an architectural fact) and the rest of this paragraph. > This is an AMD thing. Boris Ostrovsky suggested to use the AMD only routine __get_instruction_length_from_list(), which I did in v5 and found out that many (maybe all) AMD chips would still do the hand decoding. Now I could make this patch set even bigger by adjusting or adding AMD only routines that return the decode of the current instruction. This would also split vmport_gp_check() into an AMD and Intel versions. So I went with the stranger way of saying that the current instruction length is 15 for AMD in all cases. Then in vmport_gp_check() I only fetch bytes to the end of the page (which should be almost the same overhead between 1 and 15 when all 15 are in the same page). I.E. the wasted fetching on AMD is not a problem. I will try and add some of this to the commit message. >> @@ -2111,6 +2112,31 @@ svm_vmexit_do_vmsave(struct vmcb_struct *vmcb, >> return; >> } >> >> +static void svm_vmexit_gp_intercept(struct cpu_user_regs *regs, >> + struct vcpu *v) >> +{ >> + struct vmcb_struct *vmcb = v->arch.hvm_svm.vmcb; >> + /* >> + * Just use 15 for the instruction length; vmport_gp_check will >> + * adjust it. This is because >> + * __get_instruction_length_from_list() has issues, and may >> + * require a double read of the instruction bytes. At some >> + * point a new routine could be added that is based on the code >> + * in vmport_gp_check with extensions to make it more general. >> + * Since that routine is the only user of this code this can be >> + * done later. >> + */ >> + unsigned long inst_len = 15; > > Surely this can be unsigned int? The code is smaller this way. In vmx_vmexit_gp_intercept(): unsigned long inst_len; ... __vmread(VM_EXIT_INSTRUCTION_LEN, &inst_len); ... rc = vmport_gp_check(regs, v, &inst_len, inst_addr, ... So changing the argument to vmport_gp_check() to "unsigned int" would add code there. > And the value be MAX_INST_LEN? Yes, will change. > >> --- /dev/null >> +++ b/xen/arch/x86/hvm/vmware/vmport.c >> @@ -0,0 +1,262 @@ >> +/* >> + * HVM VMPORT emulation >> + * >> + * Copyright (C) 2012 Verizon Corporation >> + * >> + * This file is free software; you can redistribute it and/or modify it >> + * under the terms of the GNU General Public License Version 2 (GPLv2) >> + * as published by the Free Software Foundation. >> + * >> + * This file is distributed in the hope that it will be useful, >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >> + * General Public License for more details. <http://www.gnu.org/licenses/>. >> + */ >> + >> +#include <xen/config.h> > > No need for this. Dropped. > >> +#define MAX_INST_LEN 15 > > Please move SVM's identical definition into e.g. asm-x86/processor.h > or even x86_emulate/x86_emulate.h (so it can also be used in > x86_emulate/x86_emulate.c), and avoid adding another instance > here. > Sure. >> +#ifndef NDEBUG >> +unsigned int opt_vmport_debug __read_mostly; >> +integer_param("vmport_debug", opt_vmport_debug); >> +#endif > > If this was used anywhere, the variable ought to be static. But > since it seems unused, it ought to be dropped. > Done. >> +/* More VMware defines */ >> + >> +#define VMWARE_GUI_AUTO_GRAB 0x001 >> +#define VMWARE_GUI_AUTO_UNGRAB 0x002 >> +#define VMWARE_GUI_AUTO_SCROLL 0x004 >> +#define VMWARE_GUI_AUTO_RAISE 0x008 >> +#define VMWARE_GUI_EXCHANGE_SELECTIONS 0x010 >> +#define VMWARE_GUI_WARP_CURSOR_ON_UNGRAB 0x020 >> +#define VMWARE_GUI_FULL_SCREEN 0x040 >> + >> +#define VMWARE_GUI_TO_FULL_SCREEN 0x080 >> +#define VMWARE_GUI_TO_WINDOW 0x100 >> + >> +#define VMWARE_GUI_AUTO_RAISE_DISABLED 0x200 >> + >> +#define VMWARE_GUI_SYNC_TIME 0x400 > > What do all of the above mean? Without any explanation it is > impossible to understand why reporting any of them set below > is correct/acceptable. > Will move to QEMU, so dropping them. >> +int vmport_ioport(int dir, uint32_t port, uint32_t bytes, uint32_t *val) >> +{ >> + struct cpu_user_regs *regs = guest_cpu_user_regs(); >> + uint16_t cmd = regs->rcx; > > As you already have most other variables needed only inside the if() > below declared in that scope, please be consistent with this one. > Albeit the value of this variable is questionable anyway - it's being > used exactly once. > Ok, dropping this variable (it was used more in older versions). >> + int rc = X86EMUL_OKAY; >> + >> + if ( regs->_eax == BDOOR_MAGIC ) > > With this, is handling other than 32-bit in/out really meaningful/ > correct? > Yes. Harder to use, but since VMware allows it, I allow it also. >> + case BDOOR_CMD_GETHWVERSION: >> + /* vmware_hw */ >> + regs->_eax = 0;get_instruction_length >> + if ( is_hvm_vcpu(curr) ) > > Since you can't get here for PV, I can't see what you need this > conditional for. > Since I was not 100% sure, I was being safe. Would converting this to be a "debug=y" check be ok? >> + { >> + struct hvm_domain *hd = &d->arch.hvm_domain; >> + >> + regs->_eax = hd->params[HVM_PARAM_VMWARE_HW]; >> + } >> + if ( !regs->_eax ) >> + regs->_eax = 4; /* Act like version 4 */ > > Why version 4? > That is the 1st version that VMware was more consistent in the handling of the "VMware hardware version". Any value between 1 and 6 would be ok. This should only happen in strange configs. >> + case BDOOR_CMD_GETTIME: >> + value = get_localtime_us(d) - d->time_offset_seconds * >> 1000000ULL; >> + /* hostUsecs */ >> + regs->_ebx = value % 1000000UL; >> + /* hostSecs */ >> + regs->_eax = value / 1000000ULL; >> + /* maxTimeLag */ >> + regs->_ecx = 1000000; >> + /* offset to GMT in minutes */ >> + regs->_edx = d->time_offset_seconds / 60; >> + break; >> + case BDOOR_CMD_GETTIMEFULL: >> + value = get_localtime_us(d) - d->time_offset_seconds * >> 1000000ULL; >> + /* ... */ > > ??? > There use to be a lot of setting registers to BDOOR_MAGIC, but since >> + regs->_eax = BDOOR_MAGIC; > > regs->_eax already has this value. > You are right. Will drop this setting and comment. >> + /* hostUsecs */ >> + regs->_ebx =value / 1000000ULL; >> + /* maxTimeLag */ >> + regs->_ecx = 1000000; >> + break; > > Perhaps this should share code with BDOOR_CMD_GETTIME; I have > to admit though that I can't make any sense of why the latter one > has a FULL suffix when it returns _less_ information. > Sharing of code is not simple. Since I did not pick the names, VMware did. Bug found. The full returns data in si & dx. will fix. And also makes sharing more complex then not. >> + if ( dir == IOREQ_READ ) >> + { >> + switch ( bytes ) >> + { >> + case 1: >> + regs->rax = (saved_rax & 0xffffff00) | (regs->rax & 0xff); >> + break; >> + case 2: >> + regs->rax = (saved_rax & 0xffff0000) | (regs->rax & 0xffff); >> + break; > > Both of these zero the high 32 bits when they shouldn't. But also see > below. > You are right, will fix. >> + case 4: >> + regs->rax = regs->_eax; >> + break; >> + } >> + *val = regs->rax; >> + } >> + else >> + regs->rax = saved_rax; > > This is all rather dubious - instead of clobbering reg->rax within the > earlier switch, write the value to a local variable and then merge it > here. Ok, Will adjust that way. > But as much as above, the question on what to do with > operand size being other than 32-bit - in particular for the cases > where other registers get modified - is relevant here too. I tested what happens on VMware and it "does what I have here" on sizes other then 32-bit. The other registers still change. > Even more > so that the "port" function parameter isn't even checked (and hence > you'd also handle e.g. "in %dx,%al" with %dx being BDOOR_PORT > + 1, 2, or 3 afaict). > Yes, I need to check and reject if not exactly this port. >> +int vmport_gp_check(struct cpu_user_regs *regs, struct vcpu *v, >> + unsigned long *inst_len, unsigned long inst_addr, >> + unsigned long ei1, unsigned long ei2) >> +{ >> + if ( !v->domain->arch.hvm_domain.is_vmware_port_enabled ) >> + return X86EMUL_VMPORT_NOT_ENABLED; >> + >> + if ( *inst_len && *inst_len <= MAX_INST_LEN && >> + (regs->rdx & 0xffff) == BDOOR_PORT && ei1 == 0 && ei2 == 0 && > > regs->_edx may yield slightly better code; I wonder whether we > shouldn't extend __DECL_REG() to also give us ->dx (and maybe > even ->dl, ->dh, etc). > I will switch to _edx for now. I see no need to extend __DECL_REG() in this patch set. Since most of the time, QEMU will get involved and be most of the time spent, the overhead of the the and is not that much. > These ei1/ei2 checks belong in the callers imo - even if both SVM > and VMX happen to have them be zero in the cases you're > interested in, these are still vendor dependent values which > shouldn't be interpreted by vendor independent code. > Ok, will move the checks. >> + regs->_eax == BDOOR_MAGIC ) >> + { >> + int i = 0; > > unsigned int > Sure. >> + uint32_t val; >> + uint32_t byte_cnt = hvm_guest_x86_mode(v); > > Not the best variable name - x86emul uses op_bytes. Which gets me > to a fundamental question: With all this custom decoding and > linearizing of CS:RIP, did you investigate using x86emul with suitably > set up callbacks instead? That would e.g. at once make you properly > ignore benign instruction prefixes (you look for 0x66 only below). > Will change the name. I did not investigate x86emul, I will look into what it would take. So one of the risks that come to mind is that if x86emul incorrectly executes an instruction (a non I/O one) and does not generate a #GP, this is bad. I should be able to prevent the #GP for an I/O in my quick look. Since this code path is only for avoiding #GP fault when using the VMware port, the only code that I know of that did not use "0xed" and any prefix bytes is the testing code I have. Preventing VMware port from working under unexpected instruction cases is not been high on the list of tasks to do. >> + unsigned char bytes[MAX_INST_LEN]; >> + unsigned int fetch_len; >> + int frc; >> + >> + /* in or out are limited to 32bits */ >> + if ( byte_cnt > 4 ) >> + byte_cnt = 4; >> + >> + /* >> + * Fetch up to the next page break; we'll fetch from the >> + * next page later if we have to. >> + */ >> + fetch_len = min_t(unsigned int, *inst_len, >> + PAGE_SIZE - (inst_addr & ~PAGE_MASK)); >> + frc = hvm_fetch_from_guest_virt_nofault(bytes, inst_addr, fetch_len, >> + PFEC_page_present); >> + if ( frc != HVMCOPY_okay ) >> + { >> + gdprintk(XENLOG_WARNING, >> + "Bad instruction fetch at %#lx (frc=%d il=%lu >> fl=%u)\n", >> + (unsigned long) inst_addr, frc, *inst_len, fetch_len); > > Pointless cast. But the value of log messages like this one is > questionable anyway. > Will drop cast. I am not sure it is possible to get here. The best I have come up with is to change the GDT entry for CS to fault, then do this instruction. Not sure it would fault, and clearly is an attempt to break in. I do know that if Xen is running under VMware (Why anyone would do this?) this is possible. With all this, should I just drop this message (or make it debug=y only)? >> + return X86EMUL_VMPORT_FETCH_ERROR_BYTE1; >> + } >> + >> + /* Check for operand size prefix */ >> + while ( (i < MAX_INST_LEN) && (bytes[i] == 0x66) >> )get_instruction_length >> + { >> + i++; >> + if ( i >= fetch_len ) >> + { >> + frc = hvm_fetch_from_guest_virt_nofault( >> + &bytes[fetch_len], inst_addr + fetch_len, >> + MAX_INST_LEN - fetch_len, PFEC_page_present); >> + if ( frc != HVMCOPY_okay ) >> + { >> + gdprintk(XENLOG_WARNING, >> + "Bad instruction fetch at %#lx + %#x >> (frc=%d)\n", >> + inst_addr, fetch_len, frc); >> + return X86EMUL_VMPORT_FETCH_ERROR_BYTE2; >> + } >> + fetch_len = MAX_INST_LEN; >> + } >> + } >> + *inst_len = i + 1; > > i may be MAX_INST_LEN already when you get here. > Yes, I need to terminate the loop sooner. >> + >> + /* Only adjust byte_cnt 1 time */ >> + if ( bytes[0] == 0x66 ) /* operand size prefix */ >> + { >> + if ( byte_cnt == 4 ) >> + byte_cnt = 2; >> + else >> + byte_cnt = 4; >> + } > > Iirc REX.W set following 0x66 cancels the effect of the latter. Another > thing x86emul would be taking care of for you if you used it. > I did not know this. Most of my testing was done without any check for prefix(s). I.E. (Open) VMware Tools only uses the inl. I do not know of anybody using 16bit segments and VMware tools. > Also this byte_cnt handling isn't correct for the real and VM86 mode > cases (where hvm_guest_x86_mode() returns 0/1 respectively). > Ok, will better use hvm_guest_x86_mode(). >> + if ( bytes[i] == 0xed ) /* in (%dx),%eax or in (%dx),%ax */ >> + return vmport_ioport(IOREQ_READ, BDOOR_PORT, byte_cnt, &val); >> + else if ( bytes[i] == 0xec ) /* in (%dx),%al */ >> + return vmport_ioport(IOREQ_READ, BDOOR_PORT, 1, &val); >> + else if ( bytes[i] == 0xef ) /* out %eax,(%dx) or out %ax,(%dx) >> */ >> + return vmport_ioport(IOREQ_WRITE, BDOOR_PORT, byte_cnt, &val); >> + else if ( bytes[i] == 0xee ) /* out %al,(%dx) */ >> + return vmport_ioport(IOREQ_WRITE, BDOOR_PORT, 1, &val); >> + else >> + { >> + *inst_len = 0; /* This is unknown. */ >> + return X86EMUL_VMPORT_BAD_OPCODE; >> + } > > switch() please. Sure. > >> +static void vmx_vmexit_gp_intercept(struct cpu_user_regs *regs, >> + struct vcpu *v) >> +{ >> + unsigned long exit_qualification; >> + unsigned long inst_len; >> + unsigned long inst_addr = vmx_rip2pointer(regs, v); >> + unsigned long ecode; >> + int rc; >> +#ifndef NDEBUG >> + unsigned long orig_inst_len; >> + unsigned long vector; >> + >> + __vmread(VM_EXIT_INTR_INFO, &vector); >> + BUG_ON(!(vector & INTR_INFO_VALID_MASK)); >> + BUG_ON(!(vector & INTR_INFO_DELIVER_CODE_MASK)); >> +#endif > > If you use ASSERT() instead of BUG_ON(), I think you can avoid most > of this preprocessor conditional. > I do not see how. vector only exists in "debug=y". So yes if using ASSERT() I could move the #endif up 2 lines, but that does not look better to me. >> + __vmread(EXIT_QUALIFICATION, &exit_qualification); >> + __vmread(VM_EXIT_INSTRUCTION_LEN, &inst_len); > > get_instruction_length(). But is it architecturally defined that > #GP intercept vmexits actually set this field? > I could not find a clear statement. My reading of (directly out of "Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B & 3C): System Programming Guide Order Number: 325384-052US, September 2014"): • VM-exit instruction length. This field is used in the following cases: — For fault-like VM exits due to attempts to execute one of the following instructions that cause VM exits unconditionally (see Section 25.1.2) or based on the settings of VM-execution controls (see Section 25.1.3): CLTS, CPUID, GETSEC, HLT, IN, INS, INVD, INVEPT, INVLPG, INVPCID, INVVPID, LGDT, LIDT, LLDT, LMSW, LTR, MONITOR, MOV CR, MOV DR, MWAIT, OUT, OUTS, PAUSE, RDMSR, RDPMC, RDRAND, RDSEED, RDTSC, RDTSCP, RSM, SGDT, SIDT, SLDT, STR, VMCALL, VMCLEAR, VMLAUNCH, VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON, WBINVD, WRMSR, XRSTORS, XSETBV, and XSAVES. And 25.1.2 Instructions That Cause VM Exits Unconditionally The following instructions cause VM exits when they are executed in VMX non-root operation: CPUID, GETSEC,1 INVD, and XSETBV. This is also true of instructions introduced with VMX, which include: INVEPT, INVVPID, VMCALL,2 VMCLEAR, VMLAUNCH, VMPTRLD, VMPTRST, VMRESUME, VMXOFF, and VMXON. 25.1.3 Instructions That Cause VM Exits Conditionally Certain instructions cause VM exits in VMX non-root operation depending on the setting of the VM-execution controls. The following instructions can cause “fault-like” VM exits based on the conditions described:3 • CLTS. The CLTS instruction causes a VM exit if the bits in position 3 (corresponding to CR0.TS) are set in both the CR0 guest/host mask and the CR0 read shadow. • HLT. The HLT instruction causes a VM exit if the “HLT exiting” VM-execution control is 1. • IN, INS/INSB/INSW/INSD, OUT, OUTS/OUTSB/OUTSW/OUTSD. The behavior of each of these instruc- tions is determined by the settings of the “unconditional I/O exiting” and “use I/O bitmaps” VM-execution controls: — If both controls are 0, the instruction executes normally. — If the “unconditional I/O exiting” VM-execution control is 1 and the “use I/O bitmaps” VM-execution control is 0, the instruction causes a VM exit. — If the “use I/O bitmaps” VM-execution control is 1, the instruction causes a VM exit if it attempts to access an I/O port corresponding to a bit set to 1 in the appropriate I/O bitmap (see Section 24.6.4). If an I/O operation “wraps around” the 16-bit I/O-port space (accesses ports FFFFH and 0000H), the I/O instruction causes a VM exit (the “unconditional I/O exiting” VM-execution control is ignored if the “use I/O bitmaps” VM-execution control is 1). See Section 25.1.1 for information regarding the priority of VM exits relative to faults that may be caused by the INS and OUTS instructions. to me says that yes, this field is set on a #GP exit on an IN. But the #GP case is not called out by name. My read is that a #GP fault is a "VM Exits Unconditionally" based on the setting of the exception bit mask. In all my testing this field that the correct value on the small set of Intel chips I have tested on. Since AMD does not (and has the hard coded 15), I could switch to 15 here also. So not using get_instruction_length() does avoid a possible BUG_ON() if I am wrong. So there are 3 options here: 1) Add an ASSERT() like the BUG_ON() in get_instruction_length() 2) Switch to using get_instruction_length() 3) Switch to using MAX_INST_LEN. Let me know which way to go. >> @@ -2182,6 +2183,8 @@ int nvmx_n2_vmexit_handler(struct cpu_user_regs *regs, >> if ( v->fpu_dirtied ) >> nvcpu->nv_vmexit_pending = 1; >> } >> + else if ( vector == TRAP_gp_fault ) >> + nvcpu->nv_vmexit_pending = 1; > > Doesn't that mean an unconditional vmexit even if the L1 hypervisor > didn't ask for such? > I might. I have not done any testing here for the nested VMX case. I could just drop this for now and deside what to do for this code later. >> --- a/xen/include/asm-x86/hvm/io.h >> +++ b/xen/include/asm-x86/hvm/io.h >> @@ -25,7 +25,7 @@ >> #include <public/hvm/ioreq.h> >> #include <public/event_channel.h> >> >> -#define MAX_IO_HANDLER 16 >> +#define MAX_IO_HANDLER 17 > > If you're really getting beyond 16 (which I don't see, I'm counting > 14 current users) this should be bumped by more than just 1. > >> +/* >> + * Additional return values from vmport_gp_check. >> + * >> + * Note: return values include: >> + * X86EMUL_OKAY >> + * X86EMUL_UNHANDLEABLE >> + * X86EMUL_EXCEPTION >> + * X86EMUL_RETRY >> + * X86EMUL_CMPXCHG_FAILED >> + * >> + * The additional do not overlap any of the above. >> + */ >> +#define X86EMUL_VMPORT_NOT_ENABLED 10 >> +#define X86EMUL_VMPORT_FETCH_ERROR_BYTE1 11 >> +#define X86EMUL_VMPORT_FETCH_ERROR_BYTE2 12 >> +#define X86EMUL_VMPORT_BAD_OPCODE 13 >> +#define X86EMUL_VMPORT_BAD_STATE 14 > > Going through the patch, you only ever return these, but never > check for any of them. Why do you add these in the first place, > risking future collisions even if there are none now? > There used to be a way to log or trace them, but I think that code has been dropped. Will remove them. -Don Slutz > Jan > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |