Xen project Mailing List

Re: [Xen-devel] [PATCH v4 02/17] x86emul: support MMX/SSE{, 2, 3} moves

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Wed, 1 Mar 2017 19:56:07 +0000

Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Wed, 01 Mar 2017 19:56:12 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 01/03/17 14:19, Jan Beulich wrote: >>>> On 01.03.17 at 14:59, <andrew.cooper3@xxxxxxxxxx> wrote: >> On 28/02/17 12:50, Jan Beulich wrote: >>> Previously supported insns are being converted to the new model, and >>> several new ones are being added. >>> >>> To keep the stub handling reasonably simple, integrate SET_SSE_PREFIX() >>> into copy_REX_VEX(), at once switching the stubs to use an empty REX >>> prefix instead of a double DS: one (no byte registers are being >>> accessed, so an empty REX prefix has no effect), except (of course) for >>> the 32-bit test harness build. >> Why switch a %ds override to REX? There doesn't appear to be any benefit. > It eliminates a mode_64bit() conditional from the non-VEX path in > the macro. And then, honestly, this is a question I would have > expected (if at all) the first time you came across this. This is an extremely complicated series to review. I am sorry, but I can't always spot all issues in v1. > I also think avoiding two identical prefixes is (marginally) better > architecture- > wise. There is no specific advice in the AMD optimisation guide. The Intel guide warns against unnecessary use of 0x66 and 0x67 (specifically in the Length-Changing Prefixes section), which dynamically change the length of the instruction. This doesn't apply to us in this situation. The only other reference to prefixes comes from the Other Decoding Guidelines section, which state (obviously) that extra prefixes decrease instruction bandwidth (as more bytes need consuming to decode the instruction), and that any instruction with multiple prefixes at all require decoding in the first decoder, which builds competition of resource. I can't see anything suggesting that a double %ds vs a single %ds and rex prefix would make any difference. > >>> @@ -383,15 +383,35 @@ union vex { >>> }; >>> }; >>> >>> +#ifdef __x86_64__ >>> +# define PFX2 REX_PREFIX >>> +#else >>> +# define PFX2 0x3e >>> +#endif >>> +#define PFX_BYTES 3 >>> +#define init_prefixes(stub) ({ \ >>> + uint8_t *buf_ = get_stub(stub); \ >>> + buf_[0] = 0x3e; \ >>> + buf_[1] = PFX2; \ >>> + buf_[2] = 0x0f; \ >>> + buf_ + 3; \ >>> +}) >>> + >>> #define copy_REX_VEX(ptr, rex, vex) do { \ >>> if ( (vex).opcx != vex_none ) \ >>> { \ >>> if ( !mode_64bit() ) \ >>> vex.reg |= 8; \ >>> - ptr[0] = 0xc4, ptr[1] = (vex).raw[0], ptr[2] = (vex).raw[1]; \ >>> + (ptr)[0 - PFX_BYTES] = 0xc4; \ >>> + (ptr)[1 - PFX_BYTES] = (vex).raw[0]; \ >>> + (ptr)[2 - PFX_BYTES] = (vex).raw[1]; \ >>> + } \ >>> + else \ >>> + { \ >>> + if ( (vex).pfx ) \ >>> + (ptr)[0 - PFX_BYTES] = sse_prefix[(vex).pfx - 1]; \ >>> + (ptr)[1 - PFX_BYTES] |= rex; \ >> This is no longer guarded by mode_64bit(). Won't this result in %ds | >> rex in the 32bit test stubs? > Correct. But please realize that rex is zero at all times when > emulating other than 64-bit mode. Then please leave a comment then explaining why this is safe. > >>> @@ -5429,6 +5496,57 @@ x86_emulate( >>> singlestep = _regs._eflags & X86_EFLAGS_TF; >>> break; >>> >>> + CASE_SIMD_PACKED_FP(, 0x0f, 0x50): /* movmskp{s,d} xmm,reg */ >>> + CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0x50): /* vmovmskp{s,d} {x,y}mm,reg */ >>> + CASE_SIMD_PACKED_INT(0x0f, 0xd7): /* pmovmskb {,x}mm,reg */ >>> + case X86EMUL_OPC_VEX_66(0x0f, 0xd7): /* vpmovmskb {x,y}mm,reg */ >>> + generate_exception_if(ea.type != OP_REG, EXC_UD); >>> + >>> + if ( vex.opcx == vex_none ) >>> + { >>> + if ( vex.pfx & VEX_PREFIX_DOUBLE_MASK ) >>> + vcpu_must_have(sse2); >>> + else >>> + { >>> + if ( b != 0x50 ) >>> + host_and_vcpu_must_have(mmx); >>> + vcpu_must_have(sse); >>> + } >>> + if ( b == 0x50 || (vex.pfx & VEX_PREFIX_DOUBLE_MASK) ) >>> + get_fpu(X86EMUL_FPU_xmm, &fic); >>> + else >>> + get_fpu(X86EMUL_FPU_mmx, &fic); >>> + } >>> + else >>> + { >>> + generate_exception_if(vex.reg != 0xf, EXC_UD); >> Isn't this TwoOp? > Yes, hence the #UD. Or is the question "Why is this being done > here, instead of on the common code path?" If so - the common > code path doing this isn't being reached, as we invoke the stub > inside the case block. My question was actually "Why isn't this based on d & TwoByte"? > >>> + if ( b == 0x50 || !vex.l ) >>> + host_and_vcpu_must_have(avx); >>> + else >>> + host_and_vcpu_must_have(avx2); >>> + get_fpu(X86EMUL_FPU_ymm, &fic); >>> + } >>> + >>> + opc = init_prefixes(stub); >>> + opc[0] = b; >>> + /* Convert GPR destination to %rAX. */ >>> + rex_prefix &= ~REX_R; >>> + vex.r = 1; >>> + if ( !mode_64bit() ) >>> + vex.w = 0; >>> + opc[1] = modrm & 0xc7; >>> + fic.insn_bytes = PFX_BYTES + 2; >>> + opc[2] = 0xc3; >>> + >>> + copy_REX_VEX(opc, rex_prefix, vex); >>> + invoke_stub("", "", "=a" (dst.val) : [dummy] "i" (0)); >>> + >>> + put_stub(stub); >>> + put_fpu(&fic); >>> + >>> + dst.bytes = 4; >> Somewhere there should probably be an ASSERT() that state->simd_size is >> 0, so we don't try to invoke the stub twice. > I can do this, but it didn't seem natural to do so when putting this > together, as - obviously - I did produce/check the table entries at > basically the same time as I did write this code. It is obvious to you now, and I do trust that you checked the correctness as it related to this patch, but it will not be obvious in 6 months time with another dev-cycles worth of change on top. The emulator is very hard-to-follow code. Complexity is a necessary consequence of its purpose, but also the source of a lot of bugs; most of them very subtle. Wherever possible, I would prefer that we take all opportunities to make the logic easier to follow, and harder to accidentally break with new changes, for the sake of everyone needing to edit it in the future. In this specific case, I don't wish to prescribe exactly how to prevent accidental breakage, but some kind of assertion that we don't execute a stub twice would be very wise, because recent history has shown that AFL is very good at reaching unintended codepaths. > >>> generate_exception_if(!op_bytes, EXC_UD); >>> generate_exception_if(vex.opcx && (d & TwoOp) && vex.reg != 0xf, >>> EXC_UD); >>> >>> - if ( !buf ) >>> + if ( !opc ) >>> BUG(); >>> - if ( vex.opcx == vex_none ) >>> - SET_SSE_PREFIX(buf[0], vex.pfx); >>> - >>> - buf[fic.insn_bytes] = 0xc3; >>> - copy_REX_VEX(buf, rex_prefix, vex); >>> + opc[fic.insn_bytes - PFX_BYTES] = 0xc3; >> fic.insn_bytes - PFX_BYTES is in the middle of the opcode, isn't it? > No - note the difference between opc and buf: The former points > past the common prefix bytes. Oh - so it does. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.