[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v4 02/17] x86emul: support MMX/SSE{, 2, 3} moves



On 01/03/17 14:19, Jan Beulich wrote:
>>>> On 01.03.17 at 14:59, <andrew.cooper3@xxxxxxxxxx> wrote:
>> On 28/02/17 12:50, Jan Beulich wrote:
>>> Previously supported insns are being converted to the new model, and
>>> several new ones are being added.
>>>
>>> To keep the stub handling reasonably simple, integrate SET_SSE_PREFIX()
>>> into copy_REX_VEX(), at once switching the stubs to use an empty REX
>>> prefix instead of a double DS: one (no byte registers are being
>>> accessed, so an empty REX prefix has no effect), except (of course) for
>>> the 32-bit test harness build.
>> Why switch a %ds override to REX?  There doesn't appear to be any benefit.
> It eliminates a mode_64bit() conditional from the non-VEX path in
> the macro. And then, honestly, this is a question I would have
> expected (if at all) the first time you came across this.

This is an extremely complicated series to review.  I am sorry, but I
can't always spot all issues in v1.


> I also think avoiding two identical prefixes is (marginally) better 
> architecture-
> wise.

There is no specific advice in the AMD optimisation guide.

The Intel guide warns against unnecessary use of 0x66 and 0x67
(specifically in the Length-Changing Prefixes section), which
dynamically change the length of the instruction.  This doesn't apply to
us in this situation.

The only other reference to prefixes comes from the Other Decoding
Guidelines section, which state (obviously) that extra prefixes decrease
instruction bandwidth (as more bytes need consuming to decode the
instruction), and that any instruction with multiple prefixes at all
require decoding in the first decoder, which builds competition of resource.

I can't see anything suggesting that a double %ds vs a single %ds and
rex prefix would make any difference.

>
>>> @@ -383,15 +383,35 @@ union vex {
>>>      };
>>>  };
>>>  
>>> +#ifdef __x86_64__
>>> +# define PFX2 REX_PREFIX
>>> +#else
>>> +# define PFX2 0x3e
>>> +#endif
>>> +#define PFX_BYTES 3
>>> +#define init_prefixes(stub) ({ \
>>> +    uint8_t *buf_ = get_stub(stub); \
>>> +    buf_[0] = 0x3e; \
>>> +    buf_[1] = PFX2; \
>>> +    buf_[2] = 0x0f; \
>>> +    buf_ + 3; \
>>> +})
>>> +
>>>  #define copy_REX_VEX(ptr, rex, vex) do { \
>>>      if ( (vex).opcx != vex_none ) \
>>>      { \
>>>          if ( !mode_64bit() ) \
>>>              vex.reg |= 8; \
>>> -        ptr[0] = 0xc4, ptr[1] = (vex).raw[0], ptr[2] = (vex).raw[1]; \
>>> +        (ptr)[0 - PFX_BYTES] = 0xc4; \
>>> +        (ptr)[1 - PFX_BYTES] = (vex).raw[0]; \
>>> +        (ptr)[2 - PFX_BYTES] = (vex).raw[1]; \
>>> +    } \
>>> +    else \
>>> +    { \
>>> +        if ( (vex).pfx ) \
>>> +            (ptr)[0 - PFX_BYTES] = sse_prefix[(vex).pfx - 1]; \
>>> +        (ptr)[1 - PFX_BYTES] |= rex; \
>> This is no longer guarded by mode_64bit().  Won't this result in %ds |
>> rex in the 32bit test stubs?
> Correct. But please realize that rex is zero at all times when
> emulating other than 64-bit mode.

Then please leave a comment then explaining why this is safe.

>
>>> @@ -5429,6 +5496,57 @@ x86_emulate(
>>>          singlestep = _regs._eflags & X86_EFLAGS_TF;
>>>          break;
>>>  
>>> +    CASE_SIMD_PACKED_FP(, 0x0f, 0x50):     /* movmskp{s,d} xmm,reg */
>>> +    CASE_SIMD_PACKED_FP(_VEX, 0x0f, 0x50): /* vmovmskp{s,d} {x,y}mm,reg */
>>> +    CASE_SIMD_PACKED_INT(0x0f, 0xd7):      /* pmovmskb {,x}mm,reg */
>>> +    case X86EMUL_OPC_VEX_66(0x0f, 0xd7):   /* vpmovmskb {x,y}mm,reg */
>>> +        generate_exception_if(ea.type != OP_REG, EXC_UD);
>>> +
>>> +        if ( vex.opcx == vex_none )
>>> +        {
>>> +            if ( vex.pfx & VEX_PREFIX_DOUBLE_MASK )
>>> +                vcpu_must_have(sse2);
>>> +            else
>>> +            {
>>> +                if ( b != 0x50 )
>>> +                    host_and_vcpu_must_have(mmx);
>>> +                vcpu_must_have(sse);
>>> +            }
>>> +            if ( b == 0x50 || (vex.pfx & VEX_PREFIX_DOUBLE_MASK) )
>>> +                get_fpu(X86EMUL_FPU_xmm, &fic);
>>> +            else
>>> +                get_fpu(X86EMUL_FPU_mmx, &fic);
>>> +        }
>>> +        else
>>> +        {
>>> +            generate_exception_if(vex.reg != 0xf, EXC_UD);
>> Isn't this TwoOp?
> Yes, hence the #UD. Or is the question "Why is this being done
> here, instead of on the common code path?" If so - the common
> code path doing this isn't being reached, as we invoke the stub
> inside the case block.

My question was actually "Why isn't this based on d & TwoByte"?

>
>>> +            if ( b == 0x50 || !vex.l )
>>> +                host_and_vcpu_must_have(avx);
>>> +            else
>>> +                host_and_vcpu_must_have(avx2);
>>> +            get_fpu(X86EMUL_FPU_ymm, &fic);
>>> +        }
>>> +
>>> +        opc = init_prefixes(stub);
>>> +        opc[0] = b;
>>> +        /* Convert GPR destination to %rAX. */
>>> +        rex_prefix &= ~REX_R;
>>> +        vex.r = 1;
>>> +        if ( !mode_64bit() )
>>> +            vex.w = 0;
>>> +        opc[1] = modrm & 0xc7;
>>> +        fic.insn_bytes = PFX_BYTES + 2;
>>> +        opc[2] = 0xc3;
>>> +
>>> +        copy_REX_VEX(opc, rex_prefix, vex);
>>> +        invoke_stub("", "", "=a" (dst.val) : [dummy] "i" (0));
>>> +
>>> +        put_stub(stub);
>>> +        put_fpu(&fic);
>>> +
>>> +        dst.bytes = 4;
>> Somewhere there should probably be an ASSERT() that state->simd_size is
>> 0, so we don't try to invoke the stub twice.
> I can do this, but it didn't seem natural to do so when putting this
> together, as - obviously - I did produce/check the table entries at
> basically the same time as I did write this code.

It is obvious to you now, and I do trust that you checked the
correctness as it related to this patch, but it will not be obvious in 6
months time with another dev-cycles worth of change on top.

The emulator is very hard-to-follow code.  Complexity is a necessary
consequence of its purpose, but also the source of a lot of bugs; most
of them very subtle.  Wherever possible, I would prefer that we take all
opportunities to make the logic easier to follow, and harder to
accidentally break with new changes, for the sake of everyone needing to
edit it in the future.

In this specific case, I don't wish to prescribe exactly how to prevent
accidental breakage, but some kind of assertion that we don't execute a
stub twice would be very wise, because recent history has shown that AFL
is very good at reaching unintended codepaths.

>
>>>          generate_exception_if(!op_bytes, EXC_UD);
>>>          generate_exception_if(vex.opcx && (d & TwoOp) && vex.reg != 0xf,
>>>                                EXC_UD);
>>>  
>>> -        if ( !buf )
>>> +        if ( !opc )
>>>              BUG();
>>> -        if ( vex.opcx == vex_none )
>>> -            SET_SSE_PREFIX(buf[0], vex.pfx);
>>> -
>>> -        buf[fic.insn_bytes] = 0xc3;
>>> -        copy_REX_VEX(buf, rex_prefix, vex);
>>> +        opc[fic.insn_bytes - PFX_BYTES] = 0xc3;
>> fic.insn_bytes - PFX_BYTES is in the middle of the opcode, isn't it?
> No - note the difference between opc and buf: The former points
> past the common prefix bytes.

Oh - so it does.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.