[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 5/5] x86/PV: use generic emulator for privileged instruction handling



>>> On 06.12.16 at 20:14, <andrew.cooper3@xxxxxxxxxx> wrote:
> On 06/12/16 11:16, Jan Beulich wrote:
>> @@ -2023,6 +2020,148 @@ static int read_gate_descriptor(unsigned
>>      return 1;
>>  }
>>  
>> +struct priv_op_ctxt {
>> +    struct x86_emulate_ctxt ctxt;
>> +    struct {
>> +        unsigned long base, limit;
>> +    } cs;
>> +    char *io_emul_stub;
>> +    unsigned int bpmatch;
>> +    unsigned int tsc;
>> +#define TSC_BASE 1
>> +#define TSC_AUX 2
>> +};
>> +
>> +static bool priv_op_to_linear(unsigned long base, unsigned long offset,
>> +                              unsigned int bytes, unsigned long limit,
>> +                              enum x86_segment seg,
>> +                              struct x86_emulate_ctxt *ctxt,
>> +                              unsigned long *addr)
> 
> This is entirely generic.  Could it be called pv_virtual_to_linear() in
> case it can be reused elsewhere?

Well, it's not entirely generic, as it requires an emulator context
structure. It could be used by other emulator hooks. Therefore I'm
not sure the suggested name is suitable - I'll use
pv_emul_virt_to_linear() instead. And I'll move it ahead of the
struct priv_op_ctxt declaration, to demonstrate it has no
dependency on it.

>> +static int priv_op_read_segment(enum x86_segment seg,
>> +                                struct segment_register *reg,
>> +                                struct x86_emulate_ctxt *ctxt)
>> +{
>> +    if ( ctxt->addr_size < 8 )
>> +    {
>> +        unsigned long limit;
>> +        unsigned int sel, ar;
>> +
>> +        switch ( seg )
>> +        {
>> +        case x86_seg_cs: sel = ctxt->regs->cs; break;
>> +        case x86_seg_ds: sel = read_sreg(ds);  break;
>> +        case x86_seg_es: sel = read_sreg(es);  break;
>> +        case x86_seg_fs: sel = read_sreg(fs);  break;
>> +        case x86_seg_gs: sel = read_sreg(gs);  break;
>> +        case x86_seg_ss: sel = ctxt->regs->ss; break;
>> +        case x86_seg_tr:
>> +            /* Check if this is an attempt to access to I/O bitmap. */
>> +            if ( (ctxt->opcode & ~0xb) == 0xe4 || (ctxt->opcode & ~3) == 
>> 0x6c )
>> +                return X86EMUL_DONE;
> 
> This this leftovers from before my system-segment work?
> 
> I/O bitmap accesses are now through the ->read() hook with x86_seg_tr.

No, the first thing ioport_access_check() does is a read_segment().
Plus there's no functional ->read() hook being installed by
emulate_privileged_op() in the first place.

>> +            /* fall through */
>> +        default: return X86EMUL_UNHANDLEABLE;
>> +        }
>> +
>> +        if ( !read_descriptor(sel, current, &reg->base, &limit, &ar, 0) )
>> +            return X86EMUL_UNHANDLEABLE;
>> +
>> +        reg->limit = limit;
>> +        reg->attr.bytes = ar >> 8;
>> +    }
>> +    else
>> +    {
>> +        switch ( seg )
>> +        {
>> +        default:
>> +            reg->base = 0;
>> +            break;
> 
> This is incorrect for system segments.  If we don't intend to service
> them, we should ASSERT() their absence.

Well, to be in line with the 32-bit code I'd rather make this an
X86EMUL_UNHANDLEABLE return.

>> +        case x86_seg_fs:
>> +            reg->base = rdfsbase();
>> +            break;
>> +        case x86_seg_gs:
>> +            reg->base = rdgsbase();
>> +            break;
>> +        }
>> +
>> +        reg->limit = ~0U;
>> +
>> +        reg->attr.bytes = 0;
>> +        reg->attr.fields.type = _SEGMENT_WR >> 8;
>> +        if ( seg == x86_seg_cs )
>> +            reg->attr.fields.type |= _SEGMENT_CODE >> 8;
>> +        reg->attr.fields.s   = 1;
>> +        reg->attr.fields.dpl = 3;
>> +        reg->attr.fields.p   = 1;
>> +        reg->attr.fields.l   = 1;
>> +        reg->attr.fields.db  = 1;
>> +        reg->attr.fields.g   = 1;
> 
> This can't all be correct.  Having %cs with both D and L set is
> definitely wrong in 64bit mode.

Oh, of course - while it's fine the way it is, I'll better set exactly
one of them (depending on whether it's CS).

>> +    }
>> +
>> +    /*
>> +     * For x86_emulate.c's mode_ring0() to work, fake a DPL of zero.
>> +     * Also do this for consistency for non-conforming code segments.
>> +     */
>> +    if ( (seg == x86_seg_ss ||
>> +          (seg == x86_seg_cs &&
>> +           !(reg->attr.fields.type & (_SEGMENT_EC >> 8)))) &&
>> +         guest_kernel_mode(current, ctxt->regs) )
>> +        reg->attr.fields.dpl = 0;
> 
> Have you run the XTF pv-iopl tests?  I'm fairly sure this simplification
> will regress the behaviour of guests not using my recent
> VMASST_TYPE_architectural_iopl extension.

I didn't run that test yet, but I also don't see where you think a
regression could slip in here: State is being faked for x86_emulate()
here, invisible to the guest.

>> +static int priv_op_read_io(unsigned int port, unsigned int bytes,
>> +                           unsigned long *val, struct x86_emulate_ctxt 
>> *ctxt)
>> +{
>> +    struct priv_op_ctxt *poc = container_of(ctxt, struct priv_op_ctxt, 
>> ctxt);
>> +    struct vcpu *curr = current;
>> +    struct domain *currd = current->domain;
>> +
>> +    /* INS must not come here. */
>> +    ASSERT((ctxt->opcode & ~9) == 0xe4);
> 
> In this case, I think it would be better to have:
> 
> if ( unlikely((ctxt->opcode & ~9) != 0xe4) )
> {
>     /* INS must not come here.  Only tolerate IN. */
>     ASSERT_UNREACHABLE();
>     return X86EMUL_UNHANDLEABLE;
> }
> 
> So we still bail in release builds.

I don't see the point: The missing ->write() hook will have the same
effect (and the fake ->read() one for the OUTS equivalent).

>>  int pv_emul_cpuid(unsigned int *eax, unsigned int *ebx, unsigned int *ecx,
>>                    unsigned int *edx, struct x86_emulate_ctxt *ctxt)
>>  {
>>      struct cpu_user_regs regs = *ctxt->regs;
>>  
>> +    /*
>> +     * x86_emulate uses this function to query CPU features for its own
>> +     * internal use. Make sure we're actually emulating CPUID before 
>> checking
>> +     * for emulated CPUID faulting.
>> +     */
>> +    if ( ctxt->opcode == X86EMUL_OPC(0x0f, 0xa2) )
>> +    {
>> +        const struct vcpu *curr = current;
>> +
>> +        /* If cpuid faulting is enabled and CPL>0 leave the #GP untouched. 
>> */
>> +        if ( curr->arch.cpuid_faulting &&
>> +             !guest_kernel_mode(curr, ctxt->regs) )
>> +            return X86EMUL_UNHANDLEABLE;
> 
> I don't think this is conceptually correct.  The emulator should
> explicitly raise #GP[0] here, rather than relying on the calling
> behaviour of bouncing the pending exception back to the caller.

Well, I think it's okay either way, but I've changed it to
X86EMUL_EXCEPTION. This will go away anyway with your
other series.

>> +    case 0x6c ... 0x6f: /* ins / outs */
>> +    case 0xe4 ... 0xe7: /* in / out (immediate port) */
>> +    case 0xec ... 0xef: /* in / out (port in %dx) */
>> +    case X86EMUL_OPC(0x0f, 0x06): /* clts */
>> +    case X86EMUL_OPC(0x0f, 0x09): /* wbinvd */
>> +    case X86EMUL_OPC(0x0f, 0x20) ...
>> +         X86EMUL_OPC(0x0f, 0x23): /* mov to/from cr/dr */
>> +    case X86EMUL_OPC(0x0f, 0x30): /* wrmsr */
>> +    case X86EMUL_OPC(0x0f, 0x31): /* rdtsc */
>> +    case X86EMUL_OPC(0x0f, 0x32): /* rdmsr */
>> +    case X86EMUL_OPC(0x0f, 0xa2): /* cpuid */
>> +        return X86EMUL_OKAY;
>>  
>> -        case 0x6e: /* OUTSB */
>> -            op_bytes = 1;
>> -        case 0x6f: /* OUTSW/OUTSL */
>> -            if ( (data_limit < (op_bytes - 1)) ||
>> -                 (rd_ad(esi) > (data_limit - (op_bytes - 1))) ||
>> -                  !guest_io_okay(port, op_bytes, v, regs) )
>> -                goto fail;
>> -            if ( (rc = copy_from_user(&data, (void *)data_base + rd_ad(esi),
>> -                                      op_bytes)) != 0 )
>> -            {
>> -                pv_inject_page_fault(0, data_base + rd_ad(esi)
>> -                                     + op_bytes - rc);
>> -                return EXCRET_fault_fixed;
>> -            }
>> -            guest_io_write(port, op_bytes, data, currd);
>> -            wr_ad(esi, regs->esi + (int)((regs->eflags & X86_EFLAGS_DF)
>> -                                         ? -op_bytes : op_bytes));
>> +    case 0xfa: case 0xfb: /* cli / sti */
>> +        if ( !iopl_ok(current, ctxt->regs) )
> 
> Given the use of DONE below, this part should also explicitly raise
> #GP[0], rather than falling through to UNHANDLEABLE, as we are providing
> an emulation of the instruction.

We're providing an emulation only if iopl_ok(), so I don't see the
need. The way it's done now matches previous behavior.

>> +        if ( x86_insn_modrm(state, &modrm_rm, &modrm_reg) != 3 ||
>> +             (modrm_rm & 7) != 1 )
>>              break;
>> -        case 0xd1: /* XSETBV */
>> +        switch ( modrm_reg & 7 )
>>          {
>> -            u64 new_xfeature = (u32)regs->eax | ((u64)regs->edx << 32);
>> +        case 2: /* xgetbv */
> 
> You appear to have altered XSETBV for XGETBV here.  XGETBV wasn't
> previously emulated, and doesn't have any condition (we care about
> emulating) which suffers #GP.

Indeed the comment was wrong, but the code was correct (0xd1
[ = 0b11010001] decodes to modrm_rm = 1 and modrm_reg = 2).

>> +    case X86EMUL_EXCEPTION:
>> +        pv_inject_event(&ctxt.ctxt.event);
>> +        return EXCRET_fault_fixed;
>>      }
>>  
>> -#undef wr_ad
>> -#undef rd_ad
>> -
>> - done:
>> -    instruction_done(regs, eip, bpmatch);
>> - skip:
>> -    if ( io_emul_stub )
>> -        unmap_domain_page(io_emul_stub);
>> -    return EXCRET_fault_fixed;
>> -
>> - fail:
>> -    if ( io_emul_stub )
>> -        unmap_domain_page(io_emul_stub);
>>      return 0;
> 
> This return can be dropped.  There are no paths out of the switch
> statement, and dropping this return will cause the compiler to notice if
> one gets introduced in the future.

I disagree - X86EMUL_UNHANDLEABLE comes here (as well as any
potential bogus return values).

>> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
>> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
>> @@ -1182,7 +1182,7 @@ static int ioport_access_check(
>>  
>>      fail_if(ops->read_segment == NULL);
>>      if ( (rc = ops->read_segment(x86_seg_tr, &tr, ctxt)) != 0 )
>> -        return rc;
>> +        return rc == X86EMUL_DONE ? X86EMUL_OKAY : rc;
> 
> Ah, so the real purpose of having read_segment(x86_seg_tr, ...) return
> X86EMUL_DONE is to defer the port check into the hooks themselves.
> 
> This doesn't actually match your intended meaning of X86EMUL_DONE, or
> the documentation you provide below.  I think it would be cleaner in
> this case to have an input x86_emulate_ctxt boolean requesting to skip
> %tr ioport checks, which causes ioport_access_check() to exit early with
> X86EMUL_OKAY.

How does it not match the intended or documented meaning of
DONE? There's nothing said that this means to skip all further
emulation. Instead the meaning for the different hooks this is
valid to be used in is being described explicitly. I don't think
another input would be a good idea here; in particular I'd like
to avoid adding new inputs where possible, as that always
requires updating all existing callers of x86_emulate().

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.