[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v4 4/7] x86/HVM: implement memory read caching for insn emulation

On 03.02.2020 20:48, Andrew Cooper wrote:
> On 31/01/2020 16:44, Jan Beulich wrote:
>> Emulation requiring device model assistance uses a form of instruction
>> re-execution, assuming that the second (and any further) pass takes
>> exactly the same path. This is a valid assumption as far as use of CPU
>> registers goes (as those can't change without any other instruction
>> executing in between), but is wrong for memory accesses.
> This statement isn't quite accurate.  Various hypercalls and IPIs can
> play with GPR state in the middle of emulation.
> What matters is that the GPR values aren't expected to change, and a
> guest gets to keep all the pieces if they try to play games in this area.

Not just GPR values. I've added a footnote:

[1] Other than on actual hardware, actions like
    XEN_DOMCTL_sethvmcontext, XEN_DOMCTL_setvcpucontext,
    XEN_DOMCTL_set_ext_vcpucontext, VCPUOP_initialise, INIT, or SIPI
    issued against the vCPU can occur while the vCPU is blocked waiting
    for a device model to return data. This won't affect stability of
    Xen, though, it may just lead to guest misbehavior.

I'll see whether I can come up with a reasonable way to deal with the
situation. If so, I may insert another patch ahead of this on in v5.
But I don't think this is an urgent issue to address.

>> As to the actual data page in this scenario, there are a couple of
>> aspects to take into consideration:
>> - We must be talking about an insn accessing two locations (two memory
>>   ones, one of which is MMIO, or a memory and an I/O one).
> Really?
> We're talking about all instructions, even without any memory operands,
> because at the very minimum we cache the instruction stream, and (with
> the following patch), the pagewalk to it.

I'm describing the problem case here that wants fixing, not the general
one. I.e. the case where without this change re-execution might go wrong
after the device model has returned data.

>> TBD: In principle the caching here yields unnecessary the one used for
>>      insn bytes (vio->mmio_insn{,_bytes}. However, to seed the cache
>>      with the data SVM may have made available, we'd have to also know
>>      the corresponding GPA. It's not safe, however, to re-walk the page
>>      tables to find out, as the page tables may have changed in the
>>      meantime. Therefore I guess we need to keep the duplicate
>>      functionality for now. A possible solution to this could be to use
>>      a physical-address-based cache for page table accesses (and looking
>>      forward also e.g. SVM/VMX insn emulation), and a linear-address-
>>      based one for all other reads.
> Splitting caching like that will re-introduce the same bugs I pointed
> out in earlier revisions of this series.  It is not correct to have
> multiple (and therefore, non-coherent) caches of memory.

Well, I continue to disagree, and I can only point you back at the
example of insns with multiple explicit memory references actually
picking up changes done by a remote agent. E.g. REP CMPSB with
rSI == rDI can be observed to terminate with ZF clear and rCX non-
zero. I can only re-iterate that what I introduce here is not a
cache in the sense the CPU caches work (it is now coming closer
than in prior versions, but there are differences which I think
will need to remain - see further down). Instead it is one in the
sense of "secret store for intermediate data". There is no
question of coherency whatsoever for the purposes here: A datum
once read can't become stale, no matter what writes (and by whom)
to the same memory location might occur.

> The AMD instruction stream bytes support is by no means perfect either. 
> It doesn't function properly when SMAP is active (Erratum #1096), which
> is actually caused by the instruction stream being re-fetched at vmexit
> time, with a data side access (hence the interaction with SMAP).
> Just as with the emulation for non-Nrips hardware, the contents of the
> instruction buffer (if valid) need cross referencing with the vmexit
> reason to spot races, and the same checking would make it safe to
> deduplicate the caching.

Right, and we've put in place some basic sanity checks for this
purpose, but no, I'm not convinced this is good enough to allow
safely dropping the duplicate caching. But anyway, this isn't
anything to be done in this patch, but only - if at all - in a
follow-on one.

>> --- a/xen/arch/x86/hvm/emulate.c
>> +++ b/xen/arch/x86/hvm/emulate.c
>> @@ -28,6 +28,17 @@
>>  #include <asm/iocap.h>
>>  #include <asm/vm_event.h>
>> +struct hvmemul_cache
>> +{
> This needs a comment (see below.)

I don't think I've spotted which of your further remarks this refers
to, and hence what it is a comment here would need to say.

>> +    unsigned int num_ents;
>> +    unsigned int max_ents;
>> +    struct {
>> +        paddr_t gpa:PADDR_BITS;
>> +        unsigned int size:BITS_PER_LONG - PADDR_BITS;
> This would probably result in rather better code if size was 8 bits
> rather than 12.

I can make it so, by inserting a BITS_PER_LONG - PADDR_BITS - 8
unnamed field. I agree that this way no shift is going to be

> Longer term, the size field should disappear and functionality adjusted
> to always read aligned 8 (or larger) byte values.  That way, the cache
> behaves much more like caches in real processors.

As per above, I don't think this can or should be the goal. Not
the least because this cache (as said, as in "secret store") is
also used to "cache" uncacheable data. In such cases we may not
read more than what was asked.

>> @@ -2838,6 +2868,123 @@ void hvm_dump_emulation_state(const char
>>             hvmemul_ctxt->insn_buf);
>>  }
>> +int hvmemul_cache_init(struct vcpu *v)
>> +{
>> +    /*
>> +     * No insn can access more than 16 independent linear addresses (AVX512F
>> +     * scatters/gathers being the worst).
> It is at least 5 more than the worse case number of regular operands, to
> cover exceptions, and therefore accesses into the IDT/GDT/LDT and TR and
> stack.

The emulator doesn't actually emulate exception delivery (yet?),
so no such memory accesses would result. I'd also not call this
accesses the insn itself causes (other then e.g. a descriptor
table access by a segment register load).

> You can manage that with a 32bit guest with an exception, %cs and %ss in
> different tables, and having to cross a page boundary when pushing the
> esp0 exception frame.
> It is one fewer generally in 64bit mode, because there are no %ss
> references to deal with, but both cases get even more complicated for a
> contributory exception, as the second %cs can be different to %cs for
> the first exception.
>>  Each such linear range can span a
>> +     * page boundary, i.e. may require two page walks. Account for each insn
>> +     * byte individually.
> Why are we counting 5 entries for every instruction byte?  Given an
> 8-byte maximum cached size, we need at most 2 entries to cover a single
> instruction.

3 entries if the insn spans a page boundary, plus the two
corresponding page walks. But: While the page walk data would
be shared for the individual, perhaps as small as byte, reads,
the fetched data may consume as many slots as there are
separate fetch requests. Recall that reads of adjacent
addresses don't get folded, even if the overall range would
fit. I can of course make the expression here more complicated,
to express MAX_INST_LEN actual data slots plus two sets of
page walk ones. Alternatively I can append "for simplicity" to
the comment, if that helps. Just let me know.

>> +     */
>> +    const unsigned int nents = (CONFIG_PAGING_LEVELS + 1) *
>> +                               (MAX_INST_LEN + 16 * 2);
>> +    struct hvmemul_cache *cache = xmalloc_flex_struct(struct hvmemul_cache,
>> +                                                      ents, nents);
>> +
>> +    if ( !cache )
>> +        return -ENOMEM;
>> +
>> +    /* Cache is disabled initially. */
>> +    cache->num_ents = nents + 1;
>> +    cache->max_ents = nents;
> (Below:) You need to document the internal semantics (probably best in
> the struct definition.)

I've added "The cache is disabled as long as num_ents > max_ents." Is
there anything else I should add?

> I thought I'd reverse engineered how disabling works (starting from the
> completely obscure hvmemul_cache_disabled()),

If it's "completely obscure" - any suggestions how to improve the

> but given that the cache
> is apparently disabled by default, I now can't figure out how it ever
> gets enabled.
> I can't see any code path where num_ents ends up lower than max_ents,
> and the read/write helpers below don't take their early exit path.

Check the hunk changing the top of _hvm_emulate_one().

>> +bool hvmemul_read_cache(const struct vcpu *v, paddr_t gpa,
>> +                        void *buffer, unsigned int size)
>> +{
>> +    const struct hvmemul_cache *cache = v->arch.hvm.hvm_io.cache;
>> +    unsigned int i;
>> +
>> +    /* Cache unavailable? */
>> +    if ( cache->num_ents > cache->max_ents )
>> +        return false;
>> +
>> +    while ( size > sizeof(cache->ents->data) )
>> +    {
>> +        i = gpa & (sizeof(cache->ents->data) - 1)
>> +            ? -gpa & (sizeof(cache->ents->data) - 1)
>> +            : sizeof(cache->ents->data);
>> +        if ( !hvmemul_read_cache(v, gpa, buffer, i) )
>> +            return false;
> What is this call trying to achieve?

When the full request won't fit a single entry, this reads
the first so many bytes to make gpa 8-byte aligned if it
isn't already, or a single entry's worth of data otherwise.

>> +        gpa += i;
>> +        buffer += i;
>> +        size -= i;
>> +    }
>> +
>> +    for ( i = 0; i < cache->num_ents; ++i )
>> +        if ( cache->ents[i].gpa == gpa && cache->ents[i].size == size )
> With num_ents currently topping out at 235 (and needing to increase
> anyway), this can end up being a very long loop.
> Given that it will be populated with higher level PTEs to being with,
> the common case of accessing real data will be at the end of the O(n)
> search rather than the start.

Any "real data" access will also be accompanied by a page walk.
I therefore don't see how re-arrangement would help. Also
num_ents is hardly ever going to be more than a dozen or so,
unless we really need to emulate something exotic like an S/G

> The cache is a single block of memory, so it is almost certainly better
> to keep it sorted.  That said, it is problematic due to the overlap of
> sizes, which I'm still concerned concerned about generally.

Well, I thought I'd start with something really simple. As per
above the "overlap" of sizes is not a problem, since each
individual read issued may be considered to observe memory
state at the time of its issuing (rather than that at the time
when a prior overlapping read was issued). Merging would _also_
be okay, but is not required for correctness.

The question is whether the higher cost of insertion (when
wanting to make the cache sorted) is outweighed by the cheaper
lookup. The common case is for there to not be re-execution:
Neither internally handled MMIO need it, nor writes sent to
the DM, nor almost anything that's introspection related.

One option I've been considering is to actually "sequence-
number" the accesses. Upon replay the same sequence of
operations is to occur anyway, which would make an actual
lookup (as in "search") of the correct entry unnecessary. Of
course this would move us further away again from how actual
CPU caches work. I could also imagine a hybrid model, where
the last read slot is recorded, and the next read would look
at the subsequent slot first.

>> --- a/xen/include/asm-x86/hvm/emulate.h
>> +++ b/xen/include/asm-x86/hvm/emulate.h
>> @@ -13,6 +13,7 @@
>>  #define __ASM_X86_HVM_EMULATE_H__
>>  #include <xen/err.h>
>> +#include <xen/sched.h>
>>  #include <asm/hvm/hvm.h>
>>  #include <asm/x86_emulate.h>
>> @@ -96,6 +97,31 @@ int hvmemul_do_pio_buffer(uint16_t port,
>>                            uint8_t dir,
>>                            void *buffer);
>> +#ifdef CONFIG_HVM
> This needs a comment block stating, at a minimum, the semantics and
> expected use for the cache.
> Otherwise, this is ~150 lines of totally undocumented and practically
> uncommented, critical and subtle infrastructure.

Will do; let's see if what I can come up with is good enough.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.