[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2 0/4] x86/HVM: implement memory read caching



>>> On 12.10.18 at 15:55, <andrew.cooper3@xxxxxxxxxx> wrote:
> On 11/09/18 14:10, Jan Beulich wrote:
>> Emulation requiring device model assistance uses a form of instruction
>> re-execution, assuming that the second (and any further) pass takes
>> exactly the same path. This is a valid assumption as far as use of CPU
>> registers goes (as those can't change without any other instruction
>> executing in between), but is wrong for memory accesses. In particular
>> it has been observed that Windows might page out buffers underneath
>> an instruction currently under emulation (hitting between two passes).
>> If the first pass translated a linear address successfully, any subsequent
>> pass needs to do so too, yielding the exact same translation.
>>
>> Introduce a cache (used just by guest page table accesses for now, i.e.
>> a form of "paging structure cache") to make sure above described
>> assumption holds. This is a very simplistic implementation for now: Only
>> exact matches are satisfied (no overlaps or partial reads or anything).
>>
>> There's also some seemingly unrelated cleanup here which was found
>> desirable on the way.
>>
>> 1: x86/mm: add optional cache to GLA->GFN translation
>> 2: x86/mm: use optional cache in guest_walk_tables()
>> 3: x86/HVM: implement memory read caching
>> 4: x86/HVM: prefill cache with PDPTEs when possible
>>
>> "VMX: correct PDPTE load checks" is omitted from v2, as I can't
>> currently find enough time to carry out the requested further
>> rework.
> 
> Following the x86 call, I've had some thoughts and suggestions about how
> to make this work in a reasonable way, without resorting to the full
> caching approach.

Actually I now think I'll go that full caching (in the sense meant here)
route, see below, but of course things could easily be limited to just
the page table accesses in a first step.

> First and foremost, I'd like recommend against trying to combine the fix
> for repeated PDPTR reading, and repeated PTE reading.  While they are
> both repeated reading problems, one really is a knoblly corner case of
> 32bit PAE paging, and one is a general emulation problem.  Fixing these
> problems independently makes the result rather more simple, and far
> closer to how real CPUs work.

Well, that's a separate patch anyway. If you disagree with just
that part, then it can be easily left out (or be further refined),
the more that it's (intentionally) last in the series.

> For naming, how about "access once" in place of cache?  This is the best
> description of the purpose I can come up with.

For descriptive purposes that's fine, but I wouldn't want to introduce
a hvmemul_access_once structure type. I also didn't particularly like
the suggestions George did make. If "cached" wasn't pre-occupied to
mean something else in computers, I still think this would be the right
term to use. Right now I think I'll call the thing hvmemul_data_buf or
some such, along the lines of insn_buf[] used to hold the fetched
instruction bytes.

> Next, there should be a single hvmemul_read_once(gaddr, bytes, ...)
> (name subject to improvement), which does a transparent read-through of
> the "access once cache" in terms of a single flag guest physical address
> space.  This allows individual callers to opt into using the access-once
> semantics, and doesn't hoist them with the substantial boilerplate of
> the sole correct way to use this interface.  Furthermore, this behaviour
> has the same semantics as the correct longer term fix.

That's an option perhaps, but has the downside of needing to split
apart the combined linear->phys translation and memory access
done by linear_{read,write}(). Plus (again see below)
hvmemul_read_once() doesn't fit the page walking code very well,
due to the open coded reads there, which in turn are helpful for the
logic setting the A/D bits. I'd like to do this slightly differently (in
the page walking code in particular closer to what the implementation
here was, i.e. with buffered data maintenance separated from the
actual memory accesses). But before I go and try to re-implement
this (in as transparent a way as possible while at the same time
covering all memory accesses, not just page table reads) I'd like to
settle on a few principles, after having thought more about our
options, and after having done a few experiments.

The goal of this, as before, is not a performance improvement, but a
correctness guarantee: Upon re-execution of a previously partly
emulated insn (requiring e.g. device model assistance), all memory
accesses done so far have to return (reads) or take (writes) the
exact same data. This means that the original guest memory accesses
have to record their addresses and data. A fundamental assumption
here is that no instruction may update register state if subsequently
it may requiring re-execution (see below for where this is violated).

The transparency goal suggests that maintenance of this data should be
done at as low a layer as possible. However, not all guest memory
accesses result from instruction emulation, and any that don't may not
consume (and would better also not produce) buffered contents. That is,
maintenance has to either live above the
hvm_copy_{to,from}_guest_linear() layer (as that's what also backs
copy_{from,to}_user_hvm()), or there needs to be an indicator (function
parameter or struct vcpu flag) whether to access the buffered data.

As you didn't like the function parameter approach, I assume the
approach to take is the struct vcpu flag one, which could be set and
cleared in _hvm_emulate_one() along the lines of what the current
version of the series does for the num_ents field.

Since guest page walks also need to be taken care of (in fact these are
the primary goal, as that's where the issue to fix was noticed), and
since guest_walk_tables() doesn't use lower level routines to access
guest memory, the function either needs to be converted to use them, or
the buffer accesses need to be open coded there, as was done in this
series up to now. Using the lower level routines would in particular
complicate set_ad_bits(), so I'm not currently intending to go that
route.

Now on to the intended behavior of the buffer: If we want to mimic
actual hardware behavior as closely as possible, we have to distinguish
the ordinary data cache from TLB and paging structure caches. While the
former is coherent, the latter aren't. This in particular means e.g.
that while two distinct L<n> page table reads from the same physical
address may return the same data (because there can't be any
invalidation between the start of a single insn's execution and its
completion), the same may not be appropriate for two independent
ordinary data reads. Specifically insns with multiple independent
memory operands (CMPS{B,D,Q,W}, V{,P}GATHER*) can observe different data
for the different accesses, even if all addresses are the same, as long
as another CPU manages to modify the memory location(s) quickly enough.

If we want to retain this behavior in emulation, we'll have to tag
memory accesses such that during re-execution correct association with
their earlier matching accesses is possible, and such that distinct
accesses would not consume data buffered by earlier ones. This tagging
can, I think, still be done transparently at the layer where the
buffered data gets maintained, except that memory accesses resulting
from page walks need to be recognizable, so they won't be treated the
same way. But as per above those accesses go through independent code
paths anyway, so this wouldn't be difficult to arrange for.

But there's one possible caveat here: The way gathers currently get
handled in the insn emulator, X86EMUL_RETRY may have two different
meanings: It may either identify that a read is pending dm assistance,
and hence re-execution will occur without exiting to guest context, or
it may identify what we'd call a continuation if this was a hypercall.
In either case the code updates certain register state (specifically
the register used as operation mask) to record successfully completed
parts. While this is fine when exiting back to guest context, it would
confuse the buffered data access logic, as the access pattern would no
longer match that seen during the first execution run.

As an aside I'd like to note that a possible mis-interpretation of mine
of the description on how gathers work may mean that the continuation-
like exit-to-guest behavior is wrong altogether. I've requested
clarification from Intel. Should this need to change, we'll run into
capacity problems with struct hvm_vcpu_io's mmio_cache[]. But in the end
I hope to also be able to do away with mmio_cache[].

> For the PDPTRs, this corner case is special, and should be handled by
> the pagewalk code.  I'm still going to go with my previous suggestion of
> having top_map point onto the caller stack.  For the VT-x case, the
> values can be pulled straight out of the VMCS, while for AMD, the values
> can be read through the "access once cache", which matches the behaviour
> of being read from memory, but ensures they won't be repeatedly read.

What is different here from how the last patch already implements it?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.