Xen project Mailing List

Re: [Xen-devel] [PATCH v2 0/4] x86/HVM: implement memory read caching

To: "Andrew Cooper" <andrew.cooper3@xxxxxxxxxx>, "xen-devel" <xen-devel@xxxxxxxxxxxxxxxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Thu, 14 Feb 2019 08:14:55 -0700

Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Paul Durrant <paul.durrant@xxxxxxxxxx>

Delivery-date: Thu, 14 Feb 2019 15:15:08 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

>>> On 12.10.18 at 15:55, <andrew.cooper3@xxxxxxxxxx> wrote: > On 11/09/18 14:10, Jan Beulich wrote: >> Emulation requiring device model assistance uses a form of instruction >> re-execution, assuming that the second (and any further) pass takes >> exactly the same path. This is a valid assumption as far as use of CPU >> registers goes (as those can't change without any other instruction >> executing in between), but is wrong for memory accesses. In particular >> it has been observed that Windows might page out buffers underneath >> an instruction currently under emulation (hitting between two passes). >> If the first pass translated a linear address successfully, any subsequent >> pass needs to do so too, yielding the exact same translation. >> >> Introduce a cache (used just by guest page table accesses for now, i.e. >> a form of "paging structure cache") to make sure above described >> assumption holds. This is a very simplistic implementation for now: Only >> exact matches are satisfied (no overlaps or partial reads or anything). >> >> There's also some seemingly unrelated cleanup here which was found >> desirable on the way. >> >> 1: x86/mm: add optional cache to GLA->GFN translation >> 2: x86/mm: use optional cache in guest_walk_tables() >> 3: x86/HVM: implement memory read caching >> 4: x86/HVM: prefill cache with PDPTEs when possible >> >> "VMX: correct PDPTE load checks" is omitted from v2, as I can't >> currently find enough time to carry out the requested further >> rework. > > Following the x86 call, I've had some thoughts and suggestions about how > to make this work in a reasonable way, without resorting to the full > caching approach. Actually I now think I'll go that full caching (in the sense meant here) route, see below, but of course things could easily be limited to just the page table accesses in a first step. > First and foremost, I'd like recommend against trying to combine the fix > for repeated PDPTR reading, and repeated PTE reading. While they are > both repeated reading problems, one really is a knoblly corner case of > 32bit PAE paging, and one is a general emulation problem. Fixing these > problems independently makes the result rather more simple, and far > closer to how real CPUs work. Well, that's a separate patch anyway. If you disagree with just that part, then it can be easily left out (or be further refined), the more that it's (intentionally) last in the series. > For naming, how about "access once" in place of cache? This is the best > description of the purpose I can come up with. For descriptive purposes that's fine, but I wouldn't want to introduce a hvmemul_access_once structure type. I also didn't particularly like the suggestions George did make. If "cached" wasn't pre-occupied to mean something else in computers, I still think this would be the right term to use. Right now I think I'll call the thing hvmemul_data_buf or some such, along the lines of insn_buf[] used to hold the fetched instruction bytes. > Next, there should be a single hvmemul_read_once(gaddr, bytes, ...) > (name subject to improvement), which does a transparent read-through of > the "access once cache" in terms of a single flag guest physical address > space. This allows individual callers to opt into using the access-once > semantics, and doesn't hoist them with the substantial boilerplate of > the sole correct way to use this interface. Furthermore, this behaviour > has the same semantics as the correct longer term fix. That's an option perhaps, but has the downside of needing to split apart the combined linear->phys translation and memory access done by linear_{read,write}(). Plus (again see below) hvmemul_read_once() doesn't fit the page walking code very well, due to the open coded reads there, which in turn are helpful for the logic setting the A/D bits. I'd like to do this slightly differently (in the page walking code in particular closer to what the implementation here was, i.e. with buffered data maintenance separated from the actual memory accesses). But before I go and try to re-implement this (in as transparent a way as possible while at the same time covering all memory accesses, not just page table reads) I'd like to settle on a few principles, after having thought more about our options, and after having done a few experiments. The goal of this, as before, is not a performance improvement, but a correctness guarantee: Upon re-execution of a previously partly emulated insn (requiring e.g. device model assistance), all memory accesses done so far have to return (reads) or take (writes) the exact same data. This means that the original guest memory accesses have to record their addresses and data. A fundamental assumption here is that no instruction may update register state if subsequently it may requiring re-execution (see below for where this is violated). The transparency goal suggests that maintenance of this data should be done at as low a layer as possible. However, not all guest memory accesses result from instruction emulation, and any that don't may not consume (and would better also not produce) buffered contents. That is, maintenance has to either live above the hvm_copy_{to,from}_guest_linear() layer (as that's what also backs copy_{from,to}_user_hvm()), or there needs to be an indicator (function parameter or struct vcpu flag) whether to access the buffered data. As you didn't like the function parameter approach, I assume the approach to take is the struct vcpu flag one, which could be set and cleared in _hvm_emulate_one() along the lines of what the current version of the series does for the num_ents field. Since guest page walks also need to be taken care of (in fact these are the primary goal, as that's where the issue to fix was noticed), and since guest_walk_tables() doesn't use lower level routines to access guest memory, the function either needs to be converted to use them, or the buffer accesses need to be open coded there, as was done in this series up to now. Using the lower level routines would in particular complicate set_ad_bits(), so I'm not currently intending to go that route. Now on to the intended behavior of the buffer: If we want to mimic actual hardware behavior as closely as possible, we have to distinguish the ordinary data cache from TLB and paging structure caches. While the former is coherent, the latter aren't. This in particular means e.g. that while two distinct L<n> page table reads from the same physical address may return the same data (because there can't be any invalidation between the start of a single insn's execution and its completion), the same may not be appropriate for two independent ordinary data reads. Specifically insns with multiple independent memory operands (CMPS{B,D,Q,W}, V{,P}GATHER*) can observe different data for the different accesses, even if all addresses are the same, as long as another CPU manages to modify the memory location(s) quickly enough. If we want to retain this behavior in emulation, we'll have to tag memory accesses such that during re-execution correct association with their earlier matching accesses is possible, and such that distinct accesses would not consume data buffered by earlier ones. This tagging can, I think, still be done transparently at the layer where the buffered data gets maintained, except that memory accesses resulting from page walks need to be recognizable, so they won't be treated the same way. But as per above those accesses go through independent code paths anyway, so this wouldn't be difficult to arrange for. But there's one possible caveat here: The way gathers currently get handled in the insn emulator, X86EMUL_RETRY may have two different meanings: It may either identify that a read is pending dm assistance, and hence re-execution will occur without exiting to guest context, or it may identify what we'd call a continuation if this was a hypercall. In either case the code updates certain register state (specifically the register used as operation mask) to record successfully completed parts. While this is fine when exiting back to guest context, it would confuse the buffered data access logic, as the access pattern would no longer match that seen during the first execution run. As an aside I'd like to note that a possible mis-interpretation of mine of the description on how gathers work may mean that the continuation- like exit-to-guest behavior is wrong altogether. I've requested clarification from Intel. Should this need to change, we'll run into capacity problems with struct hvm_vcpu_io's mmio_cache[]. But in the end I hope to also be able to do away with mmio_cache[]. > For the PDPTRs, this corner case is special, and should be handled by > the pagewalk code. I'm still going to go with my previous suggestion of > having top_map point onto the caller stack. For the VT-x case, the > values can be pulled straight out of the VMCS, while for AMD, the values > can be read through the "access once cache", which matches the behaviour > of being read from memory, but ensures they won't be repeatedly read. What is different here from how the last patch already implements it? Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.