Xen project Mailing List

Re: [Xen-devel] [PATCH v4 4/7] x86/HVM: implement memory read caching for insn emulation

To: Jan Beulich <jbeulich@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Mon, 3 Feb 2020 19:48:52 +0000

Authentication-results: esa5.hc3370-68.iphmx.com; dkim=none (message not signed) header.i=none; spf=None smtp.pra=andrew.cooper3@xxxxxxxxxx; spf=Pass smtp.mailfrom=Andrew.Cooper3@xxxxxxxxxx; spf=None smtp.helo=postmaster@xxxxxxxxxxxxxxx

Autocrypt: addr=andrew.cooper3@xxxxxxxxxx; prefer-encrypt=mutual; keydata= mQINBFLhNn8BEADVhE+Hb8i0GV6mihnnr/uiQQdPF8kUoFzCOPXkf7jQ5sLYeJa0cQi6Penp VtiFYznTairnVsN5J+ujSTIb+OlMSJUWV4opS7WVNnxHbFTPYZVQ3erv7NKc2iVizCRZ2Kxn srM1oPXWRic8BIAdYOKOloF2300SL/bIpeD+x7h3w9B/qez7nOin5NzkxgFoaUeIal12pXSR Q354FKFoy6Vh96gc4VRqte3jw8mPuJQpfws+Pb+swvSf/i1q1+1I4jsRQQh2m6OTADHIqg2E ofTYAEh7R5HfPx0EXoEDMdRjOeKn8+vvkAwhviWXTHlG3R1QkbE5M/oywnZ83udJmi+lxjJ5 YhQ5IzomvJ16H0Bq+TLyVLO/VRksp1VR9HxCzItLNCS8PdpYYz5TC204ViycobYU65WMpzWe LFAGn8jSS25XIpqv0Y9k87dLbctKKA14Ifw2kq5OIVu2FuX+3i446JOa2vpCI9GcjCzi3oHV e00bzYiHMIl0FICrNJU0Kjho8pdo0m2uxkn6SYEpogAy9pnatUlO+erL4LqFUO7GXSdBRbw5 gNt25XTLdSFuZtMxkY3tq8MFss5QnjhehCVPEpE6y9ZjI4XB8ad1G4oBHVGK5LMsvg22PfMJ ISWFSHoF/B5+lHkCKWkFxZ0gZn33ju5n6/FOdEx4B8cMJt+cWwARAQABtClBbmRyZXcgQ29v cGVyIDxhbmRyZXcuY29vcGVyM0BjaXRyaXguY29tPokCOgQTAQgAJAIbAwULCQgHAwUVCgkI CwUWAgMBAAIeAQIXgAUCWKD95wIZAQAKCRBlw/kGpdefoHbdD/9AIoR3k6fKl+RFiFpyAhvO 59ttDFI7nIAnlYngev2XUR3acFElJATHSDO0ju+hqWqAb8kVijXLops0gOfqt3VPZq9cuHlh IMDquatGLzAadfFx2eQYIYT+FYuMoPZy/aTUazmJIDVxP7L383grjIkn+7tAv+qeDfE+txL4 SAm1UHNvmdfgL2/lcmL3xRh7sub3nJilM93RWX1Pe5LBSDXO45uzCGEdst6uSlzYR/MEr+5Z JQQ32JV64zwvf/aKaagSQSQMYNX9JFgfZ3TKWC1KJQbX5ssoX/5hNLqxMcZV3TN7kU8I3kjK mPec9+1nECOjjJSO/h4P0sBZyIUGfguwzhEeGf4sMCuSEM4xjCnwiBwftR17sr0spYcOpqET ZGcAmyYcNjy6CYadNCnfR40vhhWuCfNCBzWnUW0lFoo12wb0YnzoOLjvfD6OL3JjIUJNOmJy RCsJ5IA/Iz33RhSVRmROu+TztwuThClw63g7+hoyewv7BemKyuU6FTVhjjW+XUWmS/FzknSi dAG+insr0746cTPpSkGl3KAXeWDGJzve7/SBBfyznWCMGaf8E2P1oOdIZRxHgWj0zNr1+ooF /PzgLPiCI4OMUttTlEKChgbUTQ+5o0P080JojqfXwbPAyumbaYcQNiH1/xYbJdOFSiBv9rpt TQTBLzDKXok86LkCDQRS4TZ/ARAAkgqudHsp+hd82UVkvgnlqZjzz2vyrYfz7bkPtXaGb9H4 Rfo7mQsEQavEBdWWjbga6eMnDqtu+FC+qeTGYebToxEyp2lKDSoAsvt8w82tIlP/EbmRbDVn 7bhjBlfRcFjVYw8uVDPptT0TV47vpoCVkTwcyb6OltJrvg/QzV9f07DJswuda1JH3/qvYu0p vjPnYvCq4NsqY2XSdAJ02HrdYPFtNyPEntu1n1KK+gJrstjtw7KsZ4ygXYrsm/oCBiVW/OgU g/XIlGErkrxe4vQvJyVwg6YH653YTX5hLLUEL1NS4TCo47RP+wi6y+TnuAL36UtK/uFyEuPy wwrDVcC4cIFhYSfsO0BumEI65yu7a8aHbGfq2lW251UcoU48Z27ZUUZd2Dr6O/n8poQHbaTd 6bJJSjzGGHZVbRP9UQ3lkmkmc0+XCHmj5WhwNNYjgbbmML7y0fsJT5RgvefAIFfHBg7fTY/i kBEimoUsTEQz+N4hbKwo1hULfVxDJStE4sbPhjbsPCrlXf6W9CxSyQ0qmZ2bXsLQYRj2xqd1 bpA+1o1j2N4/au1R/uSiUFjewJdT/LX1EklKDcQwpk06Af/N7VZtSfEJeRV04unbsKVXWZAk uAJyDDKN99ziC0Wz5kcPyVD1HNf8bgaqGDzrv3TfYjwqayRFcMf7xJaL9xXedMcAEQEAAYkC HwQYAQgACQUCUuE2fwIbDAAKCRBlw/kGpdefoG4XEACD1Qf/er8EA7g23HMxYWd3FXHThrVQ HgiGdk5Yh632vjOm9L4sd/GCEACVQKjsu98e8o3ysitFlznEns5EAAXEbITrgKWXDDUWGYxd pnjj2u+GkVdsOAGk0kxczX6s+VRBhpbBI2PWnOsRJgU2n10PZ3mZD4Xu9kU2IXYmuW+e5KCA vTArRUdCrAtIa1k01sPipPPw6dfxx2e5asy21YOytzxuWFfJTGnVxZZSCyLUO83sh6OZhJkk b9rxL9wPmpN/t2IPaEKoAc0FTQZS36wAMOXkBh24PQ9gaLJvfPKpNzGD8XWR5HHF0NLIJhgg 4ZlEXQ2fVp3XrtocHqhu4UZR4koCijgB8sB7Tb0GCpwK+C4UePdFLfhKyRdSXuvY3AHJd4CP 4JzW0Bzq/WXY3XMOzUTYApGQpnUpdOmuQSfpV9MQO+/jo7r6yPbxT7CwRS5dcQPzUiuHLK9i nvjREdh84qycnx0/6dDroYhp0DFv4udxuAvt1h4wGwTPRQZerSm4xaYegEFusyhbZrI0U9tJ B8WrhBLXDiYlyJT6zOV2yZFuW47VrLsjYnHwn27hmxTC/7tvG3euCklmkn9Sl9IAKFu29RSo d5bD8kMSCYsTqtTfT6W4A3qHGvIDta3ptLYpIAOD2sY3GYq2nf3Bbzx81wZK14JdDDHUX2Rs 6+ahAA==

Cc: Kevin Tian <kevin.tian@xxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Paul Durrant <paul@xxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>

Delivery-date: Mon, 03 Feb 2020 19:49:10 +0000

Ironport-sdr: wNggGSVAacpR+iX8hwEJ2dWBZT2TbuJfOgI0Y6Qv0h7f17fITorjHcZigv6YveFcmlnMsmHLKe 5Ol1bxwJl/J39n3abjlwtkjSaD0K+eeYI/R/jFHc6gBnw3l5+E58Xz2jZw8DwcmKsC12PXSed1 tXTV3WESP4FEx3xfvi7fnVKdz9e1mCWy4eZpntJx9ucEjjpB9yS13jjSB0LO+5qY/4jHz8noep ESDQeYO4Ttz9NThFDOVixcvY37UmKPAKwU04QTL1RFYW1YCTWtIVXOJF5vxpqE91imWDnBg4xr nrg=

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Openpgp: preference=signencrypt

On 31/01/2020 16:44, Jan Beulich wrote: > Emulation requiring device model assistance uses a form of instruction > re-execution, assuming that the second (and any further) pass takes > exactly the same path. This is a valid assumption as far as use of CPU > registers goes (as those can't change without any other instruction > executing in between), but is wrong for memory accesses. This statement isn't quite accurate. Various hypercalls and IPIs can play with GPR state in the middle of emulation. What matters is that the GPR values aren't expected to change, and a guest gets to keep all the pieces if they try to play games in this area. > In particular > it has been observed that Windows might page out buffers underneath an > instruction currently under emulation (hitting between two passes). If > the first pass read a memory operand successfully, any subsequent pass > needs to get to see the exact same value. > > Introduce a cache to make sure above described assumption holds. This > is a very simplistic implementation for now: Only exact matches are > satisfied (no overlaps or partial reads or anything); this is sufficient > for the immediate purpose of making re-execution an exact replay. The > cache also won't be used just yet for guest page walks; that'll be the > subject of a subsequent change. > > With the cache being generally transparent to upper layers, but with it > having limited capacity yet being required for correctness, certain > users of hvm_copy_from_guest_*() need to disable caching temporarily, > without invalidating the cache. Note that the adjustments here to > hvm_hypercall() and hvm_task_switch() are benign at this point; they'll > become relevant once we start to be able to emulate respective insns > through the main emulator (and more changes will then likely be needed > to nested code). > > As to the actual data page in this scenario, there are a couple of > aspects to take into consideration: > - We must be talking about an insn accessing two locations (two memory > ones, one of which is MMIO, or a memory and an I/O one). Really? We're talking about all instructions, even without any memory operands, because at the very minimum we cache the instruction stream, and (with the following patch), the pagewalk to it. > - If the non I/O / MMIO side is being read, the re-read (if it occurs at > all) is having its result discarded, by taking the shortcut through > the first switch()'s STATE_IORESP_READY case in hvmemul_do_io(). Note > how, among all the re-issue sanity checks there, we avoid comparing > the actual data. > - If the non I/O / MMIO side is being written, it is the OSes > responsibility to avoid actually moving page contents to disk while > there might still be a write access in flight - this is no different > in behavior from bare hardware. > - Read-modify-write accesses are, as always, complicated, and while we > deal with them better nowadays than we did in the past, we're still > not quite there to guarantee hardware like behavior in all cases > anyway. Nothing is getting worse by the changes made here, afaict. > > In __hvm_copy() also reduce p's scope and change its type to void *. > > Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> > --- > TBD: In principle the caching here yields unnecessary the one used for > insn bytes (vio->mmio_insn{,_bytes}. However, to seed the cache > with the data SVM may have made available, we'd have to also know > the corresponding GPA. It's not safe, however, to re-walk the page > tables to find out, as the page tables may have changed in the > meantime. Therefore I guess we need to keep the duplicate > functionality for now. A possible solution to this could be to use > a physical-address-based cache for page table accesses (and looking > forward also e.g. SVM/VMX insn emulation), and a linear-address- > based one for all other reads. Splitting caching like that will re-introduce the same bugs I pointed out in earlier revisions of this series. It is not correct to have multiple (and therefore, non-coherent) caches of memory. The AMD instruction stream bytes support is by no means perfect either. It doesn't function properly when SMAP is active (Erratum #1096), which is actually caused by the instruction stream being re-fetched at vmexit time, with a data side access (hence the interaction with SMAP). Just as with the emulation for non-Nrips hardware, the contents of the instruction buffer (if valid) need cross referencing with the vmexit reason to spot races, and the same checking would make it safe to deduplicate the caching. > --- a/xen/arch/x86/hvm/emulate.c > +++ b/xen/arch/x86/hvm/emulate.c > @@ -28,6 +28,17 @@ > #include <asm/iocap.h> > #include <asm/vm_event.h> > > +struct hvmemul_cache > +{ This needs a comment (see below.) > + unsigned int num_ents; > + unsigned int max_ents; > + struct { > + paddr_t gpa:PADDR_BITS; > + unsigned int size:BITS_PER_LONG - PADDR_BITS; This would probably result in rather better code if size was 8 bits rather than 12. Longer term, the size field should disappear and functionality adjusted to always read aligned 8 (or larger) byte values. That way, the cache behaves much more like caches in real processors. > @@ -2838,6 +2868,123 @@ void hvm_dump_emulation_state(const char > hvmemul_ctxt->insn_buf); > } > > +int hvmemul_cache_init(struct vcpu *v) > +{ > + /* > + * No insn can access more than 16 independent linear addresses (AVX512F > + * scatters/gathers being the worst). It is at least 5 more than the worse case number of regular operands, to cover exceptions, and therefore accesses into the IDT/GDT/LDT and TR and stack. You can manage that with a 32bit guest with an exception, %cs and %ss in different tables, and having to cross a page boundary when pushing the esp0 exception frame. It is one fewer generally in 64bit mode, because there are no %ss references to deal with, but both cases get even more complicated for a contributory exception, as the second %cs can be different to %cs for the first exception. > Each such linear range can span a > + * page boundary, i.e. may require two page walks. Account for each insn > + * byte individually. Why are we counting 5 entries for every instruction byte? Given an 8-byte maximum cached size, we need at most 2 entries to cover a single instruction. > + */ > + const unsigned int nents = (CONFIG_PAGING_LEVELS + 1) * > + (MAX_INST_LEN + 16 * 2); > + struct hvmemul_cache *cache = xmalloc_flex_struct(struct hvmemul_cache, > + ents, nents); > + > + if ( !cache ) > + return -ENOMEM; > + > + /* Cache is disabled initially. */ > + cache->num_ents = nents + 1; > + cache->max_ents = nents; (Below:) You need to document the internal semantics (probably best in the struct definition.) I thought I'd reverse engineered how disabling works (starting from the completely obscure hvmemul_cache_disabled()), but given that the cache is apparently disabled by default, I now can't figure out how it ever gets enabled. I can't see any code path where num_ents ends up lower than max_ents, and the read/write helpers below don't take their early exit path. > +bool hvmemul_read_cache(const struct vcpu *v, paddr_t gpa, > + void *buffer, unsigned int size) > +{ > + const struct hvmemul_cache *cache = v->arch.hvm.hvm_io.cache; > + unsigned int i; > + > + /* Cache unavailable? */ > + if ( cache->num_ents > cache->max_ents ) > + return false; > + > + while ( size > sizeof(cache->ents->data) ) > + { > + i = gpa & (sizeof(cache->ents->data) - 1) > + ? -gpa & (sizeof(cache->ents->data) - 1) > + : sizeof(cache->ents->data); > + if ( !hvmemul_read_cache(v, gpa, buffer, i) ) > + return false; What is this call trying to achieve? > + gpa += i; > + buffer += i; > + size -= i; > + } > + > + for ( i = 0; i < cache->num_ents; ++i ) > + if ( cache->ents[i].gpa == gpa && cache->ents[i].size == size ) With num_ents currently topping out at 235 (and needing to increase anyway), this can end up being a very long loop. Given that it will be populated with higher level PTEs to being with, the common case of accessing real data will be at the end of the O(n) search rather than the start. The cache is a single block of memory, so it is almost certainly better to keep it sorted. That said, it is problematic due to the overlap of sizes, which I'm still concerned concerned about generally. > --- a/xen/include/asm-x86/hvm/emulate.h > +++ b/xen/include/asm-x86/hvm/emulate.h > @@ -13,6 +13,7 @@ > #define __ASM_X86_HVM_EMULATE_H__ > > #include <xen/err.h> > +#include <xen/sched.h> > #include <asm/hvm/hvm.h> > #include <asm/x86_emulate.h> > > @@ -96,6 +97,31 @@ int hvmemul_do_pio_buffer(uint16_t port, > uint8_t dir, > void *buffer); > > +#ifdef CONFIG_HVM This needs a comment block stating, at a minimum, the semantics and expected use for the cache. Otherwise, this is ~150 lines of totally undocumented and practically uncommented, critical and subtle infrastructure. ~Andrew > +int __must_check hvmemul_cache_init(struct vcpu *v); > +static inline void hvmemul_cache_destroy(struct vcpu *v) > +{ > + XFREE(v->arch.hvm.hvm_io.cache); > +} > +bool hvmemul_read_cache(const struct vcpu *, paddr_t gpa, > + void *buffer, unsigned int size); > +void hvmemul_write_cache(const struct vcpu *, paddr_t gpa, > + const void *buffer, unsigned int size); > +unsigned int hvmemul_cache_disable(struct vcpu *); > +void hvmemul_cache_restore(struct vcpu *, unsigned int token); > +/* For use in ASSERT()s only: */ > +static inline bool hvmemul_cache_disabled(struct vcpu *v) > +{ > + return hvmemul_cache_disable(v) == hvmemul_cache_disable(v); > +} > +#else > +static inline bool hvmemul_read_cache(const struct vcpu *v, paddr_t gpa, > + void *buf, > + unsigned int size) { return false; } > +static inline void hvmemul_write_cache(const struct vcpu *v, paddr_t gpa, > + const void *buf, unsigned int size) {} > +#endif > + > void hvm_dump_emulation_state(const char *loglvl, const char *prefix, > struct hvm_emulate_ctxt *hvmemul_ctxt, int rc); > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.