Xen project Mailing List

Re: [Xen-devel] Ongoing/future speculative mitigation work

To: Dario Faggioli <dfaggioli@xxxxxxxx>, Xen-devel List <xen-devel@xxxxxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Fri, 19 Oct 2018 13:17:11 +0100

Autocrypt: addr=andrew.cooper3@xxxxxxxxxx; prefer-encrypt=mutual; keydata= xsFNBFLhNn8BEADVhE+Hb8i0GV6mihnnr/uiQQdPF8kUoFzCOPXkf7jQ5sLYeJa0cQi6Penp VtiFYznTairnVsN5J+ujSTIb+OlMSJUWV4opS7WVNnxHbFTPYZVQ3erv7NKc2iVizCRZ2Kxn srM1oPXWRic8BIAdYOKOloF2300SL/bIpeD+x7h3w9B/qez7nOin5NzkxgFoaUeIal12pXSR Q354FKFoy6Vh96gc4VRqte3jw8mPuJQpfws+Pb+swvSf/i1q1+1I4jsRQQh2m6OTADHIqg2E ofTYAEh7R5HfPx0EXoEDMdRjOeKn8+vvkAwhviWXTHlG3R1QkbE5M/oywnZ83udJmi+lxjJ5 YhQ5IzomvJ16H0Bq+TLyVLO/VRksp1VR9HxCzItLNCS8PdpYYz5TC204ViycobYU65WMpzWe LFAGn8jSS25XIpqv0Y9k87dLbctKKA14Ifw2kq5OIVu2FuX+3i446JOa2vpCI9GcjCzi3oHV e00bzYiHMIl0FICrNJU0Kjho8pdo0m2uxkn6SYEpogAy9pnatUlO+erL4LqFUO7GXSdBRbw5 gNt25XTLdSFuZtMxkY3tq8MFss5QnjhehCVPEpE6y9ZjI4XB8ad1G4oBHVGK5LMsvg22PfMJ ISWFSHoF/B5+lHkCKWkFxZ0gZn33ju5n6/FOdEx4B8cMJt+cWwARAQABzSlBbmRyZXcgQ29v cGVyIDxhbmRyZXcuY29vcGVyM0BjaXRyaXguY29tPsLBegQTAQgAJAIbAwULCQgHAwUVCgkI CwUWAgMBAAIeAQIXgAUCWKD95wIZAQAKCRBlw/kGpdefoHbdD/9AIoR3k6fKl+RFiFpyAhvO 59ttDFI7nIAnlYngev2XUR3acFElJATHSDO0ju+hqWqAb8kVijXLops0gOfqt3VPZq9cuHlh IMDquatGLzAadfFx2eQYIYT+FYuMoPZy/aTUazmJIDVxP7L383grjIkn+7tAv+qeDfE+txL4 SAm1UHNvmdfgL2/lcmL3xRh7sub3nJilM93RWX1Pe5LBSDXO45uzCGEdst6uSlzYR/MEr+5Z JQQ32JV64zwvf/aKaagSQSQMYNX9JFgfZ3TKWC1KJQbX5ssoX/5hNLqxMcZV3TN7kU8I3kjK mPec9+1nECOjjJSO/h4P0sBZyIUGfguwzhEeGf4sMCuSEM4xjCnwiBwftR17sr0spYcOpqET ZGcAmyYcNjy6CYadNCnfR40vhhWuCfNCBzWnUW0lFoo12wb0YnzoOLjvfD6OL3JjIUJNOmJy RCsJ5IA/Iz33RhSVRmROu+TztwuThClw63g7+hoyewv7BemKyuU6FTVhjjW+XUWmS/FzknSi dAG+insr0746cTPpSkGl3KAXeWDGJzve7/SBBfyznWCMGaf8E2P1oOdIZRxHgWj0zNr1+ooF /PzgLPiCI4OMUttTlEKChgbUTQ+5o0P080JojqfXwbPAyumbaYcQNiH1/xYbJdOFSiBv9rpt TQTBLzDKXok86M7BTQRS4TZ/ARAAkgqudHsp+hd82UVkvgnlqZjzz2vyrYfz7bkPtXaGb9H4 Rfo7mQsEQavEBdWWjbga6eMnDqtu+FC+qeTGYebToxEyp2lKDSoAsvt8w82tIlP/EbmRbDVn 7bhjBlfRcFjVYw8uVDPptT0TV47vpoCVkTwcyb6OltJrvg/QzV9f07DJswuda1JH3/qvYu0p vjPnYvCq4NsqY2XSdAJ02HrdYPFtNyPEntu1n1KK+gJrstjtw7KsZ4ygXYrsm/oCBiVW/OgU g/XIlGErkrxe4vQvJyVwg6YH653YTX5hLLUEL1NS4TCo47RP+wi6y+TnuAL36UtK/uFyEuPy wwrDVcC4cIFhYSfsO0BumEI65yu7a8aHbGfq2lW251UcoU48Z27ZUUZd2Dr6O/n8poQHbaTd 6bJJSjzGGHZVbRP9UQ3lkmkmc0+XCHmj5WhwNNYjgbbmML7y0fsJT5RgvefAIFfHBg7fTY/i kBEimoUsTEQz+N4hbKwo1hULfVxDJStE4sbPhjbsPCrlXf6W9CxSyQ0qmZ2bXsLQYRj2xqd1 bpA+1o1j2N4/au1R/uSiUFjewJdT/LX1EklKDcQwpk06Af/N7VZtSfEJeRV04unbsKVXWZAk uAJyDDKN99ziC0Wz5kcPyVD1HNf8bgaqGDzrv3TfYjwqayRFcMf7xJaL9xXedMcAEQEAAcLB XwQYAQgACQUCUuE2fwIbDAAKCRBlw/kGpdefoG4XEACD1Qf/er8EA7g23HMxYWd3FXHThrVQ HgiGdk5Yh632vjOm9L4sd/GCEACVQKjsu98e8o3ysitFlznEns5EAAXEbITrgKWXDDUWGYxd pnjj2u+GkVdsOAGk0kxczX6s+VRBhpbBI2PWnOsRJgU2n10PZ3mZD4Xu9kU2IXYmuW+e5KCA vTArRUdCrAtIa1k01sPipPPw6dfxx2e5asy21YOytzxuWFfJTGnVxZZSCyLUO83sh6OZhJkk b9rxL9wPmpN/t2IPaEKoAc0FTQZS36wAMOXkBh24PQ9gaLJvfPKpNzGD8XWR5HHF0NLIJhgg 4ZlEXQ2fVp3XrtocHqhu4UZR4koCijgB8sB7Tb0GCpwK+C4UePdFLfhKyRdSXuvY3AHJd4CP 4JzW0Bzq/WXY3XMOzUTYApGQpnUpdOmuQSfpV9MQO+/jo7r6yPbxT7CwRS5dcQPzUiuHLK9i nvjREdh84qycnx0/6dDroYhp0DFv4udxuAvt1h4wGwTPRQZerSm4xaYegEFusyhbZrI0U9tJ B8WrhBLXDiYlyJT6zOV2yZFuW47VrLsjYnHwn27hmxTC/7tvG3euCklmkn9Sl9IAKFu29RSo d5bD8kMSCYsTqtTfT6W4A3qHGvIDta3ptLYpIAOD2sY3GYq2nf3Bbzx81wZK14JdDDHUX2Rs 6+ahAA==

Cc: Juergen Gross <JGross@xxxxxxxx>, Lars Kurth <lars.kurth@xxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, Anthony Liguori <aliguori@xxxxxxxxxx>, Sergey Dyasli <sergey.dyasli@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxxxxx>, Ross Philipson <ross.philipson@xxxxxxxxxx>, Daniel Kiper <daniel.kiper@xxxxxxxxxx>, Konrad Wilk <konrad.wilk@xxxxxxxxxx>, Marek Marczykowski <marmarek@xxxxxxxxxxxxxxxxxxxxxx>, Martin Pohlack <mpohlack@xxxxxxxxx>, Julien Grall <julien.grall@xxxxxxx>, "Dannowski, Uwe" <uwed@xxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Mihai Donțu <mdontu@xxxxxxxxxxxxxxx>, Matt Wilson <msw@xxxxxxxxxx>, Joao Martins <joao.m.martins@xxxxxxxxxx>, "Woodhouse, David" <dwmw@xxxxxxxxxxxx>, Roger Pau Monne <roger.pau@xxxxxxxxxx>

Delivery-date: Fri, 19 Oct 2018 12:17:28 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Openpgp: preference=signencrypt

On 19/10/18 09:09, Dario Faggioli wrote: > On Thu, 2018-10-18 at 18:46 +0100, Andrew Cooper wrote: >> Hello, >> > Hey, > > This is very accurate and useful... thanks for it. :-) > >> 1) A secrets-free hypervisor. >> >> Basically every hypercall can be (ab)used by a guest, and used as an >> arbitrary cache-load gadget. Logically, this is the first half of a >> Spectre SP1 gadget, and is usually the first stepping stone to >> exploiting one of the speculative sidechannels. >> >> Short of compiling Xen with LLVM's Speculative Load Hardening (which >> is >> still experimental, and comes with a ~30% perf hit in the common >> case), >> this is unavoidable. Furthermore, throwing a few >> array_index_nospec() >> into the code isn't a viable solution to the problem. >> >> An alternative option is to have less data mapped into Xen's virtual >> address space - if a piece of memory isn't mapped, it can't be loaded >> into the cache. >> >> [...] >> >> 2) Scheduler improvements. >> >> (I'm afraid this is rather more sparse because I'm less familiar with >> the scheduler details.) >> >> At the moment, all of Xen's schedulers will happily put two vcpus >> from >> different domains on sibling hyperthreads. There has been a lot of >> sidechannel research over the past decade demonstrating ways for one >> thread to infer what is going on the other, but L1TF is the first >> vulnerability I'm aware of which allows one thread to directly read >> data >> out of the other. >> >> Either way, it is now definitely a bad thing to run different guests >> concurrently on siblings. >> > Well, yes. But, as you say, L1TF, and I'd say TLBLeed as well, are the > first serious issues discovered so far and, for instance, even on x86, > not all Intel CPUs and none of the AMD ones, AFAIK, are affected. TLBleed is an excellent paper and associated research, but is still just inference - a vast quantity of post-processing is required to extract the key. There are plenty of other sidechannels which affect all SMT implementations, such as the effects of executing an mfence instruction, execution unit > Therefore, although I certainly think we _must_ have the proper > scheduler enhancements in place (and in fact I'm working on that :-D) > it should IMO still be possible for the user to decide whether or not > to use them (either by opting-in or opting-out, I don't care much at > this stage). I'm not suggesting that we leave people without a choice, but given an option which doesn't share siblings between different guests, it should be the default. > >> Fixing this by simply not scheduling vcpus >> from a different guest on siblings does result in a lower resource >> utilisation, most notably when there are an odd number runable vcpus >> in >> a domain, as the other thread is forced to idle. >> > Right. > >> A step beyond this is core-aware scheduling, where we schedule in >> units >> of a virtual core rather than a virtual thread. This has much better >> behaviour from the guests point of view, as the actually-scheduled >> topology remains consistent, but does potentially come with even >> lower >> utilisation if every other thread in the guest is idle. >> > Yes, basically, what you describe as 'core-aware scheduling' here can > be build on top of what you had described above as 'not scheduling > vcpus from different guests'. > > I mean, we can/should put ourselves in a position where the user can > choose if he/she wants: > - just 'plain scheduling', as we have now, > - "just" that only vcpus of the same domains are scheduled on siblings > hyperthread, > - full 'core-aware scheduling', i.e., only vcpus that the guest > actually sees as virtual hyperthread siblings, are scheduled on > hardware hyperthread siblings. > > About the performance impact, indeed it's even higher with core-aware > scheduling. Something that we can see about doing, is acting on the > guest scheduler, e.g., telling it to try to "pack the load", and keep > siblings busy, instead of trying to avoid doing that (which is what > happens by default in most cases). > > In Linux, this can be done by playing with the sched-flags (see, e.g., > https://elixir.bootlin.com/linux/v4.18/source/include/linux/sched/topology.h#L20 > , > and /proc/sys/kernel/sched_domain/cpu*/domain*/flags ). > > The idea would be to avoid, as much as possible, the case when "every > other thread is idle in the guest". I'm not sure about being able to do > something by default, but we can certainly document things (like "if > you enable core-scheduling, also do `echo 1234 > /proc/sys/.../flags' > in your Linux guests"). > > I haven't checked whether other OSs' schedulers have something similar. > >> A side requirement for core-aware scheduling is for Xen to have an >> accurate idea of the topology presented to the guest. I need to dust >> off my Toolstack CPUID/MSR improvement series and get that upstream. >> > Indeed. Without knowing which one of the guest's vcpus are to be > considered virtual hyperthread siblings, I can only get you as far as > "only scheduling vcpus of the same domain on siblings hyperthread". :-) > >> One of the most insidious problems with L1TF is that, with >> hyperthreading enabled, a malicious guest kernel can engineer >> arbitrary >> data leakage by having one thread scanning the expected physical >> address, and the other thread using an arbitrary cache-load gadget in >> hypervisor context. This occurs because the L1 data cache is shared >> by >> threads. >> > Right. So, sorry if this is a stupid question, but how does this relate > to the "secret-free hypervisor", and with the "if a piece of memory > isn't mapped, it can't be loaded into the cache". > > So, basically, I'm asking whether I am understanding it correctly that > secret-free Xen + core-aware scheduling would *not* be enough for > mitigating L1TF properly (and if the answer is no, why... but only if > you have 5 mins to explain it to me :-P). > > In fact, ISTR that core-scheduling plus something that looked to me > similar enough to "secret-free Xen", is how Microsoft claims to be > mitigating L1TF on hyper-v... Correct - that is what HyperV appears to be doing. Its best to consider the secret-free Xen and scheduler improvements as orthogonal. In particular, the secret-free Xen is defence in depth against SP1, and the risk of future issues, but does have non-speculative benefits as well. That said, the only way to use HT and definitely be safe to L1TF without a secret-free Xen is to have the synchronised entry/exit logic working. >> A solution to this issue was proposed, whereby Xen synchronises >> siblings >> on vmexit/entry, so we are never executing code in two different >> privilege levels. Getting this working would make it safe to >> continue >> using hyperthreading even in the presence of L1TF. >> > Err... ok, but we still want core-aware scheduling, or at least we want > to avoid having vcpus from different domains on siblings, don't we? In > order to avoid leaks between guests, I mean. Ideally, we'd want all of these. I expect the only reasonable way to develop them is one on top of another. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.