[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Ongoing/future speculative mitigation work


  • To: Dario Faggioli <dfaggioli@xxxxxxxx>, Xen-devel List <xen-devel@xxxxxxxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Fri, 19 Oct 2018 13:17:11 +0100
  • Autocrypt: addr=andrew.cooper3@xxxxxxxxxx; prefer-encrypt=mutual; keydata= xsFNBFLhNn8BEADVhE+Hb8i0GV6mihnnr/uiQQdPF8kUoFzCOPXkf7jQ5sLYeJa0cQi6Penp VtiFYznTairnVsN5J+ujSTIb+OlMSJUWV4opS7WVNnxHbFTPYZVQ3erv7NKc2iVizCRZ2Kxn srM1oPXWRic8BIAdYOKOloF2300SL/bIpeD+x7h3w9B/qez7nOin5NzkxgFoaUeIal12pXSR Q354FKFoy6Vh96gc4VRqte3jw8mPuJQpfws+Pb+swvSf/i1q1+1I4jsRQQh2m6OTADHIqg2E ofTYAEh7R5HfPx0EXoEDMdRjOeKn8+vvkAwhviWXTHlG3R1QkbE5M/oywnZ83udJmi+lxjJ5 YhQ5IzomvJ16H0Bq+TLyVLO/VRksp1VR9HxCzItLNCS8PdpYYz5TC204ViycobYU65WMpzWe LFAGn8jSS25XIpqv0Y9k87dLbctKKA14Ifw2kq5OIVu2FuX+3i446JOa2vpCI9GcjCzi3oHV e00bzYiHMIl0FICrNJU0Kjho8pdo0m2uxkn6SYEpogAy9pnatUlO+erL4LqFUO7GXSdBRbw5 gNt25XTLdSFuZtMxkY3tq8MFss5QnjhehCVPEpE6y9ZjI4XB8ad1G4oBHVGK5LMsvg22PfMJ ISWFSHoF/B5+lHkCKWkFxZ0gZn33ju5n6/FOdEx4B8cMJt+cWwARAQABzSlBbmRyZXcgQ29v cGVyIDxhbmRyZXcuY29vcGVyM0BjaXRyaXguY29tPsLBegQTAQgAJAIbAwULCQgHAwUVCgkI CwUWAgMBAAIeAQIXgAUCWKD95wIZAQAKCRBlw/kGpdefoHbdD/9AIoR3k6fKl+RFiFpyAhvO 59ttDFI7nIAnlYngev2XUR3acFElJATHSDO0ju+hqWqAb8kVijXLops0gOfqt3VPZq9cuHlh IMDquatGLzAadfFx2eQYIYT+FYuMoPZy/aTUazmJIDVxP7L383grjIkn+7tAv+qeDfE+txL4 SAm1UHNvmdfgL2/lcmL3xRh7sub3nJilM93RWX1Pe5LBSDXO45uzCGEdst6uSlzYR/MEr+5Z JQQ32JV64zwvf/aKaagSQSQMYNX9JFgfZ3TKWC1KJQbX5ssoX/5hNLqxMcZV3TN7kU8I3kjK mPec9+1nECOjjJSO/h4P0sBZyIUGfguwzhEeGf4sMCuSEM4xjCnwiBwftR17sr0spYcOpqET ZGcAmyYcNjy6CYadNCnfR40vhhWuCfNCBzWnUW0lFoo12wb0YnzoOLjvfD6OL3JjIUJNOmJy RCsJ5IA/Iz33RhSVRmROu+TztwuThClw63g7+hoyewv7BemKyuU6FTVhjjW+XUWmS/FzknSi dAG+insr0746cTPpSkGl3KAXeWDGJzve7/SBBfyznWCMGaf8E2P1oOdIZRxHgWj0zNr1+ooF /PzgLPiCI4OMUttTlEKChgbUTQ+5o0P080JojqfXwbPAyumbaYcQNiH1/xYbJdOFSiBv9rpt TQTBLzDKXok86M7BTQRS4TZ/ARAAkgqudHsp+hd82UVkvgnlqZjzz2vyrYfz7bkPtXaGb9H4 Rfo7mQsEQavEBdWWjbga6eMnDqtu+FC+qeTGYebToxEyp2lKDSoAsvt8w82tIlP/EbmRbDVn 7bhjBlfRcFjVYw8uVDPptT0TV47vpoCVkTwcyb6OltJrvg/QzV9f07DJswuda1JH3/qvYu0p vjPnYvCq4NsqY2XSdAJ02HrdYPFtNyPEntu1n1KK+gJrstjtw7KsZ4ygXYrsm/oCBiVW/OgU g/XIlGErkrxe4vQvJyVwg6YH653YTX5hLLUEL1NS4TCo47RP+wi6y+TnuAL36UtK/uFyEuPy wwrDVcC4cIFhYSfsO0BumEI65yu7a8aHbGfq2lW251UcoU48Z27ZUUZd2Dr6O/n8poQHbaTd 6bJJSjzGGHZVbRP9UQ3lkmkmc0+XCHmj5WhwNNYjgbbmML7y0fsJT5RgvefAIFfHBg7fTY/i kBEimoUsTEQz+N4hbKwo1hULfVxDJStE4sbPhjbsPCrlXf6W9CxSyQ0qmZ2bXsLQYRj2xqd1 bpA+1o1j2N4/au1R/uSiUFjewJdT/LX1EklKDcQwpk06Af/N7VZtSfEJeRV04unbsKVXWZAk uAJyDDKN99ziC0Wz5kcPyVD1HNf8bgaqGDzrv3TfYjwqayRFcMf7xJaL9xXedMcAEQEAAcLB XwQYAQgACQUCUuE2fwIbDAAKCRBlw/kGpdefoG4XEACD1Qf/er8EA7g23HMxYWd3FXHThrVQ HgiGdk5Yh632vjOm9L4sd/GCEACVQKjsu98e8o3ysitFlznEns5EAAXEbITrgKWXDDUWGYxd pnjj2u+GkVdsOAGk0kxczX6s+VRBhpbBI2PWnOsRJgU2n10PZ3mZD4Xu9kU2IXYmuW+e5KCA vTArRUdCrAtIa1k01sPipPPw6dfxx2e5asy21YOytzxuWFfJTGnVxZZSCyLUO83sh6OZhJkk b9rxL9wPmpN/t2IPaEKoAc0FTQZS36wAMOXkBh24PQ9gaLJvfPKpNzGD8XWR5HHF0NLIJhgg 4ZlEXQ2fVp3XrtocHqhu4UZR4koCijgB8sB7Tb0GCpwK+C4UePdFLfhKyRdSXuvY3AHJd4CP 4JzW0Bzq/WXY3XMOzUTYApGQpnUpdOmuQSfpV9MQO+/jo7r6yPbxT7CwRS5dcQPzUiuHLK9i nvjREdh84qycnx0/6dDroYhp0DFv4udxuAvt1h4wGwTPRQZerSm4xaYegEFusyhbZrI0U9tJ B8WrhBLXDiYlyJT6zOV2yZFuW47VrLsjYnHwn27hmxTC/7tvG3euCklmkn9Sl9IAKFu29RSo d5bD8kMSCYsTqtTfT6W4A3qHGvIDta3ptLYpIAOD2sY3GYq2nf3Bbzx81wZK14JdDDHUX2Rs 6+ahAA==
  • Cc: Juergen Gross <JGross@xxxxxxxx>, Lars Kurth <lars.kurth@xxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, Anthony Liguori <aliguori@xxxxxxxxxx>, Sergey Dyasli <sergey.dyasli@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxxxxx>, Ross Philipson <ross.philipson@xxxxxxxxxx>, Daniel Kiper <daniel.kiper@xxxxxxxxxx>, Konrad Wilk <konrad.wilk@xxxxxxxxxx>, Marek Marczykowski <marmarek@xxxxxxxxxxxxxxxxxxxxxx>, Martin Pohlack <mpohlack@xxxxxxxxx>, Julien Grall <julien.grall@xxxxxxx>, "Dannowski, Uwe" <uwed@xxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Mihai Donțu <mdontu@xxxxxxxxxxxxxxx>, Matt Wilson <msw@xxxxxxxxxx>, Joao Martins <joao.m.martins@xxxxxxxxxx>, "Woodhouse, David" <dwmw@xxxxxxxxxxxx>, Roger Pau Monne <roger.pau@xxxxxxxxxx>
  • Delivery-date: Fri, 19 Oct 2018 12:17:28 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Openpgp: preference=signencrypt

On 19/10/18 09:09, Dario Faggioli wrote:
> On Thu, 2018-10-18 at 18:46 +0100, Andrew Cooper wrote:
>> Hello,
>>
> Hey,
>
> This is very accurate and useful... thanks for it. :-)
>
>> 1) A secrets-free hypervisor.
>>
>> Basically every hypercall can be (ab)used by a guest, and used as an
>> arbitrary cache-load gadget.  Logically, this is the first half of a
>> Spectre SP1 gadget, and is usually the first stepping stone to
>> exploiting one of the speculative sidechannels.
>>
>> Short of compiling Xen with LLVM's Speculative Load Hardening (which
>> is
>> still experimental, and comes with a ~30% perf hit in the common
>> case),
>> this is unavoidable.  Furthermore, throwing a few
>> array_index_nospec()
>> into the code isn't a viable solution to the problem.
>>
>> An alternative option is to have less data mapped into Xen's virtual
>> address space - if a piece of memory isn't mapped, it can't be loaded
>> into the cache.
>>
>> [...]
>>
>> 2) Scheduler improvements.
>>
>> (I'm afraid this is rather more sparse because I'm less familiar with
>> the scheduler details.)
>>
>> At the moment, all of Xen's schedulers will happily put two vcpus
>> from
>> different domains on sibling hyperthreads.  There has been a lot of
>> sidechannel research over the past decade demonstrating ways for one
>> thread to infer what is going on the other, but L1TF is the first
>> vulnerability I'm aware of which allows one thread to directly read
>> data
>> out of the other.
>>
>> Either way, it is now definitely a bad thing to run different guests
>> concurrently on siblings.  
>>
> Well, yes. But, as you say, L1TF, and I'd say TLBLeed as well, are the
> first serious issues discovered so far and, for instance, even on x86,
> not all Intel CPUs and none of the AMD ones, AFAIK, are affected.

TLBleed is an excellent paper and associated research, but is still just
inference - a vast quantity of post-processing is required to extract
the key.

There are plenty of other sidechannels which affect all SMT
implementations, such as the effects of executing an mfence instruction,
execution unit

> Therefore, although I certainly think we _must_ have the proper
> scheduler enhancements in place (and in fact I'm working on that :-D)
> it should IMO still be possible for the user to decide whether or not
> to use them (either by opting-in or opting-out, I don't care much at
> this stage).

I'm not suggesting that we leave people without a choice, but given an
option which doesn't share siblings between different guests, it should
be the default.

>
>> Fixing this by simply not scheduling vcpus
>> from a different guest on siblings does result in a lower resource
>> utilisation, most notably when there are an odd number runable vcpus
>> in
>> a domain, as the other thread is forced to idle.
>>
> Right.
>
>> A step beyond this is core-aware scheduling, where we schedule in
>> units
>> of a virtual core rather than a virtual thread.  This has much better
>> behaviour from the guests point of view, as the actually-scheduled
>> topology remains consistent, but does potentially come with even
>> lower
>> utilisation if every other thread in the guest is idle.
>>
> Yes, basically, what you describe as 'core-aware scheduling' here can
> be build on top of what you had described above as 'not scheduling
> vcpus from different guests'.
>
> I mean, we can/should put ourselves in a position where the user can
> choose if he/she wants:
> - just 'plain scheduling', as we have now,
> - "just" that only vcpus of the same domains are scheduled on siblings
> hyperthread,
> - full 'core-aware scheduling', i.e., only vcpus that the guest
> actually sees as virtual hyperthread siblings, are scheduled on
> hardware hyperthread siblings.
>
> About the performance impact, indeed it's even higher with core-aware
> scheduling. Something that we can see about doing, is acting on the
> guest scheduler, e.g., telling it to try to "pack the load", and keep
> siblings busy, instead of trying to avoid doing that (which is what
> happens by default in most cases).
>
> In Linux, this can be done by playing with the sched-flags (see, e.g.,
> https://elixir.bootlin.com/linux/v4.18/source/include/linux/sched/topology.h#L20
>  ,
> and /proc/sys/kernel/sched_domain/cpu*/domain*/flags ).
>
> The idea would be to avoid, as much as possible, the case when "every
> other thread is idle in the guest". I'm not sure about being able to do
> something by default, but we can certainly document things (like "if
> you enable core-scheduling, also do `echo 1234 > /proc/sys/.../flags'
> in your Linux guests").
>
> I haven't checked whether other OSs' schedulers have something similar.
>
>> A side requirement for core-aware scheduling is for Xen to have an
>> accurate idea of the topology presented to the guest.  I need to dust
>> off my Toolstack CPUID/MSR improvement series and get that upstream.
>>
> Indeed. Without knowing which one of the guest's vcpus are to be
> considered virtual hyperthread siblings, I can only get you as far as
> "only scheduling vcpus of the same domain on siblings hyperthread". :-)
>
>> One of the most insidious problems with L1TF is that, with
>> hyperthreading enabled, a malicious guest kernel can engineer
>> arbitrary
>> data leakage by having one thread scanning the expected physical
>> address, and the other thread using an arbitrary cache-load gadget in
>> hypervisor context.  This occurs because the L1 data cache is shared
>> by
>> threads.
>>
> Right. So, sorry if this is a stupid question, but how does this relate
> to the "secret-free hypervisor", and with the "if a piece of memory
> isn't mapped, it can't be loaded into the cache".
>
> So, basically, I'm asking whether I am understanding it correctly that
> secret-free Xen + core-aware scheduling would *not* be enough for
> mitigating L1TF properly (and if the answer is no, why... but only if
> you have 5 mins to explain it to me :-P).
>
> In fact, ISTR that core-scheduling plus something that looked to me
> similar enough to "secret-free Xen", is how Microsoft claims to be
> mitigating L1TF on hyper-v...

Correct - that is what HyperV appears to be doing.

Its best to consider the secret-free Xen and scheduler improvements as
orthogonal.  In particular, the secret-free Xen is defence in depth
against SP1, and the risk of future issues, but does have
non-speculative benefits as well.

That said, the only way to use HT and definitely be safe to L1TF without
a secret-free Xen is to have the synchronised entry/exit logic working.

>> A solution to this issue was proposed, whereby Xen synchronises
>> siblings
>> on vmexit/entry, so we are never executing code in two different
>> privilege levels.  Getting this working would make it safe to
>> continue
>> using hyperthreading even in the presence of L1TF.  
>>
> Err... ok, but we still want core-aware scheduling, or at least we want
> to avoid having vcpus from different domains on siblings, don't we? In
> order to avoid leaks between guests, I mean.

Ideally, we'd want all of these.  I expect the only reasonable way to
develop them is one on top of another.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.