[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Ongoing/future speculative mitigation work

  • To: Xen-devel List <xen-devel@xxxxxxxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Thu, 18 Oct 2018 18:46:22 +0100
  • Autocrypt: addr=andrew.cooper3@xxxxxxxxxx; prefer-encrypt=mutual; keydata= xsFNBFLhNn8BEADVhE+Hb8i0GV6mihnnr/uiQQdPF8kUoFzCOPXkf7jQ5sLYeJa0cQi6Penp VtiFYznTairnVsN5J+ujSTIb+OlMSJUWV4opS7WVNnxHbFTPYZVQ3erv7NKc2iVizCRZ2Kxn srM1oPXWRic8BIAdYOKOloF2300SL/bIpeD+x7h3w9B/qez7nOin5NzkxgFoaUeIal12pXSR Q354FKFoy6Vh96gc4VRqte3jw8mPuJQpfws+Pb+swvSf/i1q1+1I4jsRQQh2m6OTADHIqg2E ofTYAEh7R5HfPx0EXoEDMdRjOeKn8+vvkAwhviWXTHlG3R1QkbE5M/oywnZ83udJmi+lxjJ5 YhQ5IzomvJ16H0Bq+TLyVLO/VRksp1VR9HxCzItLNCS8PdpYYz5TC204ViycobYU65WMpzWe LFAGn8jSS25XIpqv0Y9k87dLbctKKA14Ifw2kq5OIVu2FuX+3i446JOa2vpCI9GcjCzi3oHV e00bzYiHMIl0FICrNJU0Kjho8pdo0m2uxkn6SYEpogAy9pnatUlO+erL4LqFUO7GXSdBRbw5 gNt25XTLdSFuZtMxkY3tq8MFss5QnjhehCVPEpE6y9ZjI4XB8ad1G4oBHVGK5LMsvg22PfMJ ISWFSHoF/B5+lHkCKWkFxZ0gZn33ju5n6/FOdEx4B8cMJt+cWwARAQABzSlBbmRyZXcgQ29v cGVyIDxhbmRyZXcuY29vcGVyM0BjaXRyaXguY29tPsLBegQTAQgAJAIbAwULCQgHAwUVCgkI CwUWAgMBAAIeAQIXgAUCWKD95wIZAQAKCRBlw/kGpdefoHbdD/9AIoR3k6fKl+RFiFpyAhvO 59ttDFI7nIAnlYngev2XUR3acFElJATHSDO0ju+hqWqAb8kVijXLops0gOfqt3VPZq9cuHlh IMDquatGLzAadfFx2eQYIYT+FYuMoPZy/aTUazmJIDVxP7L383grjIkn+7tAv+qeDfE+txL4 SAm1UHNvmdfgL2/lcmL3xRh7sub3nJilM93RWX1Pe5LBSDXO45uzCGEdst6uSlzYR/MEr+5Z JQQ32JV64zwvf/aKaagSQSQMYNX9JFgfZ3TKWC1KJQbX5ssoX/5hNLqxMcZV3TN7kU8I3kjK mPec9+1nECOjjJSO/h4P0sBZyIUGfguwzhEeGf4sMCuSEM4xjCnwiBwftR17sr0spYcOpqET ZGcAmyYcNjy6CYadNCnfR40vhhWuCfNCBzWnUW0lFoo12wb0YnzoOLjvfD6OL3JjIUJNOmJy RCsJ5IA/Iz33RhSVRmROu+TztwuThClw63g7+hoyewv7BemKyuU6FTVhjjW+XUWmS/FzknSi dAG+insr0746cTPpSkGl3KAXeWDGJzve7/SBBfyznWCMGaf8E2P1oOdIZRxHgWj0zNr1+ooF /PzgLPiCI4OMUttTlEKChgbUTQ+5o0P080JojqfXwbPAyumbaYcQNiH1/xYbJdOFSiBv9rpt TQTBLzDKXok86M7BTQRS4TZ/ARAAkgqudHsp+hd82UVkvgnlqZjzz2vyrYfz7bkPtXaGb9H4 Rfo7mQsEQavEBdWWjbga6eMnDqtu+FC+qeTGYebToxEyp2lKDSoAsvt8w82tIlP/EbmRbDVn 7bhjBlfRcFjVYw8uVDPptT0TV47vpoCVkTwcyb6OltJrvg/QzV9f07DJswuda1JH3/qvYu0p vjPnYvCq4NsqY2XSdAJ02HrdYPFtNyPEntu1n1KK+gJrstjtw7KsZ4ygXYrsm/oCBiVW/OgU g/XIlGErkrxe4vQvJyVwg6YH653YTX5hLLUEL1NS4TCo47RP+wi6y+TnuAL36UtK/uFyEuPy wwrDVcC4cIFhYSfsO0BumEI65yu7a8aHbGfq2lW251UcoU48Z27ZUUZd2Dr6O/n8poQHbaTd 6bJJSjzGGHZVbRP9UQ3lkmkmc0+XCHmj5WhwNNYjgbbmML7y0fsJT5RgvefAIFfHBg7fTY/i kBEimoUsTEQz+N4hbKwo1hULfVxDJStE4sbPhjbsPCrlXf6W9CxSyQ0qmZ2bXsLQYRj2xqd1 bpA+1o1j2N4/au1R/uSiUFjewJdT/LX1EklKDcQwpk06Af/N7VZtSfEJeRV04unbsKVXWZAk uAJyDDKN99ziC0Wz5kcPyVD1HNf8bgaqGDzrv3TfYjwqayRFcMf7xJaL9xXedMcAEQEAAcLB XwQYAQgACQUCUuE2fwIbDAAKCRBlw/kGpdefoG4XEACD1Qf/er8EA7g23HMxYWd3FXHThrVQ HgiGdk5Yh632vjOm9L4sd/GCEACVQKjsu98e8o3ysitFlznEns5EAAXEbITrgKWXDDUWGYxd pnjj2u+GkVdsOAGk0kxczX6s+VRBhpbBI2PWnOsRJgU2n10PZ3mZD4Xu9kU2IXYmuW+e5KCA vTArRUdCrAtIa1k01sPipPPw6dfxx2e5asy21YOytzxuWFfJTGnVxZZSCyLUO83sh6OZhJkk b9rxL9wPmpN/t2IPaEKoAc0FTQZS36wAMOXkBh24PQ9gaLJvfPKpNzGD8XWR5HHF0NLIJhgg 4ZlEXQ2fVp3XrtocHqhu4UZR4koCijgB8sB7Tb0GCpwK+C4UePdFLfhKyRdSXuvY3AHJd4CP 4JzW0Bzq/WXY3XMOzUTYApGQpnUpdOmuQSfpV9MQO+/jo7r6yPbxT7CwRS5dcQPzUiuHLK9i nvjREdh84qycnx0/6dDroYhp0DFv4udxuAvt1h4wGwTPRQZerSm4xaYegEFusyhbZrI0U9tJ B8WrhBLXDiYlyJT6zOV2yZFuW47VrLsjYnHwn27hmxTC/7tvG3euCklmkn9Sl9IAKFu29RSo d5bD8kMSCYsTqtTfT6W4A3qHGvIDta3ptLYpIAOD2sY3GYq2nf3Bbzx81wZK14JdDDHUX2Rs 6+ahAA==
  • Cc: Martin Pohlack <mpohlack@xxxxxxxxx>, Julien Grall <julien.grall@xxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Joao Martins <joao.m.martins@xxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Daniel Kiper <daniel.kiper@xxxxxxxxxx>, Marek Marczykowski <marmarek@xxxxxxxxxxxxxxxxxxxxxx>, Anthony Liguori <aliguori@xxxxxxxxxx>, "Dannowski, Uwe" <uwed@xxxxxxxxx>, Lars Kurth <lars.kurth@xxxxxxxxxx>, Konrad Wilk <konrad.wilk@xxxxxxxxxx>, Ross Philipson <ross.philipson@xxxxxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, Matt Wilson <msw@xxxxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Juergen Gross <JGross@xxxxxxxx>, Sergey Dyasli <sergey.dyasli@xxxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxxxxx>, Mihai Donțu <mdontu@xxxxxxxxxxxxxxx>, "Woodhouse, David" <dwmw@xxxxxxxxxxxx>, Roger Pau Monne <roger.pau@xxxxxxxxxx>
  • Delivery-date: Thu, 18 Oct 2018 17:46:36 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Openpgp: preference=signencrypt


This is an accumulation and summary of various tasks which have been
discussed since the revelation of the speculative security issues in
January, and also an invitation to discuss alternative ideas.  They are
x86 specific, but a lot of the principles are architecture-agnostic.

1) A secrets-free hypervisor.

Basically every hypercall can be (ab)used by a guest, and used as an
arbitrary cache-load gadget.  Logically, this is the first half of a
Spectre SP1 gadget, and is usually the first stepping stone to
exploiting one of the speculative sidechannels.

Short of compiling Xen with LLVM's Speculative Load Hardening (which is
still experimental, and comes with a ~30% perf hit in the common case),
this is unavoidable.  Furthermore, throwing a few array_index_nospec()
into the code isn't a viable solution to the problem.

An alternative option is to have less data mapped into Xen's virtual
address space - if a piece of memory isn't mapped, it can't be loaded
into the cache.

An easy first step here is to remove Xen's directmap, which will mean
that guests general RAM isn't mapped by default into Xen's address
space.  This will come with some performance hit, as the
map_domain_page() infrastructure will now have to actually
create/destroy mappings, but removing the directmap will cause an
improvement for non-speculative security as well (No possibility of
ret2dir as an exploit technique).

Beyond the directmap, there are plenty of other interesting secrets in
the Xen heap and other mappings, such as the stacks of the other pcpus. 
Fixing this requires moving Xen to having a non-uniform memory layout,
and this is much harder to change.  I already experimented with this as
a meltdown mitigation around about a year ago, and posted the resulting
series on Jan 4th,
some trivial bits of which have already found their way upstream.

To have a non-uniform memory layout, Xen may not share L4 pagetables. 
i.e. Xen must never have two pcpus which reference the same pagetable in

This property already holds for 32bit PV guests, and all HVM guests, but
64bit PV guests are the sticking point.  Because Linux has a flat memory
layout, when a 64bit PV guest schedules two threads from the same
process on separate vcpus, those two vcpus have the same virtual %cr3,
and currently, Xen programs the same real %cr3 into hardware.

If we want Xen to have a non-uniform layout, are two options are:
* Fix Linux to have the same non-uniform layout that Xen wants
(Backwards compatibility for older 64bit PV guests can be achieved with
* Make use XPTI algorithm (specifically, the pagetable sync/copy part)
forever more in the future.

Option 2 isn't great (especially for perf on fixed hardware), but does
keep all the necessary changes in Xen.  Option 1 looks to be the better
option longterm.

As an interesting point to note.  The 32bit PV ABI prohibits sharing of
L3 pagetables, because back in the 32bit hypervisor days, we used to
have linear mappings in the Xen virtual range.  This check is stale
(from a functionality point of view), but still present in Xen.  A
consequence of this is that 32bit PV guests definitely don't share
top-level pagetables across vcpus.

Juergen/Boris: Do you have any idea if/how easy this infrastructure
would be to implement for 64bit PV guests as well?  If a PV guest can
advertise via Elfnote that it won't share top-level pagetables, then we
can audit this trivially in Xen.

2) Scheduler improvements.

(I'm afraid this is rather more sparse because I'm less familiar with
the scheduler details.)

At the moment, all of Xen's schedulers will happily put two vcpus from
different domains on sibling hyperthreads.  There has been a lot of
sidechannel research over the past decade demonstrating ways for one
thread to infer what is going on the other, but L1TF is the first
vulnerability I'm aware of which allows one thread to directly read data
out of the other.

Either way, it is now definitely a bad thing to run different guests
concurrently on siblings.  Fixing this by simply not scheduling vcpus
from a different guest on siblings does result in a lower resource
utilisation, most notably when there are an odd number runable vcpus in
a domain, as the other thread is forced to idle.

A step beyond this is core-aware scheduling, where we schedule in units
of a virtual core rather than a virtual thread.  This has much better
behaviour from the guests point of view, as the actually-scheduled
topology remains consistent, but does potentially come with even lower
utilisation if every other thread in the guest is idle.

A side requirement for core-aware scheduling is for Xen to have an
accurate idea of the topology presented to the guest.  I need to dust
off my Toolstack CPUID/MSR improvement series and get that upstream.

One of the most insidious problems with L1TF is that, with
hyperthreading enabled, a malicious guest kernel can engineer arbitrary
data leakage by having one thread scanning the expected physical
address, and the other thread using an arbitrary cache-load gadget in
hypervisor context.  This occurs because the L1 data cache is shared by

A solution to this issue was proposed, whereby Xen synchronises siblings
on vmexit/entry, so we are never executing code in two different
privilege levels.  Getting this working would make it safe to continue
using hyperthreading even in the presence of L1TF.  Obviously, its going
to come in perf hit, but compared to disabling hyperthreading, all its
got to do is beat a 60% perf hit to make it the preferable option for
making your system L1TF-proof.

Anyway - enough of my rambling for now.  Thoughts?


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.