[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory

  • To: "Johnson, Ethan" <ejohns48@xxxxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Thu, 22 Aug 2019 14:51:03 +0100
  • Authentication-results: esa1.hc3370-68.iphmx.com; dkim=none (message not signed) header.i=none; spf=None smtp.pra=andrew.cooper3@xxxxxxxxxx; spf=Pass smtp.mailfrom=Andrew.Cooper3@xxxxxxxxxx; spf=None smtp.helo=postmaster@xxxxxxxxxxxxxxx
  • Autocrypt: addr=andrew.cooper3@xxxxxxxxxx; prefer-encrypt=mutual; keydata= mQINBFLhNn8BEADVhE+Hb8i0GV6mihnnr/uiQQdPF8kUoFzCOPXkf7jQ5sLYeJa0cQi6Penp VtiFYznTairnVsN5J+ujSTIb+OlMSJUWV4opS7WVNnxHbFTPYZVQ3erv7NKc2iVizCRZ2Kxn srM1oPXWRic8BIAdYOKOloF2300SL/bIpeD+x7h3w9B/qez7nOin5NzkxgFoaUeIal12pXSR Q354FKFoy6Vh96gc4VRqte3jw8mPuJQpfws+Pb+swvSf/i1q1+1I4jsRQQh2m6OTADHIqg2E ofTYAEh7R5HfPx0EXoEDMdRjOeKn8+vvkAwhviWXTHlG3R1QkbE5M/oywnZ83udJmi+lxjJ5 YhQ5IzomvJ16H0Bq+TLyVLO/VRksp1VR9HxCzItLNCS8PdpYYz5TC204ViycobYU65WMpzWe LFAGn8jSS25XIpqv0Y9k87dLbctKKA14Ifw2kq5OIVu2FuX+3i446JOa2vpCI9GcjCzi3oHV e00bzYiHMIl0FICrNJU0Kjho8pdo0m2uxkn6SYEpogAy9pnatUlO+erL4LqFUO7GXSdBRbw5 gNt25XTLdSFuZtMxkY3tq8MFss5QnjhehCVPEpE6y9ZjI4XB8ad1G4oBHVGK5LMsvg22PfMJ ISWFSHoF/B5+lHkCKWkFxZ0gZn33ju5n6/FOdEx4B8cMJt+cWwARAQABtClBbmRyZXcgQ29v cGVyIDxhbmRyZXcuY29vcGVyM0BjaXRyaXguY29tPokCOgQTAQgAJAIbAwULCQgHAwUVCgkI CwUWAgMBAAIeAQIXgAUCWKD95wIZAQAKCRBlw/kGpdefoHbdD/9AIoR3k6fKl+RFiFpyAhvO 59ttDFI7nIAnlYngev2XUR3acFElJATHSDO0ju+hqWqAb8kVijXLops0gOfqt3VPZq9cuHlh IMDquatGLzAadfFx2eQYIYT+FYuMoPZy/aTUazmJIDVxP7L383grjIkn+7tAv+qeDfE+txL4 SAm1UHNvmdfgL2/lcmL3xRh7sub3nJilM93RWX1Pe5LBSDXO45uzCGEdst6uSlzYR/MEr+5Z JQQ32JV64zwvf/aKaagSQSQMYNX9JFgfZ3TKWC1KJQbX5ssoX/5hNLqxMcZV3TN7kU8I3kjK mPec9+1nECOjjJSO/h4P0sBZyIUGfguwzhEeGf4sMCuSEM4xjCnwiBwftR17sr0spYcOpqET ZGcAmyYcNjy6CYadNCnfR40vhhWuCfNCBzWnUW0lFoo12wb0YnzoOLjvfD6OL3JjIUJNOmJy RCsJ5IA/Iz33RhSVRmROu+TztwuThClw63g7+hoyewv7BemKyuU6FTVhjjW+XUWmS/FzknSi dAG+insr0746cTPpSkGl3KAXeWDGJzve7/SBBfyznWCMGaf8E2P1oOdIZRxHgWj0zNr1+ooF /PzgLPiCI4OMUttTlEKChgbUTQ+5o0P080JojqfXwbPAyumbaYcQNiH1/xYbJdOFSiBv9rpt TQTBLzDKXok86LkCDQRS4TZ/ARAAkgqudHsp+hd82UVkvgnlqZjzz2vyrYfz7bkPtXaGb9H4 Rfo7mQsEQavEBdWWjbga6eMnDqtu+FC+qeTGYebToxEyp2lKDSoAsvt8w82tIlP/EbmRbDVn 7bhjBlfRcFjVYw8uVDPptT0TV47vpoCVkTwcyb6OltJrvg/QzV9f07DJswuda1JH3/qvYu0p vjPnYvCq4NsqY2XSdAJ02HrdYPFtNyPEntu1n1KK+gJrstjtw7KsZ4ygXYrsm/oCBiVW/OgU g/XIlGErkrxe4vQvJyVwg6YH653YTX5hLLUEL1NS4TCo47RP+wi6y+TnuAL36UtK/uFyEuPy wwrDVcC4cIFhYSfsO0BumEI65yu7a8aHbGfq2lW251UcoU48Z27ZUUZd2Dr6O/n8poQHbaTd 6bJJSjzGGHZVbRP9UQ3lkmkmc0+XCHmj5WhwNNYjgbbmML7y0fsJT5RgvefAIFfHBg7fTY/i kBEimoUsTEQz+N4hbKwo1hULfVxDJStE4sbPhjbsPCrlXf6W9CxSyQ0qmZ2bXsLQYRj2xqd1 bpA+1o1j2N4/au1R/uSiUFjewJdT/LX1EklKDcQwpk06Af/N7VZtSfEJeRV04unbsKVXWZAk uAJyDDKN99ziC0Wz5kcPyVD1HNf8bgaqGDzrv3TfYjwqayRFcMf7xJaL9xXedMcAEQEAAYkC HwQYAQgACQUCUuE2fwIbDAAKCRBlw/kGpdefoG4XEACD1Qf/er8EA7g23HMxYWd3FXHThrVQ HgiGdk5Yh632vjOm9L4sd/GCEACVQKjsu98e8o3ysitFlznEns5EAAXEbITrgKWXDDUWGYxd pnjj2u+GkVdsOAGk0kxczX6s+VRBhpbBI2PWnOsRJgU2n10PZ3mZD4Xu9kU2IXYmuW+e5KCA vTArRUdCrAtIa1k01sPipPPw6dfxx2e5asy21YOytzxuWFfJTGnVxZZSCyLUO83sh6OZhJkk b9rxL9wPmpN/t2IPaEKoAc0FTQZS36wAMOXkBh24PQ9gaLJvfPKpNzGD8XWR5HHF0NLIJhgg 4ZlEXQ2fVp3XrtocHqhu4UZR4koCijgB8sB7Tb0GCpwK+C4UePdFLfhKyRdSXuvY3AHJd4CP 4JzW0Bzq/WXY3XMOzUTYApGQpnUpdOmuQSfpV9MQO+/jo7r6yPbxT7CwRS5dcQPzUiuHLK9i nvjREdh84qycnx0/6dDroYhp0DFv4udxuAvt1h4wGwTPRQZerSm4xaYegEFusyhbZrI0U9tJ B8WrhBLXDiYlyJT6zOV2yZFuW47VrLsjYnHwn27hmxTC/7tvG3euCklmkn9Sl9IAKFu29RSo d5bD8kMSCYsTqtTfT6W4A3qHGvIDta3ptLYpIAOD2sY3GYq2nf3Bbzx81wZK14JdDDHUX2Rs 6+ahAA==
  • Delivery-date: Thu, 22 Aug 2019 13:51:36 +0000
  • Ironport-sdr: hqeY4Zl3s1WjyE50bdkq/8T3tqaOEwLCDRodSXWzhm1E0hyRUwa+t5VfWKEkdtYD1RaRDtZrp2 Aqy+HNdq6AGaMQNd3tM8cd6A/VV+NgGXvg1Ddx60tyV39kBpmrh+JqwvmCYTpVK1m5I4VcWjDg UI/Cuf1piIjSFmonMgqnDmfata7mNTMMMh+T3pcEN4DOAO1iJlk9QMJILTiXpGQdhMraccsJlp 62TdEh3DIr9oRMvYFtNd5ADNkiWDQDtfs2eFI6b6LlWzE8J9qjsdq1oSQLmvzm0cu2EM0+PM3i KJQ=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Openpgp: preference=signencrypt

On 22/08/2019 03:06, Johnson, Ethan wrote:
> On 8/17/2019 7:04 AM, Andrew Cooper wrote:
>>> Similarly, to what extent does the dom0 (or other such
>>> privileged domain) utilize "foreign memory maps" to reach into another
>>> guest's memory? I understand that this is necessary when creating a
>>> guest, for live migration, and for QEMU to emulate stuff for HVM guests;
>>> but for PVH, is it ever necessary for Xen or the dom0 to "forcibly"
>>> access a guest's memory?
>> I'm not sure what you mean by forcibly.  Dom0 has the ability to do so,
>> if it chooses.  There is no "force" about it.
>> Debuggers and/or Introspection are other reasons why dom0 might chose to
>> map guest RAM, but I think you've covered the common cases.
>>> (I ask because the research project I'm working on is seeking to protect
>>> guests from a compromised hypervisor and dom0, so I need to limit
>>> outside access to a guest's memory to explicitly shared pages that the
>>> guest will treat as untrusted - not storing any secrets there, vetting
>>> input as necessary, etc.)
>> Sorry to come along with roadblocks, but how on earth do you intend to
>> prevent a compromised Xen from accessing guest memory?  A compromised
>> Xen can do almost anything it likes, and without recourse.  This is
>> ultimately why technologies such as Intel SGX or AMD Secure Encrypted VM
>> are coming along, because only the hardware itself is in a position to
>> isolate an untrusted hypervisor/kernel from guest data.
>> For dom0, that's perhaps easier.  You could reference count the number
>> of foreign mappings into the domain as it is created, and refuse to
>> unpause the guests vcpus until the foreign map count has dropped to 0.
> We're using a technique where privileged system software (in this case, 
> the hypervisor) is compiled to a virtual instruction set (based on LLVM 
> IR) that limits its access to hardware features and its view of 
> available memory. These limitations are/can be enforced in a variety of 
> ways but the main techniques we're employing are software fault 
> isolation (i.e., memory loads and stores in privileged code are 
> instrumented with checks to ensure they aren't accessing forbidden 
> regions), and mediation of page table updates (by modifying privileged 
> software to make page table updates through a virtual instruction set 
> interface, very similarly to how Xen PV guests make page table updates 
> through hypercalls which gives Xen the opportunity to ensure mappings 
> aren't made to protected regions).
> Our technique is based on that used by the "Virtual Ghost" project (see 
> https://dl.acm.org/citation.cfm?id=2541986 for the paper; direct PDF 
> link: http://sva.cs.illinois.edu/pubs/VirtualGhost-ASPLOS-2014.pdf), 
> which does something similar to protect applications from a compromised 
> operating system kernel without relying on something like a hypervisor 
> operating at a higher privileged level. We're looking to extend that 
> approach to hypervisors to protect guest VMs from a compromised hypervisor.

I have come across that paper before.

The extra language safety (which is effectively what this is) should
make it harder to compromise the hypervisor (and this is certainly a
good thing), but nothing at this level will get in the way of an
actually-compromised piece of ring 0 code from doing whatever it wants.

Suffice it to say that I'll be delighted if someone managed to
demonstrate me wrong.

>>> Again, this mostly boils down to: under what circumstances, if ever,
>>> does Xen ever "force" access to any part of a guest's memory?
>>> (Particularly for PV(H). Clearly that must happen for HVM since, by
>>> definition, the guest is unaware there's a hypervisor controlling its
>>> world and emulating hardware behavior, and thus is in no position to
>>> cooperatively/voluntarily give the hypervisor and dom0 access to its
>>> memory.)
>> There are cases for all guest types where Xen will need to emulate
>> instructions.  Xen will access guest memory in order to perfom
>> architecturally correct actions, which generally starts with reading the
>> instruction under %rip.
>> For PV guests, this almost entirely restricted to guest-kernel
>> operations which are privileged in nature.  Access to MSRs, writes to
>> pagetables, etc.
>> For HVM and PVH guests, while PVH means "HVM without Qemu", it doesn't
>> be a complete absence of emulation.  The Local APIC is emulated by Xen
>> in most cases, as a bare minimum, but for example, the LMSW instruction
>> on AMD hardware doesn't have any intercept decoding to help the
>> hypervisor out when a guest uses the instruction.
>> ~Andrew
> I've found a number of files in the Xen source tree which seem to be 
> related to instruction/x86 platform emulation:
> arch/x86/x86_emulate.c
> arch/x86/hvm/emulate.c
> arch/x86/hvm/vmx/realmode.c
> arch/x86/hvm/svm/emulate.c
> arch/x86/pv/emulate.c
> arch/x86/pv/emul-priv-op.c
> arch/x86/x86_emulate/x86_emulate.c
> The last of these, in particular, looks especially hairy (it seems to 
> support emulation of essentially the entire x86 instruction set through 
> a quite impressive edifice of switch statements).

Lovely, isn't it.  For Introspection, we need to be able to emulate an
instruction which took a permission fault (including No Execute), was
sent to the analysis engine, and deemed ok to continue.

Other users of emulation are arch/x86/pv/ro-page-fault.c and

That said, most of these can be ignored in common cases.  vmx/realmode.c
is only for pre-Westmere Intel CPUs which lack the unrestricted_guest
feature.  svm/emulate.c is only for K8 hardware which lacks the NRIPS

> How does all of this fit into the big picture of how Xen virtualizes the 
> different types of VMs (PV/HVM/PVH)?

Consider this "core x86 support".  All areas which need to emulate an
instruction for whatever reason use this function.  (We previously had
multiple areas of code each doing subsets of x86 instruction
decode/execute, and it was an even bigger mess.)

> My impression (from reading the original "Xen and the Art of 
> Virtualization" SOSP '03 paper that describes the basic architecture) 
> had been that PV guests, in particular, used hypercalls in place of all 
> privileged operations that the guest kernel would otherwise need to 
> execute in ring 0; and that all other (unprivileged) operations could 
> execute natively on the CPU without requiring emulation. From what 
> you're saying (and what I'm seeing in the source code), though, it 
> sounds like in reality things are a bit fuzzier - that there are some 
> operations that Xen traps and emulates instead of explicitly 
> paravirtualizing.

Correct.  Few theories survive contact with the real world.

Some emulation, such as writeable_pagetable support was added to make it
easier to port guests to being PV.  In this case, writes to pagetables
are trapped an emulated, as if an equivalent hypercall had been made. 
Sure, its slower than the hypercall, but its far easier to get started with.

Some emulation is a consequence of of CPUs changing in the 16 years
since that paper was published, and some emulation is a stopgap for
things which really should be paravirtualised properly.  A whole load of
speculative security fits into this category, as we haven't had time to
fix it nicely, following the panic of simply fixing it safely.

> Likewise, the Xen design described in the SOSP paper discussed guest I/O 
> as something that's fully paravirtualized, taking place not through 
> emulation of either memory-mapped or port I/O but rather through ring 
> buffers shared between the guest and dom0 via grant tables.

This is still correct and accurate.  Paravirtual split front/back driver
pairs for network and block are by far the most efficient way of
shuffling data in and out of the VM.

> I was a bit 
> confused to find I/O emulation code under arch/x86/pv (see e.g. 
> arch/x86/pv/emul-priv-op.c) that seems to be talking about "ports" and 
> the like. Is this another example of things being fuzzier in reality 
> than in the "theoretical" PV design?

This is "general x86 architecture".  Xen handles all exceptions,
including from PV userspace (possibly being naughty), so at a bare
minimum needs to filter those which should be handed to the guest kernel
to deal with.

When it comes to x86 Port IO, it is a critical point of safety that Xen
runs with IOPL set to 0, or a guest kernel could modify the real
interrupt flag with a popf instruction.  As a result, all `in` and `out`
instructions trap with a #GP fault.

Guest userspace could use use iopl() to logically gain access to IO
ports, after which `in` and `out` instructions would not fault.  Also,
these instructions don't fault in kernel context.  In both cases, Xen
has to filter between actually passing the IO request to hardware (if
the guest is suitably configured), or terminating it defaults, so it
fails in a manner consistent with how x86 behaves.

For VT-x/SVM guests, filtering of #GP faults happens before the VMExit
so Xen doesn't have to handle those, but still has to handle all IO
accesses which are fine (permission wise) according to the guest kernel.

> What devices, if any, are emulated rather than paravirtualized for a PV guest?

Look for XEN_X86_EMU_* throughout the code.  Those are all the discrete
devices which Xen may emulate, for both kinds of guests.  There is a
very restricted set of valid combinations.

PV dom0's get an emulated PIT to partially forward to real hardware.
ISTR it is legacy for some laptops where DRAM refresh was still
configured off timer 1.  I doubt it is revenant these days.

> I know that for PVH, you 
> mentioned that the Local APIC is (at a minimum) emulated, along with 
> some special instructions; is that true for classic PV as well?

Classic PV guests don't get a Local APIC.  They are required to use the
event channel interface instead.

> For HVM, obviously anything that can't be virtualized natively by the 
> hardware needs to be emulated by Xen/QEMU (since the guest kernel isn't 
> expected to be cooperative to issue PV hypercalls instead); but I would 
> expect emulation to be limited to the relatively small subset of the ISA 
> that VMX/SVM can't natively virtualize. Yet I see that x86_emulate.c 
> supports emulating just about everything. Under what circumstances does 
> Xen actually need to put all that emulation code to use?

Introspection, as I said earlier, which is potentially any instruction.

MMIO regions (including to the Local APIC when it is in xAPIC mode, and
hardware acceleration isn't available) can be the target of any
instruction with a memory operand.  While mov is by far the most common
instruction, other instructions such as and/or/xadd are used in some
cases.  Various of the vector moves (movups/movaps/movnti) are very
common with framebuffers.

The cfc/cf8 IO ports are used for PCI Config space accesses, which all
kernels try to use, and any kernel with real devices need to use.  The
alternative is the the MMCFG scheme which is plain MMIO as above.

> I'm also wondering just how much of this is Xen's responsibility vs. 
> QEMU's. I understand that when QEMU is used on its own (i.e., not with 
> Xen), it uses dynamic binary recompilation to handle the parts of the 
> ISA that can't be virtualized natively in lower-privilege modes. Does 
> Xen only use QEMU for emulating off-CPU devices (interrupt controller, 
> non-paravirtualized disk/network/graphics/etc.), or does it ever employ 
> any of QEMU's x86 emulation support in addition to Xen's own emulation code?

We only use QEMU for off-CPU devices.  For performance reasons, some of
the interrupt emulation (IO-APIC in particular), and timer emulation
(HPET, PIT) is done in Xen, even when it would locally be part of the
motherboard if we were looking for a clear delineation of where Xen
stops and QEMU starts.

> Is there any particular place in the code where I can go to get a 
> comprehensive "list" (or other such summary) of which parts of the ISA 
> and off-CPU system are emulated for each respective guest type (PV, HVM, 
> and PVH)?

XEN_X86_EMU_* should cover you here.

> I realize that the difference between HVM and PVH is more of a 
> continuum than a line; what I'm especially interested in is, what's the 
> *bare minimum* of emulation required for a PVH guest that's using as 
> much paravirtualization as possible? (That's the setting I'm looking to 
> target for my research on protecting guests from a compromised 
> hypervisor, since I'm trying to minimize the scope of interactions 
> between the guest and hypervisor/dom0 that our virtual instruction set 
> layer needs to mediate.)

If you are using PVH guests, on not-ancient hardware, and you can
persuade the guest kernel to use x2APIC mode, and without using any
ins/outs instructions, then you just might be able to get away without
any x86_emulate() at all.

x2APIC mode has an MSR-based interface rather than an MMIO interface,
which means that the VMExit intercept information alone is sufficient to
work out exactly what to do, and ins/outs is the only other instructions
(which come to mind) liable to trap and need emulator support above and
beyond the intercept information.

That said, whatever you do here is going to have to cope with dom0 and
all the requirements for keeping the system running.  Depending on
exactly how you're approaching the problem, it might be possible to
declare that out of scope and leave it to one side.

> On a somewhat related note, I also have a question about a particular 
> piece of code in arch/x86/pv/emul-priv-op.c, namely the function 
> io_emul_stub_setup(). It looks like it is, at runtime, crafting a 
> function that switches to the guest register context, emulates a 
> particular I/O operation, then switches back to the host register 
> context. This caught our attention while we were implementing Control 
> Flow Integrity (CFI) instrumentation for Xen (which is necessary for us 
> to enforce the software fault isolation (SFI) instrumentation that 
> provides our memory protections). Why does Xen use dynamically-generated 
> code here? Is it just for implementation convenience (i.e., to improve 
> the generalizability of the code)?

This mechanism is for dom0 only, and exists because some firmware is

Some AML in ACPI tables uses an IO port to generate an SMI, and has an
API which uses the GPRs.  It turns out things go rather wrong when Xen
intercepts the IO instruction, and replays it to hardware in Xen's GPR
context, rather than the guest kernels.

This bodge swaps Xen's and dom0's GPRs just around the IO instruction,
so the SMI API gets its parameters properly, and the results get fed
back properly into AML.

There is a related hypercall, SCHEDOP_pin_override, used by dom0,
because sometimes the AML really does need to execute on CPU0, and not
wherever dom0's vcpu0 happens to be executing.

> Thanks again for all your time and effort spent answering my questions. 
> I know I'm throwing a lot of unusual questions out there - this 
> back-and-forth has been very helpful for me in figuring out *what* 
> questions I need to be asking in the first place to understand what's 
> feasible to do in the Xen architecture and how I might go about doing 
> it. :-)

Not a problem in the slightest.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.