[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] More questions about Xen memory layout/usage, access to guest memory

I can at least confirm that no emulation is needed to execute a Linux guest, even with the Xen PVH interface, but I don't think that works out of the box today with Xen, something we are currently working on and will hopefully have some more data near the end of the year. x2APIC helps, but it takes some work to convince Linux to use that currently. The trick is to avoid PortIO and, where possible, MMIO interfaces.


On Thu, Aug 22, 2019 at 1:53 PM Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:
On 22/08/2019 03:06, Johnson, Ethan wrote:
> On 8/17/2019 7:04 AM, Andrew Cooper wrote:
>>> Similarly, to what extent does the dom0 (or other such
>>> privileged domain) utilize "foreign memory maps" to reach into another
>>> guest's memory? I understand that this is necessary when creating a
>>> guest, for live migration, and for QEMU to emulate stuff for HVM guests;
>>> but for PVH, is it ever necessary for Xen or the dom0 to "forcibly"
>>> access a guest's memory?
>> I'm not sure what you mean by forcibly.  Dom0 has the ability to do so,
>> if it chooses.  There is no "force" about it.
>> Debuggers and/or Introspection are other reasons why dom0 might chose to
>> map guest RAM, but I think you've covered the common cases.
>>> (I ask because the research project I'm working on is seeking to protect
>>> guests from a compromised hypervisor and dom0, so I need to limit
>>> outside access to a guest's memory to explicitly shared pages that the
>>> guest will treat as untrusted - not storing any secrets there, vetting
>>> input as necessary, etc.)
>> Sorry to come along with roadblocks, but how on earth do you intend to
>> prevent a compromised Xen from accessing guest memory?  A compromised
>> Xen can do almost anything it likes, and without recourse.  This is
>> ultimately why technologies such as Intel SGX or AMD Secure Encrypted VM
>> are coming along, because only the hardware itself is in a position to
>> isolate an untrusted hypervisor/kernel from guest data.
>> For dom0, that's perhaps easier.  You could reference count the number
>> of foreign mappings into the domain as it is created, and refuse to
>> unpause the guests vcpus until the foreign map count has dropped to 0.
> We're using a technique where privileged system software (in this case,
> the hypervisor) is compiled to a virtual instruction set (based on LLVM
> IR) that limits its access to hardware features and its view of
> available memory. These limitations are/can be enforced in a variety of
> ways but the main techniques we're employing are software fault
> isolation (i.e., memory loads and stores in privileged code are
> instrumented with checks to ensure they aren't accessing forbidden
> regions), and mediation of page table updates (by modifying privileged
> software to make page table updates through a virtual instruction set
> interface, very similarly to how Xen PV guests make page table updates
> through hypercalls which gives Xen the opportunity to ensure mappings
> aren't made to protected regions).
> Our technique is based on that used by the "Virtual Ghost" project (see
> https://dl.acm.org/citation.cfm?id=2541986 for the paper; direct PDF
> link: http://sva.cs.illinois.edu/pubs/VirtualGhost-ASPLOS-2014.pdf),
> which does something similar to protect applications from a compromised
> operating system kernel without relying on something like a hypervisor
> operating at a higher privileged level. We're looking to extend that
> approach to hypervisors to protect guest VMs from a compromised hypervisor.

I have come across that paper before.

The extra language safety (which is effectively what this is) should
make it harder to compromise the hypervisor (and this is certainly a
good thing), but nothing at this level will get in the way of an
actually-compromised piece of ring 0 code from doing whatever it wants.

Suffice it to say that I'll be delighted if someone managed to
demonstrate me wrong.

>>> Again, this mostly boils down to: under what circumstances, if ever,
>>> does Xen ever "force" access to any part of a guest's memory?
>>> (Particularly for PV(H). Clearly that must happen for HVM since, by
>>> definition, the guest is unaware there's a hypervisor controlling its
>>> world and emulating hardware behavior, and thus is in no position to
>>> cooperatively/voluntarily give the hypervisor and dom0 access to its
>>> memory.)
>> There are cases for all guest types where Xen will need to emulate
>> instructions.  Xen will access guest memory in order to perfom
>> architecturally correct actions, which generally starts with reading the
>> instruction under %rip.
>> For PV guests, this almost entirely restricted to guest-kernel
>> operations which are privileged in nature.  Access to MSRs, writes to
>> pagetables, etc.
>> For HVM and PVH guests, while PVH means "HVM without Qemu", it doesn't
>> be a complete absence of emulation.  The Local APIC is emulated by Xen
>> in most cases, as a bare minimum, but for example, the LMSW instruction
>> on AMD hardware doesn't have any intercept decoding to help the
>> hypervisor out when a guest uses the instruction.
>> ~Andrew
> I've found a number of files in the Xen source tree which seem to be
> related to instruction/x86 platform emulation:
> arch/x86/x86_emulate.c
> arch/x86/hvm/emulate.c
> arch/x86/hvm/vmx/realmode.c
> arch/x86/hvm/svm/emulate.c
> arch/x86/pv/emulate.c
> arch/x86/pv/emul-priv-op.c
> arch/x86/x86_emulate/x86_emulate.c
> The last of these, in particular, looks especially hairy (it seems to
> support emulation of essentially the entire x86 instruction set through
> a quite impressive edifice of switch statements).

Lovely, isn't it.  For Introspection, we need to be able to emulate an
instruction which took a permission fault (including No Execute), was
sent to the analysis engine, and deemed ok to continue.

Other users of emulation are arch/x86/pv/ro-page-fault.c and

That said, most of these can be ignored in common cases.  vmx/realmode.c
is only for pre-Westmere Intel CPUs which lack the unrestricted_guest
feature.  svm/emulate.c is only for K8 hardware which lacks the NRIPS

> How does all of this fit into the big picture of how Xen virtualizes the
> different types of VMs (PV/HVM/PVH)?

Consider this "core x86 support".  All areas which need to emulate an
instruction for whatever reason use this function.  (We previously had
multiple areas of code each doing subsets of x86 instruction
decode/execute, and it was an even bigger mess.)

> My impression (from reading the original "Xen and the Art of
> Virtualization" SOSP '03 paper that describes the basic architecture)
> had been that PV guests, in particular, used hypercalls in place of all
> privileged operations that the guest kernel would otherwise need to
> execute in ring 0; and that all other (unprivileged) operations could
> execute natively on the CPU without requiring emulation. From what
> you're saying (and what I'm seeing in the source code), though, it
> sounds like in reality things are a bit fuzzier - that there are some
> operations that Xen traps and emulates instead of explicitly
> paravirtualizing.

Correct.  Few theories survive contact with the real world.

Some emulation, such as writeable_pagetable support was added to make it
easier to port guests to being PV.  In this case, writes to pagetables
are trapped an emulated, as if an equivalent hypercall had been made. 
Sure, its slower than the hypercall, but its far easier to get started with.

Some emulation is a consequence of of CPUs changing in the 16 years
since that paper was published, and some emulation is a stopgap for
things which really should be paravirtualised properly.  A whole load of
speculative security fits into this category, as we haven't had time to
fix it nicely, following the panic of simply fixing it safely.

> Likewise, the Xen design described in the SOSP paper discussed guest I/O
> as something that's fully paravirtualized, taking place not through
> emulation of either memory-mapped or port I/O but rather through ring
> buffers shared between the guest and dom0 via grant tables.

This is still correct and accurate.  Paravirtual split front/back driver
pairs for network and block are by far the most efficient way of
shuffling data in and out of the VM.

> I was a bit
> confused to find I/O emulation code under arch/x86/pv (see e.g.
> arch/x86/pv/emul-priv-op.c) that seems to be talking about "ports" and
> the like. Is this another example of things being fuzzier in reality
> than in the "theoretical" PV design?

This is "general x86 architecture".  Xen handles all exceptions,
including from PV userspace (possibly being naughty), so at a bare
minimum needs to filter those which should be handed to the guest kernel
to deal with.

When it comes to x86 Port IO, it is a critical point of safety that Xen
runs with IOPL set to 0, or a guest kernel could modify the real
interrupt flag with a popf instruction.  As a result, all `in` and `out`
instructions trap with a #GP fault.

Guest userspace could use use iopl() to logically gain access to IO
ports, after which `in` and `out` instructions would not fault.  Also,
these instructions don't fault in kernel context.  In both cases, Xen
has to filter between actually passing the IO request to hardware (if
the guest is suitably configured), or terminating it defaults, so it
fails in a manner consistent with how x86 behaves.

For VT-x/SVM guests, filtering of #GP faults happens before the VMExit
so Xen doesn't have to handle those, but still has to handle all IO
accesses which are fine (permission wise) according to the guest kernel.

> What devices, if any, are emulated rather than paravirtualized for a PV guest?

Look for XEN_X86_EMU_* throughout the code.  Those are all the discrete
devices which Xen may emulate, for both kinds of guests.  There is a
very restricted set of valid combinations.

PV dom0's get an emulated PIT to partially forward to real hardware.
ISTR it is legacy for some laptops where DRAM refresh was still
configured off timer 1.  I doubt it is revenant these days.

> I know that for PVH, you
> mentioned that the Local APIC is (at a minimum) emulated, along with
> some special instructions; is that true for classic PV as well?

Classic PV guests don't get a Local APIC.  They are required to use the
event channel interface instead.

> For HVM, obviously anything that can't be virtualized natively by the
> hardware needs to be emulated by Xen/QEMU (since the guest kernel isn't
> expected to be cooperative to issue PV hypercalls instead); but I would
> expect emulation to be limited to the relatively small subset of the ISA
> that VMX/SVM can't natively virtualize. Yet I see that x86_emulate.c
> supports emulating just about everything. Under what circumstances does
> Xen actually need to put all that emulation code to use?

Introspection, as I said earlier, which is potentially any instruction.

MMIO regions (including to the Local APIC when it is in xAPIC mode, and
hardware acceleration isn't available) can be the target of any
instruction with a memory operand.  While mov is by far the most common
instruction, other instructions such as and/or/xadd are used in some
cases.  Various of the vector moves (movups/movaps/movnti) are very
common with framebuffers.

The cfc/cf8 IO ports are used for PCI Config space accesses, which all
kernels try to use, and any kernel with real devices need to use.  The
alternative is the the MMCFG scheme which is plain MMIO as above.

> I'm also wondering just how much of this is Xen's responsibility vs.
> QEMU's. I understand that when QEMU is used on its own (i.e., not with
> Xen), it uses dynamic binary recompilation to handle the parts of the
> ISA that can't be virtualized natively in lower-privilege modes. Does
> Xen only use QEMU for emulating off-CPU devices (interrupt controller,
> non-paravirtualized disk/network/graphics/etc.), or does it ever employ
> any of QEMU's x86 emulation support in addition to Xen's own emulation code?

We only use QEMU for off-CPU devices.  For performance reasons, some of
the interrupt emulation (IO-APIC in particular), and timer emulation
(HPET, PIT) is done in Xen, even when it would locally be part of the
motherboard if we were looking for a clear delineation of where Xen
stops and QEMU starts.

> Is there any particular place in the code where I can go to get a
> comprehensive "list" (or other such summary) of which parts of the ISA
> and off-CPU system are emulated for each respective guest type (PV, HVM,
> and PVH)?

XEN_X86_EMU_* should cover you here.

> I realize that the difference between HVM and PVH is more of a
> continuum than a line; what I'm especially interested in is, what's the
> *bare minimum* of emulation required for a PVH guest that's using as
> much paravirtualization as possible? (That's the setting I'm looking to
> target for my research on protecting guests from a compromised
> hypervisor, since I'm trying to minimize the scope of interactions
> between the guest and hypervisor/dom0 that our virtual instruction set
> layer needs to mediate.)

If you are using PVH guests, on not-ancient hardware, and you can
persuade the guest kernel to use x2APIC mode, and without using any
ins/outs instructions, then you just might be able to get away without
any x86_emulate() at all.

x2APIC mode has an MSR-based interface rather than an MMIO interface,
which means that the VMExit intercept information alone is sufficient to
work out exactly what to do, and ins/outs is the only other instructions
(which come to mind) liable to trap and need emulator support above and
beyond the intercept information.

That said, whatever you do here is going to have to cope with dom0 and
all the requirements for keeping the system running.  Depending on
exactly how you're approaching the problem, it might be possible to
declare that out of scope and leave it to one side.

> On a somewhat related note, I also have a question about a particular
> piece of code in arch/x86/pv/emul-priv-op.c, namely the function
> io_emul_stub_setup(). It looks like it is, at runtime, crafting a
> function that switches to the guest register context, emulates a
> particular I/O operation, then switches back to the host register
> context. This caught our attention while we were implementing Control
> Flow Integrity (CFI) instrumentation for Xen (which is necessary for us
> to enforce the software fault isolation (SFI) instrumentation that
> provides our memory protections). Why does Xen use dynamically-generated
> code here? Is it just for implementation convenience (i.e., to improve
> the generalizability of the code)?

This mechanism is for dom0 only, and exists because some firmware is

Some AML in ACPI tables uses an IO port to generate an SMI, and has an
API which uses the GPRs.  It turns out things go rather wrong when Xen
intercepts the IO instruction, and replays it to hardware in Xen's GPR
context, rather than the guest kernels.

This bodge swaps Xen's and dom0's GPRs just around the IO instruction,
so the SMI API gets its parameters properly, and the results get fed
back properly into AML.

There is a related hypercall, SCHEDOP_pin_override, used by dom0,
because sometimes the AML really does need to execute on CPU0, and not
wherever dom0's vcpu0 happens to be executing.

> Thanks again for all your time and effort spent answering my questions.
> I know I'm throwing a lot of unusual questions out there - this
> back-and-forth has been very helpful for me in figuring out *what*
> questions I need to be asking in the first place to understand what's
> feasible to do in the Xen architecture and how I might go about doing
> it. :-)

Not a problem in the slightest.


Xen-devel mailing list
Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.