[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Notes for upcoming PCI emulation call



I’ll try to summarize current issues/difficulties in extending the PCIe
passthrough support and possible ways to resolve these problems which
were discussed in the mailing list so far.

Possible options to extend PCI passthrough/emulation capabilities
-----------------------------------------------------------------

There is an arising need to support PCIe-specific features for PCI
passthrough. A lot of devices have PCIe Extended Capabilities above
100h offset. Even if we don’t want to support these capabilities in Xen
right away, a proprietary driver for a passed through device might want
to use these extended capabilities anyway -– Vendor-specific Extended
Capability is a classic example, though the device driver may try to
read other Extended Capabilities from its device’s conf space.

Apart from supporting PCIe Extended Capabilities, another possible (and
big) direction –- supporting PCIe-specific features in general like
native PCIe hotplug, new PM facilities or forwarding AER events to a
guest OS. This will require adding support for some cooperation between
passed through and emulated devices in a PCIe hierarchy, for which major
changes in emulated PCI bus architecture are needed. At the moment, all
PCIe devices are passed through in legacy PCI mode in Xen. This means
there is no support currently for PCIe-specific features like extended
PCI config space via ECAM.

Even providing support for PCIe Extended Capabilities alone requires
some changes –- we need to
1. Emulate ECAM (MMIO-accesses to MMCONFIG area) to allow
   reading/writing PCIe extended configuration space
2. Present a PCIe-capable system for a guest OS.

This can be achieved by adding QEMU Q35 emulation support to Xen (RFC
patch series for this feature was sent). For ECAM, in a very simplest
case, QEMU existing MMCONFIG emulation can be reused. However, there
are at least two incompatibility problems which need solution. These
are:

- Multiple PCI device emulators feature, used by VGPU in XenServer

- Emulating (a simplest) upstream PCIe hierarchy for passed through PCIe
devices. The issue was described in details here:
http://lists.gnu.org/archive/html/qemu-devel/2018-03/msg03593.html

Latter problem must be resolved properly by introducing emulated PCIe
Root Ports for passed through devices. Basically this means we need to
emulate PCI-PCI bridges with secondary bus used to place real passed
through devices, ideally using function grouping for related devices
like GPU and its HDAudio function.

There are different approaches _who_ should emulate these PCI-PCI
bridges. QEMU has support for emulated RPs and PCIe switches but we
might want to remove that privilege from QEMU as emulating RPs/switches
above _real_ passed through PCIe devices is a relatively system thing.
Also, we need to consider future PCIe passthru extensions like handling
PM events from passed through PCIe devices as these features assume some
additional support in upstream PCIe hierarchy.

So, we need to decide who will be controlling emulated Root Ports for
passed through devices – either Xen or QEMU. For a number of reasons it
will be beneficial to do it on Xen side while sticking to QEMU allows
reusing existing functionality on the other hand.

Now, regarding the multiple PCI device emulators. For multiple PCI
device emulators a specific passed through device may be assigned to a
separate device model (non-QEMU). At the low-level this will appear as
more than one IOREQ server present –- most PCI devices will be still
handled by QEMU, with some being assigned to another (device-specific)
device model -– a distinct binary –- via same
xc_hvm_map_pcidev_to_ioreq_server() call. Later,
hvm_select_ioreq_server() will select a proper device model destination
based on BDF location of the device and ioreqs will be sent to the
chosen target.
This works well for legacy CF8h/CFCh PCI conf accesses, but MMCONFIG
support introduces some problems.

First of all, MMCONFIG itself is a chipset-specific thing. Both
registers which control it and the number of MMCONFIG ranges
(ECAM-capable PCIe segments) may differ for different emulated
machines. This means that some designated device model should control
it according to the user-selected emulated machine. Device-specific
device model doesn't know anything about the emulated machine.

Secondly, in order to have all necessary information to forward ioreqs
to the correct device model, Xen needs to know
1. MMCONFIG base address and size (ideally extendable to support
   multiple MMCONFIGs)
2. MMCONFIG layout, corresponding to the current map of the PCI bus.
   This layout may change anytime due to a PCI-PCI bridge
   re-initialization or hotplugging a device.

There are different options how to pass this information to Xen. Xen
may even control it itself in some solutions.

MMCONFIG layout can be obtained passively, by simply observing
map_pcidev_to_ioreq_server calls to determine and store all emulated
PCI device BDF locations.

Another thing to consider here is MMIO hole layout and its impact.
For example, adding PCI-PCI bridges creates some complication as they
will provide windows in IO/MMIO space which should be sized accordingly
to the secondary PCI bus content. In some cases like hotplugging a PCIe
device (which should belong to some RP or switch DP) existing bridge
windows might be too small to provide space for a newly added device,
triggering PCI-PCI bridge and BARs re-initialization (aka PCI resource
rebalancing in Windows terms) in guest. This action may change the PCI
bus layout which needs to be addressed somehow. Also, by utilizing ACPI
_DSM method (not our case luckily as we don't provide it) Windows may
invoke a complete PCI BARs/PCI-PCI bridge re-initialization
unconditionally on system boot.


Possible directions to make multiple PCI device emulators compatible
with PCIe/MMCONFIG
--------------------------------------------------------------------

I. “Notification” approach. In this case QEMU will continue to emulate
PCIEXBAR and handle MMCONFIG accesses. But, upon encountering any
changes in the PCIEXBAR value, QEMU will report this change to Xen via
any suitable channel -– either a dedicated dmop, XenStore param or
anything else. Xen will store this information and use it to select a
proper IOREQ server destination for trapped MMCONFIG accesses.

II. “Own chipset device model”. In this case Xen will emulate some
chipset-specific devices himself. Of particular interest are MCH and
ICH9. Both emulated Root Complex and Root Ports will belong to Xen,
allowing implementing PCIe-specific features like AER reporting in any
convenient way. Ideally, from QEMU side only a set of distinct
PCIDevice’s will remain – storage, networking, etc. A dummy pci-host
will be providing forwarding of IOREQ_TYPE_PCI_CONFIG-accesses for
remaining PCIDevices. PCI bus layout seen by QEMU can be made different
with the real layout seen by guest. Final result will look like a new,
very reduced QEMU machine with dummy PCIBus/ISABus, perhaps even based
on top of QEMU null machine.

While this approach is beneficial in many ways, it will affect
compatibility with QEMU very, very badly. For example, NVDIMM support
patches from Intel rely on QEMU ACPI facilities which can become
completely inoperational due to removing emulated NB+SB and their
corresponding subtypes and properties. Multiple similar issues and
breakages may arise in future, though QEMU PM/ACPI facilities is the
main problem. Note that Xen already emulates some of PMBASE registers
and PMBASE value itself is hardcoded (at B000h IIRC). Own PMBASE
BAR emulation will allow to remove this limitation.

III. “Transparent emulation”. In this case Xen will intercept only some
known registers for chipset-specific devices emulated by QEMU.
PCIEXBAR, PMBASE, possibly MMIO Hole-controlling registers and some
others. A handler for this kind of registers can be selectively called
before or after the corresponding DM emulation (on different stages of
IOREQ processing) and should have freedom to specify whether the DM may
see this read/write (otherwise it is handled internally). This will
allow to provide own support for PCIEXBAR/MMCONFIG emulation while
keeping compatibility with QEMU. Zero changes will be needed on QEMU
side.
Xen will detect the emulated chipset either passively or via sending
IOREQ_TYPE_PCI_CONFIG to read VID/DID from the device model directly.
NB/SB VID/DID values will be used to distinguish between different
emulated machines and to setup correct handlers for chipset-specific
registers. 

Due to the requirement for a PCIe device to cooperate with upstream
PCIe hierarchy (at least to belong to some RP/switch), some changes for
multiple PCI emulator support must be made no matter the chosen
solution.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.