[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 00/17] Q35 initial support for HVM guests



Hello,

I'm glad someone wants to commit these patches (and surprised that
they're still applicable after so many years), thank you for this
effort. Feel free to proceed, it would be good if you manage to upstream
them to Xen/QEMU code. But be prepared that it won't be an easy
task - the patches cross multiple areas of responsibility, so it will
require some effort to make all involved Xen/QEMU maintainers happy.
I don't work on virtualization/x86 anymore and I barely remember
anything after 8 years, so I probably won't be able to help much, but
I'll keep an eye on the email thread.


Some historical background for the Q35 patches:

The project I was working on was relying on Xen for PCIe device
passthrough (mostly GPUs, NICs and storage controllers) to HVM guests.
So PCIe passthrough and HVM were the top priority - it affected many
of my decisions.

IIRC, there were 2 major obstacles to successfully passthrough any PCIe device:

1. Even back then, there were **multiple PCIe devices whose drivers
were attempting to read/write registers from their device's PCIe
extended config space** (offsets above 100h). Supporting this feature
required to have MMCONFIG/ECAM working, which was something only
available for Q35 emulation at that time => hence Q35 support was
added, with mostly PCIe passthrough in mind. In the process I also
discovered that dreadful "PCIe topology check" issue which was bypassed
by presenting the passed through PCIe device to the OS as a chipset
built-in device. This solution was a bit hacky, but allowed to
successfully pass through PCIe devices to a Q35 HVM guest.

2. Some devices had mirrors of BAR registers' values _accessed through
a proprietary mechanism_, like reading them through device-specific
MMIO registers. As such, their drivers do not read a BAR value from
the PCI conf space but rather get it directly from eg. MMIO, whose
layout is completely unknown to us. This makes all BAR emulation in the
hypervisor useless for such device - the hypervisor returns one value
for BARs read via PCI conf space, but the driver sees the real values
as it bypasses the PCI conf space. Among such devices were Nvidia GPUs
BTW - but not including the "pro" models AFAIR, which were more
virtualization-friendly.

That "BAR desync" problem was tricky - I solved it by implementing an
option (in the domain config file) for a passed through device which,
when turned on, was basically enabling 1:1 matching between virtual and
physical BAR values for a given device, without affecting other devices
(be it PT or emulated). This way virtual physical addresses in BARs
match the real ones - hence the device driver sees the same values
either in the PCI conf space or proprietary registers.

But it wasn't that simple, unfortunately - having a specific "locked"
BAR value means we need to adjust the MMIO hole size for the guest
accordingly. A straightforward approach is to make the MMIO hole size
very big. This in turn brought another problems to solve:

2.1. when a recent (back then) Windows OS sees PCI BAR allocation which
is far from perfect - it can completely reallocate all BARs of all
devices to other, very different addresses. They were calling this
feature as PCIe "resource rebalancing" IIRC. This breaks 1:1 mirroring
of given device's virtual/physical BARs - it's ok to present BARs with
real physical addresses (the sneaky device driver knows them via MMIO
registers anyway), but allowing to modify values in BARs is a no go, of
course.

Luckily, this problem was solved by a specific PCI BAR allocation - the
idea was to keep the MMIO hole as small as possible while avoiding
large unused gaps inside it, not claimed by any BAR. It was implemented
inside hvmloader, which was populating the MMIO hole while taking into
account both fixed and freely modifiable BARs and then reported the new
RAM/MMIO hole layout back to Xen. This allowed to prevent the PCI BAR
reallocation from the OS - and hotplugging was still working thanks to
the high MMIO hole (above 4Gb).

2.2. after experimenting with dynamic resizing of the MMIO hole, I
realized that Xen and QEMU have their own vision of the system memory
layout which can get out of sync. And MMIO hole resize was creating
this bad situation in fact, giving some hard to debug/reproduce bugs
with unexpected guest memory corruption.

The way I fixed this memory mismatch was emulating the real Q35
facility for this - namely, chipset's REMAP register which was designed
precisely for this goal - to reconfigure the MMIO hole size/position
while relocating underlying RAM memory to another range (so no RAM is
wasted). As the chipset was emulated by QEMU and the whole idea of HVM
was to emulate real hardware as close as possible, this was the obvious
solution - we do it in the way like it's done in a real firmware and
then QEMU knows the RAM/MMIO hole layout, allowing to sync it with
Xen's. There were some other fixed issues relying on this feature -
AFAIR, I needed it also to make 'populate on demand' working with
(hotplugged?) PT devices.

I was planning to send patches for this feature too, after settling the
Q35 patches. I'll try to find the relevant code/notes, maybe they will
be helpful.

On Fri, 13 Mar 2026 16:35:01 +0000
"Thierry Escande" <thierry.escande@xxxxxxxxxx> wrote:

>This series introduces initial Q35 chipset support for HVM guests, based on the
>patchset at [1] by Alexey Gerasimenko.
>
>Basic support means that this patchset allows to start an HVM guest that
>emulates a Q35 chipset via Qemu and implements access to PCIe extended
>configuration space for such devices emulated by Qemu.
>
>Support for PCIe device passthrough is not implemented yet. This is planned but
>implies modifications in the hypervisor and the firmwares, mainly for the
>support of multiple PCI buses.
>
>In order to create a Q35 guest, a new domain config option has been added,
>named 'device_model_machine'. Possible values are:
>- "i440" - i440 emulation (default)
>- "q35"  - emulate a Q35 machine
>
>If the option is omitted it defaults to "i440", not impacting existing domain
>configuration files.
>
>DSDT files for Q35 and i440 are largely similar so the existing file dsdt.asl
>has been split with i440 and q35 specific parts put in seperated files.
>
>The PCIe MMCONFIG area is configured by hvmloader and its base address and size
>are set in Xen using a new pair of hypercalls HVMOP_get|set_ecam_space. Access
>to the MMCONFIG area from a guest is trapped by Xen and transfered to the
>emulator as XEN_DMOP_IO_RANGE_PCI ioreq type.
>
>[1] https://lore.kernel.org/xen-devel/cover.1520867740.git.x1917x@xxxxxxxxx/
>
>Thierry Escande (17):
>  libacpi: Split dsdt.asl file and extract i440 specific parts
>  libacpi: new DSDT ACPI table for Q35
>  hvmloader: add function to set the emulated machine type (i440/Q35)
>  hvmloader: add ACPI enabling for Q35
>  hvmloader: add Q35 DSDT table loading
>  hvmloader: Move pci devices setup to a separate function
>  hvmloader: add basic Q35 support
>  hvmloader: Extend PCI BAR struct
>  xev/hvm: Add HVMOP_get|set_ecam_space hypercalls
>  hvmloader: Add support for HVMOP_set|get_ecam_space hypercalls
>  hvmloader: allocate MMCONFIG area in the MMIO hole
>  libxl: Q35 support (new option device_model_machine)
>  libxl: Add xen-platform device for Q35 machine
>  libacpi: build ACPI MCFG table if requested
>  hvmloader: Set MCFG in ACPI table
>  Handle PCIe ECAM space access from guests
>  docs: provide description for device_model_machine option




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.