|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 00/17] Q35 initial support for HVM guests
Hello, I'm glad someone wants to commit these patches (and surprised that they're still applicable after so many years), thank you for this effort. Feel free to proceed, it would be good if you manage to upstream them to Xen/QEMU code. But be prepared that it won't be an easy task - the patches cross multiple areas of responsibility, so it will require some effort to make all involved Xen/QEMU maintainers happy. I don't work on virtualization/x86 anymore and I barely remember anything after 8 years, so I probably won't be able to help much, but I'll keep an eye on the email thread. Some historical background for the Q35 patches: The project I was working on was relying on Xen for PCIe device passthrough (mostly GPUs, NICs and storage controllers) to HVM guests. So PCIe passthrough and HVM were the top priority - it affected many of my decisions. IIRC, there were 2 major obstacles to successfully passthrough any PCIe device: 1. Even back then, there were **multiple PCIe devices whose drivers were attempting to read/write registers from their device's PCIe extended config space** (offsets above 100h). Supporting this feature required to have MMCONFIG/ECAM working, which was something only available for Q35 emulation at that time => hence Q35 support was added, with mostly PCIe passthrough in mind. In the process I also discovered that dreadful "PCIe topology check" issue which was bypassed by presenting the passed through PCIe device to the OS as a chipset built-in device. This solution was a bit hacky, but allowed to successfully pass through PCIe devices to a Q35 HVM guest. 2. Some devices had mirrors of BAR registers' values _accessed through a proprietary mechanism_, like reading them through device-specific MMIO registers. As such, their drivers do not read a BAR value from the PCI conf space but rather get it directly from eg. MMIO, whose layout is completely unknown to us. This makes all BAR emulation in the hypervisor useless for such device - the hypervisor returns one value for BARs read via PCI conf space, but the driver sees the real values as it bypasses the PCI conf space. Among such devices were Nvidia GPUs BTW - but not including the "pro" models AFAIR, which were more virtualization-friendly. That "BAR desync" problem was tricky - I solved it by implementing an option (in the domain config file) for a passed through device which, when turned on, was basically enabling 1:1 matching between virtual and physical BAR values for a given device, without affecting other devices (be it PT or emulated). This way virtual physical addresses in BARs match the real ones - hence the device driver sees the same values either in the PCI conf space or proprietary registers. But it wasn't that simple, unfortunately - having a specific "locked" BAR value means we need to adjust the MMIO hole size for the guest accordingly. A straightforward approach is to make the MMIO hole size very big. This in turn brought another problems to solve: 2.1. when a recent (back then) Windows OS sees PCI BAR allocation which is far from perfect - it can completely reallocate all BARs of all devices to other, very different addresses. They were calling this feature as PCIe "resource rebalancing" IIRC. This breaks 1:1 mirroring of given device's virtual/physical BARs - it's ok to present BARs with real physical addresses (the sneaky device driver knows them via MMIO registers anyway), but allowing to modify values in BARs is a no go, of course. Luckily, this problem was solved by a specific PCI BAR allocation - the idea was to keep the MMIO hole as small as possible while avoiding large unused gaps inside it, not claimed by any BAR. It was implemented inside hvmloader, which was populating the MMIO hole while taking into account both fixed and freely modifiable BARs and then reported the new RAM/MMIO hole layout back to Xen. This allowed to prevent the PCI BAR reallocation from the OS - and hotplugging was still working thanks to the high MMIO hole (above 4Gb). 2.2. after experimenting with dynamic resizing of the MMIO hole, I realized that Xen and QEMU have their own vision of the system memory layout which can get out of sync. And MMIO hole resize was creating this bad situation in fact, giving some hard to debug/reproduce bugs with unexpected guest memory corruption. The way I fixed this memory mismatch was emulating the real Q35 facility for this - namely, chipset's REMAP register which was designed precisely for this goal - to reconfigure the MMIO hole size/position while relocating underlying RAM memory to another range (so no RAM is wasted). As the chipset was emulated by QEMU and the whole idea of HVM was to emulate real hardware as close as possible, this was the obvious solution - we do it in the way like it's done in a real firmware and then QEMU knows the RAM/MMIO hole layout, allowing to sync it with Xen's. There were some other fixed issues relying on this feature - AFAIR, I needed it also to make 'populate on demand' working with (hotplugged?) PT devices. I was planning to send patches for this feature too, after settling the Q35 patches. I'll try to find the relevant code/notes, maybe they will be helpful. On Fri, 13 Mar 2026 16:35:01 +0000 "Thierry Escande" <thierry.escande@xxxxxxxxxx> wrote: >This series introduces initial Q35 chipset support for HVM guests, based on the >patchset at [1] by Alexey Gerasimenko. > >Basic support means that this patchset allows to start an HVM guest that >emulates a Q35 chipset via Qemu and implements access to PCIe extended >configuration space for such devices emulated by Qemu. > >Support for PCIe device passthrough is not implemented yet. This is planned but >implies modifications in the hypervisor and the firmwares, mainly for the >support of multiple PCI buses. > >In order to create a Q35 guest, a new domain config option has been added, >named 'device_model_machine'. Possible values are: >- "i440" - i440 emulation (default) >- "q35" - emulate a Q35 machine > >If the option is omitted it defaults to "i440", not impacting existing domain >configuration files. > >DSDT files for Q35 and i440 are largely similar so the existing file dsdt.asl >has been split with i440 and q35 specific parts put in seperated files. > >The PCIe MMCONFIG area is configured by hvmloader and its base address and size >are set in Xen using a new pair of hypercalls HVMOP_get|set_ecam_space. Access >to the MMCONFIG area from a guest is trapped by Xen and transfered to the >emulator as XEN_DMOP_IO_RANGE_PCI ioreq type. > >[1] https://lore.kernel.org/xen-devel/cover.1520867740.git.x1917x@xxxxxxxxx/ > >Thierry Escande (17): > libacpi: Split dsdt.asl file and extract i440 specific parts > libacpi: new DSDT ACPI table for Q35 > hvmloader: add function to set the emulated machine type (i440/Q35) > hvmloader: add ACPI enabling for Q35 > hvmloader: add Q35 DSDT table loading > hvmloader: Move pci devices setup to a separate function > hvmloader: add basic Q35 support > hvmloader: Extend PCI BAR struct > xev/hvm: Add HVMOP_get|set_ecam_space hypercalls > hvmloader: Add support for HVMOP_set|get_ecam_space hypercalls > hvmloader: allocate MMCONFIG area in the MMIO hole > libxl: Q35 support (new option device_model_machine) > libxl: Add xen-platform device for Q35 machine > libacpi: build ACPI MCFG table if requested > hvmloader: Set MCFG in ACPI table > Handle PCIe ECAM space access from guests > docs: provide description for device_model_machine option
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |