Xen project Mailing List

Re: [Xen-devel] PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)

To: David Vrabel <david.vrabel@xxxxxxxxxx>, zhenzhong.duan@xxxxxxxxxx

From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Date: Tue, 11 Nov 2014 20:37:57 -0500

Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, jbeulich@xxxxxxxx

Delivery-date: Wed, 12 Nov 2014 01:38:08 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Mon, Nov 10, 2014 at 04:32:48PM -0500, Konrad Rzeszutek Wilk wrote: > On Mon, Nov 10, 2014 at 01:07:20PM -0500, Konrad Rzeszutek Wilk wrote: > > On Mon, Nov 10, 2014 at 05:42:32PM +0000, David Vrabel wrote: > > > On 10/11/14 17:32, Konrad Rzeszutek Wilk wrote: > > > > Hey, > > > > > > > > With Xen 4.5 (today's staging), when I boot a guest and then do > > > > pci-attach > > > > the BARs values are corrupt. > > I can reproduce this with Xen 4.4, Xen 4.3 and Xen 4.1. > > A bit digging in and I realized that: > > (XEN) memory_map:add: dom1 gfn=f4000 mfn=d8000 nr=4000 [64M] > (XEN) AMD-Vi: update_paging_mode Try to access pdev_list without aquiring > pcidevs_lock. > (XEN) memory_map:add: dom1 gfn=f8000 mfn=fc000 nr=2000 [32M] > (XEN) ioport_map:add: dom1 gport=1000 mport=c000 nr=80 > (XEN) AMD-Vi: Disable: device id = 0x500, domain = 0, paging mode = 3 > (XEN) AMD-Vi: Setup I/O page table: device id = 0x500, type = 0x1, root table > = 0x228b02000, domain = 1, paging mode = 3 > > The sizes are my own editing. This means QEMU is putting the > devices in the MMIO region - and doing it succesfully. But then: > > > > > > > > > > > [ 152.572965] pci 0000:00:04.0: BAR 1: no space for [mem size > > > > 0x08000000 64bit pref] > > [ 152.518320] pci 0000:00:04.0: reg 0x14: [mem 0x00000000-0x07ffffff 64bit > > pref] > > .. The guest computes the right size for them, but reads the wrong BAR value > that was set by QEMU and also created in the hypervisor. > > Perhaps this is Linux kernel being on fritz. Will try another kernel. I figured this out. When we pass in the device at bootup, the hvmloader does: (d4) pci dev 05:0 bar 14 size 008000000: 0e000000c (d4) pci dev 05:0 bar 1c size 004000000: 0e800000c (d4) pci dev 05:0 bar 10 size 002000000: 0ec000000 (d4) pci dev 05:0 bar 24 size 000000080: 00000c201 That is - it finds the size, and then it sets the BARs to fit within the MMIO region. QEMU is not involved in this. When we PCI insert an device, the BARs are not set at all - and hence the Linux kernel is the one that tries to set the BARs in. The reason it cannot fit the device in the MMIO region is due to the _CRS only having certain ranges (even thought the MMIO region can cover 2GB). See: Without any devices (and me doing PCI insertion after that): # dmesg | grep "bus resource" [ 0.366000] pci_bus 0000:00: root bus resource [bus 00-ff] [ 0.366000] pci_bus 0000:00: root bus resource [io 0x0000-0x0cf7] [ 0.366000] pci_bus 0000:00: root bus resource [io 0x0d00-0xffff] [ 0.366000] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff] [ 0.366000] pci_bus 0000:00: root bus resource [mem 0xf0000000-0xfbffffff] With the device (my GPU card) inserted so that hvmloader can enumerate it: dmesg | grep 'resource' [ 0.455006] pci_bus 0000:00: root bus resource [bus 00-ff] [ 0.459006] pci_bus 0000:00: root bus resource [io 0x0000-0x0cf7] [ 0.462006] pci_bus 0000:00: root bus resource [io 0x0d00-0xffff] [ 0.466006] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff] [ 0.469006] pci_bus 0000:00: root bus resource [mem 0xe0000000-0xfbffffff] I chatted with Bjorn and Rafeal on IRC about how PCI insertion works on baremetal and it sounds like Thunderbolt device insertion is an interesting case. The SMM sets the BAR regions to fit within the MMIO (which is advertised by the _CRS) and it then pokes the OS to enumerate the BARs. The OS is free to use what the firmware has set or renumber it. The end result is that since the SMM 'fits' the BAR inside the pre-set _CRS window it all works. We do not do that. The two ways I could think of making this work are: - QEMU tracks BAR enumeration. When a new device is inserted it would set the BAR to fit within the E820 "HOLE" region. If it can't (because the MMIO is too small) it puts it at the end of the memory. Naturally the 'end of the memory' part would require adding _CRS to cover end of GPFN to never never land. And also the _CRS region for the MMIO under 4GB would have to be expanded so QEMU can jam things in there. - Or add in dsdt.asl another _CRS region controlled by the hvmloader. This one would start at the end of GPFN + delta of maxmem - mem and continue to never never land. The hvmloader would just write the the values in the BIOS OperationRegion (0xFC000000) and let the AML code take care of parsing it and constructing the #9 _CRS region. This will allow kernels who are picky about BARs not being in _CRS region to deal with cards that are hot-plugged past BIOS boot. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.