[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)



On Mon, Nov 10, 2014 at 04:32:48PM -0500, Konrad Rzeszutek Wilk wrote:
> On Mon, Nov 10, 2014 at 01:07:20PM -0500, Konrad Rzeszutek Wilk wrote:
> > On Mon, Nov 10, 2014 at 05:42:32PM +0000, David Vrabel wrote:
> > > On 10/11/14 17:32, Konrad Rzeszutek Wilk wrote:
> > > > Hey,
> > > > 
> > > > With Xen 4.5 (today's staging), when I boot a guest and then do 
> > > > pci-attach
> > > > the BARs values are corrupt.
> 
> I can reproduce this with Xen 4.4, Xen 4.3 and Xen 4.1.
> 
> A bit digging in and I realized that:
> 
> (XEN) memory_map:add: dom1 gfn=f4000 mfn=d8000 nr=4000 [64M]
> (XEN) AMD-Vi: update_paging_mode Try to access pdev_list without aquiring 
> pcidevs_lock.
> (XEN) memory_map:add: dom1 gfn=f8000 mfn=fc000 nr=2000 [32M]
> (XEN) ioport_map:add: dom1 gport=1000 mport=c000 nr=80
> (XEN) AMD-Vi: Disable: device id = 0x500, domain = 0, paging mode = 3
> (XEN) AMD-Vi: Setup I/O page table: device id = 0x500, type = 0x1, root table 
> = 0x228b02000, domain = 1, paging mode = 3
> 
> The sizes are my own editing. This means QEMU is putting the
> devices in the MMIO region - and doing it succesfully. But then:
> 
> > > 
> > > 
> > > > [  152.572965] pci 0000:00:04.0: BAR 1: no space for [mem size 
> > > > 0x08000000 64bit  pref]
> > [  152.518320] pci 0000:00:04.0: reg 0x14: [mem 0x00000000-0x07ffffff 64bit 
> > pref]
> 
> .. The guest computes the right size for them, but reads the wrong BAR value
> that was set by QEMU and also created in the hypervisor.
> 
> Perhaps this is Linux kernel being on fritz. Will try another kernel.

I figured this out.


When we pass in the device at bootup, the hvmloader does:

(d4) pci dev 05:0 bar 14 size 008000000: 0e000000c
(d4) pci dev 05:0 bar 1c size 004000000: 0e800000c
(d4) pci dev 05:0 bar 10 size 002000000: 0ec000000
(d4) pci dev 05:0 bar 24 size 000000080: 00000c201

That is - it finds the size, and then it sets the BARs to fit within
the MMIO region. QEMU is not involved in this.

When we PCI insert an device, the BARs are not set at all - and hence
the Linux kernel is the one that tries to set the BARs in. The
reason it cannot fit the device in the MMIO region is due to the
_CRS only having certain ranges (even thought the MMIO region can
cover 2GB). See:

Without any devices (and me doing PCI insertion after that):
# dmesg | grep "bus resource"
[    0.366000] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.366000] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
[    0.366000] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
[    0.366000] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
[    0.366000] pci_bus 0000:00: root bus resource [mem 0xf0000000-0xfbffffff]

With the device (my GPU card) inserted so that hvmloader can enumerate it:
 dmesg | grep 'resource'     
[    0.455006] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.459006] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
[    0.462006] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
[    0.466006] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
[    0.469006] pci_bus 0000:00: root bus resource [mem 0xe0000000-0xfbffffff]

I chatted with Bjorn and Rafeal on IRC about how PCI insertion works
on baremetal and it sounds like Thunderbolt device insertion is an
interesting case. The SMM sets the BAR regions to fit within the MMIO
(which is advertised by the _CRS) and it then pokes the OS to enumerate
the BARs. The OS is free to use what the firmware has set or renumber
it. The end result is that since the SMM 'fits' the BAR inside the
pre-set _CRS window it all works. We do not do that.

The two ways I could think of making this work are:
 - QEMU tracks BAR enumeration. When a new device is inserted it would
   set the BAR to fit within the E820 "HOLE" region. If it can't
   (because the MMIO is too small) it puts it at the end of the memory.
   Naturally the 'end of the memory' part would require adding
   _CRS to cover end of GPFN to never never land. And also the _CRS
   region for the MMIO under 4GB would have to be expanded so QEMU
   can jam things in there.

 - Or add in dsdt.asl another _CRS region controlled by the hvmloader.
   This one would start at the end of GPFN + delta of maxmem - mem and
   continue to never never land. The hvmloader would just write the
   the values in the BIOS OperationRegion (0xFC000000) and let the
   AML code take care of parsing it and constructing the #9 _CRS region.
   This will allow kernels who are picky about BARs not being in _CRS
   region to deal with cards that are hot-plugged past BIOS boot.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.