[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] PCIe devices that are hotplugged after MMIO has been setup fail due to _CRS not covering 64-bit area



On Thu, Oct 13, 2016 at 03:20:24AM -0600, Jan Beulich wrote:
> >>> On 12.10.16 at 23:15, <konrad.wilk@xxxxxxxxxx> wrote:
> > On Wed, Sep 28, 2016 at 03:21:08AM -0600, Jan Beulich wrote:
> >> >>> On 27.09.16 at 16:43, <konrad.wilk@xxxxxxxxxx> wrote:
> >> > If the guest is booted with 'pci' we nicely expand the MMIO region below
> >> > 4GB and try to fit in the BARs in there. If that fails (not enough
> >> > space) we move it above the memory (64-bit). And throughout all of this
> >> > we also update the _CRS field to cover these ranges.
> >> > 
> >> > (Note, I need to check if the 64-bit area is also set, I think it is).
> >> > 
> >> > But the situation is different if we hot-plug a device that has too big
> >> > BAR to fit in the MMIO region. We move it in the 64-bit area but we
> >> > don't update the _CRS. Which means that Linux will complain (unless
> >> > booted with pci=nocrs)). Not sure about Windows but I would assume so
> >> > to.
> >> > 
> >> > I was wondering what would be a good way to solve this? I looked at some
> >> > Dell machines to see how they deal with hotplug PCIe devices and they
> >> > just declared all the memory in the _CRS (including RAM).
> >> > 
> >> > We could do a hybrid - during bootup make the _CRS region have entry from
> >> > end of RAM to .. end of memory?
> >> 
> >> End of physical address space you mean? Generally yes, but we
> >> need to be a little careful there: For one, on AMD we'd better not
> >> overlap with the HT area. And then there's this MTRR related
> >> comment next to the setting of pci_hi_mem_end (albeit both HT
> >> area start and end of PA space should be aligned well enough).

This got interesting. The existing code that sets the variable
MTRR ran out of MTRRs to cover say 1<<36 of space. The reason
is that it starts at low granularity sizes (4KB) and then builds up
from there. To cover say from 4GB to 64GB we ran out of MTRRs.
I modified it be subtractive, and got it to start with
large areas and then smaller and smaller:

(d2)  - CPU0 ... 36-bit phys ... fixed MTRRs ... Cover @000004344(MB) to 
000065536(M
(d2) B) with 7 MTRRs.
(d2) MTRR 1     @000004344(MB)  000037112(MB)
(d2) MTRR 2     @000037112(MB)  000053496(MB)
(d2) MTRR 3     @000053496(MB)  000061688(MB)
(d2) MTRR 4     @000061688(MB)  000063736(MB)
(d2) MTRR 5     @000063736(MB)  000064760(MB)
(d2) MTRR 6     @000064760(MB)  000065272(MB)
(d2) MTRR 7     @000065272(MB)  000065528(MB)
(d2) var MTRRs [8/8] ... done.

But of course on 48-bit hardware, even with this we ran out of MTRRs:
(d1)  - CPU0 ... 48-bit phys ... fixed MTRRs ... Cover @000004344(MB) to 
0268435456(
(d1) MB) with 7 MTRRs.
(d1) MTRR 1     @000004344(MB)  0134222072(MB)
(d1) MTRR 2     @0134222072(MB) 0201330936(MB)
(d1) MTRR 3     @0201330936(MB) 0234885368(MB)
(d1) MTRR 4     @0234885368(MB) 0251662584(MB)
(d1) MTRR 5     @0251662584(MB) 0260051192(MB)
(d1) MTRR 6     @0260051192(MB) 0264245496(MB)
(d1) MTRR 7     @0264245496(MB) 0266342648(MB)
(d1) var MTRRs [8/8] ... done.

[I figured that it would be OK to set the UC MTRR even for the
HT region: FC FFFF FFFF -> FF FFFF FFFF as you surely don't want WB there?]

Then it ocurred to me that maybe I am overthinking it and
should just pick the biggest one:

(d32) Multiprocessor initialisation:
(d32)  - CPU0 ... 48-bit phys ... fixed MTRRs ... Cover @000004344(MB) to 
0268435456(
(d32) MB) with 7 MTRRs.
(d32) MTRR 1    @000004344(MB)  0268439800(MB)
(d32) var MTRRs [1/8] ... done.

Which would cover _past_ the CPU end, but that surely won't be healthy
to the CPU? The Intel SDM doesn't mention what happens then.

Also I realized that "Range Size and Alignment Requirement" aren't meet
with the code I wrote - as the size (2^n) must be aligned on the
2^n boundary, and that is certainly not meet.

Anyhow the point here is that with modifications here I will
still run in the variable MTRR limit if I am to cover most of the
space. I can do up to a certain value. And that 'value' could
become the pci_high_mem_end?

Or perhaps revisit a6a822324:
Author: Keir Fraser <keir.fraser@xxxxxxxxxx>
Date:   Wed Apr 16 13:36:44 2008 +0100

    x86, hvm: Lots of MTRR/PAT emulation cleanup.
    
     - Move MTRR MSR initialisation into hvmloader.
     - Simplify initialisation logic by overlaying UC on default WB rather
       than vice versa.
     - Clean up hypervisor HVM MTRR/PAE code's interface with rest of
       hypervisor.
    

As the default MTRR is WB. If that was UC we could just set MTRRs
for RAM regions and have the type be WB for those regions?

I am not sure thought if that is a good direction either?

> >> 
> >> > Or perhaps add some extra logic between QEMU and ACPI AML to expand (or
> >> > perhaps modify the last _CRS entry) when PCIe devices are hotplugged?
> >> 
> >> While that would be the most flexible variant, I'd be afraid of this
> >> getting rather complicated. Or have you already got some
> >> reasonable layout of how this would look like?
> > 
> > I did this and while all the plumbing works great and I can see that
> > the pci_hi_len gets incremented by the size of the 64-bit BARS of the
> > new device (and also decremented if hot-unplugged) I hit a snag:
> > 
> > Linux evaluates this only once (actually twice, but only during bootup).
> 
> Ah - quite reasonable to expect this won't change.
> 
> > For right now let me jump with the "simpler" solution of just
> > hardcoding the end of physical address space and see how that works out.
> 
> Right.

And that actually worked out nicely. Linux sees the new _CRS regions
and I got [this includes two extra regions - so that the HT region
is not touched]:

 ...
     pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
     pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
     pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
     pci_bus 0000:00: root bus resource [mem 0xf0000000-0xfbffffff window]
     pci_bus 0000:00: root bus resource [mem 0x10fc00000-0xfcfffffffe window]
     pci_bus 0000:00: root bus resource [mem 0x10000000000-0xffffffffffff 
window]
     pci_bus 0000:00: root bus resource [bus 00-ff]

from:
    pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
    pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
    pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
    pci_bus 0000:00: root bus resource [mem 0xe0000000-0xfbffffff window]
    pci_bus 0000:00: root bus resource [bus 00-ff]

Except that when I tried this with Windows 2000 I found out that
its AML interpreter blows up if any of the values are bigger than
8GB. With a bit of extra AML duct-tape that got solved, albeit I need
to verify other Windows platforms. Which reminds me - you had dabbled
in this - are there any other surprises I should be aware of ?

> 
> Jan
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.