[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC PATCH 07/12] hvmloader: allocate MMCONFIG area in the MMIO hole + minor code refactoring



On Mon, 19 Mar 2018 15:58:02 +0000
Roger Pau Monné <roger.pau@xxxxxxxxxx> wrote:

>On Tue, Mar 13, 2018 at 04:33:52AM +1000, Alexey Gerasimenko wrote:
>> Much like normal PCI BARs or other chipset-specific memory-mapped
>> resources, MMCONFIG area needs space in MMIO hole, so we must
>> allocate it manually.
>> 
>> The actual MMCONFIG size depends on a number of PCI buses available
>> which should be covered by ECAM. Possible options are 64MB, 128MB
>> and 256MB. As we are limited to the bus 0 currently, thus using
>> lowest possible setting (64MB), #defined via PCI_MAX_MCFG_BUSES in
>> hvmloader/config.h. When multiple PCI buses support for Xen will be
>> implemented, PCI_MAX_MCFG_BUSES may be changed to calculation of the
>> number of buses according to results of the PCI devices enumeration.
>> 
>> The way to allocate MMCONFIG range in MMIO hole is similar to how
>> other PCI BARs are allocated. The patch extends 'bars' structure to
>> make it universal for any arbitrary BAR type -- either IO, MMIO, ROM
>> or a chipset-specific resource.  
>
>I'm not sure this is fully correct. The IOREQ interface can
>differentiate PCI devices and forward config space accesses to
>different emulators (see IOREQ_TYPE_PCI_CONFIG). With this change you
>will forward all MCFG accesses to QEMU, which will likely be wrong if
>there are multiple PCI-device emulators for the same domain.
>
>Ie: AFAICT Xen needs to know about the MCFG emulation and detect
>accesses to it in order to forward them to the right emulators.
>
>Adding Paul who knows more about all this.

In which use cases multiple PCI-device emulators are used for a single
HVM domain? Is it a proprietary setup?

I assume it is somehow related to this code in xen-hvm.c:
                /* Fake a write to port 0xCF8 so that
                 * the config space access will target the
                 * correct device model.
                 */
                val = (1u << 31) | ((req->addr & 0x0f00) <...>
                do_outp(0xcf8, 4, val);
if yes, similar thing can be made for IOREQ_TYPE_COPY accesses to
the emulated MMCONFIG if needed.

In HVM+QEMU case we are not limited to merely passed through devices,
most of the observable PCI config space devices belong to one particular
QEMU instance. This dictates the overall emulated MMCONFIG layout
for a domain which should be in sync to what QEMU emulates via CF8h/CFCh
accesses... and between multiple device model instances (if there are
any, still not sure what multiple PCI-device emulators you mentioned
really are).

Basically, we have an emulated MMCONFIG area of 64/128/256MB size in
the MMIO hole of the guest HVM domain. (BTW, this area itself can be
considered a feature of the chipset the device model emulates.)
It can be relocated to some other place in MMIO hole, this means that
QEMU will trap accesses to the specific to the emulated chipset
PCIEXBAR register and will issue same MMIO unmap/map calls as for
any normal emulated MMIO range.

On the other hand, it won't be easy to provide emulated MMCONFIG
translation into IOREQ_TYPE_PCI_CONFIG from Xen side. Xen should know
current emulated MMCONFIG area position and size in order to translate
(or not) accesses to it into corresponding BDF/reg pair (+whether that
area is enabled for decoding or not). This will likely require to
introduce new hypercall(s).

The question is if there will be any difference or benefit at all.

It's basically the same emulated MMIO range after all, but in one case
we trap accesses to it in Xen and translate them into
IOREQ_TYPE_PCI_CONFIG requests.
We have to provide some infrastructure to let Xen know where the device 
model/guest expects to use the MMCONFIG area (and its size). The
device model will need to use this infrastructure, informing Xen of
any changes. Also, due to MMCONFIG nature there might be some pitfalls
like a necessity to send multiple IOREQ_TYPE_PCI_CONFIG ioreqs caused by
a single memory read/write operation.

In another case, we still have an emulated MMIO range, but Xen will send
plain IOREQ_TYPE_COPY requests to QEMU which it handles itself.
In such case, all code to work with MMCONFIG accesses is available for
reuse right away (mmcfg -> pci_* translation in QEMU), no new
functionality required neither in Xen or QEMU.

>> One important new field is addr_mask, which tells which bits of the
>> base address can (should) be written. Different address types (ROM,
>> MMIO BAR, PCIEXBAR) will have different addr_mask values.
>> 
>> For every assignable BAR range we store its size, PCI device BDF
>> (devfn actually) to which it belongs, BAR type (mem/io/mem64) and
>> corresponding register offset in device PCI conf space. This way we
>> can insert MMCONFIG entry into bars array in the same manner like
>> for any other BARs. In this case, the devfn field will point to MCH
>> PCI device and bar_reg will contain PCIEXBAR register offset. It
>> will be assigned a slot in MMIO hole later in a very same way like
>> for plain PCI BARs, with respect to its size alignment.
>> 
>> Also, to reduce code complexity, all long mem/mem64 BAR flags checks
>> are replaced by simple bars[i] field probing, eg.:
>> -        if ( (bar_reg == PCI_ROM_ADDRESS) ||
>> -             ((bar_data & PCI_BASE_ADDRESS_SPACE) ==
>> -              PCI_BASE_ADDRESS_SPACE_MEMORY) )
>> +        if ( bars[i].is_mem )  
>
>This should be a separate change IMO.

OK, no problem.

>>  tools/firmware/hvmloader/config.h   |   4 ++
>>  tools/firmware/hvmloader/pci.c      | 127
>> ++++++++++++++++++++++++++++--------
>> tools/firmware/hvmloader/pci_regs.h |   2 + 3 files changed, 106
>> insertions(+), 27 deletions(-)
>> 
>> diff --git a/tools/firmware/hvmloader/config.h
>> b/tools/firmware/hvmloader/config.h index 6fde6b7b60..5443ecd804
>> 100644 --- a/tools/firmware/hvmloader/config.h
>> +++ b/tools/firmware/hvmloader/config.h
>> @@ -53,10 +53,14 @@ extern uint8_t ioapic_version;
>>  #define PCI_ISA_DEVFN       0x08    /* dev 1, fn 0 */
>>  #define PCI_ISA_IRQ_MASK    0x0c20U /* ISA IRQs 5,10,11 are PCI
>> connected */ #define PCI_ICH9_LPC_DEVFN  0xf8    /* dev 31, fn 0 */
>> +#define PCI_MCH_DEVFN       0       /* bus 0, dev 0, func 0 */
>>  
>>  /* MMIO hole: Hardcoded defaults, which can be dynamically
>> expanded. */ #define PCI_MEM_END         0xfc000000
>>  
>> +/* possible values are: 64, 128, 256 */
>> +#define PCI_MAX_MCFG_BUSES  64  
>
>What the reasoning for this value? Do we know which devices need ECAM
>areas?

Yes, Xen is limited to bus 0 emulation currently, the description
states "When multiple PCI buses support for Xen will be implemented,
PCI_MAX_MCFG_BUSES may be changed to calculation of the number of buses
according to results of the PCI devices enumeration".

I think it might be better to replace 'switch (PCI_MAX_MCFG_BUSES)'
with the real code right away, i.e. change it to

'switch (max_bus_num, aligned up to 64/128/256 boundary)',
where max_bus_num should be set in PCI device enumeration code in
pci_setup(). As we are limited to bus 0 currently, we'll just set it
to 0 for now, before/after the PCI device enumeration loop (which should
became multi-bus capable eventually).

>>  #define ACPI_TIS_HDR_ADDRESS 0xFED40F00UL
>>  
>>  extern unsigned long pci_mem_start, pci_mem_end;
>> diff --git a/tools/firmware/hvmloader/pci.c
>> b/tools/firmware/hvmloader/pci.c index 033bd20992..6de124bbd5 100644
>> --- a/tools/firmware/hvmloader/pci.c
>> +++ b/tools/firmware/hvmloader/pci.c
>> @@ -158,9 +158,10 @@ static void
>> class_specific_pci_device_setup(uint16_t vendor_id, 
>>  void pci_setup(void)
>>  {
>> -    uint8_t is_64bar, using_64bar, bar64_relocate = 0;
>> +    uint8_t is_64bar, using_64bar, bar64_relocate = 0, is_mem;
>>      uint32_t devfn, bar_reg, cmd, bar_data, bar_data_upper;
>>      uint64_t base, bar_sz, bar_sz_upper, mmio_total = 0;
>> +    uint64_t addr_mask;
>>      uint16_t vendor_id, device_id;
>>      unsigned int bar, pin, link, isa_irq;
>>      int is_running_on_q35 = 0;
>> @@ -172,10 +173,14 @@ void pci_setup(void)
>>  
>>      /* Create a list of device BARs in descending order of size. */
>>      struct bars {
>> -        uint32_t is_64bar;
>>          uint32_t devfn;
>>          uint32_t bar_reg;
>>          uint64_t bar_sz;
>> +        uint64_t addr_mask; /* which bits of the base address can
>> be written */
>> +        uint32_t bar_data;  /* initial value - BAR flags here */
>> +        uint8_t  is_64bar;
>> +        uint8_t  is_mem;
>> +        uint8_t  padding[2];  
>
>Why are you manually adding a padding here? Also why not make this
>fields bool?

Just following existing code style, hvmloader/pci.c for some
reason prefers to specify uint8_t for bool vars. OK, will change it
to bools.

>>      } *bars = (struct bars *)scratch_start;
>>      unsigned int i, nr_bars = 0;
>>      uint64_t mmio_hole_size = 0;
>> @@ -259,13 +264,21 @@ void pci_setup(void)
>>                  bar_reg = PCI_ROM_ADDRESS;
>>  
>>              bar_data = pci_readl(devfn, bar_reg);
>> +
>> +            is_mem = !!(((bar_data & PCI_BASE_ADDRESS_SPACE) ==
>> +                       PCI_BASE_ADDRESS_SPACE_MEMORY) ||
>> +                       (bar_reg == PCI_ROM_ADDRESS));
>> +
>>              if ( bar_reg != PCI_ROM_ADDRESS )
>>              {
>> -                is_64bar = !!((bar_data & (PCI_BASE_ADDRESS_SPACE |
>> -                             PCI_BASE_ADDRESS_MEM_TYPE_MASK)) ==
>> -                             (PCI_BASE_ADDRESS_SPACE_MEMORY |
>> +                is_64bar = !!(is_mem &&
>> +                             ((bar_data &
>> PCI_BASE_ADDRESS_MEM_TYPE_MASK) == PCI_BASE_ADDRESS_MEM_TYPE_64));
>> +
>>                  pci_writel(devfn, bar_reg, ~0);
>> +
>> +                addr_mask = is_mem ? PCI_BASE_ADDRESS_MEM_MASK
>> +                                   : PCI_BASE_ADDRESS_IO_MASK;
>>              }
>>              else
>>              {
>> @@ -273,28 +286,35 @@ void pci_setup(void)
>>                  pci_writel(devfn, bar_reg,
>>                             (bar_data | PCI_ROM_ADDRESS_MASK) &
>>                             ~PCI_ROM_ADDRESS_ENABLE);
>> +
>> +                addr_mask = PCI_ROM_ADDRESS_MASK;
>>              }
>> +
>>              bar_sz = pci_readl(devfn, bar_reg);
>>              pci_writel(devfn, bar_reg, bar_data);
>>  
>>              if ( bar_reg != PCI_ROM_ADDRESS )
>> -                bar_sz &= (((bar_data & PCI_BASE_ADDRESS_SPACE) ==
>> -                            PCI_BASE_ADDRESS_SPACE_MEMORY) ?
>> -                           PCI_BASE_ADDRESS_MEM_MASK :
>> -                           (PCI_BASE_ADDRESS_IO_MASK & 0xffff));
>> +                bar_sz &= is_mem ? PCI_BASE_ADDRESS_MEM_MASK :
>> +                                   (PCI_BASE_ADDRESS_IO_MASK &
>> 0xffff); else
>>                  bar_sz &= PCI_ROM_ADDRESS_MASK;
>> -            if (is_64bar) {
>> +
>> +            if (is_64bar)  
>
>Coding style (spaces between parentheses).

OK, will add.

>> +            {
>>                  bar_data_upper = pci_readl(devfn, bar_reg + 4);
>>                  pci_writel(devfn, bar_reg + 4, ~0);
>>                  bar_sz_upper = pci_readl(devfn, bar_reg + 4);
>>                  pci_writel(devfn, bar_reg + 4, bar_data_upper);
>>                  bar_sz = (bar_sz_upper << 32) | bar_sz;
>>              }
>> +
>>              bar_sz &= ~(bar_sz - 1);
>>              if ( bar_sz == 0 )
>>                  continue;
>>  
>> +            /* leave only memtype/enable bits etc */
>> +            bar_data &= ~addr_mask;
>> +
>>              for ( i = 0; i < nr_bars; i++ )
>>                  if ( bars[i].bar_sz < bar_sz )
>>                      break;
>> @@ -302,14 +322,15 @@ void pci_setup(void)
>>              if ( i != nr_bars )
>>                  memmove(&bars[i+1], &bars[i], (nr_bars-i) *
>> sizeof(*bars)); 
>> -            bars[i].is_64bar = is_64bar;
>> -            bars[i].devfn   = devfn;
>> -            bars[i].bar_reg = bar_reg;
>> -            bars[i].bar_sz  = bar_sz;
>> +            bars[i].is_64bar  = is_64bar;
>> +            bars[i].is_mem    = is_mem;
>> +            bars[i].devfn     = devfn;
>> +            bars[i].bar_reg   = bar_reg;
>> +            bars[i].bar_sz    = bar_sz;
>> +            bars[i].addr_mask = addr_mask;
>> +            bars[i].bar_data  = bar_data;
>>  
>> -            if ( ((bar_data & PCI_BASE_ADDRESS_SPACE) ==
>> -                  PCI_BASE_ADDRESS_SPACE_MEMORY) ||
>> -                 (bar_reg == PCI_ROM_ADDRESS) )
>> +            if ( is_mem )
>>                  mmio_total += bar_sz;
>>  
>>              nr_bars++;
>> @@ -339,6 +360,63 @@ void pci_setup(void)
>>          pci_writew(devfn, PCI_COMMAND, cmd);
>>      }
>>  
>> +    /*
>> +     *  Calculate MMCONFIG area size and squeeze it into the bars
>> array
>> +     *  for assigning a slot in the MMIO hole
>> +     */
>> +    if (is_running_on_q35)
>> +    {
>> +        /* disable PCIEXBAR decoding for now */
>> +        pci_writel(PCI_MCH_DEVFN, PCI_MCH_PCIEXBAR, 0);
>> +        pci_writel(PCI_MCH_DEVFN, PCI_MCH_PCIEXBAR + 4, 0);  
>
>I'm afraid I will need some context here, where is the description for
>the config space of dev 0 fn 0? I don't seem to be able to find it in
>the ich9 spec.

ICH9 is a south bridge, you need to check the NB/MCH datasheet, namely
"Intel® 3 Series Express Chipset Family".

>> +
>> +#define PCIEXBAR_64_BUSES    (2 << 1)
>> +#define PCIEXBAR_128_BUSES   (1 << 1)
>> +#define PCIEXBAR_256_BUSES   (0 << 1)
>> +#define PCIEXBAR_ENABLE      (1 << 0)  
>
>Why those strange definitions? (0 << 1)? (2 << 1) instead of (1 << 2)?

These are bitfields. It's just to show their bitfield nature,
bits[2..1] and bit0. I'll change them to something more readable
(like shifts with _BITPOS-defines) in non-RFC patches.

>> +
>> +        switch (PCI_MAX_MCFG_BUSES)
>> +        {
>> +        case 64:
>> +            bar_data = PCIEXBAR_64_BUSES | PCIEXBAR_ENABLE;
>> +            bar_sz = MB(64);
>> +            break;
>> +
>> +        case 128:
>> +            bar_data = PCIEXBAR_128_BUSES | PCIEXBAR_ENABLE;
>> +            bar_sz = MB(128);
>> +            break;
>> +
>> +        case 256:
>> +            bar_data = PCIEXBAR_256_BUSES | PCIEXBAR_ENABLE;
>> +            bar_sz = MB(256);
>> +            break;
>> +
>> +        default:
>> +            /* unsupported number of buses specified */
>> +            BUG();
>> +        }  
>
>I don't see how PCI_MAX_MCFG_BUSES should be used. Is the user
>supposed to know what value to use at compile time? What about distro
>packagers?

Answered above mostly.
We're limited to bus 0 currently. However, it is possible to change
MMCONFIG size manually for now (eg. to 256MB which allows to cover the
whole 0-FF bus range).

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.