[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v4 6/9] xen/vpci: add handlers to map the BARs



>>> On 30.06.17 at 17:01, <roger.pau@xxxxxxxxxx> wrote:
> Introduce a set of handlers that trap accesses to the PCI BARs and the command
> register, in order to emulate BAR sizing and BAR relocation.

I don't think "emulate" is the right term here - you really don't mean to
change anything, you only want to snoop Dom0 writes.

> --- /dev/null
> +++ b/xen/drivers/vpci/header.c
> @@ -0,0 +1,473 @@
> +/*
> + * Generic functionality for handling accesses to the PCI header from the
> + * configuration space.
> + *
> + * Copyright (C) 2017 Citrix Systems R&D
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms and conditions of the GNU General Public
> + * License, version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; If not, see 
> <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <xen/sched.h>
> +#include <xen/vpci.h>
> +#include <xen/p2m-common.h>
> +
> +#define MAPPABLE_BAR(x)                                                 \
> +    (((x)->type == VPCI_BAR_MEM32 || (x)->type == VPCI_BAR_MEM64_LO ||  \
> +     ((x)->type == VPCI_BAR_ROM && (x)->enabled)) &&                    \
> +     (x)->addr != INVALID_PADDR)
> +
> +static struct rangeset *vpci_get_bar_memory(const struct domain *d,
> +                                            const struct vpci_bar *map)
> +{
> +    const struct pci_dev *pdev;
> +    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
> +    int rc;
> +
> +    if ( !mem )
> +        return ERR_PTR(-ENOMEM);
> +
> +    /*
> +     * Create a rangeset that represents the current BAR memory region
> +     * and compare it against all the currently active BAR memory regions.
> +     * If an overlap is found, subtract it from the region to be
> +     * mapped/unmapped.
> +     *
> +     * NB: the rangeset uses frames, and if start and end addresses are
> +     * equal it means only one frame is used, that's why PFN_DOWN is used
> +     * to calculate the end of the rangeset.
> +     */

That explanation doesn't seem to fit: Did you perhaps mean to
point out that rangeset ranges are inclusive ones?

> +    rc = rangeset_add_range(mem, PFN_DOWN(map->addr),
> +                            PFN_DOWN(map->addr + map->size));

Don't you need to subtract 1 here (and elsewhere below)?

> +    if ( rc )
> +    {
> +        rangeset_destroy(mem);
> +        return ERR_PTR(rc);
> +    }
> +
> +    list_for_each_entry(pdev, &d->arch.pdev_list, domain_list)
> +    {
> +        uint16_t cmd = pci_conf_read16(pdev->seg, pdev->bus,
> +                                       PCI_SLOT(pdev->devfn),
> +                                       PCI_FUNC(pdev->devfn),
> +                                       PCI_COMMAND);

This is quite a lot of overhead - a loop over all devices plus a config
space read on each one. What state the memory decode bit is in
could be recorded in the ->enabled flag, couldn't it? And devices on
different sub-branches of the topology can't possibly have
overlapping entries that we need to worry about, as the bridge
windows would suppress actual accesses.

> +        unsigned int i;
> +
> +        /* Check if memory decoding is enabled. */
> +        if ( !(cmd & PCI_COMMAND_MEMORY) )
> +            continue;
> +
> +        for ( i = 0; i < ARRAY_SIZE(pdev->vpci->header.bars); i++ )
> +        {
> +            const struct vpci_bar *bar = &pdev->vpci->header.bars[i];
> +
> +            if ( bar == map || !MAPPABLE_BAR(bar) ||
> +                 !rangeset_overlaps_range(mem, PFN_DOWN(bar->addr),
> +                                          PFN_DOWN(bar->addr + bar->size)) )
> +                continue;
> +
> +            rc = rangeset_remove_range(mem, PFN_DOWN(bar->addr),
> +                                       PFN_DOWN(bar->addr + bar->size));

I'm struggling to convince myself of the correctness of this approach
(including other code further down which is also involved). I think you
should have taken the time to add a few words on the approach
chosen to the description. For example, it doesn't look like things will
go right if the device being dealt with has two BARs both using part
of the same page.

> +static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
> +                           const bool map)
> +{
> +    struct rangeset *mem;
> +    struct map_data data = { .d = d, .map = map };
> +    int rc;
> +
> +    ASSERT(MAPPABLE_BAR(bar));
> +
> +    mem = vpci_get_bar_memory(d, bar);
> +    if ( IS_ERR(mem) )
> +        return -PTR_ERR(mem);

The negation looks wrong to me.

> +static void vpci_cmd_write(struct pci_dev *pdev, unsigned int reg,
> +                           union vpci_val val, void *data)
> +{
> +    uint16_t cmd = val.u16, current_cmd;
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    int rc;
> +
> +    current_cmd = pci_conf_read16(seg, bus, slot, func, reg);
> +
> +    if ( !((cmd ^ current_cmd) & PCI_COMMAND_MEMORY) )
> +    {
> +        /*
> +         * Let the guest play with all the bits directly except for the
> +         * memory decoding one.
> +         */
> +        pci_conf_write16(seg, bus, slot, func, reg, cmd);
> +        return;

Please invert the condition and have both cases use the same write
at the end of the function.

> +static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
> +                           union vpci_val val, void *data)
> +{
> +    struct vpci_bar *bar = data;
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint32_t wdata = val.u32, size_mask;
> +    bool hi = false;
> +
> +    switch ( bar->type )
> +    {
> +    case VPCI_BAR_MEM32:
> +    case VPCI_BAR_MEM64_LO:
> +        size_mask = (uint32_t)PCI_BASE_ADDRESS_MEM_MASK;
> +        break;
> +    case VPCI_BAR_MEM64_HI:
> +        size_mask = ~0u;
> +        break;
> +    default:
> +        ASSERT_UNREACHABLE();
> +        return;
> +    }
> +
> +    if ( (wdata & size_mask) == size_mask )
> +    {
> +        /* Next reads from this register are going to return the BAR size. */
> +        bar->sizing = true;
> +        return;

I think the comment needs extending to explain why the written
sizing value can't possibly be an address. This is particularly
relevant because I'm not sure that assumption would hold on e.g.
ARM (which I don't think has guaranteed ROM right below 4Gb).

> +    }
> +
> +    /* End previous sizing cycle if any. */
> +    bar->sizing = false;
> +
> +    /*
> +     * Ignore attempts to change the position of the BAR if memory decoding 
> is
> +     * active.
> +     */
> +    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
> +         PCI_COMMAND_MEMORY )
> +        return;

Especially as long as this code supports only Dom0 I think we want
a warning here.

> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +
> +    if ( !hi )
> +        wdata &= PCI_BASE_ADDRESS_MEM_MASK;
> +
> +    /* Update the relevant part of the BAR address. */
> +    bar->addr &= ~((uint64_t)0xffffffff << (hi ? 32 : 0));

Maybe shorter "0xffffffffull << (hi ? 0 : 32)"?

> +static void vpci_rom_write(struct pci_dev *pdev, unsigned int reg,
> +                           union vpci_val val, void *data)
> +{
> +    struct vpci_bar *rom = data;
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    const uint32_t wdata = val.u32;
> +
> +    if ( (wdata & PCI_ROM_ADDRESS_MASK) == PCI_ROM_ADDRESS_MASK )
> +    {
> +        /* Next reads from this register are going to return the BAR size. */
> +        rom->sizing = true;
> +        return;
> +    }
> +
> +    /* End previous sizing cycle if any. */
> +    rom->sizing = false;
> +
> +    rom->addr = wdata & PCI_ROM_ADDRESS_MASK;
> +
> +    /* Check if memory decoding is enabled. */
> +    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
> +         PCI_COMMAND_MEMORY &&
> +         (rom->enabled ^ (wdata & PCI_ROM_ADDRESS_ENABLE)) )

Just like you parenthesize the operands of ^, please also do so for
the ones of &. Also the ^-expression relies on the particular value
of PCI_ROM_ADDRESS_ENABLE, which I'd prefer if you avoided.

> +static int vpci_init_bars(struct pci_dev *pdev)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint8_t header_type;
> +    uint16_t cmd;
> +    uint32_t rom_val;
> +    uint64_t addr, size;
> +    unsigned int i, num_bars, rom_reg;
> +    struct vpci_header *header = &pdev->vpci->header;
> +    struct vpci_bar *bars = header->bars;
> +    int rc;
> +
> +    header_type = pci_conf_read8(seg, bus, slot, func, PCI_HEADER_TYPE) & 
> 0x7f;
> +    switch ( header_type )

I'd prefer if you didn't introduce variables used just once.

> +    {
> +    case PCI_HEADER_TYPE_NORMAL:
> +        num_bars = 6;
> +        rom_reg = PCI_ROM_ADDRESS;
> +        break;
> +    case PCI_HEADER_TYPE_BRIDGE:
> +        num_bars = 2;
> +        rom_reg = PCI_ROM_ADDRESS1;
> +        break;
> +    default:
> +        return -EOPNOTSUPP;
> +    }
> +
> +    /* Setup a handler for the command register. */
> +    cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);

This is unrelated to what you mean to do here. Please move it ...

> +    rc = vpci_add_register(pdev, vpci_cmd_read, vpci_cmd_write, PCI_COMMAND,
> +                           2, header);
> +    if ( rc )
> +        return rc;
> +
> +    /* Disable memory decoding before sizing. */

... here.

> +    if ( cmd & PCI_COMMAND_MEMORY )
> +        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND,
> +                         cmd & ~PCI_COMMAND_MEMORY);
> +
> +    for ( i = 0; i < num_bars; i++ )
> +    {
> +        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
> +        uint32_t val = pci_conf_read32(seg, bus, slot, func, reg);
> +
> +        if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
> +        {
> +            bars[i].type = VPCI_BAR_MEM64_HI;
> +            rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 
> 4,
> +                                   &bars[i]);
> +            if ( rc )
> +                return rc;
> +
> +            continue;
> +        }
> +        if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
> +        {
> +            bars[i].type = VPCI_BAR_IO;
> +            continue;
> +        }
> +        if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> +             PCI_BASE_ADDRESS_MEM_TYPE_64 )
> +            bars[i].type = VPCI_BAR_MEM64_LO;
> +        else
> +            bars[i].type = VPCI_BAR_MEM32;

Perhaps ignore the 64-bit indicator if it appears in the last BAR?

> +        /* Size the BAR and map it. */
> +        rc = pci_size_mem_bar(seg, bus, slot, func, reg, i == num_bars - 1,
> +                              &addr, &size);
> +        if ( rc < 0 )
> +            return rc;
> +
> +        if ( size == 0 )
> +        {
> +            bars[i].type = VPCI_BAR_EMPTY;
> +            continue;
> +        }
> +
> +        bars[i].addr = (cmd & PCI_COMMAND_MEMORY) ? addr : INVALID_PADDR;

This doesn't match up with logic further up: When the memory decode
bit gets cleared, you don't zap the addresses, so I think you'd better
store it here too. Use INVALID_PADDR only when the value read has
all address bits set (same caveat as pointed out earlier).

> +        bars[i].size = size;
> +        bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
> +
> +        rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
> +                               &bars[i]);
> +        if ( rc )
> +            return rc;
> +    }
> +
> +    /* Check expansion ROM. */
> +    rom_val = pci_conf_read32(seg, bus, slot, func, rom_reg);
> +    if ( rom_val & PCI_ROM_ADDRESS_ENABLE )
> +        pci_conf_write32(seg, bus, slot, func, rom_reg,
> +                         rom_val & ~PCI_ROM_ADDRESS_ENABLE);

Do you really need to do this when you've cleared the memory
decode bit already?

> +    rc = pci_size_mem_bar(seg, bus, slot, func, rom_reg, true, &addr, &size);

You can't use this function here without first making it capable of
dealing with ROM BARs - it expects the low bits to be different
than what we have here (see the early ASSERT() that's there).

> +    if ( rc < 0 )
> +        return rc;

Perhaps I didn't pay attention elsewhere, but here it is quite obvious
that in the error case you return with the device in a state other than
on input.

> +    if ( size )
> +    {
> +        struct vpci_bar *rom = &header->bars[num_bars];
> +
> +        rom->type = VPCI_BAR_ROM;
> +        rom->size = size;
> +        rom->enabled = rom_val & PCI_ROM_ADDRESS_ENABLE;
> +        if ( rom->enabled )
> +            rom->addr = addr;
> +        else
> +            rom->addr = INVALID_PADDR;

Same remark as further up.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.