[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 06/10] vpci: Make every domain handle its own BARs


  • To: Roger Pau Monné <roger.pau@xxxxxxxxxx>, Oleksandr Andrushchenko <andr2000@xxxxxxxxx>
  • From: Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>
  • Date: Thu, 12 Nov 2020 13:16:10 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=zruRk307CKdKqD8phz+vf6oNxYfg8jIfM8qjUT7uttw=; b=AOrwBQDg3zMs/vNPlMFF7CQ47qM+RBJUXojr4Xp2V20K1dHD1j9+cvEAjQfQxVEjWl/5Wi650vXHFMwAOAYgRPgjUh7joABsP9iX0eSqCjMJwqo5NElCIOsE/4gz6S+Lv3RwQrK5PeUfm9yBzzIArmnO8+3JV+6Nx0lXHJz8krHpBrnqlIWXD9jQV1DuU1P0E0B/HWEtjoMESJ/oTLgDbEJij9AMX58lta/pyLCKXeJD6V05ypakHKxNL+BJcCBlPoZ6JN8rXf+guEvED6+T1nYDTaF7+6Ia/I9Ep5e9iMMmzk/UgPhM8RmsSqjpfwm+OqXGy3Uqe1GGersR1+usKw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=c7vXQws5Saua4xfXN4A5VNC8GTR2fH+RUrAoIo7AIV5wqxeRcZkX9D6j6H3qE0xS0oaAoW/tkpYYGgdX4E5Q8hl+SZbU7DTfY6WQqFt/8uAq07VJzQkyryRgXUUuE1mu0lMhIwdj0UK1kCQPHj+O3e72IYGtcnp7cTU/pTBWUd7VqXzwRCQkWmmjLF9p1LqXzeP12aDuhjdnrix+m7lw/ru1HCfR7PufwtFQjfpXexGvfYuHNDpA4sNtG8/cLCdgKGbrLa55EQ5ld40NlLO9S5Zh2n8460Ge7fjjB8BxPzuhftwHCcDVYaiY6qP3m+NDLo7g7xH8tsFMBWDqDJ8ycQ==
  • Authentication-results: citrix.com; dkim=none (message not signed) header.d=none;citrix.com; dmarc=none action=none header.from=epam.com;
  • Cc: "Rahul.Singh@xxxxxxx" <Rahul.Singh@xxxxxxx>, "Bertrand.Marquis@xxxxxxx" <Bertrand.Marquis@xxxxxxx>, "julien.grall@xxxxxxx" <julien.grall@xxxxxxx>, "jbeulich@xxxxxxxx" <jbeulich@xxxxxxxx>, "sstabellini@xxxxxxxxxx" <sstabellini@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, "iwj@xxxxxxxxxxxxxx" <iwj@xxxxxxxxxxxxxx>, "wl@xxxxxxx" <wl@xxxxxxx>
  • Delivery-date: Thu, 12 Nov 2020 13:16:25 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHWtpbzyJg77SbpvkSRWLKjWy/3PanEQmgAgAA8YoA=
  • Thread-topic: [PATCH 06/10] vpci: Make every domain handle its own BARs

On 11/12/20 11:40 AM, Roger Pau Monné wrote:
> On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx>
>>
>> At the moment there is an identity mapping between how a guest sees its
>> BARs and how they are programmed into guest domain's p2m. This is not
>> going to work as guest domains have their own view on the BARs.
>> Extend existing vPCI BAR handling to allow every domain to have its own
>> view of the BARs: only hardware domain sees physical memory addresses in
>> this case and for the rest those are emulated, including logic required
>> for the guests to detect memory sizes and properties.
>>
>> While emulating BAR access for the guests create a link between the
>> virtual BAR address and physical one: use full memory address while
>> creating range sets used to map/unmap corresponding address spaces and
>> exploit the fact that PCI BAR value doesn't use 8 lower bits of the
> I think you mean the low 4bits rather than the low 8bits?
Yes, you are right. Will fix that
>
>> memory address. Use those bits to pass physical BAR's index, so we can
>> build/remove proper p2m mappings.
> I find this quite hard to review, given it's a fairly big and
> complicated patch. Do you think you could split into smaller chunks?

I'll try. But at the moment this code isn't meant to be production quality yet

as what I'd like to achieve is to collect community's view on the subject

Once all questions are resolved I'll start working on the agreed solution

which I expect to be production quality then.

>
> Maybe you could split into smaller patches that add bits towards the
> end goal but still keep the identity mappings?
>
> I've tried to do some review below, but I would appreciate if you
> could split this.
Thank you so much for doing so!
>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx>
>> ---
>>   xen/drivers/vpci/header.c | 276 ++++++++++++++++++++++++++++++++++----
>>   xen/drivers/vpci/vpci.c   |   1 +
>>   xen/include/xen/vpci.h    |  24 ++--
>>   3 files changed, 265 insertions(+), 36 deletions(-)
>>
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index f74f728884c0..7dc7c70e24f2 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -31,14 +31,87 @@
>>   struct map_data {
>>       struct domain *d;
>>       bool map;
>> +    struct pci_dev *pdev;
> If the field is required please place it after the domain one.
I will, but may I ask why?
>
>>   };
>>   
>> +static struct vpci_header *get_vpci_header(struct domain *d,
>> +                                           const struct pci_dev *pdev);
>> +
>> +static struct vpci_header *get_hwdom_vpci_header(const struct pci_dev *pdev)
>> +{
>> +    if ( unlikely(list_empty(&pdev->vpci->headers)) )
>> +        return get_vpci_header(hardware_domain, pdev);
> I'm not sure I understand why you need a list here: each device can
> only be owned by a single guest, and thus there shouldn't be multiple
> views of the BARs (or the header).

That is because of the over-engineering happening here: you are 100% right

and this all must be made way simpler without lists and all that. I just

blindly thought that we could have multiple guests, but I have overseen

the simple fact you mentioned: physical BARs are for hwdom and virtual are

for *a single* guest as we can't passthrough the same device to multiple

guests at a time.

>
>> +
>> +    /* hwdom's header is always the very first entry. */
>> +    return list_first_entry(&pdev->vpci->headers, struct vpci_header, node);
>> +}
>> +
>> +static struct vpci_header *get_vpci_header(struct domain *d,
>> +                                           const struct pci_dev *pdev)
>> +{
>> +    struct list_head *prev;
>> +    struct vpci_header *header;
>> +    struct vpci *vpci = pdev->vpci;
>> +
>> +    list_for_each( prev, &vpci->headers )
>> +    {
>> +        struct vpci_header *this = list_entry(prev, struct vpci_header, 
>> node);
>> +
>> +        if ( this->domain_id == d->domain_id )
>> +            return this;
>> +    }
>> +    printk(XENLOG_DEBUG "--------------------------------------" \
>> +           "Adding new vPCI BAR headers for domain %d: " PRI_pci" \n",
>> +           d->domain_id, pdev->sbdf.seg, pdev->sbdf.bus,
>> +           pdev->sbdf.dev, pdev->sbdf.fn);
>> +    header = xzalloc(struct vpci_header);
>> +    if ( !header )
>> +    {
>> +        printk(XENLOG_ERR
>> +               "Failed to add new vPCI BAR headers for domain %d: " 
>> PRI_pci" \n",
>> +               d->domain_id, pdev->sbdf.seg, pdev->sbdf.bus,
>> +               pdev->sbdf.dev, pdev->sbdf.fn);
>> +        return NULL;
>> +    }
>> +
>> +    if ( !is_hardware_domain(d) )
>> +    {
>> +        struct vpci_header *hwdom_header = get_hwdom_vpci_header(pdev);
>> +
>> +        /* Make a copy of the hwdom's BARs as the initial state for vBARs. 
>> */
>> +        memcpy(header, hwdom_header, sizeof(*header));
>> +    }
>> +
>> +    header->domain_id = d->domain_id;
>> +    list_add_tail(&header->node, &vpci->headers);
> Same here, I think you want a single header, and then some fields
> would be read-only for domUs (like the position of the BARs on the
> physmap).
ditto, will remove the list
>
>> +    return header;
>> +}
>> +
>> +static struct vpci_bar *get_vpci_bar(struct domain *d,
>> +                                     const struct pci_dev *pdev,
>> +                                     int bar_idx)
> unsigned
ok
>
>> +{
>> +    struct vpci_header *vheader;
>> +
>> +    vheader = get_vpci_header(d, pdev);
>> +    if ( !vheader )
>> +        return NULL;
>> +
>> +    return &vheader->bars[bar_idx];
>> +}
>> +
>>   static int map_range(unsigned long s, unsigned long e, void *data,
>>                        unsigned long *c)
>>   {
>>       const struct map_data *map = data;
>> -    int rc;
>> -
>> +    unsigned long mfn;
>> +    int rc, bar_idx;
>> +    struct vpci_header *header = get_hwdom_vpci_header(map->pdev);
>> +
>> +    bar_idx = s & ~PCI_BASE_ADDRESS_MEM_MASK;
> I'm not sure it's fine to stash the BAR index in the low bits of the
> address, what about a device having concatenated BARs?

Hm, I am not an expert in PCI, so didn't think about that.

Probably nothing stops a PCI device from splitting the same memory

region into multiple ones...

>
> The rangeset would normally join them into a single range, and then
> you won't be able to notice whether a region in the rangeset belongs
> to one BAR or another.
Yes, I see. Very good catch, thank you
>
> IMO it might be easier to just have a rangeset for each BAR and
> structure the pending work as a linked list of BARs, that will contain
> the physical addresses of each BAR (the real phymap one and the guest
> physmap view) plus the rangeset of memory regions to map.
I'll try to think how to do that, thank you
>
>> +    s = PFN_DOWN(s);
>> +    e = PFN_DOWN(e);
> Changing the rangeset to store memory addresses instead of frames
> could for example be split into a separate patch.
Ok
>
> I think you are doing the calculation of the end pfn wrong here, you
> should use PFN_UP instead in case the address is not aligned.

PFN_DOWN for the start seems to be ok if the address is not aligned

which is the case if I pass bar_index in the lower bits: PCI memory has

PAGE_SIZE granularity, so besides the fact that I use bar_index the address

must be page aligned.

The end address is expressed in (size - 1) form, again page aligned,

so to get the last page to be mapped PFN_DOWN also seems to be appropriate.

Do I miss something here?

>
>> +    mfn = _mfn(PFN_DOWN(header->bars[bar_idx].addr));
>>       for ( ; ; )
>>       {
>>           unsigned long size = e - s + 1;
>> @@ -52,11 +125,15 @@ static int map_range(unsigned long s, unsigned long e, 
>> void *data,
>>            * - {un}map_mmio_regions doesn't support preemption.
>>            */
>>   
>> -        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
>> -                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
>> +        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, mfn)
>> +                      : unmap_mmio_regions(map->d, _gfn(s), size, mfn);
>>           if ( rc == 0 )
>>           {
>> -            *c += size;
>> +            /*
>> +             * Range set is not expressed in frame numbers and the size
>> +             * is the number of frames, so update accordingly.
>> +             */
>> +            *c += size << PAGE_SHIFT;
>>               break;
>>           }
>>           if ( rc < 0 )
>> @@ -67,8 +144,9 @@ static int map_range(unsigned long s, unsigned long e, 
>> void *data,
>>               break;
>>           }
>>           ASSERT(rc < size);
>> -        *c += rc;
>> +        *c += rc << PAGE_SHIFT;
>>           s += rc;
>> +        mfn += rc;
>>           if ( general_preempt_check() )
>>                   return -ERESTART;
>>       }
>> @@ -84,7 +162,7 @@ static int map_range(unsigned long s, unsigned long e, 
>> void *data,
>>   static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>>                               bool rom_only)
>>   {
>> -    struct vpci_header *header = &pdev->vpci->header;
>> +    struct vpci_header *header = get_hwdom_vpci_header(pdev);
>>       bool map = cmd & PCI_COMMAND_MEMORY;
>>       unsigned int i;
>>   
>> @@ -136,6 +214,7 @@ bool vpci_process_pending(struct vcpu *v)
>>           struct map_data data = {
>>               .d = v->domain,
>>               .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
>> +            .pdev = v->vpci.pdev,
>>           };
>>           int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
>>   
>> @@ -168,7 +247,8 @@ bool vpci_process_pending(struct vcpu *v)
>>   static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>>                               struct rangeset *mem, uint16_t cmd)
>>   {
>> -    struct map_data data = { .d = d, .map = true };
>> +    struct map_data data = { .d = d, .map = true,
>> +        .pdev = (struct pci_dev *)pdev };
> Dropping the const here is not fine. IT either needs to be dropped
> from apply_map and further up, or this needs to also be made const.
Ok, I'll try to keep it const
>
>>       int rc;
>>   
>>       while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == 
>> -ERESTART )
>> @@ -205,7 +285,7 @@ static void defer_map(struct domain *d, struct pci_dev 
>> *pdev,
>>   
>>   static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool 
>> rom_only)
>>   {
>> -    struct vpci_header *header = &pdev->vpci->header;
>> +    struct vpci_header *header;
>>       struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>>       struct pci_dev *tmp, *dev = NULL;
>>   #ifdef CONFIG_X86
>> @@ -217,6 +297,11 @@ static int modify_bars(const struct pci_dev *pdev, 
>> uint16_t cmd, bool rom_only)
>>       if ( !mem )
>>           return -ENOMEM;
>>   
>> +    if ( is_hardware_domain(current->domain) )
>> +        header = get_hwdom_vpci_header(pdev);
>> +    else
>> +        header = get_vpci_header(current->domain, pdev);
>> +
>>       /*
>>        * Create a rangeset that represents the current device BARs memory 
>> region
>>        * and compare it against all the currently active BAR memory regions. 
>> If
>> @@ -225,12 +310,15 @@ static int modify_bars(const struct pci_dev *pdev, 
>> uint16_t cmd, bool rom_only)
>>        * First fill the rangeset with all the BARs of this device or with 
>> the ROM
>>        * BAR only, depending on whether the guest is toggling the memory 
>> decode
>>        * bit of the command register, or the enable bit of the ROM BAR 
>> register.
>> +     *
>> +     * Use the PCI reserved bits of the BAR to pass BAR's index.
>>        */
>>       for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>       {
>>           const struct vpci_bar *bar = &header->bars[i];
>> -        unsigned long start = PFN_DOWN(bar->addr);
>> -        unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
>> +        unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) | i;
>> +        unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) +
>> +            bar->size - 1;
> Will this work fine on Arm 32bits with LPAE? It's my understanding
> that in that case unsigned long is 32bits, but the physical address
> space is 44bits, in which case this won't work.
Hm, good question
>
> I think you need to keep the usage of frame numbers here.
If I re-work the gfn <-> mfn mapping then yes, I can use frame numbers here and 
elsewhere
>
>>   
>>           if ( !MAPPABLE_BAR(bar) ||
>>                (rom_only ? bar->type != VPCI_BAR_ROM
>> @@ -251,9 +339,11 @@ static int modify_bars(const struct pci_dev *pdev, 
>> uint16_t cmd, bool rom_only)
>>       /* Remove any MSIX regions if present. */
>>       for ( i = 0; msix && i < ARRAY_SIZE(msix->tables); i++ )
>>       {
>> -        unsigned long start = PFN_DOWN(vmsix_table_addr(pdev->vpci, i));
>> -        unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
>> -                                     vmsix_table_size(pdev->vpci, i) - 1);
>> +        unsigned long start = (vmsix_table_addr(pdev->vpci, i) &
>> +                               PCI_BASE_ADDRESS_MEM_MASK) | i;
>> +        unsigned long end = (vmsix_table_addr(pdev->vpci, i) &
>> +                             PCI_BASE_ADDRESS_MEM_MASK ) +
>> +                             vmsix_table_size(pdev->vpci, i) - 1;
>>   
>>           rc = rangeset_remove_range(mem, start, end);
>>           if ( rc )
>> @@ -273,6 +363,8 @@ static int modify_bars(const struct pci_dev *pdev, 
>> uint16_t cmd, bool rom_only)
>>        */
>>       for_each_pdev ( pdev->domain, tmp )
>>       {
>> +        struct vpci_header *header;
>> +
>>           if ( tmp == pdev )
>>           {
>>               /*
>> @@ -289,11 +381,14 @@ static int modify_bars(const struct pci_dev *pdev, 
>> uint16_t cmd, bool rom_only)
>>                   continue;
>>           }
>>   
>> -        for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>> +        header = get_vpci_header(tmp->domain, pdev);
>> +
>> +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>           {
>> -            const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
>> -            unsigned long start = PFN_DOWN(bar->addr);
>> -            unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
>> +            const struct vpci_bar *bar = &header->bars[i];
>> +            unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) | 
>> i;
>> +            unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK)
>> +                + bar->size - 1;
>>   
>>               if ( !bar->enabled || !rangeset_overlaps_range(mem, start, 
>> end) ||
>>                    /*
>> @@ -357,7 +452,7 @@ static void cmd_write(const struct pci_dev *pdev, 
>> unsigned int reg,
>>           pci_conf_write16(pdev->sbdf, reg, cmd);
>>   }
>>   
>> -static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>> +static void bar_write_hwdom(const struct pci_dev *pdev, unsigned int reg,
>>                         uint32_t val, void *data)
>>   {
>>       struct vpci_bar *bar = data;
>> @@ -377,14 +472,17 @@ static void bar_write(const struct pci_dev *pdev, 
>> unsigned int reg,
>>       {
>>           /* If the value written is the current one avoid printing a 
>> warning. */
>>           if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
>> +        {
>> +            struct vpci_header *header = get_hwdom_vpci_header(pdev);
>> +
>>               gprintk(XENLOG_WARNING,
>>                       "%04x:%02x:%02x.%u: ignored BAR %lu write with memory 
>> decoding enabled\n",
>>                       pdev->seg, pdev->bus, slot, func,
>> -                    bar - pdev->vpci->header.bars + hi);
>> +                    bar - header->bars + hi);
>> +        }
>>           return;
>>       }
>>   
>> -
>>       /*
>>        * Update the cached address, so that when memory decoding is enabled
>>        * Xen can map the BAR into the guest p2m.
>> @@ -403,10 +501,89 @@ static void bar_write(const struct pci_dev *pdev, 
>> unsigned int reg,
>>       pci_conf_write32(pdev->sbdf, reg, val);
>>   }
>>   
>> +static uint32_t bar_read_hwdom(const struct pci_dev *pdev, unsigned int reg,
>> +                               void *data)
>> +{
>> +    return vpci_hw_read32(pdev, reg, data);
>> +}
>> +
>> +static void bar_write_guest(const struct pci_dev *pdev, unsigned int reg,
>> +                            uint32_t val, void *data)
>> +{
>> +    struct vpci_bar *vbar = data;
>> +    bool hi = false;
>> +
>> +    if ( vbar->type == VPCI_BAR_MEM64_HI )
>> +    {
>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>> +        vbar--;
>> +        hi = true;
>> +    }
>> +    vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
>> +    vbar->addr |= (uint64_t)val << (hi ? 32 : 0);
>> +}
>> +
>> +static uint32_t bar_read_guest(const struct pci_dev *pdev, unsigned int reg,
>> +                               void *data)
>> +{
>> +    struct vpci_bar *vbar = data;
>> +    uint32_t val;
>> +    bool hi = false;
>> +
>> +    if ( vbar->type == VPCI_BAR_MEM64_HI )
>> +    {
>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>> +        vbar--;
>> +        hi = true;
>> +    }
>> +
>> +    if ( vbar->type == VPCI_BAR_MEM64_LO || vbar->type == VPCI_BAR_MEM64_HI 
>> )
> I think this would be clearer using a switch statement.
I'll think about
>
>> +    {
>> +        if ( hi )
>> +            val = vbar->addr >> 32;
>> +        else
>> +            val = vbar->addr & 0xffffffff;
>> +        if ( val == ~0 )
> Strictly speaking I think you are not forced to write 1s to the
> reserved 4 bits in the low register (and in the 32bit case).

Ah, so Linux kernel, for instance, could have written 0xffffff0 while

I expect 0xffffffff?

>
>> +        {
>> +            /* Guests detects BAR's properties and sizes. */
>> +            if ( !hi )
>> +            {
>> +                val = 0xffffffff & ~(vbar->size - 1);
>> +                val |= vbar->type == VPCI_BAR_MEM32 ? 
>> PCI_BASE_ADDRESS_MEM_TYPE_32
>> +                                                    : 
>> PCI_BASE_ADDRESS_MEM_TYPE_64;
>> +                val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 
>> 0;
>> +            }
>> +            else
>> +                val = vbar->size >> 32;
>> +            vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
>> +            vbar->addr |= (uint64_t)val << (hi ? 32 : 0);
>> +        }
>> +    }
>> +    else if ( vbar->type == VPCI_BAR_MEM32 )
>> +    {
>> +        val = vbar->addr;
>> +        if ( val == ~0 )
>> +        {
>> +            if ( !hi )
> There's no way hi can be true at this point AFAICT.
Sure, thank you
>
>> +            {
>> +                val = 0xffffffff & ~(vbar->size - 1);
>> +                val |= vbar->type == VPCI_BAR_MEM32 ? 
>> PCI_BASE_ADDRESS_MEM_TYPE_32
>> +                                                    : 
>> PCI_BASE_ADDRESS_MEM_TYPE_64;
>> +                val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 
>> 0;
>> +            }
>> +        }
>> +    }
>> +    else
>> +    {
>> +        val = vbar->addr;
>> +    }
>> +    return val;
>> +}
>> +
>>   static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>>                         uint32_t val, void *data)
>>   {
>> -    struct vpci_header *header = &pdev->vpci->header;
>> +    struct vpci_header *header = get_hwdom_vpci_header(pdev);
>>       struct vpci_bar *rom = data;
>>       uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
>>       uint16_t cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
>> @@ -452,15 +629,56 @@ static void rom_write(const struct pci_dev *pdev, 
>> unsigned int reg,
>>           rom->addr = val & PCI_ROM_ADDRESS_MASK;
>>   }
> Don't you need to also protect a domU from writing to the ROM BAR
> register?

ROM was not a target of this RFC as I have no HW to test that, but final code 
must

also handle ROM as well, you are right

>
>>   
>> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned int 
>> reg,
>> +                                  void *data)
>> +{
>> +    struct vpci_bar *vbar, *bar = data;
>> +
>> +    if ( is_hardware_domain(current->domain) )
>> +        return bar_read_hwdom(pdev, reg, data);
>> +
>> +    vbar = get_vpci_bar(current->domain, pdev, bar->index);
>> +    if ( !vbar )
>> +        return ~0;
>> +
>> +    return bar_read_guest(pdev, reg, vbar);
>> +}
>> +
>> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int reg,
>> +                               uint32_t val, void *data)
>> +{
>> +    struct vpci_bar *bar = data;
>> +
>> +    if ( is_hardware_domain(current->domain) )
>> +        bar_write_hwdom(pdev, reg, val, data);
>> +    else
>> +    {
>> +        struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, 
>> bar->index);
>> +
>> +        if ( !vbar )
>> +            return;
>> +        bar_write_guest(pdev, reg, val, vbar);
>> +    }
>> +}
> You should assign different handlers based on whether the domain that
> has the device assigned is a domU or the hardware domain, rather than
> doing the selection here.

Hm, handlers are assigned once in init_bars and this function is only called

for hwdom, so there is no way I can do that for the guests. Hence, the 
dispatcher

>
>> +
>> +/*
>> + * FIXME: This is called early while adding vPCI handlers which is done
>> + * by and for hwdom.
>> + */
>>   static int init_bars(struct pci_dev *pdev)
>>   {
>>       uint16_t cmd;
>>       uint64_t addr, size;
>>       unsigned int i, num_bars, rom_reg;
>> -    struct vpci_header *header = &pdev->vpci->header;
>> -    struct vpci_bar *bars = header->bars;
>> +    struct vpci_header *header;
>> +    struct vpci_bar *bars;
>>       int rc;
>>   
>> +    header = get_hwdom_vpci_header(pdev);
>> +    if ( !header )
>> +        return -ENOMEM;
>> +    bars = header->bars;
>> +
>>       switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>>       {
>>       case PCI_HEADER_TYPE_NORMAL:
>> @@ -496,11 +714,12 @@ static int init_bars(struct pci_dev *pdev)
>>           uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
>>           uint32_t val;
>>   
>> +        bars[i].index = i;
>>           if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
>>           {
>>               bars[i].type = VPCI_BAR_MEM64_HI;
>> -            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, 
>> reg,
>> -                                   4, &bars[i]);
>> +            rc = vpci_add_register(pdev->vpci, bar_read_dispatch,
>> +                                   bar_write_dispatch, reg, 4, &bars[i]);
>>               if ( rc )
>>               {
>>                   pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
>> @@ -540,8 +759,8 @@ static int init_bars(struct pci_dev *pdev)
>>           bars[i].size = size;
>>           bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
>>   
>> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 
>> 4,
>> -                               &bars[i]);
>> +        rc = vpci_add_register(pdev->vpci, bar_read_dispatch,
>> +                               bar_write_dispatch, reg, 4, &bars[i]);
>>           if ( rc )
>>           {
>>               pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
>> @@ -558,6 +777,7 @@ static int init_bars(struct pci_dev *pdev)
>>           rom->type = VPCI_BAR_ROM;
>>           rom->size = size;
>>           rom->addr = addr;
>> +        rom->index = num_bars;
>>           header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
>>                                 PCI_ROM_ADDRESS_ENABLE;
>>   
>> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
>> index a5293521a36a..728029da3e9c 100644
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -69,6 +69,7 @@ int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)
>>           return -ENOMEM;
>>   
>>       INIT_LIST_HEAD(&pdev->vpci->handlers);
>> +    INIT_LIST_HEAD(&pdev->vpci->headers);
>>       spin_lock_init(&pdev->vpci->lock);
>>   
>>       for ( i = 0; i < NUM_VPCI_INIT; i++ )
>> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
>> index c3501e9ec010..54423bc6556d 100644
>> --- a/xen/include/xen/vpci.h
>> +++ b/xen/include/xen/vpci.h
>> @@ -55,16 +55,14 @@ uint32_t vpci_hw_read32(const struct pci_dev *pdev, 
>> unsigned int reg,
>>    */
>>   bool __must_check vpci_process_pending(struct vcpu *v);
>>   
>> -struct vpci {
>> -    /* List of vPCI handlers for a device. */
>> -    struct list_head handlers;
>> -    spinlock_t lock;
>> -
>>   #ifdef __XEN__
>> -    /* Hide the rest of the vpci struct from the user-space test harness. */
>>       struct vpci_header {
>> +    struct list_head node;
>> +    /* Domain that owns this view of the BARs. */
>> +    domid_t domain_id;
> Indentation seems screwed here.
It did ;)
>
>>           /* Information about the PCI BARs of this device. */
>>           struct vpci_bar {
>> +            int index;
> unsigned
ok
>
>>               uint64_t addr;
>>               uint64_t size;
>>               enum {
>> @@ -88,8 +86,18 @@ struct vpci {
>>            * is mapped into guest p2m) if there's a ROM BAR on the device.
>>            */
>>           bool rom_enabled      : 1;
>> -        /* FIXME: currently there's no support for SR-IOV. */
> Unless you are also adding support for SR-IOV, I would keep the
> comment.

WRT SR-IOV I do need your series [1] ;) SR-IOV is one of our targets

> Thanks, Roger.

Thank you so much for reviewing this,

Oleksandr

[1] https://lists.xenproject.org/archives/html/xen-devel/2018-07/msg01494.html

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.