[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 06/10] vpci: Make every domain handle its own BARs


  • To: Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Thu, 12 Nov 2020 15:46:43 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=JNqSDsDxZXRO2EpaifFeChNnd7skvjiStIlunFt7o5E=; b=h1xYNaolTC0Q39WLb7jXI35S7Ix6qkBBNgARkkkoKpnaCWsn0pfqMU3FlFHBq2SzcTwTsxDrdk5ab21DGtj0XS9U9dEkLJO0q9vOInePW1EObOO286XC7vF8lKjYPzFWFDADoFS8iViTKQGIgPyQujJ9DQG6qkZJCscBfhY+UoKiaSA7B9o2AXs/Ok5pz4AU88WjZ9gqsJDtQ6f/FcXIM4d3hTMQkA1Z5ESvtkLKJrsNvX4Rc6yJmkDf6U2ZYZzA3a1OQ+05+rV5JHnB9qCjqp0zUQv/8tGeomER8wfjRM6xjDw3TcT3XeG1pgnVZyqQso22B+zJadXGowSpmrzVGg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=F+NBEsZ/t2yyaXUQwItIvh1st19d+hDYb/TbPGC/iQHX7Lr7rKdYDPgQEKMx0X2GAY3/P0DYszY9EEjdVtxz3mwVpB1hukqKaPsZ3rkd9DlrLlj1AgndGXkSRDVrUird0dkQKDYUCwwvEn59g8uEPRvDbWtLk2uu7wiQZjxZ4cGOWy5l84hTRxT4j2Aqa2CcXkXj6luZthrIGMZC0j5wfD7T7stJ8eBSdwDILaDqwLZ+wGFLsTea4wdTLSXQSjqzRi4BlBOCMWd2b+pY1U2rOXleu5EmXTBAvyK6NJ1NDmbQ9zM1wJ7lyD69PNgYsiyk5gLpKYW71/H6oYhRc1pvhA==
  • Authentication-results: esa6.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com
  • Cc: Oleksandr Andrushchenko <andr2000@xxxxxxxxx>, "Rahul.Singh@xxxxxxx" <Rahul.Singh@xxxxxxx>, "Bertrand.Marquis@xxxxxxx" <Bertrand.Marquis@xxxxxxx>, "julien.grall@xxxxxxx" <julien.grall@xxxxxxx>, "jbeulich@xxxxxxxx" <jbeulich@xxxxxxxx>, "sstabellini@xxxxxxxxxx" <sstabellini@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, "iwj@xxxxxxxxxxxxxx" <iwj@xxxxxxxxxxxxxx>, "wl@xxxxxxx" <wl@xxxxxxx>
  • Delivery-date: Thu, 12 Nov 2020 14:47:17 +0000
  • Ironport-sdr: s5jYqgir+5Jc9u+Ngzt2+mSePeNIo5/5fNjUPdK0NeN+WER9xtPeWyWpkrMo/fBtzhQtNuY106 zoftY40MI5nnotSGeS/NEYznwUbyw36J4okrI99mz7P3HHtL+rRUBDpffNNWxaXJeakNs6FlL+ n9UizFQCYzbI4dFh5b1ne2CBzaGcwuk23oG1xFNRYEnx8I9XZy+uMVpPc1KD7FmB9OkAlVW5YU 4XLZ/wIXq/VZ6mpXFIrMjN/jSbztXWcyQttybWt7/Rl7JIWNyBoO72SrODTzT3P629IfUvobsp +Ss=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Thu, Nov 12, 2020 at 01:16:10PM +0000, Oleksandr Andrushchenko wrote:
> 
> On 11/12/20 11:40 AM, Roger Pau Monné wrote:
> > On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote:
> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx>
> >> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> >> index f74f728884c0..7dc7c70e24f2 100644
> >> --- a/xen/drivers/vpci/header.c
> >> +++ b/xen/drivers/vpci/header.c
> >> @@ -31,14 +31,87 @@
> >>   struct map_data {
> >>       struct domain *d;
> >>       bool map;
> >> +    struct pci_dev *pdev;
> > If the field is required please place it after the domain one.
> I will, but may I ask why?

So that if we add further boolean fields we can do at the end of the
struct for layout reasons. If we do:

struct map_data {
    struct domain *d;
    bool map;
    struct pci_dev *pdev;
    bool foo;
}

We will end up with a bunch of padding that could be avoided by doing:

struct map_data {
    struct domain *d;
    struct pci_dev *pdev;
    bool map;
    bool foo;
}

> >> +    s = PFN_DOWN(s);
> >> +    e = PFN_DOWN(e);
> > Changing the rangeset to store memory addresses instead of frames
> > could for example be split into a separate patch.
> Ok
> >
> > I think you are doing the calculation of the end pfn wrong here, you
> > should use PFN_UP instead in case the address is not aligned.
> 
> PFN_DOWN for the start seems to be ok if the address is not aligned
> 
> which is the case if I pass bar_index in the lower bits: PCI memory has
> 
> PAGE_SIZE granularity, so besides the fact that I use bar_index the address

No, BARs don't need to be aligned to page boundaries, you can even
have different BARs inside the same physical page.

The spec recommends that the minimum size of a BAR should be 4KB, but
that's not a strict requirement in which case a BAR can be as small as
16bytes, and then you can have multiple ones inside the same page.

> must be page aligned.
> 
> The end address is expressed in (size - 1) form, again page aligned,
> 
> so to get the last page to be mapped PFN_DOWN also seems to be appropriate.
> 
> Do I miss something here?

I'm not aware of any  of those addresses or sizes being guaranteed to
be page aligned, so I think you need to account for that.

Some of the code here uses PFN_DOWN to calculate the end address
because the rangesets are used in an inclusive fashion, so the end
frame also gets mapped.

> >
> >> +    mfn = _mfn(PFN_DOWN(header->bars[bar_idx].addr));
> >>       for ( ; ; )
> >>       {
> >>           unsigned long size = e - s + 1;
> >> @@ -52,11 +125,15 @@ static int map_range(unsigned long s, unsigned long 
> >> e, void *data,
> >>            * - {un}map_mmio_regions doesn't support preemption.
> >>            */
> >>   
> >> -        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
> >> -                      : unmap_mmio_regions(map->d, _gfn(s), size, 
> >> _mfn(s));
> >> +        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, mfn)
> >> +                      : unmap_mmio_regions(map->d, _gfn(s), size, mfn);
> >>           if ( rc == 0 )
> >>           {
> >> -            *c += size;
> >> +            /*
> >> +             * Range set is not expressed in frame numbers and the size
> >> +             * is the number of frames, so update accordingly.
> >> +             */
> >> +            *c += size << PAGE_SHIFT;
> >>               break;
> >>           }
> >>           if ( rc < 0 )
> >> @@ -67,8 +144,9 @@ static int map_range(unsigned long s, unsigned long e, 
> >> void *data,
> >>               break;
> >>           }
> >>           ASSERT(rc < size);
> >> -        *c += rc;
> >> +        *c += rc << PAGE_SHIFT;
> >>           s += rc;
> >> +        mfn += rc;
> >>           if ( general_preempt_check() )
> >>                   return -ERESTART;
> >>       }
> >> @@ -84,7 +162,7 @@ static int map_range(unsigned long s, unsigned long e, 
> >> void *data,
> >>   static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
> >>                               bool rom_only)
> >>   {
> >> -    struct vpci_header *header = &pdev->vpci->header;
> >> +    struct vpci_header *header = get_hwdom_vpci_header(pdev);
> >>       bool map = cmd & PCI_COMMAND_MEMORY;
> >>       unsigned int i;
> >>   
> >> @@ -136,6 +214,7 @@ bool vpci_process_pending(struct vcpu *v)
> >>           struct map_data data = {
> >>               .d = v->domain,
> >>               .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> >> +            .pdev = v->vpci.pdev,
> >>           };
> >>           int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
> >>   
> >> @@ -168,7 +247,8 @@ bool vpci_process_pending(struct vcpu *v)
> >>   static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
> >>                               struct rangeset *mem, uint16_t cmd)
> >>   {
> >> -    struct map_data data = { .d = d, .map = true };
> >> +    struct map_data data = { .d = d, .map = true,
> >> +        .pdev = (struct pci_dev *)pdev };
> > Dropping the const here is not fine. IT either needs to be dropped
> > from apply_map and further up, or this needs to also be made const.
> Ok, I'll try to keep it const
> >
> >>       int rc;
> >>   
> >>       while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == 
> >> -ERESTART )
> >> @@ -205,7 +285,7 @@ static void defer_map(struct domain *d, struct pci_dev 
> >> *pdev,
> >>   
> >>   static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool 
> >> rom_only)
> >>   {
> >> -    struct vpci_header *header = &pdev->vpci->header;
> >> +    struct vpci_header *header;
> >>       struct rangeset *mem = rangeset_new(NULL, NULL, 0);
> >>       struct pci_dev *tmp, *dev = NULL;
> >>   #ifdef CONFIG_X86
> >> @@ -217,6 +297,11 @@ static int modify_bars(const struct pci_dev *pdev, 
> >> uint16_t cmd, bool rom_only)
> >>       if ( !mem )
> >>           return -ENOMEM;
> >>   
> >> +    if ( is_hardware_domain(current->domain) )
> >> +        header = get_hwdom_vpci_header(pdev);
> >> +    else
> >> +        header = get_vpci_header(current->domain, pdev);
> >> +
> >>       /*
> >>        * Create a rangeset that represents the current device BARs memory 
> >> region
> >>        * and compare it against all the currently active BAR memory 
> >> regions. If
> >> @@ -225,12 +310,15 @@ static int modify_bars(const struct pci_dev *pdev, 
> >> uint16_t cmd, bool rom_only)
> >>        * First fill the rangeset with all the BARs of this device or with 
> >> the ROM
> >>        * BAR only, depending on whether the guest is toggling the memory 
> >> decode
> >>        * bit of the command register, or the enable bit of the ROM BAR 
> >> register.
> >> +     *
> >> +     * Use the PCI reserved bits of the BAR to pass BAR's index.
> >>        */
> >>       for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> >>       {
> >>           const struct vpci_bar *bar = &header->bars[i];
> >> -        unsigned long start = PFN_DOWN(bar->addr);
> >> -        unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
> >> +        unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) | i;
> >> +        unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) +
> >> +            bar->size - 1;
> > Will this work fine on Arm 32bits with LPAE? It's my understanding
> > that in that case unsigned long is 32bits, but the physical address
> > space is 44bits, in which case this won't work.
> Hm, good question
> >
> > I think you need to keep the usage of frame numbers here.
> If I re-work the gfn <-> mfn mapping then yes, I can use frame numbers here 
> and elsewhere
> >
> >>   
> >>           if ( !MAPPABLE_BAR(bar) ||
> >>                (rom_only ? bar->type != VPCI_BAR_ROM
> >> @@ -251,9 +339,11 @@ static int modify_bars(const struct pci_dev *pdev, 
> >> uint16_t cmd, bool rom_only)
> >>       /* Remove any MSIX regions if present. */
> >>       for ( i = 0; msix && i < ARRAY_SIZE(msix->tables); i++ )
> >>       {
> >> -        unsigned long start = PFN_DOWN(vmsix_table_addr(pdev->vpci, i));
> >> -        unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
> >> -                                     vmsix_table_size(pdev->vpci, i) - 1);
> >> +        unsigned long start = (vmsix_table_addr(pdev->vpci, i) &
> >> +                               PCI_BASE_ADDRESS_MEM_MASK) | i;
> >> +        unsigned long end = (vmsix_table_addr(pdev->vpci, i) &
> >> +                             PCI_BASE_ADDRESS_MEM_MASK ) +
> >> +                             vmsix_table_size(pdev->vpci, i) - 1;
> >>   
> >>           rc = rangeset_remove_range(mem, start, end);
> >>           if ( rc )
> >> @@ -273,6 +363,8 @@ static int modify_bars(const struct pci_dev *pdev, 
> >> uint16_t cmd, bool rom_only)
> >>        */
> >>       for_each_pdev ( pdev->domain, tmp )
> >>       {
> >> +        struct vpci_header *header;
> >> +
> >>           if ( tmp == pdev )
> >>           {
> >>               /*
> >> @@ -289,11 +381,14 @@ static int modify_bars(const struct pci_dev *pdev, 
> >> uint16_t cmd, bool rom_only)
> >>                   continue;
> >>           }
> >>   
> >> -        for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
> >> +        header = get_vpci_header(tmp->domain, pdev);
> >> +
> >> +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> >>           {
> >> -            const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
> >> -            unsigned long start = PFN_DOWN(bar->addr);
> >> -            unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
> >> +            const struct vpci_bar *bar = &header->bars[i];
> >> +            unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) 
> >> | i;
> >> +            unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK)
> >> +                + bar->size - 1;
> >>   
> >>               if ( !bar->enabled || !rangeset_overlaps_range(mem, start, 
> >> end) ||
> >>                    /*
> >> @@ -357,7 +452,7 @@ static void cmd_write(const struct pci_dev *pdev, 
> >> unsigned int reg,
> >>           pci_conf_write16(pdev->sbdf, reg, cmd);
> >>   }
> >>   
> >> -static void bar_write(const struct pci_dev *pdev, unsigned int reg,
> >> +static void bar_write_hwdom(const struct pci_dev *pdev, unsigned int reg,
> >>                         uint32_t val, void *data)
> >>   {
> >>       struct vpci_bar *bar = data;
> >> @@ -377,14 +472,17 @@ static void bar_write(const struct pci_dev *pdev, 
> >> unsigned int reg,
> >>       {
> >>           /* If the value written is the current one avoid printing a 
> >> warning. */
> >>           if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
> >> +        {
> >> +            struct vpci_header *header = get_hwdom_vpci_header(pdev);
> >> +
> >>               gprintk(XENLOG_WARNING,
> >>                       "%04x:%02x:%02x.%u: ignored BAR %lu write with 
> >> memory decoding enabled\n",
> >>                       pdev->seg, pdev->bus, slot, func,
> >> -                    bar - pdev->vpci->header.bars + hi);
> >> +                    bar - header->bars + hi);
> >> +        }
> >>           return;
> >>       }
> >>   
> >> -
> >>       /*
> >>        * Update the cached address, so that when memory decoding is enabled
> >>        * Xen can map the BAR into the guest p2m.
> >> @@ -403,10 +501,89 @@ static void bar_write(const struct pci_dev *pdev, 
> >> unsigned int reg,
> >>       pci_conf_write32(pdev->sbdf, reg, val);
> >>   }
> >>   
> >> +static uint32_t bar_read_hwdom(const struct pci_dev *pdev, unsigned int 
> >> reg,
> >> +                               void *data)
> >> +{
> >> +    return vpci_hw_read32(pdev, reg, data);
> >> +}
> >> +
> >> +static void bar_write_guest(const struct pci_dev *pdev, unsigned int reg,
> >> +                            uint32_t val, void *data)
> >> +{
> >> +    struct vpci_bar *vbar = data;
> >> +    bool hi = false;
> >> +
> >> +    if ( vbar->type == VPCI_BAR_MEM64_HI )
> >> +    {
> >> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> >> +        vbar--;
> >> +        hi = true;
> >> +    }
> >> +    vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
> >> +    vbar->addr |= (uint64_t)val << (hi ? 32 : 0);
> >> +}
> >> +
> >> +static uint32_t bar_read_guest(const struct pci_dev *pdev, unsigned int 
> >> reg,
> >> +                               void *data)
> >> +{
> >> +    struct vpci_bar *vbar = data;
> >> +    uint32_t val;
> >> +    bool hi = false;
> >> +
> >> +    if ( vbar->type == VPCI_BAR_MEM64_HI )
> >> +    {
> >> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> >> +        vbar--;
> >> +        hi = true;
> >> +    }
> >> +
> >> +    if ( vbar->type == VPCI_BAR_MEM64_LO || vbar->type == 
> >> VPCI_BAR_MEM64_HI )
> > I think this would be clearer using a switch statement.
> I'll think about
> >
> >> +    {
> >> +        if ( hi )
> >> +            val = vbar->addr >> 32;
> >> +        else
> >> +            val = vbar->addr & 0xffffffff;
> >> +        if ( val == ~0 )
> > Strictly speaking I think you are not forced to write 1s to the
> > reserved 4 bits in the low register (and in the 32bit case).
> 
> Ah, so Linux kernel, for instance, could have written 0xffffff0 while
> 
> I expect 0xffffffff?

I think real hardware would return the size when written 1s to all
bits except the reserved ones.

> 
> >
> >> +        {
> >> +            /* Guests detects BAR's properties and sizes. */
> >> +            if ( !hi )
> >> +            {
> >> +                val = 0xffffffff & ~(vbar->size - 1);
> >> +                val |= vbar->type == VPCI_BAR_MEM32 ? 
> >> PCI_BASE_ADDRESS_MEM_TYPE_32
> >> +                                                    : 
> >> PCI_BASE_ADDRESS_MEM_TYPE_64;
> >> +                val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH 
> >> : 0;
> >> +            }
> >> +            else
> >> +                val = vbar->size >> 32;
> >> +            vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
> >> +            vbar->addr |= (uint64_t)val << (hi ? 32 : 0);
> >> +        }
> >> +    }
> >> +    else if ( vbar->type == VPCI_BAR_MEM32 )
> >> +    {
> >> +        val = vbar->addr;
> >> +        if ( val == ~0 )
> >> +        {
> >> +            if ( !hi )
> > There's no way hi can be true at this point AFAICT.
> Sure, thank you
> >
> >> +            {
> >> +                val = 0xffffffff & ~(vbar->size - 1);
> >> +                val |= vbar->type == VPCI_BAR_MEM32 ? 
> >> PCI_BASE_ADDRESS_MEM_TYPE_32
> >> +                                                    : 
> >> PCI_BASE_ADDRESS_MEM_TYPE_64;
> >> +                val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH 
> >> : 0;
> >> +            }
> >> +        }
> >> +    }
> >> +    else
> >> +    {
> >> +        val = vbar->addr;
> >> +    }
> >> +    return val;
> >> +}
> >> +
> >>   static void rom_write(const struct pci_dev *pdev, unsigned int reg,
> >>                         uint32_t val, void *data)
> >>   {
> >> -    struct vpci_header *header = &pdev->vpci->header;
> >> +    struct vpci_header *header = get_hwdom_vpci_header(pdev);
> >>       struct vpci_bar *rom = data;
> >>       uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> >>       uint16_t cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
> >> @@ -452,15 +629,56 @@ static void rom_write(const struct pci_dev *pdev, 
> >> unsigned int reg,
> >>           rom->addr = val & PCI_ROM_ADDRESS_MASK;
> >>   }
> > Don't you need to also protect a domU from writing to the ROM BAR
> > register?
> 
> ROM was not a target of this RFC as I have no HW to test that, but final code 
> must
> 
> also handle ROM as well, you are right
> 
> >
> >>   
> >> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned 
> >> int reg,
> >> +                                  void *data)
> >> +{
> >> +    struct vpci_bar *vbar, *bar = data;
> >> +
> >> +    if ( is_hardware_domain(current->domain) )
> >> +        return bar_read_hwdom(pdev, reg, data);
> >> +
> >> +    vbar = get_vpci_bar(current->domain, pdev, bar->index);
> >> +    if ( !vbar )
> >> +        return ~0;
> >> +
> >> +    return bar_read_guest(pdev, reg, vbar);
> >> +}
> >> +
> >> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int 
> >> reg,
> >> +                               uint32_t val, void *data)
> >> +{
> >> +    struct vpci_bar *bar = data;
> >> +
> >> +    if ( is_hardware_domain(current->domain) )
> >> +        bar_write_hwdom(pdev, reg, val, data);
> >> +    else
> >> +    {
> >> +        struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, 
> >> bar->index);
> >> +
> >> +        if ( !vbar )
> >> +            return;
> >> +        bar_write_guest(pdev, reg, val, vbar);
> >> +    }
> >> +}
> > You should assign different handlers based on whether the domain that
> > has the device assigned is a domU or the hardware domain, rather than
> > doing the selection here.
> 
> Hm, handlers are assigned once in init_bars and this function is only called
> 
> for hwdom, so there is no way I can do that for the guests. Hence, the 
> dispatcher

I think we might want to reset the vPCI handlers when a devices gets
assigned and deassigned. In order to do passthrough to domUs safely
we will have to add more handlers than what's required for dom0, and
having is_hardware_domain sprinkled in all of them is not a suitable
solution.

Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.