|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH] vpci: introduce per-domain lock to protect vpci structure
On Wed, Feb 09, 2022 at 03:36:27PM +0200, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx>
>
> Introduce a per-domain read/write lock to check whether vpci is present,
> so we are sure there are no accesses to the contents of the vpci struct
> if not. This lock can be used (and in a few cases is used right away)
> so that vpci removal can be performed while holding the lock in write
> mode. Previously such removal could race with vpci_read for example.
Sadly there's still a race in the usage of pci_get_pdev_by_domain wrt
pci_remove_device, and likely when vPCI gets also used in
{de}assign_device I think.
> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
> from being removed.
>
> 2. Writing the command register and ROM BAR register may trigger
> modify_bars to run, which in turn may access multiple pdevs while
> checking for the existing BAR's overlap. The overlapping check, if done
> under the read lock, requires vpci->lock to be acquired on both devices
> being compared, which may produce a deadlock. It is not possible to
> upgrade read lock to write lock in such a case. So, in order to prevent
> the deadlock, check which registers are going to be written and acquire
> the lock in the appropriate mode from the beginning.
>
> All other code, which doesn't lead to pdev->vpci destruction and does not
> access multiple pdevs at the same time, can still use a combination of the
> read lock and pdev->vpci->lock.
>
> 3. Optimize if ROM BAR write lock required detection by caching offset
> of the ROM BAR register in vpci->header->rom_reg which depends on
> header's type.
>
> 4. Reduce locked region in vpci_remove_device as it is now possible
> to set pdev->vpci to NULL early right after the write lock is acquired.
>
> 5. Reduce locked region in vpci_add_handlers as it is possible to
> initialize many more fields of the struct vpci before assigning it to
> pdev->vpci.
>
> 6. vpci_{add|remove}_register are required to be called with the write lock
> held, but it is not feasible to add an assert there as it requires
> struct domain to be passed for that. So, add a comment about this requirement
> to these and other functions with the equivalent constraints.
>
> 7. Drop const qualifier where the new rwlock is used and this is appropriate.
>
> 8. This is based on the discussion at [1].
>
> [1] https://lore.kernel.org/all/20220204063459.680961-4-andr2000@xxxxxxxxx/
>
> Suggested-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> Suggested-by: Jan Beulich <jbeulich@xxxxxxxx>
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx>
>
> ---
> This was checked on x86: with and without PVH Dom0.
> ---
> xen/arch/x86/hvm/vmsi.c | 2 +
> xen/common/domain.c | 3 +
> xen/drivers/vpci/header.c | 8 +++
> xen/drivers/vpci/msi.c | 8 ++-
> xen/drivers/vpci/msix.c | 40 +++++++++++--
> xen/drivers/vpci/vpci.c | 114 ++++++++++++++++++++++++++++----------
> xen/include/xen/sched.h | 3 +
> xen/include/xen/vpci.h | 2 +
> 8 files changed, 146 insertions(+), 34 deletions(-)
>
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index 13e2a190b439..351cb968a423 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -893,6 +893,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
> {
> unsigned int i;
>
> + ASSERT(!!rw_is_locked(&msix->pdev->domain->vpci_rwlock));
^ no need for the double negation.
Also this asserts that the lock is taken, but could be by a different
pCPU. I guess it's better than nothing.
> +
> for ( i = 0; i < msix->max_entries; i++ )
> {
> const struct vpci_msix_entry *entry = &msix->entries[i];
Since this function is now called with the per-domain rwlock read
locked it's likely not appropriate to call process_pending_softirqs
while holding such lock (check below).
We will likely need to re-iterate over the list of pdevs assigned to
the domain and assert that the pdev is still assigned to the same
domain.
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 2048ebad86ff..10558c22285d 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -616,6 +616,9 @@ struct domain *domain_create(domid_t domid,
>
> #ifdef CONFIG_HAS_PCI
> INIT_LIST_HEAD(&d->pdev_list);
> +#ifdef CONFIG_HAS_VPCI
> + rwlock_init(&d->vpci_rwlock);
> +#endif
> #endif
>
> /* All error paths can depend on the above setup. */
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 40ff79c33f8f..9e2aeb2055c9 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -142,12 +142,14 @@ bool vpci_process_pending(struct vcpu *v)
> if ( rc == -ERESTART )
> return true;
>
> + read_lock(&v->domain->vpci_rwlock);
> spin_lock(&v->vpci.pdev->vpci->lock);
> /* Disable memory decoding unconditionally on failure. */
> modify_decoding(v->vpci.pdev,
> rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
> !rc && v->vpci.rom_only);
> spin_unlock(&v->vpci.pdev->vpci->lock);
> + read_unlock(&v->domain->vpci_rwlock);
>
> rangeset_destroy(v->vpci.mem);
> v->vpci.mem = NULL;
> @@ -203,6 +205,7 @@ static void defer_map(struct domain *d, struct pci_dev
> *pdev,
> raise_softirq(SCHEDULE_SOFTIRQ);
> }
>
> +/* This must hold domain's vpci_rwlock in write mode. */
> static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool
> rom_only)
> {
> struct vpci_header *header = &pdev->vpci->header;
> @@ -454,6 +457,8 @@ static int init_bars(struct pci_dev *pdev)
> struct vpci_bar *bars = header->bars;
> int rc;
>
> + ASSERT(!!rw_is_write_locked(&pdev->domain->vpci_rwlock));
> +
> switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
> {
> case PCI_HEADER_TYPE_NORMAL:
> @@ -548,6 +553,7 @@ static int init_bars(struct pci_dev *pdev)
> {
> struct vpci_bar *rom = &header->bars[num_bars];
>
> + header->rom_reg = rom_reg;
> rom->type = VPCI_BAR_ROM;
> rom->size = size;
> rom->addr = addr;
> @@ -559,6 +565,8 @@ static int init_bars(struct pci_dev *pdev)
> if ( rc )
> rom->type = VPCI_BAR_EMPTY;
You can also set 'rom_reg = ~0' here. Or move the setting of rom_reg
after the handler has been successfully installed.
I think it would be easier to just signal no ROM BAR with rom_reg ==
0. There's no header where the ROM BAR is at offset 0. That way you
will only have to set rom_reg on the successful path, but you don't
need to care about the case where there's no ROM BAR.
> }
> + else
> + header->rom_reg = ~(unsigned int)0;
No need for the cast.
>
> return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
> }
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> index 5757a7aed20f..5df3dfa8243c 100644
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -190,6 +190,8 @@ static int init_msi(struct pci_dev *pdev)
> uint16_t control;
> int ret;
>
> + ASSERT(!!rw_is_write_locked(&pdev->domain->vpci_rwlock));
> +
> if ( !pos )
> return 0;
>
> @@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>
> void vpci_dump_msi(void)
> {
> - const struct domain *d;
> + struct domain *d;
>
> rcu_read_lock(&domlist_read_lock);
> for_each_domain ( d )
> @@ -277,6 +279,9 @@ void vpci_dump_msi(void)
>
> printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>
> + if ( !read_trylock(&d->vpci_rwlock) )
> + continue;
> +
> for_each_pdev ( d, pdev )
> {
> const struct vpci_msi *msi;
> @@ -326,6 +331,7 @@ void vpci_dump_msi(void)
> spin_unlock(&pdev->vpci->lock);
> process_pending_softirqs();
> }
> + read_unlock(&d->vpci_rwlock);
Same here, you are calling process_pending_softirqs while holding
vpci_rwlock.
> }
> rcu_read_unlock(&domlist_read_lock);
> }
> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
> index 846f1b8d7038..5296d6025d8e 100644
> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -138,6 +138,7 @@ static void control_write(const struct pci_dev *pdev,
> unsigned int reg,
> pci_conf_write16(pdev->sbdf, reg, val);
> }
>
> +/* This must hold domain's vpci_rwlock in write mode. */
> static struct vpci_msix *msix_find(const struct domain *d, unsigned long
> addr)
> {
> struct vpci_msix *msix;
> @@ -158,7 +159,12 @@ static struct vpci_msix *msix_find(const struct domain
> *d, unsigned long addr)
>
> static int msix_accept(struct vcpu *v, unsigned long addr)
> {
> - return !!msix_find(v->domain, addr);
> + int rc;
> +
> + read_lock(&v->domain->vpci_rwlock);
> + rc = !!msix_find(v->domain, addr);
> + read_unlock(&v->domain->vpci_rwlock);
Newline before return.
> + return rc;
> }
>
> static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
> index fb0947179b79..16bb3b832e6a 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -89,22 +104,28 @@ int vpci_add_handlers(struct pci_dev *pdev)
> }
> #endif /* __XEN__ */
>
> -static int vpci_register_cmp(const struct vpci_register *r1,
> - const struct vpci_register *r2)
> +static int vpci_offset_cmp(unsigned int r1_offset, unsigned int r1_size,
> + unsigned int r2_offset, unsigned int r2_size)
> {
> /* Return 0 if registers overlap. */
> - if ( r1->offset < r2->offset + r2->size &&
> - r2->offset < r1->offset + r1->size )
> + if ( r1_offset < r2_offset + r2_size &&
> + r2_offset < r1_offset + r1_size )
> return 0;
> - if ( r1->offset < r2->offset )
> + if ( r1_offset < r2_offset )
> return -1;
> - if ( r1->offset > r2->offset )
> + if ( r1_offset > r2_offset )
> return 1;
>
> ASSERT_UNREACHABLE();
> return 0;
> }
>
> +static int vpci_register_cmp(const struct vpci_register *r1,
> + const struct vpci_register *r2)
> +{
> + return vpci_offset_cmp(r1->offset, r1->size, r2->offset, r2->size);
> +}
Seeing how this gets used below I'm not sure it's very helpful to
reuse vpci_register_cmp, see my comment below.
> +
> /* Dummy hooks, writes are ignored, reads return 1's */
> static uint32_t vpci_ignored_read(const struct pci_dev *pdev, unsigned int
> reg,
> void *data)
> @@ -129,6 +150,7 @@ uint32_t vpci_hw_read32(const struct pci_dev *pdev,
> unsigned int reg,
> return pci_conf_read32(pdev->sbdf, reg);
> }
>
> +/* This must hold domain's vpci_rwlock in write mode. */
> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
> vpci_write_t *write_handler, unsigned int offset,
> unsigned int size, void *data)
> @@ -152,8 +174,6 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t
> *read_handler,
> r->offset = offset;
> r->private = data;
>
> - spin_lock(&vpci->lock);
> -
> /* The list of handlers must be kept sorted at all times. */
> list_for_each ( prev, &vpci->handlers )
> {
> @@ -165,25 +185,23 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t
> *read_handler,
> break;
> if ( cmp == 0 )
> {
> - spin_unlock(&vpci->lock);
> xfree(r);
> return -EEXIST;
> }
> }
>
> list_add_tail(&r->node, prev);
> - spin_unlock(&vpci->lock);
>
> return 0;
> }
>
> +/* This must hold domain's vpci_rwlock in write mode. */
> int vpci_remove_register(struct vpci *vpci, unsigned int offset,
> unsigned int size)
> {
> const struct vpci_register r = { .offset = offset, .size = size };
> struct vpci_register *rm;
>
> - spin_lock(&vpci->lock);
> list_for_each_entry ( rm, &vpci->handlers, node )
> {
> int cmp = vpci_register_cmp(&r, rm);
> @@ -195,14 +213,12 @@ int vpci_remove_register(struct vpci *vpci, unsigned
> int offset,
> if ( !cmp && rm->offset == offset && rm->size == size )
> {
> list_del(&rm->node);
> - spin_unlock(&vpci->lock);
> xfree(rm);
> return 0;
> }
> if ( cmp <= 0 )
> break;
> }
> - spin_unlock(&vpci->lock);
>
> return -ENOENT;
> }
> @@ -310,7 +326,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new,
> unsigned int size,
>
> uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
> {
> - const struct domain *d = current->domain;
> + struct domain *d = current->domain;
> const struct pci_dev *pdev;
> const struct vpci_register *r;
> unsigned int data_offset = 0;
> @@ -327,6 +343,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg,
> unsigned int size)
> if ( !pdev )
> return vpci_read_hw(sbdf, reg, size);
>
> + read_lock(&d->vpci_rwlock);
After taking the domain lock you need to check that pdev->vpci !=
NULL, as vpci_remove_device will set pdev->vpci == NULL before
removing the device from the domain. Same applies to vpci_add_handlers
which will be called with the device already added to the domain, but
with pdev->vpci == NULL.
We would also need some kind of protection arround
pci_get_pdev_by_domain so that devices are not removed (from the
domain) while we are iterating over it.
> spin_lock(&pdev->vpci->lock);
>
> /* Read from the hardware or the emulated register handlers. */
> @@ -371,6 +388,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg,
> unsigned int size)
> ASSERT(data_offset < size);
> }
> spin_unlock(&pdev->vpci->lock);
> + read_unlock(&d->vpci_rwlock);
>
> if ( data_offset < size )
> {
> @@ -410,14 +428,37 @@ static void vpci_write_helper(const struct pci_dev
> *pdev,
> r->private);
> }
>
> +static bool vpci_header_write_lock(const struct pci_dev *pdev,
> + unsigned int start, unsigned int size)
I think this should live in header.c, for consistency.
I'm also not sure it's worth adding vpci_offset_cmp: you just need to
do a range overlap check, and that can be easily open coded. It's just
the first 'if' in vpci_register_cmp that you want, the rest of the
code is just adding overhead.
> +{
> + /*
> + * Writing the command register and ROM BAR register may trigger
> + * modify_bars to run which in turn may access multiple pdevs while
> + * checking for the existing BAR's overlap. The overlapping check, if
> done
> + * under the read lock, requires vpci->lock to be acquired on both
> devices
> + * being compared, which may produce a deadlock. It is not possible to
> + * upgrade read lock to write lock in such a case. So, in order to
> prevent
> + * the deadlock, check which registers are going to be written and
> acquire
> + * the lock in the appropriate mode from the beginning.
> + */
> + if ( !vpci_offset_cmp(start, size, PCI_COMMAND, 2) )
> + return true;
> +
> + if ( !vpci_offset_cmp(start, size, pdev->vpci->header.rom_reg, 4) )
No need for the comparison if rom_reg is unset. Also you can OR both
conditions into a single if.
> + return true;
> +
> + return false;
> +}
> +
> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> uint32_t data)
> {
> - const struct domain *d = current->domain;
> + struct domain *d = current->domain;
> const struct pci_dev *pdev;
> const struct vpci_register *r;
> unsigned int data_offset = 0;
> const unsigned long *ro_map = pci_get_ro_map(sbdf.seg);
> + bool write_locked = false;
>
> if ( !size )
> {
> @@ -440,7 +481,17 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg,
> unsigned int size,
> return;
> }
>
> - spin_lock(&pdev->vpci->lock);
> + if ( vpci_header_write_lock(pdev, reg, size) )
> + {
> + /* Gain exclusive access to all of the domain pdevs vpci. */
> + write_lock(&d->vpci_rwlock);
> + write_locked = true;
Here you need to check that pdev->vpci != NULL...
> + }
> + else
> + {
> + read_lock(&d->vpci_rwlock);
...and also here.
> + spin_lock(&pdev->vpci->lock);
> + }
>
> /* Write the value to the hardware or emulated registers. */
> list_for_each_entry ( r, &pdev->vpci->handlers, node )
> @@ -475,7 +526,14 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg,
> unsigned int size,
> break;
> ASSERT(data_offset < size);
> }
> - spin_unlock(&pdev->vpci->lock);
> +
> + if ( write_locked )
> + write_unlock(&d->vpci_rwlock);
> + else
> + {
> + spin_unlock(&pdev->vpci->lock);
> + read_unlock(&d->vpci_rwlock);
> + }
>
> if ( data_offset < size )
> /* Tailing gap, write the remaining. */
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index 37f78cc4c4c9..ecd34481a7af 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -444,6 +444,9 @@ struct domain
>
> #ifdef CONFIG_HAS_PCI
> struct list_head pdev_list;
> +#ifdef CONFIG_HAS_VPCI
> + rwlock_t vpci_rwlock;
> +#endif
> #endif
>
> #ifdef CONFIG_HAS_PASSTHROUGH
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index e8ac1eb39513..e19e462ee5cb 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -88,6 +88,8 @@ struct vpci {
> * is mapped into guest p2m) if there's a ROM BAR on the device.
> */
> bool rom_enabled : 1;
> + /* Offset to the ROM BAR register if any. */
> + unsigned int rom_reg;
Could you please place this field after 'type'? That will avoid some
padding in the structure.
Thanks, Roger.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |