|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH] vpci: introduce per-domain lock to protect vpci structure
Hi, Roger!
On 10.02.22 18:16, Roger Pau Monné wrote:
> On Wed, Feb 09, 2022 at 03:36:27PM +0200, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx>
>>
>> Introduce a per-domain read/write lock to check whether vpci is present,
>> so we are sure there are no accesses to the contents of the vpci struct
>> if not. This lock can be used (and in a few cases is used right away)
>> so that vpci removal can be performed while holding the lock in write
>> mode. Previously such removal could race with vpci_read for example.
> Sadly there's still a race in the usage of pci_get_pdev_by_domain wrt
> pci_remove_device, and likely when vPCI gets also used in
> {de}assign_device I think.
Yes, this is indeed an issue, but I was not trying to solve it in
context of vPCI locking yet. I think we should discuss how do
we approach pdev locking, so I can create a patch for that.
that being said, I would like not to solve pdev in this patch yet
...I do understand we do want to avoid that, but at the moment
a single reliable way for making sure pdev is alive seems to
be pcidevs_lock....
>
>> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
>> from being removed.
>>
>> 2. Writing the command register and ROM BAR register may trigger
>> modify_bars to run, which in turn may access multiple pdevs while
>> checking for the existing BAR's overlap. The overlapping check, if done
>> under the read lock, requires vpci->lock to be acquired on both devices
>> being compared, which may produce a deadlock. It is not possible to
>> upgrade read lock to write lock in such a case. So, in order to prevent
>> the deadlock, check which registers are going to be written and acquire
>> the lock in the appropriate mode from the beginning.
>>
>> All other code, which doesn't lead to pdev->vpci destruction and does not
>> access multiple pdevs at the same time, can still use a combination of the
>> read lock and pdev->vpci->lock.
>>
>> 3. Optimize if ROM BAR write lock required detection by caching offset
>> of the ROM BAR register in vpci->header->rom_reg which depends on
>> header's type.
>>
>> 4. Reduce locked region in vpci_remove_device as it is now possible
>> to set pdev->vpci to NULL early right after the write lock is acquired.
>>
>> 5. Reduce locked region in vpci_add_handlers as it is possible to
>> initialize many more fields of the struct vpci before assigning it to
>> pdev->vpci.
>>
>> 6. vpci_{add|remove}_register are required to be called with the write lock
>> held, but it is not feasible to add an assert there as it requires
>> struct domain to be passed for that. So, add a comment about this requirement
>> to these and other functions with the equivalent constraints.
>>
>> 7. Drop const qualifier where the new rwlock is used and this is appropriate.
>>
>> 8. This is based on the discussion at [1].
>>
>> [1]
>> https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@xxxxxxxxx/__;!!GF_29dbcQIUBPA!gObSySzN7s6zSKrcpSEi6vw18fRPls157cuRoqq4KDd7Ic_Nvh_cFlyVXPRpEWBkI38pgsvvfg$
>> [lore[.]kernel[.]org]
>>
>> Suggested-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
>> Suggested-by: Jan Beulich <jbeulich@xxxxxxxx>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx>
>>
>> ---
>> This was checked on x86: with and without PVH Dom0.
>> ---
>> xen/arch/x86/hvm/vmsi.c | 2 +
>> xen/common/domain.c | 3 +
>> xen/drivers/vpci/header.c | 8 +++
>> xen/drivers/vpci/msi.c | 8 ++-
>> xen/drivers/vpci/msix.c | 40 +++++++++++--
>> xen/drivers/vpci/vpci.c | 114 ++++++++++++++++++++++++++++----------
>> xen/include/xen/sched.h | 3 +
>> xen/include/xen/vpci.h | 2 +
>> 8 files changed, 146 insertions(+), 34 deletions(-)
>>
>> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
>> index 13e2a190b439..351cb968a423 100644
>> --- a/xen/arch/x86/hvm/vmsi.c
>> +++ b/xen/arch/x86/hvm/vmsi.c
>> @@ -893,6 +893,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>> {
>> unsigned int i;
>>
>> + ASSERT(!!rw_is_locked(&msix->pdev->domain->vpci_rwlock));
> ^ no need for the double negation.
Ok, will update all asserts which use !!
>
> Also this asserts that the lock is taken, but could be by a different
> pCPU. I guess it's better than nothing.
Fair enough. Do you still want the asserts or should I remove them?
>
>> +
>> for ( i = 0; i < msix->max_entries; i++ )
>> {
>> const struct vpci_msix_entry *entry = &msix->entries[i];
> Since this function is now called with the per-domain rwlock read
> locked it's likely not appropriate to call process_pending_softirqs
> while holding such lock (check below).
You are right, as it is possible that:
process_pending_softirqs -> vpci_process_pending -> read_lock
Even more, vpci_process_pending may also
read_unlock -> vpci_remove_device -> write_lock
in its error path. So, any invocation of process_pending_softirqs
must not hold d->vpci_rwlock at least.
And also we need to check that pdev->vpci was not removed
in between or *re-created*
>
> We will likely need to re-iterate over the list of pdevs assigned to
> the domain and assert that the pdev is still assigned to the same
> domain.
So, do you mean a pattern like the below should be used at all
places where we need to call process_pending_softirqs?
read_unlock
process_pending_softirqs
read_lock
pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn);
if ( pdev && pdev->vpci && is_the_same_vpci(pdev->vpci) )
<continue processing>
>
>> diff --git a/xen/common/domain.c b/xen/common/domain.c
>> index 2048ebad86ff..10558c22285d 100644
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -616,6 +616,9 @@ struct domain *domain_create(domid_t domid,
>>
>> #ifdef CONFIG_HAS_PCI
>> INIT_LIST_HEAD(&d->pdev_list);
>> +#ifdef CONFIG_HAS_VPCI
>> + rwlock_init(&d->vpci_rwlock);
>> +#endif
>> #endif
>>
>> /* All error paths can depend on the above setup. */
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index 40ff79c33f8f..9e2aeb2055c9 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -142,12 +142,14 @@ bool vpci_process_pending(struct vcpu *v)
>> if ( rc == -ERESTART )
>> return true;
>>
>> + read_lock(&v->domain->vpci_rwlock);
>> spin_lock(&v->vpci.pdev->vpci->lock);
>> /* Disable memory decoding unconditionally on failure. */
>> modify_decoding(v->vpci.pdev,
>> rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY :
>> v->vpci.cmd,
>> !rc && v->vpci.rom_only);
>> spin_unlock(&v->vpci.pdev->vpci->lock);
>> + read_unlock(&v->domain->vpci_rwlock);
>>
>> rangeset_destroy(v->vpci.mem);
>> v->vpci.mem = NULL;
>> @@ -203,6 +205,7 @@ static void defer_map(struct domain *d, struct pci_dev
>> *pdev,
>> raise_softirq(SCHEDULE_SOFTIRQ);
>> }
>>
>> +/* This must hold domain's vpci_rwlock in write mode. */
>> static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool
>> rom_only)
>> {
>> struct vpci_header *header = &pdev->vpci->header;
>> @@ -454,6 +457,8 @@ static int init_bars(struct pci_dev *pdev)
>> struct vpci_bar *bars = header->bars;
>> int rc;
>>
>> + ASSERT(!!rw_is_write_locked(&pdev->domain->vpci_rwlock));
>> +
>> switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>> {
>> case PCI_HEADER_TYPE_NORMAL:
>> @@ -548,6 +553,7 @@ static int init_bars(struct pci_dev *pdev)
>> {
>> struct vpci_bar *rom = &header->bars[num_bars];
>>
>> + header->rom_reg = rom_reg;
>> rom->type = VPCI_BAR_ROM;
>> rom->size = size;
>> rom->addr = addr;
>> @@ -559,6 +565,8 @@ static int init_bars(struct pci_dev *pdev)
>> if ( rc )
>> rom->type = VPCI_BAR_EMPTY;
> You can also set 'rom_reg = ~0' here. Or move the setting of rom_reg
> after the handler has been successfully installed.
>
> I think it would be easier to just signal no ROM BAR with rom_reg ==
> 0. There's no header where the ROM BAR is at offset 0. That way you
> will only have to set rom_reg on the successful path, but you don't
> need to care about the case where there's no ROM BAR.
Yes, it is possible
>
>> }
>> + else
>> + header->rom_reg = ~(unsigned int)0;
> No need for the cast.
Ok
>
>>
>> return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
>> }
>> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
>> index 5757a7aed20f..5df3dfa8243c 100644
>> --- a/xen/drivers/vpci/msi.c
>> +++ b/xen/drivers/vpci/msi.c
>> @@ -190,6 +190,8 @@ static int init_msi(struct pci_dev *pdev)
>> uint16_t control;
>> int ret;
>>
>> + ASSERT(!!rw_is_write_locked(&pdev->domain->vpci_rwlock));
>> +
>> if ( !pos )
>> return 0;
>>
>> @@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>>
>> void vpci_dump_msi(void)
>> {
>> - const struct domain *d;
>> + struct domain *d;
>>
>> rcu_read_lock(&domlist_read_lock);
>> for_each_domain ( d )
>> @@ -277,6 +279,9 @@ void vpci_dump_msi(void)
>>
>> printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>>
>> + if ( !read_trylock(&d->vpci_rwlock) )
>> + continue;
>> +
>> for_each_pdev ( d, pdev )
>> {
>> const struct vpci_msi *msi;
>> @@ -326,6 +331,7 @@ void vpci_dump_msi(void)
>> spin_unlock(&pdev->vpci->lock);
>> process_pending_softirqs();
>> }
>> + read_unlock(&d->vpci_rwlock);
> Same here, you are calling process_pending_softirqs while holding
> vpci_rwlock.
Same as above
>
>> }
>> rcu_read_unlock(&domlist_read_lock);
>> }
>> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
>> index 846f1b8d7038..5296d6025d8e 100644
>> --- a/xen/drivers/vpci/msix.c
>> +++ b/xen/drivers/vpci/msix.c
>> @@ -138,6 +138,7 @@ static void control_write(const struct pci_dev *pdev,
>> unsigned int reg,
>> pci_conf_write16(pdev->sbdf, reg, val);
>> }
>>
>> +/* This must hold domain's vpci_rwlock in write mode. */
>> static struct vpci_msix *msix_find(const struct domain *d, unsigned long
>> addr)
>> {
>> struct vpci_msix *msix;
>> @@ -158,7 +159,12 @@ static struct vpci_msix *msix_find(const struct domain
>> *d, unsigned long addr)
>>
>> static int msix_accept(struct vcpu *v, unsigned long addr)
>> {
>> - return !!msix_find(v->domain, addr);
>> + int rc;
>> +
>> + read_lock(&v->domain->vpci_rwlock);
>> + rc = !!msix_find(v->domain, addr);
>> + read_unlock(&v->domain->vpci_rwlock);
> Newline before return.
Ok
>
>> + return rc;
>> }
>>
>> static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
>> index fb0947179b79..16bb3b832e6a 100644
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -89,22 +104,28 @@ int vpci_add_handlers(struct pci_dev *pdev)
>> }
>> #endif /* __XEN__ */
>>
>> -static int vpci_register_cmp(const struct vpci_register *r1,
>> - const struct vpci_register *r2)
>> +static int vpci_offset_cmp(unsigned int r1_offset, unsigned int r1_size,
>> + unsigned int r2_offset, unsigned int r2_size)
>> {
>> /* Return 0 if registers overlap. */
>> - if ( r1->offset < r2->offset + r2->size &&
>> - r2->offset < r1->offset + r1->size )
>> + if ( r1_offset < r2_offset + r2_size &&
>> + r2_offset < r1_offset + r1_size )
>> return 0;
>> - if ( r1->offset < r2->offset )
>> + if ( r1_offset < r2_offset )
>> return -1;
>> - if ( r1->offset > r2->offset )
>> + if ( r1_offset > r2_offset )
>> return 1;
>>
>> ASSERT_UNREACHABLE();
>> return 0;
>> }
>>
>> +static int vpci_register_cmp(const struct vpci_register *r1,
>> + const struct vpci_register *r2)
>> +{
>> + return vpci_offset_cmp(r1->offset, r1->size, r2->offset, r2->size);
>> +}
> Seeing how this gets used below I'm not sure it's very helpful to
> reuse vpci_register_cmp, see my comment below.
Yes, the only thing which is used is the first check for the overlap
>
>> +
>> /* Dummy hooks, writes are ignored, reads return 1's */
>> static uint32_t vpci_ignored_read(const struct pci_dev *pdev, unsigned int
>> reg,
>> void *data)
>> @@ -129,6 +150,7 @@ uint32_t vpci_hw_read32(const struct pci_dev *pdev,
>> unsigned int reg,
>> return pci_conf_read32(pdev->sbdf, reg);
>> }
>>
>> +/* This must hold domain's vpci_rwlock in write mode. */
>> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>> vpci_write_t *write_handler, unsigned int offset,
>> unsigned int size, void *data)
>> @@ -152,8 +174,6 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t
>> *read_handler,
>> r->offset = offset;
>> r->private = data;
>>
>> - spin_lock(&vpci->lock);
>> -
>> /* The list of handlers must be kept sorted at all times. */
>> list_for_each ( prev, &vpci->handlers )
>> {
>> @@ -165,25 +185,23 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t
>> *read_handler,
>> break;
>> if ( cmp == 0 )
>> {
>> - spin_unlock(&vpci->lock);
>> xfree(r);
>> return -EEXIST;
>> }
>> }
>>
>> list_add_tail(&r->node, prev);
>> - spin_unlock(&vpci->lock);
>>
>> return 0;
>> }
>>
>> +/* This must hold domain's vpci_rwlock in write mode. */
>> int vpci_remove_register(struct vpci *vpci, unsigned int offset,
>> unsigned int size)
>> {
>> const struct vpci_register r = { .offset = offset, .size = size };
>> struct vpci_register *rm;
>>
>> - spin_lock(&vpci->lock);
>> list_for_each_entry ( rm, &vpci->handlers, node )
>> {
>> int cmp = vpci_register_cmp(&r, rm);
>> @@ -195,14 +213,12 @@ int vpci_remove_register(struct vpci *vpci, unsigned
>> int offset,
>> if ( !cmp && rm->offset == offset && rm->size == size )
>> {
>> list_del(&rm->node);
>> - spin_unlock(&vpci->lock);
>> xfree(rm);
>> return 0;
>> }
>> if ( cmp <= 0 )
>> break;
>> }
>> - spin_unlock(&vpci->lock);
>>
>> return -ENOENT;
>> }
>> @@ -310,7 +326,7 @@ static uint32_t merge_result(uint32_t data, uint32_t
>> new, unsigned int size,
>>
>> uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>> {
>> - const struct domain *d = current->domain;
>> + struct domain *d = current->domain;
>> const struct pci_dev *pdev;
>> const struct vpci_register *r;
>> unsigned int data_offset = 0;
>> @@ -327,6 +343,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg,
>> unsigned int size)
>> if ( !pdev )
>> return vpci_read_hw(sbdf, reg, size);
>>
>> + read_lock(&d->vpci_rwlock);
> After taking the domain lock you need to check that pdev->vpci !=
> NULL, as vpci_remove_device will set pdev->vpci == NULL before
> removing the device from the domain. Same applies to vpci_add_handlers
> which will be called with the device already added to the domain, but
> with pdev->vpci == NULL.
Sure, will add
>
> We would also need some kind of protection arround
> pci_get_pdev_by_domain so that devices are not removed (from the
> domain) while we are iterating over it.
Again, this is a problem. And we need to find a way to solve that
>
>> spin_lock(&pdev->vpci->lock);
>>
>> /* Read from the hardware or the emulated register handlers. */
>> @@ -371,6 +388,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg,
>> unsigned int size)
>> ASSERT(data_offset < size);
>> }
>> spin_unlock(&pdev->vpci->lock);
>> + read_unlock(&d->vpci_rwlock);
>>
>> if ( data_offset < size )
>> {
>> @@ -410,14 +428,37 @@ static void vpci_write_helper(const struct pci_dev
>> *pdev,
>> r->private);
>> }
>>
>> +static bool vpci_header_write_lock(const struct pci_dev *pdev,
>> + unsigned int start, unsigned int size)
> I think this should live in header.c, for consistency.
Ok, will move
>
> I'm also not sure it's worth adding vpci_offset_cmp: you just need to
> do a range overlap check, and that can be easily open coded. It's just
> the first 'if' in vpci_register_cmp that you want, the rest of the
> code is just adding overhead.
Agree
>
>> +{
>> + /*
>> + * Writing the command register and ROM BAR register may trigger
>> + * modify_bars to run which in turn may access multiple pdevs while
>> + * checking for the existing BAR's overlap. The overlapping check, if
>> done
>> + * under the read lock, requires vpci->lock to be acquired on both
>> devices
>> + * being compared, which may produce a deadlock. It is not possible to
>> + * upgrade read lock to write lock in such a case. So, in order to
>> prevent
>> + * the deadlock, check which registers are going to be written and
>> acquire
>> + * the lock in the appropriate mode from the beginning.
>> + */
>> + if ( !vpci_offset_cmp(start, size, PCI_COMMAND, 2) )
>> + return true;
>> +
>> + if ( !vpci_offset_cmp(start, size, pdev->vpci->header.rom_reg, 4) )
> No need for the comparison if rom_reg is unset. Also you can OR both
> conditions into a single if.
If we open code vpci_offset_cmp with a single if all this is going
to be a bit clumsy:
if ( r1_offset < r2_offset + r2_size &&
r2_offset < r1_offset + r1_size )
return 0;
This is a single check.
Now we need to check two registers with the code above and
also check that pdev->vpci->header.rom_reg != 0
I think it would be more readable if we have a tiny helper function
static bool vpci_offset_cmp(unsigned int r1_offset, unsigned int r1_size,
unsigned int r2_offset, unsigned int r2_size)
{
/* Return 0 if registers overlap. */
if ( r1_offset < r2_offset + r2_size &&
r2_offset < r1_offset + r1_size )
return false;
return true;
}
So, then we can have something like
static bool vpci_header_write_lock(const struct pci_dev *pdev,
unsigned int start, unsigned int size)
{
if ( !vpci_offset_cmp(start, size, PCI_COMMAND, 2) ||
(pdev->vpci->header.rom_reg &&
!vpci_offset_cmp(start, size, pdev->vpci->header.rom_reg, 4)) )
return true;
return false;
}
>
>> + return true;
>> +
>> + return false;
>> +}
>> +
>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>> uint32_t data)
>> {
>> - const struct domain *d = current->domain;
>> + struct domain *d = current->domain;
>> const struct pci_dev *pdev;
>> const struct vpci_register *r;
>> unsigned int data_offset = 0;
>> const unsigned long *ro_map = pci_get_ro_map(sbdf.seg);
>> + bool write_locked = false;
>>
>> if ( !size )
>> {
>> @@ -440,7 +481,17 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg,
>> unsigned int size,
>> return;
>> }
>>
>> - spin_lock(&pdev->vpci->lock);
>> + if ( vpci_header_write_lock(pdev, reg, size) )
>> + {
>> + /* Gain exclusive access to all of the domain pdevs vpci. */
>> + write_lock(&d->vpci_rwlock);
>> + write_locked = true;
> Here you need to check that pdev->vpci != NULL...
Will do
>
>> + }
>> + else
>> + {
>> + read_lock(&d->vpci_rwlock);
> ...and also here.
Will do
>
>> + spin_lock(&pdev->vpci->lock);
>> + }
>>
>> /* Write the value to the hardware or emulated registers. */
>> list_for_each_entry ( r, &pdev->vpci->handlers, node )
>> @@ -475,7 +526,14 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg,
>> unsigned int size,
>> break;
>> ASSERT(data_offset < size);
>> }
>> - spin_unlock(&pdev->vpci->lock);
>> +
>> + if ( write_locked )
>> + write_unlock(&d->vpci_rwlock);
>> + else
>> + {
>> + spin_unlock(&pdev->vpci->lock);
>> + read_unlock(&d->vpci_rwlock);
>> + }
>>
>> if ( data_offset < size )
>> /* Tailing gap, write the remaining. */
>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>> index 37f78cc4c4c9..ecd34481a7af 100644
>> --- a/xen/include/xen/sched.h
>> +++ b/xen/include/xen/sched.h
>> @@ -444,6 +444,9 @@ struct domain
>>
>> #ifdef CONFIG_HAS_PCI
>> struct list_head pdev_list;
>> +#ifdef CONFIG_HAS_VPCI
>> + rwlock_t vpci_rwlock;
>> +#endif
>> #endif
>>
>> #ifdef CONFIG_HAS_PASSTHROUGH
>> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
>> index e8ac1eb39513..e19e462ee5cb 100644
>> --- a/xen/include/xen/vpci.h
>> +++ b/xen/include/xen/vpci.h
>> @@ -88,6 +88,8 @@ struct vpci {
>> * is mapped into guest p2m) if there's a ROM BAR on the device.
>> */
>> bool rom_enabled : 1;
>> + /* Offset to the ROM BAR register if any. */
>> + unsigned int rom_reg;
> Could you please place this field after 'type'? That will avoid some
> padding in the structure.
Sure
>
> Thanks, Roger.
Thank you,
Oleksandr
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |