[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci


  • To: Roger Pau Monné <roger.pau@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>
  • From: Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>
  • Date: Tue, 8 Feb 2022 11:13:41 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=x4I8BVsA/QH9ekKYRIH8d+MMQPplkQM20oTyjof8RdQ=; b=oTtNXtzXCwtD1cMht0y5+8Ht7h/DyRtukiu+g0FSuWFEFBBoxiT3PpED4qpASdU7Pm0GaAnR2OBjFP6M8pO16EISbHdirahtLqi21DckXOYkYR+SL3ZO0RZ4MtC6GUy69dhxkWpTlutdQMic8rvq25totT10ddGseC3Uz5v3Q43g5GOTHS69v/IZS1IXQoGyYYvnVa7Ws5SkhfDxHrJk75Mq0DHiy57AbUpXWC390VzPolfA6xUd54JDBYd71uA+EWgfV+FKLiLLTuNInI6MXtYltHTllKeSmPvNGYRGpU+U7sICEfvXweLQ4yB3QbbjOjsnkOydNvd1lTbJLOmDIg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=dFjULULKJmnTLC1PJCMN5FBtQnMzlJehau3S9ifwGLL0g7pAWzvzik/ZiqabjrYXhZpDM5bef9JYrIWOGCzmtaf4JwkSDqEBaHVOe++jyOzxYPOBKwjokKq5/biyBm8sBTErNC/61W1hiW29IHcuiUxfMK/qcRe/jzYN0WWvaRppKdcgs+unXO6qNlspHR5YdLSIHf/LXv4vbeTxuM78j5NxOPiE9a+IY2ZiGOvL9w0ysNqq1J7XGCzwfleocczTa+pre0NGyiq9EPUQc2tMsmXlXFbuS8KpZaUiiSup6jmXmJcHrHtIToJP6+epobIuBwzQQKs5STurz5QQLH80wA==
  • Cc: "julien@xxxxxxx" <julien@xxxxxxx>, "sstabellini@xxxxxxxxxx" <sstabellini@xxxxxxxxxx>, Oleksandr Tyshchenko <Oleksandr_Tyshchenko@xxxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Artem Mygaiev <Artem_Mygaiev@xxxxxxxx>, "andrew.cooper3@xxxxxxxxxx" <andrew.cooper3@xxxxxxxxxx>, "george.dunlap@xxxxxxxxxx" <george.dunlap@xxxxxxxxxx>, "paul@xxxxxxx" <paul@xxxxxxx>, Bertrand Marquis <bertrand.marquis@xxxxxxx>, Rahul Singh <rahul.singh@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>
  • Delivery-date: Tue, 08 Feb 2022 11:14:01 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHYGZFc/MnzQOjwVEeBBUHLSW0md6yDBUkAgAASSACAAATYAIAAD/WAgAAKNgCAAAbfgIAABnuAgAAQvgCAAAMCAIAAAY4AgAADxICAABrnAIAABAgAgAR3CoCAABt5gIAAEpuAgAAE5ICAAASKAIAAAiiAgAAKNYCAAARNAIAAC1wAgAACRYCAAAGVgIAABJiAgAAB5wCAAPjsgIAANleAgAAGm4A=
  • Thread-topic: [PATCH v6 03/13] vpci: move lock outside of struct vpci


On 08.02.22 12:50, Roger Pau Monné wrote:
> On Tue, Feb 08, 2022 at 07:35:34AM +0000, Oleksandr Andrushchenko wrote:
>>
>> On 07.02.22 18:44, Oleksandr Andrushchenko wrote:
>>> On 07.02.22 18:37, Jan Beulich wrote:
>>>> On 07.02.2022 17:21, Oleksandr Andrushchenko wrote:
>>>>> On 07.02.22 18:15, Jan Beulich wrote:
>>>>>> On 07.02.2022 17:07, Oleksandr Andrushchenko wrote:
>>>>>>> On 07.02.22 17:26, Jan Beulich wrote:
>>>>>>>> 1b. Make vpci_write use write lock for writes to command register and 
>>>>>>>> BARs
>>>>>>>> only; keep using the read lock for all other writes.
>>>>>>> I am not quite sure how to do that. Do you mean something like:
>>>>>>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>>>>>                      uint32_t data)
>>>>>>> [snip]
>>>>>>>          list_for_each_entry ( r, &pdev->vpci->handlers, node )
>>>>>>> {
>>>>>>> [snip]
>>>>>>>          if ( r->needs_write_lock)
>>>>>>>              write_lock(d->vpci_lock)
>>>>>>>          else
>>>>>>>              read_lock(d->vpci_lock)
>>>>>>> ....
>>>>>>>
>>>>>>> And provide rw as an argument to:
>>>>>>>
>>>>>>> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>>>>>>>                            vpci_write_t *write_handler, unsigned int 
>>>>>>> offset,
>>>>>>>                            unsigned int size, void *data, --->>> bool 
>>>>>>> write_path <<<-----)
>>>>>>>
>>>>>>> Is this what you mean?
>>>>>> This sounds overly complicated. You can derive locally in vpci_write(),
>>>>>> from just its "reg" and "size" parameters, whether the lock needs taking
>>>>>> in write mode.
>>>>> Yes, I started writing a reply with that. So, the summary (ROM
>>>>> position depends on header type):
>>>>> if ( (reg == PCI_COMMAND) || (reg == ROM) )
>>>>> {
>>>>>         read PCI_COMMAND and see if memory or IO decoding are enabled.
>>>>>         if ( enabled )
>>>>>             write_lock(d->vpci_lock)
>>>>>         else
>>>>>             read_lock(d->vpci_lock)
>>>>> }
>>>> Hmm, yes, you can actually get away without using "size", since both
>>>> command register and ROM BAR are 32-bit aligned registers, and 64-bit
>>>> accesses get split in vpci_ecam_write().
>>> But, OS may want reading a single byte of ROM BAR, so I think
>>> I'll need to check if reg+size fall into PCI_COMAND and ROM BAR
>>> ranges
>>>> For the command register the memory- / IO-decoding-enabled check may
>>>> end up a little more complicated, as the value to be written also
>>>> matters. Maybe read the command register only for the ROM BAR write,
>>>> using the write lock uniformly for all command register writes?
>>> Sounds good for the start.
>>> Another concern is that if we go with a read_lock and then in the
>>> underlying code we disable memory decoding and try doing
>>> something and calling cmd_write handler for any reason then....
>>>
>>> I mean that the check in the vpci_write is somewhat we can tolerate,
>>> but then it is must be considered that no code in the read path
>>> is allowed to perform write path functions. Which brings a pretty
>>> valid use-case: say in read mode we detect an unrecoverable error
>>> and need to remove the device:
>>> vpci_process_pending -> ERROR -> vpci_remove_device or similar.
>>>
>>> What do we do then? It is all going to be fragile...
>> I have tried to summarize the options we have wrt locking
>> and would love to hear from @Roger and @Jan.
>>
>> In every variant there is a task of dealing with the overlap
>> detection in modify_bars, so this is the only place as of now
>> which needs special treatment.
>>
>> Existing limitations: there is no way to upgrade a read lock to a write
>> lock, so paths which may require write lock protection need to use
>> write lock from the very beginning. Workarounds can be applied.
>>
>> 1. Per-domain rw lock, aka d->vpci_lock
>> ==============================================================
>> Note: with per-domain rw lock it is possible to do without introducing
>> per-device locks, so pdev->vpci->lock can be removed and no pdev->vpci_lock
>> should be required.
> Er, no, I think you still need a per-device lock unless you intent to
> take the per-domain rwlock in write mode every time you modify data
> in vpci.
This is exactly the assumption stated below. I am trying to discuss
all the possible options, so this one is also listed
>   I still think you need pdev->vpci->lock. It's possible this
> approach doesn't require moving the lock outside of the vpci struct.
>
>> This is only going to work in case if vpci_write always takes the write lock
>> and vpci_read takes a read lock and no path in vpci_read is allowed to
>> perform write path operations.
> I think that's likely too strong?
>
> You could get away with both vpci_{read,write} only taking the read
> lock and use a per-device vpci lock?
But as discussed before:
- if pdev->vpci_lock is used this still leads to ABBA
- we should know about if to take the write lock beforehand
>
> Otherwise you are likely to introduce contention in msix_write if a
> guest makes heavy use of the MSI-X entry mask bit.
>
>> vpci_process_pending uses write lock as it have vpci_remove_device in its
>> error path.
>>
>> Pros:
>> - no per-device vpci lock is needed?
>> - solves overlap code ABBA in modify_bars
>>
>> Cons:
>> - all writes are serialized
>> - need to carefully select read paths, so they are guaranteed not to lead
>>     to lock upgrade use-cases
>>
>> 1.1. Semi read lock upgrade in modify bars
>> --------------------------------------------------------------
>> In this case both vpci_read and vpci_write take a read lock and when it comes
>> to modify_bars:
>>
>> 1. read_unlock(d->vpci_lock)
>> 2. write_lock(d->vpci_lock)
>> 3. Check that pdev->vpci is still available and is the same object:
>> if (pdev->vpci && (pdev->vpci == old_vpci) )
>> {
>>       /* vpci structure is valid and can be used. */
>> }
>> else
>> {
>>       /* vpci has gone, return an error. */
>> }
>>
>> Pros:
>> - no per-device vpci lock is needed?
>> - solves overlap code ABBA in modify_bars
>> - readers and writers are NOT serialized
>> - NO need to carefully select read paths, so they are guaranteed not to lead
>>     to lock upgrade use-cases
>>
>> Cons:
>> - ???
>>
>> 2. per-device lock (pdev->vpci_lock) + d->overlap_chk_lock
>> ==============================================================
>> In order to solve overlap ABBA, we introduce a per-domain helper
>> lock to protect the overlapping code in modify_bars:
>>
>>       old_vpci = pdev->vpci;
>>       spin_unlock(pdev->vpci_lock);
>>       spin_lock(pdev->domain->overlap_chk_lock);
> Since you drop the pdev lock you get a window here where either vpci
> or even pdev itself could be removed under your feet, so using
> pdev->vpci_lock like you do below could dereference a stale pdev.
pdev is anyways not protected with pcidevs lock here, so even
now it is possible to have pdev disapear in between.
We do not use pcidevs_lock in MMIO handlers...
>
>>       spin_lock(pdev->vpci_lock);
>>       if ( pdev->vpci && (pdev->vpci == old_vpci) )
>>           for_each_pdev ( pdev->domain, tmp )
>>           {
>>               if ( tmp != pdev )
>>               {
>>                   spin_lock(tmp->vpci_lock);
>>                   if ( tmp->vpci )
>>                       ...
>>               }
>>           }
>>
>> Pros:
>> - all accesses are independent, only the same device access is serialized
>> - no need to care about readers and writers wrt read lock upgrade issues
>>
>> Cons:
>> - helper spin lock
>>
>> 3. Move overlap detection into process pending
>> ==============================================================
>> There is a Roger's patch [1] which adds a possibility for 
>> vpci_process_pending
>> to perform different tasks rather than just map/unmap. With this patch 
>> extended
>> in a way that it can hold a request queue it is possible to delay execution
>> of the overlap code until no pdev->vpci_lock is held, but before returning to
>> a guest after vpci_{read|write} or similar.
>>
>> Pros:
>> - no need to emulate read lock upgrade
>> - fully parallel read/write
>> - queue in the vpci_process_pending will later on be used by SR-IOV,
>>     so this is going to help the future code
>> Cons:
>> - ???
> Maybe? It's hard to devise how that would end up looking like, and
> whether it won't still require such kind of double locking. We would
> still need to prevent doing a rangeset_remove_range for the device we
> are trying to setup the mapping for, at which point we still need to
> lock the current device plus the device we are iterating against?
>
> Since the code in vpci_process_pending is always executed in guest
> vCPU context requiring all guest vCPUs to be paused when doing a
> device addition or removal would prevent devices from going away, but
> we could still have issues with concurrent accesses from other vCPUs.
Yes, I understand that this may not be easily done, but this is still
an option,
>
>> 4. Re-write overlap detection code
>> ==============================================================
>> It is possible to re-write overlap detection code, so the information about 
>> the
>> mapped/unmapped regions is not read from vpci->header->bars[i] of each 
>> device,
>> but instead there is a per-domain structure which holds the regions and
>> implements reference counting.
>>
>> Pros:
>> - solves ABBA
>>
>> Cons:
>> - very complex code is expected
>>
>> 5. You name it
>> ==============================================================
>>
>>   From all the above I would recommend we go with option 2 which seems to 
>> reliably
>> solve ABBA and does not bring cons of the other approaches.
> 6. per-domain rwlock + per-device vpci lock
>
> Introduce vpci_header_write_lock(start, {end, size}) helper: return
> whether a range requires the per-domain lock in write mode. This will
> only return true if the range overlaps with the BAR ROM or the command
> register.
>
> In vpci_{read,write}:
>
> if ( vpci_header_write_lock(...) )
>      /* Gain exclusive access to all of the domain pdevs vpci. */
>      write_lock(d->vpci);
> else
> {
>      read_lock(d->vpci);
>      spin_lock(vpci->lock);
> }
> ...
>
> The vpci assign/deassign functions would need to be modified to write
> lock the per-domain rwlock. The MSI-X table MMIO handler will also
> need to read lock the per domain vpci lock.
Ok, so it seems you are in favor of this implementation and I have
no objection as well. The only limitation we should be aware of is
that once a path has acquired the read lock it is not possible to do
any write path operations in there.
vpci_process_pending will acquire write lock though as it can
lead to vpci_remove_device on its error path.

So, I am going to implement pdev->vpci->lock + d->vpci_lock
>
> I think it's either something along the lines of my suggestion above,
> or maybe option 3, albeit you would have to investigate how to
> implement option 3.
>
> Thanks, Roger.

@Roger, @Jan!
Thank you!!

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.