Xen project Mailing List

Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci

To: Roger Pau Monné <roger.pau@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>

From: Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>

Date: Tue, 8 Feb 2022 11:13:41 +0000

Accept-language: en-US

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=x4I8BVsA/QH9ekKYRIH8d+MMQPplkQM20oTyjof8RdQ=; b=oTtNXtzXCwtD1cMht0y5+8Ht7h/DyRtukiu+g0FSuWFEFBBoxiT3PpED4qpASdU7Pm0GaAnR2OBjFP6M8pO16EISbHdirahtLqi21DckXOYkYR+SL3ZO0RZ4MtC6GUy69dhxkWpTlutdQMic8rvq25totT10ddGseC3Uz5v3Q43g5GOTHS69v/IZS1IXQoGyYYvnVa7Ws5SkhfDxHrJk75Mq0DHiy57AbUpXWC390VzPolfA6xUd54JDBYd71uA+EWgfV+FKLiLLTuNInI6MXtYltHTllKeSmPvNGYRGpU+U7sICEfvXweLQ4yB3QbbjOjsnkOydNvd1lTbJLOmDIg==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=dFjULULKJmnTLC1PJCMN5FBtQnMzlJehau3S9ifwGLL0g7pAWzvzik/ZiqabjrYXhZpDM5bef9JYrIWOGCzmtaf4JwkSDqEBaHVOe++jyOzxYPOBKwjokKq5/biyBm8sBTErNC/61W1hiW29IHcuiUxfMK/qcRe/jzYN0WWvaRppKdcgs+unXO6qNlspHR5YdLSIHf/LXv4vbeTxuM78j5NxOPiE9a+IY2ZiGOvL9w0ysNqq1J7XGCzwfleocczTa+pre0NGyiq9EPUQc2tMsmXlXFbuS8KpZaUiiSup6jmXmJcHrHtIToJP6+epobIuBwzQQKs5STurz5QQLH80wA==

Cc: "julien@xxxxxxx" <julien@xxxxxxx>, "sstabellini@xxxxxxxxxx" <sstabellini@xxxxxxxxxx>, Oleksandr Tyshchenko <Oleksandr_Tyshchenko@xxxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Artem Mygaiev <Artem_Mygaiev@xxxxxxxx>, "andrew.cooper3@xxxxxxxxxx" <andrew.cooper3@xxxxxxxxxx>, "george.dunlap@xxxxxxxxxx" <george.dunlap@xxxxxxxxxx>, "paul@xxxxxxx" <paul@xxxxxxx>, Bertrand Marquis <bertrand.marquis@xxxxxxx>, Rahul Singh <rahul.singh@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>

Delivery-date: Tue, 08 Feb 2022 11:14:01 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: AQHYGZFc/MnzQOjwVEeBBUHLSW0md6yDBUkAgAASSACAAATYAIAAD/WAgAAKNgCAAAbfgIAABnuAgAAQvgCAAAMCAIAAAY4AgAADxICAABrnAIAABAgAgAR3CoCAABt5gIAAEpuAgAAE5ICAAASKAIAAAiiAgAAKNYCAAARNAIAAC1wAgAACRYCAAAGVgIAABJiAgAAB5wCAAPjsgIAANleAgAAGm4A=

Thread-topic: [PATCH v6 03/13] vpci: move lock outside of struct vpci

On 08.02.22 12:50, Roger Pau Monné wrote: > On Tue, Feb 08, 2022 at 07:35:34AM +0000, Oleksandr Andrushchenko wrote: >> >> On 07.02.22 18:44, Oleksandr Andrushchenko wrote: >>> On 07.02.22 18:37, Jan Beulich wrote: >>>> On 07.02.2022 17:21, Oleksandr Andrushchenko wrote: >>>>> On 07.02.22 18:15, Jan Beulich wrote: >>>>>> On 07.02.2022 17:07, Oleksandr Andrushchenko wrote: >>>>>>> On 07.02.22 17:26, Jan Beulich wrote: >>>>>>>> 1b. Make vpci_write use write lock for writes to command register and >>>>>>>> BARs >>>>>>>> only; keep using the read lock for all other writes. >>>>>>> I am not quite sure how to do that. Do you mean something like: >>>>>>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, >>>>>>> uint32_t data) >>>>>>> [snip] >>>>>>> list_for_each_entry ( r, &pdev->vpci->handlers, node ) >>>>>>> { >>>>>>> [snip] >>>>>>> if ( r->needs_write_lock) >>>>>>> write_lock(d->vpci_lock) >>>>>>> else >>>>>>> read_lock(d->vpci_lock) >>>>>>> .... >>>>>>> >>>>>>> And provide rw as an argument to: >>>>>>> >>>>>>> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler, >>>>>>> vpci_write_t *write_handler, unsigned int >>>>>>> offset, >>>>>>> unsigned int size, void *data, --->>> bool >>>>>>> write_path <<<-----) >>>>>>> >>>>>>> Is this what you mean? >>>>>> This sounds overly complicated. You can derive locally in vpci_write(), >>>>>> from just its "reg" and "size" parameters, whether the lock needs taking >>>>>> in write mode. >>>>> Yes, I started writing a reply with that. So, the summary (ROM >>>>> position depends on header type): >>>>> if ( (reg == PCI_COMMAND) || (reg == ROM) ) >>>>> { >>>>> read PCI_COMMAND and see if memory or IO decoding are enabled. >>>>> if ( enabled ) >>>>> write_lock(d->vpci_lock) >>>>> else >>>>> read_lock(d->vpci_lock) >>>>> } >>>> Hmm, yes, you can actually get away without using "size", since both >>>> command register and ROM BAR are 32-bit aligned registers, and 64-bit >>>> accesses get split in vpci_ecam_write(). >>> But, OS may want reading a single byte of ROM BAR, so I think >>> I'll need to check if reg+size fall into PCI_COMAND and ROM BAR >>> ranges >>>> For the command register the memory- / IO-decoding-enabled check may >>>> end up a little more complicated, as the value to be written also >>>> matters. Maybe read the command register only for the ROM BAR write, >>>> using the write lock uniformly for all command register writes? >>> Sounds good for the start. >>> Another concern is that if we go with a read_lock and then in the >>> underlying code we disable memory decoding and try doing >>> something and calling cmd_write handler for any reason then.... >>> >>> I mean that the check in the vpci_write is somewhat we can tolerate, >>> but then it is must be considered that no code in the read path >>> is allowed to perform write path functions. Which brings a pretty >>> valid use-case: say in read mode we detect an unrecoverable error >>> and need to remove the device: >>> vpci_process_pending -> ERROR -> vpci_remove_device or similar. >>> >>> What do we do then? It is all going to be fragile... >> I have tried to summarize the options we have wrt locking >> and would love to hear from @Roger and @Jan. >> >> In every variant there is a task of dealing with the overlap >> detection in modify_bars, so this is the only place as of now >> which needs special treatment. >> >> Existing limitations: there is no way to upgrade a read lock to a write >> lock, so paths which may require write lock protection need to use >> write lock from the very beginning. Workarounds can be applied. >> >> 1. Per-domain rw lock, aka d->vpci_lock >> ============================================================== >> Note: with per-domain rw lock it is possible to do without introducing >> per-device locks, so pdev->vpci->lock can be removed and no pdev->vpci_lock >> should be required. > Er, no, I think you still need a per-device lock unless you intent to > take the per-domain rwlock in write mode every time you modify data > in vpci. This is exactly the assumption stated below. I am trying to discuss all the possible options, so this one is also listed > I still think you need pdev->vpci->lock. It's possible this > approach doesn't require moving the lock outside of the vpci struct. > >> This is only going to work in case if vpci_write always takes the write lock >> and vpci_read takes a read lock and no path in vpci_read is allowed to >> perform write path operations. > I think that's likely too strong? > > You could get away with both vpci_{read,write} only taking the read > lock and use a per-device vpci lock? But as discussed before: - if pdev->vpci_lock is used this still leads to ABBA - we should know about if to take the write lock beforehand > > Otherwise you are likely to introduce contention in msix_write if a > guest makes heavy use of the MSI-X entry mask bit. > >> vpci_process_pending uses write lock as it have vpci_remove_device in its >> error path. >> >> Pros: >> - no per-device vpci lock is needed? >> - solves overlap code ABBA in modify_bars >> >> Cons: >> - all writes are serialized >> - need to carefully select read paths, so they are guaranteed not to lead >> to lock upgrade use-cases >> >> 1.1. Semi read lock upgrade in modify bars >> -------------------------------------------------------------- >> In this case both vpci_read and vpci_write take a read lock and when it comes >> to modify_bars: >> >> 1. read_unlock(d->vpci_lock) >> 2. write_lock(d->vpci_lock) >> 3. Check that pdev->vpci is still available and is the same object: >> if (pdev->vpci && (pdev->vpci == old_vpci) ) >> { >> /* vpci structure is valid and can be used. */ >> } >> else >> { >> /* vpci has gone, return an error. */ >> } >> >> Pros: >> - no per-device vpci lock is needed? >> - solves overlap code ABBA in modify_bars >> - readers and writers are NOT serialized >> - NO need to carefully select read paths, so they are guaranteed not to lead >> to lock upgrade use-cases >> >> Cons: >> - ??? >> >> 2. per-device lock (pdev->vpci_lock) + d->overlap_chk_lock >> ============================================================== >> In order to solve overlap ABBA, we introduce a per-domain helper >> lock to protect the overlapping code in modify_bars: >> >> old_vpci = pdev->vpci; >> spin_unlock(pdev->vpci_lock); >> spin_lock(pdev->domain->overlap_chk_lock); > Since you drop the pdev lock you get a window here where either vpci > or even pdev itself could be removed under your feet, so using > pdev->vpci_lock like you do below could dereference a stale pdev. pdev is anyways not protected with pcidevs lock here, so even now it is possible to have pdev disapear in between. We do not use pcidevs_lock in MMIO handlers... > >> spin_lock(pdev->vpci_lock); >> if ( pdev->vpci && (pdev->vpci == old_vpci) ) >> for_each_pdev ( pdev->domain, tmp ) >> { >> if ( tmp != pdev ) >> { >> spin_lock(tmp->vpci_lock); >> if ( tmp->vpci ) >> ... >> } >> } >> >> Pros: >> - all accesses are independent, only the same device access is serialized >> - no need to care about readers and writers wrt read lock upgrade issues >> >> Cons: >> - helper spin lock >> >> 3. Move overlap detection into process pending >> ============================================================== >> There is a Roger's patch [1] which adds a possibility for >> vpci_process_pending >> to perform different tasks rather than just map/unmap. With this patch >> extended >> in a way that it can hold a request queue it is possible to delay execution >> of the overlap code until no pdev->vpci_lock is held, but before returning to >> a guest after vpci_{read|write} or similar. >> >> Pros: >> - no need to emulate read lock upgrade >> - fully parallel read/write >> - queue in the vpci_process_pending will later on be used by SR-IOV, >> so this is going to help the future code >> Cons: >> - ??? > Maybe? It's hard to devise how that would end up looking like, and > whether it won't still require such kind of double locking. We would > still need to prevent doing a rangeset_remove_range for the device we > are trying to setup the mapping for, at which point we still need to > lock the current device plus the device we are iterating against? > > Since the code in vpci_process_pending is always executed in guest > vCPU context requiring all guest vCPUs to be paused when doing a > device addition or removal would prevent devices from going away, but > we could still have issues with concurrent accesses from other vCPUs. Yes, I understand that this may not be easily done, but this is still an option, > >> 4. Re-write overlap detection code >> ============================================================== >> It is possible to re-write overlap detection code, so the information about >> the >> mapped/unmapped regions is not read from vpci->header->bars[i] of each >> device, >> but instead there is a per-domain structure which holds the regions and >> implements reference counting. >> >> Pros: >> - solves ABBA >> >> Cons: >> - very complex code is expected >> >> 5. You name it >> ============================================================== >> >> From all the above I would recommend we go with option 2 which seems to >> reliably >> solve ABBA and does not bring cons of the other approaches. > 6. per-domain rwlock + per-device vpci lock > > Introduce vpci_header_write_lock(start, {end, size}) helper: return > whether a range requires the per-domain lock in write mode. This will > only return true if the range overlaps with the BAR ROM or the command > register. > > In vpci_{read,write}: > > if ( vpci_header_write_lock(...) ) > /* Gain exclusive access to all of the domain pdevs vpci. */ > write_lock(d->vpci); > else > { > read_lock(d->vpci); > spin_lock(vpci->lock); > } > ... > > The vpci assign/deassign functions would need to be modified to write > lock the per-domain rwlock. The MSI-X table MMIO handler will also > need to read lock the per domain vpci lock. Ok, so it seems you are in favor of this implementation and I have no objection as well. The only limitation we should be aware of is that once a path has acquired the read lock it is not possible to do any write path operations in there. vpci_process_pending will acquire write lock though as it can lead to vpci_remove_device on its error path. So, I am going to implement pdev->vpci->lock + d->vpci_lock > > I think it's either something along the lines of my suggestion above, > or maybe option 3, albeit you would have to investigate how to > implement option 3. > > Thanks, Roger. @Roger, @Jan! Thank you!!

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.