[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
Hi, Roger, Jan! On 18.11.21 17:53, Jan Beulich wrote: > On 18.11.2021 16:46, Oleksandr Andrushchenko wrote: >> On 18.11.21 17:41, Jan Beulich wrote: >>> On 18.11.2021 16:21, Oleksandr Andrushchenko wrote: >>>> On 18.11.21 17:16, Jan Beulich wrote: >>>>> For the moment I can't help thinking that draining would >>>>> be preferable over canceling. >>>> Given that cancellation is going to happen on error path or >>>> on device de-assign/remove I think this can be acceptable. >>>> Any reason why not? >>> It would seem to me that the correctness of a draining approach is >>> going to be easier to prove than that of a canceling one, where I >>> expect races to be a bigger risk. Especially something that gets >>> executed infrequently, if ever (error paths in particular), knowing >>> things are well from testing isn't typically possible. >> Could you please then give me a hint how to do that: >> 1. We have scheduled SOFTIRQ on vCPU0 and it is about to touch pdev->vpci >> 2. We have de-assign/remove on vCPU1 >> >> How do we drain that? Do you mean some atomic variable to be >> used in vpci_process_pending to flag it is running and de-assign/remove >> needs to wait and spinning checking that? > First of all let's please keep remove and de-assign separate. I think we > have largely reached agreement that remove may need handling differently, > for being a Dom0-only operation. > > As to draining during de-assign: I did suggest before that removing the > register handling hooks first would guarantee no new requests to appear. > Then it should be merely a matter of using hypercall continuations until > the respective domain has no pending requests anymore for the device in > question. Some locking (or lock barrier) may of course be needed to > make sure another CPU isn't just about to pend a new request. > > Jan > > Too long, but please read. The below is some simplified analysis of what is happening with respect to deferred mapping. First we start from looking at what hypercals are used which may run in parallel with vpci_process_pending, which lock they hold: 1. do_physdev_op(PHYSDEVOP_pci_device_add): failure during PHYSDEVOP_pci_device_add =============================================================================== pci_physdev_op: <- no hypercall_create_continuation pci_add_device <- pcidevs_lock() vpci_add_handlers init_bars cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND); modify_bars <- if cmd & PCI_COMMAND_MEMORY struct rangeset *mem = rangeset_new(NULL, NULL, 0); if ( system_state < SYS_STATE_active ) <- Dom0 is being created return apply_map(pdev->domain, pdev, mem, cmd); defer_map(dev->domain, dev, mem, cmd, rom_only); < Dom0 is running curr->vpci.pdev = pdev; curr->vpci.mem = mem; ret = iommu_add_device(pdev); <- FAIL if ( ret ) vpci_remove_device remove vPCI register handlers xfree(pdev->vpci); pdev->vpci = NULL; <- this will crash vpci_process_pending if it was scheduled and yet to run 2. do_physdev_op(PHYSDEVOP_pci_device_remove) =============================================================================== pci_physdev_op: <- no hypercall_create_continuation pci_remove_device <- pcidevs_lock() vpci_remove_device pdev->vpci = NULL; <- this will crash vpci_process_pending if it was scheduled and yet to run 3. iommu_do_pci_domctl(XEN_DOMCTL_assign_device) =============================================================================== case XEN_DOMCTL_assign_device <- pcidevs_lock(); ret = assign_device(d, seg, bus, devfn, flags); if ( ret == -ERESTART ) ret = hypercall_create_continuation(__HYPERVISOR_domctl, "h", u_domctl); 4. iommu_do_pci_domctl(XEN_DOMCTL_deassign_device) <- pcidevs_lock(); =============================================================================== case XEN_DOMCTL_deassign_device: <- no hypercall_create_continuation ret = deassign_device(d, seg, bus, devfn); 5. vPCI MMIO trap for PCI_COMMAND =============================================================================== vpci_mmio_{read|write} vpci_ecam_{read|write} vpci_{read|write} <- NO locking yet pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn); spin_lock(&pdev->vpci->lock); cmd_write modify_bars defer_map 6. SoftIRQ processing =============================================================================== hvm_do_resume vcpu_ioreq_handle_completion vpci_process_pending if ( v->vpci.mem ) rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data); if ( rc == -ERESTART ) return true; <- re-scheduling ========================================================================= spin_lock(&v->vpci.pdev->vpci->lock); <- v->vpci.pdev->vpci can be NULL ========================================================================= spin_unlock(&v->vpci.pdev->vpci->lock); v->vpci.mem = NULL; if ( rc ) <- rc is from rangeset_consume_ranges vpci_remove_device <- this is a BUG as it is potentially possible that vpci_process_pending is running on other vCPU So, from the above it is clearly seen that it might be that there is a PCI_COMMAND's write triggered mapping is happening on other vCPU in parallel with a hypercall. Some analysis on the hypercalls with respect to domains which are eligible targets. 1. Dom0 (hardware domain) only: PHYSDEVOP_pci_device_add, PHYSDEVOP_pci_device_remove 2. Any domain: XEN_DOMCTL_assign_device, XEN_DOMCTL_deassign_device Possible crash paths =============================================================================== 1. Failure in PHYSDEVOP_pci_device_add after defer_map may make vpci_process_pending crash because of pdev->vpci == NULL 2. vpci_process_pending on other vCPU may crash if runs in parallel with itself because of vpci_remove_device may be called 3. Crash in vpci_mmio_{read|write} after PHYSDEVOP_pci_device_remove vpci_remove_device makes pdev->vpci == NULL 4. Both XEN_DOMCTL_assign_device and XEN_DOMCTL_deassign_device seem to be unaffected. Synchronization is needed between: - vpci_remove_device - vpci_process_pending - vpci_mmio_{read|write} Possible locking and other work needed: ======================================= 1. pcidevs_{lock|unlock} is too heavy and is per-host 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device 3. We may want a dedicated per-domain rw lock to be implemented: diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 28146ee404e6..ebf071893b21 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -444,6 +444,7 @@ struct domain #ifdef CONFIG_HAS_PCI struct list_head pdev_list; + rwlock_t vpci_rwlock; + bool vpci_terminating; <- atomic? #endif then vpci_remove_device is a writer (cold path) and vpci_process_pending and vpci_mmio_{read|write} are readers (hot path). do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation to be implemented, so when re-start removal if need be: vpci_remove_device() { d->vpci_terminating = true; remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others if ( !write_trylock(d->vpci_rwlock) ) return -ERESTART; xfree(pdev->vpci); pdev->vpci = NULL; } Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for other operations which may require it, e.g. virtual bus topology can use it when assigning vSBDF etc. 4. vpci_remove_device needs to be removed from vpci_process_pending and do nothing for Dom0 and crash DomU otherwise: if ( rc ) { /* * FIXME: in case of failure remove the device from the domain. * Note that there might still be leftover mappings. While this is * safe for Dom0, for DomUs the domain needs to be killed in order * to avoid leaking stale p2m mappings on failure. */ if ( !is_hardware_domain(v->domain) ) domain_crash(v->domain); I do hope we can finally come up with some decision which I can implement... Thank you, Oleksandr
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |