[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [RFC PATCH linux-2.6.18-xen] pciback: clean up (MSI-X vec, entrynr) list when resetting PCI device


this is more of an RFC than a patch now for linux-2.6.18-xen. Describing the 
situation captured in RHBZ#688673, in a nutshell:

- let's say we have an Intel 82576 card (igb),
- the card exports virtual functions (igbvf),
- one virtual function is passed through to a PV guest,
- the igbvf driver, co-operating with pcifront, pciback, and the hypervisor, 
sets up three MSI-X vectors for the virtual function PCI device,
- when the domU is shut down, the MSI-X vectors are "leaked" in dom0, because 
nobody ever reaches dom0's pci_disable_msix() / msi_remove_pci_irq_vectors() 
during shutdown,
- therefore configuration of the VF during the next boot of such a guest 
doesn't succeed (dom0 thinks, based on its stale list, that MSI-X vectors are 
already allocated/mapped for the device)

  dom0 (msix_capability_init(), output beautified a bit):
    msix entry 0 for dev 01:10:0 are not freed before acquire again.
    msix entry 1 for dev 01:10:0 are not freed before acquire again.
    msix entry 2 for dev 01:10:0 are not freed before acquire again.

      Determining IP information for eth1...
      Failed to obtain physical IRQ 255
      Failed to obtain physical IRQ 254
      Failed to obtain physical IRQ 253

- if the device or the full igbvf module is removed before shutdown in the 
guest (in case the boot was successful to begin with!), then pci_disable_msix() 
is called and everything works fine; the next boot / ifup succeeds.

I backported c/s 1070 to the RHEL-5 kernel, but it was not enough -- this is 
not an abrupt termination (destroy), but a contolled shutdown. Here's what I 
suspect happens (in RHEL-5) when device_shutdown() walks the list of devices in 
the PV domU and calls the appropriate shutdown hook for igbvf:

domU: igbvf_shutdown()
                    dev, PCI_COMMAND,
                    orig & ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)
dom0:                   pciback_do_op()
                              command_write() [via PCI_COMMAND funcptr]
                                    dev->msix_enabled = 0;

The final assignment above precludes c/s 1070 from doing the job.

Of course igbvf_shutdown() could invoke igbvf_remove() instead (or another 
lower-level function that ends up in dom0's pci_disable_msix()), but it should 
not depend on the guest playing nice.

Below's my proposed patch for RHEL-5, ported to upstream. 

introduce msi_prune_pci_irq_vectors()

extend pciback_reset_device() with the following functionality 
- remove all (MSI-X vector, entry number) pairs that might still belong, in 
dom0's mind, to the PCI device being reset,
- unmap some space "used for saving/restoring MSI-X tables" that might 
otherwise go leaked

The function, modeled after msi_remove_pci_irq_vectors(), doesn't touch the 
hypervisor, nor the PCI device; it only trims the list used by dom0 for 
filtering. Justification being: when this is called, either the owner domain is 
gone already (or very close to being gone), or the device was already correctly 

Now I think one comment in the patch below does not hold for upstream: 
"msi_dev_head is only maintained in dom0". According to c/s 680, this doesn't 
seem to be the case for upstream.

Is the case described above possible at all in upstream? Do you think the fix I 
propose is correct? It seemed to do the job with RHEL-5 in my testing.

    PCI: 0000:01:10.0: cleaning up MSI-X entry 0
    PCI: 0000:01:10.0: cleaning up MSI-X entry 1
    PCI: 0000:01:10.0: cleaning up MSI-X entry 2
    PCI: 0000:01:10.0: cleaning up mask_base

Just in case:

 drivers/pci/msi-xen.c             |   46 ++++++++++++++++++++++++++++++++++++++
 drivers/xen/pciback/pciback_ops.c |    5 ++++
 include/linux/pci.h               |    1 
 3 files changed, 52 insertions(+)

Signed-off-by: Laszlo Ersek<lersek@xxxxxxxxxx>

Thank you very much!

diff -r 876a5aaac026 drivers/pci/msi-xen.c
--- a/drivers/pci/msi-xen.c     Thu May 26 12:33:41 2011 +0100
+++ b/drivers/pci/msi-xen.c     Tue May 31 19:17:26 2011 +0200
@@ -894,6 +894,51 @@ void msi_remove_pci_irq_vectors(struct p
        dev->irq = msi_dev_entry->default_irq;
+void msi_prune_pci_irq_vectors(struct pci_dev *dev)
+       unsigned long flags;
+       struct msi_dev_list *msi_dev_scan, *msi_dev_entry = NULL;
+       struct msi_pirq_entry *pirq_entry, *tmp;
+       if (!pci_msi_enable || !dev)
+               return;
+       /* msi_dev_head is only maintained in dom0 */
+       BUG_ON(!is_initial_xendomain());
+       /* search for the set of MSI-X vectors, don't extend list */
+       spin_lock_irqsave(&msi_dev_lock, flags);
+       list_for_each_entry(msi_dev_scan, &msi_dev_head, list)
+               if (msi_dev_scan->dev == dev) {
+                       msi_dev_entry = msi_dev_scan;
+                       break;
+               }
+       spin_unlock_irqrestore(&msi_dev_lock, flags);
+       if (msi_dev_entry == NULL)
+               return;
+       /* Empty the set. No PHYSDEVOP_unmap_pirq hypercalls: the domU is
+        * already gone, or the device is already unplugged.
+        */
+       spin_lock_irqsave(&msi_dev_entry->pirq_list_lock, flags);
+       if (!list_empty(&msi_dev_entry->pirq_list_head))
+               list_for_each_entry_safe(pirq_entry, tmp,
+                                        &msi_dev_entry->pirq_list_head, list) {
+                       printk(KERN_INFO "PCI: %s: cleaning up MSI-X entry "
+                              "%d\n", pci_name(dev), pirq_entry->entry_nr);
+                       list_del(&pirq_entry->list);
+                       kfree(pirq_entry);
+               }
+       spin_unlock_irqrestore(&msi_dev_entry->pirq_list_lock, flags);
+       if (msi_dev_entry->mask_base != NULL) {
+               printk(KERN_INFO "PCI: %s: cleaning up mask_base\n",
+                      pci_name(dev));
+               iounmap(msi_dev_entry->mask_base);
+               msi_dev_entry->mask_base = NULL;
+       }
 void pci_no_msi(void)
        pci_msi_enable = 0;
@@ -906,5 +951,6 @@ EXPORT_SYMBOL(pci_disable_msix);
 #ifdef CONFIG_XEN
diff -r 876a5aaac026 drivers/xen/pciback/pciback_ops.c
--- a/drivers/xen/pciback/pciback_ops.c Thu May 26 12:33:41 2011 +0100
+++ b/drivers/xen/pciback/pciback_ops.c Tue May 31 19:17:26 2011 +0200
@@ -29,6 +29,11 @@ void pciback_reset_device(struct pci_dev
                if (dev->msi_enabled)
+               /* After a controlled shutdown or the crash fixup above, prune
+                * dom0's MSI-X vector list for the device.
+                */
+               msi_prune_pci_irq_vectors(dev);
diff -r 876a5aaac026 include/linux/pci.h
--- a/include/linux/pci.h       Thu May 26 12:33:41 2011 +0100
+++ b/include/linux/pci.h       Tue May 31 19:17:26 2011 +0200
@@ -640,6 +640,7 @@ extern void msi_remove_pci_irq_vectors(s
 #ifdef CONFIG_XEN
 extern int register_msi_get_owner(int (*func)(struct pci_dev *dev));
 extern int unregister_msi_get_owner(int (*func)(struct pci_dev *dev));
+extern void msi_prune_pci_irq_vectors(struct pci_dev *dev);

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.