Hi,
this is more of an RFC than a patch now for linux-2.6.18-xen. Describing the
situation captured in RHBZ#688673, in a nutshell:
- let's say we have an Intel 82576 card (igb),
- the card exports virtual functions (igbvf),
- one virtual function is passed through to a PV guest,
- the igbvf driver, co-operating with pcifront, pciback, and the hypervisor,
sets up three MSI-X vectors for the virtual function PCI device,
- when the domU is shut down, the MSI-X vectors are "leaked" in dom0, because
nobody ever reaches dom0's pci_disable_msix() / msi_remove_pci_irq_vectors()
during shutdown,
- therefore configuration of the VF during the next boot of such a guest
doesn't succeed (dom0 thinks, based on its stale list, that MSI-X vectors are
already allocated/mapped for the device)
dom0 (msix_capability_init(), output beautified a bit):
msix entry 0 for dev 01:10:0 are not freed before acquire again.
msix entry 1 for dev 01:10:0 are not freed before acquire again.
msix entry 2 for dev 01:10:0 are not freed before acquire again.
guest:
Determining IP information for eth1...
Failed to obtain physical IRQ 255
Failed to obtain physical IRQ 254
Failed to obtain physical IRQ 253
- if the device or the full igbvf module is removed before shutdown in the
guest (in case the boot was successful to begin with!), then pci_disable_msix()
is called and everything works fine; the next boot / ifup succeeds.
I backported c/s 1070 to the RHEL-5 kernel, but it was not enough -- this is
not an abrupt termination (destroy), but a contolled shutdown. Here's what I
suspect happens (in RHEL-5) when device_shutdown() walks the list of devices in
the PV domU and calls the appropriate shutdown hook for igbvf:
domU: igbvf_shutdown()
igbvf_suspend()
pci_disable_device()
pcibios_disable_device()
pcibios_disable_resources()
pci_write_config_word(
dev, PCI_COMMAND,
orig & ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)
)
...
pcifront_bus_write()
do_pci_op(XEN_PCI_OP_conf_write)
dom0: pciback_do_op()
pciback_config_write()
conf_space_write()
command_write() [via PCI_COMMAND funcptr]
pci_disable_device()
disable_msi_mode()
dev->msix_enabled = 0;
The final assignment above precludes c/s 1070 from doing the job.
Of course igbvf_shutdown() could invoke igbvf_remove() instead (or another
lower-level function that ends up in dom0's pci_disable_msix()), but it should
not depend on the guest playing nice.
Below's my proposed patch for RHEL-5, ported to upstream.
----v----
introduce msi_prune_pci_irq_vectors()
extend pciback_reset_device() with the following functionality
(msi_prune_pci_irq_vectors()):
- remove all (MSI-X vector, entry number) pairs that might still belong, in
dom0's mind, to the PCI device being reset,
- unmap some space "used for saving/restoring MSI-X tables" that might
otherwise go leaked
The function, modeled after msi_remove_pci_irq_vectors(), doesn't touch the
hypervisor, nor the PCI device; it only trims the list used by dom0 for
filtering. Justification being: when this is called, either the owner domain is
gone already (or very close to being gone), or the device was already correctly
detached.
----^----
Now I think one comment in the patch below does not hold for upstream:
"msi_dev_head is only maintained in dom0". According to c/s 680, this doesn't
seem to be the case for upstream.
Is the case described above possible at all in upstream? Do you think the fix I
propose is correct? It seemed to do the job with RHEL-5 in my testing.
dom0:
PCI: 0000:01:10.0: cleaning up MSI-X entry 0
PCI: 0000:01:10.0: cleaning up MSI-X entry 1
PCI: 0000:01:10.0: cleaning up MSI-X entry 2
PCI: 0000:01:10.0: cleaning up mask_base
Just in case:
drivers/pci/msi-xen.c | 46 ++++++++++++++++++++++++++++++++++++++
drivers/xen/pciback/pciback_ops.c | 5 ++++
include/linux/pci.h | 1
3 files changed, 52 insertions(+)
Signed-off-by: Laszlo Ersek<lersek@xxxxxxxxxx>
Thank you very much!
lacos
diff -r 876a5aaac026 drivers/pci/msi-xen.c
--- a/drivers/pci/msi-xen.c Thu May 26 12:33:41 2011 +0100
+++ b/drivers/pci/msi-xen.c Tue May 31 19:17:26 2011 +0200
@@ -894,6 +894,51 @@ void msi_remove_pci_irq_vectors(struct p
dev->irq = msi_dev_entry->default_irq;
}
+void msi_prune_pci_irq_vectors(struct pci_dev *dev)
+{
+ unsigned long flags;
+ struct msi_dev_list *msi_dev_scan, *msi_dev_entry = NULL;
+ struct msi_pirq_entry *pirq_entry, *tmp;
+
+ if (!pci_msi_enable || !dev)
+ return;
+
+ /* msi_dev_head is only maintained in dom0 */
+ BUG_ON(!is_initial_xendomain());
+
+ /* search for the set of MSI-X vectors, don't extend list */
+ spin_lock_irqsave(&msi_dev_lock, flags);
+ list_for_each_entry(msi_dev_scan, &msi_dev_head, list)
+ if (msi_dev_scan->dev == dev) {
+ msi_dev_entry = msi_dev_scan;
+ break;
+ }
+ spin_unlock_irqrestore(&msi_dev_lock, flags);
+ if (msi_dev_entry == NULL)
+ return;
+
+ /* Empty the set. No PHYSDEVOP_unmap_pirq hypercalls: the domU is
+ * already gone, or the device is already unplugged.
+ */
+ spin_lock_irqsave(&msi_dev_entry->pirq_list_lock, flags);
+ if (!list_empty(&msi_dev_entry->pirq_list_head))
+ list_for_each_entry_safe(pirq_entry, tmp,
+ &msi_dev_entry->pirq_list_head, list) {
+ printk(KERN_INFO "PCI: %s: cleaning up MSI-X entry "
+ "%d\n", pci_name(dev), pirq_entry->entry_nr);
+ list_del(&pirq_entry->list);
+ kfree(pirq_entry);
+ }
+ spin_unlock_irqrestore(&msi_dev_entry->pirq_list_lock, flags);
+
+ if (msi_dev_entry->mask_base != NULL) {
+ printk(KERN_INFO "PCI: %s: cleaning up mask_base\n",
+ pci_name(dev));
+ iounmap(msi_dev_entry->mask_base);
+ msi_dev_entry->mask_base = NULL;
+ }
+}
+
void pci_no_msi(void)
{
pci_msi_enable = 0;
@@ -906,5 +951,6 @@ EXPORT_SYMBOL(pci_disable_msix);
#ifdef CONFIG_XEN
EXPORT_SYMBOL(register_msi_get_owner);
EXPORT_SYMBOL(unregister_msi_get_owner);
+EXPORT_SYMBOL(msi_prune_pci_irq_vectors);
#endif
diff -r 876a5aaac026 drivers/xen/pciback/pciback_ops.c
--- a/drivers/xen/pciback/pciback_ops.c Thu May 26 12:33:41 2011 +0100
+++ b/drivers/xen/pciback/pciback_ops.c Tue May 31 19:17:26 2011 +0200
@@ -29,6 +29,11 @@ void pciback_reset_device(struct pci_dev
pci_disable_msix(dev);
if (dev->msi_enabled)
pci_disable_msi(dev);
+
+ /* After a controlled shutdown or the crash fixup above, prune
+ * dom0's MSI-X vector list for the device.
+ */
+ msi_prune_pci_irq_vectors(dev);
#endif
pci_disable_device(dev);
diff -r 876a5aaac026 include/linux/pci.h
--- a/include/linux/pci.h Thu May 26 12:33:41 2011 +0100
+++ b/include/linux/pci.h Tue May 31 19:17:26 2011 +0200
@@ -640,6 +640,7 @@ extern void msi_remove_pci_irq_vectors(s
#ifdef CONFIG_XEN
extern int register_msi_get_owner(int (*func)(struct pci_dev *dev));
extern int unregister_msi_get_owner(int (*func)(struct pci_dev *dev));
+extern void msi_prune_pci_irq_vectors(struct pci_dev *dev);
#endif
#endif
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|