>>> On 01.06.11 at 12:05, Laszlo Ersek <lersek@xxxxxxxxxx> wrote:
> Hi,
>
> this is more of an RFC than a patch now for linux-2.6.18-xen. Describing the
> situation captured in RHBZ#688673, in a nutshell:
>
> - let's say we have an Intel 82576 card (igb),
> - the card exports virtual functions (igbvf),
> - one virtual function is passed through to a PV guest,
> - the igbvf driver, co-operating with pcifront, pciback, and the hypervisor,
> sets up three MSI-X vectors for the virtual function PCI device,
> - when the domU is shut down, the MSI-X vectors are "leaked" in dom0, because
> nobody ever reaches dom0's pci_disable_msix() / msi_remove_pci_irq_vectors()
> during shutdown,
> - therefore configuration of the VF during the next boot of such a guest
> doesn't succeed (dom0 thinks, based on its stale list, that MSI-X vectors are
> already allocated/mapped for the device)
>
> dom0 (msix_capability_init(), output beautified a bit):
> msix entry 0 for dev 01:10:0 are not freed before acquire again.
> msix entry 1 for dev 01:10:0 are not freed before acquire again.
> msix entry 2 for dev 01:10:0 are not freed before acquire again.
>
> guest:
> Determining IP information for eth1...
> Failed to obtain physical IRQ 255
> Failed to obtain physical IRQ 254
> Failed to obtain physical IRQ 253
>
> - if the device or the full igbvf module is removed before shutdown in the
> guest (in case the boot was successful to begin with!), then
> pci_disable_msix() is called and everything works fine; the next boot / ifup
> succeeds.
>
> I backported c/s 1070 to the RHEL-5 kernel, but it was not enough -- this is
> not an abrupt termination (destroy), but a contolled shutdown. Here's what I
> suspect happens (in RHEL-5) when device_shutdown() walks the list of devices
> in the PV domU and calls the appropriate shutdown hook for igbvf:
>
> domU: igbvf_shutdown()
> igbvf_suspend()
> pci_disable_device()
> pcibios_disable_device()
> pcibios_disable_resources()
> pci_write_config_word(
> dev, PCI_COMMAND,
> orig & ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)
> )
> ...
> pcifront_bus_write()
> do_pci_op(XEN_PCI_OP_conf_write)
> dom0: pciback_do_op()
> pciback_config_write()
> conf_space_write()
> command_write() [via PCI_COMMAND funcptr]
> pci_disable_device()
> disable_msi_mode()
> dev->msix_enabled = 0;
>
> The final assignment above precludes c/s 1070 from doing the job.
That seems to be a problem in the PCI subsystem, in that
disable_msi_mode() is being used from pci_disable_device() (instead
of pci_disable_msi{,x}()), and does not seem to be done in a similarly
bad way in newer kernels. So I wonder whether you shouldn't fix
pci_disable_device() instead, or alternatively move the vector
de-allocation (and then for MSI and MSI-X) into disable_msi_mode().
While the approach you take covers the guest shutdown/restart
case, what if the guest called its pci_disable_device() at runtime,
expecting to be able to call pci_enable_{device,msi,msix}() on it
again later on? Your newly added function would also be called here,
but in this situation you would have to call into the hypervisor (as
the domain is still alive), i.e. you could as well call
msi_remove_pci_irq_vectors().
Additionally, while covering the MSI-X case, I don't see how the
similar already-mapped case would be cleanly handled for MSI.
Jan
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|