[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable: xen panic RIP: dpci_softirq



Tuesday, November 18, 2014, 3:49:27 AM, you wrote:

> On Mon, Nov 17, 2014 at 11:40:11PM +0100, Sander Eikelenboom wrote:
>> 
>> Monday, November 17, 2014, 9:43:47 PM, you wrote:
>> 
>> > . snip..
>> >> > # cat /proc/interrupts |grep eth
>> >> >  36:     384183          0  xen-pirq-ioapic-level  eth0
>> >> >  63:          1          0  xen-pirq-msi-x     eth1
>> >> >  64:         24     661961  xen-pirq-msi-x     eth1-rx-0
>> >> >  65:        205          0  xen-pirq-msi-x     eth1-rx-1
>> >> >  66:        162          0  xen-pirq-msi-x     eth1-tx-0
>> >> >  67:        190          0  xen-pirq-msi-x     eth1-tx-1
>> >> > Is that a similar distribution of IRQ/MSIx you end up having?
>> >> 
>> >> These are when they are still active and assigned to dom0 (and not owned 
>> >> by 
>> >> pci-back) or in the guest ?
>> 
>> > In the guest.
>> >> 
>> >> I attached my /proc/interrupts for both dom0 as guest 16 with all guests 
>> >> running 
>> >> (on a Xen from before the dpci changes). 
>> >> With the devices passed through I only see one line with the IRQ of a 
>> >> PCI soundcard passed through to a PV guest:
>> >>   22:      38959          0          0          0          0          0  
>> >> xen-pirq-ioapic-level  xen-pciback[0000:03:06.0]
>> >> 
>> >> All the other devices passed through (to HVM guests) are not visible in 
>> >> /proc/interrupts of dom0.
>> 
>> > Right.
>> >> 
>> >> In the guest i do get these:
>> >>  23:         35          0          0          0  xen-pirq-ioapic-level  
>> >> uhci_hcd:usb3
>> >>  40:   13440077          0          0          0  xen-pirq-ioapic-level  
>> >> cx25821[1], cx25821[1]
>> 
>> > That is a bit odd. You have two 'request_irq' off this sole device, which 
>> > would
>> > imply that there are _two_ devices which are using the same interrupt line.
>> 
>> > But how is that possible when your device:
>> 
>> > 0a:00.0 Multimedia video controller: Conexant Systems, Inc. Device 8210
>> >         Flags: bus master, fast devsel, latency 0, IRQ 47
>> >         Memory at fe200000 (64-bit, non-prefetchable) [size=2M]
>> >         Capabilities: [40] Express Endpoint, MSI 00
>> >         Capabilities: [80] Power Management version 3
>> >         Capabilities: [90] Vital Product Data
>> >         Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
>> >         Capabilities: [100] Advanced Error Reporting
>> >         Capabilities: [200] Virtual Channel
>> >         Kernel driver in use: pciback
>> 
>> > Has only one IRQ! What is the name of this device? Perhaps I've another 
>> > one that
>> > is similar to this. Could you attach
>> 
>> Well it's a videograbber .. with also one port for audio (not used) that 
>> registers with alsa. I can have a look if i can disable the audio part and 
>> see if it makes a 
>> difference, i don't use it anyway.

> That is OK. I have a videograbber too - but I could not reproduce
> this.

Ok i disabled the audio part, it now says there isn't a soundcard in the guest 
and the line is just:
40:   13440077          0          0          0  xen-pirq-ioapic-level  
cx25821[1]

However .. the guest still crashed tonight, it lasted for about an hour now 
(still with qemu-xen).

<BIG SNIP>
>> > Back to your crash:
>> 
>> > d16 OK-softirq 458msec ago, state:1, 52039 count, [prev:ffff83054ef283e0, 
>> > next:ffff83054ef283e0] ffff83051b95fd28MACH_PCI_SHIFT MAPPED_SHIFT 
>> > GUEST_PCI_SHIFT  PIRQ:0
>> > d16 OK-raise   489msec ago, state:1, 52049 count, [prev:0000000000200200, 
>> > next:0000000000100100] ffff83051b95fd28MACH_PCI_SHIFT MAPPED_SHIFT 
>> > GUEST_PCI_SHIFT  PIRQ:0
>> > d16 ERR-poison 561msec ago, state:0, 1 count, [prev:0000000000200200, 
>> > next:0000000000100100] ffff83051b95fd28MACH_PCI_SHIFT MAPPED_SHIFT 
>> > GUEST_PCI_SHIFT  PIRQ:0
>> > d16 Z-softirq  731msec ago, state:3, 3 count, [prev:ffff83054ef283e0, 
>> > next:ffff83054ef283e0] ffff83051b95fd28MACH_PCI_SHIFT MAPPED_SHIFT 
>> > GUEST_PCI_SHIFT  PIRQ:0
>> > domain_crash called from io.c:938
>> > Domain 16 reported crashed by domain 32767 on cpu#5:
>> 
>> > All of it point to the legacy interrupt - that is the on that starts at 
>> > Xen IRQ 47 (guest IRQ 40):
>> >  io.c:550: d16: bind: m_gsi=47 g_gsi=40 dev=00.00.6 intx=0
>> > IRQ:  47 affinity:02 vec:d1 type=IO-APIC-level   status=00000030 
>> > in-flight=1 domain-list=16: +47(P-M),
>> 
>> > which looks OK.
>> OK, i still don't get why the output of debug-key 'i' reports +47 as pirq 
>> here instead of the guest value 
>> (g_gsi of for this legacy interrupt which is 40 ?), like it does when it's a 
>> MSI with the PIRQ ?

> The GSIs (m_gsi in here) are hard-wired - one I/O APIC can only handle
> so many of them (24 I believe). Anything above that is via MSI or
> MSI-X which do not require IO-APIC and can be any value that the OS
> wants.

> Xen does it in sequence - so after it has exhaused the GSIs then there
> are MSIs and other vectors.
>> 
>> > I am puzzled by the driver binding twice to the same interrupt, but 
>> > perhaps that
>> > is just a buggy driver.
>> 
>> Doesn't that happen more often like with integrated USB controllers ?
>>   17:          4          0          0          0          0          0  
>> xen-pirq-ioapic-level  ehci_hcd:usb1, ehci_hcd:usb2, ehci_hcd:usb3
>>   18:       4385          0          0          0          0          0  
>> xen-pirq-ioapic-level  ohci_hcd:usb4, ohci_hcd:usb5, ohci_hcd:usb6, 
>> ohci_hcd:usb7

> That was my thinking too. I passed in all my USB devices that looked
> like that to my guest but it instead of making them be on the same
> IRQ line - QEMU put them on seperate IRQ!
 
> And even with that I couldn't reproduce this crash.
Hmm I am now testing with qemu-xen-traditional, i just noticed the output at 
guest start is different between qemu-xen-traditional and qemu-xen:

qemu-xen-traditional gives:
(XEN) [2014-11-18 08:46:33.409] io.c:550: d16: bind: m_gsi=87 g_gsi=36 
dev=00.00.5 intx=0
(XEN) [2014-11-18 08:46:33.798] AMD-Vi: Disable: device id = 0x800, domain = 0, 
paging mode = 3
(XEN) [2014-11-18 08:46:33.798] AMD-Vi: Setup I/O page table: device id = 
0x800, type = 0x1, root table = 0x3fab6a000, domain = 16, paging mode = 3
(XEN) [2014-11-18 08:46:33.798] AMD-Vi: Re-assign 0000:08:00.0 from dom0 to 
dom16
(XEN) [2014-11-18 08:46:34.917] io.c:550: d16: bind: m_gsi=86 g_gsi=40 
dev=00.00.6 intx=0
(XEN) [2014-11-18 08:46:34.923] AMD-Vi: Disable: device id = 0xa00, domain = 0, 
paging mode = 3
(XEN) [2014-11-18 08:46:34.923] AMD-Vi: Setup I/O page table: device id = 
0xa00, type = 0x1, root table = 0x3fab6a000, domain = 16, paging mode = 3
(XEN) [2014-11-18 08:46:34.923] AMD-Vi: Re-assign 0000:0a:00.0 from dom0 to 
dom16
and when the guest is booting it gives:
(XEN) [2014-11-18 08:47:02.128] io.c:584: d16: unbind: m_gsi=87 g_gsi=36 
dev=00:00.5 intx=0
(XEN) [2014-11-18 08:47:02.128] io.c:684: d16 final unmap: m_irq=87 dev=00:00.5 
intx=0
(XEN) [2014-11-18 08:47:02.128] io.c:550: d16: bind: m_gsi=37 g_gsi=16 
dev=00.00.0 intx=0

with qemu-xen it only gives the first part:
(XEN) [2014-11-18 10:51:18.481] io.c:550: d16: bind: m_gsi=37 g_gsi=36 
dev=00.00.5 intx=0
(XEN) [2014-11-18 10:51:18.889] AMD-Vi: Disable: device id = 0x800, domain = 0, 
paging mode = 3
(XEN) [2014-11-18 10:51:18.889] AMD-Vi: Setup I/O page table: device id = 
0x800, type = 0x1, root table = 0x5071a6000, domain = 16, paging mode = 3
(XEN) [2014-11-18 10:51:18.889] AMD-Vi: Re-assign 0000:08:00.0 from dom0 to 
dom16
(XEN) [2014-11-18 10:51:20.016] io.c:550: d16: bind: m_gsi=47 g_gsi=40 
dev=00.00.6 intx=0
(XEN) [2014-11-18 10:51:20.022] AMD-Vi: Disable: device id = 0xa00, domain = 0, 
paging mode = 3
(XEN) [2014-11-18 10:51:20.022] AMD-Vi: Setup I/O page table: device id = 
0xa00, type = 0x1, root table = 0x5071a6000, domain = 16, paging mode = 3
(XEN) [2014-11-18 10:51:20.022] AMD-Vi: Re-assign 0000:0a:00.0 from dom0 to 
dom16

Looking at the m_gsi numbers .. could it be "pci_msitranslate=1" is not working 
for qemu-xen and that this causes this difference in output ?


Another strange thing i noticed with qemu-xen-traditional ..  after a while the 
irq number in /proc/interrupts is "stuck"  .. it doesn't increase anymore
 40:      10851          0          0          0  xen-pirq-ioapic-level  
cx25821[1]
however the device still continues to grab video ... 

I left it running for 2 hours, of which at least 1 hour the number of irq's in 
/proc/interrupts did
not change for the legacy irq 40 of the videograbber. 
The other number of IRQ's in /proc/interrupts do keep increasing (also for the 
passed
through USB device which enabled MSI-X). 
There is no crash and no debug output or errors in xl dmesg or guest dmesg and 
the device was
still working until shutdown. 
This is not good for one's sanity .. :-)

> Anyhow I was wondering if you could send (or point me to)
> your xen-syms file(s). I've also attached an extra debug code that
> should give me an idea if the crash/issue shows up in certain
> situations - when we have_two_entries to deal with on one CPU.

> It should apply cleanly on top of the other one.

This one included your previous debug patch, so i had to revert that one,
than it applied cleanly, so no problem !

> Oh, and the xen-syms  - it can be either before this patch or
> after - it won't matter much as I will be looking at the
> assembler code.

> Also what version of GCC compiler are you using ?

# gcc -v
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc-4.7.real
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.7/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.7.2-5' 
--with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs 
--enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr 
--program-suffix=-4.7 --enable-shared --enable-linker-build-id 
--with-system-zlib --libexecdir=/usr/lib --without-included-gettext 
--enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 
--libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu 
--enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object 
--enable-plugin --enable-objc-gc --with-arch-32=i586 --with-tune=generic 
--enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu 
--target=x86_64-linux-gnu
Thread model: posix
gcc version 4.7.2 (Debian 4.7.2-5)

That's default Debian wheezy/stable.

> And lastly, the code also has an #ifdef DIFF_LIST - if you
> want to turn that on (just add #define DIFF_LIST 1 at the top of 
> the file) - it might stop the crash. Or not  :-(

> If it does stop the crash then I think we are looking at an
> GCC bug - in which case the xen-syms of that build (with
> the DIFF_LIST) would also be interesting!

Will give this patch with and without the #define DIFF_LIST 1 a shot with 
qemu-xen and report back.

> Thank you.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.