[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable: xen panic RIP: dpci_softirq



Wednesday, November 19, 2014, 4:04:59 PM, you wrote:

> On Wed, Nov 19, 2014 at 12:16:44PM +0100, Sander Eikelenboom wrote:
>> 
>> Wednesday, November 19, 2014, 2:55:41 AM, you wrote:
>> 
>> > On Tue, Nov 18, 2014 at 11:12:54PM +0100, Sander Eikelenboom wrote:
>> >> 
>> >> Tuesday, November 18, 2014, 9:56:33 PM, you wrote:
>> >> 
>> >> >> 
>> >> >> Uhmm i thought i had these switched off (due to problems earlier and 
>> >> >> then forgot 
>> >> >> about them .. however looking at the earlier reports these lines were 
>> >> >> also in 
>> >> >> those reports).
>> >> >> 
>> >> >> The xen-syms and these last runs are all with a prestine xen tree 
>> >> >> cloned today (staging 
>> >> >> branch), so the qemu-xen and seabios defined with that were also 
>> >> >> freshly cloned 
>> >> >> and had a new default seabios config. (just to rule out anything stale 
>> >> >> in my tree)
>> >> >> 
>> >> >> If you don't see those messages .. perhaps your seabios and qemu trees 
>> >> >> (and at least the 
>> >> >> seabios config) are not the most recent (they don't get updated 
>> >> >> automatically 
>> >> >> when you just do a git pull on the main tree) ?
>> >> >> 
>> >> >> In /tools/firmware/seabios-dir/.config i have:
>> >> >> CONFIG_USB=y
>> >> >> CONFIG_USB_UHCI=y
>> >> >> CONFIG_USB_OHCI=y
>> >> >> CONFIG_USB_EHCI=y
>> >> >> CONFIG_USB_XHCI=y
>> >> >> CONFIG_USB_MSC=y
>> >> >> CONFIG_USB_UAS=y
>> >> >> CONFIG_USB_HUB=y
>> >> >> CONFIG_USB_KEYBOARD=y
>> >> >> CONFIG_USB_MOUSE=y
>> >> >> 
>> >> 
>> >> > I seem to have the same thing. Perhaps it is my XHCI controller being 
>> >> > wonky.
>> >> 
>> >> >> And this is all just from a:
>> >> >> - git clone git://xenbits.xen.org/xen.git -b staging
>> >> >> - make clean && ./configure && make -j6 && make -j6 install
>> >> 
>> >> > Aye. 
>> >> > .. snip..
>> >> >> >  1) test_and_[set|clear]_bit sometimes return unexpected values.
>> >> >> >     [But this might be invalid as the addition of the 
>> >> >> > ffff8303faaf25a8
>> >> >> >      might be correct - as the second dpci the softirq is processing
>> >> >> >      could be the MSI one]
>> >> >> 
>> >> >> Would there be an easy way to stress test this function separately in 
>> >> >> some 
>> >> >> debugging function to see if it indeed is returning unexpected values ?
>> >> 
>> >> > Sadly no. But you got me looking in the right direction when you 
>> >> > mentioned
>> >> > 'timeout'.
>> >> >> 
>> >> >> >  2) INIT_LIST_HEAD operations on the same CPU are not honored.
>> >> >> 
>> >> >> Just curious, have you also tested the patches on AMD hardware ?
>> >> 
>> >> > Yes. To reproduce this the first thing I did was to get an AMD box.
>> >> 
>> >> >> 
>> >> >>  
>> >> >> >> When i look at the combination of (2) and (3), It seems it could be 
>> >> >> >> an 
>> >> >> >> interaction between the two passed through devices and/or different 
>> >> >> >> IRQ types.
>> >> >> 
>> >> >> > Could be - as in it is causing this issue to show up faster than
>> >> >> > expected. Or it is the one that triggers more than one dpci happening
>> >> >> > at the same time.
>> >> >> 
>> >> >> Well that didn't seem to be it (see separate amendment i mailed 
>> >> >> previously)
>> >> 
>> >> > Right, the current theory I've is that the interrupts are not being
>> >> > Acked within 8 milisecond and we reset the 'state' - and at the same
>> >> > time we get an interrupt and schedule it - while we are still processing
>> >> > the same interrupt. This would explain why the 'test_and_clear_bit'
>> >> > got the wrong value.
>> >> 
>> >> > In regards to the list poison - following this thread of logic - with
>> >> > the 'state = 0' set we open the floodgates for any CPU to put the same
>> >> > 'struct hvm_pirq_dpci' on its list.
>> >> 
>> >> > We do reset the 'state' on _every_ GSI that is mapped to a guest - so
>> >> > we also reset the 'state' for the MSI one (XHCI). Anyhow in your case:
>> >> 
>> >> > CPUX:                           CPUY:
>> >> > pt_irq_time_out:
>> >> > state = 0;                      
>> >> > [out of timer coder, the                raise_softirq
>> >> >  pirq_dpci is on the dpci_list]         [adds the pirq_dpci as state == 
>> >> > 0]
>> >> 
>> >> > softirq_dpci                            softirq_dpci:
>> >> >         list_del
>> >> >         [entries poison]
>> >> >                                                 list_del <= BOOM
>> >> >                         
>> >> > Is what I believe is happening.
>> >> 
>> >> > The INTX device - once I put a load on it - does not trigger
>> >> > any pt_irq_time_out, so that would explain why I cannot hit this.
>> >> 
>> >> > But I believe your card hits these "hiccups".   
>> >> 
>> >> 
>> >> Hi Konrad,
>> >> 
>> >> I just tested you 5 patches and as a result i still got an(other) host 
>> >> crash:
>> >> (complete serial log attached)
>> >> 
>> >> (XEN) [2014-11-18 21:55:41.591] ----[ Xen-4.5.0-rc  x86_64  debug=y  Not 
>> >> tainted ]----
>> >> (XEN) [2014-11-18 21:55:41.591] CPU:    0
>> >> (XEN) [2014-11-18 21:55:41.591] ----[ Xen-4.5.0-rc  x86_64  debug=y  Not 
>> >> tainted ]----
>> >> (XEN) [2014-11-18 21:55:41.591] RIP:    e008:[<ffff82d08012c7e7>]CPU:    2
>> >> (XEN) [2014-11-18 21:55:41.591] RIP:    e008:[<ffff82d08014a461>] 
>> >> hvm_do_IRQ_dpci+0xbd/0x13c
>> >> (XEN) [2014-11-18 21:55:41.591] RFLAGS: 0000000000010006    
>> >> _spin_unlock+0x1f/0x30CONTEXT: hypervisor
>> 
>> > Duh!
>> 
>> > Here is another patch on top of the five you have (attached and inline).
>> 
>> Hi Konrad,
>> 
>> Happy to report it has been running with this additional patch for 2 hours 
>> now 
>> without any problems. I think you nailed it :-)

> Could you also do an 'xl debug-keys k' and send that please?

Sure:

(XEN) [2014-11-19 17:26:05.839] CPU00:
(XEN) [2014-11-19 17:26:05.839] d16 OK-softirq 1msec ago, state:1, 751216 
count, [prev:ffff82d0802e7e70, next:ffff82d0802e7e70] ffff8303fab608a8 22c258
(XEN) [2014-11-19 17:26:05.839] d16 OK-raise   1msec ago, state:1, 751216 
count, [prev:0200200200200200, next:0100100100100100] ffff8303fab608a8 22c257
(XEN) [2014-11-19 17:26:05.839] d16 OK-raise   347977msec ago, state:1, 61 
count, [prev:ffff82d080329160, next:ffff82d080329160] ffff8303fab608a8 203775
(XEN) [2014-11-19 17:26:05.839] d16 OK-reset   1msec ago, state:0, 258049 
count, [prev:0200200200200200, next:0100100100100100] ffff8303fab608a8 22c256
(XEN) [2014-11-19 17:26:05.839] d16 OK-timeout 1msec ago, state:0, 258049 
count, [prev:0200200200200200, next:0100100100100100] ffff8303fab608a8 22c254
(XEN) [2014-11-19 17:26:05.839] d16 OK-timeout 1msec ago, state:0, 258049 
count, [prev:0200200200200200, next:0100100100100100] ffff8303fab608a8 22c255
(XEN) [2014-11-19 17:26:05.839] d16 Z-softirq  5746msec ago, state:6, 669 
count, [prev:0200200200200200, next:0100100100100100] ffff8303fab608a8 22b871
(XEN) [2014-11-19 17:26:05.839] d16 Z-raise    5746msec ago, state:4, 669 
count, [prev:ffff82d080329160, next:ffff82d080329160] ffff8303fab608a8 22b86f
(XEN) [2014-11-19 17:26:05.839] CPU01:
(XEN) [2014-11-19 17:26:05.839] CPU02:
(XEN) [2014-11-19 17:26:05.839] CPU03:
(XEN) [2014-11-19 17:26:05.839] CPU04:
(XEN) [2014-11-19 17:26:05.840] CPU05:


>> More than happy to test the definitive patch as well.



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.