[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable: xen panic RIP: dpci_softirq



> 
> Uhmm i thought i had these switched off (due to problems earlier and then 
> forgot 
> about them .. however looking at the earlier reports these lines were also in 
> those reports).
> 
> The xen-syms and these last runs are all with a prestine xen tree cloned 
> today (staging 
> branch), so the qemu-xen and seabios defined with that were also freshly 
> cloned 
> and had a new default seabios config. (just to rule out anything stale in my 
> tree)
> 
> If you don't see those messages .. perhaps your seabios and qemu trees (and 
> at least the 
> seabios config) are not the most recent (they don't get updated automatically 
> when you just do a git pull on the main tree) ?
> 
> In /tools/firmware/seabios-dir/.config i have:
> CONFIG_USB=y
> CONFIG_USB_UHCI=y
> CONFIG_USB_OHCI=y
> CONFIG_USB_EHCI=y
> CONFIG_USB_XHCI=y
> CONFIG_USB_MSC=y
> CONFIG_USB_UAS=y
> CONFIG_USB_HUB=y
> CONFIG_USB_KEYBOARD=y
> CONFIG_USB_MOUSE=y
> 

I seem to have the same thing. Perhaps it is my XHCI controller being wonky.

> And this is all just from a:
> - git clone git://xenbits.xen.org/xen.git -b staging
> - make clean && ./configure && make -j6 && make -j6 install

Aye. 
.. snip..
> >  1) test_and_[set|clear]_bit sometimes return unexpected values.
> >     [But this might be invalid as the addition of the ffff8303faaf25a8
> >      might be correct - as the second dpci the softirq is processing
> >      could be the MSI one]
> 
> Would there be an easy way to stress test this function separately in some 
> debugging function to see if it indeed is returning unexpected values ?

Sadly no. But you got me looking in the right direction when you mentioned
'timeout'.
> 
> >  2) INIT_LIST_HEAD operations on the same CPU are not honored.
> 
> Just curious, have you also tested the patches on AMD hardware ?

Yes. To reproduce this the first thing I did was to get an AMD box.

> 
>  
> >> When i look at the combination of (2) and (3), It seems it could be an 
> >> interaction between the two passed through devices and/or different IRQ 
> >> types.
> 
> > Could be - as in it is causing this issue to show up faster than
> > expected. Or it is the one that triggers more than one dpci happening
> > at the same time.
> 
> Well that didn't seem to be it (see separate amendment i mailed previously)

Right, the current theory I've is that the interrupts are not being
Acked within 8 milisecond and we reset the 'state' - and at the same
time we get an interrupt and schedule it - while we are still processing
the same interrupt. This would explain why the 'test_and_clear_bit'
got the wrong value.

In regards to the list poison - following this thread of logic - with
the 'state = 0' set we open the floodgates for any CPU to put the same
'struct hvm_pirq_dpci' on its list.

We do reset the 'state' on _every_ GSI that is mapped to a guest - so
we also reset the 'state' for the MSI one (XHCI). Anyhow in your case:

CPUX:                           CPUY:
pt_irq_time_out:
state = 0;                      
[out of timer coder, the                raise_softirq
 pirq_dpci is on the dpci_list]         [adds the pirq_dpci as state == 0]

softirq_dpci                            softirq_dpci:
        list_del
        [entries poison]
                                                list_del <= BOOM
                        
Is what I believe is happening.

The INTX device - once I put a load on it - does not trigger
any pt_irq_time_out, so that would explain why I cannot hit this.

But I believe your card hits these "hiccups".   

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.