|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen-unstable: xen panic RIP: dpci_softirq
Monday, November 17, 2014, 5:34:16 PM, you wrote:
> On Fri, Nov 14, 2014 at 11:09:58PM +0100, Sander Eikelenboom wrote:
>>
>> Friday, November 14, 2014, 9:25:13 PM, you wrote:
>>
>> > On Fri, Nov 14, 2014 at 05:59:23PM +0100, Sander Eikelenboom wrote:
>> >>
>> >> Friday, November 14, 2014, 4:43:58 PM, you wrote:
>> >>
>> >> >>>> On 14.11.14 at 16:20, <linux@xxxxxxxxxxxxxx> wrote:
>> >> >> If it still helps i could try Andrews suggestion and try out with only
>> >> >> commit aeeea485 ..
>> >>
>> >> > Yes, even if it's pretty certain it's the second of the commits,
>> >> > verifying
>> >> > this would be helpful (or if the assumption is wrong, the pattern it's
>> >> > dying with would change and hence perhaps provide further clues).
>> >>
>> >> > Jan
>> >>
>> >>
>> >> Ok with a revert of f6dd295 .. it survived cooking and eating a nice bowl
>> >> of
>> >> pasta without a panic. So it would probably be indeed that specific
>> >> commit.
>>
>> > Could you try running with these two patches while you enjoy an beer in
>> > the evening?
>>
>> Hmm i didn't expect it not to panic and reboot anymore :-)
> I should have also asked for your to run with 'iommu=verbose,debug', but
> that can be done later..
> The guest d16 looks to have two PCI passthrough devices:
> XEN) [2014-11-14 21:31:26.569] io.c:550: d16: bind: m_gsi=37 g_gsi=36
> dev=00.00.5 intx=0
> XEN) [2014-11-14 21:31:28.095] io.c:550: d16: bind: m_gsi=47 g_gsi=40
> dev=00.00.6 intx=0
> And one of them uses just the GSI while the other uses four MSI-X, is
> that about right?
> I tried to reproduce that on my AMD box with two NICs:
> # lspci
> 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
> 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
> 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
> 00:01.2 USB Controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton
> II] (rev 01)
> 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
> 00:02.0 VGA compatible controller: Technical Corp. Device 1111
> 00:03.0 Class ff80: XenSource, Inc. Xen Platform Device (rev 01)
> 00:04.0 Ethernet controller: Intel Corporation 82576 Gigabit Network
> Connection (rev 01)
> 00:05.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet
> Controller (rev 05)
> # cat /proc/interrupts |grep eth
> 36: 384183 0 xen-pirq-ioapic-level eth0
> 63: 1 0 xen-pirq-msi-x eth1
> 64: 24 661961 xen-pirq-msi-x eth1-rx-0
> 65: 205 0 xen-pirq-msi-x eth1-rx-1
> 66: 162 0 xen-pirq-msi-x eth1-tx-0
> 67: 190 0 xen-pirq-msi-x eth1-tx-1
> Is that a similar distribution of IRQ/MSIx you end up having?
>>
>> However xl dmesg (complete one attached) showed it would have:
>>
>> (XEN) [2014-11-14 21:35:50.646] --MARK--
>> (XEN) [2014-11-14 21:35:56.861] grant_table.c:305:d0v0 Increased maptrack
>> size to 9 frames
>> (XEN) [2014-11-14 21:36:00.647] --MARK--
>> (XEN) [2014-11-14 21:36:10.410] grant_table.c:1299:d16v1 Expanding dom (16)
>> grant table from (5) to (6) frames.
>> (XEN) [2014-11-14 21:36:10.820] --MARK--
>> (XEN) [2014-11-14 21:36:20.820] --MARK--
>> (XEN) [2014-11-14 21:36:30.820] --MARK--
>> (XEN) [2014-11-14 21:36:40.821] --MARK--
>> (XEN) [2014-11-14 21:36:50.821] --MARK--
>> (XEN) [2014-11-14 21:37:00.388] CPU00:
>> (XEN) [2014-11-14 21:37:00.399] CPU01:
>> (XEN) [2014-11-14 21:37:00.410] d16 OK-softirq 20msec ago, state:1, 41220
>> count, [prev:ffff83054ef5e3e0, next:ffff83054ef5e3e0] PIRQ:0
>> (XEN) [2014-11-14 21:37:00.445] d16 OK-raise 46msec ago, state:1, 41223
>> count, [prev:0000000000200200, next:0000000000100100] PIRQ:0
>> (XEN) [2014-11-14 21:37:00.481] d16 ERR-poison 92msec ago, state:0, 1 count,
>> [prev:0000000000200200, next:0000000000100100] PIRQ:0
>> (XEN) [2014-11-14 21:37:00.515] d16 Z-softirq 28853msec ago, state:2, 1
>> count, [prev:0000000000200200, next:0000000000100100] PIRQ:0
> The PIRQ:0 would imply that this is the legacy interrupt - which would be you
> 0a:00.0 device
> (Conexant Systems, Inc. Device 8210).
> And it is pounding on this CPU - and the issue is that the
> 'test_and_clear_bit' ends
> up returning 0 - which means it was not able to set STATE_SCHED:
> (!?)
> if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
> {
>
> hvm_dirq_assist(d, pirq_dpci);
>
> put_domain(d);
>
> }
>
> else
>
> {
>
> _record(&debug->zombie_softirq, pirq_dpci);
> which causes us to record it [Z-softirq], which says we we are in state 2
> (1<<STATE_RUN).
> reset = 1;
>
> }
> .. eons ago (28853msec).
> Hmm. There is something fishy there but the only theory I have is that
> we end up doing 'list_del' twice on different CPUs on the same structure.
> That should not be possible, but then this check - 'test_and_clear_bit'
> returned
> 0 which means that somebody had cleared it (or it failed to clear it?)
> But the only other path for 'clearing' it is via the error paths and you are
> not hitting any of them.
> In the mean-time, could you try this patch. It adds a bit more debug to help
> me figure this out.
> diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> index 23e5ed1..443975c 100644
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -126,17 +126,17 @@ static void dump_record(struct _debug_f *d, unsigned
> int type)
> BUG();
>
> now = NOW();
> - printk("d%d %s %lumsec ago, state:%x, %ld count, [prev:%p, next:%p] ",
> + printk("d%d %s %lumsec ago, state:%x, %ld count, [prev:%p, next:%p] %p",
> d->domid, names[type],
> (unsigned long)((now - d->last) / MILLISECS(1)),
- d->>state, d->count, d->list.prev, d->list.next);
+ d->>state, d->count, d->list.prev, d->list.next, d->dpci);
>
> if ( d->dpci )
> {
> struct hvm_pirq_dpci *pirq_dpci = d->dpci;
>
> for ( i = 0; i <= _HVM_IRQ_DPCI_GUEST_MSI_SHIFT; i++ )
> - if ( pirq_dpci->flags & 1 << _HVM_IRQ_DPCI_TRANSLATE_SHIFT )
> + if ( pirq_dpci->flags & (1 << i) )
> printk("%s ", names_flag[i]);
>
> printk(" PIRQ:%d", pirq_dpci->pirq);
Hi Konrad,
Here is the xl dmesg output with this patch (attached with debug-key i and M
output). What i don't get .. is that d16 and d17 each have a device passed
through
that seems to be using the same pirq 87 ?
--
Sander
(XEN) [2014-11-17 17:54:18.695] CPU00:
(XEN) [2014-11-17 17:54:18.705] CPU01:
(XEN) [2014-11-17 17:54:18.716] d16 OK-softirq 62msec ago, state:1, 2628 count,
[prev:ffff83054ef57e70, next:ffff83054ef57e70] ffff83051b904428<NULL>
MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87
(XEN) [2014-11-17 17:54:18.765] d16 OK-raise 112msec ago, state:1, 2628
count, [prev:0000000000200200, next:0000000000100100] ffff83051b904428<NULL>
MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87
(XEN) [2014-11-17 17:54:18.815] CPU02:
(XEN) [2014-11-17 17:54:18.825] d17 OK-softirq 500msec ago, state:1, 3439
count, [prev:ffff83054ef47e70, next:ffff83054ef47e70] ffff83051a1c8c28<NULL>
MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87
(XEN) [2014-11-17 17:54:18.875] d17 OK-raise 549msec ago, state:1, 3439
count, [prev:0000000000200200, next:0000000000100100] ffff83051a1c8c28<NULL>
MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87
(XEN) [2014-11-17 17:54:18.924] CPU03:
(XEN) [2014-11-17 17:54:18.935] d16 OK-softirq 313msec ago, state:1, 3533
count, [prev:ffff83054ef37e70, next:ffff83054ef37e70] ffff83051b904428<NULL>
MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87
(XEN) [2014-11-17 17:54:18.984] d16 OK-raise 363msec ago, state:1, 3533
count, [prev:0000000000200200, next:0000000000100100] ffff83051b904428<NULL>
MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87
(XEN) [2014-11-17 17:54:19.034] CPU04:
(XEN) [2014-11-17 17:54:19.044] d16 OK-softirq 359msec ago, state:1, 3691
count, [prev:ffff83054ef27e88, next:ffff83054ef27e88] ffff83051b904428<NULL>
MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87
(XEN) [2014-11-17 17:54:19.094] d16 OK-raise 408msec ago, state:1, 3691
count, [prev:0000000000200200, next:0000000000100100] ffff83051b904428<NULL>
MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87
(XEN) [2014-11-17 17:54:19.143] CPU05:
(XEN) [2014-11-17 17:54:19.154] d16 OK-softirq 458msec ago, state:1, 52039
count, [prev:ffff83054ef283e0, next:ffff83054ef283e0]
ffff83051b95fd28MACH_PCI_SHIFT MAPPED_SHIFT GUEST_PCI_SHIFT PIRQ:0
(XEN) [2014-11-17 17:54:19.205] d16 OK-raise 489msec ago, state:1, 52049
count, [prev:0000000000200200, next:0000000000100100]
ffff83051b95fd28MACH_PCI_SHIFT MAPPED_SHIFT GUEST_PCI_SHIFT PIRQ:0
(XEN) [2014-11-17 17:54:19.257] d16 ERR-poison 561msec ago, state:0, 1 count,
[prev:0000000000200200, next:0000000000100100] ffff83051b95fd28MACH_PCI_SHIFT
MAPPED_SHIFT GUEST_PCI_SHIFT PIRQ:0
(XEN) [2014-11-17 17:54:19.307] d16 Z-softirq 731msec ago, state:3, 3 count,
[prev:ffff83054ef283e0, next:ffff83054ef283e0] ffff83051b95fd28MACH_PCI_SHIFT
MAPPED_SHIFT GUEST_PCI_SHIFT PIRQ:0
(XEN) [2014-11-17 17:54:19.356] domain_crash called from io.c:938
(XEN) [2014-11-17 17:54:19.356] Domain 16 reported crashed by domain 32767 on
cpu#5:Attachment:
xl-dmesg.txt _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |