WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] domU and dom0 hung with Xen console interrupt binding sh

To: Bruce Edge <bruce.edge@xxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxxxx>
Subject: Re: [Xen-devel] domU and dom0 hung with Xen console interrupt binding showing in-flight=1, (---M)
From: Keir Fraser <keir.fraser@xxxxxxxxxxxxx>
Date: Tue, 17 Aug 2010 19:01:07 +0100
Cc: Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>
Delivery-date: Tue, 17 Aug 2010 11:03:01 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <AANLkTimvCMc2_EBmFy4XsXLBa4m_T7LyHMn7Lea_qViY@xxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: Acs+MaJb3RIm2vpzRha5j+gWTC2JGAABIgAE
Thread-topic: [Xen-devel] domU and dom0 hung with Xen console interrupt binding showing in-flight=1, (---M)
User-agent: Microsoft-Entourage/12.26.0.100708
On 17/08/2010 18:28, "Bruce Edge" <bruce.edge@xxxxxxxxx> wrote:

> On Tue, Jun 29, 2010 at 1:42 AM, Jan Beulich <JBeulich@xxxxxxxxxx> wrote:
>>>>> On 28.06.10 at 20:22, Dante Cinco <dantecinco@xxxxxxxxx> wrote:
>>> I have an HP Proliant DL380-G6 (dual Xeon E5540 @ 2.53GHz) with Xen 4.0.0
>>> and dom0 Linux 2.6.32.12 x86_64 pvops and domU Linux kernel 2.6.30.1 x86_64.
>>> I'm using PCI passthrough (pci-stub) to pass my 4-port 8Gb PMC-Sierra Fibre
>>> Channel HBA to domU. After running I/Os for several hours, both dom0 and
>>> domU hangs and the Xen console shows the interrupt binding below where IRQ
>>> 66 shows in-flight=1 and mask set (---M). What's the best way to debug this
>>> problem?
>> 
>> There are potentially two problems here: One is that the guest may
>> fail to send the EOI notification. You would want to check whether
>> pirq_guest_eoi() got run after that last occurrence of the interrupt.
>> 
>> The more worrying part is that Xen should time out on a guest failing
>> to send the EOI notification, and ack the interrupt nevertheless.
>> Looking at the code I fail to see how the ack_APIC_irq() would get
>> sent in this case: non-maskable MSIs get this issued from
>> end_msi_irq(), but ->end doesn't get invoked from
>> irq_guest_eoi_timer_fn() (only ->enable does). Keir, am I missing
>> something?

I don't think that timer logic is designed to handle non-maskable MSIs, only
maskable ones. It ought to be not too hard to fix it up for non-maskable
ones too by issuing the ->end() call from the timer handler?

 -- Keir

>> Otoh I can't see how this can work reliably in the first place: Since
>> there's no other way to mask such interrupts, sending an ack to the
>> LAPIC could result in an interrupt storm. Disabling MSI on the
>> affected device isn't a good option either, as we know there are
>> devices that switch to legacy IRQ mode irreversibly in that case,
>> and hence the device becomes unusable (presumably until being
>> reset). But very likely this would still be better than hanging the
>> entire box; it probably would just need a more graceful timeout.
>> 
>> Jan
> 
> 
> This is still happening. I have 2 identical boxes that were running a stress
> test and both hung after a few hours. They have identical hardware and
> software configs so I'll report the config for one and attach the xen dump for
> both.
> 
> dom0 info:
> 
> HP Proliant DL380-G6 (dual Xeon E5540 @ 2.53GHz) 
> 
> # cat /proc/cmdline 
> root=/dev/mapper/system-dom0_0 ro earlyprintk=xen loglevel=10 debug acpi=force
> console=hvc0,115200n8
> 
> # uname -a
> Linux dpm8800-09 2.6.32.16 #1 SMP Wed Aug 4 15:38:21 PDT 2010 x86_64 GNU/Linux
> 
> The domU is an Ubuntu 10.04 kernel,  2.6.32.15+drm33.5 in hvm mode.
> 
> # xm info
> host                   : dpm8800-09
> release                : 2.6.32.16
> version                : #1 SMP Wed Aug 4 15:38:21 PDT 2010
> machine                : x86_64
> nr_cpus                : 16
> nr_nodes               : 2
> cores_per_socket       : 4
> threads_per_core       : 2
> cpu_mhz                : 2533
> hw_caps                :
> bfebfbff:28100800:00000000:00001b40:009ce3bd:00000000:00000001:00000000
> virt_caps              : hvm hvm_directio
> total_memory           : 12277
> free_memory            : 11631
> node_to_cpu            : node0:0,2,4,6,8,10,12,14
>                          node1:1,3,5,7,9,11,13,15
> node_to_memory         : node0:5601
>                          node1:6029
> node_to_dma32_mem      : node0:3506
>                          node1:0
> max_node_id            : 1
> xen_major              : 4
> xen_minor              : 0
> xen_extra              : .1-rc4
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
> hvm-3.0-x86_32p hvm-3.0-x86_64 
> xen_scheduler          : credit
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          : unavailable
> xen_commandline        : dom0_mem=512M dom0_max_vcpus=1 dom0_vcpus_pin=true
> iommu=1,passthrough,no-intremap loglvl=all loglvl_guest=all loglevl=10 debug
> apic=on apic_verbosity=verbose extra_guest_irqs=80 com1=115200,8n1
> console=com1 console_to_ring xen-pciback.permissive acpi=force numa=on
> cc_compiler            : gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) 
> cc_compile_by          : bedge
> cc_compile_domain      : lsi.com <http://lsi.com>
> cc_compile_date        : Sun Aug  1 09:44:29 PDT 2010
> xend_config_format     : 4
> 
> This device (as well as a few more of these) is passed through via pciback:
> 
> dpm8800-09:~# lspci | grep 10:
> 10:00.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08)
> 10:00.1 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08)
> 10:00.2 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08)
> 10:00.3 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08) <- on both cases
> it's this device that loses the interrupt in flight
> 
> 10:00.3 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08)
>         Flags: bus master, fast devsel, latency 0, IRQ 5
>         I/O ports at a800 [size=256]
>         I/O ports at ac00 [size=256]
>         Memory at fbdc0000 (64-bit, non-prefetchable) [size=32K]
>         Capabilities: [50] Power Management version 3
>         Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+
> Queue=0/1 Enable-
>         Capabilities: [70] Express Endpoint, MSI 01
>         Capabilities: [b0] MSI-X: Enable- Mask- TabSize=9
>         Capabilities: [100] Advanced Error Reporting <?>
> 
> 
> From host dpm8800-10:
>  (XEN)    IRQ: 133 affinity:00000000,00000000,00000000,00000001 vec:94
> type=PCI-MSI         status=00000050 in-flight=0 domain-list=2:126(----),
> (XEN)    IRQ: 134 affinity:00000000,00000000,00000000,00000001 vec:d4
> type=PCI-MSI         status=00000050 in-flight=1 domain-list=2:125(---M),
> (XEN)    IRQ: 135 affinity:00000000,00000000,00000000,00000004 vec:9c
> type=PCI-MSI         status=00000010 in-flight=0 domain-list=2:124(----),
>  
> From host dpm8800-09:
>  (XEN)    IRQ: 131 affinity:00000000,00000000,00000000,00002000 vec:7f
> type=PCI-MSI         status=00000010 in-flight=0 domain-list=1: 62(----),
> (XEN)    IRQ: 132 affinity:00000000,00000000,00000000,00000001 vec:dd
> type=PCI-MSI         status=00000010 in-flight=1 domain-list=2:127(---M),
> (XEN)    IRQ: 133 affinity:00000000,00000000,00000000,00000001 vec:3e
> type=PCI-MSI         status=00000010 in-flight=0 domain-list=2:126(----),
>  
> This time both cases correspond to 10:00.3:
>  
> (XEN) 10:00.3 - dom 2   - MSIs < 132 >
>  
> (XEN)  MSI   132 vec=dc  fixed  edge   assert phys    cpu dest=00000010
> mask=0/0/-1
> 
> 
> Let me know if there's anything else I can provide to assist in diagnosing
> this problem.
> 
> Thanks
> 
> -Bruce
> 
>> 
>>> (XEN)    IRQ:  66 affinity:00000000,00000000,00000000,00000001 vec:b9
>>> type=PCI-MSI         status=00000010 in-flight=1 domain-list=1: 79(---M),
>>> (XEN)    IRQ:  67 affinity:00000000,00000000,00000000,00000004 vec:d9
>>> type=PCI-MSI         status=00000010 in-flight=0 domain-list=1: 78(----),
>>> (XEN)    IRQ:  68 affinity:00000000,00000000,00000000,00000010 vec:22
>>> type=PCI-MSI         status=00000010 in-flight=0 domain-list=1: 77(----),
>>> (XEN)    IRQ:  69 affinity:00000000,00000000,00000000,00000040 vec:2a
>>> type=PCI-MSI         status=00000010 in-flight=0 domain-list=1: 76(----),
>>> 
>>> (XEN) 07:00.3 - dom 1   - MSIs < 69 >
>>> (XEN) 07:00.2 - dom 1   - MSIs < 68 >
>>> (XEN) 07:00.1 - dom 1   - MSIs < 67 >
>>> (XEN) 07:00.0 - dom 1   - MSIs < 66 >
>>> 
>>> (XEN)  MSI    66 vec=b9  fixed  edge   assert phys    cpu dest=00000000
>>> mask=0/0/-1
>>> (XEN)  MSI    67 vec=d9  fixed  edge   assert phys    cpu dest=00000004
>>> mask=0/0/-1
>>> (XEN)  MSI    68 vec=22  fixed  edge   assert phys    cpu dest=00000002
>>> mask=0/0/-1
>>> (XEN)  MSI    69 vec=2a  fixed  edge   assert phys    cpu dest=00000006
>>> mask=0/0/-1
>>> 
>>> Thanks.
>>> 
>>> Dante
>> 
>> 
>> 
> 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel