This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


Re: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET bro

To: Pasi Kärkkäinen <pasik@xxxxxx>
Subject: Re: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET broadcast
From: Andreas Kinzler <ml-xen-devel@xxxxxx>
Date: Wed, 29 Sep 2010 20:08:19 +0200
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>, JBeulich@xxxxxxxxxx
Delivery-date: Wed, 29 Sep 2010 11:09:10 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <20100921115604.GP2804@xxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <4C88A6F3.9020207@xxxxxx> <20100921115604.GP2804@xxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv: Gecko/20100915 Thunderbird/3.1.4
On 21.09.2010 13:56, Pasi Kärkkäinen wrote:
  I am talking a while (via email) with Jan now to track the following
problem and he suggested that I report the problem on xen-devel:

Jul  9 01:48:04 virt kernel: aacraid: Host adapter reset request. SCSI
hang ?
Jul  9 01:49:05 virt kernel: aacraid: SCSI bus appears hung
Jul  9 01:49:10 virt kernel: Calling adapter init
Jul  9 01:49:49 virt kernel: IRQ 16/aacraid: IRQF_DISABLED is not
guaranteed on shared IRQs
Jul  9 01:49:49 virt kernel: Acquiring adapter information
Jul  9 01:49:49 virt kernel: update_interval=30:00 check_interval=86400s
Jul  9 01:53:13 virt kernel: aacraid: aac_fib_send: first asynchronous
command timed out.
Jul  9 01:53:13 virt kernel: Usually a result of a PCI interrupt routing
Jul  9 01:53:13 virt kernel: update mother board BIOS or consider
utilizing one of
Jul  9 01:53:13 virt kernel: the SAFE mode kernel options (acpi, apic etc)

After the VMs have been running a while the aacraid driver reports a
non-responding RAID controller. Most of the time the NIC is also no
longer working.
I nearly tried every combination of dom0 kernel (pvops0, xenfied suse
2.6.31.x, xenfied suse 2.6.32.x, xenfied suse 2.6.34.x) with Xen
hypervisor 3.4.2, 3.4.4-cs19986, 4.0.1, unstable.
No success in two month. Every combination earlier or later had the
problem shown above. I did extensive tests to make sure that the
hardware is OK. And it is - I am sure it is a Xen/dom0 problem.

Jan suggested to try the fix in c/s 22051 but it did not help. My answer
to him:

In the meantime I did try xen-unstable c/s 22068 (contains staging c/s
22051) and
it did not fix the problem at all. I was able to fix a problem with
the serial console
and so I got some debug info that is attached to this email. The
following line looks
suspicious to me (irr=1, delivery_status=1):

(XEN)     IRQ 16 Vec216:
(XEN)       Apic 0x00, Pin 16: vector=216, delivery_mode=1,
             delivery_status=1, polarity=1, irr=1, trigger=level,
mask=0, dest_id:1

IRQ 16 is the aacraid controller which after some while seems to be
enable to receive
interrupts. Can you see from the debug info what is going on?

I also applied a small patch which disables HPET broadcast. The machine
is now running
for 110 hours without a crash while normally it crashes within a few
minutes. Is there
something wrong (race, deadlock) with HPET broadcasts in relation to
blocked interrupt
reception (see above)?
What kind of hardware does this happen on?

It is a Supermicro X8SIL-F, Intel Xeon 3450 system.

Should this patch be merged?

Not easy to answer. I spend more than 10 weeks searching nearly full time for the reason of the stability issues. Finally I was able to track it down to the HPET broadcast code.

We need to find the developer of the HPET broadcast code. Then, he should try to fix the code. I consider it a quite severe bug as it renders Xen nearly useless on affected systems. That is why I (and my boss who pays me) spend so much time (developing/fixing Xen is not really my core job) and money (buying a E5620 machine just for testing Xen).

I think many people on affected systems are having problems. See http://lists.xensource.com/archives/html/xen-users/2010-09/msg00370.html

Regards Andreas

Xen-devel mailing list