WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET bro

To: "Zhang, Xiantao" <xiantao.zhang@xxxxxxxxx>, Andreas Kinzler <ml-xen-devel@xxxxxx>, Pasi Kärkkäinen <pasik@xxxxxx>
Subject: RE: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET broadcast
From: "Wei, Gang" <gang.wei@xxxxxxxxx>
Date: Thu, 30 Sep 2010 14:02:34 +0800
Accept-language: zh-CN, en-US
Acceptlanguage: zh-CN, en-US
Cc: Keir, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Fraser <keir.fraser@xxxxxxxxxxxxx>, "JBeulich@xxxxxxxxxx" <JBeulich@xxxxxxxxxx>, "Wei, Gang" <gang.wei@xxxxxxxxx>
Delivery-date: Wed, 29 Sep 2010 23:04:35 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <BC00F5384FCFC9499AF06F92E8B78A9E1A90A388F5@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <4C88A6F3.9020207@xxxxxx> <20100921115604.GP2804@xxxxxxxxxxx> <4CA38093.9070802@xxxxxx> <BC00F5384FCFC9499AF06F92E8B78A9E1A90A388F5@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: ActgAWv5RLewlCr+RTez+lw/SpBv2QAVrBygAAMXXPA=
Thread-topic: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET broadcast
I am the original developer of HPET broadcast code. 

First of all, to disable HPET broadcast, no additional patch is required. 
Please simply add option "cpuidle=off" or "max_cstate=1" at xen cmdline in 
/boot/grub/grub.conf. 

Second, I noticed that the issue just occur on pre-nehalem server processors. I 
will check whether I can reproduce it. 

Meanwhile, I am looking forward to see whether Jeremy & Xiantao's suggestions 
have effects. So Andreas, could you help to have a try on their suggestions?

Jimmy

On , xen-devel-bounces@xxxxxxxxxxxxxxxxxxx wrote:
> Maybe you can disable pirq_set_affinity to have a try with the
> following patch. It may trigger IRQ migration in hypervisor,
> and the IRQ migration logic about(especailly
> shared)level-triggered ioapic IRQ is not well tested because
> of no users before.  After intoducing the pirq_set_affinity in
> #Cset21625, the logic is used frequently when vcpu migration
> occurs, so I doubt it maybe expose the issue you met.
> Besides, there is a bug in event driver which is fixed in
> latest pv_ops dom0, seems the dom0 you are using doesn't
> include the fix.  This bug may result in lost event in dom0
> and invoke dom0 hang eventually. To workaround this bug,  you
> can disable irqbalance in dom0. Good luck!
> Xiantao
> 
> diff -r fc29e13f669d xen/arch/x86/irq.c
> --- a/xen/arch/x86/irq.c        Mon Aug 09 16:36:07 2010 +0100
> +++ b/xen/arch/x86/irq.c        Thu Sep 30 20:33:11 2010 +0800
> @@ -516,6 +516,7 @@ void irq_set_affinity(struct irq_desc *d
> 
> void pirq_set_affinity(struct domain *d, int pirq, const cpumask_t
> *mask) {
> +#if 0
>     unsigned long flags;
>     struct irq_desc *desc = domain_spin_lock_irq_desc(d, pirq,
> &flags); 
> 
> @@ -523,6 +524,7 @@ void pirq_set_affinity(struct domain *d,        
>     return; irq_set_affinity(desc, mask);
>     spin_unlock_irqrestore(&desc->lock, flags);
> +#endif
> }
> 
> DEFINE_PER_CPU(unsigned int, irq_count);
> 
> 
> Andreas Kinzler wrote:
>> On 21.09.2010 13:56, Pasi Kärkkäinen wrote:
>>>>   I am talking a while (via email) with Jan now to track the
>>>> following problem and he suggested that I report the problem on
>>>> xen-devel: 
>>>> 
>>>> Jul  9 01:48:04 virt kernel: aacraid: Host adapter reset request.
>>>> SCSI hang ? Jul  9 01:49:05 virt kernel: aacraid: SCSI bus appears
>>>> hung Jul  9 01:49:10 virt kernel: Calling adapter init
>>>> Jul  9 01:49:49 virt kernel: IRQ 16/aacraid: IRQF_DISABLED is not
>>>> guaranteed on shared IRQs Jul  9 01:49:49 virt kernel: Acquiring
>>>> adapter information Jul  9 01:49:49 virt kernel:
>>>> update_interval=30:00 check_interval=86400s Jul  9 01:53:13 virt
>>>> kernel: aacraid: aac_fib_send: first asynchronous command timed
>>>> out. Jul  9 01:53:13 virt kernel: Usually a result of a PCI
>>>> interrupt routing problem; Jul  9 01:53:13 virt kernel: update
>>>> mother board BIOS or consider utilizing one of Jul  9 01:53:13
>>>> virt kernel: the SAFE mode kernel options (acpi, apic etc) 
>>>> 
>>>> After the VMs have been running a while the aacraid driver reports
>>>> a non-responding RAID controller. Most of the time the NIC is also
>>>> no longer working. I nearly tried every combination of dom0 kernel
>>>> (pvops0, xenfied suse 
>>>> 2.6.31.x, xenfied suse 2.6.32.x, xenfied suse 2.6.34.x) with Xen
>>>> hypervisor 3.4.2, 3.4.4-cs19986, 4.0.1, unstable.
>>>> No success in two month. Every combination earlier or later had the
>>>> problem shown above. I did extensive tests to make sure that the
>>>> hardware is OK. And it is - I am sure it is a Xen/dom0 problem.
>>>> 
>>>> Jan suggested to try the fix in c/s 22051 but it did not help. My
>>>> answer to him: 
>>>> 
>>>>> In the meantime I did try xen-unstable c/s 22068 (contains staging
>>>>> c/s 22051) and it did not fix the problem at all. I was able to
>>>>> fix a problem with the serial console and so I got some debug info
>>>>> that is attached to this email. The following line looks
>>>>> suspicious to me (irr=1, delivery_status=1):
>>>> 
>>>>> (XEN)     IRQ 16 Vec216:
>>>>> (XEN)       Apic 0x00, Pin 16: vector=216, delivery_mode=1,
>>>>>              dest_mode=logical, delivery_status=1, polarity=1,
>>>>> irr=1, trigger=level, mask=0, dest_id:1
>>>> 
>>>>> IRQ 16 is the aacraid controller which after some while seems to
>>>>> be enable to receive interrupts. Can you see from the debug info
>>>>> what is going on?
>>>> 
>>>> I also applied a small patch which disables HPET broadcast. The
>>>> machine is now running for 110 hours without a crash while normally
>>>> it crashes within a few minutes. Is there something wrong (race,
>>>> deadlock) with HPET broadcasts in relation to blocked interrupt
>>>> reception (see above)?
>>> What kind of hardware does this happen on?
>> 
>> It is a Supermicro X8SIL-F, Intel Xeon 3450 system.
>> 
>>> Should this patch be merged?
>> 
>> Not easy to answer. I spend more than 10 weeks searching nearly full
>> time for the reason of the stability issues. Finally I was able to
>> track it down to the HPET broadcast code.
>> 
>> We need to find the developer of the HPET broadcast code. Then, he
>> should try to fix the code. I consider it a quite severe bug as it
>> renders Xen nearly useless on affected systems. That is why I (and my
>> boss who pays me) spend so much time (developing/fixing Xen is not
>> really my core job) and money (buying a E5620 machine just for
>> testing Xen). 
>> 
>> I think many people on affected systems are having problems. See
>> 
> http://lists.xensource.com/archives/html/xen-users/2010-09/msg0
> 0370.html
>> 
>> Regards Andreas
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@xxxxxxxxxxxxxxxxxxx
>> http://lists.xensource.com/xen-devel
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel