xen-devel
RE: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET bro
To: |
"Zhang, Xiantao" <xiantao.zhang@xxxxxxxxx>, Andreas Kinzler <ml-xen-devel@xxxxxx>, Pasi Kärkkäinen <pasik@xxxxxx> |
Subject: |
RE: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET broadcast |
From: |
"Wei, Gang" <gang.wei@xxxxxxxxx> |
Date: |
Thu, 30 Sep 2010 14:02:34 +0800 |
Accept-language: |
zh-CN, en-US |
Acceptlanguage: |
zh-CN, en-US |
Cc: |
Keir, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Fraser <keir.fraser@xxxxxxxxxxxxx>, "JBeulich@xxxxxxxxxx" <JBeulich@xxxxxxxxxx>, "Wei, Gang" <gang.wei@xxxxxxxxx> |
Delivery-date: |
Wed, 29 Sep 2010 23:04:35 -0700 |
Envelope-to: |
www-data@xxxxxxxxxxxxxxxxxxx |
In-reply-to: |
<BC00F5384FCFC9499AF06F92E8B78A9E1A90A388F5@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> |
List-help: |
<mailto:xen-devel-request@lists.xensource.com?subject=help> |
List-id: |
Xen developer discussion <xen-devel.lists.xensource.com> |
List-post: |
<mailto:xen-devel@lists.xensource.com> |
List-subscribe: |
<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe> |
List-unsubscribe: |
<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe> |
References: |
<4C88A6F3.9020207@xxxxxx> <20100921115604.GP2804@xxxxxxxxxxx> <4CA38093.9070802@xxxxxx> <BC00F5384FCFC9499AF06F92E8B78A9E1A90A388F5@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> |
Sender: |
xen-devel-bounces@xxxxxxxxxxxxxxxxxxx |
Thread-index: |
ActgAWv5RLewlCr+RTez+lw/SpBv2QAVrBygAAMXXPA= |
Thread-topic: |
[Xen-devel] Instability with Xen, interrupt routing frozen, HPET broadcast |
I am the original developer of HPET broadcast code.
First of all, to disable HPET broadcast, no additional patch is required.
Please simply add option "cpuidle=off" or "max_cstate=1" at xen cmdline in
/boot/grub/grub.conf.
Second, I noticed that the issue just occur on pre-nehalem server processors. I
will check whether I can reproduce it.
Meanwhile, I am looking forward to see whether Jeremy & Xiantao's suggestions
have effects. So Andreas, could you help to have a try on their suggestions?
Jimmy
On , xen-devel-bounces@xxxxxxxxxxxxxxxxxxx wrote:
> Maybe you can disable pirq_set_affinity to have a try with the
> following patch. It may trigger IRQ migration in hypervisor,
> and the IRQ migration logic about(especailly
> shared)level-triggered ioapic IRQ is not well tested because
> of no users before. After intoducing the pirq_set_affinity in
> #Cset21625, the logic is used frequently when vcpu migration
> occurs, so I doubt it maybe expose the issue you met.
> Besides, there is a bug in event driver which is fixed in
> latest pv_ops dom0, seems the dom0 you are using doesn't
> include the fix. This bug may result in lost event in dom0
> and invoke dom0 hang eventually. To workaround this bug, you
> can disable irqbalance in dom0. Good luck!
> Xiantao
>
> diff -r fc29e13f669d xen/arch/x86/irq.c
> --- a/xen/arch/x86/irq.c Mon Aug 09 16:36:07 2010 +0100
> +++ b/xen/arch/x86/irq.c Thu Sep 30 20:33:11 2010 +0800
> @@ -516,6 +516,7 @@ void irq_set_affinity(struct irq_desc *d
>
> void pirq_set_affinity(struct domain *d, int pirq, const cpumask_t
> *mask) {
> +#if 0
> unsigned long flags;
> struct irq_desc *desc = domain_spin_lock_irq_desc(d, pirq,
> &flags);
>
> @@ -523,6 +524,7 @@ void pirq_set_affinity(struct domain *d,
> return; irq_set_affinity(desc, mask);
> spin_unlock_irqrestore(&desc->lock, flags);
> +#endif
> }
>
> DEFINE_PER_CPU(unsigned int, irq_count);
>
>
> Andreas Kinzler wrote:
>> On 21.09.2010 13:56, Pasi Kärkkäinen wrote:
>>>> I am talking a while (via email) with Jan now to track the
>>>> following problem and he suggested that I report the problem on
>>>> xen-devel:
>>>>
>>>> Jul 9 01:48:04 virt kernel: aacraid: Host adapter reset request.
>>>> SCSI hang ? Jul 9 01:49:05 virt kernel: aacraid: SCSI bus appears
>>>> hung Jul 9 01:49:10 virt kernel: Calling adapter init
>>>> Jul 9 01:49:49 virt kernel: IRQ 16/aacraid: IRQF_DISABLED is not
>>>> guaranteed on shared IRQs Jul 9 01:49:49 virt kernel: Acquiring
>>>> adapter information Jul 9 01:49:49 virt kernel:
>>>> update_interval=30:00 check_interval=86400s Jul 9 01:53:13 virt
>>>> kernel: aacraid: aac_fib_send: first asynchronous command timed
>>>> out. Jul 9 01:53:13 virt kernel: Usually a result of a PCI
>>>> interrupt routing problem; Jul 9 01:53:13 virt kernel: update
>>>> mother board BIOS or consider utilizing one of Jul 9 01:53:13
>>>> virt kernel: the SAFE mode kernel options (acpi, apic etc)
>>>>
>>>> After the VMs have been running a while the aacraid driver reports
>>>> a non-responding RAID controller. Most of the time the NIC is also
>>>> no longer working. I nearly tried every combination of dom0 kernel
>>>> (pvops0, xenfied suse
>>>> 2.6.31.x, xenfied suse 2.6.32.x, xenfied suse 2.6.34.x) with Xen
>>>> hypervisor 3.4.2, 3.4.4-cs19986, 4.0.1, unstable.
>>>> No success in two month. Every combination earlier or later had the
>>>> problem shown above. I did extensive tests to make sure that the
>>>> hardware is OK. And it is - I am sure it is a Xen/dom0 problem.
>>>>
>>>> Jan suggested to try the fix in c/s 22051 but it did not help. My
>>>> answer to him:
>>>>
>>>>> In the meantime I did try xen-unstable c/s 22068 (contains staging
>>>>> c/s 22051) and it did not fix the problem at all. I was able to
>>>>> fix a problem with the serial console and so I got some debug info
>>>>> that is attached to this email. The following line looks
>>>>> suspicious to me (irr=1, delivery_status=1):
>>>>
>>>>> (XEN) IRQ 16 Vec216:
>>>>> (XEN) Apic 0x00, Pin 16: vector=216, delivery_mode=1,
>>>>> dest_mode=logical, delivery_status=1, polarity=1,
>>>>> irr=1, trigger=level, mask=0, dest_id:1
>>>>
>>>>> IRQ 16 is the aacraid controller which after some while seems to
>>>>> be enable to receive interrupts. Can you see from the debug info
>>>>> what is going on?
>>>>
>>>> I also applied a small patch which disables HPET broadcast. The
>>>> machine is now running for 110 hours without a crash while normally
>>>> it crashes within a few minutes. Is there something wrong (race,
>>>> deadlock) with HPET broadcasts in relation to blocked interrupt
>>>> reception (see above)?
>>> What kind of hardware does this happen on?
>>
>> It is a Supermicro X8SIL-F, Intel Xeon 3450 system.
>>
>>> Should this patch be merged?
>>
>> Not easy to answer. I spend more than 10 weeks searching nearly full
>> time for the reason of the stability issues. Finally I was able to
>> track it down to the HPET broadcast code.
>>
>> We need to find the developer of the HPET broadcast code. Then, he
>> should try to fix the code. I consider it a quite severe bug as it
>> renders Xen nearly useless on affected systems. That is why I (and my
>> boss who pays me) spend so much time (developing/fixing Xen is not
>> really my core job) and money (buying a E5620 machine just for
>> testing Xen).
>>
>> I think many people on affected systems are having problems. See
>>
> http://lists.xensource.com/archives/html/xen-users/2010-09/msg0
> 0370.html
>>
>> Regards Andreas
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@xxxxxxxxxxxxxxxxxxx
>> http://lists.xensource.com/xen-devel
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|
<Prev in Thread] |
Current Thread |
[Next in Thread>
|
- [Xen-devel] Instability with Xen, interrupt routing frozen, HPET broadcast, Andreas Kinzler
- Re: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET broadcast, Pasi Kärkkäinen
- Re: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET broadcast, Andreas Kinzler
- Re: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET broadcast, Jeremy Fitzhardinge
- Re: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET broadcast, Andreas Kinzler
- Re: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET broadcast, Jeremy Fitzhardinge
- RE: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET broadcast, Zhang, Xiantao
- RE: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET broadcast,
Wei, Gang <=
- Re: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET broadcast, Andreas Kinzler
- RE: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET broadcast, Zhang, Xiantao
|
|
|