[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] [PATCH] Fix softlockup issue after vcpu hotplug


  • To: "Keir Fraser" <Keir.Fraser@xxxxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxx>
  • From: "Tian, Kevin" <kevin.tian@xxxxxxxxx>
  • Date: Tue, 30 Jan 2007 22:11:32 +0800
  • Delivery-date: Tue, 30 Jan 2007 06:11:24 -0800
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>
  • Thread-index: AcdESFqDCWsISfq5RGeHgxcxVzRqmQACelaDAAAZiDAAAP4q2wADxf1QAAFVV3AAAMUYXAAACCkwAACFZyAAAV9OMA==
  • Thread-topic: [Xen-devel] [PATCH] Fix softlockup issue after vcpu hotplug

>From: Keir Fraser [mailto:Keir.Fraser@xxxxxxxxxxxx]
>Sent: 2007年1月30日 21:13
>On 30/1/07 1:09 pm, "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
>
>>> I'm sure this will fix the issue. But who knows what real underlying
>issue
>>> it might be hiding?
>>>
>>> -- Keir
>>
>> I'm not sure whether it hides something. But the current situation
>> seems like a self-trap to me: watchdog waits for timer interrupt to be
>> awaken in 1s interval, while timer interrupt deliberately schedules a
>> longer interval without considering watchdog and then blames
>> watchdog thread not running within 10s. :-)
>
>Actually I think you're right -- if this fixes the issue then it points to a
>problem in the next_timer_event code. So it would actually be interesting
>to
>try clamping the timeout to one second.
>
> -- Keir

By a simple change like this:

@@ -962,7 +962,8 @@ u64 jiffies_to_st(unsigned long j)
                } else if (((unsigned long)delta >> (BITS_PER_LONG-3)) != 0) {
                        /* Very long timeout means there is no pending timer.
                         * We indicate this to Xen by passing zero timeout. */
-                       st = 0;
+                       //st = 0;
+                       st = processed_system_time + HZ * (u64)NS_PER_TICK;
                } else {
                        st = processed_system_time + delta * (u64)NS_PER_TICK;
                }

I really expected to say it as the root fix, however I can't though 
this change made it better. I created a domU with 4 VCPUs on 
2 CPUs box, and tried to hot-remove/plug vcpu 1,2,3 alternatively. 
After about ten rounds test, everything is just OK. However several 
minutes later, I saw that warning again, though far less frequent 
than before.

So I have to dig more into this bug. The first thing I plan to do, is to 
make sure whether such long timeout is requested as what guest 
wants, or it's xen to enlarge that timeout underlyingly... :-(

BTW, do you think whether it's worthy to destroy vcpu from 
scheduler when it's down and then re-init that vcpu into scheduler 
when it's on? I don't know whether this will make any influence to 
accounting of scheduler. Actually domain save/restore doesn't show 
this bug, and one obvious distinct compared to vcpu-hotplug is that 
domain is restored in a new context...

Thanks,
Kevin

P.S. some trace log attached. You can see that drift in each warning is 
just around 1000 ticks.
[root@localhost ~]# BUG: soft lockup detected on CPU#1!
BUG: drift by 0x41e
 [<c0151301>] softlockup_tick+0xd1/0x100
 [<c01095d4>] timer_interrupt+0x4e4/0x640
 [<c011bbae>] try_to_wake_up+0x24e/0x300
 [<c0151c89>] handle_IRQ_event+0x59/0xa0
 [<c0151d65>] __do_IRQ+0x95/0x120
 [<c010708f>] do_IRQ+0x3f/0xa0
 [<c0103070>] xen_idle+0x0/0x60
 [<c024e355>] evtchn_do_upcall+0xb5/0x120
 [<c0103070>] xen_idle+0x0/0x60
 [<c01057a5>] hypervisor_callback+0x3d/0x48
 [<c0103070>] xen_idle+0x0/0x60
 [<c0109d40>] raw_safe_halt+0x20/0x50
 [<c01030a1>] xen_idle+0x31/0x60
 [<c010316e>] cpu_idle+0x9e/0xf0
BUG: soft lockup detected on CPU#2!
BUG: drift by 0x447
 [<c0151301>] softlockup_tick+0xd1/0x100
 [<c01095d4>] timer_interrupt+0x4e4/0x640
 [<c011bbae>] try_to_wake_up+0x24e/0x300
 [<c0151c89>] handle_IRQ_event+0x59/0xa0
 [<c0151d65>] __do_IRQ+0x95/0x120
 [<c010708f>] do_IRQ+0x3f/0xa0
 [<c0103070>] xen_idle+0x0/0x60
 [<c024e355>] evtchn_do_upcall+0xb5/0x120
 [<c0103070>] xen_idle+0x0/0x60
 [<c01057a5>] hypervisor_callback+0x3d/0x48
 [<c0103070>] xen_idle+0x0/0x60
 [<c0109d40>] raw_safe_halt+0x20/0x50
 [<c01030a1>] xen_idle+0x31/0x60
 [<c010316e>] cpu_idle+0x9e/0xf0
BUG: soft lockup detected on CPU#1!
BUG: drift by 0x43f
 [<c0151301>] softlockup_tick+0xd1/0x100
 [<c01095d4>] timer_interrupt+0x4e4/0x640
 [<c011bbae>] try_to_wake_up+0x24e/0x300
 [<c0151c89>] handle_IRQ_event+0x59/0xa0
 [<c0151d65>] __do_IRQ+0x95/0x120
 [<c010708f>] do_IRQ+0x3f/0xa0
 [<c0103070>] xen_idle+0x0/0x60
 [<c024e355>] evtchn_do_upcall+0xb5/0x120
 [<c0103070>] xen_idle+0x0/0x60
 [<c01057a5>] hypervisor_callback+0x3d/0x48
 [<c0103070>] xen_idle+0x0/0x60
 [<c0109d40>] raw_safe_halt+0x20/0x50
 [<c01030a1>] xen_idle+0x31/0x60
 [<c010316e>] cpu_idle+0x9e/0xf0
BUG: soft lockup detected on CPU#1!
BUG: drift by 0x3ea
 [<c0151301>] softlockup_tick+0xd1/0x100
 [<c01095d4>] timer_interrupt+0x4e4/0x640
 [<c0137699>] __rcu_process_callbacks+0x99/0x100
 [<c0129867>] tasklet_action+0x87/0x130
 [<c0151c89>] handle_IRQ_event+0x59/0xa0
 [<c0151d65>] __do_IRQ+0x95/0x120
 [<c010708f>] do_IRQ+0x3f/0xa0
 [<c0103070>] xen_idle+0x0/0x60
 [<c024e355>] evtchn_do_upcall+0xb5/0x120
 [<c0103070>] xen_idle+0x0/0x60
 [<c01057a5>] hypervisor_callback+0x3d/0x48
 [<c0103070>] xen_idle+0x0/0x60
 [<c0109d40>] raw_safe_halt+0x20/0x50
 [<c01030a1>] xen_idle+0x31/0x60
 [<c010316e>] cpu_idle+0x9e/0xf0

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.