[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split


  • To: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>
  • From: Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>
  • Date: Thu, 17 Feb 2011 10:11:25 +0100
  • Cc: Andre Przywara <andre.przywara@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, "Diestelhorst, Stephan" <Stephan.Diestelhorst@xxxxxxx>
  • Delivery-date: Thu, 17 Feb 2011 01:12:09 -0800
  • Domainkey-signature: s=s1536a; d=ts.fujitsu.com; c=nofws; q=dns; h=X-SBRSScore:X-IronPort-AV:Received:X-IronPort-AV: Received:Received:Message-ID:Date:From:Organization: User-Agent:MIME-Version:To:CC:Subject:References: In-Reply-To:Content-Type:Content-Transfer-Encoding; b=uQO1E0WIOZe+Ta2hWeFKoD2lIEUPXicoZSwFfTQnCSvqwEeTyk73XD6r aqn59MQ5q/7mrO7MW5H/IoT/ugltl2bEr5+GeRuK07O9K88pzY8aaa/S8 6t61bSMky0fAYVANliZFsnoVezKUoqUd2B/cNvKCZF4n/ILKtE/OvBGLp Crvar1iokdCWv3nrua4HUd9zuYcy9mucVwovCL6taz5M53N2cY3+Ig/vL ON+Bdrg/hEFHgTlJuvER+s/bMydKx;
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>

On 02/17/11 08:05, Juergen Gross wrote:
On 02/16/11 14:54, George Dunlap wrote:
Andre (and Juergen), can you try again with the attached patch?

What the patch basically does is try to make "cpu_disable_scheduler()"
do what it seems to say it does. :-) Namely, the various
scheduler-related interrutps (both per-cpu ticks and the master tick)
is a part of the scheduler, so disable them before doing anything, and
don't enable them until the cpu is really ready to go again.

To be precise:
* cpu_disable_scheduler() disables ticks
* scheduler_cpu_switch() only enables ticks if adding a cpu to a pool,
and does it after inserting the idle vcpu
* Modify semantics, s.t., {alloc,free}_pdata() don't actually start or
stop tickers
+ Call tick_{resume,suspend} in cpu_{up,down}, respectively
* Modify credit1's tick_{suspend,resume} to handle the master ticker
as well.

With this patch (if dom0 doesn't get wedged due to all 8 vcpus being
on one pcpu), I can perform thousands of operations successfully.

(NB this is not ready for application yet, I just wanted to check to
see if it fixes Andre's problem)

Tried again, this time with the following patch:

diff -r 72470de157ce xen/common/sched_credit.c
--- a/xen/common/sched_credit.c Wed Feb 16 09:49:33 2011 +0000
+++ b/xen/common/sched_credit.c Wed Feb 16 15:09:54 2011 +0100
@@ -1268,7 +1268,8 @@ csched_load_balance(struct csched_privat
         /*
          * Any work over there to steal?
          */
-        speer = csched_runq_steal(peer_cpu, cpu, snext->pri);
+        speer = cpu_isset(peer_cpu, *online) ?
+            csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL;
         pcpu_schedule_unlock(peer_cpu);
         if ( speer != NULL )
         {


Worked without any flaw for 30000 iterations.


Juergen


After some thousand iterations the machine hang and after dumping Dom0
registers to console it continued running and crashed about a second later:

(XEN) cpupool_unassign_cpu(pool=0,cpu=9)
(XEN) cpupool_unassign_cpu(pool=0,cpu=9) ffff83083fff74c0
(XEN) cpupool_unassign_cpu ret=0
(XEN) cpupool_unassign_cpu(pool=0,cpu=4)
(XEN) cpupool_unassign_cpu(pool=0,cpu=4) ffff83083fff74c0
(XEN) cpupool_unassign_cpu ret=0
(XEN) cpupool_assign_cpu(pool=1,cpu=9)
(XEN) cpupool_assign_cpu(pool=1,cpu=9) ffff83083002de40
(XEN) Assertion 'timer->status >= TIMER_STATUS_inactive' failed at
timer.c:279
(XEN) ----[ Xen-4.1.0-rc5-pre x86_64 debug=y Tainted: C ]----
(XEN) CPU: 9
(XEN) RIP: e008:[<ffff82c480126100>] active_timer+0xc/0x37
(XEN) RFLAGS: 0000000000010046 CONTEXT: hypervisor
(XEN) rax: 0000000000000000 rbx: 0000000000000000 rcx: 0000000000000000
(XEN) rdx: ffff830839d8ff18 rsi: 0000010dbb628a80 rdi: ffff83083ffbcf98
(XEN) rbp: ffff830839d8fd50 rsp: ffff830839d8fd50 r8: ffff83083ffbcf90
(XEN) r9: ffff82c480213680 r10: 00000000ffffffff r11: 0000000000000010
(XEN) r12: ffff82c4802d3f80 r13: ffff82c4802d3f80 r14: ffff83083ffbcf98
(XEN) r15: ffff83083ffbcfc0 cr0: 000000008005003b cr4: 00000000000026f0
(XEN) cr3: 000000007809c000 cr2: 0000000000620048
(XEN) ds: 002b es: 002b fs: 0000 gs: 0000 ss: e010 cs: e008
(XEN) Xen stack trace from rsp=ffff830839d8fd50:
(XEN) ffff830839d8fda0 ffff82c480126ef9 0000000000000000 0000010dbb628a80
(XEN) 0000000000000086 0000000000000009 ffff83083002de40 ffff83083002dd50
(XEN) 0000000000000009 0000000000000009 ffff830839d8fdc0 ffff82c480117906
(XEN) ffff83083ffa3b40 ffff83083ffa5d70 ffff830839d8fe30 ffff82c4801214fa
(XEN) ffff83083002dd00 0000000900000100 0000000000000286 ffff8300780da000
(XEN) ffff83083ffbcf80 ffff83083ffbcf90 ffff82c480247e00 0000000000000009
(XEN) 00000000fffffff0 ffff83083002dd00 0000000000000000 ffff8300781cc198
(XEN) ffff830839d8fe60 ffff82c4801019ff 0000000000000009 0000000000000009
(XEN) ffff8300781cc198 ffff830839d990d0 ffff830839d8fe80 ffff82c480101bd9
(XEN) ffff83107e80c5b0 ffff8300781cc000 ffff830839d8fea0 ffff82c480104f21
(XEN) 0000000000000009 ffff830839d990e0 ffff830839d8fee0 ffff82c480125b6c
(XEN) ffff82c48024a020 ffff830839d8ff18 ffff82c48024a020 ffff830839d8ff18
(XEN) ffff830839d99060 ffff830839d99040 ffff830839d8ff10 ffff82c48015645a
(XEN) 0000000000000000 ffff8300780da000 ffff8300780da000 ffffffffffffffff
(XEN) ffff830839d8fe00 0000000000000000 0000000000000000 0000000000000000
(XEN) 0000000000000000 ffffffff8062bda0 ffff880fbb1e5fd8 0000000000000246
(XEN) 0000000000000000 000000010003347d 0000000000000000 0000000000000000
(XEN) ffffffff800033aa 00000000deadbeef 00000000deadbeef 00000000deadbeef
(XEN) 0000010000000000 ffffffff800033aa 000000000000e033 0000000000000246
(XEN) ffff880fbb1e5f08 000000000000e02b 0000000000000000 0000000000000000
(XEN) Xen call trace:
(XEN) [<ffff82c480126100>] active_timer+0xc/0x37
(XEN) [<ffff82c480126ef9>] set_timer+0x102/0x218
(XEN) [<ffff82c480117906>] csched_tick_resume+0x53/0x75
(XEN) [<ffff82c4801214fa>] schedule_cpu_switch+0x1f1/0x25c
(XEN) [<ffff82c4801019ff>] cpupool_assign_cpu_locked+0x61/0xd6
(XEN) [<ffff82c480101bd9>] cpupool_assign_cpu_helper+0x9f/0xcd
(XEN) [<ffff82c480104f21>] continue_hypercall_tasklet_handler+0x51/0xc3
(XEN) [<ffff82c480125b6c>] do_tasklet+0xe1/0x155
(XEN) [<ffff82c48015645a>] idle_loop+0x5f/0x67
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 9:
(XEN) Assertion 'timer->status >= TIMER_STATUS_inactive' failed at
timer.c:279
(XEN) ****************************************


Juergen



--
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@xxxxxxxxxxxxxx
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.