Xen project Mailing List

Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split

To: Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>

From: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>

Date: Wed, 16 Feb 2011 13:54:41 +0000

Cc: Andre Przywara <andre.przywara@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, "Diestelhorst, Stephan" <Stephan.Diestelhorst@xxxxxxx>

Delivery-date: Wed, 16 Feb 2011 05:55:26 -0800

Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; b=jAGuZwq+1O5PtRrVeGPFbsB0iZ6PQs5P2uurMSN+3s2WiPBLuKy3wSGisUj+7d6R1h wKPo3cjAw6+0DdZJV4TzvCLGtt79ItywJb+xNrRTVQwIfv4Agg9RWkoSTnMjmHjCwudf g0jc8KWmreNMD/4MJCgAtCFCqglyTuoeSFMLs=

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Andre (and Juergen), can you try again with the attached patch? What the patch basically does is try to make "cpu_disable_scheduler()" do what it seems to say it does. :-) Namely, the various scheduler-related interrutps (both per-cpu ticks and the master tick) is a part of the scheduler, so disable them before doing anything, and don't enable them until the cpu is really ready to go again. To be precise: * cpu_disable_scheduler() disables ticks * scheduler_cpu_switch() only enables ticks if adding a cpu to a pool, and does it after inserting the idle vcpu * Modify semantics, s.t., {alloc,free}_pdata() don't actually start or stop tickers + Call tick_{resume,suspend} in cpu_{up,down}, respectively * Modify credit1's tick_{suspend,resume} to handle the master ticker as well. With this patch (if dom0 doesn't get wedged due to all 8 vcpus being on one pcpu), I can perform thousands of operations successfully. (NB this is not ready for application yet, I just wanted to check to see if it fixes Andre's problem) -George On Wed, Feb 16, 2011 at 9:47 AM, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx> wrote: > Okay, I have some more data. > > I activated cpupool_dprintk() and included checks in sched_credit.c to > test for weight inconsistencies. To reduce race possibilities I've added > my patch to execute cpu assigning/unassigning always in a tasklet on the > cpu to be moved. > > Here is the result: > > (XEN) cpupool_unassign_cpu(pool=0,cpu=6) > (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16 > (XEN) cpupool_unassign_cpu(pool=0,cpu=6) > (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16 > (XEN) cpupool_assign_cpu(pool=0,cpu=1) > (XEN) cpupool_assign_cpu(pool=0,cpu=1) ffff83083fff74c0 > (XEN) cpupool_assign_cpu(cpu=1) ret 0 > (XEN) cpupool_assign_cpu(pool=1,cpu=4) > (XEN) cpupool_assign_cpu(pool=1,cpu=4) ffff831002ad5e40 > (XEN) cpupool_assign_cpu(cpu=4) ret 0 > (XEN) cpu 4, weight 0,prv ffff831002ad5e40, dom 0: > (XEN) sdom->weight: 256, sdom->active_vcpu_count: 1 > (XEN) Xen BUG at sched_credit.c:570 > (XEN) ----[ Xen-4.1.0-rc5-pre x86_64 debug=y Tainted: C ]---- > (XEN) CPU: 4 > (XEN) RIP: e008:[<ffff82c4801197d7>] csched_tick+0x186/0x37f > (XEN) RFLAGS: 0000000000010086 CONTEXT: hypervisor > (XEN) rax: 0000000000000000 rbx: ffff830839d3ec30 rcx: 0000000000000000 > (XEN) rdx: ffff830839dcff18 rsi: 000000000000000a rdi: ffff82c4802542e8 > (XEN) rbp: ffff830839dcfe38 rsp: ffff830839dcfde8 r8: 0000000000000004 > (XEN) r9: ffff82c480213520 r10: 00000000fffffffc r11: 0000000000000001 > (XEN) r12: 0000000000000004 r13: ffff830839d3ec40 r14: ffff831002ad5e40 > (XEN) r15: ffff830839d66f90 cr0: 000000008005003b cr4: 00000000000026f0 > (XEN) cr3: 0000001020a98000 cr2: 00007fc5e9b79d98 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 > (XEN) Xen stack trace from rsp=ffff830839dcfde8: > (XEN) ffff83083ffa3ba0 ffff831002ad5e40 0000000000000246 ffff830839d6c000 > (XEN) 0000000000000000 ffff830839dd1100 0000000000000004 ffff82c480119651 > (XEN) ffff831002b28018 ffff831002b28010 ffff830839dcfe68 ffff82c480126204 > (XEN) 0000000000000002 ffff83083ffa3bb8 ffff830839dd1100 000000cae439ea7e > (XEN) ffff830839dcfeb8 ffff82c480126539 00007fc5e9fa5b20 ffff830839dd1100 > (XEN) ffff831002b28010 0000000000000004 0000000000000004 ffff82c4802b0880 > (XEN) ffff830839dcff18 ffffffffffffffff ffff830839dcfef8 ffff82c480123647 > (XEN) ffff830839dcfed8 ffff830077eee000 00007fc5e9b79d98 00007fc5e9fa5b20 > (XEN) 0000000000000002 00007fff46826f20 ffff830839dcff08 ffff82c4801236c2 > (XEN) 00007cf7c62300c7 ffff82c480206ad6 00007fff46826f20 0000000000000002 > (XEN) 00007fc5e9fa5b20 00007fc5e9b79d98 00007fff46827260 00007fff46826f50 > (XEN) 0000000000000246 0000000000000032 0000000000000000 00000000ffffffff > (XEN) 0000000000000009 00007fc5e9d9de1a 0000000000000003 0000000000004848 > (XEN) 00007fc5e9b7a000 0000010000000000 ffffffff800073f0 000000000000e033 > (XEN) 0000000000000246 ffff880f97b51fc8 000000000000e02b 0000000000000000 > (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000004 > (XEN) ffff830077eee000 00000043b9afd180 0000000000000000 > (XEN) Xen call trace: > (XEN) [<ffff82c4801197d7>] csched_tick+0x186/0x37f > (XEN) [<ffff82c480126204>] execute_timer+0x4e/0x6c > (XEN) [<ffff82c480126539>] timer_softirq_action+0xf6/0x239 > (XEN) [<ffff82c480123647>] __do_softirq+0x88/0x99 > (XEN) [<ffff82c4801236c2>] do_softirq+0x6a/0x7a > (XEN) > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 4: > (XEN) Xen BUG at sched_credit.c:570 > (XEN) **************************************** > > As you can see, a Dom0 vcpus is becoming active on a pool 1 cpu. The BUG_ON > triggered in csched_acct() is a logical result of this. > > How this can happen I don't know yet. > Anyone any idea? I'll keep searching... > > > Juergen > > On 02/15/11 08:22, Juergen Gross wrote: >> >> On 02/14/11 18:57, George Dunlap wrote: >>> >>> The good news is, I've managed to reproduce this on my local test >>> hardware with 1x4x2 (1 socket, 4 cores, 2 threads per core) using the >>> attached script. It's time to go home now, but I should be able to >>> dig something up tomorrow. >>> >>> To use the script: >>> * Rename cpupool0 to "p0", and create an empty second pool, "p1" >>> * You can modify elements by adding "arg=val" as arguments. >>> * Arguments are: >>> + dryrun={true,false} Do the work, but don't actually execute any xl >>> arguments. Default false. >>> + left: Number commands to execute. Default 10. >>> + maxcpus: highest numerical value for a cpu. Default 7 (i.e., 0-7 is >>> 8 cpus). >>> + verbose={true,false} Print what you're doing. Default is true. >>> >>> The script sometimes attempts to remove the last cpu from cpupool0; in >>> this case, libxl will print an error. If the script gets an error >>> under that condition, it will ignore it; under any other condition, it >>> will print diagnostic information. >>> >>> What finally crashed it for me was this command: >>> # ./cpupool-test.sh verbose=false left=1000 >> >> Nice! >> With your script I finally managed to get the error, too. On my box (2 >> sockets >> a 6 cores) I had to use >> >> ./cpupool-test.sh verbose=false left=10000 maxcpus=11 >> >> to trigger it. >> Looking for more data now... >> >> >> Juergen >> >>> >>> -George >>> >>> On Fri, Feb 11, 2011 at 7:39 AM, Andre >>> Przywara<andre.przywara@xxxxxxx> wrote: >>>> >>>> Juergen Gross wrote: >>>>> >>>>> On 02/10/11 15:18, Andre Przywara wrote: >>>>>> >>>>>> Andre Przywara wrote: >>>>>>> >>>>>>> On 02/10/2011 07:42 AM, Juergen Gross wrote: >>>>>>>> >>>>>>>> On 02/09/11 15:21, Juergen Gross wrote: >>>>>>>>> >>>>>>>>> Andre, George, >>>>>>>>> >>>>>>>>> >>>>>>>>> What seems to be interesting: I think the problem did always occur >>>>>>>>> when >>>>>>>>> a new cpupool was created and the first cpu was moved to it. >>>>>>>>> >>>>>>>>> I think my previous assumption regarding the master_ticker was not >>>>>>>>> too bad. >>>>>>>>> I think somehow the master_ticker of the new cpupool is becoming >>>>>>>>> active >>>>>>>>> before the scheduler is really initialized properly. This could >>>>>>>>> happen, if >>>>>>>>> enough time is spent between alloc_pdata for the cpu to be moved >>>>>>>>> and >>>>>>>>> the >>>>>>>>> critical section in schedule_cpu_switch(). >>>>>>>>> >>>>>>>>> The solution should be to activate the timers only if the >>>>>>>>> scheduler is >>>>>>>>> ready for them. >>>>>>>>> >>>>>>>>> George, do you think the master_ticker should be stopped in >>>>>>>>> suspend_ticker >>>>>>>>> as well? I still see potential problems for entering deep C-States. >>>>>>>>> I think >>>>>>>>> I'll prepare a patch which will keep the master_ticker active >>>>>>>>> for the >>>>>>>>> C-State case and migrate it for the schedule_cpu_switch() case. >>>>>>>> >>>>>>>> Okay, here is a patch for this. It ran on my 4-core machine >>>>>>>> without any >>>>>>>> problems. >>>>>>>> Andre, could you give it a try? >>>>>>> >>>>>>> Did, but unfortunately it crashed as always. Tried twice and made >>>>>>> sure >>>>>>> I booted the right kernel. Sorry. >>>>>>> The idea with the race between the timer and the state changing >>>>>>> sounded very appealing, actually that was suspicious to me from the >>>>>>> beginning. >>>>>>> >>>>>>> I will add some code to dump the state of all cpupools to the BUG_ON >>>>>>> to see in which situation we are when the bug triggers. >>>>>> >>>>>> OK, here is a first try of this, the patch iterates over all CPU pools >>>>>> and outputs some data if the BUG_ON >>>>>> ((sdom->weight * sdom->active_vcpu_count)> weight_left) condition >>>>>> triggers: >>>>>> (XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask: >>>>>> fffffffc003f >>>>>> (XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0 >>>>>> (XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000 >>>>>> (XEN) Xen BUG at sched_credit.c:1010 >>>>>> .... >>>>>> The masks look proper (6 cores per node), the bug triggers when the >>>>>> first CPU is about to be(?) inserted. >>>>> >>>>> Sure? I'm missing the cpu with mask 2000. >>>>> I'll try to reproduce the problem on a larger machine here (24 cores, 4 >>>>> numa >>>>> nodes). >>>>> Andre, can you give me your xen boot parameters? Which xen changeset >>>>> are >>>>> you >>>>> running, and do you have any additional patches in use? >>>> >>>> The grub lines: >>>> kernel (hd1,0)/boot/xen-22858_debug_04.gz console=com1,vga com1=115200 >>>> module (hd1,0)/boot/vmlinuz-2.6.32.27_pvops console=tty0 >>>> console=ttyS0,115200 ro root=/dev/sdb1 xencons=hvc0 >>>> >>>> All of my experiments are use c/s 22858 as a base. >>>> If you use a AMD Magny-Cours box for your experiments (socket C32 or >>>> G34), >>>> you should add the following patch (removing the line) >>>> --- a/xen/arch/x86/traps.c >>>> +++ b/xen/arch/x86/traps.c >>>> @@ -803,7 +803,6 @@ static void pv_cpuid(struct cpu_user_regs *regs) >>>> __clear_bit(X86_FEATURE_SKINIT % 32,&c); >>>> __clear_bit(X86_FEATURE_WDT % 32,&c); >>>> __clear_bit(X86_FEATURE_LWP % 32,&c); >>>> - __clear_bit(X86_FEATURE_NODEID_MSR % 32,&c); >>>> __clear_bit(X86_FEATURE_TOPOEXT % 32,&c); >>>> break; >>>> case 5: /* MONITOR/MWAIT */ >>>> >>>> This is not necessary (in fact that reverts my patch c/s 22815), but >>>> raises >>>> the probability to trigger the bug, probably because it increases the >>>> pressure of the Dom0 scheduler. If you cannot trigger it with Dom0, >>>> try to >>>> create a guest with many VCPUs and squeeze it into a small CPU-pool. >>>> >>>> Good luck ;-) >>>> Andre. >>>> >>>> -- >>>> Andre Przywara >>>> AMD-OSRC (Dresden) >>>> Tel: x29712 >>>> >>>> >>>> _______________________________________________ >>>> Xen-devel mailing list >>>> Xen-devel@xxxxxxxxxxxxxxxxxxx >>>> http://lists.xensource.com/xen-devel >>>> >>>> >>>> >>>> _______________________________________________ >>>> Xen-devel mailing list >>>> Xen-devel@xxxxxxxxxxxxxxxxxxx >>>> http://lists.xensource.com/xen-devel >> >> > > > -- > Juergen Gross Principal Developer Operating Systems > TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 > Fujitsu Technology Solutions e-mail: > juergen.gross@xxxxxxxxxxxxxx > Domagkstr. 28 Internet: ts.fujitsu.com > D-80807 Muenchen Company details: > ts.fujitsu.com/imprint.html > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxxxxxxxx > http://lists.xensource.com/xen-devel >

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.