Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split

Juergen Gross wrote:

On 02/21/11 11:00, Andre Przywara wrote:

George Dunlap wrote:

Andre (and Juergen), can you try again with the attached patch?

I applied this patch on top of 22931 and it did _not_ work.
The crash occurred almost immediately after I started my script, so the
same behaviour as without the patch.


Did you try my patch addressing races in the scheduler when moving cpus
between cpupools?

Sorry, I tried yours first, but it didn't apply cleanly on my particulartree (sched_jg_fix ;-). So I tested George's first.

I've attached it again. For me it works quite well, while George's patch
seems not to be enough (machine hanging after some tests with cpupools).

OK, it now applied after a rebase.

And yes, I didn't see a crash! At least until the script stopped whileat lot of these messages appeared:

(XEN) do_IRQ: 0.89 No irq handler for vector (irq -1)

That is what I reported before and is most probably totally unrelated tothis issue.

So I consider this fix working!

I will try to match my recent theories and debug results with your patchto see whether this fits.

OTOH I can't reproduce an error as fast as you even without any patch :-)

(attached my script for reference, though it will most likely only make
sense on bigger NUMA machines)


Yeah, on my 2-node system I need several hundred tries to get an error.
But it seems to be more effective than George's script.

I consider the large over-provisioning the reason. With Dom0 having 48VCPUs finally squashed together to 6 pCPUs, my script triggered at thesecond run the latest.

With your patch it made 24 iterations before the other bug kicked in.

Thanks very much!
Andre.



Juergen

Regards,
Andre.

What the patch basically does is try to make "cpu_disable_scheduler()"
do what it seems to say it does. :-) Namely, the various
scheduler-related interrutps (both per-cpu ticks and the master tick)
is a part of the scheduler, so disable them before doing anything, and
don't enable them until the cpu is really ready to go again.

To be precise:
* cpu_disable_scheduler() disables ticks
* scheduler_cpu_switch() only enables ticks if adding a cpu to a pool,
and does it after inserting the idle vcpu
* Modify semantics, s.t., {alloc,free}_pdata() don't actually start or
stop tickers
+ Call tick_{resume,suspend} in cpu_{up,down}, respectively
* Modify credit1's tick_{suspend,resume} to handle the master ticker
as well.

With this patch (if dom0 doesn't get wedged due to all 8 vcpus being
on one pcpu), I can perform thousands of operations successfully.

(NB this is not ready for application yet, I just wanted to check to
see if it fixes Andre's problem)

-George

On Wed, Feb 16, 2011 at 9:47 AM, Juergen Gross
<juergen.gross@xxxxxxxxxxxxxx> wrote:

Okay, I have some more data.

I activated cpupool_dprintk() and included checks in sched_credit.c to
test for weight inconsistencies. To reduce race possibilities I've added
my patch to execute cpu assigning/unassigning always in a tasklet on the
cpu to be moved.

Here is the result:

(XEN) cpupool_unassign_cpu(pool=0,cpu=6)
(XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
(XEN) cpupool_unassign_cpu(pool=0,cpu=6)
(XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
(XEN) cpupool_assign_cpu(pool=0,cpu=1)
(XEN) cpupool_assign_cpu(pool=0,cpu=1) ffff83083fff74c0
(XEN) cpupool_assign_cpu(cpu=1) ret 0
(XEN) cpupool_assign_cpu(pool=1,cpu=4)
(XEN) cpupool_assign_cpu(pool=1,cpu=4) ffff831002ad5e40
(XEN) cpupool_assign_cpu(cpu=4) ret 0
(XEN) cpu 4, weight 0,prv ffff831002ad5e40, dom 0:
(XEN) sdom->weight: 256, sdom->active_vcpu_count: 1
(XEN) Xen BUG at sched_credit.c:570
(XEN) ----[ Xen-4.1.0-rc5-pre x86_64 debug=y Tainted: C ]----
(XEN) CPU: 4
(XEN) RIP: e008:[<ffff82c4801197d7>] csched_tick+0x186/0x37f
(XEN) RFLAGS: 0000000000010086 CONTEXT: hypervisor
(XEN) rax: 0000000000000000 rbx: ffff830839d3ec30 rcx: 0000000000000000
(XEN) rdx: ffff830839dcff18 rsi: 000000000000000a rdi: ffff82c4802542e8
(XEN) rbp: ffff830839dcfe38 rsp: ffff830839dcfde8 r8: 0000000000000004
(XEN) r9: ffff82c480213520 r10: 00000000fffffffc r11: 0000000000000001
(XEN) r12: 0000000000000004 r13: ffff830839d3ec40 r14: ffff831002ad5e40
(XEN) r15: ffff830839d66f90 cr0: 000000008005003b cr4: 00000000000026f0
(XEN) cr3: 0000001020a98000 cr2: 00007fc5e9b79d98
(XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008
(XEN) Xen stack trace from rsp=ffff830839dcfde8:
(XEN) ffff83083ffa3ba0 ffff831002ad5e40 0000000000000246
ffff830839d6c000
(XEN) 0000000000000000 ffff830839dd1100 0000000000000004
ffff82c480119651
(XEN) ffff831002b28018 ffff831002b28010 ffff830839dcfe68
ffff82c480126204
(XEN) 0000000000000002 ffff83083ffa3bb8 ffff830839dd1100
000000cae439ea7e
(XEN) ffff830839dcfeb8 ffff82c480126539 00007fc5e9fa5b20
ffff830839dd1100
(XEN) ffff831002b28010 0000000000000004 0000000000000004
ffff82c4802b0880
(XEN) ffff830839dcff18 ffffffffffffffff ffff830839dcfef8
ffff82c480123647
(XEN) ffff830839dcfed8 ffff830077eee000 00007fc5e9b79d98
00007fc5e9fa5b20
(XEN) 0000000000000002 00007fff46826f20 ffff830839dcff08
ffff82c4801236c2
(XEN) 00007cf7c62300c7 ffff82c480206ad6 00007fff46826f20
0000000000000002
(XEN) 00007fc5e9fa5b20 00007fc5e9b79d98 00007fff46827260
00007fff46826f50
(XEN) 0000000000000246 0000000000000032 0000000000000000
00000000ffffffff
(XEN) 0000000000000009 00007fc5e9d9de1a 0000000000000003
0000000000004848
(XEN) 00007fc5e9b7a000 0000010000000000 ffffffff800073f0
000000000000e033
(XEN) 0000000000000246 ffff880f97b51fc8 000000000000e02b
0000000000000000
(XEN) 0000000000000000 0000000000000000 0000000000000000
0000000000000004
(XEN) ffff830077eee000 00000043b9afd180 0000000000000000
(XEN) Xen call trace:
(XEN) [<ffff82c4801197d7>] csched_tick+0x186/0x37f
(XEN) [<ffff82c480126204>] execute_timer+0x4e/0x6c
(XEN) [<ffff82c480126539>] timer_softirq_action+0xf6/0x239
(XEN) [<ffff82c480123647>] __do_softirq+0x88/0x99
(XEN) [<ffff82c4801236c2>] do_softirq+0x6a/0x7a
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 4:
(XEN) Xen BUG at sched_credit.c:570
(XEN) ****************************************

As you can see, a Dom0 vcpus is becoming active on a pool 1 cpu. The
BUG_ON
triggered in csched_acct() is a logical result of this.

How this can happen I don't know yet.
Anyone any idea? I'll keep searching...


Juergen

On 02/15/11 08:22, Juergen Gross wrote:

On 02/14/11 18:57, George Dunlap wrote:

The good news is, I've managed to reproduce this on my local test
hardware with 1x4x2 (1 socket, 4 cores, 2 threads per core) using the
attached script. It's time to go home now, but I should be able to
dig something up tomorrow.

To use the script:
* Rename cpupool0 to "p0", and create an empty second pool, "p1"
* You can modify elements by adding "arg=val" as arguments.
* Arguments are:
+ dryrun={true,false} Do the work, but don't actually execute any xl
arguments. Default false.
+ left: Number commands to execute. Default 10.
+ maxcpus: highest numerical value for a cpu. Default 7 (i.e., 0-7 is
8 cpus).
+ verbose={true,false} Print what you're doing. Default is true.

The script sometimes attempts to remove the last cpu from cpupool0; in
this case, libxl will print an error. If the script gets an error
under that condition, it will ignore it; under any other condition, it
will print diagnostic information.

What finally crashed it for me was this command:
# ./cpupool-test.sh verbose=false left=1000

Nice!
With your script I finally managed to get the error, too. On my box (2
sockets
a 6 cores) I had to use

./cpupool-test.sh verbose=false left=10000 maxcpus=11

to trigger it.
Looking for more data now...


Juergen

-George

On Fri, Feb 11, 2011 at 7:39 AM, Andre
Przywara<andre.przywara@xxxxxxx> wrote:

Juergen Gross wrote:

On 02/10/11 15:18, Andre Przywara wrote:

Andre Przywara wrote:

On 02/10/2011 07:42 AM, Juergen Gross wrote:

On 02/09/11 15:21, Juergen Gross wrote:

Andre, George,


What seems to be interesting: I think the problem did always
occur
when
a new cpupool was created and the first cpu was moved to it.

I think my previous assumption regarding the master_ticker
was not
too bad.
I think somehow the master_ticker of the new cpupool is becoming
active
before the scheduler is really initialized properly. This could
happen, if
enough time is spent between alloc_pdata for the cpu to be moved
and
the
critical section in schedule_cpu_switch().

The solution should be to activate the timers only if the
scheduler is
ready for them.

George, do you think the master_ticker should be stopped in
suspend_ticker
as well? I still see potential problems for entering deep
C-States.
I think
I'll prepare a patch which will keep the master_ticker active
for the
C-State case and migrate it for the schedule_cpu_switch() case.

Okay, here is a patch for this. It ran on my 4-core machine
without any
problems.
Andre, could you give it a try?

Did, but unfortunately it crashed as always. Tried twice and made
sure
I booted the right kernel. Sorry.
The idea with the race between the timer and the state changing
sounded very appealing, actually that was suspicious to me from
the
beginning.

I will add some code to dump the state of all cpupools to the
BUG_ON
to see in which situation we are when the bug triggers.

OK, here is a first try of this, the patch iterates over all CPU
pools
and outputs some data if the BUG_ON
((sdom->weight * sdom->active_vcpu_count)> weight_left) condition
triggers:
(XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask:
fffffffc003f
(XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0
(XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000
(XEN) Xen BUG at sched_credit.c:1010
....
The masks look proper (6 cores per node), the bug triggers when the
first CPU is about to be(?) inserted.

Sure? I'm missing the cpu with mask 2000.
I'll try to reproduce the problem on a larger machine here (24
cores, 4
numa
nodes).
Andre, can you give me your xen boot parameters? Which xen changeset
are
you
running, and do you have any additional patches in use?

The grub lines:
kernel (hd1,0)/boot/xen-22858_debug_04.gz console=com1,vga
com1=115200
module (hd1,0)/boot/vmlinuz-2.6.32.27_pvops console=tty0
console=ttyS0,115200 ro root=/dev/sdb1 xencons=hvc0

All of my experiments are use c/s 22858 as a base.
If you use a AMD Magny-Cours box for your experiments (socket C32 or
G34),
you should add the following patch (removing the line)
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -803,7 +803,6 @@ static void pv_cpuid(struct cpu_user_regs *regs)
__clear_bit(X86_FEATURE_SKINIT % 32,&c);
__clear_bit(X86_FEATURE_WDT % 32,&c);
__clear_bit(X86_FEATURE_LWP % 32,&c);
- __clear_bit(X86_FEATURE_NODEID_MSR % 32,&c);
__clear_bit(X86_FEATURE_TOPOEXT % 32,&c);
break;
case 5: /* MONITOR/MWAIT */

This is not necessary (in fact that reverts my patch c/s 22815), but
raises
the probability to trigger the bug, probably because it increases the
pressure of the Dom0 scheduler. If you cannot trigger it with Dom0,
try to
create a guest with many VCPUs and squeeze it into a small CPU-pool.

Good luck ;-)
Andre.

--
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

--
Juergen Gross Principal Developer Operating Systems
TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions e-mail:
juergen.gross@xxxxxxxxxxxxxx
Domagkstr. 28 Internet: ts.fujitsu.com
D-80807 Muenchen Company details:
ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel




_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel



--
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@xxxxxxxxxxxxxx
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html



--
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split