[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Is: RCU callback detects an RCU hang with Linux 3.12+ Was: Re: Status of FLR in Xen 4.4

Hi Konrad,

sorry I missed your entry, google mail might not be the best software to view mailing lists ;)

The RCU stall happens roughly 2 minutes after the machine is fully booted, and I'm usually working via SSH by then..

I basically have two cases where the stall happens:

1) Without the no-cpuidle function, It happens when I start xencommons
2) With or without no-cpuidle, this happens sometimes and arbitrary and I have the feeling that logging in via SSH (or network traffic in general?) will increase the chance of the rcu stall and (and this is only a guess) in most cases this actually happens when I enter a command of more then 16 chars in the ssh command prompt. (I don't really think that this is really causing the issue, I just noticed that when entering the usual commands to start all the xen stuff / boot the domUs, it stalls mostly on the same commands / when ssh freezes I came to the same part of the command). But more ssh-intensive commands like 'dmesg' or 'htop' don't cause it..

Also, I can't really say what is on the screen because my dom0 does not have a vga card / both vga cards in the server are passed to different domUs and when I don't hide the vga cards on boot via xen-pciback.hide, the rcu usually does not stall and everything is fine..

2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
On Fri, Sep 27, 2013 at 07:07:33PM +0200, Matthias wrote:
> Hi Konrad,
> good call! I was able to reproduce the error with the 3.12-rc2 kernel, got
> a lot of information with the new NMI traces (log attached), but since I'm
> not a xen hacker I don't really know how to continue from here. So I might
> add this to the original post and maybe someone can help me. After all the
> error persists for half a year now and besides 2 kernel version / .config
> Combinations (a 3.8.2 and a 3.6.something) I could never trace this issue
> back (even with bisecting the .config because at some point it seemed
> random).

Can you tell me a bit on how this happens? Is it happening after you
boot the machine? Does it happen after a specific workload?

It looks like something in the RCU is taking far too long and
the RCU callback mechanism starts complaining. The CPU0 is when the
RCU mechanism detects that something is off and starts sending NMI to
all CPUs. CPU2 is the only one that looks to be doing RCU callback:

NMI backtrace for cpu 1
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.12.0-rc2 #2
Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
task: ffff8800658da080 ti: ffff880065900000 task.ti: ffff880065900000
RIP: e030:[<ffffffff8125b2b2>]  [<ffffffff8125b2b2>] cfb_imageblit+0x1b3/0x411
RSP: e02b:ffff88007de439f0  EFLAGS: 00000046
RAX: 0000000000000000 RBX: ffff88001e1c2800 RCX: 0000000000000003
RDX: 000000000000003b RSI: ffff88001e00614e RDI: 0000000000000000
RBP: 0000000000000013 R08: 0000000000000001 R09: ffffffff814655f0
R10: ffff88001e006116 R11: ffffc90014875710 R12: 000000000000000d
R13: 0000000000000000 R14: ffffc90014875714 R15: ffffc90014875000
FS:  00007fb294ab4900(0000) GS:ffff88007de40000(0000) knlGS:0000000000000000
CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 00007fb29177a9a0 CR3: 000000000160c000 CR4: 0000000000000660
 0000000100aaaaaa 00000000000001d8 0000000000000000 0000000000aaaaaa
 ffff8800532f0a40 ffff88001e1c2800 0000000000000001 ffff88001e1c2800
 0000000000000000 ffff88007d424400 00000000ffff00ff 000000000000003b
Call Trace:
 <IRQ>  [<ffffffff81256ac4>] ? bit_putcs+0x352/0x39d
 [<ffffffff81219825>] ? paravirt_read_tsc+0x5/0x8
 [<ffffffff81256772>] ? bit_cursor+0x45d/0x45d
 [<ffffffff812523a8>] ? fbcon_putcs+0xbd/0xcc
 [<ffffffff812bc6b6>] ? vt_console_print+0x234/0x290
 [<ffffffff810b336f>] ? call_console_drivers.constprop.18+0xb3/0xfc
 [<ffffffff810b3c7d>] ? console_unlock+0x131/0x306
 [<ffffffff810b420e>] ? vprintk_emit+0x3bc/0x3eb
 [<ffffffff812c92f5>] ? paravirt_read_tsc+0x5/0x8
 [<ffffffff812cae43>] ? add_interrupt_randomness+0x3f/0x15d
 [<ffffffff813db9c8>] ? printk+0x4f/0x51
 [<ffffffff810e4433>] ? rcu_check_callbacks+0x195/0x598                          <==================
 [<ffffffff810a3b50>] ? irqtime_account_process_tick.isra.2+0xd6/0x239
 [<ffffffff810c232a>] ? tick_sched_do_timer+0x2e/0x2e
 [<ffffffff81084c35>] ? update_process_times+0x30/0x5b
 [<ffffffff810c2237>] ? tick_sched_handle+0x3e/0x4a
 [<ffffffff810c235a>] ? tick_sched_timer+0x30/0x4c
 [<ffffffff81098355>] ? __run_hrtimer+0x93/0x159
 [<ffffffff81098b72>] ? hrtimer_interrupt+0xe3/0x1ca
 [<ffffffff8103d8e4>] ? xen_timer_interrupt+0x31/0x13b
 [<ffffffff81294c4c>] ? HYPERVISOR_event_channel_op+0xd/0x1d
 [<ffffffff8103d79b>] ? xen_force_evtchn_callback+0x9/0xa
 [<ffffffff8103df22>] ? check_events+0x12/0x20
 [<ffffffff810b5b7a>] ? handle_irq_event_percpu+0x4d/0x1c5
 [<ffffffff813e556e>] ? notifier_call_chain+0x32/0x52
 [<ffffffff810b8287>] ? handle_percpu_irq+0x39/0x4c
 [<ffffffff812951c0>] ? __xen_evtchn_do_upcall+0x107/0x2cb
 [<ffffffff81219936>] ? delay_tsc+0x9c/0xc6
 [<ffffffff81093ba0>] ? __rcu_read_unlock+0x33/0x51
 [<ffffffff8129663a>] ? xen_evtchn_do_upcall+0x22/0x32
 [<ffffffff813e897e>] ? xen_do_hypervisor_callback+0x1e/0x30
 <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
 [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
 [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13
 [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e
 [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160
Code: fb 4c 89 d6 b9 08 00 00 00 ff cd 83 fd ff 74 32 44 0f be 2e 44 29 c1 8b 44 24 18 4d 8d 73 04 41 d3 fd 44 23 6c 24 04 43 23 04 a9 <41> 89 c5 41 31 fd 45 89 2b 85 c9 75 05 48 ff c6 b1 08 4d 89 f3

Which looks to be printing something on the VT console (which is running
in KMS mode as it uses framebuffer calls). So is there something on the
screen scrolling widly in a loop?

But then there are also complains about

INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.115 msecs

this taking too long. I am wondering if there is some time issue
on your box.

What version of Xen do you have?
> 2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> > On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote:
> > > I'm currently on a vanilla 3.8.2 kernel because this is the only >3.4
> > > kernel I found which doesn't give me this issue:
> > > http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html
> >
> > So v3.12 (or rather the latest and greaters of the Linus) has the mechanism
> > for the NMI - so you can actually see what is causing the stall.
> >

Xen-devel mailing list

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.