[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Is: RCU callback detects an RCU hang with Linux 3.12+ Was: Re: Status of FLR in Xen 4.4



On Fri, Oct 04, 2013 at 12:34:56AM +0200, Matthias wrote:
>    Hi Konrad,
> 
>    sorry I missed your entry, google mail might not be the best software to
>    view mailing lists ;)
> 
>    The RCU stall happens roughly 2 minutes after the machine is fully booted,
>    and I'm usually working via SSH by then..
> 
>    I basically have two cases where the stall happens:
> 
>    1) Without the no-cpuidle function, It happens when I start xencommons
>    2) With or without no-cpuidle, this happens sometimes and arbitrary and I
>    have the feeling that logging in via SSH (or network traffic in general?)
>    will increase the chance of the rcu stall and (and this is only a guess)
>    in most cases this actually happens when I enter a command of more then 16
>    chars in the ssh command prompt. (I don't really think that this is really
>    causing the issue, I just noticed that when entering the usual commands to
>    start all the xen stuff / boot the domUs, it stalls mostly on the same
>    commands / when ssh freezes I came to the same part of the command). But
>    more ssh-intensive commands like 'dmesg' or 'htop' don't cause it..
> 
>    Also, I can't really say what is on the screen because my dom0 does not
>    have a vga card / both vga cards in the server are passed to different
>    domUs and when I don't hide the vga cards on boot via xen-pciback.hide,
>    the rcu usually does not stall and everything is fine..
> 

For debugging you should have a serial console.. so maybe get a pci serial card,
if you don't have any management processors offering SOL ? 

-- Pasi

>    2013/9/27 Konrad Rzeszutek Wilk <[1]konrad.wilk@xxxxxxxxxx>
> 
>      On Fri, Sep 27, 2013 at 07:07:33PM +0200, Matthias wrote:
>      > Hi Konrad,
>      >
>      > good call! I was able to reproduce the error with the 3.12-rc2 kernel,
>      got
>      > a lot of information with the new NMI traces (log attached), but since
>      I'm
>      > not a xen hacker I don't really know how to continue from here. So I
>      might
>      > add this to the original post and maybe someone can help me. After all
>      the
>      > error persists for half a year now and besides 2 kernel version /
>      .config
>      > Combinations (a 3.8.2 and a 3.6.something) I could never trace this
>      issue
>      > back (even with bisecting the .config because at some point it seemed
>      > random).
> 
>      Can you tell me a bit on how this happens? Is it happening after you
>      boot the machine? Does it happen after a specific workload?
> 
>      It looks like something in the RCU is taking far too long and
>      the RCU callback mechanism starts complaining. The CPU0 is when the
>      RCU mechanism detects that something is off and starts sending NMI to
>      all CPUs. CPU2 is the only one that looks to be doing RCU callback:
> 
>      NMI backtrace for cpu 1
>      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.12.0-rc2 #2
>      Hardware name: System manufacturer System Product Name/Crosshair IV
>      Formula, BIOS 3029    10/09/2012
>      task: ffff8800658da080 ti: ffff880065900000 task.ti: ffff880065900000
>      RIP: e030:[<ffffffff8125b2b2>]  [<ffffffff8125b2b2>]
>      cfb_imageblit+0x1b3/0x411
>      RSP: e02b:ffff88007de439f0  EFLAGS: 00000046
>      RAX: 0000000000000000 RBX: ffff88001e1c2800 RCX: 0000000000000003
>      RDX: 000000000000003b RSI: ffff88001e00614e RDI: 0000000000000000
>      RBP: 0000000000000013 R08: 0000000000000001 R09: ffffffff814655f0
>      R10: ffff88001e006116 R11: ffffc90014875710 R12: 000000000000000d
>      R13: 0000000000000000 R14: ffffc90014875714 R15: ffffc90014875000
>      FS:  00007fb294ab4900(0000) GS:ffff88007de40000(0000)
>      knlGS:0000000000000000
>      CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
>      CR2: 00007fb29177a9a0 CR3: 000000000160c000 CR4: 0000000000000660
>      Stack:
>       0000000100aaaaaa 00000000000001d8 0000000000000000 0000000000aaaaaa
>       ffff8800532f0a40 ffff88001e1c2800 0000000000000001 ffff88001e1c2800
>       0000000000000000 ffff88007d424400 00000000ffff00ff 000000000000003b
>      Call Trace:
>       <IRQ>  [<ffffffff81256ac4>] ? bit_putcs+0x352/0x39d
>       [<ffffffff81219825>] ? paravirt_read_tsc+0x5/0x8
>       [<ffffffff81256772>] ? bit_cursor+0x45d/0x45d
>       [<ffffffff812523a8>] ? fbcon_putcs+0xbd/0xcc
>       [<ffffffff812bc6b6>] ? vt_console_print+0x234/0x290
>       [<ffffffff810b336f>] ? call_console_drivers.constprop.18+0xb3/0xfc
>       [<ffffffff810b3c7d>] ? console_unlock+0x131/0x306
>       [<ffffffff810b420e>] ? vprintk_emit+0x3bc/0x3eb
>       [<ffffffff812c92f5>] ? paravirt_read_tsc+0x5/0x8
>       [<ffffffff812cae43>] ? add_interrupt_randomness+0x3f/0x15d
>       [<ffffffff813db9c8>] ? printk+0x4f/0x51
>       [<ffffffff810e4433>] ? rcu_check_callbacks+0x195/0x598
>               <==================
>       [<ffffffff810a3b50>] ? irqtime_account_process_tick.isra.2+0xd6/0x239
>       [<ffffffff810c232a>] ? tick_sched_do_timer+0x2e/0x2e
>       [<ffffffff81084c35>] ? update_process_times+0x30/0x5b
>       [<ffffffff810c2237>] ? tick_sched_handle+0x3e/0x4a
>       [<ffffffff810c235a>] ? tick_sched_timer+0x30/0x4c
>       [<ffffffff81098355>] ? __run_hrtimer+0x93/0x159
>       [<ffffffff81098b72>] ? hrtimer_interrupt+0xe3/0x1ca
>       [<ffffffff8103d8e4>] ? xen_timer_interrupt+0x31/0x13b
>       [<ffffffff81294c4c>] ? HYPERVISOR_event_channel_op+0xd/0x1d
>       [<ffffffff8103d79b>] ? xen_force_evtchn_callback+0x9/0xa
>       [<ffffffff8103df22>] ? check_events+0x12/0x20
>       [<ffffffff810b5b7a>] ? handle_irq_event_percpu+0x4d/0x1c5
>       [<ffffffff813e556e>] ? notifier_call_chain+0x32/0x52
>       [<ffffffff810b8287>] ? handle_percpu_irq+0x39/0x4c
>       [<ffffffff812951c0>] ? __xen_evtchn_do_upcall+0x107/0x2cb
>       [<ffffffff81219936>] ? delay_tsc+0x9c/0xc6
>       [<ffffffff81093ba0>] ? __rcu_read_unlock+0x33/0x51
>       [<ffffffff8129663a>] ? xen_evtchn_do_upcall+0x22/0x32
>       [<ffffffff813e897e>] ? xen_do_hypervisor_callback+0x1e/0x30
>       <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>       [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>       [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13
>       [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e
>       [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160
>      Code: fb 4c 89 d6 b9 08 00 00 00 ff cd 83 fd ff 74 32 44 0f be 2e 44 29
>      c1 8b 44 24 18 4d 8d 73 04 41 d3 fd 44 23 6c 24 04 43 23 04 a9 <41> 89
>      c5 41 31 fd 45 89 2b 85 c9 75 05 48 ff c6 b1 08 4d 89 f3
> 
>      Which looks to be printing something on the VT console (which is running
>      in KMS mode as it uses framebuffer calls). So is there something on the
>      screen scrolling widly in a loop?
> 
>      But then there are also complains about
> 
>      INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long
>      to run: 1.115 msecs
> 
>      this taking too long. I am wondering if there is some time issue
>      on your box.
> 
>      What version of Xen do you have?
>      >
>      >
>      > 2013/9/27 Konrad Rzeszutek Wilk <[2]konrad.wilk@xxxxxxxxxx>
>      >
>      > > On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote:
>      > > > I'm currently on a vanilla 3.8.2 kernel because this is the only
>      >3.4
>      > > > kernel I found which doesn't give me this issue:
>      > > >
>      [3]http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html
>      > >
>      > > So v3.12 (or rather the latest and greaters of the Linus) has the
>      mechanism
>      > > for the NMI - so you can actually see what is causing the stall.
>      > >
> 
>      _______________________________________________
>      Xen-devel mailing list
>      [4]Xen-devel@xxxxxxxxxxxxx
>      [5]http://lists.xen.org/xen-devel
> 
> References
> 
>    Visible links
>    1. mailto:konrad.wilk@xxxxxxxxxx
>    2. mailto:konrad.wilk@xxxxxxxxxx
>    3. http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html
>    4. mailto:Xen-devel@xxxxxxxxxxxxx
>    5. http://lists.xen.org/xen-devel

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.