[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] RFH: Kernel OOPS in xen_netbk_rx_action / xenvif_gop_skb



On Fri, Jun 06, 2014 at 12:26:55PM +0200, Philipp Hahn wrote:
> Hello,
> 
> on one of our hosts (Xen-4.1.3 with Linux-3.10.26 + Debian patches)
> running 16 Linux VMs (linux-3.2.39 and others) netback crashes during
> the night when one of the VMs is rebooted by a cron-job:
> > [38551.549615] Oops: 0000 [#1] SMP

Is there any more output above this line? Is it a NULL pointer
dereference or something else?

> > [38551.549665] Modules linked in: tun xt_physdev xen_blkback xen_netback 
> > ip6_tables
> > iptable_filter ip_tables ebtable_nat ebtables x_tables xen_gntdev nfsv3 
> > nfsv4
> > rpcsec_gss_krb5 nfsd nfs_acl auth_rpcgss oid_registry nfs fscache 
> > dns_resolver lockd
> > sunrpc fuse loop xen_blkfront xen_evtchn blktap quota_v2 quota_tree xenfs 
> > xen_privcmd
> > coretemp crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd 
> > lrw gf128mul
> > glue_helper aes_x86_64 snd_pcm snd_timer snd soundcore snd_page_alloc 
> > tpm_tis tpm lpc_ich
> > tpm_bios i7core_edac i2c_i801 psmouse microcode edac_core serio_raw pcspkr 
> > mperf ioatdma
> > mfd_core processor evdev thermal_sys ext4 jbd2 crc16 bonding bridge stp llc 
> > dm_snapshot
> > dm_mirror dm_region_hash dm_log dm_mod sd_mod crc_t10dif ehci_pci uhci_hcd 
> > ehci_hcd mptsas
> > mptscsih mptbase scsi_transport_sas usbcore usb_common igb dca i2c_algo_bit 
> > i2c_core ptp
> > pps_core button
> > [38551.550601] CPU: 0 PID: 12587 Comm: netback/0 Not tainted 
> > 3.10.0-ucs58-amd64 #1 Debian
> > 3.10.11-1.58.201405060908
> > [38551.550693] Hardware name: FUJITSU PRIMERGY BX620 S6/D3051, BIOS 080015 
> > Rev.3C78.3051
> > 07/22/2011
> > [38551.550781] task: ffff880004b067c0 ti: ffff8800561ec000 task.ti: 
> > ffff8800561ec000
> > [38551.550865] RIP: e030:[<ffffffffa04147dc>]  [<ffffffffa04147dc>]
> > xen_netbk_rx_action+0x18b/0x6f0 [xen_netback]

Try addr2line?

> > [38551.550959] RSP: e02b:ffff8800561edce8  EFLAGS: 00010202
> > [38551.551009] RAX: ffffc900104adac0 RBX: ffff8800541e95c0 RCX: 
> > ffffc90010864000
> > [38551.551064] RDX: 000000000000003b RSI: 0000000000000000 RDI: 
> > ffff880040014380
> > [38551.551120] RBP: ffff8800570e6800 R08: 0000000000000000 R09: 
> > ffff880004799800
> > [38551.551175] R10: ffffffff813ca115 R11: ffff88005e4fdb08 R12: 
> > ffff880054e6f800
> > [38551.551231] R13: ffff8800561edd58 R14: ffffc900104a1000 R15: 
> > 0000000000000000
> > [38551.551289] FS:  00007f19a54a8700(0000) GS:ffff88005da00000(0000)
> > knlGS:0000000000000000
> > [38551.551374] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [38551.551425] CR2: ffffc900108641d8 CR3: 0000000054cb3000 CR4: 
> > 0000000000002660
> > [38551.551481] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> > 0000000000000000
> > [38551.551537] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
> > 0000000000000400
> > [38551.551592] Stack:
> > [38551.551630]  ffff880004b06ba0 0000000000000000 ffff88005da13ec0 
> > ffff88005da13ec0
> > [38551.551726]  0000000004b067c0 ffffc900104a8ac0 ffffc900104a1020 
> > 000000005da13ec0
> > [38551.551823]  0000000000000000 0000000000000001 ffffc900104a8ac0 
> > ffffc900104adac0
> > [38551.551920] Call Trace:
> > [38551.551966]  [<ffffffff813ca32d>] ? _raw_spin_lock_irqsave+0x11/0x2f
> > [38551.552021]  [<ffffffffa0416033>] ? xen_netbk_kthread+0x174/0x841 
> > [xen_netback]
> > [38551.552106]  [<ffffffff8105d373>] ? wake_up_bit+0x20/0x20
> > [38551.560239]  [<ffffffffa0415ebf>] ? xen_netbk_tx_build_gops+0xce8/0xce8 
> > [xen_netback]
> > [38551.560325]  [<ffffffff8105cd73>] ? 
> > kthread_freezable_should_stop+0x56/0x56
> > [38551.560381]  [<ffffffffa0415ebf>] ? xen_netbk_tx_build_gops+0xce8/0xce8 
> > [xen_netback]
> > [38551.560466]  [<ffffffff8105ce1e>] ? kthread+0xab/0xb3
> > [38551.560518]  [<ffffffff81003638>] ? xen_end_context_switch+0xe/0x1c
> > [38551.560572]  [<ffffffff8105cd73>] ? 
> > kthread_freezable_should_stop+0x56/0x56
> > [38551.560628]  [<ffffffff813cfbfc>] ? ret_from_fork+0x7c/0xb0
> > [38551.560680]  [<ffffffff8105cd73>] ? 
> > kthread_freezable_should_stop+0x56/0x56
> > [38551.560734] Code: 8b b3 d0 00 00 00 48 8b bb d8 00 00 00 0f b7 74 37 02 
> > 89 70 08 eb 07
> > c7 40 08 00 00 00 00 89 d2 c7 40 04 00 00 00 00 48 83 c2 08 <0f> b7 34 d1 
> > 89 30 c7 44 24
> > 60 00 00 00 00 8b 44 d1 04 89 44 24
> > [38551.561151] RIP  [<ffffffffa04147dc>] xen_netbk_rx_action+0x18b/0x6f0 
> > [xen_netback]
> > [38551.561238]  RSP <ffff8800561edce8>
> > [38551.561283] CR2: ffffc900108641d8
> > [38551.561624] ---[ end trace 8c260c6af259c4aa ]---
> 
> The host itself is still alive and reachable by network, but all VMs are
> no longer reachable.
> The crash does not happen on every reboot: The VM was running fine for
> 1½ week after a dom0 kernel update, but now crashed the following past
> two nights.
> 

What's the Dom0 kernel version before upgrading? That would help us
narrow down the range of changesets.

The oops happens in guest receive path. Unfortunately that's a very
complex function, it's hard to identify the problem by looking at the
code.

And as you seem to be using a distro kernel, have your reported to
Debian yet? I don't quite understand which Debian release has 3.10
kernel though.

> I'm yet unable to reproduce this on demand, but would like to prepared
> next time it happens again.
> 
> @Ian: I found your mail "Re: [Xen-devel] Kernel 3.7.0-pre-rc1 kernel BUG
> at drivers/net/xen-netback/netback.c:405 RIP: e030:[<ffffffff814714f9>]
> [<ffffffff814714f9>] netbk_gop_frag_copy+0x379/0x380" from 2012-10-09,
> which describes a crash in the same function, but at a complete
> different (later) location. You hinted that a difference in hardware
> might explain, why I'm unable to reproduce it, as my test environment
> has different HW (no "igb", but "e1000e").
> 

3.7.0 is too old. There has been lots of changes since then.

> Running "objdump -Sl xen-netback.ko" shows the OOPs to happen here:
> > /root/linux-3.10.11/drivers/net/xen-netback/netback.c:606
> >                 meta->gso_size = skb_shinfo(skb)->gso_size;
> >      7b1:       8b b3 d0 00 00 00       mov    0xd0(%rbx),%esi
> >      7b7:       48 8b bb d8 00 00 00    mov    0xd8(%rbx),%rdi
> >      7be:       0f b7 74 37 02          movzwl 0x2(%rdi,%rsi,1),%esi
> >      7c3:       89 70 08                mov    %esi,0x8(%rax)
> >      7c6:       eb 07                   jmp    7cf 
> > <xen_netbk_rx_action+0x17e>
> > /root/linux-3.10.11/drivers/net/xen-netback/netback.c:608

You mentioned 3.10.26 at the beginning but now it's 3.10.11? I'm
confused.

If it's dereferencing NULL pointer, skb_shinfo(skb) == NULL?

> >         else
> >                 meta->gso_size = 0;
> >      7c8:       c7 40 08 00 00 00 00    movl   $0x0,0x8(%rax)
> > /root/linux-3.10.11/drivers/net/xen-netback/netback.c:611
> > 
> >         meta->size = 0;
> >         meta->id = req->id;
> >      7cf:       89 d2                   mov    %edx,%edx
> > /root/linux-3.10.11/drivers/net/xen-netback/netback.c:610
> >         if (!vif->gso_prefix)
> >                 meta->gso_size = skb_shinfo(skb)->gso_size;
> >         else
> >                 meta->gso_size = 0;
> > 
> >         meta->size = 0;
> >      7d1:       c7 40 04 00 00 00 00    movl   $0x0,0x4(%rax)
> > /root/linux-3.10.11/drivers/net/xen-netback/netback.c:611
> >         meta->id = req->id;
> >      7d8:       48 83 c2 08             add    $0x8,%rdx
> >      7dc:       0f b7 34 d1             movzwl (%rcx,%rdx,8),%esi
> 0x651 + 0x18B = 0x7DC
> 
> >      7e0:       89 30                   mov    %esi,(%rax)
> > /root/linux-3.10.11/drivers/net/xen-netback/netback.c:612
> >         npo->copy_off = 0;
> >      7e2:       c7 44 24 60 00 00 00    movl   $0x0,0x60(%rsp)
> >      7e9:       00 
> > /root/linux-3.10.11/drivers/net/xen-netback/netback.c:613
> >         npo->copy_gref = req->gref;
> >      7ea:       8b 44 d1 04             mov    0x4(%rcx,%rdx,8),%eax
> >      7ee:       89 44 24 64             mov    %eax,0x64(%rsp)
> 
> Ignoring the name change from {netbk -> xenvif}_gop_skb() and the
> addition of GSO for IPv6 the function looks unchanged compared to
> current GIT, so to me it looks like it might still be a problem with the
> current implementation.
> I tried to review the GIT commits myself, but I didn't see anything
> obvious, but with all the recent additional changes to netback I'm
> unsure of how to best proceed:
> 1. Is this a known bug and has someone observed it, too?

Not that I know of.

> 2. If yes, is there a fix in newer Linux kernels?
> 3. If no, What data should I collect in addition?
> 

There's one more patch that you can pick up from 3.10.y tree. I doubt it
will make much difference though.

I think the first thing to do is to identify which line of code is
causing the problem. If it is actually the line you're referring to in
your analyse then we need to figure out why skb_shinfo(skb) is NULL...

> Xen-Hypervisor is 4.1.3 from Debian, but as this is a kernel crash, I
> don't expect a newer version of Xen to fix it (correct me if I'm wrong).
> 

You're correct. Upgrading hypervisor won't help.

Wei.

> Thanks in advance.
> 
> Philipp
> 
> PS: I'm not afraid of getting my hands dirty doing Linux coding, but
> currently I'm out of ideas of how to best proceed.
> -- 
> Philipp Hahn
> Open Source Software Engineer
> 
> Univention GmbH
> be open.
> Mary-Somerville-Str. 1
> D-28359 Bremen
> Tel.: +49 421 22232-0
> Fax : +49 421 22232-99
> hahn@xxxxxxxxxxxxx
> 
> http://www.univention.de/
> Geschäftsführer: Peter H. Ganten
> HRB 20755 Amtsgericht Bremen
> Steuer-Nr.: 71-597-02876

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.