[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Trying to unmap invalid handle! pending_idx: @ drivers/net/xen-netback/netback.c:998 causes kernel panic/reboot



Hello!

On 14/07/14 11:52, Wei Liu wrote:
Hello

On Mon, Jul 14, 2014 at 04:25:54AM +0200, Armin Zentai wrote:
Dear Xen Developers!


We're running Xen on multiple machines, most of them are Dell R410 or SM
X8DTL, with one E5645 cpu, and 48 GB of RAM. We've update the kernel to
3.15.4, after the some of our hypervisors started to rebooting at random
times.

The logs were empty, and we have no information about the crashes, we've
tried some tricks, and at the end the netconsole kernel modul helped, so we
can do a very thin layer of remote kernel logging. We've found the following
in the remote logs:

It's good you've got netconsole working. I would still like to point out
that we have a wiki page on setting up serial console on Xen, which
might be helpful.

http://wiki.xen.org/wiki/Xen_Serial_Console


We've set up xen serial console, but we wanted to avoid about to reboot the hypervisors if it's not neccessary, so it will be activated on the systems on the next reboot.

(We have set up a system, that logs into every Dell iDRAC via telnet [we have 18 nodes, so we cannot plug every machine via a physical serial link], set up the SOL, and logs every output, but netconsole was a much more painless solution to gather the logs)


Jul 13 00:46:58 node11 [157060.106323] vif vif-2-0 h14z4mzbvfrrhb: Trying to
unmap invalid handle! pending_idx: c
Jul 13 00:46:58 node11 [157060.106476] ------------[ cut here ]------------
Jul 13 00:46:58 node11 [157060.106546] kernel BUG at
drivers/net/xen-netback/netback.c:998!
Jul 13 00:46:58 node11 [157060.106616] invalid opcode: 0000 [#1]
Jul 13 00:46:58 SMP
Jul 13 00:46:58 node11
[...]
Jul 13 00:46:58 node11 [157060.112705] CPU: 0 PID: 0 Comm: swapper/0
Tainted: G            E 3.15.4 #1
Jul 13 00:46:58 node11 [157060.112776] Hardware name: Supermicro
X8DTL/X8DTL, BIOS 1.1b    03/19/2010
Jul 13 00:46:58 node11 [157060.112848] task: ffffffff81c1b480 ti:
ffffffff81c00000 task.ti: ffffffff81c00000
Jul 13 00:46:58 node11 [157060.112936] RIP: e030:[<ffffffffa025f61d>]
Jul 13 00:46:58 node11  [<ffffffffa025f61d>] xenvif_idx_unmap+0x11d/0x130
[xen_netback]
Jul 13 00:46:58 node11 [157060.113078] RSP: e02b:ffff88008ea03d48 EFLAGS:
00010292
Jul 13 00:46:58 node11 [157060.113147] RAX: 000000000000004a RBX:
000000000000000c RCX: 0000000000000000
Jul 13 00:46:58 node11 [157060.113234] RDX: ffff88008a40b600 RSI:
ffff88008ea03a18 RDI: 000000000000021b
Jul 13 00:46:58 node11 [157060.113321] RBP: ffff88008ea03d88 R08:
0000000000000000 R09: ffff88008a40b600
Jul 13 00:46:58 node11 [157060.113408] R10: ffff88008a0004e8 R11:
00000000000006d8 R12: ffff8800569708c0
Jul 13 00:46:58 node11 [157060.113495] R13: ffff88006558fec0 R14:
ffff8800569708c0 R15: 0000000000000001
Jul 13 00:46:58 node11 [157060.113589] FS:  00007f351684b700(0000)
GS:ffff88008ea00000(0000) knlGS:0000000000000000
Jul 13 00:46:58 node11 [157060.113679] CS:  e033 DS: 0000 ES: 0000 CR0:
000000008005003b
Jul 13 00:46:58 node11 [157060.113747] CR2: 00007fc2a4372000 CR3:
00000000049f3000 CR4: 0000000000002660
Jul 13 00:46:58 node11 [157060.113835] Stack:
Jul 13 00:46:58 node11 [157060.113896]  ffff880056979f90
Jul 13 00:46:58 node11  ff00000000000001
Jul 13 00:46:58 node11  ffff880b0605e000
Jul 13 00:46:58 node11  0000000000000000
Jul 13 00:46:58 node11
Jul 13 00:46:58 node11 [157060.114143]  ffff0000ffffffff
Jul 13 00:46:58 node11  00000000fffffff6
Jul 13 00:46:58 node11  0000000000000001
Jul 13 00:46:58 node11  ffff8800569769d0
Jul 13 00:46:58 node11
Jul 13 00:46:58 node11 [157060.114390]  ffff88008ea03e58
Jul 13 00:46:58 node11  ffffffffa02622fc
Jul 13 00:46:58 node11  ffff88008ea03dd8
Jul 13 00:46:58 node11  ffffffff810b5223
Jul 13 00:46:58 node11
Jul 13 00:46:58 node11 [157060.114637] Call Trace:
Jul 13 00:46:58 node11 [157060.114700]  <IRQ>
Jul 13 00:46:58 node11
Jul 13 00:46:58 node11 [157060.114750]
Jul 13 00:46:58 node11  [<ffffffffa02622fc>] xenvif_tx_action+0x27c/0x7f0
[xen_netback]
Jul 13 00:46:58 node11 [157060.114927]  [<ffffffff810b5223>] ?
__wake_up+0x53/0x70
Jul 13 00:46:58 node11 [157060.114998]  [<ffffffff810ca077>] ?
handle_irq_event_percpu+0xa7/0x1b0
Jul 13 00:46:58 node11 [157060.115073]  [<ffffffffa02647d1>]
xenvif_poll+0x31/0x64 [xen_netback]
Jul 13 00:46:58 node11 [157060.115147]  [<ffffffff81653d4b>]
net_rx_action+0x10b/0x290
Jul 13 00:46:58 node11 [157060.115221]  [<ffffffff81071c73>]
__do_softirq+0x103/0x320
Jul 13 00:46:58 node11 [157060.115292]  [<ffffffff81072015>]
irq_exit+0x135/0x140
Jul 13 00:46:58 node11 [157060.115363]  [<ffffffff8144759c>]
xen_evtchn_do_upcall+0x3c/0x50
Jul 13 00:46:58 node11 [157060.115436]  [<ffffffff8175c07e>]
xen_do_hypervisor_callback+0x1e/0x30
Jul 13 00:46:58 node11 [157060.115506]  <EOI>
Jul 13 00:46:58 node11
Jul 13 00:46:58 node11 [157060.115551]
Jul 13 00:46:58 node11  [<ffffffff810013aa>] ?
xen_hypercall_sched_op+0xa/0x20
Jul 13 00:46:58 node11 [157060.115722]  [<ffffffff810013aa>] ?
xen_hypercall_sched_op+0xa/0x20
Jul 13 00:46:58 node11 [157060.115794]  [<ffffffff8100a200>] ?
xen_safe_halt+0x10/0x20
Jul 13 00:46:58 node11 [157060.115869]  [<ffffffff8101dbbf>] ?
default_idle+0x1f/0xc0
Jul 13 00:46:58 node11 [157060.115939]  [<ffffffff8101d38f>] ?
arch_cpu_idle+0xf/0x20
Jul 13 00:46:58 node11 [157060.116009]  [<ffffffff810b5aa1>] ?
cpu_startup_entry+0x201/0x360
Jul 13 00:46:58 node11 [157060.116084]  [<ffffffff817420a7>] ?
rest_init+0x77/0x80
Jul 13 00:46:58 node11 [157060.116156]  [<ffffffff81d3a156>] ?
start_kernel+0x406/0x413
Jul 13 00:46:58 node11 [157060.116227]  [<ffffffff81d39b6e>] ?
repair_env_string+0x5b/0x5b
Jul 13 00:46:58 node11 [157060.116298]  [<ffffffff81d39603>] ?
x86_64_start_reservations+0x2a/0x2c
Jul 13 00:46:58 node11 [157060.116373]  [<ffffffff81d3d5dc>] ?
xen_start_kernel+0x584/0x586
[...]
Jul 13 00:46:58 node11
Jul 13 00:46:58 node11 [157060.119179] RIP
Jul 13 00:46:58 node11  [<ffffffffa025f61d>] xenvif_idx_unmap+0x11d/0x130
[xen_netback]
Jul 13 00:46:58 node11 [157060.119312]  RSP <ffff88008ea03d48>
Jul 13 00:46:58 node11 [157060.119395] ---[ end trace 7e021c96c8cfea53 ]---
Jul 13 00:46:58 node11 [157060.119465] Kernel panic - not syncing: Fatal
exception in interrupt


h14z4mzbvfrrhb was a name of a VIF. This VIF belongs to a Windows Server
2008 R2 X64 virtual machine. We had 6 random reboots until now, all of the
VIFs are belonged to the same operating system, but different virtual
machines. So only Windows Server 2008 R2 X64 system's virtual interfaces
caused the crashes, these systems has been provisioned from different
installs or templates. The GPLPV driver's versions are also different.


Unfortunately I don't have Windows server 2008 R2. :-(

This bug is in guest TX path. What's the workload of your guest? Is
there any pattern of its traffic?

It's not relevant, some of them uses one core at nearly 100%, some of them had 1-2% CPU and 5-10 mbps of networking and/or IO. I've tried to test the CPU with CPU burn, prime95, tried to stress the network with SYN flood, IIS stress testing with apache ab, stressing the throughput and bandwidth, but these attempts did not caused a reboot.


I've checked changesets between 3.15.4 and 3.16-rc5 there's no fix for
this, so this is the first report of this issue.  If there's a reliable
reproduce then that would be great.

Zoltan, have you seen this before? Can your work on pktgen help?

[root@c2-node11 ~]# uname -a
Linux c2-node11 3.15.4 #1 SMP Tue Jul 8 17:58:26 CEST 2014 x86_64 x86_64
x86_64 GNU/Linux


The xm create config file of the specified VM (the other VM's config files
are the same):

kernel = "/usr/lib/xen/boot/hvmloader"
device_model = "/usr/lib64/xen/bin/qemu-dm"
builder = "hvm"
memory = "2000"
name = "vna3mhwnv9pn4m"
vcpus = "1"

timer_mode = "2"
viridian = "1"

vif = [ "type=ioemu, mac=00:16:3e:64:c8:ba, bridge=x0evss6g1ztoa4, ip=...,
vifname=h14z4mzbvfrrhb, rate=100Mb/s" ]

disk = [ "phy:/dev/q7jiqc2gh02b2b/xz7wget4ycmp77,ioemu:hda,w" ]
vnc = 1
vncpasswd="aaaaa1"
usbdevice="tablet"


The HV's networking looks as the following:
We are using dual emulex 10gbit network adapters, with bonding (LACP), and
on the top of the bond, we're using VLAN's for the VM, management and the
iSCSI traffic.
We're tried to reproduce the error, but we couldn't, the crash/reboot
happened randomly every time.


In that case you will need to instrument netback to spit out more
information. Zoltan, is there any other information that you would like
to know?

Wei.

Thanks, for your help,

  - Armin Zentai



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.