[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Intermittent xenvif_disconnect() hang on domU destroy



I'm seeing the xenwatch kernel thread hang intermittently when
destroying a domU on recent stable xen 4.5, with Linux 4.4.11 + grsec
dom0.

The domU is created with a virtual network interface connected to a
physical interface (ixgbevf) via an openvswitch virtual switch.

Everything works fine until the domain is destroyed. Once in a while,
a few seconds after the domain goes away, xenwatch hangs in
xenvif_disconnect(), calling kthread_stop() on a dealloc task.

I added a warning to xenvif_dealloc_kthread_should_stop() when
kthread_should_stop() is true and queue->inflight_packets > 0,
printing inflight_packets as well as stats.tx_zerocopy_*. Each time
the hang occurs, inflight_packets == 1 and tx_zerocopy_sent ==
tx_zerocopy_success + tx_zerocopy_fail + 1.

I also added a warning to xenvif_skb_zerocopy_complete() when
queue->task is null. If I manually bring down the physical interface
to which the vif was connected (ifconfig down), this somehow causes
the last in-flight packet to be transmitted, and everything is
unblocked.

The following shows xenwatch hung trying to stop vif44.0-q0-dealloc,
waking up again after I bring down the physical interface net0_52.

[xl destroy]
...
[ 2914.510070] net vif44.0: stopping vif44.0-q0-dealloc task (pid 28045)
[ 2914.510224] xen_netback:xenvif_dealloc_kthread_should_stop: 
vif44.0-q0-dealloc task (pid 28045) should_stop=1 inflight_packets=1 
tx_zerocopy_sent=209494 tx_zerocopy_success=209492 tx_zerocopy_fail=1
...
[ 2933.561404] device net0_52 left promiscuous mode
[ 2933.564813] device vif44.0 left promiscuous mode
[ 3136.324009] INFO: task xenwatch:29 blocked for more than 120 seconds.
[ 3136.324119]       Not tainted 4.4.11-grsec-skyport #66
[ 3136.324181] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 3136.324284] xenwatch        D ffffc90005d0baf8 12640    29      2 0x00000000
[ 3136.324411]  ffffc90005d0baf8 0000000000000000 ffffffff81b70d60 
ffff8804ec08c500
[ 3136.324536]  ffff8804ec08dbb8 ffffc900115abec0 ffffc900115abeb8 
ffff8804ec08c500
[ 3136.324662]  0000000000000005 ffffc90005d0bb10 ffffffff816d6ff7 
7fffffffffffffff
[ 3136.324806] Call Trace:
[ 3136.324866]  [<ffffffff816d6ff7>] schedule+0x37/0x80
[ 3136.324932]  [<ffffffff816dbdec>] schedule_timeout+0x1bc/0x240
[ 3136.325004]  [<ffffffff8100a73c>] ? xen_clocksource_read+0x1c/0x30
[ 3136.325078]  [<ffffffff810194e3>] ? sched_clock+0x13/0x20
[ 3136.325153]  [<ffffffff8109f72c>] ? local_clock+0x1c/0x20
[ 3136.325228]  [<ffffffff810bdd69>] ? mark_held_locks+0x79/0xa0
[ 3136.325298]  [<ffffffff816dd077>] ? _raw_spin_unlock_irq+0x27/0x50
[ 3136.325367]  [<ffffffff810bdecd>] ? trace_hardirqs_on_caller+0x13d/0x1d0
[ 3136.325441]  [<ffffffff816d8426>] wait_for_completion+0xd6/0x110
[ 3136.325514]  [<ffffffff81099570>] ? wake_up_q+0x70/0x70
[ 3136.325585]  [<ffffffff8108f087>] kthread_stop+0x47/0x80
[ 3136.325660]  [<ffffffff814f1661>] xenvif_disconnect+0xb1/0x130
[ 3136.325729]  [<ffffffff814ef3c6>] set_backend_state+0x116/0xde0
[ 3136.325805]  [<ffffffff8143717e>] ? xenbus_gather+0x10e/0x140
[ 3136.325881]  [<ffffffff811a5e42>] ? kfree+0x1c2/0x1e0
[ 3136.325960]  [<ffffffff8109f72c>] ? local_clock+0x1c/0x20
[ 3136.326026]  [<ffffffff814f0577>] frontend_changed+0xb7/0xc0
[ 3136.326095]  [<ffffffff81437fb0>] xenbus_otherend_changed+0x80/0x90
[ 3136.330341]  [<ffffffff81437410>] ? unregister_xenbus_watch+0x260/0x260
[ 3136.330414]  [<ffffffff81438d2b>] frontend_changed+0xb/0x10
[ 3136.330483]  [<ffffffff8143744a>] xenwatch_thread+0x3a/0x130
[ 3136.330553]  [<ffffffff810b2ba0>] ? wake_up_atomic_t+0x30/0x30
[ 3136.330621]  [<ffffffff8108eccc>] kthread+0xfc/0x120
[ 3136.330686]  [<ffffffff8108ebd0>] ? kthread_create_on_node+0x240/0x240
[ 3136.330775]  [<ffffffff816ddbee>] ret_from_fork+0x3e/0x70
[ 3136.330840]  [<ffffffff8108ebd0>] ? kthread_create_on_node+0x240/0x240
[ 3136.330911] 1 lock held by xenwatch/29:
[ 3136.330972]  #0:  (xenwatch_mutex){+.+.+.}, at: [<ffffffff814374a7>] 
ffffffff814374a7
...
[ifconfig net0_52 down]
...
[ 3162.217907] ixgbe 0000:81:00.0 net0: VF Reset msg received from vf 52
[ 3162.228840] ------------[ cut here ]------------
[ 3162.228945] WARNING: CPU: 3 PID: 31978 at 
drivers/net/xen-netback/interface.c:71 xenvif_skb_zerocopy_complete+0x79/0x90()
[ 3162.229104] vif44.0: dead queue vif44.0-q0 decremented inflight_packets to 0
[ 3162.229184] Modules linked in: xt_physdev br_netfilter bridge stp llc tun 
xen_blkback ip_gre ip_tunnel gre ixgbevf drbg ansi_cprng dm_crypt 
algif_skcipher af_alg xen_evtchn xenfs xen_privcmd xen_pciback openvswitch 
nf_defrag_ipv6 libcrc32c nfsd auth_rpcgss nfs_acl nfs lockd grace fscache 
sunrpc nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp nf_conntrack_ipv4 
nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables 
iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal intel_powerclamp coretemp 
crct10dif_pclmul crc32_pclmul raid1 aesni_intel aes_x86_64 lrw gf128mul 
glue_helper ablk_helper cryptd sb_edac lpc_ich mfd_core mei_me mei i2c_i801 
ioatdma ipmi_ssif ipmi_msghandler squashfs lz4_decompress ixgbe mdio vxlan 
xhci_pci xhci_hcd ip6_udp_tunnel udp_tunnel ptp pps_core dca ahci libahci 
ehci_pci ehci_hcd usbcore usb_common tpm_tis
[ 3162.230580] CPU: 3 PID: 31978 Comm: ifconfig Not tainted 
4.4.11-grsec-skyport #66
[ 3162.230681] Hardware name: ABCD, BIOS SE5C610.86B.01.01.0009.C1.060120151350 
06/01/2015
[ 3162.230814]  0000000000000000 ffffc9000f7c3a68 ffffffff8137d3f8 
0000000000000000
[ 3162.230938]  ffffc9000f7c3ab0 ffffffff81aa9060 ffffc9000f7c3aa0 
ffffffff81068de8
[ 3162.231062]  ffffc90011565000 ffffc9001156f788 0000000000000001 
ffffc90011565000
[ 3162.231187] Call Trace:
[ 3162.231251]  [<ffffffff8137d3f8>] dump_stack+0x9a/0xe2
[ 3162.231324]  [<ffffffff81068de8>] warn_slowpath_common+0x78/0xb0
[ 3162.231392]  [<ffffffff81068e67>] warn_slowpath_fmt+0x47/0x50
[ 3162.231469]  [<ffffffff810bdecd>] ? trace_hardirqs_on_caller+0x13d/0x1d0
[ 3162.231540]  [<ffffffff814f0dc9>] xenvif_skb_zerocopy_complete+0x79/0x90
[ 3162.231612]  [<ffffffff814ede0f>] xenvif_zerocopy_callback+0x9f/0xc0
[ 3162.231694]  [<ffffffff8156cdb4>] skb_release_data+0xc4/0xe0
[ 3162.231761]  [<ffffffff8156cdef>] skb_release_all+0x1f/0x30
[ 3162.231828]  [<ffffffff8156cecd>] consume_skb+0x1d/0x40
[ 3162.231900]  [<ffffffff81586a74>] __dev_kfree_skb_any+0x34/0x40
[ 3162.231974]  [<ffffffffa0bd00d0>] 
ixgbevf_unmap_and_free_tx_resource.isra.46+0x20/0x80 [ixgbevf]
[ 3162.232083]  [<ffffffffa0bd016c>] ixgbevf_clean_tx_ring+0x3c/0x80 [ixgbevf]
[ 3162.232156]  [<ffffffffa0bd418e>] ixgbevf_down+0x2be/0x330 [ixgbevf]
[ 3162.232226]  [<ffffffffa0bd5332>] ixgbevf_close+0x22/0xa0 [ixgbevf]
[ 3162.232299]  [<ffffffff81581240>] __dev_close_many+0x90/0xe0
[ 3162.232378]  [<ffffffff815813be>] __dev_close+0x2e/0x50
[ 3162.232448]  [<ffffffff8158cee8>] __dev_change_flags+0x98/0x160
[ 3162.232517]  [<ffffffff8158cfd4>] dev_change_flags+0x24/0x60
[ 3162.232590]  [<ffffffff8161f534>] devinet_ioctl+0x834/0x8f0
[ 3162.232661]  [<ffffffff8162077b>] inet_ioctl+0x4b/0x70
[ 3162.232739]  [<ffffffff8155e6c0>] sock_do_ioctl+0x20/0x50
[ 3162.232804]  [<ffffffff8155e8d0>] sock_ioctl+0x1e0/0x290
[ 3162.232876]  [<ffffffff811da630>] do_vfs_ioctl+0x430/0x7f0
[ 3162.232939]  [<ffffffff811daa64>] SyS_ioctl+0x74/0x80
[ 3162.233009]  [<ffffffff816dd83a>] entry_SYSCALL_64_fastpath+0x16/0x7e
[ 3162.233129] ---[ end trace 2e4a237dee6f3318 ]---
[ 3162.357009] xen:events: domain 45 does not have 103 anymore
[ 3162.357099] xen:events: domain 45 does not have 102 anymore
...
[xenwatch unblocked]
...
[ 3162.394227] net vif50.0: stopping vif50.0-q0-dealloc task (pid 29413)
[ 3162.667880] net vif42.0: stopping vif42.0-q0-dealloc task (pid 27272)
[ 3162.705464] net vif48.0: stopping vif48.0-q0-dealloc task (pid 28779)

It's not clear to me whether the problem lies in netback, ixgbevf, or
somewhere in between. Is the root cause ixgbevf hanging onto a skb for
so long, and doing nothing with it until I bring the interface down,
or is that a symptom of some other problem? Or is netback supposed to
somehow flush in-flight transmit packets before it gets as far as
xenvif_disconnect()?  Or should it forget about the in-flight packets
since the interface is disappearing anyway?

Any clues would be appreciated.

--Ed

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.