[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] The strange case of xen_netback not returning ARP replies



Hello,

I'm facing a rather strange problem with the netback interface. My setup
involves a netvm, which has some physical network interfaces assigned,
and a client VM where a net front is running (exposed as eth0) and which
is connected to that netvm (via vif42.0 interface, as seen in the netvm
on the dumps below).

Now, the netvm has two physical network interfaces assigned:
1) A standard Intel AGN (iwlwifi module, interface wlan0) -- this is
just a PCI devices assigned

2) A USB 3G modem (cdc_ncm module, usb0 interface) -- this has been made
available to the netvm by assigning a whole USB controller, where the 3G
modem is connected to. This works fine.

We do NAT in netvm for the traffic coming on vif* and send it out
through the default outgoing interface, e.g. wlan0. Now, as long as I
use the wlan0 for networking all works great. I've been using this setup
for years, no problem here.

However, when I switch to usb0 as a default outgoing interface in the
netvm, something strange happens. The networking works fine via usb0 for
some time (a few minutes typically), yet suddenly, after enough packets
got exchanged, the networking stops working.

When I run tcpdump on the vif* interface I can see that suddenly there
is nobody (in the netvm) to reply for the ARP requests from the client
VM (the client vm has Xen ID = 42 in this dump, and IP = .5, and gateway
= .1):

[root@netvm user]# tcpdump -ni vif42.0 arp
tcpdump: WARNING: vif42.0: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vif42.0, link-type EN10MB (Ethernet), capture size 65535 bytes
13:41:55.031819 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
13:41:56.031860 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
13:41:57.031794 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
13:41:59.287308 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
13:42:00.283853 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
13:42:01.283816 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
13:42:03.231324 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length

... and this now continues until no end.

For comparison, this is how it looks when I use networking via wlan0:

[root@netvm user]# tcpdump -ni vif42.0 arp
tcpdump: WARNING: vif42.0: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vif42.0, link-type EN10MB (Ethernet), capture size 65535 bytes
13:39:00.215883 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
13:39:00.215911 ARP, Reply 10.137.1.1 is-at fe:ff:ff:ff:ff:ff, length 28
13:39:21.799844 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
13:39:21.799869 ARP, Reply 10.137.1.1 is-at fe:ff:ff:ff:ff:ff, length 28

We can see that every once in a while an ARP request for 10.137.1.1
appears (a gateway for clientvm, so the netvm), yet this is immediately
being answered (by netvm, as I understand).

Now, this behavior seems really strange, because:

1) AFAIU, the ARP replies are/should be generated by the netback
interface in the netvm (vif*).

2) It shouldn't matter, for the netback code, how the packets are later
routed (via wlan0 vs. usb0) to provide this (dummy) arp response?

3) ...yet, for some reason, in the case when packets are later routed
through usb0, the netback is not willing to generate arp response???

Or am I misunderstanding this, and it is somebody else who is generating
the arp responses? The final NIC?

Some additional notes:
1) We make sure to set /proc/sys/net/ipv4/conf/vif*/proxy_arp to 1

2) When this "arp hang" happens, the networking (via usb0) is still
working fine in the netvm (i.e. I can do ping google.com from the netvm)

3) This has been tested on various VM kernels (in the netvm): 3.0.4,
3.2.7, and 3.3.5 -- all exhibit the same behavior.

4) Nothing spectacular in the logs of the netvm, however, I can often
see this crash in the *client* VM:

[ 1257.228761] ------------[ cut here ]------------
[ 1257.228767] WARNING: at
/home/user/qubes-src/kernel/kernel-3.3.5/linux-3.3.5/fs/sysfs/file.c:498
sysfs_attr_ns+0x93/0xa0()
[ 1257.228776] sysfs: kobject eth0 without dirent
[ 1257.228780] Modules linked in: iptable_raw bnep bluetooth rfkill
ipt_MASQUERADE ipt_REJECT xt_state xt_tcpudp xen_netback iptable_filter
iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4
ip_tables x_tables xen_netfront microcode pcspkr u2mfn(O) xen_blkback
xen_evtchn autofs4 ext4 jbd2 crc16 dm_snapshot xen_blkfront [last
unloaded: scsi_wait_scan]
[ 1257.228819] Pid: 11, comm: xenwatch Tainted: G        W  O
3.3.5-1.pvops.qubes.x86_64 #1
[ 1257.228825] Call Trace:
[ 1257.228830]  [<ffffffff810495aa>] warn_slowpath_common+0x7a/0xb0
[ 1257.228836]  [<ffffffff81049681>] warn_slowpath_fmt+0x41/0x50
[ 1257.228842]  [<ffffffff81057ba7>] ? lock_timer_base+0x37/0x70
[ 1257.228850]  [<ffffffff811a7433>] sysfs_attr_ns+0x93/0xa0
[ 1257.228856]  [<ffffffff811a7aef>] sysfs_remove_file+0x1f/0x40
[ 1257.228862]  [<ffffffff812e5622>] device_remove_file+0x12/0x20
[ 1257.228870]  [<ffffffffa00faf5a>] xennet_remove+0x84/0xac [xen_netfront]
[ 1257.228875]  [<ffffffff812b5c82>] xenbus_dev_remove+0x42/0xa0
[ 1257.228881]  [<ffffffff812e85a7>] __device_release_driver+0x77/0xd0
[ 1257.228887]  [<ffffffff812e86e8>] device_release_driver+0x28/0x40
[ 1257.228895]  [<ffffffff812e790f>] bus_remove_device+0x10f/0x180
[ 1257.228901]  [<ffffffff812e5808>] device_del+0x118/0x1c0
[ 1257.228906]  [<ffffffff812e58cd>] device_unregister+0x1d/0x60
[ 1257.228914]  [<ffffffff812b5a46>] xenbus_dev_changed+0x96/0x1b0
[ 1257.228920]  [<ffffffff812b74b4>] frontend_changed+0x24/0x50
[ 1257.228926]  [<ffffffff812b4221>] xenwatch_thread+0xb1/0x170
[ 1257.228933]  [<ffffffff8106aea0>] ? wake_up_bit+0x40/0x40
[ 1257.228939]  [<ffffffff812b4170>] ? xenbus_thread+0x40/0x40
[ 1257.228944]  [<ffffffff8106a9a6>] kthread+0x96/0xa0
[ 1257.228951]  [<ffffffff81465724>] kernel_thread_helper+0x4/0x10
[ 1257.228959]  [<ffffffff8145c7fc>] ? retint_restore_args+0x5/0x6
[ 1257.228964]  [<ffffffff81465720>] ? gs_change+0x13/0x13
[ 1257.228968] ---[ end trace 75286ef58ce0391f ]---

But this seems rather irrelevant, as it seems like it is the netvm that
is failing here, i.e. it doesn't generate ARP responses?

I would appreciate any help with this issue!

Thanks,
joanna.

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.