[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] The strange case of xen_netback not returning ARP replies
Hello, I'm facing a rather strange problem with the netback interface. My setup involves a netvm, which has some physical network interfaces assigned, and a client VM where a net front is running (exposed as eth0) and which is connected to that netvm (via vif42.0 interface, as seen in the netvm on the dumps below). Now, the netvm has two physical network interfaces assigned: 1) A standard Intel AGN (iwlwifi module, interface wlan0) -- this is just a PCI devices assigned 2) A USB 3G modem (cdc_ncm module, usb0 interface) -- this has been made available to the netvm by assigning a whole USB controller, where the 3G modem is connected to. This works fine. We do NAT in netvm for the traffic coming on vif* and send it out through the default outgoing interface, e.g. wlan0. Now, as long as I use the wlan0 for networking all works great. I've been using this setup for years, no problem here. However, when I switch to usb0 as a default outgoing interface in the netvm, something strange happens. The networking works fine via usb0 for some time (a few minutes typically), yet suddenly, after enough packets got exchanged, the networking stops working. When I run tcpdump on the vif* interface I can see that suddenly there is nobody (in the netvm) to reply for the ARP requests from the client VM (the client vm has Xen ID = 42 in this dump, and IP = .5, and gateway = .1): [root@netvm user]# tcpdump -ni vif42.0 arp tcpdump: WARNING: vif42.0: no IPv4 address assigned tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on vif42.0, link-type EN10MB (Ethernet), capture size 65535 bytes 13:41:55.031819 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28 13:41:56.031860 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28 13:41:57.031794 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28 13:41:59.287308 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28 13:42:00.283853 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28 13:42:01.283816 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28 13:42:03.231324 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length ... and this now continues until no end. For comparison, this is how it looks when I use networking via wlan0: [root@netvm user]# tcpdump -ni vif42.0 arp tcpdump: WARNING: vif42.0: no IPv4 address assigned tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on vif42.0, link-type EN10MB (Ethernet), capture size 65535 bytes 13:39:00.215883 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28 13:39:00.215911 ARP, Reply 10.137.1.1 is-at fe:ff:ff:ff:ff:ff, length 28 13:39:21.799844 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28 13:39:21.799869 ARP, Reply 10.137.1.1 is-at fe:ff:ff:ff:ff:ff, length 28 We can see that every once in a while an ARP request for 10.137.1.1 appears (a gateway for clientvm, so the netvm), yet this is immediately being answered (by netvm, as I understand). Now, this behavior seems really strange, because: 1) AFAIU, the ARP replies are/should be generated by the netback interface in the netvm (vif*). 2) It shouldn't matter, for the netback code, how the packets are later routed (via wlan0 vs. usb0) to provide this (dummy) arp response? 3) ...yet, for some reason, in the case when packets are later routed through usb0, the netback is not willing to generate arp response??? Or am I misunderstanding this, and it is somebody else who is generating the arp responses? The final NIC? Some additional notes: 1) We make sure to set /proc/sys/net/ipv4/conf/vif*/proxy_arp to 1 2) When this "arp hang" happens, the networking (via usb0) is still working fine in the netvm (i.e. I can do ping google.com from the netvm) 3) This has been tested on various VM kernels (in the netvm): 3.0.4, 3.2.7, and 3.3.5 -- all exhibit the same behavior. 4) Nothing spectacular in the logs of the netvm, however, I can often see this crash in the *client* VM: [ 1257.228761] ------------[ cut here ]------------ [ 1257.228767] WARNING: at /home/user/qubes-src/kernel/kernel-3.3.5/linux-3.3.5/fs/sysfs/file.c:498 sysfs_attr_ns+0x93/0xa0() [ 1257.228776] sysfs: kobject eth0 without dirent [ 1257.228780] Modules linked in: iptable_raw bnep bluetooth rfkill ipt_MASQUERADE ipt_REJECT xt_state xt_tcpudp xen_netback iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables xen_netfront microcode pcspkr u2mfn(O) xen_blkback xen_evtchn autofs4 ext4 jbd2 crc16 dm_snapshot xen_blkfront [last unloaded: scsi_wait_scan] [ 1257.228819] Pid: 11, comm: xenwatch Tainted: G W O 3.3.5-1.pvops.qubes.x86_64 #1 [ 1257.228825] Call Trace: [ 1257.228830] [<ffffffff810495aa>] warn_slowpath_common+0x7a/0xb0 [ 1257.228836] [<ffffffff81049681>] warn_slowpath_fmt+0x41/0x50 [ 1257.228842] [<ffffffff81057ba7>] ? lock_timer_base+0x37/0x70 [ 1257.228850] [<ffffffff811a7433>] sysfs_attr_ns+0x93/0xa0 [ 1257.228856] [<ffffffff811a7aef>] sysfs_remove_file+0x1f/0x40 [ 1257.228862] [<ffffffff812e5622>] device_remove_file+0x12/0x20 [ 1257.228870] [<ffffffffa00faf5a>] xennet_remove+0x84/0xac [xen_netfront] [ 1257.228875] [<ffffffff812b5c82>] xenbus_dev_remove+0x42/0xa0 [ 1257.228881] [<ffffffff812e85a7>] __device_release_driver+0x77/0xd0 [ 1257.228887] [<ffffffff812e86e8>] device_release_driver+0x28/0x40 [ 1257.228895] [<ffffffff812e790f>] bus_remove_device+0x10f/0x180 [ 1257.228901] [<ffffffff812e5808>] device_del+0x118/0x1c0 [ 1257.228906] [<ffffffff812e58cd>] device_unregister+0x1d/0x60 [ 1257.228914] [<ffffffff812b5a46>] xenbus_dev_changed+0x96/0x1b0 [ 1257.228920] [<ffffffff812b74b4>] frontend_changed+0x24/0x50 [ 1257.228926] [<ffffffff812b4221>] xenwatch_thread+0xb1/0x170 [ 1257.228933] [<ffffffff8106aea0>] ? wake_up_bit+0x40/0x40 [ 1257.228939] [<ffffffff812b4170>] ? xenbus_thread+0x40/0x40 [ 1257.228944] [<ffffffff8106a9a6>] kthread+0x96/0xa0 [ 1257.228951] [<ffffffff81465724>] kernel_thread_helper+0x4/0x10 [ 1257.228959] [<ffffffff8145c7fc>] ? retint_restore_args+0x5/0x6 [ 1257.228964] [<ffffffff81465720>] ? gs_change+0x13/0x13 [ 1257.228968] ---[ end trace 75286ef58ce0391f ]--- But this seems rather irrelevant, as it seems like it is the netvm that is failing here, i.e. it doesn't generate ARP responses? I would appreciate any help with this issue! Thanks, joanna. Attachment:
signature.asc _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |