[Xen-users] Strange ARP problem in a bridged config

I'm having an odd issue that I think is related to arp, and I'm hoping that
someone can help me figure out why it's happening...

I am running nagios, monitoring a number of xen hosts.  Yesterday, I
rebooted several of the machines (the physical hosts, not virtual machines).
Since then, nagios is sometimes reporting that the hosts are down because
pings to them fail.

Testing manually, I can see that this is the case.

This problem is also occurring only on servers that are on the local subnet;
servers on another subnet do not have cases where they lose connectivity.

Checking arp on the nagios server, I discovered that the machines that were
reporting down had entries like the following:

xenhost9  ether   FE:FF:FF:FF:FF:FF   C                     eth0

When the machines become available again, the entry changes to look like
this:

xenhost9  ether   00:E0:81:40:2A:AE   C                     eth0

So, it appears that the nagios server (and, on at least one occasion,
another server on my network) is picking up a MAC address that is not that
of the physical interface on the xenhost.

Taking a look at the xenhost at a time when nagios was reporting that it was
down, I found these entries in the arp table:

nagios  ether   00:16:3E:0C:DC:AC   C                     xenbr0
nagios  ether   00:16:3E:0C:DC:AC   C                     eth0

I deleted the entry on xenbr0 by doing `arp -i xenbr0 -d nagios`, and
immediately nagios was able to ping the host again.

So, something is a little wonky here, but I don't know what...

To make things stranger, I have a number of machines that are all running
the same configuration.  Only the machines that were rebooted yesterday
morning are showing this issue.

The configuration that I'm working with is:
- Opensuse 10.3
- Xen 3.1.0_15042-51.3 installed from opensuse-packaged RPMs
- Two bridges (xenbr0 and xenbr1), created with a custom network-script that
does "/etc/xen/scripts/network-bridge start vifnum=0 bridge=xenbr0
netdev=eth0 && /etc/xen/scripts/network-bridge start vifnum=1 bridge=xenbr1
netdev=eth1"

On a machine that is having this problem, `ip addr` shows this:
[10:25:13] marlier@xenhost9:~> ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: peth0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fcff:ffff:feff:ffff/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen
1000
    link/ether 00:e0:81:40:2a:af brd ff:ff:ff:ff:ff:ff
    inet 192.168.xx.229/24 brd 192.168.xx.255 scope global eth1
    inet6 fe80::2e0:81ff:fe40:2aaf/64 scope link
       valid_lft forever preferred_lft forever
4: vif0.0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue
    link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fcff:ffff:feff:ffff/64 scope link
       valid_lft forever preferred_lft forever
5: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
    link/ether 00:e0:81:40:2a:ae brd ff:ff:ff:ff:ff:ff
    inet 192.168.xx.80/24 brd 192.168.xx.255 scope global eth0
    inet 192.168.xx.229/24 brd 192.168.xx.255 scope global eth0:2
    inet6 fe80::2e0:81ff:fe40:2aae/64 scope link
       valid_lft forever preferred_lft forever
6: vif0.1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop
    link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
7: veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
8: vif0.2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop
    link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
9: veth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
10: vif0.3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop
    link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
11: veth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
12: xenbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::200:ff:fe00:0/64 scope link
       valid_lft forever preferred_lft forever
13: xenbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
    link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
14: vif1.0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 32
    link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fcff:ffff:feff:ffff/64 scope link
       valid_lft forever preferred_lft forever
[10:25:16] marlier@xenhost9:~>


On another machine that is _not_ having this issue (and which was not
rebooted yesterday), and that also has an identical configuration in terms
of scripts, versions, base OS, and so on, "ip addr" shows this:
[10:33:02] marlier@xenhost2:~> ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: peth0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fcff:ffff:feff:ffff/64 scope link
       valid_lft forever preferred_lft forever
3: peth1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fcff:ffff:feff:ffff/64 scope link
       valid_lft forever preferred_lft forever
4: vif0.0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue
    link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fcff:ffff:feff:ffff/64 scope link
       valid_lft forever preferred_lft forever
5: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
    link/ether 00:e0:81:45:82:bc brd ff:ff:ff:ff:ff:ff
    inet 192.168.xx.86/24 brd 192.168.xx.255 scope global eth0
    inet 192.168.xx.222/24 brd 192.168.xx.255 scope global eth0:2
    inet6 fe80::2e0:81ff:fe45:82bc/64 scope link
       valid_lft forever preferred_lft forever
6: vif0.1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue
    link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fcff:ffff:feff:ffff/64 scope link
       valid_lft forever preferred_lft forever
7: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
    link/ether 00:e0:81:45:82:bd brd ff:ff:ff:ff:ff:ff
    inet 192.168.xx.222/24 brd 192.168.xx.255 scope global eth1
    inet6 fe80::2e0:81ff:fe45:82bd/64 scope link
       valid_lft forever preferred_lft forever
8: vif0.2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop
    link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
9: veth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
10: vif0.3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop
    link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
11: veth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
14: xenbr0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue
    link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
15: xenbr1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue
    link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
[10:33:08] marlier@xenhost2:~>

I see those NOARP's in there, and I wonder if that might be the difference
(possibly?)...but the two machines are using the same scripts to create the
bridges, so why would they result in different configurations?  And if that
is the issue, is there a way to force the bridge to be created with the
NOARP flag in there?


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users
WARNING - OLD ARCHIVES

xen-users

[Xen-users] Strange ARP problem in a bridged config