[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: possible kernel/libxl race with xl network-attach



On Mon, Jan 24, 2022 at 10:07:54AM +0100, Roger Pau Monné wrote:
> On Fri, Jan 21, 2022 at 03:05:07PM +0000, James Dingwall wrote:
> > On Fri, Jan 21, 2022 at 03:00:29PM +0100, Roger Pau Monné wrote:
> > > On Fri, Jan 21, 2022 at 01:34:54PM +0000, James Dingwall wrote:
> > > > On 2022-01-13 16:11, Roger Pau Monné wrote:
> > > > > On Thu, Jan 13, 2022 at 11:19:46AM +0000, James Dingwall wrote:
> > > > > > 
> > > > > > I have been trying to debug a problem where a vif with the backend
> > > > > > in a
> > > > > > driver domain is added to dom0.  Intermittently the hotplug script 
> > > > > > is
> > > > > > not invoked by libxl (running as xl devd) in the driver domain.  By
> > > > > > enabling some debug for the driver domain kernel and libxl I have
> > > > > > these
> > > > > > messages:
> > > > > > 
> > > > > > driver domain kernel (Ubuntu 5.4.0-92-generic):
> > > > > > 
> > > > > > [Thu Jan 13 01:39:31 2022] [1408] 564: vif vif-0-0 vif0.0:
> > > > > > Successfully created xenvif
> > > > > > [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed:
> > > > > > /local/domain/0/device/vif/0 -> Initialising
> > > > > > [Thu Jan 13 01:39:31 2022] [26] 470:
> > > > > > xen_netback:backend_switch_state: backend/vif/0/0 -> InitWait
> > > > > > [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed:
> > > > > > /local/domain/0/device/vif/0 -> Connected
> > > > > > [Thu Jan 13 01:39:31 2022] vif vif-0-0 vif0.0: Guest Rx ready
> > > > > > [Thu Jan 13 01:39:31 2022] [26] 470:
> > > > > > xen_netback:backend_switch_state: backend/vif/0/0 -> Connected
> > > > > > 
> > > > > > xl devd (Xen 4.14.3):
> > > > > > 
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528
> > > > > > wpath=/local/domain/2/backend token=3/0: event
> > > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac569700:
> > > > > > nested ao, parent 0x5633ac567f90
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180
> > > > > > wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event
> > > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:1055:devstate_callback: backend
> > > > > > /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting
> > > > > > state 4
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528
> > > > > > wpath=/local/domain/2/backend token=3/0: event
> > > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac56a220:
> > > > > > nested ao, parent 0x5633ac567f90
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180
> > > > > > wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event
> > > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:1055:devstate_callback: backend
> > > > > > /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting
> > > > > > state 4
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_aoutils.c:88:xswait_timeout_callback: backend
> > > > > > /local/domain/2/backend/vif/0/0/state (hoping for state change to
> > > > > > 2): xswait timeout (path=/local/domain/2/backend/vif/0/0/state)
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_event.c:850:libxl__ev_xswatch_deregister: watch
> > > > > > w=0x5633ac569180 wpath=/local/domain/2/backend/vif/0/0/state
> > > > > > token=2/1: deregister slotnum=2
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_event.c:1039:devstate_callback: backend
> > > > > > /local/domain/2/backend/vif/0/0/state wanted state 2  timed out
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > > > > w=0x5633ac569180: deregister unregistered
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_device.c:1092:device_backend_callback: calling
> > > > > > device_backend_cleanup
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > > > > w=0x5633ac569180: deregister unregistered
> > > > > > 2022-01-13 01:39:51 UTC libxl: error:
> > > > > > libxl_device.c:1105:device_backend_callback: unable to add device
> > > > > > with path /local/domain/2/backend/vif/0/0
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > > > > w=0x5633ac569280: deregister unregistered
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_device.c:1470:device_complete: device
> > > > > > /local/domain/2/backend/vif/0/0 add failed
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_event.c:2035:libxl__ao__destroy: ao 0x5633ac568f30: destroy
> > > > > > 
> > > > > > the xenstore content for the backend:
> > > > > > 
> > > > > > # xenstore-ls /local/domain/2/backend/vif/0
> > > > > > 0 = ""
> > > > > >  frontend = "/local/domain/0/device/vif/0"
> > > > > >  frontend-id = "0"
> > > > > >  online = "1"
> > > > > >  state = "4"
> > > > > >  script = "/etc/xen/scripts/vif-zynstra"
> > > > > >  vifname = "dom0.0"
> > > > > >  mac = "00:16:3e:6c:de:82"
> > > > > >  bridge = "cluster"
> > > > > >  handle = "0"
> > > > > >  type = "vif"
> > > > > >  feature-sg = "1"
> > > > > >  feature-gso-tcpv4 = "1"
> > > > > >  feature-gso-tcpv6 = "1"
> > > > > >  feature-ipv6-csum-offload = "1"
> > > > > >  feature-rx-copy = "1"
> > > > > >  feature-rx-flip = "0"
> > > > > >  feature-multicast-control = "1"
> > > > > >  feature-dynamic-multicast-control = "1"
> > > > > >  feature-split-event-channels = "1"
> > > > > >  multi-queue-max-queues = "2"
> > > > > >  feature-ctrl-ring = "1"
> > > > > >  hotplug-status = "connected"
> > > 
> > > The relevant point here is that `hotplug-status = "connected"` in the
> > > backend xenstore nodes, and that's set by the hotplug script.
> > > 
> > > Having hotplug-status == "connected" will allow the backend to proceed
> > > to the connected state, so there's some component in your system that
> > > sets this xenstore node before `xl devd` get a change to run the
> > > hotplug script, hence me asking for any udev rules in the previous
> > > email.
> > > 
> > > If it's not an udev rule then I'm lost. Do you have any modifications
> > > to the Xen tools that could set this xenstore node?
> > 
> > I am wondering if that xenstore content was captured after the environment
> > had been manually fixed.  I have been able to reproduce it by hotplugging
> > an interface where libxl isn't patched:
> > 
> > 
> > dom0# xl network-attach 0 backend=netdd script=vif-zynstra vifname=dom0.2 
> > bridge=abridge
> > netdd# xenstore-ls /local/domain/2/backend/vif/0/2
> > frontend = "/local/domain/0/device/vif/2"
> > frontend-id = "0"
> > online = "1"
> > state = "4"
> > script = "/etc/xen/scripts/vif-zynstra"
> > vifname = "dom0.2"
> > mac = "00:16:3e:5f:fc:51"
> > bridge = "abridge"
> > handle = "2"
> > type = "vif"
> > feature-sg = "1"
> > feature-gso-tcpv4 = "1"
> > feature-gso-tcpv6 = "1"
> > feature-ipv6-csum-offload = "1"
> > feature-rx-copy = "1"
> > feature-rx-flip = "0"
> > feature-multicast-control = "1"
> > feature-dynamic-multicast-control = "1"
> > feature-split-event-channels = "1"
> > multi-queue-max-queues = "2"
> > feature-ctrl-ring = "1"
> > 
> > If I have understood the backend kernel code it only waits for the
> > hotplug-status == "connected" if they key exists which it doesn't
> > appear to by default.
> 
> Indeed. I have to admit this is quite weird. I have the following
> completely untested patch, could you give it a try?
> 
> Adding netback maintainers for feedback on whether it's fine for libxl
> to force netback to wait for hotplug script execution. Not sure the
> reason why netback doesn't do that by default, but it seems it's been
> like that since the module was imported into Linux in 2011.
> 
> Thanks, Roger.
> ---
> diff --git a/tools/libs/light/libxl_nic.c b/tools/libs/light/libxl_nic.c
> index 0b45469dca..0b9e70c9d1 100644
> --- a/tools/libs/light/libxl_nic.c
> +++ b/tools/libs/light/libxl_nic.c
> @@ -248,6 +248,13 @@ static int libxl__set_xenstore_nic(libxl__gc *gc, 
> uint32_t domid,
>      flexarray_append(ro_front, "mtu");
>      flexarray_append(ro_front, GCSPRINTF("%u", nic->mtu));
>  
> +    /*
> +     * Force backend to wait for hotplug script execution before switching to
> +     * connected state.
> +     */
> +    flexarray_append(back, "hotplug-status");
> +    flexarray_append(back, "");
> +
>      return 0;
>  }
>  

I rebuilt libxl with this patch and installed it (without reboot) on the
dom0 of the system I was testing with last week.  With a 'for' loop to
add 15 interfaces I didn't see the same problem so initialising xenstore
with the empty hotplug-status key fixes this problem.  I'll add it to
some more of our systems to give it some wider testing.

Thanks,
James



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.