[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [xen-4.9-testing test] 126201: regressions - FAIL



On Wed, Aug 22, 2018 at 04:52:27PM -0600, Jim Fehlig wrote:
> On 08/21/2018 05:14 AM, Jan Beulich wrote:
> > > > > On 21.08.18 at 03:11, <osstest-admin@xxxxxxxxxxxxxx> wrote:
> > > flight 126201 xen-4.9-testing real [real]
> > > http://logs.test-lab.xenproject.org/osstest/logs/126201/
> > > 
> > > Regressions :-(
> > > 
> > > Tests which did not succeed and are blocking,
> > > including tests which could not be run:
> > >   test-amd64-amd64-libvirt-pair 22 guest-migrate/src_host/dst_host fail 
> > > REGR. vs. 124328
> > 
> > Something needs to be done about this, as this continued failure is
> > blocking the 4.9.3 release. I did mail about this on Aug 2nd already
> > for flight 125710, I've got back from Wei:
> > 
> > > This is libvirtd's error message.
> > > 
> > > The remote host can't obtain the state change log due to it is already
> > > held by another task/thread. It could be a libvirt / libxl bug.
> > > 
> > > 2018-08-01 16:12:13.433+0000: 3491: warning : libxlDomainObjBeginJob:151 :
> > > Cannot start job (modify) for domain debian.guest.osstest; current job is 
> > > (modify) owned by (24975)
> 
> I took a closer look at the logs and it appears the finish phase of
> migration fails to acquire the domain job lock since it is already held by
> the perform phase. In the perform phase, after the vm has been transferred
> to the dst, the qemu process associated with the vm is started. For whatever
> reason that takes a long time on this host:
> 
> 2018-08-19 17:05:19.182+0000: libxl: libxl_dm.c:2235:libxl__spawn_local_dm:
> Domain 1:Spawning device-model /usr/local/lib/xen/bin/qemu-system-i386 with
> arguments: ...
> 2018-08-19 17:05:19.188+0000: libxl: libxl_exec.c:398:spawn_watch_event:
> domain 1 device model: spawn watch p=(null)

This is a spurious event after the watch has been set up.

> ...
> 2018-08-19 17:05:51.529+0000: libxl: libxl_event.c:573:watchfd_callback:
> watch w=0x7f84a0047ee8 wpath=/local/domain/0/device-model/1/state token=2/1:
> event epath=/local/domain/0/device-model/1/state
> 2018-08-19 17:05:51.529+0000: libxl: libxl_exec.c:398:spawn_watch_event:
> domain 1 device model: spawn watch p=running

So it has taken 32s for QEMU to write "running" in xenstore. This,
however, is still within the timeout limit set by libxl (60s).

> 
> In the meantime we move to the finish phase and timeout waiting for the
> above perform phase to complete
> 
> 2018-08-19 17:05:19.096+0000: 3492: debug : virThreadJobSet:96 : Thread 3492
> (virNetServerHandleJob) is now running job
> remoteDispatchDomainMigrateFinish3Params
> ...
> 2018-08-19 17:05:49.253+0000: 3492: warning : libxlDomainObjBeginJob:151 :
> Cannot start job (modify) for domain debian.guest.osstest; current job is
> (modify) owned by (24982)
> 2018-08-19 17:05:49.253+0000: 3492: error : libxlDomainObjBeginJob:155 :
> Timed out during operation: cannot acquire state change lock
> 
> What could be causing the long startup time of qemu on these hosts? Does
> dom0 have enough cpu/memory? As you noticed, the libvirt commit used for
> this test has not changed in a long time, well before the failures appeared.
> Perhaps a subtle change in libxl is exposing the bug?

There have only been two changes to libxl in the range of changesets
being tested.

c257e35a libxl: qemu_disk_scsi_drive_string: Break out common parts of disk 
config
5d92007c libxl: restore passing "readonly=" to qemu for SCSI disks

They wouldn't change how libxl interact with libvirt. QEMU tag didn't
change.

Wei.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.