[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [xen-4.9-testing test] 126201: regressions - FAIL



On 08/21/2018 05:14 AM, Jan Beulich wrote:
On 21.08.18 at 03:11, <osstest-admin@xxxxxxxxxxxxxx> wrote:
flight 126201 xen-4.9-testing real [real]
http://logs.test-lab.xenproject.org/osstest/logs/126201/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
  test-amd64-amd64-libvirt-pair 22 guest-migrate/src_host/dst_host fail REGR. 
vs. 124328

Something needs to be done about this, as this continued failure is
blocking the 4.9.3 release. I did mail about this on Aug 2nd already
for flight 125710, I've got back from Wei:

This is libvirtd's error message.

The remote host can't obtain the state change log due to it is already
held by another task/thread. It could be a libvirt / libxl bug.

2018-08-01 16:12:13.433+0000: 3491: warning : libxlDomainObjBeginJob:151 :
Cannot start job (modify) for domain debian.guest.osstest; current job is 
(modify) owned by (24975)

I took a closer look at the logs and it appears the finish phase of migration fails to acquire the domain job lock since it is already held by the perform phase. In the perform phase, after the vm has been transferred to the dst, the qemu process associated with the vm is started. For whatever reason that takes a long time on this host:

2018-08-19 17:05:19.182+0000: libxl: libxl_dm.c:2235:libxl__spawn_local_dm: Domain 1:Spawning device-model /usr/local/lib/xen/bin/qemu-system-i386 with arguments: ... 2018-08-19 17:05:19.188+0000: libxl: libxl_exec.c:398:spawn_watch_event: domain 1 device model: spawn watch p=(null)
...
2018-08-19 17:05:51.529+0000: libxl: libxl_event.c:573:watchfd_callback: watch w=0x7f84a0047ee8 wpath=/local/domain/0/device-model/1/state token=2/1: event epath=/local/domain/0/device-model/1/state 2018-08-19 17:05:51.529+0000: libxl: libxl_exec.c:398:spawn_watch_event: domain 1 device model: spawn watch p=running

In the meantime we move to the finish phase and timeout waiting for the above perform phase to complete

2018-08-19 17:05:19.096+0000: 3492: debug : virThreadJobSet:96 : Thread 3492 (virNetServerHandleJob) is now running job remoteDispatchDomainMigrateFinish3Params
...
2018-08-19 17:05:49.253+0000: 3492: warning : libxlDomainObjBeginJob:151 : Cannot start job (modify) for domain debian.guest.osstest; current job is (modify) owned by (24982) 2018-08-19 17:05:49.253+0000: 3492: error : libxlDomainObjBeginJob:155 : Timed out during operation: cannot acquire state change lock

What could be causing the long startup time of qemu on these hosts? Does dom0 have enough cpu/memory? As you noticed, the libvirt commit used for this test has not changed in a long time, well before the failures appeared. Perhaps a subtle change in libxl is exposing the bug?

Regardless, I'm happy to have looked at the issue since I think libvirt can be improved to cope with the problem. The thread running in the dst receiving the vm via libxl_domain_create_restore() can be created with joinable flag, then joined in the finish phase before attempting to acquire the job lock. I'll look into making such an improvement in libvirt's libxl driver.

Regards,
Jim

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.