[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Wg-test-framework] osstest Massachusetts test lab resource usage investigation



Ian Jackson writes ("Re: [Wg-test-framework] osstest Massachusetts test lab 
resource usage investigation"):
> I have now completed the investigation into db queries.

I've been looking at the bisector too.  I captured a history snapshot
containing two bisections chosen arbitrarily, and wrote some scripts
which I used to analyse the snapshot.


Xen 4.5 build failure around 20th of September.

The bisector completed bisections of both the amd64 and i386 build
failures, in that order.  Here is the top time spent for amd64:

  68.69%      41208 flight - step start build hosts-allocate | flight - step 
finish build
  10.96%       6577 flight - step start build host-install(3) | flight - step 
finish build
  10.53%       6316 flight - step start build .*-build | flight - step finish 
build
   5.42%       3249 flight - step start build host-(?:install|build-prep) | 
flight - step finish build
   0.76%        453 email - testing | flight - job start

And for i386:

  49.22%       8976 flight - step start build hosts-allocate | flight - step 
finish build
  19.51%       3557 flight - step start build .*-build | flight - step finish 
build
  13.10%       2389 flight - step start build host-install(3) | flight - step 
finish build
   5.49%       1002 flight - step start build host-(?:install|build-prep) | 
flight - step finish build
   2.24%        409 flight - flight ending | mtime - transcript

Overall it spent 50-70% of the elapsed time waiting for a slot on a
build machine, and then 16-20% of the elapsed time reinstalling the
build machine.

I think this could be improved by providing one or more hosts which
were dedicated to building.  They would not need reinstalling so
often, and would often be idle.

Looking at the logs there is a particularly long delay (15ks, 4h12)
before the first repro job completes.  I think this is probably
because each bisection job runs with the start time priority of the
first one, so that the first job is delayed by (roughly) the queue
length.  This is done deliberately to avoid trying to bisect things
which are fixed quickly.  Given the small proportion of our resources
being used for bisections we may want to reconsider that.


Xen 4.5 guest migration failure around 22nd September.
(test-amd64-amd64-xl-qemuu-winxpsp3, step guest-localmigrate/x10.

The bisector ran first for a different failure,
test-amd64-amd64-xl-qemuu-ovmf-amd64 and determined that that one was
unreproducible.

I have analysed data only up to Thu, 22 Sep 2016 07:59:02 GMT (since
that was in my collection snapshot).

Counting the whole period from the failure of the main flight to the
end of the snapshot recording, we have the following elapsed times:

  35.05%      23613 flight - step start build hosts-allocate | flight - step 
finish build
  18.85%      12698 flight - step start test hosts-allocate | flight - step 
finish test
  14.79%       9965 flight - step start test windows-install | flight - step 
finish test

These figures will disproportionately bias the initial startup host
allocation delay (see above), since this is not a finished bisection.

Counting only the period after the ovmf bisection was abandoned,

  37.11%       9965 flight - step start test windows-install | flight - step 
finish test
  10.97%       2946 flight - step start test hosts-allocate | flight - step 
finish test
   8.13%       2183 flight - step start test host-install(3) | flight - step 
finish test
   7.05%       1892 flight - step start test 
(?!capture|host-install|hosts-allocate|windows-install).* | flight - step 
finish test
   6.32%       1697 flight - step start build .*-build | flight - step finish 
build
   6.15%       1650 flight - step start build host-install(3) | flight - step 
finish build
   5.47%       1470 mtime - bisection-report | mtime - transcript
   4.54%       1220 crlog - begin | email - testing
   3.42%        918 flight - step start build host-(?:install|build-prep) | 
flight - step finish build
   1.75%        470 flight - job finish | flight - step finish build

Looking at the logs each iteration takes about 1 hour.

This bisection involves a much longer iteration for each step, because
the test involves a Windows install.  So the host allocation delay
here is a much smaller proportion, even though the bisector needs to
get exactly the right host.

11% is not that much here, but a faster test would make this look
worse.  I have a half-baked idea to allow an in-progress bisection to
reserve its test host.  I think that this would be worth pursuing,
although there's a fair amount of complication involved.

38% of the wall clock time was spent doing a Windows install.  The
test does a fresh install each time, rather than saving a VM image and
reusing it.

In principle it might be possible to use saved VM images.  We do fresh
installs because installs are a good exercise of a variety of
functionality, and because that avoids having to maintain and
comprehend a library of VMs.  In particular: if we were maintaining a
library of VMs they would have to be updated occasionally (when), and
problems which arose due to changes in the VM library would be
obscure.  I don't think changes in this area are particularly easy.

Windows installs are a pathological case because they are so slow.
Most guest installs done by osstest are much faster.  And when there
seem to be multiple regressions, osstest choses to work first on the
one whose test is shortest - hence picking an Debian install on OVMF
first, here.

About 15% of the time (depending how you count it) seems to be going
on bookkeeping of one kind or another, including the archaelogy
necessary to decide what the next revision to test is.  In a faster
test this would be quite a large proportion of the elapsed time.  My
other reports, particularly the one one on database transaction
performance, contain some suggestions on how this might be improved.

In general I think the database concurrency issues I discussed in my
email
  Subject: Re: [Wg-test-framework] osstest Massachusetts test lab resource 
usage investigation
  Date: Tue, 30 Aug 2016 11:53:18 +0100
will help with this.


I expect many developers will think that osstest's bisector is
spending far too much time on setup, during each bisection iteration.
It's always wiping a host, reinstalling with the relevant Xen, and
(if the failure is not before then) reinstalling the guest OS.

But of course a human is likely to be able to tell whether a
particular issue could have been the result of (for example)
corruption which occurred during the install phase.  It's also
possible for bugs to even cause disk corruption on the host.

To avoid giving wrong answers it seems best to me for the osstest
bisector to use a strategy which is somewhat slower but which is sure
not to be misled.


I will send my scripts as a followup to this email.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.