[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] osstest commits and Xen releases



Juergen Gross writes ("OSStest commits and Xen releases"):
> I have found an alarming tendency regarding changes in the OSStest
> repository: over the last 2 years (or 3 Xen versions) there has been
> a pattern of OSStest commits being more frequent during the RC phase
> of a Xen release. On average there were about 4 commits to osstest.git
> per week. The numbers were significantly higher during RC-phases:
> 
> Version   RC-phase                 OSStest commits per week
> 4.12      2019/01/16 -             19
> 4.11      2018/04/17 - 2018/07/09  10
> 4.10      2017/10/16 - 2017/12/13  6
> 
> I have looked at this as I would have liked to cut 4.12-RC2 this
> Monday, but OSStests for xen-unstable failed over the weekend. Ian
> suspected a change in OSStest to be blamed (needs to be verified).
> 
> As the release manager I don't like RCs being delayed due to changes
> in our infrastructure. For Xen we have code freeze and patches to go in
> need the release manager's ack. Shouldn't the same apply to OSStest?
> 
> I like OSStest very much as it helps catching bugs early. But I believe
> the main development should not be done in the time when we need it's
> results to be most reliable.
> 
> Thoughts?


Thanks for raising this.  I have three lines of response.


Firstly, in the most general case: I think you have a point.

(I think this effect is probably due to changes which had been starved
of effort due to the impending Xen freeze being unblocked, but I would
have to do a full chart to be sure.)

I suggest we improve this by adopting a release ack system for pushes
to osstest pretest after the Xen codefreeze date.  In practice it will
sometimes be necessary to make changes quickly (eg debian-installer
kernel updates) so I think I (as ossteset maintainer) would need some
discretion to waive the need for a release ack or to make one myself,
but that would certainly involve informing you, and asking your
opinion if you are available.

Another possibility would be to arrange for xen-unstable to have its
own separate branch of osstest, so that xen-unstable's runs can be
detached from the rest.  I think while this is technically possible it
is not worth the additional complexity (admin hassle, risk of
confusion, work to reconcile branches, etc. etc.)

Do you think a release ack should be needed for commissioning new
hardware ?


Secondly, on this specific set of changes, looking at it from the
point of view of whether such a release ack ought to have been
forthcoming:

We have been having hardware failures.  Particularly, we have been
having PDU port failures which I am fairly sure are due to the high
frequency with which we use the PDU relays to hard power cycle the
machines.  We have also had a higher rate of other hardware problems
than I think would be to be expected, which might be related.

These PDU relay problems themselves lead to osstest unreliability and
of course the longer the situation goes on the more stuff breaks.

So I think that for these changes a release ack should probably have
been granted although perhaps additional formal testing (or some other
assurance) would have or should been done - see below.


Thirdly, in this case these recent changes were in fact not anything
to do with the fact that we didn't get a push over the weekend.
Looking at the recent flights, the first of the changes I made at the
end of last week took effect in 132504 (which reported late on
Monday).

The osstest changes were:

 * Substantial changes to host (and L1 host/guest) power
   on/off/reboot machinery.  In particular hosts are now normally
   soft-rebooted via ssh at the start of a test, rather than
   hard power cycled.

 * Small changes to reporting functions.
 * One tiny change to improve some error messages.


These changes *did* cause a regression in 132504:
 test-amd64-amd64-examine      4 memdisk-try-append         fail pass in 132478

This was not considered blocking by osstest because from the
archaeologist's point of view it is intermittent (the archaeologist is
right but for the wrong reason).  But, to justify that osstest had to
look at 132478 which has other failures, so this osstest regression
was part of the reason for not getting a push on Monday night.


The bug was effectively introduced by dropping, late in development,
the power management changes for the FreeBSD tests.  Those changes
were dropped late due to me realising as I was writing more
comprehensive design comments that my intended scheme was not 100%
sound.

This problem was not detected by osstest's formal self-test because
the formal self-test did not encounter the triggering condition (The
bug triggers when the FreeBSD test runs on a box which for some reason
was left in a state, by the previous test, where it could not be
rebooted with ssh; the latter is quite rare.)

This risk would have been obvious to me if I had been asked (or asked
myself) how thoroughly the changes ought to have been tested.  For
example, in the context of deciding whether to grant a release-ack.
So I think your implied proposal to apply the freeze to osstest would
have avoided this: probably, I would have done additional testing and
then a better version would have gone into production.

The FreeBSD changes were made in a proper way later.  Ie the bug was
fixed on Friday and is now in production.  The currently-running
xen-unstable flight picked up the fixed version.


As for the problems which actually stopped us getting a push in 132457
132478, and contributed to failing to get a push in 132504:

132457
 test-amd64-amd64-examine memdisk-try-append

  This is the single test step in that test which uses FreeBSD,
  which is not UEFI-capable, and it ran on one of our few UEFI
  hosts.

  I don't want to only run the test on non-UEFI hosts because part of
  the point is to check that osstest's host interaction stuff is still
  working (after changes to osstest, or indeed Xen).

  This test step ought to be skipped on UEFI hosts.  That it is not is
  a bug.  The workaround from a Xen pov is either to get lucky or, if
  the test is sticky enough, a force push.

 test-amd64-i386-xl-qemuu-debianhvm-amd64-xsm guest-localmigrate/x10

  [  317.522719] Freezing of tasks failed after 20.005 seconds (1 tasks 
refusing to freeze, wq_busy=0):
  [  317.540911] jbd2/xvda5-8    D ffffffff8109e380     0   112      2 
0x00000000
  libxl: error: libxl_dom_suspend.c:367:suspend_common_wait_guest_timeout: 
Domain 21:guest did not suspend, timed out

  This looks like some kind of Xen-specific bug in the Debian
  kernel.

132478
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm debian-hvm-install fail

  I do not understand what goes wrong here.  The host and guest are
  apparently working.  The guest is doing a Debian install using
  debian-installer.  The guest installer asks to reboot, as is
  expected.  osstest manages this reboot itself by detecting the
  guest's state change, because it wants to remove the virtual
  installation media.  So the first thing it does is to destroy the
  old domain.  This fails with some kind of libvirt error.

  I think this is a bug in libvirt.  Presumably a race.

 test-amd64-amd64-xl-qemut-stubdom-debianhvm-amd64-xsm guest-localmigrate/x10

  libxl: error: libxl_dom_suspend.c:367:suspend_common_wait_guest_timeout: 
Domain 37:guest did not suspend, timed out
  [  365.637795] Freezing of tasks failed after 20.002 seconds (1 tasks 
refusing to freeze, wq_busy=0):
  [  365.645857] jbd2/xvda5-8    D ffffffff8109e380     0   115      2 
0x00000000

  Same as above.


132504
 test-amd64-amd64-xl-qemut-stubdom-debianhvm-amd64-xsm 16 
guest-localmigrate/x10 fail REGR. vs. 132422
  libxl: error: libxl_dom_suspend.c:367:suspend_common_wait_guest_timeout: 
Domain 41:guest did not suspend, timed out
  [  383.837386] Freezing of tasks failed after 20.001 seconds (1 tasks 
refusing to freeze, wq_busy=0):
  [  383.845464] jbd2/xvda5-8    D ffffffff8109e380     0   115      2 
0x00000000

  Same as above.

 test-amd64-amd64-examine      4 memdisk-try-append         fail pass in 132478

  osstest bug, discussed above.


HTH.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.