[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [OSSTEST PATCH 00/21] Abandon jobs which are unreasonably delaying their flight



Hi Ian,

On 4/18/19 5:31 PM, Ian Jackson wrote:
Sometimes we find ourselves seriously lacking the capacity to run
particular job(s).  The result can be that the whole system stands
mostly idle while a small proportion of the resources runs flat out
with a giant queue.

In this series we arrange for osstest to be able to spot this
happening, and automatically rebalance load by give up earlier on the
jobs which are overly-contended.

There are some tuning parameters, of course.  To summarise, I have
chosen here to treat jobs as starved if (for example):
   We have completed 90% of the flight, and the remaining 10%
   is projected to take 5x as long as the first 90%.
(The "90%" is by number of jobs.)  See the patch
   starvation: Infrastructure for jobs which are delaying their flights
for details of the heuristic and its parameters.

When situations like this persist it will still be good to manually
balance the load by adjusting the job mix in submitted flights.  This
is because the starvation will not necessarily drop the same job in
subsequent flights on the same "branch", so starvation will impair the
regression detection.

As we discussed on IRC, I understand this will have an impact on Arm32 testing. Do you have an estimate how likely the tests will be skipped?

I am wondering whether we should discuss to reduce the number of testings done on Arm32. We did that in the past on Arm64 when we were struggle with broken laxton0/laxton1.

Cheers,


Ian Jackson (21):
   ts-hosts-allocate-Executive: with -U, just append to the same logfile
   selecthost: Honour new $none_ok optional parameter
   ts-logs-capture: Do not try to capture logs of hosts not allocated
   alloc_resources: Support special abandonment values
   starvation: Teach sg-report-flight about starved step state
   starvation: Teach archaeologists about starved job state
   starvation: Teach ms-flights-summary about job state starved
   starvation: Teach sg-execute-flight about job state starved
   step handling: Preserve step states set by ts-* scripts
   TestSupport: Make "broken" print the actual job state
   JobDB::Executive: step_*: fix log messages to talk about "steps"
   starvation: Permit step_finish to set the state `starved'
   TestSupport: Make "broken" set the step state too
   tcl/JobDB-Executive: Do not squash "starved" status
   starvation: Propagate starved job status into dependent jobs
   ts-host-allocate-Executive: Break out $now and add a newline
   starvation: Use "starved" for hostalloc_maxwait_max
   starvation: Infrastructure for jobs which are delaying their flights
   starvation: Abandon jobs which are unreasonably delaying their flight
   hostalloc_maxwait_max: Use starvation most_optimistic
   starvation: Better logging/debugging output

  Osstest/Executive.pm         |  95 ++++++++++++++++++++++++++---
  Osstest/JobDB/Executive.pm   |   8 ++-
  Osstest/TestSupport.pm       |  24 ++++++--
  mg-hostalloc-starvation-demo |  53 ++++++++++++++++
  ms-flights-summary           |   9 +--
  sg-execute-flight            |   2 +-
  sg-report-flight             |  17 +++++-
  tcl/JobDB-Executive.tcl      |   6 +-
  ts-hosts-allocate-Executive  | 142 ++++++++++++++++++++++++++++++++++++++++---
  ts-logs-capture              |   7 ++-
  10 files changed, 328 insertions(+), 35 deletions(-)
  create mode 100755 mg-hostalloc-starvation-demo


--
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.