[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split



Hi folks,
  long time no see. :-)

On Tuesday 01 February 2011 17:32:25 Andre Przywara wrote:
> I asked Stephan Diestelhorst for help and after I convinced him that 
> removing credit and making SEDF the default again is not an option he 
> worked together with me on that ;-) Many thanks for that!
> We haven't come to a final solution but could gather some debug data.
> I will simply dump some data here, maybe somebody has got a clue. We 
> will work further on this tomorrow.

Andre and I have been looking through this further, in particular sanity
checking the invariant

prv->weight >= sdom->weight * sdom->active_vcpu_count

each time someone tweaks the active vcpu count. This happens only in
__csched_vcpu_acct_start and __csched_vcpu_acct_stop_locked. We managed
to observe the broken invariant when splitting cpupoools.

We have the following theory of what happens:
* some vcpus of a particular domain are currently in the process of
  being moved to the new pool

* some are still left on the old pool (vcpus_old) and some are already
  in the new pool (vcpus_new)

* we now have vcpus_old->sdom = vcpus_new->sdom and following from this
  * vcpus_old->sdom->weight = vcpus_new->sdom->weight
  * vcpus_old->sdom->active_vcpu_count = vcpus_new->sdom->active_vcpu_count

* active_vcpu_count thus does not represent the separation of the
  actual vpcus (may be the sum, only the old or new ones, does not
  matter)

* however, sched_old != sched_new, and thus 
  * sched_old->prv != sched_new->prv
  * sched_old->prv->weight != sched_new->prv->weight

* the prv->weight field hence sees the incremental move of VCPUs
  (through modifications in *acct_start and *acct_stop_locked)

* if at any point in this half-way migration, the scheduler wants to
  csched_acct, it erroneously checks the wrong active_vcpu_count

Workarounds / fixes (none tried):
* disable scheduler accounting while half-way migrating a domain
  (dom->pool_migrating flag and then checking in csched_acct)
* temporarily split the sdom structures while migrating to account for
  transient split of vcpus
* synchronously disable all vcpus, migrate and then re-enable

Caveats:
* prv->lock does not guarantee mutual exclusion between (same)
  schedulers of different pools

<rant>
The general locking policy vs the comment situation is a nightmare.
I know that we have some advanced data-structure folks here, but
intuitively reasoning about when specific things are atomic and
mutually excluded is a pain in the scheduler / cpupool code, see the
issue with the separate prv->locks above.

E.g. cpupool_unassign_cpu and cpupool_unassign_cpu_helper interplay:
* cpupool_unassign_cpu unlocks cpupool_lock
* sets up the continuation calling cpupool_unassign_cpu_helper
* cpupool_unassign_cpu_helper locks cpupool_lock
* while intuitively, one would think that both should see a consistent
  snapshot and hence freeing the lock in the middle is a bad idea
* also communicating continuation-local state through global variables
  mandates that only a single global continuation can be pending

* reading cpu outside of the lock protection in
  cpupool_unassign_cpu_helper also smells
</rant>

Despite the rant, it is amazing to see the ability to move running
things around through this remote continuation trick! In my (ancient)
balancer experiments I added hypervisor-threads just for side-
stepping this issue..

Stephan
-- 
Stephan Diestelhorst, AMD Operating System Research Center
stephan.diestelhorst@xxxxxxx
Tel. +49 (0)351 448 356 719

Advanced Micro Devices GmbH
Einsteinring 24
85609 Aschheim
Germany
Geschaeftsfuehrer: Alberto Bozzo u. Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632, WEEE-Reg-Nr: DE 12919551



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.