Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split

To:	Stephan Diestelhorst <stephan.diestelhorst@xxxxxxx>
Subject:	Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
From:	Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>
Date:	Wed, 02 Feb 2011 16:14:25 +0100
Cc:	George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, "Przywara, Andre" <Andre.Przywara@xxxxxxx>, Keir Fraser <keir@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>
Delivery-date:	Wed, 02 Feb 2011 07:16:37 -0800
Dkim-signature:	v=1; a=rsa-sha256; c=simple/simple; d=ts.fujitsu.com; i=juergen.gross@xxxxxxxxxxxxxx; q=dns/txt; s=s1536b; t=1296659668; x=1328195668; h=message-id:date:from:mime-version:to:cc:subject: references:in-reply-to:content-transfer-encoding; z=Message-ID:=20<4D4974D1.1080503@xxxxxxxxxxxxxx>\|Date:=20 Wed,=2002=20Feb=202011=2016:14:25=20+0100\|From:=20Juergen =20Gross=20<juergen.gross@xxxxxxxxxxxxxx>\|MIME-Version: =201.0\|To:=20Stephan=20Diestelhorst=20<stephan.diestelhor st@xxxxxxx>\|CC:=20"Przywara,=20Andre"=20<Andre.Przywara@a md.com>,=20=0D=0A=20George=20Dunlap=20<George.Dunlap@xxxx itrix.com>,=0D=0A=20Ian=20Jackson=20<Ian.Jackson@xxxxxxxx x.com>,=20=0D=0A=20"xen-devel@xxxxxxxxxxxxxxxxxxx"=20<xen -devel@xxxxxxxxxxxxxxxxxxx>,=0D=0A=20Keir=20Fraser=20<kei r@xxxxxxx>\|Subject:=20Re:=20[Xen-devel]=20Hypervisor=20cr ash(!)=20on=20xl=20cpupool-numa-split\|References:=20<4D41 FD3A.5090506@xxxxxxx>=20<AANLkTi=3DppBtb1nhdfbhGZa0Rt6kVy opdS3iJPr5fVA1x@xxxxxxxxxxxxxx>=20<4D483599.1060807@xxxxx om>=20<201102021539.06664.stephan.diestelhorst@xxxxxxx> \|In-Reply-To:=20<201102021539.06664.stephan.diestelhorst@ amd.com>\|Content-Transfer-Encoding:=207bit; bh=kMr7M/R4lyxTEocR1EsqDaMQrtz9DDHhGVxZbVPmXjI=; b=C+FI+wYMiyhzMzQSy6Gn7AjnYfVdes6kqH4S7ZfPMVFOAKjyIEkyhHpQ /ZnShKRY6T8ltyUEF6h9D/Wp4K/+8V2RLlemGhuNqZIHiAGzk98fuzHpP XAVb/oVUp41ya3qJizMkMiS96ur7KpdsTf82zNJRzRXfzrCN/jfEkksLv of+WHuvUWRPXHc8NV+hU3bdfeUgCKdUeXNZfzjbBstWSHqvdqBGxJxqGy eyb/AZ0TVf2e6nyvFyxPyoFGIygZ5;
Domainkey-signature:	s=s1536a; d=ts.fujitsu.com; c=nofws; q=dns; h=X-SBRSScore:X-IronPort-AV:Received:X-IronPort-AV: Received:Received:Message-ID:Date:From:Organization: User-Agent:MIME-Version:To:CC:Subject:References: In-Reply-To:Content-Type:Content-Transfer-Encoding; b=hyOdB4p7eEAaAyrbNf5j2p38bXP8WlHgBw2hjKer/h+aRn2CbPbD4HQH 84sTWLHM2Y+6xhStp8z1RQmFx85dpQ6pPVyJZ+7juZa5wIZRBtkLH6oj3 WtXLlNh/5oSpXOJjjKGTiMJjofo9KOPOHQYn+O5Bt/nHQIsFnr6rbNXGY DI6O3kTw4sMHdIqsQBSKGeyMLCDF09VRlWyY8nN/cjI2M7LtNWJP2qC7F 8Anh/BPlSJVw5aW1jbmcl06Zj7iWl;
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<201102021539.06664.stephan.diestelhorst@xxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Organization:	Fujitsu Technology Solutions
References:	<4D41FD3A.5090506@xxxxxxx> <AANLkTi=ppBtb1nhdfbhGZa0Rt6kVyopdS3iJPr5fVA1x@xxxxxxxxxxxxxx> <4D483599.1060807@xxxxxxx> <201102021539.06664.stephan.diestelhorst@xxxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.16) Gecko/20101226 Iceowl/1.0b1 Icedove/3.0.11

On 02/02/11 15:39, Stephan Diestelhorst wrote:

Hi folks,
   long time no see. :-)

On Tuesday 01 February 2011 17:32:25 Andre Przywara wrote:

I asked Stephan Diestelhorst for help and after I convinced him that
removing credit and making SEDF the default again is not an option he
worked together with me on that ;-) Many thanks for that!
We haven't come to a final solution but could gather some debug data.
I will simply dump some data here, maybe somebody has got a clue. We
will work further on this tomorrow.


Andre and I have been looking through this further, in particular sanity
checking the invariant

prv->weight>= sdom->weight * sdom->active_vcpu_count

each time someone tweaks the active vcpu count. This happens only in
__csched_vcpu_acct_start and __csched_vcpu_acct_stop_locked. We managed
to observe the broken invariant when splitting cpupoools.

We have the following theory of what happens:
* some vcpus of a particular domain are currently in the process of
   being moved to the new pool


The only _vcpus_ to be moved between pools are the idle vcpus. And those
never contribute to accounting in credit scheduler.

We are moving _pcpus_ only (well, moving a domain between pools actually
moves vcpus as well, but then the domain is paused).
On the pcpu to be moved the idle vcpu should be running. Obviously you
have found a scenario where this isn't true. I have no idea how this could
happen, as other then idle vcpus are taken into account for scheduling
only if the pcpu is valid in the cpupool. And the pcpu is set valid after the
BUG_ON you have triggered in your tests.


* some are still left on the old pool (vcpus_old) and some are already
   in the new pool (vcpus_new)

* we now have vcpus_old->sdom = vcpus_new->sdom and following from this
   * vcpus_old->sdom->weight = vcpus_new->sdom->weight
   * vcpus_old->sdom->active_vcpu_count = vcpus_new->sdom->active_vcpu_count

* active_vcpu_count thus does not represent the separation of the
   actual vpcus (may be the sum, only the old or new ones, does not
   matter)

* however, sched_old != sched_new, and thus
   * sched_old->prv != sched_new->prv
   * sched_old->prv->weight != sched_new->prv->weight

* the prv->weight field hence sees the incremental move of VCPUs
   (through modifications in *acct_start and *acct_stop_locked)

* if at any point in this half-way migration, the scheduler wants to
   csched_acct, it erroneously checks the wrong active_vcpu_count

Workarounds / fixes (none tried):
* disable scheduler accounting while half-way migrating a domain
   (dom->pool_migrating flag and then checking in csched_acct)
* temporarily split the sdom structures while migrating to account for
   transient split of vcpus
* synchronously disable all vcpus, migrate and then re-enable

Caveats:
* prv->lock does not guarantee mutual exclusion between (same)
   schedulers of different pools

<rant>
The general locking policy vs the comment situation is a nightmare.
I know that we have some advanced data-structure folks here, but
intuitively reasoning about when specific things are atomic and
mutually excluded is a pain in the scheduler / cpupool code, see the
issue with the separate prv->locks above.

E.g. cpupool_unassign_cpu and cpupool_unassign_cpu_helper interplay:
* cpupool_unassign_cpu unlocks cpupool_lock
* sets up the continuation calling cpupool_unassign_cpu_helper
* cpupool_unassign_cpu_helper locks cpupool_lock
* while intuitively, one would think that both should see a consistent
   snapshot and hence freeing the lock in the middle is a bad idea
* also communicating continuation-local state through global variables
   mandates that only a single global continuation can be pending

* reading cpu outside of the lock protection in
   cpupool_unassign_cpu_helper also smells
</rant>

Despite the rant, it is amazing to see the ability to move running
things around through this remote continuation trick! In my (ancient)
balancer experiments I added hypervisor-threads just for side-
stepping this issue..


I think the easiest way to solve the problem would be to move the cpu to the
new pool in a tasklet. This is possible now, because tasklets are always
executed in the idle vcpus.

OTOH I'd like to understand what is wrong with my current approach...


Juergen

--
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@xxxxxxxxxxxxxx
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split