Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split

To:	Andre Przywara <andre.przywara@xxxxxxx>
Subject:	Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
From:	Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>
Date:	Mon, 21 Feb 2011 15:50:14 +0100
Cc:	George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, "Diestelhorst, Stephan" <Stephan.Diestelhorst@xxxxxxx>
Delivery-date:	Mon, 21 Feb 2011 06:51:04 -0800
Dkim-signature:	v=1; a=rsa-sha256; c=simple/simple; d=ts.fujitsu.com; i=juergen.gross@xxxxxxxxxxxxxx; q=dns/txt; s=s1536b; t=1298299817; x=1329835817; h=message-id:date:from:mime-version:to:cc:subject: references:in-reply-to:content-transfer-encoding; bh=afT1PiqKcl4vT9k9FMcRbkBpOroKNGXhilUjNAzNgfc=; b=Ca8kVcZ4WjnH/e7rCVQW9SPh1stajRUtIsmecnjwjZOm5RI3tW4/Fj7Y cwGiChFZINBm4T/gBlW9yVy2WpdO+JFWRVtsUX8UVFspYj2UNyFHeP/B5 xSismDXvMCDwADle41taGlv1KlrWC4SkJ0wWOomLYuRdYLVmNw34NwYZ8 JWHSEIdW7rPZXm6TAqKwg0OlpD71i5MaBuVqwnea2t25YKbLkNdUBYARM oqxj8FSlJkxp1JL9+9nbZKYjhtNU0;
Domainkey-signature:	s=s1536a; d=ts.fujitsu.com; c=nofws; q=dns; h=X-SBRSScore:X-IronPort-AV:Received:X-IronPort-AV: Received:Received:Message-ID:Date:From:Organization: User-Agent:MIME-Version:To:CC:Subject:References: In-Reply-To:Content-Type:Content-Transfer-Encoding; b=l02jWXU8Rl2gI6pZCuGSHjDsKwfjU9fltvaoG0RnhP7jOpXUb7iZfxm4 7/T7i2uPRlI24tWE2A3pZCdU1XL3ILLqRWFtRmKx7neX5SpLAxTf/GxLa jDOALDfH1pJi6U2PvKjScSpHioohCUI4u7f0PlR3c7RIqNecIQeBJ2Vj/ sczZwUlNldIEqLYoZsun911q7a5y1D5urgV4VweeMX0JSxa/TToJa2G02 bLdyOJxRWWJM6Y70Vl9y4EN2ddHE7;
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<4D627A6F.5070105@xxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Organization:	Fujitsu Technology Solutions
References:	<4D41FD3A.5090506@xxxxxxx> <4D4C08B6.30600@xxxxxxx> <4D4FE7E2.9070605@xxxxxxx> <4D4FF452.6060508@xxxxxxxxxxxxxx> <AANLkTinoRUQC_suVYFM9-x3D00KvYofq3R=XkCQUj6RP@xxxxxxxxxxxxxx> <4D50D80F.9000007@xxxxxxxxxxxxxx> <AANLkTinKJUAXhiXpKui_XX8XCD6T5fmzNARwHE6Fjafv@xxxxxxxxxxxxxx> <AANLkTinP0z9GynF1RFd8RwzWuqvxYdb+UBE+7xKpX6D4@xxxxxxxxxxxxxx> <4D517051.10402@xxxxxxx> <AANLkTi=MiELBnPFvb6-jzVth+T7aKxP5JMFhVh3Crdmo@xxxxxxxxxxxxxx> <AANLkTikgGNz=imS1xRVVjntY5P=+MuT_Qsb=-h3QHajY@xxxxxxxxxxxxxx> <4D529BD9.5050200@xxxxxxx> <4D52A2CD.9090507@xxxxxxxxxxxxxx> <4D5388DF.8040900@xxxxxxxxxxxxxx> <4D53AF27.7030909@xxxxxxx> <4D53F3BC.4070807@xxxxxxx> <4D54D478.9000402@xxxxxxxxxxxxxx> <4D54E79E.3000800@xxxxxxx> <AANLkTimkRAHtM4CoTskQ7w6B-8Pis4B2+k7=frxM3oyW@xxxxxxxxxxxxxx> <4D5A29C0.4050702@xxxxxxxxxxxxxx> <4D5B9D2B.107@xxxxxxxxxxxxxx> <AANLkTin+rE1=+vpmTg9xeQdYn7_hucSFkrz1qCtiKfkY@xxxxxxxxxxxxxx> <4D6237C6.1050206@xxxxx om> <4D62666C.6010608@xxxxxxxxxxx com> <4D627A6F.5070105@xxxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.16) Gecko/20101226 Iceowl/1.0b1 Icedove/3.0.11

On 02/21/11 15:45, Andre Przywara wrote:

Juergen Gross wrote:

On 02/21/11 11:00, Andre Przywara wrote:

George Dunlap wrote:

Andre (and Juergen), can you try again with the attached patch?

I applied this patch on top of 22931 and it did _not_ work.
The crash occurred almost immediately after I started my script, so the
same behaviour as without the patch.


Did you try my patch addressing races in the scheduler when moving cpus
between cpupools?

Sorry, I tried yours first, but it didn't apply cleanly on my particular
tree (sched_jg_fix ;-). So I tested George's first.

I've attached it again. For me it works quite well, while George's patch
seems not to be enough (machine hanging after some tests with cpupools).

OK, it now applied after a rebase.
And yes, I didn't see a crash! At least until the script stopped while
at lot of these messages appeared:
(XEN) do_IRQ: 0.89 No irq handler for vector (irq -1)

That is what I reported before and is most probably totally unrelated to
this issue.
So I consider this fix working!
I will try to match my recent theories and debug results with your patch
to see whether this fits.

OTOH I can't reproduce an error as fast as you even without any patch :-)

(attached my script for reference, though it will most likely only make
sense on bigger NUMA machines)


Yeah, on my 2-node system I need several hundred tries to get an error.
But it seems to be more effective than George's script.

I consider the large over-provisioning the reason. With Dom0 having 48
VCPUs finally squashed together to 6 pCPUs, my script triggered at the
second run the latest.
With your patch it made 24 iterations before the other bug kicked in.


Okay, I'll prepare an official patch. Might last some days, as I'm not in the
office until Thursday.


Juergen

--
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@xxxxxxxxxxxxxx
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split