[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split


  • To: Andre Przywara <andre.przywara@xxxxxxx>
  • From: Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>
  • Date: Mon, 21 Feb 2011 15:50:14 +0100
  • Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, "Diestelhorst, Stephan" <Stephan.Diestelhorst@xxxxxxx>
  • Delivery-date: Mon, 21 Feb 2011 06:51:04 -0800
  • Domainkey-signature: s=s1536a; d=ts.fujitsu.com; c=nofws; q=dns; h=X-SBRSScore:X-IronPort-AV:Received:X-IronPort-AV: Received:Received:Message-ID:Date:From:Organization: User-Agent:MIME-Version:To:CC:Subject:References: In-Reply-To:Content-Type:Content-Transfer-Encoding; b=l02jWXU8Rl2gI6pZCuGSHjDsKwfjU9fltvaoG0RnhP7jOpXUb7iZfxm4 7/T7i2uPRlI24tWE2A3pZCdU1XL3ILLqRWFtRmKx7neX5SpLAxTf/GxLa jDOALDfH1pJi6U2PvKjScSpHioohCUI4u7f0PlR3c7RIqNecIQeBJ2Vj/ sczZwUlNldIEqLYoZsun911q7a5y1D5urgV4VweeMX0JSxa/TToJa2G02 bLdyOJxRWWJM6Y70Vl9y4EN2ddHE7;
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>

On 02/21/11 15:45, Andre Przywara wrote:
Juergen Gross wrote:
On 02/21/11 11:00, Andre Przywara wrote:
George Dunlap wrote:
Andre (and Juergen), can you try again with the attached patch?
I applied this patch on top of 22931 and it did _not_ work.
The crash occurred almost immediately after I started my script, so the
same behaviour as without the patch.

Did you try my patch addressing races in the scheduler when moving cpus
between cpupools?
Sorry, I tried yours first, but it didn't apply cleanly on my particular
tree (sched_jg_fix ;-). So I tested George's first.

I've attached it again. For me it works quite well, while George's patch
seems not to be enough (machine hanging after some tests with cpupools).
OK, it now applied after a rebase.
And yes, I didn't see a crash! At least until the script stopped while
at lot of these messages appeared:
(XEN) do_IRQ: 0.89 No irq handler for vector (irq -1)

That is what I reported before and is most probably totally unrelated to
this issue.
So I consider this fix working!
I will try to match my recent theories and debug results with your patch
to see whether this fits.

OTOH I can't reproduce an error as fast as you even without any patch :-)

(attached my script for reference, though it will most likely only make
sense on bigger NUMA machines)

Yeah, on my 2-node system I need several hundred tries to get an error.
But it seems to be more effective than George's script.
I consider the large over-provisioning the reason. With Dom0 having 48
VCPUs finally squashed together to 6 pCPUs, my script triggered at the
second run the latest.
With your patch it made 24 iterations before the other bug kicked in.

Okay, I'll prepare an official patch. Might last some days, as I'm not in the
office until Thursday.


Juergen

--
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@xxxxxxxxxxxxxx
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.