WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL

To: Jan Beulich <JBeulich@xxxxxxxxxx>
Subject: Re: [Xen-devel] [xen-unstable test] 6374: regressions - FAIL
From: Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>
Date: Mon, 14 Mar 2011 15:40:27 +0100
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>
Delivery-date: Mon, 14 Mar 2011 07:41:43 -0700
Dkim-signature: v=1; a=rsa-sha256; c=simple/simple; d=ts.fujitsu.com; i=juergen.gross@xxxxxxxxxxxxxx; q=dns/txt; s=s1536b; t=1300113630; x=1331649630; h=message-id:date:from:mime-version:to:cc:subject: references:in-reply-to:content-transfer-encoding; bh=cAF8qUmkwVFPJI7x4oLHt5A0fc+B+srs8K8sx3R3WT8=; b=mWNcEGgJsl4JkFAxKaDgS8DqYl858wBRwlW45KLSBPjOt2Rdji1zbKhh dJ864SJ0ME+ksbpwEAbt7BRyS8GON18foeRLXjtSkHCVrXoCtYBl4CA3w h84MVxqcOJtdvSD0Ghe4B/mfV2D63vCJe2Osm/tZkXQ77VKpsvYa2xkdS 7nNUAaaRKHCcDXja4gFNENFBwfGfwCYwsBDkPKxK9iYvfyGm+2vH14lsu Hy9RZMyvV8wF9YqZYN/IGysKIyefQ;
Domainkey-signature: s=s1536a; d=ts.fujitsu.com; c=nofws; q=dns; h=X-SBRSScore:X-IronPort-AV:Received:X-IronPort-AV: Received:Received:Message-ID:Date:From:Organization: User-Agent:MIME-Version:To:CC:Subject:References: In-Reply-To:Content-Type:Content-Transfer-Encoding; b=iTsPbFK2SDr9xasP1ssawaJfMjlUsnXrA0L7TsogCCkEIqfgsQErvc65 2nxeQOq0U3MTDw5vTuG/auqei5nfiOIg+TYDBHdW0HlUuJKHgrbxdz3Cg zBFoZaG61SWQZTyE6dz5k+ZuXkkDqbAHbTbkCGZ3uwUyxDug58c4N56mY 11BAInsnMIstiR7sfQU4jRYOqByXAORdz418GljqJ3lzZBPg0/Ub5SssP gNs3+Zw45sEs+PnfzUXkL6sf9wCtR;
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4D7DFD130200007800036344@xxxxxxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Organization: Fujitsu Technology Solutions
References: <osstest-6374-mainreport@xxxxxxx> <19834.24888.630582.491364@xxxxxxxxxxxxxxxxxxxxxxxx> <4D7DFD130200007800036344@xxxxxxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.16) Gecko/20101226 Iceowl/1.0b1 Icedove/3.0.11
On 03/14/11 11:33, Jan Beulich wrote:
On 11.03.11 at 18:51, Ian Jackson<Ian.Jackson@xxxxxxxxxxxxx>  wrote:
xen.org writes ("[Xen-devel] [xen-unstable test] 6374: regressions - FAIL"):
flight 6374 xen-unstable real [real]
Tests which did not succeed and are blocking:
  test-amd64-i386-pv            5 xen-boot               fail REGR. vs. 6369

Xen crash in scheduler (non-credit2).

Mar 11 13:46:53.646796 (XEN) Watchdog timer detects that CPU1 is stuck!
Mar 11 13:46:57.922794 (XEN) ----[ Xen-4.1.0-rc7-pre  x86_64  debug=y  Not 
tainted ]----
Mar 11 13:46:57.931763 (XEN) CPU:    1
Mar 11 13:46:57.931784 (XEN) RIP:    e008:[<ffff82c480100140>] 
__bitmap_empty+0x0/0x7f
Mar 11 13:46:57.931817 (XEN) RFLAGS: 0000000000000047   CONTEXT: hypervisor
Mar 11 13:46:57.946773 (XEN) rax: ffff82c4802d1ac0   rbx: ffff8301a7fafc78   
rcx: 0000000000000002
Mar 11 13:46:57.946813 (XEN) rdx: ffff82c4802d0cc0   rsi: 0000000000000080   
rdi: ffff8301a7fafc78
Mar 11 13:46:57.954780 (XEN) rbp: ffff8301a7fafcb8   rsp: ffff8301a7fafc00   
r8:  0000000000000002
Mar 11 13:46:57.966770 (XEN) r9:  0000ffff0000ffff   r10: 00ff00ff00ff00ff   
r11: 0f0f0f0f0f0f0f0f
Mar 11 13:46:57.966805 (XEN) r12: ffff8301a7fafc68   r13: 0000000000000001   
r14: 0000000000000001
Mar 11 13:46:57.975780 (XEN) r15: ffff82c4802d1ac0   cr0: 000000008005003b   
cr4: 00000000000006f0
Mar 11 13:46:57.987771 (XEN) cr3: 00000000d7c9c000   cr2: 00000000c45e5770
Mar 11 13:46:57.987800 (XEN) ds: 007b   es: 007b   fs: 00d8   gs: 0033   ss: 
0000   cs: e008
Mar 11 13:46:57.998773 (XEN) Xen stack trace from rsp=ffff8301a7fafc00:
...
Mar 11 13:46:58.154777 (XEN) Xen call trace:
Mar 11 13:46:58.154798 (XEN)    [<ffff82c480100140>] __bitmap_empty+0x0/0x7f
Mar 11 13:46:58.163767 (XEN)    [<ffff82c480119582>] csched_cpu_pick+0xe/0x10
Mar 11 13:46:58.163802 (XEN)    [<ffff82c480122c8d>] vcpu_migrate+0xfb/0x230
Mar 11 13:46:58.178768 (XEN)    [<ffff82c480122e24>] context_saved+0x62/0x7b
Mar 11 13:46:58.178799 (XEN)    [<ffff82c480157f17>] context_switch+0xd98/0xdca
Mar 11 13:46:58.183766 (XEN)    [<ffff82c4801226b4>] schedule+0x5fc/0x624
Mar 11 13:46:58.183795 (XEN)    [<ffff82c480123837>] __do_softirq+0x88/0x99
Mar 11 13:46:58.198784 (XEN)    [<ffff82c4801238b2>] do_softirq+0x6a/0x7a

I suppose that's a result of 22957:c5c4688d5654 - as I understand it
exiting the loop is only possible if two consecutive invocations of
pick_cpu return the same result. This, however, is precisely what the
pCPU's idle_bias is supposed to prevent on hyper-threaded/multi-core
systems (so that it's not always the same entity that gets selected).

But even beyond that particular aspect, relying on any form of
"stability" of the returned value isn't correct.

Plus running pick_cpu repeatedly without actually using its result
is wrong wrt to idle_bias updating too - that's why
cached_vcpu_acct() calls _csched_cpu_pick() with the commit
argument set to false (which will result in a subsequent call -
through pick_cpu - with the argument set to true to be likely
to return the same value, but there's no correctness dependency
on that). So 22948:2d35823a86e7 already wasn't really correct
in putting a loop around pick_cpu.

It's also not clear to me what the surrounding
if ( old_lock == per_cpu(schedule_data, old_cpu).schedule_lock )
is supposed to filter, as the lock pointer gets set only when a
CPU gets brought up.

Yeah, but the vcpu can change cpus while we don't hold the lock.
This means old_cpu can change between selecting the lock and actually
taking it...

As I don't really understand what is being tried to achieve here,
I also can't really suggest a possible fix other than reverting both
offending changesets.

I'll send a patch as a suggestion :-)


Juergen

--
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@xxxxxxxxxxxxxx
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel