[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] dom0 show call trace and failed to boot on HSW-EX platform



On 02/02/16 07:40, Li, Liang Z wrote:
Hi David,

We found dom0 will crash when booing on HSW-EX server, the dom0 kernel version is v4.4. By debugging I found the your patch
' x86/xen: discard RAM regions above the maximum reservation' , which the commit ID is : f5775e0b6116b7e2425ccf535243b21
caused the regression. The debug message is listed below:
===============================================================
 (XEN) mm.c:884:d0v14 pg_owner 0 l1e_owner 0, but real_pg_owner -1
 (XEN) mm.c:955:d0v14 Error getting mfn 1080000 (pfn ffffffffffffffff) from L1 
 (XEN) mm.c:1269:d0v14 Failure in alloc_l1_table: entry 0
 (XEN) mm.c:2175:d0v14 Error while validating mfn 188d903 (pfn 17a7cc) for type 
 (XEN) mm.c:3101:d0v14 Error -16 while pinning mfn 188d903
 [   33.768792] ------------[ cut here ]------------
WARNING: CPU: 14 PID: 1 at arch/x86/xen/multicalls.c:129 xen_mc_
 [   33.783809] Modules linked in:
 [   33.787304] CPU: 14 PID: 1 Comm: swapper/0 Not tainted 4.4.0 #1
 [   33.793991] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS 
 [   33.805624]  0000000000000081 ffff88017d2537c8 ffffffff812ff954 000000000000
 [   33.813961]  0000000000000000 0000000000000081 0000000000000000 ffff88017d25
 [   33.822300]  ffffffff810ca120 ffffffff81cb7f00 ffff8801879ca280 000000000000
 [   33.830639] Call Trace:
 [   33.833457]  [<ffffffff812ff954>] dump_stack+0x48/0x64
 [   33.839277]  [<ffffffff810ca120>] warn_slowpath_common+0x90/0xd0
 [   33.846058]  [<ffffffff810ca175>] warn_slowpath_null+0x15/0x20
 [   33.852659]  [<ffffffff81060133>] xen_mc_flush+0x1c3/0x1d0
 [   33.858858]  [<ffffffff8106449f>] xen_alloc_pte+0x20f/0x300
 [   33.865158]  [<ffffffff810beef5>] ? update_page_count+0x45/0x60
 [   33.871855]  [<ffffffff817a1194>] ? phys_pte_init+0x170/0x183
 [   33.878345]  [<ffffffff817a148d>] phys_pmd_init+0x2e6/0x389
 [   33.884649]  [<ffffffff817a17dd>] phys_pud_init+0x2ad/0x3dc
 [   33.890954]  [<ffffffff817a290d>] kernel_physical_mapping_init+0xec/0x211
 [   33.898613]  [<ffffffff8179df8d>] init_memory_mapping+0x17d/0x2f0
 [   33.905496]  [<ffffffff81104f11>] ? __raw_callee_save___pv_queued_spin_unloc
 [   33.914516]  [<ffffffff813643f7>] ? acpi_os_signal_semaphore+0x2e/0x32
 [   33.921889]  [<ffffffff810ba7b8>] arch_add_memory+0x48/0xf0
 [   33.928186]  [<ffffffff8179eb80>] add_memory_resource+0x80/0x110
 [   33.934967]  [<ffffffff8179ec8d>] add_memory+0x7d/0xc0
 [   33.940787]  [<ffffffff81399538>] acpi_memory_device_add+0x14f/0x237
 [   33.947963]  [<ffffffff81369a6d>] acpi_bus_attach+0xcb/0x166
 [   33.954359]  [<ffffffff81369acd>] acpi_bus_attach+0x12b/0x166
 [   33.960854]  [<ffffffff81369acd>] acpi_bus_attach+0x12b/0x166
 [   33.967350]  [<ffffffff81369acd>] acpi_bus_attach+0x12b/0x166
 [   33.973848]  [<ffffffff8136aff1>] acpi_bus_scan+0x5b/0x66
 [   33.979962]  [<ffffffff81d31e04>] ? acpi_early_init+0xeb/0xeb
 [   33.986450]  [<ffffffff81d32187>] acpi_scan_init+0x7d/0x1c4
 [   33.992755]  [<ffffffff81d31e04>] ? acpi_early_init+0xeb/0xeb
 [   33.999248]  [<ffffffff81d31e04>] ? acpi_early_init+0xeb/0xeb
 [   34.005747]  [<ffffffff81d3204a>] acpi_init+0x246/0x282
 [   34.011659]  [<ffffffff81d31e04>] ? acpi_early_init+0xeb/0xeb
 [   34.018156]  [<ffffffff810020b1>] do_one_initcall+0x81/0x1e0
 [   34.024557]  [<ffffffff81cf5c06>] kernel_init_freeable+0x19d/0x238
 [   34.031542]  [<ffffffff81cf5ca1>] ? kernel_init_freeable+0x238/0x238
 [   34.038711]  [<ffffffff8179d490>] ? rest_init+0x80/0x80
 [   34.044626]  [<ffffffff8179d499>] kernel_init+0x9/0xe0
 [   34.050450]  [<ffffffff817aa3cf>] ret_from_fork+0x3f/0x70
 [   34.056552]  [<ffffffff8179d490>] ? rest_init+0x80/0x80
 [   34.062475] ---[ end trace 854dae1bef359299 ]---
============================================================================================

You can get more information in 'error_log.txt'.

Any idea? 
I don't know your original intention of this patch, so just send a revert patch to fix the issue is not a good choice, 
May be you have better solution.

Liang


error_log.txt

(XEN) Bad console= option '8n1'

8n1 should be part of com1= or com2=, rather than console=

 Xen 4.7-unstable
(XEN) Xen version 4.7-unstable (build@) (gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16)) debug=y Thu Jan 21 23:21:32 EST 2016
(XEN) Latest ChangeSet: Tue Jan 19 17:47:19 2016 +0000 git:1949868-dirty
(XEN) Console output is synchronous.
(XEN) Bootloader: GNU GRUB 0.97
(XEN) Command line: dom0_mem=4096M loglvl=all guest_loglvl=all unrestricted_guest=1 msi=1 console=com1,115200,8n1 sync_console hap_1gb=1 conring_size=128M iommu=on,intpost psr=cmt ps
[20;80Hr=cat psr=cdp

This is very hard to read with the VT escape characters still present.  However, you probably meant dom0_mem=4096M:max=4096M, or dom0 gets all the remaining RAM.

Having said that, giving dom0 all the RAM should work, and...

 [   33.656695] ACPI: NR_CPUS/possible_cpus limit of 64 reached.  Processor 99/0
.[   33.665648] ACPI: Unable to map lapic to logical cpu number
 (XEN) mm.c:884:d0v14 pg_owner 0 l1e_owner 0, but real_pg_owner -1
 (XEN) mm.c:955:d0v14 Error getting mfn 1080000 (pfn ffffffffffffffff) from L1 e
0(XEN) mm.c:1269:d0v14 Failure in alloc_l1_table: entry 0
 (XEN) mm.c:2175:d0v14 Error while validating mfn 188d903 (pfn 17a7cc) for type 
1(XEN) mm.c:3101:d0v14 Error -16 while pinning mfn 188d903

This is a -EBUSY.  Is there anything magic about mfn 188d903?  It just looks like plain RAM in the E820 table.

Have you got dom0 configured to use linear p2m mode?  Without it, dom0 can only have a maximum of 512GB of RAM.

~Andrew
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.