Xen project Mailing List

On 02/02/16 07:40, Li, Liang Z wrote:

Hi David,

We found dom0 will crash when booing on HSW-EX server, the dom0 kernel version is v4.4. By debugging I found the your patch
' x86/xen: discard RAM regions above the maximum reservation' , which the commit ID is : f5775e0b6116b7e2425ccf535243b21
caused the regression. The debug message is listed below:
===============================================================
 (XEN) mm.c:884:d0v14 pg_owner 0 l1e_owner 0, but real_pg_owner -1
 (XEN) mm.c:955:d0v14 Error getting mfn 1080000 (pfn ffffffffffffffff) from L1 
 (XEN) mm.c:1269:d0v14 Failure in alloc_l1_table: entry 0
 (XEN) mm.c:2175:d0v14 Error while validating mfn 188d903 (pfn 17a7cc) for type 
 (XEN) mm.c:3101:d0v14 Error -16 while pinning mfn 188d903
 [   33.768792] ------------[ cut here ]------------
WARNING: CPU: 14 PID: 1 at arch/x86/xen/multicalls.c:129 xen_mc_
 [   33.783809] Modules linked in:
 [   33.787304] CPU: 14 PID: 1 Comm: swapper/0 Not tainted 4.4.0 #1
 [   33.793991] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS 
 [   33.805624]  0000000000000081 ffff88017d2537c8 ffffffff812ff954 000000000000[24;80H[24;80H[24;80H[24;80H
 [   33.813961]  0000000000000000 0000000000000081 0000000000000000 ffff88017d25[24;80H[24;80H[24;80H[24;80H
 [   33.822300]  ffffffff810ca120 ffffffff81cb7f00 ffff8801879ca280 000000000000[24;80H[24;80H[24;80H[24;80H
 [   33.830639] Call Trace:
 [   33.833457]  [<ffffffff812ff954>] dump_stack+0x48/0x64
 [   33.839277]  [<ffffffff810ca120>] warn_slowpath_common+0x90/0xd0
 [   33.846058]  [<ffffffff810ca175>] warn_slowpath_null+0x15/0x20
 [   33.852659]  [<ffffffff81060133>] xen_mc_flush+0x1c3/0x1d0
 [   33.858858]  [<ffffffff8106449f>] xen_alloc_pte+0x20f/0x300
 [   33.865158]  [<ffffffff810beef5>] ? update_page_count+0x45/0x60
 [   33.871855]  [<ffffffff817a1194>] ? phys_pte_init+0x170/0x183
 [   33.878345]  [<ffffffff817a148d>] phys_pmd_init+0x2e6/0x389
 [   33.884649]  [<ffffffff817a17dd>] phys_pud_init+0x2ad/0x3dc
 [   33.890954]  [<ffffffff817a290d>] kernel_physical_mapping_init+0xec/0x211
 [   33.898613]  [<ffffffff8179df8d>] init_memory_mapping+0x17d/0x2f0
 [   33.905496]  [<ffffffff81104f11>] ? __raw_callee_save___pv_queued_spin_unloc[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H
 [   33.914516]  [<ffffffff813643f7>] ? acpi_os_signal_semaphore+0x2e/0x32
 [   33.921889]  [<ffffffff810ba7b8>] arch_add_memory+0x48/0xf0
 [   33.928186]  [<ffffffff8179eb80>] add_memory_resource+0x80/0x110
 [   33.934967]  [<ffffffff8179ec8d>] add_memory+0x7d/0xc0
 [   33.940787]  [<ffffffff81399538>] acpi_memory_device_add+0x14f/0x237
 [   33.947963]  [<ffffffff81369a6d>] acpi_bus_attach+0xcb/0x166
 [   33.954359]  [<ffffffff81369acd>] acpi_bus_attach+0x12b/0x166
 [   33.960854]  [<ffffffff81369acd>] acpi_bus_attach+0x12b/0x166
 [   33.967350]  [<ffffffff81369acd>] acpi_bus_attach+0x12b/0x166
 [   33.973848]  [<ffffffff8136aff1>] acpi_bus_scan+0x5b/0x66
 [   33.979962]  [<ffffffff81d31e04>] ? acpi_early_init+0xeb/0xeb
 [   33.986450]  [<ffffffff81d32187>] acpi_scan_init+0x7d/0x1c4
 [   33.992755]  [<ffffffff81d31e04>] ? acpi_early_init+0xeb/0xeb
 [   33.999248]  [<ffffffff81d31e04>] ? acpi_early_init+0xeb/0xeb
 [   34.005747]  [<ffffffff81d3204a>] acpi_init+0x246/0x282
 [   34.011659]  [<ffffffff81d31e04>] ? acpi_early_init+0xeb/0xeb
 [   34.018156]  [<ffffffff810020b1>] do_one_initcall+0x81/0x1e0
 [   34.024557]  [<ffffffff81cf5c06>] kernel_init_freeable+0x19d/0x238
 [   34.031542]  [<ffffffff81cf5ca1>] ? kernel_init_freeable+0x238/0x238
 [   34.038711]  [<ffffffff8179d490>] ? rest_init+0x80/0x80
 [   34.044626]  [<ffffffff8179d499>] kernel_init+0x9/0xe0
 [   34.050450]  [<ffffffff817aa3cf>] ret_from_fork+0x3f/0x70
 [   34.056552]  [<ffffffff8179d490>] ? rest_init+0x80/0x80
 [   34.062475] ---[ end trace 854dae1bef359299 ]---
============================================================================================

You can get more information in 'error_log.txt'.

Any idea? 
I don't know your original intention of this patch, so just send a revert patch to fix the issue is not a good choice, 
May be you have better solution.

Liang

error_log.txt

(XEN) Bad console= option '8n1'

8n1 should be part of com1= or com2=, rather than console=

 Xen 4.7-unstable
(XEN) Xen version 4.7-unstable (build@) (gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7[16;80H-[16;80H1[16;80H6[16;80H)[16;80H)[16;80H [16;80Hd[16;80He[16;80Hb[16;80Hu[16;80Hg[16;80H=[16;80Hy[16;80H [16;80HT[16;80Hh[16;80Hu[16;80H [16;80HJ[16;80Ha[16;80Hn[16;80H [16;80H2[16;80H1[16;80H [16;80H2[16;80H3[16;80H:[16;80H2[16;80H1[16;80H:[16;80H3[16;80H2[16;80H [16;80HE[16;80HS[16;80HT[16;80H [16;80H2[16;80H0[16;80H1[16;80H6[16;80H
(XEN) Latest ChangeSet: Tue Jan 19 17:47:19 2016 +0000 git:1949868-dirty
(XEN) Console output is synchronous.
(XEN) Bootloader: GNU GRUB 0.97
(XEN) Command line: dom0_mem=4096M loglvl=all guest_loglvl=all unrestricted_gues[20;80Ht[20;80H=[20;80H1[20;80H [20;80Hm[20;80Hs[20;80Hi[20;80H=[20;80H1[20;80H [20;80Hc[20;80Ho[20;80Hn[20;80Hs[20;80Ho[20;80Hl[20;80He[20;80H=[20;80Hc[20;80Ho[20;80Hm[20;80H1[20;80H,[20;80H1[20;80H1[20;80H5[20;80H2[20;80H0[20;80H0[20;80H,[20;80H8[20;80Hn[20;80H1[20;80H [20;80Hs[20;80Hy[20;80Hn[20;80Hc[20;80H_[20;80Hc[20;80Ho[20;80Hn[20;80Hs[20;80Ho[20;80Hl[20;80He[20;80H [20;80Hh[20;80Ha[20;80Hp[20;80H_[20;80H1[20;80Hg[20;80Hb[20;80H=[20;80H1[20;80H [20;80Hc[20;80Ho[20;80Hn[20;80Hr[20;80Hi[20;80Hn[20;80Hg[20;80H_[20;80Hs[20;80Hi[20;80Hz[20;80He[20;80H=[20;80H1[20;80H2[20;80H8[20;80HM[20;80H [20;80Hi[20;80Ho[20;80Hm[20;80Hm[20;80Hu[20;80H=[20;80Ho[20;80Hn[20;80H,[20;80Hi[20;80Hn[20;80Ht[20;80Hp[20;80Ho[20;80Hs[20;80Ht[20;80H [20;80Hp[20;80Hs[20;80Hr[20;80H=[20;80Hc[20;80Hm[20;80Ht[20;80H [20;80Hp[20;80Hs
[20;80Hr[20;80H=[20;80Hc[20;80Ha[20;80Ht[20;80H [20;80Hp[20;80Hs[20;80Hr[20;80H=[20;80Hc[20;80Hd[20;80Hp[20;80H

This is very hard to read with the VT escape characters still present. However, you probably meant dom0_mem=4096M:max=4096M, or dom0 gets all the remaining RAM.

Having said that, giving dom0 all the RAM should work, and...

[23;80H [24;1H[   33.656695] ACPI: NR_CPUS/possible_cpus limit of 64 reached.  Processor 99/0[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H
[23;80H.[24;1H[   33.665648] ACPI: Unable to map lapic to logical cpu number
[23;80H [24;1H(XEN) mm.c:884:d0v14 pg_owner 0 l1e_owner 0, but real_pg_owner -1
[23;80H [24;1H(XEN) mm.c:955:d0v14 Error getting mfn 1080000 (pfn ffffffffffffffff) from L1 e[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H
[23;80H0[24;1H(XEN) mm.c:1269:d0v14 Failure in alloc_l1_table: entry 0
[23;80H [24;1H(XEN) mm.c:2175:d0v14 Error while validating mfn 188d903 (pfn 17a7cc) for type [24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H[24;80H
[23;80H1[24;1H(XEN) mm.c:3101:d0v14 Error -16 while pinning mfn 188d903

This is a -EBUSY. Is there anything magic about mfn 188d903? It just looks like plain RAM in the E820 table.

Have you got dom0 configured to use linear p2m mode? Without it, dom0 can only have a maximum of 512GB of RAM.

~Andrew

Re: [Xen-devel] dom0 show call trace and failed to boot on HSW-EX platform