Thanks for your reply.
I installed the debug hypervisor and got a new crash dump now.
I must confess that I have little to no experience debugging crash dumps, but this seems to be a different kind of error, or at least the way the error is reached is different.
The pattern with “page number X invalid” and the “restore” repeats for all preceding domains visible in the dump.
[…]
(XEN) memory.c:269:d164v0 Domain 164 page number 54fc invalid
(XEN) memory.c:269:d164v0 Domain 164 page number 54fd invalid
(XEN) grant_table.c:1491:d164v0 Expanding dom (164) grant table from (4) to (32) frames.
(XEN) Dom164 callback via changed to GSI 28
(XEN) HVM165 restore: VM saved on one CPU (0x206c2) and restored on another (0x106a5).
(XEN) HVM165 restore: CPU 0
(XEN) HVM165 restore: PIC 0
(XEN) HVM165 restore: PIC 1
(XEN) HVM165 restore: IOAPIC 0
(XEN) HVM165 restore: LAPIC 0
(XEN) HVM165 restore: LAPIC_REGS 0
(XEN) HVM165 restore: PCI_IRQ 0
(XEN) HVM165 restore: ISA_IRQ 0
(XEN) HVM165 restore: PCI_LINK 0
(XEN) HVM165 restore: PIT 0
(XEN) HVM165 restore: RTC 0
(XEN) HVM165 restore: HPET 0
(XEN) HVM165 restore: PMTIMER 0
(XEN) HVM165 restore: MTRR 0
(XEN) HVM165 restore: VMCE_VCPU 0
(XEN) HVM165 restore: TSC_ADJUST 0
(XEN) memory.c:269:d165v0 Domain 165 page number 54de invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54df invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54e0 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54e1 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54e2 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54e3 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54e4 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54e5 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54e6 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54e7 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54e8 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54e9 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54ea invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54eb invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54ec invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54ed invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54ee invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54ef invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54f0 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54f1 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54f2 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54f3 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54f4 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54f5 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54f6 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54f7 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54f8 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54f9 invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54fa invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54fb invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54fc invalid
(XEN) memory.c:269:d165v0 Domain 165 page number 54fd invalid
(XEN) grant_table.c:1491:d165v0 Expanding dom (165) grant table from (4) to (32) frames.
(XEN) Dom165 callback via changed to GSI 28
(XEN) Debugging connection not set up.
(XEN) ----[ Xen-4.6.1 x86_64 debug=y Not tainted ]----
(XEN) CPU: 6
(XEN) RIP: e008:[<ffff82d0801fd23a>] vmx_vmenter_helper+0x27e/0x30a
(XEN) RFLAGS: 0000000000010003 CONTEXT: hypervisor
(XEN) rax: 000000008005003b rbx: ffff8300e72fc000 rcx: 0000000000000000
(XEN) rdx: 0000000000006c00 rsi: ffff830617fd7fc0 rdi: ffff8300e6fc0000
(XEN) rbp: ffff830617fd7c40 rsp: ffff830617fd7c30 r8: 0000000000000000
(XEN) r9: ffff830be8dc9310 r10: 0000000000000000 r11: 00003475e9cf85d0
(XEN) r12: 0000000000000006 r13: ffff830c14ee1000 r14: ffff8300e6fc0000
(XEN) r15: ffff830617fd0000 cr0: 000000008005003b cr4: 00000000000026e0
(XEN) cr3: 00000001bd665000 cr2: 0000000004510000
(XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008
(XEN) Xen stack trace from rsp=ffff830617fd7c30:
(XEN) ffff830617fd7c40 ffff8300e72fc000 ffff830617fd7ca0 ffff82d080174f91
(XEN) ffff830617fd7f18 ffff830be8dc9000 0000000000000286 ffff830617fd7c90
(XEN) 0000000000000206 0000000000000246 0000000000000001 ffff830617e91250
(XEN) ffff8300e72fc000 ffff830be8dc9000 ffff830617fd7cc0 ffff82d080178c19
(XEN) 0000000000bdeeae ffff8300e72fc000 ffff830617fd7cd0 ffff82d080178c3e
(XEN) ffff830617fd7d20 ffff82d080179740 ffff8300e6fc2000 ffff830c17e38e80
(XEN) ffff830617e91250 ffff820080000000 0000000000000002 ffff830617e91250
(XEN) ffff830617e91240 ffff830be8dc9000 ffff830617fd7d70 ffff82d080196152
(XEN) ffff830617fd7d50 ffff82d0801f7c6b ffff8300e6fc2000 ffff830617e91250
(XEN) ffff8300e6fc2000 ffff830617e91250 ffff830617e91240 ffff830be8dc9000
(XEN) ffff830617fd7d80 ffff82d080244a62 ffff830617fd7db0 ffff82d0801d3fe2
(XEN) ffff8300e6fc2000 0000000000000000 ffff830617e91f28 ffff830617e91000
(XEN) ffff830617fd7dd0 ffff82d080175c2c ffff8300e6fc2000 ffff8300e6fc2000
(XEN) ffff830617fd7e00 ffff82d080105dd4 ffff830c17e38040 0000000000000000
(XEN) 0000000000000000 ffff830617fd0000 ffff830617fd7e30 ffff82d0801215fd
(XEN) ffff8300e6fc0000 ffff82d080329280 ffff82d080328f80 fffffffffffffffd
(XEN) ffff830617fd7e60 ffff82d08012caf8 0000000000000006 ffff830c17e3bc60
(XEN) 0000000000000002 ffff830c17e3bbe0 ffff830617fd7e70 ffff82d08012cb3b
(XEN) ffff830617fd7ef0 ffff82d0801c23a8 ffff8300e72fc000 ffffffffffffffff
(XEN) ffff82d0801f3200 ffff830617fd7f08 ffff82d080329280 0000000000000000
(XEN) Xen call trace:
(XEN) [<ffff82d0801fd23a>] vmx_vmenter_helper+0x27e/0x30a
(XEN) [<ffff82d080174f91>] __context_switch+0xdb/0x3b5
(XEN) [<ffff82d080178c19>] __sync_local_execstate+0x5e/0x7a
(XEN) [<ffff82d080178c3e>] sync_local_execstate+0x9/0xb
(XEN) [<ffff82d080179740>] map_domain_page+0xa0/0x5d4
(XEN) [<ffff82d080196152>] destroy_perdomain_mapping+0x8f/0x1e8
(XEN) [<ffff82d080244a62>] free_compat_arg_xlat+0x26/0x28
(XEN) [<ffff82d0801d3fe2>] hvm_vcpu_destroy+0x73/0xb0
(XEN) [<ffff82d080175c2c>] vcpu_destroy+0x5d/0x72
(XEN) [<ffff82d080105dd4>] complete_domain_destroy+0x49/0x192
(XEN) [<ffff82d0801215fd>] rcu_process_callbacks+0x19a/0x1fb
(XEN) [<ffff82d08012caf8>] __do_softirq+0x82/0x8d
(XEN) [<ffff82d08012cb3b>] process_pending_softirqs+0x38/0x3a
(XEN) [<ffff82d0801c23a8>] mwait_idle+0x10c/0x315
(XEN) [<ffff82d080174825>] idle_loop+0x51/0x6b
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 6:
(XEN) FATAL TRAP: vector = 6 (invalid opcode)
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...
(XEN) Debugging connection not set up.
(XEN) Executing kexec image on cpu6
(XEN) Shot down all CPUs
bt gives a longer backtrace for cpu 6 with an additional call ( #11 [ffff830617fd7d38] vmx_vcpu_update_eptp at ffff82d0801f7c6b ) between #12 ( free_compat_arg_xlat ) and #10 (destroy_perdomain_mapping
).
This additional call should not have happened according to the source code of free_compat_arg_xlat, so this kind of baffles me.
PCPU: 6 VCPU: ffff8300e72fc000
#0 [ffff830617fd7a90] kexec_crash at ffff82d080115bb9
#1 [ffff830617fd7ab0] panic at ffff82d080144202
#2 [ffff830617fd7b20] do_invalid_op at ffff82d0801a2bba
#3 [ffff830617fd7b30] pmt_update_time at ffff82d0801e0c88
#4 [ffff830617fd7b80] handle_exception_saved at ffff82d08024e5d0
#5 [ffff830617fd7c08] vmx_vmenter_helper at ffff82d0801fd23a
#6 [ffff830617fd7c48] __context_switch at ffff82d080174f91
#7 [ffff830617fd7ca8] __sync_local_execstate at ffff82d080178c19
#8 [ffff830617fd7cc8] sync_local_execstate at ffff82d080178c3e
#9 [ffff830617fd7cd8] map_domain_page at ffff82d080179740
#10 [ffff830617fd7d28] destroy_perdomain_mapping at ffff82d080196152
#11 [ffff830617fd7d38] vmx_vcpu_update_eptp at ffff82d0801f7c6b
#12 [ffff830617fd7d78] free_compat_arg_xlat at ffff82d080244a62
#13 [ffff830617fd7d88] hvm_vcpu_destroy at ffff82d0801d3fe2
#14 [ffff830617fd7db8] vcpu_destroy at ffff82d080175c2c
#15 [ffff830617fd7dd8] complete_domain_destroy at ffff82d080105dd4
#16 [ffff830617fd7e08] rcu_process_callbacks at ffff82d0801215fd
#17 [ffff830617fd7e38] __do_softirq at ffff82d08012caf8
#18 [ffff830617fd7e68] process_pending_softirqs at ffff82d08012cb3b
#19 [ffff830617fd7e78] mwait_idle at ffff82d0801c23a8
#20 [ffff830617fd7e90] vmx_intr_assist at ffff82d0801f3200
#21 [ffff830617fd7ef8] idle_loop at ffff82d080174825
#22 [ffff830617fd7f00] do_softirq at ffff82d08012cb50
Instead vmx_vcpu_update_eptp should be called before free_compat_arg_xlat by
hvm_vcpu_destroy->altp2m_vcpu_destroy->altp2m_vcpu_update_p2m->hvm_funcs.altp2m_vcpu_update_p2m
Set by
.altp2m_vcpu_update_p2m = vmx_vcpu_update_eptp in vmx.c
The vmcs of cpu 6 is 0x0
struct vcpu {
vcpu_id = 6,
processor = 6,
vcpu_info = 0x0,
domain = 0xffff830c14ee1000,
[...]
vmx = {
vmcs = 0x0,
vcpus
VCID PCID VCPU ST T DOMID DOMAIN
> 0 0 ffff8300e7557000 RU I 32767 ffff830c14ee1000
> 1 1 ffff8300e75f2000 RU I 32767 ffff830c14ee1000
2 2 ffff8300e72fe000 RU I 32767 ffff830c14ee1000
> 3 3 ffff8300e75f1000 RU I 32767 ffff830c14ee1000
> 4 4 ffff8300e75f0000 RU I 32767 ffff830c14ee1000
> 5 5 ffff8300e72fd000 RU I 32767 ffff830c14ee1000
>* 6 6 ffff8300e72fc000 RU I 32767 ffff830c14ee1000
> 7 7 ffff8300e72fb000 RU I 32767 ffff830c14ee1000
> 0 2 ffff8300e72f9000 RU 0 0 ffff830c17e32000
1 3 ffff8300e72f8000 BL 0 0 ffff830c17e32000
2 5 ffff8300e755f000 BL 0 0 ffff830c17e32000
3 0 ffff8300e755e000 BL 0 0 ffff830c17e32000
4 6 ffff8300e755d000 BL 0 0 ffff830c17e32000
5 4 ffff8300e755c000 BL 0 0 ffff830c17e32000
6 7 ffff8300e755b000 BL 0 0 ffff830c17e32000
7 5 ffff8300e755a000 BL 0 0 ffff830c17e32000
0 1 ffff8300e6fc7000 BL U 162 ffff830bdee8f000
0 3 ffff8300e6fc9000 BL U 163 ffff830be20d3000
0 6 ffff8300e6fc0000 BL U 164 ffff830be8dc9000
0 0 ffff8300e6fc6000 BL U 165 ffff830bd0cc0000
So in contrast to the last dump the crashing CPU is running DOMID 32767 (the Dom-0) if I understand the output correctly.
Kevin
Von: Andrew Cooper [mailto:andrew.cooper3@xxxxxxxxxx]
Gesendet: Freitag, 29. Juli 2016 12:05
An: Mayer, Kevin <Kevin.Mayer@xxxxxxxx>; xen-devel@xxxxxxxxxxxxx
Betreff: Re: [Xen-devel] Xen 4.6.1 crash with altp2m enabled by default
Hi guys
We are using Xen 4.6.1 to manage our virtual machines on x86-64-servers.
We start dozens of VMs and destroy them again after 60 seconds, which works fine as it is, but the next step in our approach requires the use of the altp2m functionality.
Since libvirt does not pass the altp2m-enable flag to the hypervisor we enabled altp2m unconditionally by patching the hvm.c . Since all of our machines support the altp2m this seemed to be ok.
altp2m is emulated in software when hardware support isn't available, so it should work on all hardware (albeit with rather higher overhead).
d->arch.hvm_domain.params[HVM_PARAM_HPET_ENABLED] = 1;
d->arch.hvm_domain.params[HVM_PARAM_TRIPLE_FAULT_REASON] = SHUTDOWN_reboot;
+ d->arch.hvm_domain.params[HVM_PARAM_ALTP2M] = 1;
+
This looks to be ok, given your situation.
vpic_init(d);
rc = vioapic_init(d);
Since applying this patch the hypervisor crashes after several hundred restarted VMs (without any altp2m-functionality used by us) with the following dmesg:
(XEN) ----[ Xen-4.6.1 x86_64 debug=n Not tainted ]----
As a start, please always use a debug hypervisor for investigating issues like this.
(XEN) CPU: 7
(XEN) RIP: e008:[<ffff82d0801f5a55>] vmx_vmenter_helper+0x2b5/0x340
(XEN) RFLAGS: 0000000000010003 CONTEXT: hypervisor (d0v3)
(XEN) rax: 000000008005003b rbx: ffff8300e7038000 rcx: 0000000000000008
(XEN) rdx: 0000000000006c00 rsi: ffff83062eb5e000 rdi: ffff8300e7038000
(XEN) rbp: ffff830c17e3f000 rsp: ffff830617fc7d70 r8: 0000000000000000
(XEN) r9: ffff83014f8d7028 r10: 000002700f858000 r11: 00002201be6861f0
(XEN) r12: ffff83062eb5e000 r13: ffff8300e752f000 r14: ffff82d08030ea40
(XEN) r15: 0000000000000007 cr0: 000000008005003b cr4: 00000000000026e0
(XEN) cr3: 00000001bf4da000 cr2: 00000000dd840c00
(XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008
(XEN) Xen stack trace from rsp=ffff830617fc7d70:
(XEN) ffff8300e7038000 ffff82d080170c04 0000000000000000 0000000780109f6a
(XEN) ffff830617fc7f18 ffff83000000001e 0000000000000000 ffff8300e752f19c
(XEN) 0000000000000286 ffff8300e752f000 ffff8300e72fc000 0000000000000007
(XEN) ffff830c17e3f000 ffff830c14ee1000 ffff82d08030ea40 ffff82d080173d6a
(XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) ffff82d08030ea40 ffff8300e72fc000 000002700f481091 0000000000000001
(XEN) ffff82d080324560 ffff82d08030ea40 ffff8300e752f000 ffff82d080128004
(XEN) 0000000000000001 0000000001c9c380 ffff830c14ef60e8 0000000017fce600
(XEN) 0000000000000001 ffff82d0801bd18b ffff82d0801d9e88 ffff8300e752f000
(XEN) 0000000001c9c380 ffff82d08012e700 0000006e00000171 ffffffffffffffff
(XEN) ffff830617fc0000 ffff82d0802f8f80 00000000ffffffff ffff83062eb5e000
(XEN) ffff82d08030ea40 ffff82d08012b040 ffff8300e7038000 ffff830617fc0000
(XEN) ffff8300e7038000 00000000ffffffff ffff830c14ee1000 ffff82d080170970
(XEN) ffff8300e72fc000 0000000000000000 0000000000000000 0000000000000000
(XEN) 0000000000000000 0000000080550f50 00000000ffdffc70 0000000000000000
(XEN) 0000000000000000 0000000000000000 0000000000000000 000000002fcffe19
(XEN) 00000000ffdffc70 0000000000000000 00000000ffdffc50 00000000853b0918
(XEN) 000000fa00000000 00000000f0e48162 0000000000000000 0000000000000246
(XEN) 0000000080550f34 0000000000000000 0000000000000000 0000000000000000
(XEN) 0000000000000000 0000000000000000 0000000000000007 ffff8300e752f000
(XEN) Xen call trace:
(XEN) [<ffff82d0801f5a55>] vmx_vmenter_helper+0x2b5/0x340
(XEN) [<ffff82d080170c04>] __context_switch+0xb4/0x350
(XEN) [<ffff82d080173d6a>] context_switch+0xca/0xef0
(XEN) [<ffff82d080128004>] schedule+0x264/0x5f0
(XEN) [<ffff82d0801bd18b>] mwait_idle+0x25b/0x3a0
(XEN) [<ffff82d0801d9e88>] hvm_vcpu_has_pending_irq+0x58/0xc0
(XEN) [<ffff82d08012e700>] timer_softirq_action+0x80/0x250
(XEN) [<ffff82d08012b040>] __do_softirq+0x60/0x90
(XEN) [<ffff82d080170970>] idle_loop+0x20/0x50
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 7:
(XEN) FATAL TRAP: vector = 6 (invalid opcode)
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...
(XEN) Executing kexec image on cpu7
(XEN) Shot down all CPUs
The RIP points to ud2
0xffff82d0801f5a55: ud2
From the RFLAGS we concluded that the vmwrite failed due to an invalid vmcs-pointer (CF = 1), but this is where we are stuck since we have no idea how the pointer could have gotten corrupted.
crash> vcpu
gives vmcs = 0xffffffff817cbc20 for vcpu_id = 7,
and vcpus gives
VCID PCID VCPU ST T DOMID DOMAIN
0 0 ffff8300e75f2000 RU I 32767 ffff830c14ee1000
1 1 ffff8300e72fe000 RU I 32767 ffff830c14ee1000
2 2 ffff8300e7527000 RU I 32767 ffff830c14ee1000
> 3 3 ffff8300e7526000 RU I 32767 ffff830c14ee1000
4 4 ffff8300e75f1000 RU I 32767 ffff830c14ee1000
> 5 5 ffff8300e75f0000 RU I 32767 ffff830c14ee1000
> 6 6 ffff8300e72fd000 RU I 32767 ffff830c14ee1000
7 7 ffff8300e72fc000 RU I 32767 ffff830c14ee1000
0 0 ffff8300e72fa000 BL 0 0 ffff830c17e3f000
1 6 ffff8300e72f9000 BL 0 0 ffff830c17e3f000
2 3 ffff8300e72f8000 BL 0 0 ffff830c17e3f000
> 3 7 ffff8300e752f000 RU 0 0 ffff830c17e3f000
4 5 ffff8300e752e000 RU 0 0 ffff830c17e3f000
> 5 2 ffff8300e752d000 RU 0 0 ffff830c17e3f000
> 6 1 ffff8300e752c000 BL 0 0 ffff830c17e3f000
>* 7 0 ffff8300e752b000 RU 0 0 ffff830c17e3f000
0 4 ffff8300e7042000 OF U 127 ffff830475bbe000
> 0 4 ffff8300e7040000 RU U 128 ffff83062a7bc000
0 1 ffff8300e7038000 RU U 129 ffff83062eb5e000
0 5 ffff8300e703e000 BL U 130 ffff830475bd1000
Do you have any ideas what could cause this crash or how to proceed?
As a start, use a debug hypervisor. That will get you accurate backtraces, and you might get lucky and hit an earlier assertion. Can you identify which domain this vmcs should belong to, and whether it is in the process of being destroyed?
~Andrew