WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

[Xen-devel] DomU crash during migration when suspending source domain

To: <xen-devel@xxxxxxxxxxxxxxxxxxx>
Subject: [Xen-devel] DomU crash during migration when suspending source domain
From: "Graham, Simon" <Simon.Graham@xxxxxxxxxxx>
Date: Tue, 13 Feb 2007 22:42:15 -0500
Delivery-date: Tue, 13 Feb 2007 19:41:37 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: AcdP6h4+HveIAzruQ3+gt7NQNapEGw==
Thread-topic: DomU crash during migration when suspending source domain
Just run into an odd DomU crash doing live migration of a 4-VCPU domain (with 
3.0.4 but the code looks the same in 2.6.18/unstable to me) - the actual panic 
is attached at the end of this, but the bottom line is that the code in 
cache_remove_shared_cpu_map (in arch/i385/kernel/cpu/intel_cacheinfo.c) is 
attempting to clean up the cache info for a processor that does not yet have 
this info setup - the code is dereferencing a pointer in the cpuid4_info[] 
array and looking at the dump I can see that this is NULL.

My working theory here is that we attempted the migration waaay early and the 
initialization of the array of cache info pointers was not setup for all 
processors yet; it would be relatively easy to protect against this by checking 
for NULL, but I'm not sure if this is the correct solution or not -- if anyone 
is familiar with this code and can comment on an appropriate fix I'd be 
grateful.

One thing I'm really not sure about is the timing of marking the CPUs up with 
respect to the trace re initializing CPUs (see console output below) -- I can 
see that the four VCPUs are setup in the cpu_sys_devices array (which is setup 
by the code that outputs the 'Initializing CPU#n' trace) but the array of cache 
info structures only has an entry for VCPU 0:

crash> cpu_sys_devices
cpu_sys_devices = $3 =
 {0xc0464448, 0xc046448c, 0xc04644d0, 0xc0464514, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
0x0, 0x0,
0x0, 0x0}

crash> cpuid4_info
cpuid4_info = $4 =
 {0xc7971180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
0x0, 0x0}

Any suggestions for appropriate fixes here?
Simon

--- console output ---

Enabling SMP...
Initializing CPU#3
Initializing CPU#2
Initializing CPU#1
eth0: no IPv6 routers present
Unable to handle kernel NULL pointer dereference at virtual address 00000010
 printing eip:
c010dd3a
0204a000 -> *pde = 00000001:0d8ec001
06a9c000 -> *pme = 00000000:00000000
Oops: 0000 [#1]
SMP 
Modules linked in: ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core 
binfmt_misc dm_mirror dm_mod bnx2 ata_piix libata mptscsih mptfc mptspi mptsas 
mptscsi scsi_mod mptbase
CPU:    0
EIP:    0061:[<c010dd3a>]    Tainted: GF    VLI
EFLAGS: 00010202  (2.6.16.29-xen #1) 
EIP is at cache_remove_shared_cpu_map+0x1a/0x90
eax: 00000000  ebx: 00000001  ecx: 00000001  edx: 00000000
esi: 00000000  edi: 00000010  ebp: c3913f14  esp: c3913f08
ds: 007b  es: 007b  ss: 0069
Process suspend (pid: 4038, threadinfo=c3912000 task=c2244570)
Stack: <0>00000001 00000001 00000000 c3913f28 c010e3ba 00000007 00000001 
00000007 
      c3913f34 c010e425 c03bd804 c3913f48 c012fae8 ffffffea 00000001 c568c570 
      c3913f7c c013b889 c3913fc0 00000002 00000001 00000000 00000003 00000000 
Call Trace:
 [<c0105401>] show_stack_log_lvl+0xa1/0xe0
 [<c01055f1>] show_registers+0x181/0x200
 [<c0105810>] die+0x100/0x1a0
 [<c01156f6>] do_page_fault+0x3c6/0x8b1
 [<c0105067>] error_code+0x2b/0x30
 [<c010e3ba>] cache_remove_dev+0x2a/0x60
 [<c010e425>] cacheinfo_cpu_callback+0x35/0x40
 [<c012fae8>] notifier_call_chain+0x18/0x40
 [<c013b889>] cpu_down+0x139/0x260
 [<c028bc9f>] smp_suspend+0x7f/0x100
 [<c028ca80>] __do_suspend+0x40/0x180
 [<c0136a06>] kthread+0x96/0xe0
 [<c0102e95>] kernel_thread_helper+0x5/0x10
Code: 0c 5b 5e 5f 5d c3 8d 74 26 00 8d bc 27 00 00 00 00 55 89 e5 57 56 89 d6 
53 89 c3 8d 04 92 8b 14 9d 20 4d 46 c0 8d 04 82 8d 78 10 <8b> 40 10 ba 20 00 00 
00 85 c0 74 03 0f bc d0 83 fa 21 b9 20 00 

-and-

crash> bt
PID: 4038   TASK: c2244570  CPU: 0   COMMAND: "suspend"
 #0 [c3913ddc] xen_panic_event at c010a527
 #1 [c3913df8] notifier_call_chain at c012fae6
 #2 [c3913e0c] panic at c0120b16
 #3 [c3913e20] die at c0105866
 #4 [c3913e6c] do_page_fault at c01156f1
 #5 [c3913ed0] error_code (via page_fault) at c0105065
    EAX: 00000000  EBX: 00000001  ECX: 00000001  EDX: 00000000  EBP: c3913f14
    DS:  007b      ESI: 00000000  ES:  007b      EDI: 00000010
    CS:  0061      EIP: c010dd3a  ERR: ffffffff  EFLAGS: 00010202
 #6 [c3913f04] cache_remove_shared_cpu_map at c010dd3a
 #7 [c3913f18] cache_remove_dev at c010e3b5
 #8 [c3913f2c] cacheinfo_cpu_callback at c010e420
 #9 [c3913f38] notifier_call_chain at c012fae6
#10 [c3913f4c] cpu_down at c013b884
#11 [c3913f80] smp_suspend at c028bc9a
#12 [c3913f98] __do_suspend at c028ca7b
#13 [c3913fc4] kthread at c0136a03
#14 [c3913fe8] kernel_thread_helper at c0102e93
crash>

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel