[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] BUG: unable to handle kernel NULL pointer dereference at IP: [<ffffffff8105ae4c>] process_one_work+
On Mon, Jun 13, 2011 at 07:20:34PM -0400, Scott Garron wrote: > On 06/13/2011 06:03 PM, Konrad Rzeszutek Wilk wrote: > >Can you do one more thing - bootup the same kernel as baremetal? > >Without any Xen and with the same options .. and also with > >/proc/interrupts so I can see what native Linux sees? > > The serial console plus cat /proc/interrupts pasted onto the end of it > is here: Thank you. > > http://pridelands.org/~simba/xen/hailstorm-fullserial20110613.txt So IRQ 9 is correct. Somehow I thought that this: [ 1.646560] dc 0FF ACPI Warning: Large Reference Count (0x1FEA) in object ffff88001ebb3b98 (20110316/utdelete-448) [ 4.136398] ACPI Warning: Large Reference Count (0x1FE9) in object ffff88001ebb3b98 (20110316/utdelete-448) [ 4.136426] BUG: unable to handle kernel NULL pointer dereference at (null) [ 4.136436] IP: [<ffffffff8105ae4c>] process_one_work+0x27/0x286 [ 4.136459] PGD 0 [ 4.136465] Oops: 0000 [#1] SMP [ 4.136475] CPU 0 [ 4.136479] Modules linked in: [ 4.136485] [ 4.136492] Pid: 374, comm: kworker/0:1 Tainted: G W 2.6.39+ #2 To Be Filled By O.E.M. To Be Filled By O.E.M./TYAN High-End Dual AMD Opteron, S2882 [ 4.136505] RIP: e030:[<ffffffff8105ae4c>] [<ffffffff8105ae4c>] process_one_work+0x27/0x286 [ 4.136516] RSP: e02b:ffff88001eb4be40 EFLAGS: 00010046 (from http://pridelands.org/~simba/xen/hailstorm-fullserial20110610.txt) are related - as in the ACPI IRQ gets triggered, it does something (and it looks to make the ACPI parser complain about it), then puts some function on the workqueue which dies trying to access ffff88001ebb3b80. It died and whatever that function was suppose to do - it never completed. I was thinking that due to the IRQ 9 having the wrong polarity (which it has not) or trigger (which it has not) it is causing this mayhem - but that is not the case. Sorry about wasting your time heading this wrong path. The boot process continues and the xen clocksource kicks in and it does a hypercall .. and is probabally looping between the hypercall, the xen upcall handler and back. The IRQ 9 is pending so it hasn't been acknowledged by the Linux kernel. In fact, there are couple of events that are stuck and are locally masked. Which means that 'spin_lock_irqsave' has been called and it masks the vcpu, but spin_unlock_irqrestore has not - which could be due to process_one_work dying. But the curious thing is that you have two CPUs assigned to Dom0 and while CPU0 looks to be bouncing back and forth, CPU1 is doing something. The RIP is 0xffffffff8108820c. Can you try to run this through System.map? Or the whole bunch of these: ffffffff8108820c ffffffff81088100 ffffffff810881a7 ffffffff8108811a ffffffff816101a8 ffffffff81006c32 ffffffff816114a4 ffffffff8108803a ffffffff8105f5bd ffffffff81618564 ffffffff81617973 ffffffff816117a1 ffffffff81618560 The other idea is to limit Dom0 to only run on one CPU. You can do this by having 'dom0_max_vcpus=1 dom0_vcpus_pin' and see if it fails somewhere else? It probably will die in the 0xffffffff810013aa :-( But irregardless of what I mentioned above we need to find out why process_one_worker got a toxic parameter. Can you disassemble 0xffffffff8105ae4c and see what it does and how it corresponds to 'process_one_work' in kernel/workqueue.c? You can also instrument the code to find out what: 1804 work_func_t f = work->func; is. Jeremy, any thoughts on what else might be at foot here? _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |