[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split



Andre Przywara wrote:
On 02/10/2011 07:42 AM, Juergen Gross wrote:
On 02/09/11 15:21, Juergen Gross wrote:
Andre, George,


What seems to be interesting: I think the problem did always occur when
a new cpupool was created and the first cpu was moved to it.

I think my previous assumption regarding the master_ticker was not too bad.
I think somehow the master_ticker of the new cpupool is becoming active
before the scheduler is really initialized properly. This could happen, if
enough time is spent between alloc_pdata for the cpu to be moved and the
critical section in schedule_cpu_switch().

The solution should be to activate the timers only if the scheduler is
ready for them.

George, do you think the master_ticker should be stopped in suspend_ticker
as well? I still see potential problems for entering deep C-States. I think
I'll prepare a patch which will keep the master_ticker active for the
C-State case and migrate it for the schedule_cpu_switch() case.
Okay, here is a patch for this. It ran on my 4-core machine without any
problems.
Andre, could you give it a try?
Did, but unfortunately it crashed as always. Tried twice and made sure I booted the right kernel. Sorry. The idea with the race between the timer and the state changing sounded very appealing, actually that was suspicious to me from the beginning.

I will add some code to dump the state of all cpupools to the BUG_ON to see in which situation we are when the bug triggers.
OK, here is a first try of this, the patch iterates over all CPU pools and outputs some data if the BUG_ON
((sdom->weight * sdom->active_vcpu_count) > weight_left) condition triggers:
(XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask: fffffffc003f
(XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0
(XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000
(XEN) Xen BUG at sched_credit.c:1010
....
The masks look proper (6 cores per node), the bug triggers when the first CPU is about to be(?) inserted.

HTH,
Andre.

--
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.