On 16/08/11 11:09, Jan Beulich wrote:
>>>> On 16.08.11 at 11:47, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:
>> We have had a bug raised against Xen-3.4 that the kexec path fails, on
>> HP BL465c G7 blades. The problem does not reproduce on any other AMD
>> machines I have to hand.
>>
>> On further investigation, it appears that if the crashing cpu is #0,
>> then the kexec path hangs forever trying to grab the already locked
>> legacy_hpet_event.lock in hpet_disable_legacy_broadcast(). Removing the
>> lock/unlock pair causes the kexec crash path to work as expected.
> Are you sure it is locked (rather than never initialized)? The problem
> could be that hpet_broadcast_is_available() returns true because of
> num_hpets_used > 0, yet hpet_broadcast_init() didn't make it down
> to spin_lock_init(&legacy_hpet_event.lock).
That is an very good point. I had not considered it, and it turns out
that legacy broadcast is never set up
(XEN) HPET: starting hpet_broadcast_init()
(XEN) HPET: hpet_setup() successful
(XEN) HPET: 4 timers in total, 3 timers will be used for broadcast
hpet_broadcast_init() exits inside the "if ( num_hpets_used > 0 )"
clause (as the boot dmesg doesn't printk the line immediately following
the if clause), meaning that legacy broadcasts are never set up.
Therefore, the logic
if ( hpet_broadcast_is_available() )
hpet_disable_legacy_broadcast();
in several places is wrong, and should be "if hpet_lecacy broadcast
used". Judging on the similarities in this regard between Xen-3.4 and
Xen-4.x, i am now not certain that Xen-4.x is immune and will now
proceed to investigate this.
>> If the crashing cpu is not #0, then local_time_calibration() gets
>> worried and dumps the calibration data, and hangs at some later point
>> which I have yet to find. This hang happens while performing the NMI
>> shootdown of other cpus.
>>
>> The support engineer who raised the bug says that it doesn't occur with
>> Xen-4.1. Is there anything architecturally new in the Magny-Cours
>> processors which might explain this behavior?
> Possibly more a question of the surrounding platform, namely whether
> there are HPETs in the system, and whether they get used for the
> C-state broadcasting.
>
> Jan
>
Why would C-state broadcasting make a difference at this point? I have
narrowed the crash down a bit, and local_time_calibration() is dumping
its state after one_cpu_only() and before the shootdown actually
occurs. However, I cant see any code between these two points which
alters the state of the other CPU, which should still be running
normally at this point.
--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|