Re: [Xen-devel] Debugging a weird hardware fault.

On 29/07/11 08:10, Keir Fraser wrote:
> On 28/07/2011 23:45, "Andrew Cooper" <andrew.cooper3@xxxxxxxxxx> wrote:
>
>> Initially, an SMI was what I was thinking, but the triple fault occurs 
>> whether
>> you start bringing down CPUs or not.  While waiting 10 seconds in the
>> platform_op select statment, the fault still occurs when all CPUs are still
>> up, all IRQs still enabled and potentially domU's still up.  (Also, from
>> studying the Xen3.4 code, I believe that interrupts are still actually up
>> during time_suspend(), but are soon brought down by lapic_suspend() later in
>> device_power_down().)
>>
>> Convertly, in the hacked up case where I ditched most of the shared S3/S5
>> codepath and just hit the PM1A, the server correctly shut down and stayed 
>> shut
>> down, implying that the fault was caused by software (be it BIOS or OS) 
>> rather
>> than hardware.  From what I understand of the APCI spec (and I claim very
>> little knowledge), there are a multitude of hardware events which could bring
>> the server out of S5, appearing as a triple fault, which would not be 
>> affected
>> by whether you had hit the PM1A register.
>>
>> In this specific example, dom0 regular shudown code already brought down the
>> domUs (of which there were none because we never started any), and we were
>> running with 1 CPU only so no others were up.  This opens up a whole host of
>> other possibilities which could be playing an effect betwee the
>> XENPF_enter_apci_sleep hypercall and Xen actually shutting itself down.
> Well I expect dom0 has done some going-to-sleep work that has left the
> platform on borrowed time w.r.t. bashing SLP_EN into the PM1 control
> register and actually finalising the shutdown.
>
> For example, it will have executed the _GTS ACPI method if there is one.
> That is supposed to happen immediately before writing PM1.SLP_EN, with no
> intervening interrupt activity or I/O. Obviously things don't work out quite
> like that when running on Xen!
>
> This is an architectural limitation of how ACPI sleep is currently
> implemented for Xen. It may need some rethinking to do it really properly
> according to the spec. e.g., do a hypercall just to prepare Xen for
> shutdown, but return back to dom0 in some limited environment to actually
> have it do the final ACPI sleep work. Or have dom0 pass a pointer to a code
> block that Xen should simply jump at to get the sleep to happen (where that
> code block would basically be dom0's acpi_enter_sleep() function). There are
> a few, somewhat distasteful, options that are more respectful of the ACPI
> spec than we are right now.
>
>  -- Keir
Just for information, this turned out to be a BIOS bug.  It was setting
a 6 second timer when executing _PTS, which hit the system reset if
PM1{a,b} had not been hit when the timer expired.  As Xen does all of
its shutdown after the call to _PTS and before PM1{a,b}, there is a
significant time gap, which was falling fowl of the timer in most cases.

In this case, it seems likely that a BIOS fix can be done, as Supermicro
do provide a custom BIOS for the NetScalar box in question.

However, If anyone else comes across this issue, we did make a software
solution.  You can replace /etc/init.d/halt (or equivalent for your
chosen dom0 distro) to KEXEC reboot into a native kernel which listens
for a special command line parameter and calls pm_power_off_prepare()
and pm_power_off() after the ACPI module has initialized[1].

This issue does however show that Xen itself is in breach of the ACPI
spec, which is a dangerous situation to be in given the fragility of
APCI at the best of times.  In due course, I will put my mind to solving
the dom0-Xen ACPI interaction problems if the question is still open.

~Andrew Cooper

[1] Yes this is a hack.  Sorry.  Its the easiest solution without
rewriting Xen

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] Debugging a weird hardware fault.