[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Debugging a weird hardware fault.



Hello,

I am trying to debug an issue which appears on the surface as "run
shutdown -h +0 in dom0 and the machine reboots".  The issue reproduces
on a Supermicro X8DT6 motherboard
(http://www.supermicro.com/products/motherboard/QPI/5500/X8DT6-F.cfm)
only (as far as we can tell - we cant reproduce it on any other
hardware), on both Xen 3.4 and Xen 4.1.  The debugging described below
is specifically against 3.4

It reproduces irrespective of number of CPUs and irrespective of IOMMU
utilization.  For all tests, the server is being run with maxcpus=1 on
the Xen command-line and no domUs at all.

Tracing the path of execution, Xen is getting the XENPF_enter_acpi_sleep
platform op and acting on it correctly, going down the ACPI S5 codepath.

My assumption is that the reboot is caused by a triple fault, as the
server reboots before it actually writes to the PM1A register (except
for the case where it actually works, at which point it writes correctly
and properly shuts down).  There is no indication on the serial console
of a fault or double fault.

My method of tracing is
#define SERIAL_CHAR(ch) __asm__ __volatile__ ("mov %0, %%al\n\t"\
                               
                                             "mov $0x3f8, %%dx\n\t"    \
                                                              
              "out %%al,%%dx\n\t" :: "g"(ch) : "%ax", "%dx");
scattered over the codebase.


The fault itself is time dependent - it occasionally works when the
shutdown code spends very little time in get_cmos_time.

By waiting at certain points, but particularly inserting:

     for( i=0; i < 10; ++i)
      {
        SERIAL_CHAR('*');
        mdelay(1000);
      }

in the XENPF_enter_acpi_sleep case statement, It shows that the triple
fault is reliably 5 seconds after the hypercall, and in otherwise safe
code.  I SERIAL_CHAR'd the entry and exit of the nmi handler, which
shows that the triple fault is not caused by the nmi watchdog, which I
thought might be having an effect.

While waiting to print '*' every second, the serial console buffer
continues to be written to the UART, showing that other tasks are going
on while XENPF_enter_acpi_sleep is being serviced.

The server itself is otherwise totally stable, running PV, HVM (and some
bodged pv-on-hvm container for FreeBSD), along with performing SR-IOV
from 8 NICs with 40 VFs each.  I have a workaround by removing the call
to time_suspend() at which point proding the PM1A register happens
reliably before whatever causes the triple fault later.  However, this
is not a suitable solution for the S3 codepath which suffers the same
problem but really does need to run time_suspend.

My questions to the Xen community are:

what (if any) new tasks get scheduled when a XENPF_enter_acpi_sleep is
in action, and more generally, how can I go about debugging which tasks
are being run.

Thanks in advance for any advice/tips

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.