Xen project Mailing List

[Xen-devel] Debugging a weird hardware fault.

To: "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Thu, 28 Jul 2011 20:53:28 +0100

Delivery-date: Thu, 28 Jul 2011 12:54:01 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Hello, I am trying to debug an issue which appears on the surface as "run shutdown -h +0 in dom0 and the machine reboots". The issue reproduces on a Supermicro X8DT6 motherboard (http://www.supermicro.com/products/motherboard/QPI/5500/X8DT6-F.cfm) only (as far as we can tell - we cant reproduce it on any other hardware), on both Xen 3.4 and Xen 4.1. The debugging described below is specifically against 3.4 It reproduces irrespective of number of CPUs and irrespective of IOMMU utilization. For all tests, the server is being run with maxcpus=1 on the Xen command-line and no domUs at all. Tracing the path of execution, Xen is getting the XENPF_enter_acpi_sleep platform op and acting on it correctly, going down the ACPI S5 codepath. My assumption is that the reboot is caused by a triple fault, as the server reboots before it actually writes to the PM1A register (except for the case where it actually works, at which point it writes correctly and properly shuts down). There is no indication on the serial console of a fault or double fault. My method of tracing is #define SERIAL_CHAR(ch) __asm__ __volatile__ ("mov %0, %%al\n\t"\ "mov $0x3f8, %%dx\n\t" \ "out %%al,%%dx\n\t" :: "g"(ch) : "%ax", "%dx"); scattered over the codebase. The fault itself is time dependent - it occasionally works when the shutdown code spends very little time in get_cmos_time. By waiting at certain points, but particularly inserting: for( i=0; i < 10; ++i) { SERIAL_CHAR('*'); mdelay(1000); } in the XENPF_enter_acpi_sleep case statement, It shows that the triple fault is reliably 5 seconds after the hypercall, and in otherwise safe code. I SERIAL_CHAR'd the entry and exit of the nmi handler, which shows that the triple fault is not caused by the nmi watchdog, which I thought might be having an effect. While waiting to print '*' every second, the serial console buffer continues to be written to the UART, showing that other tasks are going on while XENPF_enter_acpi_sleep is being serviced. The server itself is otherwise totally stable, running PV, HVM (and some bodged pv-on-hvm container for FreeBSD), along with performing SR-IOV from 8 NICs with 40 VFs each. I have a workaround by removing the call to time_suspend() at which point proding the PM1A register happens reliably before whatever causes the triple fault later. However, this is not a suitable solution for the S3 codepath which suffers the same problem but really does need to run time_suspend. My questions to the Xen community are: what (if any) new tasks get scheduled when a XENPF_enter_acpi_sleep is in action, and more generally, how can I go about debugging which tasks are being run. Thanks in advance for any advice/tips -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.