[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [RFC] x86/watchdog: Always disable watchdog before console_force_unlock()



Depending on the state of the conring and serial_tx_buffer,
console_force_unlock() can be a long running operation, usually because of
serial_start_sync()

XenServer testing has found a reliable case where console_force_unlock() on
one PCPU takes long enough for another PCPU to timeout due to the watchdog
(such as waiting for a tlb flush callin).

The watchdog timeout causes the second PCPU to repeat the
console_force_unlock(), at which point the first PCPU typically fails an
assertion in spin_unlock_irqrestore(&port->tx_lock) (because the tx_lock has
been unlocked behind itself).

console_force_unlock() is only on emergency paths, so one way or another the
host is going down.  Disable the watchdog before forcing the console lock to
help prevent having pcpus completing with each other to bring the host down.

Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
CC: Keir Fraser <keir@xxxxxxx>
CC: Jan Beulich <JBeulich@xxxxxxxx>
CC: Tim Deegan <tim@xxxxxxx>
---
 xen/arch/x86/cpu/mcheck/mce.c |    1 +
 xen/arch/x86/nmi.c            |    1 +
 xen/arch/x86/traps.c          |    3 +++
 3 files changed, 5 insertions(+)

diff --git a/xen/arch/x86/cpu/mcheck/mce.c b/xen/arch/x86/cpu/mcheck/mce.c
index 93d7ae1..4c679f3 100644
--- a/xen/arch/x86/cpu/mcheck/mce.c
+++ b/xen/arch/x86/cpu/mcheck/mce.c
@@ -1537,6 +1537,7 @@ static void mc_panic_dump(void)
 void mc_panic(char *s)
 {
     is_mc_panic = 1;
+    watchdog_disable();
     console_force_unlock();
 
     printk("Fatal machine check: %s\n", s);
diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c
index c93812f..091e520 100644
--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -439,6 +439,7 @@ void nmi_watchdog_tick(struct cpu_user_regs * regs)
         this_cpu(alert_counter)++;
         if ( this_cpu(alert_counter) == opt_watchdog_timeout*nmi_hz )
         {
+            watchdog_disable();
             console_force_unlock();
             printk("Watchdog timer detects that CPU%d is stuck!\n",
                    smp_processor_id());
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 57dbd0c..b12869e 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -3163,6 +3163,7 @@ static void pci_serr_error(struct cpu_user_regs *regs)
         raise_softirq(PCI_SERR_SOFTIRQ);
         break;
     default:  /* 'fatal' */
+        watchdog_disable();
         console_force_unlock();
         printk("\n\nNMI - PCI system error (SERR)\n");
         fatal_trap(TRAP_nmi, regs);
@@ -3178,6 +3179,7 @@ static void io_check_error(struct cpu_user_regs *regs)
     case 'i': /* 'ignore' */
         break;
     default:  /* 'fatal' */
+        watchdog_disable();
         console_force_unlock();
         printk("\n\nNMI - I/O ERROR\n");
         fatal_trap(TRAP_nmi, regs);
@@ -3197,6 +3199,7 @@ static void unknown_nmi_error(struct cpu_user_regs *regs, 
unsigned char reason)
     case 'i': /* 'ignore' */
         break;
     default:  /* 'fatal' */
+        watchdog_disable();
         console_force_unlock();
         printk("Uhhuh. NMI received for unknown reason %02x.\n", reason);
         printk("Do you have a strange power saving mode enabled?\n");
-- 
1.7.10.4


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.