[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2] x86/nmi: start NMI watchdog on CPU0 after SMP bootstrap



On Fri, 16 Feb 2018 17:46:48 +0000
Igor Druzhinin <igor.druzhinin@xxxxxxxxxx> wrote:

>We're noticing a reproducible system boot hang on certain
>post-Skylake platforms where the BIOS is configured in
>legacy boot mode with x2APIC disabled. The system stalls
>immediately after writing the first SMP initialization
>sequence into APIC ICR.
>
>The cause of the problem is watchdog NMI handler execution -
>somewhere near the end of NMI handling (after it's already
>rescheduled the next NMI) it tries to access IO port 0x61
>to get the actual NMI reason on CPU0. Unfortunately, this
>port is emulated by BIOS using SMIs and this emulation for
>some reason takes more time than we expect during INIT-SIPI-SIPI
>sequence. As the result, the system is constantly moving between
>NMI and SMI handler and not making any progress.
>
>To avoid this, initialize the watchdog after SMP bootstrap on
>CPU0 and, additionally, protect the NMI handler by moving
>IO port access before NMI re-scheduling.
>
>Signed-off-by: Igor Druzhinin <igor.druzhinin@xxxxxxxxxx>

This is far better than merely reducing the NMI rate.

The reason why SMI execution takes so long in the MP initialization
sequence particularly is (very likely) a necessity to wait for other
CPUs in the SMI handler.

Sending INIT causes target CPU to become numb for a relatively long
time (SDM propose 10ms delay in the MP init seq after it). If, right
after sending the INIT IPI, we immediately cause SMI I/O trap by reading
port 61h -- this means that at the beginning of the SMI handler
the primary CPU will have to wait (via a spinlock) for other CPUs
including the one which only started to come out from INIT coma.
AFAIR it doesn't even begin execution of SMI processing until all CPUs
sync their execution in SMM via that spinlock. Thus overall long SMI
processing time, enough for a next NMI watchdog tick to arrive.

Basically, it's a race between sending INIT to APs, triggering SMI by
reading the port 61h, spin-waiting for other CPUs at the start of the
SMI handler and the NMI watchdog timer (which generates flow of SMIs).

If this assumptions are correct, delaying triggering SMIs due to a
NMI watchdog until MP initialization is complete is the right
solution.

>---
> xen/arch/x86/apic.c    | 2 +-
> xen/arch/x86/smpboot.c | 3 +++
> xen/arch/x86/traps.c   | 9 +++++++--
> 3 files changed, 11 insertions(+), 3 deletions(-)
>
>diff --git a/xen/arch/x86/apic.c b/xen/arch/x86/apic.c
>index 5039173..ffa5a69 100644
>--- a/xen/arch/x86/apic.c
>+++ b/xen/arch/x86/apic.c
>@@ -684,7 +684,7 @@ void setup_local_APIC(void)
>         printk("Leaving ESR disabled.\n");
>     }
> 
>-    if (nmi_watchdog == NMI_LOCAL_APIC)
>+    if (nmi_watchdog == NMI_LOCAL_APIC && smp_processor_id())
>         setup_apic_nmi_watchdog();
>     apic_pm_activate();
> }
>diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
>index 2ebef03..1844116 100644
>--- a/xen/arch/x86/smpboot.c
>+++ b/xen/arch/x86/smpboot.c
>@@ -1248,7 +1248,10 @@ int __cpu_up(unsigned int cpu)
> void __init smp_cpus_done(void)
> {
>     if ( nmi_watchdog == NMI_LOCAL_APIC )
>+    {
>+        setup_apic_nmi_watchdog();
>         check_nmi_watchdog();
>+    }
> 
>     setup_ioapic_dest();
> 
>diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
>index 2e022b0..c16f146 100644
>--- a/xen/arch/x86/traps.c
>+++ b/xen/arch/x86/traps.c
>@@ -1706,7 +1706,7 @@ static nmi_callback_t *nmi_callback =
>dummy_nmi_callback;
> void do_nmi(const struct cpu_user_regs *regs)
> {
>     unsigned int cpu = smp_processor_id();
>-    unsigned char reason;
>+    unsigned char reason = 0;
>     bool handle_unknown = false;
> 
>     ++nmi_count(cpu);
>@@ -1714,6 +1714,12 @@ void do_nmi(const struct cpu_user_regs *regs)
>     if ( nmi_callback(regs, cpu) )
>         return;
> 
>+    /* This IO port access is likely to produce SMI which, in turn,
>+     * may take enough time for the next NMI tick to happen. To avoid
>having
>+     * nested NMIs as the result let's call it before watchdog
>re-scheduling */
>+    if ( cpu == 0 )
>+        reason = inb(0x61);
>+
>     if ( (nmi_watchdog == NMI_NONE) ||
>          (!nmi_watchdog_tick(regs) && watchdog_force) )
>         handle_unknown = true;
>@@ -1721,7 +1727,6 @@ void do_nmi(const struct cpu_user_regs *regs)
>     /* Only the BSP gets external NMIs from the system. */
>     if ( cpu == 0 )
>     {
>-        reason = inb(0x61);
>         if ( reason & 0x80 )
>             pci_serr_error(regs);
>         if ( reason & 0x40 )


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.