[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] x86/watchdog: Use real timestamps for watchdog timeout



>>> On 23.05.13 at 22:32, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:
> Do not assume that we will only receive interrupts at a rate of nmi_hz.  On a
> test system being debugged, I observed a PCI SERR being continuously 
> asserted
> without the SERR bit being set.  The result was Xen "exceeding" a 300 second
> timeout within 1 second.

This is a questionable rationale for the patch, no matter that
conceptually I don't mind a change like this. On broken systems
like this it may be more reasonable to require the watchdog (which
is disabled by default anyway) to not get enabled.

> @@ -432,23 +432,22 @@ void nmi_watchdog_tick(struct cpu_user_r
>      if ( (this_cpu(last_irq_sums) == sum) &&
>           !atomic_read(&watchdog_disable_count) )
>      {
> -        /*
> -         * Ayiee, looks like this CPU is stuck ... wait for the timeout
> -         * before doing the oops ...
> -         */
> -        this_cpu(alert_counter)++;
> -        if ( this_cpu(alert_counter) == opt_watchdog_timeout*nmi_hz )
> +        s_time_t last_change = this_cpu(last_irq_change);
> +
> +        if ( (NOW() - last_change) > SECONDS(opt_watchdog_timeout) )

You can't use NOW() here - while the time updating code is safe
against normal interrupts, it's not atomic wrt NMIs.

Since you don't really need this calculation to be very precise,
doing it using plain TSC values might be acceptable, with the one
caveat that you'd need to check that doing this in the middle of a
TSC update (by time_calibration_tsc_rendezvous()) is still safe
and sufficiently precise.

The only other (non-generic) alternative I see is to use the HPET
main counter if 64-bit capable _and_ we ran it as 64-bit counter.
But that would neither cover all machines nor be backportable
(because of 32-bit's constraints).

Jan

>          {
> +            /* Ayiee, looks like this CPU is stuck. */
> +
>              console_force_unlock();
>              printk("Watchdog timer detects that CPU%d is stuck!\n",
>                     smp_processor_id());
>              fatal_trap(TRAP_nmi, regs);
>          }
> -    } 
> -    else 
> +    }
> +    else
>      {
>          this_cpu(last_irq_sums) = sum;
> -        this_cpu(alert_counter) = 0;
> +        this_cpu(last_irq_change) = NOW();
>      }
>  
>      if ( nmi_perfctr_msr )




_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.