|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH v3 1/9] x86/mce: handle LMCE locally
>>> On 30.03.17 at 08:19, <haozhong.zhang@xxxxxxxxx> wrote:
> --- a/xen/arch/x86/cpu/mcheck/mce.c
> +++ b/xen/arch/x86/cpu/mcheck/mce.c
> @@ -42,6 +42,13 @@ DEFINE_PER_CPU_READ_MOSTLY(struct mca_banks *,
> poll_bankmask);
> DEFINE_PER_CPU_READ_MOSTLY(struct mca_banks *, no_cmci_banks);
> DEFINE_PER_CPU_READ_MOSTLY(struct mca_banks *, mce_clear_banks);
>
> +/*
> + * Flag to indicate that at least one non-local MCE on this CPU has
> + * not been completed handled. It's set by mcheck_cmn_handler() and
completely
> + * cleared by mce_softirq().
> + */
> +static DEFINE_PER_CPU(bool, nonlocal_mce_in_progress);
Does a boolean really suffice here? I.e. can't there be a 2nd one
while the first one is still being dealt with?
> @@ -186,7 +193,29 @@ static struct mce_softirq_barrier mce_trap_bar;
> */
> static DEFINE_SPINLOCK(mce_logout_lock);
>
> +/*
> + * 'severity_cpu' is used in both mcheck_cmn_handler() and mce_softirq().
> + *
> + * When handling broadcasting MCE, MCE barriers take effect to prevent
> + * 'severity_cpu' being modified in one function on one CPU and accessed
> + * in another on a different CPU.
> + *
> + * When handling LMCE, mce_barrier_enter() and mce_barrier_exit() are
> + * effectively NOP, so it's possible that mcheck_cmn_handler() is handling
> + * a LMCE on CPUx while mce_softirq() is handling another LMCE on CPUy.
> + * If both are modifying 'severity_cpu', they may interfere with each
> + * other. Therefore, it's better for mcheck_cmn_handler() and mce_softirq()
> + * to avoid accessing 'severity_cpu' when handling LMCE, unless other
> + * approaches are taken to avoid the above interference.
> + */
> static atomic_t severity_cpu = ATOMIC_INIT(-1);
> +/*
> + * The following two global variables are used only in mcheck_cmn_handler().
In which case the better approach would be to move them into
that function. In no event are these "global" variables - they're
static after all.
> @@ -1700,38 +1736,61 @@ static void mce_softirq(void)
> {
> int cpu = smp_processor_id();
> unsigned int workcpu;
> + /*
> + * On platforms supporting broadcasting MCE, if there is a
> + * non-local MCE waiting for process on this CPU, it implies other
> + * CPUs received the same MCE as well. Therefore, mce_softirq()
> + * will be launched on other CPUs as well and compete with this
> + * mce_softirq() to handle all MCE's on all CPUs. In that case, we
> + * should use MCE barriers to sync with other CPUs. Otherwise, we
> + * do not need to wait for other CPUs.
> + */
> + bool nowait = !this_cpu(nonlocal_mce_in_progress);
>
> mce_printk(MCE_VERBOSE, "CPU%d enter softirq\n", cpu);
>
> - mce_barrier_enter(&mce_inside_bar);
> + mce_barrier_enter(&mce_inside_bar, nowait);
>
> /*
> - * Everybody is here. Now let's see who gets to do the
> - * recovery work. Right now we just see if there's a CPU
> - * that did not have any problems, and pick that one.
> - *
> - * First, just set a default value: the last CPU who reaches this
> - * will overwrite the value and become the default.
> + * When LMCE is being handled and no non-local MCE is waiting for
> + * process, mce_softirq() does not need to set 'severity_cpu'. And
> + * it should not, because the modification in this function may
> + * interfere with the simultaneous modification in mce_cmn_handler()
> + * on another CPU.
> */
> -
> - atomic_set(&severity_cpu, cpu);
> -
> - mce_barrier_enter(&mce_severity_bar);
> - if (!mctelem_has_deferred(cpu))
> + if (!nowait) {
> + /*
> + * Everybody is here. Now let's see who gets to do the
> + * recovery work. Right now we just see if there's a CPU
> + * that did not have any problems, and pick that one.
> + *
> + * First, just set a default value: the last CPU who reaches this
> + * will overwrite the value and become the default.
> + */
> atomic_set(&severity_cpu, cpu);
> - mce_barrier_exit(&mce_severity_bar);
>
> - /* We choose severity_cpu for further processing */
> - if (atomic_read(&severity_cpu) == cpu) {
> + mce_barrier_enter(&mce_severity_bar, nowait);
> + if (!mctelem_has_deferred(cpu))
> + atomic_set(&severity_cpu, cpu);
> + mce_barrier_exit(&mce_severity_bar, nowait);
You're in a !nowait conditional - please pass false explicitly to
enter/exit.
> @@ -1740,7 +1799,14 @@ static void mce_softirq(void)
> }
> }
>
> - mce_barrier_exit(&mce_inside_bar);
> + mce_barrier_exit(&mce_inside_bar, nowait);
> +
> + /*
> + * At this point, either the only LMCE has been handled, or all MCE's
> + * on this CPU waiting for process have been handled on this CPU (if
"for processing" or "to be processed" I think.
> --- a/xen/arch/x86/cpu/mcheck/mce.h
> +++ b/xen/arch/x86/cpu/mcheck/mce.h
> @@ -109,12 +109,14 @@ struct mca_summary {
> int eipv; /* meaningful on #MC */
> bool uc; /* UC flag */
> bool pcc; /* PCC flag */
> + bool lmce; /* LMCE flag (Intel only) */
> bool recoverable; /* software error recoverable flag */
> };
>
> DECLARE_PER_CPU(struct mca_banks *, poll_bankmask);
> DECLARE_PER_CPU(struct mca_banks *, no_cmci_banks);
> DECLARE_PER_CPU(struct mca_banks *, mce_clear_banks);
> +DECLARE_PER_CPU(bool, mce_in_process);
This should have been deleted afaict.
Jan
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |