[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 10/10] xen/arm: Enable errata for secondary CPU on hotplug after the boot



Hi,

+Dario

On Wed, May 9, 2018 at 6:32 PM, Julien Grall <julien.grall@xxxxxxx> wrote:
>
>
> On 09/05/18 16:48, Mirela Simonovic wrote:
>>
>> Hi Julien,
>
>
> Hi Mirela,
>
>
>> On Mon, Apr 30, 2018 at 6:09 PM, Julien Grall <julien.grall@xxxxxxx>
>> wrote:
>>>
>>> Hi Mirela,
>>>
>>>
>>> On 27/04/18 18:12, Mirela Simonovic wrote:
>>>>
>>>>
>>>> On boot, enabling errata workarounds will be triggered by the boot CPU
>>>> from start_xen(). On CPU hotplug (non-boot scenario) this would not be
>>>> done. This patch adds the code required to enable errata workarounds
>>>> for a CPU being hotplugged after the system boots. This is triggered
>>>> using a notifier. If the CPU fails to enable the errata Xen will panic.
>>>> This is done because it is assumed that the CPU which is hotplugged
>>>> after the system/Xen boots, was initially hotplugged during the
>>>> system/Xen boot. Therefore, enabling errata workarounds should never
>>>> fail.
>>>>
>>>> Signed-off-by: Mirela Simonovic <mirela.simonovic@xxxxxxxxxx>
>>>>
>>>> ---
>>>> CC: Stefano Stabellini <sstabellini@xxxxxxxxxx>
>>>> CC: Julien Grall <julien.grall@xxxxxxx>
>>>> ---
>>>>    xen/arch/arm/cpuerrata.c         | 35
>>>> +++++++++++++++++++++++++++++++++++
>>>>    xen/arch/arm/cpufeature.c        | 23 +++++++++++++++++++++++
>>>>    xen/include/asm-arm/cpufeature.h |  1 +
>>>>    3 files changed, 59 insertions(+)
>>>>
>>>> diff --git a/xen/arch/arm/cpuerrata.c b/xen/arch/arm/cpuerrata.c
>>>> index 1baa20654b..4040f781ec 100644
>>>> --- a/xen/arch/arm/cpuerrata.c
>>>> +++ b/xen/arch/arm/cpuerrata.c
>>>> @@ -5,6 +5,8 @@
>>>>    #include <xen/spinlock.h>
>>>>    #include <xen/vmap.h>
>>>>    #include <xen/warning.h>
>>>> +#include <xen/notifier.h>
>>>> +#include <xen/cpu.h>
>>>>    #include <asm/cpufeature.h>
>>>>    #include <asm/cpuerrata.h>
>>>>    #include <asm/psci.h>
>>>> @@ -349,6 +351,39 @@ void __init enable_errata_workarounds(void)
>>>>        enable_cpu_capabilities(arm_errata);
>>>>    }
>>>>    +static int cpu_errata_callback(
>>>> +    struct notifier_block *nfb, unsigned long action, void *hcpu)
>>>> +{
>>>> +    switch ( action )
>>>> +    {
>>>> +    case CPU_STARTING:
>>>> +        enable_nonboot_cpu_caps(arm_errata);
>>>> +        break;
>>>> +    default:
>>>> +        break;
>>>> +    }
>>>> +
>>>> +    return NOTIFY_DONE;
>>>> +}
>>>> +
>>>> +static struct notifier_block cpu_errata_nfb = {
>>>> +    .notifier_call = cpu_errata_callback,
>>>> +};
>>>> +
>>>> +static int __init cpu_errata_notifier_init(void)
>>>> +{
>>>> +    register_cpu_notifier(&cpu_errata_nfb);
>>>> +    return 0;
>>>> +}
>>>> +/*
>>>> + * Initialization has to be done at init rather than presmp_init phase
>>>> because
>>>> + * the callback should execute only after the secondary CPUs are
>>>> initially
>>>> + * booted (in hotplug scenarios when the system state is not boot). On
>>>> boot,
>>>> + * the enabling of errata workarounds will be triggered by the boot CPU
>>>> from
>>>> + * start_xen().
>>>> + */
>>>> +__initcall(cpu_errata_notifier_init);
>>>> +
>>>>    /*
>>>>     * Local variables:
>>>>     * mode: C
>>>> diff --git a/xen/arch/arm/cpufeature.c b/xen/arch/arm/cpufeature.c
>>>> index 525b45e22f..dd30f0d29c 100644
>>>> --- a/xen/arch/arm/cpufeature.c
>>>> +++ b/xen/arch/arm/cpufeature.c
>>>> @@ -68,6 +68,29 @@ void __init enable_cpu_capabilities(const struct
>>>> arm_cpu_capabilities *caps)
>>>>        }
>>>>    }
>>>>    +/* Run through the enabled capabilities and enable() them on the
>>>> calling CPU */
>>>> +void enable_nonboot_cpu_caps(const struct arm_cpu_capabilities *caps)
>>>> +{
>>>> +    ASSERT(system_state != SYS_STATE_boot);
>>>> +
>>>> +    for ( ; caps->matches; caps++ )
>>>> +    {
>>>> +        if ( !cpus_have_cap(caps->capability) )
>>>> +            continue;
>>>> +
>>>> +        if ( caps->enable )
>>>> +        {
>>>> +            /*
>>>> +             * Since the CPU has enabled errata workarounds on boot, it
>>>> should
>>>
>>>
>>>
>>> This function is not really about errata, it is about capabilities.
>>> Errata
>>> is just a sub-category of them.
>>>
>>
>> I've fixed the comment, thanks.
>>
>>>> +             * never fail to enable them here.
>>>> +             */
>>>> +            if ( caps->enable((void *)caps) )
>>>> +                panic("CPU%u failed to enable capability %u\n",
>>>> +                      smp_processor_id(), caps->capability);
>>>
>>>
>>>
>>> We should really avoid to use panic(...) if this is something the system
>>> can
>>> survive. In that specific case, it would only affect the current CPU. So
>>> it
>>> would be better to return an error and let the caller decide what to do.
>>>
>>
>> I need to emphasize two points:
>> 1) I don't see how is this different compared to PSCI CPU OFF where we
>> do panic. Essentially, in both cases the system will not be able to
>> use that CPU and we already agreed that is a good reason to panic.
>
>
> You can't compare PSCI CPU off and the enable callback failing. The *only*
> reason PSCI CPU off can fail is because the Trusted OS is resident on that
> CPU. If that ever happen it is a programming error on Xen, and it makes
> sense to fail because you don't want that CPU to spin in Xen.
>
> Enabling a capability can fail because of a failure of allocating memory or
> mapping (see spectre workaround). It is not a programming error but an
> expected behavior and it is not a valid reason to assume we want to kill the
> system.
>
>> As oppose to CPU_OFF which wasn't called on boot so we indeed have no
>> idea whether it will pass on suspend, no matter how unlikely it could
>> fail, in this scenario we are sure that enabling capability should
>> pass because it already passed on boot. So if it doesn't pass, which I
>> consider to be impossible, I believe we should panic.
>> On the other hand, I understand how would this make a difference on
>> big.LITTLE where you try to hotplug a CPU that was never booted.
>> However, that scenario is out of this scope.
>
> While I agree that big.LITTLE is out of scope of your series, what I ask has
> nothing to do with big.LITTLE. There are valid reason for the enable
> callback to fail whether it is the case today or not.
>
>>
>> 2) I still wanted to give a chance to your proposal and just convert
>> panic into stop_cpu+printing error. The system cannot survive if
>> enabling a capability fails. In order to test this I added a
>> capability that will always fail after the boot. This is not realistic
>> in my opinion, but I used it only for testing to check whether the
>> system will survive. Instead of panic I printed an error and stopped
>> the CPU. However, Xen crashed. The boot CPU properly concluded that
>> the erroneous CPU will never become online, but later on credit
>> scheduler's assertion fails.
>
>
> Please provide more details.
>
>> I believe this is something that a person
>> who adds big.LITTLE support should deal with.
>
>
> If there is a bug in the scheduler it should be fixed rather trying to
> workaround with a panic in the code. If you provide more details, we might
> be able to help here.
>

This flow seems to have several bugs. Lets start from here:

Please take a look at function cpu_schedule_callback in schedule.c.
Within switch, case CPU_DEAD doesn't have a break, causing the bellow
CPU_UP_CANCELED to execute as well when the CPU goes down. This looks
wrong to me.
Dario, could you please confirm that this is a bug? Otherwise could
you please confirm the reasoning beyond?

Thanks,
Mirela

>>
>> Do we have an agreement to keep panic?
>
>
> I am afraid not, panic (and BUG*) should only be used when there are no way
> to come back or it is a programming error to end up here. I don't think this
> is the case with the information I have in hand.
>
> The two solutions I find acceptable would be:
>         1) Log a warning and ignore the error. Likely your CPU will break
> later on.
>         2) Return an error and let the caller deal with it. The caller might
> decide to kill the system, but that's not our business. This function should
> only report an error.
>
> Cheers,
>
> --
> Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.