[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v5 7/8] x86/microcode: Synchronize late microcode loading

>>> On 11.02.19 at 14:35, <jgross@xxxxxxxx> wrote:
> On 11/02/2019 14:23, Jan Beulich wrote:
>>>>> On 11.02.19 at 06:40, <chao.gao@xxxxxxxxx> wrote:
>>> On Fri, Feb 08, 2019 at 09:29:32AM -0700, Jan Beulich wrote:
>>>>>>> On 28.01.19 at 08:06, <chao.gao@xxxxxxxxx> wrote:
>>>>> +    /*
>>>>> +     * Initiate an update on all processors which don't have an online 
> sibling
>>>>> +     * thread with a lower thread id. Other sibling threads just await 
>>>>> the
>>>>> +     * completion of microcode update.
>>>>> +     */
>>>>> +    if ( cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu)) )
>>>>> +        ret = microcode_update_cpu();
>>>>> +    /*
>>>>> +     * Increase the wait timeout to a safe value here since we're 
> serializing
>>>>> +     * the microcode update and that could take a while on a large 
>>>>> number 
> of
>>>>> +     * CPUs. And that is fine as the *actual* timeout will be determined 
>>>>> by
>>>>> +     * the last CPU finished updating and thus cut short
>>>>> +     */
>>>>> +    if ( wait_for_cpus(&cpu_out, MICROCODE_DEFAULT_TIMEOUT_US * 
>>>>> nr_cores) )
>>>>> +        panic("Timeout when finishing updating microcode");
>>>> While I expect this to go away again in the next patch, I'd still like to
>>>> see this improved, in particular in case the patch here goes in
>>>> independently of the next one. After all on a system with 100 cores
>>>> the timeout totals to a whopping 3 seconds.
>>> To be clear, the timeout remains the same in the next patch due to
>>> the serial print clause in apply_microcode().
>>>> Generally the time needed to wait scales by the number of CPUs still
>>>> in need of doing the update. And if a timeout is really to occur, it's
>>>> perhaps because of one bad core or socket, not because nothing
>>>> works at all. Hence it would seem both nice and possible to scale the
>>>> "remaining time to wait" by the (known) number of remaining
>>>> processors to respond.
>>> Basically, I think the benefit is we can recognize the failure earlier
>>> if no core called in in a given interval (i.e. 30ms), and trigger a
>>> panic. Considering for such case, even with this optimization, the
>>> system needs reboot, which generally takes several minutes, what's the
>>> value of this optimization?
>> Hmm, on one hand this is a fair point you make. Otoh, why do
>> you add any timeout at all, if we say we're hosed anyway if the
>> timeout expires? You could then as well log a message (say
>> once a second) about how many (or which) CPUs still didn't
>> respond. The admin can then still reboot the system if desired.
> That's not a data center friendly approach.
> The ability to do microcode update in an online system might by
> risky, but in case of failure requiring access to the console or
> power settings of the system isn't nice.

I realize this, but I also wouldn't bet the system would reboot cleanly
in such a case (i.e. a power cycle may be required anyway).

> I think doing a panic() after some timeout is a sensible way to
> handle a failure.

I don't disagree; I'm just not convinced this is the only "sensible"

> In case you'd like having a way to wait longer: we could allow the
> "noreboot" parameter to be modified at runtime and do the panic only
> if opt_noreboot isn't set.

Why would runtime modification come into play here?


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.