|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH v2] x86/microcode: Synchronize late microcode loading
On Mon, Apr 30, 2018 at 09:25:26AM -0600, Jan Beulich wrote:
>>>> On 25.04.18 at 13:46, <chao.gao@xxxxxxxxx> wrote:
>> @@ -281,24 +288,56 @@ static int microcode_update_cpu(const void *buf,
>> size_t size)
>> return err;
>> }
>>
>> -static long do_microcode_update(void *_info)
>> +/* Wait for all CPUs to rendezvous with a timeout (us) */
>> +static int wait_for_cpus(atomic_t *cnt, int timeout)
>
>unsigned int
>
>> +static int do_microcode_update(void *_info)
>> +{
>> + struct microcode_info *info = _info;
>> + unsigned int cpu = smp_processor_id();
>> + int ret;
>> +
>> + ret = wait_for_cpus(&info->cpu_in, MICROCODE_DEFAULT_TIMEOUT);
>> + if ( ret )
>> + return ret;
>> + /*
>> + * Logical threads which set the first bit in cpu_sibling_mask can do
>> + * the update. Other sibling threads just await the completion of
>> + * microcode update.
>> + */
>> + if ( cpumask_test_and_set_cpu(
>> + cpumask_first(per_cpu(cpu_sibling_mask, cpu)), &info->cpus)
>> )
>> + ret = microcode_update_cpu(info->buffer, info->buffer_size);
>
>Isn't the condition inverted (i.e. missing a ! )?
Yes.
>
>Also I take it that you've confirmed that loading ucode in parallel on multiple
>cores of the same socket is not a problem? The comment in the last hunk
>suggests otherwise.
No. In microcode_update_cpu(), microcode_mutex makes the update
sequential.
>
>> + /*
>> + * Increase the wait timeout to a safe value here since we're
>> serializing
>> + * the microcode update and that could take a while on a large number of
>> + * CPUs. And that is fine as the *actual* timeout will be determined by
>> + * the last CPU finished updating and thus cut short
>> + */
>> + if ( wait_for_cpus(&info->cpu_out, MICROCODE_DEFAULT_TIMEOUT *
>> + num_online_cpus()) )
>> + panic("Timeout when finishing updating microcode");
>
>A 3s timeout (as an example for a system with 100 CPU threads) is still
>absurdly high to me, but considering you panic() anyway if you hit the
>timeout the question mainly is whether there's a slim chance for this to
>complete a brief moment before the timeout expires. If all goes well,
>you won't come close to even 1s, but as said before - there may be
>guests running, and they may become utterly confused if they don't
>get any time within a second or more.
>
>With you no longer doing things sequentially I don't, however, see why
No. It is still sequential. And only one thread in a core will try to
acquire the lock -- microcode_mutex.
>you need to scale the timeout by CPU count.
Maybe by the number of core. But I did't find an existing variable to
count cores.
>
>> +
>> + return ret;
>> }
>
>You're losing this return value (once for every CPU making it into this
>function).
I don't understand this remark. This function is called by
stop_machine_run(). stop_machine_run() could return error if
any cpu failed during update. We don't care the specific CPU and
how many CPUs failed to do the update.
Thanks
Chao
>
>> @@ -318,26 +357,52 @@ int
>> microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
>>
>> ret = copy_from_guest(info->buffer, buf, len);
>> if ( ret != 0 )
>> - {
>> - xfree(info);
>> - return ret;
>> - }
>> + goto free;
>>
>> info->buffer_size = len;
>> - info->error = 0;
>> - info->cpu = cpumask_first(&cpu_online_map);
>> +
>> + /* cpu_online_map must not change during update */
>> + if ( !get_cpu_maps() )
>> + {
>> + ret = -EBUSY;
>> + goto free;
>> + }
>>
>> if ( microcode_ops->start_update )
>> {
>> ret = microcode_ops->start_update();
>> if ( ret != 0 )
>> - {
>> - xfree(info);
>> - return ret;
>> - }
>> + goto put;
>> }
>>
>> - return continue_hypercall_on_cpu(info->cpu, do_microcode_update, info);
>> + cpumask_empty(&info->cpus);
>
>DYM cpumask_clear()?
>
>> + atomic_set(&info->cpu_in, 0);
>> + atomic_set(&info->cpu_out, 0);
>> +
>> + /*
>> + * We intend to disable interrupt for long time, which may lead to
>> + * watchdog timeout.
>> + */
>> + watchdog_disable();
>> + /*
>> + * Late loading dance. Why the heavy-handed stop_machine effort?
>> + *
>> + * -HT siblings must be idle and not execute other code while the other
>> + * sibling is loading microcode in order to avoid any negative
>> + * interactions cause by the loading.
>> + *
>> + * -In addition, microcode update on the cores must be serialized until
>> + * this requirement can be relaxed in the feature. Right now, this is
>> + * conservative and good.
>
>This is the comment I've referred to above.
>
>Jan
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |