[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v5 7/8] x86/microcode: Synchronize late microcode loading

  • To: Jan Beulich <JBeulich@xxxxxxxx>, Chao Gao <chao.gao@xxxxxxxxx>
  • From: Juergen Gross <jgross@xxxxxxxx>
  • Date: Mon, 11 Feb 2019 14:35:30 +0100
  • Autocrypt: addr=jgross@xxxxxxxx; prefer-encrypt=mutual; keydata= xsBNBFOMcBYBCACgGjqjoGvbEouQZw/ToiBg9W98AlM2QHV+iNHsEs7kxWhKMjrioyspZKOB ycWxw3ie3j9uvg9EOB3aN4xiTv4qbnGiTr3oJhkB1gsb6ToJQZ8uxGq2kaV2KL9650I1SJve dYm8Of8Zd621lSmoKOwlNClALZNew72NjJLEzTalU1OdT7/i1TXkH09XSSI8mEQ/ouNcMvIJ NwQpd369y9bfIhWUiVXEK7MlRgUG6MvIj6Y3Am/BBLUVbDa4+gmzDC9ezlZkTZG2t14zWPvx XP3FAp2pkW0xqG7/377qptDmrk42GlSKN4z76ELnLxussxc7I2hx18NUcbP8+uty4bMxABEB AAHNHkp1ZXJnZW4gR3Jvc3MgPGpncm9zc0BzdXNlLmRlPsLAeQQTAQIAIwUCU4xw6wIbAwcL CQgHAwIBBhUIAgkKCwQWAgMBAh4BAheAAAoJELDendYovxMvi4UH/Ri+OXlObzqMANruTd4N zmVBAZgx1VW6jLc8JZjQuJPSsd/a+bNr3BZeLV6lu4Pf1Yl2Log129EX1KWYiFFvPbIiq5M5 kOXTO8Eas4CaScCvAZ9jCMQCgK3pFqYgirwTgfwnPtxFxO/F3ZcS8jovza5khkSKL9JGq8Nk czDTruQ/oy0WUHdUr9uwEfiD9yPFOGqp4S6cISuzBMvaAiC5YGdUGXuPZKXLpnGSjkZswUzY d9BVSitRL5ldsQCg6GhDoEAeIhUC4SQnT9SOWkoDOSFRXZ+7+WIBGLiWMd+yKDdRG5RyP/8f 3tgGiB6cyuYfPDRGsELGjUaTUq3H2xZgIPfOwE0EU4xwFgEIAMsx+gDjgzAY4H1hPVXgoLK8 B93sTQFN9oC6tsb46VpxyLPfJ3T1A6Z6MVkLoCejKTJ3K9MUsBZhxIJ0hIyvzwI6aYJsnOew cCiCN7FeKJ/oA1RSUemPGUcIJwQuZlTOiY0OcQ5PFkV5YxMUX1F/aTYXROXgTmSaw0aC1Jpo w7Ss1mg4SIP/tR88/d1+HwkJDVW1RSxC1PWzGizwRv8eauImGdpNnseneO2BNWRXTJumAWDD pYxpGSsGHXuZXTPZqOOZpsHtInFyi5KRHSFyk2Xigzvh3b9WqhbgHHHE4PUVw0I5sIQt8hJq 5nH5dPqz4ITtCL9zjiJsExHuHKN3NZsAEQEAAcLAXwQYAQIACQUCU4xwFgIbDAAKCRCw3p3W KL8TL0P4B/9YWver5uD/y/m0KScK2f3Z3mXJhME23vGBbMNlfwbr+meDMrJZ950CuWWnQ+d+ Ahe0w1X7e3wuLVODzjcReQ/v7b4JD3wwHxe+88tgB9byc0NXzlPJWBaWV01yB2/uefVKryAf AHYEd0gCRhx7eESgNBe3+YqWAQawunMlycsqKa09dBDL1PFRosF708ic9346GLHRc6Vj5SRA UTHnQqLetIOXZm3a2eQ1gpQK9MmruO86Vo93p39bS1mqnLLspVrL4rhoyhsOyh0Hd28QCzpJ wKeHTd0MAWAirmewHXWPco8p1Wg+V+5xfZzuQY0f4tQxvOpXpt4gQ1817GQ5/Ed/wsDtBBgB CAAgFiEEhRJncuj2BJSl0Jf3sN6d1ii/Ey8FAlrd8NACGwIAgQkQsN6d1ii/Ey92IAQZFggA HRYhBFMtsHpB9jjzHji4HoBcYbtP2GO+BQJa3fDQAAoJEIBcYbtP2GO+TYsA/30H/0V6cr/W V+J/FCayg6uNtm3MJLo4rE+o4sdpjjsGAQCooqffpgA+luTT13YZNV62hAnCLKXH9n3+ZAgJ RtAyDWk1B/0SMDVs1wxufMkKC3Q/1D3BYIvBlrTVKdBYXPxngcRoqV2J77lscEvkLNUGsu/z W2pf7+P3mWWlrPMJdlbax00vevyBeqtqNKjHstHatgMZ2W0CFC4hJ3YEetuRBURYPiGzuJXU pAd7a7BdsqWC4o+GTm5tnGrCyD+4gfDSpkOT53S/GNO07YkPkm/8J4OBoFfgSaCnQ1izwgJQ jIpcG2fPCI2/hxf2oqXPYbKr1v4Z1wthmoyUgGN0LPTIm+B5vdY82wI5qe9uN6UOGyTH2B3p hRQUWqCwu2sqkI3LLbTdrnyDZaixT2T0f4tyF5Lfs+Ha8xVMhIyzNb1byDI5FKCb
  • Cc: Kevin Tian <kevin.tian@xxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, Ashok Raj <ashok.raj@xxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, tglx@xxxxxxxxxxxxx, Borislav Petkov <bp@xxxxxxx>, Roger Pau Monne <roger.pau@xxxxxxxxxx>
  • Delivery-date: Mon, 11 Feb 2019 13:35:35 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Openpgp: preference=signencrypt

On 11/02/2019 14:23, Jan Beulich wrote:
>>>> On 11.02.19 at 06:40, <chao.gao@xxxxxxxxx> wrote:
>> On Fri, Feb 08, 2019 at 09:29:32AM -0700, Jan Beulich wrote:
>>>>>> On 28.01.19 at 08:06, <chao.gao@xxxxxxxxx> wrote:
>>>> +    /*
>>>> +     * Initiate an update on all processors which don't have an online 
>>>> sibling
>>>> +     * thread with a lower thread id. Other sibling threads just await the
>>>> +     * completion of microcode update.
>>>> +     */
>>>> +    if ( cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu)) )
>>>> +        ret = microcode_update_cpu();
>>>> +    /*
>>>> +     * Increase the wait timeout to a safe value here since we're 
>>>> serializing
>>>> +     * the microcode update and that could take a while on a large number 
>>>> of
>>>> +     * CPUs. And that is fine as the *actual* timeout will be determined 
>>>> by
>>>> +     * the last CPU finished updating and thus cut short
>>>> +     */
>>>> +    if ( wait_for_cpus(&cpu_out, MICROCODE_DEFAULT_TIMEOUT_US * nr_cores) 
>>>> )
>>>> +        panic("Timeout when finishing updating microcode");
>>> While I expect this to go away again in the next patch, I'd still like to
>>> see this improved, in particular in case the patch here goes in
>>> independently of the next one. After all on a system with 100 cores
>>> the timeout totals to a whopping 3 seconds.
>> To be clear, the timeout remains the same in the next patch due to
>> the serial print clause in apply_microcode().
>>> Generally the time needed to wait scales by the number of CPUs still
>>> in need of doing the update. And if a timeout is really to occur, it's
>>> perhaps because of one bad core or socket, not because nothing
>>> works at all. Hence it would seem both nice and possible to scale the
>>> "remaining time to wait" by the (known) number of remaining
>>> processors to respond.
>> Basically, I think the benefit is we can recognize the failure earlier
>> if no core called in in a given interval (i.e. 30ms), and trigger a
>> panic. Considering for such case, even with this optimization, the
>> system needs reboot, which generally takes several minutes, what's the
>> value of this optimization?
> Hmm, on one hand this is a fair point you make. Otoh, why do
> you add any timeout at all, if we say we're hosed anyway if the
> timeout expires? You could then as well log a message (say
> once a second) about how many (or which) CPUs still didn't
> respond. The admin can then still reboot the system if desired.

That's not a data center friendly approach.

The ability to do microcode update in an online system might by
risky, but in case of failure requiring access to the console or
power settings of the system isn't nice.

I think doing a panic() after some timeout is a sensible way to
handle a failure.

In case you'd like having a way to wait longer: we could allow the
"noreboot" parameter to be modified at runtime and do the panic only
if opt_noreboot isn't set.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.