[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH V2] x86, amd_ucode: Safeguard against #GP




On 30/05/2014 17:01, Aravind Gopalakrishnan wrote:
On 5/28/2014 12:56 PM, boris ostrovsky wrote:

On 5/28/2014 11:16 AM, Aravind Gopalakrishnan wrote:
On 5/27/2014 6:47 PM, Andrew Cooper wrote:
On 27/05/2014 19:24, Aravind Gopalakrishnan wrote:
When HW tries to load a corrupted patch, it generates #GP
and hangs the system. Use wrmsr_safe instead so that we
fail to load microcode gracefully.

Also, massage error handling around apply_microcode to keep
in tune with error handling style of other parts of the code.

Example on a Fam15h system-
(XEN) microcode: CPU0 collect_cpu_info: patch_id=0x6000626
(XEN) microcode: CPU0 size 7870, block size 2586 offset 76 equivID
0x6012 rev 0x6000637
(XEN) microcode: CPU0 found a matching microcode update with version
0x6000637 (current=0x6000626)
(XEN) traps.c:3073: GPF (0000): ffff82d08016f682 -> ffff82d08022d9f8
(XEN) microcode: CPU0 update from revision 0x6000637 to 0x6000626 failed
^^^^^^^^^^^^^^^^^^^^^^
As shown, the log message above has the two revisions reversed. Fix this

Changes in V2:
     - Do not ignore return value from wrmsr_safe
     - Flip revision numbers as shown above

Signed-off-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@xxxxxxx>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
I thought we had identified that the hangs were to do with your use of
'noreboot' on the Xen command line.


Hmm. Yeah.. I figured using wrmsr_safe allows user to just boot into dom0 without
having to run through reboot loops. (lazy alternative I guess)

Nevermind then. Thanks for the comments (Jan and Andrew). Will keep in mind for the future.

I don't understand --- the fact that you had noreboot option meant that your system wouldn't reboot (duh!) when a patch is corrupted (aka "it will hang"). But I'd think we don't want a reboot neither --- we want to safely skip the patch (and possibly backlist it).



So, allowing the reboot to happen in turn allows entry to grub.
Where we can simply remove 'ucode=' option and this is same as
'skipping' the patch using wrmsr_safe right?

Except this is now explicitly done by the sysadmin
while wrmsr_safe just works without anyone doing some extra work;
and printing a log message informs user that update went wrong
for whatever reason.

My understanding from earlier comments is that Andrew (and Jan)
would rather not see a change from wrmsrl to wrmsr_safe if it
is needless because there is already a way someone can circumvent the
corrupt patch: just don't provide it on the grub menu.

Thanks,
-Aravind.

Sorry - I think my comment confused the issue.  Let me retry.

Originally, the bug was described approximately as "A corrupt ucode will cause a GP fault, causing the server to hang".

The unhandled #GP fault certainly should be wrapped with wrmsr_safe(), and an error/warning presented to the user. In the case that a bad ucode is discovered, it should be discarded and the server allowed to boot. It is substantially more useful for the server to come up and say "I couldn't load that bit of microcode you wanted me to", than to sit in a reboot loop because you made a typo in the bootloader config, and have to get someone in the datacenter to poke the physical server.


My objection was to the wording of the comment alone. Unhandled #GP faults do not "hang" Xen unless you ask for them to behave in that way, given "noreboot" on the command line.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.