[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH] misc/xenmicrocode: Upload /lib/firmware/<some blob> to the hypervisor
On Wed, Jan 28, 2015, at 06:39, Borislav Petkov wrote: > On Wed, Jan 28, 2015 at 12:10:43AM +0000, Andrew Cooper wrote: > > There was a thread on xen-devel but I cant currently find it in the > > archives. > > > > To the best of my memory, it was a 4 core APU system where the BIOS had > > updated the microcode on cpu 0 but left 1-3 at a lower patch level. Which is a situation that will *always* happen when the late microcode update driver is used, both on Intel and AMD. You will always have a time window where the processor is running with mismatched microcode between some of the hardware threads/cores/modules/whatever. This is not an issue when using the bare-metal early microcode update driver, because it updates the BSP while still in uniprocessor mode, and it updates the APs very early on the processor bootstrap code, well before they are on-lined. We can control what machien code runs in a hardware thread that might be running mismatched microcode (i.e. that runs before the processor bootstrap code attempts to update the microcode) and keep it simple and away from anything that would heavly object to mismatched microcode. Likewise, it is not an issue for a non-broken BIOS/UEFI, as it is *supposed* to update everything to the same microcode well before it attempts to do anything complex. > > Every time the reporter tried creating an HVM guest (i.e. entering SVM > > non-root mode), the system reset. > > > > The instability was sorted by ensuring each core was at the same > > microcode level. > > That sounds like a BIOS bug to me, frankly. Sort of. The extremely wide time window of mismatched microcode in that computer was a BIOS bug, of course. But the fact that you cannot trust a system with mismatched microcode to be stable is the hard truth: neither AMD nor Intel are really enforcing that late microcode updates will be always safe in all conditions. What we can do about it in the Linux kernel late microcode driver is to shorten that window as much as possible, and try to quiesce the system as much as possible during the microcode update until all cores have been updated. It still looks like Xen should *never* trigger a late microcode update, unless it freezes all VMs first. > > As Xen updates microcode one cpu at a time from 0, it could easily > > create a similar situation if microcode is updated after VMs have been > > started. Come to think of it, this is also an impending problem for PVH > > dom0 systems. > > The common way for doing microcode updates is to update all cores at > the same time, possibly. Or at least as close to one another in time as > possible. The later. We serialize microcode updates across CPUs, and doing them all at the same time is neither trivial (unforeseen side effects on a running system) nor future-proof. For example, on Intel you must *never* have two CPUs attempt to update the same "microcode store" at the same time, which requires that you actually know how the microcode is partitioned relative to packages/cores/threads (so far, this is easy: HT siblings share microcode, nothing else does. But what about future processors?). > * the late update is an addition to the early one to cover the cases of > long running systems where a reboot is prohibitively painful. With that, > as with the early method, you would want to update all hardware cores in > one go. And, unfortunately, you have a time window of mismatched microcode during the "one go", which is not something we can fix. So we would have to try to limit what happens during that time window, instead. > Now, this is where it becomes tricky for virt: you need to stop guests, > do the update and then resume them. Even worse, if all of a sudden you > want to hide hardware features and/or instructions like HSW TSX for > example, you most likely want to even avoid the late update and warn the > admin that she has to reboot that machine and apply microcode with the > early method. Exactly. But it goes further: we likely should freeze the entire kernel and run nothing (not even interrupt handling) on non-up-to-date cores. I.e. offline every CPU but one, switch to the last online CPU, update its microcode, then update the other ones one-at-a-time, onlining them after they are up-to-date (and leaving them offline if something wrong happens). Or something to that effect. It is no wonder we currently "hope for the best" as far as late microcode update mode goes, and also that Linux distros are switching to "early updates only" by default. BTW, most datacenter people I know have a policy of never updating *any* firmware at all outside of maintenance downtime, so they're actually quite fine with the idea of a reboot being required to update processor microcode. -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique de Moraes Holschuh <hmh@xxxxxxxxxx> _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |