[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH for-4.16] x86/cpuid: do not shrink number of leaves in max policies



On 24/11/2021 18:07, Ian Jackson wrote:
> (Hoisting Roger and Jan to the To:)
>
> Andrew Cooper writes ("Re: [PATCH for-4.16] x86/cpuid: do not shrink number 
> of leaves in max policies"):
>> On 24/11/2021 16:24, Ian Jackson wrote:
>>> Questions from my RM hat:
>>>
>>> Is there a workaround ?
>> No.
>>
>> The safety check being tripped is intended to prevent the VM crashing on
>> resume, and is functioning correctly.
>>
>>> What proportion of machines do we think this might affect ?
>> Any pre-xsave machines (~2012 and older), and any newer machines booted
>> with no-xsave.
>>
>> All AMD machines are actually broken by this, except that failure is
>> being masked by other changes in 4.16.  Future AMD machines will break
>> in the same way.
> This is quite bad, then, I think.  I'm inclined to treat this as a
> blocker for the release.

I would also classify it as a blocker.

>
>>> Jan, Andy, do you have an opinion ?
>> The reversion doesn't go far enough.
>>
>> While the shrinking of the max policies manifests as a concrete breakage
>> here, there is further breakage caused by shrinking the default
>> policies, because it renders some cpuid= settings in VM config files broken.
>>
>> There is still no feedback or error checking from individual cpuid=
>> settings, so this will manifest as the VM admin settings silently no
>> longer taking effect.
>>
>>
>> I recommend a full and complete reversion of 540d911c28.  The
>> justification for it in the first place is especially weak because it is
>> explicitly contrary to how real hardware behaves, and this is the 3rd
>> ABI breakage it has caused, with more expected in the future based on
>> the analysis of what has gone wrong so far.
> I would like to collect as many opinions as possible.  Do we have
> other options besides (a) reverting 540d911c28, or (b) releasing with
> this bug ?

There is a 3rd option of taking this patch as-is, which is half way
between (a) and (b), but anything other than (a) leaves us with known
breakages that have no workaround.

Shutting the VM down on the old host, copying it's disks and config file
manually, then booting it clean would avoid this specific breakage on
migrate, but you'd still be subject to the silent breakage from certain
cpuid= settings not taking effect.

> What bad consequences follow, for users of Xen, from reverting
> 540d911c28 ?

Nothing.  It will take everything back to the same behaviour as 4.15 and
older.

>   Presumably it had some purpose which will be undermined
> by reverting it.  The commit message speaks of details but doesn't
> explain the ultimate impact, at least not to someone like me who only
> dimly perceives the underlying technical aspects.

540d911c28 "fixes" an issue which is theoretical at best.

Real hardware behaviour does not trim max leaf when certain features are
turned off, and will report blocks of trailing zeros.

None of the software manuals permit any inference based on max leaf,
which is why the 4.15 behaviour has been fine for the lifetime of Xen so
far.

> I did an experimental git-revert.  It seemed to go cleanly.
> If we go for the revert, we would need a commit message.

It may revert cleanly, but it won't build because of the first hunk in
81da2b544cbb00.  That hunk needs reverting too, because it too breaks
some cpuid= settings in VM config files.

In principle, the *final* thing the toolstack should do, *for brand new
VMs only*, is a shrink of that form, but this depends on whole load more
toolstack work before it can be done safely.  There is a plan to fix
CPUID handling, in a safe way, and it is ongoing (subject to all the
security interruptions), but has a long way to go yet.

~Andrew



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.