[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Regressed XSA-286, was [xen-unstable test] 161917: regressions - FAIL


  • To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Thu, 17 Jun 2021 13:56:29 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=9aaQA8KNCkXZiPBW7WjoCJ0rMpVDOJvdPF2AbAm77Cs=; b=Nr0dhYkvkrbYVDUussRWX12WcvYJGriIoKgHqEZLE0YlObFrMF8uNd7/5jYKs+iBh4zB4ucCN7rfFo2D3pi+m0HTKbbxVy96x72YjxT688BvNJrlhx5ViUqGEHxbj2OQO+Js5eZet3wB4T2pLUeXDf66DeQCb6v1wk0KeCkY5GqIszu5yX8k5Ns03tsr8q8+ipVlqHy8kKP/s9wPiztdHSipC7C3x7Q9E7wWMpuqNrO2ATsd+/O5gD0tPrnnx98skmxDHW11MGInDnFIVv0c7wAHT+nsfX+2e1jdEzyl18ns4267UueCgbE0jhaveqvTmiCVq4jWpUpEWwrrEBM39w==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=DG+5xt692pjn7BjBlLqPZ6SILRmN1nGDJGjPdVKre7OMNwhCWfZaUj7OOHnJzX28RQqKse2mSH9Qp/GNJy8sbuHw/uJgZOL9qt/tGkK83vwMRyDjC44zPA1tF0mdlbM43fA/thmgoVBl33S7HR46OEnNW42drMQ1vA7VzdIlbZWud02H6DpcTuVttyLAo4s/W2UTA7Zjpy63kay9n5LKGBvLt+1efun+Qz9eAPPH4e7fnQW238anLsgTg8lMhmv4f5enZE8U7De1sRezVS2BiQRttd7WLN2elZ/iEOiwbXuu2d3CDAsdmRpomp9Oa0mCarG4kFuc/+2aAHJvb/91jA==
  • Authentication-results: xenproject.org; dkim=none (message not signed) header.d=none;xenproject.org; dmarc=none action=none header.from=suse.com;
  • Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, Roger Pau Monné <roger.pau@xxxxxxxxxx>, "committers@xxxxxxxxxxxxxx" <committers@xxxxxxxxxxxxxx>
  • Delivery-date: Thu, 17 Jun 2021 11:56:42 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 16.06.2021 17:43, Andrew Cooper wrote:
> On 16/06/2021 09:48, Jan Beulich wrote:
>> On 13.05.2021 22:15, Andrew Cooper wrote:
>>> On 13/05/2021 04:56, osstest service owner wrote:
>>>> flight 161917 xen-unstable real [real]
>>>> http://logs.test-lab.xenproject.org/osstest/logs/161917/
>>>>
>>>> Regressions :-(
>>>>
>>>> Tests which did not succeed and are blocking,
>>>> including tests which could not be run:
>>>>  test-arm64-arm64-examine      8 reboot                   fail REGR. vs. 
>>>> 161898
>>>>  test-arm64-arm64-xl-thunderx  8 xen-boot                 fail REGR. vs. 
>>>> 161898
>>>>  test-arm64-arm64-xl-credit1   8 xen-boot                 fail REGR. vs. 
>>>> 161898
>>>>  test-arm64-arm64-xl-credit2   8 xen-boot                 fail REGR. vs. 
>>>> 161898
>>>>  test-arm64-arm64-xl           8 xen-boot                 fail REGR. vs. 
>>>> 161898
>>> I reported these on IRC, and Julien/Stefano have already committed a fix.
>>>
>>>> Tests which are failing intermittently (not blocking):
>>>>  test-xtf-amd64-amd64-3 92 xtf/test-pv32pae-xsa-286 fail in 161909 pass in 
>>>> 161917
>>> While noticing the ARM issue above, I also spotted this one by chance. 
>>> There are two issues.
>>>
>>> First, I have reverted bed7e6cad30 and edcfce55917.  The XTF test is
>>> correct, and they really do reintroduce XSA-286.  It is a miracle of
>>> timing that we don't need an XSA/CVE against Xen 4.15.
>> As expressed at the time already, I view this reverting you did, without
>> there being any emergency and without you having gathered any acks or
>> allowed for objections, as overstepping your competencies. I did post a
>> patch to the XTF test, which I believe is wrong, without having had any
>> feedback there either. Unless I hear back by the end of this week with
>> substantial arguments of why I am wrong (which would need to also cover
>> the fact that an issue was found with 32-bit PAE only, in turn supporting
>> my view on the overall state), I intend to revert your revert early next
>> week.
> 
> It has frankly taken a while to formulate a civil reply.
> 
> I am very irritated that you have *twice* recently introduced security
> vulnerabilities by bypassing my reviews/objections on patches.

I'm sorry, Andrew, but already in my original reply a month ago I did
express that I couldn't find any record of you having objected to the
changes. It doesn't help that you claim you've objected when you
really didn't (which is the impression I get from not finding anything,
and which also matches my recollection of what was discussed).

I don't think I know which 2nd instance you're referring to, and hence
I can't respond to that aspect.

> At the time, I had to drop work on an in-progress security issue to
> urgently investigate why we'd regressed upstream, and why OSSTest hadn't
> blocked it.
> 
> I am more generally irritated that you are constantly breaking things
> which GitlabCI can tell you is broken, and that I'm having to drop work
> I'm supposed to be doing to unbreak them.

GitlabCI doesn't tell me anything just yet, unless I go actively poll
it. And as mentioned just yesterday on irc, I don't think I can easily
navigate my way through those web pages, to find breakage I may have
introduced and hence would better go fix. Unlike osstest, where I am
told what failed, and I know where to find the corresponding logs.

It's also not clear to me at all in how far GitlabCI would have
spotted the issue here, no matter whether it's caused by a hypervisor
change or the XTF test being wrong. So far I've seen GitlabCI only
spot build issues.

I'm also puzzled, to put it mildly, of your use of "constantly" here.

> In the case of this revert specifically, I did get agreement on IRC
> before reverting.

How can I know you did? You didn't even care to reply to my mail from
a month ago. And there was no reason to make an emergency out of this
and ask on irc. You could have sent mail just like is done for all
other normal bug fixes etc. Iirc I was on PTO at that time; it would
hence only have been fair to wait until my return.

> In your proposed edit to the XTF test, you say
> 
>   L3 entry updates aren't specified to take immediate effect in PAE mode:
> 
> but this is not accurate.  It's what the Intel SDM says, but is
> contradicted by the AMD APM which states that this behaviour is not true
> under NPT under any circumstance, nor is it true on native.
> 
> Furthermore, any 32bit PV guest knowing it is running on a 64bit Xen
> (even from simply checking Xen >= 4.3) can rely on the relaxed
> behaviour, irrespective of what the unwritten PV ABI might want to say
> on the matter, due to knowing that it is running on Long mode paging as
> opposed to legacy PAE paging.

Neither of these are reasons for a 32-bit guest to _rely_ on such
behavior. Hence the change to the XTF test, which so far you also
didn't care to reply to.

I'm aware of NPT having different behavior, but can you point me to
the place in AMD doc saying so also for native? In fact I can find a
statement to the contrary:

"The behavior of PAE mode in a nested-paging guest differs slightly
 from the behavior of (host-only) legacy PAE mode, in that the
 guest’s four PDPEs are not loaded into the processor at the time
 CR3 is written. Instead, the PDPEs are accessed on demand as part
 of a table walk."

This to me implies that in the native case the behavior matches
Intel's.

> If these two technical reasons aren't good enough, then consider the
> manifestation of the issue itself.  XSA-286 is specifically about Xen
> editing the wrong PTE, because of the use of linear pagetables, in light
> of the guest not flushing the TLB.

The PTE edited is, as said, only perceived wrong by the XTF test.
Hence the patch to correct it.

> If you were to remove linear pagetables from Xen, the issue
> (do_mmu_update() edits the wrong PTE) would cease to manifest even on
> legacy PAE paging, demonstrating that the problem is with Xen's actions,
> not with the guests.

And if I introduced shadowing of the L3E writes, pushing the new ones
into the live page tables only upon CR3 writes, the issue would
reappear. It ought to be permissible to make such a change to Xen,
even if we may have no specific reason to do so at this point (albeit
I think we really better would, to match bare metal behavior).

Also, in my request to you (still in context above) I did specifically
ask about the aspect of the observed issue only manifesting on 32-bit,
yet you claiming a general problem, i.e. also affecting 64-bit. You
didn't comment on this at all.

Jan




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.