[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [qemu-upstream-4.11-testing test] 136184: regressions - FAIL



On Wed, 5 Jun 2019, Julien Grall wrote:
> Hi Stefano,
> 
> On 05/06/2019 00:11, Stefano Stabellini wrote:
> > On Tue, 4 Jun 2019, Julien Grall wrote:
> > > On 6/4/19 6:39 PM, Stefano Stabellini wrote:
> > > > On Tue, 4 Jun 2019, Julien Grall wrote:
> > > > > Hi Stefano,
> > > > > 
> > > > > On 6/4/19 6:09 PM, Stefano Stabellini wrote:
> > > > > > On Tue, 4 Jun 2019, Julien Grall wrote:
> > > > > > > Hi Jan,
> > > > > > > 
> > > > > > > On 6/4/19 8:06 AM, Jan Beulich wrote:
> > > > > > > > > > > On 03.06.19 at 19:15, <anthony.perard@xxxxxxxxxx> wrote:
> > > > > > > > > On Tue, May 21, 2019 at 05:52:12PM +0100, Julien Grall wrote:
> > > > > > > > > > The same error cannot be reproduced on laxton*. Looking at
> > > > > > > > > > the
> > > > > > > > > > test
> > > > > > > > > > history,
> > > > > > > > > > it looks like qemu-upstream-4.12-testing flight has run
> > > > > > > > > > successfully
> > > > > > > > > > a
> > > > > > > > > > few
> > > > > > > > > > times on rochester*. So we may have fixed the error in Xen
> > > > > > > > > > 4.12.
> > > > > > > > > > 
> > > > > > > > > > Potential candidates would be:
> > > > > > > > > >        - 00c96d7742 "xen/arm: mm: Set-up page permission for
> > > > > > > > > > Xen
> > > > > > > > > > mappings
> > > > > > > > > > earlier on"
> > > > > > > > > >        - f60658c6ae "xen/arm: Stop relocating Xen"
> > > > > > > > > > 
> > > > > > > > > > Ian, is it something the bisector could automatically look
> > > > > > > > > > at?
> > > > > > > > > > If not, I will need to find some time and borrow the board
> > > > > > > > > > to
> > > > > > > > > > bisect
> > > > > > > > > > the
> > > > > > > > > > issues.
> > > > > > > > > 
> > > > > > > > > I attempted to do that bisection myself, and the first commit
> > > > > > > > > that
> > > > > > > > > git
> > > > > > > > > wanted to try, a common commit to both branches, boots just
> > > > > > > > > fine.
> > > > > > > > 
> > > > > > > > Thanks for doing this!
> > > > > > > > 
> > > > > > > > One thing that, for now, completely escapes me: How come the
> > > > > > > > main 4.11 branch has progressed fine, but the qemuu one has
> > > > > > > > got stalled like this?
> > > > > > > 
> > > > > > > Because Xen on Arm today does not fully respect the Arm Arm when
> > > > > > > modifying
> > > > > > > the
> > > > > > > page-tables. This may result to TLB conflict and break of
> > > > > > > coherency.
> > > > > > 
> > > > > > Yes, I follow your reasoning, but it is still quite strange that it
> > > > > > only
> > > > > > happens with the qemu testing branch. Maybe it is because laxton was
> > > > > > picked instead of rochester to run the tests for this branch?
> > > > > > Otherwise,
> > > > > > there must be a difference in the Xen configuration between the
> > > > > > normal
> > > > > > branch and the qemu testing branch, aside from QEMU of course, that
> > > > > > shouldn't make any differences.
> > > > > 
> > > > > Per the discussion before, the .config is different between the 2
> > > > > flights.
> > > > > QEMU testing is not selecting CONFIG_LIVEPATCH while staging-4.11 is.
> > > > 
> > > > Has anybody tried to start selecting CONFIG_LIVEPATCH in the QEMU
> > > > testing
> > > > branch? Is it possible to give it a try?
> > > 
> > > I don't know and I am not sure how this would help here it is pretty clear
> > > that backporting 00c96d7742 "xen/arm: mm: Set-up page permission for Xen
> > > mappings earlier on" is actually going to help booting.
> > > 
> > > So it is very unlikely that CONFIG_LIVEPATCH is the problem.
> > 
> > I am not blaming CONFIG_LIVEPATCH at all. If we decide that we don't
> > want to backport 00c96d7742 for one reason or the other, and basically
> > we cannot fix this bug, enabling CONFIG_LIVEPATCH would probably unblock
> > the CI-loop (it would be nice to be sure about it).  Let's keep in mind
> > that we always had this bug -- the next 4.11 release is not going to be
> > any more broken than the previous 4.11 release if we don't fix this
> > issue, unless you think we backported something that affected the
> > underlying problem, making it worse.
> > 
> > Note that I am not advocating for leaving this bug unfixed. I am only
> > suggesting that if we decide it is too risky to backport 00c96d7742 and
> > we don't know what else to do, it would be good to have a way to unblock
> > 4.11 without having to force-push it. Let's settle the discussion below
> > first.
> 
> One way to unblock is not testing 4.11 (or just this flight) on Thunder-X.

Yeah, let's keep these options in mind.


> > > No, this patch introducing another source of TLB conflict if the processor
> > > is
> > > caching intermediate translation (this is implementation defined).
> > 
> > By "another source of TLB conflict" are you referring to something new
> > that wasn't there before? Or are you referring to the fact that still we
> > are not following the proper sequence to update the Xen pagetable? If
> > you are referring to the latter, wouldn't it be reasonable to say that
> > such a problem could have happened also before 00c96d7742?
> 
> It is existent but in a different form. I can't tell whether this is bad or
> not because the re-ordering of the code (and therefore memory access) will
> affect how TLBs are used. So it is a bit of gambling here.

If I read this right, this is the same underlying issue but due to the
re-ordering of the code, it could manifest differently. For instance the
impact on cache lines could be different.

Is this the case? If so, I think this is a tolerable risk, as other
things could affect it too, such as CONFIG options being
enabled/disabled, as we have just seen with CONFIG_LIVEPATCH. It is
almost "random".

I did take this into account when I wrote earlier that I think it should
be backported. But if you see a different class of problems potentially
being introduced by 00c96d7742 then I think the discussion would change
because it can be considered a regression.


> > > The bug reported by osstest actually taught me that even if Xen may boot
> > > today
> > > on a given platform, this may not be the case tomorrow because of the
> > > slight
> > > change in the code ordering (and therefore memory access).
> > > 
> > > /!\ Below is my interpretation and does not imply I am correct ;)
> > > 
> > > However, such Arm Arm violations are mostly gathered around boot and
> > > shouldn't
> > > affect runtime. IOW, Xen would stop booting on those platforms rather than
> > > making unrealiable. So it would not be too bad.
> > > 
> > > /!\ End
> > > 
> > > We just have to be aware of the risk we are taking with backporting the
> > > patch.
> > 
> > What you wrote here seems to make sense but I would like to understand
> > the problem mentioned earlier a bit better
> > 
> > 
> > > > > > What about the other older stanging branches?
> > > > > 
> > > > > The only one we could consider is 4.10, but AFAICT Jan already did cut
> > > > > the
> > > > > last release for it.
> > > > > 
> > > > > So I wouldn't consider any backport unless we begin to see the branch
> > > > > failing.
> > > > 
> > > > If Jan already made the last release for 4.10, then little point in
> > > > backporting it to it. However, it is not ideal to have something like
> > > > 00c96d7742 in some still-maintained staging branches but not all.
> 
> Jan pointed out it is not yet release. However, we didn't get any report for
> problem (aside the Arm Arm violation) with Xen 4.10 today. So I would rather
> avoid such backport in a final point release as we have a risk to make more
> broken than it is today.
> 
> I find this acceptable for Xen 4.11 because it has been proven to help. We
> also still have point release afterwards if this goes wrong.

If we do the backport, I would prefer to backport it to both trees, for
consistency, and because there might be machines out there where 4.10
doesn't boot with the wrong kconfig. This patch should decrease the risk
of breakage.

However, I see your point too. This is a judgement call -- we have not
enough data but we have to make a decision anyway. No way to tell which
way is best "scientifically".

My vote is to backport to both. Jan/others please express your opinion.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.