[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] 4.11.0 RC1 panic



>>> On 25.06.18 at 10:33, <bouyer@xxxxxxxxxxxxxxx> wrote:
> On Thu, Jun 14, 2018 at 08:33:17AM -0600, Jan Beulich wrote:
>> > So far I've not been able to make Xen panic with the new xen kernel.
>> > Attached is a log of the serial console, in case you notice something.
>> 
>> None of the printk()s replacing ASSERT()s have triggered, so nothing
>> interesting to lear from the log, unfortunately.
>> 
>> > I'll keep anita tests running in a loop overnight, in case it ends up
>> > hitting an assert.
> 
> Hello,
> the dom0 has been running for a week now, running the daily NetBSD tests.
> Attached is the console log.
> I didn't notice anything suspect, exept a few domU crashes (crashing in
> Xen, the problem is not reported back to the domU). But as this is
> running NetBSD-HEAD tests it can also be a bug in the domU, that has
> been fixed since then.
> 
> It's possible that the printk changed timings in a way that prevents the
> race condition from happening ...

It may have made it less likely, but there is at least one instance in the
log (around line 6830). Sadly, this follows a set of dropped messages
(which may have been sufficient to make the race trigger again). That
is - we know nothing about d32 ahead of the crash, which is not helpful
at all. The only interesting aspect is that this appears to trigger for two
slots in a row. To me this makes it less likely again for there to be a
race in updating the counter, and more likely for the counter (living in a
union, as you may recall) to be overwritten by other code.

There's another similar instance around line 14480. The 3rd instance
(around line 13580) is a little different, in that there's no direct sign of
dropped messages, but then again there are also no useful messages
for d63 immediately ahead of the crash.

What is clear is that the referenced page always has a correct count
associated (it's always printed as zero, meaning it was incremented
from -1 just before the crash).

I now wonder whether the set_tlbflush_timestamp() invocation from
_put_page_type() is still too aggressive. In commit 2c458dfcb5 we've
reduced the invocations just as much as was deemed necessary then,
and the description explicitly says "for now". I see two options for
refining the conditional: One would be "if ( !ptpg )" (i.e. just drop the
other half of the || ), another would be to fully match the comment
and invoke it only for non-page-table pages (sort of the inverse of
the earlier if(), i.e. (x & PGT_type_mask) > PGT_l4_page_table).
It was done that minimal way because we were afraid of losing a
flush that indeed is necessary.

But if that was the case, and if the linear page table use in NetBSD
is not too different between 32- and 64-bit, I'd expect the same
issue to be observable with 64-bit guests. Or wait - in the 32-bit
case we can come here with ptpg either L2 or L3, while in the 64-bit
case this would only ever be an L4 (unless someone artificially set
up linear tables at the L3 level). So this might explain the difference
in behavior. The only remaining issue then is that I can't seem to be
able to make up a scenario where we would reach that second if()
for a page table in the first place: There would need to be one with
(initially) a single type ref but both PGT_validated and PGT_partial
clear.

Andrew, George, do you have any helpful thoughts here?

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.