[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] 4.11.0 RC1 panic
>>> On 25.06.18 at 10:33, <bouyer@xxxxxxxxxxxxxxx> wrote: > On Thu, Jun 14, 2018 at 08:33:17AM -0600, Jan Beulich wrote: >> > So far I've not been able to make Xen panic with the new xen kernel. >> > Attached is a log of the serial console, in case you notice something. >> >> None of the printk()s replacing ASSERT()s have triggered, so nothing >> interesting to lear from the log, unfortunately. >> >> > I'll keep anita tests running in a loop overnight, in case it ends up >> > hitting an assert. > > Hello, > the dom0 has been running for a week now, running the daily NetBSD tests. > Attached is the console log. > I didn't notice anything suspect, exept a few domU crashes (crashing in > Xen, the problem is not reported back to the domU). But as this is > running NetBSD-HEAD tests it can also be a bug in the domU, that has > been fixed since then. > > It's possible that the printk changed timings in a way that prevents the > race condition from happening ... It may have made it less likely, but there is at least one instance in the log (around line 6830). Sadly, this follows a set of dropped messages (which may have been sufficient to make the race trigger again). That is - we know nothing about d32 ahead of the crash, which is not helpful at all. The only interesting aspect is that this appears to trigger for two slots in a row. To me this makes it less likely again for there to be a race in updating the counter, and more likely for the counter (living in a union, as you may recall) to be overwritten by other code. There's another similar instance around line 14480. The 3rd instance (around line 13580) is a little different, in that there's no direct sign of dropped messages, but then again there are also no useful messages for d63 immediately ahead of the crash. What is clear is that the referenced page always has a correct count associated (it's always printed as zero, meaning it was incremented from -1 just before the crash). I now wonder whether the set_tlbflush_timestamp() invocation from _put_page_type() is still too aggressive. In commit 2c458dfcb5 we've reduced the invocations just as much as was deemed necessary then, and the description explicitly says "for now". I see two options for refining the conditional: One would be "if ( !ptpg )" (i.e. just drop the other half of the || ), another would be to fully match the comment and invoke it only for non-page-table pages (sort of the inverse of the earlier if(), i.e. (x & PGT_type_mask) > PGT_l4_page_table). It was done that minimal way because we were afraid of losing a flush that indeed is necessary. But if that was the case, and if the linear page table use in NetBSD is not too different between 32- and 64-bit, I'd expect the same issue to be observable with 64-bit guests. Or wait - in the 32-bit case we can come here with ptpg either L2 or L3, while in the 64-bit case this would only ever be an L4 (unless someone artificially set up linear tables at the L3 level). So this might explain the difference in behavior. The only remaining issue then is that I can't seem to be able to make up a scenario where we would reach that second if() for a page table in the first place: There would need to be one with (initially) a single type ref but both PGT_validated and PGT_partial clear. Andrew, George, do you have any helpful thoughts here? Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |