[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2 1/2] x86/mm: fix a potential race condition in map_pages_to_xen().





On 11/13/2017 5:31 PM, Jan Beulich wrote:
On 10.11.17 at 15:05, <yu.c.zhang@xxxxxxxxxxxxxxx> wrote:
On 11/10/2017 5:49 PM, Jan Beulich wrote:
I'm not certain this is important enough a fix to consider for 4.10,
and you seem to think it's good enough if this gets applied only
after the tree would be branched, as you didn't Cc Julien. Please
indicate if you actually simply weren't aware, and you indeed
there's an important aspect to this that I'm overlooking.
Well, at first I have not expected this to be accepted for 4.10. But
since we have
met this issue in practice, when running a graphic application which
consumes
memory intensively in dom0, I think it also makes sense if we can fix it
in xen's
release as early as possible. Do you think this is a reasonable
requirement? :-)
You'd need to provide further details for us to understand the
scenario. It obviously depends on whether you have other
patches to Xen which actually trigger this. If the problem can
be triggered from outside of a vanilla upstream Xen, then yes,
I think I would favor the fixes being included.

Thank, Jan. Let me try to give an explaination of the scenario. :-)

We saw an ASSERT failue in ASSERT((page->count_info & PGC_count_mask) != 0)
in is_iomem_page() <- put_page_from_l1e() <- alloc_l1_table(), when we run a
graphic application(which is a memory eater, but close sourced) in dom0. And this
failure only happens when dom0 is configured with 2 vCPUs.

Our debug showed the concerned page->count_info was already(and unexpectedly)
cleared in free_xenheap_pages(), and the call trace should be like this:

free_xenheap_pages()
    ^
    |
free_xen_pagetable()
    ^
    |
map_pages_to_xen()
    ^
    |
update_xen_mappings()
    ^
    |
get_page_from_l1e()
    ^
    |
mod_l1_entry()
    ^
    |
do_mmu_update()

And we then realized that it happened when dom0 tries to update its page table, and when the cache attributes are gonna be changed for referenced page frame, corresponding mappings for xen VA space will be updated by map_pages_to_xen()
as well.

However, since routine map_pages_to_xen() has the aforementioned racing problem,
when MMU_NORMAL_PT_UPDATE is triggered concurrently on different CPUs, it
may mistakenly free a superpage referenced by pl2e. That's why our ASSERT failure
only happens when dom0 has more than one vCPU configured.

As to the code base, we were running XenGT code, which has only a few non-upstreamed patches in Xen - I believe most of them are libxl related ones, and none of them is mmu related. So I believe this issue could be triggered by a pv guest to a vanilla
upstream xen.

Is above description convincing enough? :-)

Yu

Jan




_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.