[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v1 03/10] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes



I think I know what's going wrong:

Pages that are pinned via gfn_to_pfn() and friends take a references,
however are often released via
kvm_release_pfn_clean()/kvm_release_pfn_dirty()/kvm_release_page_clean()...


E.g., in arch/x86/kvm/x86.c:reexecute_instruction()

...
pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
...
kvm_release_pfn_clean(pfn);



void kvm_release_pfn_clean(kvm_pfn_t pfn)
{
        if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
                put_page(pfn_to_page(pfn));
}

This function makes perfect sense as the counterpart for kvm_get_pfn():

void kvm_get_pfn(kvm_pfn_t pfn)
{
        if (!kvm_is_reserved_pfn(pfn))
                get_page(pfn_to_page(pfn));
}


As all ZONE_DEVICE pages are currently reserved, pages pinned via
gfn_to_pfn() and friends will often not see a put_page() AFAIKS.

Assuming gup() takes a reference for ZONE_DEVICE pages, yes, this is a
KVM bug.

Yes, it does take a reference AFAIKs. E.g.,

mm/gup.c:gup_pte_range():
...
                if (pte_devmap(pte)) {
                        if (unlikely(flags & FOLL_LONGTERM))
                                goto pte_unmap;

                        pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
                        if (unlikely(!pgmap)) {
                                undo_dev_pagemap(nr, nr_start, pages);
                                goto pte_unmap;
                        }
                } else if (pte_special(pte))
                        goto pte_unmap;

                VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
                page = pte_page(pte);

                head = try_get_compound_head(page, 1);

try_get_compound_head() will increment the reference count.


Now, my patch does not change that, the result of
kvm_is_reserved_pfn(pfn) will be unchanged. A proper fix for that would
probably be

a) To drop the reference to ZONE_DEVICE pages in gfn_to_pfn() and
friends, after you successfully pinned the pages. (not sure if that's
the right thing to do but you're the expert)

b) To not use kvm_release_pfn_clean() and friends on pages that were
definitely pinned.

This is already KVM's intent, i.e. the purpose of the PageReserved() check
is simply to avoid putting a non-existent reference.  The problem is that
KVM assumes pages with PG_reserved set are never pinned, which AFAICT was
true when the code was first added.

(talking to myself, sorry)

Thinking again, dropping this patch from this series could effectively also
fix that issue. E.g., kvm_release_pfn_clean() and friends would always do a
put_page() if "pfn_valid() and !PageReserved()", so after patch 9 also on
ZONDE_DEVICE pages.

Yeah, this appears to be the correct fix.

But it would have side effects that might not be desired. E.g.,:

1. kvm_pfn_to_page() would also return ZONE_DEVICE pages (might even be the
right thing to do).

This should be ok, at least on x86.  There are only three users of
kvm_pfn_to_page().  Two of those are on allocations that are controlled by
KVM and are guaranteed to be vanilla MAP_ANONYMOUS.  The third is on guest
memory when running a nested guest, and in that case supporting ZONE_DEVICE
memory is desirable, i.e. KVM should play nice with a guest that is backed
by ZONE_DEVICE memory.

2. kvm_set_pfn_dirty() would also set ZONE_DEVICE pages dirty (might be
okay)

This is ok from a KVM perspective.

What about

void kvm_get_pfn(kvm_pfn_t pfn)
{
        if (!kvm_is_reserved_pfn(pfn))
                get_page(pfn_to_page(pfn));
}

Is a pure get_page() sufficient in case of ZONE_DEVICE?
(asking because of the live references obtained via get_dev_pagemap(pte_pfn(pte), pgmap) in mm/gup.c:gup_pte_range() somewhat confuse me :) )



The scarier code (for me) is transparent_hugepage_adjust() and
kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
interaction between THP and _PAGE_DEVMAP.

The x86 KVM MMU code is one of the ugliest code I know (sorry, but it had to be said :/ ). Luckily, this should be independent of the PG_reserved thingy AFAIKs.

--

Thanks,

David / dhildenb


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.