[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)

To: David Hildenbrand <david@xxxxxxxxxx>
From: Dan Williams <dan.j.williams@xxxxxxxxx>
Date: Wed, 23 Oct 2019 10:09:21 -0700
Cc: Kate Stewart <kstewart@xxxxxxxxxxxxxxxxxxx>, linux-hyperv@xxxxxxxxxxxxxxx, Michal Hocko <mhocko@xxxxxxxx>, Radim Krčmář <rkrcmar@xxxxxxxxxx>, KVM list <kvm@xxxxxxxxxxxxxxx>, Pavel Tatashin <pavel.tatashin@xxxxxxxxxxxxx>, KarimAllah Ahmed <karahmed@xxxxxxxxx>, Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx>, Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>, Alexander Duyck <alexander.duyck@xxxxxxxxx>, Michal Hocko <mhocko@xxxxxxxxxx>, Paul Mackerras <paulus@xxxxxxxxxx>, Linux MM <linux-mm@xxxxxxxxx>, Paul Mackerras <paulus@xxxxxxxxx>, Michael Ellerman <mpe@xxxxxxxxxxxxxx>, "H. Peter Anvin" <hpa@xxxxxxxxx>, Wanpeng Li <wanpengli@xxxxxxxxxxx>, "K. Y. Srinivasan" <kys@xxxxxxxxxxxxx>, Fabio Estevam <festevam@xxxxxxxxx>, Ben Chan <benchan@xxxxxxxxxxxx>, Pavel Tatashin <pasha.tatashin@xxxxxxxxxx>, devel@xxxxxxxxxxxxxxxxxxxx, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Stephen Hemminger <sthemmin@xxxxxxxxxxxxx>, "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx>, Joerg Roedel <joro@xxxxxxxxxx>, X86 ML <x86@xxxxxxxxxx>, YueHaibing <yuehaibing@xxxxxxxxxx>, Mike Rapoport <rppt@xxxxxxxxxxxxx>, Madhumitha Prabakaran <madhumithabiw@xxxxxxxxx>, Peter Zijlstra <peterz@xxxxxxxxxxxxx>, Ingo Molnar <mingo@xxxxxxxxxx>, Vlastimil Babka <vbabka@xxxxxxx>, Nishka Dasgupta <nishkadg.linux@xxxxxxxxx>, Anthony Yznaga <anthony.yznaga@xxxxxxxxxx>, Oscar Salvador <osalvador@xxxxxxx>, Dan Carpenter <dan.carpenter@xxxxxxxxxx>, "Isaac J. Manjarres" <isaacm@xxxxxxxxxxxxxx>, Matt Sickler <Matt.Sickler@xxxxxxxxxxxxxx>, Kees Cook <keescook@xxxxxxxxxxxx>, Anshuman Khandual <anshuman.khandual@xxxxxxx>, Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>, Simon Sandström <simon@xxxxxxxxxx>, Sasha Levin <sashal@xxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, kvm-ppc@xxxxxxxxxxxxxxx, Qian Cai <cai@xxxxxx>, Alex Williamson <alex.williamson@xxxxxxxxxx>, Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx>, Borislav Petkov <bp@xxxxxxxxx>, Nicholas Piggin <npiggin@xxxxxxxxx>, Andy Lutomirski <luto@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Todd Poynor <toddpoynor@xxxxxxxxxx>, Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>, Allison Randal <allison@xxxxxxxxxxx>, Jim Mattson <jmattson@xxxxxxxxxx>, Christophe Leroy <christophe.leroy@xxxxxx>, Vandana BN <bnvandana@xxxxxxxxx>, Jeremy Sowden <jeremy@xxxxxxxxxx>, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>, Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>, Cornelia Huck <cohuck@xxxxxxxxxx>, Linux Kernel Mailing List <linux-kernel@xxxxxxxxxxxxxxx>, Sean Christopherson <sean.j.christopherson@xxxxxxxxx>, Rob Springer <rspringer@xxxxxxxxxx>, Thomas Gleixner <tglx@xxxxxxxxxxxxx>, Johannes Weiner <hannes@xxxxxxxxxxx>, Paolo Bonzini <pbonzini@xxxxxxxxxx>, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>, linuxppc-dev <linuxppc-dev@xxxxxxxxxxxxxxxx>
Delivery-date: Wed, 23 Oct 2019 17:09:52 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Wed, Oct 23, 2019 at 12:26 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 22.10.19 23:54, Dan Williams wrote:
> > Hi David,
> >
> > Thanks for tackling this!
>
> Thanks for having a look :)
>
> [...]
>
>
> >> I am probably a little bit too careful (but I don't want to break things).
> >> In most places (besides KVM and vfio that are nuts), the
> >> pfn_to_online_page() check could most probably be avoided by a
> >> is_zone_device_page() check. However, I usually get suspicious when I see
> >> a pfn_valid() check (especially after I learned that people mmap parts of
> >> /dev/mem into user space, including memory without memmaps. Also, people
> >> could memmap offline memory blocks this way :/). As long as this does not
> >> hurt performance, I think we should rather do it the clean way.
> >
> > I'm concerned about using is_zone_device_page() in places that are not
> > known to already have a reference to the page. Here's an audit of
> > current usages, and the ones I think need to cleaned up. The "unsafe"
> > ones do not appear to have any protections against the device page
> > being removed (get_dev_pagemap()). Yes, some of these were added by
> > me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
> > pages into anonymous memory paths and I'm not up to speed on how it
> > guarantees 'struct page' validity vs device shutdown without using
> > get_dev_pagemap().
> >
> > smaps_pmd_entry(): unsafe
> >
> > put_devmap_managed_page(): safe, page reference is held
> >
> > is_device_private_page(): safe? gpu driver manages private page lifetime
> >
> > is_pci_p2pdma_page(): safe, page reference is held
> >
> > uncharge_page(): unsafe? HMM
> >
> > add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()
> >
> > soft_offline_page(): unsafe
> >
> > remove_migration_pte(): unsafe? HMM
> >
> > move_to_new_page(): unsafe? HMM
> >
> > migrate_vma_pages() and helpers: unsafe? HMM
> >
> > try_to_unmap_one(): unsafe? HMM
> >
> > __put_page(): safe
> >
> > release_pages(): safe
> >
> > I'm hoping all the HMM ones can be converted to
> > is_device_private_page() directlly and have that routine grow a nice
> > comment about how it knows it can always safely de-reference its @page
> > argument.
> >
> > For the rest I'd like to propose that we add a facility to determine
> > ZONE_DEVICE by pfn rather than page. The most straightforward why I
> > can think of would be to just add another bitmap to mem_section_usage
> > to indicate if a subsection is ZONE_DEVICE or not.
>
> (it's a somewhat unrelated bigger discussion, but we can start discussing it 
> in this thread)
>
> I dislike this for three reasons
>
> a) It does not protect against any races, really, it does not improve things.
> b) We do have the exact same problem with pfn_to_online_page(). As long as we
>    don't hold the memory hotplug lock, memory can get offlined and remove any 
> time. Racy.

True, we need to solve that problem too. That seems to want something
lighter weight than the hotplug lock that can be held over pfn lookups
+  use rather than requiring a page lookup in paths where it's not
clear that a page reference would prevent unplug.

> c) We mix in ZONE specific stuff into the core. It should be "just another 
> zone"

Not sure I grok this when the RFC is sprinkling zone-specific
is_zone_device_page() throughout the core?

>
> What I propose instead (already discussed in 
> https://lkml.org/lkml/2019/10/10/87)

Sorry I missed this earlier...

>
> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
> 3. Introduce pfn_active() that checks against the subsection bitmap
> 4. Once the memmap was initialized / prepared, set the subsection active
>    (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. Before the memmap gets invalidated, set the subsection inactive
>    (similar to SECTION_IS_ONLINE in the buddy right now)
> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE

This does not seem to reduce any complexity because it still requires
a pfn to zone lookup at the end of the process.

I.e. converting pfn_to_online_page() to use a new pfn_active()
subsection map plus looking up the zone from pfn_to_page() is more
steps than just doing a direct pfn to zone lookup. What am I missing?

>
> Especially, driver-reserved device memory will not get set active in
> the subsection bitmap.
>
> Now to the race. Taking the memory hotplug lock at random places is ugly. I do
> wonder if we can use RCU:

Ah, yes, exactly what I was thinking above.

>
> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
>
>         /* the memmap is guaranteed to remain active under RCU */
>         rcu_read_lock();
>         if (pfn_active(random_pfn)) {
>                 page = pfn_to_page(random_pfn);
>                 ... use the page, stays valid
>         }
>         rcu_unread_lock();
>
> Memory offlining/memremap code:
>
>         set_subsections_inactive(pfn, nr_pages); /* clears the bit atomically 
> */
>         synchronize_rcu();
>         /* all users saw the bitmap update, we can invalide the memmap */
>         remove_pfn_range_from_zone(zone, pfn, nr_pages);

Looks good to me.

>
> >
> >>
> >> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> >> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
> >> on x86-64 and PPC.
> >
> > I'll give it a spin, but I don't think the kernel wants to grow more
> > is_zone_device_page() users.
>
> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
> The other parts can rely on pfn_to_online_page() only.
>
> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes"
> - Basically never used with ZONE_DEVICE.
> - We hold a reference!
> - All it protects is a SetPageDirty(page);
>
> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
> - Same as 1.
>
> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
>   references (I assume this should be fine as we don't come via random
>   PFNs)
> - We check that we don't mix Reserved (including device memory) and CMA
>   pages when crossing compound pages.
>
> I think we can drop 1. and 2., resulting in a total of 2 new users in
> the same context. I think that is totally tolerable to finally clean
> this up.

...but more is_zone_device_page() doesn't "finally clean this up".
Like we discussed above it's the missing locking that's the real
cleanup, the pfn_to_online_page() internals are secondary.

>
>
> However, I think we also have to clarify if we need the change in 3 at all.
> It comes from
>
> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
> Author: Kees Cook <keescook@xxxxxxxxxxxx>
> Date:   Tue Jun 7 11:05:33 2016 -0700
>
>     mm: Hardened usercopy
>
>     This is the start of porting PAX_USERCOPY into the mainline kernel. This
>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>     work is based on code by PaX Team and Brad Spengler, and an earlier port
>     from Casey Schaufler. Additional non-slab page tests are from Rik van 
> Riel.
> [...]
>     - otherwise, object must not span page allocations (excepting Reserved
>       and CMA ranges)
>
> Not sure if we really have to care about ZONE_DEVICE at this point.

That check needs to be careful to ignore ZONE_DEVICE pages. There's
nothing wrong with a copy spanning ZONE_DEVICE and typical pages.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

Follow-Ups:
- Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  - From: David Hildenbrand

References:
- [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  - From: David Hildenbrand
- Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  - From: Dan Williams
- Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  - From: David Hildenbrand

Prev by Date: [Xen-devel] [ovmf test] 143054: all pass - PUSHED
Next by Date: [Xen-devel] [RFC XEN PATCH 13/23] convert common/libelf/Makefile to kbuild makefile
Previous by thread: Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
Next by thread: Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.