[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)

To: David Hildenbrand <david@xxxxxxxxxx>
From: Dan Williams <dan.j.williams@xxxxxxxxx>
Date: Wed, 23 Oct 2019 12:39:31 -0700
Cc: Kate Stewart <kstewart@xxxxxxxxxxxxxxxxxxx>, linux-hyperv@xxxxxxxxxxxxxxx, Michal Hocko <mhocko@xxxxxxxx>, Radim Krčmář <rkrcmar@xxxxxxxxxx>, KVM list <kvm@xxxxxxxxxxxxxxx>, Pavel Tatashin <pavel.tatashin@xxxxxxxxxxxxx>, KarimAllah Ahmed <karahmed@xxxxxxxxx>, Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx>, Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>, Alexander Duyck <alexander.duyck@xxxxxxxxx>, Michal Hocko <mhocko@xxxxxxxxxx>, Paul Mackerras <paulus@xxxxxxxxxx>, Linux MM <linux-mm@xxxxxxxxx>, Paul Mackerras <paulus@xxxxxxxxx>, Michael Ellerman <mpe@xxxxxxxxxxxxxx>, "H. Peter Anvin" <hpa@xxxxxxxxx>, Wanpeng Li <wanpengli@xxxxxxxxxxx>, "K. Y. Srinivasan" <kys@xxxxxxxxxxxxx>, Fabio Estevam <festevam@xxxxxxxxx>, Ben Chan <benchan@xxxxxxxxxxxx>, Pavel Tatashin <pasha.tatashin@xxxxxxxxxx>, devel@xxxxxxxxxxxxxxxxxxxx, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Stephen Hemminger <sthemmin@xxxxxxxxxxxxx>, "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx>, Joerg Roedel <joro@xxxxxxxxxx>, X86 ML <x86@xxxxxxxxxx>, YueHaibing <yuehaibing@xxxxxxxxxx>, Mike Rapoport <rppt@xxxxxxxxxxxxx>, Madhumitha Prabakaran <madhumithabiw@xxxxxxxxx>, Peter Zijlstra <peterz@xxxxxxxxxxxxx>, Ingo Molnar <mingo@xxxxxxxxxx>, Vlastimil Babka <vbabka@xxxxxxx>, Nishka Dasgupta <nishkadg.linux@xxxxxxxxx>, Anthony Yznaga <anthony.yznaga@xxxxxxxxxx>, Oscar Salvador <osalvador@xxxxxxx>, Dan Carpenter <dan.carpenter@xxxxxxxxxx>, "Isaac J. Manjarres" <isaacm@xxxxxxxxxxxxxx>, Matt Sickler <Matt.Sickler@xxxxxxxxxxxxxx>, Kees Cook <keescook@xxxxxxxxxxxx>, Anshuman Khandual <anshuman.khandual@xxxxxxx>, Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>, Simon Sandström <simon@xxxxxxxxxx>, Sasha Levin <sashal@xxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, kvm-ppc@xxxxxxxxxxxxxxx, Qian Cai <cai@xxxxxx>, Alex Williamson <alex.williamson@xxxxxxxxxx>, Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx>, Borislav Petkov <bp@xxxxxxxxx>, Nicholas Piggin <npiggin@xxxxxxxxx>, Andy Lutomirski <luto@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Todd Poynor <toddpoynor@xxxxxxxxxx>, Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>, Allison Randal <allison@xxxxxxxxxxx>, Jim Mattson <jmattson@xxxxxxxxxx>, Christophe Leroy <christophe.leroy@xxxxxx>, Vandana BN <bnvandana@xxxxxxxxx>, Jeremy Sowden <jeremy@xxxxxxxxxx>, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>, Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>, Cornelia Huck <cohuck@xxxxxxxxxx>, Linux Kernel Mailing List <linux-kernel@xxxxxxxxxxxxxxx>, Sean Christopherson <sean.j.christopherson@xxxxxxxxx>, Rob Springer <rspringer@xxxxxxxxxx>, Thomas Gleixner <tglx@xxxxxxxxxxxxx>, Johannes Weiner <hannes@xxxxxxxxxxx>, Paolo Bonzini <pbonzini@xxxxxxxxxx>, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>, linuxppc-dev <linuxppc-dev@xxxxxxxxxxxxxxxx>
Delivery-date: Wed, 23 Oct 2019 19:39:54 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Wed, Oct 23, 2019 at 10:28 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> >> I dislike this for three reasons
> >>
> >> a) It does not protect against any races, really, it does not improve 
> >> things.
> >> b) We do have the exact same problem with pfn_to_online_page(). As long as 
> >> we
> >>    don't hold the memory hotplug lock, memory can get offlined and remove 
> >> any time. Racy.
> >
> > True, we need to solve that problem too. That seems to want something
> > lighter weight than the hotplug lock that can be held over pfn lookups
> > +  use rather than requiring a page lookup in paths where it's not
> > clear that a page reference would prevent unplug.
> >
> >> c) We mix in ZONE specific stuff into the core. It should be "just another 
> >> zone"
> >
> > Not sure I grok this when the RFC is sprinkling zone-specific
> > is_zone_device_page() throughout the core?
>
> Most users should not care about the zone. pfn_active() would be enough
> in most situations, especially most PFN walkers - "this memmap is valid
> and e.g., contains a valid zone ...".

Oh, I see, you're saying convert most users to pfn_active() (and some
TBD rcu locking), but only pfn_to_online_page() users would need the
zone lookup? I can get on board with that.

>
> >
> >>
> >> What I propose instead (already discussed in 
> >> https://lkml.org/lkml/2019/10/10/87)
> >
> > Sorry I missed this earlier...
> >
> >>
> >> 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE
> >> 2. Convert SECTION_IS_ACTIVE to a subsection bitmap
> >> 3. Introduce pfn_active() that checks against the subsection bitmap
> >> 4. Once the memmap was initialized / prepared, set the subsection active
> >>    (similar to SECTION_IS_ONLINE in the buddy right now)
> >> 5. Before the memmap gets invalidated, set the subsection inactive
> >>    (similar to SECTION_IS_ONLINE in the buddy right now)
> >> 5. pfn_to_online_page() = pfn_active() && zone != ZONE_DEVICE
> >> 6. pfn_to_device_page() = pfn_active() && zone == ZONE_DEVICE
> >
> > This does not seem to reduce any complexity because it still requires
> > a pfn to zone lookup at the end of the process.
> >
> > I.e. converting pfn_to_online_page() to use a new pfn_active()
> > subsection map plus looking up the zone from pfn_to_page() is more
> > steps than just doing a direct pfn to zone lookup. What am I missing?
>
> That a real "pfn to zone" lookup without going via the struct page will
> require to have more than just a single bitmap. IMHO, keeping the
> information at a single place (memmap) is the clean thing to do (not
> replicating it somewhere else). Going via the memmap might not be as
> fast as a direct lookup, but do we actually care? We are already looking
> at "random PFNs we are not sure if there is a valid memmap".

True, we only care about the validity of the check, and as you pointed
out moving the check to the pfn level does not solve the validity
race. It needs a lock.

>
> >>
> >> Especially, driver-reserved device memory will not get set active in
> >> the subsection bitmap.
> >>
> >> Now to the race. Taking the memory hotplug lock at random places is ugly. 
> >> I do
> >> wonder if we can use RCU:
> >
> > Ah, yes, exactly what I was thinking above.
> >
> >>
> >> The user of pfn_active()/pfn_to_online_page()/pfn_to_device_page():
> >>
> >>         /* the memmap is guaranteed to remain active under RCU */
> >>         rcu_read_lock();
> >>         if (pfn_active(random_pfn)) {
> >>                 page = pfn_to_page(random_pfn);
> >>                 ... use the page, stays valid
> >>         }
> >>         rcu_unread_lock();
> >>
> >> Memory offlining/memremap code:
> >>
> >>         set_subsections_inactive(pfn, nr_pages); /* clears the bit 
> >> atomically */
> >>         synchronize_rcu();
> >>         /* all users saw the bitmap update, we can invalide the memmap */
> >>         remove_pfn_range_from_zone(zone, pfn, nr_pages);
> >
> > Looks good to me.
> >
> >>
> >>>
> >>>>
> >>>> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> >>>> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
> >>>> on x86-64 and PPC.
> >>>
> >>> I'll give it a spin, but I don't think the kernel wants to grow more
> >>> is_zone_device_page() users.
> >>
> >> Let's recap. In this RFC, I introduce a total of 4 (!) users only.
> >> The other parts can rely on pfn_to_online_page() only.
> >>
> >> 1. "staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved 
> >> changes"
> >> - Basically never used with ZONE_DEVICE.
> >> - We hold a reference!
> >> - All it protects is a SetPageDirty(page);
> >>
> >> 2. "staging/gasket: Prepare gasket_release_page() for PG_reserved changes"
> >> - Same as 1.
> >>
> >> 3. "mm/usercopy.c: Prepare check_page_span() for PG_reserved changes"
> >> - We come via virt_to_head_page() / virt_to_head_page(), not sure about
> >>   references (I assume this should be fine as we don't come via random
> >>   PFNs)
> >> - We check that we don't mix Reserved (including device memory) and CMA
> >>   pages when crossing compound pages.
> >>
> >> I think we can drop 1. and 2., resulting in a total of 2 new users in
> >> the same context. I think that is totally tolerable to finally clean
> >> this up.
> >
> > ...but more is_zone_device_page() doesn't "finally clean this up".
> > Like we discussed above it's the missing locking that's the real
> > cleanup, the pfn_to_online_page() internals are secondary.
>
> It's a different cleanup IMHO. We can't do everything in one shot. But
> maybe I can drop the is_zone_device_page() parts from this patch and
> completely rely on pfn_to_online_page(). Yes, that needs fixing to, but
> it's a different story.
>
> The important part of this patch:
>
> While pfn_to_online_page() will always exclude ZONE_DEVICE pages,
> checking PG_reserved on ZONE_DEVICE pages (what we do right now!) is
> racy as hell (especially when concurrently initializing the memmap).
>
> This does improve the situation.

True that's a race a vector I was not considering.

>
> >>
> >> However, I think we also have to clarify if we need the change in 3 at all.
> >> It comes from
> >>
> >> commit f5509cc18daa7f82bcc553be70df2117c8eedc16
> >> Author: Kees Cook <keescook@xxxxxxxxxxxx>
> >> Date:   Tue Jun 7 11:05:33 2016 -0700
> >>
> >>     mm: Hardened usercopy
> >>
> >>     This is the start of porting PAX_USERCOPY into the mainline kernel. 
> >> This
> >>     is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. 
> >> The
> >>     work is based on code by PaX Team and Brad Spengler, and an earlier 
> >> port
> >>     from Casey Schaufler. Additional non-slab page tests are from Rik van 
> >> Riel.
> >> [...]
> >>     - otherwise, object must not span page allocations (excepting Reserved
> >>       and CMA ranges)
> >>
> >> Not sure if we really have to care about ZONE_DEVICE at this point.
> >
> > That check needs to be careful to ignore ZONE_DEVICE pages. There's
> > nothing wrong with a copy spanning ZONE_DEVICE and typical pages.
>
> Please note that the current check would *forbid* this (AFAIKs for a
> single heap object). As discussed in the relevant patch, we might be
> able to just stop doing that and limit it to real PG_reserved pages
> (without ZONE_DEVICE). I'd be happy to not introduce new
> is_zone_device_page() users.

At least for non-HMM ZONE_DEVICE usage, i.e. the dax + pmem stuff, is
excluded from this path by:

52f476a323f9 libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

So this case is one more to add to the pile of HMM auditing.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

Follow-Ups:
- Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  - From: David Hildenbrand

References:
- [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  - From: David Hildenbrand
- Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  - From: Dan Williams
- Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  - From: David Hildenbrand
- Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  - From: Dan Williams
- Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
  - From: David Hildenbrand

Prev by Date: [Xen-devel] [freebsd-master test] 143059: regressions - trouble: blocked/fail
Next by Date: [Xen-devel] bug: unable to LZ4 decompress ub1910 installer kernel when launching domU
Previous by thread: Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
Next by thread: Re: [Xen-devel] [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.