[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] mm, page_alloc: fix build_zonerefs_node()



On Thu 07-04-22 14:12:38, David Hildenbrand wrote:
> On 07.04.22 14:04, Michal Hocko wrote:
> > On Thu 07-04-22 13:58:44, David Hildenbrand wrote:
> > [...]
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index 3589febc6d31..130a2feceddc 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -6112,10 +6112,8 @@ static int build_zonerefs_node(pg_data_t *pgdat, 
> >>> struct zoneref *zonerefs)
> >>>   do {
> >>>           zone_type--;
> >>>           zone = pgdat->node_zones + zone_type;
> >>> -         if (managed_zone(zone)) {
> >>> -                 zoneref_set_zone(zone, &zonerefs[nr_zones++]);
> >>> -                 check_highest_zone(zone_type);
> >>> -         }
> >>> +         zoneref_set_zone(zone, &zonerefs[nr_zones++]);
> >>> +         check_highest_zone(zone_type);
> >>>   } while (zone_type);
> >>>  
> >>>   return nr_zones;
> >>
> >> I don't think having !populated zones in the zonelist is a particularly
> >> good idea. Populated vs !populated changes only during page
> >> onlininge/offlining.
> >>
> >> If I'm not wrong, with your patch we'd even include ZONE_DEVICE here ...
> > 
> > What kind of problem that would cause? The allocator wouldn't see any
> > pages at all so it would fallback to the next one. Maybe kswapd would
> > need some tweak to have a bail out condition but as mentioned in the
> > thread already. !populated or !managed for that matter are not all that
> > much different from completely depleted zones. The fact that we are
> > making that distinction has led to some bugs and I suspect it makes the
> > code more complex without a very good reason.
> 
> I assume performance problems. Assume you have an ordinary system with
> multiple NUMA nodes and no MOVABLE memory. Most nodes will only have
> ZONE_NORMAL. Yet, you'd include ZONE_DMA* and ZONE_MOVABLE that will
> always remain empty to be traversed on each and every allocation
> fallback. Of course, we could measure, but IMHO at least *that* part of
> memory onlining/offlining is not the complicated part :D

You've got a good point here. I guess there will be usecases that really
benefit from every single CPU cycle spared in the allocator hot path
which really depends on the zonelists traversing.

I have very briefly had a look at our remaining usage of managed_zone()
and there are not that many left. We have 2 in page_alloc.c via
has_managed_dma(). I guess we could drop that one and use __GFP_NOWARN
in dma_atomic_pool_init but this is nothing really earth shattering.
The remaining occurances are in vmscan.c:
        - should_continue_reclaim, pgdat_balanced - required because
          they iterate all zones withing the zoneindex and need to
          decide whether they are balanced or not. We can make empty
          zones special case and make them always balanced
        - kswapd_shrink_node - required because we would be increasing
          reclaim target for empty zones
        - update_reclaim_active - required because we do not want to
          alter zone state if it is not a subject of the reclaim which
          empty zones are not by definition.
        - balance_pgdat - first check is likely a microoptimization,
          reclaim_idx is needed to have a populated zone there
        - wakeup_kswapd - I dunno
        - shrink_node, allow_direct_reclaim, lruvec_lru_size - microptimizations
        - pgdat_watermark_boosted - microptimizations I suspect as empty
          zone shouldn't ever get watermark_boost
        - pgdat_balanced - functionally dd

So we can get rid of quite some but we will still need some of them.
-- 
Michal Hocko
SUSE Labs



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.