[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 2/7] mm: introduce local state for lazy_mmu sections


  • To: Kevin Brodsky <kevin.brodsky@xxxxxxx>, Alexander Gordeev <agordeev@xxxxxxxxxxxxx>
  • From: David Hildenbrand <david@xxxxxxxxxx>
  • Date: Wed, 10 Sep 2025 17:37:51 +0200
  • Autocrypt: addr=david@xxxxxxxxxx; keydata= xsFNBFXLn5EBEAC+zYvAFJxCBY9Tr1xZgcESmxVNI/0ffzE/ZQOiHJl6mGkmA1R7/uUpiCjJ dBrn+lhhOYjjNefFQou6478faXE6o2AhmebqT4KiQoUQFV4R7y1KMEKoSyy8hQaK1umALTdL QZLQMzNE74ap+GDK0wnacPQFpcG1AE9RMq3aeErY5tujekBS32jfC/7AnH7I0v1v1TbbK3Gp XNeiN4QroO+5qaSr0ID2sz5jtBLRb15RMre27E1ImpaIv2Jw8NJgW0k/D1RyKCwaTsgRdwuK Kx/Y91XuSBdz0uOyU/S8kM1+ag0wvsGlpBVxRR/xw/E8M7TEwuCZQArqqTCmkG6HGcXFT0V9 PXFNNgV5jXMQRwU0O/ztJIQqsE5LsUomE//bLwzj9IVsaQpKDqW6TAPjcdBDPLHvriq7kGjt WhVhdl0qEYB8lkBEU7V2Yb+SYhmhpDrti9Fq1EsmhiHSkxJcGREoMK/63r9WLZYI3+4W2rAc UucZa4OT27U5ZISjNg3Ev0rxU5UH2/pT4wJCfxwocmqaRr6UYmrtZmND89X0KigoFD/XSeVv jwBRNjPAubK9/k5NoRrYqztM9W6sJqrH8+UWZ1Idd/DdmogJh0gNC0+N42Za9yBRURfIdKSb B3JfpUqcWwE7vUaYrHG1nw54pLUoPG6sAA7Mehl3nd4pZUALHwARAQABzSREYXZpZCBIaWxk ZW5icmFuZCA8ZGF2aWRAcmVkaGF0LmNvbT7CwZoEEwEIAEQCGwMCF4ACGQEFCwkIBwICIgIG FQoJCAsCBBYCAwECHgcWIQQb2cqtc1xMOkYN/MpN3hD3AP+DWgUCaJzangUJJlgIpAAKCRBN 3hD3AP+DWhAxD/9wcL0A+2rtaAmutaKTfxhTP0b4AAp1r/eLxjrbfbCCmh4pqzBhmSX/4z11 opn2KqcOsueRF1t2ENLOWzQu3Roiny2HOU7DajqB4dm1BVMaXQya5ae2ghzlJN9SIoopTWlR 0Af3hPj5E2PYvQhlcqeoehKlBo9rROJv/rjmr2x0yOM8qeTroH/ZzNlCtJ56AsE6Tvl+r7cW 3x7/Jq5WvWeudKrhFh7/yQ7eRvHCjd9bBrZTlgAfiHmX9AnCCPRPpNGNedV9Yty2Jnxhfmbv Pw37LA/jef8zlCDyUh2KCU1xVEOWqg15o1RtTyGV1nXV2O/mfuQJud5vIgzBvHhypc3p6VZJ lEf8YmT+Ol5P7SfCs5/uGdWUYQEMqOlg6w9R4Pe8d+mk8KGvfE9/zTwGg0nRgKqlQXrWRERv cuEwQbridlPAoQHrFWtwpgYMXx2TaZ3sihcIPo9uU5eBs0rf4mOERY75SK+Ekayv2ucTfjxr Kf014py2aoRJHuvy85ee/zIyLmve5hngZTTe3Wg3TInT9UTFzTPhItam6dZ1xqdTGHZYGU0O otRHcwLGt470grdiob6PfVTXoHlBvkWRadMhSuG4RORCDpq89vu5QralFNIf3EysNohoFy2A LYg2/D53xbU/aa4DDzBb5b1Rkg/udO1gZocVQWrDh6I2K3+cCs7BTQRVy5+RARAA59fefSDR 9nMGCb9LbMX+TFAoIQo/wgP5XPyzLYakO+94GrgfZjfhdaxPXMsl2+o8jhp/hlIzG56taNdt VZtPp3ih1AgbR8rHgXw1xwOpuAd5lE1qNd54ndHuADO9a9A0vPimIes78Hi1/yy+ZEEvRkHk /kDa6F3AtTc1m4rbbOk2fiKzzsE9YXweFjQvl9p+AMw6qd/iC4lUk9g0+FQXNdRs+o4o6Qvy iOQJfGQ4UcBuOy1IrkJrd8qq5jet1fcM2j4QvsW8CLDWZS1L7kZ5gT5EycMKxUWb8LuRjxzZ 3QY1aQH2kkzn6acigU3HLtgFyV1gBNV44ehjgvJpRY2cC8VhanTx0dZ9mj1YKIky5N+C0f21 zvntBqcxV0+3p8MrxRRcgEtDZNav+xAoT3G0W4SahAaUTWXpsZoOecwtxi74CyneQNPTDjNg azHmvpdBVEfj7k3p4dmJp5i0U66Onmf6mMFpArvBRSMOKU9DlAzMi4IvhiNWjKVaIE2Se9BY FdKVAJaZq85P2y20ZBd08ILnKcj7XKZkLU5FkoA0udEBvQ0f9QLNyyy3DZMCQWcwRuj1m73D sq8DEFBdZ5eEkj1dCyx+t/ga6x2rHyc8Sl86oK1tvAkwBNsfKou3v+jP/l14a7DGBvrmlYjO 59o3t6inu6H7pt7OL6u6BQj7DoMAEQEAAcLBfAQYAQgAJgIbDBYhBBvZyq1zXEw6Rg38yk3e EPcA/4NaBQJonNqrBQkmWAihAAoJEE3eEPcA/4NaKtMQALAJ8PzprBEXbXcEXwDKQu+P/vts IfUb1UNMfMV76BicGa5NCZnJNQASDP/+bFg6O3gx5NbhHHPeaWz/VxlOmYHokHodOvtL0WCC 8A5PEP8tOk6029Z+J+xUcMrJClNVFpzVvOpb1lCbhjwAV465Hy+NUSbbUiRxdzNQtLtgZzOV Zw7jxUCs4UUZLQTCuBpFgb15bBxYZ/BL9MbzxPxvfUQIPbnzQMcqtpUs21CMK2PdfCh5c4gS sDci6D5/ZIBw94UQWmGpM/O1ilGXde2ZzzGYl64glmccD8e87OnEgKnH3FbnJnT4iJchtSvx yJNi1+t0+qDti4m88+/9IuPqCKb6Stl+s2dnLtJNrjXBGJtsQG/sRpqsJz5x1/2nPJSRMsx9 5YfqbdrJSOFXDzZ8/r82HgQEtUvlSXNaXCa95ez0UkOG7+bDm2b3s0XahBQeLVCH0mw3RAQg r7xDAYKIrAwfHHmMTnBQDPJwVqxJjVNr7yBic4yfzVWGCGNE4DnOW0vcIeoyhy9vnIa3w1uZ 3iyY2Nsd7JxfKu1PRhCGwXzRw5TlfEsoRI7V9A8isUCoqE2Dzh3FvYHVeX4Us+bRL/oqareJ CIFqgYMyvHj7Q06kTKmauOe4Nf0l0qEkIuIzfoLJ3qr5UyXc2hLtWyT9Ir+lYlX9efqh7mOY qIws/H2t
  • Cc: linux-mm@xxxxxxxxx, linux-kernel@xxxxxxxxxxxxxxx, Andreas Larsson <andreas@xxxxxxxxxxx>, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Borislav Petkov <bp@xxxxxxxxx>, Catalin Marinas <catalin.marinas@xxxxxxx>, Christophe Leroy <christophe.leroy@xxxxxxxxxx>, Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>, "David S. Miller" <davem@xxxxxxxxxxxxx>, "H. Peter Anvin" <hpa@xxxxxxxxx>, Ingo Molnar <mingo@xxxxxxxxxx>, Jann Horn <jannh@xxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx>, Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>, Madhavan Srinivasan <maddy@xxxxxxxxxxxxx>, Michael Ellerman <mpe@xxxxxxxxxxxxxx>, Michal Hocko <mhocko@xxxxxxxx>, Mike Rapoport <rppt@xxxxxxxxxx>, Nicholas Piggin <npiggin@xxxxxxxxx>, Peter Zijlstra <peterz@xxxxxxxxxxxxx>, Ryan Roberts <ryan.roberts@xxxxxxx>, Suren Baghdasaryan <surenb@xxxxxxxxxx>, Thomas Gleixner <tglx@xxxxxxxxxxxxx>, Vlastimil Babka <vbabka@xxxxxxx>, Will Deacon <will@xxxxxxxxxx>, Yeoreum Yun <yeoreum.yun@xxxxxxx>, linux-arm-kernel@xxxxxxxxxxxxxxxxxxx, linuxppc-dev@xxxxxxxxxxxxxxxx, sparclinux@xxxxxxxxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxxx, Mark Rutland <Mark.Rutland@xxxxxxx>
  • Delivery-date: Wed, 10 Sep 2025 15:38:05 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>


Somewhat, but in the regular case where enter() is called followed by
leave() there is really no complexity for the caller, just an extra
local variable.

There are complications where we want to exit lazy_mmu temporarily, as
in mm/kasan/shadow.c [1k], but this is in fact unavoidable. Chatting
with Mark Rutland, I realised that to truly support nested sections,
this must be handled in a special way in any case. To be clear, I am
referring to this situation:

__kasan_populate_vmalloc:
     apply_to_page_range:
         arch_enter_lazy_mmu_mode() {1}

         kasan_populate_vmalloc_pte:
             arch_leave_lazy_mmu_mode() {2}
             arch_enter_lazy_mmu_mode() {3}

         arch_leave_lazy_mmu_mode() {4}

With the approach this series takes, call {2} is made safe by passing a
special parameter (say LAZY_MMU_FLUSH) that forces lazy_mmu to be fully
exited - and call {3} will then re-enter lazy_mmu. This works regardless
of whether __kasan_populate_vmalloc() has been called with lazy_mmu
already enabled (i.e. calls {1} and {4} can be nested).

On the other hand, with a pagefault_disabled-like approach, there is no
way to instruct call {3} to fully exit lazy_mmu regardless of the
nesting level.

Sure there is, with a better API. See below. :)


It would be possible to make both approaches work by introducing a new
API, along the lines of:
- int arch_disable_save_lazy_mmu_mode() (the return value indicates the
nesting level)
- void arch_restore_lazy_mmu_mode(int state) (re-enter lazy_mmu at the
given nesting level)

Yes, I think we really need a proper API.


This is arguably more self-documenting than passing LAZY_MMU_FLUSH in
call {2}. This API is however no simpler when using a
pagefault_disabled-like approach (and less consistent than when always
saving state on the stack).

Yes, a proper API is warranted. In particular, thinking about the following:

arch_enter_lazy_mmu_mode() {1}
        arch_enter_lazy_mmu_mode() {2}

        kasan_populate_vmalloc_pte:
                arch_leave_lazy_mmu_mode() {3}
                arch_enter_lazy_mmu_mode() {4}

        arch_leave_lazy_mmu_mode() {5}
arch_leave_lazy_mmu_mode() {6}


Imagine if we have the following API instead:

lazy_mmu_enable() {1}
        lazy_mmu_enable() {2}

        kasan_populate_vmalloc_pte:
                lazy_mmu_pause() {3}
                lazy_mmu_continue() {4}

        lazy_mmu_disable() {5}
lazy_mmu_disable() {6}


I think it is crucial that after lazy_mmu_save/lazy_mmu_restore, no more nesting must happen.

Assume we store in the task_struct

uint8_t lazy_mmu_enabled_count;
bool lazy_mmu_paused;

We can do things like

a) Sanity check that while we are paused that we get no more enable/disable requests
b) Sanity check that while we are paused that we get no more pause requests.

[...]


If LAZY_MMU_DEFAULT etc. are not for common code, then please
maintain them for the individual archs as well, just like you do with the
opaque type.

I see your point - having them defined in <linux/mm_types.h> could be
misleading. I just wanted to avoid all 4 architectures defining the same
macros. Maybe call them __LAZY_MMU_* to suggest they're not supposed to
be used in generic code?

Maybe look into avoiding them completely :) Let's agree on the API first and then figure out how to pass the information we need to pass.

[...]

Worse, it does not
truly enable states to be nested: it allows the outermost section to
store some state, but nested sections cannot allocate extra space. This
is really what the stack is for.

If it's really just 8 bytes I don't really see the problem. So likely
there is
more to it?

I suppose 8 extra bytes per task is acceptable, but some architectures
may want to add more state there.

Just for reference: we currently perform an order-2 allocation, effectively leaving ~4KiB "unused".

If there are any real such case on the horizon where we need to store significantly more (in which case storing it on the stack might probably also bad), please let me know.


The one case that is truly problematic (though not required at this
point) is where each (nested) section needs to store its own state. With
this series it works just fine as there is a lazy_mmu_state_t for each
section, however if we use task_struct/thread_struct there can be only
one member shared by all nested sections.

Do we have a use case for that on the horizon? If so, I fully agree, we have to store information per level. How/what information we have to store would be another question.

--
Cheers

David / dhildenb




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.