[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier



I haven't had enough time to fully understand the deferred logic in this 
change. I spotted one problem, see comments inline.

On 2019-10-28 4:10 p.m., Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@xxxxxxxxxxxx>
>
> Of the 13 users of mmu_notifiers, 8 of them use only
> invalidate_range_start/end() and immediately intersect the
> mmu_notifier_range with some kind of internal list of VAs.  4 use an
> interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
> of some kind (scif_dma, vhost, gntdev, hmm)
>
> And the remaining 5 either don't use invalidate_range_start() or do some
> special thing with it.
>
> It turns out that building a correct scheme with an interval tree is
> pretty complicated, particularly if the use case is synchronizing against
> another thread doing get_user_pages().  Many of these implementations have
> various subtle and difficult to fix races.
>
> This approach puts the interval tree as common code at the top of the mmu
> notifier call tree and implements a shareable locking scheme.
>
> It includes:
>   - An interval tree tracking VA ranges, with per-range callbacks
>   - A read/write locking scheme for the interval tree that avoids
>     sleeping in the notifier path (for OOM killer)
>   - A sequence counter based collision-retry locking scheme to tell
>     device page fault that a VA range is being concurrently invalidated.
>
> This is based on various ideas:
> - hmm accumulates invalidated VA ranges and releases them when all
>    invalidates are done, via active_invalidate_ranges count.
>    This approach avoids having to intersect the interval tree twice (as
>    umem_odp does) at the potential cost of a longer device page fault.
>
> - kvm/umem_odp use a sequence counter to drive the collision retry,
>    via invalidate_seq
>
> - a deferred work todo list on unlock scheme like RTNL, via deferred_list.
>    This makes adding/removing interval tree members more deterministic
>
> - seqlock, except this version makes the seqlock idea multi-holder on the
>    write side by protecting it with active_invalidate_ranges and a spinlock
>
> To minimize MM overhead when only the interval tree is being used, the
> entire SRCU and hlist overheads are dropped using some simple
> branches. Similarly the interval tree overhead is dropped when in hlist
> mode.
>
> The overhead from the mandatory spinlock is broadly the same as most of
> existing users which already had a lock (or two) of some sort on the
> invalidation path.
>
> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxxxx>
> Acked-by: Christian König <christian.koenig@xxxxxxx>
> Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxxxx>
> ---
>   include/linux/mmu_notifier.h |  98 +++++++
>   mm/Kconfig                   |   1 +
>   mm/mmu_notifier.c            | 533 +++++++++++++++++++++++++++++++++--
>   3 files changed, 607 insertions(+), 25 deletions(-)
>
[snip]
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 367670cfd02b7b..d02d3c8c223eb7 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
[snip]
>    * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
> @@ -52,17 +286,24 @@ struct mmu_notifier_mm {
>    * can't go away from under us as exit_mmap holds an mm_count pin
>    * itself.
>    */
> -void __mmu_notifier_release(struct mm_struct *mm)
> +static void mn_hlist_release(struct mmu_notifier_mm *mmn_mm,
> +                          struct mm_struct *mm)
>   {
>       struct mmu_notifier *mn;
>       int id;
>   
> +     if (mmn_mm->has_interval)
> +             mn_itree_release(mmn_mm, mm);
> +
> +     if (hlist_empty(&mmn_mm->list))
> +             return;

This seems to duplicate the conditions in __mmu_notifier_release. See my 
comments below, I think one of them is wrong. I suspect this one, 
because __mmu_notifier_release follows the same pattern as the other 
notifiers.


> +
>       /*
>        * SRCU here will block mmu_notifier_unregister until
>        * ->release returns.
>        */
>       id = srcu_read_lock(&srcu);
> -     hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist)
> +     hlist_for_each_entry_rcu(mn, &mmn_mm->list, hlist)
>               /*
>                * If ->release runs before mmu_notifier_unregister it must be
>                * handled, as it's the only way for the driver to flush all
> @@ -72,9 +313,9 @@ void __mmu_notifier_release(struct mm_struct *mm)
>               if (mn->ops->release)
>                       mn->ops->release(mn, mm);
>   
> -     spin_lock(&mm->mmu_notifier_mm->lock);
> -     while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
> -             mn = hlist_entry(mm->mmu_notifier_mm->list.first,
> +     spin_lock(&mmn_mm->lock);
> +     while (unlikely(!hlist_empty(&mmn_mm->list))) {
> +             mn = hlist_entry(mmn_mm->list.first,
>                                struct mmu_notifier,
>                                hlist);
>               /*
> @@ -85,7 +326,7 @@ void __mmu_notifier_release(struct mm_struct *mm)
>                */
>               hlist_del_init_rcu(&mn->hlist);
>       }
> -     spin_unlock(&mm->mmu_notifier_mm->lock);
> +     spin_unlock(&mmn_mm->lock);
>       srcu_read_unlock(&srcu, id);
>   
>       /*
> @@ -100,6 +341,17 @@ void __mmu_notifier_release(struct mm_struct *mm)
>       synchronize_srcu(&srcu);
>   }
>   
> +void __mmu_notifier_release(struct mm_struct *mm)
> +{
> +     struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm;
> +
> +     if (mmn_mm->has_interval)
> +             mn_itree_release(mmn_mm, mm);

If mmn_mm->list is not empty, this will be done twice because 
mn_hlist_release duplicates this.


> +
> +     if (!hlist_empty(&mmn_mm->list))
> +             mn_hlist_release(mmn_mm, mm);

mn_hlist_release checks the same condition itself.


> +}
> +
>   /*
>    * If no young bitflag is supported by the hardware, ->clear_flush_young can
>    * unmap the address and return 1 or 0 depending if the mapping previously
[snip]
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.