[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers



On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> This is the v2 of RFC based on the feedback I've received so far. The
> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> because I have no idea how.
> 
> Any further feedback is highly appreciated of course.

Any other feedback before I post this as non-RFC?

> ---
> From ec9a7241bf422b908532c4c33953b0da2655ad05 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@xxxxxxxx>
> Date: Wed, 20 Jun 2018 15:03:20 +0200
> Subject: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks.
> 
> Currently we simply back off and mark an oom victim with blockable mmu
> notifiers as done after a short sleep. That can result in selecting a
> new oom victim prematurely because the previous one still hasn't torn
> its memory down yet.
> 
> We can do much better though. Even if mmu notifiers use sleepable locks
> there is no reason to automatically assume those locks are held.
> Moreover majority of notifiers only care about a portion of the address
> space and there is absolutely zero reason to fail when we are unmapping an
> unrelated range. Many notifiers do really block and wait for HW which is
> harder to handle and we have to bail out though.
> 
> This patch handles the low hanging fruid. 
> __mmu_notifier_invalidate_range_start
> gets a blockable flag and callbacks are not allowed to sleep if the
> flag is set to false. This is achieved by using trylock instead of the
> sleepable lock for most callbacks and continue as long as we do not
> block down the call chain.
> 
> I think we can improve that even further because there is a common
> pattern to do a range lookup first and then do something about that.
> The first part can be done without a sleeping lock in most cases AFAICS.
> 
> The oom_reaper end then simply retries if there is at least one notifier
> which couldn't make any progress in !blockable mode. A retry loop is
> already implemented to wait for the mmap_sem and this is basically the
> same thing.
> 
> Changes since rfc v1
> - gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
>   on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
>   we bail out when we have an intersecting range for starter
> - note that a notifier failed to the log for easier debugging
> - back off early in ib_umem_notifier_invalidate_range_start if the
>   callback is called
> - mn_invl_range_start waits for completion down the unmap_grant_pages
>   path so we have to back off early on overlapping ranges
> 
> Cc: "David (ChunMing) Zhou" <David1.Zhou@xxxxxxx>
> Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> Cc: "Radim Krčmář" <rkrcmar@xxxxxxxxxx>
> Cc: Alex Deucher <alexander.deucher@xxxxxxx>
> Cc: "Christian König" <christian.koenig@xxxxxxx>
> Cc: David Airlie <airlied@xxxxxxxx>
> Cc: Jani Nikula <jani.nikula@xxxxxxxxxxxxxxx>
> Cc: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx>
> Cc: Rodrigo Vivi <rodrigo.vivi@xxxxxxxxx>
> Cc: Doug Ledford <dledford@xxxxxxxxxx>
> Cc: Jason Gunthorpe <jgg@xxxxxxxx>
> Cc: Mike Marciniszyn <mike.marciniszyn@xxxxxxxxx>
> Cc: Dennis Dalessandro <dennis.dalessandro@xxxxxxxxx>
> Cc: Sudeep Dutt <sudeep.dutt@xxxxxxxxx>
> Cc: Ashutosh Dixit <ashutosh.dixit@xxxxxxxxx>
> Cc: Dimitri Sivanich <sivanich@xxxxxxx>
> Cc: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
> Cc: Juergen Gross <jgross@xxxxxxxx>
> Cc: "Jérôme Glisse" <jglisse@xxxxxxxxxx>
> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> Cc: Felix Kuehling <felix.kuehling@xxxxxxx>
> Cc: kvm@xxxxxxxxxxxxxxx (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
> Cc: linux-kernel@xxxxxxxxxxxxxxx (open list:X86 ARCHITECTURE (32-BIT AND 
> 64-BIT))
> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx (open list:RADEON and AMDGPU DRM DRIVERS)
> Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx (open list:DRM DRIVERS)
> Cc: intel-gfx@xxxxxxxxxxxxxxxxxxxxx (open list:INTEL DRM DRIVERS (excluding 
> Poulsbo, Moorestow...)
> Cc: linux-rdma@xxxxxxxxxxxxxxx (open list:INFINIBAND SUBSYSTEM)
> Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx (moderated list:XEN HYPERVISOR INTERFACE)
> Cc: linux-mm@xxxxxxxxx (open list:HMM - Heterogeneous Memory Management)
> Reported-by: David Rientjes <rientjes@xxxxxxxxxx>
> Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
> ---
>  arch/x86/kvm/x86.c                      |  7 ++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++++++++++++++++++-----
>  drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++++++--
>  drivers/gpu/drm/radeon/radeon_mn.c      | 22 +++++++++++--
>  drivers/infiniband/core/umem_odp.c      | 33 +++++++++++++++----
>  drivers/infiniband/hw/hfi1/mmu_rb.c     | 11 ++++---
>  drivers/infiniband/hw/mlx5/odp.c        |  2 +-
>  drivers/misc/mic/scif/scif_dma.c        |  7 ++--
>  drivers/misc/sgi-gru/grutlbpurge.c      |  7 ++--
>  drivers/xen/gntdev.c                    | 44 ++++++++++++++++++++-----
>  include/linux/kvm_host.h                |  4 +--
>  include/linux/mmu_notifier.h            | 35 +++++++++++++++-----
>  include/linux/oom.h                     |  2 +-
>  include/rdma/ib_umem_odp.h              |  3 +-
>  mm/hmm.c                                |  7 ++--
>  mm/mmap.c                               |  2 +-
>  mm/mmu_notifier.c                       | 19 ++++++++---
>  mm/oom_kill.c                           | 29 ++++++++--------
>  virt/kvm/kvm_main.c                     | 15 ++++++---
>  19 files changed, 225 insertions(+), 80 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6bcecc325e7e..ac08f5d711be 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
>       kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
>  }
>  
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -             unsigned long start, unsigned long end)
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +             unsigned long start, unsigned long end,
> +             bool blockable)
>  {
>       unsigned long apic_address;
>  
> @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm 
> *kvm,
>       apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>       if (start <= apic_address && apic_address < end)
>               kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +
> +     return 0;
>  }
>  
>  void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 83e344fbb50a..3399a4a927fb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>   *
>   * Take the rmn read side lock.
>   */
> -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
> +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
>  {
> -     mutex_lock(&rmn->read_lock);
> +     if (blockable)
> +             mutex_lock(&rmn->read_lock);
> +     else if (!mutex_trylock(&rmn->read_lock))
> +             return -EAGAIN;
> +
>       if (atomic_inc_return(&rmn->recursion) == 1)
>               down_read_non_owner(&rmn->lock);
>       mutex_unlock(&rmn->read_lock);
> +
> +     return 0;
>  }
>  
>  /**
> @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct 
> amdgpu_mn_node *node,
>   * We block for all BOs between start and end to be idle and
>   * unmap them by move them into system domain again.
>   */
> -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>                                                struct mm_struct *mm,
>                                                unsigned long start,
> -                                              unsigned long end)
> +                                              unsigned long end,
> +                                              bool blockable)
>  {
>       struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>       struct interval_tree_node *it;
> @@ -208,17 +215,28 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct 
> mmu_notifier *mn,
>       /* notification is exclusive, but interval is inclusive */
>       end -= 1;
>  
> -     amdgpu_mn_read_lock(rmn);
> +     /* TODO we should be able to split locking for interval tree and
> +      * amdgpu_mn_invalidate_node
> +      */
> +     if (amdgpu_mn_read_lock(rmn, blockable))
> +             return -EAGAIN;
>  
>       it = interval_tree_iter_first(&rmn->objects, start, end);
>       while (it) {
>               struct amdgpu_mn_node *node;
>  
> +             if (!blockable) {
> +                     amdgpu_mn_read_unlock(rmn);
> +                     return -EAGAIN;
> +             }
> +
>               node = container_of(it, struct amdgpu_mn_node, it);
>               it = interval_tree_iter_next(it, start, end);
>  
>               amdgpu_mn_invalidate_node(node, start, end);
>       }
> +
> +     return 0;
>  }
>  
>  /**
> @@ -233,10 +251,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct 
> mmu_notifier *mn,
>   * necessitates evicting all user-mode queues of the process. The BOs
>   * are restorted in amdgpu_mn_invalidate_range_end_hsa.
>   */
> -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>                                                struct mm_struct *mm,
>                                                unsigned long start,
> -                                              unsigned long end)
> +                                              unsigned long end,
> +                                              bool blockable)
>  {
>       struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>       struct interval_tree_node *it;
> @@ -244,13 +263,19 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct 
> mmu_notifier *mn,
>       /* notification is exclusive, but interval is inclusive */
>       end -= 1;
>  
> -     amdgpu_mn_read_lock(rmn);
> +     if (amdgpu_mn_read_lock(rmn, blockable))
> +             return -EAGAIN;
>  
>       it = interval_tree_iter_first(&rmn->objects, start, end);
>       while (it) {
>               struct amdgpu_mn_node *node;
>               struct amdgpu_bo *bo;
>  
> +             if (!blockable) {
> +                     amdgpu_mn_read_unlock(rmn);
> +                     return -EAGAIN;
> +             }
> +
>               node = container_of(it, struct amdgpu_mn_node, it);
>               it = interval_tree_iter_next(it, start, end);
>  
> @@ -262,6 +287,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct 
> mmu_notifier *mn,
>                               amdgpu_amdkfd_evict_userptr(mem, mm);
>               }
>       }
> +
> +     return 0;
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
> b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 854bd51b9478..9cbff68f6b41 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
>       mo->attached = false;
>  }
>  
> -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier 
> *_mn,
> +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier 
> *_mn,
>                                                      struct mm_struct *mm,
>                                                      unsigned long start,
> -                                                    unsigned long end)
> +                                                    unsigned long end,
> +                                                    bool blockable)
>  {
>       struct i915_mmu_notifier *mn =
>               container_of(_mn, struct i915_mmu_notifier, mn);
> @@ -124,7 +125,7 @@ static void 
> i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>       LIST_HEAD(cancelled);
>  
>       if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> -             return;
> +             return 0;
>  
>       /* interval ranges are inclusive, but invalidate range is exclusive */
>       end--;
> @@ -132,6 +133,10 @@ static void 
> i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>       spin_lock(&mn->lock);
>       it = interval_tree_iter_first(&mn->objects, start, end);
>       while (it) {
> +             if (!blockable) {
> +                     spin_unlock(&mn->lock);
> +                     return -EAGAIN;
> +             }
>               /* The mmu_object is released late when destroying the
>                * GEM object so it is entirely possible to gain a
>                * reference on an object in the process of being freed
> @@ -154,6 +159,8 @@ static void 
> i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>  
>       if (!list_empty(&cancelled))
>               flush_workqueue(mn->wq);
> +
> +     return 0;
>  }
>  
>  static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
> diff --git a/drivers/gpu/drm/radeon/radeon_mn.c 
> b/drivers/gpu/drm/radeon/radeon_mn.c
> index abd24975c9b1..f8b35df44c60 100644
> --- a/drivers/gpu/drm/radeon/radeon_mn.c
> +++ b/drivers/gpu/drm/radeon/radeon_mn.c
> @@ -118,19 +118,27 @@ static void radeon_mn_release(struct mmu_notifier *mn,
>   * We block for all BOs between start and end to be idle and
>   * unmap them by move them into system domain again.
>   */
> -static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
> +static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>                                            struct mm_struct *mm,
>                                            unsigned long start,
> -                                          unsigned long end)
> +                                          unsigned long end,
> +                                          bool blockable)
>  {
>       struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
>       struct ttm_operation_ctx ctx = { false, false };
>       struct interval_tree_node *it;
> +     int ret = 0;
>  
>       /* notification is exclusive, but interval is inclusive */
>       end -= 1;
>  
> -     mutex_lock(&rmn->lock);
> +     /* TODO we should be able to split locking for interval tree and
> +      * the tear down.
> +      */
> +     if (blockable)
> +             mutex_lock(&rmn->lock);
> +     else if (!mutex_trylock(&rmn->lock))
> +             return -EAGAIN;
>  
>       it = interval_tree_iter_first(&rmn->objects, start, end);
>       while (it) {
> @@ -138,6 +146,11 @@ static void radeon_mn_invalidate_range_start(struct 
> mmu_notifier *mn,
>               struct radeon_bo *bo;
>               long r;
>  
> +             if (!blockable) {
> +                     ret = -EAGAIN;
> +                     goto out_unlock;
> +             }
> +
>               node = container_of(it, struct radeon_mn_node, it);
>               it = interval_tree_iter_next(it, start, end);
>  
> @@ -166,7 +179,10 @@ static void radeon_mn_invalidate_range_start(struct 
> mmu_notifier *mn,
>               }
>       }
>       
> +out_unlock:
>       mutex_unlock(&rmn->lock);
> +
> +     return ret;
>  }
>  
>  static const struct mmu_notifier_ops radeon_mn_ops = {
> diff --git a/drivers/infiniband/core/umem_odp.c 
> b/drivers/infiniband/core/umem_odp.c
> index 182436b92ba9..6ec748eccff7 100644
> --- a/drivers/infiniband/core/umem_odp.c
> +++ b/drivers/infiniband/core/umem_odp.c
> @@ -186,6 +186,7 @@ static void ib_umem_notifier_release(struct mmu_notifier 
> *mn,
>       rbt_ib_umem_for_each_in_range(&context->umem_tree, 0,
>                                     ULLONG_MAX,
>                                     ib_umem_notifier_release_trampoline,
> +                                   true,
>                                     NULL);
>       up_read(&context->umem_rwsem);
>  }
> @@ -207,22 +208,31 @@ static int invalidate_range_start_trampoline(struct 
> ib_umem *item, u64 start,
>       return 0;
>  }
>  
> -static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
>                                                   struct mm_struct *mm,
>                                                   unsigned long start,
> -                                                 unsigned long end)
> +                                                 unsigned long end,
> +                                                 bool blockable)
>  {
>       struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
> +     int ret;
>  
>       if (!context->invalidate_range)
> -             return;
> +             return 0;
> +
> +     if (blockable)
> +             down_read(&context->umem_rwsem);
> +     else if (!down_read_trylock(&context->umem_rwsem))
> +             return -EAGAIN;
>  
>       ib_ucontext_notifier_start_account(context);
> -     down_read(&context->umem_rwsem);
> -     rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
> +     ret = rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
>                                     end,
> -                                   invalidate_range_start_trampoline, NULL);
> +                                   invalidate_range_start_trampoline,
> +                                   blockable, NULL);
>       up_read(&context->umem_rwsem);
> +
> +     return ret;
>  }
>  
>  static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
> @@ -242,10 +252,15 @@ static void 
> ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
>       if (!context->invalidate_range)
>               return;
>  
> +     /*
> +      * TODO: we currently bail out if there is any sleepable work to be done
> +      * in ib_umem_notifier_invalidate_range_start so we shouldn't really 
> block
> +      * here. But this is ugly and fragile.
> +      */
>       down_read(&context->umem_rwsem);
>       rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
>                                     end,
> -                                   invalidate_range_end_trampoline, NULL);
> +                                   invalidate_range_end_trampoline, true, 
> NULL);
>       up_read(&context->umem_rwsem);
>       ib_ucontext_notifier_end_account(context);
>  }
> @@ -798,6 +813,7 @@ EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
>  int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>                                 u64 start, u64 last,
>                                 umem_call_back cb,
> +                               bool blockable,
>                                 void *cookie)
>  {
>       int ret_val = 0;
> @@ -809,6 +825,9 @@ int rbt_ib_umem_for_each_in_range(struct rb_root_cached 
> *root,
>  
>       for (node = rbt_ib_umem_iter_first(root, start, last - 1);
>                       node; node = next) {
> +             /* TODO move the blockable decision up to the callback */
> +             if (!blockable)
> +                     return -EAGAIN;
>               next = rbt_ib_umem_iter_next(node, start, last - 1);
>               umem = container_of(node, struct ib_umem_odp, interval_tree);
>               ret_val = cb(umem->umem, start, last, cookie) || ret_val;
> diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c 
> b/drivers/infiniband/hw/hfi1/mmu_rb.c
> index 70aceefe14d5..e1c7996c018e 100644
> --- a/drivers/infiniband/hw/hfi1/mmu_rb.c
> +++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
> @@ -67,9 +67,9 @@ struct mmu_rb_handler {
>  
>  static unsigned long mmu_node_start(struct mmu_rb_node *);
>  static unsigned long mmu_node_last(struct mmu_rb_node *);
> -static void mmu_notifier_range_start(struct mmu_notifier *,
> +static int mmu_notifier_range_start(struct mmu_notifier *,
>                                    struct mm_struct *,
> -                                  unsigned long, unsigned long);
> +                                  unsigned long, unsigned long, bool);
>  static struct mmu_rb_node *__mmu_rb_search(struct mmu_rb_handler *,
>                                          unsigned long, unsigned long);
>  static void do_remove(struct mmu_rb_handler *handler,
> @@ -284,10 +284,11 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
>       handler->ops->remove(handler->ops_arg, node);
>  }
>  
> -static void mmu_notifier_range_start(struct mmu_notifier *mn,
> +static int mmu_notifier_range_start(struct mmu_notifier *mn,
>                                    struct mm_struct *mm,
>                                    unsigned long start,
> -                                  unsigned long end)
> +                                  unsigned long end,
> +                                  bool blockable)
>  {
>       struct mmu_rb_handler *handler =
>               container_of(mn, struct mmu_rb_handler, mn);
> @@ -313,6 +314,8 @@ static void mmu_notifier_range_start(struct mmu_notifier 
> *mn,
>  
>       if (added)
>               queue_work(handler->wq, &handler->del_work);
> +
> +     return 0;
>  }
>  
>  /*
> diff --git a/drivers/infiniband/hw/mlx5/odp.c 
> b/drivers/infiniband/hw/mlx5/odp.c
> index f1a87a690a4c..d216e0d2921d 100644
> --- a/drivers/infiniband/hw/mlx5/odp.c
> +++ b/drivers/infiniband/hw/mlx5/odp.c
> @@ -488,7 +488,7 @@ void mlx5_ib_free_implicit_mr(struct mlx5_ib_mr *imr)
>  
>       down_read(&ctx->umem_rwsem);
>       rbt_ib_umem_for_each_in_range(&ctx->umem_tree, 0, ULLONG_MAX,
> -                                   mr_leaf_free, imr);
> +                                   mr_leaf_free, true, imr);
>       up_read(&ctx->umem_rwsem);
>  
>       wait_event(imr->q_leaf_free, !atomic_read(&imr->num_leaf_free));
> diff --git a/drivers/misc/mic/scif/scif_dma.c 
> b/drivers/misc/mic/scif/scif_dma.c
> index 63d6246d6dff..6369aeaa7056 100644
> --- a/drivers/misc/mic/scif/scif_dma.c
> +++ b/drivers/misc/mic/scif/scif_dma.c
> @@ -200,15 +200,18 @@ static void scif_mmu_notifier_release(struct 
> mmu_notifier *mn,
>       schedule_work(&scif_info.misc_work);
>  }
>  
> -static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>                                                    struct mm_struct *mm,
>                                                    unsigned long start,
> -                                                  unsigned long end)
> +                                                  unsigned long end,
> +                                                  bool blockable)
>  {
>       struct scif_mmu_notif   *mmn;
>  
>       mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
>       scif_rma_destroy_tcw(mmn, start, end - start);
> +
> +     return 0;
>  }
>  
>  static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/misc/sgi-gru/grutlbpurge.c 
> b/drivers/misc/sgi-gru/grutlbpurge.c
> index a3454eb56fbf..be28f05bfafa 100644
> --- a/drivers/misc/sgi-gru/grutlbpurge.c
> +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> @@ -219,9 +219,10 @@ void gru_flush_all_tlb(struct gru_state *gru)
>  /*
>   * MMUOPS notifier callout functions
>   */
> -static void gru_invalidate_range_start(struct mmu_notifier *mn,
> +static int gru_invalidate_range_start(struct mmu_notifier *mn,
>                                      struct mm_struct *mm,
> -                                    unsigned long start, unsigned long end)
> +                                    unsigned long start, unsigned long end,
> +                                    bool blockable)
>  {
>       struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>                                                ms_notifier);
> @@ -231,6 +232,8 @@ static void gru_invalidate_range_start(struct 
> mmu_notifier *mn,
>       gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
>               start, end, atomic_read(&gms->ms_range_active));
>       gru_flush_tlb_range(gms, start, end - start);
> +
> +     return 0;
>  }
>  
>  static void gru_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index bd56653b9bbc..55b4f0e3f4d6 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -441,18 +441,25 @@ static const struct vm_operations_struct gntdev_vmops = 
> {
>  
>  /* ------------------------------------------------------------------ */
>  
> +static bool in_range(struct grant_map *map,
> +                           unsigned long start, unsigned long end)
> +{
> +     if (!map->vma)
> +             return false;
> +     if (map->vma->vm_start >= end)
> +             return false;
> +     if (map->vma->vm_end <= start)
> +             return false;
> +
> +     return true;
> +}
> +
>  static void unmap_if_in_range(struct grant_map *map,
>                             unsigned long start, unsigned long end)
>  {
>       unsigned long mstart, mend;
>       int err;
>  
> -     if (!map->vma)
> -             return;
> -     if (map->vma->vm_start >= end)
> -             return;
> -     if (map->vma->vm_end <= start)
> -             return;
>       mstart = max(start, map->vma->vm_start);
>       mend   = min(end,   map->vma->vm_end);
>       pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
> @@ -465,21 +472,40 @@ static void unmap_if_in_range(struct grant_map *map,
>       WARN_ON(err);
>  }
>  
> -static void mn_invl_range_start(struct mmu_notifier *mn,
> +static int mn_invl_range_start(struct mmu_notifier *mn,
>                               struct mm_struct *mm,
> -                             unsigned long start, unsigned long end)
> +                             unsigned long start, unsigned long end,
> +                             bool blockable)
>  {
>       struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
>       struct grant_map *map;
> +     int ret = 0;
> +
> +     /* TODO do we really need a mutex here? */
> +     if (blockable)
> +             mutex_lock(&priv->lock);
> +     else if (!mutex_trylock(&priv->lock))
> +             return -EAGAIN;
>  
> -     mutex_lock(&priv->lock);
>       list_for_each_entry(map, &priv->maps, next) {
> +             if (in_range(map, start, end)) {
> +                     ret = -EAGAIN;
> +                     goto out_unlock;
> +             }
>               unmap_if_in_range(map, start, end);
>       }
>       list_for_each_entry(map, &priv->freeable_maps, next) {
> +             if (in_range(map, start, end)) {
> +                     ret = -EAGAIN;
> +                     goto out_unlock;
> +             }
>               unmap_if_in_range(map, start, end);
>       }
> +
> +out_unlock:
>       mutex_unlock(&priv->lock);
> +
> +     return ret;
>  }
>  
>  static void mn_release(struct mmu_notifier *mn,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4ee7bc548a83..148935085194 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1275,8 +1275,8 @@ static inline long kvm_arch_vcpu_async_ioctl(struct 
> file *filp,
>  }
>  #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
>  
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -             unsigned long start, unsigned long end);
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +             unsigned long start, unsigned long end, bool blockable);
>  
>  #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
>  int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 392e6af82701..2eb1a2d01759 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -151,13 +151,15 @@ struct mmu_notifier_ops {
>        * address space but may still be referenced by sptes until
>        * the last refcount is dropped.
>        *
> -      * If both of these callbacks cannot block, and invalidate_range
> -      * cannot block, mmu_notifier_ops.flags should have
> -      * MMU_INVALIDATE_DOES_NOT_BLOCK set.
> +      * If blockable argument is set to false then the callback cannot
> +      * sleep and has to return with -EAGAIN. 0 should be returned
> +      * otherwise.
> +      *
>        */
> -     void (*invalidate_range_start)(struct mmu_notifier *mn,
> +     int (*invalidate_range_start)(struct mmu_notifier *mn,
>                                      struct mm_struct *mm,
> -                                    unsigned long start, unsigned long end);
> +                                    unsigned long start, unsigned long end,
> +                                    bool blockable);
>       void (*invalidate_range_end)(struct mmu_notifier *mn,
>                                    struct mm_struct *mm,
>                                    unsigned long start, unsigned long end);
> @@ -229,8 +231,9 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
>                                    unsigned long address);
>  extern void __mmu_notifier_change_pte(struct mm_struct *mm,
>                                     unsigned long address, pte_t pte);
> -extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -                               unsigned long start, unsigned long end);
> +extern int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +                               unsigned long start, unsigned long end,
> +                               bool blockable);
>  extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>                                 unsigned long start, unsigned long end,
>                                 bool only_end);
> @@ -281,7 +284,17 @@ static inline void 
> mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>                                 unsigned long start, unsigned long end)
>  {
>       if (mm_has_notifiers(mm))
> -             __mmu_notifier_invalidate_range_start(mm, start, end);
> +             __mmu_notifier_invalidate_range_start(mm, start, end, true);
> +}
> +
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct 
> mm_struct *mm,
> +                               unsigned long start, unsigned long end)
> +{
> +     int ret = 0;
> +     if (mm_has_notifiers(mm))
> +             ret = __mmu_notifier_invalidate_range_start(mm, start, end, 
> false);
> +
> +     return ret;
>  }
>  
>  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> @@ -461,6 +474,12 @@ static inline void 
> mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>  {
>  }
>  
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct 
> mm_struct *mm,
> +                               unsigned long start, unsigned long end)
> +{
> +     return 0;
> +}
> +
>  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>                                 unsigned long start, unsigned long end)
>  {
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 6adac113e96d..92f70e4c6252 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -95,7 +95,7 @@ static inline int check_stable_address_space(struct 
> mm_struct *mm)
>       return 0;
>  }
>  
> -void __oom_reap_task_mm(struct mm_struct *mm);
> +bool __oom_reap_task_mm(struct mm_struct *mm);
>  
>  extern unsigned long oom_badness(struct task_struct *p,
>               struct mem_cgroup *memcg, const nodemask_t *nodemask,
> diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
> index 6a17f856f841..381cdf5a9bd1 100644
> --- a/include/rdma/ib_umem_odp.h
> +++ b/include/rdma/ib_umem_odp.h
> @@ -119,7 +119,8 @@ typedef int (*umem_call_back)(struct ib_umem *item, u64 
> start, u64 end,
>   */
>  int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>                                 u64 start, u64 end,
> -                               umem_call_back cb, void *cookie);
> +                               umem_call_back cb,
> +                               bool blockable, void *cookie);
>  
>  /*
>   * Find first region intersecting with address range.
> diff --git a/mm/hmm.c b/mm/hmm.c
> index de7b6bf77201..81fd57bd2634 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct 
> mm_struct *mm)
>       up_write(&hmm->mirrors_sem);
>  }
>  
> -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>                                      struct mm_struct *mm,
>                                      unsigned long start,
> -                                    unsigned long end)
> +                                    unsigned long end,
> +                                    bool blockable)
>  {
>       struct hmm *hmm = mm->hmm;
>  
>       VM_BUG_ON(!hmm);
>  
>       atomic_inc(&hmm->sequence);
> +
> +     return 0;
>  }
>  
>  static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d1eb87ef4b1a..336bee8c4e25 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm)
>                * reliably test it.
>                */
>               mutex_lock(&oom_lock);
> -             __oom_reap_task_mm(mm);
> +             (void)__oom_reap_task_mm(mm);
>               mutex_unlock(&oom_lock);
>  
>               set_bit(MMF_OOM_SKIP, &mm->flags);
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index eff6b88a993f..103b2b450043 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -174,18 +174,29 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, 
> unsigned long address,
>       srcu_read_unlock(&srcu, id);
>  }
>  
> -void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -                               unsigned long start, unsigned long end)
> +int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +                               unsigned long start, unsigned long end,
> +                               bool blockable)
>  {
>       struct mmu_notifier *mn;
> +     int ret = 0;
>       int id;
>  
>       id = srcu_read_lock(&srcu);
>       hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> -             if (mn->ops->invalidate_range_start)
> -                     mn->ops->invalidate_range_start(mn, mm, start, end);
> +             if (mn->ops->invalidate_range_start) {
> +                     int _ret = mn->ops->invalidate_range_start(mn, mm, 
> start, end, blockable);
> +                     if (_ret) {
> +                             pr_info("%pS callback failed with %d in 
> %sblockable context.\n",
> +                                             
> mn->ops->invalidate_range_start, _ret,
> +                                             !blockable ? "non-": "");
> +                             ret = _ret;
> +                     }
> +             }
>       }
>       srcu_read_unlock(&srcu, id);
> +
> +     return ret;
>  }
>  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 84081e77bc51..5a936cf24d79 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -479,9 +479,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>  static struct task_struct *oom_reaper_list;
>  static DEFINE_SPINLOCK(oom_reaper_lock);
>  
> -void __oom_reap_task_mm(struct mm_struct *mm)
> +bool __oom_reap_task_mm(struct mm_struct *mm)
>  {
>       struct vm_area_struct *vma;
> +     bool ret = true;
>  
>       /*
>        * Tell all users of get_user/copy_from_user etc... that the content
> @@ -511,12 +512,17 @@ void __oom_reap_task_mm(struct mm_struct *mm)
>                       struct mmu_gather tlb;
>  
>                       tlb_gather_mmu(&tlb, mm, start, end);
> -                     mmu_notifier_invalidate_range_start(mm, start, end);
> +                     if (mmu_notifier_invalidate_range_start_nonblock(mm, 
> start, end)) {
> +                             ret = false;
> +                             continue;
> +                     }
>                       unmap_page_range(&tlb, vma, start, end, NULL);
>                       mmu_notifier_invalidate_range_end(mm, start, end);
>                       tlb_finish_mmu(&tlb, start, end);
>               }
>       }
> +
> +     return ret;
>  }
>  
>  static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> @@ -545,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>               goto unlock_oom;
>       }
>  
> -     /*
> -      * If the mm has invalidate_{start,end}() notifiers that could block,
> -      * sleep to give the oom victim some more time.
> -      * TODO: we really want to get rid of this ugly hack and make sure that
> -      * notifiers cannot block for unbounded amount of time
> -      */
> -     if (mm_has_blockable_invalidate_notifiers(mm)) {
> -             up_read(&mm->mmap_sem);
> -             schedule_timeout_idle(HZ);
> -             goto unlock_oom;
> -     }
> -
>       /*
>        * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
>        * work on the mm anymore. The check for MMF_OOM_SKIP must run
> @@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>  
>       trace_start_task_reaping(tsk->pid);
>  
> -     __oom_reap_task_mm(mm);
> +     /* failed to reap part of the address space. Try again later */
> +     if (!__oom_reap_task_mm(mm)) {
> +             up_read(&mm->mmap_sem);
> +             ret = false;
> +             goto unlock_oom;
> +     }
>  
>       pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB\n",
>                       task_pid_nr(tsk), tsk->comm,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ada21f47f22b..16ce38f178d1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -135,9 +135,10 @@ static void kvm_uevent_notify_change(unsigned int type, 
> struct kvm *kvm);
>  static unsigned long long kvm_createvm_count;
>  static unsigned long long kvm_active_vms;
>  
> -__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -             unsigned long start, unsigned long end)
> +__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +             unsigned long start, unsigned long end, bool blockable)
>  {
> +     return 0;
>  }
>  
>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
> @@ -354,13 +355,15 @@ static void kvm_mmu_notifier_change_pte(struct 
> mmu_notifier *mn,
>       srcu_read_unlock(&kvm->srcu, idx);
>  }
>  
> -static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>                                                   struct mm_struct *mm,
>                                                   unsigned long start,
> -                                                 unsigned long end)
> +                                                 unsigned long end,
> +                                                 bool blockable)
>  {
>       struct kvm *kvm = mmu_notifier_to_kvm(mn);
>       int need_tlb_flush = 0, idx;
> +     int ret;
>  
>       idx = srcu_read_lock(&kvm->srcu);
>       spin_lock(&kvm->mmu_lock);
> @@ -378,9 +381,11 @@ static void 
> kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  
>       spin_unlock(&kvm->mmu_lock);
>  
> -     kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +     ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, 
> blockable);
>  
>       srcu_read_unlock(&kvm->srcu, idx);
> +
> +     return ret;
>  }
>  
>  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> -- 
> 2.18.0
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.