[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror




On 2019-11-01 1:42 p.m., Jason Gunthorpe wrote:
> On Fri, Nov 01, 2019 at 03:59:26PM +0000, Yang, Philip wrote:
>>> This test for range_blockable should be before mutex_lock, I can move
>>> it up
>>>
>> yes, thanks.
> 
> Okay, I wrote it like this:
> 
>       if (mmu_notifier_range_blockable(range))
>               mutex_lock(&adev->notifier_lock);
>       else if (!mutex_trylock(&adev->notifier_lock))
>               return false;
> 
>>> Also, do you know if notifier_lock is held while calling
>>> amdgpu_ttm_tt_get_user_pages_done()? Can we add a 'lock assert held'
>>> to amdgpu_ttm_tt_get_user_pages_done()?
>>
>> gpu side hold notifier_lock but kfd side doesn't. kfd side doesn't check
>> amdgpu_ttm_tt_get_user_pages_done/mmu_range_read_retry return value but
>> check mem->invalid flag which is updated from invalidate callback. It
>> takes more time to change, I will come to another patch to fix it later.
> 
> Ah.. confusing, OK, I'll let you sort that
> 
>>> However, this is all pre-existing bugs, so I'm OK go ahead with this
>>> patch as modified. I advise AMD to make a followup patch ..
>>>
>> yes, I will.
> 
> While you are here, this is also wrong:
> 
> int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
> {
>       down_read(&mm->mmap_sem);
>       r = hmm_range_fault(range, 0);
>       up_read(&mm->mmap_sem);
>       if (unlikely(r <= 0)) {
>               if ((r == 0 || r == -EBUSY) && !time_after(jiffies, timeout))
>                       goto retry;
>               goto out_free_pfns;
>       }
> 
>       for (i = 0; i < ttm->num_pages; i++) {
>               pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);
> 
> It is not allowed to read the results of hmm_range_fault() outside
> locking, and in particular, we can't convert to a struct page.
> 
> This must be done inside the notifier_lock, after checking
> mmu_range_read_retry(), all handling of the struct page must be
> structured like that.
> 
Below change will fix this, then driver will call mmu_range_read_retry 
second time using same range->notifier_seq to check if range is 
invalidated inside amdgpu_cs_submit, this looks ok for me.

@@ -868,6 +869,13 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo 
*bo, struct page **pages)
                 goto out_free_pfns;
         }

+       mutex_lock(&adev->notifier_lock);
+
+       if (mmu_range_read_retry(&bo->notifier, range->notifier_seq)) {
+               mutex_unlock(&adev->notifier_lock);
+               goto retry;
+       }
+
         for (i = 0; i < ttm->num_pages; i++) {
                 pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);
                 if (unlikely(!pages[i])) {
@@ -875,10 +883,12 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo 
*bo, struct page **pages)
                                i, range->pfns[i]);
                         r = -ENOMEM;

+                       mutex_unlock(&adev->notifier_lock);
                         goto out_free_pfns;
                 }
         }

+       mutex_unlock(&adev->notifier_lock);
         gtt->range = range;
         mmput(mm);

Philip

>>>> @@ -997,10 +1004,18 @@ static void amdgpu_ttm_tt_unpin_userptr(struct 
>>>> ttm_tt *ttm)
>>>>            sg_free_table(ttm->sg);
>>>>    
>>>>    #if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR)
>>>> -  if (gtt->range &&
>>>> -      ttm->pages[0] == hmm_device_entry_to_page(gtt->range,
>>>> -                                                gtt->range->pfns[0]))
>>>> -          WARN_ONCE(1, "Missing get_user_page_done\n");
>>>> +  if (gtt->range) {
>>>> +          unsigned long i;
>>>> +
>>>> +          for (i = 0; i < ttm->num_pages; i++) {
>>>> +                  if (ttm->pages[i] !=
>>>> +                          hmm_device_entry_to_page(gtt->range,
>>>> +                                        gtt->range->pfns[i]))
>>>> +                          break;
>>>> +          }
>>>> +
>>>> +          WARN((i == ttm->num_pages), "Missing get_user_page_done\n");
>>>> +  }
>>>
>>> Is this related/necessary? I can put it in another patch if it is just
>>> debugging improvement? Please advise
>>>
>> I see this WARN backtrace now, but I didn't see it before. This is
>> somehow related.
> 
> Hm, might be instructive to learn what is going on..
> 
> Thanks,
> Jason
> 
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.