[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror



On Fri, Nov 01, 2019 at 03:59:26PM +0000, Yang, Philip wrote:
> > This test for range_blockable should be before mutex_lock, I can move
> > it up
> > 
> yes, thanks.

Okay, I wrote it like this:

        if (mmu_notifier_range_blockable(range))
                mutex_lock(&adev->notifier_lock);
        else if (!mutex_trylock(&adev->notifier_lock))
                return false;

> > Also, do you know if notifier_lock is held while calling
> > amdgpu_ttm_tt_get_user_pages_done()? Can we add a 'lock assert held'
> > to amdgpu_ttm_tt_get_user_pages_done()?
> 
> gpu side hold notifier_lock but kfd side doesn't. kfd side doesn't check 
> amdgpu_ttm_tt_get_user_pages_done/mmu_range_read_retry return value but 
> check mem->invalid flag which is updated from invalidate callback. It 
> takes more time to change, I will come to another patch to fix it later.

Ah.. confusing, OK, I'll let you sort that

> > However, this is all pre-existing bugs, so I'm OK go ahead with this
> > patch as modified. I advise AMD to make a followup patch ..
> > 
> yes, I will.

While you are here, this is also wrong:

int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
{
        down_read(&mm->mmap_sem);
        r = hmm_range_fault(range, 0);
        up_read(&mm->mmap_sem);
        if (unlikely(r <= 0)) {
                if ((r == 0 || r == -EBUSY) && !time_after(jiffies, timeout))
                        goto retry;
                goto out_free_pfns;
        }

        for (i = 0; i < ttm->num_pages; i++) {
                pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);

It is not allowed to read the results of hmm_range_fault() outside
locking, and in particular, we can't convert to a struct page.

This must be done inside the notifier_lock, after checking
mmu_range_read_retry(), all handling of the struct page must be
structured like that.

> >> @@ -997,10 +1004,18 @@ static void amdgpu_ttm_tt_unpin_userptr(struct 
> >> ttm_tt *ttm)
> >>    sg_free_table(ttm->sg);
> >>   
> >>   #if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR)
> >> -  if (gtt->range &&
> >> -      ttm->pages[0] == hmm_device_entry_to_page(gtt->range,
> >> -                                                gtt->range->pfns[0]))
> >> -          WARN_ONCE(1, "Missing get_user_page_done\n");
> >> +  if (gtt->range) {
> >> +          unsigned long i;
> >> +
> >> +          for (i = 0; i < ttm->num_pages; i++) {
> >> +                  if (ttm->pages[i] !=
> >> +                          hmm_device_entry_to_page(gtt->range,
> >> +                                        gtt->range->pfns[i]))
> >> +                          break;
> >> +          }
> >> +
> >> +          WARN((i == ttm->num_pages), "Missing get_user_page_done\n");
> >> +  }
> > 
> > Is this related/necessary? I can put it in another patch if it is just
> > debugging improvement? Please advise
> > 
> I see this WARN backtrace now, but I didn't see it before. This is 
> somehow related.

Hm, might be instructive to learn what is going on..

Thanks,
Jason

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.