Xen project Mailing List

Re: [Xen-devel] [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror

From: "Yang, Philip" <Philip.Yang@xxxxxxx>

Date: Fri, 1 Nov 2019 15:59:26 +0000

Accept-language: en-ZA, en-US

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=+B70gKRy7javyiS02uROv/RyLGEqbq/siw3rZJOLViU=; b=AwHK7/UYginbUL4x0txuclKvMvMXqXNBOiWVhyld45VUVpTAWw8ehyy393Os+LZeZXsFQvDqViCrk0zSsOZn3sXZN53QiqRSpp3YdIqkUmQTXOYlFK9EA/9OJa3i6wJKs2tYkzwX9ZD78vVcoE/JhwVweBYn+6FBQdfduurepDjAXSzXOIwesUrwhRxXZZSqkJBrbM8TBkJMbn28bvoPe9No5hsy1g/6oMfUz1iyoY+WKc/6bxf8ZWYKmBVFkOLKh01IRjX7Cxl4NRata8j4+DcmWmFZI+a6+XhzrcySZohpW7IgxNpBbY72uLs80Av/RRqJ/bss0Xwwm8fU9XIySg==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Rb0vaTjrjgp43A1K3TkjU94osaOI2fbMPyv/gVDiR83E+XtBSkwpz5+5dK2AG0RNh7JBYwHqZWEP4u9s7wwwSHGZX0v/w65xUBWZA8nfJLdMD4CZY42gPHLdd5QyfMruhTAHPwMq78DD5tNkVvZKOCT3OlLdlUnOKq1VMxiRLa18sdD03PSIameqK/WR5sg5zH75vkey48xgrrD1FUOb4kc15Qpo8pHBggD+2OsyeRav4aAXviFkj/yrZXEgJ+HPk+Zc5M4Pg2l9jLm+NRUT1VSklq25o8xQQGbUXu4BWI1glwvZO5HlNnunKGFHOKs59GACySgu9w3G2riJKMILvQ==

Authentication-results: spf=none (sender IP is ) smtp.mailfrom=Philip.Yang@xxxxxxx;

Delivery-date: Fri, 01 Nov 2019 15:59:40 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: AQHVjcy5krMagvVO3k2ER8a03a+2l6dyANaAgAAA5gCABGh8AIAAB7mAgAANHgA=

Thread-topic: [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror

On 2019-11-01 11:12 a.m., Jason Gunthorpe wrote: > On Fri, Nov 01, 2019 at 02:44:51PM +0000, Yang, Philip wrote: >> >> >> On 2019-10-29 3:25 p.m., Jason Gunthorpe wrote: >>> On Tue, Oct 29, 2019 at 07:22:37PM +0000, Yang, Philip wrote: >>>> Hi Jason, >>>> >>>> I did quick test after merging amd-staging-drm-next with the >>>> mmu_notifier branch, which includes this set changes. The test result >>>> has different failures, app stuck intermittently, GUI no display etc. I >>>> am understanding the changes and will try to figure out the cause. >>> >>> Thanks! I'm not surprised by this given how difficult this patch was >>> to make. Let me know if I can assist in any way >>> >>> Please ensure to run with lockdep enabled.. Your symptops sounds sort >>> of like deadlocking? >>> >> Hi Jason, >> >> Attached patch fix several issues in amdgpu driver, maybe you can squash >> this into patch 14. With this is done, patch 12, 13, 14 is Reviewed-by >> and Tested-by Philip Yang <philip.yang@xxxxxxx> > > Wow, this is great thanks! Can you clarify what the problems you found > were? Was the bug the 'return !r' below? > Yes. return !r is critical one, and retry if hmm_range_fault return -EBUSY is needed too. > I'll also add your signed off by > > Here are some remarks: > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c >> index cb718a064eb4..c8bbd06f1009 100644 >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c >> @@ -67,21 +67,15 @@ static bool amdgpu_mn_invalidate_gfx(struct >> mmu_range_notifier *mrn, >> struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); >> long r; >> >> - /* >> - * FIXME: Must hold some lock shared with >> - * amdgpu_ttm_tt_get_user_pages_done() >> - */ >> - mmu_range_set_seq(mrn, cur_seq); >> + mutex_lock(&adev->notifier_lock); >> >> - /* FIXME: Is this necessary? */ >> - if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start, >> - range->end)) >> - return true; >> + mmu_range_set_seq(mrn, cur_seq); >> >> - if (!mmu_notifier_range_blockable(range)) >> + if (!mmu_notifier_range_blockable(range)) { >> + mutex_unlock(&adev->notifier_lock); >> return false; > > This test for range_blockable should be before mutex_lock, I can move > it up > yes, thanks. > Also, do you know if notifier_lock is held while calling > amdgpu_ttm_tt_get_user_pages_done()? Can we add a 'lock assert held' > to amdgpu_ttm_tt_get_user_pages_done()? > gpu side hold notifier_lock but kfd side doesn't. kfd side doesn't check amdgpu_ttm_tt_get_user_pages_done/mmu_range_read_retry return value but check mem->invalid flag which is updated from invalidate callback. It takes more time to change, I will come to another patch to fix it later. >> @@ -854,12 +853,20 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, >> struct page **pages) >> r = -EPERM; >> goto out_unlock; >> } >> + up_read(&mm->mmap_sem); >> + timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT); >> + >> +retry: >> + range->notifier_seq = mmu_range_read_begin(&bo->notifier); >> >> + down_read(&mm->mmap_sem); >> r = hmm_range_fault(range, 0); >> up_read(&mm->mmap_sem); >> - >> - if (unlikely(r < 0)) >> + if (unlikely(r <= 0)) { >> + if ((r == 0 || r == -EBUSY) && !time_after(jiffies, timeout)) >> + goto retry; >> goto out_free_pfns; >> + } > > This isn't really right, a retry loop like this needs to go all the > way to mmu_range_read_retry() and done under the notifier_lock. ie > mmu_range_read_retry() can fail just as likely as hmm_range_fault() > can, and drivers are supposed to retry in both cases, with a single > timeout. > For gpu, check mmu_range_read_retry return value under the notifier_lock to do retry is in seperate location, not in same retry loop. > AFAICT it is a major bug that many places ignore the return code of > amdgpu_ttm_tt_get_user_pages_done() ??? > For kfd, explained above. > However, this is all pre-existing bugs, so I'm OK go ahead with this > patch as modified. I advise AMD to make a followup patch .. > yes, I will. > I'll add a FIXME note to this effect. > >> for (i = 0; i < ttm->num_pages; i++) { >> pages[i] = hmm_device_entry_to_page(range, range->pfns[i]); >> @@ -916,7 +923,7 @@ bool amdgpu_ttm_tt_get_user_pages_done(struct ttm_tt >> *ttm) >> gtt->range = NULL; >> } >> >> - return r; >> + return !r; > > Ah is this the major error? hmm_range_valid() is inverted vs > mmu_range_read_retry()? > yes. >> } >> #endif >> >> @@ -997,10 +1004,18 @@ static void amdgpu_ttm_tt_unpin_userptr(struct ttm_tt >> *ttm) >> sg_free_table(ttm->sg); >> >> #if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR) >> - if (gtt->range && >> - ttm->pages[0] == hmm_device_entry_to_page(gtt->range, >> - gtt->range->pfns[0])) >> - WARN_ONCE(1, "Missing get_user_page_done\n"); >> + if (gtt->range) { >> + unsigned long i; >> + >> + for (i = 0; i < ttm->num_pages; i++) { >> + if (ttm->pages[i] != >> + hmm_device_entry_to_page(gtt->range, >> + gtt->range->pfns[i])) >> + break; >> + } >> + >> + WARN((i == ttm->num_pages), "Missing get_user_page_done\n"); >> + } > > Is this related/necessary? I can put it in another patch if it is just > debugging improvement? Please advise > I see this WARN backtrace now, but I didn't see it before. This is somehow related. > Thanks a lot, > Jason > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.