Xen project Mailing List

Re: [Xen-devel] [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror

From: "Yang, Philip" <Philip.Yang@xxxxxxx>

Date: Fri, 1 Nov 2019 19:45:22 +0000

Accept-language: en-ZA, en-US

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=U6ng1eFzOPjyGk61V9A5M5rQ6aMj9rG68kA5xXa/boM=; b=gzRKXbS1R9uUlAwUCe2xbkZ7xw3SJZIYWMJK4+q+qMw1aVhM/E89tYL7YOKhNkdtuTyU1c9teC/v6CUspEhAcaOCRAJdBTFVv96M5d+hpoasev2k/qWj66Bfk50HL1Twf8jG+7ShcqhONfG24ZIajKCmmIu5D8HGc4iT2FNcr39B6vd7+p5+/QhoSTd7xcZNnD7cHqmbE1Wluf+7VFMXksRWeZ6f0O+Cy6rtdvSbIxRgv4bfhGrtt+zEhCRbvIltvLh7Wukro8bgN3q7luB7ffQeusU3zSDX/k24thZinBSDl/itPFjq+wY/qAE052vVu2tNcW/2J4Y65WUhu19QAw==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=hbnMUrgCcfvsYZj3jZiyXZsfAjm8HeSUMP5NNZVte9UH+K0SOk5jYeFZtJbpuXqhKqiFplPQnNNRcXSGpWj/EwUsfLpt+Pt8x8laF6+LoAiU9QQBKdQRAfyfpX6xYtJ7LaN8Py3GPuVidjRtze6lBbqdI6wpNP282Wd98Ny7on60jaS5MfIihqKDAwnjdyac5wPHcrY8IDQULNI0BBy87QKR4Tv9HoWG6STn3DqGVhrBjgmolc0ss5n92D+YP7mW0CRXpL4NdQG4ayjvi6zPVLdBrilohtwVAcAnH7AO4G1tt9Ec3UEDpDGhiYL1KXvu19iy5xkHSsUyG2p1gYZ7vQ==

Authentication-results: spf=none (sender IP is ) smtp.mailfrom=Philip.Yang@xxxxxxx;

Delivery-date: Fri, 01 Nov 2019 19:45:35 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: AQHVjcy5krMagvVO3k2ER8a03a+2l6dyANaAgAAA5gCABGh8AIAAB7mAgAANHgCAABzRAIAAIk6A

Thread-topic: [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror

On 2019-11-01 1:42 p.m., Jason Gunthorpe wrote: > On Fri, Nov 01, 2019 at 03:59:26PM +0000, Yang, Philip wrote: >>> This test for range_blockable should be before mutex_lock, I can move >>> it up >>> >> yes, thanks. > > Okay, I wrote it like this: > > if (mmu_notifier_range_blockable(range)) > mutex_lock(&adev->notifier_lock); > else if (!mutex_trylock(&adev->notifier_lock)) > return false; > >>> Also, do you know if notifier_lock is held while calling >>> amdgpu_ttm_tt_get_user_pages_done()? Can we add a 'lock assert held' >>> to amdgpu_ttm_tt_get_user_pages_done()? >> >> gpu side hold notifier_lock but kfd side doesn't. kfd side doesn't check >> amdgpu_ttm_tt_get_user_pages_done/mmu_range_read_retry return value but >> check mem->invalid flag which is updated from invalidate callback. It >> takes more time to change, I will come to another patch to fix it later. > > Ah.. confusing, OK, I'll let you sort that > >>> However, this is all pre-existing bugs, so I'm OK go ahead with this >>> patch as modified. I advise AMD to make a followup patch .. >>> >> yes, I will. > > While you are here, this is also wrong: > > int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages) > { > down_read(&mm->mmap_sem); > r = hmm_range_fault(range, 0); > up_read(&mm->mmap_sem); > if (unlikely(r <= 0)) { > if ((r == 0 || r == -EBUSY) && !time_after(jiffies, timeout)) > goto retry; > goto out_free_pfns; > } > > for (i = 0; i < ttm->num_pages; i++) { > pages[i] = hmm_device_entry_to_page(range, range->pfns[i]); > > It is not allowed to read the results of hmm_range_fault() outside > locking, and in particular, we can't convert to a struct page. > > This must be done inside the notifier_lock, after checking > mmu_range_read_retry(), all handling of the struct page must be > structured like that. > Below change will fix this, then driver will call mmu_range_read_retry second time using same range->notifier_seq to check if range is invalidated inside amdgpu_cs_submit, this looks ok for me. @@ -868,6 +869,13 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages) goto out_free_pfns; } + mutex_lock(&adev->notifier_lock); + + if (mmu_range_read_retry(&bo->notifier, range->notifier_seq)) { + mutex_unlock(&adev->notifier_lock); + goto retry; + } + for (i = 0; i < ttm->num_pages; i++) { pages[i] = hmm_device_entry_to_page(range, range->pfns[i]); if (unlikely(!pages[i])) { @@ -875,10 +883,12 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages) i, range->pfns[i]); r = -ENOMEM; + mutex_unlock(&adev->notifier_lock); goto out_free_pfns; } } + mutex_unlock(&adev->notifier_lock); gtt->range = range; mmput(mm); Philip >>>> @@ -997,10 +1004,18 @@ static void amdgpu_ttm_tt_unpin_userptr(struct >>>> ttm_tt *ttm) >>>> sg_free_table(ttm->sg); >>>> >>>> #if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR) >>>> - if (gtt->range && >>>> - ttm->pages[0] == hmm_device_entry_to_page(gtt->range, >>>> - gtt->range->pfns[0])) >>>> - WARN_ONCE(1, "Missing get_user_page_done\n"); >>>> + if (gtt->range) { >>>> + unsigned long i; >>>> + >>>> + for (i = 0; i < ttm->num_pages; i++) { >>>> + if (ttm->pages[i] != >>>> + hmm_device_entry_to_page(gtt->range, >>>> + gtt->range->pfns[i])) >>>> + break; >>>> + } >>>> + >>>> + WARN((i == ttm->num_pages), "Missing get_user_page_done\n"); >>>> + } >>> >>> Is this related/necessary? I can put it in another patch if it is just >>> debugging improvement? Please advise >>> >> I see this WARN backtrace now, but I didn't see it before. This is >> somehow related. > > Hm, might be instructive to learn what is going on.. > > Thanks, > Jason > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.