[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 10/21] AMD/IOMMU: allow use of superpage mappings


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Thu, 5 May 2022 15:19:56 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=M5hHCjWk86eXh5ysv1fYYsjEaFY1oeuhyXEOTXkQ07o=; b=X9DIRa6Uz63RSYWCrh3HVJzg/WWgOr0AxKt9GCcSOSa1mnHwoqT0q9NeEKjjvuEX0IoXHAuX/UZwU3qbPgZS5uIssCfqC9gcpAywaoyHYilTapIo1m5QX0YBujLVOuEraGeFq2I8CB6VOxfOMqlAOGdekilaeNJxqkSkV/RN5AiGOmKMhg9gQlnzmBz2RN57eVvBqeIcnBMhllnHWnnFUVxnzTS0/efRTxBUnE1qX9dD8kylr6lQ8Ru8UCY6VZV1nzjQ1ukQM5Qns9lasJ/zT9RjKHmdEii6NodgjCqjqFKtrYCvc8y5a05J2l8PTnK7ECl1sHuSI8UzX614HyEJEA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=PpSIBrpt8jp1OZgADYFB5nXwixEOmqw86KvzpVrUS1V7wW6pBn6kUe2k0lbtDof4FdrFQUPV1Ef9EuMhKdTlf4eDQnXzc/94xmi+YPWdXOvbThgFoD252hYbKg0bb/ZhZCMeoSd3EpwN3U4rrLHa78Wnr7Q79ciP+toxkkOb7fE1gxL3E8KsluUZU+mQ4RWDrRKg6r3bJGoOVnLE+9y6SYwoTXULEh6296nfS0se8FaPBUS/5KWsMqDTFqDqH+TQehJf4xwJ9qiNY1geY4I/Ab2RMzEHkyl/6fX1t7ni0EtPIbf4TZeW5Y38H+mDKlkHcBpVCcysloyHubBkbO6jNw==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>
  • Delivery-date: Thu, 05 May 2022 13:20:49 +0000
  • Ironport-data: A9a23:aSMsP6Bt1IlURhVW/1/iw5YqxClBgxIJ4kV8jS/XYbTApDwrgzMAm GQZCmjTO6yOZWvzfNt2b9y3p0hV7MDTn9BqQQY4rX1jcSlH+JHPbTi7wuYcHM8wwunrFh8PA xA2M4GYRCwMZiaA4E/raNANlFEkvU2ybuOU5NXsZ2YgHGeIdA970Ug5w7Nh39Yy6TSEK1jlV e3a8pW31GCNg1aYAkpMg05UgEoy1BhakGpwUm0WPZinjneH/5UmJMt3yZWKB2n5WuFp8tuSH I4v+l0bElTxpH/BAvv9+lryn9ZjrrT6ZWBigVIOM0Sub4QrSoXfHc/XOdJFAXq7hQllkPh0k epw6cexaT1zEZTigvombStxAzFHaPguFL/veRBTsOS15mieLz7G5aUrC0s7e4oF5uxwHGdCs +QCLywAZQyCgOTwx6+nTu5rhYIoK8yD0IE34yk8i22GS6l6B8ydK0nJzYYwMDMYnMdBEOyYf 8MEQTFucA7Bc1tEPVJ/5JcWw7/13CKmLGYwRFS9nYUswTHc9RJLwObvLufHftOGeeFapxPNz o7B1yGjav0AD/SdwzeY9nOnhsfUgDj2HokVEdWQ5vNsxVGe2GEXIBkXTkeg5+m0jFakXNBSI FBS/TAhxYAq/VGvZsnwWVu/unHsg/IHc99ZEul/4gfdzKPRu1adHjJcEmAHb8E6vsgrQzBsz kWOg97iGT1otvuSVG6Z8bCX6zi1PED5MFM/WMPNdiNdi/GLnW35pkiWJjq/OMZZVuHIJAw=
  • Ironport-hdrordr: A9a23:G+2QX6/36DsjiNyVTE5uk+FKdb1zdoMgy1knxilNoENuH/Bwxv rFoB1E73TJYVYqN03IV+rwXZVoZUmsjaKdhrNhRotKPTOWwVdASbsP0WKM+V3d8kHFh41gPO JbAtJD4b7LfCdHZKTBkW6F+r8bqbHokZxAx92uqUuFJTsaF52IhD0JbjpzfHcGJjWvUvECZe ehD4d81nOdUEVSSv7+KmgOXuDFqdGOvJX6YSQeDxpizAWVlzun5JPzDhDdh34lInhy6IZn1V KAvx3y562lvf3+4hjA11XL55ATvNf60NNMCOGFl8BQADTxjQSDYphnRtS5zXgIidDqzGxvvM jHoh8mMcg2w3TNflutqR+o4AXk2CZG0Q6X9XaoxV/Y5eDpTjMzDMRMwahDdAHC1kYmtNZglI pWwmOwrfNsfFz9tRW4w+KNewBhl0Kyr3Znu/UUlWZjXYwXb6IUhZAD/XlSDIwLEEvBmcwa+d FVfYDhDcttABOnhyizhBgt/DXsZAV/Iv6+eDlNhiTPuAIm3kyQzCMjtbkidzk7hdcAoqJ/lp X525RT5c9zp/AtHNJA7cc6MLyK4z/2MGTx2Fz7GyWVKIg3f1TwlrXQ3JIZoMmXRb1g9upBpH 2GaiITiVIP
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Mon, Apr 25, 2022 at 10:38:06AM +0200, Jan Beulich wrote:
> No separate feature flags exist which would control availability of
> these; the only restriction is HATS (establishing the maximum number of
> page table levels in general), and even that has a lower bound of 4.
> Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via
> non-default page sizes the implementation in principle permits arbitrary
> size mappings, but these require multiple identical leaf PTEs to be
> written, which isn't all that different from having to write multiple
> consecutive PTEs with increasing frame numbers. IMO that's therefore
> beneficial only on hardware where suitable TLBs exist; I'm unaware of
> such hardware.)
> 
> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>

Reviewed-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>

> ---
> I'm not fully sure about allowing 512G mappings: The scheduling-for-
> freeing of intermediate page tables would take quite a while when
> replacing a tree of 4k mappings by a single 512G one. Yet then again
> there's no present code path via which 512G chunks of memory could be
> allocated (and hence mapped) anyway, so this would only benefit huge
> systems where 512 1G mappings could be re-coalesced (once suitable code
> is in place) into a single L4 entry. And re-coalescing wouldn't result
> in scheduling-for-freeing of full trees of lower level pagetables.

I would think part of this should go into the commit message, as to
why enabling 512G superpages is fine.

> ---
> v4: Change type of queue_free_pt()'s 1st parameter. Re-base.
> v3: Rename queue_free_pt()'s last parameter. Replace "level > 1" checks
>     where possible.
> 
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -32,12 +32,13 @@ static unsigned int pfn_to_pde_idx(unsig
>  }
>  
>  static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
> -                                                   unsigned long dfn)
> +                                                   unsigned long dfn,
> +                                                   unsigned int level)
>  {
>      union amd_iommu_pte *table, *pte, old;
>  
>      table = map_domain_page(_mfn(l1_mfn));
> -    pte = &table[pfn_to_pde_idx(dfn, 1)];
> +    pte = &table[pfn_to_pde_idx(dfn, level)];
>      old = *pte;
>  
>      write_atomic(&pte->raw, 0);
> @@ -351,11 +352,32 @@ static int iommu_pde_from_dfn(struct dom
>      return 0;
>  }
>  
> +static void queue_free_pt(struct domain_iommu *hd, mfn_t mfn, unsigned int 
> level)
> +{
> +    if ( level > 1 )
> +    {
> +        union amd_iommu_pte *pt = map_domain_page(mfn);
> +        unsigned int i;
> +
> +        for ( i = 0; i < PTE_PER_TABLE_SIZE; ++i )
> +            if ( pt[i].pr && pt[i].next_level )
> +            {
> +                ASSERT(pt[i].next_level < level);
> +                queue_free_pt(hd, _mfn(pt[i].mfn), pt[i].next_level);
> +            }
> +
> +        unmap_domain_page(pt);
> +    }
> +
> +    iommu_queue_free_pgtable(hd, mfn_to_page(mfn));
> +}
> +
>  int cf_check amd_iommu_map_page(
>      struct domain *d, dfn_t dfn, mfn_t mfn, unsigned int flags,
>      unsigned int *flush_flags)
>  {
>      struct domain_iommu *hd = dom_iommu(d);
> +    unsigned int level = (IOMMUF_order(flags) / PTE_PER_TABLE_SHIFT) + 1;
>      int rc;
>      unsigned long pt_mfn = 0;
>      union amd_iommu_pte old;
> @@ -384,7 +406,7 @@ int cf_check amd_iommu_map_page(
>          return rc;
>      }
>  

I think it might be helpful to assert or otherwise check that the
input order is supported by the IOMMU, just to be on the safe side.

> -    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, true) ||
> +    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, 
> true) ||
>           !pt_mfn )
>      {
>          spin_unlock(&hd->arch.mapping_lock);
> @@ -394,8 +416,8 @@ int cf_check amd_iommu_map_page(
>          return -EFAULT;
>      }
>  
> -    /* Install 4k mapping */
> -    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), 1,
> +    /* Install mapping */
> +    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), level,
>                                  (flags & IOMMUF_writable),
>                                  (flags & IOMMUF_readable));
>  
> @@ -403,8 +425,13 @@ int cf_check amd_iommu_map_page(
>  
>      *flush_flags |= IOMMU_FLUSHF_added;
>      if ( old.pr )
> +    {
>          *flush_flags |= IOMMU_FLUSHF_modified;
>  
> +        if ( IOMMUF_order(flags) && old.next_level )
> +            queue_free_pt(hd, _mfn(old.mfn), old.next_level);
> +    }
> +
>      return 0;
>  }
>  
> @@ -413,6 +440,7 @@ int cf_check amd_iommu_unmap_page(
>  {
>      unsigned long pt_mfn = 0;
>      struct domain_iommu *hd = dom_iommu(d);
> +    unsigned int level = (order / PTE_PER_TABLE_SHIFT) + 1;
>      union amd_iommu_pte old = {};
>  
>      spin_lock(&hd->arch.mapping_lock);
> @@ -423,7 +451,7 @@ int cf_check amd_iommu_unmap_page(
>          return 0;
>      }
>  
> -    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, false) )
> +    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, 
> false) )
>      {
>          spin_unlock(&hd->arch.mapping_lock);
>          AMD_IOMMU_ERROR("invalid IO pagetable entry dfn = %"PRI_dfn"\n",
> @@ -435,14 +463,19 @@ int cf_check amd_iommu_unmap_page(
>      if ( pt_mfn )
>      {
>          /* Mark PTE as 'page not present'. */
> -        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
> +        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level);
>      }
>  
>      spin_unlock(&hd->arch.mapping_lock);
>  
>      if ( old.pr )
> +    {
>          *flush_flags |= IOMMU_FLUSHF_modified;
>  
> +        if ( order && old.next_level )
> +            queue_free_pt(hd, _mfn(old.mfn), old.next_level);
> +    }
> +
>      return 0;
>  }
>  
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -747,7 +747,7 @@ static void cf_check amd_dump_page_table
>  }
>  
>  static const struct iommu_ops __initconst_cf_clobber _iommu_ops = {
> -    .page_sizes = PAGE_SIZE_4K,
> +    .page_sizes = PAGE_SIZE_4K | PAGE_SIZE_2M | PAGE_SIZE_1G | 
> PAGE_SIZE_512G,

As mentioned on a previous email, I'm worried if we ever get to
replace an entry populated with 4K pages with a 512G superpage, as the
freeing cost of the associated pagetables would be quite high.

I guess we will have to implement a more preemptive freeing behavior
if issues arise.

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.