[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 06/21] IOMMU/x86: perform PV Dom0 mappings in batches


  • To: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Wed, 4 May 2022 11:46:37 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=vuvmrCTvQYgpXyhHw0y7zflM6sqF8Uc+05hhmG/uaiQ=; b=P2d+GK04/1mhEY1O1BJZ0Yv8Rra6r7JPGxnq1sI3RXcclPZBBNsmNSarFxVbiXQfVDY8kmRaYrvvy4AVRcja2zhGMBnnhFNhw36FAuYfC+E2L54kGMV2lBfIwWtMVLUbe0paCIQdx75X41alANnxklOwNEHpHxsm/ymLEtVHWSK1x1EBSm4wu9WCDXN7b5QhCjSE+z9Kcoa4VLJM/7RS/LsHCogSpxQgxkQHa/BWJvrj79YAZ1l5Tj78SYzfvhpII3HeEPEdyWq34rNc9IBndtm4zTIgNkQL+DvWcH8hiZwYRCpkfan4WP+8Cnis2kfJ/d3VeXLkr8oEX8TlB4StLg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=B7Q2kE9ZdQP7BjyrzhLGcGS2KEP/GK0wvps1g7Zb5v0vxq8BpiMPNctO91imkU7RYCgm8BW16nr990m9dczmOxLH0M9fruDnekV4sIg4XFMxGuSNFMHvof8OPHN2lu/XctNLzDF1AIRXzFQm1IhytCQZDbyQnA8sNMVVxbeB75vonhW394ffeVuxPkvcjah2HiN4aiY1eAAjawuX4Lhq8snaE1n/BEsZW9Yw5tpVdppARP0AplT8YvR2atzpYvMZBxTbybasldu3LwZfiA5omb5MqyqdHOHSkgV9w0tNv6Jt2EhpJ7Ju7G7xjQPvasMhD2MGELlDKRiCQHAjF7pzlQ==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>, Wei Liu <wl@xxxxxxx>
  • Delivery-date: Wed, 04 May 2022 09:46:52 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 03.05.2022 16:49, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:34:59AM +0200, Jan Beulich wrote:
>> For large page mappings to be easily usable (i.e. in particular without
>> un-shattering of smaller page mappings) and for mapping operations to
>> then also be more efficient, pass batches of Dom0 memory to iommu_map().
>> In dom0_construct_pv() and its helpers (covering strict mode) this
>> additionally requires establishing the type of those pages (albeit with
>> zero type references).
> 
> I think it's possible I've already asked this.  Would it make sense to
> add the IOMMU mappings in alloc_domheap_pages(), maybe by passing a
> specific flag?

I don't think you did ask, but now that you do: This would look like a
layering violation to me. I don't think allocation should ever have
mapping (into the IOMMU or elsewhere) as a "side effect", no matter
that ...

> It would seem to me that doing it that way would also allow the
> mappings to get established in blocks for domUs.

... then this would perhaps be possible.

>> The installing of zero-ref writable types has in fact shown (observed
>> while putting together the change) that despite the intention by the
>> XSA-288 changes (affecting DomU-s only) for Dom0 a number of
>> sufficiently ordinary pages (at the very least initrd and P2M ones as
>> well as pages that are part of the initial allocation but not part of
>> the initial mapping) still have been starting out as PGT_none, meaning
>> that they would have gained IOMMU mappings only the first time these
>> pages would get mapped writably. Consequently an open question is
>> whether iommu_memory_setup() should set the pages to PGT_writable_page
>> independent of need_iommu_pt_sync().
> 
> I think I'm confused, doesn't the setting of PGT_writable_page happen
> as a result of need_iommu_pt_sync() and having those pages added to
> the IOMMU page tables? (so they can be properly tracked and IOMMU
> mappings are removed if thte page is also removed)

In principle yes - in guest_physmap_add_page(). But this function isn't
called for the pages I did enumerate in the remark. XSA-288 really only
cared about getting this right for DomU-s.

> If the pages are not added here (because dom0 is not running in strict
> mode) then setting PGT_writable_page is not required?

Correct - in that case we skip fiddling with IOMMU mappings on
transitions to/from PGT_writable_page, and hence putting this type in
place would be benign (but improve consistency).

>> Note also that strictly speaking the iommu_iotlb_flush_all() here (as
>> well as the pre-existing one in arch_iommu_hwdom_init()) shouldn't be
>> needed: Actual hooking up (AMD) or enabling of translation (VT-d)
>> occurs only afterwards anyway, so nothing can have made it into TLBs
>> just yet.
> 
> Hm, indeed. I think the one in arch_iommu_hwdom_init can surely go
> away, as we must strictly do the hwdom init before enabling the iommu
> itself.

Why would that be? That's imo as much of an implementation detail as
...

> The one in dom0 build I'm less convinced, just to be on the safe side
> if we ever change the order of IOMMU init and memory setup.

... this. Just like we populate tables with the IOMMU already enabled
for DomU-s, I think the same would be valid to do for Dom0.

> I would expect flushing an empty TLB to not be very expensive?

I wouldn't "expect" this - it might be this way, but it surely depends
on whether an implementation can easily tell whether the TLB is empty,
and whether its emptiness actually makes a difference for the latency
of a flush operation.

>> --- a/xen/drivers/passthrough/x86/iommu.c
>> +++ b/xen/drivers/passthrough/x86/iommu.c
>> @@ -347,8 +347,8 @@ static unsigned int __hwdom_init hwdom_i
>>  
>>  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>>  {
>> -    unsigned long i, top, max_pfn;
>> -    unsigned int flush_flags = 0;
>> +    unsigned long i, top, max_pfn, start, count;
>> +    unsigned int flush_flags = 0, start_perms = 0;
>>  
>>      BUG_ON(!is_hardware_domain(d));
>>  
>> @@ -379,9 +379,9 @@ void __hwdom_init arch_iommu_hwdom_init(
>>       * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
>>       * setting up potentially conflicting mappings here.
>>       */
>> -    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>> +    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>>  
>> -    for ( ; i < top; i++ )
>> +    for ( i = start, count = 0; i < top; )
>>      {
>>          unsigned long pfn = pdx_to_pfn(i);
>>          unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
>> @@ -390,20 +390,41 @@ void __hwdom_init arch_iommu_hwdom_init(
>>          if ( !perms )
>>              rc = 0;
>>          else if ( paging_mode_translate(d) )
>> +        {
>>              rc = p2m_add_identity_entry(d, pfn,
>>                                          perms & IOMMUF_writable ? 
>> p2m_access_rw
>>                                                                  : 
>> p2m_access_r,
>>                                          0);
>> +            if ( rc )
>> +                printk(XENLOG_WARNING
>> +                       "%pd: identity mapping of %lx failed: %d\n",
>> +                       d, pfn, rc);
>> +        }
>> +        else if ( pfn != start + count || perms != start_perms )
>> +        {
>> +        commit:
>> +            rc = iommu_map(d, _dfn(start), _mfn(start), count, start_perms,
>> +                           &flush_flags);
>> +            if ( rc )
>> +                printk(XENLOG_WARNING
>> +                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: 
>> %d\n",
>> +                       d, pfn, pfn + count, rc);
>> +            SWAP(start, pfn);
>> +            start_perms = perms;
>> +            count = 1;
>> +        }
>>          else
>> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
>> -                           perms, &flush_flags);
>> +        {
>> +            ++count;
>> +            rc = 0;
> 
> Seeing as we want to process this in blocks now, I wonder whether it
> would make sense to take a different approach, and use a rangeset to
> track which regions need to be mapped.  What gets added would be based
> on the host e820 plus the options
> iommu_hwdom_{strict,inclusive,reserved}.  We would then punch holes
> based on the logic in hwdom_iommu_map() and finally we could iterate
> over the regions afterwards using rangeset_consume_ranges().
> 
> Not that you strictly need to do it here, just think the end result
> would be clearer.

The end result might indeed be, but it would be more of a change right
here. Hence I'd prefer to leave that out of the series for now.

Jan




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.