[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 06/21] IOMMU/x86: perform PV Dom0 mappings in batches


  • To: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Wed, 4 May 2022 14:27:14 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=JoiT2BJkTZzhDs6QCYmjh86fvZROusCbPCGoXiSewzY=; b=PzMR9swBkyIi3G8aKpJ2XUaRow0BN3+Oj1ctrENa4tuli95nKrX9UE7sK9rxLCgiXk5KOkQBed8u9Sr89jSL5tstG8Go6EZNUCgysjaToQqX7jeNy3B/eVwfNjj/opqyPvNemFglpcPxvCtl5xqlntouxyIaI+0+IWXlMp133rIB9GlScJpSN49bg24GRDiKdyqJND7oWOHuGA7f/r8hPPyCIwYLvQl9FR1rAX5eV9KqoHBW+0EQwWkFHR5Icpe1G2Pf8mz/wa89IIuPQ0DG1iy3/1ANIZUlkyQM8Ii6ZLMDXlc2CkDSq7sieYaVxl5cEvxuYCi/Hi61vH1LttO3Ww==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=bht570VS3z7fLjIsDFX0khUuSufN/U/raUNVOulikpGF4IRLPuGsZ5yzDRHqCRbMw1v+lzByG4P6833vs4HRH6CqTSAFmDBocJv2vFBPMEzbO7cfIJe2j3UKHck37UJIbOuxePbHkv/vevHwBgmHgFSgJQ5uaz6YMzWvXf5jIjceqw3d5K/9fAY/NihvH2Q7qkvDLshitxXgJf9gGG4BUoN1Ulz9TYKzYLwBwWOZA60mkqZO681IJitWN/4pI9m42PJ7W6OndLMgz70Bdo/Q4JBLG1mdxWTIe3jCMvmoWXWgWshUN26No7a+SLkdInsFepKBh9GluIfdzH5wGRSiZg==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>, Wei Liu <wl@xxxxxxx>
  • Delivery-date: Wed, 04 May 2022 12:27:29 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 04.05.2022 13:20, Roger Pau Monné wrote:
> On Wed, May 04, 2022 at 11:46:37AM +0200, Jan Beulich wrote:
>> On 03.05.2022 16:49, Roger Pau Monné wrote:
>>> On Mon, Apr 25, 2022 at 10:34:59AM +0200, Jan Beulich wrote:
>>>> For large page mappings to be easily usable (i.e. in particular without
>>>> un-shattering of smaller page mappings) and for mapping operations to
>>>> then also be more efficient, pass batches of Dom0 memory to iommu_map().
>>>> In dom0_construct_pv() and its helpers (covering strict mode) this
>>>> additionally requires establishing the type of those pages (albeit with
>>>> zero type references).
>>>
>>> I think it's possible I've already asked this.  Would it make sense to
>>> add the IOMMU mappings in alloc_domheap_pages(), maybe by passing a
>>> specific flag?
>>
>> I don't think you did ask, but now that you do: This would look like a
>> layering violation to me. I don't think allocation should ever have
>> mapping (into the IOMMU or elsewhere) as a "side effect", no matter
>> that ...
> 
> Hm, I'm certainly not that familiar with PV itself to likely be able
> to make a proper argument here.  I fully agree with you for translated
> guests using a p2m.
> 
> For PV we currently establish/teardown IOMMU mappings in
> _get_page_type(), which already looks like a layering violation to me,
> hence also doing so in alloc_domheap_pages() wouldn't seem that bad if
> it allows to simplify the resulting code overall.

That's a layering violation too, I agree, but at least it's one central
place.

>>> It would seem to me that doing it that way would also allow the
>>> mappings to get established in blocks for domUs.
>>
>> ... then this would perhaps be possible.
>>
>>>> The installing of zero-ref writable types has in fact shown (observed
>>>> while putting together the change) that despite the intention by the
>>>> XSA-288 changes (affecting DomU-s only) for Dom0 a number of
>>>> sufficiently ordinary pages (at the very least initrd and P2M ones as
>>>> well as pages that are part of the initial allocation but not part of
>>>> the initial mapping) still have been starting out as PGT_none, meaning
>>>> that they would have gained IOMMU mappings only the first time these
>>>> pages would get mapped writably. Consequently an open question is
>>>> whether iommu_memory_setup() should set the pages to PGT_writable_page
>>>> independent of need_iommu_pt_sync().
>>>
>>> I think I'm confused, doesn't the setting of PGT_writable_page happen
>>> as a result of need_iommu_pt_sync() and having those pages added to
>>> the IOMMU page tables? (so they can be properly tracked and IOMMU
>>> mappings are removed if thte page is also removed)
>>
>> In principle yes - in guest_physmap_add_page(). But this function isn't
>> called for the pages I did enumerate in the remark. XSA-288 really only
>> cared about getting this right for DomU-s.
> 
> Would it make sense to change guest_physmap_add_page() to be able to
> pass the page_order parameter down to iommu_map(), and then use it for
> dom0 build instead of introducing iommu_memory_setup()?

To be quite frank: This is something that I might have been willing to
do months ago, when this series was still fresh. If I was to start
re-doing all of this code now, it would take far more time than it
would have taken back then. Hence I'd like to avoid a full re-work here
unless entirely unacceptable in the way currently done (which largely
fits with how we have been doing Dom0 setup).

Furthermore, guest_physmap_add_page() doesn't itself call iommu_map().
What you're suggesting would require get_page_and_type() to be able to
work on higher-order pages. I view adjustments like this as well out
of scope for this series.

> I think guest_physmap_add_page() will need to be adjusted at some
> point for domUs, and hence it could be unified with dom0 usage
> also?

As an optimization - perhaps. I view it as more important to have HVM
guests work reasonably well (which includes the performance aspect of
setting them up).

>>> If the pages are not added here (because dom0 is not running in strict
>>> mode) then setting PGT_writable_page is not required?
>>
>> Correct - in that case we skip fiddling with IOMMU mappings on
>> transitions to/from PGT_writable_page, and hence putting this type in
>> place would be benign (but improve consistency).
>>
>>>> Note also that strictly speaking the iommu_iotlb_flush_all() here (as
>>>> well as the pre-existing one in arch_iommu_hwdom_init()) shouldn't be
>>>> needed: Actual hooking up (AMD) or enabling of translation (VT-d)
>>>> occurs only afterwards anyway, so nothing can have made it into TLBs
>>>> just yet.
>>>
>>> Hm, indeed. I think the one in arch_iommu_hwdom_init can surely go
>>> away, as we must strictly do the hwdom init before enabling the iommu
>>> itself.
>>
>> Why would that be? That's imo as much of an implementation detail as
>> ...
> 
> Well, you want to have the reserved/inclusive options applied (and
> mappings created) before enabling the IOMMU, because at that point
> devices have already been assigned.  So it depends more on a
> combination of devices assigned & IOMMU enabled rather than just IOMMU
> being enabled.
> 
>>> The one in dom0 build I'm less convinced, just to be on the safe side
>>> if we ever change the order of IOMMU init and memory setup.
>>
>> ... this. Just like we populate tables with the IOMMU already enabled
>> for DomU-s, I think the same would be valid to do for Dom0.
>>
>>> I would expect flushing an empty TLB to not be very expensive?
>>
>> I wouldn't "expect" this - it might be this way, but it surely depends
>> on whether an implementation can easily tell whether the TLB is empty,
>> and whether its emptiness actually makes a difference for the latency
>> of a flush operation.
>>
>>>> --- a/xen/drivers/passthrough/x86/iommu.c
>>>> +++ b/xen/drivers/passthrough/x86/iommu.c
>>>> @@ -347,8 +347,8 @@ static unsigned int __hwdom_init hwdom_i
>>>>  
>>>>  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>>>>  {
>>>> -    unsigned long i, top, max_pfn;
>>>> -    unsigned int flush_flags = 0;
>>>> +    unsigned long i, top, max_pfn, start, count;
>>>> +    unsigned int flush_flags = 0, start_perms = 0;
>>>>  
>>>>      BUG_ON(!is_hardware_domain(d));
>>>>  
>>>> @@ -379,9 +379,9 @@ void __hwdom_init arch_iommu_hwdom_init(
>>>>       * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
>>>>       * setting up potentially conflicting mappings here.
>>>>       */
>>>> -    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>>>> +    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>>>>  
>>>> -    for ( ; i < top; i++ )
>>>> +    for ( i = start, count = 0; i < top; )
>>>>      {
>>>>          unsigned long pfn = pdx_to_pfn(i);
>>>>          unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
>>>> @@ -390,20 +390,41 @@ void __hwdom_init arch_iommu_hwdom_init(
>>>>          if ( !perms )
>>>>              rc = 0;
>>>>          else if ( paging_mode_translate(d) )
>>>> +        {
>>>>              rc = p2m_add_identity_entry(d, pfn,
>>>>                                          perms & IOMMUF_writable ? 
>>>> p2m_access_rw
>>>>                                                                  : 
>>>> p2m_access_r,
>>>>                                          0);
>>>> +            if ( rc )
>>>> +                printk(XENLOG_WARNING
>>>> +                       "%pd: identity mapping of %lx failed: %d\n",
>>>> +                       d, pfn, rc);
>>>> +        }
>>>> +        else if ( pfn != start + count || perms != start_perms )
>>>> +        {
>>>> +        commit:
>>>> +            rc = iommu_map(d, _dfn(start), _mfn(start), count, 
>>>> start_perms,
>>>> +                           &flush_flags);
>>>> +            if ( rc )
>>>> +                printk(XENLOG_WARNING
>>>> +                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: 
>>>> %d\n",
>>>> +                       d, pfn, pfn + count, rc);
>>>> +            SWAP(start, pfn);
>>>> +            start_perms = perms;
>>>> +            count = 1;
>>>> +        }
>>>>          else
>>>> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
>>>> -                           perms, &flush_flags);
>>>> +        {
>>>> +            ++count;
>>>> +            rc = 0;
>>>
>>> Seeing as we want to process this in blocks now, I wonder whether it
>>> would make sense to take a different approach, and use a rangeset to
>>> track which regions need to be mapped.  What gets added would be based
>>> on the host e820 plus the options
>>> iommu_hwdom_{strict,inclusive,reserved}.  We would then punch holes
>>> based on the logic in hwdom_iommu_map() and finally we could iterate
>>> over the regions afterwards using rangeset_consume_ranges().
>>>
>>> Not that you strictly need to do it here, just think the end result
>>> would be clearer.
>>
>> The end result might indeed be, but it would be more of a change right
>> here. Hence I'd prefer to leave that out of the series for now.
> 
> OK.  I think it might be nice to add a comment in that regard, mostly
> because I tend to forget myself.

Sure, I've added another post-commit-message remark.

Jan




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.