[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 6/9] vpci/header: Handle p2m range sets per BAR


  • To: Jan Beulich <jbeulich@xxxxxxxx>, Oleksandr Andrushchenko <andr2000@xxxxxxxxx>
  • From: Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>
  • Date: Thu, 9 Sep 2021 09:12:21 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=t8unG2SVBWLUsd/bfD1mgceFjbQrhNUQlU2bjVcxXqM=; b=jMe7kTs7HxpSodE6bSv/I2rpzjeHGKIEPJy+oKZ/lFTGVEHuW7bVFpBUfPcHDgHzcCGJOEEpteZEzO4C1N8qBMSG8XbQ88Aru56zp0TemQmtOS/65uPtgcOKc5gYvPnY/2jpbJ23IJlJvqzO98fNAuIrCdq86uWG5k4BaRL5FXEjOj7xcnzfFc9FKhVUDADtRogm7rlEMlxGWby+hFezNyUusouMyHuWNHCyh8+vzqkj3VR4UU/79UeGYiUNtFOhRpDIHsQZxUF2kzKXGo/wRUlQrXdPAlgYGTlzpuPX+w2kvQhUYaeCrnqK0xyRwHnmOXjko75KL9zUjGfSEiDJhA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=RRF3yGb5oiuK0r0gfCJcOtNwbv1AaMw7jRUtGhjEZBsvjQzlqEKWXFuoRKReY/N4uc5u3SK/pvLJgp55CCJQQQQHfEydq7d+Er648n8fmdbpfdGSaKQFdEkXjsT4MXU2mhyUsQip9JCKD/YnlS/ZqxCj+aeDAX3C6N9Ldoo+fSds/RTRpQ/TCxA9hq3E9iRpSu8P2zMcFz+shdIjQyzEdTE2U4yf0rK2mrjdeJazfHy5NVV3MR/1sctOR/vfyB6fVPU9xluh4BvlRagJz43vdkE6hnH+z4XTbOxJP8RPEFTB815IzrL7sIsvuD0Fly0ET02VurvjgpM2uoRA4b1ftg==
  • Authentication-results: suse.com; dkim=none (message not signed) header.d=none;suse.com; dmarc=none action=none header.from=epam.com;
  • Cc: "julien@xxxxxxx" <julien@xxxxxxx>, "sstabellini@xxxxxxxxxx" <sstabellini@xxxxxxxxxx>, Oleksandr Tyshchenko <Oleksandr_Tyshchenko@xxxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Artem Mygaiev <Artem_Mygaiev@xxxxxxxx>, "roger.pau@xxxxxxxxxx" <roger.pau@xxxxxxxxxx>, Bertrand Marquis <bertrand.marquis@xxxxxxx>, Rahul Singh <rahul.singh@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Thu, 09 Sep 2021 09:12:30 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHXoKxjXfke7B20QEeo6YjlfUMhlquXGvYAgAMgDgCAAAg4gIAA8L+AgAAy0wCAAA1uAA==
  • Thread-topic: [PATCH 6/9] vpci/header: Handle p2m range sets per BAR

On 09.09.21 11:24, Jan Beulich wrote:
> On 09.09.2021 07:22, Oleksandr Andrushchenko wrote:
>> On 08.09.21 18:00, Jan Beulich wrote:
>>> On 08.09.2021 16:31, Oleksandr Andrushchenko wrote:
>>>> On 06.09.21 17:47, Jan Beulich wrote:
>>>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@xxxxxxxx>
>>>>>>
>>>>>> Instead of handling a single range set, that contains all the memory
>>>>>> regions of all the BARs and ROM, have them per BAR.
>>>>> Without looking at how you carry out this change - this look wrong (as
>>>>> in: wasteful) to me. Despite ...
>>>>>
>>>>>> This is in preparation of making non-identity mappings in p2m for the
>>>>>> MMIOs/ROM.
>>>>> ... the need for this, every individual BAR is still contiguous in both
>>>>> host and guest address spaces, so can be represented as a single
>>>>> (start,end) tuple (or a pair thereof, to account for both host and guest
>>>>> values). No need to use a rangeset for this.
>>>> First of all this change is in preparation for non-identity mappings,
>>> I'm afraid I continue to not see how this matters in the discussion at
>>> hand. I'm fully aware that this is the goal.
>>>
>>>> e.g. currently we collect all the memory ranges which require mappings
>>>> into a single range set, then we cut off MSI-X regions and then use range 
>>>> set
>>>> functionality to call a callback for every memory range left after MSI-X.
>>>> This works perfectly fine for 1:1 mappings, e.g. what we have as the range
>>>> set's starting address is what we want to be mapped/unmapped.
>>>> Why range sets? Because they allow partial mappings, e.g. you can map part 
>>>> of
>>>> the range and return back and continue from where you stopped. And if I
>>>> understand that correctly that was the initial intention of introducing 
>>>> range sets here.
>>>>
>>>> For non-identity mappings this becomes not that easy. Each individual BAR 
>>>> may be
>>>> mapped differently according to what guest OS has programmed as 
>>>> bar->guest_addr
>>>> (guest view of the BAR start).
>>> I don't see how the rangeset helps here. You have a guest and a host pair
>>> of values for every BAR. Pages with e.g. the MSI-X table may not be mapped
>>> to their host counterpart address, yes, but you need to special cases
>>> these anyway: Accesses to them need to be handled. Hence I'm having a hard
>>> time seeing how a per-BAR rangeset (which will cover at most three distinct
>>> ranges afaict, which is way too little for this kind of data organization
>>> imo) can gain you all this much.
>>>
>>> Overall the 6 BARs of a device will cover up to 8 non-adjacent ranges. IOW
>>> the majority (4 or more) of the rangesets will indeed merely represent a
>>> plain (start,end) pair (or be entirely empty).
>> First of all, let me explain why I decided to move to per-BAR
>> range sets.
>> Before this change all the MMIO regions and MSI-X holes were
>> accounted by a single range set, e.g. we go over all BARs and
>> add MMIOs and then subtract MSI-X from there. When it comes to
>> mapping/unmapping we have an assumtion that the starting address of
>> each element in the range set is equal to map/unmap address, e.g.
>> we have identity mapping. Please note, that the range set accepts
>> a single private data parameter which is enough to hold all
>> required data about the pdev in common, but there is no way to provide
>> any per-BAR data.
>>
>> Now, that we want non-identity mappings, we can no longer assume
>> that starting address == mapping address and we need to provide
>> additional information on how to map and which is now per-BAR.
>> This is why I decided to use per-BAR range sets.
>>
>> One of the solutions may be that we form an additional list of
>> structures in a form (I ommit some of the fields):
>> struct non_identity {
>>       unsigned long start_mfn;
>>       unsigned long start_gfn;
>>       unsigned long size;
>> };
>> So this way when the range set gets processed we go over the list
>> and find out the corresponding list's element which describes the
>> range set entry being processed (s, e, data):
>>
>> static int map_range(unsigned long s, unsigned long e, void *data,
>>                        unsigned long *c)
>> {
>> [snip]
>>       go over the list elements
>>           if ( list->start_mfn == s )
>>               found, can use list->start_gfn for mapping
>> [snip]
>> }
>> This has some complications as map_range may be called multiple times
>> for the same range: if {unmap|map}_mmio_regions was not able to complete
>> the operation it returns the number of pages it was able to process:
>>           rc = map->map ? map_mmio_regions(map->d, start_gfn,
>>                                            size, _mfn(s))
>>                         : unmap_mmio_regions(map->d, start_gfn,
>>                                              size, _mfn(s));
>> In this case we need to update the list item:
>>       list->start_mfn += rc;
>>       list->start_gfn += rc;
>>       list->size -= rc;
>> and if all the pages of the range were processed delete the list entry.
>>
>> With respect of creating the list everything also not so complicated:
>> while processing each BAR create a list entry and fill it with mfn, gfn
>> and size. Then, if MSI-X region is present within this BAR, break the
>> list item into multiple ones with respect to the holes, for example:
>>
>> MMIO 0 list item
>> MSI-X hole 0
>> MMIO 1 list item
>> MSI-X hole 1
>>
>> Here instead of a single BAR description we now have 2 list elements
>> describing the BAR without MSI-X regions.
>>
>> All the above still relies on a single range set per pdev as it is in the
>> original code. We can go this route if we agree this is more acceptable
>> than the range sets per BAR
> I guess I am now even more confused: I can't spot any "rangeset per pdev"
> either. The rangeset I see being used doesn't get associated with anything
> that's device-related; it gets accumulated as a transient data structure,
> but _all_ devices owned by a domain influence its final content.

You are absolutely right here, sorry for the confusion: in the current

code the range set belongs to struct vpci_vcpu, e.g.

/* Per-vcpu structure to store state while {un}mapping of PCI BARs. */

>
> If you associate rangesets with either a device or a BAR, I'm failing to
> see how you'd deal with multiple BARs living in the same page (see also
> below).

This was exactly the issue I ran into while emulating RTL8139 on QEMU:

The MMIOs are 128 bytes long and Linux put them on the same page.

So, it is a known limitation that we can't deal with [1]

>
> Considering that a rangeset really is a compressed representation of a
> bitmap, I wonder whether this data structure is suitable at all for what
> you want to express. You have two pieces of information to carry / manage,
> after all: Which ranges need mapping, and what their GFN <-> MFN
> relationship is. Maybe the latter needs expressing differently in the
> first place?

I proposed a list which can be extended to hold all the required information

there, e.g. MFN, GFN, size etc.

>   And then in a way that's ensuring by its organization that
> no conflicting GFN <-> MFN mappings will be possible?

If you mean the use-case above with different device MMIOs living

in the same page then my understanding is that such a use-case is

not supported [1]

>   Isn't this
> precisely what is already getting recorded in the P2M?
>
> I'm also curious what your plan is to deal with BARs overlapping in MFN
> space: In such a case, the guest cannot independently change the GFNs of
> any of the involved BARs. (Same the other way around: overlaps in GFN
> space are only permitted when the same overlap exists in MFN space.) Are
> you excluding (forbidding) this case? If so, did I miss you saying so
> somewhere?
Again [1]
>   Yet if no overlaps are allowed in the first place, what
> modify_bars() does would be far more complicated than necessary in the
> DomU case, so it may be worthwhile considering to deviate more from how
> Dom0 gets taken care of. In the end a guest writing a BAR is merely a
> request to change its P2M. That's very different from Dom0 writing a BAR,
> which means the physical BAR also changes, and hence the P2M changes in
> quite different a way.

So, what is the difference then besides hwdom really writes to a BAR?

To me most of the logic remains the same: we need to map/unmap.

The only difference I see here is that for Dom0 we have 1:1 at the moment

and for guest we need GFN <-> MFN.


Anyways, I am open to any decision on what would be the right approach here:

1. Use range sets per BAR as in the patch

2. Remove range sets completely and have a per-vCPU list with mapping

data as I described above

3. Anything else?

>
> Jan

Thank you,

Oleksandr

[1] 
https://wiki.xenproject.org/wiki/Xen_PCI_Passthrough#I_get_.22non-page-aligned_MMIO_BAR.22_error_when_trying_to_start_the_guest

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.