Xen project Mailing List

Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0

To: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>

From: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Date: Tue, 12 May 2026 15:22:25 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Wj4h3Ia5IEE4WoUj7agL//1EIIs9tQPFwXwLKzXfi90=; b=ZA7x5WJi3qBve+WRf4z+Sv1rZMnHv+q8Hb/rgDC9KIlllCC/AJHrS9Ytc2tTUmaEz0gsKA8HMzrexFZMdQTldalI7uxdu1y+3B3RG7OwU3iInzvU6/lC9a+PnE14S0cQ2k9L7WRQ+h18Flp4ieuLonhXiUkNd/hJf/+xpDQHmazlbW5/T5CYSz1imsFKgt75gsQVD+RLkBR69zF4cIvHkCG5vOOyOOHmQWemn6PIrzJMmXLN4dsygnOTFz47lUXr/X+dzjfRY9/yTuYJaeb5p5RkjfPR4Y9lhaOPm4H9fI3Mkh/jYkCvYFzxLXXvOym04lLz+jXZxMgAwc7/9p+vHg==

Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=UWqonfqFprsAx34Z1NMdrUKvP+SEtVR8xNt3M3eKUxt1F6O56qoPgUaZjCgoxfkvul1ZxDydTn456PfHrtfcDqkQEPfQLdwWKh/VjFdtRzIy5IQxTM4IKAZs0lfpc+24hQAgnVv3BRgNzOcNZZ56H9bvBLKaOsBPqyRDM5kBmnqDnZJ6RSWJeY9Oba4EfEhCMqgNioOPxEjM5oiu+xNoMSfNBlKM9IhfuC+yv0ZejPVSl9L5oDCeMp4StIaAstOlulLqKpEQB3BC6WgH0hwewp8ltGv0/ILMm9fxlEWidwM3VEOAJxM05KZmvt6qNPu+zuecMVXWpM1+xdx6gXbt9w==

Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=selector1 header.d=citrix.com header.i="@citrix.com" header.h="From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck"

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;

Cc: Mykyta Poturai <Mykyta_Poturai@xxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, "Daniel P. Smith" <dpsmith@xxxxxxxxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Stewart Hildebrand <stewart.hildebrand@xxxxxxx>

Delivery-date: Tue, 12 May 2026 13:22:51 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Tue, May 12, 2026 at 10:28:05AM +0000, Volodymyr Babchuk wrote: > Hi Roger, > > Roger Pau Monné <roger.pau@xxxxxxxxxx> writes: > > > On Tue, May 12, 2026 at 07:32:20AM +0000, Mykyta Poturai wrote: > >> > >> > >> On 5/12/26 09:20, Jan Beulich wrote: > >> > On 11.05.2026 16:10, Volodymyr Babchuk wrote: > >> >> Hi Jan, > >> >> > >> >> Jan Beulich <jbeulich@xxxxxxxx> writes: > >> >> > >> >>> On 07.05.2026 22:40, Volodymyr Babchuk wrote: > >> >>>> Jan Beulich <jbeulich@xxxxxxxx> writes: > >> >>>>> On 06.05.2026 11:39, Mykyta Poturai wrote: > >> >>>>>> On 5/4/26 08:37, Jan Beulich wrote: > >> >>>>>>> On 23.04.2026 12:12, Mykyta Poturai wrote: > >> >>>>>>>> On 4/21/26 17:43, Jan Beulich wrote: > >> >>>>>>>>> On 09.04.2026 16:01, Mykyta Poturai wrote: > >> >>>>>>>>>> From: Stewart Hildebrand <stewart.hildebrand@xxxxxxx> > >> >>>>>>>>>> > >> >>>>>>>>>> This code is expected to only be used by privileged domains, > >> >>>>>>>>>> unprivileged domains should not get access to the SR-IOV > >> >>>>>>>>>> capability. > >> >>>>>>>>>> > >> >>>>>>>>>> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically > >> >>>>>>>>>> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to > >> >>>>>>>>>> account > >> >>>>>>>>>> for possible changes in the system page size register. Also > >> >>>>>>>>>> force VFs to > >> >>>>>>>>>> always use emulated reads for command register, this is needed > >> >>>>>>>>>> to > >> >>>>>>>>>> prevent some drivers accidentally unmapping BARs. > >> >>>>>>>>> > >> >>>>>>>>> This apparently refers to the change to vpci_init_header(). > >> >>>>>>>>> Writes are > >> >>>>>>>>> already intercepted. How would a read lead to accidental BAR > >> >>>>>>>>> unmap? Even > >> >>>>>>>>> for writes I don't see how a VF driver could accidentally unmap > >> >>>>>>>>> BARs, as > >> >>>>>>>>> the memory decode bit there is hardwired to 0. > >> >>>>>>>>> > >> >>>>>>>>>> Discovery of VFs is > >> >>>>>>>>>> done by Dom0, which must register them with Xen. > >> >>>>>>>>> > >> >>>>>>>>> If we intercept control register writes, why would we still > >> >>>>>>>>> require > >> >>>>>>>>> Dom0 to report the VFs that appear? > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> Sorry, I don't understand this question. You specifically > >> >>>>>>>> requested this > >> >>>>>>>> to be done this way in V2. Quoting your reply from V2 below. > >> >>>>>>>> > >> >>>>>>>> > Aren't you effectively busy-waiting for these 100ms, by > >> >>>>>>>> simply > >> >>>>>>>> returning "true" > >> >>>>>>>> > from vpci_process_pending() until the time has passed? This > >> >>>>>>>> imo is a > >> >>>>>>>> no-go. You > >> >>>>>>>> > want to set a timer and put the vCPU to sleep, to wake it up > >> >>>>>>>> again > >> >>>>>>>> when the > >> >>>>>>>> > timer has expired. That'll then eliminate the need for the > >> >>>>>>>> not-so-nice patch 4. > >> >>>>>>>> > >> >>>>>>>> > Question is whether we need to actually go this far (right > >> >>>>>>>> away). I > >> >>>>>>>> expect you > >> >>>>>>>> > don't mean to hand PFs to DomU-s. As long as we keep them in > >> >>>>>>>> the hardware > >> >>>>>>>> > domain, can't we trust it to set things up correctly, just > >> >>>>>>>> like we > >> >>>>>>>> trust it in > >> >>>>>>>> > a number of other aspects? > >> >>>>>>> > >> >>>>>>> How's any of this related to the question I raised here, or your > >> >>>>>>> reply > >> >>>>>>> thereto? If we intercept PCI_SRIOV_CTRL, we know when VFs are > >> >>>>>>> created. > >> >>>>>>> Why still demand Dom0 to report them then? > >> >>>>>>> > >> >>>>>> > >> >>>>>> The spec states that VFs can take up to 100ms after the VF_ENABLE > >> >>>>>> bit is > >> >>>>>> set to become alive. We discussed in the V2 that it is not > >> >>>>>> acceptable to > >> >>>>>> do a required 100ms wait in Xen while blocking a domain. And not > >> >>>>>> doing > >> >>>>>> that blocking would require some mechanism to only allow a domain > >> >>>>>> to run > >> >>>>>> for precisely 99(or more?)ms. You yourself suggested that we can > >> >>>>>> trust > >> >>>>>> the hardware domain with registering VFs if we already trust it with > >> >>>>>> other PCI-related stuff. Did you change your mind, or am I > >> >>>>>> completely > >> >>>>>> misunderstanding this question? > >> >>>>> > >> >>>>> No, I still think that we can trust hwdom enough. Nevertheless we > >> >>>>> should > >> >>>>> aim at being independent of it where possible. And I seem to recall > >> >>>>> that > >> >>>>> I had also outlined an approach how to avoid spin-waiting for 100ms > >> >>>>> in > >> >>>>> the hypervisor. > >> >>>> > >> >>>> I want to clarify: you are telling that Xen should not wait for hwdom > >> >>>> to > >> >>>> report VFs and instead create them by itself. Is this correct? > >> >>> > >> >>> If that's technically possible, yes. > >> >> > >> >> Okay, so let's clear this. If I remember correct, you discussed this > >> >> with Mykyta in the previous version and suggested to put the vCPU to > >> >> sleep for 100ms. > >> > > >> > I don't think I did (except perhaps from a very abstract perspective), > >> > precisely because of ... > >> > > >> >> I don't think that this is a good idea, because guest > >> >> kernel will not be happy about that. > >> > > >> > ... this. Instead iirc I suggested to refuse (short-circuit) handling > >> > VF register accesses for the next 100ms. > >> > > >> > Jan > >> > >> Do you have any suggestions on how to ensure that we accurately catch > >> the window where 100ms have already passed, but guests haven’t tried to > >> read anything yet, to flip this back? As I mentioned in the previous > >> version, Linux, for example, doesn’t attempt to re-read anything if the > >> first read failed after 100ms. So it appears to me that this approach > >> would be prone to racing with the guest for getting to the VF first. One > >> approach I can think of is to somehow swap the register handlers back > >> in-flight during the first read by the guest if 100ms have already > >> passed. However, this would still depend on Dom0 for registering VFs, > >> but in a more convoluted way. We also can’t add the VFs before 100ms > >> have passed and add timing checks to all register handlers, because > >> pci_add_device and everything below it expects the device to be > >> functional at the moment of addition. > >> > >> > >> > >> Maybe you see some other way to avoid these problems that I am missing? > > > > We could maybe do some middle ground here, kind of similar to what > > Linux does. The overall idea would be to put on hold any accesses to > > the device(s) PCI config space for 100ms, that would include the PF > > and any VFs. At the point when VF enable is set Xen already knows the > > position of the VFs in the PCI config space. > > > > Any PCI config space access attempts to the PF or VFs during that > > 100ms window would cause the guest vCPU to be put on hold, and the > > access would only be retried once the 100ms window has passed and Xen > > has registered the VFs with vPCI. This approach needs extra logic to > > put vPCI accesses on hold, similar to what Xen does when mapping a BAR > > into the p2m, and a timer to defer the adding of the Vfs and the > > unlocking of the affected PCI config space region. > > > > That would be a middle ground IMO, as the guest vCPUs could be running > > freely, unless accesses to the affected PCI config space was attempted > > before the 100ms window, at which point they would be blocked waiting > > for the timeout to expire. A well-behaved domain shouldn't try to > > access the PCI config space either ahead the 100ms window expiring. > > This approach seems reasonable for me, but this would require big > changes in vPCI logic, as now pci_add_device() needs ability to defer > all config space accesses till VFs are ready and in meantime we'll have > to deal with half-initialized pdev. PCI/vPCI logic is already convoluted > enough and adding more intermediate states, which need to be dealt with > in different places will make things even worse. Unless I miss some easy > fix, of course... You could just defer the pci_add_device() to after the 100ms window from having enabled VFs? The only requirement would be blocking PCI config space accesses to VFs during those 100ms. You could use a bitmap to signal which SBDF should be rejected in vpci_{read,write}. There's already logic in those functions to reject accesses. > What I am trying to say is that your suggestion is technically doable, > but requires lots of work, and we don't need resources for this right > now. So, what's your opinion on existing approach? Is relying on a > domain to introduce VFs such a bad idea? IMO yes, requiring the usage of an hypercall (or any other side-band interface) when not strictly required is just adding a handicap for OSes that then need to be ported to run on Xen. It might seem easier at first, but adding and maintaining such side-band interfaces in OSes tend to be cumbersome and prone to errors. We could have avoided introducing vPCI at the cost of adding a completely new side-band interface to manage PCI config space accesses and interrupts on x86 PVH, yet we didn't do it because albeit easier to implement from the Xen side, it would have a huge cost on OSes that would want to run in PVH mode. Thanks, Roger.

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.