[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests


  • To: Julien Grall <julien@xxxxxxx>, Julien Grall <julien.grall.oss@xxxxxxxxx>
  • From: Milan Djokic <milan_djokic@xxxxxxxx>
  • Date: Thu, 14 Aug 2025 18:26:00 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=F478Q9e3gaDQOqrkaOFUq/H9ZEdGMrm4ZTXBXrpOjv4=; b=MK0u9Dy5/7uij+VQbbdLpzEyIVl1uUbRBXp4u8JyknG0AVkm9dBnINJZxgryl0Zngy4r4tKU0chBjBOzD+ej0mzZXKo59T2Kv72m+eXqow4cjS9XE9NhOcH+aPIBymlIqNbTyzx0m4r4s+eKwbJcynanfmgQfaeYd2zPITHDIDp5JRc06emcVqf72MgUJajRCLSnS2pCoAUIi7Xtq5HnWuBOqbNHcVZICVAw0KRORzhJQIRnaRxLQAHRlIqyQ0FpFN1kuSmHkORteNVpnnY8Le+FWUfBBdLy4AmXGmXsRGaOI1kzevREiCIKom8yZyiq8KH9j1++MhV5XL50NHy2zg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=S3VcG2FLnDBsoqRH3OKIZtNQv3TZeVHln6FeISTNEMNumTT67Mw/3GQyFUulcfCi6lPJJpe4OS2dWmDl2BnX6g8IaBwRR0E5NsWyMO/WBxP12FlAJ3aI17/ntUKw4bxAoyWU0k2P7U3K5DXS3dHxn0l3yTwDa+65Z4pwCWBn8qmBm5ht9+GUHuUq2yRLjy32kfgsd9qMpuhs2gQ1nOiIoSgCzplBqWA9WKgXYMcYno7SJuGgWPGmHTYB0X4jGbnx09GbZRgfs5BszVbwLDz2LzTduWa3i+W8+Lme1cxZO3UDdY8XEktSReBYScQ2KbuFakNdkjsX63EARpQ0H+8CQA==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=epam.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Bertrand Marquis <bertrand.marquis@xxxxxxx>, Rahul Singh <rahul.singh@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Nick Rosbrook <enr0n@xxxxxxxxxx>, George Dunlap <gwd@xxxxxxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Delivery-date: Thu, 14 Aug 2025 16:26:30 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hello Julien,

On 8/13/25 14:11, Julien Grall wrote:
On 13/08/2025 11:04, Milan Djokic wrote:
Hello Julien,

Hi Milan,


We have prepared a design document and it will be part of the updated
patch series (added in docs/design). I'll also extend cover letter with
details on implementation structure to make review easier.

I would suggest to just iterate on the design document for now.

Following is the design document content which will be provided in
updated patch series:

Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
==========================================================

Author: Milan Djokic <milan_djokic@xxxxxxxx>
Date:   2025-08-07
Status: Draft

Introduction
------------

The SMMUv3 supports two stages of translation. Each stage of translation
can be independently enabled. An incoming address is logically
translated from VA to IPA in stage 1, then the IPA is input to stage 2
which translates the IPA to the output PA. Stage 1 translation support
is required to provide isolation between different devices within the OS.

Xen already supports Stage 2 translation but there is no support for
Stage 1 translation. This design proposal outlines the introduction of
Stage-1 SMMUv3 support in Xen for ARM guests.

Motivation
----------

ARM systems utilizing SMMUv3 require Stage-1 address translation to
ensure correct and secure DMA behavior inside guests.

Can you clarify what you mean by "correct"? DMA would still work without
stage-1.

Correct in terms of working with guest managed I/O space. I'll rephrase this statement, it seems ambiguous.


This feature enables:
- Stage-1 translation in guest domain
- Safe device passthrough under secure memory translation

Design Overview
---------------

These changes provide emulated SMMUv3 support:

- SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
    SMMUv3 driver
- vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling

So what are you planning to expose to a guest? Is it one vIOMMU per
pIOMMU? Or a single one?

Single vIOMMU model is used in this design.


Have you considered the pros/cons for both?
- Register/Command Emulation: SMMUv3 register emulation and command
    queue handling


That's a point for consideration.
single vIOMMU prevails in terms of less complex implementation and a simple guest iommmu model - single vIOMMU node, one interrupt path, event queue, single set of trap handlers for emulation, etc. Cons for a single vIOMMU model could be less accurate hw representation and a potential bottleneck with one emulated queue and interrupt path. On the other hand, vIOMMU per pIOMMU provides more accurate hw modeling and offers better scalability in case of many IOMMUs in the system, but this comes with more complex emulation logic and device tree, also handling multiple vIOMMUs on guest side. IMO, single vIOMMU model seems like a better option mostly because it's less complex, easier to maintain and debug. Of course, this decision can and should be discussed.

For each pSMMU, we have a single command queue that will receive command
from all the guests. How do you plan to prevent a guest hogging the
command queue?

In addition to that, AFAIU, the size of the virtual command queue is
fixed by the guest rather than Xen. If a guest is filling up the queue
with commands before notifying Xen, how do you plan to ensure we don't
spend too much time in Xen (which is not preemptible)?


We'll have to do a detailed analysis on these scenarios, they are not covered by the design (as well as some others which is clear after your comments). I'll come back with an updated design.

Lastly, what do you plan to expose? Is it a full vIOMMU (including event
forwarding)?


Yes, implementation provides full vIOMMU functionality, with stage-1 event forwarding to guest.

- Device Tree Extensions: adds iommus and virtual SMMUv3 nodes to
    device trees for dom0 and dom0less scenarios
- Runtime Configuration: introduces a 'viommu' boot parameter for
    dynamic enablement

Security Considerations
------------------------

viommu security benefits:
- Stage-1 translation ensures guest devices cannot perform unauthorized
    DMA
- Emulated SMMUv3 for domains removes dependency on host hardware while
    maintaining isolation

I don't understand this sentence.


Current implementation emulates IOMMU with predefined capabilities, exposed as a single vIOMMU to guest. That's where "removes dependency on host hardware" came from. I'll rephrase this part to be more clear.



Observations and Potential Risks
--------------------------------

1. Observation:
Support for Stage-1 translation introduces new data structures
(s1_cfg and s2_cfg) and logic to write both Stage-1 and Stage-2 entries
in the Stream Table Entry (STE), including an abort field for partial
config states.

Risk:
A partially applied Stage-1 configuration might leave guest DMA
mappings in an inconsistent state, enabling unauthorized access or
cross-domain interference.

I don't understand how a misconfigured stage-1 could lead to
cross-domain interference. Can you clarify?


For stage-1 support, SID-to-device mapping and per device io_domain allocation is introduced in Xen smmu driver, and we have to take care that these mappings are valid all the time. If these are not properly managed, structures and SIDs could be mapped to wrong device (and consequentially wrong guest) in some extreme cases. This is covered by the design, but listed as a risc anyway for eventual future updates in this area.



Mitigation (Handled by design):
Both s1_cfg and s2_cfg are written atomically. The abort field ensures
Stage-1 config is only used when fully applied. Incomplete configs are
ignored by the hypervisor.

2. Observation:
Guests can now issue Stage-1 cache invalidations.

Risk:
Failure to propagate invalidations could leave stale mappings, enabling
data leakage or misrouting.

This is a risk from the guest PoV, right? IOW, this would not open up a
security hole in Xen.


Yes, this is guest PoV, although still related to vIOMMU.


Mitigation (Handled by design):
Guest invalidations are forwarded to the hardware to ensure IOMMU
coherency.

3. Observation:
The feature introduces large functional changes including the vIOMMU
framework, vsmmuv3 devices, command queues, event queues, domain
handling, and Device Tree modifications.

Risk:
Increased attack surface with risk of race conditions, malformed
commands, or misconfiguration via the device tree.

Mitigation:
- Improved sanity checks and error handling
- Feature is marked as Tech Preview and self-contained to reduce risk
    to unrelated code

Surely, you will want to use the code in production... No?


Yes, it is planned for production usage. At the moment, it is optionally enabled (grouped under unsupported features), needs community feedback, complete security analysis and performance benchmarking/optimization. That's the reason it's marked as a Tech Preview at this point.



4. Observation:
The implementation supports nested and standard translation modes,
using guest command queues (e.g. CMD_CFGI_STE) and events.

Risk:
Malicious commands could bypass validation and corrupt SMMUv3 state or
destabilize dom0.

Mitigation (Handled by design):
Command queues are validated, and only permitted configuration changes
are accepted. Handled in vsmmuv3 and cmdqueue logic.

I didn't mention anything in obversation 1 but now I have to say it...
The observations you wrote are what I would expect to be handled in any
submission to our code base. This is the bare minimum to have the code
secure. But you don't seem to address the more subttle ones which are
more related to scheduling issue (see some above). They require some
design and discussion.


Yes, it's clear to me after your comments that some important observations are missing. We'll do additional analysis and come back with a more complete design.


5. Observation:
Device Tree changes inject iommus and vsmmuv3 nodes via libxl.

Risk:
Malicious or incorrect DT fragments could result in wrong device
assignments or hardware access.

Mitigation:
Only vetted and sanitized DT fragments are allowed. libxl limits what
guests can inject.

Today, libxl doesn't do any sanitisation on the DT. In fact, this is
pretty much impossible because libfdt expects trusted DT. Is this
something you are planning to change?

I've referred to libxl parsing only supported fragments/nodes from DT, but yes, that's not actual sanitization. I'll update these statements.


6. Observation:
The feature is enabled per-guest via viommu setting.

Risk:
Guests without viommu may behave differently, potentially causing
confusion, privilege drift, or accidental exposure.

Mitigation:
Ensure downgrade paths are safe. Perform isolation audits in
multi-guest environments to ensure correct behavior.

Performance Impact
------------------

Hardware-managed translations are expected to have minimal overhead.
Emulated vIOMMU may introduce some latency during initialization or
event processing.

Latency to who? We still expect isolation between guests and a guest
will not go over its time slice.


This is more related to comparison of emulated vs hw translation, and overall overhead introduced with these mechanisms. I'll rephrase this part to be more clear.

For the guest itself, the main performance impact will be TLB flushes
because they are commands that will end up to be emulated by Xen.
Depending on your Linux configuration (I haven't checked other), this
will either happen every unmap operation or they will be batch. The
performance of the latter will be the worse one.

Have you done any benchmark to confirm the impact? Just to note, I would
not gate the work for virtual SMMUv3 based on the performance. I think
it is ok to offer the support if the user want extra security and
doesn't care about performance. But it would be good to outline them as
I expect them to be pretty bad...


We haven't performed detailed benchmarking, just a measurement of boot time and our domU application execution rate with and without viommu. We could perform some measurements for viommu operations and add results in this section.

Thank you for your feedback, I'll come back with an updated design document for further review.

BR,
Milan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.