[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests
Hi Volodymyr, On 8/29/25 18:27, Volodymyr Babchuk wrote: Hi Milan, Thanks, "Security Considerations" sections looks really good. But I have more questions. Milan Djokic <milan_djokic@xxxxxxxx> writes:Hello Julien, Volodymyr On 8/27/25 01:28, Volodymyr Babchuk wrote:Hi Milan, Milan Djokic <milan_djokic@xxxxxxxx> writes:Hello Julien, On 8/13/25 14:11, Julien Grall wrote:On 13/08/2025 11:04, Milan Djokic wrote:Hello Julien,Hi Milan,We have prepared a design document and it will be part of the updated patch series (added in docs/design). I'll also extend cover letter with details on implementation structure to make review easier.I would suggest to just iterate on the design document for now.Following is the design document content which will be provided in updated patch series: Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests ========================================================== Author: Milan Djokic <milan_djokic@xxxxxxxx> Date: 2025-08-07 Status: Draft Introduction ------------ The SMMUv3 supports two stages of translation. Each stage of translation can be independently enabled. An incoming address is logically translated from VA to IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to the output PA. Stage 1 translation support is required to provide isolation between different devices within the OS. Xen already supports Stage 2 translation but there is no support for Stage 1 translation. This design proposal outlines the introduction of Stage-1 SMMUv3 support in Xen for ARM guests. Motivation ---------- ARM systems utilizing SMMUv3 require Stage-1 address translation to ensure correct and secure DMA behavior inside guests.Can you clarify what you mean by "correct"? DMA would still work without stage-1.Correct in terms of working with guest managed I/O space. I'll rephrase this statement, it seems ambiguous.This feature enables: - Stage-1 translation in guest domain - Safe device passthrough under secure memory translation Design Overview --------------- These changes provide emulated SMMUv3 support: - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in SMMUv3 driver - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handlingSo what are you planning to expose to a guest? Is it one vIOMMU per pIOMMU? Or a single one?Single vIOMMU model is used in this design.Have you considered the pros/cons for both?- Register/Command Emulation: SMMUv3 register emulation and command queue handlingThat's a point for consideration. single vIOMMU prevails in terms of less complex implementation and a simple guest iommmu model - single vIOMMU node, one interrupt path, event queue, single set of trap handlers for emulation, etc. Cons for a single vIOMMU model could be less accurate hw representation and a potential bottleneck with one emulated queue and interrupt path. On the other hand, vIOMMU per pIOMMU provides more accurate hw modeling and offers better scalability in case of many IOMMUs in the system, but this comes with more complex emulation logic and device tree, also handling multiple vIOMMUs on guest side. IMO, single vIOMMU model seems like a better option mostly because it's less complex, easier to maintain and debug. Of course, this decision can and should be discussed.Well, I am not sure that this is possible, because of StreamID allocation. The biggest offender is of course PCI, as each Root PCI bridge will require own SMMU instance with own StreamID space. But even without PCI you'll need some mechanism to map vStremID to <pSMMU, pStreamID>, because there will be overlaps in SID space. Actually, PCI/vPCI with vSMMU is its own can of worms...For each pSMMU, we have a single command queue that will receive command from all the guests. How do you plan to prevent a guest hogging the command queue? In addition to that, AFAIU, the size of the virtual command queue is fixed by the guest rather than Xen. If a guest is filling up the queue with commands before notifying Xen, how do you plan to ensure we don't spend too much time in Xen (which is not preemptible)?We'll have to do a detailed analysis on these scenarios, they are not covered by the design (as well as some others which is clear after your comments). I'll come back with an updated design.I think that can be handled akin to hypercall continuation, which is used in similar places, like P2M code [...]I have updated vIOMMU design document with additional security topics covered and performance impact results. Also added some additional explanations for vIOMMU components following your comments. Updated document content: =============================================== Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests =============================================== :Author: Milan Djokic <milan_djokic@xxxxxxxx> :Date: 2025-08-07 :Status: Draft Introduction ======== The SMMUv3 supports two stages of translation. Each stage of translation can be independently enabled. An incoming address is logically translated from VA to IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to the output PA. Stage 1 translation support is required to provide isolation between different devices within OS. XEN already supports Stage 2 translation but there is no support for Stage 1 translation. This design proposal outlines the introduction of Stage-1 SMMUv3 support in Xen for ARM guests. Motivation ========== ARM systems utilizing SMMUv3 require stage-1 address translation to ensure secure DMA and guest managed I/O memory mappings.It is unclear for my what you mean by "guest manged IO memory mappings", could you please provide an example? Basically enabling stage-1 translation means that the guest is responsible for managing IOVA to IPA mappings through its own IOMMU driver. Guest manages its own stage-1 page tables and TLB. For example, when a guest driver wants to perform DMA mapping (e.g. with dma_map_single()), it will request mapping of its buffer physical address to IOVA through guest IOMMU driver. Guest IOMMU driver will further issue mapping commands emulated by Xen which translate it into stage-2 mappings. This feature enables: - Stage-1 translation in guest domain - Safe device passthrough under secure memory translationAs I see it, ARM specs use "secure" mostly when referring to Secure mode (S-EL1, S-EL2, EL3) and associated secure counterparts of architectural devices, like secure GIC, secure Timer, etc. So I'd probably don't use this word here to reduce confusion Sure, secure in terms of isolation is the topic here. I'll rephrase this Design Overview =============== These changes provide emulated SMMUv3 support: - **SMMUv3 Stage-1 Translation**: stage-1 and nested translation support in SMMUv3 driver."Nested translation" as in "nested virtualization"? Or is this something else? No, this refers to 2-stage translation IOVA->IPA->PA as a nested translation. Although with this feature, nested virtualization is also enabled since guest can emulate its own IOMMU e.g. when kvm is run in guest. - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1 handling.I think, this is the big topic. You see, apart from SMMU, there is at least Renesas IP-MMU, which uses completely different API. And probably there are other IO-MMU implementations possible. Right now vIOMMU framework handles only SMMU, which is okay, but probably we should design it in a such way, that other IO-MMUs will be supported as well. Maybe even IO-MMUs for other architectures (RISC V maybe?). I think that it is already designed in such manner. We have a generic vIOMMU framework and a backend implementation for target IOMMU as separate components. And the backend implements supported commands/mechanisms which are specific for target IOMMU type. At this point, only SMMUv3 is supported, but it is possible to implement other IOMMU types support under the same generic framework. AFAIK, RISC-V IOMMU stage-2 is still in early development stage, but I do believe that it will be also compatible with vIOMMU framework. - **Register/Command Emulation**: SMMUv3 register emulation and command queue handling.Continuing previous paragraph: what about other IO-MMUs? For example, if platform provides only Renesas IO-MMU, will vIOMMU framework still emulate SMMUv3 registers and queue handling? Yes, this is not supported in current implementation. To support other IOMMU than SMMUv3, stage-1 emulation backend needs to be implemented for target IOMMU and probably Xen driver for target IOMMU has to be updated to handle stage-1 configuration. I will elaborate this part in the design, to make clear that we have a generic vIOMMU framework, but only SMMUv3 backend exists atm. - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes to device trees for dom0 and dom0less scenarios. - **Runtime Configuration**: Introduces a `viommu` boot parameter for dynamic enablement. vIOMMU is exposed to guest as a single device with predefined capabilities and commands supported. Single vIOMMU model abstracts the details of an actual IOMMU hardware, simplifying usage from the guest point of view. Guest OS handles only a single IOMMU, even if multiple IOMMU units are available on the host system.In the previous email I asked how are you planning to handle potential SID overlaps, especially in PCI use case. I want to return to this topic. I am not saying that this is impossible, but I'd like to see this covered in the design document. Sorry, I've missed this part in the previous mail. This is a valid point,SID overlapping would be an issue for a single vIOMMU model. To prevent it, design will have to be extended with SID namespace virtualization, introducing a remapping layer which will make sure that guest virtual SIDs are unique and maintain proper mappings of vSIDs to pSIDs. For PCI case, we need to have an extended remapping logic where iommu-map property will be also patched in the guest device tree since we need a range of unique vSIDs for every RC assigned to guest. Alternative approach would be to switch to vIOMMU per pIOMMU model. Since both approaches require major updates, I'll have to do a detailed analysis and come back with an updated design which would address this issue. Security Considerations ======================= **viommu security benefits:** - Stage-1 translation ensures guest devices cannot perform unauthorized DMA. - Emulated IOMMU removes guest dependency on IOMMU hardware while maintaining domains isolation.I am not sure that I got this paragraph. First one refers to guest controlled DMA access. Only IOVA->IPA mappings created by guest are usable by the device when stage-1 is enabled. On the other hand, with stage-2 only enabled, device could access to complete IOVA->PA mapping created by Xen for guest. Since the guest has no control over device IOVA accesses, a malicious guest kernel could potentially access memory regions it shouldn't be allowed to, e.g. if stage-2 mappings are stale. With stage-1 enabled, guest device driver has to explicitly map IOVAs and this request is propagated through emulated IOMMU, making sure that IOVA mappings are valid all the time. Second claim means that with emulated IOMMU, guests don’t need direct access to physical IOMMU hardware. The hypervisor emulates IOMMU behavior for the guest, while still ensuring that memory access by devices remains properly isolated between guests, just like it would with real IOMMU hardware. 1. Observation: --------------- Support for Stage-1 translation in SMMUv3 introduces new data structures (`s1_cfg` alongside `s2_cfg`) and logic to write both Stage-1 and Stage-2 entries in the Stream Table Entry (STE), including an `abort` field to handle partial configuration states. **Risk:** Without proper handling, a partially applied Stage-1 configuration might leave guest DMA mappings in an inconsistent state, potentially enabling unauthorized access or causing cross-domain interference. **Mitigation:** *(Handled by design)* This feature introduces logic that writes both `s1_cfg` and `s2_cfg` to STE and manages the `abort` field-only considering Stage-1 configuration if fully attached. This ensures incomplete or invalid guest configurations are safely ignored by the hypervisor. 2. Observation: --------------- Guests can now invalidate Stage-1 caches; invalidation needs forwarding to SMMUv3 hardware to maintain coherence. **Risk:** Failing to propagate cache invalidation could allow stale mappings, enabling access to old mappings and possibly data leakage or misrouting. **Mitigation:** *(Handled by design)* This feature ensures that guest-initiated invalidations are correctly forwarded to the hardware, preserving IOMMU coherency. 3. Observation: --------------- This design introduces substantial new functionality, including the `vIOMMU` framework, virtual SMMUv3 devices (`vsmmuv3`), command queues, event queues, domain management, and Device Tree modifications (e.g., `iommus` nodes and `libxl` integration). **Risk:** Large feature expansions increase the attack surface—potential for race conditions, unchecked command inputs, or Device Tree-based misconfigurations. **Mitigation:** - Sanity checks and error-handling improvements have been introduced in this feature. - Further audits have to be performed for this feature and its dependencies in this area. Currently, feature is marked as *Tech Preview* and is self-contained, reducing the risk to unrelated components. 4. Observation: --------------- The code includes transformations to handle nested translation versus standard modes and uses guest-configured command queues (e.g., `CMD_CFGI_STE`) and event notifications. **Risk:** Malicious or malformed queue commands from guests could bypass validation, manipulate SMMUv3 state, or cause Dom0 instability.Only Dom0? This is a mistake, the whole system could be affected. I'll fix this. **Mitigation:** *(Handled by design)* Built-in validation of command queue entries and sanitization mechanisms ensure only permitted configurations are applied. This is supported via additions in `vsmmuv3` and `cmdqueue` handling code. 5. Observation: --------------- Device Tree modifications enable device assignment and configuration—guest DT fragments (e.g., `iommus`) are added via `libxl`. **Risk:** Erroneous or malicious Device Tree injection could result in device misbinding or guest access to unauthorized hardware. **Mitigation:** - `libxl` perform checks of guest configuration and parse only predefined dt fragments and nodes, reducing risc. - The system integrator must ensure correct resource mapping in the guest Device Tree (DT) fragments. 6. Observation: --------------- Introducing optional per-guest enabled features (`viommu` argument in xl guest config) means some guests may opt-out. **Risk:** Differences between guests with and without `viommu` may cause unexpected behavior or privilege drift. **Mitigation:** Verify that downgrade paths are safe and well-isolated; ensure missing support doesn't cause security issues. Additional audits on emulation paths and domains interference need to be performed in a multi-guest environment. 7. Observation: --------------- Guests have the ability to issue Stage-1 IOMMU commands like cache invalidation, stream table entries configuration, etc. An adversarial guest may issue a high volume of commands in rapid succession. **Risk** Excessive commands requests can cause high hypervisor CPU consumption and disrupt scheduling, leading to degraded system responsiveness and potential denial-of-service scenarios. **Mitigation** - Xen credit scheduler limits guest vCPU execution time, securing basic guest rate-limiting.I don't thing that this feature available only in credit schedulers, AFAIK, all schedulers except null scheduler will limit vCPU execution time. I was not aware of that. I'll rephrase this part. - Batch multiple commands of same type to reduce overhead on the virtual SMMUv3 hardware emulation. - Implement vIOMMU commands execution restart and continuation supportSo, something like "hypercall continuation"? Yes 8. Observation: --------------- Some guest commands issued towards vIOMMU are propagated to pIOMMU command queue (e.g. TLB invalidate). For each pIOMMU, only one command queue is available for all domains. **Risk** Excessive commands requests from abusive guest can cause flooding of physical IOMMU command queue, leading to degraded pIOMMU responsivness on commands issued from other guests. **Mitigation** - Xen credit scheduler limits guest vCPU execution time, securing basic guest rate-limiting. - Batch commands which should be propagated towards pIOMMU cmd queue and enable support for batch execution pause/continuation - If possible, implement domain penalization by adding a per-domain cost counter for vIOMMU/pIOMMU usage. 9. Observation: --------------- vIOMMU feature includes event queue used for forwarding IOMMU events to guest (e.g. translation faults, invalid stream IDs, permission errors). A malicious guest can misconfigure its SMMU state or intentionally trigger faults with high frequency. **Risk** Occurance of IOMMU events with high frequency can cause Xen to flood the event queue and disrupt scheduling with high hypervisor CPU load for events handling. **Mitigation** - Implement fail-safe state by disabling events forwarding when faults are occured with high frequency and not processed by guest. - Batch multiple events of same type to reduce overhead on the virtual SMMUv3 hardware emulation. - Consider disabling event queue for untrusted guests Performance Impact ================== With iommu stage-1 and nested translation inclusion, performance overhead is introduced comparing to existing, stage-2 only usage in Xen. Once mappings are established, translations should not introduce significant overhead. Emulated paths may introduce moderate overhead, primarily affecting device initialization and event handling. Performance impact highly depends on target CPU capabilities. Testing is performed on cortex-a53 based platform.Which platform exactly? While QEMU emulates SMMU to some extent, we are observing somewhat different SMMU behavior on real HW platforms (mostly due to cache coherence problems). Also, according to MMU-600 errata, it can have lower than expected performance in some use-cases. Performance measurement are done on QEMU emulated Renesas platform. I'll add some details for this. Performance is mostly impacted by emulated vIOMMU operations, results shown in the following table. +-------------------------------+---------------------------------+ | vIOMMU Operation | Execution time in guest | +===============================+=================================+ | Reg read | median: 30μs, worst-case: 250μs | +-------------------------------+---------------------------------+ | Reg write | median: 35μs, worst-case: 280μs | +-------------------------------+---------------------------------+ | Invalidate TLB | median: 90μs, worst-case: 1ms+ | +-------------------------------+---------------------------------+ | Invalidate STE | median: 450μs worst_case: 7ms+ | +-------------------------------+---------------------------------+ With vIOMMU exposed to guest, guest OS has to initialize IOMMU device and configure stage-1 mappings for devices attached to it. Following table shows initialization stages which impact stage-1 enabled guest boot time and compares it with stage-1 disabled guest. "NOTE: Device probe execution time varies significantly depending on device complexity. virtio-gpu was selected as a test case due to its extensive use of dynamic DMA allocations and IOMMU mappings, making it a suitable candidate for benchmarking stage-1 vIOMMU behavior." +---------------------+-----------------------+------------------------+ | Stage | Stage-1 Enabled Guest | Stage-1 Disabled Guest | +=====================+=======================+========================+ | IOMMU Init | ~25ms | / | +---------------------+-----------------------+------------------------+ | Dev Attach / Mapping| ~220ms | ~200ms | +---------------------+-----------------------+------------------------+ For devices configured with dynamic DMA mappings, DMA allocate/map/unmap operations performance is also impacted on stage-1 enabled guests. Dynamic DMA mapping operation issues emulated IOMMU functions like mmio write/read and TLB invalidations. As a reference, following table shows performance results for runtime dma operations for virtio-gpu device. +---------------+-------------------------+----------------------------+ | DMA Op | Stage-1 Enabled Guest | Stage-1 Disabled Guest | +===============+=========================+============================+ | dma_alloc | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs| +---------------+-------------------------+----------------------------+ | dma_free | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs | +---------------+-------------------------+----------------------------+ | dma_map | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs| +---------------+-------------------------+----------------------------+ | dma_unmap | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs | +---------------+-------------------------+----------------------------+ Testing ============ - QEMU-based ARM system tests for Stage-1 translation and nested virtualization. - Actual hardware validation on platforms such as Renesas to ensure compatibility with real SMMUv3 implementations. - Unit/Functional tests validating correct translations (not implemented). Migration and Compatibility =========================== This optional feature defaults to disabled (`viommu=""`) for backward compatibility. BR, Milan
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |