Xen project Mailing List

Re: Memory corruption bug with Xen PV Dom0 and BOSS-S1 RAID card

From: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Date: Thu, 20 Feb 2025 13:37:00 +0100

Cc: Paweł Srokosz <pawel.srokosz@xxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, andrew cooper3 <andrew.cooper3@xxxxxxxxxx>, JBeulich@xxxxxxxx

Delivery-date: Thu, 20 Feb 2025 12:37:22 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Thu, Feb 20, 2025 at 10:31:02AM +0100, Jürgen Groß wrote: > On 20.02.25 10:16, Roger Pau Monné wrote: > > On Wed, Feb 19, 2025 at 07:37:47PM +0100, Paweł Srokosz wrote: > > > Hello, > > > > > > > So the issue doesn't happen on debug=y builds? That's unexpected. I > > > > would > > > > expect the opposite, that some code in Linux assumes that pfn + 1 == > > > > mfn + > > > > 1, and hence breaks when the relation is reversed. > > > > > > It was also surprising for me but I think the key thing is that debug=y > > > causes whole mapping to be reversed so each PFN lands on completely > > > different > > > MFN e.g. MFN=0x1300000 is mapped to PFN=0x20e50c in ndebug, but in debug > > > it's mapped to PFN=0x5FFFFF. I guess that's why I can't reproduce the > > > problem. > > > > > > > Can you see if you can reproduce with dom0-iommu=strict in the Xen > > > > command > > > > line? > > > > > > Unfortunately, it doesn't help. But I have few more observations. > > > > > > Firstly, I checked the "xen-mfndump dump-m2p" output and found that > > > misread > > > blocks are mapped to suspiciously round MFNs. I have different versions of > > > Xen and Linux kernel on each machine and I see some coincidence. > > > > > > I'm writing few huge files without Xen to ensure that they have been > > > written > > > correctly (because under Xen both read and writeback is affected). Then > > > I'm > > > booting to Xen, memory-mapping the files and reading each page. I see > > > that when > > > block is corrupted, it is mapped on round MFN e.g. > > > pfn=0x5095d9/mfn=0x1600000, > > > another on pfn=0x4095d9/mfn=0x1500000 etc. > > > > > > On another machine with different Linux/Xen version these faults appear on > > > pfn=0x20e50c/mfn=0x1300000, pfn=0x30e50c/mfn=0x1400000 etc. > > > > > > I also noticed that during read of page that is mapped to > > > pfn=0x20e50c/mfn=0x1300000, I'm getting these faults from DMAR: > > > > > > ``` > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr > > > 1200000000 > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr > > > 1200001000 > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr > > > 1200006000 > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr > > > 1200008000 > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr > > > 1200009000 > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr > > > 120000a000 > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set > > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr > > > 120000c000 > > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set > > > ``` > > > > That's interesting, it seems to me that Linux is assuming that pages > > at certain boundaries are superpages, and thus it can just increase > > the mfn to get the next physical page. > > I'm not sure this is true. See below. > > > > and every time I'm dropping the cache and reading this region, I'm getting > > > DMAR faults on few random addresses from 1200000000-120000f000 range (I > > > guess > > > MFNs 0x1200000-120000f). MFNs 0x1200000-0x12000ff are not mapped to any > > > PFN in > > > Dom0 (based on xen-mfndump output.). > > > > It would be very interesting to figure out where those requests > > originate, iow: which entity in Linux creates the bios with the > > faulting address(es). > > I _think_ this is related to the kernel trying to get some contiguous areas > for the buffers used by the I/Os. As those areas are being given back after > the I/O, they don't appear in the mfndump. > > > It's a wild guess, but could you try to boot Linux with swiotlb=force > > on the command line and attempt to trigger the issue? I wonder > > whether imposing the usage of the swiotlb will surface the issues as > > CPU accesses, rather then IOMMU faults, and that could get us a trace > > inside Linux of how those requests are generated. > > > > > On the other hand, I'm not getting these DMAR faults while reading other > > > regions. > > > Also I can't trigger the bug with reversed Dom0 mapping, even if I fill > > > the page > > > cache with reads. > > > > There's possibly some condition we are missing that causes a component > > in Linux to assume the next address is mfn + 1, instead of doing the > > full address translation from the linear or pfn space. > > My theory is: > > The kernel is seeing the used buffer to be a physically contiguous area, > so it is _not_ using a scatter-gather list (it does in the debug Xen case, > resulting in it not to show any errors). Unfortunately the buffer is not > aligned to its size, so swiotlb-xen will remap the buffer to a suitably > aligned one. The driver will then use the returned machine address for > I/Os to both the devices of the RAID configuration. When the first I/O is > done, the driver probably is calling the DMA unmap or device sync function > already, causing the intermediate contiguous region to be destroyed again > (this is the time when the DMAR errors should show up for the 2nd I/O still > running). > > So the main issue IMHO is, that a DMA buffer mapped for one device is used > for 2 devices instead. But that won't cause IOMMU faults? Because the memory used by the bounce buffer would still be owned by dom0 (and thus part of it's IOMMU page-tables), just probably re-written to contain different data. Or is the swiotlb contiguous region torn down after every operation? That would seem extremely wasteful to me, I assume the buffer is allocated during device init, and stays the same until the device is detached. Thanks, Roger.

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.