[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Troubles running Xen on Raspberry Pi 4 with 5.6.1 DomU
On Wed, 6 May 2020, Jürgen Groß wrote: > On 06.05.20 00:34, Stefano Stabellini wrote: > > + Boris, Jürgen > > > > See the crash Roman is seeing booting dom0 on the Raspberry Pi. It is > > related to the recent xen dma_ops changes. Possibly the same thing > > reported by Peng here: > > > > https://marc.info/?l=linux-kernel&m=158805976230485&w=2 > > > > An in-depth analysis below. > > > > > > On Mon, 4 May 2020, Roman Shaposhnik wrote: > > > > > [ 2.534292] Unable to handle kernel paging request at virtual > > > > > address 000000000026c340 > > > > > [ 2.542373] Mem abort info: > > > > > [ 2.545257] ESR = 0x96000004 > > > > > [ 2.548421] EC = 0x25: DABT (current EL), IL = 32 bits > > > > > [ 2.553877] SET = 0, FnV = 0 > > > > > [ 2.557023] EA = 0, S1PTW = 0 > > > > > [ 2.560297] Data abort info: > > > > > [ 2.563258] ISV = 0, ISS = 0x00000004 > > > > > [ 2.567208] CM = 0, WnR = 0 > > > > > [ 2.570294] [000000000026c340] user address but active_mm is > > > > > swapper > > > > > [ 2.576783] Internal error: Oops: 96000004 [#1] SMP > > > > > [ 2.581784] Modules linked in: > > > > > [ 2.584950] CPU: 3 PID: 135 Comm: kworker/3:1 Not tainted > > > > > 5.6.1-default #9 > > > > > [ 2.591970] Hardware name: Raspberry Pi 4 Model B (DT) > > > > > [ 2.597256] Workqueue: events deferred_probe_work_func > > > > > [ 2.602509] pstate: 60000005 (nZCv daif -PAN -UAO) > > > > > [ 2.607431] pc : xen_swiotlb_free_coherent+0x198/0x1d8 > > > > > [ 2.612696] lr : dma_free_attrs+0x98/0xd0 > > > > > [ 2.616827] sp : ffff800011db3970 > > > > > [ 2.620242] x29: ffff800011db3970 x28: 00000000000f7b00 > > > > > [ 2.625695] x27: 0000000000001000 x26: ffff000037b68410 > > > > > [ 2.631129] x25: 0000000000000001 x24: 00000000f7b00000 > > > > > [ 2.636583] x23: 0000000000000000 x22: 0000000000000000 > > > > > [ 2.642017] x21: ffff800011b0d000 x20: ffff80001179b548 > > > > > [ 2.647461] x19: ffff000037b68410 x18: 0000000000000010 > > > > > [ 2.652905] x17: ffff000035d66a00 x16: 00000000deadbeef > > > > > [ 2.658348] x15: ffffffffffffffff x14: ffff80001179b548 > > > > > [ 2.663792] x13: ffff800091db37b7 x12: ffff800011db37bf > > > > > [ 2.669236] x11: ffff8000117c7000 x10: ffff800011db3740 > > > > > [ 2.674680] x9 : 00000000ffffffd0 x8 : ffff80001071e980 > > > > > [ 2.680124] x7 : 0000000000000132 x6 : ffff80001197a6ab > > > > > [ 2.685568] x5 : 0000000000000000 x4 : 0000000000000000 > > > > > [ 2.691012] x3 : 00000000f7b00000 x2 : fffffdffffe00000 > > > > > [ 2.696465] x1 : 000000000026c340 x0 : 000002000046c340 > > > > > [ 2.701899] Call trace: > > > > > [ 2.704452] xen_swiotlb_free_coherent+0x198/0x1d8 > > > > > [ 2.709367] dma_free_attrs+0x98/0xd0 > > > > > [ 2.713143] rpi_firmware_property_list+0x1e4/0x240 > > > > > [ 2.718146] rpi_firmware_property+0x6c/0xb0 > > > > > [ 2.722535] rpi_firmware_probe+0xf0/0x1e0 > > > > > [ 2.726760] platform_drv_probe+0x50/0xa0 > > > > > [ 2.730879] really_probe+0xd8/0x438 > > > > > [ 2.734567] driver_probe_device+0xdc/0x130 > > > > > [ 2.738870] __device_attach_driver+0x88/0x108 > > > > > [ 2.743434] bus_for_each_drv+0x78/0xc8 > > > > > [ 2.747386] __device_attach+0xd4/0x158 > > > > > [ 2.751337] device_initial_probe+0x10/0x18 > > > > > [ 2.755649] bus_probe_device+0x90/0x98 > > > > > [ 2.759590] deferred_probe_work_func+0x88/0xd8 > > > > > [ 2.764244] process_one_work+0x1f0/0x3c0 > > > > > [ 2.768369] worker_thread+0x138/0x570 > > > > > [ 2.772234] kthread+0x118/0x120 > > > > > [ 2.775571] ret_from_fork+0x10/0x18 > > > > > [ 2.779263] Code: d34cfc00 f2dfbfe2 d37ae400 8b020001 (f8626800) > > > > > [ 2.785492] ---[ end trace 4c435212e349f45f ]--- > > > > > [ 2.793340] usb 1-1: New USB device found, idVendor=2109, > > > > > idProduct=3431, bcdDevice= 4.20 > > > > > [ 2.801038] usb 1-1: New USB device strings: Mfr=0, Product=1, > > > > > SerialNumber=0 > > > > > [ 2.808297] usb 1-1: Product: USB2.0 Hub > > > > > [ 2.813710] hub 1-1:1.0: USB hub found > > > > > [ 2.817117] hub 1-1:1.0: 4 ports detected > > > > > > > > > > This is bailing out right here: > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/firmware/raspberrypi.c?h=v5.6.1#n125 > > > > > > > > > > FYIW (since I modified the source to actually print what was returned > > > > > right before it bails) we get: > > > > > buf[1] == 0x800000004 > > > > > buf[2] == 0x00000001 > > > > > > > > > > Status 0x800000004 is of course suspicious since it is not even listed > > > > > here: > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/soc/bcm2835/raspberrypi-firmware.h#n14 > > > > > > > > > > So it appears that this DMA request path is somehow busted and it > > > > > would be really nice to figure out why. > > > > > > > > You have actually discovered a genuine bug in the recent xen dma rework > > > > in Linux. Congrats :-) > > > > > > Nice! ;-) > > > > > > > I am doing some guesswork here, but from what I read in the thread and > > > > the information in this email I think this patch might fix the issue. > > > > If it doesn't fix the issue please add a few printks in > > > > drivers/xen/swiotlb-xen.c:xen_swiotlb_free_coherent and please let me > > > > know where exactly it crashes. > > > > > > > > > > > > diff --git a/include/xen/arm/page-coherent.h > > > > b/include/xen/arm/page-coherent.h > > > > index b9cc11e887ed..ff4677ed9788 100644 > > > > --- a/include/xen/arm/page-coherent.h > > > > +++ b/include/xen/arm/page-coherent.h > > > > @@ -8,12 +8,17 @@ > > > > static inline void *xen_alloc_coherent_pages(struct device *hwdev, > > > > size_t size, > > > > dma_addr_t *dma_handle, gfp_t flags, unsigned long > > > > attrs) > > > > { > > > > + void *cpu_addr; > > > > + if (dma_alloc_from_dev_coherent(hwdev, size, dma_handle, > > > > &cpu_addr)) > > > > + return cpu_addr; > > > > return dma_direct_alloc(hwdev, size, dma_handle, flags, attrs); > > > > } > > > > > > > > static inline void xen_free_coherent_pages(struct device *hwdev, > > > > size_t size, > > > > void *cpu_addr, dma_addr_t dma_handle, unsigned long > > > > attrs) > > > > { > > > > + if (dma_release_from_dev_coherent(hwdev, get_order(size), > > > > cpu_addr)) > > > > + return; > > > > dma_direct_free(hwdev, size, cpu_addr, dma_handle, attrs); > > > > } > > > > > > Applied the patch, but it didn't help and after printk's it turns out > > > it surprisingly crashes right inside this (rather convoluted if you > > > ask me) if statement: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/xen/swiotlb-xen.c?h=v5.6.1#n349 > > > > > > So it makes sense that the patch didn't help -- we never hit that > > > xen_free_coherent_pages. > > The crash happens here: > > > > if (!WARN_ON((dev_addr + size - 1 > dma_mask) || > > range_straddles_page_boundary(phys, size)) && > > TestClearPageXenRemapped(virt_to_page(vaddr))) > > xen_destroy_contiguous_region(phys, order); > > > > I don't know exactly what is causing the crash. Is it the WARN_ON somehow? > > Is it TestClearPageXenRemapped? Neither should cause a crash in theory. > > > > > > But I do know that there are problems with that if statement on ARM. It > > can trigger for one of the following conditions: > > > > 1) dev_addr + size - 1 > dma_mask > > 2) range_straddles_page_boundary(phys, size) > > > > > > The first condition might happen after bef4d2037d214 because > > dma_direct_alloc might not respect the device dma_mask. It is actually a > > bug and I would like to keep the WARN_ON for that. The patch I sent > > yesterday (https://marc.info/?l=xen-devel&m=158865080224504) should > > solve that issue. But Roman is telling us that the crash still persists. > > > > The second condition is completely normal and not an error on ARM > > because dom0 is 1:1 mapped. It is not an issue if the address range is > > straddling a page boundary. We certainly shouldn't WARN (or crash). > > > > So, I suggest something similar to Peng's patch, appended. > > > > Roman, does it solve your problem? > > > > > > diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c > > index b6d27762c6f8..994ca3a4b653 100644 > > --- a/drivers/xen/swiotlb-xen.c > > +++ b/drivers/xen/swiotlb-xen.c > > @@ -346,9 +346,8 @@ xen_swiotlb_free_coherent(struct device *hwdev, size_t > > size, void *vaddr, > > /* Convert the size to actually allocated. */ > > size = 1UL << (order + XEN_PAGE_SHIFT); > > - if (!WARN_ON((dev_addr + size - 1 > dma_mask) || > > - range_straddles_page_boundary(phys, size)) && > > - TestClearPageXenRemapped(virt_to_page(vaddr))) > > + WARN_ON(dev_addr + size - 1 > dma_mask); > > + if (TestClearPageXenRemapped(virt_to_page(vaddr))) > > xen_destroy_contiguous_region(phys, order); > > This is a very bad idea for x86. Calling xen_destroy_contiguous_region() > for cases where it shouldn't be called is causing subtle bugs in memory > management, like pages being visible twice and thus resulting in weird > memory inconsistencies. I understand what you are saying but there is one thing I am missing: on x86 how could it happen that either: dev_addr + size - 1 > dma_mask or range_straddles_page_boundary(phys, size) evaluates to true? It shouldn't right? If it does, it is an unexpected error condition, is that right? If that is the case, does it even make sense to continue at all? Wouldn't it be better to do something like: #ifdef CONFIG_X86 if (WARN_ON(dev_addr + size - 1 > dma_mask) || range_straddles_page_boundary(phys, size)) { return; } #endif if (TestClearPageXenRemapped(virt_to_page(vaddr))) xen_destroy_contiguous_region(phys, order); xen_free_coherent_pages(hwdev, size, vaddr, (dma_addr_t)phys, attrs); Or do you think that xen_free_coherent_pages should actually be called anyway?
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |