[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Crashes under Xen with Radeon graphics card


  • To: "Deucher, Alexander" <Alexander.Deucher@xxxxxxx>, lkml <linux-kernel@xxxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, "amd-gfx@xxxxxxxxxxxxxxxxxxxxx" <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>
  • From: Juergen Gross <jgross@xxxxxxxx>
  • Date: Fri, 15 Dec 2023 17:32:56 +0100
  • Authentication-results: smtp-out2.suse.de; dkim=pass header.d=suse.com header.s=susede1 header.b=m35CfrNF
  • Autocrypt: addr=jgross@xxxxxxxx; keydata= xsBNBFOMcBYBCACgGjqjoGvbEouQZw/ToiBg9W98AlM2QHV+iNHsEs7kxWhKMjrioyspZKOB ycWxw3ie3j9uvg9EOB3aN4xiTv4qbnGiTr3oJhkB1gsb6ToJQZ8uxGq2kaV2KL9650I1SJve dYm8Of8Zd621lSmoKOwlNClALZNew72NjJLEzTalU1OdT7/i1TXkH09XSSI8mEQ/ouNcMvIJ NwQpd369y9bfIhWUiVXEK7MlRgUG6MvIj6Y3Am/BBLUVbDa4+gmzDC9ezlZkTZG2t14zWPvx XP3FAp2pkW0xqG7/377qptDmrk42GlSKN4z76ELnLxussxc7I2hx18NUcbP8+uty4bMxABEB AAHNH0p1ZXJnZW4gR3Jvc3MgPGpncm9zc0BzdXNlLmNvbT7CwHkEEwECACMFAlOMcK8CGwMH CwkIBwMCAQYVCAIJCgsEFgIDAQIeAQIXgAAKCRCw3p3WKL8TL8eZB/9G0juS/kDY9LhEXseh mE9U+iA1VsLhgDqVbsOtZ/S14LRFHczNd/Lqkn7souCSoyWsBs3/wO+OjPvxf7m+Ef+sMtr0 G5lCWEWa9wa0IXx5HRPW/ScL+e4AVUbL7rurYMfwCzco+7TfjhMEOkC+va5gzi1KrErgNRHH kg3PhlnRY0Udyqx++UYkAsN4TQuEhNN32MvN0Np3WlBJOgKcuXpIElmMM5f1BBzJSKBkW0Jc Wy3h2Wy912vHKpPV/Xv7ZwVJ27v7KcuZcErtptDevAljxJtE7aJG6WiBzm+v9EswyWxwMCIO RoVBYuiocc51872tRGywc03xaQydB+9R7BHPzsBNBFOMcBYBCADLMfoA44MwGOB9YT1V4KCy vAfd7E0BTfaAurbG+Olacciz3yd09QOmejFZC6AnoykydyvTFLAWYcSCdISMr88COmmCbJzn sHAogjexXiif6ANUUlHpjxlHCCcELmZUzomNDnEOTxZFeWMTFF9Rf2k2F0Tl4E5kmsNGgtSa aMO0rNZoOEiD/7UfPP3dfh8JCQ1VtUUsQtT1sxos8Eb/HmriJhnaTZ7Hp3jtgTVkV0ybpgFg w6WMaRkrBh17mV0z2ajjmabB7SJxcouSkR0hcpNl4oM74d2/VqoW4BxxxOD1FcNCObCELfIS auZx+XT6s+CE7Qi/c44ibBMR7hyjdzWbABEBAAHCwF8EGAECAAkFAlOMcBYCGwwACgkQsN6d 1ii/Ey9D+Af/WFr3q+bg/8v5tCknCtn92d5lyYTBNt7xgWzDZX8G6/pngzKyWfedArllp0Pn fgIXtMNV+3t8Li1Tg843EXkP7+2+CQ98MB8XvvPLYAfW8nNDV85TyVgWlldNcgdv7nn1Sq8g HwB2BHdIAkYce3hEoDQXt/mKlgEGsLpzJcnLKimtPXQQy9TxUaLBe9PInPd+Ohix0XOlY+Uk QFEx50Ki3rSDl2Zt2tnkNYKUCvTJq7jvOlaPd6d/W0tZqpyy7KVay+K4aMobDsodB3dvEAs6 ScCnh03dDAFgIq5nsB11j3KPKdVoPlfucX2c7kGNH+LUMbzqV6beIENfNexkOfxHfw==
  • Cc: "Koenig, Christian" <Christian.Koenig@xxxxxxx>, "Pan, Xinhui" <Xinhui.Pan@xxxxxxx>
  • Delivery-date: Fri, 15 Dec 2023 16:33:08 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 15.12.23 17:19, Deucher, Alexander wrote:
[AMD Official Use Only - General]

-----Original Message-----
From: Juergen Gross <jgross@xxxxxxxx>
Sent: Friday, December 15, 2023 11:13 AM
To: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; lkml <linux-
kernel@xxxxxxxxxxxxxxx>; xen-devel@xxxxxxxxxxxxxxxxxxxx; amd-
gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Koenig, Christian <Christian.Koenig@xxxxxxx>; Pan, Xinhui
<Xinhui.Pan@xxxxxxx>
Subject: Re: Crashes under Xen with Radeon graphics card

On 15.12.23 17:04, Deucher, Alexander wrote:
[Public]

-----Original Message-----
From: Juergen Gross <jgross@xxxxxxxx>

...

The crashes vary, but often the kernel accesses non-canonical
addresses or tries to map illegal physical addresses. Sometimes the
system is just hanging, either with softlockups or without any further signs
of being alive.

I can easily reproduce the problem, so any debug patches to narrow
down the problem are welcome.

There are still missing firmware required for proper operation.  Please fix
them up.

That was the starting point, of course!

Ah, ok.  Thanks for clarifying.  What exactly happens when you get this crash?  
System hang?  Kernel oops?  Is there anything in the dmesg when it happens?

As I wrote above: rather different cases. The crash happens normally
within 20 seconds after the system is completely up. I had one case
where it survived ca. 2 minutes.

One example:

[   64.549114] BUG: unable to handle page fault for address: ffff888121291000
[   64.562850] #PF: supervisor write access in kernel mode
[   64.573352] #PF: error_code(0x0003) - permissions violation
[ 64.584589] PGD 2836067 P4D 2836067 PUD 3e73f7067 PMD 3e72ed067 PTE 8010000121291025
[   64.600212] Oops: 0003 [#1] PREEMPT SMP NOPTI
[ 64.608985] CPU: 3 PID: 2090 Comm: kioslave5 Tainted: G E 6.7.0-rc5-default #974
[   64.626721] Hardware name: Dell Inc. OptiPlex 9020/0PC5F7, BIOS A25 
05/30/2019
[   64.641193] RIP: e030:clear_page_erms+0x7/0x10
[ 64.650161] Code: 48 89 47 38 48 8d 7f 40 75 d9 90 c3 cc cc cc cc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 b9 00 10 00 00 31 c0 <f3> aa c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[   64.687996] RSP: e02b:ffffc9004206fb50 EFLAGS: 00010246
[   64.698378] RAX: 0000000000000000 RBX: ffffea000484a400 RCX: 0000000000001000
[   64.712780] RDX: 0000000000052dc0 RSI: 0000000000000003 RDI: ffff888121291000
[   64.727154] RBP: 0000000000000901 R08: ffffea000484a440 R09: ffffea000484a600
[   64.741491] R10: 0000000000000002 R11: 000000000000241e R12: ffff8883e7d21d80
[   64.755843] R13: 000000000028d834 R14: 0000000000000901 R15: ffffea000484a400
[ 64.770207] FS: 00007f4c2b79d280(0000) GS:ffff888409380000(0000) knlGS:0000000000000000
[   64.786487] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[   64.798019] CR2: ffff888121291000 CR3: 000000014fef4000 CR4: 0000000000050660
[   64.812411] Call Trace:
[   64.817308]  <TASK>
[   64.821625]  ? __die_body+0x1a/0x60
[   64.828746]  ? page_fault_oops+0x151/0x470
[   64.837065]  ? search_bpf_extables+0x65/0x70
[   64.845717]  ? fixup_exception+0x22/0x320
[   64.853844]  ? exc_page_fault+0xb3/0x150
[   64.861792]  ? asm_exc_page_fault+0x22/0x30
[   64.870275]  ? clear_page_erms+0x7/0x10
[   64.878050]  prep_new_page+0x97/0xb0
[   64.885308]  get_page_from_freelist+0x7a4/0x1f40
[   64.894678]  __alloc_pages+0x18b/0x350
[   64.902270]  ? kvmalloc_node+0x3a/0xd0
[   64.909892]  __kmalloc_large_node+0x7a/0x140
[   64.918542]  __kmalloc_node+0xc1/0x130
[   64.926149]  kvmalloc_node+0x3a/0xd0
[   64.933399]  proc_sys_call_handler+0xfa/0x230
[   64.942259]  vfs_read+0x22f/0x2e0
[   64.949007]  ksys_read+0xa5/0xe0
[   64.955527]  do_syscall_64+0x5d/0xe0
[   64.962806]  ? do_user_addr_fault+0x5b3/0x8a0
[   64.971647]  ? exc_page_fault+0x6f/0x150
[   64.979587]  entry_SYSCALL_64_after_hwframe+0x6f/0x77
[   64.989821] RIP: 0033:0x7f4c29f06a3e
[ 64.997098] Code: 08 e8 f4 1e 02 00 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a f3 c3 0f 1f 84 00 00 00 00 00 41 54 55 49 [ 65.034962] RSP: 002b:00007ffd5a86f2b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   65.050071] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4c29f06a3e
[   65.064415] RDX: 0000000000004000 RSI: 0000000002562c18 RDI: 0000000000000004
[   65.078775] RBP: 0000000002561d60 R08: 00007f4c2abd3418 R09: 0000000000000028
[   65.093155] R10: 000000000253b010 R11: 0000000000000246 R12: 0000000000004000
[   65.107492] R13: 0000000000004000 R14: 0000000000000004 R15: 0000000002562c18
[   65.121850]  </TASK>



BTW, meanwhile I have tested kernel 5.19, which is working. I suspected that
the patch series merging swiotlb and swiotlb-xen could be to blame, but that
went into v5.19.

Can you bisect?

I can try to find the offending commit, sure. I just wanted to share my current
findings in the hope that someone might have an idea ...


Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.