[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [BUG] kernel panics with drbd
Hi, I'm getting dom0 kernel panics, relating to moderately heavy use of drbd. I think this is a Xen bug. My Xen hosts are Debian jessie amd64 boxes, on slightly elderly Intel kit. Linux ophon 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt9-3~deb8u1 (2015-04-24) x86_64 GNU/Linux Linux opus 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2 (2015-07-17) x86_64 GNU/Linux Both have the standard jessie versions of Xen - 4.4.1-9+deb8u1 and xen-tools - 4.5-1 I have disable_sendpage enabled for drbd: root@opus:~# cat /etc/modprobe.d/drbd.conf options drbd disable_sendpage=1 root@opus:~# cat /sys/module/drbd/parameters/disable_sendpage Y root@ophon:~# cat /etc/modprobe.d/drbd.conf options drbd disable_sendpage=1 root@ophon:~# cat /sys/module/drbd/parameters/disable_sendpage Y I have a script running on "ophon" that sets up a drbd device (itself as primary, "opus" as secondary), makes an LVM pv+vg on top of that drbd device, and then calls xen-create-image[0]. "opus" typically kernel panics shortly after xen-create-image starts. I attach the relevant bit of kern.log from one such crash to this mail - you can see the drbd operations happening a second or so before the crash. I also attach the relevant drbd .res file The bug is not 100% repeatable, but still fairly reliable (for obvious reasons, extensive testing and hard-rebooting my kit is not a very joyous prospect). I did once achieve a similar result by running drbd-overview on opus, which said kernel:[ 1127.630208] BUG: soft lockup - CPU#2 stuck for 23s! [xenstored:864] on console and then panicked much as before. The "amusing" quirk is that similar code worked a couple of weeks ago when I last tried it; that code does now also produce kernel panics AFAICT (with a not-100%-repeatable bug and long reproduction timescales 'cos of having to power-cycle etc. it's hard to be completely certain). The two hosts are part of a pacemaker cluster, and "opus" is otherwise able to run guests fine. I hope that's sufficient information; I'm happy to supply other config files etc. if necessary. Regards, Matthew [0] The code in question is in fact a python script; running on ophon, it does the following (using ssh to run commands on opus): --both hosts-- lvcreate -L 20G -nmwsig-mws-02474 guests drbdadm -- --force create-md drbdadm up mws-02474 --ophon only-- drbdadm wait-connect drbdadm new-current-uuid --clear-bitmap minor-4 drbdadm primary mws-02474 pvcreate /dev/drbd4 vgcreate mws-02474-vg /dev/drbd4 xen-create-image ... --lvm mws-02474-vg resource mws-02474 { device /dev/drbd4; disk /dev/guests/mwsig-mws-02474; meta-disk internal; on ophon { address 131.111.8.218:7792; } on opus { address 131.111.8.217:7792; } net { after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; } } Aug 3 16:03:13 opus kernel: [ 1250.026811] drbd mws-priv-1: Starting worker thread (from drbdsetup-84 [12987]) Aug 3 16:03:13 opus kernel: [ 1250.027313] block drbd4: disk( Diskless -> Attaching ) Aug 3 16:03:13 opus kernel: [ 1250.027409] drbd mws-priv-1: Method to ensure write ordering: flush Aug 3 16:03:13 opus kernel: [ 1250.027413] block drbd4: max BIO size = 4096 Aug 3 16:03:13 opus kernel: [ 1250.027418] block drbd4: drbd_bm_resize called with capacity == 41941688 Aug 3 16:03:13 opus kernel: [ 1250.027558] block drbd4: resync bitmap: bits=5242711 words=81918 pages=160 Aug 3 16:03:13 opus kernel: [ 1250.027561] block drbd4: size = 20 GB (20970844 KB) Aug 3 16:03:13 opus kernel: [ 1250.032268] block drbd4: Writing the whole bitmap, size changed Aug 3 16:03:13 opus kernel: [ 1250.047827] block drbd4: bitmap WRITE of 160 pages took 4 jiffies Aug 3 16:03:13 opus kernel: [ 1250.061634] block drbd4: 20 GB (5242711 bits) marked out-of-sync by on disk bit-map. Aug 3 16:03:13 opus kernel: [ 1250.180186] block drbd4: bitmap READ of 160 pages took 2 jiffies Aug 3 16:03:13 opus kernel: [ 1250.180291] block drbd4: recounting of set bits took additional 0 jiffies Aug 3 16:03:13 opus kernel: [ 1250.180293] block drbd4: 20 GB (5242711 bits) marked out-of-sync by on disk bit-map. Aug 3 16:03:13 opus kernel: [ 1250.180304] block drbd4: Suspended AL updates Aug 3 16:03:13 opus kernel: [ 1250.180307] block drbd4: disk( Attaching -> Inconsistent ) Aug 3 16:03:13 opus kernel: [ 1250.180310] block drbd4: attached to UUIDs 0000000000000004:0000000000000000:0000000000000000:0000000000000000 Aug 3 16:03:13 opus kernel: [ 1250.191161] drbd mws-priv-1: conn( StandAlone -> Unconnected ) Aug 3 16:03:13 opus kernel: [ 1250.191183] drbd mws-priv-1: Starting receiver thread (from drbd_w_mws-priv [12989]) Aug 3 16:03:13 opus kernel: [ 1250.191345] drbd mws-priv-1: receiver (re)started Aug 3 16:03:13 opus kernel: [ 1250.191360] drbd mws-priv-1: conn( Unconnected -> WFConnection ) Aug 3 16:03:13 opus kernel: [ 1250.689576] drbd mws-priv-1: Handshake successful: Agreed network protocol version 101 Aug 3 16:03:13 opus kernel: [ 1250.689580] drbd mws-priv-1: Agreed to support TRIM on protocol level Aug 3 16:03:13 opus kernel: [ 1250.689616] drbd mws-priv-1: conn( WFConnection -> WFReportParams ) Aug 3 16:03:13 opus kernel: [ 1250.689631] drbd mws-priv-1: Starting asender thread (from drbd_r_mws-priv [12992]) Aug 3 16:03:13 opus kernel: [ 1250.737084] block drbd4: max BIO size = 1048576 Aug 3 16:03:13 opus kernel: [ 1250.737091] block drbd4: drbd_sync_handshake: Aug 3 16:03:13 opus kernel: [ 1250.737094] block drbd4: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:5242711 flags:0 Aug 3 16:03:13 opus kernel: [ 1250.737096] block drbd4: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:5242711 flags:0 Aug 3 16:03:13 opus kernel: [ 1250.737098] block drbd4: uuid_compare()=0 by rule 10 Aug 3 16:03:13 opus kernel: [ 1250.737100] block drbd4: No resync, but 5242711 bits in bitmap! Aug 3 16:03:13 opus kernel: [ 1250.737105] block drbd4: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> Inconsistent ) Aug 3 16:03:13 opus kernel: [ 1250.737109] block drbd4: Resumed AL updates Aug 3 16:03:14 opus kernel: [ 1250.773903] block drbd4: Accepted new current UUID, preparing to skip initial sync Aug 3 16:03:14 opus kernel: [ 1250.777061] block drbd4: bitmap WRITE of 160 pages took 1 jiffies Aug 3 16:03:14 opus kernel: [ 1250.788564] block drbd4: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Aug 3 16:03:14 opus kernel: [ 1250.788573] block drbd4: disk( Inconsistent -> UpToDate ) pdsk( Inconsistent -> UpToDate ) Aug 3 16:03:14 opus kernel: [ 1250.797104] block drbd4: receiver updated UUIDs to 14460554106EF79A:0000000000000000:0000000000000000:0000000000000000 Aug 3 16:03:14 opus kernel: [ 1250.797117] block drbd4: peer( Secondary -> Primary ) Aug 3 16:03:15 opus kernel: [ 1251.748952] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) Aug 3 16:03:15 opus kernel: [ 1251.748983] BUG: unable to handle kernel paging request at ffff8800022f3d88 Aug 3 16:03:15 opus kernel: [ 1251.749016] IP: [<ffff8800022f3d88>] 0xffff8800022f3d88 Aug 3 16:03:15 opus kernel: [ 1251.749041] PGD 1814067 PUD 1815067 PMD 2f81067 PTE 80100000022f3067 Aug 3 16:03:15 opus kernel: [ 1251.749082] Oops: 0011 [#1] SMP Aug 3 16:03:15 opus kernel: [ 1251.749106] Modules linked in: xt_physdev iptable_filter ip_tables x_tables xen_netback xen_blkback xen_gntdev xen_evtchn xenfs xen_privcmd nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc bridge stp llc intel_powerclamp coretemp crc32_pclmul ghash_clmulni_intel joydev hid_generic iTCO_wdt iTCO_vendor_support aesni_intel evdev aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd usbhid ttm hid drm_kms_helper pcspkr drm i2c_i801 lpc_ich mfd_core i7core_edac ioatdma edac_core tpm_tis tpm ipmi_si ipmi_msghandler button shpchp processor thermal_sys drbd lru_cache libcrc32c autofs4 ext4 crc16 mbcache jbd2 dm_mod raid1 md_mod sg sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel ahci libahci libata ehci_pci uhci_hcd ehci_hcd scsi_mod usbcore usb_common igb i2c_algo_bit i2c_core dca ptp pps_core Aug 3 16:03:15 opus kernel: [ 1251.749691] CPU: 0 PID: 12993 Comm: drbd_a_mws-priv Not tainted 3.16.0-4-amd64 #1 Debian 3.16.7-ckt11-1+deb8u2 Aug 3 16:03:15 opus kernel: [ 1251.749726] Hardware name: Intel Corporation S5500WBV/S5500WB, BIOS S5500.86B.01.00.0061.030920121535 03/09/2012 Aug 3 16:03:15 opus kernel: [ 1251.749765] task: ffff8800171742d0 ti: ffff8800022f0000 task.ti: ffff8800022f0000 Aug 3 16:03:15 opus kernel: [ 1251.749847] RIP: e030:[<ffff8800022f3d88>] [<ffff8800022f3d88>] 0xffff8800022f3d88 Aug 3 16:03:15 opus kernel: [ 1251.749934] RSP: e02b:ffff8800022f3d90 EFLAGS: 00010212 Aug 3 16:03:15 opus kernel: [ 1251.749984] RAX: 00000000fffffffc RBX: ffffffffffffffff RCX: 0000000000000113 Aug 3 16:03:15 opus kernel: [ 1251.750039] RDX: 0000000000000113 RSI: 00000000fffffe01 RDI: ffffffff81463f75 Aug 3 16:03:15 opus kernel: [ 1251.750094] RBP: ffff8800171742d0 R08: ffff8800022f0000 R09: 0000000000000000 Aug 3 16:03:15 opus kernel: [ 1251.750150] R10: ffff88001751b890 R11: 0000000000000000 R12: 0000000000000001 Aug 3 16:03:15 opus kernel: [ 1251.750205] R13: 0000000000000000 R14: 0000000000000010 R15: ffff880016c92000 Aug 3 16:03:15 opus kernel: [ 1251.750263] FS: 00007f90c43c0740(0000) GS:ffff88001fa00000(0000) knlGS:0000000000000000 Aug 3 16:03:15 opus kernel: [ 1251.750348] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b Aug 3 16:03:15 opus kernel: [ 1251.750399] CR2: ffff8800022f3d88 CR3: 000000001073f000 CR4: 0000000000002660 Aug 3 16:03:15 opus kernel: [ 1251.750454] Stack: Aug 3 16:03:15 opus kernel: [ 1251.750493] ffff8800022f3d88 0000000000000010 0000000000000000 0000000000000000 Aug 3 16:03:15 opus kernel: [ 1251.750589] ffff8800022f3d90 0000000000000001 0000000000000000 0000000000000000 Aug 3 16:03:15 opus kernel: [ 1251.750686] 0000000000004100 ffffffffa02577be ffff880016c92080 0000001000000000 Aug 3 16:03:15 opus kernel: [ 1251.750783] Call Trace: Aug 3 16:03:15 opus kernel: [ 1251.750829] [<ffffffffa02577be>] ? drbd_asender+0x27e/0x750 [drbd] Aug 3 16:03:15 opus kernel: [ 1251.750887] [<ffffffffa0260d00>] ? drbd_destroy_connection+0xc0/0xc0 [drbd] Aug 3 16:03:15 opus kernel: [ 1251.750947] [<ffffffffa0260d46>] ? drbd_thread_setup+0x46/0x130 [drbd] Aug 3 16:03:15 opus kernel: [ 1251.751006] [<ffffffffa0260d00>] ? drbd_destroy_connection+0xc0/0xc0 [drbd] Aug 3 16:03:15 opus kernel: [ 1251.751065] [<ffffffff81087fad>] ? kthread+0xbd/0xe0 Aug 3 16:03:15 opus kernel: [ 1251.751114] [<ffffffff81087ef0>] ? kthread_create_on_node+0x180/0x180 Aug 3 16:03:15 opus kernel: [ 1251.751170] [<ffffffff815114d8>] ? ret_from_fork+0x58/0x90 Aug 3 16:03:15 opus kernel: [ 1251.751221] [<ffffffff81087ef0>] ? kthread_create_on_node+0x180/0x180 Aug 3 16:03:15 opus kernel: [ 1251.751274] Code: ff ff ff ff ff ff ff ff ff ff ff 88 3d 2f 02 00 88 ff ff 30 e0 00 00 00 00 00 00 12 02 01 00 00 00 00 00 90 3d 2f 02 00 88 ff ff <2b> e0 00 00 00 00 00 00 88 3d 2f 02 00 88 ff ff 10 00 00 00 00 Aug 3 16:03:15 opus kernel: [ 1251.751694] RIP [<ffff8800022f3d88>] 0xffff8800022f3d88 Aug 3 16:03:15 opus kernel: [ 1251.751747] RSP <ffff8800022f3d90> Aug 3 16:03:15 opus kernel: [ 1251.751790] CR2: ffff8800022f3d88 Aug 3 16:03:15 opus kernel: [ 1251.752128] ---[ end trace 975e04f66c2d9004 ]--- Aug 3 16:03:15 opus kernel: [ 1251.835012] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 Aug 3 16:03:15 opus kernel: [ 1251.835235] IP: [<ffffffffa02453bd>] drbd_endio_write_sec_final+0x9d/0x480 [drbd] Aug 3 16:03:15 opus kernel: [ 1251.835408] PGD 0 Aug 3 16:03:15 opus kernel: [ 1251.835531] Oops: 0002 [#2] SMP Aug 3 16:03:15 opus kernel: [ 1251.835704] Modules linked in: xt_physdev iptable_filter ip_tables x_tables xen_netback xen_blkback xen_gntdev xen_evtchn xenfs xen_privcmd nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc bridge stp llc intel_powerclamp coretemp crc32_pclmul ghash_clmulni_intel joydev hid_generic iTCO_wdt iTCO_vendor_support aesni_intel evdev aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd usbhid ttm hid drm_kms_helper pcspkr drm i2c_i801 lpc_ich mfd_core i7core_edac ioatdma edac_core tpm_tis tpm ipmi_si ipmi_msghandler button shpchp processor thermal_sys drbd lru_cache libcrc32c autofs4 ext4 crc16 mbcache jbd2 dm_mod raid1 md_mod sg sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel ahci libahci libata ehci_pci uhci_hcd ehci_hcd scsi_mod usbcore usb_common igb i2c_algo_bit i2c_core dca ptp pps_core Aug 3 16:03:15 opus kernel: [ 1251.840379] CPU: 0 PID: 12992 Comm: drbd_r_mws-priv Tainted: G D 3.16.0-4-amd64 #1 Debian 3.16.7-ckt11-1+deb8u2 Aug 3 16:03:15 opus kernel: [ 1251.840515] Hardware name: Intel Corporation S5500WBV/S5500WB, BIOS S5500.86B.01.00.0061.030920121535 03/09/2012 Aug 3 16:03:15 opus kernel: [ 1251.840648] task: ffff880016c1a050 ti: ffff8800173e4000 task.ti: ffff8800173e4000 Aug 3 16:03:15 opus kernel: [ 1251.840771] RIP: e030:[<ffffffffa02453bd>] [<ffffffffa02453bd>] drbd_endio_write_sec_final+0x9d/0x480 [drbd] Aug 3 16:03:15 opus kernel: [ 1251.840951] RSP: e02b:ffff8800173e7ce0 EFLAGS: 00010097 Aug 3 16:03:15 opus kernel: [ 1251.841041] RAX: 0000000000000000 RBX: ffff880017439700 RCX: 000000000000009c Aug 3 16:03:15 opus kernel: [ 1251.841137] RDX: 0000000000000000 RSI: ffff88000c850200 RDI: ffff88000c85bed0 Aug 3 16:03:15 opus kernel: [ 1251.841233] RBP: ffff88000cb22800 R08: 0000000000000cce R09: ffff88000c850200 Aug 3 16:03:15 opus kernel: [ 1251.841329] R10: 0000000000007ff0 R11: 0000000000000000 R12: ffff880002b676a0 Aug 3 16:03:15 opus kernel: [ 1251.841424] R13: ffff88001f8463b0 R14: ffff88000cb22bb0 R15: ffff88000cb22800 Aug 3 16:03:15 opus kernel: [ 1251.841523] FS: 00007f90c43c0740(0000) GS:ffff88001fa00000(0000) knlGS:0000000000000000 Aug 3 16:03:15 opus kernel: [ 1251.841648] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b Aug 3 16:03:15 opus kernel: [ 1251.841739] CR2: 0000000000000008 CR3: 000000001073f000 CR4: 0000000000002660 Aug 3 16:03:15 opus kernel: [ 1251.841835] Stack: Aug 3 16:03:15 opus kernel: [ 1251.841913] 0000000000000000 0000000000030003 ffff8800174397b8 0000000000000000 Aug 3 16:03:15 opus kernel: [ 1251.842216] 0000000000000000 0000000000102800 0000000000400000 0000000000104800 Aug 3 16:03:15 opus kernel: [ 1251.842518] 0000000000000000 0000000000102800 0000000000000000 0000000000000000 Aug 3 16:03:15 opus kernel: [ 1251.842820] Call Trace: Aug 3 16:03:15 opus kernel: [ 1251.842906] [<ffffffffa0254bb5>] ? drbd_submit_peer_request+0x85/0x330 [drbd] Aug 3 16:03:15 opus kernel: [ 1251.843033] [<ffffffffa02556ea>] ? receive_Data+0x36a/0xe40 [drbd] Aug 3 16:03:15 opus kernel: [ 1251.843130] [<ffffffffa0257407>] ? drbd_receiver+0x117/0x250 [drbd] Aug 3 16:03:15 opus kernel: [ 1251.843228] [<ffffffffa0260d00>] ? drbd_destroy_connection+0xc0/0xc0 [drbd] Aug 3 16:03:15 opus kernel: [ 1251.843328] [<ffffffffa0260d46>] ? drbd_thread_setup+0x46/0x130 [drbd] Aug 3 16:03:15 opus kernel: [ 1251.843427] [<ffffffffa0260d00>] ? drbd_destroy_connection+0xc0/0xc0 [drbd] Aug 3 16:03:15 opus kernel: [ 1251.843524] [<ffffffff81087fad>] ? kthread+0xbd/0xe0 Aug 3 16:03:15 opus kernel: [ 1251.843614] [<ffffffff81087ef0>] ? kthread_create_on_node+0x180/0x180 Aug 3 16:03:15 opus kernel: [ 1251.843709] [<ffffffff815114d8>] ? ret_from_fork+0x58/0x90 Aug 3 16:03:15 opus kernel: [ 1251.843801] [<ffffffff81087ef0>] ? kthread_create_on_node+0x180/0x180 Aug 3 16:03:15 opus kernel: [ 1251.843894] Code: 04 48 8b 45 00 48 8d b8 d0 00 00 00 e8 dd bc 2c e1 8b 53 58 49 89 c1 c1 ea 09 01 95 54 02 00 00 49 83 fd ff 48 8b 13 48 8b 43 08 <48> 89 42 08 48 89 10 48 8d 85 c0 03 00 00 48 8b 95 c8 03 00 00 Aug 3 16:03:15 opus kernel: [ 1251.846996] RIP [<ffffffffa02453bd>] drbd_endio_write_sec_final+0x9d/0x480 [drbd] Aug 3 16:03:15 opus kernel: [ 1251.847168] RSP <ffff8800173e7ce0> Aug 3 16:03:15 opus kernel: [ 1251.847252] CR2: 0000000000000008 Aug 3 16:03:15 opus kernel: [ 1251.847335] ---[ end trace 975e04f66c2d9005 ]--- _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |