[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [BUG] kernel panics with drbd

On Tue, 2015-08-04 at 14:52 +0100, Matthew Vernon wrote:
> Hi,


> I'm getting dom0 kernel panics, relating to moderately heavy use of
> drbd. I think this is a Xen bug.

It is remarkably similar looking to 
http://blog.chinewalking.com/drbd-kernel-oops-w-trim/ . Do you have trim?


> My Xen hosts are Debian jessie amd64 boxes, on slightly elderly Intel
> kit. 
> Linux ophon 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt9-3~deb8u1 (2015-04
> -24) x86_64 GNU/Linux
> Linux opus 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2 (2015-07
> -17) x86_64 GNU/Linux
> Both have the standard jessie versions of Xen - 4.4.1-9+deb8u1 and
> xen-tools - 4.5-1
> I have disable_sendpage enabled for drbd:
> root@opus:~# cat /etc/modprobe.d/drbd.conf 
> options drbd disable_sendpage=1
> root@opus:~# cat /sys/module/drbd/parameters/disable_sendpage 
> Y
> root@ophon:~# cat /etc/modprobe.d/drbd.conf 
> options drbd disable_sendpage=1
> root@ophon:~# cat /sys/module/drbd/parameters/disable_sendpage 
> Y
> I have a script running on "ophon" that sets up a drbd device (itself
> as primary, "opus" as secondary), makes an LVM pv+vg on top of that
> drbd device,  and then calls xen-create-image[0]. "opus" typically kernel
> panics shortly after xen-create-image starts.
> I attach the relevant bit of kern.log from one such crash to this mail
> - you can see the drbd operations happening a second or so before the
> crash. I also attach the relevant drbd .res file
> The bug is not 100% repeatable, but still fairly reliable (for obvious
> reasons, extensive testing and hard-rebooting my kit is not a very
> joyous prospect). I did once achieve a similar result by running
> drbd-overview on opus, which said
> kernel:[ 1127.630208] BUG: soft lockup - CPU#2 stuck for 23s!
> [xenstored:864]
> on console and then panicked much as before.
> The "amusing" quirk is that similar code worked a couple of weeks ago
> when I last tried it; that code does now also produce kernel panics
> AFAICT (with a not-100%-repeatable bug and long reproduction
> timescales 'cos of having to power-cycle etc. it's hard to be
> completely certain).
> The two hosts are part of a pacemaker cluster, and "opus" is otherwise
> able to run guests fine.
> I hope that's sufficient information; I'm happy to supply other config
> files etc. if necessary.
> Regards,
> Matthew
> [0] The code in question is in fact a python script; running on ophon,
> it does the following (using ssh to run commands on opus):
> --both hosts--
> lvcreate -L 20G -nmwsig-mws-02474 guests
> drbdadm -- --force create-md
> drbdadm up mws-02474
> --ophon only--
> drbdadm wait-connect
> drbdadm new-current-uuid --clear-bitmap minor-4
> drbdadm primary mws-02474
> pvcreate /dev/drbd4
> vgcreate mws-02474-vg /dev/drbd4
> xen-create-image ... --lvm mws-02474-vg
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.