[Xen-users] [XCP] ext3 crashes and slowdowns

Hi Folks.

I've two Intel boxes (Intel server S5520UR, 2x E5520, 32GB ram, SATA HW-Raid, 
BBU) running as XCP-0.5 pool, both running a OpenFiler-2.3 domU, clustered, 
active/passive. Data Storage is provided as SCSISR (without LVM layer, like a 
HBASR) to OpenFiler. Shared storage is provided as iSCSI target by OpenFiler 
via clusterIP (storage frontend network), replication is done by drbd (storage 
backend network), HA is done by haertbeat (hearbeat network). All networks are 
built on top of redundant HP gigabit switches, 2 pairs of Intel gigabit NICs, 
each bonded and plugged into the same switch, both bonds multipathed 
(active/passive multipathing, patched OpenVSwitch-1.1.2p1) via the two 
switches, which are linked together with 2 ports each.

XCP pool works, ISCSI works, replication works, HA works.

If filer 1 (running on server1) is active i can install and run domUs on 
server 2 without problems, I can not install or run domUs on server 1.

If  I switch to filer 2 (on server 2) as the active one the running but 
stalled domUs on server 1 get back their life, and the running domUs on filer2 
loose their life.
# dd if=/dev/zero of=/tmp/test bs=512M count=1 oflag=direct
shows a rate of  0.8 - 1.2 MB/sec.

The kernel shows traces like

INFO: task syslogd:1081 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syslogd       D ffff880001003460     0  1081      1          1084  1073 
(NOTLB)
 ffff8800367edd88  0000000000000286  ffff8800367edd98  ffffffff80262dd3 
 0000000000000009  ffff88003fb007a0  ffffffff804f4b80  0000000000000d5b 
 ffff88003fb00988  0000000000006d06 
Call Trace:
 [<ffffffff80262dd3>] thread_return+0x6c/0x113
 [<ffffffff88036d5a>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff8029c60a>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
 [<ffffffff8023138e>] __writeback_single_inode+0x1e9/0x328
 [<ffffffff802d2ff1>] do_readv_writev+0x26e/0x291
 [<ffffffff802e555b>] sync_inode+0x24/0x33
 [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc
 [<ffffffff80252276>] do_fsync+0x52/0xa4
 [<ffffffff802d37f5>] __do_fsync+0x23/0x36
 [<ffffffff802602f9>] tracesys+0xab/0xb6


Iscsiadm shows no errors.

# iscsiadm -m session -r 1 -s
Stats for session [sid: 1, target: 
iqn.2006-01.com.openfiler:tsn.26336ef50fe0:storage1_osimages, portal: 
172.16.0.2,3260]
iSCSI SNMP:
        txdata_octets: 486181549212
        rxdata_octets: 2622687792
        noptx_pdus: 0
        scsicmd_pdus: 15184105
        tmfcmd_pdus: 0
        login_pdus: 0
        text_pdus: 0
        dataout_pdus: 195910
        logout_pdus: 0
        snack_pdus: 0
        noprx_pdus: 0
        scsirsp_pdus: 15184088
        tmfrsp_pdus: 0
        textrsp_pdus: 0
        datain_pdus: 87898
        logoutrsp_pdus: 0
        r2t_pdus: 151200
        async_pdus: 0
        rjt_pdus: 0
        digest_err: 0
        timeout_err: 0
iSCSI Extended:
        tx_sendpage_failures: 0
        rx_discontiguous_hdr: 0
        eh_abort_cnt: 0

If I reboot the domU after giving back her life, in most cases, the ext3 
journal is corrupt, and the kernel panics after one reboot more.

If I try to install a PV-Domain (CentOS-5.5) the installer asks if I wish to 
initialize the disk xvda, but if the disk partitioning and layout questions 
appear the disk is missing in the list. There's nothing more than a question 
mark.
Sometimes I have the disk in the list, if so I can install the OS, all seems 
fine, but after the second reboot the ext3 journal is missing and the kernel 
panics after the third reboot, rootfs is gone.


Are there any ideas? I'm out of.

Thanks
Christian

Some kernel logging from domU, nothing inside dom0 log.

EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743295
Aborting journal on device dm-0.
ext3_abort called.
EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743296
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743297
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743298
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743299
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743300
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743301
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743302
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743303
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743304
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743305
EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device dm-0) in ext3_truncate: Journal has aborted
EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device dm-0) in ext3_orphan_del: Journal has aborted
EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has aborted
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data



_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users
WARNING - OLD ARCHIVES

xen-users

[Xen-users] [XCP] ext3 crashes and slowdowns