Hello,
I
have following problem on our server, running W2K3SRVx64 DomUs over Debian Etch
under Xen 3.1.0:
There
is following storage configuration:
omega:~#
cat /proc/mdstat
Personalities
: [raid1]
md3
: active raid1 sdc2[0] sde2[2](S) sdd2[1]
488287552 blocks [2/2] [UU]
md2
: active raid1 sdc1[0] sde1[2](S) sdd1[1]
96256 blocks [2/2] [UU]
md1
: active raid1 sda2[0] sdb2[1]
488287552 blocks [2/2] [UU]
md0
: active raid1 sda1[0] sdb1[1]
96256 blocks [2/2] [UU]
-
and the arrays md1-md3 are used in the volume group for the LVM-managed
logical volumes, used as block devices for the virtual instances.
Today
I encountered the problem with one physical disk connected into the raid array,
what caused the crash of one virtual domain – the output from kern.log
looks like:
Nov
15 14:14:19 omega kernel: sd 0:0:1:0: SCSI error: return code = 0x08000002
Nov
15 14:14:19 omega kernel: sdb: Current: sense key: Medium Error
Nov
15 14:14:19 omega kernel: Additional sense: Unrecovered
read error
Nov
15 14:14:19 omega kernel: Info fld=0x12832f4d
Nov
15 14:14:19 omega kernel: end_request: I/O error, dev sdb, sector 310587213
Nov
15 14:14:19 omega kernel: raid1: sdb2: rescheduling sector 310394432
Nov
15 14:14:19 omega kernel: raid1: sdb2: rescheduling sector 310394440
Nov
15 14:14:24 omega kernel: raid1: sda2: redirecting sector 310394432 to another
mirror
Nov
15 14:14:28 omega kernel: raid1: sda2: redirecting sector 310394440 to another
mirror
Nov
15 14:14:28 omega kernel: qemu-dm[6305]: segfault at 0000000000000000 rip
0000000000000000 rsp 0000000041000ca8 error 14
Nov
15 14:14:28 omega kernel: xenbr0: port 4(tap0) entering disabled state
Nov
15 14:14:28 omega kernel: device tap0 left promiscuous mode
Nov
15 14:14:28 omega kernel: audit(1195132468.260:16): dev=tap0 prom=0
old_prom=256 auid=4294967295
Nov
15 14:14:28 omega kernel: xenbr0: port 4(tap0) entering disabled state
The
question is, even if the disk /dev/sdb would fail, why has the virtual instance
died with the segfault?
In
xend.log there is nothing logged about this problem...
The
given instance has been still reported with the xm list or xm top command, but
spent 0 CPU time and there was no possibility to connect to this instance over
VNC, it was also ping unreachable...
After
the xm shutdown it took some time but then it has been possibly destroyed...
After the xm create the instance normally continued to work as usually...
What
can I do to have it running more stable? I think, there could be some
read-timeout during the operation from the given device what caused the
instance segfault has came out before the raid subsystem could take the data
from the disk mirror... I throught always the virtual instance should survive
such a problem if running from md device...
Any
helps or advices are appreciated...
With
best regards
Archie