I've been having terrible problems with ocfs2 getting corrupted. (Of
course this is after I said on this list a couple months ago that I've
been using it for a while without issues!)
I have two sets of SLES11 servers, that each share their own ocfs2
volume. I started having problems with the original set of servers and
opened a ticket with Novell. They wanted me to completely update the
systems. Since they were running critical VMs I didn't feel comfortable
doing that, so I installed two more servers with their own oscf2 volume.
These two I then patched completely.
Unfortunately, these two servers starting exhibiting their own
corruption problems. Just copying my virtual disk files and running some
VMs would cause the ocfs2 to get corrupted. Right now it's at a point
where I can't even fix it with fsck.ocfs2. I'm told the ticket has been
escalated to the ocfs2 devs.
Earlier this week I had a problem with one of the original servers and
had to hard restart it. This is a problem I've always had with xen after
it runs for a long time, sometimes it will have memory allocation
issues, can't start VMs, etc. Worse yet, there's no way to restart it
nicely, because VMs will not shut down and you can't get on the console
or ssh to shut down the server nicely. Only option (that I know of) is
to hard reset the box.
Of course this can have side affects. In this case everything came back
up ok, but I could see there was corruption. I asked Novell and they
said I should unmount the volumes, run fsck.ocfs2 and make sure it's
clean, then restart everything. This was on Monday, and since my
critical machines were up and running, I couldn't afford to have them
down right then.
So, 2AM this morning I decided was a good time to down these systems,
run the fsck and then get them back up. I thought this would be fairly
simple, take 30-60 mins, and get things stable for a while longer while
we work on the ocfs2 issue with Novell.
Unfortunately, after running fsck.ocfs2 and making sure it was clean, my
VMs would not all come back up. I could get 4 or 5 of them up, but not
the rest. After unhealthy and very stressful investigation I found that
the ocfs2 volume is going read only.
I'm waiting for a call back from Novell right now. It seems that once my
ocfs2 volume gets corrupted there's no way to fix it or make it stable
again.
Our storage is on a Xiotech Magnitude 4000 3D. Each xen server is
assigned the same vdisk that is used for ocfs2. We use file based disks
for our VMs. Performance wise this does the job for us. It makes them
very easy to move around, copy for new VMs, etc.
What other options should I look at besides ocfs2? I also have a call in
to our xiotech admin to create me a new disk that I can assign directly
to my server (one for each) so I can copy my VMs and get them up and
running. Just in case Novell is not able to get a resolution for me. I'm
confident that the VMs will be stable once they are running on "local"
storage.
Sorry this got so long, but I don't think I can take much more stress
around the stability of my xen servers. I've also looked at XenServer,
which seems to be really stable and has nice features, but you also lose
a lot of portability. Hard for me to explain, but on sles/xen it's
incredibly easy to create sles VMs. It's also nice to be able to mount
disk files if needed, copy them, etc.
If anyone gets this far into the message I'd appreciate any
suggestions.
Thanks a lot,
James
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users
|