On Sat, Jan 21, 2006 at 02:19:03PM -0500, Matt Ayres wrote:
> I have noticed my most major issue with putting xend into full
> production is with many xm commands being issued it hangs and only
> starts working (sometimes) after a "service xend restart". I created a
> bug a long time for this and have attached 3 different sets of logs
> using xen-bugtool. This happens to most servers after running for 3-4
> days. Those that have little activity on the xend daemon (older servers
> that were upgraded) can go 2 weeks+ at this point. Once Xen gets to
> this state even restarting xend so the list command (and others) work,
> running "xm shutdown -a" will guarantee an internal server error from
> Error: Error connecting to xend: Connection refused. Is xend running?
> Bug url: http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=465
In the /var/log/xend-debug.log for both your bugs #465 and #486 you can see
the message "error: can't start new thread". That's going to be fatal --
there's no way that Xend can proceed if it cannot create new threads.
This points to a resource leak on the machine -- either you are leaking
threads or processes locally to Xend or globally to your machine, which would
show up on ps ax, or you are out of memory, which would show up in free or top
(press m to sort by memory usage). Possibly, this could be a manifestation of
a file descriptor leak, which would show up in lsof.
Could you try and track down the leak? This would give us a much better clue
as where to look.
> I've also run into this once:
> Message from syslogd@vm20 at Fri Jan 20 23:16:52 2006 ...
> vm20 xenstored: xenstored corruption: connection id -1: err No such file
> or directory: No child '(null)' found
If you get this, all bets are off. There is no way that the system as it
stands will recover gracefully if the store is corrupted. At best, you'll
just lose configuration data regarding the running VMs -- at worst, the
corruption could persist indefinitely, and you'll be unable to do anything
Do you have xen-unstable changeset 8269:ac3ceb2d37d1 aka xen-3.0-testing
changeset 8250:1e3d31952015? This fixes the only xenstore corruption bug that
I know of, and if you've got that fix, then it's definitely a new bug. In
that case, we would appreciate it if you could either find a test case that
takes less than a few days to trigger this bug, or get your hands dirty
yourself and put some tracing and assertions into Xenstored around the TDB
manipulations to try and catch the corruption.
Maybe the corrupted TDB file itself might be useful to someone. Could you
save that, too?
As far as I'm aware, you are the only person who's ever seen this message, so
tracking it down without your help is going to be impossible. Is there
anything strange about your setup? Any network block devices or NFS involved,
any quotas on your filesystems or SELinux? Any patches that you've applied,
non-standard kernel options, anything like that?
Xen-devel mailing list