[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] xenbus and the message of doom



On Fri, 2011-12-16 at 09:18 +0000, Stefan Bader wrote:
> On 15.12.2011 21:53, Ian Campbell wrote:
> > On Thu, 2011-12-15 at 19:20 +0000, Stefan Bader wrote:
> >> I was investigating a bug report[1] about newer kernels (>3.1) not booting 
> >> as
> >> HVM guests on Amazon EC2. For some reason git bisect did give the some 
> >> pain, but
> >> it lead me at least close and with some crash dump data I think I figured 
> >> the
> >> problem.
> >>
> >> commit ddacf5ef684a655abe2bb50c4b2a5b72ae0d5e05
> >> Author: Olaf Hering <olaf@xxxxxxxxx>
> >> Date:   Thu Sep 22 16:14:49 2011 +0200
> >>
> >>     xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old
> >>     kernel
> >>
> >> This change introduced a xs_reset_watches() call. The problem seems to be 
> >> that
> >> there is at least some version of Xen (I was able to reproduce with a 3.4.3
> >> version which I admit to deliberately not having updated) for which 
> >> xenstore
> >> will not return any reply.
> >>
> >> At least the backtraces in crash showed that xs_init had been calling
> >> xs_reset_watches() and that was happily idling in read_reply(). Effectively
> >> nothing was going on and the boot just hung.
> >> By just not doing that xs_reset_watches() call, I was able to boot under 
> >> the
> >> same host. And for what it is worth there has not been an issue with Xen 
> >> 4.1.1
> >> and a 3.0 dom0 kernel. Just this "older" release is trouble.
> > 
> > I sent a patch to fix exactly this issue in oxenstored (the ocaml
> > xenstore) just this week. Is there any chance that you are running C
> > xenstored with Xen 4.1.1 and oxenstored with Xen 3.4.3?
> 
> Thanks for the pointer, I missed that thread. Now dumb question, would
> oxenstored be named that way? Or iow, how do I quickly find out what is 
> running?

The process name will be oxenstored instead of xenstored.

> The binary running in 3.4.3 is xenstored which is a linked executable (same in
> 4.1.1).

The sounds like you are running the C xenstored in both cases so this is
a red-herring.

> But I guess, whatever version is running, any oxenstored would not have the
> bugfix because things take longer to reach any packaged versions.

Correct.

> I rather would suspect that in 4.1.1, the reset watches message probably is 
> just
> known and thus avoiding the problem. Unfortunately it is near impossible to 
> tell
> for sure what exactly EC2 is running.
> 
> The major point here probably is that when the upstream kernels are calling 
> that
> message and there are versions of xenstored in production that will just 
> ignore
> it while the kernel blocks waiting, this is a painful path. Production systems
> tend to update slowly and the symptoms are not that obvious.

Yes. It is unfortunate that xenstored in the field has this bug but it
does mean that the approach taken here with this new message cannot
work.

FWIW there weren't all that many changes to C xenstored between 3.4.x
and current unstable so it wouldn't be hard to identify where the fix
is, but that doesn't really help people stuck with an older xenstored.

> Having a timeout
> maybe could be useful not only for this case, but clearly it is nothing that
> should be rushed.

A timeout may not help, it depends what the daemon does after the
invalid message.

It looks as if oxenstored just throws it away and will process the next
message fine so that is OK but I didn't check the C version.

I think (or hope!) that Olaf tested with the most recent version of C
xenstored which did not support this new message and so I presume that
it correct returns an error for an unknown message but that doesn't help
us with the older C xenstored which you have.

In any case someone really needs to check both the ocaml and the
version's behaviour.

> So reverting the patch introducing that call (at least in the distro kernel) 
> may
> be the best thing to do (knowing that this will be bought by loosing the fix 
> for
> kexec boots fo crash kernels).

Agreed. We should revert the kernel change for now and revisit it.

One potential solution, depending on the actual behaviour of the
daemons, would be to follow the potentially unsupported command with an
innocuous well established one and use the ID field to identify which we
get a response to.

Does the target kernel know that it has been kexec'd? Perhaps we should
only reset xenstore watches if we are booting after a kexec. Worst case
the kexec tool can add a command line argument to trigger this. Doing it
this way means there is no possibility of regressions for normal boot
and kexec wasn't supported on older xenstored anyway.

Ian.

> 
> -Stefan
> 
> >> Now the big question is, should this never happen and the host needs urgent
> >> updating. Or, should xs_talkv() set up a time limit and assume failure 
> >> when not
> >> receiving a message after that? I could imagine the latter might lead at 
> >> least
> >> to a more helpful "there is something wrong here, dude" than just hanging 
> >> around
> >> without any response. ;)
> >>
> >> -Stefan
> >>
> >> _______________________________________________
> >> Xen-devel mailing list
> >> Xen-devel@xxxxxxxxxxxxxxxxxxx
> >> http://lists.xensource.com/xen-devel
> > 
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@xxxxxxxxxxxxxxxxxxx
> > http://lists.xensource.com/xen-devel
> 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.