Xen project Mailing List

Re: [Xen-devel] xenbus and the message of doom

To: Stefan Bader <stefan.bader@xxxxxxxxxxxxx>

From: Ian Campbell <Ian.Campbell@xxxxxxxxxx>

Date: Fri, 16 Dec 2011 09:31:57 +0000

Cc: Olaf Hering <olaf@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Delivery-date: Fri, 16 Dec 2011 09:32:28 +0000

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

On Fri, 2011-12-16 at 09:18 +0000, Stefan Bader wrote: > On 15.12.2011 21:53, Ian Campbell wrote: > > On Thu, 2011-12-15 at 19:20 +0000, Stefan Bader wrote: > >> I was investigating a bug report[1] about newer kernels (>3.1) not booting > >> as > >> HVM guests on Amazon EC2. For some reason git bisect did give the some > >> pain, but > >> it lead me at least close and with some crash dump data I think I figured > >> the > >> problem. > >> > >> commit ddacf5ef684a655abe2bb50c4b2a5b72ae0d5e05 > >> Author: Olaf Hering <olaf@xxxxxxxxx> > >> Date: Thu Sep 22 16:14:49 2011 +0200 > >> > >> xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old > >> kernel > >> > >> This change introduced a xs_reset_watches() call. The problem seems to be > >> that > >> there is at least some version of Xen (I was able to reproduce with a 3.4.3 > >> version which I admit to deliberately not having updated) for which > >> xenstore > >> will not return any reply. > >> > >> At least the backtraces in crash showed that xs_init had been calling > >> xs_reset_watches() and that was happily idling in read_reply(). Effectively > >> nothing was going on and the boot just hung. > >> By just not doing that xs_reset_watches() call, I was able to boot under > >> the > >> same host. And for what it is worth there has not been an issue with Xen > >> 4.1.1 > >> and a 3.0 dom0 kernel. Just this "older" release is trouble. > > > > I sent a patch to fix exactly this issue in oxenstored (the ocaml > > xenstore) just this week. Is there any chance that you are running C > > xenstored with Xen 4.1.1 and oxenstored with Xen 3.4.3? > > Thanks for the pointer, I missed that thread. Now dumb question, would > oxenstored be named that way? Or iow, how do I quickly find out what is > running? The process name will be oxenstored instead of xenstored. > The binary running in 3.4.3 is xenstored which is a linked executable (same in > 4.1.1). The sounds like you are running the C xenstored in both cases so this is a red-herring. > But I guess, whatever version is running, any oxenstored would not have the > bugfix because things take longer to reach any packaged versions. Correct. > I rather would suspect that in 4.1.1, the reset watches message probably is > just > known and thus avoiding the problem. Unfortunately it is near impossible to > tell > for sure what exactly EC2 is running. > > The major point here probably is that when the upstream kernels are calling > that > message and there are versions of xenstored in production that will just > ignore > it while the kernel blocks waiting, this is a painful path. Production systems > tend to update slowly and the symptoms are not that obvious. Yes. It is unfortunate that xenstored in the field has this bug but it does mean that the approach taken here with this new message cannot work. FWIW there weren't all that many changes to C xenstored between 3.4.x and current unstable so it wouldn't be hard to identify where the fix is, but that doesn't really help people stuck with an older xenstored. > Having a timeout > maybe could be useful not only for this case, but clearly it is nothing that > should be rushed. A timeout may not help, it depends what the daemon does after the invalid message. It looks as if oxenstored just throws it away and will process the next message fine so that is OK but I didn't check the C version. I think (or hope!) that Olaf tested with the most recent version of C xenstored which did not support this new message and so I presume that it correct returns an error for an unknown message but that doesn't help us with the older C xenstored which you have. In any case someone really needs to check both the ocaml and the version's behaviour. > So reverting the patch introducing that call (at least in the distro kernel) > may > be the best thing to do (knowing that this will be bought by loosing the fix > for > kexec boots fo crash kernels). Agreed. We should revert the kernel change for now and revisit it. One potential solution, depending on the actual behaviour of the daemons, would be to follow the potentially unsupported command with an innocuous well established one and use the ID field to identify which we get a response to. Does the target kernel know that it has been kexec'd? Perhaps we should only reset xenstore watches if we are booting after a kexec. Worst case the kexec tool can add a command line argument to trigger this. Doing it this way means there is no possibility of regressions for normal boot and kexec wasn't supported on older xenstored anyway. Ian. > > -Stefan > > >> Now the big question is, should this never happen and the host needs urgent > >> updating. Or, should xs_talkv() set up a time limit and assume failure > >> when not > >> receiving a message after that? I could imagine the latter might lead at > >> least > >> to a more helpful "there is something wrong here, dude" than just hanging > >> around > >> without any response. ;) > >> > >> -Stefan > >> > >> _______________________________________________ > >> Xen-devel mailing list > >> Xen-devel@xxxxxxxxxxxxxxxxxxx > >> http://lists.xensource.com/xen-devel > > > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@xxxxxxxxxxxxxxxxxxx > > http://lists.xensource.com/xen-devel > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.