[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Fatal crash on xen4.2 HVM + qemu-xen dm + NFS



On Wed, 16 Jan 2013, Alex Bligh wrote:
> >> If QEMU is completing writes before they've actually been done, haven't
> >> we got a wider set of problems to worry about?
> >
> > Reading the thread you linked in a previous email, it seems that
> > it can actually happen that a userspace application is told that
> > the write is completed before all the outstanding network requests are
> > dealt with.
> 
> What is 'userspace application' in this context? QEMU running in dom0?
> That would seem to me to be a kernel bug unless the page is marked
> CoW, wouldn't it? Else write() then alter the page might write the
> altered data. But perhaps I've misunderstood (see below)

Yes, the application is QEMU. I also think that it is a kernel bug.


> >> Could the problem be "cache=writeback" on the QEMU command
> >> line (evident from a 'ps'). If caching is writeback perhaps QEMU
> >> needs to copy the data. Is there some setting to turn this off in
> >> xl for test purposes?
> >
> > The command line cache options are ignored by xen_disk, so, assuming
> > that the guest is using the PV disk interface, that can't be the issue.
> 
> This appears not to be the case (at least in our environment).
> 
> We use PV on HVM and:
>  disk = [ 'tap:qcow2:/my/nfs/directory/testdisk.qcow2,xvda,w' ]
> (remainder of config file in the original message)
> 
> We tried modifying the cache= setting using the patch below (yes,
> the mail client will probably have eaten it, but in essence change
> the word 'writeback' to 'none'), and that stops it booting VMs
> at all with
>  hd0 write error
>  error: couldn't read file
> so it would appear not to be entirely correct that the cache=
> settings are being ignored. I've not had time to find out why
> (possibly it's trying and failing to use O_DIRECT on NFS) but
> I'll try writethrough.

The cache command line option is ignored by xen_disk, the PV disk
backend.  I was assuming that the guest is using blkfront to access the
disk, but it looks like I am wrong.  If the guest is using the IDE
interface, then yes, the cache command line option makes a big
difference.

It is interesting that cache=none has that terrible effect on the disk
reads, that means that O_DIRECT doesn't work properly either.


> One thing the guest is doing is writing to the partition table
> (UEC cloud images do this on boot). This isn't special cased in
> any way is it?

I don't think so.


> >> > Isn't there a way to prevent tcp_retransmit from running when the
> >> > request is already completed? Or stop it if you find out that the pages
> >> > are already gone?
> >>
> >> But what would you do? If you don't run the tcp_retransmit the write
> >> would be lost (to say nothing of the NFS connection to the server).
> >
> > Well, that is not true: if the write was really lost, the kernel wouldn't
> > have completed the AIO write and notified QEMU.
> 
> Isn't that exactly what you said did happen? The kernel completed the AIO
> write and notified QEMU prior to the write actually completing as the
> data to write is still sitting in some as-yet-unacked TCP buffer. The
> kernel then doesn't get the ACK in respect of that sequence number and
> decides to resend the entire TCP segment. That than blows up because
> the TCP segment it points to contains data pointing to a hole in memory.
> Perhaps I'm misunderstanding the problem.
> 
> If TCP does not retransmit, that segment will never get ACKed, and the
> TCP stream will lock up (this assumes that the cause of the original
> need to retransmit was packet loss - if it's simply buffering at
> a busy filer, then I agree).

Almost. I am saying that the kernel completed the AIO write and notified
QEMU after it received an ACK from the other end, but before the
tcp_retransmit was supposed to run.  I admit I am not that familiar with
the network stack so this is just a supposition.


> >> > You could try persistent grants, that wouldn't solve the bug but they
> >> > should be able to "hide" it pretty well. Not ideal, I know.
> >> > The QEMU side commit is 9e496d7458bb01b717afe22db10a724db57d53fd.
> >> > Konrad issued a pull request recently with the corresponding Linux
> >> > blkfront changes:
> >> >
> >> > git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git
> >> > stable/for-jens-3.8
> >>
> >> That's presumably the fir 8 commits at:
> >> http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=shortlog;h=re
> >> fs/heads/stable/for-jens-3.8
> >>
> >> So I'd need a new dom0 kernel and to backport the QEMU patch.
> >
> > Yep.
> 
> What puzzles me about this is (a) why we never see the same problems
> on KVM, and (b) why this doesn't affect NFS clients even when no
> virtualisation is involved.

If it is the bug that I think it is, then it would also affect KVM and
other native clients, but it wouldn't cause such horrible host crashes.
For example tcp_retransmit could send stale data or even data that has
just been written by QEMU but that it is not supposed to go over the
network yet. After all, who knows what's written on those pages now that
the AIO is completed?


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.