Xen project Mailing List

Re: [Xen-devel] Fatal crash on xen4.2 HVM + qemu-xen dm + NFS

From: Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>

Date: Wed, 16 Jan 2013 17:33:59 +0000

Cc: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>, Xen Devel <xen-devel@xxxxxxxxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Stefano Stabellini <Stefano.Stabellini@xxxxxxxxxxxxx>

Delivery-date: Wed, 16 Jan 2013 17:35:07 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Wed, 16 Jan 2013, Alex Bligh wrote: > >> If QEMU is completing writes before they've actually been done, haven't > >> we got a wider set of problems to worry about? > > > > Reading the thread you linked in a previous email, it seems that > > it can actually happen that a userspace application is told that > > the write is completed before all the outstanding network requests are > > dealt with. > > What is 'userspace application' in this context? QEMU running in dom0? > That would seem to me to be a kernel bug unless the page is marked > CoW, wouldn't it? Else write() then alter the page might write the > altered data. But perhaps I've misunderstood (see below) Yes, the application is QEMU. I also think that it is a kernel bug. > >> Could the problem be "cache=writeback" on the QEMU command > >> line (evident from a 'ps'). If caching is writeback perhaps QEMU > >> needs to copy the data. Is there some setting to turn this off in > >> xl for test purposes? > > > > The command line cache options are ignored by xen_disk, so, assuming > > that the guest is using the PV disk interface, that can't be the issue. > > This appears not to be the case (at least in our environment). > > We use PV on HVM and: > disk = [ 'tap:qcow2:/my/nfs/directory/testdisk.qcow2,xvda,w' ] > (remainder of config file in the original message) > > We tried modifying the cache= setting using the patch below (yes, > the mail client will probably have eaten it, but in essence change > the word 'writeback' to 'none'), and that stops it booting VMs > at all with > hd0 write error > error: couldn't read file > so it would appear not to be entirely correct that the cache= > settings are being ignored. I've not had time to find out why > (possibly it's trying and failing to use O_DIRECT on NFS) but > I'll try writethrough. The cache command line option is ignored by xen_disk, the PV disk backend. I was assuming that the guest is using blkfront to access the disk, but it looks like I am wrong. If the guest is using the IDE interface, then yes, the cache command line option makes a big difference. It is interesting that cache=none has that terrible effect on the disk reads, that means that O_DIRECT doesn't work properly either. > One thing the guest is doing is writing to the partition table > (UEC cloud images do this on boot). This isn't special cased in > any way is it? I don't think so. > >> > Isn't there a way to prevent tcp_retransmit from running when the > >> > request is already completed? Or stop it if you find out that the pages > >> > are already gone? > >> > >> But what would you do? If you don't run the tcp_retransmit the write > >> would be lost (to say nothing of the NFS connection to the server). > > > > Well, that is not true: if the write was really lost, the kernel wouldn't > > have completed the AIO write and notified QEMU. > > Isn't that exactly what you said did happen? The kernel completed the AIO > write and notified QEMU prior to the write actually completing as the > data to write is still sitting in some as-yet-unacked TCP buffer. The > kernel then doesn't get the ACK in respect of that sequence number and > decides to resend the entire TCP segment. That than blows up because > the TCP segment it points to contains data pointing to a hole in memory. > Perhaps I'm misunderstanding the problem. > > If TCP does not retransmit, that segment will never get ACKed, and the > TCP stream will lock up (this assumes that the cause of the original > need to retransmit was packet loss - if it's simply buffering at > a busy filer, then I agree). Almost. I am saying that the kernel completed the AIO write and notified QEMU after it received an ACK from the other end, but before the tcp_retransmit was supposed to run. I admit I am not that familiar with the network stack so this is just a supposition. > >> > You could try persistent grants, that wouldn't solve the bug but they > >> > should be able to "hide" it pretty well. Not ideal, I know. > >> > The QEMU side commit is 9e496d7458bb01b717afe22db10a724db57d53fd. > >> > Konrad issued a pull request recently with the corresponding Linux > >> > blkfront changes: > >> > > >> > git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git > >> > stable/for-jens-3.8 > >> > >> That's presumably the fir 8 commits at: > >> http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=shortlog;h=re > >> fs/heads/stable/for-jens-3.8 > >> > >> So I'd need a new dom0 kernel and to backport the QEMU patch. > > > > Yep. > > What puzzles me about this is (a) why we never see the same problems > on KVM, and (b) why this doesn't affect NFS clients even when no > virtualisation is involved. If it is the bug that I think it is, then it would also affect KVM and other native clients, but it wouldn't cause such horrible host crashes. For example tcp_retransmit could send stale data or even data that has just been written by QEMU but that it is not supposed to go over the network yet. After all, who knows what's written on those pages now that the AIO is completed? _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.