This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


Re: [Xen-devel] Shouldn't backend devices for VMX domain disks be opened

To: Anthony Liguori <aliguori@xxxxxxxxxx>
Subject: Re: [Xen-devel] Shouldn't backend devices for VMX domain disks be opened with O_DIRECT?
From: Stephen Tweedie <sct@xxxxxxxxxx>
Date: Thu, 02 Feb 2006 21:42:08 -0500
Cc: Steve Dobbelstein <steved@xxxxxxxxxx>, "Philip R. Auld" <pauld@xxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>
Delivery-date: Fri, 03 Feb 2006 02:52:22 +0000
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <43E29F27.10009@xxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <43E27DA3.80405@xxxxxxxxxx> <OF4FC3AD2A.9B8EA7AB-ON06257109.007A4F76-06257109.007B7876@xxxxxxxxxx> <20060202224106.GC17266@xxxxxxxxxxxxxxxxxx> <43E29F27.10009@xxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx

On Thu, 2006-02-02 at 18:09 -0600, Anthony Liguori wrote:

> Referring to the original question, which has been quoted away, 
> journaling doesn't require that data be written to disk per-say but that 
> writes occur in a particular order.  A journal is always recoverable 
> given that writes occur in the expected order.

Sure... it's *internally* consistent, maybe.  But you need more than
that.  You need guarantees that things are on disk, else external
consistency guarantees will be broken.

Consider things like sendmail fsync()ing a spool file before telling the
sender that the email has been accepted.  After that acknowledgement,
the sender can delete the mail from its queues knowing that the
recipient MTA definitely has the data, and even if it crashes, the mail
won't be lost.  Databases frequently have similar consistency
requirements.  If a power failure loses writes that you have told the
domU have completed --- even if you maintain write ordering --- then you
*are* putting application correctness at risk, there's no doubt about

> A buffer cache will have 
> no effect on that order so you're no more likely to have corruption than 
> if you disabled the buffer cache.

Not if it's being used as a write-through cache.  If it's write-back, it
will have a major impact on ordering.

> You especially want the buffer cache if you have LVM partitions.  
> Sectors on an LVM disk are not necessarily contiguous and can even span 
> multiple disks.  You definitely want the IO scheduler involved there.

That does not at all imply the use of the buffer cache.  All that you
need to satisfy this is AIO (asynchronous *submission* of the IO)
combined with O_DIRECT IO (synchronous *completion*) --- ie. you can
submit multiple IOs concurrently, but you know for sure when each one
completes.  That still lets the elevator get strongly involved in the
scheduling and reordering of the IOs, but lets you know reliably when
things hit disk.

Fortunately, that's just what blkback is doing --- it's using submit_bio
to submit the write IOs without waiting for completion, and is using the
bio's bi_end_io callback to process the IO completion once it is hard on


Xen-devel mailing list