[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Shouldn't backend devices for VMX domain disks be opened with O_DIRECT?


On Thu, 2006-02-02 at 18:09 -0600, Anthony Liguori wrote:

> Referring to the original question, which has been quoted away, 
> journaling doesn't require that data be written to disk per-say but that 
> writes occur in a particular order.  A journal is always recoverable 
> given that writes occur in the expected order.

Sure... it's *internally* consistent, maybe.  But you need more than
that.  You need guarantees that things are on disk, else external
consistency guarantees will be broken.

Consider things like sendmail fsync()ing a spool file before telling the
sender that the email has been accepted.  After that acknowledgement,
the sender can delete the mail from its queues knowing that the
recipient MTA definitely has the data, and even if it crashes, the mail
won't be lost.  Databases frequently have similar consistency
requirements.  If a power failure loses writes that you have told the
domU have completed --- even if you maintain write ordering --- then you
*are* putting application correctness at risk, there's no doubt about

> A buffer cache will have 
> no effect on that order so you're no more likely to have corruption than 
> if you disabled the buffer cache.

Not if it's being used as a write-through cache.  If it's write-back, it
will have a major impact on ordering.

> You especially want the buffer cache if you have LVM partitions.  
> Sectors on an LVM disk are not necessarily contiguous and can even span 
> multiple disks.  You definitely want the IO scheduler involved there.

That does not at all imply the use of the buffer cache.  All that you
need to satisfy this is AIO (asynchronous *submission* of the IO)
combined with O_DIRECT IO (synchronous *completion*) --- ie. you can
submit multiple IOs concurrently, but you know for sure when each one
completes.  That still lets the elevator get strongly involved in the
scheduling and reordering of the IOs, but lets you know reliably when
things hit disk.

Fortunately, that's just what blkback is doing --- it's using submit_bio
to submit the write IOs without waiting for completion, and is using the
bio's bi_end_io callback to process the IO completion once it is hard on


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.