Sybase Business Intelligence Solutions - Database Management, Data Warehousing Software, Mobile Enterprise Applications and Messaging
  Worldwide [change] Contact Us  |  MySybase  |   |  Shopping Cart - Buy Sybase Application Servers & Wireless Applications  
CATEGORIES
ARCHIVES
Sybase Blog Center
Sybase Blog Center

Some Musings on Device Options

January 29, 2007 8:05 AM

Filed Under: Adaptive Server Enterprise (ASE) Linux Kernel

In ASE 12.0 we introduced official file system device support via the device-level dsync flag.  Since then, many a DBA have pondered "to dsync, or not to dsync?"  This tends to be part of the larger question of file vs. raw.  Like just about anything related to performance, there is not really a yes or no answer that fits all cases.  In this post I'll try to further your understanding of this and other device options.

I like to have an understanding of what a flag / switch / etc. really does before making a decision on its use, so I'll start my discussion of device flags with an explanation as to what they really are.  Before we talk about the various flags, let's take a look at file system i/o in general (generalizations and simplifications will be fine for this portion).

A typical file system has a cache into which blocks are read and written.  You can think of this very much like ASE's data cache.  A file system block is essentially like an ASE page, and the file system cache is like the ASE data cache.  When a user requests some data from file (or database), appropriate blocks (or pages) are read from disk and into the cache, and then returned from the cache to the user.  When the user writes to a file (or a database), the write is reflected in the cache but is typically not written to disk until a later time.  This makes both read and write operations faster than the really are.  The next time that same data is read it may already be in cache, eliminating the need for a disk i/o.  On the write side, because we are only updating memory (cache), we don't need to wait for an expensive disk i/o to complete.

Because nothing in life is free, the improved efficiency comes with some risk.  I mentioned that a write will complete when the change is written to the cache, with the actual i/o to disk happening at some later time.  Suppose the system crashes before that write to disk takes place?  The application thought it had, because the O/S returned success for the write.  However, when the system comes back up, the changes will be lost.  If the file that lost the write was one of your database devices, ouch.  This is the reason that ASE did not officially support file system devices until 12.0.


In 12.0 we introduced the dsync flag - shorthand for "Data Synchronous.” The dsync flag in ASE directly translates to the O_DSYNC open(2) flag. That is to say, when ASE opens a device that has the dsync flag set, ASE will pass O_DSYNC to open(2). This flag tells the file system that a write to that file must pass though the cache and be written to disk before the write is considered complete. In other words, for writes we throw away the cache efficiency and make sure the data goes to disk. This way if the system crashes, everything that we thought had been written to disk has in fact been written to disk.


(Note: Most file systems support an O_SYNC mode in addition to O_DSYNC. The difference has to do with metadata. File systems maintain metadata about a file in addition to the data in the file. The O_SYNC flag forces both metadata and data changes to be written to disk, whereas the O_DSYNC flag does not concern itself with metadata. O_DSYNC is sufficient for ASE because the metadata does ASE devices does not frequently change in any consequential way.)


Now that we know what dsync really is, let's consider when it should and shouldn't be used. This is pretty simple. Any time you are using file system devices and you care about the recoverability of the data, you should use the dsync option. Remember that without dsync, the OS may tell ASE that a write is complete without it ever having been written to the disk. In the event of a system crash, we could have some real problems, like data page changes being on disk without the accompanying log records having made it. The natural exception are tempdb devices. Here we don't care about recoverability, and therefore we can optimize write performance by running without dsync.


There are a few common questions and misconceptions about dsync that I want to clarify:


  1. A common misconception is that because dsync is “synchronous”, you can't do asynchronous I/O to the file. This is not true. This synchronous / asynchronous conflict is at a different level. With async i/o we are talking about the context in which the i/o is executed, i.e. whether the i/o blocks the caller or if it is done in a different context (see my first post on async i/o). With dsync we are talking about when the write() is considered complete. These are not mutually exclusive, and you can asynchronously do a data synchronous i/o. It is quite simple. The async portion is as always: the application issues an i/o and later polls (or is notified) for completion. The dsync portion means that the application won't be told that the I/O has completed until the data has made it to disk. (Note that the original implementation of KAIO in Linux did in fact block in the async i/o request unless direct i/o was being used.)

  2. Another question that occasionally comes up is whether or not dsync needs to be used for raw devices. Raw device i/o does not go through a file system, and therefore there is no caching of the i/o. When an application issues a write() to a raw device, the data goes directly to the disk. Therefore, dsync is out of context for raw. One way to think about it is: the safety that dsync provides is already guaranteed by raw i/o.

  3. Finally, some folks point out that even with dsync, we are not guaranteed that the write has hit the physical platter before the application is told it has completed. This is true. With dsync we consider a write complete when the driver / hardware says it has. With most disks / disk controllers / SANs, this means that the write has made it to the controller's cache. However, these devices guarantee that any i/o to the controller cache will be eventually make it do the platter, so we don't worry about this.


No doubt many of you are wondering about the direct i/o option introduced in ASE 15.0, and I still haven't touched on raw vs. file system. I'll address those in my next post.


Cheers,

Dave

Posted by David Wein on January 29, 2007 8:05 AM

Comments

Greg Spielman email -

Where I work, we really enjoyed your last three blog posts. We are eager for the remaining installments regarding device i/o.

(Also, I tried to create AvantGo RSS channels for you and two other bloggers, but it did not work. Can you bring this to somebody's attention? Thanks.)

Name
URL (remove the http://)
Email
Comments
   

TrackBack Link