ZFS for Database Log Files
- by user12620111
I've been troubled by drop outs in CPU usage in my application server, 
characterized by the CPUs suddenly going from close to 90% CPU busy to 
almost completely CPU idle for a few seconds. Here is an example of a 
drop out as shown by a snippet of vmstat data taken while the 
application server is under a heavy workload.   # vmstat 1
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s3 s4 s5 s6   in   sy   cs us sy id
 1 0 0 130160176 116381952 0 16 0 0 0 0  0  0  0  0  0 207377 117715 203884 70 21 9
 12 0 0 130160160 116381936 0 25 0 0 0 0 0  0  0  0  0 200413 117162 197250 70 20 9
 11 0 0 130160176 116381920 0 16 0 0 0 0 0  0  1  0  0 203150 119365 200249 72 21 7
 8 0 0 130160176 116377808 0 19 0 0 0 0  0  0  0  0  0 169826 96144 165194 56 17 27  0 0 0 130160176 116377800 0 16 0 0 0 0  0  0  0  0  1 10245 9376 9164 2  1 97
 0 0 0 130160176 116377792 0 16 0 0 0 0  0  0  0  0  2 15742 12401 14784 4 1 95
 0 0 0 130160176 116377776 2 16 0 0 0 0  0  0  1  0  0 19972 17703 19612 6 2 92
 14 0 0 130160176 116377696 0 16 0 0 0 0 0  0  0  0  0 202794 116793 199807 71 21 8
 9 0 0 130160160 116373584 0 30 0 0 0 0  0  0 18  0  0 203123 117857 198825 69 20 11 
This behavior occurred consistently while the application server was 
processing synthetic transactions: HTTP requests from JMeter running on 
an external machine. 
I explored many theories trying to explain the drop outs, including: 
   
    Unexpected JMeter behavior 
    Network contention 
    Java Garbage Collection  
    Application Server thread pool problems 
    Connection pool problems 
    Database transaction processing  
    Database I/O contention 
   
  Graphing the CPU %idle led to a breakthrough:   
Several of the drop outs were 30 seconds apart. With that insight, I 
went digging through the data again and looking for other outliers that 
were 30 seconds apart. In the database server statistics, I found spikes
 in the iostat "asvc_t" (average response time of disk transactions, in 
milliseconds) for the disk drive that was being used for the database 
log files. 
Here is an example:                      extended device statistics     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device      0.0 2053.6    0.0 8234.3  0.0  0.2    0.0    0.1   0  24 c3t60080E5...F4F6d0s0     0.0 2162.2    0.0 8652.8  0.0  0.3    0.0    0.1   0  28 c3t60080E5...F4F6d0s0     0.0 1102.5    0.0 10012.8  0.0  4.5    0.0    4.1   0  69 c3t60080E5...F4F6d0s0     0.0   74.0    0.0 7920.6  0.0 10.0    0.0  135.1   0 100 c3t60080E5...F4F6d0s0     0.0  568.7    0.0 6674.0  0.0  6.4    0.0   11.2   0  90 c3t60080E5...F4F6d0s0     0.0 1358.0    0.0 5456.0  0.0  0.6    0.0    0.4   0  55 c3t60080E5...F4F6d0s0     0.0 1314.3    0.0 5285.2  0.0  0.7    0.0    0.5   0  70 c3t60080E5...F4F6d0s0 
Here is a little more information about my database configuration: 
   
    The database and application server were running on two different SPARC servers. 
    Storage for the database was on a storage array connected via 8 gigabit Fibre Channel 
    Data storage and log file were on different physical disk drives 
    Reliable low latency I/O is provided by battery backed NVRAM   
    Highly available: 
     
      Two Fibre Channel links accessed via MPxIO 
      Two Mirrored cache controllers   
      The log file physical disks were mirrored in the storage device  
     
    Database log files on a ZFS Filesystem with cutting-edge technologies, such as copy-on-write and end-to-end checksumming 
   
  Why would I be getting service time spikes in my high-end storage? 
First, I wanted to verify that the database log disk service time spikes
 aligned with the application server CPU drop outs, and they did:   
  At first, I guessed that the disk service time spikes might be 
related to flushing the write through cache on the storage device, but I
 was unable to validate that theory.  
  After searching the WWW for a while, I decided to try using a separate log device:  
  # zpool add ZFS-db-41 log c3t60080E500017D55C000015C150A9F8A7d0  
  The ZFS log device is configured in a similar manner as described above: two physical disks mirrored in the storage array.  
  This change to the database storage configuration eliminated the application server CPU drop outs:   
  Here is the zpool configuration:  
  # zpool status ZFS-db-41   pool: ZFS-db-41  state: ONLINE  scan: none requested config:          NAME                                     STATE         ZFS-db-41                                ONLINE           c3t60080E5...F4F6d0  ONLINE         logs           c3t60080E5...F8A7d0  ONLINE  
   
  Now, the I/O spikes look like this:  
   
                      extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1053.5    0.0 4234.1  0.0  0.8    0.0    0.7   0  75 c3t60080E5...F8A7d0s0
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1131.8    0.0 4555.3  0.0  0.8    0.0    0.7   0  76 c3t60080E5...F8A7d0s0
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1167.6    0.0 4682.2  0.0  0.7    0.0    0.6   0  74 c3t60080E5...F8A7d0s0     0.0  162.2    0.0 19153.9  0.0  0.7    0.0    4.2   0  12 c3t60080E5...F4F6d0s0
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1247.2    0.0 4992.6  0.0  0.7    0.0    0.6   0  71 c3t60080E5...F8A7d0s0     0.0   41.0    0.0   70.0  0.0  0.1    0.0    1.6   0   2 c3t60080E5...F4F6d0s0
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1241.3    0.0 4989.3  0.0  0.8    0.0    0.6   0  75 c3t60080E5...F8A7d0s0
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1193.2    0.0 4772.9  0.0  0.7    0.0    0.6   0  71 c3t60080E5...F8A7d0s0  
  We can see the steady flow of 4k writes to the ZIL device from O_SYNC database log file writes. The spikes are from flushing the transaction group.   
  Like almost all problems that I run into, once I thoroughly 
understand the problem, I find that other people have documented similar
 experiences. Thanks to all of you who have documented alternative approaches.  
  Saved for another day: now that the problem is obvious, I should try 
"zfs:zfs_immediate_write_sz" as recommended in the ZFS Evil Tuning 
Guide. 
References:  
   
    The ZFS Intent Log 
    Solaris ZFS, Synchronous Writes and the ZIL Explained 
    ZFS Evil Tuning Guide: Cache Flushes 
    ZFS Evil Tuning Guide: Tuning ZFS for Database Performance