-
Notifications
You must be signed in to change notification settings - Fork 51
Poor DirectIO Reads
Poor Direct I/O read performance using Oracle with XFS on RHEL 7.1
Date: 03/01/2016
In a TPCC benchmark test on a Superdome X running RHEL 7.1 with multiple Oracle databases installed, two of the 8 databases were experiencing poor performance, while the remain 6 were performing as expected. Looking at the AWR Reports, the database reads were showing severe performance issues, even though the data resided on the same 3PAR storage array.
Top 10 Foreground Events by Total Wait Time
Event | Waits | Total Wait Time (sec) | Wait Avg(ms) | % DB time | Wait Class |
---|---|---|---|---|---|
db file scattered read | 126,247 | 13.5K | 107.05 | 48.0 | User I/O |
db file sequential read | 128,698 | 13.5K | 104.90 | 48.0 | User I/O |
enq: TX - row lock contention | 3,879 | 869.1 | 224.05 | 3.1 | Application |
DB CPU | 322.3 | 1.1 |
In all database instances, the databases resided on XFS filesystems.
The LinuxKI Toolset was used to collect a 20 second KI dump of the system. The File Activity Report (kifile.*.txt) showed a lot of mutex contention in __mutex_lock_slowpath() on one of the database files:
device: 0x0fd00000 node: 1716550918 fstype: REG filename: /oracle_data/tpccdata.dbf
System Call Name Count Rate ElpTime Avg Max Errs AvSz KB/s
readv 10485 524.2 910.428597 0.086832 0.393425 0 37954 19431.0
SLEEP 41089 2054.4 902.106968 0.021955
Sleep Func 10237 875.541484 0.085527 0.387676 __mutex_lock_slowpath
Sleep Func 9844 9.275561 0.000942 0.013671 rwsem_down_read_failed
Sleep Func 10475 8.958206 0.000855 0.015064 rwsem_down_write_failed
Sleep Func 10485 8.331718 0.000795 0.011906 do_blockdev_direct_IO
RUNQ 0.371456
CPU 7.459641
pread64 8793 439.6 748.182554 0.085088 0.390384 0 8192 3517.2
SLEEP 34473 1723.6 741.649100 0.021514
Sleep Func 8710 720.760638 0.082751 0.382259 __mutex_lock_slowpath
Sleep Func 8168 7.578238 0.000928 0.014525 rwsem_down_read_failed
Sleep Func 8771 7.568110 0.000863 0.014681 rwsem_down_write_failed
Sleep Func 8793 5.742113 0.000653 0.013306 do_blockdev_direct_IO
RUNQ 0.307918
CPU 6.080786
lseek 10531 526.5 0.010822 0.000001 0.000029 0
RUNQ 0.000003
CPU 0.000020
Subsequent analysis of one of the Oracle processes reading this file shows the top sleep stack traces:
PID 295368 oracleORCL
PPID 255353 bash
********* SCHEDULER ACTIVITY REPORT ********
RunTime : 0.761588 SysTime : 0.225303 UserTime : 0.536284
SleepTime : 19.006968 Sleep Cnt : 1268 Wakeup Cnt : 751
RunQTime : 0.006764 Switch Cnt: 1272 PreemptCnt : 4
Last CPU : 252 CPU Migrs : 63 NODE Migrs : 0
Policy : SCHED_NORMAL vss : 44006798 rss : 6492
...
Process Sleep stack traces (sort by % of total wait time) - Top 20 stack traces
count wpct avg Stack trace
% msecs
===============================================================
80 43.68 103.783 __mutex_lock_slowpath mutex_lock xfs_file_aio_read do_sync_readv_writev
do_readv_writev vfs_readv sys_readv tracesys | sstk ksfd_skgfqio ksfd_skgfrvio ksfd_vio ksfdvread kcfrbd1
kcbz_table_scan_read
41 18.58 86.132 __mutex_lock_slowpath mutex_lock xfs_file_aio_read do_sync_read vfs_read
sys_pread64 tracesys | __pread_nocancel ksfd_skgfqio ksfd_io ksfdread kcfrbd1 kcbzib kcbgtcr ktrgtc2
29 11.60 75.996 SYSC_semtimedop sys_semtimedop tracesys | semop skgpwwait ksliwat kslwaitctx
ksqcmi ksqgtlctx ksqgelctx ktuGetTxForXid ktcwit1 kdddgb kdusru updrowFastPath
23 9.46 78.144 __mutex_lock_slowpath mutex_lock xfs_file_aio_read do_sync_read vfs_read
sys_pread64 tracesys | __pread_nocancel ksfd_skgfqio ksfd_io ksfdread kcfrbd1 kcbzib kcbgtcr ktrget2
19 6.98 69.874 __mutex_lock_slowpath mutex_lock xfs_file_aio_read do_sync_read vfs_read
sys_pread64 tracesys | __pread_nocancel ksfd_skgfqio ksfd_io ksfdread kcfrbd1 kcbzibmlt kcbzib kcbgtcr
Note in the stack traces above, the mutex_lock() function is called from xfs_file_aio_read(). The nice thing about Linux is that the source code is readily available. While viewing the source code can be intimidating, it’s not always super difficult to understand what is going on. Below is the RHEL 7.1 source code for xfs_file_aio_read() where locking is performed:
0283 /*
0284 * Locking is a bit tricky here. If we take an exclusive lock
0285 * for direct IO, we effectively serialise all new concurrent
0286 * read IO to this file and block it behind IO that is currently in
0287 * progress because IO in progress holds the IO lock shared. We only
0288 * need to hold the lock exclusive to blow away the page cache, so
0289 * only take lock exclusively if the page cache needs invalidation.
0290 * This allows the normal direct IO case of no page cache pages to
0291 * proceeed concurrently without serialisation.
0292 */
0293 xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
0294 if ((ioflags & IO_ISDIRECT) && inode->i_mapping->nrpages) {
0295 xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
0296 xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
0297
0298 if (inode->i_mapping->nrpages) {
0299 ret = filemap_write_and_wait_range(
0300 VFS_I(ip)->i_mapping,
0301 pos, pos + size - 1);
0302 if (ret) {
0303 xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
0304 return ret;
0305 }
0306
0307 /*
0308 * Invalidate whole pages. This can return an error if
0309 * we fail to invalidate a page, but this should never
0310 * happen on XFS. Warn if it does fail.
0311 */
0312 ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
0313 pos >> PAGE_CACHE_SHIFT,
0314 (pos + size - 1) >> PAGE_CACHE_SHIFT);
0315 WARN_ON_ONCE(ret);
0316 ret = 0;
0317 }
0318 xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
0319 }
In the above code, if the file access is direct (IO_ISDIRECT) and there are file pages mapped into memory (inode->I_mapping->nrpages), then the shared lock is dropped (XFS_IOLOCK_SHARED), and the exclusive lock is obtained (XFS_IOLOCK_EXCL). Once the exclusive lock is obtained, then the pages associated with the direct I/O are pushed out of the cache. Note that all of the file pages are not flushed, just the ones that overlap the direct I/O. Eventually, if ALL the pages get pushed out of the filecache, performance would return to normal.
Linux versions impacted
In Linux version 3.16 and earlier, when the cached pages were invalidated, they were invalidated at the position of the Direct I/O until the end of the file. With Linux version 3.17, only the overlapping pages were flushed, so every time a new Direct I/O was done, only a small portion of the pages in the cache were invalidated, so the exclusive lock had to be obtained until the entire file was pushed out of the cache, which may never happen.
In Linux version 4.3, when the first Direct I/O is performed, the entire file is forced out of the cache, thus avoiding the exclusive lock after the first Direct I/O.
RHEL 7.1 followed the 3.17 behavior. See also http://oss.sgi.com/archives/xfs/2015-08/msg00296.html. RHEL 7.0 follows the 3.16 behavior and can still suffer from the same issue. RHEL 7.2 follows the 4.3 behavior and does not observe the same issue.
SLES 12 and earlier follows the 3.16 and earlier behavior.
A quick solution to push the database pages out of the cache is to drop all of the file cache pages.
$ echo 3 > /proc/sys/vm/drop_caches
While this will resolve the problem with the Direct I/O performance, note that ALL pages for every filesystem are pushed out of the cache. So if you have some other files that perform cached access rather than direct I/O access, removing those pages from the cache may have other impacts.
If you only want to impact the one filesystem in question, you will need to stop the database, unmount the filesystem, and mount it again.
Another workaround existed for Linux distributions that follow the 3.16 behavior (RHEL 7.0 and earlier, SLES 12 and earlier). Since the pages are flushed from the position of the Direct I/O operation until the end of the file, you can simply do a single Direct I/O read from the file at file offset 0. For example:
$ dd if=<filename> of=/dev/null bs=4k count=1 iflag=direct
To resolve the problem, upgrade to RHEL 7.2 or to a Linux kernel based on version 4.3 or later behavior
- LinuxKI Mainpage
- LinuxKI Basic Documentation
- LinuxKI 7.10 - New!!
- LinuxKI Video Series
-
LinuxKI Warnings
- High System CPU utilization during memory allocations, deallocations, and page faults
- RunQ delays for critical processes can impact performance in a variety of ways
- Performance degradation on Microsoft Windows due to TCP interrupt timeouts
- Microsoft SQLServer scaling issues caused by SQL auto statistics
- Excessive page faults on KVM host
- Large IOs (>1MB) causing performance degradation on servers with PCIe Smart Array Controllers
- Oracle column tracking causing high CPU usage by Oracle processes
- Side-Channel Attack mitigation
- High SYS CPU time by processes reading /proc/stat such
- hugetlb_fault lock contention
- Excessive CPU time in pcc_cpufreq driver
- Excessive poll() calls by Oracle
- High wait time in md_flush()
- High BLOCK SoftIRQ times
- Network Latency Tuned profile
- Power vs. Performance
- Unaligned Direct IO
- NUMA Balancing
- NUMA Off
- SAP DB2 semget
- Semaphore Lock Scaling
- Tasklet IRQs
- Unterminated ixgbe NICs
- Poor Direct IO Reads
- RHEL 7.3 / SLES 12SP2 Multipath bug
- Barrier Writes