-
Notifications
You must be signed in to change notification settings - Fork 51
NUMA Balancing
Oracle Performance degrades after upgrading to RHEL 7
Dated: 04/06/2015
After upgrading from RHEL 6 to RHEL 7, some Oracle customers have reported degraded performance, especially on large Non-Uniform Memory Access (NUMA) systems like the DL980 or Superdome X servers.
During the normal Oracle processing, Linux KI data was collected and analyzed. Analysis of the individual Oracle processes shows that the processes spent a lot of time sleeping in sleep_on_page(). Further review of the stack trace for this wait event shows that the thread is handling a page fault when it goes to sleep waiting on a page migration from one NUMA node to another.
PID 150089 oraclePROC1
PPID 1 /usr/lib/systemd/system
********* SCHEDULER ACTIVITY REPORT ********
RunTime : 1.240211 SysTime : 0.300939 UserTime : 0.939272
SleepTime : 18.566175 Sleep Cnt : 11321 Wakeup Cnt : 629
RunQTime : 0.182789 Switch Cnt: 11714 PreemptCnt : 393
Last CPU : 43 CPU Migrs : 4330 NODE Migrs : 858
Policy : SCHED_NORMAL vss : 27916259 rss : 7205
busy : 6.20%
sys : 1.51%
user : 4.70%
runQ : 0.91%
sleep : 92.88%
Kernel Functions calling sleep() - Top 20 Functions
Count Pct SlpTime Slp% TotalTime% Msec/Slp MaxMsecs Func
806 7.12% 5.9461 32.03% 29.75% 7.377 80.777 sleep_on_page
1669 14.74% 5.5614 29.95% 27.82% 3.332 75.438 do_blockdev_direct_IO
6728 59.43% 3.9617 21.34% 19.82% 0.589 159.637 sk_wait_data
1958 17.30% 2.7493 14.81% 13.75% 1.404 139.797 poll_schedule_timeout
126 1.11% 0.2368 1.28% 1.18% 1.879 73.157 read_events
32 0.28% 0.1013 0.55% 0.51% 3.167 10.763 __mutex_lock_slowpath
1 0.01% 0.0095 0.05% 0.05% 9.541 9.541 sleep_on_page_killable
Process Sleep stack traces (sort by % of total wait time) - Top 20 stack traces
count wpct avg Stack trace
% msecs
===============================================================
1668 29.90 3.328 do_blockdev_direct_IO __blockdev_direct_IO blkdev_direct_IO generic_file_aio_read
do_sync_read vfs_read sys_pread64 tracesys | __pread_nocancel ksfd_skgfqio ksfd_io ksfdread kcfrbd1
kcbzib kcbgtcr
6621 21.33 0.598 sk_wait_data tcp_recvmsg inet_recvmsg sock_aio_read.part.7 sock_aio_read
do_sync_read vfs_read sys_read tracesys | __read_nocancel nttfprd nsbasic_brc nsbrecv nioqrc opikndf2
1907 14.16 1.379 poll_schedule_timeout do_sys_poll sys_poll tracesys | __poll_nocancel
sskgxp_selectex skgxpiwait skgxpwaiti skgxpwait ksxpwait ksliwat kslwaitctx ksxprcv_int ksxprcvimdwctx
kclwcrs
100 4.49 8.332 sleep_on_page __wait_on_bit wait_on_page_bit __migration_entry_wait.isra.37
migration_entry_wait handle_mm_fault __do_page_fault do_page_fault page_fault | lxeg2u ldxdts evadis
evaopn2 qerixGetKey qerixStart
84 3.53 7.805 sleep_on_page __wait_on_bit wait_on_page_bit __migration_entry_wait.isra.37
migration_entry_wait handle_mm_fault __do_page_fault do_page_fault page_fault | lxeg2u ldxdts evadis
evaopn2 qerixGetKey qerixStart
82 3.45 7.802 sleep_on_page __wait_on_bit wait_on_page_bit __migration_entry_wait.isra.37
migration_entry_wait handle_mm_fault __do_page_fault do_page_fault page_fault | ttcpip opitsk opiino
62 2.41 7.218 sleep_on_page __wait_on_bit wait_on_page_bit __migration_entry_wait.isra.37
migration_entry_wait handle_mm_fault __do_page_fault do_page_fault page_fault | kpobii kpobav opibvg
opiexe opiefn opiodr
44 1.28 5.396 sleep_on_page __wait_on_bit wait_on_page_bit __migration_entry_wait.isra.37
migration_entry_wait handle_mm_fault __do_page_fault do_page_fault page_fault | opitsk opiino
...
Meanwhile, the kiprof output also shows some time spent in the page migration code...
non-idle GLOBAL HARDCLOCK STACK TRACES (sort by count):
Count Pct Stack trace
============================================================
416 0.82% get_gendisk blkdev_get raw_open chrdev_open do_dentry_open finish_open do_last path_openat
do_filp_open do_sys_open sys_open tracesys
382 0.75% __blk_run_queue __elv_add_request blk_insert_cloned_request dm_dispatch_request dm_request_fn
__blk_run_queue queue_unplugged blk_flush_plug_list blk_finish_plug do_blockdev_direct_IO __blockdev_direct_IO
blkdev_direct_IO generic_file_aio_read do_sync_read vfs_read sys_pread64
379 0.74% remove_migration_pte rmap_walk migrate_pages migrate_misplaced_page do_numa_page handle_mm_fault
__do_page_fault do_page_fault page_fault
362 0.71% __page_check_address try_to_unmap_one try_to_unmap_file try_to_unmap migrate_pages
migrate_misplaced_page do_numa_page handle_mm_fault __do_page_fault do_page_fault page_fault
344 0.67% __mutex_lock_slowpath mutex_lock try_to_unmap_file try_to_unmap migrate_pages
migrate_misplaced_page do_numa_page handle_mm_fault __do_page_fault do_page_fault page_fault
304 0.60% syscall_trace_leave int_check_syscall_exit_work
269 0.53% syscall_trace_enter tracesys
257 0.50% __schedule schedule schedule_timeout sk_wait_data tcp_recvmsg inet_recvmsg sock_aio_read.part.7
sock_aio_read do_sync_read vfs_read sys_read tracesys
Root Cause
In RHEL 7, additional code was added to make the Linux kernel more NUMA aware by migrating a task's pages to the same NUMA node where the task is running. However, due to the size of the Oracle SGA and frequency of the task migrations, this migration of the pages becomes costly.
Solution
The new NUMA page migration code can be disabled by setting the parameter kernel.numa_balancing to 0:
$ sysctl kernel.numa_balancing=0
or
$ echo 0 > /proc/sys/kernel/numa_balancing
- LinuxKI Mainpage
- LinuxKI Basic Documentation
- LinuxKI 7.10 - New!!
- LinuxKI Video Series
-
LinuxKI Warnings
- High System CPU utilization during memory allocations, deallocations, and page faults
- RunQ delays for critical processes can impact performance in a variety of ways
- Performance degradation on Microsoft Windows due to TCP interrupt timeouts
- Microsoft SQLServer scaling issues caused by SQL auto statistics
- Excessive page faults on KVM host
- Large IOs (>1MB) causing performance degradation on servers with PCIe Smart Array Controllers
- Oracle column tracking causing high CPU usage by Oracle processes
- Side-Channel Attack mitigation
- High SYS CPU time by processes reading /proc/stat such
- hugetlb_fault lock contention
- Excessive CPU time in pcc_cpufreq driver
- Excessive poll() calls by Oracle
- High wait time in md_flush()
- High BLOCK SoftIRQ times
- Network Latency Tuned profile
- Power vs. Performance
- Unaligned Direct IO
- NUMA Balancing
- NUMA Off
- SAP DB2 semget
- Semaphore Lock Scaling
- Tasklet IRQs
- Unterminated ixgbe NICs
- Poor Direct IO Reads
- RHEL 7.3 / SLES 12SP2 Multipath bug
- Barrier Writes