-
Notifications
You must be signed in to change notification settings - Fork 51
SAP DB2 semget
Poor Performance for Linux SAP clients accessing DB2 database server
Date: 01/08/2014
When using an application running on a remote DB2 client such as SAP, poor performance can be observed when accessing the DB2 database due to a high number of failed semget() calls used by the internal trace facility (db2trc), even if tracing facility is not enabled.
In a recent benchmark, a customer was comparing the performance of their SAP application running on a remote DB2 client using both an HP DL380 (Linux) server and an IBM P770 (AIX) server. While the P770 server had a higher clock speed (3.1 GHz) than the DL380 (2.2 GHz), the performance of the DL380 was 2.5 to 3 times worse, which is worse than expected.
While the CPU speed was considered a primary factor, Linux KI trace data was collected for 40 seconds with the runki script and analyzed to understand how the SAP client processes spent their time. Below is edited output from the PID Analysis Report (kiinfo -kipid) for one of the SAP client processes:
PID 1026 dw.sapRU1_D00
PPID 7939 dw.sapRU1_D00
********* SCHEDULER ACTIVITY REPORT ********
RunTime : 6.187232 SysTime : 0.451327 UserTime : 5.735905
SleepTime : 31.581517 Sleep Cnt : 30901 Wakeup Cnt : 26398
RunQTime : 2.191165 Switch Cnt: 35158 PreemptCnt : 4257
Last CPU : 2 CPU Migrs : 7346 NODE Migrs : 4128
Policy : SCHED_NORMAL vss : 8060548 rss : 58919
busy : 15.48%
sys : 1.13%
user : 14.35%
runQ : 5.48%
sleep : 79.03%
Note that while the CPU is slower, the process spends most of its time sleeping. The System Call Report for this process shows the a lot of time spent sleeping in rwsem_down_failed_common():
******** SYSTEM CALL REPORT ********
System Call Name Count Rate ElpTime Avg Max Errs AvSz KB/s
semget 27022 676.2 26.619859 0.000985 0.006248 27022
SLEEP 26305 658.3 24.837537 0.000944
Sleep Func 26304 24.837537 0.000944 0.006212 rwsem_down_failed_common
RUNQ 1.462762
CPU 0.319563
recvfrom 4649 116.3 7.017916 0.001510 0.110619 60 761 86.5
SLEEP 4535 113.5 6.711683 0.001480
Sleep Func 4535 6.711683 0.001480 0.110599 sk_wait_data
RUNQ 0.250706
CPU 0.055527
sendto 4485 112.2 0.073516 0.000016 0.001277 0 686 75.2
RUNQ 0.004954
CPU 0.068562
Note that the process made 27022 semget() calls that all failed (27022 errors). Many of the calls to semget() would sleep trying to obtain an exclusive lock on the internal semaphore's rwsem lock. The lock contention is due to many SAP processes calling semget() in parallel. Looking at the individual kitrace records for the processes, we can verify that all the semget() calls fail with an errno 2 (ENOENT) using the same semaphore key value:
0.100159 cpu=2 pid=1026 recvfrom ret=0xd68 syscallbeg= 0.000802 sockfd=9 *buf=0x2b035e9d5fe0 len=131072
flags=0x0 *src_addr=0x0 *addrlen=0x0
0.100908 cpu=3 pid=1026 semget ret=-2 syscallbeg= 0.000725 key=0x10cabb74 nsems=1 semflag=0x0
0.101352 cpu=3 pid=1026 semget ret=-2 syscallbeg= 0.000441 key=0x10cabb74 nsems=1 semflag=0x0
0.102538 cpu=3 pid=1026 semget ret=-2 syscallbeg= 0.001183 key=0x10cabb74 nsems=1 semflag=0x0
0.103752 cpu=3 pid=1026 semget ret=-2 syscallbeg= 0.001210 key=0x10cabb74 nsems=1 semflag=0x0
0.104775 cpu=3 pid=1026 semget ret=-2 syscallbeg= 0.001020 key=0x10cabb74 nsems=1 semflag=0x0
0.106332 cpu=3 pid=1026 semget ret=-2 syscallbeg= 0.001554 key=0x10cabb74 nsems=1 semflag=0x0
0.106633 cpu=3 pid=1026 writev ret=0x224 syscallbeg= 0.000016 fd=16 *iov=0x7fffa5a18230 iovcnt=2
This ENOENT error indicates that the semaphore with the key 0x10cabb74 does not exist and the IPC_CREAT flag is not set in the semflag parameter. To identify why the SAP process is trying to obtain a semaphore using a non-existent key, gstack/pstack was used to identify user-space stack trace when the semget() call was made:
#0 0x00000033baaea777 in semget () from /lib64/libc.so.6
#1 0x00002baee1b3c959 in sharedMutexAttach(TRC_SHARED_IPC_INFO_T*, int*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#2 0x00002baee1b3bb60 in sharedAttach(TRC_SHARED_IPC_INFO_T*, bool) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#3 0x00002baee1b37c80 in trcAttachShared () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#4 0x00002baee1ea4eeb in internalProcessTraceInit () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#5 0x00002baee1ea4cea in sqltinit () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#6 0x00002baee2081f9d in CLI_scrollGetDescAttr(db2UCinterface*, dataDescriptor**, unsigned long*, unsigned char) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#7 0x00002baee1c5145c in CLI_sqlCompleteCallback(db2UCinterface*, int) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#8 0x00002baee1e09871 in sqljrRqstComplete(sqljrDrdaArCb*, db2UCinterface*, int, bool) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#9 0x00002baee22ca8d1 in sqljrParseOpenQueryReply(db2UCinterface*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#10 0x00002baee22e1f45 in sqljrParse(db2UCinterface*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#11 0x00002baee22d397f in sqljrDrdaArOpen(db2UCinterface*, db2UCCursorInfo*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#12 0x00002baee21255d7 in csmOpen(db2UCinterface*, db2UCCursorInfo*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#13 0x00002baee1c593ab in CLI_sqlOpen(CLI_STATEMENTINFO*, CLI_ERRORHEADERINFO*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#14 0x00002baee1fe112f in SQLExecute2(CLI_STATEMENTINFO*, CLI_ERRORHEADERINFO*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#15 0x00002baee1ffc245 in SQLExecute () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so...
From the stack trace, the semaphore appears to be defined in the SQL tracing facility for DB2 (libdb2.so). Note the call to internalProcessTraceInit() above.
Note that the P770 system has the same problem with the failing semget() calls, but the semget() calls on AIX do not suffer from the same rwlock contention in the kernel code that Linux does. So what is a minor problem on AIX is a major problem on Linux.
Note that the Kparse report (kiinfo -kparse) will detect the issue:
- Warning: SAP worker task with many semget() errors (Freq > 50)
The problem and its solution are documented by IBM in the following article:
https://www-304.ibm.com/support/docview.wss?rs=71&uid=swg21209523&wv=1
The solution is to issue the following command at the DB2 client to allocate the semaphore in question.
db2trc alloc
Take special care that the right db2trc binary is used if multiple library versions are present on the system. Customer should consult with IBM if further details are needed.
After implementing the db2trc alloc, the performance of the DL380 was 30% faster than the P770, despite the difference in the CPU clock speed for this specific test.
- LinuxKI Mainpage
- LinuxKI Basic Documentation
- LinuxKI 7.10 - New!!
- LinuxKI Video Series
-
LinuxKI Warnings
- High System CPU utilization during memory allocations, deallocations, and page faults
- RunQ delays for critical processes can impact performance in a variety of ways
- Performance degradation on Microsoft Windows due to TCP interrupt timeouts
- Microsoft SQLServer scaling issues caused by SQL auto statistics
- Excessive page faults on KVM host
- Large IOs (>1MB) causing performance degradation on servers with PCIe Smart Array Controllers
- Oracle column tracking causing high CPU usage by Oracle processes
- Side-Channel Attack mitigation
- High SYS CPU time by processes reading /proc/stat such
- hugetlb_fault lock contention
- Excessive CPU time in pcc_cpufreq driver
- Excessive poll() calls by Oracle
- High wait time in md_flush()
- High BLOCK SoftIRQ times
- Network Latency Tuned profile
- Power vs. Performance
- Unaligned Direct IO
- NUMA Balancing
- NUMA Off
- SAP DB2 semget
- Semaphore Lock Scaling
- Tasklet IRQs
- Unterminated ixgbe NICs
- Poor Direct IO Reads
- RHEL 7.3 / SLES 12SP2 Multipath bug
- Barrier Writes