Solving Oracle Log File Sync Waits Caused by High Commit Frequency

By: Richard Niemiec


High commit frequency is the number one cause for foreground log file sync waits. Find out if the session that spends a lot of time on the log file sync event belongs to a batch or OLTP process or if it is a middle-tier (Tuxedo, Weblogic, etc.) persistent connection.

If the session belongs to a batch process, it may be committing each database change inside a loop. Discover the module name and ask the developer to review the code to see if the number of commits can be reduced. This is an application issue, and the solution is simply to eliminate unnecessary commits and reduce the overall commit frequency.

Some application developers have learned that if they commit infrequently, jobs may fail due to rollback segments running out of space, and they get calls in the middle of the night. Those who have been bitten by deadlocks may have been told to commit more frequently. Naturally, they become commit-happy people. The right thing to do is to properly define what a transaction is and commit at the end of each transaction. A transaction is a unit of work. A unit of work should either succeed or fail in its entirety. The proper place for a commit or rollback is at the end of each unit of work. Do not introduce additional commits for the sake of rollback segments space or deadlocks. That is treating the symptom and not the problem. If none of the existing rollback segments can handle the unit of work, then you as the DBA should provide one that will. (If you are using the Automatic Undo Management (introduced in Oracle9i Database), then allocate enough space to the undo tablespace and set an appropriate undo retention time.)

Introducing additional commits can create other problems, among them, the infamous ORA-01555: snapshot too old error because the rollback (or undo) data can be overwritten. A high commit frequency also increases the overhead that is associated with starting and ending transactions. At the beginning of each transaction, Oracle assigns a rollback segment (called a rollback segment binding) and updates the transaction table in the rollback segment header. The transaction table must also be updated at the end of each transaction, followed by a commit cleanout activity. Updates to the rollback segment headers must also be recorded in the log buffer because the blocks are dirtied. Therefore, make the necessary adjustments in the application so that it only commits at the end of a unit of work. The commit statement that is inside a loop may need to be moved out of the loop so that the job commits only once instead of once in every loop iteration.

If the session that spends a lot of time on the log file sync event is a persistent connection from a middle-tier layer, then this is a tough case because it services many front-end users. You have to trace the session with event 10046 and observe the application behavior over time. Look for the log file sync event in the trace file. It may give you an idea of the commit frequency. Alternatively, you can mine the redo log files with Oracle Log Miner. This will show you the systemwide commit behavior.

In OLTP databases, you normally notice a high log file sync wait time at the system level (V$SYSTEM_EVENT) but not at the session level. The high system-level wait time may be driven by many short transactions from OLTP sessions that actively log in and out of the database. If this is your scenario, then about the only thing you can do in the database is to ensure a smooth I/O path for the LGWR process. This includes using asynchronous I/O and putting your log files on raw devices or an equivalent, such as the Veritas Quick I/O, that is serviced by dedicated I/O controllers—or better yet, using solid state disks for the log files. (James Morle has an excellent article and benchmark numbers on solid state disks for redo logs at http://www.oaktable.net/fullArticle.jsp?id=5) However, you only need to do this if the log file parallel write average wait time is bad. Otherwise, there isn’t much benefit. The real solution has to come from the application.

 Slow I/O Subsystem

Query the V$SYSTEM_EVENT view as follows to discover the LGWR average wait time (log file parallel write event). An average wait time of 10ms (1 centisecond) or less is generally acceptable.

Fig 7-1

 

 

 

 

 

A higher system I/O throughput can improve the average wait time of the log file sync and log file parallel write events. However, this is not an excuse for not fixing the application if it is poorly designed and is committing too frequently. You may be tempted to mess with the database layout and I/O subsystem, perhaps because it is difficult to deal with the application group or it is a third-party application. However, just because it is a third-party application does not give the vendor the right to throw junk code at your database, right? Remember, you cannot solve an application problem from the database. Any change you make in the database may give your users the impression that it is a database problem. Besides, the problem won’t go away because the root cause is in the application.

That said, there are many things that you and your system administrator can do to provide some relief to the log file sync wait by increasing the I/O throughput to the log files. This includes using fiber channel (FC) connection to databases on SAN (storage area network); gigabit Ethernet (Gig-E) or Infiniband connection to databases on NAS (network attached storage); ultrawide SCSI or FC connection to databases on DAS (direct attached storage); private networks; high-speed switches; dedicated I/O controllers; asynchronous I/O; placing your log files on raw device and binding the LUN in RAID 0 or 0+1 instead of RAID 5, and so on. The questions you have to ask are which business unit is paying for the new hardware and how much faster the commits will be. Let’s take the preceding listing, for example. The log file parallel write average wait baseline is 6.1ms, and the log file sync average wait baseline is 9.9ms. What numbers can you expect to get with the new hardware—25 percent improvement, 50 percent improvement? How does a 50 percent improvement in the average log file sync wait translate to the session response time? Will the users be happy? You must weigh the hardware cost with the potential benefits.

In practice, you often have to compromise because of company policies and because you don’t make hardware decisions. Your log files may be on RAID 5 LUNs that are shared with other systems. This is especially true in SAN environments. The storage companies are not helping either by rolling out large capacity disk drives. Most likely, your management makes storage decisions based on cost per MB instead of performance. As a result, you end up with fewer disks. Furthermore, you may be required to manage disk utilization into the 90 percentiles. Believe us, we feel your pain!

 

Comments

    • Hi Venkatesh,
      You can find the module (and action or other session information) by looking at v$session during the time that the ‘log file sync’ wait event is occurring. If you ‘select username,module,action from v$session where event =’log file sync’;’ you should get some idea of who is causing it.

Leave a Reply