Handling Oracle Data Guard Disconnects

on October 1, 2013

The speed and latency of the network used for log transport can have a considerable impact on how Oracle Data Guard operates. An equally important topic to understand is how network disconnects impact Oracle Data Guard. Let’s consider what happens during a simple network disconnect between a primary and standby database.

When the network connection between two hosts is disconnected or when one host within a TCP session is no longer available, the session is known as a dead connection. A dead connection indicates that there is no physical connection, but to the processes on each system, the connection still appears to exist. When the LGWR and RFS process are involved in a dead connection, LNS will discover the dead connection when attempting to send a new message to the RFS process. At this point, LNS will wait on the TCP layer to timeout on the network session between the primary and standby before establishing that network connectivity has indeed been lost.

The TCP timeout, as defined by TCP kernel parameter settings, is key to how long LNS will remain in a wait state before abandoning the network connection. On some platforms, the default for TCP timeout can be as high as two hours. To avoid this lengthy TCP timeout from severely impacting Oracle Data Guard, the MAX_FAILURE, REOPEN, and NET_TIMEOUT attributes were developed.

On the standby, the RFS process is always synchronously waiting for new information to arrive from the LNS process on the primary. The RFS process that is doing the network read operation is blocked until more data arrives or until the operating system’s network software determines that the connection is dead.

Once the RFS process receives notification of the dead network connection, it will terminate itself. However, until the RFS process terminates itself, it will retain lock information on the archivelog or the standby redo log on the standby database, whose redo information was being received from the primary database. Any attempt to perform a failover using the RECOVER MANAGED STANDBY DATABASE FINISH command will fail while the RFS process maintains a lock on the standby redo log or archivelog. The RECOVER command will fail with the following errors:

   ORA-00283: recovery session canceled due to errors
   ORA-00261: log 4 of thread 1 is being archived or modified
   ORA-00312: online log 4 thread 1:'+DBDATA/cosp/onlinelog/

At this point, we must wait for either the network stack on the OS to clean up the dead connection or kill the RFS process before the failover attempt will succeed. The preferred method to decrease the time for the OS network software to clean up the dead connection is the use of Oracle’s Dead Connection Detection (DCD) feature.

With Oracle’s DCD feature, Oracle Net periodically sends a network probe to verify that a client-server connection is still active. This ensures that connections are not left open indefinitely due to an abnormal client termination. If the probe finds a dead connection or a connection that is no longer in use, it returns an error that causes the RFS process to exit. Though DCD will minimize the detection time of a network disconnect, we are still dependent on the OS network stack for timeouts and retries, so a delay will occur before termination of the RFS process.

Once the network problem is resolved, and the primary database processes are again able to establish network connections to the standby database, a new RFS process will automatically be spawned on the standby database for each new network connection. These new RFS processes will resume the reception of redo data from the primary database.

Related Posts

Leave a Reply