Oracle Technologies–Recover Quickly From Failure

on September 26, 2013

To achieve the redundancy necessary to recover quickly from software failures or component failures, there are several technology options. There are, for example, a plethora of backup options from which to choose. In an environment with terabytes’ worth of data, OS file copies no longer cut the mustard for most DBAs. This article explores the Oracle technologies that can be used to achieve these goals.

Again with Oracle RAC

Clearly, Oracle RAC, with its shared everything architecture, is a key component to a redundant architecture. Any failure affecting an individual server’s ability to function properly will not impact the entire database, because other nodes in the cluster will be able to continue the work seamlessly, even if an entire node is lost. Although other technologies such as Oracle Data Guard or Streams can achieve the same goal, they do so by keeping copies of the data in separate and distinct databases. This, however, implies a time delay in reacting to and recovering from relatively simple failures. Only Oracle RAC can provide this redundant access to the same database, so that even in an extreme case such as the complete failure of a node in your cluster, other nodes are up and running and actively accessing the same database even in the midst of the server failure. The remaining instances will automatically perform instance recovery for the instance that crashed, and any sessions that were connected to the downed server will be able to reconnect immediately to another instance that is already actively accessing the database.

Oracle Clusterware

The Oracle Clusterware component of Oracle’s grid infrastructure is a necessary underpinning of an Oracle RAC Database. In addition to facilitating shared access to the actual database, Oracle Clusterware offers the benefit of monitoring for failures of critical processes such as instances, listeners, virtual IP addresses and the like, as well as monitoring node membership and responsiveness. When the Oracle Clusterware stack detects a failure of a critical component of the cluster, corrective action is taken automatically to restart the failed resource, up to and including the node itself. This architecture allows DBAs to recove quickly from localized and relatively minor failures without impacting the business.

Oracle Recovery Manager (RMAN)

With respect to backups, too much is never enough. RMAN allows you to take full or incremental backups, and lets you restore and recover as little as a single database block. If using Automatic Storage Management (ASM), the RMAN strategy is doubly important.This type of flexibility in terms of backup and recovery is integral to maximizing the availability of the databases.

Flashback Database

In days gone by, DBAs often found that they needed to perform an entire database restore to recover up to the time just prior to the occurrence of some critical error (such as an inadvertent deletion of data, some other logical corruption, or corruption of an online redo log). However, with the advent of Flashback Database, you can now alleviate the need to do this by essentially storing all of the blocks necessary both to redo and undo transactions for a specified period of time. This means that you can do a “rewind” of the database without first doing a full restore, potentially saving immeasurable time in a crisis.