About the Oracle Clusterware Trace Files

on September 26, 2013

This article talks about the logs that are included in Oracle 11g Release 2. In Oracle 11g Release 1, RDBMS instances, listeners, and so on, moved to using the Automatic Diagnostic Repository (ADR). The default repository home location is the ORACLE_BASE/diag directory, and the various process types have directories under that, such as asm, rdbms, and tnslsnr. When an error occurs, the ADR framework creates an incident and gives you the option of packaging up that incident using the ADR Command Interpreter (ADRCI) to send to Oracle Support.

Although I will not go into ADR here, because ASM is now under our GI home and may be a critical part of the clusterware, you need to know where to go for troubleshooting. The clusterware has not as yet followed the ADR model; however, the tracing has evolved significantly since 10 g Release 1. Most of the trace files are fairly low level and for use by Oracle Support to diagnose problems—only the cluster alert log is meant for user consumption in the same way as the RDBMS alert log is.

It is, of course, never a bad thing to look at the traces to get an understanding of what is going on so long as you don’t get paranoid about the contents and log Support Requests for every line that does not look right. (I guess I’m saying don’t go looking too hard for problems in these logs because many of the messages look like a problem but are quite normal.)

The cluster alert log is a good one to look at to look for problems. The cluster alert log is found at $GI_HOME/log/<hostname>alert<hostname>.log and included only the top-level information such as startups, shutdowns, Cluster Verification Utility (CVU) checks, evictions, and reconfigurations due to eviction or shutdown of other nodes, as well as actual problems encountered by other processes. Even some of these reported problems may not be of concern—for example, errors from processes on startup may occur simply because they need resources that are not yet online. Having said that, you may find it helpful to check out these logs.

Here is an example of a message in the cluster alert log:

   2010-09-22 02:35:11.474
   [/u01/app/11.2.0/grid/bin/oraagent.bin(4696)]CRS-5016:
   Process /u01/app/11.2.0/grid/bin/lsnrctl" spawned by agent
   /u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details
   at"(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/racassur-
   pfix01/agent/crsd/oraagent_grid/oraagent_grid.log"

You can see that CRSD’s oraagent tries to do a check on the listener using the lsnrctl command and it failed. Let’s take a look at the file mentioned to see what happened. (I show only the more interesting lines and remove the date and time stamps.)

NOTE

If you do not find the timestamp you expect in the file, check the other files in the directory, because the log files are aged out when they reach a certain size, so there can be up to ten old versions of the file named (in this case) oraagent_grid.l01 to l10.

   [ora.LISTENER_SCAN3.lsnr][1579252032] {1:63693:62933} [check] lsnrctl
   status LISTENER_SCAN3
   [ora.LISTENER_SCAN3.lsnr][1579252032] {1:63693:62933} [check]
   (:CLSN00010:)Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)
   (KEY=LISTENER_SCAN3)))
   [ora.LISTENER_SCAN3.lsnr][1579252032] {1:63693:62933} [check]
   (:CLSN00010:)TNS-12541: TNS:no listener
   [ora.LISTENER_SCAN3.lsnr][1579252032] {1:63693:62933} [check]
   (:CLSN00010:) TNS-12560: TNS:protocol adapter error
   [ora.LISTENER_SCAN3.lsnr][1579252032] {1:63693:62933} [check]
   (:CLSN00010:)  TNS-00511: No listener
   [ora.LISTENER_SCAN3.lsnr][1579252032] {1:63693:62933} [check]
   (:CLSN00010:)  Linux Error: 2: No such file or directory

These errors show that the listener was down when the check ran, but the trace file also shows that, just before this, a stop for that listener was called, so it is expected it would be down.

   2010-09-22 02:35:11.365: [    AGFW][1646369088] {1:63693:62933} Com-
   mand: stop for resource: ora.LISTENER_SCAN3.lsnr 1 1 completed with
   status: SUCCESS

Nothing to worry about there, so long as the listener can be restarted when needed.

If you do happen to find a real problem that requires the help of Oracle Support, you can zip all the required trace files using $GI_HOME/bin/diagcollection.sh, which is a wrapper for diagcollection.pl. It can collect ADR incident data, Cluster Health Monitor O/S data, as well as clusterware trace files:

   [root]# /u01/app/11.2.0/grid/bin/diagcollection.sh -h
   Production Copyright 2004, 2010, Oracle.  All rights reserved
   Cluster Ready Services (CRS) diagnostic collection tool
   diagcollection
       --collect
        [--crs] For collecting crs diag information
        [--adr] For collecting diag information for ADR; specify ADR location
        [--chmos] For collecting Cluster Health Monitor (OS) data
        [--all] Default.For collecting all diag information.
        [--core] UNIX only. Package core files with CRS data
        [--afterdate] UNIX only. Collects archives from the specified
   Date. Specify in mm/dd/yyyy format
        [--aftertime] Supported with -adr option. Collects archives after the
   specified time. Specify in YYYYMMDDHHMISS24 format
        [--beforetime] Supported with -adr option. Collects archives before
   the specified date. Specify in YYYYMMDDHHMISS24 format
        [--crshome] Argument that specifies the CRS Home location
        [--incidenttime] Collects Cluster Health Monitor (OS) data from the
   specified time.  Specify in MM/DD/YYYY24HH:MM:SS format
        If not specified, Cluster Health Monitor (OS) data generated in the
   past 24 hours are collected
        [--incidentduration] Collects Cluster Health Monitor (OS) data for the
   duration after the specified time.  Specify in HH:MM format.
        If not specified, all Cluster Health Monitor (OS) data after inci-
   denttime are collected

NOTE

For 11.2.0.1, check the valid arguments with the –h flags as the arguments shown in this example are the 11.2.0.2 options.