Fast and Accurate Oracle Root Cause Analysis

on June 26, 2013

In this article, we establish the characteristics of a data collector suitable for ongoing data collection as well as the kind of information you need to effectively perform root cause analyses of Oracle database performance problems. We also examine why Trace Event 10046 and Oracle Statspak don’t meet our criteria.

Characteristics of a Suitable Oracle Database Performance Data Collector

You need to be able to look back in time to discover what each foreground process did in the database. The facts should either lead you to the problem in the database or help you prove that the problem is external. If the facts reveal that the problem is outside the database, you can use these facts to encourage the other teams to review their processes and systems and prevent finger pointing at the database. The following is our recommended list of data collector characteristics:

Wait-based methodology–Root cause analyses rely heavily on session-level wait events data. Therefore, the data collector of choice must subscribe to the Oracle Wait Interface.
Session-level granularity–Performance root cause analyses also require raw (low-level) session-level data. Summarized session-level data has some value, but instance-level granularity is too coarse and is not suitable for this purpose.
Always-on and low overhead–Because no one knows when a performance problem will occur, the data collector must be running at all times (meaning continuous collection or sampling). The collector must also be able to simultaneously monitor all foreground database connections. Therefore, low processing overhead is extremely important.
Historical repositories–Repositories are important because they allow you to look back in time to discover what went on in the database, just like surveillance tapes that investigators rely on to solve a crime or mystery. You should have a separate repository for sessions or connections, wait events, and SQL statements. There is more discussion on repositories in the “Sampling for Performance Using PL/SQL Procedure” section.

Why Trace Event 10046 Is Not a Suitable Data Collector

Trace event 10046, not to be confused with wait events, is the best utility to trace processes, bar none. The trace file contains information on SQL statements, parses and executions, bind variables, and wait events. You can trace foreground and background processes with it. It is perfect for ad hoc tracing when you need runtime information at the lowest granular level and you have established the tracing timeframe.

However, the 10046 trace facility does not meet the always-on and low overhead requirement established in the previous section. There are two main reasons why you cannot use this trace facility as a continuous performance data collector:

It generates too much data. The fine-grain data that it offers is synonymous with high volume, especially at levels 8 and 12. The amount of trace data generated at these levels can quickly consume all the space in your UDUMP/BDUMP directory. For this reason, trace event 10046 is seldom enabled at the instance level. If it is, it is only for a brief moment, usually at the request and guidance of Oracle Support. Forget about enabling this for all processes on a 24×7 basis. It is not even practical to trace one long-running session from start to finish. You have to be very selective about when to start and stop tracing, which means you need to have a rough idea of when the problem will arise.
It is expensive. The comprehensiveness of this trace facility is synonymous with overhead. The overhead will further degrade the performance of an already slow-running process.

Besides these reasons, there are other reasons the trace event 10046 is unsuitable for the task, including the following:

Despite the vast amount of data generated at levels 8 and 12, it is still not enough. You will only find raw event data in the trace file. You may see something like “WAIT #12: nam=‘db file scattered read’ ela= 0 p1=106 p2=60227 p3=8”. Obviously, it is not user-friendly and there is no value-added interpretation. You have to manually translate P1 and P2 to discover the object name, which is a laborious task, especially when many objects are present. Furthermore, translations performed after the fact can be erroneous if the object is dropped and/or the same space is occupied by another object. There is also no cross-referencing in the trace file. For example, when there is an enqueue wait, the trace file records the P1, P2, and P3 values for the enqueue event, but there is no data about the blocking session or the SQL statement that the blocker executes. In this case, you only get half of the story.
Depending on the Oracle version, the trace files may be bug prone and are not always reliable. Consider the following snippet of an event 10046 trace file that belongs to the DBWR process in Oracle Release 9.2.0.1.0. Notice the negative P1 values. (This bug is fixed in 9.2.0.4.)
Perhaps the biggest drawback with the trace event 10046 facility is that it is impractical as an ongoing data collector, though it is technically possible.

Why Statspack Is Not a Suitable Data Collector

Oracle Corporation introduced Statspack in Oracle 8.1.6 as a tool to collect and store performance data. Statspack provides instance-level diagnostic data, which is too coarse for root cause analysis. Although it is possible to take session-level snapshots with Statspack, it is not designed for tracking every database connection. Therefore, we do not consider Statspack a suitable data collector. To find out more about why Statspack is not suitable for collecting session-level performance data, refer to the “10046 Alternatives” white paper at www.ioug.org.