A DBA is in his cube waiting for an export job to finish when the developer calls. The developer claims he ran a job around midnight the night before that took two hours when it should have run in about 20 minutes. He wants the DBA to explain why the job ran for so long. When the DBA asks him if he had made any code changes, he quickly says, “No”. However, the DBA is almost certain that he is hiding codes capable of mass destruction. So, the DBA says he doesn’t know the cause of the performance problem and offers to monitor the job if the developer will run it again. Having worked with the developer for a long time, the DBA knows what he is thinking: “First, ask the DBA what’s wrong with the database. If the DBA can’t provide an immediate answer, blame the database, otherwise, find others to blame.”
A question from a customer such as, “Why did the job run so slowly?” is not only a loaded question, it has to be the most challenging question a DBA can face. This kind of question catches many DBAs helpless like a deer in the headlights; not because they can’t tune, but they have no historical data to show what went on in the database.
Corporate IT organizations seem to think that DBAs are omniscient creatures and don’t need sleep. They expect speedy answers from their DBAs when they inquire about database performance no matter what time of day or day of the year it is. That’s why you carry a pager, right? (Sometimes two or even three pagers!) However, without reliable historical performance data, you simply cannot tell them why a job ran slowly. If you fail to provide an answer, the performance problem usually is labeled as a database problem. You are guilty until you prove yourself innocent. The burden of proof always falls on the DBA.
Are you tired of being blamed for problems unrelated to the database? Ask yourself what percentage of the performance calls you receive are truly database problems. Why is it that the database is such a magnet for blame? Perhaps being downstream may have something to do with it—or could it be that you simply have no way to find out what happened, making you an easy target for abuse? The truth is that nowadays application codes are moved into production systems with very little review and testing, and when there is a performance issue, the common question is what’s wrong with the database. You can defend yourself only when you are armed with session-level historical performance data. For this, you need a data collector that collects session-level performance data continuously.
Fast and Accurate Root Cause Analysis
Now that you understand the importance of historical performance data, the next step is to establish the characteristics of the data collector suitable for ongoing data collection as well as the kind of information you need to effectively perform root cause analyses of performance problems. You need to be able to look back in time to discover what each foreground process did in the database. The facts should either lead you to the problem in the database or help you prove that the problem is external. If the facts reveal that the problem is outside the database, you can use these facts to encourage the other teams to review their processes and systems and prevent finger pointing at the database. The following is our recommended list of data collector characteristics:
- Wait-based methodology: Root cause analyses rely heavily on session-level wait events data. Therefore, the data collector of choice must subscribe to the Oracle Wait Interface.
- Session-level granularity: Performance root cause analyses also require raw (low-level) session-level data. Summarized session-level data has some value, but instance-level granularity is too coarse and is not suitable for this purpose.
- Always-on and low overhead: Because no one knows when a performance problem will occur, the data collector must be running at all times (meaning continuous collection or sampling). The collector must also be able to simultaneously monitor all foreground database connections. Therefore, low processing overhead is extremely important.
- Historical repositories: Repositories are important because they allow you to look back in time to discover what went on in the database, just like surveillance tapes that investigators rely on to solve a crime or mystery. You should have a separate repository for sessions or connections, wait events, and SQL statements. There is more discussion on repositories in the “Sampling for Performance Using PL/SQL Procedure” section.