Histograms are a feature of the cost-based optimizer (CBO) that allows the Oracle engine to determine how data is distributed within a column. They are most useful for a column that is included in the WHERE clause of SQL and the data distribution is skewed.
Assume a table named PROCESS_QUEUE with one million rows including a column named PROCESSED_FLAG with five distinct values. Also assume a query similar to the following is executed:
SELECT id, serial_number FROM process_queue WHERE processed_flag = ‘N’; SELECT STATEMENT Optimizer=ALL_ROWS (Cost=1087 Card=260363 Bytes=7029801) TABLE ACCESS (FULL) OF ‘PROCESS_QUEUE’ (TABLE) (Cost=1087 Card=260363 Bytes=7029801)
Without histograms and only five distinct values, Oracle assumes an even data distribution and would most likely perform a full table scan for this query. With one million rows and five values, Oracle assumes that each value would return 200,000 rows, or 20% of the rows.
However, what if the data for the PROCESSED_FLAG column was skewed:
SELECT processed_flag, COUNT(1) FROM process_queue GROUP BY processed_flag; PROCESSED_FLAG COUNT ——————————- ———- P 24 Y 999345 E 30 S 568 N 33
In this case, ony 33 rows have a value of ‘N’, so there has to be a way to tell Oracle to use the index on the PROCESSED_FLAG column. That is where histograms come into use. A histogram would include data similar to above and allow Oracle to know that only 33 rows would be returned for this query.
To collect histograms for this column, a command similar to the following could be used:
EXECUTE DBMS_STATS.GATHER_TABLE_STATS(user, ‘PROCESS_QUEUE’, method_opt => ‘for columns processed_flag size 5’) SELECT id, serial_number FROM process_queue WHERE processed_flag = ‘N’; SELECT STATEMENT Optimizer=ALL_ROWS (Cost=1 Card=28 Bytes=756) TABLE ACCESS (BY INDEX ROWID) OF ‘PROCESS_QUEUE’ (TABLE) (Cost=1 Card=28 Bytes=756) INDEX (RANGE SCAN) OF ‘PQ_IX1’ (INDEX) (Cost=1 Card=28)
Notes About Histograms
Note 1: Using histograms works best for SQL statements that use literal values. If a statement uses a bind variable, the first time the query is parsed, Oracle will peek at the value of the bind variable and choose a plan accordingly. That same plan will be used until the SQL is reparsed. In this case, if the bind variable was ‘Y’ the first time, Oracle may perform a full table scan for this query no matter what value was passed in from then on. The opposite may also be true. Assume a similar data distribution to above but with 100 distinct values for the PROCESSED_FLAG column. The rows that have a ‘Y’ value are still be 95% of the rows. However, if you used the criteria “WHERE processed_flag=’Y'”, without histograms Oracle may decide to use the index when a full table scan may be a better option.
Note 2: The defaults for the METHOD_OPT parameter changed between Oracle 9i and 10g. In 9i the parameter defaulted to ‘for all columns size 1’ which essentially turns off histograms. The default value in Oracle 10g is ‘for all columns size auto’ which means that Oracle will decide whether or not to collect histograms for a column. In my experience it seems that unneccesary histograms are collected and histogram data is not collected for some columns where it would be useful.
Histograms allow Oracle to make much better performance decisions. The case we discussed in this article is one way that histograms are used and is commonly referred to as “table access method” histograms. Another use for histograms, referred to as “table order join” histograms, is to help Oracle decide the order in which tables will be joined. This helps the CBO know the size of the result sets or “cardinality” to properly determine the correct order in which to do joins.