To get started, let’s take a look at the various ways statistics can be created and updated. Statistics are created in several ways: automatically by the query optimizer (if the default option to automatically create statistics, AUTO_CREATE_STATISTICS, is on), when an index is created, and explicitly (for example, via the CREATE STATISTICS statement). Statistics can be created on one or more columns, and both the index and explicit creation methods support single- and multi-column statistics. However, the statistics that are automatically generated by the query optimizer are always single-column statistics. The components of statistics objects are the histogram, the density information, and the string statistics. Both histograms and string statistics are created only for the first column of a statistics object, the latter only if the column is of a string data type.
Density information is calculated for each set of columns, forming a prefix in the statistics object. Filtered statistics, on the other hand, are not created automatically by the query optimizer, but only when a filtered index is created or when a CREATE STATISTICS statement with a WHERE clause is issued. Both filtered indexes and statistics are a feature introduced in SQL Server 2008.
With the default configuration (if AUTO_UPDATE_STATISTICS is on), the query optimizer automatically updates statistics when they are out of date. The query optimizer does not automatically create multi-column or filtered statistics, but once they are created, they can be automatically updated. Alternatively, index rebuild operations and statements such as UPDATE STATISTICS can also be used to update statistics. Because both the auto-create and auto-update default choices will give you good quality statistics most of the time, it is strongly recommended that you keep these defaults. Naturally you also have the choice to use some other statements if you need more control over the quality of the statistics.
So, by default, statistics may be automatically created (if nonexistent) and automatically updated (if out of date) as necessary during query optimization. By “out of date,” we refer to the data being changed and therefore the statistics not being representative of the underlying data (more on the exact mechanism later). If an execution plan for a specific query exists in the plan cache and the statistics that were used to build the plan are now out of date, then the plan is discarded, the statistics are updated, and a new plan is created. In a similar way, updating statistics, either manually or automatically, invalidates any existing execution plan that used those statistics, and will cause a new optimization the next time the query is executed.
When it comes to determining the quality of your statistics, a fact to consider is the size of the sample of the target table used to calculate said statistics. The query optimizer determines a statistically significant sample by default when it creates or updates statistics, and the minimum sample size is 8MB (1,024 pages) or the size of the table if it’s smaller than 8MB. The sample size will increase for bigger tables, but it may still only be a small percentage of the table.
If needed, you can use the CREATE STATISTICS and UPDATE STATISTICS statements to explicitly request a bigger sample or scan the entire table to have better quality statistics. In order to do that you need to specify a sample size or use the WITH FULLSCAN option to scan the entire table. A sample size can be specified as number of rows or percentage and, because the query optimizer has to scan all the rows on a data page, these values are approximate. Using WITH FULLSCAN or using a larger sample can be of benefit, especially with data that is not randomly distributed throughout the table. Scanning the entire table will naturally give you the most accurate statistics possible. Consider that if statistics are built after scanning 50 percent of a table, then SQL Server will assume that the 50 percent of data that it has not seen is statistically exactly the same as the 50 percent it has seen. In fact, given that statistics are always created alongside a new index, and given that this operation scans the entire table anyway, index statistics are initially created with the equivalent of the WITH FULLSCAN option. However, if the query optimizer needs to automatically update these index statistics, it has to go back to a default sample because it may take too long to scan the entire table again.
By default, SQL Server needs to wait for the update statistics operation to complete before optimizing and executing the query; that is, statistics are updated synchronously. A database configuration option introduced with SQL Server 2005, AUTO_UPDATE_STATISTICS_ASYNC, can be used to change this default and let the statistics be updated asynchronously. As you might have guessed, with asynchronous statistics update, the query optimizer does not wait for the update statistics operation to complete, and instead just uses the current statistics for the optimization process. This can help in situations where applications experience timeouts caused by delays related to the automatic update of statistics. Although the current optimization will use the out-of-date statistics, they will be updated in the background and will be used by any later query optimization. However, asynchronous statistics updates usually only benefit OLTP workloads and may not be a good solution for more expensive queries, where getting a better plan is more important than an infrequent delay in statistics update.
SQL Server defines when statistics are out of date by using column modification counters, or colmodctrs, which count the total number of modifications for the leading statistics column since the last time statistics were updated. Basically, for tables bigger than 500 rows, a statistics object is considered out of date if the colmodctr value of the leading column has changed by more than 500 plus 20 percent of the number of rows in the table. The same formula is used by filtered statistics, but, because they are built only from a subset of the records of the table, the colmodctr value is first multiplied by the selectivity of the filter. colmodctrs are exposed in the modification_counter column of the sys.dm_db_stats_properties DMF, which is available starting with SQL Server 2008 R2 Service Pack 2 and SQL Server 2012 Service Pack 1. (Previously, colmodctrs were only available using a dedicated administrator connection and looking at the rcmodified column of the sys.sysrscols base system table in SQL Server 2008 or the sysrowset columns for SQL Server 2005.)
SQL Server 2000 used rowmodctrs, or row modification counters, instead to keep track of the number of changes in a table or index. The main difference with colmodctrs is that rowmodctrs track any change to the row, whereas colmodctrs only track changes to the leading column of the statistics object. Currently, the sp_updatestats statement, which is another way to update statistics, is still based on rowmodctrs, whose values are available as the rowmodctr column of the sys.sysindexes compatibility view.
Trace flag 2371 was introduced with SQL Server 2008 R2 Service Pack 1 as a way to automatically update statistics in a lower and dynamic percentage rate, instead of the mentioned 20 percent threshold. With this dynamic percentage rate, the higher the number of rows in a table, the lower this threshold will become to trigger an automatic update of statistics. Tables with less than 25,000 records will still use the 20 percent threshold, but as the number of records in the table increase, this threshold will be lower and lower. For more details about this trace flag, see the article “Changes to Automatic Update Statistics in SQL Server – Traceflag 2371.”
The density information on multi-column statistics might improve the quality of execution plans in the case of correlated columns or statistical correlations between columns. Density information is kept for all the columns in a statistics object, in the order that they appear in the statistics definition. By default, SQL Server assumes columns are independent; therefore, if a relationship or dependency exists between columns, multicolumn statistics can help with cardinality estimation problems in queries that are using these columns. Density information will also help on filters and GROUP BY operations. Filtered statistics can also be used for cardinality estimation problems with correlated columns.