About SQL Server 2012 Query Processing–Index Selection

on September 12, 2013

The identification of search arguments allows the SQL Server optimizer to decide whether one or more existing indices will be used. In this phase of query processing, the optimizer checks each search argument to see if there are indices in relation to the corresponding expression. If an index exists, the optimizer decides whether or not to use it. This decision depends mainly on the selectivity of the corresponding expression. The selectivity of an expression is defined as the ratio of the number of rows satisfying the condition to the total number of rows in the table.

The optimizer checks the selectivity of an expression with the indexed column by using statistics that are created in relation to the distribution of values in a column. The query optimizer uses this information to determine the optimal query plan by estimating the cost of using an index to execute the query.

The following sections discuss in detail selectivity of an expression with the indexed column and statistics. (Because statistics exist in relation to both indices and columns, they are discussed separately in two sections.)

NOTE

The Database Engine automatically creates (index and column) statistics if the database option called AUTO_CREATE_STATISTICS is activated.

Selectivity of an expression with the indexed column

As you already know, the optimizer uses indices to improve query execution time. When you query a table that doesn’t have indices, or if the optimizer decides not to use an existing index, the system performs a table scan. During the table scan, the Database Engine sequentially reads the table’s data pages to find the rows that belong to the result set. Index access is an access method in which the database system reads and writes data pages using an existing index. Because index access significantly reduces the number of I/O read operations, it often outperforms table scan.

The Database Engine uses a nonclustered index to search for data in one of two ways. If you have a heap (a table without a clustered index), the system first traverses the nonclustered index structure and then retrieves a row using the row identifier. If you have a clustered table, however, the traversal of the nonclustered index structure is followed by the traversal of the index structure of the table’s clustered index. On the other hand, the use of a clustered index to search for data is always unique: the Database Engine starts from the root of the corresponding B⁺-tree and usually after three or four read operations reaches the leaf nodes, where the data is stored. For this reason, the traversing of the index structure of a clustered index is almost always significantly faster than the traversing of the index structure of the corresponding nonclustered index.

From the preceding discussion, you can see that the answer to which access method (index scan or table scan) is faster isn’t straightforward and depends on the selectivity and the index type.

Tests that I performed showed that a table scan often starts to perform better than a nonclustered index access when at least 10 percent of the rows are selected. In this case, the optimizer’s decision of when to switch from nonclustered index access to a table scan must not be correct. (If you think that the optimizer forces a table scan prematurely, you can use the INDEX query hint to change its decision, as discussed later in this chapter.)

For several reasons, the clustered index usually performs better than the nonclustered index. When the system scans a clustered index, it doesn’t need to leave the B⁺-tree structure to scan data pages, because the pages already exist at the leaf level of the tree. Also, a nonclustered index requires more I/O operations than a corresponding clustered index. Either the nonclustered index needs to read data pages after traversing the B⁺-tree or, if a clustered index for another table’s column(s) exists, the nonclustered index needs to read the clustered index’s B⁺-tree structure.

Therefore, you can expect a clustered index to perform significantly better than a table scan even when selectivity is poor (that is, the percentage of returned rows is high, because the query returns many rows). The tests that I performed showed that when the selectivity of an expression is 75 percent or less, the clustered index access is generally faster than the table scan.

Index statistics

Index statistics are generally created when an index for the particular column(s) is created. The creation of index statistics for an index means that the Database Engine creates a histogram based on up to 200 values of the column. (Therefore, up to 199 intervals are built.) The histogram specifies, among other things, how many rows exactly match each interval, the average number of rows per distinct value inside the interval, and the density of values.

NOTE

Index statistics are always created for one column. If your index is a composite (multicolumn) index, the system generates statistics for the first column in the index.

If you want to create index statistics explicitly, you can use the following tools:

sp_createstats system procedure
SQL Server Management Studio

The sp_createstats system procedure creates single-column statistics for all columns of all user tables in the current database. The new statistic has the same name as the column where it is created.

To use SQL Server Management Studio for index statistics creation, expand the server, expand the Databases folder, expand the database, expand the Tables folder, expand the table, right-click Statistics, and choose New Statistics. The New Statistics on Table dialog box appears. In the dialog box, specify first the name for the new statistics. After that, click the Add button, select column(s) of the table to which to add the statistics, and click OK. Finally, click OK in the New Statistics on Table dialog box.

As the data in a column changes, index statistics become out of date. The out-of-date statistics can significantly influence the performance of the query. The Database Engine can automatically update index statistics if the database option AUTO_UPDATE_STATISTICS is activated (set to ON). In that case, any out-of-date statistics required by a query for optimization are automatically updated during query optimization.

There is also another database option, AUTO_CREATE_STATISTICS, that builds any missing statistics required by a query for optimization. Both options can be activated (or deactivated) using either the ALTER DATABASE statement or SQL Server Management Studio.

Column statistics

As you already know from the previous section, the Database Engine creates statistics for every existing index. The system can create statistics for nonindexed columns too. These statistics are called column statistics. Together with index statistics, column statistics are used to optimize execution plans. The Database Engine creates statistics for a nonindexed column that is a part of the condition in the WHERE clause.

There are several situations in which the existence of column statistics can help the optimizer to make the right decision. One of them is when you have a composite index on two or more columns. For such an index, the system generates statistics only for the first column in the index. The existence of column statistics for the second column (and all other columns) of the composite index can help the optimizer to choose the optimal execution plan.

The Database Engine supports two catalog views in relation to column statistics (these views can be used to edit the information concerning index statistics, too):

sys.stats
sys.stats_columns

The sys.stats view contains a row for each statistic of a table or a view. Besides the name column, which specifies the name of the statistics, this catalog view has, among others, two other columns: