SQL Server 2012 Query Processing–Join Order Selection

on September 12, 2013


Generally, the order in which two or more joined tables are written in the FROM clause of a SELECT statement doesn’t influence the decision made by the SQL Server optimizer in relation to their processing order.

Many different factors influence the decision of the optimizer regarding which table will be accessed first. On the other hand, you can influence the join order selection by using the FORCE ORDER hint.

Join processing techniques

The join operation is the most time-consuming operation in query processing. The Database Engine supports the following three different join processing techniques, so the optimizer can choose one of them depending on the statistics for both tables:

  •     Nested loop
  •     Merge join
  •     Hash join

The following subsections describe these techniques.

Nested loop

Nested loop is the processing technique that works by “brute force.” In other words, for each row of the outer table, each row from the inner table is retrieved and compared. The pseudo-code in Algorithm 1 demonstrates the nested loop processing technique for two tables.

ALGORITHM 1

0550_001

0551_001

 

In Algorithm 1, every row selected from the outer table (table A) causes the access of all rows of the inner table (table B). After that, the comparison of the values in the join columns is performed and the row is added to the result set if the values in both columns are equal.

The nested loop technique is very slow if there is no index for the join column of the inner table. Without such an index, the Database Engine would have to scan the outer table once and the inner table n times, where n is the number of rows of the outer table. Therefore, the query optimizer usually chooses this method if the join column of the inner table is indexed, so the inner table does not have to be scanned for each row in the outer table.

Merge join

The merge join technique provides a cost-effective alternative to constructing an index for nested loop. The rows of the joined tables must be physically sorted using the values of the join column. Both tables are then scanned in order of the join columns, matching the rows with the same value for the join columns. The pseudo-code in Algorithm 2 demonstrates the merge join processing technique for two tables.

ALGORITHM 2

0551_002

 

The merge join processing technique has a high overhead if the rows from both tables are unsorted. However, this method is preferable when the values of both join columns are sorted in advance. (This is always the case when both join columns are primary keys of corresponding tables, because the Database Engine creates by default the clustered index for the primary key of a table.)

Hash join

The hash join technique is usually used when there are no indices for join columns. In the case of the hash join technique, both tables that have to be joined are considered as two inputs: the build input and the probe input. (The smaller table usually represents the build input.) The process works as follows:

  1. The value of the join column of a row from the build input is stored in a hash bucket depending on the number returned by the hashing algorithm.
  2. Once all rows from the build input are processed, the processing of the rows from the probe input starts.
  3. Each value of the join column of a row from the probe input is processed using the same hashing algorithm.
  4. The corresponding rows in each bucket are retrieved and used to build the result set.

 

NOTE

The hash join technique requires no index. Therefore, this method is highly applicable for ad hoc queries, where indices cannot be expected. Also, if the optimizer uses this processing technique, it could be a hint that you should create additional indices for one or both join columns.

Related Posts

Leave a Reply