Boost Performance with SQL Server 2014 Columnstore Indexes

By:


One of the main new features when SQL Server 2012 was originally released was columnstore indexes. By using a new column-based storage approach and new query-processing algorithms, memory-optimized columnstore indexes were designed to improve the performance of data warehouse queries by several orders of magnitude. Although the inability to update data was the biggest drawback when this feature was originally released back in 2012, the current release, SQL Server 2014, has addressed this limitation and now has the ability to directly update its data and even create a columnstore clustered index on it. The fact that columnstore indexes were originally limited to only nonclustered indexes was also considered a limitation because it required duplicated data on an already very large object such as a fact table.

In an OLTP system, transactions usually access one or a few rows whereas typical data warehouse star join queries access a large number of rows. In addition, an OLTP transaction usually accesses all the columns in the row, which is opposite to a star join query, where usually only a few columns are required. This data access pattern showed that a columnar approach could benefit data warehouse workloads.

The traditional storage approach used by SQL Server is to store rows on data pages, which we call a “rowstore.” Rowstores in SQL Server include heaps and B-tree structures such as standard clustered and nonclustered indexes. Column-oriented storage like that used by columnstore indexes dedicates entire database pages to store data from a single column. Rowstore and columnstore are compared in Figure 1, where a rowstore contains pages with rows, each row containing all its columns, and a columnstore contains pages with data for only one column, labeled C1, C2, and so on. Because the data is stored in columns, a question frequently asked is how a row is retrieved from this columnar storage. It is the position of the value in the column that indicates to which row this data belongs. For example, the first value on each page (C1, C2, and so on) shown on Figure 1 belongs to the first row, the second value on each page belongs to the second row, and so on.

f0329-01

Figure 1 Rowstore and columnstore data layout comparison

Column-oriented storage is not new and has been used before by some other database vendors. Columnstore indexes are based on Microsoft xVelocity technology, formerly known as VertiPaq, which is also used in SQL Server Analysis Services (SSAS), PowerPivot for Excel, and SharePoint. As with Hekaton, columnstore indexes are an in-memory technology.

As mentioned, columnstore indexes dedicate entire database pages to store data from a single column. Columnstore indexes are also divided into segments, which consist of multiple pages, and each segment is stored in SQL Server as a separate BLOB. As indicated earlier, in SQL Server 2014, it is now possible to define a columnstore index as a clustered index, which is a great benefit because there is no need for duplicated data. For example, in SQL Server 2012, a nonclustered columnstore index had to be created on a heap or regular clustered index, thus duplicating the data of an already large fact table.

Performance Benefits

Columnstore indexes provide increased performance benefits based on the following:

  • Reduced I/O – Because in a rowstore pages contain all the columns in a row, in a data warehouse without columnstore indexes, SQL Server has to read all the columns, including the columns that are not required by the query. Typical star join queries use only 10 percent to 15 percent of the columns in a fact table. Based on this, using a columnstore to read only those columns can represent savings of 85 percent to 90 percent in disk I/O, compared to a rowstore.
  • Batch mode processing – New query-processing algorithms are designed to process data in batches and are very efficient at handling large amounts of data.
  • Compression – Because data from the same column is stored contiguously in the same pages, it will usually have similar or repeated values that can often be compressed more effectively. Compression can improve performance because fewer disk I/O operations are needed to read compressed data and more data can fit into memory. Columnstore compression is not the same as the row and page compression available since SQL Server 2008 and instead uses the VertiPaq compression algorithms. However, a difference is that in columnstore indexes, column values are automatically compressed and compression cannot be disabled, whereas row and page compression in rowstores is optional and has to be explicitly configured.
  • Segment elimination – As indicated earlier, columnstore indexes are divided into segments, and SQL Server keeps the minimum and maximum values for each column segment on the sys.column_store_segments catalog view. This information can be used by SQL Server to compare against query filter conditions to identify whether a segment can contain the requested data, thus avoiding reading segments that are not needed and saving both I/O and CPU resources.

Because data warehouse queries using columnstore indexes can now be executed, in many cases, several orders of magnitude faster than defining the same table in a rowstore structure, they can provide other benefits as well, such as reducing or eliminating the need to rely on pre-built aggregates such as OLAP cubes, indexed views, and summary tables. The only action required to benefit from these performance improvements is to define the columnstore indexes in your fact tables. There is no need to change your queries or use any specific syntax; the query optimizer will automatically consider the columnstore index—although, as always, whether or not it is used in a plan will be a cost-based decision. Columnstore indexes also provide more flexibility to changes than building these pre-built aggregates. If a query changes, the columnstore index will still be useful, and the query optimizer will automatically react to those query changes, whereas the pre-built aggregate may no longer be able to support the new query or may require changes to accommodate it.

Leave a Reply