SQL Server Primary Key vs Clustered Index, Part 3: Table Partitioning and Partition Key Selection

on November 12, 2012


Note: This is Part 3 of a three-part article on SQL Server primary keys and clustered indexes.

In Part 1 of this series we learned how to break apart the Clustered Index from the Primary Key. In Part 2, we saw there was little performance benefit in doing so. However, when table partitioning (especially for existing databases), separate Clustered Index and Primary Key may just be the answer.

In the real world, most large tables, for existing databases, join through foreign keys based only on an identity property in the parent table (the Primary Key). This works great from a performance aspect, as queries only need an integer column (unique) to join parent and child tables.

There are many white papers and articles on SQL Server table partitioning with academic examples (see references at end of this article), and most suggest partitioning on a natural key, like date. This is a good recommendation, but then how do you implement it when there are child tables? This article will show how to do so.

Partioning on a natural key when there are child tables: an example

In this case, we will use the container table as described in Part 2 with child tables having Foreign Key relations back to ContainerID.

Plan3 Fig1

 

The customer data retention plan called for data to be retained 6 months (since last update and/or create) and the data to be partitioned on a month-to-month basis. Data archiving and purging will be monthly just prior to the end of the month. Therefore, for this scenario, we defined a computed column LastDateKey as shown below:

ALTER TABLE dbo.Container ADD LastDateKey AS
(CONVERT(INT,CONVERT(varchar(8),ISNULL(UpdateDate,CreateDate),112))) PERSISTED;

Note the PERSISTED key word. This is required so that the LastDateKey can be part of the clustered index. Also, note this is an integer column in the form of yyyymmdd. Now we can create the Partition Function and Partition Scheme:

CREATE PARTITION FUNCTION DB_PF1 (INT)
       AS RANGE RIGHT FOR VALUES(20120301,20120401,20120501,20120601,20120701,20120801)
GO
CREATE PARTITION SCHEME DB_PS1
       AS PARTITION DB_PF1 ALL TO ([FileGroup01])
GO

The partition function defines 7 partitions covering the last 6 months and the partition scheme puts all the partitions in the same filegroup (FileGroup01). The RANGE RIGHT option defines the lower range boundary of each partition starting with partition 2 (the first partition has no lower boundary).

Before we can separate the Primary Key from the Clustered Index, the dependent Foreign Keys (from the child tables) must be dropped. Otherwise, you will encounter an error when trying to drop the Primary Key. Assuming this has been done, you can then run the following statements

IF  EXISTS (SELECT * FROM sys.indexes
WHERE object_id = OBJECT_ID(N'dbo.Container') AND name = N'PK_Container')
              ALTER TABLE dbo.Container DROP CONSTRAINT PK_Container;
GO
CREATE UNIQUE CLUSTERED INDEX CIDX_Container ON dbo.Container(ContainerID, LastDateKey)
       WITH (ONLINE = ON, DATA_COMPRESSION = ROW)
       on DB_PS1(LastDateKey);
GO
       --Rebuild PK
ALTER TABLE dbo.Container WITH CHECK ADD CONSTRAINT PK_Container PRIMARY KEY NONCLUSTERED
(ContainerID)
       ON [PRIMARY]

The first TSQL statement drops the Primary Key (and its respective clustered index). The second TSQL statement creates a unique clustered index based on ContainerID and LastDateKey with row compression. The reason for this order is from the performance degradation we saw in Part 2 of this series if the LastDateKey was first (for this particular database, most table joins are on the ContainerID). The third TSQL statement rebuilds the primary key (non-clustered). It is defined on the Primary file group and is not partitioned. This is because all of the child tables have a single column Foreign Key constraint referring back to ContainerID. If you wanted a partitioned Primary Key, all of the child tables would have to be modified with the addition of the LastDateKey column. This is very undesirable for a variety of reasons – mainly that the LastDateKey would be a property (and defined) on the parent table (Container) and have no relation to the data in the child table. There is no need to have 2 columns for join criteria (from a performance viewpoint) back to the container (parent) table when the ContainerID is already unique.

Now that the Container table partitioned by month (as defined above), it is very easy to add a new month:

ALTER PARTITION SCHEME DB_PS1 NEXT USED [FileGroup01]
GO
ALTER PARTITION FUNCTION DB_PF1() SPLIT RANGE (20120901);
GO

The NEXT USED statement is necessary to alert SQL Server to the filegroup to use for the next partition range. The SPLIT statement actually creates the new boundary. It can be run before or after that date – just be careful if the SPLIT is after the date there may be data movement within the filegroup with a resulting schema lock and possible application timeouts. The MERGE statement (not shown) can remove an old partition boundary – but, again, be careful – there also may be data movement from orphaned data or data retained because of other archiving constraints. This MERGE operation generates a schema lock and may cause application timeouts. The more data to move, the longer the locks are retained. Data movement during splits or merges can be very disruptive during your monthly partition maintenance operations — especially in a 24×7 environment.

Final thoughts

Do not forget to add back in the foreign keys that were dropped earlier. Another tip to consider is to drop all the existing non-clustered indexes on the parent table before dropping the Primary Key and creating the new Clustered Index and new Primary Key. The non-clustered indexes can then be rebuilt (I would suggest using the ONLINE = ON option and also partitioned on the partition scheme, e.g.,

CREATE NONCLUSTERED INDEX [IDX_Container_ParentContainerID] ON dbo.Container
       (ParentContainerID ASC,DelFlag ASC)
       INCLUDE (CreateDate,ContainerTypeID,UpdateDate)
       WITH (ONLINE = ON, DATA_COMPRESSION = ROW)
       ON [NGSCoreContainerPS1](ContainerID);

Data compression is another tool for consideration, but that is a topic for another article.

Consider an index on LastDateKey; it could be useful in queries using dates or a date range.

This article should have given you some insight into SQL Server table partitioning selecting a partition key. In any case picking a partition key needs a fair amount of thought – once you do so and implement it the cost of change may be prohibitive.

References:

  1. Partitioned Tables and Indexes in SQL Server 2005, Kimberly Tripp, http://msdn.microsoft.com/en-us/library/ms345146(v=sql.90).aspx
  2. Partitioning in SQL Server 2008, Muhammad Shujaat Siddiqi, http://www.sqlservercentral.com/articles/partition/64740/
  3. Easy Table Partitions with SQL Server 2008, Ravi Subramaniam, http://blogs.msdn.com/b/manisblog/archive/2009/01/18/easy-table-partitions-with-sql-server-2008.aspx
  4. SQL SERVER – 2005 – Database Table Partitioning Tutorial – How to Horizontal Partition Database Table, Pinal Dave, http://blog.sqlauthority.com/2008/01/25/sql-server-2005-database-table-partitioning-tutorial-how-to-horizontal-partition-database-table/
  5. Creating a table with horizontal partitioning in SQL Server, Atif Shehzad, http://www.mssqltips.com/sqlservertip/1796/creating-a-table-with-horizontal-partitioning-in-sql-server/

Related Posts

Comments

  1. Hi Mike,

    Thanks for sharing your knowledge on Table partition , this is an exceptional article for partitioning existing tables, can you please let me know if there will be any performance impacts by introducing the unique Key and altering the current primary key from a clustered Index to a non clustered index ? Appreciate your efforts in writing such a detailed article.

    Regards,
    Kiran

    • Two things that I can think of. One is the addition of another index (DML impact) and the other is that I found our with SS 2012 that you could not swap partition windows when you have separate clustered index and primary key. I have not checked this in SS2014 and SS2016, but I suspect this is still the same. At my previous employment we just used partitions for helping us with data retention — not performance. There may have been some performance benefits, but our data retention for our OLTP main database was 6 months. Even when I got the window swapping to work I still have to feed back in data that had child or parent data referencing it within the six month window. Having said all that (and in my previous articles) really architect the solution before putting in development/production. That are many little gotchas that seems to pop up along the way.

      • Hi Mike,

        I’m considering Partition for both Performance and archiving purposes , I currently designed two strategies with only two partitions : First one using your above example with a unique Key on a new column LastDateKey with the existing primary Key (clustered Index).

        And second Partitoin Strategy by just partitioning the exiting Primary Key(Clustered index with a datatype : int) in this scenario I queried the first id value for the year of partition and used it to create the Parttion function as below.

        –Create the Partition Function :

        CREATE PARTITION FUNCTION PF_ArchiveYearly (INT)

        AS RANGE RIGHT FOR VALUES(6462116,9006517)

        GO

        –Create Partition Scheme :

        CREATE PARTITION SCHEME PS_ArchiveYearly

        AS PARTITION PF_ArchiveYearly TO (PartitionFileGroup_OlderThanTwoYears_Sales,PartitionFileGroup_PreviousYear,[PRIMARY])

        GO

        I’m not able to figure out the performance difference between these two strategies (The one with unique index vs Partitioning on the existing clustered index) can you please let me know your opinion on my approach,

        what are the implications to use just an existing clustered index as partition key ? I’m fine to figure out the year boundaries to archive data yearly , I’m being a bit cautious to change existing Clustered index of the table(s) as there is a lot of code relying on this clustered index (this db is a 7 year old well maintained db and this si the core db for the company). Thanks for your advise.

        Cheers,
        Kiran

        • Actually either approach has about the same performance effect. The only problem with using the original PK is that you must assign it at the start of the next time period you are using. For example, if you are partitioning by month, then at midnight of the next month you need to determine your maximum PK value (could be multiple columns) and assign it then. If you wait any time later to assign the new partition key, you will have to endure data movement from the old partition to the new one. This can be costly in both data movement and the fact that the table is under a schema lock at the time. Not sure this wholly answers your question, but if not, ask some more.

          Cheers,
          Mike Byrd

  2. Thank you for the very insightful article, I have been through a few and this was by far the most informative.

    You certainly know your stuff

  3. I am totally confused with the advantage of Partition. I have a legacy database where few table grows , so wanted a solution to speed up the queries on the tables. So created a partition based on a flag (a bit type partition key). Created a clustered index on id column with partition. When query the table without partition key , take double the cost of without partition on the same table. If I include the partition key in the query it take the same cost as that of the table without partition. Could you comment on the benefit in such as case? Is it worth relying on partition to gain performance?

Trackbacks

Leave a Reply