Clustered Index Keys - The Right Choice

Parent: SQL Server Cost Based Optimizer

Intro

In SQL Server, the use of a clustered index for almost all tables has been pervasive for a very long time to the point that little was put into features necessary to support efficient operation and maximum performance for Heap tables.

The principle basis of a clustered index is that an index seek to the cluster key does not require a lookup for the additional columns in the table.

The Display Estimated or Actual Execution Plan feature of Microsoft SQL Server Management Studio shows the Index seek on the clustered index key as:
  ClusteredIndexSeek

and the Index seek to a heap table requiring a Lookup as:
  IndexSeekHeap

The impression conveyed in Microsoft SQL Server via the Execution Plan is that a Clustered Index cuts the operation cost by half versus a Heap table in which a Lookup is required.

What is not shown are overhead and behind the scenes operations, such as the plan lookup, and system table accesses. When these are factored in, the efficiency advantages of the Clustered Index over a Heap are significantly diminished.

It is further presumed via the execution plan cost model that the Lookup operation is the same for both lookups to a clustered index
  IndexSeekKLC
and to a heap
IndexSeekKLH

In fact, Lookups to a Clustered Index is a more expensive operation than the lookup to a heap in ideal circumstances. Ideal circumstances are when SQL Server knows that there no forwarded records, and the execution plan does not employ a Compute Scalar containing the BmkToPage intrinsic function.

ComputeScalar

Despite the theoretical advantages of the Cluster Index not being as significant as the impression conveyed by the graphical execution plan cost model, there are also liabilities to a heap table.

In any case, the advantage of a Clustered Index occurs in the use of the clustered index key, not the index keys the nonclustered indexes, unless the query is supported by a covered index. The obvious strategy is then to choose the most frequent or otherwise important mode of access to be the lead cluster key.

The Primary Key and the Clustered Index Key

As obvious as that may be, another pattern seems to be all too common. This being one in which each table has an Identity column that is also the Primary Key. Notice that it is not necessary to explicitly declare the primary key to be clustered.

CREATE TABLE CommonExampleButNotUniversalTemplate (
  ID INT NOT NULL IDENTITY PRIMARY KEY,
  AdditionalColumns INT NOT NULL
)
GO

Another pattern that was common during a certain period in which each table has a uniqueidentifier as the primary key clustered.

In certain tables, the identity column as the primary key clustered is a good choice. In others, it is not the best choice, and sometimes not even a good choice.

A frequent circumstance is one in which the most common the mode of access to a table is via one of the foreign key columns to a parent table. In this case, that should be the lead cluster key.

CREATE TABLE A (
  AID INT NOT NULL IDENTITY
  , PID INT NOT NULL
  , INDEX UCX_AID UNIQUE CLUSTERED(PID,AID)
  , CONSTRAINT PK_A PRIMARY KEY NONCLUSTERED (AID)
  , CONSTRAINT FK_A_P FOREIGN KEY (PID) REFERENCES P (PID) )
GO

However, the foreign key column is usually not unique in the referencing table. A very good practice for the cluster key is that the full key be unique. One common method is for the lead key to be the most frequently accessed parent table key with a local identity at the end, if necessary, to make the full key unique.

If this table has its own child tables, it may be desirable to have a Primary Key Nonclustered defined on the Identity column. This supports child tables that foreign to it. Another option, perhaps a better one, is to have a unique nonclustered index serve this purpose, which can have Included Columns. Note: certain high-level tools look for the primary key. I would advise avoiding tools built by people with a limited understanding of the SQL Server engine.

It is possible to have implicitly named primary and foreign keys and default constraints as well as in the example below.

CREATE TABLE B (
  BID INT NOT NULL IDENTITY PRIMARY KEY NONCLUSTERED
  , AID INT NOT NULL FOREIGN KEY REFERENCES A (AID)
  , Bdate date NOT NULL DEFAULT getdate()
  , INDEX UCX_AID UNIQUE CLUSTERED (AID, BID)
)

My recommendation is that all constraints be explicitly named, ideally to a compact naming convention, sufficient to guarantee uniqueness as required for sys.objects entities.

CREATE TABLE C (
  CID INT NOT NULL IDENTITY
  , AID INT NOT NULL
  , Cdate date NOT NULL CONSTRAINT DF_C_Cdate DEFAULT getdate()
  , INDEX UCX_AID UNIQUE CLUSTERED (AID,CID)
  , CONSTRAINT PK_C PRIMARY KEY NONCLUSTERED (CID)
  , CONSTRAINT FK_C_A FOREIGN KEY (AID) REFERENCES A(AID)
)
GO

Another situation is the join table, that supports a many-to-many relation between two tables. Any single row in table B might have a relation to more than one row in table C, and vice versa. In this case, there is no point in having an identity column as below.

CREATE TABLE BC_bad(
  ID INT NOT NULL IDENTITY
  , BID INT NOT NULL
  , CID INT NOT NULL
  , CONSTRAINT PK_BC1 PRIMARY KEY (ID)
  , CONSTRAINT FK_BC1_B FOREIGN KEY (BID) REFERENCES B (BID)
  , CONSTRAINT FK_BC1_C FOREIGN KEY (CID) REFERENCES C (CID)
)

The more efficient method to support a join table is to simply have the two foreign key columns as either the Primary Key Clustered, or just a unique clustered index.

CREATE TABLE BC2 (
  BID INT NOT NULL
  , CID INT NOT NULL
  , CONSTRAINT PK_BC1 PRIMARY KEY CLUSTERED(BID,CID)
  , INDEX UX_CB UNIQUE NONCLUSTERED (CID,BID)
  , CONSTRAINT FK_BC1_B FOREIGN KEY (BID) REFERENCES B (BID)
  , CONSTRAINT FK_BC1_C FOREIGN KEY (CID) REFERENCES C (CID)
 

It may or may not be necessary to have an index in the opposite order depending on the needs of the business logic.