You are on page 1of 16

Composite Primary Keys

Ah … primary keys … such a topic! When discussing what columns to


define as a primary key in your data models, two large points always
tend to surface:

1. Surrogate Keys versus Natural Keys


2. Normalization

These can be very complicated and sometimes polarizing things to


debate. As I often try to do, I will attempt to approach this topic from
a slightly different perspective.

Let's start things off with what I feel is a good interview question:

How would you define what a primary key of a table is?

a. An auto-generated numeric or GUID column in the table that


uniquely identifies each row
b. A non-nullable column in the table that uniquely identifies each
row
c. None of the above

I suspect that many people will answer (a), and quite a few will
answer (b). If you answer (c), though, you are correct! Why?
Because a primary key is not a single column, it is a set of columns.
Many people who have designed large, complicated systems are
simply not aware of this.

I once worked with a consultant who kept claiming that importing data
into his system was complicated, because his database used “primary
keys”. It was very confusing (yet humorous) trying to discuss things
with him because he kept confusing primary keys with identity
columns. They are not the same! An identity column may be a type
of primary key, but a primary key is not an identity column; it is a set
of columns that you define that determine what makes the data in
your table unique. It defines your data. It may be an identity column,
it may be a varchar column or a datetime column or an integer
column, or it may be a combination of multiple columns.

When you define more than one column as your primary key on a
table, it is called a composite primary key. And many experienced and
otherwise talented database programmers have neverused them and
may not even be aware of them. Yet, composite primary keys are very
important when designing a good, solid data model with integrity.

This will be greatly oversimplifying things, but for this discussion let's
categorize the tables in a database into these two types:

• Tables that define entities


• Tables that relate entities

Tables that define entities are tables that define customers, or sales
people, or even sales transactions. The primary key of these tables is
not what I am here to discuss. You can use GUID columns, identity
columns, long descriptive text columns, or whatever it is you feel
comfortable to use as primary keys on tables that define entities. It’s
all fine by me, whatever floats your boat as they say. There are lots of
discussions and ideas about the best way to determine what the best
primary key of these tables should be, and pros and cons of all of the
various approaches, but overall, that is not really what I am
addressing.

Tables that relate entities, however, are a different story.

Suppose we have a system that tracks customers, and allows you to


assign multiple products to multiple customers to indicate what they
are eligible to order. This is called a many-to-many orN:N relation
between Customers and Products. We already have a table of
Products, and a table of Customers. The primary key of the Products
table is ProductID, and the Customers table is CustomerID. Whether
or not these “ID” columns are natural or surrogate, identity or GUID,
numerical or text or codes, is irrelevant at this point.

What is relevant and important, and what I am here to discuss, is how


we define our CustomerProducts table. This table relates customers to
products, so the purpose of the table is to relate two entities that have
already been defined in our database. Let’s also add a simple
“OrderLimit” column which indicates how many of that product they
are allowed to order. (This is just a simple example, any attribute will
do). How should we define this table?

For some reason, a very common answer is that we simply create a


table with 4 columns: One that stores the CustomerID, one that stores
the ProductID we are relating it to, the Order Limit, and of course the
primary key column which is an identity:
Create table CustomerProducts
(
Customer_ProductID int identity primary key,
CustomerID int references Customers(CustomerID) not
null,
ProductID int references Products(ProductID) not null,
OrderLimit int not null
)

This is what I see in perhaps most of the databases that I’ve worked
with over the years. The reason for designing a table in this manner?
Honestly, I don’t know! I can only surmise that it is because of the
lack of understanding what a primary key of a table really is, and that
it can be something other than an identity and that it can be
comprised of more than just a single column. As I mentioned, it
seems that many database architects are simply not aware of this fact.

So then, what is the problem here? The primary issue is data


integrity. This table allows me to enter the following data:

CustomerProductID CustomerID ProductID OrderLimit


1 1 100 25
2 1 100 30

In the above data, what is the order limit for customerID #1,
productID #100? Is it 25 or 30? There is no way to conclusively know
for sure. Nothing in the database constrains this table so that we only
have exactly one row per CustomerID/ProductID combination.
Remember, our primary key is just an identity, which does not
constrain anything.

Most database designs like this just assume (hope?) that the data will
be always be OK and there will be no duplicates. The UI will handle
this, of course! But even if you think that only one single form on one
single application ever updates this table, you have to remember that
data will always get in and out of your system in different ways. What
happens if you upgrade your system and have to move the data over?
What if you need certain transactions restored from a back up? What
if you ever need to do a batch import to save valuable data entry
time? Or to convert data from a new system that you are absorbing or
integrating?

If you ever write a report or an application off of a system and


simply assume that the data will be constrained a certain way, but the
database itself does not guarantee that, you are either a) greatly over-
engineering what should be a simple SQL statement to deal with the
possibility of bad data or b) ignoring the possibility of bad data
completely and setting yourself up for issues down the road. It's
possible to constrain data properly, it's efficient, it's easy to do, and it
simply must be done or you should not really be working with a
database in the first place -- you are forgoing a very important
advantage it provides.

So, to handle that issue with this table design, we need create a
unique constraint on our CustomerID/ProductID columns:

create unique index cust_products_unique on


CustomerProducts (CustomerID, ProductID)

Now, we are guaranteed that there will only be exactly one row per
combination of CustomerID and ProductID. That handles that
problem, our data now has integrity, so we seem to be all set, right?

Well, let’s remember the definition of what a primary key really is. It
is the set of columns in a table that uniquely identify each row of
data. Also, for a table to be normalized, all non-primary key columns
in a table should be fully dependent on the primary key of that table.

Consider instead the following design:

Create table CustomerProducts


(
CustomerID int references Customers(CustomerID) not
null,
ProductID int references Products(ProductID) not null,
OrderLimit int not null,
Primary key (CustomerID, ProductID)
)

Notice here that we have eliminated the identity column, and have
instead defined a composite (multi-column) primary key as the
combination of the CustomerID and ProductID columns. Therefore, we
do not have to create an additional unique constraint. We also do not
need an additional identity column that really serves no purpose. We
have not only simplified our data model physically, but we’ve also
made it more logically sound and the primary key of this table
accurately explains what it is this table is modeling – the relationship
of a CustomerID to a ProductID.

Going back to normalization, we also know that our OrderLimit column


should be dependent on our primary key columns. Logically, our
OrderLimit is determined based on the combination of a CustomerID
and a ProductID, so physically this table design makes sense and is
fully normalized. If our primary key is just a meaningless auto-
generated identity column, it doesn’t make logical sense since our
OrderLimit is not dependent on that.

Some people argue that having more than one column in a primary
key “complicates things” or “makes things less efficient” rather than
always using identity columns. This is simply not the case. We’ve
already established that you must add additional unique constraints to
your data to have integrity, so instead of just:

1. A single indexed composite primary key that uniquely constrains


our data

we instead need:

1. An additional identity column


2. A primary key index on that identity column
3. An additional unique constraint on the columns that logically
define the data

So we are actually adding complexity and overhead to our design, not


simplifying! And we are requiring more memory and resources to
store and manipulate data in our table.

In addition, let's remember that a data model can be a complicated


thing. We have all kinds of tables that have primary keys defined that
let us identify what they are modeling, and we have relations and
constraints and data types and the rest. Ideally, you should be able to
look at a table's primary key and understand what it is all about, and
how it relates to other tables, and not need to basically ignore the
primary key of a table and instead investigate unique constraints on
that table to really determine what is going on! It simply makes no
sense and adds unnecessary confusion and complication to your
schema that is so easily avoided.

Some people will claim that being able to quickly label and identify the
relation of a Product to a Customer with a single integer value makes
things easier, but again we are over-complicating things. If we only
know we are editing CustomerProductID #452 in our user interface,
what does that tell us? Nothing! We need to select from the
CustomerProducts table every time just to get the CustomerID and the
ProductID that we are dealing with in order to display labels or
descriptions or to get any related data from those tables. If, instead,
we know that we are editing CustomerID #1 and productID #6
because we are using a true, natural primary key of our table, we
don’t need to select from that table at all to get those two very
important attributes.

There are lots of complexities and many ways to model things, and
there are many complicated situations that I did not discuss here. I
am really only scratching the surface. But my overall point is to at
least be aware of composite primary keys, and the fact that a primary
key is not always a single auto-generated column. There are pros
and cons to many different approaches, from both a logical design and
physical performance perspective, but please consider carefully the
idea of making your primary keys count for something and don’t
automatically assume that just tacking on identity columns to all of
your tables will give you the best possible database design.

And, remember -- when it comes to defining your entities, I


understand that using an identity or GUID or whatever you like instead
of real-world data has advantages. It is when we relateentities that
we should consider using those existing primary key columns from our
entity tables (however you had defined them) to construct an
intelligent and logical and accurate primary key for our entity relation
table to avoid the need to create extra, additional identity columns and
unique constraints.

A composite key has more than one attribute (field). In this example we
store details of tracks on albums - we need to use three columns to get
a unique key - each album may have more than one disk - each disk will
have tracks numbered 1, 2, 3...

The primary key must be different for each row of the table. The primary
key may not contain a null.

Source: http://sqlzoo.net/howto/source/z.dir/tip241027/i02create.xml
PRIMARY KEY Constraints
SQL Server 2000

PRIMARY KEY Constraints

A table usually has a column or combination of columns whose values uniquely


identify each row in the table. This column (or columns) is called the primary key of
the table and enforces the entity integrity of the table. You can create a primary key
by defining a PRIMARY KEY constraint when you create or alter a table.

A table can have only one PRIMARY KEY constraint, and a column that participates in
the PRIMARY KEY constraint cannot accept null values. Because PRIMARY KEY
constraints ensure unique data, they are often defined for identity column.

When you specify a PRIMARY KEY constraint for a table, Microsoft® SQL Server™
2000 enforces data uniqueness by creating a unique index for the primary key
columns. This index also permits fast access to data when the primary key is used in
queries.

If a PRIMARY KEY constraint is defined on more than one column, values may be
duplicated within one column, but each combination of values from all the columns in
the PRIMARY KEY constraint definition must be unique.

As shown in the following illustration, the au_id and title_id columns in


the titleauthor table form a composite PRIMARY KEY constraint for
the titleauthor table, which ensures that the combination ofau_id and title_id is
unique.

When you work with joins, PRIMARY KEY constraints relate one table to another. For
example, to determine which authors have written which books, you can use a three-
way join between the authorstable, the titles table, and the titleauthor table.
Because titleauthor contains columns for both theau_id and title_id columns,
the titles table can be accessed by the relationship between titleauthorand titles.
Introduction

Not long ago, I had an interesting and extended debate with one of my friends
regarding which column should be primary key in a table. The debate instigated an
in-depth discussion about candidate keys and primary keys. My present article
revolves around the two types of keys.

Let us first try to grasp the definition of the two keys.

Candidate Key - A Candidate Key can be any column or a combination of columns


that can qualify as unique key in database. There can be multiple Candidate Keys in
one table. Each Candidate Key can qualify as Primary Key.

Primary Key - A Primary Key is a column or a combination of columns that uniquely


identify a record. Only one Candidate Key can be Primary Key.

One needs to be very careful in selecting the Primary Key as an incorrect selection
can adversely impact the database architect and future normalization. For a
Candidate Key to qualify as a Primary Key, it should be Non-NULL and unique in
any domain. I have observed quite often that Primary Keys are seldom changed. I
would like to have your feedback on not changing a Primary Key.

An Example to Understand Keys

Let us look at an example where we have multiple Candidate Keys, from which we
will select an appropriate Primary Key.

Given below is an example of a table having three columns that can qualify as single
column Candidate Key, and on combining more than one column the number of
possible Candidate Keys touches seven. A point to remember here is that only one
column can be selected as Primary Key. The decision of Primary Key selection from
possible combinations of Candidate Key is often very perplexing but very imperative!
On running the following script it will always give 504 rows in all the options. This
proves that they are all unique indatabase and meet the criteria of a Primary Key.

Run the following script to verify if all the tables have unique values or not.

01.USE AdventureWorks
02.GO
03.SELECT *
04.FROM Production.Product
05.GO
06.SELECT DISTINCT ProductID
07.FROM Production.Product
08.GO
09.SELECT DISTINCT Name
10.FROM Production.Product
11.GO
12.SELECT DISTINCT ProductNumber
13.FROM Production.Product
14.GO

All of the above queries will return the same number of records; hence, they all
qualify as Candidate Keys. In other words, they are the candidates for Primary Key.
There are few points to consider while turning any Candidate Key into a Primary Key.

Select a key that does not contain NULL

It may be possible that there are Candidate Keys that presently do not contain value
(not null) but technically they can contain null. In this case, they will not
qualify for Primary Key. In the following table structure, we can see that even though
column [name] does not have any NULL value it does not qualify as it has the
potential to contain NULL value in future.

1.CREATE TABLE [Production].[Product](


2.[ProductID] [int] IDENTITY(1,1) NOT NULL,
3.[Name] [dbo].[Name] NULL,
4.[ProductNumber] [nvarchar](25) NOT NULL,
5.[Manufacturer] [nvarchar](25) NOT NULL
6.)

Select a key that is unique and does not repeat

It may be possible that Candidate Keys that are unique at this moment may contain
duplicate value. These kinds of Candidate Keys do not qualify for Primary Key. Let us
understand this scenario by looking into the example given above. It is absolutely
possible that two Manufacturers can create products with the same name; the
resulting name will be a duplicate and only the name of the Manufacturer will differ
in the table. This disqualifies Name in the table to be a Primary Key.

Make sure that Primary Key does not keep changing

This is not a hard and fast rule but rather a general recommendation: Primary Key
values should not keep changing. It is quite convenient for a database if Primary Key
is static. Primary Keys are referenced in numerous places in the database, from
Index to Foreign Keys. If they keep changing then they can adversely affect
database integrity, data statistics as well as internal of Indexes.

Selection of Primary Key

Let us examine our case by applying the above three rules to the table and decide on
the appropriate candidate for Primary Key. Name can contain NULL so it disqualifies
as per Rule 1 and Rule 2. Product Number can be duplicated for different
Manufacturers so it disqualifies as per Rule 2. ProductID is Identity and Identity
column cannot be modified. So, in this case ProductID qualifies as Primary Key.

Please note that many database experts suggest that it is not a good practice to
make Identity Column as Primary Key. The reason behind this suggestion is that
many times Identity Column that has been assigned as Primary Key does not play
any role in database. There is no use of this Primary Key in both application and in T-
SQL. Besides, this Primary Key may not be used in Joins. It is a known fact that
when there is JOIN on Primary Key or when Primary Key is used in the WHERE
condition it usually gives better performance than non primary key columns. This
argument is absolutely valid and one must make sure not to use such Identity
Column. However, our example presents a different case. Here, although ProductID
is Identity Column it uniquely defines the row and the same column will be used as
foreign key in other tables. If a key is used in any other table as foreign key it is
likely that it will be used in joins.

Quick Note on Other Kinds of Keys

The above paragraph evokes another question - what is a foreign key? A foreign key
in a database table is a key from another table that refers to the primary key in the
table being used. A primary key can be referred by multiple foreign keys from other
tables. It is not required for a primary key to be the reference of any foreign keys.
The interesting part is that a foreign key can refer back to the same table but to a
different column. This kind of foreign key is known as "self-referencing foreign key".

Summary

A table can have multiple Candidate Keys that are unique as single column or
combined multiple columns to the table. They are all candidates for Primary Key.
Candidate keys that follow all the three rules - 1) Not Null, 2) Unique Value in Table
and 3) Static - are the best candidates for Primary Key. If there are multiple
candidate keys that are satisfying the criteria for Primary Key, the decision should be
made by experienced DBAs who should keep performance in mind.

SQL SERVER – 2008 – Creating Primary Key, Foreign Key and


Default Constraint
September 8, 2008 by pinaldave

Primary key, Foreign Key and Default constraint are the 3 main constraints that need to

be considered while creating tables or even after that. It seems very easy to apply these

constraints but still we have some confusions and problems while implementing it. So I tried

to write about these constraints that can be created or added at different levels and in

different ways or methods.

Primary Key Constraint: Primary Keys constraints prevents duplicate values for columns and

provides unique identifier to each column, as well it creates clustered index on the columns.
1) Create Table Statement to create Primary Key

a. Column Level

USE AdventureWorks2008

GO

CREATE TABLE Products

ProductID INT CONSTRAINT pk_products_pid PRIMARY KEY,

ProductName VARCHAR(25)

);

GO

b. Table Level

CREATE TABLE Products

ProductID INT,

ProductName VARCHAR(25)

CONSTRAINT pk_products_pid PRIMARY KEY(ProductID)

);

GO

2) Alter Table Statement to create Primary Key

ALTER TABLE Products

ADD CONSTRAINT pk_products_pid PRIMARY KEY(ProductID)

GO

3) Alter Statement to Drop Primary key

ALTER TABLE Products

DROP CONSTRAINT pk_products_pid;

GO
Foreign Key Constraint: When a FOREIGN KEY constraint is added to an existing column or

columns in the table SQL Server, by default checks the existing data in the columns to ensure

that all values, except NULL, exist in the column(s) of the referenced PRIMARY KEY or UNIQUE

constraint.

1) Create Table Statement to create Foreign Key

a. Column Level

USE AdventureWorks2008

GO

CREATE TABLE ProductSales

(
SalesID INT CONSTRAINT pk_productSales_sid PRIMARY KEY,

ProductID INT CONSTRAINT fk_productSales_pidFOREIGN KEY REFERENCES Products(Pro

ductID),

SalesPerson VARCHAR(25)

);

GO

b. Table Level

CREATE TABLE ProductSales

SalesID INT,

ProductID INT,

SalesPerson VARCHAR(25)

CONSTRAINT pk_productSales_sid PRIMARY KEY(SalesID),

CONSTRAINT fk_productSales_pidFOREIGN KEY(ProductID)REFERENCES Products(Product

ID)

);

GO

1) Alter Table Statement to create Foreign Key

ALTER TABLE ProductSales

ADD CONSTRAINT fk_productSales_pidFOREIGN KEY(ProductID)REFERENCES Products(Pro

ductID)

GO

2) Alter Table Statement to Drop Foreign Key

ALTER TABLE ProductSales

DROP CONSTRAINT fk_productSales_pid;

GO
Default Constraint: Default constraint when created on some column will have the default

data which is given in the constraint when no records or data is inserted in that column.

1) Create Table Statement to create Default Constraint

a. Column Level

USE AdventureWorks2008

GO

CREATE TABLE Customer

CustomerID INT CONSTRAINT pk_customer_cid PRIMARY KEY,

CustomerName VARCHAR(30),

CustomerAddress VARCHAR(50) CONSTRAINT df_customer_AddDEFAULT 'UNKNOWN'

);

GO

b. Table Level : Not applicable for Default Constraint

2) Alter Table Statement to Add Default Constraint


ALTER TABLE Customer

ADD CONSTRAINT df_customer_Add DEFAULT 'UNKNOWN' FORCustomerAddress

AGO

3) Alter Table to Drop Default Constraint

ALTER TABLE Customer

DROP CONSTRAINT df_customer_Add

GO