You are on page 1of 2

CHAPTER 4

■■■

The Normalization Process

The Atomic Age is here to stay—but are we?

—Bennett Cerf

N ormalization is the process of taking the entities and attributes that have been discovered and
making them suitable for the relational database system. The process does this by removing redun-
dancies and shaping the data in the manner that the relational engine desires to work with it. Once
you are done with the process, working with the data will be more natural using the set-based SQL
language.
In computer science terms, atomic means that the value cannot (or more reasonably should
not) be broken down into smaller parts. If you want to get technical, most any value can be broken
down into smaller parts. For the database definition, consider atomic to refer to a value that needn’t
be broken down any further for use in the relational database. Our eventual goal will be to break
down the piles of data we have identified into values that are atomic, that is, broken down to the
lowest form that will need to be accessed in Transact SQL (T-SQL) code.
Normalization often gets a bad rap, and there are many misconceptions about it. I should
refute a few things that normalization may seem like but clearly is not:

• Myth: It’s primarily a method to annoy functional programmers (though this does tend to be
a fun side effect, if you have the clout to get away with it).
• Myth: It’s a way to keep database professionals in a job.
• Myth: It’s a silver bullet to end world suffering or even global warming.

The process of normalization is based on a set of levels, each of which achieves a level of cor-
rectness or adherence to a particular set of “rules.” The rules are formally known as forms, as in the
normal forms. There are quite a few normal forms that have been theorized and postulated, but I’ll
focus on the primary six that are commonly known. I’ll start with First Normal Form (1NF), which
eliminates data redundancy (such as a name being stored in two separate places), and continue
through to Fifth Normal Form (5NF), which deals with the decomposition of ternary relationships.
(One of the normal forms I’ll present isn’t numbered; it’s named for the people who devised it.) Each
level of normalization indicates an increasing degree of adherence to the recognized standards of
database design. As you increase the degree of normalization of your data, you’ll naturally tend to
create an increasing number of tables of decreasing width (fewer columns).

117
118 CHAPTER 4 ■ THE NORMALIZATION PROCESS

In this chapter, I’ll start out by addressing two fundamental questions:

• Why normalize?: I’ll take a detailed look at the numerous reasons why you should normalize
your data. The bottom line is that you should normalize to increase the efficiency of, and
protect the integrity of, your relational data.
• How far should you normalize?: This is always a contentious issue. Normalization tends to
optimize your database for efficient storage and updates, rather than querying. It dramati-
cally reduces the propensity for introducing update anomalies (different records displaying
different values for the same piece of data), but it increases the complexity of your queries,
because you might be forced to collect data from many different tables.

My answers to these questions will be followed by a look at each of the normal forms in turn,
explaining with clear examples the requirements of each one, the programming anomalies they
help you avoid, and the telltale signs that your relational data is flouting that particular normal
form. It might seem out of place to show programming anomalies at this point, since the first four
chapters of the book are specifically aligned to the preprogramming design, but it can help recon-
cile to the programming mind what having data in a given normal form can do to make the tables
easier to work in SQL. I’ll then wrap up with an overview of some normalization best practices.

Why Normalize?
Before discussing the mechanics of the normalization process, I’ll discuss some of the things that
normalization will do for you, if done correctly. In the following sections, I’ll discuss reasons that
might not be obvious, even after finishing the sections on how to normalize, such as the following:

• Eliminating data that’s duplicated, increasing the chance it won’t match when you need it
• Avoiding unnecessary coding needed to keep duplicated data in sync
• Keeping tables thin, increasing the number of values that will fit on an 8K physical database
page (which will be discussed in more detail in Chapter 9) and decreasing the number of
reads that will be needed to read data from a table
• Maximizing the use of clustered indexes, allowing for more optimum data access and joins
• Lowering the number of indexes per table, because indexes are costly to maintain

Many of these are implementation issues or even pertain more to physical modeling (how data
is laid out on disk). Since this is a professional book, I’m assuming you have some knowledge of
such things. If not, later chapters of the book will give overviews of these issues and direct you to
additional reading on the subject.
It is also true that normalization has some negative effects on performance for some opera-
tions. This fact is the basis of many arguments over how far to normalize. However, the costs are
nearly always outweighed by the positives of having to make sure that data is not corrupted by
operations that seem correct, but aren’t because of poor design decisions.

Eliminating Duplicated Data


Any piece of data that occurs more than once in the database is an error waiting to happen. No
doubt you’ve been beaten by this once or twice in your life: your name is stored in multiple places,
then one version gets modified and the other doesn’t, and suddenly you have more than one name
where before there was just one.

You might also like