You are on page 1of 13

Rules of Data Normalization

1NF Eliminate Repeating Groups - Make a separate table for each set of related
attributes, and give each table a primary key.
2NF Eliminate Redundant Data - If an attribute depends on only part of a multi-valued
key, remove it to a separate table.
3NF Eliminate Columns Not Dependent On Key - If attributes do not contribute to a
description of the key, remove them to a separate table.
BCNF Boyce-Codd Normal Form - If there are non-trivial dependencies between
candidate key attributes, separate them out into distinct tables.
4NF Isolate Independent Multiple Relationships - No table may contain two or more
1:n or n:m relationships that are not directly related.
5NF Isolate Semantically Related Multiple Relationships - There may be practical
constrains on information that justify separating logically related many-to-many
relationships.
ONF Optimal Normal Form - a model limited to only simple (elemental) facts, as
expressed in Object Role Model notation.
DKNF Domain-Key Normal Form - a model free from all modification anomalies.
Important Note!
All normal forms are additive, in that if a model is in 3rd normal form, it is by
definition also in 2nd and 1st.

1. Eliminate Repeating Groups

In the original member list, each member name is followed by any databases that the
member has experience with. Some might know many, and others might not know any.
To answer the question, "Who knows DB2?" we need to perform an awkward scan of the
list looking for references to DB2. This is inefficient and an extremely untidy way to
store information.

Moving the known databases into a seperate table helps a lot. Separating the repeating
groups of databases from the member information results in first normal form. The
MemberID in the database table matches the primary key in the member table, providing
a foreign key for relating the two tables with a join operation. Now we can answer the
question by looking in the database table for "DB2" and getting the list of members.
2. Eliminate Redundant Data

In the Database Table, the primary key is made up of the MemberID and the DatabaseID.
This makes sense for other attributes like "Where Learned" and "Skill Level" attributes,
since they will be different for every member/database combination. But the database
name depends only on the DatabaseID. The same database name will appear redundantly
every time its associated ID appears in the Database Table.

Suppose you want to reclassify a database - give it a different DatabaseID. The change
has to be made for every member that lists that database! If you miss some, you'll have
several members with the same database under different IDs. This is an update anomaly.

Or suppose the last member listing a particular database leaves the group. His records
will be removed from the system, and the database will not be stored anywhere! This is a
delete anomaly. To avoid these problems, we need second normal form.

To achieve this, separate the attributes depending on both parts of the key from those
depending only on the DatabaseID. This results in two tables: "Database" which gives the
name for each DatabaseID, and "MemberDatabase" which lists the databases for each
member.

Now we can reclassify a database in a single operation: look up the DatabaseID in the
"Database" table and change its. name. The result will instantly be available throughout
the application.
3. Eliminate Columns Not Dependent On Key

The Member table satisfies first normal form - it contains no repeating groups. It satisfies
second normal form - since it doesn't have a multivalued key. But the key is MemberID,
and the company name and location describe only a company, not a member. To achieve
third normal form, they must be moved into a separate table. Since they describe a
company, CompanyCode becomes the key of the new "Company" table.

The motivation for this is the same for second normal form: we want to avoid update and
delete anomalies. For example, suppose no members from the IBM were currently stored
in the database. With the previous design, there would be no record of its existence, even
though 20 past members were from IBM!

BCNF. Boyce-Codd Normal Form

Boyce-Codd Normal Form states mathematically that:


A relation R is said to be in BCNF if whenever X -> A holds in R, and A is not in X, then
X is a candidate key for R.
BCNF covers very specific situations where 3NF misses inter-dependencies between
non-key (but candidate key) attributes. Typically, any relation that is in 3NF is also in
BCNF. However, a 3NF relation won't be in BCNF if (a) there are multiple candidate
keys, (b) the keys are composed of multiple attributes, and (c) there are common
attributes between the keys.

Basically, a humorous way to remember BCNF is that all functional dependencies are:
"The key, the whole key, and nothing but the key, so help me Codd."

4. Isolate Independent Multiple Relationships

This applies primarily to key-only associative tables, and appears as a ternary


relationship, but has incorrectly merged 2 distinct, independent relationships.

The way this situation starts is by a business request list the one shown below. This could
be any 2 M:M relationships from a single entity. For instance, a member could know
many software tools, and a software tool may be used by many members. Also, a member
could have recommended many books, and a book could be recommended by many
members.

Initial business request

So, to resolve the two M:M relationships, we know that we should resolve them
separately, and that would give us 4th normal form. But, if we were to combine them into
a single table, it might look right (it is in 3rd normal form) at first. This is shown below,
and violates 4th normal form.
Incorrect solution

To get a picture of what is wrong, look at some sample data, shown below. The first few
records look right, where Bill knows ERWin and recommends the ERWin Bible for
everyone to read. But something is wrong with Mary and Steve. Mary didn't recommend
a book, and Steve Doesn't know any software tools. Our solution has forced us to do
strange things like create dummy records in both Book and Software to allow the record
in the association, since it is key only table.

Sample data from incorrect solution

The correct solution, to cause the model to be in 4th normal form, is to ensure that all
M:M relationships are resolved independently if they are indeed independent, as shown
below.

Correct 4th normal form

NOTE! This is not to say that ALL ternary associations are invalid. The above situation
made it obvious that Books and Software were independently linked to Members. If,
however, there were distinct links between all three, such that we would be stating that
"Bill recommends the ERWin Bible as a reference for ERWin", then separating the
relationship into two separate associations would be incorrect. In that case, we would lose
the distinct information about the 3-way relationship.

5. Isolate Semantically Related Multiple Relationships

OK, now lets modify the original business diagram and add a link between the books and
the software tools, indicating which books deal with which software tools, as shown
below.

Initial business request

This makes sense after the discussion on Rule 4, and again we may be tempted to resolve
the multiple M:M relationships into a single association, which would now violate 5th
normal form. The ternary association looks identical to the one shown in the 4th normal
form example, and is also going to have trouble displaying the information correctly.
This time we would have even more trouble because we can't show the relationships
between books and software unless we have a member to link to, or we have to add our
favorite dummy member record to allow the record in the association table.

Incorrect solution

The solution, as before, is to ensure that all M:M relationships that are independent are
resolved independently, resulting in the model shown below. Now information about
members and books, members and software, and books and software are all stored
independently, even though they are all very much semantically related. It is very
tempting in many situations to combine the multiple M:M relationships because they are
so similar. Within complex business discussions, the lines can become blurred and the
correct solution not so obvious.

Correct 5th normal form

6. Optimal Normal Form

At this point, we have done all we can with Entity-Relationship Diagrams (ERD). Most
people will stop here because this is usually pretty good. However, another modeling
style called Object Role Modeling (ORM) can display relationships that cannot be
expressed in ERD. Therefore there are more normal forms beyond 5th. With Optimal
Normal Form (OMF)
It is defined as a model limited to only simple (elemental) facts, as expressed in ORM.

7. Domain-Key Normal Form

This level of normalization is simply a model taken to the point where there are no
opportunities for modification anomalies.

 "if every constraint on the relation is a logical consequence of the definition of keys
and domains"
 Constraint "a rule governing static values of attributes"
 Key "unique identifier of a tuple"
 Domain "description of an attribute’s allowed values"

1. A relation in DK/NF has no modification anomalies, and conversely.


2. DK/NF is the ultimate normal form; there is no higher normal form related to
modification anomalies
3. Defn: A relation is in DK/NF if every constraint on the relation is a logical
consequence of the definition of keys and domains.
4. Constraint is any rule governing static values of attributes that is precise enough
to be ascertained whether or not it is true
5. E.g. edit rules, intra-relation and inter-relation constraints, functional and multi-
valued dependencies.
6. Not including constraints on changes in data values or time-dependent constraints.
7. Key - the unique identifier of a tuple.
8. Domain: physical and a logical description of an attributes allowed values.
9. Physical description is the format of an attribute.
10. Logical description is a further restriction of the values the domain is allowed
11. Logical consequence: find a constraint on keys and/or domains which, if it is
enforced, means that the desired constraint is also enforced.
12. Bottom line on DK/NF: If every table has a single theme, then all functional
dependencies will be logical consequences of keys. All data value constraints can
them be expressed as domain constraints.
13. Practical consequence: Since keys are enforced by the DBMS and domains are
enforced by edit checks on data input, all modification anomalies can be avoided
by just these two simple measures.

Visitors on this page: ?


Last Updated: June 13, 2005 at 10:12 ]

******************88

Pattern: Denormalization

Abstract

Denormalization is a technique to move from higher to lower normal forms of


database modeling in order to speed up database access. You may apply
Denormalization in the process of deriving a physical data model from a logical
form.

Context

The analysis you have started, concentrating on important views only, shows that
an Order View is one of the most heavily used views in the database. Reading an
Order and the OrderItems costs a join statement and several physical page
accesses. Writing it costs several update statements. Further statistical analysis,
again concentrating on the important parts of orders, yields that 90 percent of
Orders have no more than 5 positions.

Problem

How can you manage to read and write most physical views with a single page
database access when you have a parent/child (Order/OrderItem) relation?
Forces

• Time vs. space: Database optimization is mostly a question of time versus


space tradeoffs. Normalized logical data models are optimized for
minimum redundancy and avoidance of update anomalies. They are not
optimized for minimum access time. Time does not play a role in the
denormalization process. A 3NF or higher normalized data model can be
accessed with minimum complex code if the domain reflects the relational
calculus and the logical data model based on it. Normalized data models
are usually better to understand than data models that reflect
considerations of physical optimizations.

Solution

Fill up the Order entity database page with OrderItems until you reach the next
physical page limit. Physical pages are usually chunks of 2 or 4K depending on
the database implementation.

The number of OrderItems should be near or greater than the number of OrderItems
that reflects the order size in 80-95% of cases.

Consequences

• Time: You will now be able to access most Orders with a single database
access.
• Space: You need more disk space in the database depending on
alignment rules. If you are lucky and the database system will begin a new
page with every new order the space is even free. As you have to deal
with the 5-20% or so cases of orders that do not fit into the Order/OrderItem
record, the code will become more complex.
• Queries: The database will also become less queryable for ad hoc queries
as the physical structure does no longer exactly reflect the logical data
model.

Related Patterns
Use an OverflowTable to deal with the other 5-20% of orders that have a
mandatory number of OrderItems. Use PhysicalViews to encapsulate the code that
is needed to handle the cases of page overflow.

Other Online Articles

• DENORMALIZATION GUIDELINES by Craig S. Mullins (The Data


Administration Newsletter),
• Responsible Denormalization, How to break the rules and get away with it
(SQL Magazine)

The Data Administration Newsletter (TDAN.com)


Robert S. Seiner - Publisher

THE DENORMALIZATION SURVIVAL GUIDE - PART 1


Steve Hoberman - Steve Hoberman & Associates, LLC
[Return to the Article Archive]

This the first of two articles on the Denormalization Survival Guide, adapted from Chapter 8 of Steve
Hoberman's book, the Data Modeler's Workbench (Order this book through Amazon.com Today!). This
first article focuses on the dangers of denormalization and introduces the Denormalization Survival Guide,
which is a question-and-answer approach to applying denormalization to our data models. The second
article will discuss the questions and answers that comprise this guide when modeling a data mart.

The Dangers of Denormalization


In her book entitled, Handbook of Relational Database Design, Barbara von Halle has a great definition for
denormalization: “Denormalization is the process whereby, after defining a stable, fully normalized data
structure, you selectively introduce duplicate data to facilitate specific performance requirements.”
Denormalization is the process of combining data elements from different entities. By doing so, we lose the
benefits of normalization and, therefore, reintroduce redundancy into the design. This extra redundancy
can help improve retrieval time performance. Reducing retrieval time is the primary reason for
denormalizing.

My favorite word in von Halle's definition is “selectively”. We have to be very careful and selective where
we introduce denormalization because it can come with a huge price even though it can decrease retrieval
time. The price for denormalizing can take the form of these bleak situations:

• Update, delete, and insert performance can suffer. When we repeat a data element in two or
more tables, we can usually retrieve the values within this data element much more quickly.
However, if we have to change the value in this data element, we need to change it in every table
where it resides. If Bob Jones appears in five different tables, and Bob would prefer to be called
“Robert”, we will need to change “Bob Jones” to “Robert Jones” in all five tables, which takes
longer than making this change to just one table.
• Sometimes even read performance can suffer. We denormalize to increase read or retrieval
performance. Yet if too many data elements are denormalized into a single entity, each record
length can get very large and there is the potential that a single record might span a database
block size, which is the length of contiguous memory defined within the database. If a record is
longer than a block size, it could mean that retrieval time will take much longer because now some
of the information the user requests will be in one block, and the rest of the information could be in
a different part of the disk, taking significantly more time to retrieve. A Shipment entity I’ve
encountered recently suffered from this problem.
• You may end up with too much redundant data. Let's say the CUSTOMER LAST NAME data
element takes up 30 characters. Repeating this data element three times means we are now using
90 instead of 30 characters. In a table with a small number of records, or with duplicate data
elements with a fairly short length, this extra storage space will not be substantial. However, in
tables with millions of rows, every character could require megabytes of additional space.
• It may mask lack of understanding. The performance and storage implications of denormalizing
are very database- and technology-specific. Not fully understanding the data elements within a
design, however, is more of a functional and business concern, with potentially much worse
consequences. We should never denormalize without first normalizing. When we normalize, we
increase our understanding of the relationships between the data elements. We need this
understanding in order to know where to denormalize. If we just go straight to a denormalized
design, we could make very poor design decisions that could require complete system rewrites
soon after going into production. I once reviewed the design for an online phone directory, where
all of the data elements for the entire design were denormalized into a single table. On the surface,
the table looked like it was properly analyzed and contained a fairly accurate primary key.
However, I started grilling the designer with specific questions about his online phone directory
design:

“What if an employee has two home phone numbers?”

“How can we store more than one email address for the same employee?”

“Can two employees share the same work phone number?”

After receiving a blank stare from the designer, I realized that denormalization was applied before
fully normalizing, and therefore, there was a significant lack of understanding of the relationships
between the data elements.

• It might introduce data quality problems. By having the same data element multiple times in our
design, we substantially increase opportunities for data quality issues. If we update Bob's first
name from Bob to Robert in 4 out of 5 of the places his name occurred, we have potentially
created a data quality issue.

Being aware of these potential dangers of denormalization encourages us to make denormalization


decisions very selectively. We need to have a full understanding of the pros and cons of each opportunity
we have to denormalize. This is where the Denormalization Survival Guide becomes a very important tool.
The Denormalization Survival Guide will help us make the right denormalization decisions, so that our
designs can survive the test of time and minimize the chances of these bleak situations from occurring.

What Is the Denormalization Survival Guide?


The Denormalization Survival Guide is a question-and-answer approach to determining where to
denormalize your logical data model. The Survival Guide contains a series of questions in several different
categories that need to be asked for each relationship on our model. There are point values associated
with answers to each question. By adding up these points, we can determine whether to denormalize each
specific relationship. If our score is 10 or more after summing the individual scores, we will denormalize the
relationship. If our score is less than 10, we will keep the relationship normalized and intact. When you are
done asking these questions for each relationship, you will have a physical data model at the appropriate
level of denormalization.

There are two main purposes to using the Denormalization Survival Guide:

• Maximizing the benefits of denormalization and minimizing the costs. Whenever we


denormalize, we potentially gain something but also potentially lose something. The purpose of the
Denormalization Survival Guide is to make sure we maximize our gains while minimize our losses.
Therefore, if using the guide dictates that we should denormalize Account into the Customer entity,
it will be because the benefits of denormalization outweigh the costs.
• Providing consistent criteria for denormalization. We are applying the same set of questions
and point values to each relationship. Therefore, there is a consistency that we will use in deciding
when to denormalize. If two relationships have similar answers, either both will be denormalized or
both will remain normalized. This consistency leads to greater reuse and understanding. In
addition, because we will start to see the same structures in multiple places, we will more easily be
able to learn and understand new structures. For example, if Customer and Account are
denormalized into a single entity, we will be able to understand this denormalized Customer
Account entity in other places it occurs.

Summary
This first article focused on the dangers of denormalization and introduced the Denormalization Survival
Guide. The next article in this series will discuss the questions and answers that comprise the guide,
including these questions when determining whether to denormalize a relationship in a data mart:

• What type of relationship?


• What is the participation ratio?
• How many data elements are in the parent entity?
• What is the usage ratio?
• Is the parent entity a placeholder?
• What is the rate of change comparison?
Steve Hoberman is an expert in the fields of data modeling and data
warehousing. He is currently the Lead Data Warehouse Developer for
Mars, Inc. He has been data modeling since 1990 across industries as
diverse as telecommunications, finance, and manufacturing. Steve
speaks regularly for the Data Warehousing Institute. He is the author of
the Data Modeler's Workbench, Tools and Techniques for Analysis
and Design. Order this book through Amazon.com Today!

Steve specializes in data modeling training, design strategy, and in


creating techniques to improve the data modeling process and
deliverables. He enjoys reviewing data models and is the founder of
Design Challenges, a discussion group which tackles complex data
modeling scenarios. To learn more about his data model reviews and to
add your email address to the Design Challenge distribution list,
please visit his web site at www.stevehoberman.com. He can be
reached at me@stevehoberman.com or 973-586-8765.

[Return to the Article Archive]

The Data Administration Newsletter (TDAN.com)


Robert S. Seiner - Publisher - rseiner@tdan.com

You might also like