You are on page 1of 21

Syllabus Focus: Unit 2 Module 1 Content 9

Specific Objective 9: Explain the concept of normalization;

Content: Definition of normalisation; attribute redundancy and anomalies; normal forms: including First
Normal Form (1NF), Second Normal Form (2NF), and Third Normal Form (3NF); keys: primary, Foreign
and Composite (or compound or concatenated); Partial and Non-key Dependencies; relationships, use of
entity-relationship diagrams (ERD).

explain the concept of Definition of normalisation; attribute redundancy and


normalization; anomalies; normal forms: including first normal form
(1NF), second normal form (2NF), third normal form
(3NF); keys: primary, foreign and composite (or
compound or concatenated); partial and non-key
dependencies; relationships, use of entity-relationship
diagrams (ERD).

Database Normalisation, sometimes referred to as Canonical Synthesis, is a technique for designing


Relational Database tables to minimize duplication of information and, in so doing, to safeguard the
Database against certain types of logical or structural problems, namely Data Anomalies. For example,
when multiple instances of a given piece of information occur in a table, the possibility exists that these
instances will not be kept consistent when the data within the table is updated, leading to a loss of data
integrity. A table that is sufficiently normalized is less vulnerable to problems of this kind, because its
structure reflects the basic assumptions for when multiple instances of the same information should be
represented by a single instance only.

Higher degrees of normalization typically involve more tables and create the need for a larger number of
joins, which can reduce performance. Accordingly, more highly normalized tables are typically used in
database applications involving many isolated transactions (e.g. an automated teller machine), while
less normalized tables tend to be used in database applications that need to map complex relationships
between data entities and data attributes (e.g. a reporting application, or a full-text search application).

Database theory describes a table's degree of normalization in terms of normal forms of successively
higher degrees of strictness. A table in third normal form (3NF), for example, is consequently in second
normal form (2NF) as well; but the reverse is not necessarily the case.

Problems addressed by Normalisation

Although the Normal Forms are often defined informally in terms of the characteristics of tables, rigorous
definitions of the normal forms are concerned with the characteristics of mathematical constructs known
as relations. Whenever information is represented relationally, it is meaningful to consider the extent to
which the representation is normalised.

CAPE NOTES Unit 2 Module 1 Content 9 1


A table that is not sufficiently normalized can suffer from logical inconsistencies of various types, and
from anomalies involving data operations. In such a table:

 The same information can be


expressed on multiple records;
therefore updates to the table
may result in logical
inconsistencies. For example,
each record in an "Employees'
Skills" table might contain an
Employee ID, Employee
Address, and Skill; thus a
change of address for a particular employee will potentially need to be applied to multiple records
(one for each of his skills). If the update is not carried through successfully—if, that is, the
employee's address is updated on some records but not others—then the table is left in an
inconsistent state. Specifically, the table provides conflicting answers to the question of what this
particular employee's address is. This phenomenon is known as an Update Anomaly.
 There are circumstances in which certain facts cannot
be recorded at all. For example, each record in a
"Faculty and Their Courses" table might contain a
Faculty ID, Faculty Name, Faculty Hire Date, and
Course Code—thus we can record the details of any
faculty member who teaches at least one course, but
we cannot record the details of a newly-hired faculty
member who has not yet been assigned to teach any
courses. This phenomenon is known as an Insertion
Anomaly.
 There are circumstances in which the deletion of data
representing certain facts necessitates the deletion of
data representing completely different facts. The
"Faculty and Their Courses" table described in the
previous example suffers from this type of anomaly,
for if a faculty member temporarily ceases to be
assigned to any courses, we must delete the last of the
records on which that faculty member appears. This
phenomenon is known as a Deletion Anomaly.

Ideally, a relational database table should be designed in such


a way as to exclude the possibility of Update, Insertion, and
Deletion Anomalies. The Normal Forms of Relational
Database theory provides guidelines for deciding whether a particular design will be vulnerable to such
anomalies. It is possible to correct an unnormalised design so as to make it adhere to the demands of
the Normal Forms: this is called normalization. Removal of redundancies of the tables will lead to
several tables, with Referential Integrity restrictions between them.

CAPE NOTES Unit 2 Module 1 Content 9 2


Normalisation typically involves decomposing an unnormalized table into two or more tables that
were they are to be combined (joined), would convey exactly the same information as the original
table.

Referential Integrity in a Relational Database is consistency


between coupled tables. Referential Integrity is usually
enforced by the combination of a primary key or candidate
key (alternate key) and a foreign key. For Referential
Integrity to hold, any field in a table that is declared a foreign
key can contain only values from a parent table's primary key or
a candidate key. For instance, deleting a record that contains a
value referred to by a foreign key in another table would break
referential integrity. The Relational Database
Management System (RDBMS) enforces referential
integrity, normally either by deleting the foreign key rows as
well to maintain integrity, or by returning an error and not
performing the delete. Which method is used would be
determined by the referential integrity constraint, as defined in
the Data Dictionary.

Background to Normalisation

Functional Dependency: Attribute B has a functional


dependency on attribute A i.e. A → B if, for each value of
attribute A, there is exactly one value of attribute B. If value of A is repeating in tuples then value of B
will also repeat. In our example, Employee Address has a functional dependency on Employee ID,
because a particular Employee ID value corresponds to one and only one Employee Address value. (Note
that the reverse need not be true: several employees could live at the same address and therefore one
Employee Address value could correspond to more than one Employee ID. Employee ID is therefore not
functionally dependent on Employee Address.)
An attribute may be functionally dependent
either on a single attribute or on a combination
of attributes. It is not possible to determine
the extent to which a design is normalized
without understanding what functional
dependencies apply to the attributes within
its tables; understanding this, in turn,
requires knowledge of the problem domain.
For example, an Employer may require certain
employees to split their time between two locations, such as Kingston and Spanish Town, and therefore
want to allow Employees to have more than one Employee Address. In this case, Employee Address
would no longer be functionally dependent on Employee ID.

 Primary key: Most DBMSs require a table to be defined as having a single unique key, rather
than a number of possible unique keys. A primary key is a key which the database designer has
designated for this purpose.

CAPE NOTES Unit 2 Module 1 Content 9 3


 Super key: A super key is an attribute or set of attributes that uniquely identifies rows within a
table; in other words, two distinct rows are always guaranteed to have distinct super keys.
{Employee ID, Employee Address, Skill} would be a super key for the "Employees' Skills" table;
{Employee ID, Skill} would also be a super key.
 Candidate key: A candidate key is a minimal super key, that is, a super key for which we can
say that no proper subset of it is also a super key. {Employee Id, Skill} would be a candidate key
for the "Employees' Skills" table.
 Non-prime attribute: A non-prime attribute is an attribute that does not occur in any candidate
key. Employee Address would be a non-prime attribute in the "Employees' Skills" table.

http://en.wikipedia.org/wiki/Database_normalization

NORMAL FORMS

The Normal Forms (abbrev. NF) of Relational Database theory provide criteria for determining a table's
degree of vulnerability to logical inconsistencies and anomalies. The higher the normal form applicable to
a table, the less vulnerable it is to inconsistencies and anomalies. Each table has a "highest normal form"
(HNF): by definition, a table always meets the requirements of its HNF and of all normal forms lower
than its HNF; also by definition, a table fails to meet the requirements of any normal form higher than its
HNF.

The Normal Forms are applicable to individual tables; to say that an entire database is in Normal Form n
is to say that all of its tables are in Normal Form n.

First Normal Form

First normal form (1NF or Minimal Form) is a normal form used in database normalization. A
relational database table that adheres to 1NF is one that meets a certain minimum set of criteria. These
criteria are basically concerned with ensuring that the table is a faithful representation of a relation and
that it is free of repeating groups.

The concept of a "repeating group" is, however, understood in different ways by different theorists. As a
consequence, there is not universal agreement as to which features would disqualify a table from being in
1NF. Most notably, 1NF as defined by some authors (for example, Ramez Elmasri and Shamkant B.
Navathe, following the precedent established by Edgar F. Codd) excludes relation-valued attributes
(tables within tables); whereas 1NF as defined by other authors (for example, Chris Date) permits them.

According to Date's definition of 1NF, a table is in 1NF if and only if it is "isomorphic to some
relation", which means, specifically, that it satisfies the following five conditions:

1. There's no top-to-bottom ordering to the rows.


2. There's no left-to-right ordering to the columns.
3. There are no duplicate rows.
4. Every row-and-column intersection contains exactly one value from the applicable domain (and
nothing else).
5. All columns are regular [i.e. rows have no hidden components such as row IDs, object IDs, or hidden
timestamps].

CAPE NOTES Unit 2 Module 1 Content 9 4


Violation of any of these conditions would mean that the table is not strictly relational, and therefore that
it is not in 1NF.

Examples of tables (or views) that would not meet this definition of 1NF are:

 A table that lacks a unique key. Such a table would be able to accommodate duplicate rows, in
violation of condition 3.
 A table whose definition mandates that results be returned in a particular order, so that the row-
ordering is an intrinsic and meaningful aspect of the view. This violates condition 1. The
tuples in true relations are not ordered with respect to each other.
 A table with at least one null able attribute. A nullable attribute would be in violation of
condition 4, which requires every field to contain exactly one value from its column's domain. It
should be noted, however, that this aspect of condition 4 is controversial. It marks an important
departure from Codd's original vision of the relational model, which made explicit provision
for nulls.

Date's fourth condition, which expresses "what most people think of as the defining feature of 1NF", is
concerned with repeating groups. The following example illustrates how a database design might
incorporate repeating groups, in violation of 1NF.

Example 1: Domains and values

Suppose novice designer wishes to CUSTOMER


record the names and telephone Customer ID First Name Surname Telephone Number
numbers of customers. He defines a 123 Robert Ingram 555-861-2025
customer table which looks like this: 456 Jane Wright 555-403-1659
789 Maria Fernandez 555-808-9633
The designer then becomes aware of
a requirement to record multiple
telephone numbers for some CUSTOMER
customers. He reasons that the Customer ID First Name Surname Telephone Number
simplest way of doing this is to allow 123 Robert Ingram 555-861-2025
the "Telephone Number" field in any 456 Jane Wright 555-403-1659
given record to contain more than 555-776-4100
one value: 789 Maria Fernandez 555-808-9633

Assuming, however, that the Telephone Number column is defined on some Telephone Number-like
domain (e.g. the domain of strings 12 characters in length), the representation above is not in 1NF. 1NF
(and, for that matter, the RDBMS) prohibits a field from containing more than one value from its
column's domain.

CAPE NOTES Unit 2 Module 1 Content 9 5


Example 2: Repeating groups across columns

The designer might attempt to get around this restriction by defining multiple Telephone Number
columns:

CUSTOMER
Customer ID First Name Surname Tel. No. 1 Tel. No. 2 Tel. No. 3
123 Robert Ingram 555-861-2025
456 Jane Wright 555-403-1659 555-776-4100
789 Maria Fernandez 555-808-9633

This representation, however, makes use of null able columns, and therefore does not conform to Date's
definition of 1NF. Even if the table is taken that null able columns are allowed, the design is not in
keeping with the spirit of 1NF. Tel. No. 1, Tel. No. 2., and Tel. No. 3. share exactly the same domain and
exactly the same meaning; the splitting of Telephone Number into three headings is artificial and causes
logical problems. These problems include:

 Difficulty in querying the table. Answering such questions as "Which customers have telephone
number X?" and "Which pairs of customers share a telephone number?" is awkward.
 Inability to enforce uniqueness of Customer-to-Telephone Number links through the RDBMS.
Customer 789 might mistakenly be given a Tel. No. 2 value that is exactly the same as her Tel. No. 1
value.
 Restriction of the number of telephone numbers per customer to three. If a customer with four
telephone numbers comes along, we are constrained to record only three and leave the fourth
unrecorded. This means that the database design is imposing constraints on the business process,
rather than (as should ideally be the case) vice-versa.

Example 3: Repeating groups within columns

The designer might, alternatively, retain the single Telephone Number column but alter its domain,
making it a string of sufficient length to accommodate multiple telephone numbers:

CUSTOMER
Customer ID First Name Surname Telephone Number
123 Robert Ingram 555-861-2025
456 Jane Wright 555-403-1659, 555-776-4100
789 Maria Fernandez 555-808-9633

This design is not consistent with 1NF, and presents several design issues. The Telephone Number
heading becomes semantically woolly, as it can now represent either a telephone number, a list of
telephone numbers, or indeed anything at all. A query such as "Which pairs of customers share a
telephone number?" is more difficult to formulate, given the necessity to cater for lists of telephone
numbers as well as individual telephone numbers. Meaningful constraints on telephone numbers are also
very difficult to define in the RDBMS with this design.

CAPE NOTES Unit 2 Module 1 Content 9 6


A design that complies with 1NF

A design that is unambiguously in 1NF makes use of two tables: a Customer Name table and a Customer
Telephone Number table.

CUSTOMER NAME CUSTOMER TELEPHONE NUMBER


Customer First Customer ID Telephone Number
ID Name123 555-861-2025
123 Robert
456 555-403-1659
456 Jane 456 555-776-4100
789 Maria789 555-808-9633

Repeating groups of telephone numbers do not occur in this design. Instead, each Customer-to-Telephone
Number link appears on its own record.

Newcomers to database design sometimes suppose that normalisation proceeds in an iterative fashion, i.e.
a 1NF design are first normalized to 2NF, then to 3NF, and so on. This is not an accurate description of
how normalization typically works. A sensibly designed table is likely to be in 3NF on the first attempt;
furthermore, if it is 3NF, it is overwhelmingly likely to have an HNF of 5NF. Achieving the "higher"
normal forms (above 3NF) does not usually require an extra expenditure of effort on the part of the
designer, because 3NF tables usually need no modification to meet the requirements of these higher
normal forms.

Edgar F. Codd originally defined the first three normal forms (1NF, 2NF, and 3NF). These Normal
Forms have been summarized as requiring that all non-key attributes be dependent on "the key, the whole
key and nothing but the key".

A table is in first normal form (1NF) if and only if it represents a relation. Given that database tables
embody a relation-like form, the defining characteristic of one in First Normal Form is that it does not
allow duplicate rows or nulls. Simply put, a table with a unique key (which, by definition, prevents
duplicate rows) and without any null able columns is in 1NF.

By redefining 1NF to exclude null able columns in 1NF, no level of normalization can ever be achieved
unless all null able columns are completely eliminated from the entire database.

One requirement of a relation is that every table contains exactly one value for each attribute. This is
sometimes expressed as "no repeating groups". While that statement itself is axiomatic, experts disagree
about what qualifies as a "repeating group", in particular whether a value may be a relation value; thus the
precise definition of 1NF is the subject of some controversy. Notwithstanding, this theoretical uncertainty
applies to relations, not tables. Table manifestations are intrinsically free of variable repeating groups
because they are structurally constrained to the same number of columns in all rows.

Put at its simplest; when applying 1NF to a database, every record must be the same length. This means
that each record has the same number of fields, and none of them contain a null value.

CAPE NOTES Unit 2 Module 1 Content 9 7


Second Normal Form

The criteria for second normal form (2NF) are:

 The table must be in 1NF.


 None of the non-prime attributes of the table are functionally dependent on a part (proper subset) of a
candidate key; in other words, all functional dependencies of non-prime attributes on candidate keys
are full functional dependencies. For example, consider an "Employees' Skills" table whose attributes
are Employee ID, Employee Name, and Skill; and suppose that the combination of Employee ID and
Skill uniquely identifies records within the table. Given that Employee Name depends on only one of
those attributes – namely, Employee ID – the table is not in 2NF.
 In simple terms, a table is 2NF if it is in 1NF and all fields are dependent on the whole of the primary
key, or a relation is in 2NF if it is in 1NF and every non-key attribute is fully dependent on each
candidate key of the relation.
 Note that if none of a 1NF table's candidate keys are composite – i.e. every candidate key consists of
just one attribute – then we can say immediately that the table is in 2NF.
 All columns must be a fact about the entire key, and not a subset of the key.

Consider a table describing employees' skills:

EMPLOYEES' SKILLS
Employee Skill Current Work Location
Jones Typing 114 Main Street
Jones Shorthand 114 Main Street
Jones Whittling 114 Main Street
Roberts Light Cleaning 73 Industrial Way
Ellis Alchemy 73 Industrial Way
Ellis Juggling 73 Industrial Way
Harrison Light Cleaning 73 Industrial Way

The table's only candidate key is {Employee, Skill}.

The remaining attribute, Current Work Location, is dependent on only part of the candidate key, namely
Employee. Therefore the table is not in 2NF. Note the redundancy in the way Current Work Locations are
represented: we are told three times that Jones works at 114 Main Street, and twice that Ellis works at 73
Industrial Way. This redundancy makes the table vulnerable to update anomalies: it is, for example,
possible to update Jones' work location on his "Typing" and "Shorthand" records and not update his
"Whittling" record. The resulting data would imply contradictory answers to the question "What is Jones'
current work location?"

CAPE NOTES Unit 2 Module 1 Content 9 8


A 2NF alternative to this design would represent the same information in two tables:

EMPLOYEES EMPLOYEES' SKILLS


Employee Current Work Employee
Location
Jones Typing
Jones 114 MainJones
Street Shorthand
Roberts 73 Industrial
Jones Whittling
Way Roberts Light Cleaning
Ellis 73 Industrial
Ellis Alchemy
Way Ellis Juggling
Harrison 73 Industrial
Harrison Light Cleaning
Way

Update anomalies cannot occur in these tables, which are both in 2NF.

Not all 2NF tables are free from update anomalies, however. An example of a 2NF table which suffers
from update anomalies is:

TOURNAMENT WINNERS
Tournament Year Winner Winner Date of Birth
Des Moines Masters 1998 Chip Masterson 14 March 1977
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977

Even though Winner and Winner Date of Birth are determined by the whole key {Tournament, Year} and
not part of it, particular Winner / Winner Date of Birth combinations are shown redundantly on multiple
records. This problem is addressed by third normal form (3NF).

Third Normal Form

A table for which there are no partial functional dependencies on the primary key is typically, but not
always, in 2NF. In addition to the primary key, the table may contain other candidate keys; it is necessary
to establish that no non-prime attributes have part-key dependencies on any of these candidate keys.

Multiple candidate keys occur in the following table:

ELECTRIC TOOTHBRUSH MODELS


Manufacturer Model Model Full Name Manufacturer Country
Forte X-Prime Forte X-Prime Italy
Forte Ultraclean Forte Ultraclean Italy
Dent-o-Fresh EZBrush Dent-o-Fresh EZBrush USA
Kobayashi ST-60 Kobayashi ST-60 Japan

CAPE NOTES Unit 2 Module 1 Content 9 9


Hoch Toothmaster Hoch Toothmaster Germany
Hoch Contender Hoch Contender Germany

Even if the designer has specified the primary key as {Model Full Name}, the table is not in 2NF.
{Manufacturer, Model} is also a candidate key, and Manufacturer Country is dependent on a proper subset
of it: Manufacturer.

Third Normal Form

The Third Normal Form (3NF) is a normal form used in database normalization. 3NF was originally
defined by E.F. Codd in 1971. Codd's definition states that a table is in 3NF if and only if both of the
following conditions hold:

 The relation R (table) is in second normal form (2NF)


 Every non-prime attribute of R is non-transitively dependent (i.e. directly dependent) on every key of
R.

An example of a 2NF table that fails to meet the requirements of 3NF is:

TOURNAMENT WINNERS
Tournament Year Winner Winner Date of Birth
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977

The only candidate key is {Tournament, Year}.

The breach of 3NF occurs because the non-prime attribute Winner Date of Birth is transitively dependent
on {Tournament, Year} via the non-prime attribute Winner. The fact that Winner Date of Birth is
functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing
to stop the same person from being shown with different dates of birth on different records.

In order to express the same facts without violating 3NF, it is necessary to split the table into two:

TOURNAMENT WINNERS PLAYER DATES OF BIRTH


Tournament Year Winner Player Date of Birth
Indiana Chip Masterson 14 March 1977
1998 Al Fredrickson
Invitational Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson Bob Albertson 28 September 1968
Des Moines
1999 Al Fredrickson
Masters
Indiana
1999 Chip Masterson
Invitational

CAPE NOTES Unit 2 Module 1 Content 9 10


Update anomalies cannot occur in these tables, which are both in 3NF.

Additional Reading

An Introduction to Database Normalization

By Mike Hillyer

Table of contents:
 Introduction
 Mike's Bookstore
 First Normal Form
 Defining Relationships
 Second Normal Form
 Third Normal Form
 Joining Tables
 Conclusion
 Resources

Introduction

Over the years I, like many of you, have had the experience of taking over responsibility for an existing
application and its associated schema (sometimes frustratingly woven together as part of a Microsoft
Access solution).

The most challenging of these to maintain and rework suffer from what one author described as the
Spreadsheet Syndrome: a tendency for the developer to lump every possible piece of information into as
few table as possible, often into a single table.

A schema that suffers from the Spreadsheet Syndrome is subject to data redundancies, data anomalies, and
various inefficiencies. The cure for Spreadsheet Syndrome is database normalization.

Database normalization is a process by which an existing schema is modified to bring its component
tables into compliance with a series of progressive normal forms. The concept of database normalization
was first introduced by Edgar Frank Codd in his paper A Relational Model of Data for Large Shared Data
Banks, section 4.

The goal of database normalization is to ensure that every non-key column in every table is directly
dependent on the key, the whole key and nothing but the key and with this goal come benefits in the form
of reduced redundancies, fewer anomalies, and improved efficiencies. While normalization is not the be-
all and end-all of good design, a normalized schema provides a good starting point for further
development.

This article will take a practical look at database normalization, focusing on the first three of seven
generally recognized normal forms. Additional resources that look at the theory of database normalization
and the additional normal forms can be found in the Resources section at the end of this article.

CAPE NOTES Unit 2 Module 1 Content 9 11


Note:
This article has been updated from an original version published in 2003.
Mike's Bookstore

Let's say you were looking to start an online bookstore. You would need to track certain information about
the books available to your site viewers, such as:

 Title
 Author
 Author Biography
 ISBN
 Price
 Subject
 Number of Pages
 Publisher
 Publisher Address
 Description
 Review

 Reviewer Name

Let's start by adding the book that coined the term ?Spreadsheet Syndrome?. Because this book has two
authors, we are going to need to accommodate both in our table. Lets take a look at a typical approach

Table 1. Two Books

Title Author Bio ISBN Subject Pages Publisher


Chad Russell is a programmer
Beginning Chad and network administrator who
MySQL,
MySQL Database Russell, owns his own Internet hosting
1590593324 Database 520 Apress
Design and Jon company., Jon Stephens is a
Design
Optimization Stephens member of the MySQL AB
documentation team.

Lets take a look at some issues involved in this design:

First, this table is subject to several anomalies: we cannot list publishers or authors without having a book
because the ISBN is a primary key which cannot be NULL (referred to as an insertion anomaly).
Similarly, we cannot delete a book without losing information on the authors and publisher (a deletion
anomaly). Finally, when updating information, such as an author's name, we must change the data in every
row, potentially corrupting data (an update anomaly).

Note:
Normalization is a part of relational theory, which requires that each relation (AKA table) has a
primary key. As a result, this article assumes that all tables have primary keys, without which a
table cannot even be considered to be in first normal form.

CAPE NOTES Unit 2 Module 1 Content 9 12


Second, this table is not very efficient with storage. Lets imagine for a second that our publisher is
extremely busy and managed to produce 5000 books for our database. Across 5000 rows we would need
to store information such as a publisher name, address, phone number, URL, contact email, etc. All that
information repeated over 5000 rows is a serious waste of storage resources.

Third, this design does not protect data consistency. Lets once again imagine that Jon Stephens has written
20 books. Someone has had to type his name into the database 20 times, and it is possible that his name
will be misspelled at least once (i.e.. John Stevens instead of Jon Stephens). Our data is now in an
inconsistent state, and anyone searching for a book by author name will find some of the results missing.
This also contributes to the update anomalies mentioned earlier.

First Normal Form

The normalization process involves getting our data to conform to progressive normal forms, and a higher
level of normalization cannot be achieved unless the previous levels have been satisfied (though many
experienced designers can create normalized tables directly without iterating through the lower forms).
The first normal form (or 1NF) requires that the values in each column of a table are atomic. By atomic
we mean that there are no sets of values within a column.

In our example table, we have a set of values in our author and subject columns. With more than one value
in a single column, it is difficult to search for all books on a given subject or by a specific author. In
addition, the author names themselves are non-atomic: first name and last name are in fact different
values. Without separating first and last names it becomes difficult to sort on last name.

One method for bringing a table into first normal form is to separate the entities contained in the table into
separate tables. In our case this would result in Book, Author, Subject and Publisher tables.

Table 2. Book Table

ISBN Title Pages


1590593324 Beginning MySQL Database Design and Optimization 520

Table 3. Author Table

Author_ID First_Name Last_name


1 Chad Russell
2 Jon Stephens
3 Mike Hillyer

Table 4. Subject Table

Subject_ID Name
1 MySQL
2 Database Design

Table 5. Publisher Table

CAPE NOTES Unit 2 Module 1 Content 9 13


Publisher_ID Name Address City State Zip
1 Apress 2560 Ninth Street, Station 219 Berkeley California 94710
Note

The Author, Subject, and Publisher tables use what is known as a surrogate primary key -- an
artificial primary key used when a natural primary key is either unavailable or impractical. In the
case of author we cannot use the combination of first and last name as a primary key because there
is no guarantee that each author's name will be unique, and we cannot assume to have the author's
government ID number (such as SIN or SSN), so we use a surrogate key.

Some developers use surrogate primary keys as a rule, others use them only in the absence of a
natural candidate for the primary key. From a performance point of view, an integer used as a
surrogate primary key can often provide better performance in a join than a composite primary key
across several columns. However, when using a surrogate primary key it is still important to create
a UNIQUE key to ensure that duplicate records are not created inadvertently (but some would
argue that if you need a UNIQUE key it would be better to stick to a composite primary key).

By separating the data into different tables according to the entities each piece of data represents, we can
now overcome some of the anomalies mentioned earlier: we can add authors who have not yet written
books, we can delete books without losing author or publisher information, and information such as author
names are only recoded once, preventing potential inconsistencies when updating.

Depending on your point of view, the Publisher table may or may not meet the 1NF requirements because
of the Address column: on the one hand it represents a single address, on the other hand it is a
concatenation of a building number, street number, and street name.

The decision on whether to further break down the address will depend on how you intend to use the data:
if you need to query all publishers on a given street, you may want to have separate columns. If you only
need the address for mailings, having a single address column should be acceptable (but keep potential
future needs in mind).

Defining Relationships

As you can see, while our data is now split up, relationships between the tables have not been defined.
There are various types of relationships that can exist between two tables:

 One to (Zero or) One


 One to (Zero or) Many
 Many to Many

The relationship between the Book table and the Author table is a many-to-many relationship: A book can
have more than one author, and an author can write more than one book. To represent a many-to-many
relationship in a relational database we need a third table to serve as a link between the two. By naming
the table appropriately, it becomes instantly clear which tables it connects in a many-to-many relationship
(in the following example, between the Book and the Author table).

Table 6. Book_Author Table

CAPE NOTES Unit 2 Module 1 Content 9 14


ISBN Author_ID
1590593324 1
1590593324 2

Similarly, the Subject table also has a many-to-many relationship with the Book table, as a book can cover
multiple subjects, and a subject can be explained by multiple books:

Table 7. Book_Subject Table

ISBN Subject_ID
1590593324 1
1590593324 2

As you can see, we now have established the relationships between the Book, Author, and Subject tables.
A book can have an unlimited number of authors, and can refer to an unlimited number of subjects. We
can also easily search for books by a given author or referring to a given subject.

The case of a one-to-many relationship exists between the Book table and the Publisher table. A given
book has only one publisher (for our purposes), and a publisher will publish many books. When we have a
one-to-many relationship, we place a foreign key in the table representing the ?many?, pointing to the
primary key of the table representing the ?one?. Here is the new Book table:

Table 8. Book Table

ISBN Title Pages Publisher_ID


1590593324 Beginning MySQL Database Design and Optimization 520 1

Since the Book table represents the ?many? portion of our one-to-many relationship, we have placed the
primary key value of the Publisher as in aPublisher_ID column as a foreign key.

In the tables above the values stored refer to primary key values from the Book, Author, Subject and
Publisher tables. Columns in a table that refer to primary keys from another table are known as foreign
keys, and serve the purpose of defining data relationships.

In database systems (DBMS) which support referential integrity constraints, such as the InnoDB storage
engine for MySQL, defining a column as a foreign key will allow the DBMS to enforce the relationships
you define. For example, with foreign keys defined, the InnoDB storage engine will not allow you to
insert a row into the Book_Subject table unless the book and subject in question already exist in the Book
and Subject tables or if you're inserting NULL values. Such systems will also prevent the deletion of
books from the book table that have ?child? entries in the Book_Subject or Book_Author tables.

Second Normal Form

Where the First Normal Form deals with atomicity of data, the Second Normal Form (or 2NF) deals with
relationships between composite key columns and non-key columns. As stated earlier, the normal forms
are progressive, so to achieve Second Normal Form, your tables must already be in First Normal Form.

CAPE NOTES Unit 2 Module 1 Content 9 15


The second normal form (or 2NF) any non-key columns must depend on the entire primary key. In the
case of a composite primary key, this means that a non-key column cannot depend on only part of the
composite key.

Let's introduce a Review table as an example.

Table 9. Review Table

ISBN Author_ID Summary Author_URL


1590593324 3 A great book! http://www.openwin.org

In this situation, the URL for the author of the review depends on the Author_ID, and not to the
combination of Author_ID and ISBN, which form the composite primary key. To bring the Review table
into compliance with 2NF, the Author_URL must be moved to the Author table.

Third Normal Form

Third Normal Form (3NF) requires that all columns depend directly on the primary key. Tables violate the
Third Normal Form when one column depends on another column, which in turn depends on the primary
key (a transitive dependency).

One way to identify transitive dependencies is to look at your table and see if any columns would require
updating if another column in the table was updated. If such a column exists, it probably violates 3NF.

In the Publisher table the City and State fields are really dependent on the Zip column and not the
Publisher_ID. To bring this table into compliance with Third Normal Form, we would need a table based
on zip code:

Table 10. Zip Table

Zip City State


94710 Berkeley California

In addition, you may wish to instead have separate City and State tables, with the City_ID in the Zip table
and the State_ID in the City table.

A complete normalization of tables is desirable, but you may find that in practice that full normalization
can introduce complexity to your design and application. More tables often means more JOIN operations,
and in most database management systems (DBMSs) such JOIN operations can be costly, leading to
decreased performance. The key lies in finding a balance where the first three normal forms are generally
met without creating an exceedingly complicated schema.

CAPE NOTES Unit 2 Module 1 Content 9 16


Joining Tables

With our tables now separated by entity, we join the tables together in our SELECT queries and other
statements to retrieve and manipulate related data. When joining tables, there are a variety of JOIN
syntaxes available, but typically developers use the INNER JOIN and OUTER JOIN syntaxes.

An INNER JOIN query returns one row for each pair or matching rows in the tables being joined. Take
our Author and Book_Author tables as an example:

mysql> SELECT First_Name, Last_Name, ISBN


-> FROM Author INNER JOIN Book_Author ON Author.Author_ID = Book_Author.Author_ID;
+------------+-----------+------------+
| First_Name | Last_Name | ISBN |
+------------+-----------+------------+
| Chad | Russell | 1590593324 |
| Jon | Stephens | 1590593324 |
+------------+-----------+------------+
2 rows in set (0.05 sec)

The third author in the Author table is missing because there are no corresponding rows in the
Book_Author table. When we need at least one row in the result set for every row in a given table,
regardless of matching rows, we use an OUTER JOIN query.

There are three variations of the OUTER JOIN syntax: LEFT OUTER JOIN, RIGHT OUTER JOIN and
FULL OUTER JOIN. The syntax used determines which table will be fully represented. A LEFT OUTER
JOIN returns one row for each row in the table specified on the left side of the LEFT OUTER JOIN
clause. The opposite is true for the RIGHT OUTER JOIN clause. A FULL OUTER JOIN returns one row
for each row in both tables.

In each case, a row of NULL values is substituted when a matching row is not present. The following is an
example of a LEFT OUTER JOIN:

mysql> SELECT First_Name, Last_Name, ISBN


-> FROM Author LEFT OUTER JOIN Book_Author ON Author.Author_ID =
Book_Author.Author_ID;
+------------+-----------+------------+
| First_Name | Last_Name | ISBN |
+------------+-----------+------------+
| Chad | Russell | 1590593324 |
| Jon | Stephens | 1590593324 |
| Mike | Hillyer | NULL |
+------------+-----------+------------+
3 rows in set (0.00 sec)

The third author is returned in this example, with a NULL value for the ISBN column, indicating that
there are no matching rows in the Book_Author table.

CAPE NOTES Unit 2 Module 1 Content 9 17


Conclusion

Through the process of database normalization we bring our schema's tables into conformance with
progressive normal forms. As a result our tables each represent a single entity (a book, an author, a
subject, etc) and we benefit from decreased redundancy, fewer anomalies and improved efficiency.

Resources

The following resources were either used in the development of this article or are considered to be of
interest by the author.

 A Relational Model of Data for Large Shared Data Banks - E.F. Codd
 Database Normalization - Wikipedia
 A Simple Guide to Five Normal Forms in Relational Database Theory - William Kent
 Normal Form Definitions and Examples
 MySQL Database Design and Optimization - Jon Stephens & Chad Russell

http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html

CAPE NOTES Unit 2 Module 1 Content 9 18


Entity-Relationship Diagram (ERD)

The purpose of Entity Relationship Diagram (ERD) is to organize information into manageable record in a database.
ERD illustrate the logical structure of database. They are used to:
 Identify the data that must be captured, stored and retrieved for the particular application
 Identify the data required to derive and report on the performance measures that an organization
should be monitoring.

ERDs have three different components:


 Entities
 Attributes
 Relationship
Entity
Entities are people, places, things and events. In short, anything you need to store data about. Entities also
represent collection of things. For example a STUDENT entity might represent a collection of all the
things that can be said about a student. Entities are represented on the ERD by labeled boxes.

STUDENT

Attributes are the properties or characteristics of an entity. These are the smallest unit of data that can be
described in a meaningful manner. For example STUDENT entity may have attributes such as:

STUDENT
StudentNumber
FirstName
Surname STUDENT
TelephoneNumber StudentNumber
DOB FirstName
Form Surname
TelephoneNumber
DOB
Form

The relationship is the interaction between the entities. A relationship may be represented by a diamond
shape, or more simply, by the line connecting the entities. In either case, verbs are used to label the
relationships.
The cardinality defines the relationship between the entities in terms of numbers. An entity may
be optional: for example, a sales rep could have no customers or could have one or many customers;
or mandatory: for example, there must be at least one product listed in an order. There are several different

CAPE NOTES Unit 2 Module 1 Content 9 19


types of cardinality notation; crow's foot notation, used here, is a common one. In crow's foot notation, a
single bar indicates one, a double bar indicates one and only one (for example, a single instance of a
product can only be stored in one warehouse), a circle indicates zero, and a crow's foot indicates many.
The three main cardinal relationships are: one-to-one, expressed as 1:1; one-to-many, expressed as 1:M;
and many-to-many, expressed as M:N.

adapted from http://federated-ins-claim-erd.wikispaces.com/Symbols

The steps involved in creating an ERD are:


 Identify the entities.
 Determine all significant interactions.
 Analyze the nature of the interactions.
 Draw the ERD.

EXAMPLE of an ERD

WOD 1: Warehouse
This WOD was a simple ERD diagram:

CAPE NOTES Unit 2 Module 1 Content 9 20


This diagram shows that there is a many to many relationship between tag and product which means
there can be many tags for one product and many products for a single tag. Product has a many to one
relationship with stock item which means that there can be only one product associated with a stock
item but there can be many stock items with the same product. Finally, there is a many to one
relationship between stock item and warehouse meaning there can only be one warehouse per stock
item but a warehouse can have many stock items.
Taken from: https://ekomiyama.wordpress.com/2013/11/, retrieved on July 27, 2015

CAPE NOTES Unit 2 Module 1 Content 9 21

You might also like