You are on page 1of 9

Functional Dependencies

and
Normalization
Chapter 14
Designing a database
So far, we have designed our base tables from a E-R diagram or by common sense
We still need some formal measure of why one group of attributes is a better base
table than another group.
This chapter discusses some of the theory of what a good design is
Gathering the data
Identify the data requirements of the users.
This sounds simple, and it CAN be, but usually isn't because users rarely can
clearly describe what they need.
In addition, the data processing staff, who are quite knowledgeable how the system
functions, often have tunnel vision, which inhibits creative thought.

Normalization helps make this process objective


In many respects, normalization is simply applied common sense.
It is a technique for reviewing the entity/attribute list to insure that attributes are
stored "where they belong"
We did not have any formal measure of why one grouping of attributes into a relation
may be better than another.
We will be talking about the logical level or second level of the architecture. i.e
designing base tables.

Informal discussion of some of the criteria for good and bad relational
schema.
1. The meaning (semantics) of attributes
2. Redundant Information and Update Anomalies
3. Avoiding NULL values
4. Avoiding generation of spurious tuples
1. The meaning (semantics) of attributes

When we group attributes to form a relation, we assume that a certain meaning is


associated with the attributes.
This meaning helps us understand how the attributes in a tuple are related to each
other.
In general, the easier it is to explain the semantics of the relation, the better the relation
schema design.
GUIDELINE 1: Design a relation schema in such a way that it is easy to
explain its meaning.

Other design possibilities


(or why some organizations do not work as well as others)
What is the semantics of these relations?
EMP_DEPT
ename SSN bdate address dnumber dname dmgrssn

EMP_PROJ
SSN PNUMBER hours ename pname plocation

Why are these poor designs?


These relations also have clear meanings.
There is nothing logically wrong with these two relations,
but they mix attributes from distinct real-world entities.
They may be used as views, but they cause several problems when used as base tables.
What are the problems?
They take up more storage space because of data redundancy.
Another serious problem with using relations like this as base tables is the problem of
update anomalies.
anomalies are the result of a poorly designed database.
There are three types of Update Anomalies
1. Insertion anomalies
Must be careful that DNAME & DMGRSSN agree with DNUM
Difficult to insert a new department that has no employees yet
2. Deletion anomalies
If we delete the last employee of a dept, we eliminate the information on that
department
3. Modification anomalies
If dept 4 gets a new manager, then all the employees who work in dept 4's tuples
must be updated.

2. Redundant Information and Update Anomalies


GUIDELINE 2: Design the base tables so no insertion, deletion, or
modification anomalies occur in the relations.

The second guideline is consistent with and in a way a restatement of the first.
We begin to see a need for a more formal approach to help us evaluate whether our
design meets these guidelines.
3. Avoiding NULL values
In some base tables we may group many attributes together into a "fat" relation
If many of the attributes do not apply to all tuples in the relation, we end up with many
NULL's in the relation.
This can be wasteful of storage space, and may also lead to problems with
understanding of the meaning of the attributes.
NULL's can have multiple interpretations
The attribute does not apply to this tuple
The attribute value for this tuple is unknown
The value is known but absent (has not been recorded yet)

Avoiding NULL values


GUIDELINE 3: Avoid placing attributes in a base table whose values may
be NULL, as far a possible.
For example, if only 10% of employees have individual offices, do not put an attribute
OFF_NUM in the EMPLOYEE relation.
Solution: Create a new relation EMP_OFF that includes only tuples for employees
who have offices.

Avoid generation of spurious tuples


Look at the relations EMP_LOC and EMP_PROJ1
A tuple in EMP_LOC means that an employee works on some project at plocation
A tuple in EMP_PROJ1 means that an employee with SSN works on project for
hours/week, the name and location of the project are also kept

Using EMP_LOC and EMP_PROJ1 as base tables


This is a very bad schema design
If we do a natural join of these tables having as join conditions:
EMP_LOC.plocation = EMP_PROJ1.plocation
We get many spurious tuples
A spurious tuple has wrong information; information that is not valid
Why is this happening?
PLOCATION is not a PK in either table, so it cannot be a FK

Functional Dependency
This is a more formal, more objective design criteria.
Functional dependency is the single most important concept in relational
database design.
A FD is a property of the meaning or semantics of the attributes of a relation.
Before I give a formal definition, lets look at dependency informally first.
Figure out which attributes are dependent on other attributes.

Formal Definition of Functional Dependency


Given a relation R and attributes X and Y,
we say Y is functionally dependent on X if and only if
for each X value there is precisely one Y value associated with it.
i.e. X functionally determines Y
this is written like this: X -> Y
What are the functional Dependencies of the EMP_DEPT and the EMP_PROJ
relations

The FD of EMP_DEPT and the EMP_PROJ relations


SSN -> {ename, bdate, address, dnumber}
dnumber -> {dname, dmgrssn}

{SSN, pnumber} ->{hours}


SSN -> {ename}
pnumber -> {pname, plocation}

Full functional dependency


If the X is composite, then there is a possibility that Y is not fully functionally
dependent on X.
To be fully functionally dependent, Y must not be dependent on any proper subset
of X.
{SSN,PNUMBER} -> ENAME is not fully functionally dependent because ENAME
is dependent on SSN, a subset of {SSN, PNUMBER}.

Property of functional dependencies


A FD is a property of the relation schema, not of a particular extension or instance
of the relation
It is actually a constraint on the schema
The DB designer uses their understanding of the meaning of the attributes to
specify the functional dependency
These FD must hold for all instances of the DB at all times
The set of FD on a relation schema
Let the letter F denotes the set of FDs on a relation
Example for the EMP_DEPT relation
F = {SSN -> {ename, bdate, address, dnumber
dnumber -> {dname, dmgrssn}}
These are the FD that are semantically obvious
Are they the only FD?
There are many other FD that can be inferred or deduced from the semantic
obvious ones
Example
SSN-> {dname, dmgrssn} or even
SSN -> SSN, or dnumber -> dname

Inferred FDs
Suppose F is the set of functional dependencies that are specified on relational schema
R
Typically, the schema designer specifies the FD that are semantically obvious
But there are other depenencies that can be inferred or deduced from the FDs in F
The closure F+ of F is the set of all functional dependencies that can be inferred from
F

Properties of functional dependencies


The inference rules are a set of rules that can be used to infer all the possible FDs from
F
Some of these FD are trivial
A FD is trivial if and only if the right-hand side (the dependent) is a subset of the
left -hand side (the determinant).
One set of FD can be equivalent to another set
What are the functional dependencies

Emp Emp Emp Dept Dept dept Skill Skill Skill Skill
ID Name Phone Name Phone Mgr ID Name date Level

Semantics
An employee works for a single department, but may have multiple skills
Skills have an ID, and a name; an employee takes a test at a certain date to establish
a skill level
Definition: Determinate
any attribute on which some other attribute is dependent
Do you agree with these FD?
empID {empName, empPhone, deptName}
What about empPhone {empID}
deptName{deptPhone, deptMgr}
skillID{skillName}
What about skillName {skillID} ?
empID, skillID{skillDate, skillLevel}
If you had a different set of FD, were they equivalent?

How functional dependencies are used


FD are used as information about the semantics of the relational schema
This, in turn, is used to determine the normal form of a relation
Normal forms are used to analyze how good a design is
To determine the normal form three things are needed
1) a set of FDs for each relation
2) a designated primary key
3) tests or conditions for each normal form
So, next we look at the test or conditions for the normal forms

Normalization overview
Originally, E.F. Codd proposed three normal forms, called first, second, and third
normal forms
1NF, 2NF and 3NF are based on the functional dependencies among the
attributes of a relation
Later a stronger definition of 3NF was proposed by Boyce and Codd and is known
as Boyce-Codd normal form.
Later, 4th and 5th normal forms were proposed, based on other properties
In the normalization process, you start with a universal relation (one table with all the
attributes) and functional dependencies
Then you apply the normal form restrictions, and decompose the tables as they
specify.

Normalization of data
This is a process of analyzing the relation schemas based on functional dependencies
(FD) and PK to achieve two desirable properties
1) Minimizing data redundancy
2) Minimizing insertion, deletion and update anomalies
Using normal forms
When a test fails, the relation violating that test must be decomposed
All these normal forms were based on FD among the attributes of a relation
Disclaimers to use of normal forms:
Normal forms when considered by themselves do not guarantee good db design.
Sometimes the best design may not be the highest normal form, for performance
reasons.

Review of terminology of keys


SUPERKEY - any set of attributes that make a tuple unique
KEY - a minimum superkey
CANDIDATE KEY - if a relation has more than one key, each is a candidate key
PRIMARY KEY - an arbitrary candidate key
SECONDARY KEY - a candidate key that is not the PK
PRIME ATTRIBUTE - a member of any key
NONPRIME or NONKEY ATTRIBUTE - not a member of any candidate key
First Normal Form
1NF is now generally considered to be part of the formal definition of a relation. It
states that:
the domains of attributes must include only atomic values
the value of any attribute in a tuple must be a single value from the domain of that
attribute
all attributes are dependent on the PK
Look at this schema for departments:
dname dnumber dmgrssn dlocations

Normalizing the relation into 1NF


The problem with this table is that department location is multivalued.
What are the possible solutions?
Have the primary key be composite {dnumber, dlocation}
Problems with this?
If you know the possible number of locations, make attributes for three locations
Problem with this?
Make a new table with dname and dlocation
This is the preferred solution

Second Normal Form


The second normal form is based on the concept of a full functional dependency.
If a dependency is not a full functional dependency, it is called a partial
dependency
This can occur only if the determinate is composite
A functional dependency X -> Y is a full functional dependency if removal of any
attribute from X means that the dependency does not hold any more
A functional dependency is a partial dependency if there is some attribute A in X that
can be removed from X and the dependency still holds.

Definition for 2NF


A relation is in 2NF if:
every nonkey attribute is fully functionally dependent on the primary key.
If a relation is not in 2NF, it can be normalized into a number of 2NF relations
This involves decomposing the table
Problem: Normalize the EMP_PROJ table
Procedure:
Figure out the dependencies, noting the determinates
A determinate is any attribute(s) on which some other attribute(s) are dependent
Put each determinate in a table by itself, and
Include in each table the attributes that are dependent on that determinate
EMP_PROJ
SSN PNUMBER hours ename pname plocation

Third Normal Form


Look at the ENP_DEPT relation
Is it in 1NF? In 2NF?
Is there still a problem? What is the problem?
EMP_DEPT
ename SSN bdate address dnumber dname dmgrssn

The problem with the EMP_DEPT table


DNAME depends on DNUM, not the primary key.
This is called a transitive dependency.
A functional dependency X-> Y is a transitive dependency if
there is a set of attributes Z of the relation and both X -> Z and Z ->Y hold
Definition for 3NF
A relation is in 3NF if every nonkey attribute is:
Fully functionally dependent on the primary key (i.e. in 2NF).
Nontransitively dependent on the primary key.

We can normalize EMP_DEPT by decomposing it into two 3NF relations.


Intuitively, we see that the two results represent independent entity facts.
A natural join will recover the original relation.
Identify FD of this relation
property_ID county_nam Lot# area price tax_rate
e
Semantics
Lot#s are unique only within each county
Property_ID#s are unique across counties for the entire state
The tax rate is fixed for a given county(does not vary lot by lot)
The price of a lot is determined by its area regardless of which county it is in
(assume that this is the price for the lot for tax purposes
What is the PK? Normalize it into 3NF.
Generalizing the definition of 2NF and 3NF
To assure we do not have update anomalies we need to extend this definition to
include all candidate key rather than just PK
For 2NF we disallowed partial dependencies on any key
For 3NF we disallowed transitive dependencies on any key
Simpler definition for 3NF: Every nonprime attribute must be
fully functionally dependent on every key
nontransitively dependent on every key

Boyce-Codd Normal Form


This is a simpler, yet stricter form of 3NF.
There are a few cases where a DB is in 3NF and not in BCNF.
Example: define another FD on the Lots DB
All the lots in the DB come from only two counties
In one county, all lots are >= one acre
In the other county, all lots are < one acre
The FD is area -> county_name
Is this still in 3NF
Definition of BDNF
A relation is in BCNF if and only if every determinate is a candidate key
To be in 3NF but not in BCNF, must have a composite candidate key, one of whose
members is determined by a non prime attribute
Any relation not in BCNF can be decomposed into an equivalent set of BCNF
relations

Summary
In general, it is best to have a relational schema in BCNF
If that is not possible, 3NF will do
2NF and 1NF are not considered good relation schema designs.
They allow too much data redundancy which leads to update anomalies
Summary of Normal forms
A relation is in 1NF the domains of attributes must include only atomic
values
A relation is in 2NF if every nonkey attribute is fully functionally
dependent on a candidate key.

A relation is in 3NF if every nonkey attribute is in 2NF and nontransitively


dependent on a candidate key

A relation is in BCNF if and only if every determinate is a candidate key