You are on page 1of 17

DBMS

MODULE-IV

NORMALIZATION OF DATABASE TABLES

DATABASE TABLES AND NORMALIZATION:


The database table is the basic building block in the database design process. Consequently, the table’s
structure is of great interest. Ideally, the database design process explored in Entity Relationship (ER) Modeling,
yields good table structures. Yet it is possible to create poor table structures even in a good database design. So how
do you recognize a poor table structure, and how do you produce a good table? The answer to both questions
involves normalization.

Normalization is a process for evaluating and correcting table structures to minimize data redundancies, thereby
reducing the likelihood of data anomalies. The normalization process involves assigning attributes to tables based on
the concept of Relational Database Model.

Normalization works through a series of stages called normal forms. The first three stages are described as first
normal form (1NF), second normal form (2NF), and third normal form (3NF). From a structural point of view, 2NF
is better than 1NF, and 3NF is better than 2NF. For most purposes in business database design, 3NF is as high as
you need to go in the normalization process. However, you will discover in Section 5.3 that properly designed 3NF
structures also meet the requirements of fourth normal form (4NF).

Although normalization is a very important database design ingredient, you should not assume that the highest level
of normalization is always the most desirable. Generally, the higher the normal form, the more relational join
operations required to produce a specified output and the more resources required by the database system to respond
to end-user queries. A successful design must also consider end-user demand for fast performance. Therefore, you
will occasionally be expected to denormalize some portions of a database design in order to meet performance
requirements. Denormalization produces a lower normal form; that is, a 3NF will be converted to a 2NF through
denormalization. However, the price you pay for increased performance through denormalization is greater data
redundancy.

THE NEED FOR NORMALIZATION:


Storing the Same information redundantly, that is, in more than one place within a database, can lead to several
problems:

Redundant Storage: Same information is stored repeatedly.


Update Anomalies: If one copy of such repeated data is updated, an inconsistency is created unless all copies are
similarly updated.
Insertion Anomalies: It may not be possible to store certain information unless some other, unrelated, information
is stored as well.
Deletion Anomalies: It may not be possible to delete certain information without losing some other, unrelated,
information as well.

Normalization of data can be considered a process of analyzing the given relation schemas based on their FDs and
primary keys to achieve the desirable properties of (1) minimizing redundancy and (2) minimizing the insertion,
deletion, and update Anomalies.

SCHEMA REFINEMENT:

Problems Caused by Redundancy:


Storing the Same information redundantly, that is, in more than one place within a database, can lead to several
problems:

MODULE-4 Page 1
DBMS

Redundant Storage: Same information is stored repeatedly.


Update Anomalies: If one copy of such repeated data is updated, an inconsistency is created unless all copies are
similarly updated.
Insertion Anomalies: It may not be possible to store certain information unless some other, unrelated, information
is stored as well.
Deletion Anomalies: It may not be possible to delete certain information without losing some other, unrelated,
information as well.

Consider a relation obtained by translating a variant of the Hourly_Emps entity Set

Hourly_Emps(ssn, name, lot, rating, hourly_wages, hourly_worked)

The key for Hourly_Emps is ssn. In addition, suppose that the hourly_wages attribute is determined by the Rating
attribute. That is, for a given rating value, there is only one permissible hourly_wages value. This IC is an example
of a functional dependency. It leads to possible redundancy in the relation Hourly_Emps, as illustrated in Figure
19.1.

If the same value appears in the rating column of two tuples, the IC tells us that the same value must appear
in the hourly_wages column well. This redundancy has the same negative consequences as before:

Redundant Storage: The rating value 8 corresponds to the hourly wage 10, and this association is repeated three
times.
Update Anomalies The Hourly _wages in the first tuple could be updated without making a similar change in the
second tuple.
Insertion Anomalies: We cannot insert a tuple for an employee unless we know the hourly wage for the employee’s
rating value.
Deletion Anomalies: If we delete all tuples vvith a given rating value (e.g., we delete the tuples for Srllethurst and
Guldu) we lose the association between that rating value and its hourly_wages value.

Decompositions:
A decomposition of a relation schema R consists of replacing the relation schema by two (or more) relation
schemas that each contain a subset of the attributes of R and together include all attributes in R. Intuitively, we want
to store the information in any given instance of R by storing projections the instance.

We can decompose Hourly_Emps into two relations:


Hourly_emps2(ssn,name,lot,rating,hours_worked)
Wages(rating,hourly_wages)

MODULE-4 Page 2
DBMS

The instances of these relations corresponding to the instance of Hourly_Emps relation in Figure 19.1 is
shown in Figure 19.2. Note that we can easily record the hourly wage for any rating simply by adding a tuple to
Wages, even if no employee with that rating appears in the current instance of hourly _Emps. Changing the wage
associated with a rating involves updating a single Wages tuple. This is more efficient than updating several tuples
(as in the original design), and it eliminates the potential for inconsistency.

FUNCTIONAL DEPENDENCIES:

A functional dependency, denoted by X -> Y, between two sets of attributes X and Y that are subsets of R
specifies a constraint on the possible tuples that can form a relation state r of R. The constraint is that, for any two
tuples t1 and t2 in r that have t1[X] = t2[X], they must also have t1[Y] = t2[Y].

This means that the values of the Y component of a tuple in r depend on, or are determined by, the values of
the X component; alternatively, the values of the X component of a tuple uniquely (or functionally) determine the
values of the Y component. We also say that there is a functional dependency from X to Y, or that Y is functionally
dependent on X. The abbreviation for functional dependency is FD or f.d. The set of attributes X is called the left-
hand side of the FD, and Y is called the right-hand side.

Figure 19.3 illustrates the meaning of the FD AB C by showing an instance that satisfies this dependency. The
first two tuples show that an FD is not the same as a key constraint: Although the FD is not violated, AB is clearly
not a key for the relation. The third and fourth tuples illustrate that if two tuples differ in either the A field or the B
field, they can differ in the C field without violating the FD. On the other hand, if we add a tuple (a1, bl, c2, d1) to
the instance shown in this figure, the resulting instance would violate the FD; to see this violation, compare the first
tuple in the figure with the new tuple.

MODULE-4 Page 3
DBMS

REASONING ABOUT FDS:

Given a set of FDs over a relation schlem .R, typically several additional FDs hold over R whenever all of
the given FDs hold. As an example, consider:

Workers (Ssn,name,lot,did,since)

We know that ssn -- did holds, since ssn is the key, and FD did -- lot is given to hold. Therefore, in any legal
instance of Workers, if two tuples have the same ssn value, they must have the same did value (from the first FD),
and because they have the same did value, they must also have the same lot value (from the second FD). Therefore,
the FD ssn -- lot also holds on Workers.

We say that an FD f is implied by a given set F of FDs if f holds on every relation instance that satisfies all
dependencies in F; that is, f holds whenever all FDs in F hold. Note that it is not sufficient for f to hold on same
instance that satisfies all dependencies in F; rather, f must hold on every instance that satisfies all dependencies in F.

Closure of a set of fds:

The set of all FDs implied by a given set F of FDs is called the closure of F, denoted F+.
- In order to check for the presence of R/A we need to ascertain the possible presence of other FDs implied by those
stated explicitly. This means that we have to calculate the closure F+.
- This may be done utilizing the Armstrong Axioms which may be stated as follows: letting
X, Y, Z denote sets of attributes of a relation R,

• Reflexivity: If X ⊇ Y (i.e. X contains Y) then X->Y. (This rule really generates only
trivial FDs).

• Augmentation: If X->Y, then XZ->YZ for any Z.

• Transitivity: If X->Y and Y->Z, then, X->Z.

It may be proven that


(1) Armstrong’s axioms are sound, i.e. they generate only FDs in F+.
(2)Armstrong’s axioms are complete, i. e. they generate all the FDs in F+.

It is convenient to add the following additional rules which may even be considered as denotation rules

• Union: If X->Y and X->Z, then X->YZ.

• Decomposition: If X->YZ, then X->Y and X->Z.

Note that these axioms do not imply that you may ‘cancel’ attributes appearing on both sides. Thus if AB_->BC,
then you may not conclude that A->B.

Examples Of Application Of Armstrong Axioms (1):

Consider the relation ABC with FDs {(i) A->B and (ii) B->C}

1. From Reflexivity we get all the trivial FDs which are of the form
X->Y, where Y ⊆X, X ⊆ABC and Y ⊆ABC.

2. Applying transitivity to (i) and (ii) we get A->C.

3. From augmentation we get


AC ->BC, AB-> AC, AB->CB.

MODULE-4 Page 4
DBMS

Thus the closure of the set F of given FDs is (apart from trivial FDs):
F+ = {A->B, B->C, A-> C, AC->BC, AB->AC, AB->CB}

Examples Of Application Of Armstrong Axioms (2):

Consider the previous relation Contracts that is characterized by the set of FDs
{ (i) C ->CSJDPQV, (ii) JP->C, (iii) SD-> P}.

1. From (ii), (i), and transitivity we get (iv) JP-> CSJDPQV.


2. From (iii) and augmentation we get (v) SDJ ->JP.
3. From (v), (iv) and transitivity we get (vi) SDJ ->CSDJPQV.
4. From (i) and decomposition we can get
C->S, C->J, C->D, C->P,C->Q, C->V.

Attribute Closure:
Constructing the closure of a set of FDs may be fairly laborious. It may be avoided when one wishes to check what
are the possible right-hand sides of an FD X ->Y, for a given X, by means of the following algorithm which
calculates the so-called attribute closure, denoted X+, of a set X = {A1, A2, … ,An} of attributes, with respect to
the set F of FDs.

1. Let X be a set of attributes that eventually will become the closure. First we initialize X to be {A1, A2, … ,An}.
2. We repeatedly search for some FD B1 B2 …Bm ->C such that all of B1 B2 … Bm are in the set of attributes X,
but C is not. We then add C to the set X.
3. Repeat step 2 as many times as necessary until no more new attributes can be added to X.
4. The final set X is the correct value of {A1, A2, … ,An}+.

Example Of Attribute Closure Computation:


Given the previous Contracts relation characterized by the FDs
(i) C ->CSJDPQV, (ii) JP->C, (iii) SD->P

Suppose we wish to get the attribute closure of JP, i.e. (JP)+


1. Initialize the closure (X)+ as {JP}.

2. (i) does not satisfy the requirement that the left side be in JP.
(ii) does, therefore we set (X)+ = (X)+ U C = {JPC}
(iii) does not.
We now repeat step 2:

2. (i) now does satisfy the requirement that the left side be in JP,
therefore we set (X)+ = (X)+ U CSJDPQV = {JPCSDQV}.
(ii) and (iii) add nothing new. Repeating step 2 does not change (X)+.
Therefore we stop having obtained (JP)+ = {JPCSDQV}+.

THE NORMALIZATION PROCESS:


In this section, you learn how to use normalization to produce a set of normalized tables to store the data
that will be used to generate the required information. The objective of normalization is to ensure that each table
conforms to the concept of well-formed relations, that is, tables that have the following characteristics:

- Each table represents a single subject. For example, a course table will contain only data that directly pertains
to courses. Similarly, a student table will contain only student data.

MODULE-4 Page 5
DBMS

-No data item will be unnecessarily stored in more than one table (in short, tables have minimum controlled
redundancy). The reason for this requirement is to ensure that the data are updated in only one place.

- All nonprime attributes in a table are dependent on the primary key—the entire primary key and nothing but the
primary key. The reason for this requirement is to ensure that the data are uniquely identifiable by a primary key
value.

-Each table is void of insertion, update, or deletion anomalies. This is to ensure the integrity and consistency of the
data.

To accomplish the objective, the normalization process takes you through the steps that lead to successively higher
Normal forms. The most common normal forms and their basic characteristic are listed in Table 5.2.

Definitions of Keys and Attributes Participating in Keys:


A superkey of a relation schema R = {A1, A2, ... , An} is a set of attributes S ⊆ R with the property that no two
tuples t1 and t2 in any legal relation state r of R will have t1[S] = t2[S]. A key K is a superkey with the additional
property that removal of any attribute from K will cause K not to be a superkey any more.
The difference between a key and a superkey is that a key has to be minimal; that is, if we have a key K = {A1, A2,
..., Ak} of R, then K – {Ai} is not a key of R for any Ai, 1≤i ≤k.For example, {Ssn} is a key for EMPLOYEE,
whereas {Ssn}, {Ssn, Ename},{Ssn, Ename, Bdate}, and any set of attributes that includes Ssn are all superkeys.
If a relation schema has more than one key, each is called a candidate key. One of the candidate keys is arbitrarily
designated to be the primary key, and the others are called secondary keys.
An attribute of relation schema R is called a prime attribute of R if it is a member of some candidate key of R. An
attribute is called nonprime if it is not a prime attribute—that is, if it is not a member of any candidate key.

First Normal Form:

It states that the domain of an attribute must include only atomic (simple, indivisible) values and that the
value of any attribute in a tuple must be a single value from the domain of that attribute. Hence, 1NF disallows
having a set of values, a tuple of values, or a combination of both as an attribute value for a single tuple. In other
words, 1NF disallows relations within relations or relations as attribute values within tuples. The only attribute
values permitted by 1NF are single atomic (or indivisible) values.

Consider the DEPARTMENT relation schema shown in Figure 15.9(a), whose primary key is Dnumber, and
suppose we assume that each department can have a number of locations. The DEPARTMENT schema and a
sample relation state are shown in Figure 15.9(b). As we can see, this is not in 1NF because Dlocations is not an
atomic attribute, as illustrated by the first tuple in Figure 15.9(b). There are two ways we can look at the Dlocations
attribute: So we have converted this relation into 1NF as shown in figure 15.9(c).

MODULE-4 Page 6
DBMS

Second Normal Form:


Definition: A relation schema R is in 2NF if every nonprime attribute A in R is fully functionally dependent on the
primary key of R.

The test for 2NF involves testing for functional dependencies whose left-hand side attributes are part of the primary
key. If the primary key contains a single attribute, the test need not be applied at all. The EMP_PROJ relation in
Figure 15.10 is in 1NF but is not in 2NF. The nonprime attribute Ename violates 2NF because of FD2, as do the
nonprime attributes Pname and Plocation because of FD3. The functional dependencies FD2 and FD3 make Ename,
Pname, and Plocation partially dependent on the primary key {Ssn, Pnumber} of EMP_PROJ, thus violating the
2NF test.

If a relation schema is not in 2NF, it can be second normalized or 2NF normalized into a number of 2NF relations in
which nonprime attributes are associated only with the part of the primary key on which they are fully functionally
dependent. Therefore, the functional dependencies FD1, FD2, and FD3 in Figure 15.10 lead to the decomposition of
EMP_PROJ into the three relation schemas EP1, EP2, and EP3 shown in Figure 15.11, each of which is in 2NF.

Figure 15.10

MODULE-4 Page 7
DBMS

Third Normal Form:

Third normal form (3NF) is based on the concept of transitive dependency. A functional dependency X-
>Y in a relation schema R is a transitive dependency if there exists a set of attributes Z in R that is neither a
candidate key nor a subset of any key of R and both X->Z and Z->Y hold.

Definition: According to Codd’s original definition, a relation schema R is in 3NF if it satisfies 2NF and no
nonprime attribute of R is transitively dependent on the primary key. A relation schema R is in third normal form
(3NF) if, whenever a nontrivial functional dependency X->A holds in R, either (a) X is a superkey of R, or (b) A is a
prime attribute of R.

Figure 15.5

The dependency Ssn->Dmgr_ssn is transitive through Dnumber in EMP_DEPT in Figure 15.5, because both the
dependencies Ssn ->Dnumber and Dnumber ->Dmgr_ssn hold and Dnumber is neither a key itself nor a subset of
the key of EMP_DEPT. Intuitively, we can see that the dependency of Dmgr_ssn on Dnumber is undesirable in
EMP_DEPT since Dnumber is not a key of EMP_DEPT.
The relation schema EMP_DEPT in Figure 15.5 is in 2NF, since no partial dependencies on a key exist. However,
EMP_DEPT is not in 3NF because of the transitive dependency of Dmgr_ssn (and also Dname) on Ssn via
Dnumber. We can normalize EMP_DEPT by decomposing it into the two 3NF relation schemas ED1 and ED2
shown in Figure 15.6.
Intuitively, we see that ED1 and ED2 represent independent entity facts about employees and departments. A
NATURAL JOIN operation on ED1 and ED2 will recover the original relation EMP_DEPT without generating
spurious tuples.

MODULE-4 Page 8
DBMS

Figure 15.6

Example:

MODULE-4 Page 9
DBMS

HIGHER-LEVEL NORMAL FORMS:


Tables in 3NF will perform suitably in business transactional databases. However, there are occasions
when higher normal forms are useful. In this section, you learn about a special case of 3NF, known as Boyce-Codd
normal form (BCNF), and about fourth normal form (4NF).

Boyce-Codd Normal Form:

Boyce-Codd normal form (BCNF) was proposed as a simpler form of 3NF, but it was found to be stricter
than 3NF. That is, every relation in BCNF is also in 3NF; however, a relation in 3NF is not necessarily in BCNF.

Definition: A relation schema R is in BCNF if whenever a nontrivial functional dependency X->A holds in R, then
X is a superkey of R.

MODULE-4 Page 10
DBMS

The formal definition of BCNF differs from the definition of 3NF in that condition (b) of 3NF, which allows A to be
prime, is absent from BCNF. That makes BCNF a stronger normal form compared to 3NF. In our example, FD5
violates BCNF in LOTS1A because AREA is not a superkey of LOTS1A. Note that FD5 satisfies 3NF in LOTS1A
because County_name is a prime attribute (condition b), but this condition does not exist in the definition of
BCNF.We can decompose LOTS1A into two BCNF relations LOTS1AX and LOTS1AY, shown in Figure 15.13(a).
This decomposition loses the functional dependency FD2 because its attributes no longer coexist in the same
relation after decomposition.

Multivalued Dependencies:

If we have a nontrivial MVD in a relation, we may have to repeat values redundantly in the tuples. In the
EMP relation of Figure 15.15(a), the values ‘X’ and ‘Y’ of Pname are repeated with each value of Dname (or, by
symmetry, the values ‘John’ and ‘Anna’ of Dname are repeated with each value of Pname). This redundancy is
clearly undesirable. However, the EMP schema is in BCNF because no functional dependencies hold in EMP.
Therefore, we need to define a fourth normal form that is stronger than BCNF and disallows relation schemas such
as EMP. Notice that relations containing nontrivial MVDs tend to be all-key relations—that is, their key is all their
attributes taken together.

MODULE-4 Page 11
DBMS

Fourth Normal Form:


We now present the definition of fourth normal form (4NF), which is violated when a relation has
undesirable multivalued dependencies, and hence can be used to identify and decompose such relations.
A relation R is in Fourth Normal Form (4NF) if and only if the following conditions are satisfied simultaneously:
 R is already in 3NF or BCNF.
 If it contains no multi-valued dependencies.

MODULE-4 Page 12
DBMS

Fifth Normal Form:


A database is said to be in 5NF, if and only if,

 It's in 4NF
 If we can decompose table further to eliminate redundancy and anomaly, and when we re-join the
decomposed tables by means of candidate keys, we should not be losing the original data or any new record
set should not arise. In simple words, joining two or more decomposed table should not lose records nor
create new records.
 There should be no join dependency in a relation.

MODULE-4 Page 13
DBMS

Figure 15.15

DECOMPOSITION:

Decomposition is a tool that allows us to eliminate redundancy. However, it is important to check that
decomposition does not introduce new problems. In particular, we should check whether decomposition allows us to
recover the original relation, and whether it allows us to check integrity constraints efficiently.

Properties Of Decompositions:

1. Lossless-Join Decomposition:
Let R be a relation schema and let F be a set of FDs over R. A decomposition of R into two schemas with
attribute sets X and Y is said to be a lossless-join decomposition with respect to F if, for every instance r of R that
satisfies the dependencies in F, ∏x (r) ⋈ ∏y (r) = r. In other words, we can recover the original relation from the
decomposed relations.

From the definition it is easy to see that r is always a subset of natural join of decomposed relations. If we take
projections of a relation and recombine them using natural join, we typically obtain some tuples that were not in the
original relation.

MODULE-4 Page 14
DBMS

2.Dependency-Preserving Decomposition:
The decomposition of relation schema R with FDs Finto schemas with attribute Sets X and Y is
dependency-preserving if (Fx U Fy)+ =F+, That is, if we take the dependencies in Fx and Fy and compute the
closure of their union, We get back all dependencies in the closure of F. Therefore, we need to enforce only the
dependencies in Fx and Fy: all FDs in F+ are then sure to be satisfied. To enforce Fx , we need to examine only
relation X (on inserts to that relation). To enforce Fy, we need to examine only relation Y.

DENORMALIZATION:
It’s important to remember that the optimal relational database implementation requires that all tables be at least in
third normal form (3NF). A good relational DBMS excels at managing normalized relations; that is, relations void of
any unnecessary redundancies that might cause data anomalies. Although the creation of normalized relations is an
important database design goal, it is only one of many such goals. Good database design also considers processing
(or reporting) requirements and processing speed. The problem with normalization is that as tables are decomposed
to conform to normalization requirements, the number of database tables expands. Therefore, in order to generate
information, data must be put together from various tables. Joining a large number of tables takes additional
input/output (I/O) operations and processing logic, thereby reducing system speed. Most relational database systems
are able to handle joins very efficiently. However, rare and occasional circumstances may allow some degree of
denormalization so processing speed can be increased.

Keep in mind that the advantage of higher processing speed must be carefully weighed against the disadvantage of
data anomalies. On the other hand, some anomalies are of only theoretical interest. For example, should people in a
real-world database environment worry that a ZIP_CODE determines CITY in a CUSTOMER table whose primary
key is the customer number? Is it really practical to produce a separate table for

MODULE-4 Page 15
DBMS

ZIP (ZIP_CODE, CITY)


to eliminate a transitive dependency from the CUSTOMER table? (Perhaps your answer to that question changes if
you are in the business of producing mailing lists.) As explained earlier, the problem with denormalized relations
and redundant data is that the data integrity could be compromised due to the possibility of data anomalies (insert,
update, and deletion anomalies.) The advice is simple: use common sense during the normalization process.
Furthermore, the database design process could, in some cases, introduce some small degree of redundant data in the
model (as seen in the previous example). This, in effect, creates “denormalized” relations.

A more comprehensive example of the need for denormalization due to reporting requirements is the case of a
faculty evaluation report in which each row list the scores obtained during the last four semesters taught. See Figure
5.17.

Although this report seems simple enough, the problem arises from the fact that the data are stored in a normalized
table in which each row represents a different score for a given faculty in a given semester. See Figure 5.18.

MODULE-4 Page 16
DBMS

The difficulty of transposing multirow data to multicolumnar data is compounded by the fact that the last four
semesters taught are not necessarily the same for all faculty members (some might have taken sabbaticals, some
might have had research appointments, some might be new faculty with only two semesters on the job, etc.) To
generate this report, the two tables you see in Figure 5.18 were used. The EVALDATA table is the master data table
containing the evaluation scores for each faculty member for each semester taught; this table is normalized. The
FACHIST table contains the last four data points—that is, evaluation score and semester—for each faculty member.
The FACHIST table is a temporary denormalized table created from the EVALDATA table via a series of queries.
(The FACHIST table is the basis for the report shown in Figure 5.17.)

As seen in the faculty evaluation report, the conflicts between design efficiency, information requirements, and
performance are often resolved through compromises that may include denormalization. In this case and assuming
there is enough storage space, the designer’s choices could be narrowed down to:

- Store the data in a permanent denormalized table. This is not the recommended solution, because the denormalized
table is subject to data anomalies (insert, update, and delete.) This solution is viable only if performance is an issue.

-Create a temporary denormalized table from the permanent normalized table(s). Because the denormalized table
exists only as long as it takes to generate the report, it disappears after the report is produced. Therefore, there are no
data anomaly problems. This solution is practical only if performance is not an issue and there are no other viable
processing options.

MODULE-4 Page 17

You might also like