You are on page 1of 40

Normalization and its Types

Normalization is a database design technique, which organizes tables in a manner that reduces
redundancy and dependency of data. It divides larger tables to smaller tables and links them
using relationships.

Normalization is a process of organizing the data in database to avoid data redundancy,


insertion anomaly, update anomaly & deletion anomaly. Anomaly means something
deviating from normal behavior.

Normalization is a method to remove all these anomalies and bring the database to a
consistent state.

Anomalies in DBMS

There are three types of anomalies that occur when the database is not normalized. These are –
Insertion, update and deletion anomaly. Let’s take an example to understand this.

Example:

Employee table

emp_id emp_name emp_address emp_dept

101 Rick Delhi D001

101 Rick Delhi D002

123 Maggie Agra D890

166 Glenn Chennai D900

1
166 Glenn Chennai D004

The above table is not normalized. We will see the problems that we face when a table is not
normalized.

Update anomaly: In the above table we have two rows for employee Rick as he belongs to two
departments of the company. If we want to update the address of Rick then we have to update
the same in two rows or the data will become inconsistent. If somehow, the correct address gets
updated in one department but not in other then as per the database, Rick would be having two
different addresses, which is not correct and would lead to inconsistent data.

Insert anomaly: Suppose a new employee joins the company, who is under training and
currently not assigned to any department then we would not be able to insert the data into the
table if emp_dept field doesn’t allow nulls.

Delete anomaly: Suppose, if at a point of time the company closes the department D890 then
deleting the rows that are having emp_dept as D890 would also delete the information of
employee Maggie since she is assigned only to this department.

To overcome these anomalies we need to normalize the data.

Types of Normal Forms

Here are the most commonly used normal forms:


 First normal form(1NF)
 Second normal form(2NF)
 Third normal form(3NF)
 Boyce & Codd normal form (BCNF)
 Fourth normal form(4NF)
 Fifth normal form(5NF)

Normal Description
Form

1NF A relation is in 1NF if it contains atomic value.

2NF A relation will be in 2NF if it is in 1NF and the relation must not contain
any partial dependency.

3NF A relation will be in 3NF if it is in 2NF and no transitive dependency


exists.

4NF A relation will be in 4NF if it is in Boyce Codd normal form and have no
multi-valued dependency.

2
5NF A relation is in 5NF if it is in 4NF and not contains any join
dependency and joining should be lossless.

Decomposition using Functional Dependencies:

First Normal Form (1NF)

First Normal Form defines that all the attributes in a relation must have atomic domains. The
values in an atomic domain are indivisible units/ contain single value. ie., Each attribute must
contain only a single value.

We re-arrange the relation (table) as below, to convert it to First Normal Form.

Second Normal Form (2NF)

o In the 2NF, the relation must be in 1NF.


o In the second normal form, relation must not contain any partial dependency.

Before we learn about the second normal form, we need to understand the following −

 Prime attribute − An attribute, which is a part of the candidate-key, is known as


a prime attribute.
 Non-prime attribute − An attribute, which is not a part of the prime-key, is said to be a
non-prime attribute.

3
Example:

TEACHER table : (one teacher takes multiple subjects)

TEACHER_ID SUBJECT TEACHER_AGE

25 Chemistry 30

25 Biology 30

47 English 35

83 Math 38

83 Computer 38

In the given table, the candidate key is { TEACHER_ID, SUBJECT }

We cannot take TEACHER_ID as a primary key as it contains non-unique values (eg: teacher_id
25 and 83 are repeated in two places)

In the given table, non-prime attribute TEACHER_AGE is dependent on the prime attribute
TEACHER_ID alone (which means partially dependent on candidate key), which is a proper
subset of candidate key, that's why it violates the rule for 2NF.

To convert the given table into 2NF, we decompose it into two tables:

TEACHER_DETAIL table:

TEACHER_ID TEACHER_AGE

25 30

47 35

83 38

TEACHER_SUBJECT table:

TEACHER_ID SUBJECT

25 Chemistry

4
25 Biology

25 English

83 Math

83 Computer

Second Normal Form in detail:

For a table to be in the Second Normal Form, it must satisfy two conditions:

1. The table should be in the First Normal Form.


2. There should be no Partial Dependency.

What is Partial Dependency? First let's understand what is Dependency in a table?

What is Dependency?
Let's take an example of a Student table with
columns student_id, name, reg_no, branch and address.

student_id name reg_no branch address

In this table, student_id is the primary key and will be unique for every row, hence we can
use student_id to fetch any row of data from this table

Even for a case, where student names are same, if we know the student_id we can easily fetch the
correct record.

student_id name reg_no branch Address

10 Akon 07-WY CSE Kerala

11 Akon 08-WY IT Gujarat

Hence we can say a Primary Key for a table is a single column or a group of
columns(composite/candidate key) which can uniquely identify each record in the table.
5
I can ask from branch name of student with student_id 10, and I can get it. Similarly, if I ask for
name of student with student_id 10 or 11, I will get it. So all I need is student_id and every other
column depends on it, or can be fetched using it.

This is Dependency and we also call it Functional Dependency.

What is Partial Dependency?


Now that we know what dependency is, we are in a better state to understand what partial
dependency is.
For a simple table like Student, a single column like student_id can uniquely identify all the
records in a table.

But this is not true all the time. So now let's extend our example to see if more than 1 column
together can act as a primary key.

Let's create another table for Subject, which will have subject_id and subject_name fields
and subject_id will be the primary key.

subject_id subject_name

1 Java

2 C++

3 Php

Now we have a Student table with student information and another table Subject for storing
subject information.

Let's create another table Score, to store the marks obtained by students in the respective
subjects. We will also be saving name of the teacher who teaches that subject along with marks.

score_id student_id subject_id marks Teacher

1 10 1 70 Java Teacher

2 10 2 75 C++ Teacher

3 11 1 80 Java Teacher

In the score table we are saving the student_id to know which student's marks are these
and subject_id to know for which subject the marks are for.

Together, student_id + subject_id forms a Candidate Key for this table, which can be
the Primary key.

Confused, How this combination can be a primary key?


6
See, if I ask you to get me marks of student with student_id 10, can you get it from this table? No,
because you don't know for which subject. And if I give you subject_id, you would not know for
which student. Hence we need student_id + subject_id to uniquely identify any row.

But where is Partial Dependency?

Now if you look at the Score table, we have a column names teacher which is only dependent on
the subject, for Java it's Java Teacher and for C++ it's C++ Teacher & so on.
Now as we just discussed that the primary key for this table is a composition of two columns
which is student_id & subject_id, but the teacher's name only depends on subject, hence
the subject_id, and has nothing to do with student_id.

This is Partial Dependency, where an attribute in a table depends on only a part of the primary
key and not on the whole key.

How to remove Partial Dependency?

There can be many different solutions for this, but our objective is to remove teacher's name from
Score table.

The simplest solution is to remove columns teacher from Score table and add it to the Subject
table. Hence, the Subject table will become:

subject_id subject_name Teacher

1 Java Java Teacher

2 C++ C++ Teacher

3 Php Php Teacher

And our Score table is now in the second normal form, with no partial dependency.

score_id student_id subject_id Marks

1 10 1 70

2 10 2 75

3 11 1 80

7
Quick Recap on 2NF:

1. For a table to be in the Second Normal form, it should be in the First Normal form and it
should not have Partial Dependency.
2. Partial Dependency exists, when for a composite primary key, any attribute in the table
depends only on a part of the primary key and not on the complete primary key.
3. To remove Partial dependency, we can divide the table, remove the attribute which is
causing partial dependency, and move it to some other table where it fits in well.

Third Normal Form

A database is in third normal form if it satisfies the following conditions:

 It is in second normal form


 There is no transitive functional dependency

By transitive functional dependency, we mean we have the following relationships in the table:
A is functionally dependent on B, and B is functionally dependent on C. In this case, C is
transitively dependent on A via B.

score table:
score_id student_id subject_id marks exam_name total_marks

What is Transitive Dependency?

With exam_name and total_marks added to our Score table, it saves more data now. Primary key
for our Score table is a composite key, which means it's made up of two attributes or columns
→ student_id + subject_id.

Our new column exam_name depends on both student and subject. So we can say
that exam_name is dependent on both student_id and subject_id.

The column total_marks depends on exam_name as with exam type the total score changes. For
example, practicals are of less marks while theory exams are of more marks.

8
But, exam_name is just another column in the score table. It is not a primary key or even a part of
the primary key, but total_marks depends on it.

This is Transitive Dependency. When a non-prime attribute depends on other non-prime


attributes rather than depending upon the primary key.

How to remove Transitive Dependency?

Take out the columns exam_name and total_marks from Score table and put them in
an Exam table and use the exam_id (which is a primary key) wherever required.

Score Table: In 3rd Normal Form


score_id student_id subject_id marks exam_id

The new Exam table


exam_id exam_name total_marks

1 Workshop 200

2 Mains 70

3 Practicals 30

Advantage of removing Transitive Dependency


The advantage of removing transitive dependency is,
 Amount of data duplication is reduced.
 Data integrity achieved.

3NF made simple:

For a relation to be in Third Normal Form, it must be in Second Normal form and the following
must satisfy −
 No non-prime attribute is transitively dependent on prime key attribute.

9
We find that in the above Student_detail relation, Stu_ID is the key and only prime key attribute.
We find that City can be identified by Stu_ID as well as Zip. City is a non-prime attribute which
depends (can be identified using) on another non-prime attirbute Zip.

Additionally,

Stu_ID → Zip
Zip → City

therefore, Stu_ID → Zip → City, so there exists transitive dependency.

To bring this relation into third normal form, we break the relation into two relations as follows −

Boyce-Codd Normal Form (BCNF)

Even when a database is in 3rd Normal Form, still there would be anomalies resulted if it has more
than one Candidate Key.

A 3NF table, which does not have multiple overlapping candidate keys is said to be in BCNF.

Boyce-Codd Normal Form (BCNF) is an extension of Third Normal Form on strict terms. BCNF
states that

 For any dependency A → B, A should be a super key.

ie., (Prime attribute--> Non-Prime attribute)

In the above, Stu_ID is the super-key in the relation Student_Detail and Zip is the super-key in the
relation ZipCodes.
So,

Stu_ID → Stu_Name, Zip


and
Zip → City

Which confirms that both the relations are in BCNF.

BCNF in detail:
10
college enrolment table:
student_id subject professor

101 Java Adam

101 C++ Martin

102 Java Allin

103 C# Joe

104 Java Jack

In the table above:


 One student can enrol for multiple subjects.
 One professor can teach one subject.
 One subject is taken by multiple professors

In the table above {student_id, subject} together form the primary key

But, there is a dependency exists between subject and professor, where professor depends on
the subject.

But this table is not in Boyce-Codd Normal Form as non-prime attribute --> prime attribute

Why this table is not in BCNF?

There is a dependency, professor → subject. (ie. non-prime attribute → prime attribute)

which is not allowed by BCNF.

How to satisfy BCNF?

To make this relation satisfy BCNF, we will decompose this table into two tables, student table
and professor table.

Student Table
student_id p_id

101 1

101 2

11
102 3

103 4

104 5

And, Professor Table


p_id professor subject

1 Adam Java

2 Martin C++

3 Allin Java

4 Joe C#

5 Jack Java

Decomposition Using Multivalued Dependencies:

4NF (Fourth Normal Form) Rules

If database table instance does not contain two or more, independent and multivalued data
describing the relevant entity, then it is in 4th Normal Form.

Fourth Normal Form (4NF)


A table is said to be in the Fourth Normal Form when,
1. It is in the Boyce-Codd Normal Form.
2. And, it doesn't have Multi-Valued Dependency.

What is Multi-valued Dependency?

A table is said to have multi-valued dependency, if the following conditions are true,

1. For a dependency A → B, if for a single value of A, multiple value of B exists, then the
table may have multi-valued dependency.
2. Also, a table should have at-least 3 columns for it to have a multi-valued dependency.
3. And, for a relation R(A,B,C), if there is a multi-valued dependency between, A and B,
then B and C should be independent of each other.

If all these conditions are true for any relation(table), it is said to have multi-valued dependency.

12
college enrolment table:

s_id Course hobby

1 Science Cricket

1 Maths Hockey

2 C# Cricket

2 Php Hockey

student with s_id 1 has opted for two courses, Science and Maths, and has two
hobbies, Cricket and Hockey.

You must be thinking what problem this can lead to, right?

Well the two records for student with s_id 1, will give rise to two more records, as shown below,
because for one student, two hobbies exists, hence along with both the courses, these hobbies
should be specified.

s_id Course hobby

1 Science Cricket

1 Maths Hockey

1 Science Hockey

1 Maths Cricket
And, in the table above, there is no relationship between the columns course and hobby. They
are independent of each other.

So there is multi-value dependency, which leads to un-necessary repetition of data and other
anomalies as well.

How to satisfy 4th Normal Form?

To make the above relation satify the 4th normal form, we can decompose the table into 2 tables.

CourseOpted Table
s_id course

1 Science

1 Maths
13
2 C#

2 Php

And, Hobbies Table


s_id hobby

1 Cricket

1 Hockey

2 Cricket

2 Hockey

Now this relation satisfies the fourth normal form.

A table can also have functional dependency along with multi-valued dependency. In that case,
the functionally dependent columns are moved in a separate table and the multi-valued dependent
columns are moved to separate tables.

5NF (Fifth Normal Form) Rules

A table is in 5th Normal Form only if it is in 4NF and it cannot be decomposed into any number of
smaller tables without loss of data.

Functional- Dependency (FD)Theory:

 Functional Dependency is a relationship that exists between multiple attributes of a


relation.
 This concept is given by E. F. Codd.
 It is a type of constraint existing between various attributes of a relation.
 It is used to define various normal forms.

A It is a determinant set.
B It is a dependent attribute.
{A → B} A functionally determines B.
B is a functionally dependent on A.

Advantages of Functional Dependency:

14
 Functional Dependency avoids data redundancy where same data should not be repeated
at multiple locations in same database.
 It helps in identifying bad designs.

For example:

Suppose we have a student table with attributes: (Stu_Id, Stu_Name, Stu_Age)

Here Stu_Id attribute uniquely identifies the Stu_Name attribute of student table because if we
know the student id we can tell the student name associated with it. This is known as functional
dependency and can be written as Stu_Id->Stu_Name (or) in words we can say Stu_Name is
functionally dependent on Stu_Id.

Formally:

If column A of a table uniquely identifies the column B of same table then it can represented as
A->B (Attribute B is functionally dependent on attribute A)

Types of Functional Dependencies

1. Trivial functional dependency


2. Non-trivial functional dependency
3. Multivalued dependency
4. Transitive dependency

Trivial Functional Dependency:

Trivial If A holds B {A → B}, where B is a subset of A, then it is called a Trivial Functional


Dependency. Trivial always holds Functional Dependency.

[this trivial dependency is not much of importance as it provides a result which


contains subset of it. For example, if somebody is going to account section and telling
their reg.no and name, the accountant replies like " your name , your fee detail]
Non-Trivial If A holds B {A → B}, where B is not a subset A, then it is called as a Non-Trivial
Functional Dependency.
[this trivial dependency is much of importance as it provides a result which contains
no subset of it. For example, if somebody is going to account section and telling their
reg.no and name, the accountant replies like only " your fee" detail]
Completely If A holds B {A → B}, where A intersect B = Φ, it is called as a Completely Non-
Non-Trivial Trivial Functional Dependency.

1. Trivial functional dependency:


15
If a functional dependency X->Y holds true where Y is a subset of X then this
dependency is called trivial Functional dependency.

Symbolically:

A ->B is trivial functional dependency if B is a subset of A.

The following dependencies are also trivial: A->A & B->B

Example:

{Student_Id, Student_Name} -> Student_Id is a trivial functional dependency as


Student_Id is a subset of {Student_Id, Student_Name}.

Also, Student_Id -> Student_Id & Student_Name -> Student_Name are trivial
dependencies too.

2. Non-trivial functional dependency

If a functional dependency X->Y holds true where Y is not a subset of X then this
dependency is called non trivial Functional dependency.

For example:

An employee table with three attributes: (emp_id, emp_name, emp_address)

The following functional dependencies are non-trivial:

emp_id -> emp_name (emp_name is not a subset of emp_id)

emp_id -> emp_address (emp_address is not a subset of emp_id)

Completely non trivial FD:

If a FD X->Y holds true where X intersection Y is null then this dependency is said to be
completely non trivial function dependency.

Eg:

Stu-id Name
1 X
2 Y
3 Z

16
stu-id ∩ name = ⱷ

3. Multivalued dependency:

Multivalued dependency occurs when there are more than one independent multivalued
attributes in a table.

For example: Consider a bike manufacture company, which produces two colors (Black
and Red) in each model every year.

bike_model manuf_year color

M1001 2007 Black

M1001 2007 Red

M2012 2008 Black

M2012 2008 Red

M2222 2009 Black

M2222 2009 Red

Here columns manuf_year and color are independent of each other but both dependent on
bike_model. In this case these two columns are said to be multivalued dependent on
bike_model.

These dependencies can be represented like this:


bike_model ->> manuf_year
bike_model ->> color

The symbol "->> " denotes multi-valued dependency

4. Transitive dependency:

A functional dependency is said to be transitive if it is indirectly formed by two functional


dependencies.

For e.g.
X -> Z is a transitive dependency if the following three functional dependencies hold true:
 X->Y
 Y does not ->X
 Y->Z
Note: A transitive dependency can only occur in a relation of three of more attributes. This
dependency helps us normalizing the database in 3NF (3rd Normal Form).

17
Example:

Book Author Author_age

Game of Thrones George R. R. Martin 66

Harry Potter J. K. Rowling 49

Dying of the Light George R. R. Martin 66

{Book} ->{Author} (if we know the book, we know the author name)
{Author} does not ->{Book}
{Author} -> {Author_age}

Therefore as per the rule of transitive dependency: {Book} -> {Author_age} should
hold, that makes sense because if we know the book name we can know the author’s age.

Armstrong's Axioms

Introduction to Axioms Rules / Rule of Inference

 Armstrong's Axioms is a set of rules.


 It provides a simple technique for reasoning about functional dependencies.
 It was developed by William W. Armstrong in 1974.
 It is used to infer all the functional dependencies on a relational database.

Type of Axioms Rules


A. Primary Rules

Rule 1 Reflexivity
If A is a set of attributes and B is a subset of A, then A holds B. { A → B }
Rule 2 Augmentation
If A hold B and C is a set of attributes, then AC holds BC. {AC → BC}
It means that attribute in dependencies does not change the basic dependencies.
Rule 3 Transitivity
If A holds B and B holds C, then A holds C.
If {A → B} and {B → C}, then {A → C}

B. Secondary Rules

Rule 1 Union
If A holds B and A holds C, then A holds BC.
18
If{A → B} and {A → C}, then {A → BC}
Rule 2 Decomposition
If A holds BC and A holds B, then A holds C.
If{A → BC} and {A → B}, then {A → C}
Rule 3 Pseudo Transitivity
If A holds B and BC holds D, then AC holds D.
If{A → B} and {BC → D}, then {AC → D}

Example: Calculate members of the axioms:


Consider relation E = (P, Q, R, S, T, U) having set of Functional Dependencies (FD).
P→Q P→R
QR → S Q→T
QR → U PR → U

Calculate some members of the Axioms as follows:


1. P → T
2. PR → S
3. QR → SU
4. PR → SU

Solution:

1. P → T
In the above FD set, P → Q and Q → T
So, Using Transitive Rule: If {A → B} and {B → C}, then {A → C}
∴ If P → Q and Q → T, then P → T.
P→T

2. PR → S
In the above FD set, P → Q
As, QR → S
So, Using Pseudo Transitivity Rule: If{A → B} and {BC → D}, then {AC → D}
∴ If P → Q and QR → S, then PR → S.
PR → S

3. QR → SU
In above FD set, QR → S and QR → U
So, Using Union Rule: If{A → B} and {A → C}, then {A → BC}
∴ If QR → S and QR → U, then QR → SU.
QR → SU

4. PR → SU
So, Using Union Rule: If{A → B} and {A → C}, then {A → BC}
∴ If PR → S and PR → U, then PR → SU.
PR → SU

19
How to find functional dependencies for a relation?

Functional Dependencies in a relation is where an attribute is dependent on another attribute of


the relation.

Example:
Consider the STUDENT relation given in Table 1.

 We know that STUD_NO is unique for each student. So STUD_NO->STUD_NAME,


STUD_NO->STUD_PHONE, STUD_NO->STUD_STATE, STUD_NO-
>STUD_COUNTRY and STUD_NO -> STUD_AGE all will be true.
 Similarly, STUD_STATE->STUD_COUNTRY will be true as if two records have same
STUD_STATE, they will have same STUD_COUNTRY as well.
 For relation STUDENT_COURSE, COURSE_NO->COURSE_NAME will be true as two
records with same COURSE_NO will have same COURSE_NAME.

Functional Dependency Set:

Functional Dependency set or FD set of a relation is the set of all FDs present in the relation.

Example:

FD set for relation STUDENT shown in table 1 is:

{ STUD_NO->STUD_NAME,
STUD_NO->STUD_PHONE,
STUD_NO->STUD_STATE,
20
STUD_NO->STUD_COUNTRY,
STUD_NO -> STUD_AGE,
STUD_STATE->STUD_COUNTRY }

Attribute Closure:

Attribute closure of an attribute set can be defined as set of attributes which can be functionally
determined from it.

To find attribute closure of an attribute set:


 Add elements of attribute set to the result set.
 Recursively add elements to the result set which can be functionally determined from the
elements of the result set.

Example:

Using FD set of table 1, attribute closure can be determined as:

(STUD_NO)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,


STUD_COUNTRY, STUD_AGE}

(STUD_STATE)+ = {STUD_STATE, STUD_COUNTRY}

Example

Imagine the following list of FDs. We are going to calculate a closure for A from this
relationship.

1. A → B
2. B → C
3. AB → D

The closure for A would be as follows:

a) A → A (by Armstrong's reflexivity)


b) A → AB (as A → B )
c) A → ABD (as AB → D)
d) A → ABCD (as B → C )

The closure is therefore A → ABCD. By calculating the closure of A, we have validated that A
is also a good candidate key as its closure is every single data value in the relationship.

Finding the candidate key, super key from the given FDs

21
R(A,B,C,D)

A→B
B→C
C→A

Solution:
Draw edge diagram

A B C D

No edge incident on D
so D is an essential attribute and find closure of attribute D

(D)+= D

combine all other attribute with D and find the closure over the attribute set

(AD)+= ABCD
(BD)+=ABCD
(CD)+=ABCD

using the above combinations, we could find all the attributes of the relation, hence AD,BD,CD
are candidate keys. (ie,using a minimal set of keys, we could find all the attributes of a relation)

If we could find all attributes with (ABD) combination, it is a super key, but not the candidate
key.

Canonical or Minimal:

Sometimes Functional Dependency Sets are not able to reduce if the set has following properties,

1. The Right-hand side set of functional dependency holds only one attribute.
2. The Left-hand side set of functional dependency cannot be reduced, it changes the entire
content of the set.
3. Reducing any functional dependency may change the content of the set.

A set of functional dependencies with the above three properties are also called as Canonical or
Minimal.

Canonical cover

It is an irreducible form of FDs or canonical form (ie., minimal set of FDs)


22
Eg:

R(w,x,y,z)

x→z
wz → xyy
y → wxz

No redundancy (ie., redundant rule exists)

example for redundant rule:


A→C
B→D
C→D
D → AB

Here redundancy exists as B and C implies same attribute D (if removal of either of the rule will
not affect the FD, then it is said to be in canonical form)

Hence, remove redundancy or minimization of the above FDs is:

A→C
B→D
D → AB

Ex problem:

R(w,x,y,z)

x→z
wz → xy
y → wxz

i. Check right hand side for redundancy. Use the decomposition rule to find out the right hand
side redundancy

x→z
wz → x
wz → y
y→w
y→x
y→z

ii. Checking for redundant rule

First rule: x → z

23
So, find closure of x
(x)+= xz

Now, hide/cover x → z and find closure of x


(x)+= x

Both of the (x)+ implies different attributes, hence the rule is not redundant. so, we can keep the
rule x → z

Second rule: wz → x

So, find closure of wz


(wz)+= wzxy

Now, hide/cover wz → x and find closure of wz


(wz)+= wzyx

Both of the (wz)+ implies same attributes, hence the rule is redundant. so, we can remove the rule
wz → x

Third rule: wz → y

So, find closure of wz


(wz)+= wzyx

Now, hide/cover wz → y and find closure of wz


(wz)+= wz

Both of the (wz)+ implies different attributes, hence the rule is not redundant. so, we can keep the
rule wz → y

Fourth rule: y →w

So, find closure of y


(y)+= ywxz

Now, hide/cover y →w and find closure of y


(y)+= yxz

Both of the (y)+ implies different attributes, hence the rule is not redundant. so, we can keep the
rule y →w

Fifth rule: y → x

So, find closure of y


(y)+= yxwz
24
Now, hide/cover y → x and find closure of y
(y)+= ywz

Both of the (y)+ implies different attributes, hence the rule is not redundant. so, we can keep the
rule y →x

Sixth rule: y → z

So, find closure of y


(y)+= yzwx

Now, hide/cover y → z and find closure of y


(y)+= ywxz

Both of the (y)+ implies same attributes, hence the rule is redundant. so, we can remove the rule y
→z

The final canonical form is:

x→z
wz → y
y→w
y→x

Further applying axioms (union rule), can generate minimal set of FDs

x→z
wz → y
y → wx

Database Decomposition

What is decomposition?

 Decomposition is the process of breaking down in parts or elements.


 It replaces a relation with a collection of smaller relations.
 It breaks the table into multiple tables in a database.
 It should always be lossless, because it confirms that the information in the original
relation can be accurately reconstructed based on the decomposed relations.
 If there is no proper decomposition of the relation, then it may lead to problems like loss
of information.

Properties of Decomposition:

25
Following are the properties of Decomposition,

1. Lossless Decomposition
2. Dependency Preservation
3. Lack of Data Redundancy

1. Lossless Decomposition

 Decomposition must be lossless. It means that the information should not get lost from the
relation that is decomposed.
 It gives a guarantee that the join will result in the same relation as it was decomposed.

Example:

Let's take 'E' is the Relational Schema, With instance 'e'; is decomposed into: E1, E2, E3, . . . . En;
With instance: e1, e2, e3, . . . . en, If e1 ⋈ e2 ⋈ e3 . . . . ⋈ en, then it is called as 'Lossless Join
Decomposition'.
 In the above example, it means that, if natural joins of all the decomposition give the
original relation, then it is said to be lossless join decomposition.

Example: <Employee_Department> Table

Eid Ename Age City Salary Deptid DeptName


E001 ABC 29 Pune 20000 D001 Finance
E002 PQR 30 Pune 30000 D002 Production
E003 LMN 25 Mumbai 5000 D003 Sales
E004 XYZ 24 Mumbai 4000 D004 Marketing
E005 STU 32 Bangalore 25000 D005 Human Resource

 Decompose the above relation into two relations to check whether a decomposition is
lossless or lossy.
 Now, we have decomposed the relation that is Employee and Department.

Relation 1 : <Employee> Table

Eid Ename Age City Salary


E001 ABC 29 Pune 20000
E002 PQR 30 Pune 30000
E003 LMN 25 Mumbai 5000
E004 XYZ 24 Mumbai 4000
E005 STU 32 Bangalore 25000

26
 Employee Schema contains (Eid, Ename, Age, City, Salary).

Relation 2 : <Department> Table

Deptid Eid DeptName


D001 E001 Finance
D002 E002 Production
D003 E003 Sales
D004 E004 Marketing
D005 E005 Human Resource

 Department Schema contains (Deptid, Eid, DeptName).


 So, the above decomposition is a Lossless Join Decomposition, because the two relations
contains one common field that is 'Eid' and therefore join is possible.
 Now apply natural join on the decomposed relations.

Employee ⋈ Department

Eid Ename Age City Salary Deptid DeptName


E001 ABC 29 Pune 20000 D001 Finance
E002 PQR 30 Pune 30000 D002 Production
E003 LMN 25 Mumbai 5000 D003 Sales
E004 XYZ 24 Mumbai 4000 D004 Marketing
E005 STU 32 Bangalore 25000 D005 Human Resource

Hence, the decomposition is Lossless Join Decomposition.

Lossy Join Decomposition:

 If the <Employee> table contains (Eid, Ename, Age, City, Salary) and <Department> table
contains (Deptid and DeptName), then it is not possible to join the two tables or relations,
because there is no common column between them. And it becomes Lossy Join
Decomposition.

2. Dependency Preservation

 Dependency is an important constraint on the database.


 Every dependency must be satisfied by at least one decomposed table.

27
 If {A → B} holds, then two sets are functional dependent. And, it becomes more useful
for checking the dependency easily if both sets in a same relation.
 This decomposition property can only be done by maintaining the functional dependency.
 In this property, it allows to check the updates without computing the natural join of the
database structure.

3. Lack of Data Redundancy

 Data Redundancy is also known as a Repetition of Information.


 The proper decomposition should not suffer from any data redundancy.
 The careless decomposition may cause a problem with the data.
 The lack of data redundancy property may be achieved by Normalization process.

GATE Question:

How to find Candidate Keys and Super Keys using Attribute Closure?
 If attribute closure of an attribute set contains all attributes of relation, the attribute set will
be super key of the relation.
 If no subset of this attribute set can functionally determine all attributes of the relation, the
set will be candidate key as well.

For Example, using FD set of table 1,

(STUD_NO, STUD_NAME)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,


STUD_COUNTRY, STUD_AGE}
(STUD_NO) + = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_COUNTRY, STUD_AGE}
(STUD_NO, STUD_NAME) will be super key but not candidate key because its subset
(STUD_NO)+ is equal to all attributes of the relation. So, STUD_NO will be a candidate key.

GATE Question:

Consider the relation scheme R = {E, F, G, H, I, J, K, L, M, M} and the set of functional


dependencies {{E, F} -> {G}, {F} -> {I, J}, {E, H} -> {K, L}, K -> {M}, L -> {N} on R.
What is the key for R?
A. {E, F}
B. {E, F, H}
C. {E, F, H, K, L}
D. {E}

Answer:

Finding attribute closure of all given options, we get:


{E,F}+ = {EFGIJ}
{E,F,H}+ = {EFHGIJKLMN}
28
{E,F,H,K,L}+ = {{EFHGIJKLMN}
{E}+ = {E}
{EFH}+ and {EFHKL}+ results in set of all attributes, but EFH is minimal. So it will be
candidate key.

So correct option is (B).

How to check whether an FD can be derived from a given FD set?


To check whether an FD A->B can be derived from an FD set F,
1. Find (A)+ using FD set F.
2. If B is subset of (A)+, then A->B is true else not true.

GATE Question:
In a schema with attributes A, B, C, D and E following set of functional dependencies are given
{A -> B, A -> C, CD -> E, B -> D, E -> A}
Which of the following functional dependencies is NOT implied by the above set?

A. CD -> AC
B. BD -> CD
C. BC -> CD
D. AC -> BC

Answer:

Using FD set given in question,


(CD)+ = {CDEAB} which means CD -> AC also holds true.
(BD)+ = {BD} which means BD -> CD can’t hold true. So this FD is no implied in FD set.

So (B) is the required option.


Others can be checked in the same way.

Prime and non-prime attributes


Attributes which are parts of any candidate key of relation are called as prime attribute, others are
non-prime attributes. For Example, STUD_NO in STUDENT relation is prime attribute, others
are non-prime attribute.

GATE Question:
Consider a relation scheme R = (A, B, C, D, E, H) on which the following functional
dependencies hold: {A–>B, BC–> D, E–>C, D–>A}.
What are the candidate keys of R?
(a) AE, BE
(b) AE, BE, DE
(c) AEH, BEH, BCH
(d) AEH, BEH, DEH

29
Answer:
(AE)+ = {ABECD} which is not set of all attributes. So AE is not a candidate key. Hence option
A and B are wrong.
(AEH)+ = {ABCDEH}
(BEH)+ = {BEHCDA}
(BCH)+ = {BCHDA} which is not set of all attributes. So BCH is not a candidate key. Hence
option C is wrong.

So correct answer is D.

Primary File Organization

 A file is partitioned into fixed-length storage units called blocks, which are the
units of both storage allocation and data transfer from/to the secondary storage
(HDD).
 Most DBMSs use block sizes of 4 to 8 kilobytes by default
 Many DBMSs allow the block size to be specified when a DB instance is created.

The database is stored as a collection of files.

 Each file is a sequence of records.


 A record is a sequence of fields.

30
Two possible approaches:

• Record size is fixed

• Record size is variable

Fixed-Length Records:

Even if the records are not really of fixed length (e.g. varchar below), we assume that
each field has max. size:

Record access is simple: Record i starts at byte n*(i-1), where n is the size of each
record (e.g. 53 bytes).

Problem 1: Unless the block size is a multiple of n, the last record in a block crosses the
block boundary

 Requires two block accesses


 Modification: leave the fractional record at the end of the block unused.

Problem 2: What to do when a record ( i ) is deleted?

Possible solutions:

 shift records i + 1, . . ., n to i, . . . , n – 1
 move record n to i
 do not move records, but link all free records on a free list
31
Free Lists (Linked):

 Store the address of the first deleted record in the file header.
 Can think of these stored addresses as pointers since they “point” to the location of a
record.
 For efficiency, reuse the space for normal attributes in the free records to store pointers.
(No pointers stored in in-use records!)

32
Variable-Length Records:

Variable-length records arise in several ways:

 Storage of multiple record types in a file.

E.g. the records represent tuples from different tables

 Record types that allow variable lengths for one or more fields such as strings (varchar)
 Record types that allow repeating fields (used in some older data models).

Internal representation of variable-length records

 Attributes are stored in order, but


 Variable length attributes represented by a fixed size pair (offset, length), with actual
data stored after all fixed length attributes
33
 Null values represented by null-value bitmap

Representation of variable-length records inside a block: Slotted page structure

A slotted page has a header which contains:

 The nr. of record entries


 The end of free space in the block
 The location and size of each record

Records are allocated contiguously in the page/block, starting from the end.

Records can be moved around within the page/block to keep them contiguous (no
empty space between them)

o header entry is updated on every move


o b/c of this, outside pointers should not point directly to record but to the header
entry.

34
File Organization
File Organization defines how file records are mapped onto disk blocks. We have four types of
File Organization to organize file records −

1. Heap File Organization


When a file is created using Heap File Organization, the Operating System allocates memory area
to that file without any further accounting details. File records can be placed anywhere in that
memory area. It is the responsibility of the software to manage the records. Heap File does not
support any ordering, sequencing, or indexing on its own.
2. Sequential File Organization
Every file record contains a data field (attribute) to uniquely identify that record. In sequential file
organization, records are placed in the file in some sequential order based on the unique key field
or search key. Practically, it is not possible to store all the records sequentially in physical form.
3. Hash File Organization
Hash File Organization uses Hash function computation on some fields of the records. The output
of the hash function determines the location of disk block where the records are to be placed.
4. Clustered File Organization
Clustered file organization is not considered good for large databases. In this mechanism, related
records from one or more relations are kept in the same disk block, that is, the ordering of records
is not based on primary key or search key.

Indexing
35
Indexing is a way to optimize performance of a database by minimizing the number of disk
accesses required when a query is processed.

An index or database index is a data structure which is used to quickly locate and access the data
in a database table.

Indexes are created using some database columns.

 The first column is the Search key that contains a copy of the primary key or candidate
key of the table. These values are stored in sorted order so that the corresponding data can
be accessed quickly (Note that the data may or may not be stored in sorted order).
 The second column is the Data Reference which contains a set of pointers holding the
address of the disk block where that particular key value can be found.

There are two kinds of indices:

1. Ordered indices: Indices are based on a sorted ordering of the values.


2. Hash indices: Indices are based on the values being distributed uniformly across a range
of buckets. The buckets to which a value is assigned is determined by function called a
hash function.

There is no comparison between both the techniques, it depends on the database application on
which it is being applied.

 Access Types: e.g. value based search, range access, etc.


 Access Time: Time to find particular data element or set of elements.
 Insertion Time: Time taken to find the appropriate space and insert a new data.
 Deletion Time: Time taken to find an item and delete it as well as update the index
structure.
 Space Overhead: Additional space required by the index.

Indexing Methods

36
1. Ordered indices

The indices are usually sorted to make searching faster. The indices which are sorted are known
as ordered indices.

Example: Suppose we have an employee table with thousands of record and each of which is 10
bytes long. If their IDs start with 1, 2, 3....and so on and we have to search student with ID-543.

o In the case of a database with no index, we have to search the disk block from starting till
it reaches 543. The DBMS will read the record after reading 543*10=5430 bytes.
o In the case of an index, we will search using indexes and the DBMS will read the record
after reading 542*2= 1084 bytes which are very less compared to the previous case.

2. Primary Index

o If the index is created on the basis of the primary key of the table, then it is known as
primary indexing. These primary keys are unique to each record and contain 1:1 relation
between the records.
o As primary keys are stored in sorted order, the performance of the searching operation is
quite efficient.
o The primary index can be classified into two types: Dense index and Sparse index.

2.1 Dense index

o The dense index contains an index record for every search key value in the data file. It
makes searching faster.
o In this, the number of records in the index table is same as the number of records in the
main table.
o It needs more space to store index record itself. The index records have the search key and
a pointer to the actual record on the disk.

37
2.2 Sparse index

o In the data file, index record appears only for a few items. Each item points to a block.
o In this, instead of pointing to each record in the main table, the index points to the records
in the main table in a gap.

3. Clustering Index

o A clustered index can be defined as an ordered data file. Sometimes the index is created on
non-primary key columns which may not be unique for each record.
o In this case, to identify the record faster, we will group two or more columns to get the
unique value and create index out of them. This method is called a clustering index.
o The records which have similar characteristics are grouped, and indexes are created for
these group.

Example: suppose a company contains several employees in each department. Suppose we use a
clustering index, where all employees which belong to the same Dept_ID are considered within a
single cluster, and index pointers point to the cluster as a whole. Here Dept_Id is a non-unique
key.

38
The previous schema is little confusing because one disk block is shared by records which belong
to the different cluster. If we use separate disk block for separate clusters, then it is called better
technique.

4. Secondary Index

39
In the sparse indexing, as the size of the table grows, the size of mapping also grows. These
mappings are usually kept in the primary memory so that address fetch should be faster. Then the
secondary memory searches the actual data based on the address got from mapping. If the
mapping size grows then fetching the address itself becomes slower. In this case, the sparse index
will not be efficient. To overcome this problem, secondary indexing is introduced.

In secondary indexing, to reduce the size of mapping, another level of indexing is introduced. In
this method, the huge range for the columns is selected initially so that the mapping size of the
first level becomes small. Then each range is further divided into smaller ranges. The mapping of
the first level is stored in the primary memory, so that address fetch is faster. The mapping of the
second level and actual data are stored in the secondary memory (hard disk).

For example:

o If you want to find the record of roll 111 in the diagram, then it will search the highest
entry which is smaller than or equal to 111 in the first level index. It will get 100 at this
level.
o Then in the second index level, again it does max (111) <= 111 and gets 110. Now using
the address 110, it goes to the data block and starts searching each record till it gets 111.
o This is how a search is performed in this method. Inserting, updating or deleting is also
done in the same manner.

40

You might also like