Professional Documents
Culture Documents
Normalization and Functional Dependency - File Organization - Unit 4
Normalization and Functional Dependency - File Organization - Unit 4
Normalization is a database design technique, which organizes tables in a manner that reduces
redundancy and dependency of data. It divides larger tables to smaller tables and links them
using relationships.
Normalization is a method to remove all these anomalies and bring the database to a
consistent state.
Anomalies in DBMS
There are three types of anomalies that occur when the database is not normalized. These are –
Insertion, update and deletion anomaly. Let’s take an example to understand this.
Example:
Employee table
1
166 Glenn Chennai D004
The above table is not normalized. We will see the problems that we face when a table is not
normalized.
Update anomaly: In the above table we have two rows for employee Rick as he belongs to two
departments of the company. If we want to update the address of Rick then we have to update
the same in two rows or the data will become inconsistent. If somehow, the correct address gets
updated in one department but not in other then as per the database, Rick would be having two
different addresses, which is not correct and would lead to inconsistent data.
Insert anomaly: Suppose a new employee joins the company, who is under training and
currently not assigned to any department then we would not be able to insert the data into the
table if emp_dept field doesn’t allow nulls.
Delete anomaly: Suppose, if at a point of time the company closes the department D890 then
deleting the rows that are having emp_dept as D890 would also delete the information of
employee Maggie since she is assigned only to this department.
Normal Description
Form
2NF A relation will be in 2NF if it is in 1NF and the relation must not contain
any partial dependency.
4NF A relation will be in 4NF if it is in Boyce Codd normal form and have no
multi-valued dependency.
2
5NF A relation is in 5NF if it is in 4NF and not contains any join
dependency and joining should be lossless.
First Normal Form defines that all the attributes in a relation must have atomic domains. The
values in an atomic domain are indivisible units/ contain single value. ie., Each attribute must
contain only a single value.
Before we learn about the second normal form, we need to understand the following −
3
Example:
25 Chemistry 30
25 Biology 30
47 English 35
83 Math 38
83 Computer 38
We cannot take TEACHER_ID as a primary key as it contains non-unique values (eg: teacher_id
25 and 83 are repeated in two places)
In the given table, non-prime attribute TEACHER_AGE is dependent on the prime attribute
TEACHER_ID alone (which means partially dependent on candidate key), which is a proper
subset of candidate key, that's why it violates the rule for 2NF.
To convert the given table into 2NF, we decompose it into two tables:
TEACHER_DETAIL table:
TEACHER_ID TEACHER_AGE
25 30
47 35
83 38
TEACHER_SUBJECT table:
TEACHER_ID SUBJECT
25 Chemistry
4
25 Biology
25 English
83 Math
83 Computer
For a table to be in the Second Normal Form, it must satisfy two conditions:
What is Dependency?
Let's take an example of a Student table with
columns student_id, name, reg_no, branch and address.
In this table, student_id is the primary key and will be unique for every row, hence we can
use student_id to fetch any row of data from this table
Even for a case, where student names are same, if we know the student_id we can easily fetch the
correct record.
Hence we can say a Primary Key for a table is a single column or a group of
columns(composite/candidate key) which can uniquely identify each record in the table.
5
I can ask from branch name of student with student_id 10, and I can get it. Similarly, if I ask for
name of student with student_id 10 or 11, I will get it. So all I need is student_id and every other
column depends on it, or can be fetched using it.
But this is not true all the time. So now let's extend our example to see if more than 1 column
together can act as a primary key.
Let's create another table for Subject, which will have subject_id and subject_name fields
and subject_id will be the primary key.
subject_id subject_name
1 Java
2 C++
3 Php
Now we have a Student table with student information and another table Subject for storing
subject information.
Let's create another table Score, to store the marks obtained by students in the respective
subjects. We will also be saving name of the teacher who teaches that subject along with marks.
1 10 1 70 Java Teacher
2 10 2 75 C++ Teacher
3 11 1 80 Java Teacher
In the score table we are saving the student_id to know which student's marks are these
and subject_id to know for which subject the marks are for.
Together, student_id + subject_id forms a Candidate Key for this table, which can be
the Primary key.
Now if you look at the Score table, we have a column names teacher which is only dependent on
the subject, for Java it's Java Teacher and for C++ it's C++ Teacher & so on.
Now as we just discussed that the primary key for this table is a composition of two columns
which is student_id & subject_id, but the teacher's name only depends on subject, hence
the subject_id, and has nothing to do with student_id.
This is Partial Dependency, where an attribute in a table depends on only a part of the primary
key and not on the whole key.
There can be many different solutions for this, but our objective is to remove teacher's name from
Score table.
The simplest solution is to remove columns teacher from Score table and add it to the Subject
table. Hence, the Subject table will become:
And our Score table is now in the second normal form, with no partial dependency.
1 10 1 70
2 10 2 75
3 11 1 80
7
Quick Recap on 2NF:
1. For a table to be in the Second Normal form, it should be in the First Normal form and it
should not have Partial Dependency.
2. Partial Dependency exists, when for a composite primary key, any attribute in the table
depends only on a part of the primary key and not on the complete primary key.
3. To remove Partial dependency, we can divide the table, remove the attribute which is
causing partial dependency, and move it to some other table where it fits in well.
By transitive functional dependency, we mean we have the following relationships in the table:
A is functionally dependent on B, and B is functionally dependent on C. In this case, C is
transitively dependent on A via B.
score table:
score_id student_id subject_id marks exam_name total_marks
With exam_name and total_marks added to our Score table, it saves more data now. Primary key
for our Score table is a composite key, which means it's made up of two attributes or columns
→ student_id + subject_id.
Our new column exam_name depends on both student and subject. So we can say
that exam_name is dependent on both student_id and subject_id.
The column total_marks depends on exam_name as with exam type the total score changes. For
example, practicals are of less marks while theory exams are of more marks.
8
But, exam_name is just another column in the score table. It is not a primary key or even a part of
the primary key, but total_marks depends on it.
Take out the columns exam_name and total_marks from Score table and put them in
an Exam table and use the exam_id (which is a primary key) wherever required.
1 Workshop 200
2 Mains 70
3 Practicals 30
For a relation to be in Third Normal Form, it must be in Second Normal form and the following
must satisfy −
No non-prime attribute is transitively dependent on prime key attribute.
9
We find that in the above Student_detail relation, Stu_ID is the key and only prime key attribute.
We find that City can be identified by Stu_ID as well as Zip. City is a non-prime attribute which
depends (can be identified using) on another non-prime attirbute Zip.
Additionally,
Stu_ID → Zip
Zip → City
To bring this relation into third normal form, we break the relation into two relations as follows −
Even when a database is in 3rd Normal Form, still there would be anomalies resulted if it has more
than one Candidate Key.
A 3NF table, which does not have multiple overlapping candidate keys is said to be in BCNF.
Boyce-Codd Normal Form (BCNF) is an extension of Third Normal Form on strict terms. BCNF
states that
In the above, Stu_ID is the super-key in the relation Student_Detail and Zip is the super-key in the
relation ZipCodes.
So,
BCNF in detail:
10
college enrolment table:
student_id subject professor
103 C# Joe
In the table above {student_id, subject} together form the primary key
But, there is a dependency exists between subject and professor, where professor depends on
the subject.
But this table is not in Boyce-Codd Normal Form as non-prime attribute --> prime attribute
To make this relation satisfy BCNF, we will decompose this table into two tables, student table
and professor table.
Student Table
student_id p_id
101 1
101 2
11
102 3
103 4
104 5
1 Adam Java
2 Martin C++
3 Allin Java
4 Joe C#
5 Jack Java
If database table instance does not contain two or more, independent and multivalued data
describing the relevant entity, then it is in 4th Normal Form.
A table is said to have multi-valued dependency, if the following conditions are true,
1. For a dependency A → B, if for a single value of A, multiple value of B exists, then the
table may have multi-valued dependency.
2. Also, a table should have at-least 3 columns for it to have a multi-valued dependency.
3. And, for a relation R(A,B,C), if there is a multi-valued dependency between, A and B,
then B and C should be independent of each other.
If all these conditions are true for any relation(table), it is said to have multi-valued dependency.
12
college enrolment table:
1 Science Cricket
1 Maths Hockey
2 C# Cricket
2 Php Hockey
student with s_id 1 has opted for two courses, Science and Maths, and has two
hobbies, Cricket and Hockey.
You must be thinking what problem this can lead to, right?
Well the two records for student with s_id 1, will give rise to two more records, as shown below,
because for one student, two hobbies exists, hence along with both the courses, these hobbies
should be specified.
1 Science Cricket
1 Maths Hockey
1 Science Hockey
1 Maths Cricket
And, in the table above, there is no relationship between the columns course and hobby. They
are independent of each other.
So there is multi-value dependency, which leads to un-necessary repetition of data and other
anomalies as well.
To make the above relation satify the 4th normal form, we can decompose the table into 2 tables.
CourseOpted Table
s_id course
1 Science
1 Maths
13
2 C#
2 Php
1 Cricket
1 Hockey
2 Cricket
2 Hockey
A table can also have functional dependency along with multi-valued dependency. In that case,
the functionally dependent columns are moved in a separate table and the multi-valued dependent
columns are moved to separate tables.
A table is in 5th Normal Form only if it is in 4NF and it cannot be decomposed into any number of
smaller tables without loss of data.
A It is a determinant set.
B It is a dependent attribute.
{A → B} A functionally determines B.
B is a functionally dependent on A.
14
Functional Dependency avoids data redundancy where same data should not be repeated
at multiple locations in same database.
It helps in identifying bad designs.
For example:
Here Stu_Id attribute uniquely identifies the Stu_Name attribute of student table because if we
know the student id we can tell the student name associated with it. This is known as functional
dependency and can be written as Stu_Id->Stu_Name (or) in words we can say Stu_Name is
functionally dependent on Stu_Id.
Formally:
If column A of a table uniquely identifies the column B of same table then it can represented as
A->B (Attribute B is functionally dependent on attribute A)
Symbolically:
Example:
Also, Student_Id -> Student_Id & Student_Name -> Student_Name are trivial
dependencies too.
If a functional dependency X->Y holds true where Y is not a subset of X then this
dependency is called non trivial Functional dependency.
For example:
If a FD X->Y holds true where X intersection Y is null then this dependency is said to be
completely non trivial function dependency.
Eg:
Stu-id Name
1 X
2 Y
3 Z
16
stu-id ∩ name = ⱷ
3. Multivalued dependency:
Multivalued dependency occurs when there are more than one independent multivalued
attributes in a table.
For example: Consider a bike manufacture company, which produces two colors (Black
and Red) in each model every year.
Here columns manuf_year and color are independent of each other but both dependent on
bike_model. In this case these two columns are said to be multivalued dependent on
bike_model.
4. Transitive dependency:
For e.g.
X -> Z is a transitive dependency if the following three functional dependencies hold true:
X->Y
Y does not ->X
Y->Z
Note: A transitive dependency can only occur in a relation of three of more attributes. This
dependency helps us normalizing the database in 3NF (3rd Normal Form).
17
Example:
{Book} ->{Author} (if we know the book, we know the author name)
{Author} does not ->{Book}
{Author} -> {Author_age}
Therefore as per the rule of transitive dependency: {Book} -> {Author_age} should
hold, that makes sense because if we know the book name we can know the author’s age.
Armstrong's Axioms
Rule 1 Reflexivity
If A is a set of attributes and B is a subset of A, then A holds B. { A → B }
Rule 2 Augmentation
If A hold B and C is a set of attributes, then AC holds BC. {AC → BC}
It means that attribute in dependencies does not change the basic dependencies.
Rule 3 Transitivity
If A holds B and B holds C, then A holds C.
If {A → B} and {B → C}, then {A → C}
B. Secondary Rules
Rule 1 Union
If A holds B and A holds C, then A holds BC.
18
If{A → B} and {A → C}, then {A → BC}
Rule 2 Decomposition
If A holds BC and A holds B, then A holds C.
If{A → BC} and {A → B}, then {A → C}
Rule 3 Pseudo Transitivity
If A holds B and BC holds D, then AC holds D.
If{A → B} and {BC → D}, then {AC → D}
Solution:
1. P → T
In the above FD set, P → Q and Q → T
So, Using Transitive Rule: If {A → B} and {B → C}, then {A → C}
∴ If P → Q and Q → T, then P → T.
P→T
2. PR → S
In the above FD set, P → Q
As, QR → S
So, Using Pseudo Transitivity Rule: If{A → B} and {BC → D}, then {AC → D}
∴ If P → Q and QR → S, then PR → S.
PR → S
3. QR → SU
In above FD set, QR → S and QR → U
So, Using Union Rule: If{A → B} and {A → C}, then {A → BC}
∴ If QR → S and QR → U, then QR → SU.
QR → SU
4. PR → SU
So, Using Union Rule: If{A → B} and {A → C}, then {A → BC}
∴ If PR → S and PR → U, then PR → SU.
PR → SU
19
How to find functional dependencies for a relation?
Example:
Consider the STUDENT relation given in Table 1.
Functional Dependency set or FD set of a relation is the set of all FDs present in the relation.
Example:
{ STUD_NO->STUD_NAME,
STUD_NO->STUD_PHONE,
STUD_NO->STUD_STATE,
20
STUD_NO->STUD_COUNTRY,
STUD_NO -> STUD_AGE,
STUD_STATE->STUD_COUNTRY }
Attribute Closure:
Attribute closure of an attribute set can be defined as set of attributes which can be functionally
determined from it.
Example:
Example
Imagine the following list of FDs. We are going to calculate a closure for A from this
relationship.
1. A → B
2. B → C
3. AB → D
The closure is therefore A → ABCD. By calculating the closure of A, we have validated that A
is also a good candidate key as its closure is every single data value in the relationship.
Finding the candidate key, super key from the given FDs
21
R(A,B,C,D)
A→B
B→C
C→A
Solution:
Draw edge diagram
A B C D
No edge incident on D
so D is an essential attribute and find closure of attribute D
(D)+= D
combine all other attribute with D and find the closure over the attribute set
(AD)+= ABCD
(BD)+=ABCD
(CD)+=ABCD
using the above combinations, we could find all the attributes of the relation, hence AD,BD,CD
are candidate keys. (ie,using a minimal set of keys, we could find all the attributes of a relation)
If we could find all attributes with (ABD) combination, it is a super key, but not the candidate
key.
Canonical or Minimal:
Sometimes Functional Dependency Sets are not able to reduce if the set has following properties,
1. The Right-hand side set of functional dependency holds only one attribute.
2. The Left-hand side set of functional dependency cannot be reduced, it changes the entire
content of the set.
3. Reducing any functional dependency may change the content of the set.
A set of functional dependencies with the above three properties are also called as Canonical or
Minimal.
Canonical cover
R(w,x,y,z)
x→z
wz → xyy
y → wxz
Here redundancy exists as B and C implies same attribute D (if removal of either of the rule will
not affect the FD, then it is said to be in canonical form)
A→C
B→D
D → AB
Ex problem:
R(w,x,y,z)
x→z
wz → xy
y → wxz
i. Check right hand side for redundancy. Use the decomposition rule to find out the right hand
side redundancy
x→z
wz → x
wz → y
y→w
y→x
y→z
First rule: x → z
23
So, find closure of x
(x)+= xz
Both of the (x)+ implies different attributes, hence the rule is not redundant. so, we can keep the
rule x → z
Second rule: wz → x
Both of the (wz)+ implies same attributes, hence the rule is redundant. so, we can remove the rule
wz → x
Third rule: wz → y
Both of the (wz)+ implies different attributes, hence the rule is not redundant. so, we can keep the
rule wz → y
Fourth rule: y →w
Both of the (y)+ implies different attributes, hence the rule is not redundant. so, we can keep the
rule y →w
Fifth rule: y → x
Both of the (y)+ implies different attributes, hence the rule is not redundant. so, we can keep the
rule y →x
Sixth rule: y → z
Both of the (y)+ implies same attributes, hence the rule is redundant. so, we can remove the rule y
→z
x→z
wz → y
y→w
y→x
Further applying axioms (union rule), can generate minimal set of FDs
x→z
wz → y
y → wx
Database Decomposition
What is decomposition?
Properties of Decomposition:
25
Following are the properties of Decomposition,
1. Lossless Decomposition
2. Dependency Preservation
3. Lack of Data Redundancy
1. Lossless Decomposition
Decomposition must be lossless. It means that the information should not get lost from the
relation that is decomposed.
It gives a guarantee that the join will result in the same relation as it was decomposed.
Example:
Let's take 'E' is the Relational Schema, With instance 'e'; is decomposed into: E1, E2, E3, . . . . En;
With instance: e1, e2, e3, . . . . en, If e1 ⋈ e2 ⋈ e3 . . . . ⋈ en, then it is called as 'Lossless Join
Decomposition'.
In the above example, it means that, if natural joins of all the decomposition give the
original relation, then it is said to be lossless join decomposition.
Decompose the above relation into two relations to check whether a decomposition is
lossless or lossy.
Now, we have decomposed the relation that is Employee and Department.
26
Employee Schema contains (Eid, Ename, Age, City, Salary).
Employee ⋈ Department
If the <Employee> table contains (Eid, Ename, Age, City, Salary) and <Department> table
contains (Deptid and DeptName), then it is not possible to join the two tables or relations,
because there is no common column between them. And it becomes Lossy Join
Decomposition.
2. Dependency Preservation
27
If {A → B} holds, then two sets are functional dependent. And, it becomes more useful
for checking the dependency easily if both sets in a same relation.
This decomposition property can only be done by maintaining the functional dependency.
In this property, it allows to check the updates without computing the natural join of the
database structure.
GATE Question:
How to find Candidate Keys and Super Keys using Attribute Closure?
If attribute closure of an attribute set contains all attributes of relation, the attribute set will
be super key of the relation.
If no subset of this attribute set can functionally determine all attributes of the relation, the
set will be candidate key as well.
GATE Question:
Answer:
GATE Question:
In a schema with attributes A, B, C, D and E following set of functional dependencies are given
{A -> B, A -> C, CD -> E, B -> D, E -> A}
Which of the following functional dependencies is NOT implied by the above set?
A. CD -> AC
B. BD -> CD
C. BC -> CD
D. AC -> BC
Answer:
GATE Question:
Consider a relation scheme R = (A, B, C, D, E, H) on which the following functional
dependencies hold: {A–>B, BC–> D, E–>C, D–>A}.
What are the candidate keys of R?
(a) AE, BE
(b) AE, BE, DE
(c) AEH, BEH, BCH
(d) AEH, BEH, DEH
29
Answer:
(AE)+ = {ABECD} which is not set of all attributes. So AE is not a candidate key. Hence option
A and B are wrong.
(AEH)+ = {ABCDEH}
(BEH)+ = {BEHCDA}
(BCH)+ = {BCHDA} which is not set of all attributes. So BCH is not a candidate key. Hence
option C is wrong.
So correct answer is D.
A file is partitioned into fixed-length storage units called blocks, which are the
units of both storage allocation and data transfer from/to the secondary storage
(HDD).
Most DBMSs use block sizes of 4 to 8 kilobytes by default
Many DBMSs allow the block size to be specified when a DB instance is created.
30
Two possible approaches:
Fixed-Length Records:
Even if the records are not really of fixed length (e.g. varchar below), we assume that
each field has max. size:
Record access is simple: Record i starts at byte n*(i-1), where n is the size of each
record (e.g. 53 bytes).
Problem 1: Unless the block size is a multiple of n, the last record in a block crosses the
block boundary
Possible solutions:
shift records i + 1, . . ., n to i, . . . , n – 1
move record n to i
do not move records, but link all free records on a free list
31
Free Lists (Linked):
Store the address of the first deleted record in the file header.
Can think of these stored addresses as pointers since they “point” to the location of a
record.
For efficiency, reuse the space for normal attributes in the free records to store pointers.
(No pointers stored in in-use records!)
32
Variable-Length Records:
Record types that allow variable lengths for one or more fields such as strings (varchar)
Record types that allow repeating fields (used in some older data models).
Records are allocated contiguously in the page/block, starting from the end.
Records can be moved around within the page/block to keep them contiguous (no
empty space between them)
34
File Organization
File Organization defines how file records are mapped onto disk blocks. We have four types of
File Organization to organize file records −
Indexing
35
Indexing is a way to optimize performance of a database by minimizing the number of disk
accesses required when a query is processed.
An index or database index is a data structure which is used to quickly locate and access the data
in a database table.
The first column is the Search key that contains a copy of the primary key or candidate
key of the table. These values are stored in sorted order so that the corresponding data can
be accessed quickly (Note that the data may or may not be stored in sorted order).
The second column is the Data Reference which contains a set of pointers holding the
address of the disk block where that particular key value can be found.
There is no comparison between both the techniques, it depends on the database application on
which it is being applied.
Indexing Methods
36
1. Ordered indices
The indices are usually sorted to make searching faster. The indices which are sorted are known
as ordered indices.
Example: Suppose we have an employee table with thousands of record and each of which is 10
bytes long. If their IDs start with 1, 2, 3....and so on and we have to search student with ID-543.
o In the case of a database with no index, we have to search the disk block from starting till
it reaches 543. The DBMS will read the record after reading 543*10=5430 bytes.
o In the case of an index, we will search using indexes and the DBMS will read the record
after reading 542*2= 1084 bytes which are very less compared to the previous case.
2. Primary Index
o If the index is created on the basis of the primary key of the table, then it is known as
primary indexing. These primary keys are unique to each record and contain 1:1 relation
between the records.
o As primary keys are stored in sorted order, the performance of the searching operation is
quite efficient.
o The primary index can be classified into two types: Dense index and Sparse index.
o The dense index contains an index record for every search key value in the data file. It
makes searching faster.
o In this, the number of records in the index table is same as the number of records in the
main table.
o It needs more space to store index record itself. The index records have the search key and
a pointer to the actual record on the disk.
37
2.2 Sparse index
o In the data file, index record appears only for a few items. Each item points to a block.
o In this, instead of pointing to each record in the main table, the index points to the records
in the main table in a gap.
3. Clustering Index
o A clustered index can be defined as an ordered data file. Sometimes the index is created on
non-primary key columns which may not be unique for each record.
o In this case, to identify the record faster, we will group two or more columns to get the
unique value and create index out of them. This method is called a clustering index.
o The records which have similar characteristics are grouped, and indexes are created for
these group.
Example: suppose a company contains several employees in each department. Suppose we use a
clustering index, where all employees which belong to the same Dept_ID are considered within a
single cluster, and index pointers point to the cluster as a whole. Here Dept_Id is a non-unique
key.
38
The previous schema is little confusing because one disk block is shared by records which belong
to the different cluster. If we use separate disk block for separate clusters, then it is called better
technique.
4. Secondary Index
39
In the sparse indexing, as the size of the table grows, the size of mapping also grows. These
mappings are usually kept in the primary memory so that address fetch should be faster. Then the
secondary memory searches the actual data based on the address got from mapping. If the
mapping size grows then fetching the address itself becomes slower. In this case, the sparse index
will not be efficient. To overcome this problem, secondary indexing is introduced.
In secondary indexing, to reduce the size of mapping, another level of indexing is introduced. In
this method, the huge range for the columns is selected initially so that the mapping size of the
first level becomes small. Then each range is further divided into smaller ranges. The mapping of
the first level is stored in the primary memory, so that address fetch is faster. The mapping of the
second level and actual data are stored in the secondary memory (hard disk).
For example:
o If you want to find the record of roll 111 in the diagram, then it will search the highest
entry which is smaller than or equal to 111 in the first level index. It will get 100 at this
level.
o Then in the second index level, again it does max (111) <= 111 and gets 110. Now using
the address 110, it goes to the data block and starts searching each record till it gets 111.
o This is how a search is performed in this method. Inserting, updating or deleting is also
done in the same manner.
40