You are on page 1of 47

A Thesis Report

on
Automated Database Normalization Up-To Third Normal Form
Using Data Dictionary

For Partial Fulfillment of the Requirements for the Degree of Master of Science in
Computer Science Awarded by Pokhara University

Submitted by:

Ashok Chand
MSc.CS
Roll No: 14531

Under the Supervision of


Assoc. Prof. Roshan Chitrakar, PhD

DEPARTMENT OF GRADUATE STUDIES

NEPAL COLLEGE OF
INFORMATION TECHNOLOGY
Balkumari, Lalitpur,Nepal.

September 2018
DECLARATION

I hereby declare that the work done in thesis entitled "Automated Database
Normalization Up-To Third Normal Form Using Data Dictionary" submitted to
Nepal College of Information Technology, Pokhara University, is my original work
performed in the form of partial requirement for the degree of Master of Science in
Computer Science (MSc.CS).

……………………….
Ashok Chand
Date: 6 September 2018
i
ACKNOWLEDGEMENT

First of all I would like to express my sincere gratitude to respected supervisor, Assoc.
Prof. Roshan Chitrakar, PhD, for this advice, support, guidance and valuable time for
discussion that provided ideas and impetus in each and every phase of my thesis work.

I would like to thank respected Mr. Saroj Shakya who give me full support from
proposing the topic to this stage. I would always be grateful for this insightful and
guidance. I would like to special thank Mr. Sanjeev Kumar Pandey for his valuable
suggestion, guidance and support in every stage of my thesis work. Also thank to Mr.
Madan Kadariya for all his suggestion in initial phase of my dissertation.

My sincere admiration goes to Mr. Shashidhar Ram Joshi, Mr. Niranjan Khakurel,
Mr. Kumar Pudashaini, and Mr. Sanjay Kushwaha for providing me such a broad
knowledge and inspirations within the time period of two years. I would also like to thank
all the teachers and staff members of Nepal College of Information Technology.

I would like to specially thank to my all friends for their support in each and every
challenging step of this thesis work.

Finally I would like to express my profound gratitude to my dearest parent, sister and
brothers who support and encourage me a lot in every moment of my life to achieve my
goal.

Ashok Chand
September 2018

ii
ABSTRACT

Database Normalization is a mechanism which reduces the redundancy in a table of the


database. Normalization splits a larger table into different small tables and defines the
relationship between them. The small tables are relevantly less redundant in comparison to
the bigger table. The objective of Normalization is isolating data; and the changes like
addition, deletion, and modification are made in the bigger table which later, are
propagated through the smaller tables using the defined relationship between them.
Without Normalization, it will always be a difficult task to handle and update the database
without facing data loss. In this research a new method has been proposed for database
normalization using data dictionary. Algorithms were designed for database normalization
using Data Dictionary for graph representation. In this method, the Graph is stored in the
form of incoming and outgoing attributes of a node. And call it as INOUT Data Dictionary.
Then, use a simple algorithm to get the Transitive dependencies and remove these TDs.
After removed TDs database tables get in 3NF.

Keywords: Database, Database Normalization, Functional dependency, Automated


database normalization, dependency matrix.

iii
TABLE OF CONTENTS

1. INTRODUCTION .......................................................................................................... 1
1.1 Database Normalization ............................................................................................ 1
1.2 Anomalies in Databases ............................................................................................ 1
1.3 Normal forms ............................................................................................................ 2
1.3.1. First Normal form ............................................................................................. 2
1.3.2. Second Normal form ......................................................................................... 3
1.3.3. Third Normal form ............................................................................................ 4
1.4 Motivation ................................................................................................................. 6
2. PROBLEM STATEMENT ............................................................................................. 7
3. OBJECTIVES ................................................................................................................. 8
4. LITERATURE REVIEW ............................................................................................... 9
4.1. Dependency Graph Diagram.................................................................................... 9
4.2. Dependency Matrix ................................................................................................ 10
4.3. Directed Graph Matrix ........................................................................................... 11
4.4. Closure dependency ............................................................................................... 12
5. METHODOLOGY ....................................................................................................... 15
5.1 Research Model ...................................................................................................... 15
5.1.1 Literature Review............................................................................................. 15
5.1.2 Problem Formulation: ...................................................................................... 16
5.1.3 Design of Algorithms:...................................................................................... 16
Working Model ............................................................................................................. 18
5.1.4 Implementation: ............................................................................................... 18
5.1.5 Analysis............................................................................................................ 18
5.2. Data Analysis ......................................................................................................... 19
6. EXPERIMENTS ........................................................................................................... 20
6.1. Tools and Environment .......................................................................................... 20
6.1.1. Matplotlib........................................................................................................ 20
iv
6.1.2. Networkx......................................................................................................... 21
6.2. Validation Testing - workflow ............................................................................... 21
7. RESULTS AND DISCUSSION ................................................................................... 23
7.1. Output Analysis ..................................................................................................... 23
7.2. Findings and Discussion ........................................................................................ 27
8. VALIDATION OF SYSTEM ....................................................................................... 30
8.1. Time Complexity ................................................................................................... 30
8.2. Space Complexity .................................................................................................. 30
9. CONCLUSION AND FUTURE WORKS ................................................................... 32
10. REFERENCES AND BIBLIOGRAPHY ................................................................... 33
11. APPENDIX (SOURCE CODE) ................................................................................. 35

v
LIST OF TABLE

Table 1: Manager-Employee Table 2


Table 2: Manager-Employee Table in 1NF 2
Table 3: Customer-info Table in 1NF 3
Table 4: Customer-info(1) in 2NF 3
Table 5: Customer-info(2) in 2NF 4
Table 6: Order-info Table 5
Table 7: Order-info Table in 3NF 6
Table 8: Dependency Matrix(DM) for Example 1 in [1] 11
Table 9: 2NF Table of Example 1 24
Table 10: 3NF Table-1 of Example 1 25
Table 11: 3NF Table-2 of Example 1 25
Table 12: 2NF Table-1 of Example 2 26
Table 13: 2NF Table-2 of Example 2 26
Table 14: 3NF Table-1 of Example 2 27
Table 15: 3NF Table-2 of Example 2 27
Table 16: Experimental data table 28
Table 17: Time and Space Complexity comparison 31

vi
LIST OF FIGURE

Figure 1: Flowchart for getting Transitive dependencies 9

Figure 2: Dependency Graph for Example 1 in [1] 10

Figure 3: Algorithm flowchart for DG Matrix 12

Figure 4: Dependency graph diagrams 13

Figure 5: Spanning Tree Graph (STG) 14

Figure 6: Flow diagram for the research 15

Figure 7: Flowchart Getting Transitive Dependencies from INOUT Dictionary 17

Figure 8: Working flowchart of proposed model 18

Figure 9: Validation testing workflow 21

Figure 10: Directed Graph for Second Normal form of Example 1 23

Figure 11: Directed Graph for Third Normal form of Example 1 24

Figure 12: Directed Graph for Second Normal form of Example 2 26

Figure 13: Directed Graph for Third Normal form of Example 2 27

Figure 14: Time Comparison of different Transitive Dependencies 29

vii
1. INTRODUCTION

1.1 Database Normalization

Database Normalization is a mechanism which reduces the redundancy in a table of the


Database. Normalization splits a larger table into different small tables and defines the
relationship between them. The small tables are relevantly less redundant in comparison to
the bigger table. The objective of Normalization is isolating data; and the changes like
addition, deletion, and modification are made in the bigger table which later, are
propagated through the smaller tables using the defined relationship between them.
Without Normalization, it will always be a difficult task to handle and update the database
without facing data loss. Normally three kinds of anomalies face; update, insertion and
deletion. Talking about update, I need to update a field in each and every table to maintain
consistency of database.

1.2 Anomalies in Databases

The three types of anomalies are described here: Update, Insertion and Deletion anomalies.
Insertion anomaly is a failure to place information about a new database entry into all the
places in the database where information about the new entry needs to be stored. In a
properly normalized database, information about a new entry needs to be inserted into only
one place in the database, in an inadequately normalized database, information about a new
entry may need to be inserted into more than one place, and human fallibility being what
it is, some of the needed additional insertions may be missed.

Deletion anomaly is a failure to remove information about an existing database entry when
it is time to remove that entry. In a properly normalized database, information about an old,
to-be-gotten-rid-of entry needs to be deleted from only one place in the database, in an
inadequately normalized database, information about that old entry may need to be deleted
from more than one place.

1
Update Anomaly involves modifications that may be additions, deletions, or both. Thus
“update anomalies” can be either of the kinds discussed above.

All three kinds of anomalies are highly undesirable, since their occurrence constitutes
corruption of the database. Properly normalized database are much less susceptible to
corruption than are non normalized databases.

1.3 Normal forms

The first three normal forms are described in this topic: 1NF, 2NF, 3NF.

1.3.1. First Normal form

Definition: A relation is said to be in First Normal Form (1NF) if and only if each attribute
of the relation is atomic with a primary key defined for every row. More simply, to be in
1NF, each row must be a unique Tuple.
Example: The following table is NOT in First Normal Form:
Table 1: Manager-Employee Table

Manager Employees

Jim Susan, Rob, Beth

Mary Alice, John, Asim

Renee Mike

Joe Alan, Tim

Here is an alternative option that is in 1NF.


Table 2: Manager-Employee Table in 1NF

Manager Employee

Jim Susan

Jim Rob

Jim Beth

2
Mary Alice

Mary John

Mary Asim

Renee Mike

Joe Alan

Joe Tim

1.3.2. Second Normal form

Definition: In order to be in Second Normal Form, a relation must first fulfill the
requirements to be in First Normal Form. Additionally, each non-key attribute in the
relation must be functionally dependent upon the primary key.
Example: The following relation is in First Normal Form, but not Second Normal Form:
Table 3: Customer-info Table in 1NF

Order # Customer Contact Person Total

1 Acme Widgets John Doe n $134.23

2 ABC Corporation Fred Flintstone $521.24

3 Acme Widgets John Doe $1042.42

4 Acme Widgets John Doe $928.53

In the table above, the order number serves as the primary key. Notice that the customer
and total amount are dependent upon the order number -- this data is specific to each order.
However, the contact person is dependent upon the customer. An alternative way to
accomplish this would be to create two tables:

3
Table 4: Customer-info(1) in 2NF

Customer Contact Person

Acme Widgets John Doe

ABC Corporation Fred Flintstone

Table 5: Customer-info(2) in 2NF

Order # Customer Total

1 Acme Widgets $134.23

2 ABC Corporation $521.24

3 Acme Widgets $1042.42

4 Acme Widgets $928.53

The creation of two separate tables eliminates the dependency problem experienced in the
previous case. In the first table, contact person is dependent upon the primary key –
customer name. The second table only includes the information unique to each order.
Someone interested in the contact person for each order could obtain this information by
performing a JOIN operation.

1.3.3. Third Normal form

For the third normal form the following criteria needed to be fulfilled:

 Meet the requirements of 1NF and 2NF.

 Remove columns that are not fully dependent upon the primary key.

Imagine that I have a table of widget orders:

4
Table 6: Order-info Table

Customer
Order Number Unit Price Quantity Total
Number
1 241 $10 2 $20

2 842 $9 20 $180

3 919 $19 1 $19

4 919 $12 10 $120

Our first requirement is that the table must satisfy the requirements of 1NF and 2NF. Are
there any duplicative columns? No. Do we have a primary key? Yes, the order number.
Therefore, we satisfy the requirements of 1NF. Are there any subsets of data that apply to
multiple rows? No, so we also satisfy the requirements of 2NF.

Now, are all of the columns fully dependent upon the primary key? The customer number
varies with the order number and it doesn't appear to depend upon any of the other fields.
What about the unit price? This field could be dependent upon the customer number in a
situation where we charged each customer a set price. However, looking at the data above,
it appears we sometimes charge the same customer different prices. Therefore, the unit
price is fully dependent upon the order number. The quantity of items also varies from
order to order, so we're OK there.

What about the total? It looks like we might be in trouble here. The total can be derived by
multiplying the unit price by the quantity, therefore it's not fully dependent upon the
primary key. We must remove it from the table to comply with the third normal form:

5
Table 7: Order-info Table in 3NF

Order Number Customer Number Unit Price Quantity

1 241 $10 2

2 842 $9 20

3 919 $19 1

4 919 $12 10

Now our table is in 3NF. But, what about the total? This is a derived field and it's best not
to store it in the database at all. We can simply compute it "on the fly" when performing
database queries. For example, we might have previously used this query to retrieve order
numbers and totals:

SELECT OrderNumber, Total


FROM WidgetOrders

We can now use the following query:

SELECT OrderNumber, UnitPrice * Quantity AS Total


FROM WidgetOrders to achieve the same results without violating normalization rules.

1.4 Motivation
Database Normalization has always been a field of interest for set theorist and Computer
Scientists. With the growth of data in every field, optimization of data representation is
must. Database Normalization in some sense, is a form of optimization. Data redundancy
is handled by normalization. And, it’s a good approach to handle anomalies which occurs
in database.
The automation, if Database Normalization has not been studied much and is an interesting
field to do research with it. Transitive Dependency is one of the entity needed to translate
a table into Third normal form.

6
2. PROBLEM STATEMENT

Various methods for automated database normalization have found in Literature. The
proposed method is use of Data dictionary for Database normalization. Is it practical to use
Data dictionary is the main research concern for this work.

7
3. OBJECTIVES

The main objectives of this research are:


a. Implementation of Database Normalization Module using Data Dictionary.
b. Analysis of Time consumed for different sort of Transitive Dependencies: Linear,
Circular and Hybrid Transitive Dependencies.

8
4. LITERATURE REVIEW

For the detection of Transitive dependency, Amir Hassan Bahmani have presented a
method in [1]. According to them the following procedure will result the transitive
dependency:

Figure 1: Flowchart for getting Transitive dependencies

4.1. Dependency Graph Diagram


Functional dependencies can be used to monitor all the relations between different
attributes of a table. These dependencies can be represented graphically. In these graphs,
arrow is the most important symbol used. Besides, in our way of representing the
relationship graph, a (dotted) horizontal line separates simple keys
(i.e., attributes) from composite keys (i.e., keys composed of more than one attribute). A
dependency graph is generated using the following rules:
1. Each attribute of the table is encircled and all attributes of the table is drawn at the
lowest level (i.e., bottom) of the graph.
2. A horizontal line is drawn on top of all attributes.
3. Each composite key (if any) is encircled and all composite keys are drowning on
top of the horizontal line.
9
4. All functional dependency arrows are drawn.
5. All reflexivity rule dependencies are drawn using dotted arrows (e.g., AB-->A, AB-
->B).
The example 1 in [1] is considered here:
Fds= {A → BCD, C → DE, EF → DG, D → G}

Figure 2 is the Graphical representation of the dependencies

Figure 2: Dependency Graph for Example 1 in [1]

4.2. Dependency Matrix


The dependency matrix is generated as follows:
i. Define matrix DM[n][m], where n= number of Determinant Keys and m= number
of simple keys.
ii. Suppose that β⊆α, γ⊄ α and γ,β∈ {Simple key set}and α∈ {Determinant key set}
iii. Establish DM element as follows:
if α → β ==> DM[α][β]= 2
if α → γ ==> DM[α][γ]= 1
else DM[α][γ]= 0
The Dependency Matrix for the above example is shown in the next page:

10
Table 8: Dependency Matrix(DM) for Example 1 in [1]
A B C D E F G

A 2 1 1 1 0 0 0

C 0 0 2 1 1 0 0

D 0 0 0 2 0 0 1

EF 0 0 0 1 2 2 1

4.3. Directed Graph Matrix


The Directed Graph (DG) matrix for determinant keys is used to represent all possible
direct dependencies between determinant keys. The DG is an n×n matrix where n is the
number of determinant keys. The process of determining the elements of this matrix is as
follows.
The elements of the DG matrix are initially set to zeros. Starting from the first row of the
dependency matrix DM, this matrix is investigated in a row major approach. Suppose we
are investigating the row corresponding to determinant key x. If all simple keys that x is
composed of depend on a determinant key other than x then x also depends on that
determinant key (Armstrong’s augmentation rule). The dependency of a simple key to a
determinant key is represented by a non-zero in the DM matrix.
The Algorithm for DG matrix adopted from [1] is as follows:

11
Figure 3: Algorithm flowchart for DG Matrix

4.4. Closure dependency


The pseudo-code for closure dependency presented below. Symbols used have their usual
meaning as of before:
Dependency-closure ()
{
for (i=0; i<n ; i++)
for( j=0; j<n ; j++)
if( i!=j && DG[i][j]!=-1) {
for (k=0; k<m ; k++)
if( DM[j][k]!=0 && DM[j][k]!=2)
DM[i][k]=j; }

12
}

Another research paper for the database normalization authored by Chetneti Srisa-an has
presented a method in [2]. According to this paper a new complete automated relational
database normalization method has been presented, which produces the directed graph and
spanning tree, first. It then proceeds with generating the 2NF, 3NF normal forms.
This paper use two structures, Function Dependency Graph (DG), and Spanning tree Graph
(STG) to manipulate dependencies among attributes of a relation.
The example 1 in [2] is considered here:
Fds= {A → BCD, C → DE, EF → DG, D → G}

Figure 4: Dependency graph diagrams

After applying DG and STG, the database attributes forms a forest; where, every individual
tree represents a table which is in third normal form. We will consider the example in [2]:
Fds= {A → BCD, C → DE, EF → DG, D → G}

13
Figure 5: Spanning Tree Graph (STG)

14
5. METHODOLOGY

5.1 Research Model

This research is in some sense based on earlier research as well as it is an exploratory


research too. The literature of Database Normalization on the basis of Depth first search
was considered for the study; therefore I can say that this research is based on Earlier
Research. On the other hand, an attempt has been made to use Data dictionary for
representation of a node of the Graph which is Exploratory for this research. When a
researcher has a limited amount of experience with or knowledge about a research issue,
Exploratory Research is useful. It ensures that a more rigorous, more conclusive future
study will begin with an inadequate understanding of the nature of the problem at hand.
Usually, exploratory research provides greater understanding of a concept or crystallizes a
problem. Exploratory research is initial research conducted to clarify and define the nature
of a problem [3]. The research model for the overall work is presented in the figure below:

Figure 6: Flow diagram for the research

5.1.1 Literature Review

A literature review is simply a summary of what existing scholarship knows about a


particular topic. It is always based on secondary sources – that is, what other people have
already written on the subject; it is not concerned about discovering new knowledge or
information. As such, it is a prelude to further research, a digest of scholarly opinion[4]. In
this research Literature Review was done on Spanning Tree based Normalization and
Depth first search for searching Transitive Dependencies.

15
5.1.2 Problem Formulation:

The next part of Research is based on Problem formulation. After the literature survey, an
attempt has been made is this thesis to propose the detection of Transitive dependencies
using Data dictionary. This proposal was an Exploratory part of overall research because
it was uncertain about the result of research. The research might be failed because nothing
was predefined about the Data Dictionary representation of Graph.

5.1.3 Design of Algorithms:

Algorithms were designed for Database normalization using Data Dictionary for Graph
representation. The time complexity of these Algorithms was Quadratic in Nature. The
Algorithms are presented as below.
In this method, the Graph is stored in the form of incoming and outgoing attributes of a
node.
Algorithm 1: INOUT Data Dictionary.
Let us consider the example; Fds= {A → BCD, C → DE, EF → DG, D → G}.
These Fds will be stored in the form of incoming and outgoing attributes as follows:
1: A: {in:Φ; out: B}, A: {in:Φ; out: C}, A: {in:Φ; out: C}
2: B: {in:A}
3: C: {in:A; out:D}, C: {in:A; out:E}
4: D:{in:C; out:G}, D:{in:A; out:G}
5: E:{in:C; out:Φ}
6: F: {in:Φ; out:Φ}
7: EF: {in:Φ, out: D}, EF: {in:Φ, out: G}

After storage of Fds, use a simple algorithm to get the Transitive dependencies.

The algorithm is presented below:

16
Figure 7: Flowchart Getting Transitive Dependencies from INOUT Dictionary

Next, All the FDs will be traced from the given Input. Then, store them in a variable called
ALL_FD. Then applied following method to remove TDs and bring the Database tables in
3NF:
Algorithm 2: Identify and Remove Transitive dependency
1: Normal= []
2: for i in ALL_FD:
3: temp= i
4: for j in TD:
5: if j in temp:
6: remove j.out from temp
7: add temp to Normal
8: Output: Normal.

17
Working Model

Fig 8: Working flowchart of proposed model

5.1.4 Implementation:

The implementation phase is done by writing the Computer programs using python
language. The details about Environment and tools are described in next section.

5.1.5 Analysis

In analysis phase an analysis is done over the run time of different inputs for the Computer
model prepared for this research.

18
5.2. Data Analysis
Data Analysis was done on Run time of different three types of Transitive dependencies:
Linear Transitive dependency, Circular Transitive Dependency and Hybrid Transitive
Dependency.
Linear Transitive Dependency is a TD which is in the form of A → B → C.
Circular Dependency is in the form of A → B → C → A; and Tree Type Transitive
dependency is of the form A → B → C, B → D.
Hybrid Transitive Dependency is a TD where the dependencies form a circle as well as
a line. For example, A → B, B → C, C → D, C → A.

19
6. EXPERIMENTS

In the experiment of proposed model, first need functional dependency (FDs) of a relation
as input. These FDs used by INOUT data dictionary algorithm. Then, use a simple
algorithm to get the Partial dependencies and remove these dependencies. After removing
partial dependencies database tables get in 2NF. Then, use a simple algorithm to get the
Transitive dependencies and remove these TDs. After removing TDs database tables get in
3NF as required output.
Program execution Method:
1. Give functional dependency inputs.
2. Run SecondNormalform Module
3. Obtain second normal form graph
4. This second normal form graph is input for ThirdNormalform module
5. Run ThirdNormalform module
6. Obtain third normal form graph

6.1. Tools and Environment


This research was done on a machine having 4 GB of RAM, Intel i5 Processor. The
Operating System used was Windows 7. Programming Language was Python 2.7.13.
Two extra inbuilt modules of python Networkx and Matplotlib were also used. These two
modules are described below:

6.1.1. Matplotlib

Matplotlib is a Python 2D plotting library which produces publication quality figures in a


variety of hardcopy formats and interactive environments across platforms. Matplotlib can
be used in Python scripts, the Python and Ipython shells, the Jupyter notebook, web
application servers, and four graphical user interface toolkits. Matplotlib tries to make easy
things easy and hard things possible. You can generate plots, histograms, power spectra,
bar charts, errorcharts, scatterplots, etc., with just a few lines of code [5].

20
6.1.2. Networkx

NetworkX is a Python library for studying graphs and networks. NetworkX is free Software
released under the BSD new license [6]. NetworkX is suitable for operation on large real-
world graphs: e.g., graphs in excess of 10 million nodes and 100 million edges [7]. Due to
its dependence on a pure-Python "dictionary of dictionary" data structure, NetworkX is a
reasonably efficient, very scalable, highly portable framework for network and social
network analysis [8].

6.2. Validation Testing - workflow

The Software under test is evaluated during this type of testing.

Fig 9: Validation testing workflow

Unit Testing:
It is a level of programming testing where singular units/ components of software are
tested. The purpose is to validate that each unit of the software performs as designed. A
unit is the smallest testable part of any software.
In this system mainly four unites they are SecondNormalform , ThirdNormalform ,
GraphVisualization for second normal form and GraphVisualization for third normal form.
At first develop second normal form unit and tested it for work properly. Then develop
third normal form unit and tested it. Also develop graph visualization model for second
and third normal form also tested them.

21
Integration Testing:
It is a level of programming testing where singular units are combined and tested as a
group. The purpose of this level of testing is to expose faults in the interaction between
integrated units. In this system combine the two or more unites and tested them. Combine
SecondNormal form unit and GraphVisualization for second normal form. And test that
combined model. And Combine ThirdNormalform unit and
GraphVisualizationThirdNormalform unit and test that combined model.

System Testing:
it is a level of programming testing where a complete and integrated software is tested. The
purpose of this test is to evaluate the system’s consistence with the specified requirements.

22
7. RESULTS AND DISCUSSION

7.1. Output Analysis


The output of algorithms (whose implementation in python is listed in Appendix) and the
computational processes are presented here. Few Functional dependencies will be
considered here and their Computational process will be shown:
Example 1:
Functional Dependency: Fds= ["A → BC","BC → D","C → D"]
Output of Module SecondNormalform at
 Output: [['A', 'B', 1], ['A', 'C', 1], ['BC', 'B', 0], ['BC', 'C', 0], ['BC', 'D', 1], ['C', 'D',
1]]. Here [‘X’, ‘Y’, 1] represented an edge; ‘X’ is a node ‘Y’ is another node and
the value 1 states that ‘X’ and ‘Y’ are connected and 0 states the vice versa.
 Output: [['A', 'B', 1], ['A', 'C', 1], ['BC', 'B', 0], ['BC', 'C', 0], ['BC', 'D', 0.5], ['C', 'D',
1]]. Here ['BC', 'D', 0.5] which has 0.5 as the third attribute, indicates that D has a
partial dependency with BC. Thus by the flag of 0.5, the partial dependency is
identified and is removed.
 Output: [(('A', 'B', 1), ('A', 'C', 1), ('C', 'D', 1))] which is the directed graph for the
attributes after removal of Partial dependencies. The directed graph of Functional
dependencies is used to create Table of Second Normal form.

Figure 10: Directed Graph for Second Normal form of Example 1

23
The Table created from the above graph will be as follows:

Table 9: 2NF Table of Example 1


A[Primary key] B C D

Hence, the partial dependency of D with BC has been removed by creating A as Primary

key(The starting node of Directed graph is considered as Primary Key). But still there is a

transitive dependency between A and D (A → C → D) which will be removed in third

normal form. The Third Normal form module identify the Transitive Dependencies (if

exists) from the graph presented above and removes them. After removal of Transitive

Dependency the obtained figure will be as follows:

Figure 11: Directed Graph for Third Normal form of Example 1

Her we have a forest of Graphs. We will choose the Graphs from this forest in such a way
that the union of all the nodes of graphs gives all the attributes presented in FDs and each
graph has at-least one member in another graph.

24
So we will choose (A → B, A → C) and (C → D). Therefore, we will have two tables as

follows:

Table 10: 3NF Table-1 of Example 1


A[Primary key] B C[foreign key]

Table 11: 3NF Table-2 of Example 1

C[Primary key] D

Example 2:
Functional Dependency: Fds= ["A → BCD", "C → D", "EF → DG", "D → G"]
Output of Module SecondNormalform at
 Output: [['A', 'B', 1], ['A', 'C', 1], ['A', 'D', 1], ['C', 'D', 1], ['EF', 'E', 0], ['EF', 'F', 0],
['EF', 'D', 1], ['EF', 'G', 1], ['D', 'G', 1]]
 Output: [['A', 'B', 1], ['A', 'C', 1], ['A', 'D', 1], ['C', 'D', 1], ['EF', 'E', 0], ['EF', 'F', 0],
['EF', 'D', 1], ['EF', 'G', 1], ['D', 'G', 1]]
 Output: [[(('A', 'B', 1), ('A', 'C', 1), ('C', 'D', 1), ('A', 'D', 1), ('D', 'G', 1)), (('EF', 'D',
1), ('D', 'G', 1), ('EF', 'G', 1))] which is the directed graph for the attributes after
removal of Partial dependencies. The directed graph of Functional dependencies is
used to create Table of Second Normal form. Figures are presented in the next page.

25
Figure 12: Directed Graph for Second Normal form of Example 2

There will be two tables for second normal form for this example. A will be the primary
key for Table 1 and EF will be the primary key for Table 2. D will act as foreign key in
Table 1. The tables are listed below:

Table 12: 2NF Table-1 of Example 2


A[primary key] B D[Foreign Key] C G

Table 13: 2NF Table-2 of Example 2


E[Primary key] F[Primary Key] D G

There are lot of Transitive Dependencies in the above two tables which will be removed
by the Third Normal form module to bring these tables into Third Normal form. The
Directed graph created after applying this module is shown in the figure next page:

26
Figure 13: Directed Graph for Third Normal form of Example 2

Tables for third normal form will be as follows:

Table 14: 3NF Table-1 of Example 2


A [Pk] B C[Fk] D

Table 15: 3NF Table-2 of Example 2


E[Pk] F[Pk] C D G

7.2. Findings and Discussion


The Experiment was done over different lengths of Linear, Circular and Hybrid TD’s.
Most of the time, Circular dependencies were processed fast in comparison to Linear and
27
Hybrid TD’s. Hybrid TD’s took the maximum time in most of the cases. The data of 12
experiments done over a length 6 to 18 are shown in the table below:

Table 16: Experimental data table

Length of FD Time taken by Time taken by Time taken by


Linear TD Circular TD Hybrid TD
(microseconds) (microseconds) (microseconds)

6 9 7 12

7 10 8 15

8 12 10 17

9 12 11 18

10 15 10 17

11 14 12 17

12 17 14 19

13 16 15 17

14 20 18 24

15 19 17 21

16 21 20 25

17 23 19 25

28
Figure 14: Time Comparison of different Transitive Dependencies

29
8. VALIDATION OF SYSTEM

Validation done by comparison Amir Hassan Bahmani method and proposed model.

8.1. Time Complexity


The inner depth of loop in the module Dependency_Closure is maximum, which is 3, as
presented by Amir Hasan et. al. in the following algorithm.
Dependency-closure ()
{
for (i=0; i<n ; i++)
for( j=0; j<n ; j++)
if( i!=j && Path[i][j]!=-1) {
for (k=0; k<m ; k++)
if( DM[j][k]!=0 && DM[j][k]!=2)
DM[i][k]=j; }
}
It is clear from here that the the minimum number of time the lines after the line: for
(k=0; k<m; k++) is n*n*m which is of the order O(n3).

But, when we go through the algorithm presented in this thesis, we can find that none of
the module and sub module have complexity greater than O(n2). This is obtained from
the analysis done over the source code presented in appendix.

8.2. Space Complexity

Amir Hassan Bahmani has used Dependency Matrix and Directed Graph Matrix.
Dependency matrix requires n*m units space, where n is the number of determinant keys
m is the number of simple keys. Similarly, the representation of directed graph matrix
requires n*n units of space. Therefore, total space required is (m+n)*n units of space.

The model presented here in this thesis (Data dictionary model) uses a linear array which
consists of information about a particular key (for both determinant as well as simple keys).
The information is all about in and out degree of the graph formed through the functional
dependencies of the keys. If the maximum limit for in and out degree for a graph formed
by functional dependencies is C (where C is a constant), then total space required for the
model used in this thesis is m*C.
30
Theoretically, the value of C is not a universal constant but it keeps on changing with a
new database Table. Or, it can be seen that the value of C is assumed to be a constant for a
particular Database. When switching the Database, value of C does not remain a constant.
Generally, the value of C is less than n because for any Determinant key, it is very rare that
this determinant key is functionally dependent to all of the simple keys. And, in general
the value of m<n in real scenario of Databases because, we know that determinant keys
(fundamentally a primary key or composite key) have less attributes in comparison to
overall attributes present in a database table which are not a part of determinant keys.
From, the above discussion it can be concluded that C< n and m<n in practical cases of
database design. This implies that m*C < (m+n)*n.
Therefore, conclude that the space complexity of the model presented in this thesis is better
than that of Amir Hassan Bahmani.

Table 17: Time and Space Complexity comparison


Time complexity Space Complexity
Amir Hassan Proposed Amir Hassan Proposed
Bahmani Model Bahmani Model
DM O(n2) O(n2) O(n2) O(n*c)
DG O(n2) O(n2) O(n2) O(n*c)
2NF O(n2) O(n2) O(n2) O(n*c)
3NF O(n3) O(n2) O(n2) O(n*c)

Where; DM= Dependency matrix


DG= Directed graph matrix
c is a constant.

31
9. CONCLUSION AND FUTURE WORKS

In this research, Data dictionary has been used to automate the Database Normalization
process. The Attributes of tables in a Database are represented as variables and their
functional dependencies are provided by the user. This Research has been done for
database normalization up to third normal form. A test has been made for different sort of
Transitive dependencies which are termed as Linear, Circular and Hybrid dependencies.
The performance of module has been analyzed on the basis of different types of transitive
dependencies and the findings are: the average time taken by Circular TDs was minimum
and Hybrid TDs took the maximum time.

This research was bounded within Third Normal form. Further extensions can be made by
automating the normalization process of BCNF, 4NF, 5NF and so on. Strong
recommendations are there to use python Networkx module for getting the optimum results
from the Graphs.

32
10. REFERENCES AND BIBLIOGRAPHY

[1] AUTOMATIC DATABASE NORMALIZATION AND PRIMARY KEY


GENERATION: https://ieeexplore.ieee.org/document/4564486/ Amir Hassan
Bahmani,Mahmoud Naghibzadeh, Behnam Bahmani:: Conference Paper in
Canadian Conference on Electrical and Computer Engineering, June 2008

[2] APPLYING SPANNING TREE GRAPH THEORY FOR AUTOMATIC


DATABASE NORMALIZATION:
https://waset.org/publications/9998331/applying-spanning-tree-graph-theory-for-
automatic-database-normalization, 2014

[3] A Note on Exploratory research; aweshkar Vol. XVII Issue 1 March 2014 WeSchool

[4] www.kent.ac.uk/learning , fetched on Tuesday June 12 2018

[5] https://matplotlib.org/, fetched on Tuesday June 12 2018

[6] NetworkX first public release (NX-0.2), From: Aric Hagberg, Date: 12 April 2005,
Python-announce-list mailing list

[7] Aric Hagberg, Drew Conway, "Hacking social networks using the Python
programming language (Module II – Why do SNA in NetworkX)",
Sunbelt 2010: International Network for Social Network Analysis.

[8] Aric A. Hagberg, Daniel A. Schult, Pieter J. Swart, Exploring Network Structure,
Dynamics, and Function using NetworkX, Proceedings of the 7th Python in Science
conference (SciPy 2008), G. Varoquaux, T. Vaught, J. Millman (Eds.), pp. 11–15.

[9] Connoly, Thomas, Carolyn Begg: Database Systems. A Practical Approach to


Design, Implementation, and Management , Pearson Education, Third edition,
2005.Relational and XML Data, Journal of Computer System Science, Vol. 73(4):
pp. 636-647, 2007
[10] Date, C.J., An Introduction to Database Systems, Addison-Wesley, Seventh
Edition 2000
[11] Kolahi, S., Dependency-Preserving Normalization of Relational and XML Data,
Journal of Computer System Science, Vol. 73(4): pp. 636-647, 2007.

33
[12] https://en.wikipedia.org/wiki/Database_normalization

[13] Yazici, A., and Z. Karakaya, Normalizing Relational Database Schemas Using
Mathematica, LNCS, Springer-Verlag, Vol.3992, pp. 375-382,2006

[14] Kung, H. and T. Case, Traditional and Alternative Database Normalization


Techniques: Their Impacts on
IS/IT Students’ Perceptions and Performance, International Journal of
Information Technology Education, Vol.1, No.1 pp. 53-76, 2004.

[15] M Arenas, L Libkin, An Information-Theoretic Approach to Normal Forms for


Relational and XML Data,
Journal of the ACM (JACM), Vol. 52(2), pp. 246-283, 2005..

34
11. APPENDIX (SOURCE CODE)

Module: SecondNormalform
Fds= ["A-->BC", "BC-->D", "C-->D"]
#Fds= ["AB-->CDEF","A-->EF"]
#Fds= ["A-->BCD", "C-->DE", "EF-->DG", "D-->G"]
#Fds= ["A-->BCD","D-->E","F-->E","E-->G","FE-->G"]
#Fds= ["A-->BCDEF","BCD-->GH","D-->H"]
#Fds= ["A-->BD","B-->C","B-->D","D-->E"]
#Fds= ["A-->BCD","BCD-->E","B-->E"]
def intersection(x,y):
list1= []
list2= []
for i in x[0]:
list1.append(i)
for i in x[1]:
list1.append(i)

for i in y[0]:
list2.append(i)
for i in y[1]:
list2.append(i)
if set(list1).intersection(set(list2))== set(list2):
return True
graph= []
directed_graph= [] # (Node, Node, weight)
for i in Fds:
X= i[0:i.index(">")-3+1]
Y= i[i.index(">")+1:len(i)]
if len(X)>1:

35
for j in X:
directed_graph.append([X,j,0])
for j in Y:
directed_graph.append([X,j,1])
print directed_graph
for i in range(len(directed_graph)):
for j in range(len(directed_graph)):
if i!=j and directed_graph[i][2]!= 0 and directed_graph[j][2]!= 0:
if intersection(directed_graph[i], directed_graph[j]):
directed_graph[i][2]= 0.5
print directed_graph
final_ans= []
for i in directed_graph:
if i[2]==1:
final_ans.append(i)
starting_elements= []
for i in final_ans:
xx= i[0]
count= 0
for j in final_ans:
if xx== j[1]:
break
else:
count= count+1
if count==len(final_ans):
starting_elements.append(i[0])
starting_elements= list(set(starting_elements))
tables= []
for i in starting_elements:
temp= []

36
for j in final_ans:
if i== j[0]:
temp.append(j)
for k in final_ans:
if j[1]== k[0]:
temp.append(k)
tables.append(temp)
final_tables= []
for i in tables:
xx= []
for j in i:
xx.append(tuple(j))
final_tables.append(tuple(xx))
list_of_elems= []
for i in final_ans:
list_of_elems.append(i[0])
list_of_elems.append(i[1])
list_of_elems= list(set(list_of_elems))
print final_tables

Module: ThirdNormalform

from SecondNormalform import final_ans # final answer has list of edges


import itertools
table= []
for i in range(len(final_ans)):
final_ans[i].remove(1)
final_ans[i]= tuple(final_ans[i])
Transitive_paths= []
for i in final_ans:
for j in final_ans:
if i[1]== j[0]:
Transitive_paths.append([i,j])
subsets= []
for i in range(len(final_ans),1,-1):
37
subsets.append(list(set(itertools.combinations(final_ans,i ))))
Third_normal= []
#print subsets
print
#print Transitive_paths
for i in subsets:
#print "i=", i
temp= []
for j in i:
var= []
for k in Transitive_paths:
# print "k=", k
# print "j=", j
Bool = set(k)<set(j) or set(k)==set(j)
var.append(Bool)
# print "var=", var
if not(True in var):
temp.append(j)
# print "temp=", temp
Third_normal.append(temp)
union_= set([])
for i in Third_normal:
for j in i:
union_= ((union_).union(set(j)))
if union_.intersection(set(final_ans))== set((final_ans)):
break

Module: GraphVisualization

from SecondNormalform import final_tables


import networkx as nx
import math
import matplotlib.pyplot as plt
print final_tables
G= []
for i in range(len(final_tables)):
G.append(nx.DiGraph())
import random
position= {}
y1= 1
y2= 10
for graph in range(len(G)):
edges= []
for p in final_tables[graph]:
38
edges.append((p[0],p[1]))
for i in edges:
G[graph].add_edge(i[0],i[1])
for r in range(len(G)):
position= nx.spring_layout(G[r],k=5/math.sqrt(G[r].order()))
plt.figure(r)
nx.draw(G[r],node_size= 1000,pos= position, cmap=plt.get_cmap('jet'),
with_labels= True)
plt.show()

Module: GraphVisualizationThirdNormalform

from ThirdNormalform import i as final_tables


import networkx as nx
import math
import matplotlib.pyplot as plt
print final_tables
G= []
for i in range(len(final_tables)):
G.append(nx.DiGraph())
import random
position= {}
y1= 1
y2= 10
for graph in range(len(G)):
edges= []
for p in final_tables[graph]:
edges.append((p[0],p[1]))
for i in edges:
G[graph].add_edge(i[0],i[1])
for r in range(len(G)):
position= nx.spring_layout(G[r],k=5/math.sqrt(G[r].order()))
plt.figure(r)
nx.draw(G[r],node_size= 1000,pos= position, cmap=plt.get_cmap('jet'),
with_labels= True)
plt.show()

39

You might also like