Professional Documents
Culture Documents
The idea of a functional dependency is to define one field as an anchor from which one can
always find a single value for another field.
As another example, suppose that a company assigned each employee a unique employee
number. Each employee has a number and a name. Names might be the same for two
different employees, but their employee numbers would always be different and unique
because the company defined them that way. It would be inconsistent in the database if there
were two occurrences of the same employee number with different names.
Or
EmpNo Name.
The expression SSN Name is read "SSN defines Name" or "SSN implies Name."
Let us look at some sample data for the second FD.
EmpNo Name
101 Anil
102 Ravi
103 Naveen
104 Nitin
105 Nitin
Wait a minute…. You have two people named Nitin! Is this a problem with FDs? Not at all. You
expect that Name will not be unique and it is commonplace for two people to have the
same name. However, no two people have the same EmpNo and for each EmpNo, there is
a Name.
Let us look at a more interesting example:
EmpNo Job Name
101 President Kaitlyn
104 Programmer Fred
103 Designer Beryl
103 Programmer Beryl
Is there a problem here? No. We have the FD that EmpNo Name. This means that every
time we find 104, we find the name, Fred. Just because something is on the left-hand side
(LHS) of a FD, it does not imply that you have a key or that it will be unique in the database
the FD X Y only means that for every occurrence of X you will get the same value of Y.
Let us now consider a new functional dependency in our example. Suppose that Job
Salary. In this database, everyone who holds a job title has the same salary. Again, adding
an attribute to the previous example, we might see this:
Do we see a contradiction to our known FDs? No. Every time we find an EmpNo, we find
the same Name; every time we find a Job title, we find the same Salary.
Let us now consider another example. We will go back to the SSN Name example and add
a couple more attributes.
SSN Name School Location
101 ANIL DAV PUNE
102 RAVI MSU VADODARA
103 MADHAVI DPS DELHI
104 BRIJESH MSU VADODARA
105 WASIM DAV PUNE
106 NIKHIL DAV PUNE
However, this is a problem. We cannot have PUNE AND JODHPUR in the same row because
we also defined School Location. So in creating our counterexample, we came upon a
contradiction to our defined FDs.
Hence, the row with PUNE AND JODHPUR is bogus. If you had tried to create a new location
like this:
SSN Name School Location
106 NIKHIL DAV PUNE
106 NIKHIL DPS DELHI
You violate the FD, SSN School — again, a bogus row was created. By being unable to
provide a counterexample, you have demonstrated that the transitivity rule holds. You may
prove the transitivity rule more formally (see Elmasri and Navathe, 2000, p. 479).
There are other inference rules for functional dependencies. We will state them and give an
example; leaving formal proofs to the interested reader (see Elmasri and Navathe, 2000).
If XY, then XZY. You might call this rule, "more information is not really needed, but it
doesn't hurt." Suppose we use the same data as before with Names and Cities, and define
the FD Name City. Now, suppose we add a column, Shoe Size:
Name City Shoe Size
AMIT PUNE 10
RAVI DELHI 6
NITIN JODHPUR 3
Now, I claim that because Name City, that Name+Shoe Size City (i.e., we augmented
Name with Shoe Size). Will there be a contradiction here, ever? No, because we defined
Name City, Name plus more information will always identify the unique City for that
individual. We can always add information to the LHS of an FD and still have the FD be true.
The Decomposition Rule
The decomposition rule says that if it is given that X YZ (that is, X defines both Y and Z),
then X Y and X Z. Again, an example:
The union rule is the reverse of the decomposition rule in that if X Y and X Z, then X
YZ. The same example of Name, City, and Shoe Size illustrates the rule. If we found
independently or were given that Name City and given that Name Show Size, we can
immediately write NameCity, Shoe Size.
The main reason we identify the FDs and inference rules is to be able to find keys and
develop normal forms for relational databases. In any relational table, we want to find out
which, if any attribute(s), will identify the rest of the attributes. An attribute that will identify
all the other attributes in row is called a "candidate key." A "key" means a "unique identifier"
for a row of information.
Hence, if an attribute or some combination of attributes will always identify all the other attributes
in a row, it is a "candidate" to be "named" a key. To give an example, consider the following:
SSN Name School Location
101 ANIL DAV PUNE
102 RAVI MSU VADODARA
103 MADHAVI DPS DELHI
104 BRIJESH MSU VADODARA
105 WASIM DAV PUNE
106 NIKHIL DAV PUNE
Now suppose I define the following FDs:
SSN Name
SSN School
School Location
What I want is the fewest number of attributes I can find to identify all the rest — hopefully
only one attribute. I know that SSN looks like a candidate, but can I rely on SSN to identify
all the attributes? Put another way, can I show that SSN "defines" all attributes in the
relation? I know that SSN defines Name and School because that is given. I know that I have
the following transitive set of FDs:
SSN School
School Location
Therefore, by the transitive rule, I can say that SSN Location. I have derived the three FDs I
need. Adding the reflexive rule, I can then use the union rule:
This says that given any SSN, I can find a unique value for each of the other fields for that
SSN. SSN therefore is a candidate key for this relation.
In FD theory, once we find all the FDs that an attribute defines, we have found the closure of the
attribute(s). In our example, the closure of SSN is all the attributes in the relation. Finding a
candidate key is the finding of a closure of an attribute or a set of attributes that defines
all the other attributes.
Normalization
Normalization is the process of organizing data in a database. This includes creating tables
and establishing relationships between those tables according to rules designed both to
protect the data and to make the database more flexible by eliminating two factors:
redundancy and
Inconsistent dependency.
Redundant data wastes disk space and creates maintenance problems. If data that exists in
more than one place must be changed, the data must be changed in exactly the same way
in all locations. Inconsistent dependencies can make data difficult to access; the path to find
the data may be missing.
Normalization is the analysis of functional dependencies between attributes. It is the
process of decomposing relations with anomalies to produce well-structured relations.
Well-structured relation contains minimal redundancy and allows insertion, modification,
and deletion without errors or inconsistencies.
Normalization is a formal process for deciding which attributes should be grouped
together in a relation. It is the primary tool to validate and improve a logical design so that it
satisfies certain constraints that avoid unnecessary duplication of data.
Normalization theory is based on the concepts of normal forms. A relational table is said to be
a particular normal form if it satisfied a certain set of constraints. There are currently five
normal forms that have been defined.
Normalization should remove redundancy but not at the expense of data integrity. In
general, the normalization process generates many simple entity specifications from a few
semantically complex entity specifications. Here entity specification refers to the declaration of
entity attribute.
Purpose of Normalization
Normalization allows us to minimize insert, update, and delete anomalies and help maintain data
consistency in the database.
1. To avoid redundancy by storing each fact within the database only once
2. To put data into the form that is more able to accurately accommodate change
3. To avoid certain updating “anomalies”
4. To facilitate the enforcement of data constraint
5. To avoid unnecessary coding. Extra programming in triggers, stored procedures can be
required to handle the non-normalized data and this in turn can impair performance
significantly.
Steps in Normalization
A table is in first normal form (1NF) if and only if all columns contain only atomic values; that
is, there are no repeating groups (columns) within a row. It is to be noted that all entries in a
field must be of same kind and each field must have a unique name, but the order of the
field (column) is irrelevant.
A relation is in First normal form if every field contains only atomic values, that is, not lists or
sets.
Example
Manager Employees
Fatma Sayed, Tariq
Abdulaziz Tafla, Mohammed
Ali Sarai, Miriam
Manager Employee
Fatma Sayed
Fatma Tariq
Abdulaziz Tafla
Abdulaziz Mohammed
Ali Sarai
Ali Miriam
ID Employee ManagerID
1 Sayed 7
2 Tariq 7
3 Tafla 8
4 Mohammed 8
5 Sarai 9
6 Miriam 9
7 Fatma
8 Abdulaziz
9 Ali
A table is in second normal form (2NF) if and only if it is in 1NF and every nonkey attribute is
fully dependent on the primary key.
Example
This data is in 1NF: all fields are atomic and the CustID serves as the primary key
But let’s pay attention to the City, State, and Zip fields:
• There are 2 rows of repeating data: one for Chicago, and one for St. Paul.
• Both have the same city, state and zip code
Customer Table
CustID FirstName LastName Address Zip
1 Bob Smith 123 Main St. 12345
2 John Brown 555 2nd Ave. 54355
3 Sandy Jessop 4256 James St. 43555
4 Maria Hernandez 4599 Columbia V5N 1M0
5 Gameil Hintz 569 Summit St. 54355
6 James Richardson 12 Cameron Bay S4T 2V8
7 Shiela Green 12 Michigan Ave. 43555
8 Ian Sampson 56 Manitoba St. M5W 9N7
9 Ed Rodgers 15 Athol St. S4T 2V9
• We see that we can actually save 2 rows in the Zip Code table by removing these
redundancies: 9 customer records only need 7 Zip code records.
• Zip code becomes a foreign key in the customer table linked to the primary key in
the Zip code table
Advantages of 2NF
Hence in 2 NF
Data is in 1NF
Subsets of data in multiple columns are moved to a new table
These new tables are related using foreign keys
To be in Third Normal Form (3NF) the relation must be in 2NF and no transitive dependencies
may exist within the relation. A transitive dependency is when an attribute is indirectly
functionally dependent on the key (that is, the dependency is through another nonkey
attribute).
In this table:
• CustomerID and ProdID depend on the OrderID and no other column (good)
• Stated another way, “If you know the OrderID, you know the CustID and the
ProdID”
So: OrderID à CustID, ProdID
But there are some fields that are not dependent on OrderID:
• Total is the simple product of Price*Quantity. As such, has a transitive
dependency to Price and Quantity.
• Because it is a calculated value, doesn’t need to be included at all.
Also, we can see that Price isn’t really dependent on ProdID, or OrderID. Customer 1001
bought AB-111 for $50 (in order 1) and for $75 (in order 7), while 1002 spent $60 for each
item in order 2.
Maybe price is dependent on the ProdID and Quantity: The more you buy of a given
product the cheaper that product becomes!
So we ask the business manager and she tells us that this is the case.
By splitting out the table, we can quickly adjust our price table to meet our competitor, or if
the prices changes from our suppliers.
• It is in 2NF
• It has no transitive dependencies
A transitive dependency exists when one attribute (or field) is determined by
another non-key attribute (or field)
We remove fields with a transitive dependency to a new table and link them
by a foreign key.