You are on page 1of 16

Functional Dependency

A functional dependency is a relationship of one attribute or field in a record to another. In a


database, we often have the case where one field defines the other.
For example, we can say that Social Security Number (SSN) defines a name. What does this
mean? It means that if I have a database with SSNs and names, and if I know someone's
SSN, then I can find their name.
Further, because we used the word "defines," we are saying that for every SSN we will have
one and only one name. We will say that we have defined name as being functionally
dependent on SSN.

The idea of a functional dependency is to define one field as an anchor from which one can
always find a single value for another field.

As another example, suppose that a company assigned each employee a unique employee
number. Each employee has a number and a name. Names might be the same for two
different employees, but their employee numbers would always be different and unique
because the company defined them that way. It would be inconsistent in the database if there
were two occurrences of the same employee number with different names.

We write a functional dependency (FD) connection with an arrow:


SSN  Name

Or
EmpNo  Name.

The expression SSN  Name is read "SSN defines Name" or "SSN implies Name."
Let us look at some sample data for the second FD.
EmpNo Name
101 Anil
102 Ravi
103 Naveen
104 Nitin
105 Nitin
Wait a minute…. You have two people named Nitin! Is this a problem with FDs? Not at all. You
expect that Name will not be unique and it is commonplace for two people to have the
same name. However, no two people have the same EmpNo and for each EmpNo, there is
a Name.
Let us look at a more interesting example:
EmpNo Job Name
101 President Kaitlyn
104 Programmer Fred
103 Designer Beryl
103 Programmer Beryl

Is there a problem here? No. We have the FD that EmpNo  Name. This means that every
time we find 104, we find the name, Fred. Just because something is on the left-hand side
(LHS) of a FD, it does not imply that you have a key or that it will be unique in the database
the FD X  Y only means that for every occurrence of X you will get the same value of Y.

Let us now consider a new functional dependency in our example. Suppose that Job
Salary. In this database, everyone who holds a job title has the same salary. Again, adding
an attribute to the previous example, we might see this:

EmpNo Job Name Salary


101 President Kaitlyn 50
104 Programmer Fred 30
103 Designer Beryl 35
103 Programmer Beryl 30

Do we see a contradiction to our known FDs? No. Every time we find an EmpNo, we find
the same Name; every time we find a Job title, we find the same Salary.

Let us now consider another example. We will go back to the SSN  Name example and add
a couple more attributes.
SSN Name School Location
101 ANIL DAV PUNE
102 RAVI MSU VADODARA
103 MADHAVI DPS DELHI
104 BRIJESH MSU VADODARA
105 WASIM DAV PUNE
106 NIKHIL DAV PUNE

Here, we will define two FDs:


SSN  Name and
School  Location.
Further, we will define this FD: SSN  School.
First, have we violated any FDs with our data? Because all SSNs are unique, there cannot
be a FD violation of SSN  Name. Why?
Because a FD X Y says that given some value for X, you always get the same Y. Because
the X's are unique, you will always get the same value. The same comment is true for SSN 
School.
How about our second FD, School Location? There are only three schools in the example
and you may note that for every school, there is only one location, so no FD violation.

Now, we want to point out something interesting. If we define a functional dependency X  Y


and we define a functional dependency Y Z, then we know by inference that X Z. Here, we
defined SSNSchool. We also defined School  Location, so we can infer that SSN
Location although that FD was not originally mentioned. The inference we have illustrated is
called the transitivity rule of FD inference. Here is the transitivity rule restated:
Given X  Y
Given Y Z
Then X Z
To see that the FD SSN Location is true in our data, you can note that given any value of
SSN, you always find a unique location for that person.
Another way to demonstrate that the transitivity rule is true is to try to invent a row where it is
not true and then see if you violate any of the defined FDs.
We defined these FD's:
Given: SSN  Name
SSN  School
School  Location
We are claiming by inference using the transitivity rule that SSN Location. Suppose that
we add another row with the same SSN and try a different location:
Now, we have satisfied SSN Name but violated SSN Location. Can we do this? We have
no value for School, but we know that if School ="DAV" as defined by SSN School, then
we would have the following rows:

SSN Name School Location


106 NIKHIL DAV PUNE
106 NIKHIL DAV JODHPUR

However, this is a problem. We cannot have PUNE AND JODHPUR in the same row because
we also defined School  Location. So in creating our counterexample, we came upon a
contradiction to our defined FDs.
Hence, the row with PUNE AND JODHPUR is bogus. If you had tried to create a new location
like this:
SSN Name School Location
106 NIKHIL DAV PUNE
106 NIKHIL DPS DELHI

You violate the FD, SSN School — again, a bogus row was created. By being unable to
provide a counterexample, you have demonstrated that the transitivity rule holds. You may
prove the transitivity rule more formally (see Elmasri and Navathe, 2000, p. 479).

There are other inference rules for functional dependencies. We will state them and give an
example; leaving formal proofs to the interested reader (see Elmasri and Navathe, 2000).

The Reflexive Rule

If X is a composite, composed of A and B, then XA and X B.


Example: X= Name, City. Then we are saying that X Name and X City.
Example:
Name City
AMIT PUNE
RAVI DELHI
NITIN JODHPUR
The rule, which seems quite obvious, says if I give you the combination <Ravi, Delhi>, what is
this person's Name? What is this person's City? While this rule seems obvious enough, it is
necessary to derive other functional dependencies.
The Augmentation Rule

If XY, then XZY. You might call this rule, "more information is not really needed, but it
doesn't hurt." Suppose we use the same data as before with Names and Cities, and define
the FD Name  City. Now, suppose we add a column, Shoe Size:
Name City Shoe Size
AMIT PUNE 10
RAVI DELHI 6
NITIN JODHPUR 3
Now, I claim that because Name City, that Name+Shoe Size  City (i.e., we augmented
Name with Shoe Size). Will there be a contradiction here, ever? No, because we defined
Name City, Name plus more information will always identify the unique City for that
individual. We can always add information to the LHS of an FD and still have the FD be true.
The Decomposition Rule
The decomposition rule says that if it is given that X YZ (that is, X defines both Y and Z),
then X Y and X Z. Again, an example:

Name City Shoe Size


AMIT PUNE 10
RAVI DELHI 6
NITIN JODHPUR 3
Suppose I define Name City, Shoe Size. This means for every occurrence of Name, I have a
unique value of City and a unique value of Shoe Size. The rule says that given Name City
and Shoe Size together, then Name City and Name Shoe Size. A partial proof using the
reflexive rule would be:
Name City, Shoe Size (given)
City, Shoe Size  City (by the reflexive rule)
Name City (using steps 1 and 2 and the transitivity rule)

The Union Rule

The union rule is the reverse of the decomposition rule in that if X Y and X Z, then X
YZ. The same example of Name, City, and Shoe Size illustrates the rule. If we found
independently or were given that Name City and given that Name Show Size, we can
immediately write NameCity, Shoe Size.

Keys and FDs

The main reason we identify the FDs and inference rules is to be able to find keys and
develop normal forms for relational databases. In any relational table, we want to find out
which, if any attribute(s), will identify the rest of the attributes. An attribute that will identify
all the other attributes in row is called a "candidate key." A "key" means a "unique identifier"
for a row of information.
Hence, if an attribute or some combination of attributes will always identify all the other attributes
in a row, it is a "candidate" to be "named" a key. To give an example, consider the following:
SSN Name School Location
101 ANIL DAV PUNE
102 RAVI MSU VADODARA
103 MADHAVI DPS DELHI
104 BRIJESH MSU VADODARA
105 WASIM DAV PUNE
106 NIKHIL DAV PUNE
Now suppose I define the following FDs:
SSN Name
SSN  School
School Location
What I want is the fewest number of attributes I can find to identify all the rest — hopefully
only one attribute. I know that SSN looks like a candidate, but can I rely on SSN to identify
all the attributes? Put another way, can I show that SSN "defines" all attributes in the
relation? I know that SSN defines Name and School because that is given. I know that I have
the following transitive set of FDs:
SSN School
School  Location
Therefore, by the transitive rule, I can say that SSN Location. I have derived the three FDs I
need. Adding the reflexive rule, I can then use the union rule:

SSN → Name (given)


SSN → School (given)
SSN → Location (derived by the transitive rule)
SSN → SSN (reflexive rule (obvious))
SSN → SSN, Name, School, Location (union rule)

This says that given any SSN, I can find a unique value for each of the other fields for that
SSN. SSN therefore is a candidate key for this relation.

In FD theory, once we find all the FDs that an attribute defines, we have found the closure of the
attribute(s). In our example, the closure of SSN is all the attributes in the relation. Finding a
candidate key is the finding of a closure of an attribute or a set of attributes that defines
all the other attributes.
Normalization
Normalization is the process of organizing data in a database. This includes creating tables
and establishing relationships between those tables according to rules designed both to
protect the data and to make the database more flexible by eliminating two factors:
redundancy and
Inconsistent dependency.
Redundant data wastes disk space and creates maintenance problems. If data that exists in
more than one place must be changed, the data must be changed in exactly the same way
in all locations. Inconsistent dependencies can make data difficult to access; the path to find
the data may be missing.
Normalization is the analysis of functional dependencies between attributes. It is the
process of decomposing relations with anomalies to produce well-structured relations.
Well-structured relation contains minimal redundancy and allows insertion, modification,
and deletion without errors or inconsistencies.
Normalization is a formal process for deciding which attributes should be grouped
together in a relation. It is the primary tool to validate and improve a logical design so that it
satisfies certain constraints that avoid unnecessary duplication of data.
Normalization theory is based on the concepts of normal forms. A relational table is said to be
a particular normal form if it satisfied a certain set of constraints. There are currently five
normal forms that have been defined.
Normalization should remove redundancy but not at the expense of data integrity. In
general, the normalization process generates many simple entity specifications from a few
semantically complex entity specifications. Here entity specification refers to the declaration of
entity attribute.

Purpose of Normalization
Normalization allows us to minimize insert, update, and delete anomalies and help maintain data
consistency in the database.
1. To avoid redundancy by storing each fact within the database only once
2. To put data into the form that is more able to accurately accommodate change
3. To avoid certain updating “anomalies”
4. To facilitate the enforcement of data constraint
5. To avoid unnecessary coding. Extra programming in triggers, stored procedures can be
required to handle the non-normalized data and this in turn can impair performance
significantly.
Steps in Normalization

FIRST NORMAL FORM [1 NF]

A table is in first normal form (1NF) if and only if all columns contain only atomic values; that
is, there are no repeating groups (columns) within a row. It is to be noted that all entries in a
field must be of same kind and each field must have a unique name, but the order of the
field (column) is irrelevant.
A relation is in First normal form if every field contains only atomic values, that is, not lists or
sets.
Example

Manager Employees
Fatma Sayed, Tariq
Abdulaziz Tafla, Mohammed
Ali Sarai, Miriam

This data has some problems:


 The Employees column is not atomic.
o A column must be atomic, meaning that it can only hold a single item of
data. This column holds more than one employee name.
 Data that is not atomic means:
• We can’t easily sort the data
• We can’t easily search or index the data
• We can’t easily change the data
• We can’t easily reference the data in other tables
 Breaking the Employee column into more than 1 column doesn’t solve our
problems:
• The data may look atomic, but only because we have many identical
columns storing a single piece of data instead of a single column storing
many pieces of data.

Manager Employee1 Employee2


Fatma Sayed Tariq
Abdulaziz Tafla Mohammed
Ali Sarai Miriam

• We still can’t easily sort, search, or index our employees.


• What if a manager has more than 2 employees, 10 employees, 100
employees? We’d need to add columns to our database just for
these cases.
• It is still hard to reference our employees in other tables.

 1NF means that we must:


• Eliminate duplicate columns from the same table, and
• Create separate tables for each group of related data into separate
tables, each with a unique row identifier (primary key)
So, making our columns atomic…
 By breaking each tuple of our table into an entry for each employee,
we have made our data atomic.

Manager Employee
Fatma Sayed
Fatma Tariq
Abdulaziz Tafla
Abdulaziz Mohammed
Ali Sarai
Ali Miriam

 Of course there may come a day when we hire a second employee or


manager with the same name. To avoid this, let’s use an employee ID
instead of their name.

ID Employee ManagerID
1 Sayed 7
2 Tariq 7
3 Tafla 8
4 Mohammed 8
5 Sarai 9
6 Miriam 9
7 Fatma
8 Abdulaziz
9 Ali

Second Normal Form (2NF)

A table is in second normal form (2NF) if and only if it is in 1NF and every nonkey attribute is
fully dependent on the primary key.

 A database in 2NF must also be in 1NF:


• Data must be atomic
• Every row (or tuple) must have a unique primary key
 Plus:
• Subsets of data that apply to multiple rows (repeating data) are moved to
separate tables

Example

CustID FirstName LastName Address City State Zip


1 Bob Smith 123 Main St. Tucson AZ 12345
2 John Brown 555 2nd Ave. St. Paul MN 54355
3 Sandy Jessop 4256 James St. Chicago IL 43555
4 Maria Hernandez 4599 Columbia Vancouver BC V5N 1M0
5 Gameil Hintz 569 Summit St. St. Paul MN 54355
6 James Richardson 12 Cameron Bay Regina SK S4T 2V8
7 Shiela Green 12 Michigan Ave. Chicago IL 43555
8 Ian Sampson 56 Manitoba St. Winnipeg MB M5W 9N7
9 Ed Rodgers 15 Athol St. Regina SK S4T 2V9

This data is in 1NF: all fields are atomic and the CustID serves as the primary key

 But let’s pay attention to the City, State, and Zip fields:
• There are 2 rows of repeating data: one for Chicago, and one for St. Paul.
• Both have the same city, state and zip code

City State Zip


Tucson AZ 12345
St. Paul MN 54355
Chicago IL 43555
Vancouver BC V5N 1M0
St. Paul MN 54355
Regina SK S4T 2V8
Chicago IL 43555
Winnipeg MB M5W 9N7
Regina SK S4T 2V9

To be in 2NF, this repeating data must be in its own table.


So:
• Let’s create a Zip code table that maps Zip codes to their City and State.
• Note that Canadian Postal Codes are different: the same city and state can
have many different postal codes.

Customer Table
CustID FirstName LastName Address Zip
1 Bob Smith 123 Main St. 12345
2 John Brown 555 2nd Ave. 54355
3 Sandy Jessop 4256 James St. 43555
4 Maria Hernandez 4599 Columbia V5N 1M0
5 Gameil Hintz 569 Summit St. 54355
6 James Richardson 12 Cameron Bay S4T 2V8
7 Shiela Green 12 Michigan Ave. 43555
8 Ian Sampson 56 Manitoba St. M5W 9N7
9 Ed Rodgers 15 Athol St. S4T 2V9

Zip code Table


Zip City State
12345 Tucson AZ
54355 St. Paul MN
43555 Chicago IL
V5N 1M0 Vancouver BC
S4T 2V8 Regina SK
M5W 9N7 Winnipeg MB
S4T 2V9 Regina SK

• We see that we can actually save 2 rows in the Zip Code table by removing these
redundancies: 9 customer records only need 7 Zip code records.
• Zip code becomes a foreign key in the customer table linked to the primary key in
the Zip code table

Advantages of 2NF

Saves space in the database by reducing redundancies


If a customer calls, you can just ask them for their Zip code and you’ll know their city and
state! (No more spelling mistakes)
If a City name changes, we only need to make one change to the database.

Hence in 2 NF

Data is in 1NF
Subsets of data in multiple columns are moved to a new table
These new tables are related using foreign keys

Third Normal Form (3NF)

To be in Third Normal Form (3NF) the relation must be in 2NF and no transitive dependencies
may exist within the relation. A transitive dependency is when an attribute is indirectly
functionally dependent on the key (that is, the dependency is through another nonkey
attribute).

 To be in 3NF, a database must be:


• In 2NF
• All columns must be fully functionally dependent on the primary key (There are no
transitive dependencies)

OrderID CustID ProdID Price Quantity Total

1 1001 AB-111 50 1,000 50,000

2 1002 AB-111 60 500 30,000

3 1001 ZA-245 35 100 3,500

4 1003 MB-153 82 25 2,050

5 1004 ZA-245 42 10 420

6 1002 ZA-245 40 50 2,000

7 1001 AB-111 75 100 7,500

 In this table:
• CustomerID and ProdID depend on the OrderID and no other column (good)
• Stated another way, “If you know the OrderID, you know the CustID and the
ProdID”
 So: OrderID à CustID, ProdID

OrderID CustID ProdID Price Quantity Total

1 1001 AB-111 50 1,000 50,000

2 1002 AB-111 60 500 30,000

3 1001 ZA-245 35 100 3,500

4 1003 MB-153 82 25 2,050

5 1004 ZA-245 42 10 420


6 1002 ZA-245 40 50 2,000

1001 AB-111 75 100 7,500


But there are some fields that are not dependent on OrderID:
• Total is the simple product of Price*Quantity. As such, has a transitive
dependency to Price and Quantity.
• Because it is a calculated value, doesn’t need to be included at all.

 Also, we can see that Price isn’t really dependent on ProdID, or OrderID. Customer 1001
bought AB-111 for $50 (in order 1) and for $75 (in order 7), while 1002 spent $60 for each
item in order 2.

 Maybe price is dependent on the ProdID and Quantity: The more you buy of a given
product the cheaper that product becomes!
 So we ask the business manager and she tells us that this is the case.

 We say that Price has a transitive dependency on ProdID and Quantity.


• This means that Price isn’t just determined by the OrderID. It is also determined by
the size (or quantity) of the order (and of course what is ordered).

 Let’s diagram the dependencies.


 We can see that all fields are dependent on OrderID, the Primary Key (white lines)
 But Total is also determined by Price and Quantity (yellow lines)
• This is a derived field
(Price x Quantity = Total)
• We can save a lot of space by getting rid of it altogether and just calculating total
when we need it
 Price is also determined by both ProdID and Quantity rather than the primary key (red
lines). This is called a transitive dependency. We must get rid of transitive
dependencies to have 3NF.

 We do this by moving the transitive dependency into a second table…

 By splitting out the table, we can quickly adjust our price table to meet our competitor, or if
the prices changes from our suppliers.

 The second table is our pricing list.


 Think of Quantity as a range:
• AB-111: 1-100, 101-500, 501 and more
ZA-245: 1-10, 11-50, 51 and more
 The primary Key for this second table is a composite of ProdID and Quantity.

We’re now in 3NF!


 We can also quickly figure out what price to offer our customers for any quantity they
want.

Hence A database is in 3NF if:

• It is in 2NF
• It has no transitive dependencies
 A transitive dependency exists when one attribute (or field) is determined by
another non-key attribute (or field)
 We remove fields with a transitive dependency to a new table and link them
by a foreign key.

You might also like