You are on page 1of 39

Agenda

• Data Normalization – Additional Material


• Homewrok 3
• Kaggle Datasets
• https://www.kaggle.com/datasets
• Exam # 1 Date

1
Normalization
Normalization
Process for evaluating and correcting table structures to
minimize data redundancies .
“Reduces data anomalies”

Normalization is a process that “improves” a database


design by generating relations that are of higher normal
forms.

The objective of normalization:


“to create relations where every dependency is on the key,
the whole key, and nothing but the key”.

2
Normalization
There is a sequence to normal forms:
1NF is considered the weakest,
2NF is stronger than 1NF,
3NF is stronger than 2NF, and
BCNF is considered the strongest

Also,
any relation that is in BCNF, is in 3NF;
any relation in 3NF is in 2NF; and
any relation in 2NF is in 1NF.

3
Normalization

1NF a relation in BCNF, is also


in 3NF
2NF a relation in 3NF is also in
2NF
3NF
a relation in 2NF is also in
1NF
BCNF

4
Normalization
We consider a relation in BCNF to be fully normalized.

The benefit of higher normal forms is that update semantics for


the affected data are simplified.

This means that applications required to maintain the database


are simpler.

A design that has a lower normal form than another design has
more redundancy. Uncontrolled redundancy can lead to data
integrity problems.

First we introduce the concept of functional dependency

5
Functional Dependencies
Functional Dependencies
We say an attribute, B, has a functional dependency on
another attribute, A, if for any two records, which have
the same value for A, then the values for B in these two
records must be the same. We illustrate this as:
AB
Example: Suppose we keep track of employee email
addresses, and we only track one email address for each
employee. Suppose each employee is identified by their
unique employee number. We say there is a functional
dependency of email address on employee number:

employee number  email address


6
Functional Dependencies
EmpNum EmpEmail EmpFname EmpLname
123 jdoe@abc.com John Doe
456 psmith@abc.com Peter Smith
555 alee1@abc.com Alan Lee
633 pdoe@abc.com Peter Doe
787 alee2@abc.com Alan Lee

If EmpNum is the PK then the FDs:


EmpNum  EmpEmail
EmpNum  EmpFname
EmpNum  EmpLname
must exist.

7
Functional Dependencies
EmpNum  EmpEmail
EmpNum  EmpFname 3 different ways
EmpNum  EmpLname you might see FDs
depicted
EmpEmail
EmpNum EmpFname

EmpLname

EmpNum EmpEmail EmpFname EmpLname

8
Determinant
Functional Dependency

EmpNum  EmpEmail

Attribute on the LHS is known as the determinant


• EmpNum is a determinant of EmpEmail

9
• Product table (Prodcut_code, price, type)
What is the determinant?

10
Transitive dependency
Transitive dependency
Consider attributes A, B, and C, and where
A  B and B  C.
Functional dependencies are transitive, which
means that we also have the functional dependency
AC
We say that C is transitively dependent on A
through B.

11
Transitive dependency
EmpNum  DeptNum

EmpNum EmpEmail DeptNum DeptNname

DeptNum  DeptName

EmpNum EmpEmail DeptNum DeptNname

DeptName is transitively dependent on EmpNum via DeptNum


EmpNum  DeptName

12
• Table A (Student_id, Student_name, Prof_id,
Prof_name)
– How many functional dependencies are there in
the relation (table)? What are they?
– Where is the “transitive dependency”?

May 2012 91.2814 13


Partial dependency
A partial dependency exists when an attribute B is
functionally dependent on an attribute A, and A is a
component of a multipart candidate key.

InvNum LineNum Qty InvDate

Candidate keys: {InvNum, LineNum} InvDate is


partially dependent on {InvNum, LineNum} as
InvNum is a determinant of InvDate and InvNum is
part of a candidate key
14
15
First Normal Form
First Normal Form
We say a relation is in 1NF if all values stored in the
relation are single-valued and atomic.

1NF places restrictions on the structure of relations.


Values must be simple.

16
First Normal Form
The following in not in 1NF

EmpNum EmpPhone EmpDegrees


123 233-9876
333 233-1231 BA, BSc, PhD
679 233-1231 BSc, MSc

EmpDegrees is a multi-valued field:


employee 679 has two degrees: BSc and MSc
employee 333 has three degrees: BA, BSc, PhD

17
Second Normal Form
Second Normal Form
A relation is in 2NF if it is in 1NF, and every non-key attribute
is fully dependent on each candidate key. (That is, we don’t
have any partial functional dependency.)

• 2NF (and 3NF) both involve the concepts of key and


non-key attributes.
• A key attribute is any attribute that is part of a key;
any attribute that is not a key attribute, is a non-key
attribute.
•A relation in 2NF will not have any partial dependencies

18
Second Normal Form
Consider this InvLine table (in 1NF):
InvNum LineNum ProdNum Qty InvDate
InvNum, LineNum ProdNum, Qty
There are two
candidate keys.
InvDate is the only
non-key attribute, and
InvNum InvDate it is dependent on
InvLine is not 2NF since there is a partial InvNum
dependency of InvDate on InvNum
InvLine is
only in 1NF

19
Second Normal Form
InvLine
InvNum LineNum ProdNum Qty InvDate
The above relation has redundancies: the invoice date is
repeated on each invoice line.
We can improve the database by decomposing the relation
into two relations:
InvNum LineNum ProdNum Qty

InvNum InvDate

Question: What is the highest normal form for these


relations? 2NF? 3NF? BCNF?
20
2NF, but not in 3NF:

EmployeeDept
ename ssn bdate address dnumber dname

since dnumber is not a candidate key and we have:

dnumber  dname.

22
Third Normal Form
Third Normal Form
• A relation is in 3NF if the relation is in 1NF and all
determinants of non-key attributes are candidate keys
That is, for any functional dependency: X  Y, where Y is
a non-key attribute (or a set of non-key attributes), X is a
candidate key.
• This definition of 3NF differs from BCNF only in the
specification of non-key attributes - 3NF
• A relation in 3NF will not have any transitive dependencies
of non-key attribute on a candidate key through another
non-key attribute.

23
Third Normal Form
Consider this Employee relation Candidate keys
are? …

EmpNum EmpName DeptNum DeptName

EmpName, DeptNum, and DeptName are non-key attributes.


DeptNum determines DeptName, a non-key attribute, and
DeptNum is not a candidate key.

Is the relation in 3NF? … no Is the relation in BCNF? … no


Is the relation in 2NF? … yes

24
Third Normal Form
EmpNum EmpName DeptNum DeptName

We correct the situation by decomposing the original relation


into two 3NF relations. Note the decomposition is lossless.

EmpNum EmpName DeptNum DeptNum DeptName

Verify these two relations are in 3NF.

25
Patient relation example
Patient # Surgeon # Surg. date Patient Name Patient Addr Surgeon Surgery Postop drug
Drug side effects

Gallstone
s removal;
Jan 1, 15 New St. Beth Little Kidney
145 1995; June New York, Michael stones Penicillin, rash
1111 311 12, 1995 John White NY Diamond removal none- none

Eye
Charles Cataract
Apr 5, Field removal
243 1994 May 10 Main St. Patricia Thrombos Tetracyclin Fever
1234 467 10, 1995 Mary Jones Rye, NY Gold is removal e none none
Dogwood
Lane Open
Jan 8, Harrison, David Heart Cephalosp
2345 189 1996 Charles Brown NY Rosen Surgery orin none
55 Boston
Post Road,
Nov 5, Chester, Cholecyst
4876 145 1995 Hal Kane CN Beth Little ectomy Demicillin none
Blind Brook Gallstone
May 10, Mamaronec s
5123 145 1995 Paul Kosher k, NY Beth Little Removal none none
Eye
Cornea
Replacem
Apr 5, Hilton Road ent Eye
1994 Dec Larchmont, Charles cataract Tetracyclin
6845 243 15, 1984 Ann Hood NY Field removal e Fever

26
Patient #
Unnormalized Relation
Surgeon # Surg. date Patient Name Patient Addr Surgeon Surgery Postop drug
Drug side effects

Gallstone
s removal;
Jan 1, 15 New St. Beth Little Kidney
145 1995; June New York, Michael stones Penicillin, rash
1111 311 12, 1995 John White NY Diamond removal none- none

Eye
Charles Cataract
Apr 5, Field removal
243 1994 May 10 Main St. Patricia Thrombos Tetracyclin Fever
1234 467 10, 1995 Mary Jones Rye, NY Gold is removal e none none
Dogwood
Lane Open
Jan 8, Harrison, David Heart Cephalosp
2345 189 1996 Charles Brown NY Rosen Surgery orin none
55 Boston
Post Road,
Nov 5, Chester, Cholecyst
4876 145 1995 Hal Kane CN Beth Little ectomy Demicillin none
Blind Brook Gallstone
May 10, Mamaronec s
5123 145 1995 Paul Kosher k, NY Beth Little Removal none none
Eye
Cornea
Replacem
Apr 5, Hilton Road ent Eye
1994 Dec Larchmont, Charles cataract Tetracyclin
6845 243 15, 1984 Ann Hood NY Field removal e Fever
First Normal Form
• To move to First Normal Form a relation must
contain only atomic values at each row and
column.
– No repeating groups
– A column or set of columns is called a Candidate
Key when its values can uniquely identify the row
in the relation.
First Normal Form
Patient # Surgeon # Surgery DatePatient Name Patient Addr Surgeon Name Surgery Drug adminSide Effects

15 New St.
New York, Gallstone
1111 145 01-Jan-95 John White NY Beth Little s removal Penicillin rash
15 New St. Kidney
New York, Michael stones
1111 311 12-Jun-95 John White NY Diamond removal none none
Eye
10 Main St. Cataract Tetracyclin
1234 243 05-Apr-94 Mary Jones Rye, NY Charles Field removal e Fever

10 Main St. Thrombos


1234 467 10-May-95 Mary Jones Rye, NY Patricia Gold is removal none none
Dogwood
Lane Open
Charles Harrison, Heart Cephalosp
2345 189 08-Jan-96 Brown NY David Rosen Surgery orin none

55 Boston
Post Road,
Chester, Cholecyst
4876 145 05-Nov-95 Hal Kane CN Beth Little ectomy Demicillin none

Blind Brook Gallstone


Mamaronec s
5123 145 10-May-95 Paul Kosher k, NY Beth Little Removal none none
Eye
Hilton Road Cornea
Larchmont, Replacem Tetracyclin
6845 243 05-Apr-94 Ann Hood NY Charles Field ent e Fever

Hilton Road Eye


Larchmont, cataract
6845 243 15-Dec-84 Ann Hood NY Charles Field removal none none
1NF Storage Anomalies
• Insertion: A new patient has not yet undergone surgery --
hence no surgeon # -- Since surgeon # is part of the key we
can’t insert.
• Insertion: If a surgeon is newly hired and hasn’t operated yet -
- there will be no way to include that person in the database.
• Update: If a patient comes in for a new procedure, and has
moved, we need to change multiple address entries.
• Deletion (type 1): Deleting a patient record may also delete all
info about a surgeon.
• Deletion (type 2): When there are functional dependencies
(like side effects and drug) changing one item eliminates other
information.
Second Normal Form
• A relation is said to be in Second Normal Form
when every nonkey attribute is fully
functionally dependent on the primary key.
– That is, every nonkey attribute needs the full
primary key for unique identification
Second Normal Form
Patient # Patient Name Patient Address
15 New St. New
1111 John White York, NY
10 Main St. Rye,
1234 Mary Jones NY
Charles Dogwood Lane
2345 Brown Harrison, NY
55 Boston Post
4876 Hal Kane Road, Chester,
Blind Brook
5123 Paul Kosher Mamaroneck, NY
Hilton Road
6845 Ann Hood Larchmont, NY
Second Normal Form
Surgeon # Surgeon Name

145 Beth Little

189 David Rosen

243 Charles Field

311 Michael Diamond

467 Patricia Gold


Second Normal Form
Patient # Surgeon # Surgery Date Surgery Drug Admin Side Effects
Gallstones
1111 145 01-Jan-95 removal
Kidney Penicillin rash
stones
1111 311 12-Jun-95 removal none none
Eye Cataract
1234 243 05-Apr-94 removal Tetracycline Fever
Thrombosis
1234 467 10-May-95 removal none none
Open Heart Cephalospori
2345 189 08-Jan-96 Surgery n none
Cholecystect
4876 145 05-Nov-95 omy Demicillin none
Gallstones
5123 145 10-May-95 Removal none none
Eye cataract
6845 243 15-Dec-84 removal none none
Eye Cornea
6845 243 05-Apr-94 Replacement Tetracycline Fever
1NF Storage Anomalies Removed

• Insertion: Can now enter new patients without


surgery.
• Insertion: Can now enter Surgeons who haven’t
operated.
• Deletion (type 1): If Charles Brown dies the
corresponding tuples from Patient and Surgery tables
can be deleted without losing information on David
Rosen.
• Update: If John White comes in for third time, and
has moved, we only need to change the Patient table
2NF Storage Anomalies
• Insertion: Cannot enter the fact that a particular drug
has a particular side effect unless it is given to a
patient.
• Deletion: If John White receives some other drug
because of the penicillin rash, and a new drug and
side effect are entered, we lose the information that
penicillin can cause a rash
• Update: If drug side effects change (a new formula)
we have to update multiple occurrences of side
effects.
Third Normal Form
• A relation is said to be in Third Normal Form if there
is no transitive functional dependency between
nonkey attributes
– When one nonkey attribute can be determined with one
or more nonkey attributes there is said to be a transitive
functional dependency.
• The side effect column in the Surgery table is
determined by the drug administered
– Side effect is transitively functionally dependent on drug
so Surgery is not 3NF
Third Normal Form
Patient # Surgeon # Surgery Date Surgery Drug Admin

1111 145 01-Jan-95 Gallstones removal Penicillin


Kidney stones
1111 311 12-Jun-95 removal none

1234 243 05-Apr-94 Eye Cataract removal Tetracycline

1234 467 10-May-95 Thrombosis removal none

2345 189 08-Jan-96 Open Heart Surgery Cephalosporin

4876 145 05-Nov-95 Cholecystectomy Demicillin

5123 145 10-May-95 Gallstones Removal none

6845 243 15-Dec-84 Eye cataract removal none


Eye Cornea
6845 243 05-Apr-94 Replacement Tetracycline
Third Normal Form

Drug Admin Side Effects


Cephalosporin none
Demicillin none
none none
Penicillin rash
Tetracycline Fever
2NF Storage Anomalies Removed

• Insertion: We can now enter the fact that a


particular drug has a particular side effect in
the Drug relation.
• Deletion: If John White recieves some other
drug as a result of the rash from penicillin, but
the information on penicillin and rash is
maintained.
• Update: The side effects for each drug appear
only once.

You might also like