You are on page 1of 19

Chapter 9:

Normalization
• Part 1: A Simple Example
• Part 2: Another Example & The
Formal Stuff

A Problem: Keeping Track of Invoices (cont’d)

Suppose we have some invoices that we may or may not


want to refer to later…

1
A Problem: Keeping Track of Invoices (cont’d)

Fig. 9.1

Could store in an excel file but, as seen, might have problems if have complex
questions relating to the data:
1. How many 4” bolts did Frankenstein Parts order in 2002?
2. What items were sold on a certain date?

Solution: A Normalized Database


• First Normal Form (NF1):
No Repeating Elements or Groups of Elements
• In Fig.
Fig 9.1,
9 1 rows 2
2, 3,
3 4 represent invoice 125,
125 which in
DB terms is a single tuple
• In NF1 want to get rid of repeating elements, which
are:
– column H2 to H4, column J2 to J4, column K2 to K4 etc
– these contain lists of values, and these are hated by NF1
– NF1 wants atomicity: each attribute is simple & indivisible
– the repeating data for invoice 125 is cells: H2-M2, H3-M3, H4-
M4
• Can satisfy NF1, simply by separating each item in
these lists into its own row (See Fig. 9.2).

2
Solution: NF1 Cont’d

Fig. 9.2

But, were trying to reduce & simplify, now have introduced more data!
No matter, this will be addressed later (with NF3)

Solution: NF1 Cont’d


• Have only done half of NF1. NF1 addresses:
1. Row of data can’t have repeat groups of similar data (atomicity) 9
2. Each row of data must have a unique identifier (or Primary Key)
• In order to look at 2., have to convert Fig 9.2 into a
RDBMS (see the orders table in MS Access Fig. 9.3)
Fig. 9.3

• As can be seen, no one column ids each row, so have to


use two together: order_id & item_id
• Together the concatenated primary key ids each row

3
Solution: NF1 Cont’d
Fig. 9.4
• The underlying structure of the orders orders
table can be represented as Fig. 9.4 order_id(PK)
order_date
• Identify
Id tif the
th columns
l th
thatt make
k up the
th primary
i
customer_id
key with the PK notation. customer_name
customer_address
• Fig. 9.4 begins the Entity Relationship customer_city
customer_state
Diagram (or ERD). item_id(PK)
item_description
• DB schema now satisfies the 2 item_qty
item_price
requirements of NF1: atomicity item total price
item_total_price
order_total_price
& uniqueness. Thus it meets the
most basic criteria of a relational db.

Solution: NF2
• Second Normal Form (NF2):
No Partial Dependencies on a Concatenated Key
• Next have to test each table for partial dependencies on a
concatenated key
• Means that for a table with a concatenated primary key, each
column that is not part of the primary key must depend upon the
entire concatenated key for its existence.
• If a column depends upon only 1 part of the concatenated key,
then entire table has failed NF2 & must create another table to fix
it.
• For each column must ask the question:
q
– Can this column exist without one or the other part of the
concatenated primary key?
– If answer is “yes” – even once – table fails NF2

4
Solution: NF2 Cont’d
• Refer to Fig. 9.4 again to recall orders table structure.
• Recall the meaning of the two columns orders
Fig. 9.4
in the primary key:
order_id(PK)
order_date
– order_id ids invoice this item comes from. customer_id
customer_name
– item_id is the inventory items unique identifier. customer_address

Can think of it as a part number.


customer_city
customer_state

• Don't analyze these columns (since item_id(PK)


item_description
they are part of the primary key). item_qty
item_price
• Instead consider the remaining columns...
columns item total price
item_total_price
order_total_price

Solution: NF2 Cont’d


• order_date is the date on which the order was made.
– relies on order_id; an order date has to have an order,
otherwise it is only a date
– can an order date exist without an item_id? yes: order_date
relies
li on order_id,
d id nott item_id
it id (a
( specific
ifi order
d doesn’t
d ’t have
h
to have a specific item)
– so order_date fails NF28
• customer_id is ID of the customer who placed the order
– does it rely on order_id? No: a customer can exist without
placing any orders.
– does it rely on item_id?
item id? No (same reason).
reason)
– customer_id does not rely on either member of the PK
– What to do? NF3 will come to the rescue here, hence ? for all
the rest of the customer_* columns

5
Solution: NF2 Cont’d
• item_description is next column not itself part of PK. It is
the plain-language description of the inventory item.
– relies on item_id, but can it exist without an order_id?
– Yes! An inventory item (&
(&"description")
description ) could sit on a shelf,
shelf and
never be purchased... It can exist independent of an order.
– item_description fails the test. 8
• item_qty is no. of items purchased on a particular
invoice.
– can it exist without an item_id? No: cant have "amount of
nothing"
g
– can it exist without an order_id? No: a quantity purchased with
an invoice is meaningless without an invoice.
– So this column does not violate NF2
– item_qty depends on both parts of our concatenated PK.9

Solution: NF2 Cont’d


• item_price is similar to item_description. It depends on
the item_id but not on order_id, so it does violate NF2. 8
• item_total_price is tricky:
– seems to depend on both order_id & item_id, so passes NF2.
– but it is a derived value: it is item_qty times item_price.
– so, in fact, it doesn’t belong in the db at all.
– can easily be reconstructed outside of db; to include it would be
redundant (and could quite possibly introduce corruption).
– therefore can discard it
• order_total_price the
th sum off all
ll the
th item_total_price
fields for a particular order, is another derived value.
– can discard this field too for the same reason as
item_total_price

6
Solution: NF2 Cont’d

Fig. 9.4 Fig. 9.4


orders (New) orders
order_id(PK) order_id(PK)
order_date order_date 8
customer_id customer_id ?
customer_name
customer_address
customer_name
customer_address ?
?
customer_city customer_city ?
customer_state customer_state ?
item_id(PK) item_id(PK)
item_description item_description 8
item qty
item_qty item qty9
item_qty
item_price item_price8
item_total_price item_total_price
order_total_price order_total_price

Solution: NF2 Cont’d


• What to do with a table that fails NF2, as this one has?
– First take out the second half of the concatenated PK (item_id)
&p put it in its own table.
– All columns that depend on item_id - whether in whole or in
part - follow it into the new table, order_items (see Fig. 9.5).
– The other fields — those that rely on just the first half of the PK
(order_id) and those we aren't sure about — stay where they
are.

orders
order_id(PK)
Fig. 9.5 order_date
order_items
order_id(PK)
customer_id
item_id(PK)
customer_name
item_description
customer_address
item_qty
customer_city
item_price
customer_state

7
Solution: NF2 Cont’d
• things to notice abut Fig. 9.5:
1. have brought a copy of order_id to the order_items table to
allow each order_item to "remember" which order it is a part
of.
2 orders
2. d t bl h
table has fewer
f rows than
th before
b f & no longer
l has
h a
concatenated PK. PK consists of a single column, order_id.
3. order_items table does have a concatenated primary key.
• Crows feet mean in Fig. 9.5:
– each order can be associated with any number of order-items, but at
least one;
– each order-item is associated with one order, and only one.

orders
order_id(PK)
Fig. 9.5 order_date
order_items
order_id(PK)
customer_id
item_id(PK)
customer_name
item_description
customer_address
item_qty
customer_city
item_price
customer_state

Solution: NF2 Phase II


• Remember, NF2 only applies to tables with a
concatenated PK. Now orders has a single-column PK,
it has passed NF2.
• order_items, however, still has a concatenated PK.
– have to pass it thro NF2 analysis again to see if it passes.
– ask the same question we did before:
– Can this column exist without one or the other part of the
concatenated PK?
• Fig. 9.6 shows order_items table structure. Fig. 9.6
• item_description relies on item_id, but order_items
order_id(PK)
d id(PK)
not order_id, so this again fails NF28 item_id(PK)
• item_qty relies on both parts of PK, item_description
item_qty
does not violate NF2 9 item_price
• item_price relies on item_id but not on order_id, so it
does violate NF2 8

8
Solution: NF2 Phase II Cont’d
Fig. 9.6 Fig. 9.6
order_items (New) order_items
order_id(PK) order_id(PK)
item_id(PK) item_id(PK)
item_description
p item description 8
item_description
item_qty item_qty9
item_price item_price 8

• On first pass thro NF2 test, lost all fields relying on item_id & put
them into new table. This time, only taking fields failing the test:
ie item_qty stays. What's different this time?
• First p
pass,, removed item_id key y from orders altogether
g cos of
the 1:M relationship between orders & order-items.
– Therefore item_qty field had to follow item_id into the new table.
• Second pass, item_id wasn’t taken from order-items table cos of
the M:1 relationship between order-items & items.
– Therefore, since item_qty does not violate NF2 this time, it is
permitted to stay in the table with the two PK parts that it relies on.

Solution: NF2 Phase II Cont’d


• Crows feet mean in Fig. 9.7:
– each item can be associated with any number of lines on any number
of invoices, including zero;
– each order-item is associated with one item, and only one.
– These two lines are examples of 1:M relationships.
• This
h 3-table
3 bl structure, is h
how express a M:N relationship:
l h
– Each order can have many items; each item can belong to many
orders.
• Notes:
– Didn’t bring a copy of order_id column into new table cos individual
items needn’t know the orders they are part of, as order_items
remembers this r’ship via the order_id & item_id columns. Taken
together these columns comprise the PK of order_items,
order items, but taken
separately they are FKs to rows in other tables.
– New table does not have a concatenated PK, so it passes NF2.
orders
order_id(PK)

Fig. 9.7
order_date order_items items
customer_id
order_id(PK) item_id(PK)
customer_name
item_id(PK) item_description
customer_address
item_qty item_price
customer_city
customer_state

9
Solution: NF3
• Third Normal Form (NF3):
No Dependencies on Non-Key Attributes
• Can return to repeating Customer info problem. As db stands, if
customer places >1 order have to input customer
customer'ss contact info
again cos there are columns in orders that rely on "non-key
attributes".
• To understand this, consider order_date. Can it exist independent
of order_id?
– No!: an "order date" is meaningless without an order.
– order_date depends on a key attribute (order_id is "key attribute"
because it is table’s PK).
• What about customer_name — can it exist on its own, outside of
the orders table?
– Yes. It is meaningful to talk about a customer name without referring
to an order or invoice.

Solution: NF3 Cont’d


• Same goes for customer_address, customer_city, &
customer_state. These 4 columns actually rely on customer_id,
which is not a key in this table (it is a non-key attribute).
• These fields belong in their own table customers,
customers with
customer_id as PK (see Fig 9.8).
• However, notice in Fig 9.8 that relationship has been severed btw
orders table and the Customer data that used to inhabit it.
orders order_items items
order_id(PK) order_id(PK) item_id(PK)
customer_id(FK) item_id(PK) item_description
order_date item_qty item_price

Fig. 9.8

customers
customer_id(PK)
customer_name
customer_address
customer_city
customer_state

10
Solution: NF3 Cont’d
• Restore relationship by creating a foreign key (indicated by (FK))
in orders
– As know, FK is a column that points to the PK in another table.
– Fig 9.9 describes this relationship, and shows our completed ERD.
• Relationship between orders & customers may be expressed in
this way:
– each order is made by one, and only one customer;
– each customer can make any number of orders, including zero
orders order_items items
order_id(PK) order_id(PK) item_id(PK)
customer_id(FK) item_id(PK) item_description
order_date item_qty item_price

Fig. 9.9

customers
customer_id(PK)
customer_name
customer_address
customer_city
customer_state

Solution: NF3 Cont’d


• Last point to note:
– order_id and item_id columns in order_items perform a dual
purpose: not only do they function as the (concatenated) PK for
order_items
d it , they
th also
l individually
i di id ll serve as FKs
FK to th orders
t the d
table and items table respectively.
– This is shown in Fig. 9.10

orders order_items items


order_id(PK) order_id(FK) item_id(PK)
PK
customer_id(FK) item_id(FK) item_description
order_date item_qty item_price

Fig. 9.10

customers
customer_id(PK)
customer_name
customer_address
customer_city
customer_state

11
Normalisation cont’d

Introduction to Database Design

• As we have seen, an important part of database design


is deciding on a suitable logical structure or schema to
implement ... called database design
design.
SP
• Considering supplier parts example (S,P,SP) S S# P# QTY

S1 P1 300

there is a feeling of correctness.


S# SName Status City S1

S1
P2

P3
200

400
S1 Smith 20 Paris

• Normalisation theory is a S2

S3
Jones

Blake
10

30
Paris

Rome
S2

S2
P1

P2
300

400

formalism of simple ideas with a P


S3 P2 200

P# PName Colour Weight City

practical application P1

P2
Nut

Bolt
Red

Green
12

17
London

Paris

in logical database schema design. P3

P4
Screw

Screw
Blue

Red
27

14
Rome

London

• Normalisation theory should allow us to


recognise relations with undesirable
properties, tell us what is "wrong" & how to "correct" it.

12
Intro to Database Design Cont’d
• Normalisation theory is built around normal forms - each normal
form has a set of satisfiable criteria.
• Normal forms exist in a hierarchy:
– 1NF -> 2NF -> 3NF -> BCNF -> 4NF -> PJ/NF (5NF)
• Codd defined 1NF, 2NF, 3NF in 1972.
• 3NF had inadequacies so revised in ‘74 by Boyce/Codd (BCNF).
• 1977 Fagin defined 4NF, 1979 defined 5NF.
• 6NF,7NF ?... dependencies theory suggests there may be higher
NFs but not practicable in database environment.
• DB designers should aim for higher NFs but this is not law - just
recommended as normalisation simply provides guidelines for
database design.
• There are often good reason for not using normalisation theory.

Introduction to Database Design Cont’d


• In order to describe the various normal forms we must
first introduce some definitions:
• Functional Dependency
– Given relation R, attribute Y of R is functionally dependent on X
of R, R.X -> R.Y, or R.X functionally determines R.Y ...
– ... iff each R.X value has associated with it precisely one R.Y
value, where X and/or Y may be composite.
– R.X called the determinant, R.Y called the dependent
• S.SNAME, S.STATUS and S.CITY are each functionally
dependent on S.S#
S S#
• If R.X is a candidate key or if R.X is the primary key,
then all R.Y must be functionally dependent on R.X
• In SP we have a composite primary key so
SP.(S#,P#) -> SP.QTY

13
Introduction to Database Design Cont’d
• There is no requirement in the definition of functional
dependence that R.X be a candidate key, thus:
R.X -> R.Y iff whenever 2 tuples of R.X are the same then the
corresponding R.Y values are also the same.
– R.Y is fully functionally dependent on R.X ….
– …. iff it is functionally dependent on R.X & not fully functionally
dependent on any subset of R.X
– Example:
S.(S#,STATUS) -> S.CITY is true but not full functional
dependence as S.S# -> > S.CITY
– If R.X -> R.Y but not fully then R.X must be composite

Normalisation: Example 2
• Given the report in Fig 9.11, need to put it in a tidy DB.
• Problems with current form:
– PROJ_NUM is supposed to be PK or part of PK but contains nulls.
Maybe
aybe PROJ
OJ_NUM+EMP
U U will de
_NUM define
e eac
each row.
o
– The table entries contain inconsistencies (e.g. JOB_CLASS
“Elect. Engineer” could be “EE” or “E. Eng” or others)

Fig. 9.11

14
Normalisation: Example 2 Cont’d
• Further problems with current form:
– The table has data redundancies leading to the following
anomalies:
1. Update Anomalies: Modifying (e.g.) JOB_CLASS for Employee 105 requires
lots of alterations (one for each employee 105).
2. Insertion Anomalies: To complete a row definition, a new employee must
be given a project; if not yet assigned, this must be assumed to complete
the employee tuple.
3. Deletion Anomalies: If employee 103 quits, every row with EMP_NUM=103
must be deleted with the potential loss of other data.
– Inefficiency: If a large number of new employees are hired, a
l t off redundant/unassigned
lot d d t/ i dddata
t mustt b
be assumedd and
d input.
i t
– Integrity: Possible data integrity problems may arise out of the
above.

Example 2: Conversion to NF1


• So… Problems with Fig. 9.11:
– Data cannot be as shown in Fig. 9.11 cos have to be able to
identify all tuples with a PK.
– PROJ_NUM cannot be PK in Fig. 9.11 cos of nulls
– Cannot have the repeating groups shown in Fig.
Fig 9.11
9 11 so have
to alter table to remove them.
• Step 1. Eliminate the repeating groups
– Eliminate the null values.
– Now have Fig. 9.12

Fig. 9.12

15
Example 2: Conversion to NF1 Cont’d
• Step 2. Identify the Primary Key
– Layout in Fig. 9.12 is only a cosmetic change – need a PK to
uniquely identify all tuples.
– This may be seen to be PROJ_NUM+EMP_NUM
• Step 3. Identify all dependencies
– The identification of the PK means already have the following:
PROJ_NUM,EMP_NUM PROJ_NAME,EMP_NAME,JOB_CLASS,CHG_HOUR, HOURS

Fig. 8.12

Example 2: Conversion to NF1 Cont’d


• Step 3. Cont’d
– But there are additional dependencies:
1. The project number determines the project name:
PROJ_NUM PROJ_NAME
2. If know employee number, also know their name, job classification and
their charge per hour:
EMP_NUM EMP_NAME, JOB_CLASS, CHG_HOURS
3. Also knowing job classification means also know the charge per hour:
JOB_CLASS CHG_HOURS
– These dependencies are shown in the Dependency Diagram in
Fig. 9.13
– Dependency Diagrams are useful for getting an overall view of
relationships among attributes.
Normal
Fig. 9.13
PROJ_ PROJ_ EMP_ EMP_ JOB_ CHG_ Partial
NUM NAME NUM NAME CLASS HOUR HOURS

Transitive

16
Example 2: Conversion to NF1 Cont’d
• Looking at Fig. 9.13, can see that:
1. PK attributes are bold, underlined and a different colour.
2. Arrows above (blue) denote desirable FDs (those based on PK)
3. Arrows below the diagram (red and green) are less desirable:
a) Partial Dependencies: dependencies based on part of composite PK
– Need only know PROJ_NUM to know PROJ_NAME, so PROJ_NAME is only
dependent on part of the PK.
– Need only know EMP_NUM to find the EMP_NAME, JOB_CLASS,
CHG_HOUR.
b) Transitive Dependencies: Dependency of 1 non-prime attribute on another
– From Fig. 9.13, can see that CHG_HOUR is dependent on JOB_CLASS
– Neither of these is part of PK (i.e. a Prime Attribute).

Normal
Fig. 9.13
PROJ_ PROJ_ EMP_ EMP_ JOB_ CHG_ Partial
NUM NAME NUM NAME CLASS HOUR HOURS

Transitive

Example 2: Conversion to NF1 Cont’d

• Properties of NF1: A table in NF1 must have:


1. All key attributes defined
1
2. No repeating groups in the table (i.e each row/column entry
must have only one value)
• Problem with Fig. 9.13 is the partial dependencies.
• This can be eliminated with NF2

17
Example 2: Conversion to NF2
• Step 1. Identify all key components:
PROJ_NUM
EMP_NUM
PROJ NUM EMP
PROJ_NUM, EMP_NUM
NUM
– Each component becomes the key of a new table.
– Three new tables project, employee, assign
• Step 2. Identify the dependent attributes
– Use Fig. 9.13 to determine which attributes are dependent on
which others, using the arrows in the dependency diagram
project(PROJ NUM PROJ_NAME)
project(PROJ_NUM, PROJ NAME)
employee(EMP_NUM, EMP_NAME, JOB_CLASS, CHG_HOURS)
assign(PROJ_NUM, EMP_NUM, ASSIGN_HOURS)
– Results are shown in Fig. 9.14

Example 2: Conversion to NF2 Cont’d


• At this point, most anomalies discussed above have been
eliminated e.g. if want to add/change/delete a project record,
only need to alter 1 row of project
• So a table is in NF2 iff
1. It is in NF1
And
2. It has no partial dependencies (can still have transitive
dependencies)
• Fig. 9.14 still has a transitive dependency which can generate
anomalies e.g. if charge per hour changes for a job classification
held by many employees, that change must be made for all
(leading to possible update anomalies)
• Resolve transitive dependencies in NF3

PROJ_ PROJ_ EMP_ EMP_ JOB_ CHG_ PROJ_ EMP_ ASSIGN


NUM NAME NUM NAME CLASS HOUR NUM NUM _HOURS

project assign
Fig. 9.14 employee

18
Example 2: Conversion to NF3
• Step 1. Identify each new determinant
– For each transitive dependency, write its determinant as a PK for
a new table (recall: determinant is any attribute whose value
determines other values within a row).
– If have 3 transitive dependencies, have 3 different determinants
– Here only have one: JOB_CLASS
• Step 2. Identify the dependent attributes
– Identify the attributes dependent on each determinant identified
in Step 1. Here, have
JOB_CLASS CHG_HOUR
– Name the table to reflect its contents & function, here JOB is ok
• Step 3. Remove dependent attrib from transitive
dependencies
– Remove all dependent attributes from dependent relationship(s)
from each table with transitive relationships
– JOB_CLASS remains in the employee table as FK

Example 2: Conversion to NF3


• Final dependency diagram is shown in Fig. 9.15
Fig. 9.15
PROJ_ PROJ_ EMP_ EMP_ JOB_ JOB_ CHG_ PROJ_ EMP_ ASSIGN
NUM NAME NUM NAME CLASS CLASS HOUR NUM NUM _HOURS

project employee job assign

• Or 4 Tables:
project(PROJ_NUM, PROJ_NAME)
assign(EMP_NUM, PROJ_NUM, ASSIGN_HOURS)
employee(EMP NUM EMP_NAME,
employee(EMP_NUM, EMP NAME JOB_CLASS)
JOB CLASS)
job(JOB_CLASS, CHG_HOUR)
• A table is in NF3 iff
– It is in NF2
And
– It contains no transitive dependencies.

19

You might also like