You are on page 1of 47

The Relational Data Model, Normalisation and effective

Database Design
By Tony Marston
30th September 2004
Amended 12th August 2005
What is a database?
The Hierarchical Data Model
The Netor! Data Model
The Relational Data Model
" The Relation
" #eys
" Relationshi$s
" Relational %oins
" &ossless %oins
" Determinant and De$endent
" 'unctional De$endencies ('D)
" Transitive De$endencies (TD)
" Multi"*alued De$endencies (M*D)
" %oin De$endencies (%D)
" Modification +nomalies
Ty$es of Relational %oin
" Inner %oin
" Natural %oin
" &eft ,-uter. %oin
" Right ,-uter. %oin
" 'ull ,-uter. %oin
" /elf %oin
" 0ross %oin
1ntity"Relationshi$ Diagram (1RD)
Data Normalisation
" 2st Normal 'orm
" 3nd Normal 'orm
" 4rd Normal 'orm
" Boyce"0odd Normal 'orm
" 5th Normal 'orm
" 6th (7ro8ection"%oin) Normal 'orm
" 9th (Domain"#ey) Normal 'orm
" 0om$ound 'ields
" /ummary 'ields
" /ummary Tables
" -$tional +ttributes that e:ist as a grou$
7ersonal ;uidelines
" Database Names
" Table Names
" 'ield Names
" 7rimary #eys
" 'oreign #eys
" ;enerating <ni=ue ids
" The choice beteen u$$er and loer case
" 'ield names should identify their content
" The naming of 'oreign #eys
+mendment History
I have been designing and building applications, including the databases used by
those applications, for several decades no! I have seen similar problems
approached by different designs, and this has given me the opportunity to
evaluate the effectiveness of one design over another in providing solutions to
those problems!
It may not seem obvious to a lot of people, but the design of the database is the
heart of any system! If the design is rong then the hole application ill be
rong, either in effectiveness or performance, or even both! "o amount of clever
coding can compensate for a bad database design! Sometimes hen building an
application I may encounter a problem hich can only be solved effectively by
changing the database rather than by changing the code, so change the
database is hat I do! I may have to try several different designs before I find
one that provides the most benefits and the least number of disadvantages, but
that is hat prototyping is all about!
#he biggest problem I have encountered in all these years is here the database
design and softare development are handled by different teams! #he database
designers build something according to their rules, and they then e$pect the
developers to rite code around this design! #his approach is often fraught ith
disaster as the database designers often have little or no development
e$perience, so they have little or no understanding of ho the development
language can use that design to achieve the e$pected results! #his happened on
a pro%ect I or&ed on in the 1''0s, and every time that e, the developers, hit a
problem the response from the database designers as alays the same( )ur
design is perfect, so you ill have to %ust code around it! So code around it e
did, and not only ere e not happy ith the result, neither ere the users as the
entire system ran li&e a pig ith a ooden leg!
In this article I ill provide you ith some tips on ho I go about designing a
database in the hope that you may learn from my e$perience! "ote that I do not
use any e$pensive modelling tools, %ust the *ar& I +rain!
What is a database?
#his may seem a pretty fundamental ,uestion, but unless you &no hat a
database consists of you may find it difficult to build one that can be used
effectively! -ere is a simple definition of a database(
A database is a collection of information that is organised so that it can easily be
accessed, managed, and updated!
A database engine may comply ith a combination of any of the folloing(
• #he database is a collection of table, files or datasets!
• .ach table is a collection of fields, columns or data items!
• )ne or more columns in each table may be selected as the primary &ey!
• #here may be additional uni,ue &eys or non/uni,ue inde$es to assist in
data retrieval!
• 0olumns may be fi$ed length or variable length!
• 1ecords may be fi$ed length or variable length!
• #able and column names may be restricted in length 23, 14 or 32
• #able and column names may be case/sensitive!
)ver the years there have been several different ays of constructing databases,
amongst hich have been the folloing(
• #he -ierarchical 6ata *odel
• #he "etor& 6ata *odel
• #he 1elational 6ata *odel
Although I ill give a brief summary of the first to, the bul& of this document is
concerned ith #he 1elational 6ata *odel as it the most prevalent in today7s
The Hierarchical Data Model
#he -ierarchical 6ata *odel structures data in a tree of records, ith each
record having one parent record and many children! It can be represented as
8igure 1 / #he -ierarchical 6ata *odel
A hierarchical database consists of the folloing(
1! It contains nodes connected by branches!
2! #he top node is called the root!
3! If multiple nodes appear at the top level, the nodes are called root
4! #he parent of node n
is a node directly above n
and connected to n
by a
5! .ach node 2ith the e$ception of the root5 has e$actly one parent!
4! #he child of node n
is the node directly belo n
and connected to n
by a
9! )ne parent may have many children!
+y introducing data redundancy, comple$ netor& structures can also be
represented as hierarchical databases! #his redundancy is eliminated in physical
implementation by including a 7logical child7! #he logical child contains no data
but uses a set of pointers to direct the database management system to the
physical child in hich the data is actually stored! Associated ith a logical child
are a physical parent and a logical parent! #he logical parent provides an
alternative 2and possibly more efficient5 path to retrieve logical child information!
The Netor! Data Model
#he "etor& 6ata *odel uses a lattice structure in hich a record can have
many parents as ell as many children! It can be represented as follos(
8igure 2 / #he "etor& 6ata *odel
:i&e the #he -ierarchical 6ata *odel the "etor& 6ata *odel also consists of
nodes and branches, but a child may have multiple parents ithin the netor&
structure instead of being restricted to %ust one!
I have or&ed ith both hierarchical and netor& databases, and they both
suffered from the folloing deficiencies 2hen compared ith relational
• Access to the database as not via S;: ,uery strings, but by a specific
set of A<I7s, typically for 8I"6, 01.A#., 1.A6, =<6A#. and 6.:.#.!
• .ach A<I ould only access a single table 2dataset5, so it as not
possible to implement a >)I" hich ould return data from several tables!
• It as not possible to provide a variable ?-.1. clause! #he only
selection mechanism availabe as
o read all entries 2a full table scan5!
o read a single entry using a specific primary &ey!
o read all entries on a child table hich ere associated ith a
selected entry on a parent table
Any further filtering had to be done ithin the application code!
• It as not possible to provide an )16.1 +@ clause! 6ata as presented
in the order in hich it e$isted in the database! #his mechanism could be
tuned by specifying sort criteria to be used hen each record as inserted,
but this had several disadvantages(
o )nly a single sort se,uence could be defined for each path 2lin& to
a parent5, so all records retrieved on that path ould be provided in that
o It could ma&e inserts rather slo hen attempting to insert into the
middle of a large collection, or here a table had multiple paths each
ith its on set of sort criteria!
The Relational Data Model
#he 1elational 6ata *odel has the relation at its heart, but then a hole series of
rules governing &eys, relationships, %oins, functional dependencies, transitive
dependencies, multi/valued dependencies, and modification anomalies!
The Relation
#he Relation is the basic element in a relational data model!
8igure 3 / 1elations in the 1elational 6ata *odel
A relation is sub%ect to the folloing rules(
1! 1elation 2file, table5 is a to/dimensional table!
2! Attribute 2i!e! field or data item5 is a column in the table!
3! .ach column in the table has a uni,ue name ithin that table!
4! .ach column is homogeneous! #hus the entries in any column are all of
the same type 2e!g! age, name, employee/number, etc5!
5! .ach column has a domain, the set of possible values that can appear in
that column!
4! A #uple 2i!e! record5 is a ro in the table!
9! #he order of the ros and columns is not important!
3! Aalues of a ro all relate to some thing or portion of a thing!
'! 1epeating groups 2collections of logically related attributes that occur
multiple times ithin one record occurrence5 are not alloed!
10! 6uplicate ros are not alloed 2candidate &eys are designed to prevent
11! 0ells must be single/valued 2but can be variable length5! Single valued
means the folloing(
o 0annot contain multiple values such as 7A1,+2,037!
o 0annot contain combined values such as 7A+0/B@C7 here 7A+07
means one thing and 7B@C7 another!
A relation may be e$pressed using the notation R(A,B,C, ...) here(
• 1 D the name of the relation!
• 2A,+,0, !!!5 D the attributes ithin the relation!
• A D the attribute2s5 hich form the primary &ey!
1! A sim$le &ey contains a single attribute!
2! A com$osite !ey is a &ey that contains more than one attribute!
3! A candidate !ey is an attribute 2or set of attributes5 that uni,uely identifies
a ro! A candidate &ey must possess the folloing properties(
o =ni,ue identification / 8or every ro the value of the &ey must
uni,uely identify that ro!
o "on redundancy / "o attribute in the &ey can be discarded ithout
destroying the property of uni,ue identification!
4! A $rimary !ey is the candidate &ey hich is selected as the principal
uni,ue identifier! .very relation must contain a primary &ey! #he primary
&ey is usually the &ey selected to identify a ro hen the database is
physically implemented! 8or e$ample, a part number is selected instead of
a part description!
5! A su$er!ey is any set of attributes that uni,uely identifies a ro! A
super&ey differs from a candidate &ey in that it does not re,uire the non
redundancy property!
4! A foreign !ey is an attribute 2or set of attributes5 that appears 2usually5 as
a non &ey attribute in one relation and as a primary &ey attribute in another
relation! I say usually because it is possible for a foreign &ey to also be the
hole or part of a primary &ey(
o A many/to/many relationship can only be implemented by
introducing an intersection or lin& table hich then becomes the child in
to one/to/many relationships! #he intersection table therefore has a
foreign &ey for each of its parents, and its primary &ey is a composite of
both foreign &eys!
o A one/to/one relationship re,uires that the child table has no more
than one occurrence for each parent, hich can only be enforced by
letting the foreign &ey also serve as the primary &ey!
9! A semantic or natural &ey is a &ey for hich the possible values have an
obvious meaning to the user or the data! 8or e$ample, a semantic primary
&ey for a 0)="#1@ entity might contain the value 7=SA7 for the occurrence
describing the =nited States of America! #he value 7=SA7 has meaning to
the user!
3! A technical or surrogate or artificial &ey is a &ey for hich the possible
values have no obvious meaning to the user or the data! #hese are used
instead of semantic &eys for any of the folloing reasons(
o ?hen the value in a semantic &ey is li&ely to be changed by the
user, or can have duplicates! 8or e$ample, on a <.1S)" table it is
unise to use <.1S)"E"A*. as the &ey as it is possible to have
more than one person ith the same name, or the name may change
such as through marriage!
o ?hen none of the e$isting attributes can be used to guarantee
uni,ueness! In this case adding an attribute hose value is generated
by the system, e!g from a se,uence of numbers, is the only ay to
provide a uni,ue value! #ypical e$amples ould be )16.1EI6 and
I"A)I0.EI6! #he value 7123457 has no meaning to the user as it
conveys nothing about the entity to hich it relates!
9. A &ey functionally determines the other attributes in the ro, thus it is
alays a determinant!
10! "ote that the term 7&ey7 in most 6+*S engines is implemented as an
inde$ hich does not allo duplicate entries!
)ne table 2relation5 may be lin&ed ith another in hat is &non as a
relationshi$! 1elationships may be built into the database structure to facilitate
the operation of relational %oins at runtime!
1! A relationship is beteen to tables in hat is &non as a one"to"many
or $arent"child or master"detail relationship here an occurrence on the
7one7 or 7parent7 or 7master7 table may have any number of associated
occurrences on the 7many7 or 7child7 or 7detail7 table! #o achieve this the
child table must contain fields hich lin& bac& the $rimary !ey on the
$arent table! #hese fields on the child table are &non as a foreign !ey,
and the $arent table is referred to as the foreign table 2from the viepoint
of the child5!
2! It is possible for a record on the $arent table to e$ist ithout
corresponding records on the child table, but it should not be possible for
an entry on the child table to e$ist ithout a corresponding entry on the
$arent table!
3! A child record ithout a corresponding $arent record is &non as an
4! It is possible for a table to be related to itself! 8or this to be possible it
needs a foreign !ey hich points bac& to the $rimary !ey! "ote that these
to &eys cannot be comprised of e$actly the same fields otherise the
record could only ever point to itself!
5! A table may be the sub%ect of any number of relationships, and it may be
the $arent in some and the child in others!
4! Some database engines allo a $arent table to be lin&ed via a candidate
!ey, but if this ere changed it could result in the lin& to the child table
being bro&en!
9! Some database engines allo relationships to be managed by rules
&non as referential integrity or foreign !ey restraints! #hese ill
prevent entries on child tables from being created if the foreign !ey does
not e$ist on the $arent table, or ill deal ith entries on child tables hen
the entry on the $arent table is updated or deleted!
Relational %oins
#he %oin operator is used to combine data from to or more relations 2tables5 in
order to satisfy a particular ,uery! #o relations may be %oined hen they share
at least one common attribute! #he %oin is implemented by considering each ro
in an instance of each relation! A ro in relation 11 is %oined to a ro in relation
12 hen the value of the common attribute2s5 is e,ual in the to relations! #he
%oin of to relations is often called a binary 8oin>
#he %oin of to relations creates a ne relation! #he notation 711 $ 127 indicates
the %oin of relations 11 and 12! 8or e$ample, consider the folloing(
Relation R2
+ B 0
1 5 3
2 4 5
3 3 5
' 3 3
1 4 5
5 4 3
2 9 5
Relation R3
B D 1
4 9 4
4 2 3
5 9 3
9 2 3
3 2 2
"ote that the instances of relation 11 and 12 contain the same data values for
attribute +! 6ata normalisation is concerned ith decomposing a relation 2e!g!
12A,+,0,6,.5 into smaller relations 2e!g! 11 and 125! #he data values for
attribute + in this conte$t ill be identical in 11 and 12! #he instances of 11 and
12 are pro%ections of the instances of 12A,+,0,6,.5 onto the attributes 2A,+,05
and 2+,6,.5 respectively! A pro%ection ill not eliminate data values / duplicate
ros are removed, but this ill not remove a data value from any attribute!
#he %oin of relations 11 and 12 is possible because + is a common attribute! #he
result of the %oin is(
Relation R2 : R3
+ B 0 D 1
1 5 3 9 3
2 4 5 9 4
3 3 5 2 2
' 3 3 2 2
1 4 5 2 3
5 4 3 9 4
2 9 5 2 3
#he ro 22 4 5 9 45 as formed by %oining the ro 22 4 55 from relation 11 to the
ro 24 9 45 from relation 12! #he to ros ere %oined since each contained the
same value for the common attribute +! #he ro 22 4 55 as not %oined to the ro
24 2 35 since the values of the common attribute 24 and 45 are not the same!
#he relations %oined in the preceding e$ample shared e$actly one common
attribute! -oever, relations may share multiple common attributes! All of these
common attributes must be used in creating a %oin! 8or e$ample, the instances of
relations 11 and 12 in the folloing e$ample are %oined using the common
attributes + and 0(
+efore the %oin(
Relation R2
+ B 0
4 1 4
3 1 4
5 1 2
2 9 1

Relation R3
B 0 D
1 4 '
1 4 2
1 2 1
9 1 2
9 1 3
After the %oin(
Relation R2 : R3
+ B 0 D
4 1 4 '
4 1 4 2
3 1 4 '
3 1 4 2
5 1 2 1
2 9 1 2
2 9 1 3
#he ro 24 1 4 '5 as formed by %oining the ro 24 1 45 from relation 11 to the
ro 21 4 '5 from relation 12! #he %oin as created since the common set of
attributes 2+ and 05 contained identical values 21 and 45! #he ro 24 1 45 from 11
as not %oined to the ro 21 2 15 from 12 since the common attributes did not
share identical values / 21 45 in 11 and 21 25 in 12!
#he %oin operation provides a method for reconstructing a relation that as
decomposed into to relations during the normalisation process! #he %oin of to
ros, hoever, can create a ne ro that as not a member of the original
relation! #hus invalid information can be created during the %oin process!
&ossless %oins
A set of relations satisfies the lossless %oin property if the instances can be %oined
ithout creating invalid data 2i!e! ne ros5! #he term lossless %oin may be
somehat confusing! A %oin that is not lossless ill contain e$tra, invalid ros! A
%oin that is lossless ill not contain e$tra, invalid ros! #hus the term gainless
8oin might be more appropriate!
#o give an e$ample of incorrect information created by an invalid %oin let us ta&e
the folloing data structure(
R(student, course, instructor, hour, room, grade)
Assuming that only one section of a class is offered during a semester e can
define the folloing functional dependencies(
1! 2-)=1, 1))*5 0)=1S.
2! 20)=1S., S#=6."#5 F1A6.
3! 2I"S#1=0#)1, -)=15 1))*
4! 20)=1S.5 I"S#1=0#)1
5! 2-)=1, S#=6."#5 1))*
#a&e the folloing sample data(
/T<D1NT 0-<R/1 IN/TR<0T-R H-<R R--M ;R+D1
Smith *ath 1 >en&ins 3(00 100 A
>ones .nglish Foldman 3(00 200 +
+ron .nglish Foldman 3(00 200 0
Freen Algebra >en&ins '(00 400 A
#he folloing four relations, each in 4th normal form, can be generated from the
given and implied dependencies(
"ote that the dependencies 2-)=1, 1))*5 0)=1S. and 2-)=1,
S#=6."#5 1))* are not e$plicitly represented in the preceding
decomposition! #he goal is to develop relations in 4th normal form that can be
%oined to anser any ad hoc in,uiries correctly! #his goal can be achieved
ithout representing every functional dependency as a relation! 8urthermore,
several sets of relations may satisfy the goal!
#he preceding sets of relations can be populated as follos(
/T<D1NT H-<R 0-<R/1
Smith 3(00 *ath 1
>ones 3(00 .nglish
+ron 3(00 .nglish
Freen '(00 Algebra
/T<D1NT 0-<R/1 ;R+D1
Smith *ath 1 A
>ones .nglish +
+ron .nglish 0
Freen Algebra A
0-<R/1 IN/TR<0T-R
*ath 1 >en&ins
.nglish Foldman
Algebra >en&ins
IN/TR<0T-R H-<R R--M
>en&ins 3(00 100
Foldman 3(00 200
>en&ins '(00 400
"o suppose that a list of courses ith their corresponding room numbers is
re,uired! 1elations 11 and 14 contain the necessary information and can be
%oined using the attribute -)=1! #he result of this %oin is(
R2 : R5
/T<D1NT 0-<R/1 IN/TR<0T-R H-<R R--M
Smith *ath 1 >en&ins 3(00 100
Smith *ath 1 Foldman 3(00 200
>ones .nglish >en&ins 3(00 100
>ones .nglish Foldman 3(00 200
+ron .nglish >en&ins 3(00 100
+ron .nglish Foldman 3(00 200
Freen Algebra >en&ins '(00 400
#his %oin creates the folloing invalid information 2denoted by the coloured ros5(
• Smith, >ones, and +ron ta&e the same class at the same time from to
different instructors in to different rooms!
• >en&ins 2the *aths teacher5 teaches .nglish!
• Foldman 2the .nglish teacher5 teaches *aths!
• +oth instructors teach different courses at the same time!
Another possibility for a %oin is 13 and 14 2%oined on I"S#1=0#)15! #he result
ould be(
R4 : R5
0-<R/1 IN/TR<0T-R H-<R R--M
*ath 1 >en&ins 3(00 100
*ath 1 >en&ins '(00 400
.nglish Foldman 3(00 200
Algebra >en&ins 3(00 100
Algebra >en&ins '(00 400
#his %oin creates the folloing invalid information(
• >en&ins teaches *ath 1 and Algebra simultaneously at both 3(00 and
A correct se,uence is to %oin 11 and 13 2using 0)=1S.5 and then %oin the
resulting relation ith 14 2using both I"S#1=0#)1 and -)=15! #he result
ould be(
R2 : R4
/T<D1NT 0-<R/1 IN/TR<0T-R H-<R
Smith *ath 1 >en&ins 3(00
>ones .nglish Foldman 3(00
+ron .nglish Foldman 3(00
Freen Algebra >en&ins '(00
(R2 : R4) : R5
/T<D1NT 0-<R/1 IN/TR<0T-R H-<R R--M
Smith *ath 1 >en&ins 3(00 100
>ones .nglish Foldman 3(00 200
+ron .nglish Foldman 3(00 200
Freen Algebra >en&ins '(00 400
.$tracting the 0)=1S. and 1))* attributes 2and eliminating the duplicate ro
produced for the .nglish course5 ould yield the desired result(
0-<R/1 R--M
*ath 1 100
.nglish 200
Algebra 400
#he correct result is obtained since the se,uence 211 $ r35 $ 14 satisfies the
lossless 2gainlessG5 %oin property!
A relational database is in 4th normal form hen the lossless %oin property can be
used to anser unanticipated ,ueries! -oever, the choice of %oins must be
evaluated carefully! *any different se,uences of %oins ill recreate an instance of
a relation! Some se,uences are more desirable since they result in the creation
of less invalid data during the %oin operation!
Suppose that a relation is decomposed using functional dependencies and multi/
valued dependencies! #hen at least one se,uence of %oins on the resulting
relations e$ists that recreates the original instance ith no invalid data created
during any of the %oin operations!
8or e$ample, suppose that a list of grades by room number is desired! #his
,uestion, hich as probably not anticipated during database design, can be
ansered ithout creating invalid data by either of the folloing to %oin
11 $ 13
211 $ 135 $ 12
2211 $ 135 $ 125 $


11 $ 13
211 $ 135 $ 14
2211 $ 135 $ 145 $
#he re,uired information is contained ith relations 12 and 14, but these
relations cannot be %oined directly! In this case the solution re,uires %oining all 4
#he database may re,uire a 7lossless %oin7 relation, hich is constructed to
assure that any ad hoc in,uiry can be ansered ith relational operators! #his
relation may contain attributes that are not logically related to each other! #his
occurs because the relation must serve as a bridge beteen the other relations
in the database! 8or e$ample, the lossless %oin relation ill contain all attributes
that appear only on the left side of a functional dependency! )ther attributes may
also be re,uired, hoever, in developing the lossless %oin relation!
0onsider relational schema 12A, +, 0, 65, A + and 0 6! 1elations 1l2A, +5
and 1220, 65 are in 4th normal form! A third relation 132A, 05, hoever, is
re,uired to satisfy the lossless %oin property! #his relation can be used to %oin
attributes + and 6! #his is accomplished by %oining relations 11 and 13 and then
%oining the result to relation 12! "o invalid data is created during these %oins! #he
relation 132A, 05 is the lossless %oin relation for this database design!
A relation is usually developed by combining attributes about a particular sub%ect
or entity! #he lossless %oin relation, hoever, is developed to represent a
relationship among various relations! #he lossless %oin relation may be difficult to
populate initially and difficult to maintain / a result of including attributes that are
not logically associated ith each other!
#he attributes ithin a lossless %oin relation often contain multi/valued
dependencies! 0onsideration of 4th normal form is important in this situation! #he
lossless %oin relation can sometimes be decomposed into smaller relations by
eliminating the multi/valued dependencies! #hese smaller relations are easier to
populate and maintain!
Determinant and De$endent
#he terms determinant and dependent can be described as follos(
1! #he e$pression B @ means 7if I &no the value of B, then I can obtain
the value of @7 2in a table or somehere5!
2! In the e$pression B @, B is the determinant and @ is the de$endent
3! #he value B determines the value of @!
4! #he value @ de$ends on the value of B!
'unctional De$endencies ('D)
A functional dependency can be described as follos(
1. An attribute is functionally dependent if its value is determined by another
2! #hat is, if e &no the value of one 2or several5 data items, then e can
find the value of another 2or several5!
3! 8unctional dependencies are e$pressed as B @, here B is the
determinant and @ is the functionally dependent attribute!
4! If A 2+,05 then A + and A 0!
5! If 2A,+5 0, then it is not necessarily true that A 0 and + 0!
4! If A + and + A, then A and + are in a 1/1 relationship!
9! If A + then for A there can only ever be one value for +!
Transitive De$endencies (TD)
A transitive dependency can be described as follos(
1. An attribute is transitively dependent if its value is determined by another
attribute which is not a key!
2! If B @ and B is not a &ey then this is a transitive dependency!
3! A transitive dependency e$ists hen A + 0 but ")# A 0!
Multi"*alued De$endencies (M*D)
A multi/valued dependency can be described as follos(
1! A table involves a multi/valued dependency if it may contain multiple
values for an entity!
2. A multi/valued dependency may arise as a result of enforcing 1st normal
3! B @, ie B multi/determines @, hen for each value of B e can have
more than one value of @!
4! If A + and A 0 then e have a single attribute A hich multi/
determines to other independent attributes, + and 0!
5! If A 2+,05 then e have an attribute A hich multi/determines a set of
associated attributes, + and 0!
%oin De$endencies (%D)
A %oin dependency can be described as follos(
1! If a table can be decomposed into three or more smaller tables, it must be
capable of being %oined again on common &eys to form the original table!
Modification +nomalies
A ma%or ob%ective of data normalisation is to avoid modification anomalies! #hese
come in to flavours(
1! An insertion anomaly is a failure to place information about a ne
database entry into all the places in the database here information about
that ne entry needs to be stored! In a properly normaliHed database,
information about a ne entry needs to be inserted into only one place in
the database! In an inade,uately normaliHed database, information about a
ne entry may need to be inserted into more than one place, and, human
fallibility being hat it is, some of the needed additional insertions may be
2! A deletion anomaly is a failure to remove information about an e$isting
database entry hen it is time to remove that entry! In a properly
normaliHed database, information about an old, to/be/gotten/rid/of entry
needs to be deleted from only one place in the database! In an
inade,uately normaliHed database, information about that old entry may
need to be deleted from more than one place, and, human fallibility being
hat it is, some of the needed additional deletions may be missed!
An update of a database involves modifications that may be additions, deletions,
or both! #hus 7update anomalies7 can be either of the &inds of anomalies
discussed above!
All three &inds of anomalies are highly undesirable, since their occurrence
constitutes corruption of the database! <roperly normalised databases are much
less susceptible to corruption than are unnormalised databases!
Ty$es of Relational %oin
A >)I" is a method of creating a result set that combines ros from to or more
tables 2relations5! ?hen comparing the contents of to tables the folloing
conditions may occur(
• .very ro in one relation has a match in the other relation!
• 1elation 11 contains ros that have no match in relation 12!
• 1elation 12 contains ros that have no match in relation 11!
I"".1 %oins contain only matches! )=#.1 %oins may contain mismatches as
Inner %oin
#his is sometimes &non as a sim$le %oin! It returns all ros from both tables
here there is a match! If there are ros in 11 hich do not have matches in 12,
those ros ill not be listed! #here are to possible ays of specifying this type
of %oin(
SE'ECT ( )R& R1, R! *HERE R1.r1+,ie-d . R!.r!+,ie-d/
SE'ECT ( )R& R1 $NNER 0$N R! N R1.,ie-d . R!.r!+,ie-d
If the fields to be matched have the same names in both tables then the N
condition, as in(
N R1.,ie-dname . R!.,ie-dname
N (R1.,ie-d1 . R!.,ie-d1 AND R1.,ie-d! . R!.,ie-d!)
can be replaced by the shorter US$N" condition, as in(
US$N" ,ie-dname
US$N" (,ie-d1, ,ie-d!)
Natural %oin
A natural %oin is based on all columns in the to tables that have the same name!
It is semantically e,uivalent to an I"".1 >)I" or a :.8# >)I" ith a US$N"
clause that names all columns that e$ist in both tables!
SE'ECT ( )R& R1 NATURA' 0$N R!
#he alternative is a !eyed %oin hich includes an N or US$N" condition!
&eft ,-uter. %oin
1eturns all the ros from 11 even if there are no matches in 12! If there are no
matches in 12 then the 12 values ill be shon as null!
SE'ECT ( )R& R1 'E)T 1UTER2 0$N R! N R1.,ie-d . R!.,ie-d
Right ,-uter. %oin
1eturns all the ros from 12 even if there are no matches in 11! If there are no
matches in 11 then the 11 values ill be shon as null!
SE'ECT ( )R& R1 R$"HT 1UTER2 0$N R! N R1.,ie-d . R!.,ie-d
'ull ,-uter. %oin
1eturns all the ros from both tables even if there are no matches in one of the
tables! If there are no matches in one of the tables then its values ill be shon
as null!
SE'ECT ( )R& R1 )U'' 1UTER2 0$N R! N R1.,ie-d . R!.,ie-d
/elf %oin
#his %oins a table to itself! #his table appears tice in the 81)* clause and is
folloed by table aliases that ,ualify column names in the %oin condition!
SE'ECT a.,ie-d1, 3.,ie-d! )R& R1 a, R1 3 *HERE a.,ie-d . 3.,ie-d
0ross %oin
#his type of %oin is rarely used as it does not have a %oin condition, so every ro
of 11 is %oined to every ro of 12! 8or e$ample, if both tables contain 100 ros
the result ill be 10,000 ros! #his is sometimes &non as a cartesian $roduct
and can be specified in either one of the folloing ays(
SE'ECT ( )R& R1 CRSS 0$N R!
SE'ECT ( )R& R1, R!
1ntity"Relationshi$ Diagram (1RD)
An entity/relationship diagram 2.165 is a data modeling techni,ue that creates a
graphical representation of the entities, and the relationships beteen entities,
ithin an information system! Any .1 diagram has an e,uivalent relational table,
and any relational table has an e,uivalent .1 diagram! .1 diagramming is an
invaluable aid to engineers in the design, optimiHation, and debugging of
database programs!
• #he entity is a person, ob%ect, place or event for hich data is collected! It
is e,uivalent to a database table! An entity can be defined by means of its
properties, called attributes! 8or e$ample, the 0=S#)*.1 entity may have
attributes for such things as name, address and telephone number!
• #he relationship is the interaction beteen the entities! It can be described
using a verb such as(
o A customer places an order!
o A sales rep serves a customer!
o A order contains a product!
o A arehouse stores a product!
In an entity/relationship diagram entities are rendered as rectangles, and
relationships are portrayed as lines connecting the rectangles! )ne ay of
indicating hich is the 7one7 or 7parent7 and hich is the 7many7 or 7child7 in the
relationship is to use an arrohead, as in figure 4!
8igure 4 / )ne/to/*any relationship using arrohead notation
#his can produce an .16 as shon in figure 5(
8igure 5 / .16 ith arrohead notation
Another method is to replace the arrohead ith a crosfoot, as shon in figure
8igure 4 / )ne/to/*any relationship using crosfoot notation
#he relating line can be enhanced to indicate cardinality hich defines the
relationship beteen the entities in terms of numbers! An entity may be optional
2Hero or more5 or it may be mandatory 2one or more5!
• A single bar indicates one!
• A double bar indicates one and only one!
• A circle indicates ?ero!
• A crosfoot or arrohead indicates many!
As ell as using lines and circles the cardinality can be e$pressed using
numbers, as in(
• )ne/to/)ne e$pressed as 1(1
• Cero/to/*any e$pressed as 0(*
• )ne/to/*any e$pressed as 1(*
• *any/to/*any e$pressed as "(*
#his can produce an .16 as shon in figure 9(
8igure 9 / .16 ith crosfoot notation and cardinality
In plain language the relationships can be e$pressed as follos(
• 1 instance of a SA:.S 1.< serves 1 to many 0=S#)*.1S
• 1 instance of a 0=S#)*.1 places 1 to many )16.1S
• 1 instance of an )16.1 lists 1 to many <1)6=0#S
• 1 instance of a ?A1.-)=S. stores 0 to many <1)6=0#S
In order to determine if a particular design is correct here is a simple test that I
1! #a&e the ritten rules and construct a diagram!
2! #a&e the diagram and try to reconstruct the ritten rules!
If the output from step 225 is not the same as the input to step 215 then something
is rong! If the model allos a situation to e$ist hich is not alloed in the real
orld then this could lead to serious problems! #he model must be an accurate
representation of the real orld in order to be effective! If any ambiguities are
alloed to creep in they could have disastrous conse,uences!
?e have no completed the logical data model, but before e can construct the
physical database there are several steps that must ta&e place(
• Assign attributes 2properties or values5 to all the entities! After all, a table
ithout any columns ill be of little use to anyone!
• 1efine the model using a process &non as 7normalisation7! #his ensures
that each attribute is in the right place! 6uring this process it may be
necessary to create ne tables and ne relationships!
Data Normalisation
1elational database theory, and the principles of normalisation, ere first
constructed by people ith a strong mathematical bac&ground! #hey rote about
databases using terminology hich as not easily understood outside those
mathematical circles! +elo is an attempt to provide understandable
6ata normalisation is a set of rules and techni,ues concerned ith(
• Identifying relationships among attributes!
• 0ombining attributes to form relations!
• 0ombining relations to form a database!
It follos a set of rules or&ed out by . 8 0odd in 1'90! A normalised relational
database provides several benefits(
• .limination of redundant data storage!
• 0lose modeling of real orld entities, processes, and their relationships!
• Structuring of data so that the model is fle$ible!
+ecause the principles of normalisation ere first ritten using the same
terminology as as used to define the relational data model this led some people
to thin& that normalisation is difficult! "othing could be more untrue! #he
principles of normalisation are simple, common sense ideas that are easy to
Although there are numerous steps in the normalisation process / 1"8, 2"8,
3"8, +0"8, 4"8, 5"8 and 6I"8 / a lot of database designers often find it
unnecessary to go beyond 3rd "ormal 8orm! #his does not mean that those
higher forms are unimportant, %ust that the circumstances for hich they ere
designed often do not e$ist ithin a particular database! -oever, all database
designers should be aare of all the forms of normalisation so that they may be
in a better position to detect hen a particular rule of normalisation is bro&en and
then decide if it is necessary to ta&e appropriate action!
#he guidelines for developing relations in 3rd "ormal 8orm can be summarised
as follos(
1! 6efine the attributes!
2! Froup logically related attributes into relations!
3. Identify candidate &eys for each relation!
4. Select a primary &ey for each relation!
5! Identify and remove repeating groups!
6. 0ombine relations ith identical &eys 21st normal form5!
7. Identify all functional dependencies!
3! 6ecompose relations such that each non &ey attribute is dependent on all
the attributes in the &ey!
9. 0ombine relations ith identical primary &eys 22nd normal form5!
10. Identify all transitive dependencies!
o 0hec& relations for dependencies of one non &ey attribute ith
another non &ey attribute!
o 0hec& for dependencies ithin each primary &ey 2i!e! dependencies
of one attribute in the &ey on other attributes ithin the &ey5!
11. 6ecompose relations such that there are no transitive dependencies!
12! 0ombine relations ith identical primary &eys 23rd normal form5 if there
are no transitive dependencies!
2st Normal 'orm
A table is in first normal form if all the &ey attributes have been defined and it
contains no repeating groups!
#a&ing the )16.1 entity in figure 9 as an e$ample e could end up ith a set of
attributes li&e this(
order@id customer@id $roduct2 $roduct3 $roduct4
123 454 abc1 def1 ghi1
454 93' abc2
#his structure creates the folloing problems(
• )rder 123 has no room for more than 3 products!
• )rder 454 has asted space for product2 and product3!
In order to create a table that is in first normal form e must e$tract the repeating
groups and place them in a separate table, hich I shall call )16.1E:I".!
order@id customer@id
123 454
454 93'
I have removed 7product17, 7product27 and 7product37, so there are no repeating
order@id $roduct
123 abc1
123 def1
123 ghi1
454 abc2
.ach ro contains one product for one order, so this allos an order to contain
any number of products!
#his results in a ne version of the .16, as shon in figure 3(
8igure 3 / .16 ith )16.1 and )16.1E:I".
#he ne relationships can be e$pressed as follos(
• 1 instance of an )16.1 has 1 to many )16.1 :I".S
• 1 instance of a <1)6=0# has 0 to many )16.1 :I".S
3nd Normal 'orm
A table is in second normal form 22"85 if and only if it is in 1"8 and every non
&ey attribute is fully functionally dependent on the hole of the primary &ey 2i!e!
there are no partial dependencies5!
1. Anomalies can occur hen attributes are dependent on only part of a
multi/attribute 2composite5 &ey!
2! A relation is in second normal form hen all non/&ey attributes are
dependent on the hole &ey! #hat is, no attribute is dependent on only a
part of the &ey!
3! Any relation having a &ey ith a single attribute is in second normal form!
#a&e the folloing table structure as an e$ample(
order(order+id, cust, cust+address, cust+contact, order+date,
-ere e should realise that cust+address and cust+contact are functionally
dependent on cust but not on order+id, therefore they are not dependent on
the hole &ey! #o ma&e this table 2"8 these attributes must be removed and
placed somehere else!
4rd Normal 'orm
A table is in third normal form 23"85 if and only if it is in 2"8 and every non &ey
attribute is non transitively dependent on the primary &ey 2i!e! there are no
transitive dependencies5!
1. Anomalies can occur hen a relation contains one or more transitive
2. A relation is in 3"8 hen it is in 2"8 and has no transitive dependencies!
3! A relation is in 3"8 hen 7All non/&ey attributes are dependent on the &ey,
the hole &ey and nothing but the &ey7!
#a&e the folloing table structure as an e$ample(
order(order+id, cust, cust+address, cust+contact, order+date,
-ere e should realise that cust+address and cust+contact are functionally
dependent on cust hich is not a &ey! #o ma&e this table 3"8 these attributes
must be removed and placed somehere else!
@ou must also note the use of calculated or derived fields! #a&e the e$ample
here a table contains <1I0., ;=A"#I#@ and .B#."6.6E<1I0. here
.B#."6.6E<1I0. is calculated as ;=A"#I#@ multiplied by <1I0.! As one of
these values can be calculated from the other to then it need not be held in the
database table! 6o not assume that it is safe to drop any one of the three fields
as a difference in the number of decimal places beteen the various fields could
lead to different results due to rounding errors! 8or e$ample, ta&e the folloing
• A*)="# / a monetary value in home currency, to 2 decimal places!
• .B0-E1A#. / e$change rate, to ' decimal places!
• 0=11."0@EA*)="# / amount e$pressed in foreign currency,
calculated as A*)="# multiplied by .B0-E1A#.!
If you ere to drop .B0-E1A#. could it be calculated bac& to its original '
decimal placesG
1eaching 3"8 is is ade,uate for most practical needs, but there may be
circumstances hich ould benefit from further normalisation!
Boyce"0odd Normal 'orm
A table is in +oyce/0odd normal form 2+0"85 if and only if it is in 3"8 and every
determinant is a candidate &ey!
1. Anomalies can occur in relations in 3"8 if there is a composite &ey in
hich part of that &ey has a determinant hich is not itself a candidate &ey!
2! #his can be e$pressed as 12A,+,05, 0 A here(
o #he relation contains attributes A, + and 0!
o A and + form a candidate &ey!
o 0 is the determinant for A 2A is functionally dependent on 05!
o 0 is not part of any &ey!
3! Anomalies can also occur here a relation contains several candidate
&eys here(
o #he &eys contain more than one attribute 2they are composite
o An attribute is common to more than one &ey!
#a&e the folloing table structure as an e$ample(
schedu-e(cam4us, course, c-ass, time, room53-dg)
#a&e the folloing sample data(
cam$us course class time roomAbldg
.ast .nglish 101 1 3(00/'(00 212 A@.
.ast .nglish 101 2 10(00/11(00 305 18I
?est .nglish 101 3 3(00/'(00 102 <<1
"ote that no to buildings on any of the university campuses have the same
name, thus 1))*J+:6F 0A*<=S! As the determinant is not a candidate &ey
this table is ")# in +oyce/0odd normal form!
#his table should be decomposed into the folloing relations(
R1(course, c-ass, room53-dg, time)
R!(room53-dg, cam4us)
As another e$ample ta&e the folloing structure(
enro-(student6, s+name, course6, c+name, date+enro--ed)
#his table has the folloing candidate &eys(
• 2studentK, courseK5
• 2studentK, cEname5
• 2sEname, courseK5 / this assumes that sEname is a uni,ue identifier
• 2sEname, cEname5 / this assumes that cEname is a uni,ue identifier
#he relation is in 3"8 but not in +0"8 because of the folloing dependencies(
• studentK sEname
• courseK cEname
5th Normal 'orm
A table is in fourth normal form 24"85 if and only if it is in +0"8 and contains no
more than one multi/valued dependency!
1. Anomalies can occur in relations in +0"8 if there is more than one multi/
valued dependency!
2. If A + and A 0 but + and 0 are unrelated, ie A 2+,05 is false, then
e have more than one multi/valued dependency!
3! A relation is in 4"8 hen it is in +0"8 and has no more than one multi/
valued dependency!
#a&e the folloing table structure as an e$ample(
in,o(em4-o7ee6, s8i--s, ho33ies)
#a&e the folloing sample data(
em$loyeeB s!ills hobbies
1 <rogramming Folf
1 <rogramming +oling
1 Analysis Folf
1 Analysis +oling
2 Analysis Folf
2 Analysis Fardening
2 *anagement Folf
2 *anagement Fardening
#his table is difficult to maintain since adding a ne hobby re,uires multiple ne
ros corresponding to each s&ill! #his problem is created by the pair of multi/
valued dependencies .*<:)@..K SII::S and .*<:)@..K -)++I.S!
A much better alternative ould be to decompose I"8) into to relations(
s8i--s(em4-o7ee6, s8i--)
ho33ies(em4-o7ee6, ho337)
6th (7ro8ection"%oin) Normal 'orm
A table is in fifth normal form 25"85 or <ro%ection/>oin "ormal 8orm 2<>"85 if it is
in 4"8 and it cannot have a lossless decomposition into any number of smaller
Another ay of e$pressing this is(
!!! and each %oin dependency is a conse,uence of the candidate &eys!
@et another ay of e$pressing this is(
!!! and there are no pairise cyclical dependencies in the primary &ey comprised
of three or more attributes!
• Anomalies can occur in relations in 4"8 if the primary &ey has three or
more fields!
• 5"8 is based on the concept of %oin dependence / if a relation cannot be
decomposed any further then it is in 5"8!
• <airise cyclical dependency means that(
o @ou alays need to &no to values 2pairise5!
o 8or any one you must &no the other to 2cyclical5!
#a&e the folloing table structure as an e$ample(
3u7ing(3u7er, 9endor, item)
#his is used to trac& buyers, hat they buy, and from hom they buy!
#a&e the folloing sample data(
buyer vendor item
Sally :iH 0laiborne +louses
*ary :iH 0laiborne +louses
Sally >ordach >eans
*ary >ordach >eans
Sally >ordach Snea&ers
#he ,uestion is, hat do you do if 0laiborne starts to sell >eansG -o many
records must you create to record this factG
#he problem is there are pairise cyclical dependencies in the primary &ey! #hat
is, in order to determine the item you must &no the buyer and vendor, and to
determine the vendor you must &no the buyer and the item, and finally to &no
the buyer you must &no the vendor and the item!
#he solution is to brea& this one table into three tablesL +uyer/Aendor, +uyer/
Item, and Aendor/Item!
9th (Domain"#ey) Normal 'orm
A table is in si$th normal form 24"85 or 6omain/Iey normal form 26I"85 if it is in
5"8 and if all constraints and dependencies that should hold on the relation can
be enforced simply by enforcing the domain constraints and the &ey constraints
specified on the relation!
Another ay of e$pressing this is(
!!! if every constraint on the table is a logical conse,uence of the definition of
&eys and domains!
1! An domain constraint 2better called an attribute constraint5 is simply a
constraint to the effect a given attribute A of 1 ta&es its values from some
given domain 6!
2! A &ey constraint is simply a constraint to the effect that a given set A,
+, !!!, 0 of 1 constitutes a &ey for 1!
#his standard as proposed by 1on 8agin in 1'31, but interestingly enough he
made no note of multi/valued dependencies, %oin dependencies, or functional
dependencies in his paper and did not demonstrate ho to achieve 6I"8!
-oever, he did manage to demonstrate that 6I"8 is often impossible to
If relation 1 is in 6I"8, then it is sufficient to enforce the domain and &ey
constraints for 1, and all constraints on 1 ill be enforced automatically!
.nforcing those domain and &ey constraints is, of course, very simple 2most
6+*S products do it already5! #o be specific, enforcing domain constraints %ust
means chec&ing that attribute values are alays values from the applicable
domain 2i!e!, values of the right type5L enforcing &ey constraints %ust means
chec&ing that &ey values are uni,ue!
=nfortunately lots of relations are not in 6I"8 in the first place! 8or e$ample,
suppose there7s a constraint on 1 to the effect that 1 must contain at least ten
tuples! #hen that constraint is certainly not a conse,uence of the domain and &ey
constraints that apply to 1, and so 1 is not in 6I"8! #he sad fact is, not all
relations can be reduced to 6I"8L nor do e &no the anser to the ,uestion
M.$actly hen can a relation be so reducedGM
6enormalisation is the process of modifying a perfectly normalised database
design for performance reasons! 6enormalisation is a natural and necessary part
of database design, but must follo proper normalisation! -ere are a fe ords
from 0 > 6ate on denormalisation(
#he general idea of normaliHation!!!is that the database designer should aim for
relations in the MultimateM normal form 25"85! -oever, this recommendation
should not be construed as la! Sometimes there are good reasons for flouting
the principles of normaliHation!!!! #he only hard re,uirement is that relations be in
at least first normal form! Indeed, this is as good a place as any to ma&e the point
that database design can be an e$tremely comple$ tas&!!!! "ormaliHation theory
is a useful aid in the process, but it is not a panaceaL anyone designing a
database is certainly advised to be familiar ith the basic techni,ues of
normaliHation!!!but e do not mean to suggest that the design should necessarily
be based on normaliHation principles alone!
0!>! 6ate
An Introduction to 6atabase Systems
<ages 523/52'
In the 1'90s and 1'30s hen computer hardare as bul&y, e$pensive and slo
it as often considered necessary to denormalise the data in order to achieve
acceptable performance, but this performance boost often came ith a cost
2refer to *odification Anomalies5! +y comparison, computer hardare in the 21st
century is e$tremely compact, e$tremely cheap and e$tremely fast! ?hen this is
coupled ith the enhanced performance from today7s 6+*S engines the
performance from a normalised database is often acceptable, therefore there is
less need for any denormalisation!
-oever, under certain conditions denormalisation can be perfectly acceptable!
#a&e the folloing table as an e$ample(
0om$any 0ity /tate Ci$
Acme ?idgets "e @or& "@ 1014'
A+0 0orporation *iami 8: 331'4
B@C Inc 0olumbia *6 21044
#his table is ")# in 3rd normal form because the city and state are dependent
upon the CI< code! #o place this table in 3"8, to separate tables ould be
created / one containing the company name and CI< code and the other
containing city, state, CI< code pairings!
#his may seem overly comple$ for daily applications and indeed it may be!
6atabase designers should alays &eep in mind the tradeoffs beteen higher
level normal forms and the resource issues that comple$ity creates!
6eliberate denormalisation is commonplace hen you7re optimiHing performance!
If you continuously dra data from a related table, it may ma&e sense to
duplicate the data redundantly! 6enormalisation alays ma&es your system
potentially less efficient and fle$ible, so denormalise as needed, but not
#here are techni,ues for improving performance that involve storing redundant or
calculated data! Some of these techni,ues brea& the rules of normalisation,
others do not! Sometimes real orld re,uirements %ustify brea&ing the rules!
Intelligently and consciously brea&ing the rules of normalisation for performance
purposes is an accepted practice, and should only be done hen the benefits of
the change %ustify brea&ing the rule!
0om$ound 'ields
A compound field is a field hose value is the combination of to or more fields
in the same record! #he cost of using compound fields is the space they occupy
and the code needed to maintain them! 20ompound fields typically violate 2"8 or
8or e$ample, if your database has a table ith addresses including city and
state, you can create a compound field 2call it 0ityEState5 that is made up of the
concatenation of the city and state fields! Sorts and ,ueries on 0ityEState are
much faster than the same sort or ,uery using the to source fields / sometimes
even 40 times faster!
#he donside of compound fields for the developer is that you have to rite code
to ma&e sure that the 0ityEState field is updated henever either the city or the
state field value changes! #his is not difficult to do, but it is important that there
are no 7lea&s7, or situations here the source data changes and, through some
oversight, the compound field value is not updated!
/ummary 'ields
A summary field is a field in a one table record hose value is based on data in
related/many table records! Summary fields eliminate repetitive and time/
consuming cross/table calculations and ma&e calculated results directly available
for end/user ,ueries, sorts, and reports ithout ne programming! )ne/table
fields that summarise values in multiple related records are a poerful
optimiHation tool! Imagine trac&ing invoices ithout maintaining the invoice totalN
Summary fields li&e this do not violate the rules of normalisation! "ormalisation is
often misconceived as forbidding the storage of calculated values, leading people
to avoid appropriate summary fields!
#here are to costs to consider hen contemplating using a summary field( the
coding time re,uired to maintain accurate data and the space re,uired to store
the summary field!
Some typical summary fields hich you may encounter in an accounting system
• 8or an I"A)I0. the invoice amount is the total of the amounts on all
I"A)I0.E:I". records for that invoice!
• 8or an A00)="# the account balance ill be the sum total of the
amounts on all I"A)I0. and <A@*."# records for that account!
/ummary Tables
A summary table is a table hose records summarise large amounts of related
data or the results of a series of calculations! #he entire table is maintained to
optimise reporting, ,uerying, and generating cross/table selections! Summary
tables contain derived data from multiple records and do not necessarily violate
the rules of normalisation! <eople often overloo& summary tables based on the
misconception that derived data is necessarily denormalised!
In order for a summary table to be useful it needs to be accurate! #his means
you need to update summary records henever source records change! #his
tas& can be ta&en care of in the program code, or in a database trigger
2preferred5, or in a batch process! @ou must also ma&e sure to update summary
records if you change source data in your code! Ieeping the data valid re,uires
e$tra or& and introduces the possibility of coding errors, so you should factor
this cost in hen deciding if you are going to use this techni,ue!
-$tional +ttributes that e:ist as a grou$
As mentioned in the guidelines for developing relations in 3rd normal form all
relations hich share the same primary &ey are supposed to be combined into
the same table! -oever, there are circumstances here is is perfectly valid to
ignore this rule! #a&e the folloing e$ample hich I encountered in 1'34(
• A finance company gives loans to customers, and a record is &ept of each
customer7s repayments!
• If a customer does not meet a scheduled repayment then his account
goes into arrears and special action needs to be ta&en!
• )f the total customer base about 5O are in arrears at any one time!
#his means that ith 100,000 customers there ill be roughly 5,000 in arrears! If
the arrears data is held on the same record as the basic customer data 2both
sets of data have customerEid as the primary &ey5 then it re,uires searching
through all 100,000 records to locate those hich are in arrears! #his is not very
efficient! )ne method tried as to create an inde$ on account_status hich
identified hether the account as in arrears or not, but the improvement 2due to
the speed of the hardare and the limitations of the database engine5 as
A solution in these circumstances is to e$tract all the attributes hich deal ith
arrears and put them in a separate table! #hus if there are 5,000 customers in
arrears you can reference a table hich contains only 5,000 records! As the
arrears data is subordinate to the customer data the arrears table must be the
7child7 in the relationship ith the customer 7parent7! It ould be possible to give
the arrears table a different primary &ey as ell as the foreign &ey to the
customer table, but this ould allo the customer arrears relationship to be
one/to/many instead of one/to/one! #o enforce this constraint the foreign &ey and
the primary &ey should be e$actly the same!
#his situation can be e$pressed using the folloing structure(
R (:, A, B, C, ;, <, =) here(
1! Attribute I is the primary &ey!
2! Attributes 2A + 05 e$ist all the time!
3! Attributes 2B @ C5 e$ist some of the time 2but alays as a group under the
same circumstances5!
4! Attributes 2B @ C5 re,uire special processing!
After denormalising the result is to separate relations, as follos(
• R1 (:, A, B, C)
• R! (:, ;, <, =) here I is also the foreign &ey to 11
7ersonal ;uidelines
.ven if you obey all the preceding rules it is still possible to produce a database
design that causes problems during development! I have come across many
different implementation tips and techni,ues over the years, and some that have
or&ed in one database system have been successfully carried forard into a
ne database system! Some tips, on the other hand, may only be applicable to a
particular database system!
8or particular options and limitations you must refer to your database manual!
Database Names
1! 6atabase names should be short and meaningful, such as 7products7,
7purchasing7 and 7sales7!
o Short, but not too short, as in 7prod7 or 7purch7!
o *eaningful but not verbose, as in 7the database used to store
product details7!
2! 6o not aste time using a prefi$ such as 7db7 to identify database names!
#he S;: synta$ analyser has the intelligence to or& that out for itself / so
should you!
3! If your 6+*S allos a mi$ture of upper and loercase names, and it is
case sensitive, it is better to stic& to a standard naming convention such as(
o All uppercase!
o All loercase 2my preference / see #he choice beteen upper and
loer case5!
o :eading uppercase, remainder loercase!
Inconsistencies may lead to confusion, confusion may lead to mista&es,
mista&es can lead to disasters!
4! If a database name contains more than one ord, such as in 7sales orders7
and 7purchase orders7, decide ho to deal ith it(
o Separate the ords ith a single space, as in 7sales orders7 2note
that some 6+*Ss do not allo embedded spaces, hile most
languages ill re,uire such names to be enclosed in ,uotes5!
o Separate the ords ith an underscore, as in 7salesEorders7 2my
preference / see #he choice beteen upper and loer case5!
o Separate the ords ith a hyphen, as in 7sales/orders7!
o =se camel caps, as in 7Sales)rders7!
Again, be consistent!
5! 1ather than putting all the tables into a single database it may be better to
create separate databases for each logically related set of tables! #his may
help ith security, archiving, replication, etc!
Table Names
1! #able names should be short and meaningful, such as 7part7, 7customer7
and 7invoice7!
o Short, but not too short!
o *eaningful, but not verbose!
2! 6o not aste time using a prefi$ such as 7tbl7 to identify table names! #he
S;: synta$ analyser has the intelligence to or& that out for itself / so
should you!
3! #able names should be in the singular 2e!g! 7customer7 not 7customers75!
#he fact that a table may contain multiple entries is irrelevant / any
multiplicity can be derived from the e$istence of one/to/many relationships!
4! If your 6+*S allos a mi$ture of upper and loercase names, and it is
case sensitive, It is better to stic& to a standard naming convention such as(
o All uppercase!
o All loercase! 2my preference / see #he choice beteen upper and
loer case5
o :eading uppercase, remainder loercase!
Inconsistencies may lead to confusion, confusion may lead to mista&es,
mista&es can lead to disasters!
5! If a table name contains more than one ord, such as in 7sales order7 and
7purchase order7, decide ho to deal ith it(
o Separate the ords ith a single space, as in 7sales order7 2note
that some 6+*Ss do not allo embedded spaces, hile most
languages ill re,uire such names to be enclosed in ,uotes5!
o Separate the ords ith an underscore, as in 7salesEorder7 2my
preference / see #he choice beteen upper and loer case5!
o Separate the ords ith a hyphen, as in 7sales/order7!
o =se camel caps, as in 7Sales)rder7!
Again, be consistent!
4! +e careful if the same table name is used in more than one database / it
may lead to confusion!
'ield Names
1! 8ield names should be short and meaningful, such as 7partEname7 and
o Short, but not too short, such as in 7ptnam7!
o *eaningful, but not verbose, such as 7the name of the part7!
2! 6o not aste time using a prefi$ such as 7col7 or 7fld7 to identify columnJfield
names! #he S;: synta$ analyser has the intelligence to or& that out for
itself / so should you!
3! If your 6+*S allos a mi$ture of upper and loercase names, and it is
case sensitive, it is better to stic& to a standard naming convention such as(
o All uppercase!
o All loercase! 2my preference / see #he choice beteen upper and
loer case5
o :eading uppercase, remainder loercase!
Inconsistencies may lead to confusion, confusion may lead to mista&es,
mista&es can lead to disasters!
4! If a field name contains more than one ord, such as in 7part name7 and
customer name7, decide ho to deal ith it(
o Separate the ords ith a single space, as in 7part name7 2note that
some 6+*Ss do not allo embedded spaces, hile most languages
ill re,uire such names to be enclosed in ,uotes5!
o Separate the ords ith an underscore, as in 7partEname7 2my
preference / see #he choice beteen upper and loer case5!
o Separate the ords ith a hyphen, as in 7part/name7!
o =se camel caps, as in 7<art"ame7!
Again, be consistent!
5! 0ommon ords in field names may be abbreviated, but be consistent!
o 6o not allo a mi$ture of abbreviations, such as 7no7, 7num7 and 7nbr7
for 7number7!
o <ublish a list of standard abbreviations and enforce it!
4! Although field names must be uni,ue ithin a table, it is possible to use
the same name on multiple tables even if they are unrelated, or they do not
share the same set of possible values! It is recommended that this practice
should be avoided, for reasons described in 8ield names should identify
their content and #he naming of 8oreign Ieys!
7rimary #eys
1. It is recommended that the primary &ey of an entity should be constructed
from the table name ith a suffi$ of 7EI67! #his ma&es it easy to identify the
primary &ey in a long list of field names!
2! 6o not aste time using a prefi$ such as 7p&7 to identify primary &ey fields!
#his has absolutely no meaning to any database engine or any application!
3! Avoid using generic names for all primary &eys! It may seem a clever idea
to use the name 7I67 for every primary &ey field, but this causes problems(
o It causes the same name to appear on multiple tables ith totally
different conte$ts! #he string $D.>ABC1!#> is e$tremely vague as it
gives no idea of the entity being referenced! Is it an invoice id, customer
id, or hatG
o It also causes a problem ith foreign &eys!
4! #here is no rule that says a primary &ey must consist of a single attribute /
both simple and composite &eys are alloed / so don7t aste time creating
artificial &eys!
5! Avoid the unnecessary use of technical &eys! If a table already contains a
satisfactory uni,ue identifier, hether composite or simple, there is no need
to create another one! Although the use of a technical &ey can be %ustified
in certain circumstances, it ta&es intelligence to &no hen those
circumstances are right! #he indiscriminate use of technical &eys shos a
distinct lac& of intelligence! 8or further vies on this sub%ect please refer to
#echnical Ieys / #heir =ses and Abuses!
'oreign #eys
1. It is recommended that here a foreign &ey is re,uired that you use the
same name as that of the associated primary &ey on the foreign table! It is
a re,uirement of a relational %oin that to relations can only be %oined hen
they share at least one common attribute, and this should be ta&en to mean
the attribute name2s5 as ell as the value2s5! #hus here the 7customer7
and 7invoice7 tables are %oined in a parent/child relationship the folloing ill
o #he primary &ey of 7customer7 ill be 7customerEid7!
o #he primary &ey of 7invoice7 ill be 7invoiceEid7!
o #he foreign &ey hich %oins 7invoice7 to 7customer7 ill be
2. 8or *yS;: users this means that the shortened version of the %oin
condition may be used(
o Short( A :.8# >)I" + =SI"F 2a,b,c5
o :ong( A :.8# >)I" + )" 2A!aD+!a A"6 A!bD+!b A"6 A!cD+!c5
3! #he only e$ception to this naming recommendation should be here a
table contains more than one foreign &ey to the same parent table, in hich
case the names must be changed to avoid duplicates! In this situation I
ould simply add a meaningful suffi$ to each name to identify the usage,
such as(
o #o signify movement I ould use 7locationEidEfrom7 and
o #o signify positions in a hierarchy I ould use 7nodeEidEsnr7 and
o #o signify replacement I ould use 7partEidEold7 and 7partEidEne7!
4! 6o not aste time using a prefi$ such as 7f&7 to identify foreign &ey fields!
#his has absolutely no meaning to any database engine or any application!
;enerating <ni=ue ids
?here a technical primary &ey is used a mechanism is re,uired that ill generate
ne and uni,ue values! Such &eys are usually numeric, so there are several
methods available(
1! Some database engines ill maintain a set of se,uence numbers for you
hich can be referenced using code such as (
!. SE'ECT ?se@+nameA.NE;TBA' )R& DUA'
=sing such a se,uence is a to/step procedure(
o Access the se,uence to obtain a value!
o =se the supplied value on an I"S.1# statement!
It is sometimes possible to access the se,uence directly from an I"S.1#
statement, as in the folloing(
$NSERT $NT ta3-ename (co-1,co-!,...) BA'UES
If the number %ust used needs to be retrieved so that it can be passed bac&
to the application it can be done so ith the folloing(
SE'ECT ?se@+nameA.CURRBA' )R& DUA'
I have used this method, but a disadvantage that I have found is that the
6+*S has no &noledge of hat primary &ey is lin&ed to hich se,uence,
so it is possible to insert a record ith a &ey not obtained from the
se,uence and thus cause the to to become unsynchronised! #he ne$t
time the se,uence is used it could therefore generate a value hich already
e$ists as a &ey and therefore cause an I"S.1# error!
3! Some database engines ill allo you to specify a numeric field as 7auto/
increment7, and on an I"S.1# they ill automatically generate the ne$t
available number 2provided that no value is provided for that field in the first
place5! #his is better than the previous method because(
o #he se,uence is tied directly to a particular database table and is
not a separate ob%ect, thus it is impossible to become unsynchronised!
o It is not necessary to access the se,uence then use the returned
value on an I"S.1# statement / %ust leave the field empty and the
6+*S ill fill in the value automatically!
4! ?hile the previous methods have their merits, they both have a common
failing in that they are not/standard e$tensions to the S;: standard,
therefore they are not available in all S;:/compliant database engines!
#his becomes an important factor if it is ever decided to sitch to another
database engine! A truly portable method hich uses a standard techni,ue
and can therefore be used in any S;:/compliant database is to use an
S;: statement similar to the folloing to obtain a uni,ue &ey for a table(
D. SE'ECT maC(ta3-e+id) )R& ?ta3-enameA
E. ta3-e+id . ta3-e+idF1
Some people seem to thin& that this method is inefficient as it re,uires a full
table search, but they are missing the fact that ta3-e+id is a primary &ey,
therefore the values are held ithin an inde$! #he SE'ECT maC(...)
statement ill automatically be optimised to go straight to the last value in
the inde$, therefore the result is obtained ith almost no overhead! #his
ould not be the case if I used SE'ECT count(...) as this ould have to
physically count the number of entries! Another reason for not using
SE'ECT count(...) is that if records ere to be deleted then record
count ould be out of step ith the highest current value!
9! #he 1adicore development frameor& has separate data access ob%ects
for each 6+*S to hich it can connect! #his means that the different code
for dealing ith autoEincrement &eys can be contained ithin each ob%ect,
so is totally transparent to the application! All that is necessary is that the
&ey be identified as 7autoEincrement7 in the 6ata 6ictionary and the
database ob%ect ill ta&e care of all the necessary processing!
Some people disagree ith my ideas, but usually because they have limited
e$perience and only &no hat they have been taught! ?hat I have stated here
is the result of decades of e$perience using various database systems ith
various languages! #his is hat I have learned, and goes beyond hat I have
been taught! #here are valid reasons for some of the preferences I have stated in
this document, and it may prove beneficial to state these in more detail!
The choice beteen u$$er and loer case
?hen I first started programming in the 1'90s all coding as input via punched
cards, not a A6= 2that7s a Aisual 6isplay =nit to the uninitiated5, and there as
no such thing as loercase as the computer used a 4/bit character instead of an
3/bit byte and did not have enough room to deal ith both loer and uppercase
characters! 0)"S.;=."#:@ .A.1@#-I"F -A6 #) +. I" =<<.1 0AS.!
?hen I progressed to a system here both cases ere possible neither the
operating system nor the programming language cared hich as used / they
ere both case/insensitive! +y common consent all the programmers preferred
to use loercase for everything! #he use of uppercase as considered #) +.
#-. .;=IAA:."# )8 S-)=#I"F and as discouraged, e$cept here
something important needed to stand out!
=ntil the last fe years all the operating systems, database systems,
programming languages and te$t editors have been case/insensitive! #he ="IB
operating system and its derivatives are case/sensitive 2for Fod7s sa&e ?-@GG5!
#he <-< programming language is case/sensitive in certain areas!
I do not li&e systems hich are case/sensitive for the folloing reasons(
• I have been or&ing for 30 years ith systems hich have been case/
insensitive and I see no %ustification in ma&ing the sitch!
• 0ase does not ma&e a difference in any spo&en language, so hy should
it ma&e a difference in any computer languageG
• ?hen I am merrily hammering aay at the &eyboard I do not li&e all those
pauses here I have to reach for the shift &ey! It tends to interrupt my train of
thought, and I do not li&e to be interrupted ith trivialities!
• #o my &noledge there is no database system hich is case/sensitive, so
hen I am riting code to access a database I do not li&e to be told hich
case to use!
• ?ith the groing trend of being able to spea& to a computer instead of
using a &eyboard, ho frustrating ill it become if you have to specify that
particular ords and letters are in upper or loer caseG
#hat is hy my preference is for all database, table and field names to be in
loercase as it or&s the same for both case/sensitive and case/insensitive
systems, so I don7t get suddenly caught out hen the softare decides to get
pic&y! #his also means that I use underscore separators instead of those ugly
0amel0aps 2i!e! 7fieldEname7 instead of 78ield"ame75!
#his topic is discussed in more detail in 0ase Sensitive Softare is .AI:!
The use of uni=ue and non"uni=ue field names
Some people thin& that my habit of including the table name inside a field name
2as in 0=S#)*.1!0=S#)*.1EI65 introduces a level of redundancy and is
therefore rong! I consider this vie to be too narro as it does not cater for all
the different circumstances I have encountered over the years!
'ield names should identify their content>
)ver many years I have come to adopt a fairly straightforard convention ith
the naming of fields(
• 8ield names should give some idea of their content!
If I see several tables hich all contain field names such as I6 and
6.S01I<#I)" it ma&es me ant to reach for the rubber gloves, disinfectant
and scrubbing brush! A field named I6 simply says that it contains an identity,
but the identity of hatG A field named 6.S01I<#I)" simply says that it
contains a description, but the description of hatG
)ne of the first database systems hich I used did not allo field definitions
to be included ithin the table definitions inside the schema! Instead all the
fields ere defined in one area, and the table definitions simply listed the
fields hich they contained! #his meant that a field as defined ith one set
of attributes 2type and siHe5 and those attributes could not be changed at the
table level! #hus I6 could not be 010 in one table and 030 in another! #he
only time e had fields ith the same name e$isting on more than one table
as here there as a logical relationship beteen records hich had the
same values in those fields!
+ecause of this it became standard practice to have uni,ue field names on
each table using the table name as a prefi$, such as ta3-e+id and
ta3-e+desc! )ne of the benefits of this approach as that e could build
standard code hich provided the correct label and help te$t based on
nothing more than the field name itself!
?hen performing s,l >)I"S beteen tables hich have common field names
that are unrelated you have to give each field a uni,ue alias name before you
can access its content! If each of these fields ere given a uni,ue name to
begin ith then this step ould not be necessary!
• 8ields ith the same conte$t should have the same name!
If primary &eys are named ta3-e+id instead of %ust id it then becomes
possible, hen naming foreign &ey fields on related tables, to use the same
name for both the primary &ey and the foreign &ey! #his ma&es it easier for a
human being to recognise certain fields for hat they are / anything ending in
+id is a &ey field, and if the ta3-e prefi$ is not the current table then it is a
foreign &ey to that table! #his is hat e called Mself/documenting field
In some circumstances it may not be possible to use the same name! #his
happens hen the same field needs to appear more than once in the same
table! In this case I ould start ith the same basic name and add a suffi$ for
further identification, such as having -ocation+id+)R& and
-ocation+id+T to identify movements from one location to another, or
having node+id+SNR and node+id+0NR to identify the senior and %unior
nodes in a hierarchical relationship!
• 8ields ith different conte$t should have different names!
It is not %ust primary &ey fields hich should have uni,ue names instead of
sharing the common name of id! "on/&ey fields should follo the same
convention for the same reasons! 8or e$ample, if the 0=S#)*.1 table has
a S#A#=S field ith one set of values and the I"A)I0. table has a S#A#=S
field ith another set of values then you should resist the temptation to give
the to different fields the same common name of S#A#=S! #hey should be
given proper names such as 0=S#ES#A#=S and I"AES#A#=S! #hey are
different fields ith different meanings, therefore they deserve to have
different names!
#he brea&ing of this simple rule cased problems in one of the short/lived ne/
fangled languages that I used many years ago! #his tool as built on the
assumption that fields ith the same name that e$isted on more than one
table implied a relationship beteen those tables! If you tried to perform a %oin
beteen to tables this softare ould loo& for field names hich e$isted on
both tables and automatically perform a natural %oin using those fields! #his
caused our programs not to find the right records hen e performed a %oin,
and the only ay e could fi$ it as to give different names to unrelated
#hose conventions arose out of e$perience, to avoid certain problems hich
ere encountered ith certain languages! .very time I see these conventions
bro&en I do not have to ait long before I see the same problems reappearing!
The naming of 'oreign #eys
In any relationship the foreign &ey field2s5 on the childJ%unior table are lin&ed ith
the primary &ey field2s5 on the parentJsenior table! #hese related fields do not
have to have the same name as it is still possible to perform a %oin, as shon in
the folloing e$ample(
SE'ECT ,ie-d1, ,ie-d!, ,ie-d#
)R& ,irst+ta3-e
'E)T 1UTER2 0$N second+ta3-e
N (,irst+ta3-e.8e7,ie-d . second+ta3-e.,oreign+8e7,ie-d)
-oever, if the fields have the same name then it is possible to replace the N
e$pression ith a shorter US$N" e$pression, as in the folloing e$ample(
SE'ECT ,ie-d1, ,ie-d!, ,ie-d#
)R& ,irst+ta3-e
'E)T 1UTER2 0$N second+ta3-e
US$N" (,ie-d1)
#his feature is available in popular databases such as *yS;:, <ostgreS;: and
)racle, so it %ust goes to sho that using identical field names is a recognised
practice that has its benefits!
"ot only does the use of identical names have an advantage hen performing
%oins in an S;: ,uery, it also has advantages hen simulating %oins in your
softare! +y this I mean here the reading of the to tables is performed in
separate operations! It is possible to perform this using standard code ith the
folloing logic(
• )peration 215 perform the folloing after each database ro has been
o Identify the field2s5 hich constitute the primary &ey for the first
o .$tract the values for those fields from the current ro!
o 0onstruct a string in the format ,ie-d1.>9a-ue1>
o <ass this string to the ne$t operation!
• )peration 225 performs the folloing(
o =se the string passed don from the previous operation as the
*HERE clause in a SE'ECT statement!
o .$ecute the ,uery on the second table!
o 1eturn the result bac& to the previous operation!
It is possible to perform these functions using standard code that never has to be
customised for any particular database table! I should &no as I have done it in
to completely different languages! #he only time that manual intervention 2i!e!
e$tra code5 is re,uired is here the field names are not e$actly the same, hich
forces operation 225 to convert 4rimar7+8e7+,ie-d.>9a-ue> to
,oreign+8e7+,ie-d.>9a-ue> before it can e$ecute the ,uery! .$perienced
programmers should instantly recognise that the need for e$tra code incurs its
on overhead(
• #he time ta&en to actually rite this e$tra code!
• #he time ta&en to test that the right code has been put in the right place!
• #he time ta&en to amend this code should there be any database changes
in the future!
#he only occasion here fields ith the same name are not possible is hen a
table contains multiple versions of that field! #his is here I ould add a suffi$ to
give some e$tra meaning! 8or e$ample(
• In a table hich records movements or ranges I ould have
?ta3-eA+$D+)R& and ?ta3-eA+$D+T!
• In a table hich records a senior/to/%unior hierarchy I ould have
?ta3-eA+$D+SNR and ?ta3-eA+$D+0NR!
*y vie of field names can be summed up as follos(
• 8ields ith the same conte$t should have the same name!
• 8ields ith different conte$t should have different names!
• Iey fields, hether primary or foreign, should be in the format
• 6uplicate foreign &eys should be in the format ?ta3-eA+id+?su,,iCA
P Tony Marston
30th September 2004
Amendment history(
12 August
Added a ne section for 0omments! Also added a ne section for
#ypes of 1elational >oin!
2' *ay 2005 Added comment about using prefi$es for database, table and field