You are on page 1of 14

Logical Database design

Earlier we saw how to convert an unorganized text description of information requirements into a
conceptual design, by the use of ER diagrams. The advantage of ER diagrams is that they force you to
identify data requirements that are implicitly known, but not explicitly written down in the original
description. Here we will see how to convert this ER into a logical design (this will be defined below)
of a relational database. The logical model is also called a Relational Model.

We shall represent a relation as a table with columns and rows. Each column of the table has a name,
or attribute. Each row is called a tuple.

Domain: a set of atomic values that an attribute can take

Attribute: name of a column in a particular table (all data is stored in tables). Each attribute Ai must
have a domain, dom(Ai).

Relational Schema: The design of one table, containing the name of the table (i.e. the name of the
relation), and the names of all the columns, or attributes.
Example: STUDENT( Name, SID, Age, GPA)

Degree of a Relation: the number of attributes in the relation's schema.

Tuple, t, of R( A1, A2, A3, …, An): an ORDERED set of values, < v1, v2, v3, …, vn>, where each
vi is a value from dom( Ai).

Relation Instance, r( R): a set of tuples; thus, r( R) = { t1, t2, t3, …, tm}

Attributes dom(GPA) = [0.0, 12.0]


dom( Name) = character string, max length 100.

Name SID Age GPA


CHAN Kin Ho 99001122 22 11.19
LAM Wai Kin 99012233 21 10.2
Each row is one tuple;
MAN Ko Yee 99234567 21 7.5 This instance of schema STUDENT has 5 tuples
LEE Chi Cheung 99888899 21 8.9 t3 = <MAN Ko Yee, 99234567, 21, 7.5>

Alvin LAM 99000011 22 9.8


NOTES:
1. The tuples in an instance of a relation are not considered to be ordered Î putting the rows in a
different sequence does not change the table.
2. Once the schema, R( A1, A2, A3, …, An) is defined, the values, vi, in each tuple, t, must be
ordered as t = <v1, v2, v3, …, vn>

Integrity Constraints

Each relational schema must satisfy the following four types of constraints.

A. Domain constraints
Each attribute Ai must be an atomic value from dom( Ai) for that attribute.
The attribute, Name in the example is a BAD DESIGN (because sometimes we may want to search a
person by only using their last name.

B. Key Constraints

Superkey of R: A set of attributes, SK, of R such that no two tuples in any valid relational instance,
r( R), will have the same value for SK. Therefore, for any two distinct tuples, t1 and t2 in r( R),
t1[ SK] != t2[SK].

Key of R: A minimal superkey. That is, a superkey, K, of R such that the removal of ANY attribute
from K will result in a set of attributes that are not a superkey.

Example CAR( State, LicensePlateNo, VehicleID, Model, Year, Manufacturer)

This schema has two keys:


K1 = { State, LicensePlateNo}
K2 = { VehicleID }

Both K1 and K2 are superkeys.


K3 = { VehicleID, Manufacturer} is a superkey, but not a key (Why?).

If a relation has more than one keys, we can select any one (arbitrarily) to be the primary key.
Primary Key attributes are underlined in the schema:

CAR( State, LicensePlateNo, VehicleID, Model, Year, Manufacturer)


C. Entity Integrity Constraints

The primary key attribute, PK, of any relational schema R in a database cannot have null values in
any tuple. In other words, for each table in a DB, there must be a key; for each key, every row in the
table must have non-null values. This is because PK is used to identify the individual tuples.
Mathematically, t[PK] != NULL for any tuple t ∈ r( R).

D. Referential Integrity Constraints

Referential integrity constraints are used to specify the relationships between two relations in a
database.

Consider a referencing relation, R1, and a referenced relation, R2. Tuples in the referencing relation,
R1, have attributed FK (called foreign key attributes) that reference the primary key attributes of the
referenced relation, R2. A tuple, t1, in R1 is said to reference a tuple, t2, in R2 if t1[FK] = t2[PK].

A referential integrity constraint can be displayed in a relational database schema as a directed arc
from the referencing (foreign) key to the referenced (primary) key. Examples are shown in the figure
below:

EMPLOYEE ENo Name Address DeptNo SupENo

DEPT Dno DName Locn MgrENo


ER-to-Relational Mapping

Now we are ready to lay down some informal methods to help us create the Relational schemas from
our ER models. These will be described in the following steps:

1. For each regular entity, E, in the ER model, create a relation R that includes all the simple
attributes of E. Select the primary key for E, and mark it.

2. For each weak entity type, W, in the ER model, with the Owner entity type, E, create a relation R
with all attributes of W as attributes of W, plus the primary key of E. [Note: if there are identical
tuples in W which share the same owner tuple, then we need to create an additional index
attribute in W.]

3. For each binary relation type, R, in the ER model, identify the participating entity types, S and T.

• For 1:1 relationship between S and T


Choose one relation, say S. Include the primary key of T as a foreign key of S.

• For 1:N relationship between S and T


Let S be the entity on the N side of the relationship. Include the primary key of T as a foreign
key in S.

• For M: N relation between S and T


Create a new relation, P, to represent R. Include the primary keys of both, S and T as foreign
keys of P.

4. For each multi-valued attribute A, create a new relation, R, that includes all attributes
corresponding to A, plus the primary key attribute, K, of the relation that represents the entity
type/relationship type that has A as an attribute.

5. For each n-ary relationship type, n > 2, create a new relation S. Include as foreign key attributes
in S the primary keys of the relations representing each of the participating entity types. Also
include any simple attributes of the n-ary relationship type as attributes of S.
Formal Relational Database Design

We have seen the informal procedure to map the ER model into a logical model (relational database
schema). How do we know that this is a good schema?

We shall now study the formal theory which can help us to understand (a) what is a good relational
database design, (b) why it is good, and (c) how do we design good relational database schemas.

To do so, we shall learn the concepts of Functional dependencies and Normal forms.

We shall use our EMPLOYEE-DEPARTMENT-PROJECTS example. To give us some motivation,


let us first see what kinds of problems can arise if we fail to design the database properly.

1. Redundant Information in Tuples and Update Anomalies

Consider a (poorly) designed database where the Employee and Department information is planned in
one table, and the Employee and Project information in another table:

EMPLOYEE_DEPT
SSN LName Address BDate DNumber DName MgrSSN

EMPLOYEE_PROJ

SSN PNumber Hours LName PName PLocation

Problems:

(a) Information is stored redundantly -- repeated information Î wasted storage.


For example, if 5 employees work for Department number 4, then the department name and
manager's SSN for Department 3 is stored 3 times in the table.
(b) Insertion anomalies. When we enter the record for a new employee, we must specify ALL data
fields for his department correctly. For example, if a new employee joins Dept 5, then we must ensure
that the data entered for Dept 5 in the new record is consistent with the data for Dept 5 in all earlier
records of other employees of Dept 5.

(c) Deletion Anomalies. If a dept has one employee working in it, and we delete the information of
this employee, then the information of the department is also lost. We may not want this to happen.

(d) Modification Anomalies. If we modify a value, we must make the entire table consistent very
carefully. For example, if an employee changes departments from Dept 5 to Dept 4, then his entire
record must be changed, not just his DNumber field. If a department changes its manager, the entire
table must be scanned and modified.

In other bad DB designs, several other problems can occur. The most important ones include the
following:

2. Null values in Tuples


Consider a DB design: STUDENT( SID, Name, Phone, Email, SocietyName, MembershipNo)
which is used to store student information. For all students who did not join any society, the last two
attributes will contain NULL values. If there are many such students, then this will be considered as a
poor DB design, and it would be better to store information in two relational schemas: STUDENT(
SID, Name, Phone, Email), MEMBERSHIP( SID, SocietyName, MembershipNo).
In general, we should design tables which have the fewest NULL value entries. If an attribute has
NULL value very frequently in a table, it should be placed in a separate table (along with the primary
key).

3. Spurious (false) Tuples


To avoid anomalies and null values in tuples, most DBs must be designed to store data by using
several tables. However, at any given time, we may need some information that is partly contained in
two or more different tables. When we combine the information contained in two separate tables, in
other words, when we “JOIN” the two tables, some care must be taken, other wise we may get some
misleading (or FALSE) information.
I will illustrate the problem of spurious tuples with the example of a poorly designed Projects-Parts-
Suppliers database:
PROJECT_PARTS SUPPLIER_PARTS
ProjectNo PartNo SupplierNo PartNo Qty
Proj1 P1 S1 P1 10
Proj2 P1 S2 P2 25
Proj2 P2 S2 P1 20

Suppose we are interested to know: Which suppliers supplied the parts for Proj2? Obviously this
information can only be obtained by somehow combining the information in the two tables (why ?).
The common link between the two tables is the PartNo (assume that PartNo uniquely identifies the
part). From PROJECT_PARTS, we can easily see that Proj2 uses P1 and P2. P2 is supplied by S2
(from tuple-2 of SUPPLIER_PARTS). However, which supplier supplied P1 to Proj2? If we match
the PartNo in tuple-2 of PROJECT_PARTS with PartNo in SUPPLIER_PARTS, then we get two
matches: tuple-1 Î S1, and tuple-3 Î S2. However, from this, we cannot conclude that both S1 and
S2 supplied parts P1 to Proj2 (why ?).

In the above example, we say that JOINING the information of the two tables created some false
information; in other words, if we use the above DB design to store data, some information that we
know is actually getting lost. We call such JOIN’s that create information loss as LOSSY-JOINS.

A good database design must guarantee LOSSLESS JOIN operations on its tables.

The above examples are obviously simple; however, a real DB may have tens of tables, with many
attributes in each. Some tables may have created as an extension of a previously existing DB. How
can we guarantee that the DB will not have any possibility of such errors?

There is a logical process that allows us to design good databases: it is called normalization. We now
study normalization.

Functional Dependencies

Normalization depends on identifying logical relationships between data, called Functional


dependencies (FDs), and Keys. FDs are constraints between the attributes of relations in the universe
of discourse.
Definition:

A set of attributes, X, functionally determines a set of attributes Y


if the value of X determines a unique value for Y.

This is written as: X Æ Y.

The implication of the statement X Æ Y is that, for any two tuples, t1 and t2,
if t1[ X] = t2[ X], then t1[ Y] = t2[ Y].

Examples of FD's:

{SSN} Æ {Employee name}

{Employee SSN, Project Number} Æ {Hours per week}

If K is a key of R, then K functionally determines all attributes of R.

Inference Rules for FD's


Given a set of FDs, we can infer other FDs that will also be true. Such inference requires proof, based
on the use of the above definition of FD, and some logical reasoning.

Armstrong's Inference Rules:

A1. (Reflexive). If Y ⊆ X, then X Æ Y

A2. (Augmentation). If X Æ Y, then XZ Æ YZ


(Note: XZ denotes X U Z, the union of the sets of attributes X and Z).

A3. (Transitive). If X Æ Y and Y Æ Z, then X Æ Z

Using A1, A2 and A3, we can also prove some additional useful inference rules:

A4. (Decomposition). If X Æ YZ, the X Æ Y and X Æ Z


A5. (Union). If X Æ Y and X Æ Z, then X Æ YZ
A6. (Pseudotransitive). If X Æ Y and WY Æ Z, then WX Æ Z
Equivalence of sets of FDs: Two sets of FDs, F and G are said to be equivalent if every FD in F can
be inferred from G, and every FD in G can be inferred from F.

Minimal set of FDs: A set of FDs is minimal if it satisfies the following conditions.
(a) Every dependency in F has a single attribute for its RHS (right hand side).
(b) We cannot remove any dependency from F, and have a set of dependencies equivalent to F.
(c) We cannot replace any dependency, X Æ A, in F with a dependency, Y Æ A, where Y ⊂ X, and
still have a set of dependencies equivalent to F.

In general, we know the following to be true: (1) Every set of FDs has an equivalent minimal set; and
(2) There can be several equivalent minimal sets.

The First Normal Form (1NF)


A relational schema is in 1NF if it does not contain any composite attributes, multi-valued attributes,
and nested relations.

It is possible to convert any schema that is not in 1NF into one or more schemas that are in 1NF:

Composite Multi-valued
STUDENT_COURSES

SID Name SemYr Courses


0401 John Smith Fall 05 ie110, ie215 Not 1NF

0402 Jane Doe Fall 05 ie110, ie317

STUDENT_COURSES_1NF

SID Lname Fname Sem Yr Course


0401 Smith John Fall 05 ie110
0401 Smith John Fall 05 ie215 1NF
0402 Doe Jane Fall 05 ie110
0402 Doe Jane Fall 05 ie317
EMPLOYEE_PROJECTS Composite
Projects
SSN Lname Fname ProjNo Hours
P1 10 Not 1NF
1123 Smith John
P2 5
P2 10
3312 Doe Jane
P3 5

EMPLOYEE EMP_PROJECTS

SSN Lname Fname SSN ProjNo Hours


1123 Smith John 1123 P1 10
3312 Doe Jane 1123 P2 5
3312 P2 10
1NF
3312 P3 5

Second Normal Form (2NF). 2NF uses the concepts of FDs and the primary key.

Definitions:

Prime Attribute: An attribute that is a member of the primary key.

Full functional Dependency: A FD, Y Æ Z, where the removal of ANY attribute from Y means that
the FD will not hold true any more.

Examples:

{SSN, PNumber} Æ Hours is a Full FD, since neither {SSN} Æ Hours, nor {PNumber} Æ Hours is true.

{SSN, PNumber} Æ EName is NOT a Full FD, since {SSN} Æ EName is also true.

Second Normal Form: A relational schema, R, is in 2NF if every non-prime attribute A in R is fully
functionally dependent on the primary key.
Consider a DB storing information about employees, projects, and what project each employee works
on. The schema EMP_PROJ below attempts to store all this data in one table. The arrows denote the
FD’s in the schema.
You can easily check that a table constructed with this schema will result in many anomalies. We can
use the definition of 2NF above to break this schema into a set of three schemas which are in 2NF.

EMP_PROJ
SSN Pnumber Hours EName PName PLocation

EMP_PROJ1
SSN PNumber Hours

EMP_PROJ2
SSN EName

EMP_PROJ3
PNumber PName PLocation

The Third Normal Form ( 3NF)


The third normal form ensures that a schema design does not contain any transitive FD’s.
Transitive Functional Dependency is an FD Y Æ Z that can be derived from two FDs Y Æ X and
X Æ Z.

Examples:

SSN Æ MgrSSN is a transitive dependency, because SSN Æ DNumber, and DNumber Æ MgrSSN.

SSN Æ LName is NOT a transitive dependency, since there is no attribute X, such that SSN Æ X,
and X Æ LName.

Third Normal Form: a relational schema is in 3NF if it is in 2NF, and no non-prime attribute A in R
is transitively dependent on the primary key.
Below is an example of a schema that is not in 3NF, and how to decompose it into a pair of 3NF
relational schemas.

EMP_DEPT
SSN EName Address Dno DName MgrSSN

EMP_DEPT1
SSN EName Address Dno

EMP_DEPT2
DNo DName MgrSSN

For most applications, achieving 3NF as defined above will provide an acceptable logical design.
However, note that the entire process above is dependent on the (arbitrary) selection of primary keys
from all possible candidate keys. There is a broader definition of 3NF which considers the existence
of multiple candidate keys.

General Normal Form Definitions


The general 1NF definition is identical to the one before.

A relational schema is in 2NF if it is in 1NF, and if every non-prime attribute A in R is fully


functionally dependent on every key of R.

The general 3NF definition requires our earlier definition of superkeys ( a superkey of R is a set of
attributes which contain a key of R).

A relational schema R is in 3NF if whenever a FD X Æ A holds in R, then


either X is a superkey of R, or A is a prime attribute of R.
Example:
Consider a simplified database for Government records on land sales in different districts.
Each lot of land has a unique ID number for the territory. Within each district, each lot is also
assigned a lot#, which is unique in the district.
Further, the tax rate to be paid is determined by the district.
Finally, the government controls the cost of land to be uniform across the territory, so that the price
can be determined by the area of the lot.

LOTS
PropertyID District Lot# Area Price TaxRate 1NF
FD1
FD2
FD3
FD4

LOTS1
PropertyID District Lot# Area Price
FD1
FD2
FD4
2NF

LOTS2
District TaxRate

3NF

LOTS1A LOTS1B
PropertyID District Lot# Area Area Price
FD1
FD2

Notice: there are two candidate keys: { PropertyID} and { District, Lot#}.

2NF:
TaxRate is partially dependent upon the candidate key { District, Lot#}, due to FD3.
Hence we break this schema into two parts to achieve 2NF, as shown.
3NF:
LOTS2 is already in 3NF according to the general definition. However, LOTS1 violates the general
definition of 3NF, since by FD4, Area -> Price. However, (a) Area is not a superkey of LOTS2;
AND, (b) Price is not a prime attribute of LOTS2.

Therefore we further break LOTS1 into LOTS1A and LOTS1B to achieve 3NF.

Acknowledgements: The notes have been based (with minor modifications) on lecture notes of Prof Navathe, as
distributed by Dr. Kamal Karlapalem for COMP231, HKUST.

You might also like