Professional Documents
Culture Documents
Earlier we saw how to convert an unorganized text description of information requirements into a
conceptual design, by the use of ER diagrams. The advantage of ER diagrams is that they force you to
identify data requirements that are implicitly known, but not explicitly written down in the original
description. Here we will see how to convert this ER into a logical design (this will be defined below)
of a relational database. The logical model is also called a Relational Model.
We shall represent a relation as a table with columns and rows. Each column of the table has a name,
or attribute. Each row is called a tuple.
Attribute: name of a column in a particular table (all data is stored in tables). Each attribute Ai must
have a domain, dom(Ai).
Relational Schema: The design of one table, containing the name of the table (i.e. the name of the
relation), and the names of all the columns, or attributes.
Example: STUDENT( Name, SID, Age, GPA)
Tuple, t, of R( A1, A2, A3, …, An): an ORDERED set of values, < v1, v2, v3, …, vn>, where each
vi is a value from dom( Ai).
Relation Instance, r( R): a set of tuples; thus, r( R) = { t1, t2, t3, …, tm}
Integrity Constraints
Each relational schema must satisfy the following four types of constraints.
A. Domain constraints
Each attribute Ai must be an atomic value from dom( Ai) for that attribute.
The attribute, Name in the example is a BAD DESIGN (because sometimes we may want to search a
person by only using their last name.
B. Key Constraints
Superkey of R: A set of attributes, SK, of R such that no two tuples in any valid relational instance,
r( R), will have the same value for SK. Therefore, for any two distinct tuples, t1 and t2 in r( R),
t1[ SK] != t2[SK].
Key of R: A minimal superkey. That is, a superkey, K, of R such that the removal of ANY attribute
from K will result in a set of attributes that are not a superkey.
If a relation has more than one keys, we can select any one (arbitrarily) to be the primary key.
Primary Key attributes are underlined in the schema:
The primary key attribute, PK, of any relational schema R in a database cannot have null values in
any tuple. In other words, for each table in a DB, there must be a key; for each key, every row in the
table must have non-null values. This is because PK is used to identify the individual tuples.
Mathematically, t[PK] != NULL for any tuple t ∈ r( R).
Referential integrity constraints are used to specify the relationships between two relations in a
database.
Consider a referencing relation, R1, and a referenced relation, R2. Tuples in the referencing relation,
R1, have attributed FK (called foreign key attributes) that reference the primary key attributes of the
referenced relation, R2. A tuple, t1, in R1 is said to reference a tuple, t2, in R2 if t1[FK] = t2[PK].
A referential integrity constraint can be displayed in a relational database schema as a directed arc
from the referencing (foreign) key to the referenced (primary) key. Examples are shown in the figure
below:
Now we are ready to lay down some informal methods to help us create the Relational schemas from
our ER models. These will be described in the following steps:
1. For each regular entity, E, in the ER model, create a relation R that includes all the simple
attributes of E. Select the primary key for E, and mark it.
2. For each weak entity type, W, in the ER model, with the Owner entity type, E, create a relation R
with all attributes of W as attributes of W, plus the primary key of E. [Note: if there are identical
tuples in W which share the same owner tuple, then we need to create an additional index
attribute in W.]
3. For each binary relation type, R, in the ER model, identify the participating entity types, S and T.
4. For each multi-valued attribute A, create a new relation, R, that includes all attributes
corresponding to A, plus the primary key attribute, K, of the relation that represents the entity
type/relationship type that has A as an attribute.
5. For each n-ary relationship type, n > 2, create a new relation S. Include as foreign key attributes
in S the primary keys of the relations representing each of the participating entity types. Also
include any simple attributes of the n-ary relationship type as attributes of S.
Formal Relational Database Design
We have seen the informal procedure to map the ER model into a logical model (relational database
schema). How do we know that this is a good schema?
We shall now study the formal theory which can help us to understand (a) what is a good relational
database design, (b) why it is good, and (c) how do we design good relational database schemas.
To do so, we shall learn the concepts of Functional dependencies and Normal forms.
Consider a (poorly) designed database where the Employee and Department information is planned in
one table, and the Employee and Project information in another table:
EMPLOYEE_DEPT
SSN LName Address BDate DNumber DName MgrSSN
EMPLOYEE_PROJ
Problems:
(c) Deletion Anomalies. If a dept has one employee working in it, and we delete the information of
this employee, then the information of the department is also lost. We may not want this to happen.
(d) Modification Anomalies. If we modify a value, we must make the entire table consistent very
carefully. For example, if an employee changes departments from Dept 5 to Dept 4, then his entire
record must be changed, not just his DNumber field. If a department changes its manager, the entire
table must be scanned and modified.
In other bad DB designs, several other problems can occur. The most important ones include the
following:
Suppose we are interested to know: Which suppliers supplied the parts for Proj2? Obviously this
information can only be obtained by somehow combining the information in the two tables (why ?).
The common link between the two tables is the PartNo (assume that PartNo uniquely identifies the
part). From PROJECT_PARTS, we can easily see that Proj2 uses P1 and P2. P2 is supplied by S2
(from tuple-2 of SUPPLIER_PARTS). However, which supplier supplied P1 to Proj2? If we match
the PartNo in tuple-2 of PROJECT_PARTS with PartNo in SUPPLIER_PARTS, then we get two
matches: tuple-1 Î S1, and tuple-3 Î S2. However, from this, we cannot conclude that both S1 and
S2 supplied parts P1 to Proj2 (why ?).
In the above example, we say that JOINING the information of the two tables created some false
information; in other words, if we use the above DB design to store data, some information that we
know is actually getting lost. We call such JOIN’s that create information loss as LOSSY-JOINS.
A good database design must guarantee LOSSLESS JOIN operations on its tables.
The above examples are obviously simple; however, a real DB may have tens of tables, with many
attributes in each. Some tables may have created as an extension of a previously existing DB. How
can we guarantee that the DB will not have any possibility of such errors?
There is a logical process that allows us to design good databases: it is called normalization. We now
study normalization.
Functional Dependencies
The implication of the statement X Æ Y is that, for any two tuples, t1 and t2,
if t1[ X] = t2[ X], then t1[ Y] = t2[ Y].
Examples of FD's:
Using A1, A2 and A3, we can also prove some additional useful inference rules:
Minimal set of FDs: A set of FDs is minimal if it satisfies the following conditions.
(a) Every dependency in F has a single attribute for its RHS (right hand side).
(b) We cannot remove any dependency from F, and have a set of dependencies equivalent to F.
(c) We cannot replace any dependency, X Æ A, in F with a dependency, Y Æ A, where Y ⊂ X, and
still have a set of dependencies equivalent to F.
In general, we know the following to be true: (1) Every set of FDs has an equivalent minimal set; and
(2) There can be several equivalent minimal sets.
It is possible to convert any schema that is not in 1NF into one or more schemas that are in 1NF:
Composite Multi-valued
STUDENT_COURSES
STUDENT_COURSES_1NF
EMPLOYEE EMP_PROJECTS
Second Normal Form (2NF). 2NF uses the concepts of FDs and the primary key.
Definitions:
Full functional Dependency: A FD, Y Æ Z, where the removal of ANY attribute from Y means that
the FD will not hold true any more.
Examples:
{SSN, PNumber} Æ Hours is a Full FD, since neither {SSN} Æ Hours, nor {PNumber} Æ Hours is true.
{SSN, PNumber} Æ EName is NOT a Full FD, since {SSN} Æ EName is also true.
Second Normal Form: A relational schema, R, is in 2NF if every non-prime attribute A in R is fully
functionally dependent on the primary key.
Consider a DB storing information about employees, projects, and what project each employee works
on. The schema EMP_PROJ below attempts to store all this data in one table. The arrows denote the
FD’s in the schema.
You can easily check that a table constructed with this schema will result in many anomalies. We can
use the definition of 2NF above to break this schema into a set of three schemas which are in 2NF.
EMP_PROJ
SSN Pnumber Hours EName PName PLocation
EMP_PROJ1
SSN PNumber Hours
EMP_PROJ2
SSN EName
EMP_PROJ3
PNumber PName PLocation
Examples:
SSN Æ MgrSSN is a transitive dependency, because SSN Æ DNumber, and DNumber Æ MgrSSN.
SSN Æ LName is NOT a transitive dependency, since there is no attribute X, such that SSN Æ X,
and X Æ LName.
Third Normal Form: a relational schema is in 3NF if it is in 2NF, and no non-prime attribute A in R
is transitively dependent on the primary key.
Below is an example of a schema that is not in 3NF, and how to decompose it into a pair of 3NF
relational schemas.
EMP_DEPT
SSN EName Address Dno DName MgrSSN
EMP_DEPT1
SSN EName Address Dno
EMP_DEPT2
DNo DName MgrSSN
For most applications, achieving 3NF as defined above will provide an acceptable logical design.
However, note that the entire process above is dependent on the (arbitrary) selection of primary keys
from all possible candidate keys. There is a broader definition of 3NF which considers the existence
of multiple candidate keys.
The general 3NF definition requires our earlier definition of superkeys ( a superkey of R is a set of
attributes which contain a key of R).
LOTS
PropertyID District Lot# Area Price TaxRate 1NF
FD1
FD2
FD3
FD4
LOTS1
PropertyID District Lot# Area Price
FD1
FD2
FD4
2NF
LOTS2
District TaxRate
3NF
LOTS1A LOTS1B
PropertyID District Lot# Area Area Price
FD1
FD2
Notice: there are two candidate keys: { PropertyID} and { District, Lot#}.
2NF:
TaxRate is partially dependent upon the candidate key { District, Lot#}, due to FD3.
Hence we break this schema into two parts to achieve 2NF, as shown.
3NF:
LOTS2 is already in 3NF according to the general definition. However, LOTS1 violates the general
definition of 3NF, since by FD4, Area -> Price. However, (a) Area is not a superkey of LOTS2;
AND, (b) Price is not a prime attribute of LOTS2.
Therefore we further break LOTS1 into LOTS1A and LOTS1B to achieve 3NF.
Acknowledgements: The notes have been based (with minor modifications) on lecture notes of Prof Navathe, as
distributed by Dr. Kamal Karlapalem for COMP231, HKUST.