Professional Documents
Culture Documents
The notion of a datamodel is one of the most fundamental in the study of database
systems.
some basic terminology and most important data models.
These limits are not weaknesses, but strengths. These make it possible for
programmers to describe data operatons at a very high level, yet have database
implement the operations efficiently.
3. Constraints on the data
constraints describe limitations on the data.
e.g: (simple) "a day of the week is an integer between 1 and 7"
e.g: "a movie has at least one title"
very complex constraints can be put on the data, see chapter 7
e.g:
title, year, length, genre
Gone With The Wind, 1939, 231, drama
Star Wars, 1977, 124, sciFi
Wayne's World, 1992, 95, comedy
The *operations* normally associated with relational model form the 'relational
algebra'. (see section 2.4)
The operations are table oriented, e.g: we can ask for all rows of a table that
have a specific value in a specific column.
a brief example of constraints - we could ensure that the 'genre' values are drawn
from a fixed set, or we could ensure that there aren't two movies of the same name
(this is incorrect wrt the real world domain of movie making)
<Movies>
<Movie title="Gone With the Wind">
<Year>1939</Year>
<Length>231</Length>
<Genre>drama</Genre>
</Movie>
There are many other data models that have been associated with databases.
E.g: a modern trend is to add object oriented features to the relational model.
There are two effects of OO on relations
There are database models of the purely object kind. see section 4.9
There are several other models that have fallen out of disuse.
e.g: the hierarchical model a tree model, like the semistructured model
another model is the network model, a graph oriented, physical level model. this is
like the hierarchical model, but unlike the hm, it does not favor trees.
Together (KEY) this limitations turn into features, which allow us to implement
languages, *like* SQL that allow programmers to express themselves at very high
levels (_of abstraction). But since SQL has a limited number of operations, we can
optimize them to run fast.
2.2.1 Attributes
The columns of a relations are named by attributes.
In the above example, the attributes are title, year, length, genre
Usually an attribute describes the meaning of entries in the columns.
2.2.2 Schemas
The name of a relation and the set of attributes for the relation are called the
'schema' of that relation.
We represent the schema as the relation name followed by a parenthesized list of
its attributes.
The attributes of a relation are the set, not a list, so we have to declare a
'standard order'.
However for the relation above, we could take the given order as 'standard'.
2.2.3 Tuples
The rows of a relation, other than the header rows containing the attributes are
called 'tuples'.
A tuple has one 'component' for each attribute of the schema.
e.g: The first of three tuples in the relation above has the components "Gone With
The Wind", 1939, 231, drama, for the attributes, title, year, length, genre
respectively.
2.2.4 Domains
The relational model demands that each attribute is atomic, i.e it must be of some
elementary type like integer or string.
they can*not* be compound - like structures, sets, list, array, or any other type
that can be broken down.
Each attribute has a 'domain' (which seems to be a synonym for 'elementary type')
We can include the domain of each attribute in the representation of the schema
E.g
A relation with a specific set of atttributes and corresponding tuples are called
'instances' of that tuple. (thought short cut - a frozen relation is an 'instance')
The Relational Model allows many types of constraints. These are discussed in
Chapter 7.
A set of attributes form a key (constraint for a relation) if we do not allow two
tuples in a relation instance to have the same values for all attributes of the
key.
e.g
so there cannot be two tuples for this relation which have the exact same values
for both title *and* year.
Movies(
*title* : string,
*year* : integer,
length : integer,
genre : string,
studioName : string,
producerC# : integer
MovieStar(
*name* : string,
address : string,
gender : char,
birthdate : date
)
StarsIn(
*movieTitle* : string,
*movieYear* : integer,
*starName* : string
MovieExec(
name : string,
address : string,
*cert#* : integer,
netWorth : integer
)
Studio(
*name* : string,
address : string,
presC# : integer
Comments:
The relation Movie has a key consisting of the attributes, name, and year.
studioName is the name of the studio that owns the movie
producerC# is an integer that represents the producer (see MovieExec
comments)
The relation MovieStar has the name attribute as the key, we use here the
convenient fiction that movie star names are unique. A more conventional approach
like using SS number or assigning each individual a unique number would work.
The relation StarsIn connects stars to movies they act in, and movies to their
stars.
Note that the key consists of all three attributes. also, two of the
attributes are actually the movie relation's keys, though we use different names
here. (you need an explicit statement of this being a foreign key etc)
Studio represents movie studios. The key is the studio name (we assume two
movie studios don't have the same name).
We assume the movie studio has a president who is a movie exective, and so the
MovieExec relation's key is present in this relation, to identify the individual
who is the studio's president.
SQL is the principal language used to describe and manipulate relational databases.
Current standard = SQL 99.
Two aspects to SQL
1. The Data Definition sublanguage - for describing database schemas
2. The Data Manipulation sublanguage - for querying and modifying databases
In this section- how to declare tables, (the first type, not the second or third).
the CREATE TABLE statement declares the schema for a stored relation, and gives the
name for a table, its attributes, their data types, allows us to declare a key, or
even several keys, declaring other constraints, declaring indexes.
2. BIT(n) and BIT VARYING (n) denote bit strings of fixed and varying lengths
(upto n) respectively.
5. FLOAT or REAL (they are synonyms) for floating point numbers. Also DECIMAL
(6, 2) etc.
The simplest way to declare a relation is to use the keywords CREATE TABLE followed
by the name of a relation, and a parenthesized, comma separated list of the
attribute names and their types.
e.g:
e.g:
DROP TABLE R;
To modify an existing relation, we *start with* a statement that begins with "ALTER
TABLE $NAME_OF_RELATION
We have several options the most important of which are
a. ADD followed by an attribute name and its data type
b. DROP followed by an attribute name.
the tuples of MovieStar now have the phone attribute, but existing tuples will have
the special value NULL
ALTER TABLE MovieStar DROP birthdate;
e.g:
There are two kinds of declarations to indicate 'keyness - PRIMARY KEY, or UNIQUE
(more below)
There are two ways to declare an attribute , or a set of attributes to be a key in
the CREATE TABLE statement
1. We may declare *one attribute* to a key when that attribute is declared in
the schema
Example:
2. We may add to the *list of items declared in the schema* (schema so far has
been only a list of attributes) with an additional declaration that states a
specific attribute, or a set of attributes, is a key.
- the data manipulation aspect of the relational model. A data model is not just a
structure. There needs to be a way to modify and query data.
We learn an algebra - a relational algebra - that consists of several ways to
construct new relations from existing relations.
When given relations are data, the new relations can be answers to queries on that
data.
RelAlg is not used as a query language in real life databases, but the 'real' query
language, SQL incorporates relational algebra.
Many SQL programs are 'syntactically sugared' relational algebra expressions. When
an RDBMS handles SQL queries, the first step is to transform the SQL query into
relational algebra, or an equivalent representation.
E.g: compute whether the number of tuples in a relation is even or odd. (_ there
isn't a way to do this in SQL)
But, by restricting what we can say or do in our query language, we get two huge
advantages.
1. ease of programming (_ because *everything* turing computable isn't possible in
the language we use)
2. ability of compiler to produce highly optimized code. (_ again because the
language to be compiled is smaller/simpler)
In relational algebra
1. the operands are (a) variables that stand for relations (b) constants that
are finite relations
In the next section we examine the operations of relalg.
Example:
Let R =
name, address, gender, birthdate
Carrie Fisher, 123 Maple St., Hollywood, 9/9/99
Mark Hamill, 456, Oak Rd, Brentwood, 8/8/88
Let S =
name, address, gender, birthdate
Carrie Fisher, 123 Maple St., Hollywood, 9/9/99
Harrison Ford, 789 Palm Dr., BeverlyHills, 7/7/77
then R union S =
Carrie Fisher, 123 Maple St., Hollywood, 9/9/99
Mark Hamill, 456, Oak Rd, Brentwood, 8/8/88
Harrison Ford, 789 Palm Dr., BeverlyHills, 7/7/77
R intersection S =
Carrie Fisher, 123 Maple St., Hollywood, 9/9/99
R difference S
Mark Hamill, 456, Oak Rd, Brentwood, 8/8/88
2.4.5 Projection
The projection operator (greek pi) produces from the relation R a new relation by
removing some of R's columns.
The value of the expression pi_ A1, A2, ..., A_n (R) is a relation that has only
the attributes from the columns (of R) .
The schema for the resulting value ( a relation) is the set of attributes
{A1,A2, ... A_n} which we conventionally show in the order A1, ... A_n.
Example 2.9
pi_title, year, length(Movies) gives us the relation
title,year, length
Star Wars, 1977, 124
Galaxy Quest, 1999, 104
Wayne's World, 1992, 95
Example
pi_genre(Movies) gives us
genre
scifi
comedy.
2.4.6 Selection
The selection operator, applied to a relation produces a new relation with a subset
of R's tuples, that sastisfies condition C (of type boolean) that involves R's
attributes (and constants)
then
gives
Essentially the same for relations, but the elements are tuples. Tuples can have
more than one component.
The result of pairing 1 tuple from R with another from S is a longer tuple, with an
attribute (in the longer tuple) for each tuple in R and S.
Conventionally, the attributes of R preceed the attributes of S in the result
tuple.
If R and S have the same attributes, we use R.A and S.A in the resulting tuple.
The simplest such is that we join the tuples from R and S (into a new tuple for
the resulting relation) only when the tuples match in the common attributes (common
in the schemas of R and S and the tuples have identical values for *those*
attributes).
More precisely let A1, A2, .... A_n be the common attributes of R and S.
Then, a tuple from R and a tuple from S are joined (to form a tuple in R nj S) only
iff all values of A1 ... An match in both tuples.
let relation R be
A,B
1,2
3,4
let relation S be
B,C,D
2,5,6
4,7,8
9,10,11
A,B,C,D
1,2,5,6
3,4,7,8
A,B,C
1,2,3
6,7,8
9,7,8
Let S be
B,C,D
2,3,4
2,3,5
7,8,10
Here the common attributes are B *and* C.
so R nj S
has tuples
A,B,C,D
1,2,3,4
1,2,3,5
6,7,8,10
9,7,8,10
each tuple in R is joined to each tuple in R where the common attribute values
match.
The natural join combines tuples from R and S on *one* specific condition - the
equality of shared attribute values.
It is sometimes necessary to combine tuples based on *other* conditions.
For this purpose we have the 'theta join' in which theta represents an arbitrary
condition, we'll use C instead, and use the 'bowtie' notation (in the textbook) of
the natural join with C as a subscript indicating the condition to be satisfied.
let R be
A,B,C
1,2,3
6,7,8
9,7,8
Let S be
B,C,D
2,3,4
2,3,5
7,8,10
The result is
A, R.B, R.C , S.B, S.C, D (note: relation namespaced common attributes)
1,2,3,2,3,4
1,2,3,2,3,5
1,2,3,7,8,10
6,7,8,7,8,10
9,7,8,7,8,10
Note: In the case of a theta join there is no guarantee that shared attributes will
agree in value in the combined tuple (_ so we have to list them separataly with the
name of the relation prefixed (e.g R.C, S.C etc)
Example R theta_join S with Condition == A < D and R.B (not =) S.B
basic idea: algebraic operations can be composed, the output of one operation
feeding into the input of another.
parentheses group operators.
example 2.17
we want the title and year of movies produced by Fox that are at least 100 minutes
long
one option
a) *select* movies with studioName = Fox
b) *select* movies with length > 100
c) compute intersection of the results of 1 and 2
b) project result of 3 onto title and year
we could also do
a) select movies with studioname = Fox *AND* length > 100
b) project result onto title and year
the book has a weird notation so i'm using rename (rel_name, target_name
attributes)
so rename (R, S, A1, .. A_n) gives a relation S that is the same as R, but with the
attributes renamed (in order) from A1 thru An
if we want to keep the attribute names intact we do rename (R,S)
renames (S, S, X, Y Z) renames the three attributes of S (from say A,B,C) to X, Y,Z
no concrete example.
R intersect S = R - (R - S)
R theta-join(condition C) S = select_C (R X S)
The 'core' or 'base' operations which cannot be written in terms of others are -
union, difference, selection, projection, (cross) product, renaming.
so
R(t,y,l,i,s,p) := select (Movies, length > 100)
S(t,y,l,i,s,p) := select (Movies, studioName = 'Fox')
T(t,y,l,i,s,p) := R intersection S
Answer (title, year) := project(T, title, year)
basically every interior node in a tree has its own variable, on which operators
higher up on the tree operate.
a2. R intersection S
maximum = m = n (all tuples are common)
minimum = 0 (no tuples in common)
b. R natural join S
maximum = m * n ( no attributes in common)
minimum = 0 (have common attributes, which have no equal values
in R, S)
Third important aspect of relational model = the ability to restrict the data that
maybe stored in the database.
so far we have seen only one kind of constraint, that of one or more attributes
acting as a key.
there are many more kinds of constraints
e.g: 'referential integrity constraints' - the value of one column of a
relation must appear in another column of the relation or a column of another
relation.
(here we use relalg but in chapter 7 we see how SQL can express the same
constraints)
(ex: 2.4.8)
The semijoin of a is the set of tuples t in R s.t there is at least one tuple u in
S such that u and t have common attributes to be equal.
a bit abstract, so try with the data for the natural join
let relation R be
A,B
1,2
3,4
let relation S be
B,C,D
2,5,6
4,7,8
9,10,11
then R semijoin S is (note that the common attribute here is B)
A,B,
1,2
3,4
S semijoin R would be
B,C,D
2,5,6
4,7,8
(this seems like a natural join but only the tuples in R are in the result, there
is no 'join')
(ex:2.4.10 )
A relation R has attributes A1, A2,... A_n, B1, B2, .... B_m.
Let S be a reletion with scheme B1, B2,... B_m. Iow, S's attributes are a subset of
R's.
R quotient S is the set of tuples t over A1...A_n (i.e non S attributes of R) such
that for every tuple s in S, ts is a tuple of R
(fair enough, but I suspect 'quotient' is clearer in terms of relations, see Set
Theory book)
third aspect on data model == constraints on the model (_ first two, structure,
operations)
So for we only saw one kind of constraint, a set of attributes of a relation acting
as a key.
now we *also* look at 'referential integrity' constraints - iow, a value appearing
in a column of one relation must also appear in another column of the same (??!!)
or another relation.
(some confused writing here, but the key idea seems to be that there are often 'two
ways' to express a constraint. as far as I can see the differences involve using
set notation vs using equality)
Movies(
*title* : string,
*year* : integer,
length : integer,
genre : string,
studioName : string,
producerC# : integer
MovieStar(
*name* : string,
address : string,
gender : char,
birthdate : date
)
StarsIn(
*movieTitle* : string,
*movieYear* : integer,
*starName* : string
MovieExec(
name : string,
address : string,
*cert#* : integer,
netWorth : integer
)
Studio(
*name* : string,
address : string,
presC# : integer
project_A (R) subsetOf project_B (S). (_ so B in S *can* have values not in R.A but
every value in R.A must be in S.B)
or with alternative notation
Example 2.21
we would expect the values of producerC# in movies would appear as the cert# of
some executive (tuple) in MovieExec (_ otherwise there would be a producer who is
not a movieExec. also there can be movie execs who are not directors).
Example 2.22
A referential constraint where the 'value' involved is represented by more than one
attribute.
Any movie mentioned in the 'StarsIn' relation must appear in the movies relation.
The key difference here is that Movies are identified (uniquely, so via primary key
=) year *and* title. so we use subset of *pairs* to express this constraint
(For now assuming that we are concerned only with the address attribute, given name
is a key)
Let us rename the MovieStar relation to get two new 'names' MS1, MS2
and then
There are many kinds of constraints that can be expressed with relational algebra,
which are used for restricting database contents.
gender must be either 'M' or 'F' on relation MovieStar becomes select_ gender
MovieStar != 'M' AND select_gender MovieStar != 'F' = NULL
to be a moviestudio president, you need a net worth of at least 10 million .
given
MovieExec (name, address, certC#, netWorth)
Studio(name, address, presC#)
or
step 1. Select (netWorth >= 10,000,000) MovieExec, then
step 2. Project (certC#) [Select (netWorth >= 10,000,000) MovieExec]
step 3. Project (presC#) Studio subsetOf (2)