You are on page 1of 4

Relational Model

SQL does not remove duplicates


DISTINCT
Removes duplicates in results
SELECT
- Relations = tables
SELECT *: all attributes
- Attributes = columns
SELECT is projection
- Tuples = rows
WHERE conditions
- Domain = type
%: any length string
- Schema = Relation + Attributes
_: one character
No duplicate tuples
INTERSECT, UNION, EXCEPT
Tuple order doesnt matter
Follow set semantics so duplicates removed
Attribute order doesnt matter
Add ALL to keep duplicates
Types
IN, NOT IN (R)
- char(n) fixed length
Set comparisons
- varchar(n) var lenght
> ALL, < SOME
- integer 32 bit
Comparators: <, <=, >, >=, =, <>
- decimal (5,2) 999.99
EXISTS
- real,double 32bit,64bit
SELECT name FROM Student S WHERE
- date 2002-01-15
EXISTS (SELECT * FROM Enroll E
- time 13:50:00
WHERE S.sid = E.sid AND dept = CS);
- timestamp 2002-01-15 13:50:00
More efficient than IN
Creating Table
Subqueries in FROM
CREATE TABLE <name> (
SELECT name FROM (SELECT name, age
<attr> <type> [NOT NULL],
FROM Student) S
,
WHERE age > 17;
PRIMARY KEY(<attr>,))
Subquery must be named
Loading Data
Aggregates
LOAD DATA INFILE <datafile> INTO TABLE
SUM, AVG, COUNT, MIN, MAX, COUNT(*) <table>
GROUP BY: SELECT age, AVG(GPA) FROM
Terms
Student GROUP BY age;
Data model = graph/tree model
Find # classes each student taking: SELECT sid,Schema = table design/structure
COUNT(*) FROM Enroll GROUP BY sid
Instance = data
If want to keep entries with 0 count, UNION Database Construction Steps
with (SELECT sid 0 FROM Student WHERE
1. domain analysis
NOT IN (SELECT sid FROM Enroll));
2. database design: E/R model, database design
NULL values ignored except COUNT(*)
theory
HAVING
3. table creation: DDL
Find student taking 2 or more classes: SELECT
4. load
sid FROM Enroll GROUP BY sid HAVING
5. query and update: DML
COUNT(*) >= 2;
Can be rewritten without HAVING
Relational Algebra
OUTER JOIN
SELECT: C(R) rows
Dangling tuples padded with NULL
PROJECT: A(R) columns
R <LEFT/RIGHT/FULL> OUTER JOIN S ON
CROSS PRODUCT:
R.A = S.A
NATURAL JOIN:
LEFT keeps R and pads S attributes
THETA JOIN: C
RIGHT keeps S and pads R attributes
Enforces equality on common attributes
FULL keep both R and S
R C T = C(R T)
SELECT sid, COUNT(cnum) FROM Student S
RENAME: S(R) = Rename R to S
LEFT OUTER JOIN Enroll E ON S.sid =
UNION: Same attributes
E.sid GROUP BY sid;
DIFFERENCE: Same attributes
Three-valued logic
INTERSECT: Same attributes
True, False, and Unknown
DIVISION: / Find all A values in R such that the Comparisons with NULL are Unknown
values appear with all B values in S
SQL only returns True tuples
AND: U & T = U, U & F = F, U & U = U
OR: U | T = T, U | F = U, U | U = U

Data Modification

INSERT INTO R VALUES (v1, );

DELETE FROM R WHERE C;

UPDATE R SET A WHERE C;


R / S = A(R) A(( A(R) S) R)
Result: A has a1
Database Integrity
Data validity enforcement
SQL
Domain: GPA is real
General Statement
Constraints: give error, abort state.
SELECT A1, , An
Key Constraints
FROM R1, , Rm
One primary key per table
WHERE C
Referential Integrity

Means referenced value always exists


Referencing attribute or foreign key
Referenced attribute
Foreign key can be NULL, when it is, no
constraint checking
Referenced must be PRIMARY or UNIQUE
May be omitted if they are the same name with
referencing attribute
RI violation by inserting into referencing table is
not allowed system rejects the statement
ON DELETE/UPDATE [SET NULL,
CASCADE, SET DEFAULT]
CHECK
Constraint checked when tuple is updated (or
inserted)
In SQL92, can be complex
SQL Assertion
Entire set of relations or database
CREATE ASSERTION <name> CHECK
(<condition>)
Trigger
Event-Condition-Action
CREATE TRIGGER <name> <event>
<referencing clause> // optional
WHEN (<condition>) // optional
<action>
<event>
[BEFORE/AFTER] [INSERT / DELETE /
(UPDATE OF A1, , An)] ON R
<action>
Mutiple statements: BEGIN END
<referencing clause>
REFERENCING OLD/NEW TABLE/ROW AS
<var>,
FOR EACH ROW
FOR EACH STATEMENT (default)
CREATE TRIGGER T
AFTER INSERT ON Laptop
REFERENCING NEW ROW AS nrow
FOR EACH ROW
WHEN nrow.weight>5
BEGIN
UPDATE Laptop SET weight=NULL
WHERE model=nrow.model
END
Views
CREATE VIEW <name> AS <query>
Can be created on top of other views
Tuples created on the fly using real tables
Used to save complicated tables
Virtual database: views
Conceptual database: tables
Physical database: pages on disk
Modifying Views
Possible, but must make sense
SELECT on single table T without DISTINCT
Subqueries in WHERE must not refer to T
No aggregation
Attributes of T not projected in view allowed to
be NULL or default
CHECK OPTION
CREATE VIEWASWITH CHECK
OPTION
Check INSERT/DELETE to ensure new tuple is
still in view
Reject if statement is not
Dropping Views

DROP [CASCADE | RESTRICT]


For dropping a view in between 2 tables:
CASCADE: destroys above (default)
RESTRICT: drop statement fails if object is
referenced
Materialized Views
Precomputed views
Pros: More efficient to compute; faster
Cons: Takes up space; update costs
Authorization
Granting Privileges
GRANT <privileges> ON <R> TO <users>
[ WITH GRANT OPTION ];
<privileges>: SELECT, INSERT (separated by
commas, or ALL privileges)
<R>: table, view
<users>: list of users/groups, or PUBLIC
Managing Privileges
DBA (owns all)
WITH GRANT OPTION
- Users can grant the same or less privileges

Authorization Graph

Nodes: users
Edges: granted privileges
Revoking Privileges

REVOKE <privileges> on <R> FROM <users>


[ CASCADE | RESTRICT ];

REVOKE GRANT OPTION FOR <command>


ON <table> FROM <users> [ CASC/REST ]
Restrict is default
Views and privileges
To create a view, user needs SELECT privilege
on all underlying base tables
To grant privilege on view, user has to have
GRANT OPTION on the base tables of view
When privileges for base tables are revoked, the
views on top are automatically dropped
Disks and Files
Terms: Boom, Head, Sector, Spindle, Track,
Platter, Cylinder
Each platter has: track, cylinder, sector (=block,
page)
Access Time
Seek time + rotational delay + transfer time
Seek Time
Time to move disk head between tracks
e.g. track to track ~ 1ms, average 10 ms, full ~
20 ms

Rotational Delay

Typical: 1000-15000 RPM

Q: For 6k RPM, avg rotational delay


6000 rot/min * 1 min/60 sec = 10 ms

Best = 0, Worst = 10, Avg = 5

Transfer Time

How long to read one block


e.g. 6K RPM, 10000 sect/track, 1KB/sect
10ms/rot * 1/10000 = 0.001 ms
Transfer Rate
1KB/15.001ms = 67 KB/s
Burst Transfer Rate
(RPM/60)*(sect/track)*(bytes/sect)
Sequential vs Random I/O
Read 3 seq blocks = 3*.001ms = .003 ms
3 rand. blocks = .001ms (1st) +
2*(10ms+5ms+.001ms) = 30.003 ms
RAID
Redundant Array of Independent Disks

Potentially high throughput


Potential reliability issues
RAID 0: striping only (redundancy)
RAID 1: striping + mirroring
RAID 5: striping + parity block
Data Modification
Byte level not allowed; only by blocks
Abstraction by OS
Sequential blocks: dont need to worry about
head, cylinder, sector
Access to non-adj blocks = rand. IO
Access to adjacent blocks = seq I/O
Buffers, Buffer pool
Cache for disk blocks to avoid future read and
hide disk latency
Spanned vs Unspanned
Unspanned extra space wasted
Guarantees full tuple inside
Worse case waste 50% each block
Spanned fill later
Variable-Length Tuples
Reserved Space
Reserve max space for each tuple
Problem: waste of space
Variable Length Space
Pack tuples tightly
End of a record by EOR (mark at end)
Or read length at the beginning
Slotted Page
Array of pointers to point to tuples
Long Tuples
Spanning or splitting tuple
Sequential File tuples ordered by certain
attribute
Sequencing Tuples 2 options
Rearrange or linked list
Overflow page if insert when full
o PCTFREE in DBMS
Index
Primary (Clustering) Index: index on search key
Search key

primary key

Dense
(key, pointer) pair for every record
Why Use Dense Index?
100 mil records (900B/rec), 4B search key, 4B
ptr, 4KB block, unspanned
For table:
4096/900 = 4 records/blk
100 mil tuples/(4 records/blk) = 25 mil blocks
25 mil blocks * 4KB/blocks = 100 GB
For index:
8 bytes/entry, 4096/8 = 512 entr/block
100M/512 = 195313 blks
195313*4KB = 781 MB
Can store index into RAM for no disk IO
Sparse, Primary Index
(key, pointer) pair for every block
Points to first record in block
Multi-level Index
Sparse (2nd lvl) -> 1st lvl -> sequential
Secondary (non-clustering) Index
Tuples in table not ordered by index search key
1st level always dense, sparse from 2nd level
Insertion
Overflow (a new bucket)
Redistribute

Traditional Index
Pros: simple, sequential blocks
Cons: bad for updates, ugly over time
B+ Tree
Pros: suitable for updates, balanced, min space
usage guarantee
Cons: non-sequential index blocks
Leaf Node
Left of # pts to tuples, right pts to next leaf
Non-leaf Node
Left of # points to lower level (right side is

) (at least half pointersused)

Insertion
Simple: Insert as next
Leaf Overflow: Split tuple and add, then copy up
to parent
Non-leaf overflow: split leaf node, insert into
non-leaf node by splitting, move new value up to
the root
New Root: Split old root and make middle value
as new root as parent to those split tuples
Number of Ptrs/Key for B+ Tree

Common Queries
Check Constraint
Q: Check that CS class has to be >3 units
A: CHECK(dept <> CS OR unit > 3)
Cross product itself
Q: Sensor (date, time, temp, humidity) stores
temp every few hours. Get highest
temperature of each day
A: SELECT date, MAX(temp) FROM Sensor
GROUP BY date;

date ,temp ( Sensor ) R 1.date , R 1.temp


R 1.data =R 2.date R 1.temp< R 2.temp

( R 1 ( Sensor ) R 2 ( Sensor ) )
Complicated Queries
Q: Find names of such companies that all
employees have salaries > $100000
A: SELECT company-name FROM Company C
WHERE 100000 < ALL (SELECT salary
FROM Work W WHERE C.company-name
= W.company-name);

companyname ( Company )
companyname ( salary 100000 ( Work ) )
Q: Find names of employees whose total salary
is higher than those of all employees living
in Los Angeles
A: SELECT person-name FROM Work
GROUP BY person-name
HAVING SUM(salary) > ALL
(SELECT SUM(salary) FROM Work,
Employee WHERE Work.person-name =
Employee.person-name AND city = 'Los
Angeles' GROUP BY Work.person-name)

Q: Find names of managers whose total salary is


greater than at least one employee managed
A: SELECT manager-name FROM Manage M,
(SELECT person-name, SUM(salary) totalsalary FROM Work GROUP BY personname) S1
WHERE M.manager-name=S1.person-name
AND S1.total-salary > SOME

(SELECT total-salary FROM (SELECT

person-name, SUM(salary) total-salary


FROM Work GROUP BY person-name) S2
WHERE S2.person-name = M.person-name)
Join

|R| = 1,000 tuples; |S| = 10,000 tuples


10 tuples/block
bR = 100 blocks; bS = 1,000 blocks
Memory buffer = 22 blocks
Nested Loop
Nave:

b R +|R|b S

100 + 1000 * 1000 = 1,000,100 disk I/Os


Block-nested:

b R +b RbS

Uses 1,100 disk I/Os

Join Stage:
Sequentially read R and S blocks one at a time
Uses one block for output buffer

2 ( b R +b S ) log M 1

bR

M 2

Uses 2,200 disk I/Os

Total = 3,300 disk I/Os


Index
Sequentially read R and S blocks one at a time

b R +|R|( C+ J )

C = average lookup cost


o Probability
o e.g. Leaf nodes not in memory divided by
total leaf nodes
b

J = matching tuples in S for every R tuple


R bS + bR
M 2
Uses 1,115 to 10,580 disk I/Os

Summary of join algorithms


Uses 5,100 disk I/Os
Nested-loop ok for small relations
Sort-Merge
Hash join usually best equi-join
Sort Stage:
o If relations sorted and no index
Sort both tables R and S
Merge join for sorted relations
o Sort merge good for non-equi-join
bConsider index join if index exists
bR
2 bR log M1 +1 +2 bS log M 1 ToS pick
+1best, DBMS maintains statistics on data
M
M
Statistics Collection

DB2
Uses 400 + 6000 = 6400
o RUNSTATS ON TABLE <userid>.<table>
AND INDEXES ALL
Hash join usually best equi-join
o ANALYZE TABLE <table> COMPUTE
5
STATISTICS
o ANALYZE TABLE <table> ESTIMATE
STATISTICS (cheaper)
Merge Stage:
Irrelevant in MySQL; only rule-based
Sequentially read R and S blocks one at a time
optimization

100 + 100 * 1000 = 100,100 disk I/Os


Block-nested and read maximum blocks R:

) (

b R +b S

Uses 1,100 disk I/Os

Total = Uses 7,500 disk I/Os


Hash
Hashing Stage:
Read each table and hash into buckets

Bucket size:

b R +b S

bR

M 1

E/R Model
Entity: thing or object
Rectangle
Attribute: property of entities
Ellipsis
Key: set of attributes to uniquely identify entity

All entity sets need key


Underline

Relationship: connection between entities


Can have attributes
Diamond
Roles: labels on edges in E/R
More than one relationship
e.g. Coder and tester for partner
Subclasses: ISA relationship
Generalization: Subclass Superclass
Specialization: Superclass Subclass

Subclass inherits all superclass attr


TOTAL SPECIALIZATION: double lines =
entity always one of the subclasses
Weak Entity Set: Entity set without unique keys
Double rectangle/diamond
Discriminator: set of attributes in WES that
are part of key = dashed underline
Owner Entity Set: entity set providing part
of key
Identifying Relationship: relationship
between WES and OES
Always double edge between weak entity
and identifying relationship
Cardinality
No Arrow = many
Arrow = one
ONE-TO-ONE: Each entity in E1 is related to at
most one entity in E2 and vice versa
MANY-TO-ONE: Each entity in E1 is related to
at most one entity in E2 (converse = ONETO-MANY)
MANY-TO-MANY: Each entity in E1 may be
related to 0 or more entities in E2 and vice
versa
TOTAL PARTICIPATION: an entity participates
in the relationship at least once
E/R to Relation
STRONG ENTITY SET: one table with all
attributes
RELATIONSHIP SET: one table with keys from
the linked ES and its own attributes
o Key =

One to one: either can be key

Many to one: many is key

Many to many: both are keys


o Rename attributes if conflict
WEAK ENTITY SET: one table with its own
attributes and keys from owner ES
SUBCLASS: three approaches
o e.g. Student, ForeignStudent, HonorStudent
o Preferred: One table for each subclass with
all its attributes plus key from its superclass

i.e. Student, FStudent, HStudent


o One big relation with all attributes with null
values for missing attributes

i.e. Student
o One table for every subtree (including the
root) with all its attributes plus all
inherited attributes

i.e. Student, FStudent, HStudent,


FHStudent
Normalization Theory
Motivation for normalization:
Update anomaly: only some info updated
Insertion anomaly: some information cannot be
represented, have to use many NULLs
Deletion anomaly: deleting too much
information in a single tuple
Functional dependency: X Y no two tuples
can have same X but different Y
Closure X+: set of all attributes logically implied
by X. If X+ is all attributes of table, it is a key.
BCNF
Eliminate redundancy using FDs

FD X Y will lead to redundancy if X does not


contain key in the table
Table R is in BCNF iff for every non-trivial X
Y, X contains a key
If X doesnt, split table into R(X+) and R1(X, Z)
MVD
Definition:
If u[X] = v[X], then a tuple w exists such that:
1. w[X] = u[X] = v[X]
2. w[Y] = u[Y]
3. w[Z] = v[Z] where z is all common
attributes except X and Y
Complementation rule: X Y, then X Z
4NF

For every nontrivial FD X Y or MVD X Y, Consistency: If database was in a consistent


state, then it remains in a consistent state
X contains a key
Isolation: End result is the same as when
After getting into BCNF:
transactions are run in isolation
If non-trivial X Y and X does not contain key:
Durability: Results from committed transactions
decompose R into R(X,Y) R1(X,Z)
are never lost
repeat

Transactions
ACID
Atomicity: All or nothing operation, either all
finished or not at all

More SQL Examples


Temps(sensorID, time, temp)
Q: Test whether time temp holds (temperature
readings must be same for each time)
A: SELECT time FROM Temps GROUP BY
time HAVING COUNT(DISTINCT temp) > 1

You might also like