Professional Documents
Culture Documents
Traditional ?
Structured data (relational), centralized, disk-based, OLTP workloads, ACID
COMP7104 - DASC7104 2
What is a database
A database is a collection of data, typically
related and describing activities of an
organization in terms of entities (real-world
objects) and relationships between them.
COMP7104 - DASC7104 3
What is a RDBMS
A big piece of software that allows us to manage
efficiently a large database and allows it to
persist over long periods of time
• Quick / robust / safe / simple access
COMP7104 - DASC7104 4
RDBMS
5
COMP7104 - DASC7104
Example: IMDB
• The Internet Movie Database
• http://www.imdb.com
COMP7104 - DASC7104 6
Tables (or relations)
Actor Casting
... ...
...
COMP7104 - DASC7104 7
Queries – SQL (declarative!)
SELECT * SELECT count(*) SELECT count(*)
FROM Movie FROM Actor FROM Movie
WHERE Title LIKE ‘Star %’
SELECT * SELECT *
FROM Movie FROM Actor
LIMIT 50 WHERE FirstName= ‘Clint’
SELECT *
FROM Movie M, Actor A, Casting C
WHERE A.FirstName=’Clint'
and M.MovieID = C.MovieID
and A.ActorID=C.ActorID
and M.Year=1995
COMP7104 - DASC7104 9
How can we evaluate the query ?
Classical query execution Classical query optimizations Classical statistics
• Index-based selection • Pushing selections down • Table cardinalities
• Hash-join • Join reorder • # distinct values
• Merge-join • histograms
• Index-join
⨝ ⨝
⨝
⨝
vs.
σFirstName=‘Clint’ σYear=1995 σFirstName=‘Clint’
σYear=1995
COMP7104 - DASC7104 10
Query optimization
• Rule-based query optimization, heuristics
• Dynamic programing cost-based query
optimization
COMP7104 - DASC7104 11
Role of a RDBMS
DML
1. Creation and storage of data
2. Queries and updates
DDL
3. Change / evolution of structure
4. Concurrency control (multiple users) Transactions
5. Recovery ACID
6. Data integrity and security
7. Reliability (99.99999%) Grant, Revoke, roles
COMP7104 - DASC7104 12
COMP7104 - DASC7104 13
COMP7104 - DASC7104 14
Data independence
Physical schema
• Storage as files, row vs.
column store, indexes
COMP7104 - DASC7104 15
Data independence
Application programs are insulated from changes
in the way the data is structured and stored
• Key property of a DBMS
– Logical independence
– Physical independence
COMP7104 - DASC7104 16
Logical data independence
Users isolated from changes in the logical structure of data
AmericanActor(ActorID:String,FirstName:String, LastName:String)
• Still a “view” Actor can be obtained using the above new relations,
by merging them
(Actor,Movie*)*
(Movie,Actor*)*
COMP7104 - DASC7104 19
Physical data independence (3)
COMP7104 - DASC7104 21
Data model: relational model
A data model is a collection of high-level data description
constructs (for structure, operations, constraints) that
hide many low-level storage details.
In this course:
– structured data: all elements have a fixed format
(relational, nested relational, semi-structured)
– relational model: tables
COMP7104 - DASC7104 22
Other important data models
• Semi-structured data model (XML, JSON)
– some structure but not fixed
– hierarchically nested tagged-elements in tree
structure
• Nested relational model: nested tables
• Graph model
• Unstructured data: text, image, audio, video
COMP7104 - DASC7104 23
Applications on DBMS
Any compute service that maintains state today is an
application on top of some kind of DBMS
– Uber
– Cathay Pacific
– Amazon
– HSBC
– SCMP
COMP7104 - DASC7104 24
Applications want something from the DBMS
S QL
S QL
Database
Management S QL S QL S QL
System S QL
S QL
COMP7104 - DASC7104 25
Concurrency control and recovery
• Concurrency control (RDBMS’s Transaction Manager)
– Correct and fast data access in the presence of concurrent
work by many users
– Disorderly processing that provides the illusion of order !
COMP7104 - DASC7104 26
Why multiple transactions running concurrently ?
COMP7104 - DASC7104 27
Example
UPDATE Budget
SET balance = balance + 200
WHERE uid = 2
UPDATE Budget
SET balance = balance + 300 Would like to treat each
WHERE uid = 3 group of instructions as an
atomic unit!
COMP7104 - DASC7104 29
What is a transaction?
• Sequence of many actions considered to be one atomic
unit of work (one logical unit)
• Usage:
1. Begin transaction
2. Set of SQL statements
3. End transaction
• Examples:
– Transfer balance between accounts
– Book a flight, a hotel, and a car together on Expedia
COMP7104 - DASC7104 30
Transaction model in RDBMS
• Transaction
COMP7104 - DASC7104 33
WHAT WE’LL LEARN ABOUT RDBMS
COMP7104 - DASC7104 34
RDBMS anatomy
SQL Client
Completed
Query Parsing
& Optimization
We will unpack a database system Relational Operators
and explore its modular design.
Files and Index Management
Database
Management
Buffer Management
System
Disk Space Management
3
M 2 Bob Harmon
Page 1 Page 2
Slotted Page
Page
Header Page 3 Page 4 Page 5
Frame Page 3
Frame
Page 5 Page 6
Page 4
Frame
COMP7104 - DASC7104 38
We’ll learn how to index stored data
HW2:
Indexing
COMP7104 - DASC7104 39
We’ll learn about query execution
• Simple closed set of operators
– σ (selection) Indexed
⨝ Nested-Loop
Join
– Π (projection)
Indexed
– ρ (renaming) Nested-Loop
Join
– ⋈ (join) On-the-fly
Select ⨝
Operator On-the-fly
• Combined together via iterators σFirstName=‘Clint’ Select
Operator
into a data flow σYear=1995
– Iterator
– Materialization Actor Casting Movie
B+Tree
– Vectorization B+Tree
IndexedScan
B+Tree
IndexedScan
IndexedScan
Iterator Iterator Iterator
COMP7104 - DASC7104 40
We’ll bridge the WHAT with the HOW
• Query optimization!
• Three stages
– Plan space
– Cost estimation
– Search algorithm
PA1: Query
optimisation
(Oracle)
COMP7104 - DASC7104 41
We’ll reason about transaction
ordering and concurrency control
• Correct (ideal): serially-ordered
• Desire: interleave to maximize performance
• Risk: disorder may lead to data anomalies
• Allowable orders: (conflict) serializability
• Implementation: (Strict) 2PL
PA2:
Transactions
(Oracle)
COMP7104 - DASC7104 42
We’ll learn about recovery
DB RAM
COMP7104 - DASC7104 43
We’ll relay some key messages
• Query optimization is good (omni-present): most of
today’s popular systems have a query optimizer of
some kind
• Declarative languages are good: SQL in DBMS, SQL
on Big Data (more on that later)
• Schema is good
• Secondary indexes are good
COMP7104 - DASC7104 44
But we’ll also chart and explore the
limitations of RDBMS
Relational databases have been around for decades
and they are very well designed for…
• … structured data
• … often/concurrent read/writes, integrity (OLTP)
COMP7104 - DASC7104 45
What’s changed since RDBMS inception?
• Dropping cost of disks (more on that soon)
– Cheaper to store everything than to figure out what
we really need !
• Types of data collected
– From data that’s obviously valuable to data whose
value is less apparent
• Rise of social media and user-generated content
– Large increase in data volume, need for data analytics
• Growing maturity of data mining techniques
– Demonstrates value of data analytics
COMP7104 - DASC7104 46