You are on page 1of 46

Overview of traditional RDBMS

Traditional ?
Structured data (relational), centralized, disk-based, OLTP workloads, ACID

COMP7104 - DASC7104 2
What is a database
A database is a collection of data, typically
related and describing activities of an
organization in terms of entities (real-world
objects) and relationships between them.

COMP7104 - DASC7104 3
What is a RDBMS
A big piece of software that allows us to manage
efficiently a large database and allows it to
persist over long periods of time
• Quick / robust / safe / simple access

Examples: DB2 (IBM), SQL Server (MS), Oracle,


Postgres, MySQL, etc…

COMP7104 - DASC7104 4
RDBMS

5
COMP7104 - DASC7104
Example: IMDB
• The Internet Movie Database
• http://www.imdb.com

Entities: Actors (1.5M), Movies (1.8M), Directors

Relationships: who played where, who directed


what, …

COMP7104 - DASC7104 6
Tables (or relations)
Actor Casting

ActorID FirstName LastName ActorID MovieID

45601 Clint Eastwood 45601 10230

... ...

Movie MovieID Title Year

10230 Star Wars 1977

...

COMP7104 - DASC7104 7
Queries – SQL (declarative!)
SELECT * SELECT count(*) SELECT count(*)
FROM Movie FROM Actor FROM Movie
WHERE Title LIKE ‘Star %’

SELECT * SELECT *
FROM Movie FROM Actor
LIMIT 50 WHERE FirstName= ‘Clint’

SELECT MovieID, count(*) SELECT min(Year)


FROM Casting FROM Movie
GROUP BY MovieID
COMP7104 - DASC7104 8
Queries – SQL

SELECT *
FROM Movie M, Actor A, Casting C
WHERE A.FirstName=’Clint'
and M.MovieID = C.MovieID
and A.ActorID=C.ActorID
and M.Year=1995

This query has selections and joins


• 1.8M actors, 11M castings, 1.5M movies
• how do we make it fast ?

COMP7104 - DASC7104 9
How can we evaluate the query ?
Classical query execution Classical query optimizations Classical statistics
• Index-based selection • Pushing selections down • Table cardinalities
• Hash-join • Join reorder • # distinct values
• Merge-join • histograms
• Index-join
⨝ ⨝



vs.
σFirstName=‘Clint’ σYear=1995 σFirstName=‘Clint’
σYear=1995

Actor Casting Movie Actor Casting Movie

COMP7104 - DASC7104 10
Query optimization
• Rule-based query optimization, heuristics
• Dynamic programing cost-based query
optimization

COMP7104 - DASC7104 11
Role of a RDBMS
DML
1. Creation and storage of data
2. Queries and updates
DDL
3. Change / evolution of structure
4. Concurrency control (multiple users) Transactions
5. Recovery ACID
6. Data integrity and security
7. Reliability (99.99999%) Grant, Revoke, roles

8. Reduced application development time


9. Efficiency (thousands of queries par seconde)

… and behind the curtains, our focus...

COMP7104 - DASC7104 12
COMP7104 - DASC7104 13
COMP7104 - DASC7104 14
Data independence

Physical schema
• Storage as files, row vs.
column store, indexes

COMP7104 - DASC7104 15
Data independence
Application programs are insulated from changes
in the way the data is structured and stored
• Key property of a DBMS
– Logical independence
– Physical independence

COMP7104 - DASC7104 16
Logical data independence
Users isolated from changes in the logical structure of data

• Initially one table for actors:


Actor(ActorID: String, FirstName: String, LastName:String)

• Then divided into two relations:

AmericanActor(ActorID:String,FirstName:String, LastName:String)

WorldActor(ActorID: String, FirstName: String, LastName:String)

• Still a “view” Actor can be obtained using the above new relations,
by merging them

• Users/applications querying view Actor get same answer as before


COMP7104 - DASC7104 17
Physical data independence (1)
Physical data independence: applications should
be isolated from
• changes to the physical organization
– add/drop index
– different storage organization

(Actor,Movie*)*
(Movie,Actor*)*

(Movie*, Casting*, Actor*)


COMP7104 - DASC7104 18
Physical data independence (2)
The logical schema insulates users from changes in
physical storage details
• how the data is stored on disk
• the file structure
• the choice of indexes
The application remains unaltered
• Performance may be affected by such changes !

COMP7104 - DASC7104 19
Physical data independence (3)

Query processor à Translates WHAT into HOW

• SQL = WHAT we want = declarative

• Relational algebra = HOW to get it = algorithm

• RDBMS are about translating WHAT into HOW


COMP7104 - DASC7104 20
SQL – Structured Query Language

Language for computing on relations HW1: SQL

• that separates the WHAT from the HOW

• and enables the system to choose the best how

given the data and its layout

COMP7104 - DASC7104 21
Data model: relational model
A data model is a collection of high-level data description
constructs (for structure, operations, constraints) that
hide many low-level storage details.

In this course:
– structured data: all elements have a fixed format
(relational, nested relational, semi-structured)
– relational model: tables

COMP7104 - DASC7104 22
Other important data models
• Semi-structured data model (XML, JSON)
– some structure but not fixed
– hierarchically nested tagged-elements in tree
structure
• Nested relational model: nested tables
• Graph model
• Unstructured data: text, image, audio, video

COMP7104 - DASC7104 23
Applications on DBMS
Any compute service that maintains state today is an
application on top of some kind of DBMS
– Uber
– Cathay Pacific
– Amazon
– HSBC
– SCMP

COMP7104 - DASC7104 24
Applications want something from the DBMS

• Queries and updates


• Real applications are composed of many sorts of
statements being generated by user behaviors
• Many users work with the database at the same time

S QL
S QL
Database
Management S QL S QL S QL
System S QL
S QL

COMP7104 - DASC7104 25
Concurrency control and recovery
• Concurrency control (RDBMS’s Transaction Manager)
– Correct and fast data access in the presence of concurrent
work by many users
– Disorderly processing that provides the illusion of order !

• Recovery (RDBMS’s Recovery Manager)


– Ensures database is fault tolerant, and not corrupted by
software, system, or media failure
– Storage (hard disk) guarantees for mission-critical data

COMP7104 - DASC7104 26
Why multiple transactions running concurrently ?

Throughput: increased processor and disk utilization leads to


more transactions per second (TPS) completed
– Single core: e.g., one transaction using the CPU, while
another is reading from or writing to disk
– Multi-core: ideally, scale throughput in number of
processors
Latency: multiple transactions can run at the same time, so
(with ample resources) one transaction’s latency need not be
dependent on another unrelated transaction (hopefully)

COMP7104 - DASC7104 27
Example

UPDATE Budget SELECT sum(balance)


SET balance = balance – 500 FROM Budget
WHERE uid = 1

UPDATE Budget
SET balance = balance + 200
WHERE uid = 2

UPDATE Budget
SET balance = balance + 300 Would like to treat each
WHERE uid = 3 group of instructions as an
atomic unit!

What could go wrong?


COMP7104 - DASC7104 28
Transactions
• Major component of database systems
• Critical for most applications
• Turing Awards to database researchers:
– Charles Bachman (1973) for pioneering early DBMS,
including IDS
– Edgar Codd (1981) for inventing relational DBs
– Jim Gray (1998) for inventing transaction processing
– Michael Stonebraker (2015) for pioneering relational
DBMSs, including Ingres and Postgres

COMP7104 - DASC7104 29
What is a transaction?
• Sequence of many actions considered to be one atomic
unit of work (one logical unit)
• Usage:
1. Begin transaction
2. Set of SQL statements
3. End transaction
• Examples:
– Transfer balance between accounts
– Book a flight, a hotel, and a car together on Expedia

COMP7104 - DASC7104 30
Transaction model in RDBMS
• Transaction

– sequence of reads and writes of database objects

– batch of work that must commit or abort as an atomic unit

• RDBMS’s transaction manager controls execution of


transactions

• Program logic is invisible to the RDBMS

– The DBMS only see data read/written from/to the DB

– Arbitrary computation possible on data fetched from the DB


COMP7104 - DASC7104 31
ACID guarantees
• A tomicity: All actions in the transaction happen, or none
happen (Recovery Manager)

• C onsistency: If the DB starts out consistent, it ends up


consistent at the end of the transaction (Transaction
Manager)

• I solation: Execution of each transaction is isolated from


that of other transactions (Transaction Manager)

• D urability: If a transaction commits, its effects persist


(Recovery Manager) COMP7104 - DASC7104 32
Quick quizz – which is which ?
1. Maintain integrity constraints
• Atomicity
• Consistency 2. All or nothing
• Isolation 3. Committed data survives failures
• Durability 4. No worry of race conditions

COMP7104 - DASC7104 33
WHAT WE’LL LEARN ABOUT RDBMS

COMP7104 - DASC7104 34
RDBMS anatomy
SQL Client
Completed

Query Parsing
& Optimization
We will unpack a database system Relational Operators
and explore its modular design.
Files and Index Management
Database
Management
Buffer Management
System
Disk Space Management

You are here


Database (storage)
COMP7104 - DASC7104 35
Abstraction at each level
Query Parsing
& Optimization
What à How

Relational Operators How à Dataflow on files

Files and Index Management Files à Blocks in memory


Database
Management
Buffer Management Memory Blocks à Disk pages
System
Disk Space Management Pages on disk à Bytes

You are here


Database (storage)
Each level hiding the
complexity of the next
COMP7104 - DASC7104 36
We’ll study data layout and indexes
Record SSNz Last
Name
First
Name
Age Salary

123 Adams Elmo 31 $400


Bob Harmon M 32 94703 443 Grouch Oscar 32 $300

244 Oz Bert 55 $140


Varchar Varchar Char Int Int Page 1
134 Sanders Ernie 55 $400 Frame
Byte Representation of Record
File
Page 6
Frame Page 2
Frame
Header
94703

3
M 2 Bob Harmon

Page 1 Page 2
Slotted Page

Page
Header Page 3 Page 4 Page 5
Frame Page 3
Frame

Page 5 Page 6
Page 4
Frame

• Knowledge of data and access patterns can affect choice of


data-layout and caching strategies
• Choice à Challenges à Motivated query optimization
COMP7104 - DASC7104 37
We’ll learn how to index stored data

COMP7104 - DASC7104 38
We’ll learn how to index stored data
HW2:
Indexing

COMP7104 - DASC7104 39
We’ll learn about query execution
• Simple closed set of operators
– σ (selection) Indexed
⨝ Nested-Loop
Join
– Π (projection)
Indexed
– ρ (renaming) Nested-Loop
Join
– ⋈ (join) On-the-fly
Select ⨝
Operator On-the-fly
• Combined together via iterators σFirstName=‘Clint’ Select
Operator
into a data flow σYear=1995
– Iterator
– Materialization Actor Casting Movie
B+Tree
– Vectorization B+Tree
IndexedScan
B+Tree
IndexedScan
IndexedScan
Iterator Iterator Iterator

COMP7104 - DASC7104 40
We’ll bridge the WHAT with the HOW
• Query optimization!
• Three stages
– Plan space
– Cost estimation
– Search algorithm

PA1: Query
optimisation
(Oracle)
COMP7104 - DASC7104 41
We’ll reason about transaction
ordering and concurrency control
• Correct (ideal): serially-ordered
• Desire: interleave to maximize performance
• Risk: disorder may lead to data anomalies
• Allowable orders: (conflict) serializability
• Implementation: (Strict) 2PL

PA2:
Transactions
(Oracle)
COMP7104 - DASC7104 42
We’ll learn about recovery

• Write-Ahead Logging (WAL)


• ARIES

DB RAM

LSNs pageLSNs flushedLSN

COMP7104 - DASC7104 43
We’ll relay some key messages
• Query optimization is good (omni-present): most of
today’s popular systems have a query optimizer of
some kind
• Declarative languages are good: SQL in DBMS, SQL
on Big Data (more on that later)
• Schema is good
• Secondary indexes are good

COMP7104 - DASC7104 44
But we’ll also chart and explore the
limitations of RDBMS
Relational databases have been around for decades
and they are very well designed for…
• … structured data
• … often/concurrent read/writes, integrity (OLTP)

Their core concepts were designed also decades ago


– Sequential access to disk are slow, RAM is scarce
– Data / schema normalization
• Removing redundancy / duplication and optimizing storage
– Originally designed for single machines
• Scaling usually mean to buy a new and “bigger” machine

COMP7104 - DASC7104 45
What’s changed since RDBMS inception?
• Dropping cost of disks (more on that soon)
– Cheaper to store everything than to figure out what
we really need !
• Types of data collected
– From data that’s obviously valuable to data whose
value is less apparent
• Rise of social media and user-generated content
– Large increase in data volume, need for data analytics
• Growing maturity of data mining techniques
– Demonstrates value of data analytics
COMP7104 - DASC7104 46

You might also like