Professional Documents
Culture Documents
Textbook Chapter 33
March 19, 2019
Today
• OLTP vs OLAP
• Data Warehouses
• Extract, Transform, Load
• Multidimensional Data Models
• SQL:
• GROUPING SETS
• Window Functions
SELECT
Posts.Id,
Users.Id
FROM Posts
JOIN Users
ON Users.Id =
Posts.OwnerUserId
• Data Extraction
• Take data from one (or many) source(s).
• Data Transformation
• Modify/transforming the data into the format required.
• Almost always requires cleaning of data to:
• Remove bad data/characters
• Making list of values consistent
• Standardize data formatting
• Data Load
• Load the transformed data into your new system.
SELECT
…
FROM Thread
JOIN Category
ON Category.CategoryID = Thread.CategoryID
JOIN Post
ON Post.ThreadID = Thread.ThreadID
JOIN User AS PostAuthor
ON Post.AuthorUserID = PostAuthor.UserID
JOIN Rating
ON Rating.PostID = Post.PostID
LEFT JOIN ThreadTag
ON ThreadTag.ThreadID = Thread.ThreadID
JOIN Tag
ON Tag.TagID = ThreadTag.TagID
conn = sqlite3.connect("typechart.db")
cur = conn.cursor()
def fetch_types():
cur.execute("SELECT Attacking FROM tmp")
results = cur.fetchall()
def transform():
types = fetch_types()
transform()
• You can structure the data in a “flatter” way that removes need for
complex JOINs or is more accessible to end-users of the OLAP system
• Every dimensional model is composed of
• One table with many foreign keys making up a composite PK, called the fact
table.
• Many tables, each with a simple surrogate PK, called dimension tables.
• This forms a “star-like” structure, and so is called a star schema.
• Unnormalized
• One join to get any
dimension we might be
interested in
• Traditionally, in a 2D table you would have two axes, and then values.
• OLAP cubes extend this to many more dimensions
SELECT
"Country",
date_part('year', "InvoiceDate") AS "Year",
SUM("Quantity")
FROM "Sales"
GROUP BY "Country", "Year"
SELECT
"Country",
date_part('year', "InvoiceDate") AS "Year",
SUM("Quantity")
FROM "Sales"
GROUP BY "Country", "Year"
We still need:
• Subtotals for country, year
• Grand total
SELECT
"Country",
date_part('year', "InvoiceDate") AS "Year",
SUM("Quantity")
FROM "Sales"
GROUP BY GROUPING SETS (
(),
("Country"),
("Year"),
("Country", "Year")
)
In General
• In general…
GROUPING SETS (
( a, b, c ),
( a, b ),
Is equivalent to ( a, c ),
CUBE ( a, b, c )
( a ),
( b, c ),
( b ),
( c ),
( )
)
GROUPING SETS (
( e1, e2, e3, ... ),
ROLLUP ( e1, e2, e3, ... ) ...
( e1, e2 ),
Is equivalent to ( e1 ),
( )
)
Name Description
Compute the rank for a row in an ordered set of rows with no gaps in rank
DENSE_RANK
values.
RANK Assign a rank to each row within the partition of the result set.
FIRST_VALUE Get the value of the first row in a specified window frame.
Provide access to a row at a given physical offset that comes before the
LAG
current row.
LAST_VALUE Get the value of the last row in a specified window frame.
LEAD Provide access to a row at a given physical offset that follows the current row.
Assign a sequential integer starting from one to each row within the current
ROW_NUMBER
partition.
SELECT
movement.lognavigationid,
movement.posx,
movement.posy,
movement.posz
FROM movement
SELECT
movement.lognavigationid,
movement.posx,
LAG(movement.posx, 1) OVER (
PARTITION BY lognavigationid ORDER BY logmovementid ASC) AS prev_posx,
movement.posy,
LAG(movement.posy, 1) OVER (
PARTITION BY lognavigationid ORDER BY logmovementid ASC) AS prev_posy,
movement.posz,
LAG(movement.posz, 1) OVER (
PARTITION BY lognavigationid ORDER BY logmovementid ASC) AS prev_posz
FROM movement
SELECT
lognavigationid,
SQRT((m.posx-m.prev_posx)^2 + (m.posy-m.prev_posy)^2 + (m.posz-m.prev_posz)^2) AS distance
FROM
(
SELECT
movement.logmovementid,
movement.lognavigationid,
movement.posx,
LAG(movement.posx, 1)
OVER (PARTITION BY lognavigationid ORDER BY logmovementid ASC) AS prev_posx,
movement.posy,
LAG(movement.posy, 1)
OVER (PARTITION BY lognavigationid ORDER BY logmovementid ASC) AS prev_posy,
movement.posz,
LAG(movement.posz, 1)
OVER (PARTITION BY lognavigationid ORDER BY logmovementid ASC) AS prev_posz
FROM movement
) AS m
SELECT
lognavigationid,
SUM(SQRT(
(m.posx-m.prev_posx)^2
+ (m.posy-m.prev_posy)^2
+ (m.posz-m.prev_posz)^2)) AS distance
FROM
(
…
) AS m
GROUP BY lognavigationid
SELECT
P."condition",
SUM(SQRT((
m.posx-m.prev_posx)^2
+ (m.posy-m.prev_posy)^2
+ (m.posz-m.prev_posz)^2)) AS distance
Group 0 travelled a greater distance
FROM
(more error in completing the navigation task)
(
...
) AS M
JOIN log_navigation LN ON LN.lognavigationid = M.lognavigationid
JOIN participant P ON P.participantid = LN.participantid
GROUP BY P."condition"