You are on page 1of 60

DS 5110 – Lecture 5

SQL Part III

Roi Yehoshua
Agenda
 Accessing SQL from Python
 Data warehouses and lakes
 OLAP
 Window functions

2 Roi Yehoshua, 2024


Accessing SQL from a Programming Languages
 To access SQL from a programming language you need a database driver
 A driver converts application commands into a format that the DBMS can understand
 Typical commands include
 Connection commands (e.g., open or close a connection to the database)
 SQL commands
 Handling cursors, which allow you to traverse through result sets
 Transaction management commands
 Types of drivers
 Native drivers – provided by specific database vendors
 ODBC drivers – provide a standard interface to communicate with databases
 JDBC drivers – specific to Java-based applications

3 Roi Yehoshua, 2024


ODBC
 Open Database Connectivity (ODBC) is a standard API for accessing DBMS
 Independent of any specific DBMS or operating system
 ODBC drivers exist for most of the commercial databases

4 Roi Yehoshua, 2024


MySQL Connector/Python
 Python Database API (DB-API) defines a standard interface for accessing relational
databases from Python programs
 Different packages in Python implement this interface for different databases
 e.g., mysql.connector for MySQL, sqlite3 for SQLite, pyodbc for ODBC
 mysql.connector is the recommended driver for interacting with MySQL
 Developed by the MySQL group at Oracle
 Install it via pip
pip install mysql-connector-python

 If you’ve installed Python with the Anaconda distribution, you should already have it

5 Roi Yehoshua, 2024


MySQL Connector/Python
 The general workflow of a Python program that interacts with a MySQL database:
 Connect to the MySQL server
 Execute a SQL query
 Fetch the results
 Inform the database if any table changes were made (by committing the changes)
 Close the connection to the server

6 Roi Yehoshua, 2024


Establishing a Connection with MySQL Server
 To connect to a MySQL server, call the connect() function in mysql.connector module
 This function gets 4 parameters: host, user, password and database name
 It returns a MySQLConnection object

 You should always close the connection in the end by calling conn.close()
 You should never hard-code your login credentials directly in a Python script

7 Roi Yehoshua, 2024


The Cursor Object
 In order to execute SQL queries in Python, you need a cursor object
 A cursor object allows you to traverse over database records
 To create a cursor, use the cursor() method of your connection object
 Then you execute a SQL query by calling cursor.execute(query)
 If the query returns rows, you can retrieve them using one of cursor’s fetch methods:
 fetchall() - retrieves all the rows from the result as a list of tuples
 fetchone() - retrieves the next row of the result as a tuple
 Returns None if no more rows are available
 fetchmany(n) - retrieves the next n rows from the result as a list of tuples (n defaults to 1)
 Returns an empty list if no more rows are available

8 Roi Yehoshua, 2024


Reading Records from a Table
 The following example selects all the records from the instructor table:

• The result variable holds the


records returned from.fetchall()
• It’s a list of tuples representing
individual records from the table

9 Roi Yehoshua, 2024


Reading Records from a Table
 To get specific attributes in each row, specify their indexes in the returned tuples
 For example, to print only the instructor names:

10 Roi Yehoshua, 2024


Cursor as an Iterator
 To process the rows in the result one at a time, you can use the cursor as an iterator:

11 Roi Yehoshua, 2024


More Complex Queries
 You can make your select queries as complex as you want using the same methods
 Example: Find the courses that were taken by the highest number of students

12 Roi Yehoshua, 2024


Cursor Properties
 The cursor has a few useful properties that provide information about the result set
 column_names – returns the list containing the column names of the result set
 rowcount – the number of rows in the result set

13 Roi Yehoshua, 2024


SQL Injection Attack
 A common attack that enables execution of malicious SQL statements in the DB
 By insertion (“injection”) of a SQL query via the input data from the client
 Suppose you write a script that checks if a given username exists in the users table
 You construct the following SQL query:
username = # read from the input
query = "SELECT * FROM users WHERE username = '" + username + "'"

 If the user, instead of entering their username, enters:

 Then the query becomes:


query = "SELECT * FROM users WHERE username = '' or 1 = 1 --'"

 The where clause is now always true and the entire users table is returned
14 Roi Yehoshua, 2024
SQL Injection Attack
 Most databases allow execution of multiple SQL statements separated by semicolon
 This allows the hacker to inject whole SQL statements into the query
 For example, the hacker could enter the following string:

 The resulting query would be:


query = "SELECT * FROM users WHERE username = ''; drop table users; --"

 This query would result in deleting the entire users table!

 Solution: use parameterized queries whenever user input is involved in the query

15 Roi Yehoshua, 2024


SQL Injection Attack
 Example:

16 Roi Yehoshua, 2024


Parameterized Query
 A parameterized query is a query that uses placeholders (%) for attribute values
 Strings passed to the placeholders are correctly escaped by the library at runtime
 e.g., each quotation mark is doubled
 You can pass a parameterized query to cursor.execute() as follows:

17 Roi Yehoshua, 2024


Parameterized Query
 If a user tries to sneak in some problematic characters, the resulting statement will
cause no harm since each quotation mark will be doubled:
select * from instructor where name = 'X'' or ''Y'' = ''Y'
 This query will return an empty set

18 Roi Yehoshua, 2024


Inserting New Records
 To insert data, pass the insert into command to the cursor’s execute() method
 For example, to add a new instructor:

 You must call conn.commit() at the end, otherwise your changes will be lost!
 Unless you turn on automatic commits by setting conn.autocommit = True

19 Roi Yehoshua, 2024


Inserting New Records
 Verifying that the new record was added to the table:

20 Roi Yehoshua, 2024


Inserting New Records
 If the values to be inserted come from an external source (e.g., the user), again you
should use parameters in the SQL statement:

21 Roi Yehoshua, 2024


Inserting a Bulk of Records
 You can insert multiple records using the executemany() method
 It accepts two parameters:
 A query that contains placeholders for the records that need to be inserted
 A list that contains all records that you wish to insert

22 Roi Yehoshua, 2024


Update and Delete
 Updating and deleting work the same way, just pass the SQL to cursor.execute()
 You can use the cursor rowcount attribute to check how many records were affected:

23 Roi Yehoshua, 2024


Calling Stored Procedures from Python
 To call a stored procedure use the callproc() method of the Cursor object
cursor.callproc(procedure_name, args=())
 Then, you can call the stored_results() method to get an iterator with the result set
 The rows in the result can be read by calling the fetchall() method
 Example:

24 Roi Yehoshua, 2024


Loading Data into a DataFrame
 Pandas provides a read_sql() method that reads an SQL query into a DataFrame

 For example, loading the instructor table into a DataFrame:

25 Roi Yehoshua, 2024


Loading Data into a DataFrame
 You can use the params argument to pass a list of parameters to the query
 For example, the following displays all the students that took the course CS-319

26 Roi Yehoshua, 2024


Data Warehouse
 A specialized type of database designed for analysis of large volumes of data
 Integrates data from various sources and arranges them in order to optimize queries
 Stores both current and historical data
 Data is usually added or deleted, but not updated
 OLAP (Online Analytical Processing) tools allow analysis of multi-dimensional data
 Popular data warehouses: Snowflake, Amazon Redshift, Google BigQuery, Teradata

27 Roi Yehoshua, 2024


OLTP vs. OLAP

28 Roi Yehoshua, 2024


Multidimensional Data
 A data warehouse contains two primary types of tables:
 Fact tables store quantitative data (facts) about a business process
 For example, a table that stores daily sales:
sales(date, product_id, store_id, total_sales_amount, total_units_sold)
 Each row in the table represents a combination of dimensions (date, product and store) and
the corresponding measures (total sales amount and units sold)
 Dimension tables store descriptive attributes (dimensions) related to the data
 For example, a product dimension table would include details about each product
 e.g., id, name, category, price
 Dimension tables are typically denormalized to make the design simpler
 Data modeled as dimension + measures attributes is called multidimensional data

29 Roi Yehoshua, 2024


Star Schema
 The resulting schema is a star schema
 The fact table is connected to multiple peripheral tables (dimension tables)
 Advantages
 Queries are simpler and run faster compared to normalized structures (less joins)
 Flexible design that allows to add new dimensions without making extensive changes

30 Roi Yehoshua, 2024


Snowflake Schema
 A variant of star schema with multiple levels of dimension tables
 The dimension tables are usually normalized
 Reduces data redundancy but increases the number of tables and complexity of joins

31 Roi Yehoshua, 2024


Data Lakes
 Storage systems that can store vast amounts of data in its native format
 Including structured data (e.g., databases) and unstructured data (e.g., images, text)
 Require less upfront effort, but more effort during querying
 Use “schema-on-read” approach instead of “schema-on-write”
 You store the data as-is and define the schema only when you need to read/process it
 Usually built on top of distributed storage systems such as Hadoop
 which can scale out by adding more nodes to the system
 Integrate well with big data tools such as Apache Spark
 Popular data lakes:
 Amazon S3
 Azure Data Lake Storage
 Google Cloud Storage
32 Roi Yehoshua, 2024
Data Lakes

Source: https://www.grazitti.com/blog/data-lake-vs-data-warehouse-which-one-should-you-go-for

33 Roi Yehoshua, 2024


OLAP (Online Analytical Processing)
 Set of tools that provide analysis of data from multiple dimensions
 Allows users to interactively analyze and explore data from different points of view
 Used in BI (Business Intelligence) to uncover insights and patterns that are not
immediately apparent with traditional, two-dimensional databases
 Types of OLAP
 MOLAP (Multidimensional OLAP): Uses multidimensional databases
 ROLAP (Relational OLAP): Uses relational databases to store data and a multidimensional
model at the front end
 HOLAP (Hybrid OLAP): Combines the features of both MOLAP and ROLAP

34 Roi Yehoshua, 2024


Data Cube
 The primary unit of storage and analysis in OLAP
 A multi-dimensional grid of data
 Each dimension represents a different aspect of the data
 Actual data points (measures) reside within the cube
 Example: sales_data (date, item_type, location, sales_amount)

35 Roi Yehoshua, 2024


OLAP Operations
 The main OLAP operations
 Roll-up (aggregation)
 Drill-down (de-aggregation)
 Slice
 Dice
 Pivot (rotating)

36 Roi Yehoshua, 2024


Roll-Up
 Aggregate the data by ascending the level in the dimension hierarchy
 Example: aggregate sales by countries instead of cities

37 Roi Yehoshua, 2024


Roll-Up in SQL
 Can be achieved by using GROUP BY
 For example, aggregating daily sales to monthly sales:
SELECT YEAR(date) AS year, MONTH(date) AS month, SUM(sale_amount) AS monthly_sales
FROM sales_data
GROUP BY YEAR(date), MONTH(date);

date item_type location sale_amount year month total_sales


2022-01-01 Electronics New York 500 2022 1 800
2022-01-02 Clothing New York 300 2022 2 870
2022-02-01 Electronics New York 550 2023 1 770
2022-02-03 Clothing New York 320
2023-01-05 Electronics Los Angeles 480
2023-01-10 Clothing Los Angeles 290

38 Roi Yehoshua, 2024


The ROLLUP Operator
 Enhances the grouping capability of your queries by providing subtotals for
hierarchical combinations of columns in the GROUP BY
 For example, using ROLLUP to get yearly and monthly sales:
SELECT YEAR(date) AS year, MONTH(date) AS month, SUM(sale_amount) AS total_sales
FROM sales_data
GROUP BY YEAR(date), MONTH(date) WITH ROLLUP;

date item_type location sale_amount year month total_sales


2022-01-01 Electronics New York 500 2022 1 800
2022-01-02 Clothing New York 300 2022 2 870
2022-02-01 Electronics New York 550 2022 NULL 1670
2022-02-03 Clothing New York 320 2023 1 770
2023-01-05 Electronics Los Angeles 480 2023 NULL 770
2023-01-10 Clothing Los Angeles 290 NULL NULL 2440

39 Roi Yehoshua, 2024


Drill-Down
 Break down the data by descending the level in the dimension hierarchy
 Example: aggregate the sales by months instead of quarters

40 Roi Yehoshua, 2024


Drill-Down in SQL
 Just add another level to the GROUP BY
 For example, breaking down monthly sales by location:
SELECT YEAR(date) AS year, MONTH(date) AS month, location, SUM(sale_amount) AS total_sales
FROM sales_data
GROUP BY YEAR(date), MONTH(date), location;

date item_type location sale_amount year month location total_sales


2022-01-01 Electronics New York 500 2022 1 New York 800
2022-01-02 Clothing New York 300 2022 2 New York 870
2022-02-01 Electronics New York 550 2023 1 Los Angeles 770
2022-02-03 Clothing New York 320
2023-01-05 Electronics Los Angeles 480
2023-01-10 Clothing Los Angeles 290

41 Roi Yehoshua, 2024


Slice
 View a slice of the cube by fixing some of the dimensions
 Example: viewing sales for a specific quarter

42 Roi Yehoshua, 2024


Slice in SQL
 Use WHERE to fix one of the dimensions and group by the other dimensions
 Example: viewing sales for the year 2022
SELECT item_type, location, SUM(sale_amount) AS total_sales
FROM sales_data
WHERE YEAR(date) = 2022
GROUP BY item_type, location;

43 Roi Yehoshua, 2024


Dice
 Select two or more dimensions to get a sub-cube
 Example: viewing sales for a specific location, quarter, and item type

44 Roi Yehoshua, 2024


Dice in SQL
 Use WHERE to fix two or more dimensions and group by the other dimensions
 Example: viewing sales for the year 2022 in Chicago
SELECT item_type, SUM(sale_amount) AS total_sales
FROM sales_data
WHERE YEAR(date) = 2022 AND location = 'Chicago'
GROUP BY item_type;

45 Roi Yehoshua, 2024


Pivot
 Rotate the data axes to view the data from a different perspective
 Example: show the item types in the columns instead of the rows

46 Roi Yehoshua, 2024


Pivot in SQL
 SQL doesn’t directly support the pivot operation like some specialized OLAP systems
 You can achieve a pivot using conditional aggregation
 For example, pivoting monthly sales for different item types:
SELECT MONTH(date) AS Month,
SUM(CASE WHEN item_type = 'Electronics' THEN sale_amount ELSE 0 END) AS electronics_sales,
SUM(CASE WHEN item_type = 'Clothing' THEN sale_amount ELSE 0 END) AS clothing_sales
FROM sales_data
WHERE YEAR(date) = 2022
GROUP BY MONTH(date);

47 Roi Yehoshua, 2024


Other OLAP Operations
 Split
 Break out one dimension into two separate dimensions for finer granularity
 Drill across
 Navigate from one cube to another cube that shares one or more dimensions
 Drill-through
 Access the detailed data that constitutes a certain higher-level aggregate value

48 Roi Yehoshua, 2024


The CUBE Operator
 An extension to the GROUP BY clause
 Generates a result set that represents all combinations of all aggregations
 In contrast to ROLLUP that represents aggregations from the most detailed level to the total
 For each grouping, the result contains NULL for attributes not present in the
grouping
 Supported by many relational databases (but not MySQL )

49 Roi Yehoshua, 2024


The CUBE Operator
date location sale_amount year month location total_sales
2022 1 New York 800
2022-01-01 New York 500
2022 2 Los Angeles 870
2022-01-02 New York 300
2022 1 NULL 800
2022-02-01 Los Angeles 550 2022 2 NULL 870
2022-02-03 Los Angeles 320 NULL 1 New York 800
NULL 2 Los Angeles 870
SELECT YEAR(date) AS year, MONTH(date) AS month, 2022 NULL New York 800
location, SUM(sale_amount) AS total_sales 2022 NULL Los Angeles 870
FROM sales_data
GROUP BY YEAR(date), MONTH(date), location WITH CUBE; NULL NULL New York 800
NULL NULL Los Angeles 870
2022 NULL NULL 1670
NULL 1 NULL 800
NULL 2 NULL 870
NULL NULL NULL 2440

50 Roi Yehoshua, 2024


Class Exercise
 Compute the total number of student enrollments in each department’s courses
over the years, with subtotals by year and a grand total across all years

51 Roi Yehoshua, 2024


Solution

52 Roi Yehoshua, 2024


Window Functions in SQL
 Perform calculations across a set of table rows that are related to the current row
 e.g., calculating running totals / moving average
 Return a single value for each row from the query
 Unlike aggregate functions which return a single value for each group
 General syntax:
<window_function>(<arguments>) OVER (
[PARTITION BY <partition_expression(s)>]
[ORDER BY <order_expression(s)>]
[frame_specification]
)

 The OVER clause defines the rows on which the window function operates
 An optional PARTITION BY clause divides the rows into groups
 The ORDER BY clause determines the order of the rows within each partition
 Frame specification defines which rows are included in the frame (subset of the current row)
53 Roi Yehoshua, 2024
Example: Moving Average
 The AVG() function can be used as a window function to get a moving average
 For example, assume that we have a table with daily sales figures
 We can calculate a 3-day moving average for sales as follows:

54 Roi Yehoshua, 2024


Frame Specifications
 Define the set of rows included in the window relative to the current row
 Has two components:
 Frame start: where the frame starts relative to the current row
 Frame end: where the frame ends relative to the current row
 ROWS, RANGE, or GROUPS: determine how the frame boundaries are defined
 ROWS: Defines the frame by a specific number of rows
 RANGE: Defines the frame by value range
 GROUPS: Used with ordered set aggregate functions
 Frame boundary:
 UNBOUNDED PRECEDING/FOLLOWING: The frame starts/ends at the first row of the partition
 N PRECEDING/FOLLOWING: The frame starts/ends N rows before the current row
 CURRENT ROW: The frame starts or ends at the current row

55 Roi Yehoshua, 2024


Common Window Functions
 Ranking functions
 RANK() assigns a rank to each row based on a specified column’s values
 DENSE_RANK() is similar to RANK() but without gaps between rank values for rows with the
same rank
 ROW_NUMBER() assigns a unique sequential integer to rows
 Aggregate functions
 Almost all aggregate functions (e.g., SUM() and AVG()) can be used as window functions
 Navigation functions
 LEAD() returns a value from a subsequent row
 LAG() returns a value from a preceding row
 FIRST_VALUE() returns a value from the first row of the window frame
 LAST_VALUE() returns a value from the last row of the window frame

56 Roi Yehoshua, 2024


Rank Function
 The RANK() function assigns a rank to each row of a partition or the entire result set
 It should be used with ORDER BY to sort the rows into the desired order
 For example, we can rank the students based on their total credits:

57 Roi Yehoshua, 2024


PARTITION BY
 PARTITION BY divides the result set into partitions and the window function is
applied to each partition independently
 For example, we can rank students by the total credits within each department:

58 Roi Yehoshua, 2024


Class Exercise
 For each course in the Fall 2017 semester rank the students based on their grades
 If two students have the same grade, they should get the same rank
 Expected columns in the result: ‘course_id’, ‘student_id’, ‘grade’, ‘student_rank’

59 Roi Yehoshua, 2024


Solution

60 Roi Yehoshua, 2024

You might also like