You are on page 1of 62

DS 5110 – Lecture 1

Introduction

Roi Yehoshua
Agenda
 Course objectives
 Intro to database management systems
 Database architecture
 The relational model
 MySQL installation

2 Roi Yehoshua, 2023


DS 5110 Course Objectives
 Understand basic database concepts
 Understand the structure and operation of a relational database system
 Use ER models to design a database system
 Improve the database design by normalization
 Learn to formulate SQL queries on the data
 Integrate SQL queries within Python applications
 Learn how to work with semi-structured data, including XML, JSON and HTML
 Understand NoSQL databases
 Learn to work with MongoDB
 Describe big data tools and techniques
 Learn to use Apache Spark
3 Roi Yehoshua, 2023
Course Schedule
 Published on Canvas

4 Roi Yehoshua, 2023


Your Responsibilities
 Read the materials before class
 Attend the lectures, please be on time
 Ask questions, participate in interactive discussion in class
 Come to office hours
 Submit assignments on time
 Help others

5 Roi Yehoshua, 2023


Homework Assignments
 4 problem sets including both theoretical questions and programming exercises
 Assignments are due at 11:59PM on the specified date
 Submission via Canvas
 Please submit both PDF and source files individually (no zip)
 Up to one day late is allowed, penalized by 10%

6 Roi Yehoshua, 2023


Final Project
 Build an application that uses a DBMS including:
 Database design
 Data modeling and normalization
 Application that interacts with the database
 Data analysis and visualization
 Team size: 1-3 members

 Timeline
 Project proposal: mid class
 Presentation: on the last lecture of the semester
 Final report: ~1 week after semester ends

7 Roi Yehoshua, 2023


Grading
 Assignments – 20%
 Midterm exam – 20%
 Final exam – 20%
 Final project – 40%

 Class participation is used to adjust grade upwards (at the discretion of the
instructor)

8 Roi Yehoshua, 2023


Academic Integrity
 Homework is done individually!
 Rules
 Can discuss with colleagues or instructor
 Can post and answer questions on Piazza
 Code cannot be shared with colleagues
 Cannot use code from the internet without explicit permission from the instructor
 Any cheating will automatically result in grade F and report to the university
administration
 http://www.northeastern.edu/osccr/academic-integrity-policy/

9 Roi Yehoshua, 2023


Textbook

Abraham Silberschatz, Henry F. Korth, S. Sudarshan.


Database System Concepts (7th Edition). McGraw-Hill,
2020.

Course website:
https://www.db-book.com/

10 Roi Yehoshua, 2023


What is Data?
 Collection of data objects and their attributes

Attributes

Objects

11 Roi Yehoshua, 2023


Facets of Data
 There are many different types of data
 Each requires its own tools and techniques
 Main categories of data:
 Data matrix (e.g., spreadsheets)
 Relational data (tables in a database)
 Text documents (natural language)
 Graph-based (e.g., social networks, world wide web)
 Sequential data (e.g., genome sequences)
 Multimedia (audio, video, images)
 Streamed data (e.g., changing stock prices, stream of Twitter tweets)

12 Roi Yehoshua, 2023


Data Growth
 The amount of raw data is increasing exponentially
 Data is eating the world: prediction is that 163 ZB will be created in 2025

1 KB (Kilobyte) 103 bytes


1 MB (Megabyte) 106 bytes
1 GB (Gigabyte) 109 bytes
1 TB (Terabyte) 1012 bytes
1 PB (Petabyte) 1015 bytes
1 EB (Exabyte) 1018 bytes
1 ZB (Zettabyte) 1021 bytes
1 YB (Yottabyte) 1024 bytes

13 Roi Yehoshua, 2023


Database and DBMS
 A database is a an organized collection of structured data stored in a computer
 A database management system (DBMS) is software designed to store, retrieve,
define, and manage data in a database
 DBMS provide users with an abstract view of the data
 By hiding certain details of how the data are stored and maintained
 A database system includes the database, DBMS and associated applications

14 Roi Yehoshua, 2023


Database Systems Examples
 Enterprise information
 Sales: customers, products, purchases
 Accounting: payments, receipts, account balances, assets
 Human resources: information about employees, salaries, payroll taxes, benefits
 Manufacturing: management of production, inventory, orders, supply chain
 Banking and finance
 Customer information, accounts, loans, and banking transactions
 Sales and purchases of financial instruments (e.g., stocks and bonds)
 Universities: student information, course registration, grades
 Airlines: reservations and schedule information
 Telecommunication: records of calls, texts, and data usage
 Social media: users, connections between users, posts made by users, etc.
15 Roi Yehoshua, 2023
File Systems vs. Databases
 A file system stores data in a collection of flat files grouped into directories
 Databases are built on top of the file system
 Advantages of DBMS over file systems
 Data redundancy and inconsistency: DBMS controls redundancy by maintaining a single
repository of data that is defined once and accessed by many users
 In a file system, the data is scatted in various files
 Data concurrency: DBMS provides mechanisms to control concurrent access to the same data
 Data access and searching: DBMS provides built-in efficient searching operations
 In a file system, a different program has to be written for every search operation
 Data integrity: DBMS allows the user to define constraints on the data
 Data security: DBMS allows to control which users can access which part of the data
 Supports transactions and atomic updates

16 Roi Yehoshua, 2023


Database Levels of Abstraction
 Physical level: describes how a record (e.g., instructor) is actually stored
 Logical level: describes what data are stored in the database, and what relationships
exist among those data
 View level: describes only part of the database that is needed for the application
 The system may provide many views for the same database
 Views can also hide information (such as an employee’s salary) for security purposes

17 Roi Yehoshua, 2023


Schemas and Instances
 Similar to types and variables in programming languages
 The overall design of the database is called the database schema
 Analogous to a variable declaration with its associated type
 Logical schema – the overall logical structure of the database
 e.g., the database contains information about customers and accounts in a bank and their relations
 Physical schema – the overall physical structure of the database
 Database instance – the actual content of the database at a particular point in time
 Analogous to the value of a variable

18 Roi Yehoshua, 2023


DBMS Architecture
 A DBMS consists of two major components
 Storage manager
 Query processor

19 Roi Yehoshua, 2023


Storage Manager
 Responsible for storage, retrieval and updates of data in the database
 The storage manager components include:
 File manager - manages the allocation of space on disk storage
 Buffer manager - decides which data needs to be cached in main memory
 Authorization and integrity manager - checks the authorization of users to access data
 Transaction manager - ensures that the database remains in a consistent state despite
transaction failures
 Implements several data structures
 Data files - stores the database itself
 Data dictionary - stores metadata about the structure of the database
 e.g., the database schema, integrity constraints and authorization
 Indices - provides fast access to data items

20 Roi Yehoshua, 2023


The Query Processor
 A query is a statement requesting the retrieval of information from the database
 The query processor translates queries into sequences of physical actions
 Includes the following components
 Query parser – translates query statements into an execution plan
 Query optimizer – picks the lowest cost execution plan for a given query
 Query evaluation engine – executes the low-level instructions in the execution plan

21 Roi Yehoshua, 2023


Database Users

22 Roi Yehoshua, 2023


Database Administrator (DBA)
 A DBA is a person who has central control over the system is called a DBA
 Functions of a DBA include:

23 Roi Yehoshua, 2023


Popular Database Systems
 A 2021 survey of 25,000 data scientists and ML engineers done by Kaggle

https://www.kaggle.com/kaggle-survey-2021

24 Roi Yehoshua, 2023


Data Models
 A collection of tools for describing the data set
 Various types of data models
 The relational model
 Entity-relationship models (used mainly for design)
 Semi-structured data model (e.g., JSON and XML)
 Document-based models
 Graph-based models

25 Roi Yehoshua, 2023


The Relational Model
 The most widely used data model
 Introduced by Edgar F. Codd at IBM in 1970
 Data is represented by a collection of tables (also called relations)
 Each table contains records (rows) of a particular type
 Each record type defines a fixed number of fields (columns)

Ted Codd
Turing Award 1981

26 Roi Yehoshua, 2023


Relational Schemas
 Relational schemas are denoted by R(A1, A2, …, An )
 A1, A2, …, An are the names of the attributes
 Attribute values are normally required to be atomic (i.e., indivisible)
 A relation instance r defined over schema R is denoted by r(R)
 For example, a schema R(A, B) may have the following instance r:
A B
1 4
1 5
3 7

 An element t of relation r is called a tuple and is represented by a row in a table


 Order of tuples is irrelevant (tuples may be stored in an arbitrary order)

27 Roi Yehoshua, 2023


Relational Schemas
 Example for a relation schema: instructor (ID, name, dept_name, salary)
 We often use the same name (e.g., instructor) to refer both to the schema and the instance

attributes
(or columns)

tuples
(or rows)

28 Roi Yehoshua, 2023


Relational Schemas
 Tuples in different relations are "linked" using keys
Columns

Rows

Relationship
established by keys

29 Roi Yehoshua, 2023


Keys
 Let K  R be a subset of the attributes in schema R
 K is a super key of R if it is sufficient to identify a unique tuple in a relation r(R)
 That is, if t1 and t2 are in r and t1  t2, then t1.K  t2.K
 Example: {ID} and {ID, name} are both super keys of instructor
 Super key K is a candidate key if K is minimal (no proper subset of K is a superkey)
 Example: {ID} is a candidate key for instructor
 One of the candidate keys is selected to be the primary key
 The primary key should be chosen such that its values are never/rarely changed
 The primary key attributes are underlined in the schema definition
instructor (ID, name, dept_name, salary)
classroom (building, room_number, capacity)
section (course_id, sec_id, semester, year, building, room_number)

30 Roi Yehoshua, 2023


Foreign Keys
 A foreign key is an attribute (or set of attributes) that refers to an attribute of unique
values in another table (usually its primary key)
 The table with the foreign key is called the child table (or referencing table)
 The table with the primary key is called the parent table (or referenced table)
 The foreign key constraint ensures that a value inserted to the foreign key column
must be one of the values contained in the primary key of the parent table

31 Roi Yehoshua, 2023


Full Schema of the University Database

32 Roi Yehoshua, 2023


Schema Diagrams
 A database schema can be depicted using a schema diagram
 The foreign keys are represented by arrows connecting the two tables

33 Roi Yehoshua, 2023


Relational Algebra
 Relational algebra is a procedural query language introduced by Edgar F. Codd
 Consists of a set of mathematical operators defined on relations
 Provides a theoretical foundation for relational query languages such as SQL

34 Roi Yehoshua, 2023


SQL
 Structured Query Language (SQL) is the standard relational databases language
 A set-based, declarative programming language
SQL Commands

DDL DML DQL DCL TCL


(Data Definition (Data Manipulation (Data Query (Data Control (Transaction Control
Language) Language) Language) Language) Language)

CREATE INSERT SELECT GRANT COMMIT


ALTER UPDATE REVOKE ROLLBACK
DROP DELETE SAVEPOINT
RENAME MERGE SET TRANSACTION
TRUNCATE CALL
EXPLAIN PLAN
LOCK TABLE
35 Roi Yehoshua, 2023
SQL Queries
 A SQL query takes as input several tables (possibly one) and returns a results table
 Basic structure of SQL queries
select A1, A2, ..., An
from r1, r2, ..., rm
where P

 Ai represents an attribute (column in a table)


 ri represents a relation (table)
 P is a predicate

36 Roi Yehoshua, 2023


SQL Queries
 For example, an SQL query that finds the names of all instructors and the courses
that they teach

select name, course_id


from instructor, teaches
where instructor.ID = teaches.ID

37 Roi Yehoshua, 2023


Database Applications
 SQL is not as powerful as procedural programming languages like Java or Python
 It also doesn’t support actions such as input from users or output to displays
 Application programs written in a host language (like Python) can include
embedded SQL statements
 These statements are sent to the DB via a data API (application program interface)

38 Roi Yehoshua, 2023


Database Applications
 Database applications are usually partitioned into two or three layers (tiers)

39 Roi Yehoshua, 2023


NoSQL Databases
 Non-tabular databases that provide flexible schemas
 Allow related data to be nested within a single data structure
 Scale easily with large amounts of data and high loads

40 Roi Yehoshua, 2023


NoSQL Database Types

Store data in Store data in nodes


documents similar to and edges
JSON

Store data tables, Each item contains


rows, and dynamic keys and values
columns

41 Roi Yehoshua, 2023


Big Data
 Data sets that are too large or complex for traditional data techniques to deal with
 The characteristics of big data are often referred to as the 3 Vs:
 Volume – The amount of data is much larger than what traditional databases can handle
 Velocity – Data arrives at a very fast rate, and needs to be analyzed very quickly
 Many applications need to process data items as they arrive (streaming data systems)
 Variety – Data comes in variety of types, such as text, audio and video

42 Roi Yehoshua, 2023


Big Data
 A new generation of technologies and frameworks that deal with big data
 Distributed file systems
 Distributed programming (MapReduce)
 NoSQL databases
 Analytic tools
 Cloud services

43 Roi Yehoshua, 2023


MySQL
 For this course, I recommend that you use MySQL database
 Used by many popular websites, including Facebook, Flickr, Twitter and YouTube
 MySQL is offered under two editions:
 Open source MySQL Community Server
 The proprietary Enterprise Server (owned by Oracle)
 Can run on various platforms UNIX, Linux, Windows, etc.
 Easy to master in comparison with other database software like Oracle Database or
Microsoft SQL Server

44 Roi Yehoshua, 2023


MySQL Installation
 Download MySQL Community Server from https://dev.mysql.com/downloads/mysql/
 MySQL comes preinstalled on most linux distributions, although the version may not be the latest

45 Roi Yehoshua, 2023


MySQL Installation
 Run the setup with the default settings
 Remember the password you set for the root user

46 Roi Yehoshua, 2023


MySQL Tools
 After the installation, you will have two programming tools to work with MySQL:
 MySQL Shell, where you can type SQL commands directly in the console
 A graphical user interface tool called MySQL Workbench

47 Roi Yehoshua, 2023


MySQL Tools
 After the installation, you will have two programming tools to work with MySQL:
 MySQL Shell, where you can type SQL commands directly in the console
 A graphical user interface tool called MySQL Workbench
 To run the MySQL Shell open a command prompt
 Then type mysqlsh -u root -p --sql
 Enter the password for your root user
 Then you can enter SQL commands directly in the shell
 For Mac users, use the following link
 https://www.bytebase.com/blog/how-to-install-mysql-shell-on-macos

48 Roi Yehoshua, 2023


MySQL Shell

49 Roi Yehoshua, 2023


MySQL Shell
 Running a SQL query:

50 Roi Yehoshua, 2023


MySQL Tools
 Start MySQL WorkBench and click on the default connection

51 Roi Yehoshua, 2023


Create a New Schema
 Click the Create New Schema button in the toolbar

 Call the schema university

 Click Apply

52 Roi Yehoshua, 2023


Creating a New Database
 You will see the newly created database on the left side under the Schemas tab
 Double-click on the database to select it
 All your SQL statements will now be directed to this database
 Alternatively, you can type USE [database] before your SQL statement

53 Roi Yehoshua, 2023


Load the Sample Database
 Download the scripts for creating the university database from the book’s web site
 DDL.sql creates the schema
 smallRelationsInsertFile.sql loads data into the tables
 To run SQL scripts choose File > Open SQL Script

54 Roi Yehoshua, 2023


Load the Sample Database
 Double-click on the university schema and then click on the execute button

55 Roi Yehoshua, 2023


Load the Sample Database
 Before running smallRelationsInsertFile put into comments the delete commands by
wrapping them around /* and */
 MySql by default has the safe-updates option set. This means that you can't update or delete
records without specifying a key in the where clause.

56 Roi Yehoshua, 2023


Load the Sample Database
 Right click on the university schema and choose Refresh All
 You will now be able to see the tables that have been created

57 Roi Yehoshua, 2023


Load the Sample Database
 To view a table’s data, click on a table name and then click the grid button

58 Roi Yehoshua, 2023


Run SQL Query
 To run an SQL query, click the Create New SQL button in the toolbar
 Type your query in the editor and then click Execute

59 Roi Yehoshua, 2023


Generating Schema Diagrams in MySQL
 To create a diagram from an existing database you can use the reverse engineering
option in MySQL Workbench

 Provide connection details to your database and click Next


 Select the schema that you want to reverse engineer

60 Roi Yehoshua, 2023


Generating Schema Diagrams in MySQL
 When the process ends, you’ll get a diagram with all the tables and views

61 Roi Yehoshua, 2023


Final Notes
 I publish numerous articles on data science and machine learning on Medium
 https://medium.com/@roiyeho

62 Roi Yehoshua, 2023

You might also like