Professional Documents
Culture Documents
Course Introduction
Lecture 1
Agenda for today
Course Components
• Databases
• Relational and NoSQL databases
• Statistics
• Review
• Types of data
• Probability
• Statistical test
• Regression
Data is at the core of every industry.
“The world’s most valuable resource is no longer oil, but data.”
The Economist
May, 2017
Page 11 11
With the rise of 4K video, medical images, IoT, digital
information, AI and analytics, the data explosion is accelerating.
New, mostly unstructured data sources emerge constantly, creating an expanding data ecosystem for every
organization.
125 billion
80
500
375
Internet-connected
devices by 20302
%
Unstructured data3
250
90%
of all data was created
125 in the last 2 years1
Projected
Exabytes
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
12
The growing imperative of Business Data
Analytics have emerged for …to massive Interactive,
years from Transactional, Unstructured content
Structured data… Documents
Web Pages
Sales transactions
Cameras
80 %
Databases
Text Messages
Is Unstructured
Emails
13
What Happens in a Social Internet Minute when it’s now Business critical?
The AI Market
16
Data vs. Information
DATA IS A COLLECTION OF FACTS
These facts are independent of overall meaning
INFORMATION
Smoking can lower your
No vegetarians are coming chances of getting
to the wedding Alzheimer’s
Categorizing Databases
BY USERS
SINGLE USER
Supports one ‘user’ (could be a program)
Example – desktop database
MULTI-USER
Sometimes called ‘workgroup’ or ‘enterprise’ depending on scale
Example – USC course schedule
BY LOCALITY
CENTRALIZED
Usually owned and maintained by the organization using the database
Example – SIS (Student Information System) at USC
DISTRIBUTED
Database is decentralized and data is distributed (sometimes redundantly)
Example – Domain name servers
Blockchain? Bitcoin?
BY USAGE
OPERATIONAL DATABASES
Maintains data relevant to moment-to-moment operations
Optimized for fast data manipulation (inserting and editing data)
Example – single Walmart store database
ANALYTICAL DATABASES
Contains historical data to observe trends and make business decisions
Optimized for fast data processing (access and computation)
Example – Walmart historical sales database
D ATA B A S E T I M E L I N E
Today -> popular for data transmission (as a delimited file or XML or JSON)
HIERARCHICAL MODEL
Every parent has one or many children of a certain type
Example – ‘Client’ can have one or more ‘Appointments’
CLIENTS
PROS
Fast access
Ensures referential integrity
APPOINTMENTS PAYMENTS
CONS
Whole hierarchy must be satisfied to enter data
Can only retrieve data from root
Data redundancy
NETWORK MODEL
Uses inverted tree structure to represent
owner/member relationships through structures with
multiple links allowed
CLIENTS
PROS
Schedule Make
Complex (real-world) relationships are better modeled
Any data can be accessed through structures
APPOINTMENTS PAYMENTS
CONS
Users must be familiar with network layout to retrieve data
Changing structures is difficult and involves reworking
whole network
RELATIONAL MODEL
Every relation can be thought of as a table
Every table has a column that uniquely identifies each row in the table
By sharing the identifying field, data can be related
PROS
Implementation independence
Easy data retrieval
CONS
Computationally expensive
OBJECT ORIENTED MODEL
Overlays object oriented functionality on traditional relational databases
Uses object relational mapping (ORM) middleware between consumer programs and database
structures
PROS
Allows programmers to develop database programs without knowing SQL
CONS
There is a high cost – financial expense, computation time, external dependencies, maintenance
overhead
POST RELATIONAL MODEL
Handles volume, velocity and variety of data (3V’s)
Uses key-value storage of data instead of structured tables, allowing for sparse unstructured
data
Not always an appropriate replacement for relational database
PROS
Enabled the ‘Big Data’ revolution
CONS
Can involve a large overhead
Databases vs. Spreadsheets
DATABASES ARE LIKE SPREADSHEETS
Designed for easy data entry, manipulation, and retrieval
Maintain data structure and organization