You are on page 1of 40

ITP 249

Introduction to Data Analytics

Course Introduction
Lecture 1
Agenda for today

• Adding the Class Policy (Role Call)


• Introductions
 Me, You
 Syllabus Review,
 Why Data Analytics is so Hot
• Guest Speaker – Lauren Marker, KPMG
• Lecture 1 – Data versus Information
• Add students to waiting list
ABOUT ME
Naz Nageer
Adjunct Professor, Working Professional
Information Technology Program
Viterbi School of Engineering
University of Southern California
Los Angeles
 40 years in IT Profession as a provider and consumer of technology
solutions
 Started my career as a Database Programmer
 Currently Employed by IBM, Previously Oracle
 Also worked at Warner and Fox
 Been teaching for 2 and half years in the Applied Analytics minor.
 It is now the largest Minor at USC (surpassing Cinematic Arts)
ABOUT YOU
What made
you want to What was the
take this highlight of
class? your summer
break?

What is What do you


your hope to learn
College / in this class?
major?
ABOUT CLASS
The Syllabus …
Course Description
Data is now an integral part of our lives and to be successful in today’s business landscape, we need to
be able to leverage data to make critical business decisions. This course will teach students how to use
data to make those decisions confidently.
Learning Objectives
After completing this course, students will be able to:
• Use Excel, SQL, NoSQL, and leading industry tools
• Collect, clean, and analyze data from multiple sources
• Pose questions, collect relevant data, analyze data, interpret data and provide insights
• Present data-driven insights using data visualization and dashboards
• Tell compelling stories with data

Course Components
• Databases
• Relational and NoSQL databases
• Statistics
• Review
• Types of data
• Probability
• Statistical test
• Regression
Data is at the core of every industry.
“The world’s most valuable resource is no longer oil, but data.”

The Economist
May, 2017

Page 11 11
With the rise of 4K video, medical images, IoT, digital
information, AI and analytics, the data explosion is accelerating.

New, mostly unstructured data sources emerge constantly, creating an expanding data ecosystem for every
organization.

125 billion
80
500

375
Internet-connected
devices by 20302
%
Unstructured data3
250
90%
of all data was created
125 in the last 2 years1
Projected
Exabytes
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
12
The growing imperative of Business Data
Analytics have emerged for …to massive Interactive,
years from Transactional, Unstructured content
Structured data… Documents
Web Pages

Sales transactions
Cameras

80 %
Databases

Text Messages
Is Unstructured
Emails

13
What Happens in a Social Internet Minute when it’s now Business critical?
The AI Market

© Copyright IBM Corporation 2018


Tools
Excel
ACCESS
SQL
MYSQL
MYSQL Workbench
NOSQL
MONGODB
SAS

16
Data vs. Information
DATA IS A COLLECTION OF FACTS
These facts are independent of overall meaning

INFORMATION IS MEANING DERIVED FROM DATA


Information is extracted by processing data – organizing, analyzing, summarizing
database \ ˈdatəˌ dā- bās,ˈ\ noun specialized
structure that allow computer-based systems to
store, manage and retrieve data very quickly
DATA
DATA

48 wedding guests order


1 in 8 Americans have
chicken dinners
Alzheimer’s by age 85
52 wedding guests order
steak dinners The average life expectancy
of an American smoker is 66
100 people RSVP’d for the
dinner
INFORMATION

INFORMATION
Smoking can lower your
No vegetarians are coming chances of getting
to the wedding Alzheimer’s
Categorizing Databases
BY USERS

SINGLE USER
Supports one ‘user’ (could be a program)
Example – desktop database

MULTI-USER
Sometimes called ‘workgroup’ or ‘enterprise’ depending on scale
Example – USC course schedule
BY LOCALITY

CENTRALIZED
Usually owned and maintained by the organization using the database
Example – SIS (Student Information System) at USC

DISTRIBUTED
Database is decentralized and data is distributed (sometimes redundantly)
Example – Domain name servers
Blockchain? Bitcoin?
BY USAGE

OPERATIONAL DATABASES
Maintains data relevant to moment-to-moment operations
Optimized for fast data manipulation (inserting and editing data)
Example – single Walmart store database

ANALYTICAL DATABASES
Contains historical data to observe trends and make business decisions
Optimized for fast data processing (access and computation)
Example – Walmart historical sales database
D ATA B A S E T I M E L I N E

Hierarchical Object Post


Networked Relational Oriented Relational

1960s 1970s 1980s 2000+


Database Models
FILE SYSTEM MODEL
Sometimes called ‘flat model’, referring to a ‘flat file’ that contains data
Does not indicate relationships – just maintains records
Example – an address book

Today -> popular for data transmission (as a delimited file or XML or JSON)
HIERARCHICAL MODEL
Every parent has one or many children of a certain type
Example – ‘Client’ can have one or more ‘Appointments’

CLIENTS
PROS
Fast access
Ensures referential integrity

APPOINTMENTS PAYMENTS

CONS
Whole hierarchy must be satisfied to enter data
Can only retrieve data from root
Data redundancy
NETWORK MODEL
Uses inverted tree structure to represent
owner/member relationships through structures with
multiple links allowed
CLIENTS

PROS
Schedule Make
Complex (real-world) relationships are better modeled
Any data can be accessed through structures
APPOINTMENTS PAYMENTS

CONS
Users must be familiar with network layout to retrieve data
Changing structures is difficult and involves reworking
whole network
RELATIONAL MODEL
Every relation can be thought of as a table
Every table has a column that uniquely identifies each row in the table
By sharing the identifying field, data can be related

PROS
Implementation independence
Easy data retrieval

CONS
Computationally expensive
OBJECT ORIENTED MODEL
Overlays object oriented functionality on traditional relational databases
Uses object relational mapping (ORM) middleware between consumer programs and database
structures

PROS
Allows programmers to develop database programs without knowing SQL

CONS
There is a high cost – financial expense, computation time, external dependencies, maintenance
overhead
POST RELATIONAL MODEL
Handles volume, velocity and variety of data (3V’s)
Uses key-value storage of data instead of structured tables, allowing for sparse unstructured
data
Not always an appropriate replacement for relational database
PROS
Enabled the ‘Big Data’ revolution

CONS
Can involve a large overhead
Databases vs. Spreadsheets
DATABASES ARE LIKE SPREADSHEETS
Designed for easy data entry, manipulation, and retrieval
Maintain data structure and organization

DATABASES ARE NOT LIKE SPREADSHEETS


Designed to deliver data to many users simultaneously
Designed for speedy access
Can enforce very strict data rules
SPREADSHEETS ARE GOOD AT ORGANIZING DATA
Organized into rows and columns
Data in spreadsheets can have types (numbers, text, results of formulas)
Data can be checked and interrelated (with macros or other programming)

DATABASES ARE BETTER AT ORGANIZING DATA


Data is organized (usually into rows and columns)
Data is the database has types (numbers, text, images, results of formulas)
Data can be checked and interrelated (by how database is defined)
SPREADSHEETS ARE GOOD AT RETRIEVING DATA
Data can be retrieved via cell lookup
Using tools (like PivotTables) data can be analyzed

DATABASES ARE BETTER AT RETRIEVING DATA


Data queried and subsets can be retrieved
Using SQL compatible tools and programs data can be analyzed
SPREADSHEETS ARE GOOD AT MANIPULATING DATA
Data can be changed with the user interface (UI)
Data can sometimes be changed with tools

DATABASES ARE BETTER AT MANIPULATING DATA


Databases provide a manipulation language (SQL) to change data
With SQL other programs can change data in a database
Summary
The goal of this course is to learn how to use database systems not just as systems of record but a
system of information. i.e. a repository of data that can be analyzed to derive information,
knowledge and perhaps wisdom!
Secondly, we will review basic statistical techniques for use in applied analytics.
Hands-on

Please bring your laptops to class each time


We will use the Viterbi Virtual Lab (aka VDI aka myDesktop)

Follow instructions here to install -


https://viterbiit.usc.edu/instructional-support/

For some assignments, you may need to VPN


https://tmyh120u4ne1mjc7u1ptxu17-wpengine.netdna-ssl.com/wp-content/uploads/2017/06/82116.pdf

Homework 1 will be assigned in Week 2

You might also like