Data Analytics Program - Introduction To Data Analytics - Lesson 1

DATA
ANALYTICS
PROGRAM
12. Data Analytics Capstone
Project
Learning Schedule
11. Predictive Analytics 2 -
Deep Learning
10. Predictive Analytics 1 -

Machine Learning
9. Holistics/Bigquery/Tableau
8. Data Visualization
7. Data Analytics with R/Python
6. Business Analytics with Excel
5. SQL - Structured Query

Language
4. Programming Fundamentals
(R/Python)
3. Statistical Analysis of Data
2. Business Context for Data

Analytics
1. Introduction to Data Analytics

Introduction to
DATA ANALYTICS
Worldwide
“Data Analytics”
Past 5 years
Vietnam
INTRODUCTION TO DATA ANALYTICS lessons
01 02 03 04
What should we do with Data? Data Analytics Basics Data Analytics Framework Getting Insights from Data
1. Think like a Data Scientist 1. Approach Frameworks 1. Data Analytics Process
1. Data Analysis vs Data Analytics 2. Do you need all that data? 2. Data Quality
2. CRISP-DM Frameworks
2. Understanding Data 3. Importance of Segmentation of your 3. Descriptive
Analytics 3. Strategic Roadmap 4. Predictive
3. Data issues
4. Know the difference between your data & 5. Prescriptive
4. Data formatting your metrics 6. Semantic
5. Data Blending 5. Can your data be trusted?
6. Pitfalls of data-driven decisions
7. Why it’s so hard for us to communicate
uncertainty
Lesson 1
WHAT SHOULD WE DO WITH DATA?
Lesson 1: What should we do with Data?
Data Analysis vs Data Analytics
Analytics The science that Analysis provides you with

Data, in the information
analyze crude data to information & raises
age, are a large set of
extract useful knowledge questions
bits encoding numbers,
(patterns) from them.
texts, images, sounds,
Analytics give you insights
videos, and so on.
& attempts to answer
questions
DATA ANALYTICS ANALYSIS
Source: A General Introduction to Data Analytics, Wiley & ChartMogul

Data Analysis vs Data Analytics - Example
5 months ago, Bank ABC decreased totally Top 4 reasons due to Attrition in Bank:
10.200 bio. VND of Loan portfolio in Attrition (1) Dissatisfaction about services (50%)
(2) Lower rate in another banks (30%)
(Ending Loan portfolio = Beginning Loan + (3) Change another loan package in the bank(10%)
(4) Death (10%)
New loan – Attrition - Maturity)
Understanding Data – Categories of Data
Understanding Data – Data Sources
Computer files Database Web-based

Understanding Data – Importance of Data Types
Understanding Data – Data Types
String data can be Numeric data are Date/time contains a The Boolean type is Images
declared in a number numbers which can specific date, or a sometimes also called Maps
of different ways be whole numbers, combination of both a logical type and is a Report objects
depending on the such as Integers or date and time conditional flag Sound
character set required numbers with decimal representing either
and the anticipated places true or false
length of the string: Byte
any kind of Integer
characters, Fixed Decimal
alphanumeric, Float
including symbols. Double
Understanding Data – Data Types Exercise
Data Issues – Types of Data Issues
Dirty
Data
Data
Issues
Missing
Outliers
data
Data Issues – Dirty Data
Dirty Data contains some kind of errors in them, or in a format that’s unfriendly or unusable
Data Issues – Dirty Data: Parsing Data (Example)
Data Issues – Dirty Data: Extra Characters
Extra characters can be currency symbols, number signs… We’d need to remove these before
changing between field types
Data Issues – Dirty Data: Extra Characters (Example)
Data Issues – Dirty Data: Extra Characters (Example)
Data Issues – Dirty Data: Duplicate Data - Example
Duplicate records can end up in your dataset because of a manual mistake or

it could be some kind of program error => de-duping
Data Issues – Missing data
Missing data: gaps in data
Blank/ Empty cells (CSV) Null value (Database) N/A (program)
BIAS in statistics refers

to the tendency of an
analysis to either over
or under estimate the
values of that specific
field or parameter
Data Issues – Missing data (Example)
Real Data
Downward BIAS
Data Issues – Solutions for Missing data
SOLUTIONS
1. Deleting Missing Data
2. Imputation
3. Advanced methods
Data Issues – Missing data: Deleting Missing Data
Deleting Missing Data

Deleting missing data is often the default method
because it's simplicity. No decisions that need to be
made that might confuse the data. You just get rid
of records where there are missing values.
However, you should make sure that deleting

missing data doesn't have adverse effects on your
analysis. For example, if a particular demographic
tended to leave a response blank in a survey, then
removing records with blank entries will mean that a
part of the population is underrepresented.
One of the downsides is that eliminating missing
data reduces the size of the dataset (Ex: cost).
(Example)
(Example)
Red colors: Age & Income are

Strings => Check in Meta Data
Effect of Deletion on Model
Raw data Deleted missing data

Effect of Deletion on Model (Example)
Data Issues – Missing data: Imputation
Imputation
In statistics, Imputation is the process of
substituting values in the data where the
value are missing (we impute values, we
are making them up). We are creating
fake data in order to develop a model
that makes sense and is as close to
reality as we can get it
Data Issues – Missing data: Imputation (Example)
Data Issues – Missing data: Imputation (Example)
Data Issues – Missing data: Advanced methods
If your business and results could be significantly off by using a simpler method,
you might want to explore these options
Missing values aren’t

actually replaced, but
they’re handle within the
modeling process itself
Blend models together
Data Issues – Missing data: Selecting the method
What methodology might be the best approach
1. How much data is really missing? (>=80%)
2. How the missing data is distributed across the dataset? (2/10 predictor variables missed)
3. Whether those specific variables are actually significant to our analysis and model making
process
4. The missing data is numeric or categorical
Data Issues – Outliers
Identifying outliers in the data helps us understand how vulnerable our model would be to a small
set of observations.
Data Issues – Outliers: Identify
Identifying outliers more methodically rather than simply eyeballing them
Violin Plot: shows the volume of the distribution
Others: z-scores or standard deviations
Data Issues – Outliers: Identify
If a value is 1.5 times the INTERQUARTILE RANGE of a data set, then it

can be considered an OUTLIER
Data Issues – Outliers: Identify – Example with Excel
Add-ins in Excel (Real Statistics Using Excel) : http://www.real-statistics.com/free-

download/
Data Issues – Outliers: Dealing with outliers
3. Don’t have obvious errors,

1 & 2/ ERRORS but we aren’t certain whether
the data is accurate or not
1. Try to go back to the original 2. Delete the record from the

source to determine the dataset
correct data
Ex: Age: 299

Data Issues – Outliers:
Effect of outliers & Dealing with outliers
Could be correct, but it’s just abnormal then the analysis and Outliers didn’t change the
modeling process SHOULD INCLUDE that data. That said, it is results, the regression line
legitimate to create models without the data as well to compare retained its original lope, then
results, but it should be noted which models do and don’t contain it can be legitimate to
the outliers (2 options: include outliers vs exclude outliers) remove that observation
Data Issues – Outliers: Dealing with outliers
4. Truncation NOTE: We see here where age and income are fairly
random with no association between how old a
person is and how much income they have. But the
outlier creates the slope of the line by just being
present... so without outlier 1 (row 10), we have a
steep positive slope, but without outliers 2 (row 14)
and 3 (row 15), we have a negative slope.
In other words without the outlier we wouldn’t really
be able to draw a legitimate line at all, but the
presence of the outlier is what creates the model
effect. In cases such as this, we should definitely
Where we know that a certain value can remove the outlier and investigate other predictor
only be below a given maximum and yet variables.
a value is reported above that
Data Formatting
• How to identify when your data needs to be formatted.

• How to massage data into the correct format
• How to aggregate it to the form required
1. Transposing
2. Aggregating Data
3. Cross Tabulation
Data Formatting - Transposing
Data Formatting – Transposing - Example
Data Formatting - Aggregating Data
Data Formatting - Aggregating Data - Example
Data Formatting - Cross Tabulation
Data Formatting - Cross Tabulation - Example
Data Blending
Data may come from different places,

and as a results, it’ll all need to be
stitched together into one data file
Data Blending – Unions
Unioning allows you to take multiple datasets and deal with them as one
Data Blending – Joining Datasets
Data Blending – Fuzzy Matching
Fuzzy Matching will enable you to join 2 data sets
together where a regular join may fail. The Fuzzy
Match identifies records with similar string values
in specified fields.
Fuzzy Matching uses algorithms to score how

similar 2 words or phrases are.
Fuzzy Matching Algorithms

Jaro: The Jaro algorithm is a measure of MATCHES
characters in common, being no more than half
the length of the longer string in distance, with
consideration for transpositions.
Levenshtein: The Levenshtein algorithm counts the
number of edits (insertions, deletions, or
substitutions) needed to convert one string to the
other.
Data Blending – Fuzzy Matching - Example
It looks at these words and calculate a closeness of match score

based on the similarity of these words.
The match threshold is the minimum score achieved by the fuzzy matching for
it to be considered to be a match
Data Blending – Spatial Matching
Types of Spatial Data

All of these location data examples are represented by points, lines, or polygons
Points Lines Polygons

A point, also referred to as a A line is a string of latitudes Polygons are made up of a series of
centroid, is in the form of a latitude and longitude locations. longitude and latitude coordinates
and longitude which we use to defining all of the vertices of a region.
pinpoint its exact location.
Data Blending – Spatial Blending
There aren’t fields that can be Gray area: How many customers fall
used to join them together within a store trade area is to match
them and assign a store number to them
Data Blending – Spatial Blending - Example
Customer Information
Spatial Data
LESSON 1: WHAT SHOULD WE DO WITH DATA?
THANK YOU

Data Analytics Program - Introduction To Data Analytics - Lesson 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analytics Program - Introduction To Data Analytics - Lesson 1

Uploaded by

Copyright:

Available Formats

DATA

10. Predictive Analytics 1 -

7. Data Analytics with R/Python

6. Business Analytics with Excel

5. SQL - Structured Query

3. Statistical Analysis of Data

2. Business Context for Data

1. Introduction to Data Analytics

Analytics The science that Analysis provides you with

DATA ANALYTICS ANALYSIS

Source: A General Introduction to Data Analytics, Wiley & ChartMogul

Computer files Database Web-based

Duplicate records can end up in your dataset because of a manual mistake or

Missing data: gaps in data

Blank/ Empty cells (CSV) Null value (Database) N/A (program)

BIAS in statistics refers

Deleting Missing Data

However, you should make sure that deleting

Red colors: Age & Income are

Raw data Deleted missing data

Missing values aren’t

If a value is 1.5 times the INTERQUARTILE RANGE of a data set, then it

Add-ins in Excel (Real Statistics Using Excel) : http://www.real-statistics.com/free-

3. Don’t have obvious errors,

1. Try to go back to the original 2. Delete the record from the

Ex: Age: 299

• How to identify when your data needs to be formatted.

Data may come from different places,

Fuzzy Matching uses algorithms to score how

Fuzzy Matching Algorithms

It looks at these words and calculate a closeness of match score

Types of Spatial Data

Points Lines Polygons

You might also like