You are on page 1of 56

DATA

ANALYTICS
PROGRAM
12. Data Analytics Capstone
Project
Learning Schedule
11. Predictive Analytics 2 -
Deep Learning

10. Predictive Analytics 1 -


Machine Learning

9. Holistics/Bigquery/Tableau

8. Data Visualization

7. Data Analytics with R/Python

6. Business Analytics with Excel

5. SQL - Structured Query


Language

4. Programming Fundamentals
(R/Python)

3. Statistical Analysis of Data

2. Business Context for Data


Analytics

1. Introduction to Data Analytics


Introduction to

DATA ANALYTICS
Worldwide

“Data Analytics”
Past 5 years
Vietnam
INTRODUCTION TO DATA ANALYTICS lessons

01 02 03 04
What should we do with Data? Data Analytics Basics Data Analytics Framework Getting Insights from Data
1. Think like a Data Scientist 1. Approach Frameworks 1. Data Analytics Process
1. Data Analysis vs Data Analytics 2. Do you need all that data? 2. Data Quality
2. CRISP-DM Frameworks
2. Understanding Data 3. Importance of Segmentation of your 3. Descriptive
Analytics 3. Strategic Roadmap 4. Predictive
3. Data issues
4. Know the difference between your data & 5. Prescriptive
4. Data formatting your metrics 6. Semantic
5. Data Blending 5. Can your data be trusted?
6. Pitfalls of data-driven decisions
7. Why it’s so hard for us to communicate
uncertainty
Lesson 1
WHAT SHOULD WE DO WITH DATA?
Lesson 1: What should we do with Data?
Data Analysis vs Data Analytics

Analytics The science that Analysis provides you with


Data, in the information
analyze crude data to information & raises
age, are a large set of
extract useful knowledge questions
bits encoding numbers,
(patterns) from them.
texts, images, sounds,
Analytics give you insights
videos, and so on.
& attempts to answer
questions

DATA ANALYTICS ANALYSIS

Source: A General Introduction to Data Analytics, Wiley & ChartMogul


Lesson 1: What should we do with Data?
Data Analysis vs Data Analytics - Example

5 months ago, Bank ABC decreased totally Top 4 reasons due to Attrition in Bank:
10.200 bio. VND of Loan portfolio in Attrition (1) Dissatisfaction about services (50%)
(2) Lower rate in another banks (30%)
(Ending Loan portfolio = Beginning Loan + (3) Change another loan package in the bank(10%)
(4) Death (10%)
New loan – Attrition - Maturity)
Lesson 1: What should we do with Data?
Understanding Data – Categories of Data
Lesson 1: What should we do with Data?
Understanding Data – Data Sources

Computer files Database Web-based


Lesson 1: What should we do with Data?
Understanding Data – Importance of Data Types
Lesson 1: What should we do with Data?
Understanding Data – Data Types

String data can be Numeric data are Date/time contains a The Boolean type is Images
declared in a number numbers which can specific date, or a sometimes also called Maps
of different ways be whole numbers, combination of both a logical type and is a Report objects
depending on the such as Integers or date and time conditional flag Sound
character set required numbers with decimal representing either
and the anticipated places true or false
length of the string: Byte
any kind of Integer
characters, Fixed Decimal
alphanumeric, Float
including symbols. Double
Lesson 1: What should we do with Data?
Understanding Data – Data Types Exercise
Lesson 1: What should we do with Data?
Data Issues – Types of Data Issues

Dirty
Data

Data
Issues

Missing
Outliers
data
Lesson 1: What should we do with Data?
Data Issues – Dirty Data

Dirty Data contains some kind of errors in them, or in a format that’s unfriendly or unusable
Lesson 1: What should we do with Data?
Data Issues – Dirty Data: Parsing Data (Example)
Lesson 1: What should we do with Data?
Data Issues – Dirty Data: Extra Characters

Extra characters can be currency symbols, number signs… We’d need to remove these before
changing between field types
Lesson 1: What should we do with Data?
Data Issues – Dirty Data: Extra Characters (Example)
Lesson 1: What should we do with Data?
Data Issues – Dirty Data: Extra Characters (Example)
Lesson 1: What should we do with Data?
Data Issues – Dirty Data: Duplicate Data - Example

Duplicate records can end up in your dataset because of a manual mistake or


it could be some kind of program error => de-duping
Lesson 1: What should we do with Data?
Data Issues – Missing data

Missing data: gaps in data

Blank/ Empty cells (CSV) Null value (Database) N/A (program)

BIAS in statistics refers


to the tendency of an
analysis to either over
or under estimate the
values of that specific
field or parameter
Lesson 1: What should we do with Data?
Data Issues – Missing data (Example)

Real Data

Downward BIAS
Lesson 1: What should we do with Data?
Data Issues – Solutions for Missing data

SOLUTIONS
1. Deleting Missing Data
2. Imputation
3. Advanced methods
Lesson 1: What should we do with Data?
Data Issues – Missing data: Deleting Missing Data

Deleting Missing Data


Deleting missing data is often the default method
because it's simplicity. No decisions that need to be
made that might confuse the data. You just get rid
of records where there are missing values.

However, you should make sure that deleting


missing data doesn't have adverse effects on your
analysis. For example, if a particular demographic
tended to leave a response blank in a survey, then
removing records with blank entries will mean that a
part of the population is underrepresented.
One of the downsides is that eliminating missing
data reduces the size of the dataset (Ex: cost).
Lesson 1: What should we do with Data?
Data Issues – Missing data: Deleting Missing Data
(Example)
Lesson 1: What should we do with Data?
Data Issues – Missing data: Deleting Missing Data
(Example)

Red colors: Age & Income are


Strings => Check in Meta Data
Lesson 1: What should we do with Data?
Data Issues – Missing data: Deleting Missing Data
Effect of Deletion on Model

Raw data Deleted missing data


Lesson 1: What should we do with Data?
Data Issues – Missing data: Deleting Missing Data
Effect of Deletion on Model (Example)
Lesson 1: What should we do with Data?
Data Issues – Missing data: Imputation

Imputation
In statistics, Imputation is the process of
substituting values in the data where the
value are missing (we impute values, we
are making them up). We are creating
fake data in order to develop a model
that makes sense and is as close to
reality as we can get it
Lesson 1: What should we do with Data?
Data Issues – Missing data: Imputation (Example)
Lesson 1: What should we do with Data?
Data Issues – Missing data: Imputation (Example)
Lesson 1: What should we do with Data?
Data Issues – Missing data: Advanced methods

If your business and results could be significantly off by using a simpler method,
you might want to explore these options

Missing values aren’t


actually replaced, but
they’re handle within the
modeling process itself
Blend models together
Lesson 1: What should we do with Data?
Data Issues – Missing data: Selecting the method
What methodology might be the best approach
1. How much data is really missing? (>=80%)
2. How the missing data is distributed across the dataset? (2/10 predictor variables missed)
3. Whether those specific variables are actually significant to our analysis and model making
process
4. The missing data is numeric or categorical
Lesson 1: What should we do with Data?
Data Issues – Outliers

Identifying outliers in the data helps us understand how vulnerable our model would be to a small
set of observations.
Lesson 1: What should we do with Data?
Data Issues – Outliers: Identify
Identifying outliers more methodically rather than simply eyeballing them
Violin Plot: shows the volume of the distribution
Others: z-scores or standard deviations
Lesson 1: What should we do with Data?
Data Issues – Outliers: Identify

If a value is 1.5 times the INTERQUARTILE RANGE of a data set, then it


can be considered an OUTLIER
Lesson 1: What should we do with Data?
Data Issues – Outliers: Identify – Example with Excel

Add-ins in Excel (Real Statistics Using Excel) : http://www.real-statistics.com/free-


download/
Lesson 1: What should we do with Data?
Data Issues – Outliers: Dealing with outliers

3. Don’t have obvious errors,


1 & 2/ ERRORS but we aren’t certain whether
the data is accurate or not

1. Try to go back to the original 2. Delete the record from the


source to determine the dataset
correct data

Ex: Age: 299


Lesson 1: What should we do with Data?
Data Issues – Outliers:
Effect of outliers & Dealing with outliers

Could be correct, but it’s just abnormal then the analysis and Outliers didn’t change the
modeling process SHOULD INCLUDE that data. That said, it is results, the regression line
legitimate to create models without the data as well to compare retained its original lope, then
results, but it should be noted which models do and don’t contain it can be legitimate to
the outliers (2 options: include outliers vs exclude outliers) remove that observation
Lesson 1: What should we do with Data?
Data Issues – Outliers: Dealing with outliers

4. Truncation NOTE: We see here where age and income are fairly
random with no association between how old a
person is and how much income they have. But the
outlier creates the slope of the line by just being
present... so without outlier 1 (row 10), we have a
steep positive slope, but without outliers 2 (row 14)
and 3 (row 15), we have a negative slope.
In other words without the outlier we wouldn’t really
be able to draw a legitimate line at all, but the
presence of the outlier is what creates the model
effect. In cases such as this, we should definitely
Where we know that a certain value can remove the outlier and investigate other predictor
only be below a given maximum and yet variables.
a value is reported above that
Lesson 1: What should we do with Data?
Data Formatting

• How to identify when your data needs to be formatted.


• How to massage data into the correct format
• How to aggregate it to the form required

1. Transposing
2. Aggregating Data
3. Cross Tabulation
Lesson 1: What should we do with Data?
Data Formatting - Transposing
Lesson 1: What should we do with Data?
Data Formatting – Transposing - Example
Lesson 1: What should we do with Data?
Data Formatting - Aggregating Data
Lesson 1: What should we do with Data?
Data Formatting - Aggregating Data - Example
Lesson 1: What should we do with Data?
Data Formatting - Cross Tabulation
Lesson 1: What should we do with Data?
Data Formatting - Cross Tabulation - Example
Lesson 1: What should we do with Data?
Data Blending

Data may come from different places,


and as a results, it’ll all need to be
stitched together into one data file
Lesson 1: What should we do with Data?
Data Blending – Unions

Unioning allows you to take multiple datasets and deal with them as one
Lesson 1: What should we do with Data?
Data Blending – Joining Datasets
Lesson 1: What should we do with Data?
Data Blending – Fuzzy Matching
Fuzzy Matching will enable you to join 2 data sets
together where a regular join may fail. The Fuzzy
Match identifies records with similar string values
in specified fields.

Fuzzy Matching uses algorithms to score how


similar 2 words or phrases are.

Fuzzy Matching Algorithms


Jaro: The Jaro algorithm is a measure of MATCHES
characters in common, being no more than half
the length of the longer string in distance, with
consideration for transpositions.
Levenshtein: The Levenshtein algorithm counts the
number of edits (insertions, deletions, or
substitutions) needed to convert one string to the
other.
Lesson 1: What should we do with Data?
Data Blending – Fuzzy Matching - Example

It looks at these words and calculate a closeness of match score


based on the similarity of these words.

The match threshold is the minimum score achieved by the fuzzy matching for
it to be considered to be a match
Lesson 1: What should we do with Data?
Data Blending – Spatial Matching

Types of Spatial Data


All of these location data examples are represented by points, lines, or polygons

Points Lines Polygons


A point, also referred to as a A line is a string of latitudes Polygons are made up of a series of
centroid, is in the form of a latitude and longitude locations. longitude and latitude coordinates
and longitude which we use to defining all of the vertices of a region.
pinpoint its exact location.
Lesson 1: What should we do with Data?
Data Blending – Spatial Blending

There aren’t fields that can be Gray area: How many customers fall
used to join them together within a store trade area is to match
them and assign a store number to them
Lesson 1: What should we do with Data?
Data Blending – Spatial Blending - Example

Customer Information

Spatial Data
LESSON 1: WHAT SHOULD WE DO WITH DATA?

THANK YOU

You might also like