Week-2 BA Data Preprocessing

Business Analytics
Week-2
Business Analytics
A data-driven decision making approach that

uses statistical and quantitative analysis, IT,
management science along with data mining & fact-
based data to measure past business performance
to guide an organization in business planning and
effective decision making.
BA: Broad View with associated
knowledge areas
We already understand this…

 Data used as asset.
 Collection & Analysis of massive data
 Integral part of business & commitment to
data driven decisions
 Informed, automated, optimized
decisions.
 Gain insight, drive planning
 BI architecture to provide the above…
Provides the ability to analyze…
 What is happening and why did something

happen?
 Will it happen again? What will happen?
 What will happen if we make changes to
some of the inputs?

 What the data is telling us that we were not
able to see before?

It’s more about anticipated future needs!!!
Types of Analytics
• Descriptive Analytics tells
you what happened in the
past.
• Diagnostic Analytics helps
you understand why
something happened in
the past.
• Predictive
Analytics predicts what is
most likely to happen in
the future.
• Prescriptive
Analytics recommends
actions you can take to
affect those outcomes.
Tools of Descriptive & Predictive Analytics
The spectrum of BA
What Does Data Mining Do?
How Does it Work?
 DM extracts patterns from data
 Pattern?
A mathematical (numeric and/or symbolic)
relationship among data items
 Types of patterns
 Association
 Prediction
 Cluster (segmentation)
 Sequential (or time series) relationships
What is Analytics & Data Mining?
Data mining is used to find patterns and relationships in

data. (EDA = Exploratory Data Analysis)
Patterns can be analyzed via 2 types of models:
Descriptive : Describe patterns and create meaningful
subgroups or clusters.
Predictive : Forecast explicit values, based upon
patterns in known results.
Data -> Information -> Knowledge -> Understanding -> Wisdom !!!
12
Predictive Modeling
Sample dataset 1:
4 attributes and an outcome (play?)
Prediction methods
Challenge: Learn how the attributes of given/known
data relate to the outcome.
Discover relationships between input attributes and a
target attribute (outcome or class label) in the given
data-set.
Objective: form a description that can be used to

predict unseen cases.
for new input data (X), you can predict the outcome for that data
Unseen example: sunny, cool, high, false…. Play or not??
Sample dataset 2:
No outcome, just items
Association Rule Discovery:
Application
A rule discovered from sales data may be:
Bottled Water  Eggs
Sample dataset 3:
Data about some entities
There is no outcome and only input data
is available.
The aim is now to find relationships,

similarities and associations in the input.
Question: Is this A or B? uses

classification algorithms
For example:
Which brings in more customers: a $5 coupon or
a 25% discount?
Is this a spam email? Yes or No?
Is the person sick? Yes or No?
Classification
Given old data about customers and

payments, predict new applicant’s loan
eligibility.
Previous customers Classifier Decision rules
Salary > 5 L
Age Good/
Salary Prof. = Exec
bad
Profession
Location
Customer type
New applicant’s data
Question: How much? or How many?

uses regression algorithms
Regression algorithms make numerical predictions,

such as:
What will the temperature be next Tuesday?
What will my fourth quarter sales be?
They help answer any question that asks for a
number.
Regression example
I plan to spend 63 million Euros

on Advertising in year 10.
What will be my sales?
Question: How is this organized? uses

clustering algorithms
Sometimes you want to understand the structure of a

data set - How is this organized?
you don’t have examples that you already know
outcomes for.
Clustering separates data into natural "clumps or groups"
for easier interpretation.
Clustering student data
Data Mining Technique: Clustering
In this case,
three different
groups (classes)
of items were
found among
all of the items
in the data set.
26
Clustering: Application
Market Segmentation:
Goal: subdivide a market into distinct
subsets of customers.
Approach:
Collect different attributes of customers based
on their geographical and lifestyle related
information.
Find clusters of similar customers.
Observe buying patterns of customers in same
cluster vs. those from different clusters.
Some applications
• Financial Analytics
• Europcar, the leading rental car company in Europe, uses forecasting
models, simulation and optimization to predict demand, assess risk,
and optimize the use of its fleet. It's models are implemented via a
decision support system used in nine countries in Europe and has
led to higher utilization of its fleet, decreased costs, and increased
profitability.
• HR Analytics
• Google has analyzed substantial data on their own employees to
determine the characteristics of great leaders, to assess factors that
contribute to productivity, and to evaluate potential new hires.
Google also uses predictive analytics to continually update their
forecast of future employee turnover and retention.
28
• Marketing Analytics
• Turner Broadcasting System Inc. uses forecasting and optimization
models to create more-targeted audiences and to better schedule
commercials for its advertising partners. The use of these models
has led to an increase in Turner year-over-year advertising revenue
of 186% and, at the same time, dramatically increased sales for the
advertisers. Those advertisers that chose to benchmark found an
increase in sales of $118 million.
• Web Analytics
• Has huge implications for promoting and selling products and
services via the Internet. Leading companies apply descriptive and
advanced analytics to data collected in online experiments to
determine the best way to configure web sites, position ads, and
utilize social networks for the promotion of products and services.
• And many more in healthcare, sports, supply chain, Govt sectors.

29
EXERCISE
Data from USA locations: 9 variables from 300+
metropolitan areas.
1. Climate mildness
2. Housing cost
3. Health care and environment
4. Crime
5. Transportation supply
6. Educational opportunities and effort
7. Arts and culture facilities
8. Recreational opportunities
9. Personal economic outlook
+ latitude and longitude of each city
30
CRISP-DM
CRISP-DM stands
for CRoss-Industry
Process for Data
Mining.
The CRISP-DM
methodology
provides a
structured
approach to
planning a data
mining project.
Understanding your data

Data Preprocessing
What is Data?
Collection of data objects and Attributes

their attributes
An attribute is a property or Tid Refund Marital Taxable

Status Income Cheat
characteristic of an object
1 Yes Single 125K No
– Examples: eye color of a
2 No Married 100K No
person, temperature, etc.
3 No Single 70K No
– Attribute is also known as
4 Yes Married 120K No
variable, field, characteristic,
5 No Divorced 95K Yes
or feature Objects
6 No Married 60K No
A collection of attributes
7 Yes Divorced 220K No
describe an object
8 No Single 85K Yes
– Object is also known as
9 No Married 75K No
record, point, case, sample,
10 No Single 90K Yes
entity, or instance 1 0
Types of Attributes & Properties
There are different The type of an attribute depends

on which of the following
types of attributes properties it possesses:
– Nominal – Distinctness: = 
 Examples: ID
numbers, eye color, – Order: < >
zip codes
– Addition/sub: + -
– Ordinal
 Examples: rankings – Multiplication/div: * /
(e.g., grades, height in
{tall, medium, short} – Nominal attribute: distinctness
– Scale – Ordinal attribute: distinctness &
 Numbers, qty, cost, order
counts, – Scale: can perform arithmetic
operations
 Dates etc
Discrete and Continuous Attributes
Discrete Attribute
– Has only a finite or countably infinite set of values
 Often represented as integer variables, e.g. 12, 67, 2020 etc.
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Note: binary attributes (0, 1) are a special case of
discrete attributes
Continuous Attribute
– Has real numbers as attribute values, e.g. 3.45, 19.5
– Examples: temperature, height, or weight.
– Typically represented as floating-point (decimal)
variables.
Introduction to Data Preprocessing
Data in the real world is dirty

– Quality decisions must be based on quality
data
 e.g.,
duplicate or missing data may cause incorrect
or even misleading statistics.
Missing Data
Outliers
Missing Data
• Missing data are defined as “not available
values” in a certain row/column.
• Can be anything from missing sequence,
incomplete feature, files missing,
information incomplete, data entry error
etc.
Missing data  Misleading outcome/decisions
38
1. Missing values Techniques
Ignore the record:-

– This is usually done when the class label is missing. It is poor when
the percentage of missing values per attributes varies.
Use the attribute mean to fill in the missing value:-

– For example, take the average value of income attribute. Use this
value to replace the missing value for income.
Use the attribute Median to fill in the missing value:-

– The main difference between Mean and Median is that, Median is
robust to noise.
Use classification rules to fill in the missing value:-

– Try to predict missing value by using some prediction algorithm.
Deletion of missing data
Imputing missing values

Imputing missing values
Could fill in missing values

with 0 or some arbitrary
value as well.
Exercise:
Fill in missing values using
the:
– A) mean
– B) median
2. Outliers or Anomalies
An observation far away from other observations.
In statistics, outliers are data points that don’t
belong to a certain population.
– An observation that diverges from otherwise well-
structured data.
Do you see the outlier in this list:
– [20, 24, 22, 19, 29, 18, 4300, 30, 18]
Applications: Outlier analysis
We now have smart watches and wristbands that can

detect our heartbeats every few minutes.
Detecting anomalies in the heartbeat data can help in
predicting heart diseases.
Anomalies in traffic patterns can help in predicting
accidents.
Can also be used to identify bottlenecks in network
infrastructure and traffic between servers.
intrusion detection in cyber security, fraud
detection for credit cards, & fault detection in
safety critical systems
Effect of outliers in data

2. Outlier Analysis: Method 1
A commonly adopted definition is based on the distance

between each data point and the mean.
Usually any observations above or below three
standard deviations of the mean are considered
outliers.
2. Outlier Analysis: Method 2

Boxplots: A graphical depiction of numerical data
through their quantiles.
Lower and upper whiskers are boundaries of the
data distribution.
– Any data points that show above or below the whiskers,
can be considered outliers or anomalous.
Boxplot Anatomy
IQR Rule
Anything above Q3+(1.5 x IQR) or below Q1-(1.5 x IQR) is an outlier
Exercise
Descriptive statistics are provided for 20 data

values.
Answer the following:
1. Using St Dev only, what are the values above
or below which a data point will be considered
an outlier?
2. Using IQR rule, what are the values above or
below which a data point will be considered
an outlier?
3. Is 97.6 an outlier? Is 89.4 an outlier?
IQR for outlier identification
Interquartile Range (IQR) is important because it

is used to define the outliers.
It is the difference between the third quartile and
the first quartile (IQR = Q3 -Q1).
Outliers in this case are defined as the
observa ons that are below (Q1 − 1.5x IQR)
or boxplot lower whisker or above (Q3 + 1.5x IQR)
or boxplot upper whisker.

Week-2 BA Data Preprocessing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week-2 BA Data Preprocessing

Uploaded by

Copyright:

Available Formats

Business Analytics

A data-driven decision making approach that

We already understand this…

 What is happening and why did something

 What will happen if we make changes to

some of the inputs?

able to see before?

What is Analytics & Data Mining?

Data mining is used to find patterns and relationships in

Objective: form a description that can be used to

A rule discovered from sales data may be:

Bottled Water  Eggs

The aim is now to find relationships,

Question: Is this A or B? uses

Given old data about customers and

Question: How much? or How many?

Regression algorithms make numerical predictions,

I plan to spend 63 million Euros

Question: How is this organized? uses

Sometimes you want to understand the structure of a

Data Mining Technique: Clustering

• And many more in healthcare, sports, supply chain, Govt sectors.

Understanding your data

Collection of data objects and Attributes

An attribute is a property or Tid Refund Marital Taxable

Types of Attributes & Properties

There are different The type of an attribute depends

Introduction to Data Preprocessing

Data in the real world is dirty

Missing data  Misleading outcome/decisions

Ignore the record:-

Use the attribute mean to fill in the missing value:-

Use the attribute Median to fill in the missing value:-

Use classification rules to fill in the missing value:-

Imputing missing values

Could fill in missing values

We now have smart watches and wristbands that can

Effect of outliers in data

A commonly adopted definition is based on the distance

2. Outlier Analysis: Method 2

Descriptive statistics are provided for 20 data

Interquartile Range (IQR) is important because it

You might also like