Professional Documents
Culture Documents
Week-2 BA Data Preprocessing
Week-2 BA Data Preprocessing
Week-2
Business Analytics
Types of Analytics
• Descriptive Analytics tells
you what happened in the
past.
• Diagnostic Analytics helps
you understand why
something happened in
the past.
• Predictive
Analytics predicts what is
most likely to happen in
the future.
• Prescriptive
Analytics recommends
actions you can take to
affect those outcomes.
Tools of Descriptive & Predictive Analytics
The spectrum of BA
What Does Data Mining Do?
How Does it Work?
DM extracts patterns from data
Pattern?
A mathematical (numeric and/or symbolic)
relationship among data items
Types of patterns
Association
Prediction
Cluster (segmentation)
Sequential (or time series) relationships
Data -> Information -> Knowledge -> Understanding -> Wisdom !!!
12
Predictive Modeling
Sample dataset 1:
4 attributes and an outcome (play?)
Prediction methods
Challenge: Learn how the attributes of given/known
data relate to the outcome.
Discover relationships between input attributes and a
target attribute (outcome or class label) in the given
data-set.
Sample dataset 2:
No outcome, just items
Association Rule Discovery:
Application
Sample dataset 3:
Data about some entities
There is no outcome and only input data
is available.
For example:
Which brings in more customers: a $5 coupon or
a 25% discount?
Is this a spam email? Yes or No?
Is the person sick? Yes or No?
Classification
In this case,
three different
groups (classes)
of items were
found among
all of the items
in the data set.
26
Clustering: Application
Market Segmentation:
Goal: subdivide a market into distinct
subsets of customers.
Approach:
Collect different attributes of customers based
on their geographical and lifestyle related
information.
Find clusters of similar customers.
Observe buying patterns of customers in same
cluster vs. those from different clusters.
Some applications
• Financial Analytics
• Europcar, the leading rental car company in Europe, uses forecasting
models, simulation and optimization to predict demand, assess risk,
and optimize the use of its fleet. It's models are implemented via a
decision support system used in nine countries in Europe and has
led to higher utilization of its fleet, decreased costs, and increased
profitability.
• HR Analytics
• Google has analyzed substantial data on their own employees to
determine the characteristics of great leaders, to assess factors that
contribute to productivity, and to evaluate potential new hires.
Google also uses predictive analytics to continually update their
forecast of future employee turnover and retention.
28
• Marketing Analytics
• Turner Broadcasting System Inc. uses forecasting and optimization
models to create more-targeted audiences and to better schedule
commercials for its advertising partners. The use of these models
has led to an increase in Turner year-over-year advertising revenue
of 186% and, at the same time, dramatically increased sales for the
advertisers. Those advertisers that chose to benchmark found an
increase in sales of $118 million.
• Web Analytics
• Has huge implications for promoting and selling products and
services via the Internet. Leading companies apply descriptive and
advanced analytics to data collected in online experiments to
determine the best way to configure web sites, position ads, and
utilize social networks for the promotion of products and services.
EXERCISE
Data from USA locations: 9 variables from 300+
metropolitan areas.
1. Climate mildness
2. Housing cost
3. Health care and environment
4. Crime
5. Transportation supply
6. Educational opportunities and effort
7. Arts and culture facilities
8. Recreational opportunities
9. Personal economic outlook
+ latitude and longitude of each city
30
CRISP-DM
CRISP-DM stands
for CRoss-Industry
Process for Data
Mining.
The CRISP-DM
methodology
provides a
structured
approach to
planning a data
mining project.
Continuous Attribute
– Has real numbers as attribute values, e.g. 3.45, 19.5
– Examples: temperature, height, or weight.
– Typically represented as floating-point (decimal)
variables.
Missing Data
Outliers
Missing Data
• Missing data are defined as “not available
values” in a certain row/column.
• Can be anything from missing sequence,
incomplete feature, files missing,
information incomplete, data entry error
etc.
38
1. Missing values Techniques
2. Outliers or Anomalies
An observation far away from other observations.
In statistics, outliers are data points that don’t
belong to a certain population.
– An observation that diverges from otherwise well-
structured data.
Do you see the outlier in this list:
– [20, 24, 22, 19, 29, 18, 4300, 30, 18]
Applications: Outlier analysis
IQR Rule
Anything above Q3+(1.5 x IQR) or below Q1-(1.5 x IQR) is an outlier
Exercise