Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business

DSCI 5240
Data Preparation and

Exploration
DSCI 5240 Data Mining and Machine Learning for Business
Javier Rubio-Herrero
DSCI 5240
Know Your Data!
“Conducting data analysis is like drinking a fine wine. It is

important to swirl and sniff the wine, to unpack the
complex bouquet and to appreciate the experience.
Gulping the wine doesn’t work.”
Daniel B. Wright
2
DSCI 5240
About Data
3
DSCI 5240
Acquiring Data
• Data acquisition may or may not be your concern within your organization
• In large organizations, there may be teams devoted to extracting relevant information from the data warehouse
• In smaller organizations, that work may fall to the data miner
• We rarely have an issue with too little data
• Big Data refers to situations where datasets are so large they cannot be stored or analyzed using traditional methods
• In instances where additional data would be helpful, it can often be acquired from operational systems or third parties
• Data is accessed in a variety of ways
• Directly in the data warehouse (rare)
• Extracted to a database management system (DBMS), e.g., Microsoft Access
• Extracted to Microsoft Excel (very common)
4
DSCI 5240
Data Structure
• Data is almost always organized
in tabular/matrix format
• Rows
• Tuples
• Observations
• Columns
• Variables
• Dimensions
• Features
• Inputs/Targets
5
DSCI 5240
Common Data Types – Non-Numeric Data

• Nominal (categorical) – The value is a name
that identifies a specific category
• Values are called levels in this context
• Region is a commonly occurring nominal variable in business
data
• In this example, “Color” vs. “Black and White” simply identify
types of movies
• Ordinal – The value identifies a category but is

also associated with a rank (i.e., the data can
be ordered)
• Classic example is position in a race (1 st, 2nd, 3rd, etc.)
• Here, budget category generally indicates how much money
was spent
Ordinal
• All ordinal data is nominal data
Nominal 6
DSCI 5240
Common Data Types – Numeric Data

• Interval – Data has meaningful intervals between
measurements. Ratios are meaningless. There is not
a value that represents “nullity” (i.e., there’s no 0).
• Classic example is temperature
• Is a movie rated 4 twice as good as a movie rated 2?
• Ratio – Data has meaningful intervals between
measurements. Ratios make sense. There is a value
that represents “nullity” (i.e., there’s a 0).
• Numeric business data is often ratio
• e.g., a movie that grossed $60M made twice as much
as a movie that grossed $30M

Ratio
• All ratio data is interval data
Interval 7
DSCI 5240
Common Data Types - Identifiers

• Datasets commonly employ identifiers
to distinguish between observations
• Identifiers are often the primary key in

the database from which the dataset
was drawn
• Identifiers typically have no predictive

value in models… they are only there
to help you navigate the data
8
DSCI 5240
Common Data Types - Text

• Text also frequently appears in
business data
• Can often be distinguished from

nominal data by the number of levels
present
• Text is often not useful in the

development of predictive models
unless those models are combined
with additional text mining techniques
9
DSCI 5240
Variable Types
• Each column in your dataset represents
a potential variable that may be
included in your model
• Variables may be of two types:

• Independent variable (input) –
A variable whose variation does not
depend on any other variable (x)
• Dependent variable (target) – A
variable whose variation does (we
hope) depend on other variables
(y)
10
DSCI 5240
Data Preprocessing
11
DSCI 5240
Data Preprocessing
The data contained in modern data warehouses often has significant data quality issues
• Accuracy – Do the data accurately represent what they are intended to represent?
We have a customer record but the income field reflects household, rather than individual income as expected
• Completeness – Do we have all of the data necessary?
We have a customer record but there is no value in the income field
• Timeliness – Was the data collected recently enough to still be useful?
We have a customer record but the value of the income field was collected 20 years ago
• Believability – Can the data be trusted?
We have a customer record but the value of the income field is $5B
• Interpretability – Do we really understand what the data shows?
We have a customer record but the value of the income field has been scaled several times and we are not
really sure what it means
12
DSCI 5240
Data Quality
• Data quality has consistently been shown to be a critical factor in the successful use
of BI within organizations
• Quality depends on the intended use of the data

• Quality costs time and money
• You are looking for sufficient, rather than optimal, quality
• Some preprocessing tasks related to improving data quality are often completed
before the analyst receives the data, others are completed after
13
DSCI 5240
Preprocessing Tasks - Overview

• Data cleaning – dealing with missing values and smoothing noisy data
• Data integration – ensure that the incorporation of data from multiple sources has
not introduced inconsistencies into the data
• Data reduction – identifying a smaller subset of the data which can produce the
same (or similar) analytical results
14
DSCI 5240
Data Cleaning
• Missing data approaches
• Ignore the tuple – Skip it; can result in significant data loss in sparse data sets
• Manual correction – Fix it; unrealistic in most scenarios
• Global constant – Use a placeholder; can get confused with actual data
• Central tendency – Use the mean or median; can alter the variation in the data
• Class-based central tendency – Use the mean or median associated with the class to which this record belongs
• Most probable value – Estimate it using regression, decision tree, etc.
• Noisy data approaches
• Binning – sort and adjust the value based on those of its neighbors (mean, median, boundary)
• Regression – Use predicted rather than actual values
• Outlier analysis – Identify and exclude “odd” records
15
DSCI 5240
Data Integration
• Entity Identification
• How do we match records in one data source with those in another?
• Does prd_id = product_id?
• Are values contained in the sources in common units?
• Redundancy
• Can a given field be derived from others within the data set?
• Can introduce statistical issues if used in the same model
• Duplication
• Can result from data redundancy in underlying data sources
• Can inappropriately increase the significance of relationships
• Data value conflict

• When data source A indicates price is 24.99 and data source B indicates price is 19.99, which is correct?
• What are the rules that govern conflict resolution?

16
DSCI 5240
Data Reduction
• The data we work with is often BIG and its
size may inhibit our ability to work with it
• Acquisition
• Storage
• Modeling
• If we think about data in terms of a matrix,

we may have a lot of
• Rows
• Columns
• Both rows and columns
17
DSCI 5240
Data Reduction – Reducing Rows

Reduction may be achieved in a number
of ways
• Aggregation in the data
warehouse – Increasing the
granularity of the data cube will
reduce the number of observations
present in the resulting data
• Sampling – Selecting
representative observations for
analysis while discarding the bulk of
the data
18
DSCI 5240
Data Reduction – Reducing Columns

Column reduction (dimensionality reduction) can be
more complicated and involve more work
• Manual Feature Selection – The data

analyst can examine the available
dimensions and exclude those they feel are
not useful for modeling purposes
• Feature Selection based on Objective

Function – A modeling approach is used to
identify the features that appear to have the
most influence on the dependent variable
• Feature Extraction – Maps high

dimensional data onto a lower dimensional
subspace (i.e., combines variables)
19
DSCI 5240
Data Exploration
20
DSCI 5240
Data Exploration
• Having a sound understanding of the data you employ in models is critical
• What does each variable represent?
• How was it measured?
• From whom was it obtained?
• How is it related to the business domain?
• If predicting, is it available at the time the prediction needs to be made?
• A lack of understanding on the part of the modeler will result in poor model performance
and/or nonsensical model parameters
• In addition to a general understanding of the data, understanding their statistical properties

is also important
21
DSCI 5240
Exploratory Data Analysis (EDA)

• EDA is an approach that attempts to develop an understanding of the data to facilitate
the selection of the best possible models.
• The seminal work is Exploratory Data Analysis (Tukey 1977)
• A nice summary may be found at http://www.itl.nist.gov/div898/handbook/index.htm
• The approach is designed to:
1. Maximize insight into a data set
2. Uncover underlying structure
3. Extract important variables
4. Detect outliers and anomalies
5. Test underlying assumptions
6. Develop parsimonious models
7. Determine optimal factor settings
22
DSCI 5240
Important Summary Statistics

• Measures of Central Tendency
• Mean
𝐺𝑖𝑣𝑒𝑛
𝑥=5 , 2 ,7 , 2 , 8
• Sum a group of numbers and divide by the number of

𝑚𝑒𝑎𝑛
∑𝑥
observations 𝑥= 𝑥
´=
𝑛
• Represents central tendency but is not robust
¿ 5 +2 +7 + 2+ 8 =4.8
• Median 5
• Order a group of numbers and select the middle number

𝑥 𝑜𝑟𝑑𝑒𝑟𝑒𝑑 =2 ,2 , 5 , 7 , 8
• Also represents central tendency but is robust to the
presence of outliers (i.e. it does not change much with 𝑚𝑒𝑑𝑖𝑎𝑛

𝑥 =5
the presence of a few outliers)
• If the parity of our dataset is even, the median is the
average of the two middle numbers

23
DSCI 5240
Important Visualizations
• Histogram
• Graphical representation of the distribution of numeric data
• Bins are constructed and the number of observations that fall within each bin is represented on a bar graph
• Important for verifying assumptions of models are not violated
• Box Plot
• Graphical representation of data through quartiles
• Bottom and top of the box represent the first and third quartile, middle bar or sometimes a dot represents
the median, whiskers vary
• Excellent for assessing distribution and identifying outliers
• Scatter Plot
• Graphical representation of two or more variables in relation to one another
• Each variable is plotted on one axis using Cartesian coordinates
• Good for detecting relationships between variables

24
DSCI 5240
Histogram
Normal distribution
25
DSCI 5240
Box Plot
26
DSCI 5240
Scatter Plot
27

Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business

Uploaded by

Copyright:

Available Formats

DSCI 5240

Data Preparation and

Know Your Data!

“Conducting data analysis is like drinking a fine wine. It is

• In smaller organizations, that work may fall to the data miner

• We rarely have an issue with too little data

• Data is accessed in a variety of ways

• Directly in the data warehouse (rare)

• Extracted to a database management system (DBMS), e.g., Microsoft Access

• Extracted to Microsoft Excel (very common)

Common Data Types – Non-Numeric Data

• Region is a commonly occurring nominal variable in business

• In this example, “Color” vs. “Black and White” simply identify

• Ordinal – The value identifies a category but is

• Here, budget category generally indicates how much money

Common Data Types – Numeric Data

measurements. Ratios are meaningless. There is not

a value that represents “nullity” (i.e., there’s no 0).

• Classic example is temperature

• Is a movie rated 4 twice as good as a movie rated 2?

• Ratio – Data has meaningful intervals between

measurements. Ratios make sense. There is a value

that represents “nullity” (i.e., there’s a 0).

• Numeric business data is often ratio

• e.g., a movie that grossed $60M made twice as much

as a movie that grossed $30M

Common Data Types - Identifiers

• Identifiers are often the primary key in

• Identifiers typically have no predictive

Common Data Types - Text

• Can often be distinguished from

• Text is often not useful in the

• Variables may be of two types:

• Completeness – Do we have all of the data necessary?

We have a customer record but there is no value in the income field

• Timeliness – Was the data collected recently enough to still be useful?

• Believability – Can the data be trusted?

• Interpretability – Do we really understand what the data shows?

• Quality depends on the intended use of the data

Preprocessing Tasks - Overview

• Manual correction – Fix it; unrealistic in most scenarios

• Most probable value – Estimate it using regression, decision tree, etc.

• Noisy data approaches

• Regression – Use predicted rather than actual values

• Outlier analysis – Identify and exclude “odd” records

• Does prd_id = product_id?

• Are values contained in the sources in common units?

• Can introduce statistical issues if used in the same model

• Can inappropriately increase the significance of relationships

• Data value conflict

• What are the rules that govern conflict resolution?

• If we think about data in terms of a matrix,

• Both rows and columns

Data Reduction – Reducing Rows

Data Reduction – Reducing Columns

• Manual Feature Selection – The data

• Feature Selection based on Objective

• Feature Extraction – Maps high

• What does each variable represent?

• How was it measured?

• From whom was it obtained?

• How is it related to the business domain?