You are on page 1of 60

Chapter 3

Data Mining & Warehousing


Data Preprocessing
Topics
The nature of real world data and data formats
 Data quality issues
 Preparing data for analysis/preprocessing
The nature of real world data and data
formats
The Data Set (input to DM process)
 Components of the input:
 Concepts: kinds of things that can
be learned
Aim: intelligible and operational
concept description
 Instances: the individual,
independent examples of a concept
 Attributes: measuring aspects of an
instance
We apply data analytics on this input
data
Cont,,,
Thus dataset can be seen as collection of data
objects/examples and their attributes, representing a
concept.
 attribute is a property or characteristic of object
▶ Examples: eye color of a person, temperature, etc.
▶ Attribute is also known as variable, field,
characteristic, or feature
Object is also known as record, point, case,
sample, entity, or instance
What’s an attribute?
 Each instance is described by a fixed predefined set of
features, its “attributes”
 But: number of attributes may vary in practice
 Possible attribute types (“levels of measurement”)
▶Nominal, ordinal, interval and ratio
 Nominal attributes are also called “categorical”,
”enumerated”, or “discrete”
 Ordinal attributes are called “numeric”, or “continuous”
▶ But: “continuous” implies mathematical continuity
Attribute types: Summary
Nominal, e.g. eye color=brown, blue, …
▶ only equality tests
▶ important special case: Boolean (True/False)
Ordinal, e.g. grade=k,1,2,..,12
▶ Continuous (numeric), e.g. year
Types of data sets formats
Record
▶ Relational , Data Matrix, Document Data, Transaction
Data
 Graph and networks
▶ World Wide Web, Social networks, Molecular
Structures
Ordered
▶ Spatial Data, Temporal Data, Sequential Data,
Genetic Sequence Data
Types of data sets formats
Data Quality

What kinds of data quality problems?


 How can we detect problems with the data?
 What can we do about these problems?
Cont,,
Examples (basic types) of data quality
problems:
▶ Noise and outliers
▶ missing values
▶ duplicate data
Data Quality Measures
 Before feeding data to DW/DM /DA we have to make sure
the quality of data.
 A well-accepted multidimensional data quality measures
are the following:
▶ Accuracy
▶ Completeness
▶ Consistency
▶ Timeliness
▶ Believability
▶ Interpretability
 Most of the data in the real world are: incomplete,
Inconsistent , Noisy, redundant, …
Data is often of low quality
 Business Analytics/Data mining requires collecting large
amount of data (available in warehouses or databases) to
achieve the intended objective.
 BUT in addition to its heterogeneous & distributed nature
of data, data in the real world is dirty and low quality.
 Why?
▶ You didn’t collect it yourself!
▶ It probably was created for some other use, and then
you came along wanting to integrate it
▶ People make mistakes (typos)
▶ People are busy (“this is good enough”) to systematically
organize carefully using structured formats
Noise

 Noise refers to modification of original values


▶Examples:
▶ attribute values that might be invalid or incorrect. E.g.
typographical errors
▶ distortion of a person’s voice when talking on a poor
phone line and “snow” on television screen
Outliers
Outliers are data objects with characteristics that
are considerably different than most of the other
data objects in the data set
 misleading data that do not fit to most of the
data/facts
Missing Values

attributes values might be absent which needs


to be replaced with estimates
 Reasons for missing values
▶ Information is not collected
(e.g., people decline to give their age and weight)
▶ Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
(victim age etc..is not appropriate to Property damage
Severity)
Duplicate Data

Data set may include data objects that are


duplicates, or almost duplicates of one another
Major issue when merging data from heterogonous
sources
▶ Examples:
Date of birth and age at the same time
Same person with multiple email addresses
Data Preprocessing (preparing the data for
experiment/analytics)
What is Data Preprocessing?

 "Every data analysis task starts by gathering, characterizing,


and cleaning a new, unfamiliar data set..."
 More than 80% of researchers working on mining projects
spend 40%-60% of their time on cleaning and preparation
of data.” [Kalashnikov+Mehrotra2005]
 Data pre-processing refers to the processing of the various
data elements to prepare for the analytics operation.
 Any activity performed prior to mining the data to get
knowledge out of it is called data pre-processing
Should we start processing right away?
 Before starting data preprocessing, it will be advisable to have
overall picture of the data we have, so that it tell us high level
summary such as;
▶ General property of the data
▶ Magnitude of duplicates, missing etc..
▶ Which data values should be considered as noise or outliers….
 This can be done with the help of descriptive data
summarization
▶ But a minimum requirement is details of the data set in terms
of number of Instances, Features (with their nature), Attribute
values especially for the target class- and explaining them
▶ Even, Total data size in MB, GB etc..
▶ Graphic display of basic summaries
Now then
▶ You have good understanding of the general nature
of the data
▶ You can move to selecting and engaging on basic
preprocessing tasks.
▶ Note that ; all pre processing tasks may not be
applicable to your data at hand Even you may not
need one
FORMS (MAJOR TASKS) OF DATA
PREPROCESSING (HIGH LEVEL)
Major Tasks in Data Preprocessing
▶ Data cleaning: to get rid of bad data
– smooth noisy data, fill in missing values,, identify or remove outliers,
and resolve duplication/inconsistencies
▶ Data integration
– Integration of data from multiple databases, data warehouses, or files
▶ Data reduction
– Dimensionality reduction/ feature selection
– Numerosity/size reduction
– Data compression (usually for multimedia data mining)
▶ Data transformation
– Normalization
– Discretization and/or Concept hierarchy generation
Data Cleaning: Incomplete (Missing) Data
• Incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
– e.g., Occupation=“ ” (missing data)
• Data is not always available
– E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data

ID Name City State

1 Ministry of Transportation Addis Ababa Addis Ababa

2 Ministry of Finance Addis Ababa

3 Office of Foreign Affairs Addis Ababa Addis Ababa

• What’s wrong here? A missing required field using DB


Data Cleaning: How to Handle Missing(incomplete)
Data?- Imputation

It is the process of replacing missing data with


substituted values (Case Deletion Vs Imputation)
 Case Deletion
▶ Generally Ignore the tuple: usually done when class
label is missing
Imputation
▶A few of the well known attempts to deal with missing
data include: hot-deck(last observation carried forward);
mean imputation; regression imputation; global constant
imputation;
Cont,,
 Use a global constant to fill in the missing value: e.g.,
“unknown”,a new class?!
▶ Note that this constant may form some interesting pattern
for the data mining task which mislead decision process.
 A hot-deck imputation "last observation carried
forward", which involves sorting a dataset according
to any of a number of variables, thus creating an
ordered dataset.
▶ The technique then finds the first missing value and uses
the cell value immediately prior to the data that are missing
to impute
Cont,,,
 Use the attribute mean to fill in the missing value
▶ For example, suppose that the average income of All
Electronics customers is $56,000. Use this value to replace
the missing value for income.
Data Cleaning: Noisy Data
• Noisy: containing noise, errors, or outliers
– e.g., Salary=“−10” (an error)
• Typographical errors are errors that corrupt data
• Let say ‘green’ is written as ‘rgeen’

• Incorrect attribute values may be due to


– faulty data collection instruments (e.g.: OCR)
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention

28
Data Cleaning: How to catch Noisy Data
• Manually check all data : tedious + infeasible?
• Sort data by frequency
– ‘green’ is more frequent than ‘rgeen’
– Works well for categorical data
• Use, say Numerical constraints to Catch Corrupt Data
• Weight can’t be negative
• People can’t have more than 2 parents
• Salary can’t be less than Birr 300
• Use statistical techniques to Catch Corrupt Data
– Check for outliers (the case of the 8 meters man)
– Check for correlated outliers using n-gram (“pregnant males”)
• People can be male
• People can be pregnant
• People can’t be male AND pregnant
29
Data Cleaning: Redundancy
• Duplicate or redundant data is data problems which
require data cleaning
• What’s wrong here?

ID Name City State


1 Ministry of Transportation Addis Ababa Addis Ababa
2 Ministry of Finance Addis Ababa Addis Ababa
3 Ministry of Finance Addis Ababa Addis Ababa

• How to clean it: manually or automatically? Using


XML,DB,…
Data Integration
• Data integration combines data from multiple sources
(database, data warehouse, files & sometimes from
non-electronic sources) into a coherent store
• Because of the use of different sources, data that that
is fine on its own may become problematic when we
want to integrate it.
• Some of the issues are:
– Different formats and structures
– Conflicting and redundant data
– Data at different levels
Data Integration
• Data integration combines data from multiple sources
(database, data warehouse, files & sometimes from
non-electronic sources) into a coherent store
• Because of the use of different sources, data that that
is fine on its own may become problematic when we
want to integrate it.
• Some of the issues are:
– Different formats and structures
– Conflicting and redundant data
– Data at different levels
Data Integration: Formats
• Not everyone uses the same format. Do you agree?
– Schema integration: e.g., A.cust-id  B.cust-#
• Integrate metadata from different sources
• Dates are especially problematic:
– 12/19/97
– 19/12/97
– 19/12/1997
– 19-12-97
– Dec 19, 1997
– 19 December 1997
– 19th Dec. 1997
• Are you frequently writing money as:
– Birr 200, Br. 200, 200 Birr, …
Data Integration: Inconsistent
• Inconsistent data: containing discrepancies in codes or names,
which is also the problem of Lack of standardization / naming
conventions. e.g.,
– Age=“42” vs. Birthday=“03/07/1997”
– Some use “1,2,3” for rating; others “A, B, C”
• Discrepancy between duplicate records

ID Name City State


1 Ministry of Transportation Addis Ababa Addis Ababa region
Addis Ababa
2 Ministry of Finance Addis Ababa administration
Addis Ababa regional
3 Office of Foreign Affairs Addis Ababa administration
Data Integration: different structure
What’s wrong here? No data type constraints
ID Name City State
Ministry of
1 Transportation Addis Ababa AA

ID Name City State


Ministry of Addis
Two Finance Ababa AA

Name ID City State


Office of Foreign Addis
Affairs 3 Ababa AA
Data Integration: Data that Moves
• Be careful of taking snapshots of a moving target
• Example: Let’s say you want to store the price of a shoe
in France, and the price of a shoe in Italy
– You can’t store it all in the same currency (say, US$) because
the exchange rate changes
– Price in foreign currency stays the same
– Must keep the data in foreign currency and use the current
exchange rate to convert
• The same needs to be done for ‘Age’
– It is better to store ‘Date of Birth’ than ‘Age’
Data at different level of detail than needed
• If it is at a finer level of detail, you can sometimes bin it
• Example
– I need age ranges of 20-30, 30-40, 40-50, etc.
– Imported data contains birth date
– No problem! Divide data into appropriate categories
• Sometimes you cannot bin it
• Example
– I need age ranges 20-30, 30-40, 40-50 etc.
– Data is of age ranges 25-35, 35-45, etc.
– What to do?
• Ignore age ranges because you aren’t sure
• Make educated guess based on imported data (e.g.,
assume that # people of age 25-35 are average # of
people of age 20-30 & 30-40)
Data Integration: Conflicting Data
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different
sources are different
– Possible reasons: different representations, different scales, e.g.,
American vs. British units
• weight measurement: KG or pound
• Height measurement: meter or inch
• Information source #1 says that Alex lives in Bahirdar
– Information source #2 says that Alex lives in Mekele
• What to do?
– Use both (He lives in both places)
– Use the most recently updated piece of information
– Use the “most trusted” information
– Flag row to be investigated further by hand
– Use neither (We’d rather be incomplete than wrong)
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple
databases
– Object identification: The same attribute or object may have
different names in different databases
– Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue, age
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
39
Correlation Analysis (Nominal Data)
• Χ2 (chi-square) test
(Observed  Expected ) 2
2  
Expected
• The larger the Χ2 value, the more likely the variables are
related
• The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population
Chi-Square Calculation: An Example
male Female Sum (row)
Like science fiction 250(90 op) 200(360) 450
Not like science 50(210) 1000(840) 1050
fiction
Sum(col.) 300 1200 1500
• Χ2 (chi-square) calculation (numbers in parenthesis are expected
counts calculated based on the data distribution in the two
categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
    507.93
90 210 360 840
• It shows that like_science_fiction and play_chess are correlated
in the group
Covariance
• Covariance is similar to correlation

where n is the number of tuples, p and q are the respective


mean of p and q, σp and σq are the respective standard deviation
of p and q.
• It can be simplified in computation as

• Positive covariance: If Covp,q > 0, then p and q both tend to be


directly related.
• Negative covariance: If Covp,q < 0 then p and q are inversely
related.
• Independence: Covp,q = 0
Example: Co-Variance
• Suppose two stocks A and B have the following values in
one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry
trends, will their prices rise or fall together?
– E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
– E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
– Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
• Thus, A and B rise together since Cov(A, B) > 0.
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the dataset that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
• Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time
to run on the complete data set.
• Data reduction strategies
– Dimensionality reduction,
• Select best attributes or remove unimportant attributes
– Numerosity reduction
• Reduce data volume by choosing alternative, smaller forms of
data representation
– Data compression
• Is a technology that reduce the size of large files such that
smaller files take less memory space and fast to transfer over a
network or the Internet,
Data Reduction: Dimensionality Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse
• Dimensionality reduction
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization
• Method: attribute subset selection
– One of the method to reduce dimensionality of data is by selecting
best attributes
– Helps to avoid redundant attributes : that contain duplicate
information in one or more other attributes
• E.g., purchase price of a product & the amount of sales tax paid
– Helps to avoid Irrelevant attributes: that contain no information
that is useful for the data mining task at hand
• E.g., is students' ID relevant to predict students' GPA?
Heuristic Search in Attribute Selection
• Given M attributes there are 2M possible attribute
combinations
• Commonly used heuristic attribute selection methods:
– Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...
– Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
– Best combined attribute selection and elimination
Data Reduction: Numerosity Reduction
• Different methods can be used, including Clustering and
sampling
• Clustering
– Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
– There are many choices of clustering definitions and clustering
algorithms
• Sampling
– obtaining a small sample s to represent the whole data set N
– Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
– Key principle: Choose a representative subset of the data using
suitable sampling technique
Types of Sampling
• Simple random sampling
– There is an equal probability of selecting any particular item
– Simple random sampling may have very poor performance in
the presence of skew
• Stratified sampling:
– Develop adaptive sampling methods, e.g., stratified sampling;
which partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
– Used in conjunction with skewed data
• Sampling without replacement
– Once an object is selected, it is removed from the population
• Sampling with replacement
– A selected object is not removed from the population
Sampling: With or without Replacement

W O R
SRS le random
i m p ho ut
( s e wi t
l
sa m p m e nt )
p l a ce
re

SRSW
R

Raw Data
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample


Data Reduction: Data Compression
• Two types of data compression:
 Lossless compression is a compression technique that does not
lose any data in the compression process. lossless compression
can reduce it to about half that size
 Lossy compression: To reduce a file significantly beyond 50%, one
must use lossy compression. Lossy compression will strip some of
the redundant data in a file. Because of this data loss, only certain
applications are fit for lossy compression, like images, audio, and
video.

Original Data
lossless Compressed
Data

s s y
Original Data lo
Approximated
Data Reduction: Data Compression
• Lossless and lossy compression have become part of our
every day vocabulary largely due to the popularity of
– MP3 music files.
• Compare it with WAV files
– JPEG image files
• Compare it with GIF or
– MP4 or MPEG video files
• Compare it with AVI files

• These formats makes the resulting file much smaller so


that several dozen music, image and/or video files can fit,
for example, on a single compact disk, mobile phones, etc.
Data Transformation
• A function that maps the entire set of values of a given attribute to a new
set of replacement values such that each old value can be identified with
one of the new values
• Methods for data transformation
– Normalization: Scaled to fall within a smaller, specified range of
values
• min-max normalization
• z-score normalization

– Discretization: Reduce data size by dividing the range of a


continuous attribute into intervals. Interval labels can then be used to
replace actual data values
• Discretization can be performed recursively on an attribute using
method such as
– Binning: divide values into intervals( into equal - width, depth)
– Concept hierarchy climbing: organizes concepts (i.e., attribute
values) hierarchically as like lattice structure.
Normalization
• Min-max normalization:
v  minA
v' 
maxA  minA
– Ex. Let income range $12,000 to $98,000 is normalized to
[0.0, 1.0]. Then $73,600 is mapped to
73,600  12,000
(1.0  0)  0  0.716
98,000  12,000

• Z-score normalization (μ: mean, σ: standard deviation):


v  A
v' 
 A

73,600  54,000
– Ex. Let μ = 54,000, σ = 16,000. Then,  1.225
16,000
Simple Discretization: Binning
• Equal-width (distance) partitioning
– Divides the range into N intervals of equal size: uniform
grid
– if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B –A)/N.
– This is the most straightforward, but outliers may
dominate presentation
– Skewed data is not handled well
• Equal-depth (frequency) partitioning
– Divides the range into N intervals, each containing
approximately same number of samples
– Good data scaling
– Managing categorical attributes can be tricky
Cont,,,
Equal-width (distance) partitioning:
▶ It divides the range into N intervals of equal size: uniform grid
▶ if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B-A)/N.
▶ Given the data set (say 24, 21, 28, 8, 4, 26, 34, 21, 29, 15, 9, 25)
▶ Determine the number of bins : N (say 3)
▶ Determine the range R= Max(B) – Min (A)
▶ For the above data R= 30,
▶ W= R/3 = 30/3= 10, X1 = 14, X2 = 24, and X3 = 34
▶ First sort the data as 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
▶ Therefore in our case Bin 1 = 4,8,9 Bin 2 = 15, 21, 21 Bin3 = 24, 25,26,
28, 29, 34
• outliers may dominate presentation
• Skewed data may not be handled well.
Cont,,,
Equal-depth (frequency) partitioning:
▶ It divides the range into N intervals, each containing
approximately same number of samples
▶ Given the data set (say 24, 21, 28, 8, 4, 26, 34, 21, 29, 15, 9,25)
▶ Determine the number of bins : N (say 3)
▶ Determine the frequency (F) of the data set (12)
▶ Determine the number of sample per bin F/N (12/3 = 4)
▶ Sort the data as 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 and place
F/N element in order into a bin
▶ Therefore in our case Bin 1 = 4,8,9 ,15 Bin 2 = 21, 21,24, 25 Bin3 = 26,
28, 29, 34
▶ Good data scaling
Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Binning
• Attribute values (for one attribute e.g., age):
– 0, 4, 12, 16, 16, 18, 24, 26, 28
• Equi-width binning – for bin width of e.g., (n=10):
– Bin 1: 0, 4 [-,10) bin
– Bin 2: 12, 16, 16, 18 [10,20) bin
– Bin 3: 24, 26, 28 [20,+) bin
– – denote negative infinity, + positive infinity
• Equi-frequency binning – for bin density of e.g.,(n= 3):
– Bin 1: 0, 4, 12 [-, 14) bin
– Bin 2: 16, 16, 18 [14, 21) bin
– Bin 3: 24, 26, 28 [21,+] bin
Concept Hierarchy Generation
• Concept hierarchy organizes concepts (i.e.,
country
attribute values) hierarchically and is usually
associated with each dimension in a data
warehouse Region or state
– Concept hierarchy formation: Recursively
reduce the data by collecting and replacing city
low level concepts (such as numeric values
for age) by higher level concepts (such as
youth, adult, or senior) Sub city
• Concept hierarchies can be explicitly
specified by domain experts and/or data Street
warehouse designers
• Concept hierarchy can be automatically formed by the analysis of
the number of distinct values. E.g., for a set of attributes: {street,
city, state, country}
For numeric data, use discretization methods.

You might also like