P. 1
02 Data Preprocessing

02 Data Preprocessing

|Views: 1,072|Likes:
Published by Praveen Reddy

More info:

Published by: Praveen Reddy on Apr 07, 2011
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PPT, PDF, TXT or read online from Scribd
See more
See less






  • This chapter covers
  • Why Is Data Preprocessing Important?
  • Measures of Data Quality
  • Major Tasks in Data Preprocessing
  • Measuring the central tendency
  • Measuring the dispersion
  • 1. Missing Data
  • How to Handle Missing Data?
  • 2. Noisy Data
  • How to Handle Noisy Data?
  • Simple Discretization Methods: Binning
  • Binning Methods -- Examples
  • Cluster Analysis
  • Data integration issues:
  • Why Redundancy Causes Problems?
  • Handling Redundancy in Data Integration
  • Correlation Analysis (Numerical Data)
  • Correlation Analysis (Categorical Data)
  • Chi-Square Calculation: An Example
  • Data Transformation: Normalization
  • 1. Data Cube Aggregation
  • 2. Attribute selection
  • 3. Data Compression or dimensionality reduction
  • Data Compression
  • Principal Component Analysis
  • Entropy-Based Discretization
  • Interval Merge by G2
  • Analysis
  • Segmentation by Natural Partitioning
  • Example of 3-4-5 Rule
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation



This chapter covers
Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation

1. 2. 3. 4. 5. 6.


The real-world data are highly susceptible to noise, incompleteness and inconsistency. 

Data in the real world is dirty 

incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  e.g., occupation= noisy: containing errors or outliers  e.g., Salary= -10 inconsistent: containing discrepancies in codes or names  e.g., Age= 42 Birthday= 03/07/1997  e.g., Was rating 1,2,3 , now rating A, B, C  e.g., discrepancy between duplicate records  


Why Is Data Dirty? 4 .

 Human/hardware/software problems Noisy data (incorrect values) may come from  Faulty data collection instruments  Human or computer error at data entry  Errors in data transmission  Contd 5 . Incomplete data may come from  Not applicable data value when collected  Different considerations between the time when the data was collected and when it is analyzed.

g. modify some linked data) Duplicate records also need data cleaning  6 . Inconsistent data may come from  Different data sources  Functional dependency violation (e..

and transformation comprises the majority of the work of building a data warehouse 7 .Why Is Data Preprocessing Important?  Noisy Data No quality data.g. cleaning. duplicate or missing data may cause incorrect or even misleading statistics.. no quality mining results!  Lack of Data Quality Quality decisions must be based on quality data  Lack of Quality in Query Results Lack of Quality Information e.  Lack of Quality in Mining  Data warehouse needs consistent integration of quality data Lack of Quality Decision Making Data extraction.

Measures of Data Quality  Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility        8 .

Part of data reduction but with particular importance. It merges data from multiple sources into a coherent data store. Data integration 3. or clustering. Data transformation 4. These are like normalizations. Data cleaning Purpose It can be applied to remove noise and correct inconsistencies in the data. especially for numerical data 2.Major Tasks in Data Preprocessing Technique 1. eliminating redundant features. It can reduce the data size by aggregating. Data reduction 5. such as a data warehouse. Data discretization 9 .

10 . These can improve the overall quality of the patterns mined and the time required for the actual mining.Data preprocessing techniques are applied before mining.

users would like to learn about data characteristics regarding both central tendency and dispersion of the data. 11 .For data preprocessing to be successful. it is essential to have an overall picture of the data. For many preprocessing tasks.

mode and midrange. count(). sum(). For example. 12 .Measuring the central tendency A distributive measure is a measure that can be computed for a give data set by partitioning the data into smaller subsets. and then merging the results in order to arrive at the measure s value for the entire data set. An algebraic measure is a measure that can be computed by applying an algebraic function to one or more distributive measures. For example. median. mean.

For example. range. scatter plot. 13 .Measuring the dispersion The degree to which numerical data tend to spread is called the dispersion or variance of data. SD. QD. quantile plot.

and inconsistent. Data cleaning routines attempt  to fill in missing values  to smooth out noise while identifying outliers  to correct inconsistencies in the data  to resolve redundancy caused by data integration. noisy.Real-world data tend to be incomplete. 14 .

such as customer income in sales data  Missing data may be due to      equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data  Missing data may need to be inferred 15 .1..g. Missing Data  Data is not always available  E. many tuples have no recorded value for several attributes.

This method is not very effective.How to Handle Missing Data? The methods are as follows: Ignore the tuple This is usually done when the class label is missing. unless the tuple contains several attributes with missing values. Use global constant to fill in the missing value Use the attribute mean to fill in the missing value Use the most probable value to fill in the missing value 16 . Fill in the missing values manually It is time consuming. Not possible for large data sets with many missing values.

17 .2. Noisy Data What is noise? It is random error or variance in a measured variable.

Why noise occurs?  Incorrect attribute values may due to      faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention 18 .

3. etc. Binning How applied First sort data and partition into (equalfrequency) bins Then one can smooth by bin means.How to Handle Noisy Data? Technique 1. Combined computer and human inspection 19 . bin median. Detect and remove outliers Detect suspicious values and check by human. 2. smooth by bin boundaries. Regression Smooth by fitting the data into regression functions. Clustering 4.

but outliers may dominate presentation Skewed data is not handled well  Equal-depth (frequency) partitioning  Divides the range into N intervals.   The most straightforward. the width of intervals will be: W = (B A)/N. each containing approximately same number of samples   Good data scaling Managing categorical attributes can be tricky 20 .Simple Discretization Methods: Binning  Equal-width (distance) partitioning   Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute.

34 * Smoothing by bin means: .Bin 3: 29. 24. 23 .Bin 3: 26. 34 21 . 23. 25 . 34 * Partition into equal-frequency (equi-depth) bins: . 15. 4.Bin 1: 4. 8. 29. 25. 4. 29 * Smoothing by bin boundaries: . 21. 29. 23.Bin 2: 21.Bin 3: 26. 29. 15 . 26. 28. 24.Bin 1: 9. 9.Bin 2: 21. 29. 21. 25.Examples Sorted data for price (in dollars): 4.Bin 1: 4. 9. 21. 28.Binning Methods -. 9. 9. 21. 26. 15 . 9 . 8.Bin 2: 23. 25 . 26.

y Y1 Y1¶ y=x+1 X1 x 22 .Regression Data can be smoothed by fitting the data to a function such as regression.

where similar values are organized into groups. 23 .Cluster Analysis Outliers may be detected by clustering. Values that fall outside of the set of clusters may be considered outliers. or clusters.

Data integration combines data from multiple sources into a coherent data store. 24 . These sources may include multiple databases. or flat files. data cubes.

British units  Contd 25 .g.Data integration issues:  Entity identification problem:  Identify real world entities from multiple data sources. different scales. Bill Clinton = William Clinton  Detecting and resolving data value conflicts  For the same real world entity. e.. metric vs. e.g. attribute values from different sources are different Possible reasons: different representations..

Some redundancies can be detected by correlation analysis. Duplication Detection and resolution of data value conflicts 26 .Data integration issues: Schema integration and object matching can be tricky Entity identification problem How can equivalent real-world entities from multiple data stores be matched up? Redundancy An attribute may be redundant if it can be derived from another attribute or set of attributes.

it is better first to normalize them before integrating data from multiple sources. If redundant variables are numeric. ENAME.Why Redundancy Causes Problems? Consider the following tables: EMP(ENO. PF.PF In the same way ITEM-PRICE will be determined by local taxes. BASIC. 27 . DA. PAY) -> PAY = BASIC + DA . DA. PAY) -> PAY = BASIC + DA EMPLOYEE(ENO. BASIC. ENAME. which vary from area to area.

g. annual revenue   Redundant attributes may be able to be detected by correlation analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality  28 ..Handling Redundancy in Data Integration  Redundant data occur often when integration of multiple databases  Object identification: The same attribute or object may have different names in different databases Derivable data: One attribute may be a derived attribute in another table. e.

rA. A and B are positively correlated (A s values increase as B s). rA.B = 0: independent. B ! § ( A  A )( B  B ) ! § ( AB )  n A B ( n  1)W A W B ( n  1)W A W B where n is the number of tuples. A and B are the respective standard deviation of A and B. A and B are the respective means of A and B.B < 0: negatively correlated  29 . The higher.Correlation Analysis (Numerical Data) Correlation coefficient (also called Pearson s product moment coefficient)  rA .B > 0.  If rA. the stronger correlation. and (AB) is the sum of the AB cross-product.

Correlation Analysis (Categorical Data)  G2 (chi-square) test (Observed  Expected ) 2 G2 ! § Expected   The larger the G2 value. the more likely the variables are related The cells that contribute the most to the G2 value are those whose actual count is very different from the expected count Correlation does not imply causality    # of hospitals and # of car-theft in a city are correlated Both are causally linked to the third variable: population 30 .

Chi-Square Calculation: An Example Play chess Like science fiction Not like science fiction Sum(col.93 90 210 360 840 2  It shows that like_science_fiction and play_chess are correlated in the group 31 .) 250(90) 50(210) 300 Not play chess 200(360) 1000(840) 1200 Sum (row) 450 1050 1500  G2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) ( 250  90 ) 2 (50  210 ) 2 ( 200  360 ) 2 (1000  840 ) 2 G !    ! 507 .

Here. the data are transformed or consolidated into forms appropriate for mining. 32 .

Data transformation can involve the following:  Smoothing: remove noise from data through binning. regression and clustering Aggregation: summarization. data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small. specified range       min-max normalization z-score normalization normalization by decimal scaling  Attribute/feature construction  New attributes constructed from the given ones 33 .

= 16.000.225 16.000  12.600  12.000 to $98.0.0  0)  0 ! 0.000 normalized to [0. 73.000 (1.000.000  Normalization by decimal scaling v v' ! j 10 Where j is the smallest integer such that Max(| ¶|) < 1 34 .600  54.Data Transformation: Normalization  Min-max normalization: to [new_minA. Let income range $12. Let = 54.000 ! 1. new_maxA] v' !  v  minA (new _ maxA  new _ minA)  new _ minA maxA  minA Ex. Then $73. Then 73.600 is mapped to 98.0].000 1.716  Z-score normalization ( : mean. : standard deviation): v  QA v'! WA  Ex.

35 .

Why data reduction?   A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time to run on the complete data set 36 .

yet closely maintains the integrity of the original data. mining on the reduced data set should be more efficient yet produce the same analytical results. 37 . That is.Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume.

2.Data reduction strategies 1. 4.. fit data into models 38 . 3. Data cube aggregation: Attribute subset selection Data Compression or dimensionality reduction Numerosity reduction e.g.

The base cuboid should correspond to an individual entity of interest. The cube created at the lowest level of abstraction is referred to as the base cuboid. 39 . The lowest level should be useful for analysis.1. such as sales or customer. A cube at the highest level of abstraction is apex cuboid. Data Cube Aggregation Data cubes store multidimensional aggregated information.

Each higher level of abstraction further reduces the resulting data size. When replying to data mining requests. 40 .Data cubes created for varying levels of abstraction are referred to as cuboids. the smallest available cuboid relevant to the given task should be used.

2. which assume that the attributes are independent of one another. Attribute selection Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes or dimensions. The goal here is to find a minimum set of attributes such that the resulting probability distribution of data classes is as close as possible to the original distribution obtained using all attributes. The best attributes are typically determined using tests of statistical significance. 41 .

Stepwise backward elimination 3.Basic heuristic methods of attribute subset selection include the following techniques: 1. Decision tree induction 42 . Combination of forward and backward elimination 4. Stepwise forward selection 2.

Data Compression or dimensionality reduction Here. Wavelet transforms  Discrete wavelet transform  Discrete Fourier transform  Hierarchical pyramid algorithm 2. Principal component analysis (PCA) 43 .  Lossy compression  Lossless compression Two lossy compression techniques: 1.3. the data encoding or transformations are applied so as to obtain a reduced or compressed representation of the original data.

Data Compression Original Data lossless Compressed Data Original Data Approximated 44 .

with progressive refinement  Sometimes small fragments of signal can be reconstructed without reconstructing the whole Time sequence is not audio  Typically short and vary slowly with time 45 .Examples    String compression  There are extensive theories and well-tuned algorithms  Typically lossless  But only limited manipulation is possible without expansion Audio/video compression  Typically lossy compression.

Dimensionality Reduction:Wavelet Transformation Haar2  Daubechie4 Discrete wavelet transform (DWT): linear signal processing. multiresolutional analysis Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients Similar to discrete Fourier transform (DFT). localized in space Method:        Length. until reaches the desired length 46 . resulting in two set of data of length L/2 Applies two functions recursively. but better lossy compression. must be an integer power of 2 (padding with 0 s. when necessary) Each transform has 2 functions: smoothing. difference Applies to pairs of data. L.

e...e.e.. the size of the data can be reduced by eliminating the weak components. find k n orthogonal vectors (principal components) that can be best used to represent data Steps      Normalize input data: Each attribute falls within the same range Compute k orthonormal (unit) vectors. it is possible to reconstruct a good approximation of the original data   Works for numeric data only Used when the number of dimensions is large 47 . i. those with low variance. principal components Each input data (vector) is a linear combination of the k principal component vectors The principal components are sorted in order of decreasing significance or strength Since the components are sorted. i. (i.Dimensionality Reduction: Principal Component Analysis (PCA)   Given N data vectors from n-dimensions. using the strongest principal components.

Principal Component Analysis X2 Y1 Y2 X1 48 .

Outliers may also be stored.4. instead of the actual data. clustering and sampling. a model is used to estimate the data parameters need to be stored. Numerosity reduction The techniques of numerosity reduction can be applied to reduce the data volume by choosing alternative. Nonparametric methods for storing reduced representations of the data include histograms. These techniques may be parametric or nonparametric. For parametric methods. smaller forms of data representation. 49 .

Data Reduction Method (1): Regression and Log-Linear Models  Linear regression: Data are modeled to fit a straight line  Often uses the least-square method to fit the line  Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vector  Log-linear model: approximates discrete multidimensional probability distributions 50 .

Data Reduction Method (1): Regress Analysis and Log-Linear Models    Linear regression: Y = w X + b  Two regression coefficients. Multiple regression: Y = b0 + b1 X1 + b2 X2. . X1. X2. w and b. specify the line and are to be estimated by using the data at hand  Using the least squares criterion to the known values of Y1.  Many nonlinear functions can be transformed into the above Log-linear models:  The multi-way table of joint probabilities is approximated by a product of lower-order tables  Probability: p(a. b. Y2. c. d) = Eab FacGad Hbcd . .

Data Reduction Method (2): Histograms Divide data into buckets and store average (sum) for each bucket Partitioning rules:     40 35 30 25 20 15 10 5 0 10000 30000 50000 70000 90000  Equal-width: equal bucket range Equal-frequency (or equal-depth) V-optimal: with the least histogram variance (weighted sum of the original values that each bucket represents) MaxDiff: set bucket boundary between each pair for pairs have the 1 largest differences  52 .

g. and store cluster representation (e..Data Reduction Method (3): Clustering  Partition data set into clusters based on similarity. centroid and diameter) only   Can be very effective if data is clustered but not if data is smeared Can have hierarchical clustering and be stored in multi-dimensional index tree structures  There are many choices of clustering definitions and clustering algorithms 53 .

Data Reduction Method (4): Sampling 

Sampling: obtaining a small sample s to represent the whole data set N Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Choose a representative subset of the data  

Simple random sampling may have very poor performance in the presence of skew Stratified sampling:  

Develop adaptive sampling methods 

Approximate the percentage of each class (or subpopulation of interest) in the overall database Used in conjunction with skewed data  

Note: Sampling may not reduce database I/Os (page at a time)


Sampling: with or without Replacement

Raw Data


Sampling: Cluster or Stratified Sampling
Raw Data Cluster/Stratified Sample


g. profession values from an ordered set. Reduce data size by discretization Prepare for further analysis 57 . e.. integer or real numbers Continuous Discretization:     Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. e. military or academic rank real numbers.. e.g.Three types of attributes:    Nominal Ordinal values from an unordered set. color..g.

middle-aged.Discretization  Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Supervised vs. merge (bottom-up) Discretization can be performed recursively on an attribute     Concept hierarchy formation  Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as young. or senior) 58 . unsupervised Split (top-down) vs.

and then repeats recursively on the resulting intervals.  Bottom-up discretization or merging: Discretization can be performed recursively on an attribute to provide a hierarchical partitioning of the attribute values. the process uses the class information. the process starts by first finding one or a few points (split or cut points) to split the entire attribute range. 2. Concept hierarchies are useful for mining at multiple levels of abstraction. known as concept hierarchy. 59 . Unsupervised discretization:  Top-down discretization or splitting: Here.Discretization techniques can be categorized based on how the discretization is performed. Supervised discretization: Here. 1.

unsupervised    Entropy-based discretization: supervised. unsupervised 60 . unsupervised. unsupervised  Clustering analysis (covered above)  Either top-down split or bottom-up merge. top-down split Interval merging by G2 Analysis: unsupervised. bottom-up merge Segmentation by natural partitioning: top-down split.  Histogram analysis (covered above)  Top-down split.Typical methods: All the methods can be applied recursively  Binning (covered above)  Top-down split.

the entropy of S1 is m Entropy ( S 1 ) !  § p i log 2 ( p i ) i !1 where pi is the probability of class i in S1  The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization The process is recursively applied to partitions obtained until some stopping criterion is met Such a boundary may reduce data size and improve classification accuracy 61   . the information gain after partitioning is I (S . if S is partitioned into two intervals S1 and S2 using boundary T.Entropy-Based Discretization  Given a set of samples S.T ) !  | S1 | |S | Entropy ( S 1)  2 Entropy ( S 2 ) |S | |S | Entropy is calculated based on class distribution of the samples in the set. Given m classes.

See also Liu et al.Interval Merge by G2 Analysis   Merging-based (bottom-up) vs. A is considered to be one interval G2 tests are performed for every pair of adjacent intervals Adjacent intervals with the least G2 values are merged together. since low G2 values for a pair indicate similar class distributions This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level. DMKD 2002]     Initially. max-interval.)  62 . etc. max inconsistency. splitting-based methods Merge: Find the best neighboring intervals and merge them to form larger intervals recursively ChiMerge [Kerber AAAI 1992. each distinct value of a numerical attr.

or 8 distinct values at the most significant digit. 7 or 9 distinct values at the most significant digit. partition the range into 3 equi-width intervals If it covers 2.  If an interval covers 3. 5. 6. partition the range into 5 intervals   63 .Segmentation by Natural Partitioning  A simply 3-4-5 rule can be used to segment numeric data into relatively uniform. or 10 distinct values at the most significant digit. 4. natural intervals. partition the range into 4 intervals If it covers 1.

000) (-$1.000 $4.000) ($1.700 Max High=$2.Example of 3-4-5 Rule count Step 1: Step 2: Step 3: -$351 Min msd=1.200) ($1.000 .600 ($1.$2. 000) ($2.400) ($1.000 $1.e. 000) ($2.838 High(i.000 .000) ($1.e.000 .$2.000 (-$1.000) ($3.000 profit $1.000) ($1.800 $1.400 $1.000) (-$400 .000) ($600 $800) April 7.000) Step 4: (-$400 -$5.$1.800) $2.0) (0 -$ 1.$2.000 $3.000 -$159 Low (i.$5.200 $1.000 .000) ($1. 2011 Data Mining: Concepts and Techniques 64 .0) (-$400 -$300) (-$300 -$200) (-$200 -$100) (-$100 0) (0 $200) ($200 $400) ($400 $600) (0 . 5%-tile) Low=-$1.000 $5.600) ($800 $1.000) ($4. 95%-0 tile) $4.000 .

or senior). The high level concepts are useful for data generalization. Concept hierarchies can be used to reduce the data by collecting and replacing lowlevel concepts (such as numerical values for age) with higher-level concepts (such as youth. rather than during mining. 65 . middle-aged.A concept hierarchy for a numerical attribute defines a discretization of the attribute. The discretization techniques and concept hierarchies are applied before data mining as a preprocessing step.

state.g.Concept Hierarchy Generation for Categorical Data  Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts  street < city < state < country {Urbana. Champaign. for a set of attributes: {street..g. city. country} 66 . not others  Specification of a hierarchy for a set of values by explicit data grouping   Specification of only a partial set of attributes   Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values  E.. Chicago} < Illinois E. only street < city.

year country province_or_ state city street 15 distinct values 365 distinct values 3567 distinct values 674..Automatic Concept Hierarchy Generation  Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set   The attribute with the most distinct values is placed at the lowest level of the hierarchy Exceptions. quarter. weekday.g.339 distinct values 67 . month. e.

April 7. 2011 Data Mining: Concepts and Techniques 68 .

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->