This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

### Categories

### Publishers

Editors' Picks Books

Hand-picked favorites from

our editors

our editors

Editors' Picks Audiobooks

Hand-picked favorites from

our editors

our editors

Editors' Picks Comics

Hand-picked favorites from

our editors

our editors

Editors' Picks Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

- This chapter covers
- Why Is Data Preprocessing Important?
- Measures of Data Quality
- Major Tasks in Data Preprocessing
- Measuring the central tendency
- Measuring the dispersion
- 1. Missing Data
- How to Handle Missing Data?
- 2. Noisy Data
- How to Handle Noisy Data?
- Simple Discretization Methods: Binning
- Binning Methods -- Examples
- Cluster Analysis
- Data integration issues:
- Why Redundancy Causes Problems?
- Handling Redundancy in Data Integration
- Correlation Analysis (Numerical Data)
- Correlation Analysis (Categorical Data)
- Chi-Square Calculation: An Example
- Data Transformation: Normalization
- 1. Data Cube Aggregation
- 2. Attribute selection
- 3. Data Compression or dimensionality reduction
- Data Compression
- Principal Component Analysis
- Entropy-Based Discretization
- Interval Merge by G2
- Analysis
- Segmentation by Natural Partitioning
- Example of 3-4-5 Rule
- Concept Hierarchy Generation for Categorical Data
- Automatic Concept Hierarchy Generation

1

**This chapter covers
**

Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation

1. 2. 3. 4. 5. 6.

2

The real-world data are highly susceptible to noise, incompleteness and inconsistency.

Data in the real world is dirty

incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation= noisy: containing errors or outliers e.g., Salary= -10 inconsistent: containing discrepancies in codes or names e.g., Age= 42 Birthday= 03/07/1997 e.g., Was rating 1,2,3 , now rating A, B, C e.g., discrepancy between duplicate records

3

Why Is Data Dirty? 4 .

Incomplete data may come from Not applicable data value when collected Different considerations between the time when the data was collected and when it is analyzed. Human/hardware/software problems Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission Contd 5 .

Inconsistent data may come from Different data sources Functional dependency violation (e. modify some linked data) Duplicate records also need data cleaning 6 ..g.

no quality mining results! Lack of Data Quality Quality decisions must be based on quality data Lack of Quality in Query Results Lack of Quality Information e.. cleaning. Lack of Quality in Mining Data warehouse needs consistent integration of quality data Lack of Quality Decision Making Data extraction. duplicate or missing data may cause incorrect or even misleading statistics.Why Is Data Preprocessing Important? Noisy Data No quality data.g. and transformation comprises the majority of the work of building a data warehouse 7 .

Measures of Data Quality Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility 8 .

or clustering. Part of data reduction but with particular importance. especially for numerical data 2. Data cleaning Purpose It can be applied to remove noise and correct inconsistencies in the data. Data discretization 9 . Data reduction 5. It can reduce the data size by aggregating. such as a data warehouse. eliminating redundant features. These are like normalizations. Data integration 3. Data transformation 4. It merges data from multiple sources into a coherent data store.Major Tasks in Data Preprocessing Technique 1.

These can improve the overall quality of the patterns mined and the time required for the actual mining.Data preprocessing techniques are applied before mining. 10 .

it is essential to have an overall picture of the data.For data preprocessing to be successful. 11 . users would like to learn about data characteristics regarding both central tendency and dispersion of the data. For many preprocessing tasks.

count(). sum(). 12 . An algebraic measure is a measure that can be computed by applying an algebraic function to one or more distributive measures. and then merging the results in order to arrive at the measure s value for the entire data set. mean. median.Measuring the central tendency A distributive measure is a measure that can be computed for a give data set by partitioning the data into smaller subsets. For example. mode and midrange. For example.

For example. scatter plot. SD.Measuring the dispersion The degree to which numerical data tend to spread is called the dispersion or variance of data. quantile plot. 13 . range. QD.

Data cleaning routines attempt to fill in missing values to smooth out noise while identifying outliers to correct inconsistencies in the data to resolve redundancy caused by data integration. and inconsistent. 14 .Real-world data tend to be incomplete. noisy.

.g.1. Missing Data Data is not always available E. such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred 15 . many tuples have no recorded value for several attributes.

Not possible for large data sets with many missing values.How to Handle Missing Data? The methods are as follows: Ignore the tuple This is usually done when the class label is missing. Fill in the missing values manually It is time consuming. This method is not very effective. Use global constant to fill in the missing value Use the attribute mean to fill in the missing value Use the most probable value to fill in the missing value 16 . unless the tuple contains several attributes with missing values.

2. 17 . Noisy Data What is noise? It is random error or variance in a measured variable.

Why noise occurs? Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention 18 .

How to Handle Noisy Data? Technique 1. 2. Binning How applied First sort data and partition into (equalfrequency) bins Then one can smooth by bin means. etc. smooth by bin boundaries. Clustering 4. Regression Smooth by fitting the data into regression functions. Combined computer and human inspection 19 . bin median. 3. Detect and remove outliers Detect suspicious values and check by human.

Simple Discretization Methods: Binning Equal-width (distance) partitioning Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute. but outliers may dominate presentation Skewed data is not handled well Equal-depth (frequency) partitioning Divides the range into N intervals. each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky 20 . The most straightforward. the width of intervals will be: W = (B A)/N.

8. 29 * Smoothing by bin boundaries: . 9. 29.Bin 1: 4. 15 . 4. 21. 15 . 29. 24.Bin 3: 26. 21. 28.Bin 2: 21.Bin 1: 4. 21.Binning Methods -. 23. 9. 26. 21. 15. 29. 28.Examples Sorted data for price (in dollars): 4. 4.Bin 1: 9. 25 . 24.Bin 3: 26. 9. 25. 8. 29.Bin 2: 23. 34 * Partition into equal-frequency (equi-depth) bins: . 34 21 . 25. 9.Bin 2: 21. 23 . 23. 34 * Smoothing by bin means: . 26. 25 .Bin 3: 29. 9 . 26.

y Y1 Y1¶ y=x+1 X1 x 22 .Regression Data can be smoothed by fitting the data to a function such as regression.

Values that fall outside of the set of clusters may be considered outliers.Cluster Analysis Outliers may be detected by clustering. or clusters. where similar values are organized into groups. 23 .

24 . or flat files.Data integration combines data from multiple sources into a coherent data store. These sources may include multiple databases. data cubes.

different scales. e.. Bill Clinton = William Clinton Detecting and resolving data value conflicts For the same real world entity. attribute values from different sources are different Possible reasons: different representations. e.g..g.Data integration issues: Entity identification problem: Identify real world entities from multiple data sources. metric vs. British units Contd 25 .

Some redundancies can be detected by correlation analysis. Duplication Detection and resolution of data value conflicts 26 .Data integration issues: Schema integration and object matching can be tricky Entity identification problem How can equivalent real-world entities from multiple data stores be matched up? Redundancy An attribute may be redundant if it can be derived from another attribute or set of attributes.

DA. which vary from area to area. DA. BASIC. ENAME. PAY) -> PAY = BASIC + DA . 27 . BASIC. it is better first to normalize them before integrating data from multiple sources. ENAME.PF In the same way ITEM-PRICE will be determined by local taxes. PF. If redundant variables are numeric.Why Redundancy Causes Problems? Consider the following tables: EMP(ENO. PAY) -> PAY = BASIC + DA EMPLOYEE(ENO.

g.Handling Redundancy in Data Integration Redundant data occur often when integration of multiple databases Object identification: The same attribute or object may have different names in different databases Derivable data: One attribute may be a derived attribute in another table. e.. annual revenue Redundant attributes may be able to be detected by correlation analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality 28 .

B ! § ( A A )( B B ) ! § ( AB ) n A B ( n 1)W A W B ( n 1)W A W B where n is the number of tuples. The higher. rA. A and B are the respective means of A and B.B > 0. the stronger correlation.Correlation Analysis (Numerical Data) Correlation coefficient (also called Pearson s product moment coefficient) rA . If rA. A and B are positively correlated (A s values increase as B s). A and B are the respective standard deviation of A and B. and (AB) is the sum of the AB cross-product. rA.B < 0: negatively correlated 29 .B = 0: independent.

Correlation Analysis (Categorical Data) G2 (chi-square) test (Observed Expected ) 2 G2 ! § Expected The larger the G2 value. the more likely the variables are related The cells that contribute the most to the G2 value are those whose actual count is very different from the expected count Correlation does not imply causality # of hospitals and # of car-theft in a city are correlated Both are causally linked to the third variable: population 30 .

) 250(90) 50(210) 300 Not play chess 200(360) 1000(840) 1200 Sum (row) 450 1050 1500 G2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) ( 250 90 ) 2 (50 210 ) 2 ( 200 360 ) 2 (1000 840 ) 2 G ! ! 507 .93 90 210 360 840 2 It shows that like_science_fiction and play_chess are correlated in the group 31 .Chi-Square Calculation: An Example Play chess Like science fiction Not like science fiction Sum(col.

the data are transformed or consolidated into forms appropriate for mining. 32 .Here.

regression and clustering Aggregation: summarization.Data transformation can involve the following: Smoothing: remove noise from data through binning. specified range min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction New attributes constructed from the given ones 33 . data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small.

600 54.000 (1.000 12. = 16.600 12. Then $73.Data Transformation: Normalization Min-max normalization: to [new_minA.000 1. 73.0]. Let income range $12. new_maxA] v' ! v minA (new _ maxA new _ minA) new _ minA maxA minA Ex.225 16.000.600 is mapped to 98.000 Normalization by decimal scaling v v' ! j 10 Where j is the smallest integer such that Max(| ¶|) < 1 34 .716 Z-score normalization ( : mean.000. Let = 54. Then 73.000 normalized to [0.0. : standard deviation): v QA v'! WA Ex.000 ! 1.0 0) 0 ! 0.000 to $98.

35 .

Why data reduction? A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time to run on the complete data set 36 .

yet closely maintains the integrity of the original data. mining on the reduced data set should be more efficient yet produce the same analytical results.Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume. 37 . That is.

Data reduction strategies 1. 3.g.. fit data into models 38 . Data cube aggregation: Attribute subset selection Data Compression or dimensionality reduction Numerosity reduction e. 4. 2.

The base cuboid should correspond to an individual entity of interest. The lowest level should be useful for analysis. A cube at the highest level of abstraction is apex cuboid. 39 .1. The cube created at the lowest level of abstraction is referred to as the base cuboid. such as sales or customer. Data Cube Aggregation Data cubes store multidimensional aggregated information.

the smallest available cuboid relevant to the given task should be used.Data cubes created for varying levels of abstraction are referred to as cuboids. 40 . Each higher level of abstraction further reduces the resulting data size. When replying to data mining requests.

2. Attribute selection Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes or dimensions. The best attributes are typically determined using tests of statistical significance. which assume that the attributes are independent of one another. The goal here is to find a minimum set of attributes such that the resulting probability distribution of data classes is as close as possible to the original distribution obtained using all attributes. 41 .

Decision tree induction 42 . Combination of forward and backward elimination 4.Basic heuristic methods of attribute subset selection include the following techniques: 1. Stepwise forward selection 2. Stepwise backward elimination 3.

Wavelet transforms Discrete wavelet transform Discrete Fourier transform Hierarchical pyramid algorithm 2. Lossy compression Lossless compression Two lossy compression techniques: 1.3. Principal component analysis (PCA) 43 . Data Compression or dimensionality reduction Here. the data encoding or transformations are applied so as to obtain a reduced or compressed representation of the original data.

Data Compression Original Data lossless Compressed Data Original Data Approximated 44 .

with progressive refinement Sometimes small fragments of signal can be reconstructed without reconstructing the whole Time sequence is not audio Typically short and vary slowly with time 45 .Examples String compression There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible without expansion Audio/video compression Typically lossy compression.

multiresolutional analysis Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients Similar to discrete Fourier transform (DFT). L. when necessary) Each transform has 2 functions: smoothing.Dimensionality Reduction:Wavelet Transformation Haar2 Daubechie4 Discrete wavelet transform (DWT): linear signal processing. until reaches the desired length 46 . must be an integer power of 2 (padding with 0 s. difference Applies to pairs of data. but better lossy compression. resulting in two set of data of length L/2 Applies two functions recursively. localized in space Method: Length.

(i.e.Dimensionality Reduction: Principal Component Analysis (PCA) Given N data vectors from n-dimensions. i.e.e. find k n orthogonal vectors (principal components) that can be best used to represent data Steps Normalize input data: Each attribute falls within the same range Compute k orthonormal (unit) vectors. principal components Each input data (vector) is a linear combination of the k principal component vectors The principal components are sorted in order of decreasing significance or strength Since the components are sorted. i. the size of the data can be reduced by eliminating the weak components... using the strongest principal components. it is possible to reconstruct a good approximation of the original data Works for numeric data only Used when the number of dimensions is large 47 . those with low variance..

Principal Component Analysis X2 Y1 Y2 X1 48 .

For parametric methods. 49 . smaller forms of data representation. Outliers may also be stored. Nonparametric methods for storing reduced representations of the data include histograms.4. clustering and sampling. instead of the actual data. These techniques may be parametric or nonparametric. Numerosity reduction The techniques of numerosity reduction can be applied to reduce the data volume by choosing alternative. a model is used to estimate the data parameters need to be stored.

Data Reduction Method (1): Regression and Log-Linear Models Linear regression: Data are modeled to fit a straight line Often uses the least-square method to fit the line Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vector Log-linear model: approximates discrete multidimensional probability distributions 50 .

. d) = Eab FacGad Hbcd . Many nonlinear functions can be transformed into the above Log-linear models: The multi-way table of joint probabilities is approximated by a product of lower-order tables Probability: p(a. specify the line and are to be estimated by using the data at hand Using the least squares criterion to the known values of Y1. c. w and b. X1. b. Y2. Multiple regression: Y = b0 + b1 X1 + b2 X2. X2.Data Reduction Method (1): Regress Analysis and Log-Linear Models Linear regression: Y = w X + b Two regression coefficients. .

Data Reduction Method (2): Histograms Divide data into buckets and store average (sum) for each bucket Partitioning rules: 40 35 30 25 20 15 10 5 0 10000 30000 50000 70000 90000 Equal-width: equal bucket range Equal-frequency (or equal-depth) V-optimal: with the least histogram variance (weighted sum of the original values that each bucket represents) MaxDiff: set bucket boundary between each pair for pairs have the 1 largest differences 52 .

centroid and diameter) only Can be very effective if data is clustered but not if data is smeared Can have hierarchical clustering and be stored in multi-dimensional index tree structures There are many choices of clustering definitions and clustering algorithms 53 .g. and store cluster representation (e..Data Reduction Method (3): Clustering Partition data set into clusters based on similarity.

Data Reduction Method (4): Sampling

Sampling: obtaining a small sample s to represent the whole data set N Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Choose a representative subset of the data

Simple random sampling may have very poor performance in the presence of skew Stratified sampling:

Develop adaptive sampling methods

Approximate the percentage of each class (or subpopulation of interest) in the overall database Used in conjunction with skewed data

Note: Sampling may not reduce database I/Os (page at a time)

54

Sampling: with or without Replacement

Raw Data

55

**Sampling: Cluster or Stratified Sampling
**

Raw Data Cluster/Stratified Sample

56

e. e. military or academic rank real numbers. color. integer or real numbers Continuous Discretization: Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes..g.. profession values from an ordered set.Three types of attributes: Nominal Ordinal values from an unordered set. e.g.g. Reduce data size by discretization Prepare for further analysis 57 ..

unsupervised Split (top-down) vs. or senior) 58 .Discretization Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Supervised vs. middle-aged. merge (bottom-up) Discretization can be performed recursively on an attribute Concept hierarchy formation Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as young.

2. known as concept hierarchy. and then repeats recursively on the resulting intervals. Bottom-up discretization or merging: Discretization can be performed recursively on an attribute to provide a hierarchical partitioning of the attribute values. 59 . Concept hierarchies are useful for mining at multiple levels of abstraction. Supervised discretization: Here. Unsupervised discretization: Top-down discretization or splitting: Here. 1. the process uses the class information.Discretization techniques can be categorized based on how the discretization is performed. the process starts by first finding one or a few points (split or cut points) to split the entire attribute range.

unsupervised Clustering analysis (covered above) Either top-down split or bottom-up merge. unsupervised Entropy-based discretization: supervised. bottom-up merge Segmentation by natural partitioning: top-down split. unsupervised 60 . top-down split Interval merging by G2 Analysis: unsupervised. unsupervised.Typical methods: All the methods can be applied recursively Binning (covered above) Top-down split. Histogram analysis (covered above) Top-down split.

if S is partitioned into two intervals S1 and S2 using boundary T.Entropy-Based Discretization Given a set of samples S. the information gain after partitioning is I (S . Given m classes. the entropy of S1 is m Entropy ( S 1 ) ! § p i log 2 ( p i ) i !1 where pi is the probability of class i in S1 The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization The process is recursively applied to partitions obtained until some stopping criterion is met Such a boundary may reduce data size and improve classification accuracy 61 .T ) ! | S1 | |S | Entropy ( S 1) 2 Entropy ( S 2 ) |S | |S | Entropy is calculated based on class distribution of the samples in the set.

DMKD 2002] Initially. each distinct value of a numerical attr. etc. See also Liu et al.) 62 . A is considered to be one interval G2 tests are performed for every pair of adjacent intervals Adjacent intervals with the least G2 values are merged together. since low G2 values for a pair indicate similar class distributions This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level. splitting-based methods Merge: Find the best neighboring intervals and merge them to form larger intervals recursively ChiMerge [Kerber AAAI 1992. max inconsistency.Interval Merge by G2 Analysis Merging-based (bottom-up) vs. max-interval.

Segmentation by Natural Partitioning A simply 3-4-5 rule can be used to segment numeric data into relatively uniform. or 10 distinct values at the most significant digit. 4. partition the range into 3 equi-width intervals If it covers 2. partition the range into 5 intervals 63 . If an interval covers 3. 5. partition the range into 4 intervals If it covers 1. 6. 7 or 9 distinct values at the most significant digit. natural intervals. or 8 distinct values at the most significant digit.

000 (-$1.$2.000 $5.000) ($1.000 .0) (-$400 -$300) (-$300 -$200) (-$200 -$100) (-$100 0) (0 $200) ($200 $400) ($400 $600) (0 .$2.000) (-$1.000) ($4.000 $3. 000) ($2.000 -$159 Low (i.Example of 3-4-5 Rule count Step 1: Step 2: Step 3: -$351 Min msd=1.$2.e.400 $1.600 ($1.200) ($1.000 $1.$5.000) (-$400 .000 .800 $1.000 profit $1.600) ($800 $1.000 .000) ($1.000) ($3.800) $2.e.000) ($600 $800) April 7.000) ($1.000) ($1.000 .000 $4.0) (0 -$ 1.000) Step 4: (-$400 -$5.200 $1.838 High(i. 000) ($2. 95%-0 tile) $4.700 Max High=$2. 2011 Data Mining: Concepts and Techniques 64 .$1. 5%-tile) Low=-$1.000 .400) ($1.

The discretization techniques and concept hierarchies are applied before data mining as a preprocessing step. Concept hierarchies can be used to reduce the data by collecting and replacing lowlevel concepts (such as numerical values for age) with higher-level concepts (such as youth. or senior). middle-aged. 65 .A concept hierarchy for a numerical attribute defines a discretization of the attribute. rather than during mining. The high level concepts are useful for data generalization.

Chicago} < Illinois E. not others Specification of a hierarchy for a set of values by explicit data grouping Specification of only a partial set of attributes Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values E. country} 66 . for a set of attributes: {street.. state. only street < city.Concept Hierarchy Generation for Categorical Data Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts street < city < state < country {Urbana.g. city.g.. Champaign.

e.Automatic Concept Hierarchy Generation Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is placed at the lowest level of the hierarchy Exceptions. weekday.339 distinct values 67 .g.. year country province_or_ state city street 15 distinct values 365 distinct values 3567 distinct values 674. quarter. month.

2011 Data Mining: Concepts and Techniques 68 .April 7.

Download

by Anthony Doerr

by Graeme Simsion

by Lindsay Harrison

by Brad Kessler

by David McCullough

by Jaron Lanier

by Miriam Toews

by Meg Wolitzer

by Alexandra Horowitz

by Ernest Hemingway

by Lewis MacAdams

by Katie Ward

by Nic Pizzolatto

by Rachel Kushner

by Aravind Adiga

by David Gordon

by Maxim Biller

by Åke Edwardson

On Writing

The Shell Collector

I Am Having So Much Fun Here Without You

The Rosie Project

The Blazing World

Quiet Dell

After Visiting Friends

Missing

Birds in Fall

Who Owns the Future?

All My Puny Sorrows

The Wife

A Farewell to Arms

Birth of the Cool

Girl Reading

Galveston

Telex from Cuba

Rin Tin Tin

The White Tiger

The Serialist

No One Belongs Here More Than You

Love Today

Sail of Stone

Fourth of July Creek

The Walls Around Us

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

CANCEL

OK

scribd