84 views

Uploaded by api-27259648

- International Journal of Engineering Research and Development (IJERD)
- Modified SPIHT Algorithm for Wavelet Packet Image Coding
- [IJET-V1I6P2] Authors:Imran khan , Asst. Prof K.Suresh , Asst.Prof Miss Ranjana Batham
- Cont2DWT
- CCTV SYSTEMS Video Compression
- Along Binary Basic Built in Behavior Resulting
- 10. Mul Tech
- IJERA(www.ijera.com)International Journal of Engineering Research and Applications
- Digital Image Processing
- Blind Key Steganography Based on Multilevel Wavelet and CSF
- task 5 and 6
- Steganographic Method Image Security Based on Optimal Pixel Adjustment Process and Integer Wavelet Transform
- 50af7f09e80f94.88367740
- Chapters
- SDMCET-EE--V. R. Sheelavant.pdf
- Information and Coding Theory ECE533 - Lec-2 Source Coding v4.0
- Introduction to Orthogonal Transforms
- 10.1.1.323.9065
- Electrocardiogram Denoised Signal by Discrete Wavelet Transform and Continuous Wavelet Transform
- Fourier vs Wavelet

You are on page 1of 55

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy

generation

Summary

October 16, 2008 Data Mining: Concepts and Techniques 1

Why Data Preprocessing?

incomplete: lacking attribute values, lacking

only aggregate data

noisy: containing errors or outliers

or names

No quality data, no quality mining results!

Quality decisions must be based on quality

data

Data warehouse needs consistent integration

of quality data

October 16, 2008 Data Mining: Concepts and Techniques 2

Multi-Dimensional Measure of Data

Quality

Accuracy

Completeness

Consistency

Timeliness

Believability

Value added

Interpretability

Accessibility

Broad categories:

intrinsic, contextual, representational, and

accessibility.

October 16, 2008 Data Mining: Concepts and Techniques 3

Major Tasks in Data

Preprocessing

Data cleaning

Fill in missing values, smooth noisy data, identify or

remove outliers, and resolve inconsistencies

Data integration

Integration of multiple databases, data cubes, or files

Data transformation

Normalization and aggregation

Data reduction

Obtains reduced representation in volume but produces

the same or similar analytical results

Data discretization

Part of data reduction but with particular importance,

especially for numerical data

October 16, 2008 Data Mining: Concepts and Techniques 4

Forms of data

preprocessing

Chapter 3: Data Preprocessing

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy

generation

Summary

October 16, 2008 Data Mining: Concepts and Techniques 6

Data Cleaning

Fill in missing values

Identify outliers and smooth out noisy

data

Correct inconsistent data

Missing Data

Data is not always available

E.g., many tuples have no recorded value for several

attributes, such as customer income in sales data

Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the

time of entry

not register history or changes of the data

Missing data may need to be inferred.

How to Handle Missing

Data?

Ignore the tuple: usually done when class label is missing

(assuming the tasks in classification—not effective when the

percentage of missing values per attribute varies

considerably.

Fill in the missing value manually: tedious + infeasible?

Use a global constant to fill in the missing value: e.g.,

“unknown”, a new class?!

Use the attribute mean to fill in the missing value

Use the attribute mean for all samples belonging to the

same class to fill in the missing value: smarter

Use the most probable value to fill in the missing value:

October 16, 2008 Data Mining: Concepts and Techniques 9

Noisy Data

variable

Incorrect attribute values may due to

technology limitation

duplicate records

incomplete data

October16,inconsistent

2008 data

Data Mining: Concepts and Techniques 10

How to Handle Noisy Data?

Binning method:

first sort data and partition into (equi-depth)

bins

then one can smooth by bin means, smooth by

Clustering

Regression

functions

October 16, 2008 Data Mining: Concepts and Techniques 11

Simple Discretization Methods:

Binning

Equal-width (distance) partitioning:

It divides the range into N intervals of equal

if A and B are the lowest and highest values of

(B-A)/N.

The most straightforward

It divides the range into N intervals, each

samples

October

16, 2008 Data Mining: Concepts and Techniques 12

Binning Methods for Data

Smoothing

• Sorted data for price (in dollars):

• 13,15,16,

16,19,20,20,21,22,22,25,25,25,33,35,35,35,35,36,40,45,

• 46,52,70

• * Partition into (equi-depth) bins:

- Bin 1: 13,15,16

- Bin 2: 16,19,20

- Bin 3: 20,21,22

- Bin 4: 22,25,25

- Bin 5: 25,25,33

- Bin 6: 33,35,35

- Bin 7 35,35,36

- Bin 8: 40, 45,46

- Bin 9: 52,70

*

October 16, 2008 Data Mining: Concepts and Techniques 13

Smoothing by bin means:

- Bin 1: 14,14,14

- Bin 2: 18,18,18

- Bin 3: 21,21,21

- Bin 4: 24,24,24

- Bin 5: 27,27,27

- Bin 6: 34,34,34

- Bin 7: 35,35,35

- Bin 8: 43,43,43

- Bin 9: 61,62

Smoothing by bin medians:

- Bin 1: 15,15,15

- Bin 2: 19,19,19

- Bin 3: 21,21,21

- Bin 4: 25,25,25

- Bin 5: 25,25,25

- Bin 6: 35,35,35

- Bin 7: 35,35,35

- Bin 8: 45,45,45

- Bin 9: 61,61

Smoothing by bin boundaries:

- Bin 1: 13,16,16

- Bin 2: 16,20,20

- Bin 3: 20,22,22

- Bin 4: 22,25,25

- Bin 5: 25,25,33

- Bin 6: 33,35,35

- Bin 7: 35,35,36

- Bin 8: 40,46,46

- Bin 9: 52,70

Inconsistent Data

have different names in different databases.

Cluster Analysis

Regression

y

Y1

Y1’ y=x+1

X1 x

Chapter 3: Data Preprocessing

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy

generation

Summary

October 16, 2008 Data Mining: Concepts and Techniques 20

Data Integration

Data integration:

combines data from multiple sources into a

coherent store

Schema integration

integrate metadata from different sources

id ≡ B.cust-#

Detecting and resolving data value conflicts

for the same real world entity, attribute values

possible reasons: different representations,

October 16, 2008 Data Mining: Concepts and Techniques 21

Data in Data

Integration

Redundant data occur often when integration of

multiple databases

The same attribute may have different names

in different databases

One attribute may be a “derived” attribute in

another table, e.g., annual revenue

Redundant data may be able to be detected by

correlational analysis

Careful integration of the data from multiple

sources may help reduce/avoid redundancies and

inconsistencies and improve mining speed and

Octoberquality

16, 2008 Data Mining: Concepts and Techniques 22

Data

Transformation

Aggregation: summarization, data cube

construction

Generalization: concept hierarchy climbing

Normalization: scaled to fall within a small,

specified range

min-max normalization

z-score normalization

normalization by decimal scaling

Attribute/feature construction

New attributes

October 16, 2008 constructed

Data Mining: from the given

Concepts and Techniques 23

Data Transformation:

Normalization

min-max normalization

v − minA

v' = (new _ maxA − new _ minA) + new _ minA

maxA − minA

z-score normalization

v −meanA

v' =

stand _ devA

normalization by decimal scaling

v

v' = j Where j is the smallest integer such that Max(| v ' |)<1

10

Chapter 3: Data Preprocessing

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy

generation

Summary

October 16, 2008 Data Mining: Concepts and Techniques 25

Data Reduction

Strategies

Warehouse may store terabytes of data: Complex

data analysis/mining may take a very long time to

run on the complete data set

Data reduction

Obtains a reduced representation of the data set

the same (or almost the same) analytical results

Data reduction strategies

Data cube aggregation

Dimensionality reduction

Numerosity reduction

Data Cube Aggregation

The lowest level of a data cube

the aggregated data for an individual entity of

interest

e.g., a customer in a phone calling data

warehouse.

Multiple levels of aggregation in data cubes

Further reduce the size of data to deal with

Reference appropriate levels

Use the smallest representation which is enough

to solve the task

October

16, 2008 Data Mining: Concepts and Techniques 27

Dimensionality Reduction

Feature selection (i.e., attribute subset selection):

Select a minimum set of features such that the

the values for those features is as close as

possible to the original distribution given the

values of all features

reduce # of patterns in the patterns, easier to

understand

Heuristic methods (due to exponential # of choices):

step-wise forward selection

Example of Decision Tree Induction

{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Heuristic Feature Selection

Methods

There are 2d possible sub-features of d features

Several heuristic feature selection methods:

Best single features under the feature

significance tests.

Best step-wise feature selection:

...

Step-wise feature elimination:

Repeatedly eliminate the worst feature

Best combined feature selection and

elimination:

October 16, 2008 Data Mining: Concepts and Techniques 30

Data Compression

String compression

There are extensive theories and well-tuned

algorithms

Typically lossless

expansion

Audio/video compression

Typically lossy compression, with progressive

refinement

Sometimes small fragments of signal can be

Time sequence is not audio

October 16,

2008 Data Mining: Concepts and Techniques 31

Data Compression

Data

lossless

s s y

lo

Original Data

Approximated

Wavelet Transforms

Haar2 Daubechie4

Discrete wavelet transform (DWT): linear signal

processing

Compressed approximation: store only a small

fraction of the strongest of the wavelet coefficients

Similar to discrete Fourier transform (DFT), but

better lossy compression, localized in space

Method:

Length, L, must be an integer power of 2 (padding with 0s,

when necessary)

Each transform has 2 functions: smoothing, difference

Applies to pairs of data, resulting in two set of data of

length L/2

October 16,

2008 Data Mining: Concepts and Techniques 33

Principal Component Analysis

<= k orthogonal vectors that can be best used

to represent data

The original data set is reduced to one

consisting of N data vectors on c principal

components (reduced dimensions)

Each data vector is a linear combination of the c

principal component vectors

Works for numeric data only

Used when the number of dimensions is large

October 16, 2008 Data Mining: Concepts and Techniques 34

Principal Component Analysis

X2

Y1

Y2

X1

Numerosity Reduction

Parametric methods

Assume the data fits some model, estimate

model parameters, store only the parameters,

and discard the data (except possible outliers)

Log-linear models: obtain value at a point in

m-D space as the product on appropriate

marginal subspaces

Non-parametric methods

Do not assume models

Major families: histograms, clustering,

sampling

October 16, 2008 Data Mining: Concepts and Techniques 36

Regression and Log-Linear

Models

straight line

Often uses the least-square method to fit the

line

Multiple regression: allows a response variable Y

to be modeled as a linear function of

multidimensional feature vector

Log-linear model:

October 16, 2008

approximates discrete

Data Mining: Concepts and Techniques 37

Regress Analysis and Log-

Linear Models

Linear regression: Y = α + β X

Two parameters , α and β specify the line and

using the least squares criterion to the known

Multiple regression: Y = b0 + b1 X1 + b2 X2.

Many nonlinear functions can be transformed

Log-linear models:

The multi-way table of joint probabilities is

tables.

Probability: p(a, b, c, d) = αab βacχad δbcd

October 16, 2008 Data Mining: Concepts and Techniques

Histograms

A popular data 40

reduction technique 35

Divide data into 30

buckets and store

25

average (sum) for

each bucket 20

Can be constructed 15

optimally in one 10

dimension using

5

dynamic

programming 0

Related to

90000

20000

40000

70000

50000

10000

30000

60000

80000

100000

quantization

Clustering

cluster representation only

Can be very effective if data is clustered but not

if data is “smeared”

Can have hierarchical clustering and be stored in

multi-dimensional index tree structures

There are many choices of clustering definitions

and clustering algorithms, further detailed in

Chapter 8

October 16, 2008 Data Mining: Concepts and Techniques 40

Sampling

is potentially sub-linear to the size of the data

Choose a representative subset of the data

Develop adaptive sampling methods

Stratified sampling:

database

Used in conjunction with skewed data

October 16, 2008 Data Mining: Concepts and Techniques 41

Sampling

W O R

SRS le random

i m p h o u t

( s e wi t

l

samp ment)

p l a c e

re

SRSW

R

Raw Data

Sampling

Hierarchical Reduction

Use multi-resolution structure with different

degrees of reduction

Hierarchical clustering is often performed but tends

“clusters”

Parametric methods are usually not amenable to

hierarchical representation

Hierarchical aggregation

Each partition can be considered as a bucket

October 16, 2008 a hierarchical histogram

Mining: Concepts and Techniques 44

Chapter 3: Data Preprocessing

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy

generation

Summary

October 16, 2008 Data Mining: Concepts and Techniques 45

Discretization

Three types of attributes:

Nominal — values from an unordered set

Discretization:

☛ divide the range of a continuous attribute into

intervals

Some classification algorithms only accept

categorical attributes.

Reduce data size by discretization

Discretization and Concept

hierachy

Discretization

reduce the number of values for a given

continuous attribute by dividing the range of

the attribute into intervals. Interval labels can

then be used to replace actual data values.

Concept hierarchies

reduce the data by collecting and replacing

low level concepts (such as numeric values for

the attribute age) by higher level concepts

(such as young, middle-aged, or senior).

Discretization and concept

hierarchy generation for numeric

data

Entropy-based discretization

Entropy-Based Discretization

Given a set of samples S, if S is partitioned into

two intervals S1 and S2 using boundary T, the

entropy after partitioning

|S | is | S |

E (S , T ) = 1 Ent ( ) + 2 Ent ( )

|S| S1 | S | S2

The boundary that minimizes the entropy function

over all possible boundaries is selected as a binary

discretization.

The process is recursively applied to partitions

obtained until some

Ent ( S stopping

) − E (T , S )criterion

>δ is met, e.g.,

and improve classification accuracy

Segmentation by natural

partitioning

into

relatively uniform, “natural” intervals.

* If an interval covers 3, 6, 7 or 9 distinct values at

the most significant digit, partition the range

into 3 equi-width intervals

* If it covers 2, 4, or 8 distinct values at the most

significant digit, partition the range into 4

intervals

* If it covers 1, 5,

October 16, 2008

or 10 distinct values at the most

Data Mining: Concepts and Techniques 50

Example of 3-4-5 rule

count

Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max

Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)

Step 3:

(-$4000 -$5,000)

Step 4:

(-$400 - 0) (0 - $1,000) ($1,000 - $2, 000)

(0 -

($1,000 -

(-$400 - $200)

$1,200) ($2,000 -

-$300) $3,000)

($200 -

($1,200 -

$400)

(-$300 - $1,400)

($3,000 -

-$200)

($400 - ($1,400 - $4,000)

(-$200 - $600) $1,600) ($4,000 -

-$100) ($600 - ($1,600 - $5,000)

$800) ($800 - ($1,800 -

$1,800)

(-$100 - $1,000) $2,000)

0)

Concept hierarchy generation for

categorical data

explicitly at the schema level by users or experts

Specification of a portion of a hierarchy by

explicit data grouping

Specification of a set of attributes, but not of

their partial ordering

Specification of only a partial set of attributes

Specification of a set of

attributes

Concept hierarchy can be automatically

generated based on the number of distinct

values per attribute in the given attribute set.

The attribute with the most distinct values is

placed at the lowest level of the hierarchy.

values

city 3567 distinct values

Chapter 3: Data Preprocessing

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy

generation

Summary

October 16, 2008 Data Mining: Concepts and Techniques 54

Summary

warehousing and mining

Data preparation includes

Data cleaning and data integration

Data reduction and feature selection

Discretization

A lot a methods have been developed but still an

active area of research

October 16, 2008 Data Mining: Concepts and Techniques 55

- International Journal of Engineering Research and Development (IJERD)Uploaded byIJERD
- Modified SPIHT Algorithm for Wavelet Packet Image CodingUploaded byKvsakhil Kvsakhil
- [IJET-V1I6P2] Authors:Imran khan , Asst. Prof K.Suresh , Asst.Prof Miss Ranjana BathamUploaded byInternational Journal of Engineering and Techniques
- Cont2DWTUploaded byWilverAnderson
- CCTV SYSTEMS Video CompressionUploaded bymuddog
- Along Binary Basic Built in Behavior ResultingUploaded bysfofoby
- 10. Mul TechUploaded byshitterbabyboobsdick
- IJERA(www.ijera.com)International Journal of Engineering Research and ApplicationsUploaded byAnonymous 7VPPkWS8O
- Digital Image ProcessingUploaded bykainsu12
- Blind Key Steganography Based on Multilevel Wavelet and CSFUploaded bywww.irjes.com
- task 5 and 6Uploaded byapi-297332338
- Steganographic Method Image Security Based on Optimal Pixel Adjustment Process and Integer Wavelet TransformUploaded bySarayu Sannajaji
- 50af7f09e80f94.88367740Uploaded byVinod Thete
- ChaptersUploaded byRichurajan
- SDMCET-EE--V. R. Sheelavant.pdfUploaded byJulio Ferreira
- Information and Coding Theory ECE533 - Lec-2 Source Coding v4.0Uploaded byNikesh Bajaj
- Introduction to Orthogonal TransformsUploaded byEdwin J Montufar
- 10.1.1.323.9065Uploaded byDũng Cù Việt
- Electrocardiogram Denoised Signal by Discrete Wavelet Transform and Continuous Wavelet TransformUploaded byAI Coordinator - CSC Journals
- Fourier vs WaveletUploaded byarwachinschool
- 2009-Time-Frequency Feature Extraction From Stress and Emotio Clasification in SpeechUploaded byhord72
- SSTD11Uploaded bydiana hertanti
- Lossless Compression and Information Hiding in Images Using SteganographyUploaded bySreeni Champ
- The Research on Unmanned Aerial Vehicle Remote Sensing and Its ApplicationsUploaded bygauravsn143
- Main Prjt DocUploaded bymgitecetech
- Philip E. Pace-Detecting and Classifying Low Probability of Intercept Radar (2009).pdfUploaded byanji.guvvala
- JPEG SpecificationUploaded byvikassingh86
- 09 TMT4073 Digital Image ProcessingUploaded bySusie Chai
- IRJET-A NOVEL FINGERPRINT COMPRESSION STANDARD BASED ON SPARSE REPRESENTATIONUploaded byIRJET Journal
- Structural Health Monitoring 2011 Wei Fan 83 111Uploaded bypaulkohan

- Application Format 126 GDOCUploaded byapi-27259648
- Application Format 126 GDOCUploaded byapi-27259648
- MC1702Uploaded byapi-27259648
- Unit 2 FinalUploaded byapi-27259648
- CTS_Final_MCAUploaded byapi-27259648
- EC1307 Microprocessor and Application LabUploaded byapi-27259648
- DS 82C55Uploaded byilluminatissm
- coprocessor1Uploaded byapi-27259648
- jaeUploaded byapi-27259648
- Programmable Keyboard-Display Interface - 8279Uploaded byapi-27259648
- CA-no_standing_arrearUploaded byapi-27259648
- Campus_data_format_HCL.xlsMCAUploaded byapi-27259648
- CTS_Topper_shopper_MCAUploaded byapi-27259648
- Virtual MemoryUploaded byapi-27259648
- JAVA_QTN2Uploaded byapi-27259648
- Chap_6Uploaded byapi-27259648
- Flash MaskUploaded byapi-27259648
- Unix 4 and 5Uploaded byapi-3836654
- Chap_9Uploaded byapi-27259648
- flashformUploaded byapi-27259648
- HCL__CTS.xlsMCAUploaded byapi-27259648
- Chap_8Uploaded byapi-27259648
- chap_7Uploaded byapi-27259648
- Creating night effectsUploaded byapi-27259648
- Creating own brush toolUploaded byapi-27259648
- PRINT1Uploaded byapi-27259648
- PRINT2Uploaded byapi-27259648
- flashUploaded byapi-27259648
- IdlUploaded byapi-3836654
- Time table-AUploaded byapi-27259648

- Content Based Image Retrieval using Dominant Color and Texture featuresUploaded byijcsis
- plantdisease.pdfUploaded byJoyce Ann Martes
- A Survey of Feature Selection and Feature Extraction Techniques in Machine LearningUploaded byTaufiqAriWDamanik
- PSO-GA Algorithm Based Cyberbullying DetectionUploaded byInternational Journal of Innovative Science and Research Technology
- Neural NetworkUploaded byJournalNX - a Multidisciplinary Peer Reviewed Journal
- Face Recognition Using Eigenfaces and Neural NetworksUploaded byatulnandwal
- Image Similarity Measure Using Color Histogram, Color Coherence Vector, And Sobel MethodUploaded byIjsrnet Editorial
- Thesis Proposal - FPGA-Based Face Recognition System, By Poie - Nov 12, 2009Uploaded byPoie
- p387Uploaded byJonathan Hodges
- A Frequency Based ApprochUploaded bySajeev Ram
- Face Detection based on Skin Color in Image by Neural NetworksUploaded bythaleswary
- A Fuzzy Self-Constructing Feature ClusteringUploaded byNagesh Dhangar
- HWCRUploaded byNikhil Sai Nethi
- Vehicle License Plate SegmentationUploaded byJames Yu
- Thesis FinalUploaded byLouiseBundgaard
- Audiovisual Speech Recognition with Missing or Unreliable DataUploaded byd.kolossa660
- Comparison of Face Recognition Approach through MPCA Plus LDA and MPCA Plus LPPUploaded byijcsis
- Using Genetic Algorithms for Data Mining Optimization in an Educational Web-Based SystemUploaded byvasu
- Face Recognition 2Uploaded byNora Habeb
- 06_Robust Speech Recognition Using Perceptual Wavelet DenoisingUploaded bykashidshubhangi
- IRJET- Design and Analysis of Compact Circularly Polarized Fractal Boundaray Circuar Patch Antenna for Wi-Max ApplicationsUploaded byIRJET Journal
- arabic ocr a surveyUploaded byapi-3754855
- International Journal of Image Processing (IJIP) ,Volume (4): Issue (5)Uploaded byAI Coordinator - CSC Journals
- 2018 - Su - SPLATNet Sparse Lattice Networks for Point Cloud ProcessingUploaded bysnt_388857437
- -Handwriting RecognitionUploaded byJournalNX - a Multidisciplinary Peer Reviewed Journal
- A Colour Code Algorithm for Signature RecognitionUploaded byJulietta Cardenas Muñoz
- dami07Uploaded byMohan Krishna Mannava
- SMART OFFICE SURVEILLANCE ROBOT USING FACE RECOGNITIONUploaded byTJPRC Publications
- REportUploaded byNurul Aida Amira Johari
- Scribe Identification in Medieval English ManuscriptsUploaded bytweety492