Professional Documents
Culture Documents
Anil Chhangani
Associate Professor
CSC - 504
Information
Data A stream of raw facts representing things or
events that have happened
Information
Data that has been processed
to make it meaningful & useful
Knowledge
Data A stream of raw facts representing things or events that
have happened
Information Data that has been processed to make it
meaningful & useful
Data + Meaning = Information
Knowledge
Knowing what to do with Data &
Information requires knowledge
Unit
Kilobytes (KB) 1,000 bytes
Megabytes (MB) 1,000 Kilobytes
Gigabytes (GB) 1,000 Megabytes
Terabytes (TB) 1,000 Gigabytes
Petabytes (PB) 1,000 Terabytes
Data
What is Data Mining?
- searching for
knowledge
(interesting patterns) in Data
data
What is Data Mining?
Characterization
Discrimination
Association or correlation analysis,
Classification
Prediction
Clustering
Outlier analysis
Evolution analysis
Characterization/ Descriptions
Data can be associated with classes or
concepts
E.g.
classes of items – computers, printers
concepts of customers – bigSpenders,
budgetSpenders,
Characterization / Descriptions
Descriptions can be derived via
Data characterization – summarizing the
general characteristics of a target class
E.g. summarizing the characteristics of customers
who spend more than Rs10,000 a year
Interactive mining
1. Data cleaning
to remove noise and
inconsistent data
Knowledge Discovery from Data (KDD) Process
2. Data integration
multiple data sources
may be combined
Knowledge Discovery from Data (KDD) Process
3. Data selection
data relevant to the
analysis task are
retrieved from the
database
Knowledge Discovery from Data (KDD) Process
4. Data transformation
data are transformed
and consolidated into
forms appropriate for
mining by performing
summary or
aggregation operations
Knowledge Discovery from Data (KDD) Process
5. Data mining
an essential process
where intelligent
methods are applied to
extract data patterns
Knowledge Discovery from Data (KDD) Process
6. Pattern evaluation
to identify interesting
patterns representing
knowledge based on
interestingness measures
Knowledge Discovery from Data (KDD) Process
7. Knowledge presentation
where visualization and
knowledge representation
techniques are used to
present mined knowledge
to users
Knowledge Discovery from Data (KDD) Process
1. Data cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining
6. Pattern evaluation
7. Knowledge presentation
What Is an Attribute
Binary
Ordinal
Numeric
Types of Attribute : Nominal
Possible values for hair color are black, brown, blond, red,
pink, gray, and white
The dispersion of the data : That is, how are the data spread
out? The most common data dispersion measures are the
Range
Quartiles
Interquartile range
Five-number summary and boxplots
Variance and standard deviation
For example, we can sort the values observed for salary and
remove the top and bottom 2% before computing the mean
Median separates the higher half of a data set from the lower
half
Mode for a set of data is the value that occurs most frequently
in the set
Third quartile, Q3, is the 75th percentile -it cuts off the
lowest 75% (or highest 25%) of the data
IQR = Q3 - Q1
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
•Interquartile Range :
To find the first, second, and third quartiles:
1. Arrange the N data values into an array
Quartiles for this data are the 3rd, 6th, and 9th values,
Therefore, Q1 = $47,000 and Q3 = $63,000
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30,33, 33, 35, 35,
35, 35, 36, 40, 45, 46, 52, 70
In this example, the data for price are first sorted and then
partitioned into equal-frequency bins of size 3
Measuring Dispersion of Data
Binning: In smoothing by bin means, each value in a bin is
replaced by the mean value of the bin
Data Integration
Data Reduction
Data Transformation
Data Preprocessing
Major Tasks in Data Preprocessing : Fig taken from Han-
Kamber’s book Data mining concepts and techniques
Data Preprocessing
Major Tasks in Data Preprocessing
Data cleaning routines attempt to fill in missing values,
smooth out noise while identifying outliers, and correct
inconsistencies in the data