You are on page 1of 122

Dr.

Anil Chhangani
Associate Professor

CSC - 504

Data Warehousing & Mining


Data
Data
A stream of raw facts
representing things or events
that have happened
Data A stream of raw facts representing things or
events that have happened

Information
Data A stream of raw facts representing things or
events that have happened

Information
Data that has been processed
to make it meaningful & useful

Data + Meaning = Information


Data A stream of raw facts representing things or events that
have happened
Information Data that has been processed to make it
meaningful & useful
Data + Meaning = Information

Knowledge
Data A stream of raw facts representing things or events that
have happened
Information Data that has been processed to make it
meaningful & useful
Data + Meaning = Information

Knowledge
Knowing what to do with Data &
Information requires knowledge
Unit
Kilobytes (KB) 1,000 bytes
Megabytes (MB) 1,000 Kilobytes
Gigabytes (GB) 1,000 Megabytes
Terabytes (TB) 1,000 Gigabytes
Petabytes (PB) 1,000 Terabytes

Exabytes (EB) 1,000 Petabytes

Zettabytes (ZB) 1,000 Exabytes

Yottabytes (YB) 1,000 Zettabytes


Why Data Mining?
• The Explosive Growth of Data: from
terabytes(10004) to yottabytes(10008)

• Data rich but information poor!

• Data mining — Analysis of massive data sets


What is Data Mining?

Data
What is Data Mining?

- searching for
knowledge
(interesting patterns) in Data
data
What is Data Mining?

- searching for knowledge


(interesting patterns) in data

- analysing large amount of


data stored in a data Data
warehouse for useful
information which makes
use of statistical tools to
find patterns, which are
otherwise hidden
What is Data Mining?

- searching for knowledge


(interesting patterns) in data

- analysing large amount of data


stored in a data warehouse for
useful information which makes Data
use of statistical tools to find
patterens, which are otherwise
hidden

- an essential step in the


process of knowledge
discovery
Data Mining Task Primitives
- a data mining task in mind (data analysis )

- DM task can be - form of a data mining query,

- DM query is - data mining task primitives

- These primitives allow to communicate with the


DM system during discovery in order to
- direct the mining process
- examine the findings from different angles
or depths
Data Mining Task Primitives
The data mining primitives specify the following:
The set of task-relevant data to be mined:
- specifies portions of database or set of
data in which the user is interested.
- includes database attributes or data
warehouse dimensions of interest
Data Mining Task Primitives
The set of task-relevant data to be mined

The kind of knowledge to be mined:


Specifies DM functions to be performed, like
Characterization
Discrimination
Association or correlation analysis,
Classification
Prediction
Clustering
Outlier analysis
Evolution analysis.
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined

The background knowledge to be used in the


discovery process:
Useful for
- guiding knowledge discovery process
- evaluating the patterns found
Concept hierarchies allows data to be mined
at multiple levels of abstraction
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined

The background knowledge to be used in the


discovery process:
Concept hierarchies allows data to be mined
at multiple levels of abstraction eg

For attribute age


Root represents
the most general
abstraction level,
denoted as all
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined
The background knowledge to be used in the discovery process:
The interestingness measures and thresholds for
pattern evaluation:
Used
- to guide the mining process or,
- to evaluate the discovered patterns.
Different kinds of knowledge may have different
interestingness measures.
For eg. : Interestingness measures for association
rules include support and confidence
Support and confidence values are below user-
specified thresholds are considered uninteresting
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined
The background knowledge to be used in the discovery process:
The interestingness measures and thresholds for pattern
evaluation:
For ex., to know which products are frequently purchased
together (within the same transaction)
An example of such a rule, mined from a Electronis store
transactional database, is

buys(X,“computer”) -> buys(X, “software”) [support = 1%, confidence = 50%]

- X is a variable representing a customer


- confidence, of 50% means that if a customer buys a
computer, there is a 50% chance that she will buy
software as well
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined
The background knowledge to be used in the discovery process:
The interestingness measures and thresholds for pattern
evaluation:
buys(X,“computer”) -> buys(X, “software”) [support = 1%, confidence = 50%]

- X is a variable representing a customer


- confidence, of 50% means that if a customer buys a
computer, there is a 50% chance that customer will buy
software as well
- 1% support means that 1% of all the transactions under analysis
show that computer and software are purchased together

This association rule involves a single attribute or predicate (i.e., buys)


hence is referred to as single-dimensional association
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined
The background knowledge to be used in the discovery process:
The interestingness measures and thresholds for pattern
evaluation:
Why Data Mining
What is Data Mining
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined
The background knowledge to be used in the
discovery process
The interestingness measures and thresholds for
pattern evaluation
The expected representation for visualizing the
discovered patterns:
The kind of knowledge to be mined:
Specifies DM functions to be performed,

Characterization
Discrimination
Association or correlation analysis,
Classification
Prediction
Clustering
Outlier analysis
Evolution analysis
Characterization/ Descriptions
Data can be associated with classes or
concepts
E.g.
classes of items – computers, printers
concepts of customers – bigSpenders,
budgetSpenders,
Characterization / Descriptions
Descriptions can be derived via
Data characterization – summarizing the
general characteristics of a target class
E.g. summarizing the characteristics of customers
who spend more than Rs10,000 a year

Result can be a general profile of the customers,


such as 40 – 50 years old, employed, have
excellent credit ratings
Data discrimination –
comparing the target class with one
or a set of comparative classes
Data discrimination –
E.g. Compare the general features of software
products whose sales increase by 10% in the
last year with those whose sales
decrease by 30% during the same period

Or both of the above


Mining Frequent Patterns, Associations & Correlations
Frequent itemset: a set of items that frequently appear
together e.g. milk and bread

Frequent subsequence: a pattern that customers tend to


purchase product A, followed by a purchase of product B
Mining Frequent Patterns, Associations & Correlations
Association Analysis: find frequent patterns
E.g. a sample analysis result – an association rule:

buys(X, “computer”) => buys( X, “software”)


[support= 1%, confidence = 50%]
if a customer buys a computer, there is a 50% chance that
he/she will buy software
1% of all of the transactions under analysis showed that
computer and software are purchased together

Associations rules are discarded as uninteresting if they do


not satisfy both a minimum threshold values
Mining Frequent Patterns, Associations & Correlations
Correlation Analysis: additional analysis to find statistical
correlations between associated pairs
Classification and Prediction
Classification:
The process of finding a model that describes & distinguishes
the data classes or concepts, for the purpose of being able to
use the model to predict the class of objects whose class
label is unknown

The derived model is based on the analysis of a set of training


data (data objects whose class label is known)

The model can be represented in classification (IF-THEN)


rules, decision trees, neural networks, etc.
Classification and Prediction
Prediction
Predict missing or unavailable numerical data values or
a class label for some data
Cluster Analysis
Class label is unknown ( unsupervised) : group data to
form new classes

Clusters of objects are formed based on the principle


of maximizing intra-class similarity and
minimizing interclass similarity

E.g. Identify homogeneous subpopulations of


customers. These clusters may represent individual
target groups for marketing.
Evolution Analysis

Describes and models regularities or trends for


objects whose behavior changes over time

E.g. Identify stock evolution regularities for overall


stocks and for the stocks of particular companies
Major Issues in Data Mining
Issues in data mining research, are grouped into
five groups:
Mining methodology
User interaction
Efficiency and scalability
Diversity of data types
Data mining and society
Major Issues in Data Mining
Mining methodology: Various aspects of mining
methodology

Mining various and new kinds of knowledge

Mining knowledge in multidimensional space

Handling uncertainty, noise, or incompleteness of data

Pattern evaluation and pattern- or constraint-guided mining


Major Issues in Data Mining
User Interaction: Various aspects of it are

Interactive mining

Incorporation of background knowledge

Ad hoc data mining and data mining query languages

Presentation and visualization of data mining results


Major Issues in Data Mining
Efficiency and Scalability: Various aspects of it are

Efficiency and scalability of data mining algorithms

Parallel, distributed, and incremental mining algorithms

Ad hoc data mining and data mining query languages

Presentation and visualization of data mining results


Major Issues in Data Mining
Diversity of Database Types: Various aspects of it are

Handling complex types of data

Mining dynamic, networked, and global data repositories


Major Issues in Data Mining
Data Mining and Society: Various aspects of it are

Social impacts of data mining

Privacy-preserving data mining

Invisible data mining


Outlier Analysis
Data that do no comply with the general behavior or
model

Outliers are usually discarded as noise or exceptions

Useful for fraud detection

E.g. Detect purchases of extremely large amounts


Knowledge Discovery from Data (KDD) Process
Knowledge Discovery from Data (KDD) Process
The KDD process is shown in Figure as an iterative
sequence of the following steps:
1. Data cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining
6. Pattern evaluation
7. Knowledge presentation
Knowledge Discovery from Data (KDD) Process

1. Data cleaning
to remove noise and
inconsistent data
Knowledge Discovery from Data (KDD) Process

2. Data integration
multiple data sources
may be combined
Knowledge Discovery from Data (KDD) Process

3. Data selection
data relevant to the
analysis task are
retrieved from the
database
Knowledge Discovery from Data (KDD) Process

4. Data transformation
data are transformed
and consolidated into
forms appropriate for
mining by performing
summary or
aggregation operations
Knowledge Discovery from Data (KDD) Process

5. Data mining
an essential process
where intelligent
methods are applied to
extract data patterns
Knowledge Discovery from Data (KDD) Process

6. Pattern evaluation
to identify interesting
patterns representing
knowledge based on
interestingness measures
Knowledge Discovery from Data (KDD) Process

7. Knowledge presentation
where visualization and
knowledge representation
techniques are used to
present mined knowledge
to users
Knowledge Discovery from Data (KDD) Process

1. Data cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining
6. Pattern evaluation
7. Knowledge presentation
What Is an Attribute

An attribute is a data field, representing a characteristic or


feature of a data object

The nouns attribute, dimension, feature, and variable are


often used interchangeably

Dimension is commonly used in data warehousing


Machine learning literature tends to use the term feature

Statisticians prefer the term variable

Data mining and database professionals commonly use the


term attribute
Attribute

Attributes describing a customer object can include, for


example, customer ID, name, and address

Observed values for a given attribute are known as


observations

A set of attributes used to describe a given object is called an


attribute vector (or feature vector)

The distribution of data involving one attribute (or variable) is


called univariate

A bivariate distribution involves two attributes, and so on


Types of Attribute

The type of an attribute is determined by the set of possible


values the attribute can have

Following are types of attributes


Nominal

Binary

Ordinal

Numeric
Types of Attribute : Nominal

Nominal means “relating to names”

Values of nominal attribute are symbols or names of things

Each value represents some kind of category, code, or state,

Nominal attributes are also referred to as categorical

Values do not have any meaningful order

Values are also known as enumerations


Types of Attribute : Nominal

Example: hair color is an attributes describing person


objects

Possible values for hair color are black, brown, blond, red,
pink, gray, and white

Another example of a nominal attribute is occupation, with


the values professor, dentist, programmer, farmer, and so on

It is possible to represent symbols or “names” with numbers.


With hair color, for instance, we can assign a code of 0 for
black, 1 for brown, and so on
Types of Attribute : Nominal

Nominal attribute values do not have any meaningful order


about them and are not quantitative, it makes no sense to
find the mean (average) value or median (middle) value for
such an attribute, given a set of objects

One thing that is of interest, however, is the attribute’s most


commonly occurring value
Types of Attribute : Nominal

Nominal attribute values do not have any meaningful order


about them and are not quantitative, it makes no sense to
find the mean (average) value or median (middle) value for
such an attribute, given a set of objects

One thing that is of interest, however, is the attribute’s most


commonly occurring value

This value, known as the mode, is one of the measures of


central tendency
Types of Attribute : Binary

A binary attribute is a nominal attribute with only two


categories or states: 0 or 1, where

0 typically means that the attribute is absent, and

1 means that it is present

Binary attributes are referred to as Boolean if the two states


correspond to true and false.
Types of Attribute : Binary
Example : Suppose a patient undergoes a medical test that
has two possible outcomes
The attribute medical test is binary, where a value of 1 means
the result of the test for the patient is positive, while 0 means
the result is negative
Binary attribute is symmetric if both of its states are equally
valuable and carry the same weight
It means, there is no preference on which outcome should be
coded as 0 or 1
Example could be the attribute gender having the states male
and female
Types of Attribute : Binary
Example :
A binary attribute is asymmetric if the outcomes of the states
are not equally important

The positive and negative outcomes of a medical test

By convention, we code the most important outcome, which


is usually the rarest one, by 1(positive) and the other by 0
Types of Attribute : Ordinal
Ordinal Attributes

An ordinal attribute is an attribute with possible values that


have a meaningful order or ranking among them, but the
magnitude between successive values is not known

Ordinal attributes may be obtained from the discretization of


numeric quantities by splitting the value range into a finite
number of ordered categories

Central tendency of an ordinal attribute can be represented by


its mode and its median, but the mean cannot be defined.
Types of Attribute : Ordinal
Ordinal Attributes

Nominal, binary, and ordinal attributes are qualitative

That is, they describe a feature of an object without giving an


actual size or quantity

The values of such qualitative attributes are typically words


representing categories
Types of Attribute : Ordinal
Ordinal Attributes
Example : Suppose that drink size corresponds to the size of
drinks available at a fast-food restaurant

Attribute has 3 possible values: small, medium,and large

The values have a meaningful sequence (which corresponds


to increasing drink size); however, we cannot tell from the
values how much bigger, say, a medium is than a large

Other examples of ordinal attributes include grade (e.g., A+,


A, A-, B+,and so on) and professional rank
Types of Attribute : Ordinal
Example :Professional ranks can be enumerated in a
sequential order for example, assistant, associate, and full for
professors

Ordinal attributes are useful for registering subjective


assessments of qualities that cannot be measured objectively

Ordinal attributes are often used in surveys for ratings

Customer satisfaction had the following ordinal categories: 0:


very dissatisfied, 1: somewhat dissatisfied, 2: neutral, 3:
satisfied, and 4: very satisfied
Types of Attribute : Numeric
Numeric Attribute : is quantitative
It is a measurable quantity, represented in integer or real
Numeric attributes can be interval-scaled or ratio-scaled
Interval-scaled attributes are measured on a scale of equal-
size units
The values of interval-scaled attributes have order and can be
positive, 0, or negative

In addition to providing a ranking of values, such attributes


allow us to compare and quantify the difference between
values
Types of Attribute : Numeric- Interval-scaled
Example: A temperature attribute is interval-scaled
The outdoor temperature value for a number of different days,
where each day is an object
By ordering the values, we obtain a ranking of the objects
with respect to temperature
We can also quantify the difference between values, a temp of
20 is five degrees higher than a temperature of 15
Calendar dates are another example. For instance, the years
2002 and 2022 are twenty years apart
Attributes are numeric, we can compute their mean, median
and mode measures of central tendency
Types of Attribute : Numeric- Ratio-scaled
Ratio-Scaled Attributes
A ratio-scaled attribute is a numeric attribute with an
inherent zero-point

If a measurement is ratio-scaled, we can speak of a value as


being a multiple (or ratio) of another value

In addition, the values are ordered, and we can also compute


the difference between values, as well as the mean, median,
and mode.
Types of Attribute : Numeric- Ratio-scaled
Example: A count attributes such as years of experience
(e.g., the objects are employees) and

number of words (e.g., the objects are documents)

monetary quantities (e.g., you are 100 times richer with


Rs1000 than with Rs10).
Statistical Descriptions of Data
For data preprocessing to be successful, it is essential to have
an overall picture of your data

Basic statistical descriptions can be used to identify properties


of the data and highlight which data values should be treated
as noise or outliers

There are three areas of basic statistical descriptions

Measures of central tendency which measure the location of


the middle or center of a data distribution

We discuss the mean, median, mode, and midrange


Statistical Descriptions of Data
Measures of central tendency which measure the location of
the middle or center of a data distribution

We discuss the mean, median, mode, and midrange

The dispersion of the data : That is, how are the data spread
out? The most common data dispersion measures are the
Range
Quartiles
Interquartile range
Five-number summary and boxplots
Variance and standard deviation

These measures are useful for identifying outliers


Statistical Descriptions of Data
Graphic displays of basic statistical descriptions to visually
inspect our data

Most statistical or graphical data presentation software


packages include bar charts, pie charts, and line graphs

Other popular displays of data summaries and distributions


include quantile plots, quantile–quantile plots, histograms,
and scatter plots
Measure of Central Tendancy
Mean, Median, and Midrange
Measures of central tendency include the mean, median,
mode, and midrange.
The most common and effective numeric measure of the
“center” of a set of data is the (arithmetic) mean

The mean of this set of values is


Measure of Central Tendancy
Mean, Median, and Midrange
Example Mean. Suppose we have the following values for
salary (in thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. Using Eq.
Measure of Central Tendancy
Mean, Median, and Midrange
Measure of Central Tendancy
Mean, Median, and Midrange
Although the mean is the single most useful quantity for
describing a data set, it is not always the best way of
measuring the center of the data
A major problem with the mean is its sensitivity to extreme
(e.g., outlier) values
Even a small number of extreme values can corrupt the mean
For ex, the mean salary at a company may be substantially
pushed up by that of a few highly paid managers
To offset the effect caused by a small number of extreme
values, we can instead use the trimmed mean,
Measure of Central Tendancy
Mean, Median, and Midrange
To offset the effect caused by a small number of extreme
values, we can instead use the trimmed mean

It is the mean obtained after chopping off values at the high


and low extremes.

For example, we can sort the values observed for salary and
remove the top and bottom 2% before computing the mean

Avoid trimming too large a portion (such as 20%) at both


ends, as this can result in the loss of valuable information
Measure of Central Tendancy
Mean, Median, and Midrange
For skewed (asymmetric) data, a better measure of the center
of data is the median

Median is the middle value in a set of ordered data values

Median separates the higher half of a data set from the lower
half

In a given data set of N values for an attribute X is sorted in


increasing order

If N is odd, then the median is the middle value of the ordered


set
Measure of Central Tendancy
Mean, Median, and Midrange
If N is odd, then the median is the middle value of the ordered
If N is even, then it is the two middlemost values and the
median is taken as the average of the two middlemost values
Median Suppose we have the following values for salary (in
thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
Median can be any value within the two middlemost values
of 52 and 56 ,the average of the two middlemost values as the
median; that is, (52+56) / 2 = 54

Thus, the median is $54,000


Measure of Central Tendancy
Mean, Median, and Midrange
Mode : is another measure of central tendency

Mode for a set of data is the value that occurs most frequently
in the set

Data sets with one, two, or three modes are respectively


called unimodal, bimodal, and trimodal

In general, a data set with two or more modes is multimodal

If each data value occurs only once, then there is no mode


Measure of Central Tendancy
Mean, Median, and Midrange
Mode : Suppose we have the following values for salary (in
thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
The two modes are $52,000 and $70,000

For unimodal numeric data that are moderately skewed


(asymmetrical), we have the following empirical relation:

Mean -Mode = 3(Mean-Median)


Measure of Central Tendancy
Mean, Median, and Midrange
The midrange can also be used to assess the central tendency
of a numeric data set

It is the average of the largest and smallest values in the set


Midrange : Suppose we have the following values for salary
(in thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
Midrange = ( 30,000 + 110,000 ) / 2 = $70,000
Measure of Central Tendancy
Mean, Median, and Midrange
In a unimodal frequency curve with perfect symmetric data
distribution, the mean, median, and mode are all at the same
center value, as shown in Figure (a)
Data in most real applications are not symmetric. They may
instead be either positively skewed, where the mode occurs at
a value that is smaller than the median(Figure b), or
Negatively skewed, where the mode occurs at a value greater
than the median (Figure c)
Measure of Central Tendancy
Mean, Median, and Midrange

•If mean = median = mode, the shape of the distribution is


symmetric
•If mode < median < mean, the shape of the distribution trails
to the right, is positively skewed
•If mean < median < mode, the shape of the distribution trails
to the left, is negatively skewed
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
To assess the dispersion or spread of numeric data, the
measures include range, quantiles, quartiles, percentiles,
and the interquartile range T

The five-number summary, which can be displayed as a


boxplot, is useful in identifying outliers

Variance and standard deviation also indicate the spread of


a data distribution
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
Let x1,x2,…… ,xN be a set of observations for some numeric
attribute, X
The range of the set is the difference between the largest
(max()) and smallest (min()) values
Let the data for attribute X are sorted in increasing numeric
order
Pick certain data points so as to split the data distribution into
equal-size consecutive sets, as in Figure, These data points
are called quantiles
Quantiles are points taken at regular intervals of a data
distribution, dividing it into essentially equal size
consecutive sets.
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
Quantiles are points taken at regular intervals of a data
distribution, dividing it into essentially equal size consecutive
sets

4-quantiles are the 3 data points that split the data


distribution into 4 equal parts; each part represents one-fourth
of the data distribution, referred to as quartiles

100-quantiles are referred to as percentiles; they divide the


data distribution into 100 equal-sized consecutive sets

The median, quartiles, and percentiles are the most widely


used forms of quantiles
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
The median, quartiles, and percentiles are the most widely
used forms of quantiles
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
The quartiles give an indication of a
distribution’s center, spread, and shape.
The first quartile, denoted by Q1, is
the 25th percentile. It cuts off the
lowest 25% of the data.

Third quartile, Q3, is the 75th percentile -it cuts off the
lowest 75% (or highest 25%) of the data

Second quartile is the 50th percentile, as the median, it


gives the center of the data distribution
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
The distance between the first and third
quartiles is a simple measure of spread
that gives the range covered by the
middle half of the data

This distance is called the Interquartile Range (IQR) and


is defined as

IQR = Q3 - Q1
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
•Interquartile Range :
To find the first, second, and third quartiles:
1. Arrange the N data values into an array

2. First quartile, Q1 = data value at position (N + 1)/4

3. Second quartile, Q2 = data value at position 2(N + 1)/4

4. Third quartile, Q3 = data value at position 3(N + 1)/4

* Use N instead of N+1 , if N is even.


Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
Interquartile Range :Suppose we have the following values
for salary (in thousands of dollars), in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110

Quartiles for this data are the 3rd, 6th, and 9th values,
Therefore, Q1 = $47,000 and Q3 = $63,000

Interquartile Range IQR = 63 - 47 = $16,000

Sixth value is a median, $52,000


Measuring Dispersion of Data
Five-Number Summary, Boxplots, and Outliers:
The five-number summary of a distribution consists of the
median (Q2), the quartiles Q1 and Q3, and the smallest and
largest individual observations

Five-Number Summary written in the order of

Minimum, Q1, Median(Q2), Q3, Maximum


Measuring Dispersion of Data
Five-Number Summary, Boxplots, and Outliers:
Boxplots are a popular way of visualizing a distribution
A boxplot incorporates the five-number summary as follows:
Typically, the ends of the box are at the quartiles so that the
box length is the interquartile range
The median is marked by a line within the box
Two lines (called whiskers) outside the box extend to the
smallest (Minimum) and largest (Maximum) observations

Outliers : A common rule of thumb for identifying suspected


outliers is to single out values falling at least 1.5*IQR above
the third quartile or below the first quartile
Measuring Dispersion of Data
Five-Number Summary, Boxplots, and Outliers:
Measuring Dispersion of Data
Five-Number Summary, Boxplots, and Outliers:
Boxplot: Figure shows boxplots for unit price data for items
sold at four branches of AllElectronics during a given period

For branch 1, the median price of items sold is $80,


Q1 is $60, and Q3 is $100

Notice that two outlying observations for this branch were


plotted individually, as their values of 175 and 202 are more
than 1.5 times the IQR here of 40
Measuring Dispersion of Data
Five-Number Summary, Boxplots, and Outliers:
Suppose that the data for analysis includes the attribute age. The age
values for the data tuples are (in increasing order)
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30,33, 33, 35, 35,
35, 35, 36, 40, 45, 46, 52, 70
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal,trimodal, etc.).
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1) and the third quartile
(Q3) of the data?
(e) Give the five-number summary of the data.
(f) Show a boxplot of the data.
Measuring Dispersion of Data
Example : Suppose that the data for analysis includes the attribute
age. The age values for the data tuples are (in increasing order)

13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30,33, 33, 35, 35,
35, 35, 36, 40, 45, 46, 52, 70

(a) What is the mean of the data? What is the median?


(b) What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal,trimodal, etc.).
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1) and the third quartile
(Q3) of the data?
(e) Give the five-number summary of the data.
(f) Show a boxplot of the data.
Measuring Dispersion of Data
Solution:
N 27
Sum 836
Mean 29.85714286
Median ( 14th 25
Mode 25 , 35 (Bimodal)
Midrange 41.5
Q1 (7) { (N + 1) / 4 } 20
Q2 (14) { 2 * (N + 1) / 4 } 25
Q3 (21) { 3 * (N + 1) / 4 } 35
5 Pt Summary 13,20,25,35,70
Measuring Dispersion of Data
Binning: Binning methods smooth a sorted data value by
consulting its “neighborhood,” that is, the values around it

The sorted values are distributed into a number of “buckets,”


or bins

Because binning methods consult the neighborhood of values,


they perform local smoothing

Figure below illustrates some binning techniques

In this example, the data for price are first sorted and then
partitioned into equal-frequency bins of size 3
Measuring Dispersion of Data
Binning: In smoothing by bin means, each value in a bin is
replaced by the mean value of the bin

For example, the mean of the values 4, 8, and 15 in Bin 1 is 9

Therefore, each original value in this bin is replaced by the


value 9
Similarly, smoothing by bin medians can be employed, in
which each bin value is replaced by the bin median

In smoothing by bin boundaries, the minimum and


maximum values in a given bin are identified as the bin
boundaries
Measuring Dispersion of Data
Binning:
Each bin value is then replaced by the closest boundary value.
In general, the larger the width, the greater the effect of the
smoothing

Alternatively, bins may be equal width, where the interval


range of values in each bin is constant

Binning is also used as a discretization technique and is further


discussed
Measuring Dispersion of Data
Binning:
Binning: MU Question
Suppose a group of sales price records has been sorted as follows
6, 9, 12, 13, 15, 25, 50, 70, 72, 92, 204, 232
Partition them into 3- bins by equal frequency partitioning method.
Perform data smoothing by bin mean.
Sol: Partition into 3 bins each of size 4
Bin 1 : 6 9 12 13
Bin 2 : 15, 25, 50, 70
Bin 3 : 72, 92, 204, 232
Bin 1 : 6 9 12 13 Bin 1 Mean = (6 + 9 + 12 + 13) / 4 = 10
Bin 2 : 15, 25, 50, 70 Bin 2 Mean = 40
Bin 3 : 72, 92, 204, 232 Bin 3 Mean = 150

Smoothing by Bin Means


Bin 1 : 10 10 10 10
Bin 2 : 40 40 40 40
Bin 3 : 150 150 150 150
Binning: MU Question
Suppose a group of sales price records has been sorted as follows
6, 9, 12, 13, 15, 25, 50, 70, 72, 92, 204, 232
Partition them into 3- bins by equal frequency partitioning method. Perform
data smoothing by bin mean.
Sol: Partition into 3 bins each of size 4
Bin 1 : 6 9 12 13
Bin 2 : 15, 25, 50, 70
Bin 3 : 72, 92, 204, 232

Smoothing by Bin Means


Bin 1 : 10 10 10 10
Bin 2 : 40 40 40 40
Bin 3 : 150 150 150 150
Smoothing by Bin Boundaries
Bin 1 : 6 6 13 13
Bin 2 : 15, 15, 70, 70
Bin 3 : 72, 72, 232, 232
Data Preprocessing
Why Preprocess the Data?
Data have quality if they satisfy the requirements of the
intended use
There are many factors comprising data quality, including:
Accuracy
Completeness
Consistency
Timeliness,
Believability
Interpretability
Inaccurate, incomplete, and inconsistent data are
commonplace properties of large real-world databases and
data warehouses
Data Preprocessing
Major Tasks in Data Preprocessing
Major steps involved in data preprocessing, namely,
Data Cleaning

Data Integration

Data Reduction

Data Transformation
Data Preprocessing
Major Tasks in Data Preprocessing : Fig taken from Han-
Kamber’s book Data mining concepts and techniques
Data Preprocessing
Major Tasks in Data Preprocessing
Data cleaning routines attempt to fill in missing values,
smooth out noise while identifying outliers, and correct
inconsistencies in the data

Data cleaning is usually performed as an iterative two-step


process consisting of
Discrepancy detection and
Data transformation
Data Preprocessing
Major Tasks in Data Preprocessing
Data integration combines data from multiple sources to
form a coherent data store

The resolution of semantic heterogeneity, metadata,


correlation analysis, tuple duplication detection, and data
conflict detection contribute to smooth data integration
Data Preprocessing
Major Tasks in Data Preprocessing
Data integration combines data from multiple sources to
form a coherent data store

The resolution of semantic heterogeneity, metadata,


correlation analysis, tuple duplication detection, and data
conflict detection contribute to smooth data integration
Data Preprocessing
Major Tasks in Data Preprocessing
Data reduction techniques obtain a reduced representation
of the data while minimizing the loss of information content

These include methods of


Dimensionality reduction
Numerosity reduction
Data compression
Data Preprocessing
Major Tasks in Data Preprocessing: Data reduction
Dimensionality reduction
reduces the number of random variables or attributes under

Numerosity reduction methods use parametric or nonparatmetric


models to obtain smaller representations of the original data
Parametric models store only the model parameters instead
of the actual data
Nonparamteric methods include histograms, clustering,
sampling, and data cube aggregation

Data compression methods apply transformations to obtain a


reduced or “compressed” representation of the original data
Data Preprocessing
Major Tasks in Data Preprocessing:
Data transformation routines convert the data into appropriate
forms for mining

For example, in normalization, attribute data are scaled so as to


fall within a small range such as 0.0 to 1.0

Other examples are


Data discretization
Concept hierarchy generation

Data discretization transforms numeric data by mapping values to


interval or concept labels
Discretization techniques include binning, histogram analysis,
cluster analysis, decision tree analysis, and correlation analysis

You might also like