You are on page 1of 56

Data Mining and Analytics

Introduction
Data Mining

• Data mining refers to extracting or “mining”


knowledge from large amounts of data
• It is also termed as Knowledge Discovery from
Data (KDD)
• Mostly, data mining is viewed as an essential
step in the process of knowledge discovery
Knowledge discovery is shown in the Figure - 1
which consists of the following steps:

o Data cleaning
o Data Integration
o Data Selection
o Data Transformation
o Data Mining
o Pattern Evaluation
o Knowledge Presentation
Figure – 1: Data mining as a step in the process of
knowledge discovery
• First 4 steps are different forms of data
preprocessing, where the data are prepared
for mining
• Data mining step interacts with user or
knowledge base
• Interesting patterns presented to the user and
that can be stored as a new knowledge in a
separate knowledge base
Kinds of data

• There are a number of data repositories from


where data can be taken
• These data repositories will include relational
database, data warehouse, transactional
database, spatial database, time-series
database, multimedia databases, legacy
database, WWW
Relational database

• It is a collection of tables, each is assigned


with a unique name
• Each table has set of attributes (columns or
fields) and stores a lot of tuples (records or
rows)
Data warehouse
• It’s a repository of information collected from
various resources, stored under a unified schema
that usually resides at a single site
• Data warehouses are constructed through a
process of data cleaning, data integration, data
transformation, data loading, and periodic data
refreshing
• Data warehouse systems are well suited for on-
line analytical processing , or OLAP
• OLAP operations include drill-down, roll-up
which allow the user to view the data at different
views
Spatial databases

• In addition to usual data, stores geographical


information like maps, global and regional
positioning
• Such spatial databases presents new
challenges to data mining algorithms
Multimedia databases

• Includes videos, images, audio and text media.


They can be stored on object – oriented
databases or simply a file system
• Multimedia is characterized by its high
dimensionality, which makes data mining even
more challenging
• Data mining from Multimedia repositories may
require computer vision, image interpretation,
computer graphics and natural language
processing methodologies
World Wide Web (WWW)
• Most heterogeneous and dynamic repository
available
• Large number of authors and publishers are
continuously contributing to its growth and
number of users are accessing it daily
• Data in WWW is organized in inter connected
documents
• These documents are video, text, audio or even
raw data
• WWW comprises of three major components:
Content of web, structure of web and usage of
web
• Content of web comprises of the documents
available.
• Structure of web comprises of the
relationships between documents.
• Usage of web describes how and when
resources are accessed
Data Mining Functionalities – What
kind of patterns can be mined?

• Data mining functionalities are used to specify


the kind of patterns to be found in data
mining tasks
• Two categories: Descriptive and Predictive
• “Descriptive” mining tasks characterize the
general properties of the data in the database
• “Predictive” mining tasks perform inference
on the current data in order to make
predictions
Concept/Class Description: Characterization
and
Discrimination
• Data characterization: Summarizes the data of
the class under study (also called the target
class)
• Data discrimination: by comparison of the
target class with one or a set of comparative
classes (also called the contrasting classes)
• Examples for each in textbook
Mining Frequent Patterns, Associations, and
Correlations
Frequent patterns:
Patterns that occur frequently in data.

The kinds of frequent patterns:


• Frequent itemsets patterns
• Frequent sequential patterns
• Frequent structured patterns

Mining frequent patterns leads to the discovery of


interesting associations and correlations within data.
Association Analysis

• Suppose, as a marketing manager of


AllElectronics, you would like to determine which
items are frequently purchased together within
the same transactions
• An example from the AllElectronics transactional
database, is:

• where X is a variable representing a customer


• A confidence, or certainty, of 50% means that if
a customer buys a computer, there is a 50% chance that
she will buy software as well
• A 1% support means that 1% of all of the
transactions under analysis showed that computer and
software were purchased together
• This association rule involves a single attribute
or predicate (i.e., buys) that repeats
• Association rules that contain a single predicate
are referred to as single-dimensional association rules
• Association rules that contain more than one predicate
are referred to as multi-dimensional association rules
Classification and Prediction

• Classification
Construct models (functions) that describe and
distinguish classes or concepts to predict the class of
objects whose class label is unknown

Example:
• In weather problem the play or don’t play
judgment
• In contact lenses problem the lens
recommendation
• Derived model can be represented in various forms:
 Decision tree
 Neural network
 If Then rules

• Instead of predicting categorical response labels for


each store item, you would like to predict the amount
of revenue that each item will generate during an
upcoming sale at AllElectronics , based on previous
sales data. This happens in Prediction
• Numeric prediction is a variant of classification learning
in which the outcome is a numeric value rather than a
category
Cluster Analysis

Clustering:
• Grouping similar instances into clusters

• The objects are clustered or grouped based on the


principle of maximizing the intraclass similarity and
minimizing the interclass similarity .
• Clusters of objects are formed so that objects within a
cluster have high similarity in comparison to one
another, but are very dissimilar to objects in other
clusters.
Outlier Analysis

• In some applications such as fraud detection,


the rare events can be more interesting than
the more regularly occurring ones
• The analysis of outlier data is referred to as
outlier mining
Evolution Analysis

• Data evolution analysis describes and models


regularities or trends for objects whose
behavior changes over time
• Examples are stock market, inventory control,
etc
Are all the patterns interesting?
Data mining may generate thousands of patterns.
Not all of them are interesting.
• What makes a pattern interesting?
Ans: 1) easily understood by humans
2) valid on new or test data with some degree
of certainty
3) potentially useful
4) novel
5) validates some hypothesis that a user seeks
to confirm
• Can a data mining system generate all the
interesting patterns?
Ans: refers to completeness of data mining
algorithm
• Do we need to find all the interesting
patterns?
Ans: association rule mining is an example
• Can a data mining system generate only
interesting patterns?
Ans: generate only the interesting patterns
Data Mining Task Primitives

The data mining primitives specify the following:


• Task - relevant data
• Kind of knowledge to be mined
• Background knowledge
• Interestingness measures
• Knowledge presentation and visualization
techniques to be used for displaying the
discovered patterns
Issues in data mining

1) Mining methodology and user interaction issues:

• Mining different kinds of knowledge in databases


• Interactive mining of knowledge at multiple levels of
abstraction
• Including background knowledge
• Data mining query languages and ad hoc data mining
• Presentation and visualization of data mining results
• To handle noisy or incomplete data
• Pattern evaluation
2) Performance issues:

• Efficiency and scalability of data mining


algorithms
• Parallel, distributed and incremental mining
algorithms
3) Issues related to diversity of database types:

• Handling relational and complex types of data


• Mining information from heterogeneous
databases and global information systems
Descriptive Data Summarization
• To learn data characteristics better
Central Tendency and Dispersion of Data

• Measures of central tendency include mean,


median, mode , and midrange
• Measures of data dispersion include
quartiles, interquartile range (IQR) , and
variance
• Mean (algebraic measure) (sample vs. population):

1 n x
– Weighted arithmetic mean: x   xi
n i 1 N
n

– Trimmed mean: chopping extreme values w x i i


x i 1
n

• Median: A holistic measure w


i 1
i

– Middle value if odd number of values, or average of the middle two


values otherwise
n / 2  ( f )l
– Estimated by interpolation (for grouped data): median  L1  ( )c
f median
• Mode
– Value that occurs most frequently in the data
– Unimodal, bimodal, trimodal
– Empirical formula: mean  mode  3  (mean  median)
Symmetric vs. Skewed Data
Median, mean and mode of symmetric, positively and
negatively skewed data
Dispersion of Data Measurement

• Quartiles, outliers and boxplots


– Quartiles: Q1 (25th percentile), Q3 (75th percentile)
– Inter-quartile range: IQR = Q3 – Q1
– Five number summary: min, Q1, M, Q3, max
– Boxplot: ends of the box are the quartiles, median is marked, whiskers, and
plot outlier individually
– Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
– Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 2
 ( xi  x )  n 1[ xi  ( xi ) ]
n n
s  1 1
  xi   2
2 2
 
2
( x  
2
) 
2
n  1 i 1 i 1 n i 1 N i 1
i
N i 1

-- Standard deviation s (or σ) is the square root of variance s2 (or σ2)


Properties of Normal Distribution Curve

• The normal (distribution) curve


– From μ–σ to μ+σ: contains about 68% of the
measurements (μ: mean, σ: standard deviation)
– From μ–2σ to μ+2σ: contains about 95% of it
– From μ–3σ to μ+3σ: contains about 99.7% of it
Boxplot Analysis

• The five-number summary of a distribution consists of the median, the


quartiles Q1 and Q3 , and the smallest and largest individual
observations, written in the order
Minimum, Q1, Median, Q3, Maximum
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third quartiles, i.e., the
height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extend to Minimum and
Maximum
Visualization of Data Dispersion: Boxplot Analysis
Histogram Analysis

 A histogram for an attribute A partitions the


distribution of A into disjoint subsets, or
buckets
 Typically, the width of each bucket is uniform
 Each bucket is represented by a rectangle whose height is equal
to the count or relative frequency of the values at the bucket
• If A is numeric, the term histogram is preferred
Quantile Plot

• A quantile plot is a simple and effective way to have


a first look at a univariate data distribution
• First, it displays all of the data for the given attribute
(allowing the user to assess both the overall behavior
and unusual occurrences)
• Second, it plots quantile information
Quantile-Quantile (Q-Q) Plot

• A quantile-quantile plot , or q-q plot , graphs the


quantiles of one univariate distribution against the
corresponding quantiles of another
• It is a powerful visualization tool in that it allows the
user to viewwhether there is a shift in going fromone
distribution to another
Scatter plot

• A scatter plot is one of the most effective graphical


methods for determining if there appears to be a
relationship, pattern, or trend between two numerical
attributes
• To construct a scatter plot, each pair of values is treated as
a pair of coordinates in an algebraic sense and plotted as
points in the plane
Loess Curve

• A loess curve is another important exploratory


graphic aid that adds a smooth curve to a scatter plot
in order to provide better perception of the pattern of
dependence
• The word loess is short for “local regression”
Positively and Negatively Correlated Data
Not Correlated Data
Data Preprocessing
• Data is preprocessed in order to avoid noisy, incomplete
and inconsistent data
• Techniques for doing preprocessing are as follows:
 Data cleaning in order to remove noisy data, avoid missing
values and avoid inconsistent data

 Missing values can be avoided by the following methods:

1) Ignore the tuple


2) Fill in the missing values manually
3) Use a global constant to fill the missing value
4) Use the attribute mean to fill in the missing value
5) Use the attribute mean for all samples belonging to the
same class as the given tuple
6) Use the most probable value to fill in the missing value
 Noisy data can be removed by the following methods:
1) Binning
• First sort data and partition into (equal-frequency)
bins
• Then, one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc
2) Regression
• Linear regression involves finding the “best” line to fit
two attributes, so that one attribute can be used to
predict the other
• Multiple linear regression is an extension of linear
regression, where more than two attributes are
involved
3) Clustering
• Outliers may be detected by clustering, where similar
values are organized into groups, or “clusters”
Data Integration
• Data integration is merging the data from different stores
• Some issues to be considered while data integration are as
following:
1) Entity identification problem
• Identify real world entities from multiple data sources,

2) Redundancy
• Redundant data occur often when integration of multiple
databases
• Redundant attributes may be able to be detected by
correlation analysis

3) Detection and resolution of data value conflicts


• For the same real-world entity, attribute values from different
sources may differ
• This may be due to differences in representation, scaling, or
encoding
Data Transformation
• The data are transformed or consolidated into forms
appropriate for mining. It involves the following:
1) Smoothing: Remove noise from data
2) Aggregation: Summarization, data cube construction
3) Generalization of the data: Concept hierarchy
4) Normalization: scaled to fall within a small, specified range
• min-max normalization: v -minA
v'= (new _maxA-new _minA )+new_minA
maxA-minA
v

A

• z-score normalization: v '


 A

v
• normalization by decimal scaling: v'  10 j
where, j is the smallest integer such that Max(|ν’|) < 1
5) Attribute construction: New attributes constructed from
the given ones

Data Reduction:
Data reduction techniques can be applied to obtain a reduced
representation of the dataset that is much smaller in volume,
yet closely maintains the integrity of the original data
• Various strategies used for data reduction are as follows:
1) Data cube aggregation
2) Attribute subset selection: it includes the following
techniques:
• Stepwise forward selection
• Stepwise backward elimination
• Combination of forward selection and backward
elimination
• Decision tree induction
3) Dimensionality reduction: two effective
methods are as following:
• Wavelet Transforms
– Length, L, must be an integer power of 2 (padding with 0’s,
when necessary)
– Each transform has 2 functions: smoothing, difference
– Applies to pairs of data, resulting in two set of data of length L/2
– Applies two functions recursively, until reaches the desired
length
• Principal Component Analysis
- Given N data vectors from n-dimensions, find k ≤ n orthogonal
vectors (principal components) that can be best used to represent data
- Works for numeric data only
- Used when the number of dimensions is large
4) Numerosity reduction: some of the techniques in numerosity reduction are
as follows:
• Regression and Log-Linear Models
In Linear regression, the data are modeled to fit a straight line. y = wx+ b
Multiple linear regression allows a response variable, y, to be modeled as a
linear function of two or more predictor variables
Log-linear models approximate discrete multidimensional probability
distributions.

• Histograms
Divide data into buckets and store average (sum) for each bucket
Partitioning rules:
– Equal-width: equal bucket range
– Equal-frequency (or equal-depth)
– V-optimal: with the least histogram variance (weighted sum of the
original values that each bucket represents)
– MaxDiff: set bucket boundary between each pair for pairs have the β–
1 largest differences
• Clustering
- Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
- Can be very effective if data is clustered but not if data is “smeared”

• Sampling
Obtaining a small sample s to represent the whole data set N
- Simple random sample without replacement (SRSWOR) of size s
- Simple random sample with replacement (SRSWR) of size s
- Cluster sample
- Stratified sample
Discretization and concept hierarchy generation

– Reduce the number of values for a given continuous attribute by


dividing the range of the attribute into intervals
– Interval labels can then be used to replace actual data values
– Supervised vs. unsupervised
– Split (top-down) vs. merge (bottom-up)
– Discretization can be performed recursively on an attribute
Binning (covered above)
• Top-down split, unsupervised,
Histogram analysis (covered above)
• Top-down split, unsupervised
Clustering analysis (covered above)
• Either top-down split or bottom-up merge, unsupervised
Entropy-based discretization: supervised, top-down split
| S1 | |S |
I (S , T )  Entropy ( S1)  2 Entropy ( S 2)
Entropy is calculated based on class distribution of |the
S | samples in the set.| SGiven
| m classes,
the entropy of S1 is:
m
where pi is the probability of class i in S1 Entropy ( S1 )   pi log 2 ( pi )
i 1
2
Interval merging by Analysis: unsupervised, bottom-up merge
• Merge: Find the best neighboring intervals and merge them to form larger intervals
recursively
• ChiMerge
– Initially, each distinct value of a numerical attribute A is considered to be one
interval
– 2 tests are performed for every pair of adjacent intervals
– Adjacent intervals with the least 2 values are merged together, since low 2
values for a pair indicate similar class distributions
– This merge process proceeds recursively until a predefined stopping criterion is
met
Discretization by Intuitive Partitioning

• A simply 3-4-5 rule can be used to segment numeric


data into relatively uniform, “natural” intervals.
– If an interval covers 3, 6, 7 or 9 distinct values at
the most significant digit, partition the range into 3
equi-width intervals
– If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into 4 intervals
– If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals
Example of 3-4-5 Rule

count

Step 1: -$351 -$159 profit $1,838 $4,700


Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$400 -$5,000)
Step 4:

(-$400 - 0) ($2,000 - $5, 000)


(0 - $1,000) ($1,000 - $2, 000)
(0 -
(-$400 - ($1,000 -
$200)
$1,200) ($2,000 -
-$300)
($200 - $3,000)
($1,200 -
(-$300 - $400)
$1,400)
-$200) ($3,000 -
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) $5,000)
($600 - ($1,600 -
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
Concept Hierarchy Generation for Categorical Data

• Specification of a partial/total ordering of attributes explicitly


at the schema level by users or experts
– street < city < state < country
• Specification of a hierarchy for a set of values by explicit data
grouping
– {Urbana, Champaign, Chicago} < Illinois
• Specification of only a partial set of attributes
– E.g., only street < city, not others
• Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
– E.g., for a set of attributes: {street, city, state, country}
Summary of Unit - I

• Data mining as a step in knowledge discovery


• Introduction to Data Mining – Kinds of Data
• Data mining Functionalities – Interesting
Patterns
• Data mining task primitives
• Issues in data mining
• Data preprocessing Techniques

You might also like