Professional Documents
Culture Documents
There are many different methods used to perform data mining tasks. These
techniques not only require specific types of data structures, but also imply
certain types of algorithmic approaches
Parametric models describe the relationship between input and output through
the use of algebraic equations where some parameters are not specified.
1. Point Estimation
3 Bayes Theorem
Point Estimation
The bias of an estimator is the difference between the expected value of the
estimator and the actual value:
An unbiased estimator is one whose bias i s 0. While point estimators for small
data sets may actually be unbiased, for larger database applications we would
expect that most estimators are biased.
1
DATA MINIING UNIT-2-5 NOTES
The root mean square (RMS) may also be used to estimate error or as another
2
DATA MINIING UNIT-2-5 NOTES
There are many basic concepts that provide an abstraction and summarization of
the data as a whole. The basic well-known statistical concepts such as mean,
variance, standard deviation, median, and mode are simple models of the
underlying population.
There are also many well-known techniques to display the structure of the data
graphically.
The box in the center of the figure shows the range between the first, second,
and third quartiles. The line in the box shows the median. The lines extending
from either end of the box are the values that are a distance of 1.5 of the
interquartile range from the first and third quartiles, respectively. Outliers are
shown as points beyond these values
3
DATA MINIING UNIT-2-5 NOTES
Bayes Theorem
Here P (h1 I Xi) is called the posterior probability, while P (h 1 ) is the prior
probability associated with hypothesis h 1 . P (xi) is the probability of the
occurrence of data value Xi and P (xi I h1) is the conditional probability that,
given a hypothesis, the tuple satisfies it.
4
DATA MINIING UNIT-2-5 NOTES
Hypothesis Testing
Hypothesis testing attempts to find a model that explains the observed data by
first creating a hypothesis and then testing the hypothesis against the data.
One technique to perform hypothesis testing is based on the use of the chi-
squared
5
DATA MINIING UNIT-2-5 NOTES
Assuming that 0 represents the observed data and E is the expected values based
2
on the hypothesis, the chi-squared statistic, x , is defined as:
Here there are n input variables, which are called predictors or regressors; one
output variable (the variable being predicted), which is called the response; and
n + 1 constants, which are chosen during the modeling process to match the
input examples (or sample). This is sometimes called multiple linear regression
because there is more than one predictor.
SIMILARITY M EASURES
The idea of similarity measures can be abstracted and applied to more general
6
DATA MINIING UNIT-2-5 NOTES
classification problems. The difficulty lies in how the similarity measures are
defined and applied to the items in the database
DEFINITION 3.3. A decision tree (DT) is a tree where the root and each
internal node is labeled with a question. The arcs emanating fr om each node
represent each possible answer to the associated question. Each leaf node
represents a prediction of a solution to the problem under consideration .
7
DATA MINIING UNIT-2-5 NOTES
a female who is 1 . 95 m in
height is considered as tall, while a male of the same height may not be
considered tall. Also, a child 10 years of age may be tall if he or she is only 1 . 5
m.
NE URAL N ETWORKS
8
DATA MINIING UNIT-2-5 NOTES
9
DATA MINIING UNIT-2-5 NOTES
10
DATA MINIING UNIT-2-5 NOTES
11
DATA MINIING UNIT-2-5 NOTES
GENETIC ALGORITH M S
Genetic algorithms are examples of evolutionary computing methods and
are optimization- type algorithms.
The basis for evolutionary computing algorithms is biological evolution,
where over time evolution produces the best or "fittest" individuals.
Chromosomes, which are DNA strings, provide the abstract model for a living
organism. Subsections of the chromosomes, which are called genes, are used to
define different traits of the individual.
During reproduction, genes from the parents are combined to produce the
genes for the child.
12
DATA MINIING UNIT-2-5 NOTES
Crossover is achieved by interchanging the last three bits of the two strings.
In part (b) the center three bits are interchanged. Figure 3.9 shows single and
multiple crossover points. There are many variations of the crossover approach,
including determining crossover points randomly.
13
DATA MINIING UNIT-2-5 NOTES
Classification
Classification is perhaps the most familiar and most popular data mining
technique.
14
DATA MINIING UNIT-2-5 NOTES
15
DATA MINIING UNIT-2-5 NOTES
Issues in Classification
Missing Data. Missing data values cause problems during both the training
phase and the classification process itself. Missing values in the training data must
be handled and may produce an inaccurate result.
16
DATA MINIING UNIT-2-5 NOTES
17
DATA MINIING UNIT-2-5 NOTES
Bayesian Classification
Assuming that the contribution by all attributes are independent and that
each contributes equally to the classification problem, a simple classification
scheme called naive Bayes classification has been proposed that is based on
Bayes rule of conditional probability
18
DATA MINIING UNIT-2-5 NOTES
19
DATA MINIING UNIT-2-5 NOTES
K Nearest Neighbors
One common classification scheme based on the use of distance measures
is that of the K nearest neighbors (KNN). The KNN technique assumes that the
entire training set includes not only the data in the set but also the desired
classification for each item.
When a classification is to be made for a new
item, its distance to each item in the training set must be determined. Only
the K closest entries in the training set are considered further. The new item is
then placed in the class that contains the most items from this set of K closest
items the process used by KNN. Here the points in the training set are shown and
K = 3. The three closest items in the training set are shown; t will be placed in the
class to which most of these are members.
20
DATA MINIING UNIT-2-5 NOTES
21
DATA MINIING UNIT-2-5 NOTES
22
DATA MINIING UNIT-2-5 NOTES
There have been many decision tree algorithms. We illustrate the tree-
building phase in the simplistic DTBuild Algorithm 4.3. Attributes in the database
schema that will be used to label nodes in the tree and around which the divisions
will take place are called the splitting attributes. The predicates by which the arcs
in the tree are labeled are called the splitting predicates.
The following issues are faced by most DT algorithms:
• Choosing splitting attributes: Which attributes to use for splitting
attributes impacts the performance applying the built DT.
23
DATA MINIING UNIT-2-5 NOTES
24
DATA MINIING UNIT-2-5 NOTES
25
DATA MINIING UNIT-2-5 NOTES
26
DATA MINIING UNIT-2-5 NOTES
27
DATA MINIING UNIT-2-5 NOTES
CART
Classification and regression trees (CART) is a technique that generates
a binary decision tree. As with ID3, entropy is used as a measure to choose the
best splitting attribute and criterion.
The splitting is performed around what is determined to be the best
split point. At each step, an exhaustive search is used to determine the best
split, where "best" is defined by
This formula
is evaluated at the current node, t, and for each possible splitting attribute
and criterion, s. Here L and R are used to indicate the left and right subtrees
of the current node in the tree. PL, PR are the probability that a tuple in the
training set will be on the left or right side of the tree.
.
NEURAL NETWORK-BASED ALGORITHMS
With neural networks (NNs), just as with decision trees, a model
representing how to classify any given database tuple is constructed. The
activation functions typically are sigmoidal. When a tuple must be classified,
certain attribute values from that tuple are input into the directed graph at the
corresponding source nodes.
28
DATA MINIING UNIT-2-5 NOTES
29
DATA MINIING UNIT-2-5 NOTES
30
DATA MINIING UNIT-2-5 NOTES
NN Supervised Learning
31
DATA MINIING UNIT-2-5 NOTES
Back propagation.
32
DATA MINIING UNIT-2-5 NOTES
33
DATA MINIING UNIT-2-5 NOTES
34
DATA MINIING UNIT-2-5 NOTES
35
DATA MINIING UNIT-2-5 NOTES
36
DATA MINIING UNIT-2-5 NOTES
37
DATA MINIING UNIT-2-5 NOTES
38
DATA MINIING UNIT-2-5 NOTES
39
DATA MINIING UNIT-2-5 NOTES
40
DATA MINIING UNIT-2-5 NOTES
OUTLIERS
outliers are sample points with values much different from those
of the remaining set of data. Outliers may represent errors in the data
(perhaps a malfunctioning sensor recorded an incorrect data value) or could be
correct data values that are simply much different from the remaining data.
Outlier detection, or outlier mining, is the process of identifying outliers in
a set of data. Clustering, or other data mining, algorithms may then choose to
remove or treat these values differently.
41
DATA MINIING UNIT-2-5 NOTES
42
DATA MINIING UNIT-2-5 NOTES
43
DATA MINIING UNIT-2-5 NOTES
44
DATA MINIING UNIT-2-5 NOTES
45
DATA MINIING UNIT-2-5 NOTES
46
DATA MINIING UNIT-2-5 NOTES
47
DATA MINIING UNIT-2-5 NOTES
K-Means Clustering
48
DATA MINIING UNIT-2-5 NOTES
49
DATA MINIING UNIT-2-5 NOTES
50
DATA MINIING UNIT-2-5 NOTES
Algorithm
51
DATA MINIING UNIT-2-5 NOTES
52
DATA MINIING UNIT-2-5 NOTES
53
DATA MINIING UNIT-2-5 NOTES
Cluster.
54
DATA MINIING UNIT-2-5 NOTES
55
DATA MINIING UNIT-2-5 NOTES
56
DATA MINIING UNIT-2-5 NOTES
57
DATA MINIING UNIT-2-5 NOTES
58
DATA MINIING UNIT-2-5 NOTES
59
DATA MINIING UNIT-2-5 NOTES
60
DATA MINIING UNIT-2-5 NOTES
61
DATA MINIING UNIT-2-5 NOTES
62
DATA MINIING UNIT-2-5 NOTES
63
DATA MINIING UNIT-2-5 NOTES
64
DATA MINIING UNIT-2-5 NOTES
65
DATA MINIING UNIT-2-5 NOTES
66
DATA MINIING UNIT-2-5 NOTES
67
DATA MINIING UNIT-2-5 NOTES
68
DATA MINIING UNIT-2-5 NOTES
69
DATA MINIING UNIT-2-5 NOTES
70
DATA MINIING UNIT-2-5 NOTES
71
DATA MINIING UNIT-2-5 NOTES
72
DATA MINIING UNIT-2-5 NOTES
73
DATA MINIING UNIT-2-5 NOTES
74
DATA MINIING UNIT-2-5 NOTES
75
DATA MINIING UNIT-2-5 NOTES
76
DATA MINIING UNIT-2-5 NOTES
77
DATA MINIING UNIT-2-5 NOTES
78