You are on page 1of 8

-5-

Data mining algorithms: Prediction


Prediction is the same as classification, except that for prediction, the results are
misrepresented in the future.
Prediction task predicts the possible values of missing or future data. Prediction
involves developing a model based on the available data and this model is used in
predicting future values of a new data set of interest. For example, a model can predict
the income of an employee based on education, experience and other demographic
factors like place of stay, gender etc.
Examples of prediction tasks in business and research involve −
 It can be predicting the value of a stock three months into the future.
 It can be predicting the percentage increase in traffic deaths next year if the speed
limit is raised.
 It can be predicting the winner of this fall’s baseball World Series, depending on a
similarity of team statistics.
How prediction actually works? First of all a model is built using existing data. The
existing data set is divided into two subsets, one is called the training set and the other
is called test set. The training set is used to form model and the associated rules. Once
model built and rules defined, the test set is used for grouping. It must be noted the test
set groupings are already known but they are put in the model to test its accuracy.
Accuracy, we will discuss in detail in following slides but is dependent on many factors
like the model, training data and test data selection and sizes and many more things. So,
the accuracy gives the confidence level, that the rules are accurate to that much level.

Statistical (Bayesian) classification

Bayes’ Theorem or Bayes’ Rule is named after Reverend Thomas Bayes. It


describes the probability of an event, based on prior knowledge of conditions that
might be related to that event. It can also be considered for conditional probability
examples. For example: There are 3 bags, each containing some white marbles and
some black marbles in each bag. If a white marble is drawn at random. With
probability to find that this white marble is from the first bag. In cases like such,
we use the Bayes’ Theorem. It is used where the probability of occurrence of a
particular event is calculated based on other conditions which are also called
conditional probability.

1
-5-

Let E1, E2, E3,…… En be mutually exclusive and exhaustive events associated
with a random experiment, and let A be an event that occurs with some Ei. Then,

Example 1

One of two boxes contains 4 red balls and 2 green balls and the second box contains
4 green and 2 red balls. By design, the probabilities of selecting box 1 or box 2 at random
are 1/3 for box 1 and 2/3 for box 2.

A box is selected at random and a ball is selected at random from it.

a) Given that the ball selected is red, what is the probability it was selected from the first
box?

b) Given that the ball selected is red, what is the probability it was selected from the
second box?
c) Compare the results in parts a) and b) and explain the answer.

Solution to Example 1

Let us call the first box B1 and the second box


B2, Let event E1 be "select box 1" and event E2
"select box 2". Let event R be "select a red ball".

The probabilities of selecting one of the two


boxes would are given (above) by

P(E1)=1/3 and P(E2)=2/3

The conditional probability that a selected ball is red given that it is selected from box1
is given by

P(R|E1)=4/6=2/3 , 4 balls out of 6 are red in box1

The conditional probability that a selected ball is red given that it is selected
from box 2 is given by

P(R|E2)=2/6=1/3 , 2 balls out of 6 are red in box2

2
-5-

a. The question is to find the conditional probability that the ball is selected from
box given that it is red, is given by Bayes' theorem.

b. The question is to find the conditional probability that the ball is selected from
box 2 given that it is red, is given by Bayes' theorem.

c. The two probabilities calculated in parts a) and b) are equal. Although there are
more red balls in box1 than in box2 (twice as much), the probabilities calculated
above are equal because the probabilities of selecting box2 is higher (twice as
much) than the probability of selecting box1. Bayes' theorem takes all the
information into consideration.

Example 2

1% of a population have a certain disease and the remaining 99% are free from
this disease. A test is used to detect this disease. This test is positive in 95% of the
people with the disease and is also (falsely) positive in 2% of the people free from the
disease. If a person, selected at random from this population, has tested positive, what
is the probability that she/he has the disease?

Solution to Example 2

Let D be the event "have the disease" and FD


be the event "free from the disease" Let the
event TP be the event that the "test is positive".

3
-5-

The probability that a person has the disease given that it has tested positive is
given by Bayes' theorem:

Note that even when a person tests positive that does not mean that she/he has the
disease; and that is because the number of disease free (99%) is much higher that those
who have the disease (1%).
Let us clarify the results obtained above using some concrete numbers.
Suppose that 1000 people are tested
Disease Free are: 99%×1000=990 and 2%×990=19.8≈20 test positive
People with disease: 1%×1000=10 and 95%×10=9.5=9.5≈10 test positive
Percentage of those who test positive but do not have the disease is:
9.5 / (19.8 + 9.5) = 0.32 which is the probability P(D|TP) computed above.

H.W:
Three factories produce light bulbs to supply the market. Factory A produces 20%,
50% of the tools are produced in factories B and 30% in factory C. 2% of the bulbs
produced in factory A, 1% of the bulbs produced in factory B and 3% of the bulbs
produced in factory C are defective. A bulb is selected at random in the market and found
to be defective. What is the probability that this bulb was produced by factory B?

Bayesian networks

Bayesian networks are a type of probabilistic graphical model that uses Bayesian
inference for probability computations. Bayesian networks aim to model conditional
dependence, and therefore causation, by representing conditional dependence by edges
in a directed graph. Through these relationships, one can efficiently conduct inference on
the random variables in the graph through the use of factors.

Consider this example:

4
-5-

In this figure, we have an alarm ‘A’ – a


node, say installed in a house of a person ‘gfg’,
which rings upon two probabilities i.e burglary
‘B’ and fire ‘F’, which are – parent nodes of the
alarm node. The alarm is the parent node of two
probabilities P1 calls ‘P1’ & P2 calls ‘P2’
person nodes.

Upon the instance of burglary and fire, ‘P1’


and ‘P2’ call person ‘gfg’, respectively. But,
there are few drawbacks in this case, as
sometimes ‘P1’ may forget to call the person
‘gfg’, even after hearing the alarm, as he has a
tendency to forget things, quick. Similarly, ‘P2’, sometimes fails to call the person ‘gfg’,
as he is only able to hear the alarm, from a certain distance.
Find the probability that ‘P1’ is true (P1 has called ‘gfg’), ‘P2’ is true (P2 has called
‘gfg’) when the alarm ‘A’ rang, but no burglary ‘B’ and fire ‘F’ has occurred.
=> P ( P1, P2, A, ~B, ~F) [ where: P1, P2 & A are ‘true’ events and ‘~B’ & ‘~F’ are
‘false’ events]
[Note: The values mentioned below are neither calculated nor computed. They have
observed values]
Burglary ‘B’ –
 P (B=T) = 0.001 (‘B’ is true i.e burglary has occurred)
 P (B=F) = 0.999 (‘B’ is false i.e burglary has not occurred)
Fire ‘F’ –
 P (F=T) = 0.002 (‘F’ is true i.e fire has occurred)
 P (F=F) = 0.998 (‘F’ is false i.e fire has not occurred)
Alarm ‘A’ –
B F P(A=T) P(A=F)

T T 0.95 0.05

T F 0.94 0.06

F T 0.29 0.71

F F 0.001 0.999

5
-5-

 The alarm ‘A’ node can be ‘true’ or ‘false’ ( i.e may have rung or may not have rung). It
has two parent nodes burglary ‘B’ and fire ‘F’ which can be ‘true’ or ‘false’ (i.e may have
occurred or may not have occurred) depending upon different conditions.
Person ‘P1’ –
A P (P1=T) P (P1=F)

T 0.95 0.05

F 0.05 0.95

 The person ‘P1’ node can be ‘true’ or ‘false’ (i.e may have called the person ‘gfg’ or not)
. It has a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e may have rung or
may not have rung ,upon burglary ‘B’ or fire ‘F’).
Person ‘P2’ –
A P (P2=T) P (P2=F)

T 0.80 0.20

F 0.01 0.99

 The person ‘P2’ node can be ‘true’ or false’ (i.e may have called the person ‘gfg’ or
not). It has a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e may have
rung or may not have rung, upon burglary ‘B’ or fire ‘F’).

Solution: Considering the observed probabilistic scan –


With respect to the question: P (P1, P2, A, ~B, ~F), we need to get the probability of
‘P1’. We find it with regard to its parent node – alarm ‘A’. To get the probability of
‘P2’, we find it with regard to its parent node — alarm ‘A’.
We find the probability of alarm ‘A’ node with regard to ‘~B’ & ‘~F’ since burglary
‘B’ and fire ‘F’ are parent nodes of alarm ‘A’.
From the observed probabilistic scan, we can deduce:
P (P1, P2, A, ~B, ~F)
= P (P1/A) * P (P2/A) * P (A/~B~F) * P (~B) * P (~F)
= 0.95 * 0.80 * 0.001 * 0.999 * 0.998
= 0.00075

6
-5-

Instance-based methods (nearest neighbor)


The Machine Learning systems which are categorized as instance-based learning are
the systems that learn the training examples by heart and then generalizes to new instances
based on some similarity measure. It is called instance-based because it builds the
hypotheses from the training instances. It is also known as memory-based learning or lazy-
learning. The time complexity of this algorithm depends upon the size of training data.
The worst-case time complexity of this algorithm is O (n), where n is the number of
training instances.

For example, If we were to create a spam filter with an instance-based learning


algorithm, instead of just flagging emails that are already marked as spam emails, our
spam filter would be programmed to also flag emails that are very similar to them. This
requires a measure of resemblance between two emails. A similarity measure between
two emails could be the same sender or the repetitive use of the same keywords or
something else.
Advantages:
1. Instead of estimating for the entire instance set, local approximations can be made
to the target function.
2. This algorithm can adapt to new data easily, one which is collected as we go.

Disadvantages:
1. Classification costs are high.
2. Large amount of memory required to store the data, and each query involves
starting the identification of a local model from scratch.

Some of the instance-based learning algorithms are:


1. K Nearest Neighbor (KNN)
2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)

7
-5-

Linear models
Generalized linear models GLMs can be used to construct the models for
regression and classification problems by using the type of distribution which best
describes the data or labels given for training the model. Below given are some types
of datasets and the corresponding distributions which would help us in constructing
the model for a particular type of data (The term data specified here refers to
the output data or the labels of the dataset).
1. Binary classification data – Bernoulli distribution
2. Real valued data – Gaussian distribution
3. Count-data – Poisson distribution

You might also like