Introduction to Probability
Syllabus
Introduction to probability theory, Probability thee logy, Funde J cepts in
probability-axioms of probability, Application of simple ‘probability
learning, Bayes’ theorem, Rs tara peti, Dacia Tse )
Distribution Function (CDF) of a
Distribution, Geometric distribution,
Exponential distribution, Chi
Contents
34
32
33
a4
35
36
a7
38
39
310pe 3-2 Introduction to Probab,
El Introduction to Probability Theory
© Probability theory is concerned with the study of random phenomena,
phenomena are characterized by the fact that their future behaviour ig
predictable in a deterministic fashion. The role or probability theory is to !
the behavior of a system or algorithm assuming the given probability assignment,
and distributions.
‘* Probability was developed to analyze the games of chance. It is
modeling of the phenomenon of chance or randomness. The measure of
called the probability of the statement
* The probability of an even is defined as the number of favorable
divided by the total number of possible outcomes.
Classical Definition of Probability
1, Computing Probability Using the Classical Method
* If an experiment has n equally likely simple events and if the m
that an event E can occur is m, then the probability of E, P(E), is
P(e) = Number of way that E can occur __m
Number of possible outcomes in
, if S is the sample space of this experiment, then
N(E)
P(E) = NO
denotes the events of non-occurrence of E, then the number
in E is n-m and hence the probability of E is :
=). om
Be as
0
1-P(E) => P(E)+P(E) =1
@ non-negative integer and n is a positive integer and m <
called mathematical or priori probability,
m Experiment
experiments, we are not able to control the value of
Tesults will vary from one performance of the
most of the conditions are the same. These
‘experiment.
An up thrust for knoudarienoe Introduction to Probability
Random
experiment is defined as an experiment whose outcomes are known
before the experiment
pert eaee roto but which outcome is going to happen in a
Example : If we toss a die,
the result
with one of the numbers in of the experiment is that it will come up
the set {1, 2, 3, 4, 5, 6}.
A random variable is simply an expression whose value is the outcome of a
particular experiment.
Sample Space
The totality of the possible outcomes of a random experiment is called the sample
space of the experiment and it will be denoted by letter 'S’.
There will be more than one sample space that can describe outcomes of an
experiment, but there is usually only one that will provide the most information.
The sample space is not determined completely by the experiment. It is partially
determined by the purpose for which the experiment is carried out.
Example 1 : If the experiment consists of flipping two coins, then the sample space
consists of the following points:
$={(1,1), ©), @, 1, HH)
The outcome will be (T, T) if both coins are tails, (T,H) if the first coin is tails and the
ond heads, (H, T) if the first is heads and the second tails, and (H,H) if both coins
heads.
Example 2 : A die is rolled once. We let X denote the outcome of this experiment.
wn the sample space for this experiment is the 6-element set
S = (123454,
It is convenient to classify sample spaces according to the number of elements
they contains. If a sample space has a finite number of points, it is called a finite
sample space. If it has as many points as there are natural numbers 1, 2, 3, ...» it
i untably infinite sample space. If it has as many points as there are in
melee on ihe x axis, such a 0S x $ 1, itis called a non-countably infinite
sample space. ;
‘A sample space that is finite
space, while one that is non-count
The result of a trail in a random experiment is called an out come.
or countably finite is often called a discrete sample
tably infinite is called a non-discrete sample space.
Event
‘An event is simply a collection of certain sample points.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge- 3-4 Introduction to Probayy,
« An event is a subset A of the sample space S, ic, it is a set of possible
If the outcome of an experiment is an element of A, we say that the event 4 jg
occurred. An event consisting of a single point of S is called a simple or dlenerton
event. |
«A simple event is any single outcome from a probability experiment. Each simple eoen i |
denoted ¢;. i
* A single performance of the experiment is known as a trial.
+ As particular events, we have $ itself; which is the sure or certain event
element of $ must occur, and the empty set 0, which is called the impo
because an element of cannot occur.
* An unusual event is an event that has a low probability of occurring.
« Independent events : Let Ej,E2 be two events. Then E;,E2 are
|
independent events if,
P(E, NE) = P(Ei) P(E)
© Mutually exclusive events : E,,E,...Ey are said to be mutually
E, OE; = 0, forij. IE, and E> are independent and mutually exclu
then either P(E;) = 0 or P(E2) = 0. If E, and E are independent e
and E§ are also independent.
* Mutually exclusive events are sometimes called as Disjoint event. If
mutually exclusive, then it is not possible for both events to occur
trial. If two events are Mutually Exclusive Events then they do not s
outcomes.
Example : 1. In throwing a die all 6 possible cases are mutually exclu
2, In tossing a coin the event head turning up and tail turning up are
exclusive. '
Two events of a sample space whose intersection is 6 and who
tire sample space are called complementary events, If E is an
ce S, its complement is denoted by E’ or E.
y likely events : Two or more events are said to be equall
chances of their happening in a trial is equal. For example,
from a pack, any card may be obtained. In the trial, all
are equally likely. J
ippening of an event in the first trial influences the
ive trial, then the events are said to be
TECHNICAL PUBLICATIONS® - An up thrust for knowledge‘scence 3.
£ Introduction to Probability
gxhaustive events : The total number of all
scaled exhaustive events, For example,
Jmentary events, Le. head and tail
Possible events of a random experiment
in tossing a coin, there are two exhaustive
po Algebra of Events
« Following Fig. 3.1.1 shows the relation between two sets.Data Science 3-6 Introduction to Pry
De Morgan's law :
* The useful relationship between the three basic operations of forming
intersection and complements are known as De Morgrans laws. “oa
* The complement of a union (intersection) of two sets A and B equals
intersection (union) of the complements A and B. Thus
(UB) = AnB
(AnB) = AUB
Prove that : P(A WB) = P(A)+P(B)- P(A MB).
Solution ; Let A and B be any two events. To write AUB as the
mutually exclusive events : AN BS, AN B and ASB.
AUB = (ANB‘)U(ANB)U (ASB)
By axioms 3 :
P(AUB) = P(AMB‘)+P(ANB)+P(ASAB)
= (AN BS)U (ANB)
= (AS AB)U(ANB)
= P(ANBS)+P(ANB)
= P(ASOB)+P(AMB)
s (3) and (4), we get
P(ANB‘)+P (ASO B)+2P (ANB)
quation (2) and (5),
P(AUB)+P (ANB)
P (A)+ P(B)~ P (ANB)
1 2 1
5 PCB) = 3 ANB) =
P(AS MB) 0) P(AMBS).
.) + P(B)- P(A B)
og
TECHNICAL PUBLICATIONS® - An up thrust for knowledgeP®)- P(ANB)Data Science - 3-8
1
6
P(A B)
P(A B)
So, P(A|B)
PB)
P(A|B)
P(B|A)
P(B|A)
GEMIIERE) oetermine ) PBA) wf “Je
PB) =} P(AvB) = 3.
RIF wlelal
Solution :
7 B
mn
pela) = PAC)
P(A)
First calculate P(A B) from given data.
= P(A)+ P(B)- P(AU B)‘Science
‘Axioms of Probability
; rules, known as
‘The subject of probability is based on three commonsense 7 axioms,
‘li is in terms its
One way of defining the probability of an. event 1s a = relative
frequency. For an experiment, sample space S is repeatecty a d
exactly the same conditions. For each event E of the sample apace 1, we 4
n(Z) to be the number of times in the first n repetitions # the experiment that th
event E occurs. Then P(E), the probability of the event E is denoted by,
. n(E)
Pm = ts
The theory of probability starts with the assumption that probabilities car
assigned so as to satisfy the following three basic axioms of probability.
Suppose we have a sample space S. If $ is discrete, all subsets correspon
events and conversely; if $ is non-discrete, only special subsets co
events.
To each event A in the class C of events, we associate a real number P(A),
is called a probability function, and P(A) the probability of the event, if the
axioms are satisfied.
om 1: For every event A in class C, P(A) 2 0.
: For the certain event $ in the class C, P(S) = 1
om 3 : For any number of mutually exclusive events A, A>,A3,... in
P(A U Ag U Ag Uns) = P(Aq)+ PlAg)+ P(Ag)”
arly for two mutually exclusive events A, and A >,
P(A; UA) = P(A1)+P(Ad)
st axiom says that the outcome of an experiment is always in th
ie second axiom says that the long-run frequency of any event is é
and 100 %, The third axiom says that the probability is either too
Probability
probability is a probability that likelihood ‘
meast
will happen concurrently. ae ie
are two independent events A and B,
i Gviaaldciving tia waa , the probability that A and
abilities, '
rule of multiplication shown. ae a fore a
P(A and B) = P(A) P(B).3-11 introduction to Probability
rule of multiplication is used to find the joint probability that two
‘ccur. Symbolically, the general rule of multiplication is,
P(A and B) = P(A) P(B|A).
ility P(A™B) is called the joint probability for two events A and B
in the sample space. Venn diagram will readily shows that
P(A B) = P(A) + P(B) - P (AUB)
P(A B) = P(A)+ P(B)- P(AM B)< P(A) + P(B)
ity of the union of two events never exceeds the sum of the event
gram is very useful for portraying conditional and joint probabilities. A
am portrays outcomes that are mutually exclusive.
ition of Simple Probability Rules
basket analysis is solved by using association rule learning with the help
probability and conditional probability,
basket analysis is just one form of frequent pattern mining there are many
frequent pattern association rules in frequent mining can be classified in
way.
basket analysis is an example of frequent itemset mining. The purpose of
basket analysis is to determine what products customers purchase together.
its name from the idea of customers throwing all their purchases into a
g cart (a "market basket’) during grocery shopping.
basket analysis is a technique which identifies the strength of association
pairs of products purchased together and identify pattems of
e. A co-occurrence is when two or more things take place together.
pasket analysis creates If-Then scenario rules, for example, if item A is
then item B is likely to be purchased.
are probabilistic in nature ot, in other words, they are derived from the
ies of co-occurrence in the observations.
is the proportion of baskets that contain the items of interest. The rules
‘ised in pricing strategies, product placement, and various types of
strategies.
Dasket analysis takes data at transaction level, which lists all items bought
tomer in a single purchase.
TECHNICAL PUBLICATIONS® - An up thrust for knowledgeData Science 3-12 Introduction to Prcbata ay
ionships of what products were purchased wig,
+ The technique determines relationship !
which other product(s). These relationships are then used fo build profi
containing if-then rules of the items purchased.
Association Rule Learning
* Association rule learning is also called associ ‘
¢ Ina retail context, association rule learning is a method for nag
relationships that exist in frequently purchased items. Association
relationship of the form X — Y (that is, X implies Y).
* Association rule mining can be viewed as a two-step Process +
1, Find all frequent item sets : By definition, each of these item sets will
least as frequently as a predetermined minimum support count, min su
2. Generate strong association rules from the frequent item sets : By def
these rules must satisfy minimum support and minimum confidence, _
* The association rule X -> Y means that transactions containing items fror
tend to contain items from set Y.
* Association rules show attribute value conditions that occur frequently to,
a given data set. A typical example of association rule mining is m
analysis.
* Data is collected using bar-code scanners in supermarkets. Such n
databases consist of a large number of transaction records.
* Each record lists all items bought by a customer on a single purchase
Managers would be interested to know if certain Sroups of items are
purchased together.
* They could use this data for adjusting store layouts, for cross
promotions, for catalog design and to identify customer segments bi
patterns.
Association rules provide information of this orm |
statements. These rules are computed from the data nike
logic, association rules are probabilistic in nature, 2
In addition to the antecedent (if) and the consequent (then), an :
two numbers that express the degree of uncertainty about the
In association analysis, the antecedent conseq é
: , and
itemsets) that are disjoint (do not have any items nena ae
ation rule mining.
type in the
TECHNICAL PUBLICATIONS® . An up thrust forS218 Introduction to Probability
a 5
« The first number is called the support for the rule. The support is simply the
number of transactions that include all i
parts of the rule. items in the antecedent and consequent
“) aes ‘ount : The support count of an item A is denoted by A. Count in a
taset T is the number of transaction in T that contain X. Assume T has 1
transcations.
(AUB) Count
Support =
PP 2
b) Confidence : The rule hold in T with confidence % of transaction that contain
A also contain B.
Conf = P,(B/A)
Confidence = (AB) Count
Bayes’ Theorem
Bayes’ theorem is a method to revise the probability of an event given additional
information. Bayes's theorem calculates a conditional probability called a posterior
or revised probability.
Bayes’ theorem is a result in probability theory that relates conditional
probabilities. If A and B denote two events, P(A|B) denotes the conditional
probability of A occurring, given that B occurs, The two conditional probabilities
P(A|B) and P(B|A) are in general different.
Bayes theorem gives a relation between P(A|B) and P(B|A). An important
theorem is that it gives a rule how to update or revise the
application of Bayes’
4 beliefs in light of new evidence a posteriori.
strengths of evidence-base
A prior probability is an initial probability value originally obtained before any
additional information is obtained.
A posterior probability is 4 probability value that has been revised by using
additional information that is later obtained.
Suppose that B,,B2,B3 + B, partition the outcomes of an experiment and that A
is another event. For any number, k with 1 sk $n we have the formula :
S P(A/B))-P@i)
P(BY/A) =
PUBLICATIONS® - An up thrust for knowledgeeee 3-14
A mechanical factory production line is manufacturing
ats, AB. Teh! at, mcine A esos,
machine A is defective, 4S fore mctind8 and 2.% from machine C.
‘at random from the production line and found to be defective, What is |
it came from : i. machine A it, machine B iti, machine C ?
Solution : Let
= {bolt is defective}, *
= {bolt is from machine A},
= {bolt is from machine B},
= {bolt is from machine C}.
aOwrsgd
Given data: P(A) = 0.25, P(B) = 0.35, P(C) = 0.4.
P(D|A) = 0.05, P(D|B) = 0.04, P(D|C) = 0.02.
‘rom the Bayes’ Theorem :
P(D/A)x P(A)
P(D/A)x P(A) + P(D/B)x P(B)+ P(D/O)x PO)
0.05 0.25
0.05% 0.25 + 0.04% 0.35+ 0.02x 04
0.0125
0.0125+ 0.014+ 0.008
0.3621
P(A/D)
Do) P(/B)x PB)
‘P(D/A)x P(A) + P(D/B)x P(B) + PO/OXPO
he 04x 0.35
TRIB OORT UOC
P(D/C)x P(C)
P(D/A)x P(A) + P(D/B)x P(B)+ PO/OXPO)
4 x 0.4
OUBKOBT OOS OSS TOORCOT
TECHNICAL PUBLICATIONS®
~ An up thrust for knowledgeae Introduction to Probability
ig 0.008 0.008
0.0125 + 0.014+ 0.008 ~ 0.0345
P(C/D) = 0.2318
At a certain university, 4 % of men are over 6 feet tall and 1 % of toome
over 6 feet tall. The total student population is divided in the ratio 3 : 2 in favour of
. If a student is selected at random from among all those over six feet tall, what is
probability that the student is a woman ?
: Let us assume following :
M = {Student is Male},
F = {Student is Female},
T = {Student is over 6 feet tall).
data :
P(M) = 2/5,
PF) = 3/5,
P(T|M) = 4/100
P(T|F) = 1/100.
fe require to find P(F|T)?
Ising Bayes’ Theorem we have :
P(I/F) PF)
P(F/T) = POV P+ POM) POM)
PF/T)
If the sum of 9 has appeeed, find the proababiity
A pair of dice is rolled.
‘one of the dice shows 3.
Let A = The event that the sum is 9
E the event the one of dice shows 3.
ustive cases = 67 = 36
puBLIGATIONS® - An up trust or krowedgead ior 3-16 Introduction to Probabjgy —
Favorable cases of the event A = (3, 6), (6, 3), (4, 5), (5, 4).
So P(A) 4.36
1
9
0
P(A)
Favorable case for the event AM B = (3, 6), (6, 3)
2 1
Hence P(AM B) = %
But P(AMB) = P(A)x P(B/A)
P(ANB)
P(B/A) P(A)
yig_ 1.9
SE ome a
P(B/A) 1/2
PEDERED 4¢ « cerizin university, 4 % of men are over 6 feet tall and
over 6 feet tall. ed amcor for oi goa
{Student is Male},
{Student is Female},
that M and F partition the sample space of students),
{Student is over 6 feet tall)
TECHNICAL PUBLICATIONS® - An up thust for knowledgea Introduction to Probability
MeO PO
PCT/F) PR) + PCM) PM)
ied
= ss
lees 1
100°5 * i005
4
PT)
Random Variables
A random variable is a set of Possible values from a random experiment.
A random variable, usually written X, is a variable whose possible values are
numerical outcomes of a random phenomenon. There are two types of random
variables, discrete and continuous,
Whenever you run and experiment, flip a coin, roll a die, pick a card, you assign
a number to represent the value to the outcome that you get. This assign is called
a random variable.
A random variable is a variable X that assigns a real number [x], for each and
every outcome of a random experiment. If § is the sample space containing all the
‘n’ outcomes {e1,€2,€3,-/€)/€q) of random experiment and X is a random
variable defined as a function X(e) on S, then for every outcome e; (where i = 1, 2,
3, ...,n) that is in § the random variable X(e;) will assign a real value x;.
Advantages of random variables is that user can define certain probability
functions that make it both convenient and easy to compute the probabilities of
various events.
‘A random variable is a numerically valued variable which takes on different
values with given probabilities.
les :
return on an investment in a one-year period
price of an equity.
number of customers entering a store.
sales volume of a store on a particular day.
turnover rate at your organization next year.
Discrete Random Variables
all possible outcomes for a random variable and
. way to list
we can find a way we have a discrete random variable.
ign probabilities to each one,
TEGHNIGAL PUBLICATIONS® - An up thrust for knowledgeData Science 3-18
* The random variable is called a discrete random variable if it is defined
sample space having a finite or a countable infinite number of sample
this case, random variable takes on discrete values and it is possible to
all the values it may assume. ‘
* A discrete random variable can only have a specific (or finite)
numerical values. 4
* We can have infinite discrete random variables if we think about #
know have an estimated number. Think about the number of stars in the
* We know that there are not a specific number that we have a way to co
is an example of an infinite discrete random variable. 4
© The mean of any discrete random variable is an average of the poss
with each outcome weighted by its probability.
© Example :
1. Total of roll of two dice : 2, 3, ..., 12
2. Number of desktops sold : 0, 1, ...
3. Customer count : 0, 1,
Continuous Random Variable
‘* In the case sample space having an uncountable infinite number of
the associated random variable is called a continuous random
values distributed over one or more continuous intervals on the Te
* A continuous random variable is one which takes an infinite
values. Continuous random variables are usually measurements. Ex
height, weight, the amount of sugar in an orange, the time require
* A continuous random variable is one having continuous range of
be produced from a discrete sample space because of our req
random variables be single valued functions of all sample space
A continuous random variable is not defined at specific vah
defined over an interval of values and is represented by the
probability of observing any single value is equal to 0, s
jues which may be assumed by the random variable is infinite.
th types of random variables are important in science and
variable is one for which some of its values are3-19 Introduction to Probability
Mass Function and Cumulative Distribution Function of a
Random Variable
discrete random variable, the probability that a random variable X taking a
¢ value x;, P(X = x,), is called the probability mass function P(x).
ty mass function is a function that maps each outcome of a random
to a probability
be a discrete random variable with range Ry = (xj, Xz, X3--) (finite or
ly infinite). The function Py (x,) = P(X = x,), for k = 1, 2, 3... is called the
lity Mass Function (PMF) of X
the PMF is a probability measure that gives us probabilities of the possible
for a random variable.
vior of a random variable is characterized by its probability distribution,
is, by the way probabilities are distributed over the values it assumes. A
ity mass function are two ways to characterize this distribution for a
random variable
are equivalent in the sense that the knowledge of either one completely
the random variable. The corresponding functions for a continuous
variable are the probability distribution function, defined in the same way
case of discrete random variable and the probability density function
random variable, then the function F(x) is defined by
x) = PX
xjF (x; for discrete distribution
j
f xf (x) dx
variance 6? (Sigma square) by +
D jw? £O)
for continuous distribution
u
Bw
for discrete distribution
re
TECHNICAL PUBLICATIONS® - An up thrust for knowledgeo? = J Ow? & for continuous d
. ‘The mean (ui) is also denoted by E (X) and is called the expectation of X
gives the average value of X to be expected in many trials.
© Let us compute the variance of a normal distribution. If X has an J
distribution, then
Var (X) = E((x-E[Xp*]
Here we substituted Z = (x-)/o.
Using integration,
_ 75+ 80+ 82+ 87+96 420
OS
Wea -w)? + 62 -n)? +. + (&q -W)?]
——<$$—<—$ —___ ae
n
[75~ 84)? + (60- 84)? + (62-8424 (87-84)?
5
9)? +4)? +2)? +6)? 4 aay?
pe
+ (96-
81+ 16+ 4494145
Sl+ 1644494144 _
= =sane Introduction to Probability
tion (3)
0 = Vo? = 88 = 7.1274
HEX is @ normal variate with mean 30 and standard deviation 5. Find the
b) 245,
data :
deviation o = 5.
M1 = 26 and x, = 40
4
=.
2 45
TECHNICAL PUBLICATIONS® « An up thust for owwdye)K it) Evaluate P(X <6), P(X26), P(O 5 find the minimum value of K. :
iv) Determine the distribution function of X.
v) Mean vi) Variance. s
Solution : i) K
K+2K+2K+3K+ K? + 2K? +7K2+K
10K? +9K-1 = 0
(K+1) (0K-1) = 0
K+1=0 and 10K-1=0
K=-1 ad Kei
10
We discard K = — 1 value. Therefore K = To= 0.1.
ii) «=P (X <6) P(X=0)+P (X= 1)+P (X=2)+...4 P(X = 5) *
0+K+2K+ 2K+ 3K+ K?
04 s
040.142 (0.1)+2(0.1)+3 (0.1)+ (0.1)?
01 +.0.2 +0.2 +03 + 0.01
P(X <6) = 081
P(X26) = 1-P(XK <6)
1-081
P(x26) = 0.19
W >, minimum value of K.
2
= P(X=0)+P (X=)
= 0+K
=K
= 01
= P(K=0)+P(XK=1)+P(X=2)
= 0+K+2K
= 3K
= 3x01
= 03
PHO) +PK=N+PX=H+PX=3)
= 0+K+K+2K
= 5K
= 5x01
= 05
<4) = 08 (We already calculated)
condition is PSK
«is suitable fr this minimum value of K = 4.
function of X.
PUBLICATIONS® - An up thrust for knowledge3-27 Introduction to Probability
ial Distribution
means ‘two numbers’.
mes of health research are often measured by whether they have occured
example, recovered from disease, admitted to hospital, died ete.
ial distribution occurs in games of chance, quality inspection, opinion
icine and so on.
be modelled by assuming that the number of events ‘n’ has a binomial
m with a fixed probability of event p. Binomial distribution is
ion for a series of Bernoulli trials.
distribution written as B (n, p) where n is the total number of events and
ability of an event.
of binomial distribution :
iment consist of n identical trials.
‘trial has only two outcomes.
probability of one outcome is p and the other is q = 1 - p.
trials are independent.
are interested in x, the number of success observed during the n trials.
tisfying the above properties are called Bernoulli trials.
bility function X,
— (™)oxqn-*
-(T}p
otherwise. The distribution of X with probability function is called the
distribution or Bernoulli distribution.
1 (mu) of the binomial distribution is
and variance of binomial distribution with parameters (n, p) are given
E00
SEX)
i=l
= np
TECHNICAL PUBLICATIONS® - An up thrust for knowiedge — ~Ss Introduction to Probability
s CypXqh—*x2 2
= 1x29Cp p2q?-243x2p3qn-34
+E CyxpXqh yu?
+" Cyn (n- 1) np
+8
= n(a-)p? > n-2C, 2p%-2q"-* ¢np—n2p?
= n(n-1)p? (p+ g)"-2 +np—n2p?
= n(n-1)p? +np-n2p?
= n?p? ~ np? + np-n2p?
= np-np?
= np (1-p)
o = npq
standard deviation (a) of the binomial distribution is /npq.
x 0 fet 2 pig
“Poke 00 dos 016s 03290329 0.132
the mean value of distribution.
xP (X= 0)+ xP (X= 1)+ xP (X= 2)+ xP (X= 3)+ xP (X= 4) + xP(X= 5)
= Ox (0.004) + 1x (0.0041) + 2x (0.165) + 3x (0.329) + 4x (0.329) + 5x (0.132)
0+ 0.0041 + 0.33+ 0.987+ 1316+ 0.66
=
"
= 3.2971
Find the probability of getting 3 and 6 heads inclusive im 10 tosses of a
distribution
approximation to the binomal distribution.
TECHNICAL PUBLICATIONS® - An up thrust for knowledgeintroduction to Probability
= 0.1171 + 0.2050 + 0.246094 + o0sa36
= 0575195
distribution :
ider data is continuous
b= mpe ix!
Normal Distribution
lormal distribution was discovered 1733 by de Moivre as an approximation to the
‘omial distribution when the number of trails large. It is derived in 1809 by
uss.
lormal distribution is also called Gauss distribution.
e normal distribution describes a special class of such distributions that are
‘< and can be discribed by the distribution mean p and the standard
viation o (or variance 0”).
Central Limit Theorem, which states that the sum of a large
ber of independent random variables (binomial, Poisson, etc) will
roximate a normal distribution. For example : Human height is determined by
Iarge mumbo forsonn bal gesate ancl envizormnentsl which ots SoSetemi
i ‘effects. Thus, it follows a normal distribution.
ccmtirana tania semetese se eicl wo be mommally dietslouted 27% Teen ia
‘ance 6? if its probability density function is,
xe ee
ov2n
f0) is ‘not the same as P(x)-
TECHGAL PUBLICATIONS! - An up tmet f mowtedg®3-33 Introduction to Probability
ition is symmetric about its mean.
of distribution is determined by standard deviation : Large value of SD
® the height and increase the spread of the curve, small value of SD increase
t and reduce the spread of the curve.
t all of the distribution will lie within 3 deviations of the mean.
total area under the curve is 1
curve extends indefinitely in both directions, approaching, but never touching,
rizontal axis as it does so.
about normally distributed variables and normal-curve areas :
we know the mean and the standard deviation of a normally distributed
le, we know its distribution and associated normal curve. The mean and
lard deviation are normal distribution’s sufficient statistics, they completely
the variable's distribution.
probability a normally distributed variable assumes a value between a and b is
to the area under the curve between a and b.
lation of the probability that a normal random variable lies within some
al. Theoretically, we need to calculate the area under the curve between the
end points of the interval. Integration for each and every different normal
. Due to the complication of calculation and frequency in which it is done, a
dardized way has been derived.
dard normal distribution : The normal distribution with mean = 0 and
lard deviation = 1.
dard normal random varible : The normal random variable with the standard
distribution is called standard normal random variable.
approximation of the binomial probability distribution : Recall the
discrete distribution
P(x) = CR p*q?*
nt
xin
i tances the continuous normal distribution is a good
ae eee binomial distribution. When the probability (p) of
distribution is near zero or 1, or n (times of trials) is small, the binomial
‘bution will be nonsymmetrical and the normal will not give an good
chs
TECHNICAL PUBLICATIONS® - An up thrust for knowledgeData Science
Poisson Distribution
« Poisson distribution, named after its invertor simeon pols who was:
mathematician. He found that if we have a rare event (ie. p is y
know the expected or mean (or 1) number of occurances, the prob
2 ... events are given by :
evkp®
er
Poisson distribution : Is a distribution the number of rare events that o
of time, distance, space and so on.
Examples :
1. Number of insurance claims in a unit of time.
2. Number of accidents in a ten-mile highway.
3, Number of airplane crash in triangle area.
s When there is a large number of trials, but a small probability
binomial calculate becomes impractical. Example : Number of |
horse kicks in the army in different years. The mean number of
n trials is p= np.
If we substitute p1/n for p, and let n tend to infinity, the binomial
becomes the Poisson distribution : :
e Mus
x!
PO) =
= Poisson distribution is applied where random events in
expected to occur. Deviation from poisson distribution may
degree of non-randomness in the events under study. ‘
= Example : 64 deaths in 20 years from thousands of soldiers.
* If a mean or average probability of an event happening
page/per mile cycled etc., is given and you ar asked to calculate a
events happening in a given time/number of pages/number of mil
the Poisson distribution is used. ’
+ If on the other hand, an exact probability of an event happenng is
implied, in the question, and you are asked to calculate the probal
then the Binomial distribution must beae
me Introduction to Probability
; Given data:
n = 100, p=
P= 500
1
n= np =1000x=.. =
Q 599 = 2
probability
el
x!
= 0.18
cf Poisson's distribution :
number of trials 'n’ are large and probability of success 'p' is very small
ial probabilities are approximated by Poisson's distribution.
np remains constant when n—>< and p— 0
tributed random variable in 'n' trials is given as,
bability of Poisson's dist
py = CO
yx =m) = SP
isson's distribution is applicable when the events do not occur as the outcomes:
‘a definite number of trials.
isson's distribution is applied
ely rare, but they have a
wrrence.
of Poisson's distribution is >
‘A random variable X has a Pol
X33)
Here P(1s X $3) is given 4
SECO eet
2 = 3 we get,
to the events whose probabilities of occurrence are
large number of independent opportunities for
= np and its variance is 02 = np.
jason distribution with a mean of 3. Find
X53) = PIX
a = 3andk
aken® 3ten? a 3e
i
=1,2,3 and
(X = 1)Data Science 3-36
PsXx<3)
ap De, 9S hg
3e +5e" +5 345%
= 0.59744
Geometric Distribution
* The geometric distribution represents the number of failures
success in a series of Bernoulli trials. This discrete probability
represented by the probability density function : f(x) = (1-p)*1p.
* Instead of counting the number of successes, we can also count the ni
trials until a success is obtained. That is, we shall let the rando
X represent the number of trials needed to obtain the first success.
In this situation, the number of trials will not be fixed. But if the
independent, only two outcomes are available for each trial and the
a success is still constant, then the random variable will have
distribution.
In a geometric distribution, if p is the probability of a success and x is th
of trials to obtain the first success, then the following formulas apply.
PO) = pal-p)**
If X is a geometric random variable with parameter p, then
Mean un = E(x) = 1/p
1-
VO) = -
P
Variance o?
Assumptions for the geometric distribution are as follows :
1. There are two possible outcomes for each trial (success or failure).
2. The trials are independent.
3. The probability of success is the same for each trial.
© The memoryless property means that a given probability distr
_ independent of its history. Any time may be marked down as time zero.
‘* Let X be exponentially distributed with parameter A. Suppose we know
What is the probability that X is also greater than some value s + t ? That
want to know i
PX >s+t|X>HData Science 3-38 Introduction to,
Uniform Distribution
* A uniform distribution, also called a rectangular distribution, is 2 pro
distribution that has constant probability.
* This distribution is defined by two parameters,a and b :
1. ais the minimum.
2. bis the maximum.
3. The distribution is written as Ua).
* Its probability density function and cumulative distribution functions are
ea xelz,b]
fo)
Mean and variance of uniform distribution :
The mean value of a continuous random variable is given by equation as,
m, = Jetindee fe pti
m, = 25° = m and variance, o2 = xb)?
Exponential Distribution
exponential distribution is a continuous- tee Introduction to Probabilty
type of problem shows wy
iP frequently in queueing systems where we're
rested in the time between ev , . ee
ents,
example, suppose that jobs in our s
vice times. If we have a job that's been running for one hour, what's the
ability that it will continue to run for more than two hours ?
the definition of conditional probability, we have
stem have exponentially distributed
PR set] X>t = ree
>s +t then X > tis redundant, so we can simplify the numerator.
PX>s4t | X>t = meet
the CCDF of the exponential distribution,
PX>s+t)
PRaeet|x>y = Poe
e”“' terms cancel, giving the surprising result.
PX>s+t | X>t) = eo
joryless property is an important property that simplifies calculations
ated with conditional probabilities. Geometric distribution is the only discrete
ability distribution that has the memoryless property.
exponential distribution is memoryless because the past has no bearing on its
behaviour. Every instant is like the beginning of a new random period,
has the same distribution regardless of how much time has already elapsed.
exponential is the only memoryless continuous random variable.
rameters of Continuous Distributions
tinuous distributions are defined using the following three parameters :
Scale parameter : It defines the range of the continuous distribution. The
larger the scale parameter value, larger is the spread of the distribution.
Shape parameter : Shape parameter defines the shape of the probability
distribution. The changes to the value of shape parameter will change the
shape of the distribution.
Location parameter : Location parameter locates (or shifts) the distribution on
the horizontal axis.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge2 Introduction to Probabilly
ose, 21, | Iroduglonnta PISDEREE
tial probability density
Ae forx20
0 forx Observed test statistic. Do
ct null hypothesis.
P-value : Low probability that test statistic > Observed test statistic.
null hypothesis.
ym manufacturing rivets wants to limit variations in their length as much
om?
data : n = 10, 9 = 0-145
2.15+1.99+2.0
5 + 2.12+ 2.17 +2.01+1.98+2.03+ 2.25+1.93
10
POBLICATIONS® - An up tus fr krowidcesa ~ 0.078 _—
: ~ 0018 ba
ao 0.052 —_
27 0.102 basi
201 ~ 0.058 0.003364
Ps oak ono7744
FS Bone o.o01444
225 0.182 0.083124
193
~ 0138 0.019044
0.09096
0.09096
a = 0.009096
Steps :
1. Null hypothesis (Ho) : 02 <9:
a
Altemative hypothesis (Hy) :625 33
of significance a = 0.95
tistic
10x0.009096
at
(0.145)2 od3
2 ve introduction to Probability
SS-Ohfit test is applied to binned data
NESs-OhAt test c .
a test can be applied to discrete distributions such as
of fit test begins by hypothesizing that the distribution of
nner. For example, in order to determine
zer may wish to know whether
n day of the week
(of equal numbers of customers on each day could be
Bi be the null hypothesis
l& frequency distribution with k categories into which
. The frequencies of occurrence of the variable, for each
‘called the observed values.
Square goodness of fit test w
be in each category if the sample
the claim.
of cases for each categ
made equal to the total of the observed
‘orks is to determine
data were
ory. The total of the
ted number of cases in each category is
fof cases in each category
ic test that is used to find out
phenomena is significantly different from the
fest, the term goodness of fit is used to compare the
swith the expected probability distribution.
Hest determines how well theoretical distribution (such
Seen) sis Be exapisical distrbuton-
r is i vals. Then the
ie caeople data is divided tnt ini
{nterval are compared, with the expected
probability distribution functionminnie
* Suppose that the values, say xj, xz, .. ep Xq of size n, have
Occupied with frequencies (Of), (Of), (Of); +o OB am AB),
respectively, where (Of) stands for observed frequency and S\n i = 1 (Oi = N.
State null hypothesized proportions for each category (pj). Alternative is that
least one of the proportions is different than specified in the null.
Calculate the expected counts for each cell as npi .
3. Calculate the X? statistic :
nv
eae J Woserved~ expected)?
expected
Compute the p-value as the proportion above the X? statistic for
randomization distribution or a X* distribution with df = (number of categ
if expected counts all > 5
. Interpret the p-value in context.
Chi-square Test for Independence of Attributes
The chi-Square test of independence is used to determine if there is a si
relationship between two nominal (categorical) variables.
The frequency of each category for one nominal variable is compared
categories of the second nominal variable.
The data can be displayed in a contingency table where each row repr
category for one variable and each column represents a category for the 0
variable.
For example, say a researcher wants to examine the relationship b
(male vs. female) and empathy (high vs. low). The chi
independence can be used to examine this relationship.
The null hypothesis for this test is that there is no relationship bei
and empathy.
* The alternative hypothesis is that there is a relationship between g
empathy (eg, there are more high-empathy females than high-empathy n
This test is also known as chi-square Test of Association. This test ul
contingency table to analyze the data.
. A contingency table is an arrangement in which data is classified according to
_ ¢ategorical variables. The categories for one variable appear in the rows, and
categories for the other variable appear in columns.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge3-45
Introduction to Probability
variable must have two or more
for a specific pair of categories
the sample of N observations,
Categories. Each cell reflects the total count
“be the observed fr :
ee fequency of children of high income level going to
R, and R, is row totals and C, and C, are column totals such that
C= R, + Ry =N, the total frequency.
ae ~
albjdd = 1
can be directly found using the short cur
of 2 x 2 contigency table oe
n(ad be)?
. 22.
eyen F112
ngth and Limitation of Chi-Square Test
ute than some statistics
to comp’ Paik
oul
makes no assumptions ab distribution of the population
no asi3-47
Introduction to Probability
7 methods are under devetopm
discs are made by each metl
with liquid
rent for making discs of a superconducting
thod, and they are checked for superconductivity
Method
; Mi M2) M3) M4. Total
_Superconducters 31 2
He | 0
Failures Ts Pat.
ao 50 50 50 50 200
Significance difference between the proportions of
superconductors under different
0.05 level using #
hi-square test,
lull hypothesis (Hy) : the proportions of semiconductors are equal.
ternative hypothesis (H,) : the proportions of semiconductors are not equal.
putations of expected frequencies (E,) is
Observed frequency (Of) _Expected frequencies (EA)
(Omit = 31 (E11 = G0 X 120 ) / 200 = 30
(Ofi2 = 42 (E12 = (60 X 120 ) / 200 = 30
(Offs = 2 (BE)13 = (50 X 120 ) / 200 = 30
Onis = 25 (Bi)14 = (60 X 120 ) / 200 = 30
(O21 = 19 (E921 = (50 X 80) / 200 = 20
(Ofna = 8 (E22 = (50 X 80) / 200 = 20
(O23 = 28 (E823 = (50 X 80) / 200 = 20
(on2s = 25 _2024 = 60X90) / 200= 20 |
Determination of degree of freedom = (ma - 1) (n- 1) = (2-1) 4-1) =3
Chi-square statistic
_ [cong -@5)°
ae ee
TECHNICAL PUBLICATIONS® - An up thrust for knowledgeData Science 3-48 Introduction to Probab,
(31-30)? (42~30)? 30)? (25-30)?
30 30
2 2
(19-20)? (B= 20)? , (28-20)? | (25-20)
er ae a ee
= 1950
Step 6: Calculated chi-square test is greater than the tabulated value (7.815). }
null hypothesis. Therefore, the proportions of superconductors are not equal.
Student's t-Distribution
* When the sample values come from a normal distribution, the exact
t' was worked out by W. S. Gossett. He called it a t-distribution.
* Unfortunately, there is not one t-distribution. There are different t-di
each different value of n. If n = 7 there is a certain t-distribution but
t-distribution is a little different. We say that the variable t has a
with n ~ 1 degrees of freedom.
* Suppose a simple random sample of size n is drawn from a pop
population from which the sample is taken follows a normal di
distribution of the random variable
X=Ho
s/n
ows Student's t-Distribution with n — 1 degrees of freedom.
sample mean is X and the sample standard deviation is s. 7
degrees of freedom are the number of free choices left after ¢
as is calculated. When you use a t-distribution to estin
the degrees of freedom are equal to one less than the s
df. = n-1
t=
is normal although this assumption can be relaxed
sample was drawn from the population of interest.
the comparison of calculated 't' value with the theoreti
we conclude :of student's t-distribution Introduction to Probability
7 a 0 2 4
Fig. 3.13.1
Properties of Student's t-Distribution
‘The t-distribution is different for different degrees of freedom.
0 and symmetric about 0.
‘The t-distribution is centered at
the left of 0 is 1/2 and the area to
The total area under the curve is 1. The area to
the right of 0 is 1/2
wut never equals 0.
the area in the tails of the
increases the graph approaches by
‘As the magnitude of t
t-distribution is larger than
‘The area in the tails of the
normal distribution.
f the t-distribution is dependent on the sa
distribution becomes 4
ymple size n.
proximately normal.
The shape o!
‘As sample size n increases, the
greater than 1.
ode of the t-dis
The standard deviation
tribution are equal £0 Zer0-
the area in the tails
an estimate of 5,
thereby introducing further density of the curve of t get closer to the
the pecause as the sample size 0
As the sample size ® # v qesult occurs
3 oe ee
Critical values for various degrees oF58392-58000 369
648 /J6 ~ 264545 = 1482
t = 1.482 is less than toy2 = 3.365.
< tay2 = 3.365
is accepted.
le of 100 iron bars is said to be drawn from a large number of bars.
ily distributed with mean 4 feet and S.D. 6 ft. If the sample
‘the sample be regarded as a truly random sample ?
: Sample size n = 100, Sample mean X = 4.2,
w=4, SD.c =06
(Ho) : Sample can be regarded
is (H) : Sample cannot be regarded.
eee 24S 02
o/vn 06/100 9.06
1.333) > Tabulated (t = 3)
is rejected.
significance is 5 %).
= 3.503
ition
theory and statistics, the F-distribution is a continuous probability
At is also known as Snedecor’s F-distribution or the Fisher-Snedecor
te of the Fedistribution arises as the ratio of two chi-squared
U, have chi-square distributions with d, and dy degrees of f
and U2 are independent.
‘as the null distribution of a test statisn Degrees of freedom
‘ 5
16 15
31 : 30
101 100
1001 1000
Normal “Infinite”
neither. 1. n
50, the distribution is skewed, s = 2.5
2. n = 25, the distribution is skewed, s = 52.9
3. n = 25, the distribution is normal, o = 4.12
Solution :
1. The normal distribution would be used because the sample size is 50.
2. Neither distribution would be used because n < 30 and the distribution is sk
3. The normal distribution would be used because although n < 30,
standard deviation is known.
Sd Ue ee
¢ Given data : Sample size n =
6, Here n < 30, so sample size is
mean X = 58392, Standard deviation s = 648
‘of freedom =n-1=6-1=5
esis (Hy) : w = 58000
ze hypothesis (H,) ; 1 # 58000
ance « = 0.05
: Two tailed test is used.Data Science 3-52
* The probability density function of an F (d;,dz) distributed random
given by
BX) =
1 dx eat
di:
dix 2/2 a
B(d/2,
djx+d)
for real x 2 0, where d; and dy are positive integers, and B is the beta
* An F random variable is defined as a ratio of two independent chi-square ra
variables.
Basic Properties of F-distributions
1. The total area under an F-curve equals 1
2. An F-curve is only defined for x > 0.
3. An F-curve has value 0 at x = 0, is positive for x > 0, extends i
right, and approaches 0 as x +0
4. An F-curve is right-skewed
Testing the equality of two variances
1, Test assumption of equal variances that was made in using the t-test
Interest in actually comparing the variance of two populations
Assume we repeatedly select a random sample of size n from f
populations. Consider the distribution of the ratio of two variances :
formed in this manner approximates an F distributi
ing degrees of freedom : v =n, ~1and v7 =n -1,
F table gives the critical values of the F-distribution which depen
es of freedom.ce is not significance at 5 % level.
fe following menurements of the hest producing |
ton) of the coal produced by tune mimes :
sailsData Science 3-56 Introduction to Probability —
© The P-value is calculated using the t distribution.
Two-sample confidence intervals
* In addition to two sample t-tests, we can also use the t distribution to co
confidence intervals for the mean difference. When 6; and o> are unknown, w
can form the following 100:C % confidence interval for the mean d
Hi He
2 2
(St) 153)
ny ny
&-%) + & |
* The critical value t), is calculated from a t distribution with degrees of
The k is equal to the smaller of (n; ~1) and (n -1).
mary of two-sample tests
wo independent samples with known o; and o2 : Use two-sample Z-test
values calculated using the standard normal distribution.
fo independent samples with unknown o and 2 : Use two-sample test
lues calculated using the t distribution with degrees of freedom equal
er of n, -1 and n> -1.
© independent samples with unknown 1 and o and assume they are
two-sample t-test with pooled variance estimator. The P-values is cal
ig the t distribution with n; +n» ~2 degrees of freedom.
samples that are matched pairs : First calculate the differences for
then use our usual one-sample t-test on these differences.
o Sample t Test with Unknown Variances.
we are comparing two populations that have different means but
standard deviations. We want to infer about the difference b
when the standard deviation is unknown. We can use other alterna'
are assuming that both populations have the same standard deviation b
two estimates S} and $3,
it way to combine theses two estimates to give a more informative —
or. The pooled estimator of the variance is
TECHNICAL PUBLICATIONS® - An up thrust for knowledgeSampling and Estimation
Jy wampling, Population paranerers and vample Hotiante Hany Prohabitiane
Hon rebiabsliny vounpling, Sampling iow thunion, Cornral Limi Hhenrem (OFT), Mample
Hon for mean vf the populasion, Kavimaniun vf population parameters, Methoit af
Latimation af parameters wring method af moments, Bartnation af parameters Witt
wh eatin
© amping
wid Mon Prebabiily Oar hing
Lait Tagore (AT)
tiizan Beationvation ton thowr A tha Moputediins
Pagnbalion Parwmnirers
bharinuin (aliiond EeinetionOe Dee
ERR mrrewection te Senping
© A sample is a gree of
SAAN 2M tinting
® set Ro a Rigger group (the POPUMAHION). hy.
StadvIng the same | & oped to draw wath conctasiona AOU the lager BROUD,
A sammie a sateet of a papatation, San
NN A a mallee FON, he PAI OF the
popadation of wvereat Hat we act
Really evaMNRE BL aRMtOE KY gather the information,
& sealer collection of wits trom « population used to deternine
RAIN
© -A:sample & generally selected for stucty becmuse the population is tuo large to
= Tre aampe ade representative of the general
etter Dest achiewad Oy ranctom sampling. Also, before
AUC the researcher carefully and completely
NS a description of the member: to be inchuded,
Pepalation fr a study of infant health avight be all children bom
& She UR im the L880. The sample might be all babies bor on? May in any of
Gon is any entire collection of people, animals, plants or things from
we may collect data. It is the entire group we are interested in, which we
8 2 collection of objects, It may be finite or infinite according to the
‘ebjects in the population.
@ representative sample (or subset) of that population,
‘generalizations about a population, a sample, that is meant
—S igaceeradgeprll someyr tear gots i) :4.9
Sampling and Estimation
orresponding Population
ta would giv Parameter
Por exampl np wan fo «
© WKORMALION about the oy eset a
fs important that Population mear
population before
included
hample : The population for a study
the UK in the 1980's, The sample
years,
such measures like
of a population
the investigator cay
refully and completely defines the
collect id comp ly
Meeting the Sample, inel
Luding « description of the members to
Of infant health might be all children born
Might be all babies born on 7"" May in any of
the mean, median, mode, variance and standard
distribution are computed, they are referred to as
A parameter can be simply defined as a summary characteristic of a
tion distribution,
bilistic and Non-Probability Sampling
general approaches to sampling are used.
natic random sample
d random sample1 Smee eee
Pee
SPCC ate, eS WH a Maree names oF strata, ot HOSE
: Re SWE Sper grou), etratihed SAMpling can
Shar Neme; PREF methods
oe Sane Ae Retina
~Panaitation trom. setich sampic
ees hat cammpite “selected trom Ey, wlephone Book, city
Mie Re ahs
drawn
DORSET © Frame
TS Popalation Ey Total annual GDP or exports),
earowotes Liheral in federal election. Also por af a
S terme narameten
RGREREEROR ESRI ot 2 xamole Eg. manthh unemployment rate
SOPOSEATSIC & the prodahility distribunon os the statishe
or Be aire BE The population ar the number of elements in
Ger Be ize | the sample ar the number at elements in the
@e SSammpte OF size in selected in a manner that each
tens Phe same prodadility of deing selected
attribute list -> splitting attribute
(each outcome j of splitting criterion)
D; be the set of data tuples in D satisfying outcome j;
is empty then attach a leaf labeled with the majority class in D to node N;
attach the node returned by Generate decision tree(D;, attribute list) to node
of for loop
urn N;
cision tree generation consists of two
: Tree construction and pruning.
tree construction phase, all the
examples are at the root.
examples recursively based on
d attributes.
hase, the identification
eee ei that reflect