Professional Documents
Culture Documents
Basic Statistic
EduPristine
EduPristine www.edupristine.com
Agenda
Introduction
Data
Basic Statistics
EduPristine
3. Basic Statistics
I.
Probability
II.
Random variables
EduPristine
3.a. Probability
Probability is a numerical way of describing how likely something is to happen.
One of the fundamental methods of calculating probability is by using set theory.
A set is defined as a collection of objects and each individual object is called an element of that set.
Example from number of credit cards data, the distinct number of credit cards owned form a set:
# Cards = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Numbers present on a dice form a set:
Dice = {1, 2, 3, 4, 5, 6}
The sample space (S ) is the set of all possible outcomes that might be observed for an
event/experiment.
If each of the elements in the sample space are equally likely, then we can define the probability of
event A as:
P(A) = (# elements in A)/(# elements in sample space)
e.g. P(# Cards = 1) = (# of customers having 1 card)/(Total number of customers) = 100/1000 = 0.10 = 10%
e.g. Probability of rolling an even number on a dice
Sample space (S) = {1, 2, 3, 4, 5, 6}
Event (A) = {2, 3, 4}
P(A) = 3/6 = 0.5 = 50%
Intersection (A B)
Venn diagrams
Basic operations on Venn diagrams
1.
2.
3.
Conditional probability
U
Bayes theorem
EduPristine
Definition
II.
Discrete
Continuous
EduPristine
x1
x2
x3
.
.
.
Set of real numbers
So, if w is an element of the sample space S (i.e. w is one of the possible outcomes of the
experiment concerned) and the number x is associated with this outcome, then X(w) = x .
Convention:
Denote random variable by capital letter X
Denote the outcome or possible values by small letter x i.e. X(w) = x
EduPristine
b1
b2
b3
b4
b5
b6
B7
b8
Sample space S- Individual balls
EduPristine
0.10
0.15
0.20
EduPristine
Probabilities:
Probabilities are defined on events (subsets of the sample space S).
So what is meant by P(X = x) ?
Suppose sample space consists of eight events {s1, s2, s3, s4, s5, s6, s7, s8}
Let the outcome for
E1 = {s1, s2, s3} be associated with number x1
E2 = {s4, s5} be associated with number x2
E3 = {s6, s7, s8} be associated with number x3
P(X = x1) is meant P(E1)
P(X = x2) is meant P(E2)
P(X = x3) is meant P(E3)
EduPristine
The function fX (x) = P(X = x) for each x in the range of X is the probability function (PF) of X
It specifies how the total probability of 1 is divided up amongst the possible values of X
Thus, gives the probability distribution of X.
Also known as probability distribution functions (pdf)
Following are the requirements for a function to qualify as the probability function of a discrete
random variable:
fX (x) >= 0 for all x within the range of X
fX (x) = 1
EduPristine
10
X (bi) = x
b1
b2
b3
b4
b5
b6
b7
b8
x1=0.10
x2=0.15
x3=0.20
Set of real numbers- Weights in kg
11
First define the range or the interval in which the probability has to be determined.
Say its (a, b).
The probability associated is represented as P(a < X < b) or P(a X b).
Also, it is the area under the curve of the probability density function (PDF) from a to b.
So probabilities can be evaluated by integrating the PDF fX (x) .
This relationship defines the PDF.
Mathematically
b
EduPristine
fX(x) dx = 1
12
For a continuous random variable, FX (x) is a continuous, non-decreasing function, defined for all
real values of x.
x
- fX(t) dt
FX (x) =
EduPristine
13
Mean:
E[X] is a measure of central location
For discrete case calculated as E[X] = (xi * Pi) OR E[X] = (x * fX(x))
Variance:
Var[X] = E[{X E[X]}2]
Var[X] = E[X2] E2[X]
EduPristine
14
Weight (Random
variable)
x1=0.10
x2=0.15
x3=0.20
Set of real numbers- Weights in kg
15
EduPristine
Uniform
Bernoulli
Binomial
Poisson
Negative Binomial
16
Expected values:
Mean, = (k + 1)/2
Variance, 2 = (k2 1)/12
EduPristine
10
17
P({f}) = 1 p
0<p<1
X(s) = 1,
, x = 0, 1; 0 < p < 1
X(f) = 0.
Expected values:
Mean, = p
Variance, 2 = p (1 p)
Examples:
Tossing of a coin. Head corresponds to success and Tail corresponds to failure.
Defaulting a home loan. Default corresponds to success and Non-default corresponds to failure.
Auto insurance policy. No claim corresponds to success and Claim corresponds to failure.
EduPristine
18
0.50
0.25
0.00
s (X = 1)
EduPristine
f (X = 0)
19
20
EduPristine
21
22
EduPristine
23
Two-parameter distribution
for k = 0, 1, 2, ..., n,
where
nC = n!/{(n-k)!*k!}
k
24
EduPristine
Uniform
Normal
Lognormal
Exponential
Gamma
Chi-square
t- distribution
F- distribution
25
fX(x) = 1/(b-a),
a<x<b
Expected values:
Mean, = (a + b)/2
Variance, 2 = (b - a)2/12
EduPristine
10
26
x>0
Denoted as X ~ Gamma(, )
Expected values:
Mean, = /
Variance, 2 = / 2
Special cases:
Exponential distribution when = 1: fX(x) = e- x ,
x>0
Example:
Used the predict claim amount in Auto insurance.
Used the predict loss amount in bank loan defaults
EduPristine
27
EduPristine
Ga(2, 3)
7.96%
11.41%
12.26%
11.72%
10.49%
9.02%
7.54%
6.18%
4.98%
3.96%
3.12%
2.44%
1.90%
1.46%
1.12%
0.86%
0.65%
0.50%
0.37%
0.28%
Ga(1, 4)
19.47%
15.16%
11.81%
9.20%
7.16%
5.58%
4.34%
3.38%
2.63%
2.05%
1.60%
1.24%
0.97%
0.75%
0.59%
0.46%
0.36%
0.28%
0.22%
0.17%
Ga(20, 0.5)
0.00%
0.00%
0.00%
0.08%
0.75%
3.23%
8.17%
13.98%
17.73%
17.77%
14.71%
10.40%
6.44%
3.56%
1.79%
0.82%
0.35%
0.14%
0.05%
0.02%
Gamma distribution
20%
18%
Ga(2, 3)
16%
Probability
X
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
14%
Ga(1, 4)
12%
Ga(20, 0.5)
10%
8%
6%
4%
2%
0%
1
11
13
15
17
19
28
Denoted as X ~ N(, 2)
It provides good approximations to various other distributions (Central Limit Theorem)
Transformation z = (x- )/ has N(0, 1) distribution.
Afterwards, The probability is calculated by looking into the standard probability distribution
table for N(0,1) distribution.
EduPristine
29
N(0,1)
N(0,1.6)
45%
-5
0.0%
0.2%
40%
-4
0.0%
1.1%
35%
N(0,1)
-3
0.4%
4.3%
30%
N(0,1.6)
-2
5.4%
11.4%
-1
24.2%
20.5%
39.9%
24.9%
24.2%
20.5%
15%
5.4%
11.4%
10%
0.4%
4.3%
5%
0.0%
1.1%
0%
0.0%
0.2%
EduPristine
25%
20%
-5
-4
-3
-2
-1
30
Solution:
(i) P(X < 28) = P(Z < (28-25)/sqrt(36)) = P(Z < 3/6) =0.69146
(ii) P(X > 30) = P(Z > 0.833) =1 P(Z < 0.833) =1 0.79758 = 0.20242
(iii) P(X < 20) = P(Z < 0.833) =1 P(Z < 0.833) =1 0.79758 = 0.20242
EduPristine
31
Random variable X is bounded at zero and used to model variables taking non-zero positive values.
Defined by two parameters and 2 and denoted as X ~ log N(, 2)
Probability density function:
0<x
Expected values:
Mean, E[X] = exp( + (1/2) 2)
Variance, var(X) = exp(2 + 2) (exp(2) 1)
Example:
Used the predict claim amount in Auto insurance.
Used the predict loss amount in bank loan defaults
EduPristine
32
logN(0,1)
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0
EduPristine
10
12
14
16
18
20
33
Definition:
If X1, X2, .,Xn is a sequence of independent, identically distributed (iid) random variables with finite
mean and finite (non-zero) variance 2 then the distribution of (<X> )/( /n) approaches the
standard normal distribution, N(0,1) , as n
is the population mean from which X1, X2, .,Xn have been extracted.
<X> is the sample mean calculated as <X> = (1/n)
i=n
Xi
i=1
For large n, (<X> )/( /n) and ( Xi n )/((n 2)) has N(0, 1) distribution
OR
<X> ~ N(, 2/n)
Xi ~ N(n , n 2)
EduPristine
34
Solution:
We have, = 40, = 12 , n = 50 .
2
EduPristine
35
Introduction
EduPristine
36
EduPristine
37
38
Example:
State the distribution of (<X>-100)/(S/ 5) for a random sample of 5 values taken from a N(100, 2 )
population. What is the probability that this quantity will exceed 1.533?
Solution:
Distribution: (<X>-100)/(S/ 5) ~ t4
From the t-Distribution table, the probability that this quantity will exceed 1.533 is 10%.
EduPristine
39
EduPristine
40
EduPristine
41
Solution
1. State the hypothesis:
where variance of earnings for the chemical industry =
variance of earnings for the petroleum industry =
2. Select the appropriate test statistic: F=s12 / s22
3. Specify the level of significance: Take it 5% here
4. State the decision rule regarding the hypothesis: Reject H0 is F > 1.74
5. Collect the sample & calculate the sample statistics:
Using the information provided, the F-statistic can be computed as:
F = S12 = $3.502 = 1.1165 < 1.74 (Hence no sufficient evidence to reject H0)
S22 $3.002
EduPristine
42
EduPristine
43
EduPristine
44
Conclusion
EduPristine
45
46
The term one-tailed signifies that all z-values that would cause Sam to reject H0, are in just one
tail of the sampling distribution
m -> Population Mean
H0: m $19,000
EduPristine
47
0.2
0.15
Critical Value
(Xc)
0.1
0.05
0
-10
-5
0
$19,000
10
Sample mean values greater than $19,000--that is x-values on the right-hand side of the sampling
distribution centered on = $19,000--suggest that H0 may be false.
More important the farther to the right x is , the stronger is the evidence against H 0
Reject H0 if the sample mean exceeds Xc
EduPristine
48
Standard deviation for the sample of 100 households is $4,000. The standard error of the mean
(sx) is given by:
sx
$ 400
EduPristine
49
The value of the test statistic is simply the z-value corresponding to = $20,000.
x m
2 .5
0.25
sx
0.2
= 0.05
0.1
0.05
0
-10
-5
0
=$19,000
Z=0
Do not Reject H0
Reject H0
X
Z
EduPristine
50
x=
5 $ 20,000
Z=2.5
$ 19 , 658
1 . 645
10
Actual
H0 is True
H0 is False
H0 is True
Correct Decision
Confidence Level=1-
H0 is False
Significance Level:
-> Significance level :
1 - ->confidence level :
EduPristine
0.25
0.2
0.15
0.1
= 0.05
0.05
=$19,000
Z=0
Do not Reject H0
p-value= 0.00621
Reject H0
5352
What if Sam surveyed the market and found that the student behavior is estimated to be:
They would found the training too expensive if their household income is < US$ 19,000 and
hence would not have the buying power for the course?
They would perceive the training to be of inferior quality, if their household income is >
US$19,000 and hence not buy the training?
How would the decision criteria change? What should be the testing strategy?
Hint: From the question wording infer: Two tailed testing
Appropriately modify the significance value and other parameters
Use the Z-test
Appropriate change in the decision making and testing process process:
Students will not attend the course if:
The household income >$19,000 and the students perceive the course to be inferior
The household income is <$19,000
This becomes a two tailed test wherein the student will join the course only when the
household lie between a particular boundary. i.e. the household income should be neither
very high neither very low
EduPristine
0.25
0.2
0.15
0.025
0.1
= 0.025
0.05
0
m Z / 2 * s 19 , 000 1 . 95 * 400 $ 18 , 216
-10
=$19,000
Z=0
-5
EduPristine
Reject H0
54
Do not
Reject H0
10
Reject H0
Thank you!
EduPristine
702, Raaj Chambers, Old Nagardas Road, Andheri (E), Mumbai-400 069. INDIA
www.edupristine.com
Ph. +91 22 3215 6191
EduPristine
EduPristine www.edupristine.com