Methodology Training - Basic Statistics (Divya Beri)

Basic Statistics Training
Decision Analytics Group

Training Structure
Getting Started Hypothesis Testing

Measures of Central Tendency Type I and Type II Errors
Measures of Dispersion Critical Region
Skewness & Kurtosis Single Tailed Test
Probability function P-Value
Cumulative Probability function Significance Tests
Probability Density Function Continuous Distributions (related to
Some Important discrete Distributions Significance Tests)
Normal Distribution
Standard Normal Distribution
Other Continuous Distributions
Sampling Distribution
© ExlService Holdings, Inc. 2006-2011 Confidential 2

Getting Started
Statistics is the science of collecting, describing and interpreting data
Sample:
A subset of a population
Random Variable:
A random variable is a function which maps simple events to real numbers
Sample Statistics:
A function of sample values is called statistic

Measures of Central Tendency
Mean - the average of a set of observations

Median - Median of a set of observations is the value of the variable which divides it into two
equal parts. Median is thus a positional average
Mode - Mode is the value which occurs most frequently in a set of observations and around
which the other items of the set cluster densely.
Example -
For the set of observations - 12, 15, 11, 11, 7, 13
Mean Mode
Median

Measures of Dispersion
Range - It is the difference between the two extreme observations of the variable.
Inter-Quartile Range – Difference between the first and the third quartile
Variance - The variance of a random variable is the mean of the squared deviation of that
variable from its mean. It is calculated as –
Standard Deviation - Square root of the Variance

Coefficient of Variation - Coefficient of Variation is the ratio of standard deviation to the mean.
It is used for comparing the variability of two series.

Skewness and Kurtosis
Skewness is a measure of the asymmetry of the probability distribution.
For Negative Skew - Mean < Mode

For Positive Skew – Mean > Mode
Kurtosis is the degree of peakedness of a distribution.
For Mesokurtic Curve, Kurtosis = 3,

For Leptokrtic Curve, Kurtosis > 3,
For Platykurtic Curve, Kurtosis < 3

SAS Command and SAS OUTPUT
SAS COMMAND :-
data Rand;
do i = 1 to 500;
Gamma_RN = rangam(int(datetime()),4) / (0.5);
Normal_RN = rannor(int(datetime()));
output;
end;
run;
proc univariate data = Rand;

var Gamma_RN Normal_RN;
histogram;
run;
© ExlService Holdings, Inc. 2006-2011

SAS Command and SAS OUTPUT
Moments Quantiles (Definition 5)

N 500 Sum Weights 500 Quantile Estimate
Mean 8.27075664 Sum Observations 4135.37832 100% Max 27.94164
Std Deviation 4.22705493 Variance 17.8679934 99% 21.54010
Skewness 1.26774815 Kurtosis 2.3000518 95% 17.04148
Uncorrected SS 43118.8364 Corrected SS 8916.12869 90% 13.81142
Coeff Variation 51.1084429 Std Error Mean 0.18903964 75% Q3 10.36902
50% Median 7.55599
Basic Statistical Measures 25% Q1 5.34146
Location Variability 10% 3.56934
Mean 8.270757 Std Deviation 4.22705 5% 2.94742
Median 7.555995 Variance 17.86799 1% 1.98987
Mode . Range 26.63490 0% Min 1.30674
Interquartile Range 5.02756

Probability function
A rule that assigns probabilities to values of random variable is called the probability function or
probability mass function (pmf) . It is denoted by-
f ( x) = P ( X = x)
This is also called the probability distribution for the discrete random variable. Probability at a point
X=x is corresponding to an event A such that
A = {s ∈ S : X (s ) = x}
Example:
We toss three coins and observe the “number of heads” visible. The random variable X is the number of heads
observed and may take on integer values 0 to 3.
Sample space = {TTT, HHT, HTH, THH, HTT, THT, TTH, HHH}
P r o a b ility D is trib u tio n
X p r o b a b ility Properties are satisfied,
0 f (0 ) = P ( X = 0 ) = 1 8
1 f (1 ) = P ( X = 1 ) = 3 1 . 0 ≤ each P ( xi ) ≤ 1
8
2 f (2 ) = P ( X = 2 ) = 3 8 2. ∑ P (x ) =1
all X
i
3 f (3 ) = P ( X = 3 ) = 1
8

Cumulative Probability Distribution
A function, F, that provides the probabilities that X <= xi for each value of xi assumed by the random variable X is
called the cdf of X , where xi > xi-1
i
F ( xi ) = P ( X ≤ xi ) = ∑ P ( X = xk )
k =1
Properties of cdf: 1. 0 ≤ F ( xi ) ≤ 1
2. P ( X = xi ) = F ( xi ) − F ( xi−1 )
Going back to the previous example,
Here, x 1 = 0 , x 2 = 1, x 3 = 2 , x 4 = 3
Cumulative Probability Distribution 1
7
X probabilities 8
6
8
x1 = 0 F (0 ) = P ( X ≤ 0 ) = 1 5
8 8
4
x2 = 1 F (1) = P ( X ≤ 1) = 4 F ( x) 8
8 3
8
x3 = 2 F (2) = P ( X ≤ 2) = 7 2
8 8
1
x4 = 3 F (3) = P ( X ≤ 3) = 8 = 1 8
8
0 1 2 3
Probability Density Function
From a real-life viewpoint , since exact measurements are impossible, it is reasonable to ask for the probability
of an observation lying in a range such as [0.4,0.6] rather than having an exact value like 0.5. When a random
variable is continuous, we calculate probabilities by using its probability density function f(x), as follows:
Given any values, a,b with a<=b, of X :

b
P ( a ≤ X ≤ b ) = ∫ f ( x ) dx
a
In other words, the probability of X taking values between a and b is given by the area under the graph of f
between a and b.
Hypothetical Relative frequency density distribution of a large population
f ( xy )0.05
0.04
0.03
0.02
0.01
0.00
150 155
X=a X=b
160 165 170 175 180 185 190 195 200 205 210
xx
Fgure 3
The graph of the probability density function is called the density curve.

Some Important Discrete Distributions
Discrete uniform - distribution
• When a pmf is constant on the space, we say that the distribution is uniform over that space.
• Example:
• If X has a discrete uniform distribution on S={1,2,3,4,5,6} and its pmf is
f(x) =P(X=x)= 1/6
• We can generalize this by letting X have a discrete uniform distribution over the first m positive integers
so that its pmf is
f(x)=1/m x=1,2,3….,m

Bionomial distribution Geometric distribution
Probability Mass Function of Binomial Distribution with PMF AND CDF for Geometric Distribution with Probability of Head
Parameteres n = 10 and p = 0.5 as 0.7
0.3 1.00000
0.90000
0.25 0.80000
0.2 0.70000
Probaility
0.60000
P(X)
0.15
0.50000
0.1 0.40000
0.30000
0.05
0.20000
0 0.10000
0 1 2 3 4 5 6 7 8 9 10 -
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
X
Trial Number at which first Tail appears
P(X)
P(X = x) F(X = x)
A Bernoulli experiment is a random experiment,

in which the outcome can only be classified in If X denotes the number of trials to obtain the
one of two mutually exclusive and exhaustive first success in a sequence of Bernoulli trials then
ways say, success or failure. Binomial distribution X follows Geometric distribution.
is the sum of n independent Bernoulli random
variables.
It is the number of “successes” in n trials
The Multinomial Distribution is a generalization
of the binomial distribution.

Negative bionomial distribution Poisson distribution
Probability Mass Function for Poisson Distribution with

different values of Lambda
0.9
0.8
0.7
0.6
P(X)
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
X
P(X) f or Lambda = 0.1 P(X) f or Lambda = 0.5 P(X) f or Lambda = 1

P(X) f or Lambda = 2.5 P(X) f or Lambda = 5
A generalization of the geometric distribution If X is the number of events occurring randomly

in which the random variable is the no. of in a unit interval with the mean rate of λ per unit
Bernoulli trials required to obtain r success. interval then X is said to follow Poisson
distribution with parameter λ.

Normal Distribution
A large number of variables have a relative frequency distribution which approximates to this shape
Unimodal and approximately symmetrical
y
0
166 168 170 172 174 176 178 180 182 184 186 188 190 192 194 196 198 200
x
Figure 1
Suppose we measured the variable more accurately, then for a larger sample from the population, we would
get a relative frequency distribution as shown below.
L a r g e r s a m p le a n d m o r e a c c u r a te m e a s u r e m e n t
y
50
40
30
20
10
0
160 165 170 175 180 185 190 195 200
x
F ig u re 2

Normal Distribution …Contd
A mathematical model which describes approximately the shape of many relative frequency
distributions is the Normal distribution, shown below.
Two important areas under the normal curve are emphasized below,
• Approximately 95% of the data will lie within two standard deviations i.e. in the interval ( µ − 2σ , µ + 2σ )
• Nearly all the values lie within three standard deviation.

Standard Normal Distribution
Normal distributions can be transformed to standard normal distributions by the formula:
X −µ
Z=
σ
where, X is a random variable, μ is the mean and σ is the standard deviation of original normal distribution.
Applying the formula produces a transformed distribution with a mean of zero and a standard deviation of
one.
Standard normal distribution

y
0.4
0.3
0.2
0.1
-4 -3 -2 -1 0 1 2 3 4
x
Figure 5
Approximately 95% of the data will lie within two units of 0 i.e. the interval (-2,2)
Approximately all (99.7% ) of the values lie within three units of 0. i.e. the interval (-3,3)

Other Continuous Distributions
As the number of degrees of freedom increases, the distributions tend toward normal distribution
Exponential distribution Gamma distribution
• If X denotes the waiting time until the first change occurs

• Now if X denotes the waiting time until the kth change
when observing a Poisson process in which the mean
in the Poisson process then X has a gamma
number of changes in the unit interval is λ, has an distribution with parameters ‘k’ and θ =1/λ
exponential distribution.
• For a fixed θ as ‘k’’ increases the probability curve
• If we observe any process composed of events that occur moves towards the right. Same is true for increasing θ
at random times (say lightning strikes, coal mine with fixed ‘k’’ . So if the mean number of changes per
accidents, earthquakes, fires, etc.), the times between unit decreases, the waiting time to observe ‘k’
these events will be Exponentially distributed. The changes, can be expected to increase.
probability of occurrence of the next event is independent
of the occurrence time of the past event.

Sampling Distribution
The distribution of values of a sample statistic obtained from repeated samples, all of the same size and all
drawn from the same population is called the sampling distribution of the statistic. The sampling distribution
is nothing but the probability distribution of the statistic.
Example:
A box consists of 10 counters. The color of the counter can be either yellow, green, red, or blue. Suppose
there are 2 yellow, 2 green, 3 red and 3 blue counters in the box. Each colour is associated with a “score”
which is written on the counter. A person is asked to draw 2 counters from this box. Determine the sampling
distribution of the mean score of this person.
All the possible ways in which two counters can be drawn:

Frequency distribution Mean of scores obtained in each sample
Sampling Distribution of Mean

If a random sample of size ‘n’ is drawn from a
population with mean µ and standard deviation σ
then the sampling distribution of the means has
mean µ and standard deviation σ/ √n
Standard error: The standard deviation of the

sampling distribution of a statistic is called its
Standard Error.

HYPOTHESIS TESTING
Statistical inference:
The making of conjectures about the distribution of a random variable or its parameters based on the sample
is called statistical inference.
There can be an infinite number of functions of sample values called statistics that may be proposed as
estimators. The best estimate would be one which falls nearest to the true value of the parameter.
We can estimate in two ways:
Point estimation: It is a single value or a point.
Interval estimation: It gives a range of values or an interval in which the true value of the parameter may
be expected to lie with some definite probability or degree of confidence. This interval is called a
Confidence Interval.
Level of Confidence:
The probability that the sample yields an interval that includes the parameter being estimated is the level of
confidence for that interval. It is denoted by (1-α) where α is the probability that the parameter lies outside the
confidence interval. It is often expressed as a percentage.
e.g. For a = 0.05, the confidence coefficient = 100(1– 0.05)% = 95%.
Statistical test Setting up of Hypothesis Choice of a suitable test statistic Conclusion

Statistical Hypothesis Statistical Hypothesis: Any Test Statistic: A Test Criteria: The
Test - A process by assumption about the statistic used in making test statistic value is
which a decision is parametric value or form of the decision about H0. used to draw
made between two the probability distribution Choose an appropriate appropriate
opposing hypotheses which can be tested test statistic and conclusions about
statistically with the help of compute its value H0
the observations is called from the sample.
statistical hypothesis.
Null Hypothesis: This is the
hypothesis that we test. It is
denoted by H0
Alternative Hypothesis: It is a
statement that specifies that
the population parameter has
a value different, in some way,
from the value given in the
null hypothesis.
The process by which The process by which an These test statistics Methods used –
an engineer decides on engineer decides on the basis follow some 1. Classical approach
the basis of sample of sample data whether the distribution. 2.P-value approach
data whether the true true avg. lifetime of a certain
avg. lifetime of a kind of tire is 22000 miles.
certain kind of tire is
22000 miles.
Type 1 and Type 2 error
If we observe that in the sample, average lifetime of the tire is 22000 then we do not have sufficient evidence
to reject the null hypothesis.
But since we are not sure of the population, we might be committing an error. More clearly, µ≠22000 for the
population, it may have been the case only for that sample.
There are two types of errors that can arise:

Critical Region (Rejection region)
The set of values for the test statistic that will cause us to reject the null hypothesis in favor of the alternative
hypothesis.
Consider a situation in which we want to test the null hypothesis –

H0 : µ =22000 against the alternative hypothesis H0 : µ ≠ 22000
Since it appears reasonable to accept the null hypothesis when our estimate of µ is close to 22000 and to
reject it when our estimate is much larger or much smaller than 22000, it would be logical to let the critical
region consist of both tails of the sampling distribution of our test statistic. Such a test is referred to as a two-
tailed test.

Single- tailed test
On the other hand, if we are testing the null hypothesis - H0 : µ =22000 against the alternative hypothesis H1 :
µ < 22000 it would seem reasonable to reject H0 only when the estimate of µ is much smaller than 22000. In
this case the critical region consists only of the left tail.
Likewise, in testing the null hypothesis

H0 : µ =22000 against the alternative hypothesis H1 : µ > 22000 it would seem reasonable to reject H0 only
when the estimate of µ is much larger than 22000. In this case the critical region consists only of the right tail.

P-value
The usual tables provide the critical values (e.g zα of standard normal distribution) but only for a few values α. Mainly
for this reason, it has been the custom to base tests of statistical hypotheses almost exclusively on the level of
significance, usually α=0.05 or α=0.01. Consider the critical region for one-tailed test H1:μ1 > μ0
compare the shaded region of the figure drawn above with αinstead of comparing the observed value of
X with the boundary of the critical region or the value of zα
.In other words, we reject the null hypothesis if the shaded region of above figure is less than or equal to α.
This shaded region is referred to as the P-value or the observed level of significance corresponding to x , the
observed value of X In fact, it is the probabilityP ( X ≥ x ) when the null hypothesis is true.

Significance Tests
Tests for Single Mean

Normal distribution
t - distribution
Tests for Difference of Means

Normal distribution
t - distribution
Test for Variance

Chi-square distribution
Tests for Independence of Attributes

Chi-square distribution
Test for Equality of Variances

F-distribution
Test for Equality of Several Means

F-distribution (ANOVA)

Do I use the Normal distribution or the t-distribution to test these hypotheses?

Continuous Distributions (Related to Significance Tests)
Standard Normal and t-distribution

y
0.4
• Degrees of freedom’ is a parameter that identifies a
t - distribution. Thus, we get a separate distribution
0.3
for each degree of freedom.
0.2 • The t-distribution approaches the normal
distribution as the number of degrees of freedom
0.1 increases.
-4 -3 -2 -1 0 1 2 3 4
x
Is ‘n’ large?
Samples of size n > 30 may be considered large enough for the central limit theorem to hold if the
sample data are unimodal, nearly symmetrical, short-tailed, and without outliers.
Samples that are not symmetrical require larger sample sizes, with 50 sufficing, except for extremely
skewed samples.

Continuous Distributions (Related to Significance Tests)
As the number of degrees of freedom increases, the distributions tend toward normal distribution
Chi - distribution F - distribution
• F is non negative in value, it is zero or positively

• χ2 is non-negative in value; valued.
• χ2 is not symmetrical. • F is non symmetrical; it is skewed to the right.
• χ2 is distributed so as to form a family of • F is distributed so as to form a family of
distributions, a separate distribution for each distributions, a separate distribution for each pair
different number of degrees of freedom. of different number of degrees of freedom.

Methodology Training - Basic Statistics (Divya Beri)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Methodology Training - Basic Statistics (Divya Beri)

Uploaded by

Copyright:

Available Formats

Basic Statistics Training

Decision Analytics Group

Getting Started Hypothesis Testing

© ExlService Holdings, Inc. 2006-2011 Confidential 2

Statistics is the science of collecting, describing and interpreting data

© ExlService Holdings, Inc. 2006-2011 Confidential 3

Mean - the average of a set of observations

© ExlService Holdings, Inc. 2006-2011 Confidential 4

Standard Deviation - Square root of the Variance

© ExlService Holdings, Inc. 2006-2011 Confidential 5

For Negative Skew - Mean < Mode

Kurtosis is the degree of peakedness of a distribution.

For Mesokurtic Curve, Kurtosis = 3,

© ExlService Holdings, Inc. 2006-2011 Confidential 6

proc univariate data = Rand;

© ExlService Holdings, Inc. 2006-2011

Moments Quantiles (Definition 5)

Basic Statistical Measures 25% Q1 5.34146

Location Variability 10% 3.56934

Mean 8.270757 Std Deviation 4.22705 5% 2.94742

Median 7.555995 Variance 17.86799 1% 1.98987

Mode . Range 26.63490 0% Min 1.30674

Interquartile Range 5.02756

© ExlService Holdings, Inc. 2006-2011

© ExlService Holdings, Inc. 2006-2011 Confidential 9

Given any values, a,b with a<=b, of X :

© ExlService Holdings, Inc. 2006-2011 Confidential 11

Discrete uniform - distribution

f(x) =P(X=x)= 1/6

© ExlService Holdings, Inc. 2006-2011

Bionomial distribution Geometric distribution

A Bernoulli experiment is a random experiment,

© ExlService Holdings, Inc. 2006-2011

Negative bionomial distribution Poisson distribution

Probability Mass Function for Poisson Distribution with

P(X) f or Lambda = 0.1 P(X) f or Lambda = 0.5 P(X) f or Lambda = 1

A generalization of the geometric distribution If X is the number of events occurring randomly

© ExlService Holdings, Inc. 2006-2011

© ExlService Holdings, Inc. 2006-2011 Confidential 15

© ExlService Holdings, Inc. 2006-2011 Confidential 16

Standard normal distribution

© ExlService Holdings, Inc. 2006-2011 Confidential 17

Exponential distribution Gamma distribution

• If X denotes the waiting time until the first change occurs

© ExlService Holdings, Inc. 2006-2011

© ExlService Holdings, Inc. 2006-2011 Confidential 19

Sampling Distribution of Mean

Standard error: The standard deviation of the

© ExlService Holdings, Inc. 2006-2011 Confidential 20

Statistical test Setting up of Hypothesis Choice of a suitable test statistic Conclusion

© ExlService Holdings, Inc. 2006-2011 Confidential 21

There are two types of errors that can arise:

© ExlService Holdings, Inc. 2006-2011 Confidential 23

Consider a situation in which we want to test the null hypothesis –

© ExlService Holdings, Inc. 2006-2011 Confidential 24

Likewise, in testing the null hypothesis

© ExlService Holdings, Inc. 2006-2011 Confidential 25

© ExlService Holdings, Inc. 2006-2011 Confidential 26

Tests for Single Mean

Tests for Difference of Means

Test for Variance

Tests for Independence of Attributes

Test for Equality of Variances

Test for Equality of Several Means