You are on page 1of 30

Basic Statistics Training

Decision Analytics Group


Training Structure

 Getting Started  Hypothesis Testing


 Measures of Central Tendency  Type I and Type II Errors
 Measures of Dispersion  Critical Region
 Skewness & Kurtosis  Single Tailed Test
 Probability function  P-Value
 Cumulative Probability function  Significance Tests
 Probability Density Function  Continuous Distributions (related to
 Some Important discrete Distributions Significance Tests)

 Normal Distribution
 Standard Normal Distribution
 Other Continuous Distributions
 Sampling Distribution

© ExlService Holdings, Inc. 2006-2011 Confidential 2


Getting Started

Statistics is the science of collecting, describing and interpreting data

Sample:
A subset of a population

Random Variable:
A random variable is a function which maps simple events to real numbers

Sample Statistics:
A function of sample values is called statistic

© ExlService Holdings, Inc. 2006-2011 Confidential 3


Measures of Central Tendency

Mean - the average of a set of observations


Median - Median of a set of observations is the value of the variable which divides it into two
equal parts. Median is thus a positional average
Mode - Mode is the value which occurs most frequently in a set of observations and around
which the other items of the set cluster densely.

Example -
For the set of observations - 12, 15, 11, 11, 7, 13

Mean Mode

Median

© ExlService Holdings, Inc. 2006-2011 Confidential 4


Measures of Dispersion
Range - It is the difference between the two extreme observations of the variable.
Inter-Quartile Range – Difference between the first and the third quartile
Variance - The variance of a random variable is the mean of the squared deviation of that
variable from its mean. It is calculated as –

Standard Deviation - Square root of the Variance


Coefficient of Variation - Coefficient of Variation is the ratio of standard deviation to the mean.
It is used for comparing the variability of two series.

© ExlService Holdings, Inc. 2006-2011 Confidential 5


Skewness and Kurtosis
Skewness is a measure of the asymmetry of the probability distribution.

For Negative Skew - Mean < Mode


For Positive Skew – Mean > Mode

Kurtosis is the degree of peakedness of a distribution.

For Mesokurtic Curve, Kurtosis = 3,


For Leptokrtic Curve, Kurtosis > 3,
For Platykurtic Curve, Kurtosis < 3

© ExlService Holdings, Inc. 2006-2011 Confidential 6


SAS Command and SAS OUTPUT

SAS COMMAND :-

data Rand;
do i = 1 to 500;
Gamma_RN = rangam(int(datetime()),4) / (0.5);
Normal_RN = rannor(int(datetime()));
output;
end;
run;

proc univariate data = Rand;


var Gamma_RN Normal_RN;
histogram;
run;

© ExlService Holdings, Inc. 2006-2011


SAS Command and SAS OUTPUT

Moments Quantiles (Definition 5)


N 500 Sum Weights 500 Quantile Estimate
Mean 8.27075664 Sum Observations 4135.37832 100% Max 27.94164
Std Deviation 4.22705493 Variance 17.8679934 99% 21.54010
Skewness 1.26774815 Kurtosis 2.3000518 95% 17.04148
Uncorrected SS 43118.8364 Corrected SS 8916.12869 90% 13.81142
Coeff Variation 51.1084429 Std Error Mean 0.18903964 75% Q3 10.36902
50% Median 7.55599

Basic Statistical Measures 25% Q1 5.34146

Location Variability 10% 3.56934

Mean 8.270757 Std Deviation 4.22705 5% 2.94742

Median 7.555995 Variance 17.86799 1% 1.98987

Mode . Range 26.63490 0% Min 1.30674

Interquartile Range 5.02756

© ExlService Holdings, Inc. 2006-2011


Probability function

A rule that assigns probabilities to values of random variable is called the probability function or
probability mass function (pmf) . It is denoted by-
f ( x) = P ( X = x)

This is also called the probability distribution for the discrete random variable. Probability at a point
X=x is corresponding to an event A such that
A = {s ∈ S : X (s ) = x}

Example:
We toss three coins and observe the “number of heads” visible. The random variable X is the number of heads
observed and may take on integer values 0 to 3.
Sample space = {TTT, HHT, HTH, THH, HTT, THT, TTH, HHH}
P r o a b ility D is trib u tio n
X p r o b a b ility Properties are satisfied,
0 f (0 ) = P ( X = 0 ) = 1 8
1 f (1 ) = P ( X = 1 ) = 3 1 . 0 ≤ each P ( xi ) ≤ 1
8
2 f (2 ) = P ( X = 2 ) = 3 8 2. ∑ P (x ) =1
all X
i

3 f (3 ) = P ( X = 3 ) = 1
8

© ExlService Holdings, Inc. 2006-2011 Confidential 9


Cumulative Probability Distribution
A function, F, that provides the probabilities that X <= xi for each value of xi assumed by the random variable X is
called the cdf of X , where xi > xi-1
i
F ( xi ) = P ( X ≤ xi ) = ∑ P ( X = xk )
k =1
Properties of cdf: 1. 0 ≤ F ( xi ) ≤ 1
2. P ( X = xi ) = F ( xi ) − F ( xi−1 )
Going back to the previous example,
Here, x 1 = 0 , x 2 = 1, x 3 = 2 , x 4 = 3
Cumulative Probability Distribution 1
7
X probabilities 8
6
8
x1 = 0 F (0 ) = P ( X ≤ 0 ) = 1 5
8 8
4
x2 = 1 F (1) = P ( X ≤ 1) = 4 F ( x) 8
8 3
8
x3 = 2 F (2) = P ( X ≤ 2) = 7 2
8 8
1
x4 = 3 F (3) = P ( X ≤ 3) = 8 = 1 8
8
0 1 2 3
© ExlService Holdings, Inc. 2006-2011 Confidential 10
Probability Density Function
From a real-life viewpoint , since exact measurements are impossible, it is reasonable to ask for the probability
of an observation lying in a range such as [0.4,0.6] rather than having an exact value like 0.5. When a random
variable is continuous, we calculate probabilities by using its probability density function f(x), as follows:

Given any values, a,b with a<=b, of X :


b
P ( a ≤ X ≤ b ) = ∫ f ( x ) dx
a
In other words, the probability of X taking values between a and b is given by the area under the graph of f
between a and b.
Hypothetical Relative frequency density distribution of a large population
f ( xy )0.05

0.04

0.03

0.02

0.01

0.00
150 155
X=a X=b
160 165 170 175 180 185 190 195 200 205 210
xx
Fgure 3
The graph of the probability density function is called the density curve.

© ExlService Holdings, Inc. 2006-2011 Confidential 11


Some Important Discrete Distributions

Discrete uniform - distribution

• When a pmf is constant on the space, we say that the distribution is uniform over that space.

• Example:
• If X has a discrete uniform distribution on S={1,2,3,4,5,6} and its pmf is

f(x) =P(X=x)= 1/6

• We can generalize this by letting X have a discrete uniform distribution over the first m positive integers
so that its pmf is

f(x)=1/m x=1,2,3….,m

© ExlService Holdings, Inc. 2006-2011


Some Important Discrete Distributions

Bionomial distribution Geometric distribution

Probability Mass Function of Binomial Distribution with PMF AND CDF for Geometric Distribution with Probability of Head
Parameteres n = 10 and p = 0.5 as 0.7

0.3 1.00000
0.90000
0.25 0.80000
0.2 0.70000

Probaility
0.60000
P(X)

0.15
0.50000
0.1 0.40000
0.30000
0.05
0.20000
0 0.10000
0 1 2 3 4 5 6 7 8 9 10 -
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
X
Trial Number at which first Tail appears

P(X)
P(X = x) F(X = x)

A Bernoulli experiment is a random experiment,


in which the outcome can only be classified in If X denotes the number of trials to obtain the
one of two mutually exclusive and exhaustive first success in a sequence of Bernoulli trials then
ways say, success or failure. Binomial distribution X follows Geometric distribution.
is the sum of n independent Bernoulli random
variables.
It is the number of “successes” in n trials
The Multinomial Distribution is a generalization
of the binomial distribution.

© ExlService Holdings, Inc. 2006-2011


Some Important Discrete Distributions

Negative bionomial distribution Poisson distribution

Probability Mass Function for Poisson Distribution with


different values of Lambda

0.9

0.8

0.7

0.6

P(X)
0.5

0.4

0.3

0.2

0.1

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
X

P(X) f or Lambda = 0.1 P(X) f or Lambda = 0.5 P(X) f or Lambda = 1


P(X) f or Lambda = 2.5 P(X) f or Lambda = 5

A generalization of the geometric distribution If X is the number of events occurring randomly


in which the random variable is the no. of in a unit interval with the mean rate of λ per unit
Bernoulli trials required to obtain r success. interval then X is said to follow Poisson
distribution with parameter λ.

© ExlService Holdings, Inc. 2006-2011


Normal Distribution
A large number of variables have a relative frequency distribution which approximates to this shape
Unimodal and approximately symmetrical
y

0
166 168 170 172 174 176 178 180 182 184 186 188 190 192 194 196 198 200
x

Figure 1
Suppose we measured the variable more accurately, then for a larger sample from the population, we would
get a relative frequency distribution as shown below.

L a r g e r s a m p le a n d m o r e a c c u r a te m e a s u r e m e n t
y
50

40

30

20

10

0
160 165 170 175 180 185 190 195 200
x
F ig u re 2

© ExlService Holdings, Inc. 2006-2011 Confidential 15


Normal Distribution …Contd
A mathematical model which describes approximately the shape of many relative frequency
distributions is the Normal distribution, shown below.

Two important areas under the normal curve are emphasized below,
• Approximately 95% of the data will lie within two standard deviations i.e. in the interval ( µ − 2σ , µ + 2σ )
• Nearly all the values lie within three standard deviation.

© ExlService Holdings, Inc. 2006-2011 Confidential 16


Standard Normal Distribution
Normal distributions can be transformed to standard normal distributions by the formula:
X −µ
Z=
σ
where, X is a random variable, μ is the mean and σ is the standard deviation of original normal distribution.
Applying the formula produces a transformed distribution with a mean of zero and a standard deviation of
one.

Standard normal distribution


y
0.4

0.3

0.2

0.1

-4 -3 -2 -1 0 1 2 3 4
x
Figure 5

Approximately 95% of the data will lie within two units of 0 i.e. the interval (-2,2)
Approximately all (99.7% ) of the values lie within three units of 0. i.e. the interval (-3,3)

© ExlService Holdings, Inc. 2006-2011 Confidential 17


Other Continuous Distributions
As the number of degrees of freedom increases, the distributions tend toward normal distribution

Exponential distribution Gamma distribution

• If X denotes the waiting time until the first change occurs


• Now if X denotes the waiting time until the kth change
when observing a Poisson process in which the mean
in the Poisson process then X has a gamma
number of changes in the unit interval is λ, has an distribution with parameters ‘k’ and θ =1/λ
exponential distribution.
• For a fixed θ as ‘k’’ increases the probability curve
• If we observe any process composed of events that occur moves towards the right. Same is true for increasing θ
at random times (say lightning strikes, coal mine with fixed ‘k’’ . So if the mean number of changes per
accidents, earthquakes, fires, etc.), the times between unit decreases, the waiting time to observe ‘k’
these events will be Exponentially distributed. The changes, can be expected to increase.
probability of occurrence of the next event is independent
of the occurrence time of the past event.

© ExlService Holdings, Inc. 2006-2011


Sampling Distribution

The distribution of values of a sample statistic obtained from repeated samples, all of the same size and all
drawn from the same population is called the sampling distribution of the statistic. The sampling distribution
is nothing but the probability distribution of the statistic.

Example:
A box consists of 10 counters. The color of the counter can be either yellow, green, red, or blue. Suppose
there are 2 yellow, 2 green, 3 red and 3 blue counters in the box. Each colour is associated with a “score”
which is written on the counter. A person is asked to draw 2 counters from this box. Determine the sampling
distribution of the mean score of this person.
All the possible ways in which two counters can be drawn:

© ExlService Holdings, Inc. 2006-2011 Confidential 19


Frequency distribution Mean of scores obtained in each sample

Sampling Distribution of Mean 


If a random sample of size ‘n’ is drawn from a
population with mean µ and standard deviation σ
then the sampling distribution of the means has
mean µ and standard deviation σ/ √n

Standard error: The standard deviation of the


sampling distribution of a statistic is called its
Standard Error.

© ExlService Holdings, Inc. 2006-2011 Confidential 20


HYPOTHESIS TESTING
Statistical inference:
The making of conjectures about the distribution of a random variable or its parameters based on the sample
is called statistical inference.

There can be an infinite number of functions of sample values called statistics that may be proposed as
estimators. The best estimate would be one which falls nearest to the true value of the parameter.
We can estimate in two ways:
Point estimation: It is a single value or a point.
Interval estimation: It gives a range of values or an interval in which the true value of the parameter may
be expected to lie with some definite probability or degree of confidence. This interval is called a
Confidence Interval.

Level of Confidence:
The probability that the sample yields an interval that includes the parameter being estimated is the level of
confidence for that interval. It is denoted by (1-α) where α is the probability that the parameter lies outside the
confidence interval. It is often expressed as a percentage.
e.g. For a = 0.05, the confidence coefficient = 100(1– 0.05)% = 95%.

Statistical test  Setting up of Hypothesis  Choice of a suitable test statistic Conclusion

© ExlService Holdings, Inc. 2006-2011 Confidential 21


Statistical Hypothesis Statistical Hypothesis: Any Test Statistic: A Test Criteria: The
Test - A process by assumption about the statistic used in making test statistic value is
which a decision is parametric value or form of the decision about H0. used to draw
made between two the probability distribution Choose an appropriate appropriate
opposing hypotheses which can be tested test statistic and conclusions about
statistically with the help of compute its value H0
the observations is called from the sample.
statistical hypothesis.
Null Hypothesis: This is the
hypothesis that we test. It is
denoted by H0
Alternative Hypothesis: It is a
statement that specifies that
the population parameter has
a value different, in some way,
from the value given in the
null hypothesis.

The process by which The process by which an These test statistics Methods used –
an engineer decides on engineer decides on the basis follow some 1. Classical approach
the basis of sample of sample data whether the distribution. 2.P-value approach
data whether the true true avg. lifetime of a certain
avg. lifetime of a kind of tire is 22000 miles.
certain kind of tire is
22000 miles.
© ExlService Holdings, Inc. 2006-2011 Confidential 22
Type 1 and Type 2 error

If we observe that in the sample, average lifetime of the tire is 22000 then we do not have sufficient evidence
to reject the null hypothesis.

But since we are not sure of the population, we might be committing an error. More clearly, µ≠22000 for the
population, it may have been the case only for that sample.

There are two types of errors that can arise:

© ExlService Holdings, Inc. 2006-2011 Confidential 23


Critical Region (Rejection region)
The set of values for the test statistic that will cause us to reject the null hypothesis in favor of the alternative
hypothesis.

Consider a situation in which we want to test the null hypothesis –


H0 : µ =22000 against the alternative hypothesis H0 : µ ≠ 22000

Since it appears reasonable to accept the null hypothesis when our estimate of µ is close to 22000 and to
reject it when our estimate is much larger or much smaller than 22000, it would be logical to let the critical
region consist of both tails of the sampling distribution of our test statistic. Such a test is referred to as a two-
tailed test.

© ExlService Holdings, Inc. 2006-2011 Confidential 24


Single- tailed test
On the other hand, if we are testing the null hypothesis - H0 : µ =22000 against the alternative hypothesis H1 :
µ < 22000 it would seem reasonable to reject H0 only when the estimate of µ is much smaller than 22000. In
this case the critical region consists only of the left tail.

Likewise, in testing the null hypothesis


H0 : µ =22000 against the alternative hypothesis H1 : µ > 22000 it would seem reasonable to reject H0 only
when the estimate of µ is much larger than 22000. In this case the critical region consists only of the right tail.

© ExlService Holdings, Inc. 2006-2011 Confidential 25


P-value
The usual tables provide the critical values (e.g zα of standard normal distribution) but only for a few values α. Mainly
for this reason, it has been the custom to base tests of statistical hypotheses almost exclusively on the level of
significance, usually α=0.05 or α=0.01. Consider the critical region for one-tailed test H1:μ1 > μ0

compare the shaded region of the figure drawn above with αinstead of comparing the observed value of
X with the boundary of the critical region or the value of zα
.In other words, we reject the null hypothesis if the shaded region of above figure is less than or equal to α.
This shaded region is referred to as the P-value or the observed level of significance corresponding to x , the
observed value of X In fact, it is the probabilityP ( X ≥ x ) when the null hypothesis is true.

© ExlService Holdings, Inc. 2006-2011 Confidential 26


Significance Tests

Tests for Single Mean


 Normal distribution
 t - distribution

Tests for Difference of Means


 Normal distribution
 t - distribution

Test for Variance


 Chi-square distribution

Tests for Independence of Attributes


 Chi-square distribution

Test for Equality of Variances


 F-distribution

Test for Equality of Several Means


 F-distribution (ANOVA)

© ExlService Holdings, Inc. 2006-2011 Confidential 27


Do I use the Normal distribution or the t-distribution to test these hypotheses?

© ExlService Holdings, Inc. 2006-2011 Confidential 28


Continuous Distributions (Related to Significance Tests)

Standard Normal and t-distribution


y
0.4
• Degrees of freedom’ is a parameter that identifies a
t - distribution. Thus, we get a separate distribution
0.3
for each degree of freedom.
0.2 • The t-distribution approaches the normal
distribution as the number of degrees of freedom
0.1 increases.

-4 -3 -2 -1 0 1 2 3 4
x

Is ‘n’ large?
Samples of size n > 30 may be considered large enough for the central limit theorem to hold if the
sample data are unimodal, nearly symmetrical, short-tailed, and without outliers.
Samples that are not symmetrical require larger sample sizes, with 50 sufficing, except for extremely
skewed samples.

© ExlService Holdings, Inc. 2006-2011


Continuous Distributions (Related to Significance Tests)
As the number of degrees of freedom increases, the distributions tend toward normal distribution

Chi - distribution F - distribution

• F is non negative in value, it is zero or positively


• χ2 is non-negative in value; valued.
• χ2 is not symmetrical. • F is non symmetrical; it is skewed to the right.
• χ2 is distributed so as to form a family of • F is distributed so as to form a family of
distributions, a separate distribution for each distributions, a separate distribution for each pair
different number of degrees of freedom. of different number of degrees of freedom.

© ExlService Holdings, Inc. 2006-2011

You might also like