Unit 8 - Stats and Probab

UNIT SNAPSHOT
UGC NET MANAGEMENT

Unit VIII
PART I - STATISTICS & PROBABILITY

Statistics
“Statistics”, that a word is often used, has been derived from the Latin word ‘Status’
that means a group of numbers or figures; those represent some information of our
human interest.
According to A.L. Bowley “Statistics are numerical statements of facts in any

department of enquiry placed in relation to each other.”
According to Croxton and Cowden, “Statistics may be defined as the collection,

presentation, analysis, and interpretation of numerical data.
➢ The important characteristics of statistics are given below:

1. Statistics are aggregates of facts.
2. Statistics are numerically expressed.
3. Statistics are affected to a marked extent by multiplicity of causes.
4. Statistics are enumerated or estimated according to a reasonable standard of
accuracy.
5. Statistics are collected for a predetermined purpose. Statistics are collected in a
systemic manner.
6. Statistics must be comparable to each other.
➢ Functions of Statistics
The functions of statistics may be enumerated as follows:
(i) To present facts in a definite form

(ii) To simplify unwieldy and complex data
(iii) To use it as a technique for making comparisons
(iv) To enlarge individual experience
(v) To provide guidance in the formulation of policies
(vi) To enable measurement of the magnitude of a phenomenon
Query:hello@everstudy.in www.everstudy.co.in 2
➢ Main Limitations of Statistics
1. Qualitative Aspect Ignored
2. It does not deal with individual items
3. It does not depict entire story of phenomenon
4. It is liable to be miscued
5. Results are true only on average
6. To Many methods to study problems
Statistical results are not always beyond doubt
➢ Measures of Central Tendency

According to Prof Bowley, “Measures of central tendency (averages) are statistical
constants which enable us to comprehend in a single effort the significance of the
whole.”
The main objectives of Measure of Central Tendency are
1) To condense data in a single value.
2) To facilitate comparisons between data.
 Averages provide us the gist and give a bird’s eye view of the huge mass of
unwieldy numerical data.
 Averages are the typical values around which other items of the distribution
congregate.
 This value lies between the two extreme observations of the distribution and
give us an idea about the concentration of the values in the central part of the
distribution.
 And so they are called the measures of central tendency.
 Averages are also called measures of location since they enable us to locate
the position or place of the distribution in question.
➢ Essential of a Good Average
An average represents the statistical data and it is used for purposes of
comparison, it must possess the following properties.
1. It must be rigidly defined and not left to the mere estimation of the observer. If
the definition is rigid, the computed value of the average obtained by different
persons shall be similar.
2. The average must be based upon all values given in the distribution.
3. It should be easily understood. The average should possess simple and obvious
properties. It should be too abstract for the common people.
4. It should be capable of being calculated with reasonable care and rapidity.
5. It should be stable and unaffected by sampling fluctuations.
6. It should be capable of further algebraic manipulation.
7. It should be not be unduly affected by extreme values.
➢ Methods of Measuring Central Tendency
Different methods of measuring “Central Tendency” provide us with different
kinds of averages.
The following are the main types of averages that are commonly used:
I. Mean
(i) Arithmetic mean
(ii) Weighted mean
(iii) Geometric mean
(iv) Harmonic mean
2. Median
3. Mode
1. Arithmetic Mean:
 The mean is the arithmetic average, and it is probably the measure of central
tendency that you are most familiar.
 Calculating the mean is very simple.
 Add up all of the values and divide by the number of observations in the
dataset.
 The calculation of the mean incorporates all values in the data. If you change
any value, the mean changes.
Mathematical Properties of the Arithmetic Mean:
1. The sum of the deviation of a given set of individual observations from the
arithmetic mean is always zero.
2. The sum of squares of deviations of a set of observations is the minimum
when deviations are taken from the arithmetic average.
3. If each value of a variable X is increased or decreased or multiplied by a
constant k, the arithmetic mean also increases or decreases or multiplies by
the same constant.
4. If we are given the arithmetic mean and number of items of two or more
groups, we can compute the combined average of these groups by apply the
following formula :
Combined mean
Merits of Arithmetic mean: Demerits of Arithmetic mean:
 Simplicity  Effect of extreme values.
 Certainty  Mean value may not figure in the
 Based on all values. series.
 Algebraic treatment possible.  Unsuitability.
 Basis of comparison  Misleading conclusions.
 Accuracy test possible.  Cannot be used in case of
 No scope for estimated value qualitative phenomenon.
 Gets distorted by extreme value
of the series.
Formulae of calculating arithmetic mean:
Types of
Direct Method Shortcut Methods Step deviation Methods
Series
Individual
Series
Discrete
series
Continuous
Series
2. Weighted Average
 A weighted average is a type of average where each observation in the data set
is multiplied by a predetermined weight before calculation.
 In calculating a simple average (arithmetic mean) all observations are treated
equally and assigned equal weight.
 A weighted average assigns weights that determine the relative importance of
each data point.
Why Weighted Average??
 Takes into account relative importance of data points when calculating an

average thereby making it more descriptive than a simple average.
 Smoothes out data which improves accuracy.

 Often used in finance to calculate cost basis of stock portfolios, inventory
accounting and valuation.
Weighted Mean
3. Geometric Mean:
 A geometric mean is a mean or average which shows the central tendency of a set
of numbers by using the product of their values.
 For a set of n observations, a geometric mean is the nth root of their product.
 The geometric mean G.M., for a set of numbers x1, x2, … , xn is given as
G.M. = (x1. x2 … xn)1⁄n
Or, G. M. = (π i = 1n xi) 1⁄n = n√( x1, x2, … , xn).
The geometric mean of two numbers, say x, and y is the square root of their product
x×y. For three numbers, it will be the cube root of their products i.e., (x y z) 1⁄3.
Relation between Geometric Mean and Logarithms
In order to make our calculation easy and less time consuming we use the concept
of logarithms in the calculation of geometric means.
Since, G.M. = (x1. x2 … xn) 1⁄n
Taking log on both sides, we have
log G.M. = 1⁄n (log ((x1. x2 … xn))
or, log G.M. = 1⁄n (log x1 + log x2 + … + log xn)
or, log G.M. = (1⁄n) ∑ i= 1n log xi
or, G.M. = Antilog(1⁄n (∑ i= 1n log xi))
For a grouped frequency distribution, the geometric mean G.M. is
G.M. = (x1 f1. x2 f2 … xn fn) 1⁄N , where N = ∑ i= 1n fi
Taking logarithms on both sides, we get
log G.M. = 1⁄N (f1 log x1 + f2 log x2 + … + fn log xn) = 1⁄N [∑ i= 1n fi log xi ].
➢ Properties of Geometric Means

1. The logarithm of geometric mean is the arithmetic mean of the logarithms of
given values
2. If all the observations assumed by a variable are constants, say K >0, then the
G.M. of the observation is also K
3. The geometric mean of the ratio of two variables is the ratio of the geometric
means of the two variables
4. The geometric mean of the product of two variables is the product of their
geometric means
➢ Geometric Mean of a Combined Group
Suppose G1, and G2 are the geometric means of two series of sizes n1, and
n2respectively. The geometric mean G, of the combined groups, is:
log G = (n1 log G1 + n2 log G2) ⁄ (n1 + n2)
or, G = antilog [(log G1 + n2 log G2) ⁄ (n1 + n2)]
In general for ni geometric means, i = 1 to k, we have
G = antilog [(log G1 + n2 log G2 + … + nk log Gk) ⁄ (n1 + n2 + … +nk)]
➢ Specific uses of G.M.:
The geometric Mean has certain specific uses, some of them are:
1. It is used in the construction of index numbers.

2. It is also helpful in finding out the compound rates of change such as the rate
of growth of population in a country.
3. It is suitable where the data are expressed in terms of rates, ratios and
percentage.
4. It is quite useful in computing the average rates of depreciation or
appreciation.
5. It is most suitable when large weights are to be assigned to small items and
small weights to large items.
Advantages of Geometric Mean Disadvantages of Geometric Mean
 A geometric mean is based upon  A geometric mean is not easily

all the observations understandable by a non-
mathematical person
 It is rigidly defined
 If any of the observations is zero,
 The fluctuations of the the geometric mean becomes zero
observations do not affect the
geometric mean  If any of the observation is
 It gives more weight to small negative, the geometric mean
items becomes imaginary
4. Harmonic Mean
 A simple way to define a harmonic mean is to call it the reciprocal of the

arithmetic mean of the reciprocals of the observations.
 The most important criteria for it are that none of the observations should be
zero.
 A harmonic mean is used in averaging of ratios.
 The most common examples of ratios are that of speed and time, cost and unit
of material, work and time etc.
 The harmonic mean (H.M.) of n observations is
H.M. = 1÷ (1⁄n ∑ i= 1n (1⁄xi) )
In the case of frequency distribution, a harmonic mean is given by
H.M. = 1÷ [1⁄N (∑ i= 1n (fi ⁄ xi)], where N = ∑ i= 1n fi
➢ Properties of Harmonic Mean
1. If all the observation taken by a variable are constants, say k, then the
harmonic mean of the observations is also k
2. The harmonic mean has the least value when compared to the geometric mean
and the arithmetic mean
Advantages of Harmonic Mean Disadvantages of Harmonic Mean
 A harmonic mean is rigidly  Not easily understandable

defined
 Difficult to compute
 It is based upon all the
observations
 The fluctuations of the

observations do not affect the
harmonic mean
 More weight is given to smaller

items
Relationship between Arithmetic Mean, Harmonic Mean, and

Geometric Mean of Two Numbers
For two numbers x and y, let x, a, y be a sequence of three numbers.
 If x, a, y is an arithmetic progression then 'a' is called arithmetic mean.
 If x, a, y is a geometric progression then 'a' is called geometric mean.
 If x, a, y form a harmonic progression then 'a' is called harmonic mean.
Let AM = arithmetic mean,
GM = geometric mean,
And HM = harmonic mean.
The relationship between the three is given by the formula
AM × HM = GM2
(II) Median
➢ Median is the middle value of the series when arranged in order of the magnitude.
➢ When a series is divided into more than two parts, the dividing values are called
Partition values.
How to calculate Median??
The very first thing to be done with raw data is to arrange them in ascending or
descending order.
In Layman’s terms:
For odd numbers : Median = the middle number
For example: 5,8,9,10,6,15
Arrange in ascending order: 5,6,8,10,15
As we have 5 numbers the middle number will be the 3 rd number which can also
be calculated as
{(n+1)/2 }th number= (5+1)/= 6/2 = 3rd number which is 8
So the Median is 8
For even numbers: As then there is no value exactly in the middle of the series.
In such a situation the median is arbitrarily taken to be halfway between the two
middle items.
For example: 19,8,9,6,12,5
Ascending order: 5,6,8,9,12,19
Here we have 6 numbers so (n+1)/2= (6+1)/2 = 3.5
So find the average of 3rd and 4th number = 8+9/2=8.5
So 8.5 is the median
For grouped data Median is calculated using the formula:
l --lower limit of median class,

c--cumulative frequency of previous to median class,
f --frequency of median class,
h --Size of the median class interval
N --total number of observation i.e. sum of frequencies
Related Positional Measures:
The median divides the series into two equal parts.
Similarly there are certain other measures which divide the series into certain equal
parts
➢ Quartiles:
Quartiles are the measures which divide the data into four equal parts; each portion
contains equal number of observation.
➢ There are three quartiles

➢ If a statistical series is divided into four equal parts, the end value of each part is
called a quartile and denoted by ‘Q’.
1. The lower half of a data set is the set of all values that are to the left of the
median value when the data has been put into increasing order.
2. The upper half of a data set is the set of all values that are to the right of the
median value when the data has been put into increasing order.
1. The first quartile, denoted by Q1, is the median of the lower half of the data
set. This means that about 25% of the numbers in the data set lie below Q1 and
about 75% lie above Q1.
2. The second quartile also called median and denoted by Q2, has 50% of the
items below it and 50% of the items above it.
3. The third quartile, denoted by Q3, is the median of the upper half of the data
set. This means that about 75% of the numbers in the data set lie below Q3 and
about 25% lie above Q3.
Formulae of calculating median and partition values:
Individual Discrete
Measure Continuous Series
Series Series
Size of
Size of item Size of item Size of item Formula
item
Median
First
Quartile
Third
Quartile
➢ Deciles
Deciles: Deciles distribute the series into ten equal parts and generally expressed as
D.
➢ There are nine deciles expressed as D1,D2…D9 which are called as first
decile, second decile and so on
➢ Percentiles
Percentiles: Percentiles divide the series into hundred equal parts and generally
expressed as P.
Merits of Median: Demerits of median:
(i) Simple measure of central tendency. (i) Not based on all the items in the
(ii) It is not affected by extreme series, as it indicates the value of middle
observations. items.
(Iii) Possible even when data is (ii) Not suitable for algebraic treatment.
incomplete. (iii) Arranging the data in ascending
(iv) Median can be determined by order takes much time.
graphic presentation of data. (iv) Affected by fluctuations of items.
(v) It has a definite value. (v) It cannot be computed exactly where
(vi) Simple to calculate and understand the number of items in a series is even.
(vii) It is a positional value not a
calculated value.
III. Mode
 Mode is that value of the variable which occurs or repeats itself maximum
number of item.
 The mode is most “ fashionable” size in the sense that it is the most common and
typical and is defined by Zizek as “the value occurring most frequently in series
of items and around which the other items are distributed most densely.”
 In the words of Croxton and Cowden, the mode of a distribution is the value at
the point where the items tend to be most heavily concentrated.
 According to A.M. Tuttle, Mode is the value which has the greater frequency
density in its immediate neighborhood.
 In the case of individual observations, the mode is that value which is repeated
the maximum number of times in the series. The value of mode can be denoted by
the alphabet ‘z’ also.
Mode for continuous series:
L= lower limit of the class, where mode lies,

i = Class interval
f0= frequency of the class preceding the modal class.
f1= frequency of the class, where mode lies.
f 2 = frequency of the class succeeding the modal class.
Merits of mode: Demerits of mode:
(i) Simple and popular measure of (i) It is an uncertain measure.

central tendency. (ii) It is not capable of algebraic
(ii) It can be located graphically with the treatment.
help of histogram. (iii) Procedure of grouping is complex.
(iii) Less effect of marginal values. (iv) It is not based on all observations.
(iv) No need of knowing all the items of (v) For bi- modal and tri-modal series, it
series. is difficult to calculate.
(v) It is the most representative value in (vi) Its value is not based on each and
the given series. every item of the series.
(vii) If items are identical, it is difficult
(vi) It is less effected by extreme values. to identify the modal value.
Relation among mean, median and mode :
Mode = 3 median – 2 mean
DISPERSION
• According to Dr. Bowley, “dispersion is the measure of the variation between
items.”
• Dispersion refers to the variation of the items around an average.
• Measures of dispersion measure how spread out a set of data is.
➢ Objectives of Dispersion
a) To determine the reliability of an average.
b) To compare the variability of two or more series.
c) It serves the basis of other statistical measures such as correlation etc.
d) It serves the basis of statistical quality control.
➢ Properties of good measure of Dispersion

a) It should be easy to understand.
b) Easy to calculate.
c) Rigidly defined
d) Based on all observations.
e) Should not be unduly affected by extreme values.
➢ Classification of Measures of Dispersion
MEASURES OF
DISPERSION
ABSOLUTE
RELATIVE MEASURES
MEASURES
RANGE COEFFICIENT OF RANGE
QUARTILE COEFFICIENT OF QUARTILE

DEVIATON DEVIATON
STANDARD COEFFICIENT OF STANDARD
DEVIATION DEVIATION
MEAN COEFFICIENT OF MEAN
DEVIATION DEVIATION
➢ Range:
 It is the simplest method of studying dispersion. Range is the difference
between the smallest value and the largest value of a series.
 While computing range, we do not take into account frequencies of different

groups.
 If X max and X min are the two extreme observations then
Range = X max – X min
Co-efficient of Range = (X max – X min)
(X max + X min)
Merits of Range Demerits of Range
1. It is simple to understand and easy to 1. It is affected by extreme values in the

calculate. series.
2. It is widely used in statistical quality 2. It cannot be calculated in case of open

control. end series.
3. It is not based on all items.
➢ Quartile Deviation
✓ The concept of ‘Quartile Deviation does take into account only the values of
the ‘Upper quartile (Q3) and the ‘Lower quartile’ (Q1).
✓ Quartile Deviation is also called ‘inter-quartile range’.
✓ Inter quartile range is the difference between Upper Quartile (Q3) and Lower
Quartile Q1.
✓ Quartile deviation is half of inter quartile range.
✓ ‘Quartile Deviation’ can be obtained as :
Inter-quartile Deviation
= Q3 – Q1
Semi-quartile Deviation
= Q3 – Q1
Co-efficient of quartile deviation
= (Q3 – Q1)
(Q3 + Q1).
Merits of Q.D Demerits of Q.D
1. Easy to compute 1. Not based on all observations
2. Less affected by extreme values. 2. It ignores the 50% of the data
3. Can be computed in open ended 3. It is influenced by change in

series. sample and suffers from
instability.
4. All the drawbacks of Range are
overcome by quartile deviation.
➢ Mean Deviation:
Average deviation is defined as a value which is obtained by taking the average of the
deviations of various items from a measure of central tendency Mean or Median or
Mode, ignoring negative signs.
Merits of Mean Deviation Demerits of Mean Deviation
1. Based on all observations. 1. It ignores ± signs in deviations.
2. It is less affected by extreme values. 2. It is difficult to compute when

deviations comes in fractions.
3. Simple to understand and easy to
calculate. 3. M.D. and its co-efficient taken from
X, M and Z often differ.
4. It is a good index of score density at
the middle of the distribution.
5. Quartiles are useful in indicating the

skewness of a distribution
➢ Standard Deviation:
“Standard deviation or S.D. is the square root of the mean of the squared deviations of
the individual scores from the mean of the distribution.”
Standard deviation is calculated as the square root of average of squared deviations
taken from actual mean.
✓ It is denoted by a Greek letter sigma, σ.
✓ It is also called root mean square deviation.
✓ The square of standard deviation is called ‘variance’. It is denoted by σ 2
➢ Properties of Standard Deviation:

1. If each variate value is increased by the same constant value, the value of S.D.
of the distribution remains unchanged
2. When a constant value is subtracted from each variate, the value of S.D. of the
new distribution remains unchanged
3. If each observed value is multiplied by a constant value, S.D. of the new
observations will also be multiplied by the same constant
4. If each observed value is divided by a constant value, S.D. of the new
observations will also be divided by the same constant.
5. Thus, to conclude, SD is
SD is independent of change of origin (addition, subtraction)
But,
SD is dependent of change of scale (multiplication, division).
Standard deviation for Sample Standard deviation for Population
The standard deviation for ungrouped data is defined as
Where d = deviation of individual scores from the mean;
(Some authors use ‘x’ as the deviation of individual scores from the mean)
∑ = sum total of;
N = total number of cases.
Computation of S.D. (Grouped data):
Co-efficient of Standard deviation = S.D.
Mean
Merits of Standard Deviation Demerits of Standard Deviation
i. Rigidly defined and its value is always

definite.
i. Difficult to understand and compute.
ii. Based on all observations
ii. Affected by extreme items.
iii. Takes Algebraic signs in
consideration
iv. Amenable to further Algebraic

treatment
v. It is less affected by fluctuations of

sampling.
vi. It provides a standard unit of

measure that possesses comparable
meaning from one test to another.
vii. Moreover, the normal curve is

directly related to S.D.
➢ Uses of S.D:
(i) When the most accurate, reliable and stable measure of variability is wanted.
(ii) When more weight is to be given to extreme deviations from the mean.
(iii) When coefficient of correlation and other statistics are subsequently computed.
(iv) When measures of reliability are computed.
(v) When scores are to be properly interpreted with reference to the normal curve.
(vi) When standard scores are to be computed.
(vii) When we want to test the significance of the difference between two statistics.
(viii) When coefficient of variation, variance, etc. are calculated.
➢ Coefficient of Dispersion
 Whenever we want to compare the variability of the two series which differ
widely in their averages.
 Also, when the unit of measurement is different.
 We need to calculate the coefficients of dispersion along with the measure of

dispersion.
 The coefficients of dispersion (C.D.) based on different measures of dispersion.
 The coefficient of variation (C.V.) is 100 times the coefficient of dispersion based
on standard deviation.
C.V. = 100 x (S.D. / Mean)
 CV gives the percentage which σ is of the test mean. It is thus a ratio which
is independent of the units of measurement.
 CV is restricted in its use owing to certain ambiguities in its interpretation. It is
defensible when used with ratio scales—scales in which the units are equal and
there is a true zero or reference point.
 Two cases arise in the use of V with ratio scales:
(1) When units are dissimilar, and
(2) When M’s are unequal, the units of the scale being the same.
➢ Types of Distributions
Bernoulli Distribution
 A Bernoulli distribution has only two possible outcomes, namely 1 (success) and
0 (failure), and a single trial.
 So the random variable X which has a Bernoulli distribution can take value 1 with
the probability of success, say p, and the value 0 with the probability of failure,
say q or 1-p.
 For example when a unbiased coin is tossed,
✓ the occurrence of a head denotes success, and
✓ The occurrence of a tail denotes failure.
 Probability of getting a head = Probability of getting a tail = 0.5 since there are
only two possible outcomes.
 The probability mass function is given by:
px(1-p)1-x
where x € (0, 1).
✓ It can also be written as
 The probabilities of success and failure need not be equally likely,
✓ For instance, the result of a fight between me and Undertaker. He is pretty

much certain to win. So in this case probability of my success is 0.15
while my failure is 0.85
 Here, the probability of success(p) is not same as the probability of failure.
The expected value of a random variable X from a Bernoulli distribution is found

as follows:
E(X) = 1*p + 0*(1-p) = p
E(X) = p
The variance of a random variable from a Bernoulli distribution is:
V(X) = E(X²) – [E(X)]² = p – p² = p(1-p)
Var(X) = p-(1-p)
Binomial Distribution
 A distribution where only two outcomes are possible, such as success or failure,
gain or loss, win or lose and where the probability of success and failure is same
for all the trials is called a Binomial Distribution.
 The outcomes need not be equally likely.
✓ For example, if the probability of success in an experiment is 0.2 then the

probability of failure can be easily computed as q = 1 – 0.2 = 0.8.
 The properties of a Binomial Distribution are:
1. Each trial is independent.
2. There are only two possible outcomes in a trial- either a success or a failure.
3. A total number of n identical trials are conducted.
4. The probability of success and failure is same for all trials. (Trials are
identical.)
The mathematical representation of binomial distribution (Probability mass
function) is given by:
Parameters of binomial distribution are n and p
The mean of a binomial distribution are given by:
Mean = µ = n*p
The Variance of a binomial distribution are given by:
Variance =Var(X) = n*p*q
Examples of binomial experiments
• Tossing a coin 20 times to see how many tails occur.
• Asking 200 people if they watch ABC news.
• Rolling a die to see if a 5 appears.
Examples which aren't binomial experiments
• Rolling a die until a 6 appears (not a fixed number of trials)
• Asking 20 people how old they are (not two outcomes)
• Drawing 5 cards from a deck for a poker hand (done without replacement, so
not independent)
Normal distribution :
 Normal distribution represents the behavior of most of the situations in the
universe.
 The large sum of (small) random variables often turns out to be normally
distributed, contributing to its widespread application.
 Any distribution is known as Normal distribution if it has the following
characteristics:
1. The mean, median and mode of the distribution coincide.
2. The curve of the distribution is bell-shaped and symmetrical about the line
x=μ.
3. The total area under the curve is 1.
4. The mean divides the curve into 2 equal parts
5. Its quartile deviation, Q.D= /3 σ
6. Its mean deviation, M.D= 4/5 σ
7. The X axis is an Asymptote to the curve(Asymptote is a straight line that

touches the curve at infinity)
8. It is unimodal distribution
9. The area under the curve within the central limits is as under
Limits Area %
68.2
µ± σ
95
µ± 1.96σ
95.4
µ± 2σ
99.7
µ± 3σ
 A normal distribution is highly different from Binomial Distribution.

 However, if the number of trials approaches infinity then the shapes will be
quite similar.
 The PDF of a random variable X following a normal distribution is given by:
The parameters of normal distribution are µ (mean) and σ (standard
deviation)
The mean and variance of a random variable X which is said to be normally

distributed is given by:
Mean= E(X) = µ
Variance = Var(X) = σ2
Poisson distribution
 The Poisson distribution is a discrete distribution with a single parameter ‘m’.
 Poisson process is obtained when the binomial experiment is conducted many

number of times
 Here the number of trails would be large number.
 It is also called discrete probability distribution
 If the probability of success “p” is small and the number of trails “n” is large, the
binomial distribution is approximated to Poisson distribution.
 As m increases, the distribution shifts to the right.
 All the Poisson distribution is skewed to right. This is the reason why the Poisson
probability distributions have been called the probability of distribution of rare
events.
➢ Assumptions of Poisson distribution:
A distribution is called Poisson distribution when the following assumptions are

valid:
1. Any successful event should not influence the outcome of another successful
event.
2. The probability of success over a short interval must equal the probability of
success over a longer interval.
3. The probability of success in an interval approaches zero as the interval

becomes smaller.
Now, if any distribution validates the above assumptions then it is a Poisson

distribution.
Some examples are
1. The number of emergency calls recorded at a hospital in a day.
2. The number of thefts reported in an area on a day.
3. The number of customers arriving at a salon in an hour.
4. The number of suicides reported in a particular city.
5. The number of printing errors at each page of the book.
Here, X is called a Poisson Random Variable and the probability distribution of X is

called Poisson distribution.
The Probability distribution of X following a Poisson distribution is given by:
The mean µ is the parameter of this distribution.
µ is also defined as the λ time’s length of that interval.
The mean and variance of X following a Poisson distribution:
Mean = E(X) = µ = λ
Variance = Var(X) = µ = λ
Standard deviation = SD(X) =  = 
Exponential Distribution
 Exponential distribution is widely used for survival analysis.
 From the expected life of a machine to the expected life of a human, exponential
distribution successfully delivers the result.
 Consider the call center example one more time. What about the interval of time
between the calls ?
 Here, exponential distribution comes to our rescue. Exponential distribution

models the interval of time between the calls.
Other examples are:
1. Length of time between metro arrivals,
2. Length of time between arrivals at a gas station
3. The life of an Air Conditioner
A random variable X is said to have an exponential distribution with PDF:
 Here λ > 0 is the parameter of the distribution, often called the rate
parameter.
 The distribution is supported on the interval [0, ∞).
 If a random variable X has this distribution, we write X ~ Exp(λ).
For survival analysis, λ is called the failure rate of a device at any time t, given that
it has survived up to t.
Mean and Variance of a random variable X following an exponential distribution:
Mean = E(X) = 1/λ
Variance = Var(X) = (1/λ)²
Standard Deviation = SD(X) = 1/λ
The standard deviation is equal to the mean.
 Greater the rate, the faster the curve drops and
 The lower the rate, flatter the curve.
Data collection:
 Collection of data is the first and most important stage in any statistical survey.
The method for collection of data depends upon various factors such as
objective, scope, nature of investigation and availability of resources.
Sources of Data
There are two sources of data in Statistics.
1. Statistical sources refer to data that are collected for some official purposes and
include censuses and officially conducted surveys.
2. Non-statistical sources refer to the data that are collected for other administrative
purposes or for the private sector.
✓ Statistical Survey: A statistical Survey is normally conducted using

a sample. It is also called Sample Survey. It is the method of collecting sample
data and analyzing it using statistical methods. This is done to make estimations
about population characteristics.
✓ Census : Opposite to a sample survey, a census is based on all items of the

population and then data are analyzed. Data collection happens for a
specific reference period. For example, the Census of India is conducted every 10
years. Other censuses are conducted roughly every 5-10 years. Data is collected
using questionnaires that may be mailed to the respondents. Responses can also
be collected over other modes of communication like the telephone.
✓ Register: Registers are basically storehouses of statistical information from

which data can be collected and analysis can be made. Registers tend to be
detailed and extensive. It is beneficial to use data from here as it is reliable. Two
or more registers can be linked together based on common information for even
more relevant data collection.
Types of Data
 There are two types of data – primary data and secondary data.
1. Primary data is the data collected for the first time keeping in view the
objective of the survey. Interview, questionnaire and telephone/mail are all
examples of primary data.
2. Secondary data is any information, used for the current investigation but is
obtained from data, which has been collected and used by some other
agency or person in a separate investigation, or survey.
 Both primary and secondary data may be collected either by census or by
sampling methods. Based on how accurate data is required for statistical
surveys, appropriate methods can be adopted.
Let’s learn data collection in detail:
➢ Primary data
 Primary data is the one, which is collected by the investigator for the purpose
of a specific inquiry or study.
 Such data is original in character and is generated by a survey conducted by
individuals or a research institution or any organization.
 They are likely to be more reliable. However, cost of collection of such data is
much higher.
 Primary data is collected by either a census method or a sampling method
Collection of primary data is done by a suitable method as per the following:
1. Direct personal observation: In the direct personal observation method, the

investigator collects data by having direct contact with the units of
investigation. The accuracy of data depends upon the ability, training and
attitude of the investigator.
2. Indirect oral interview:
 Indirect oral interview is used when the area to be covered is large.
 The investigator collects the data from a third party or a witness or the
head of an institution.
 This method is generally used by the police department in cases related to
enquiries on the cause of fires, thefts or murders.
 Enquiry committees appointed by governments use this method to get
people’s views and every possible detail regarding the enquiry.
 This method suits best when direct sources do not exist or cannot be relied
upon or would be unwilling to take part in the
3. Information through agencies:
 This method of collecting information through local agencies or
correspondents is generally adopted by newspapers and television
channels.
 Local agents are appointed in different parts of the area under investigation.
 They send the desired information at regular intervals. This method is used
where the area to be covered is very large and periodic information is
required.
4. Information through mailed questionnaires:
 Under this method, information is collected through questionnaires.
 The questionnaires are filled with questions pertaining to the investigation.
 They are sent to the respondents with a covering letter soliciting
cooperation from the respondents (respondents are the people who respond
to questions in the questionnaire).
 The respondents are asked to give correct information and to mail the
questionnaire back.
5. Information through a schedule filled by investigators:
 Information can be collected through schedules filled by investigators
through personal contact. In order to get reliable information, the
investigator should be well trained, tactful, unbiased and hard working.
 A schedule is suitable for an extensive area of investigation through
investigator’s personal contact.
 The problem of non-response is minimized. There is a difference between a
schedule and a questionnaire.
 A schedule is a form that the investigator fills personally, while surveying
the units or individuals from the sample (respondent).
 A questionnaire is a form sent (usually mailed) by an investigator to
respondents. The respondent has to fill it and then send it back to the
investigator.
➢ Secondary data:
 Any information, that is used for the current investigation but is obtained from
some data, which has been collected and used by some other agency or
person in a separate investigation, or survey, is known as secondary data.
 They are available in a published or unpublished form.
 In published form, secondary data is available in research papers, newspapers,

magazines, government publication, international publication, and websites.
 Secondary data is collected for different purposes. Therefore, care should be
exercised while using it.
 The accuracy, reliability, objectives and scope of secondary data should be
examined thoroughly before use.
 Secondary data may be collected either by census or by sampling methods.
1. Published sources: The various sources of published data are:
• Reports and official publications of international and national
organizations as well as central and state governments
• Publications of several local bodies such as municipal corporations and
district boards
• Financial and economic journals
• Annual reports of various companies
• Publications brought out by research agencies and research scholars
• Some of the journals (both academic and non-academic) are published at
regular intervals like yearly, monthly, weekly whereas, other
publications are more ad hoc.
• Internet is a powerful source of secondary data, which can be accessed at
any time for any further analysis of the study.
2. Unpublished sources: Unpublished data such as records maintained by
various government and private offices, studies made by research
institutions and scholars can also be used where necessary.
Though, use of secondary data is economic in terms of expense, time and

manpower requirement, researcher must be careful in choosing such secondary
data.
Secondary data must possess the following characteristic:
1) Reliability of data
2) Suitability of the data
3) Adequacy of data
BASIS FOR PRIMARY DATA SECONDARY DATA

COMPARISON
Meaning Primary data refers to the Secondary data means data

first hand data gathered by collected by someone else
the researcher himself. earlier.
Data Real time data Past data
Process Very involved Quick and easy
Source Surveys, observations, Government publications,

experiments, questionnaire, websites, books, journal
personal interview, etc. articles, internal records etc.
Cost effectiveness Expensive Economical
Collection time Long Short
Specific Always specific to the May or may not be specific to

researcher's needs. the researcher's need.
Available in Crude form Refined form
Accuracy and More Relatively less

Reliability
➢ Questionnaire design
Questionnaire design is the process of designing the format and questions in the
survey instrument that will be used to collect data about a particular phenomenon.
In designing a questionnaire, all the various stages of survey design and

implementation should be considered.
Elements of Questionnaire design:

These include the following nine elements:
a. determination of goals, objectives, and research questions;

b. definition of key concepts;
c. generation of hypotheses and proposed relationships;
d. choice of survey mode (mail, telephone, face-to-face, Internet);
e. question construction;
f. sampling;
g. questionnaire administration and data collection;
h. data summarization and analysis;
i. Conclusions and communication of results.
Points to be considered before forming the Questionnaire

design
1. Initial considerations
i. Type of information required
ii. Type/nature of respondents
iii. Type and method by which survey is to be undertaken
2. Question content
i. Relevance of a question
ii. Clarity of a question
iii. Avoid ambiguous, leading, double-barrelled questions
iv. Ability and willingness of a respondent to answer the questions
3. Question phrasing
i. Style appropriate to target population
ii. Short, Clear and unambiguous questions
iii. Avoid biased words and leading questions
iv. Avoid negative questions
v. Discourage guessing
vi. Do not assume anything for granted from the part of the respondents
4. Types of questions
A. Closed ended questions
i. Dichotomous
ii. Multiple choice (4 to 5 options; neutral point)
iii. Likert scale (Agree or disagree)
iv. Semantic differential (scale connecting bipolar words)
v. Importance scale (importance of some attribute)
vi. Rating scale (Excellent to poor)
B. Open ended questions

i. Completely unstructured
ii. Word association (first word that comes to mind …)
iii. Sentence completion
iv. Story completion
v. Picture completion (filling balloons)
vi. Thematic Apperception Test (relate story to picture)
5. Question sequence
i. Logical order
ii. Avoid questions which suggest answers to later questions (bias)
6. Questionnaire layout
i. Good quality paper
ii. As short as possible (20-30 questions)
iii. Use lines, boxes, pictures, etc.
iv. Instructions kept to a minimum but user-friendly
v. Purpose of survey explained at the beginning and guarantee of
confidentiality
vi. What is to be done with the completed questionnaire?
7. Pre-test, revision and final version of questionnaire
i. Uncover faults
ii. Misprints
iii. Grammatical mistakes
iv. Relevance of questions
v. Expected range of answers
➢ Essentials of a good questionnaire?
Success of this method of collection of data depends mainly on proper drafting of the
questionnaire. You have to keep the following points in mind while preparing a
questionnaire:
1. The questionnaires should begin with an effort to awaken the respondents’

interest. Important target questions should be asked in the middle of the
opinion survey.
2. The respondent should not take much time in completing the questionnaire. It
should be small and not lengthy.
3. The questions asked should be well structured and unambiguous.
4. The questions asked should be in a proper logical sequence.
5. Questions should be unbiased. The questions in the questionnaire should not
disturb the privacy of the respondents.
6. The questionnaire should not have much writing work.
7. Necessary instructions and glossary should be given in covering letter.
8. Questions involving technological jargons and mathematical calculations
should be avoided.
9. All the questions related to personal information (name, income, phone,
address etc) of the respondents should be either optional or asked in the last
section of the questionnaires.
10.A pilot test should be conducted to detect the weakness in the questionnaires
designed.
➢ Steps in Questionnaire Design:
The task of composing questionnaire may be considered more an art than a science. It
needs a great deal of experience, expertise, and creativity.
Determine the Data to be collected
Determine the Method to be used for Data Collection
Evaluate the Contents of the Question
Decide on Type of Questions and Response Format
Decide on Wording of Questions
Determine on Questionnaire Structure or Physical Format
Pretest, Review and Final Draft
➢ Statistical Survey
A Statistical Survey is a scientific process of collection and analysis of numerical
data.
Surveys differ from each other with regard to their purpose, field of study, scope, and
the source of information. The standard tools for any statistical study are:
• relevance
• timeliness
• accuracy of data gathered
Surveys are used by businesses to:
• Assess the level of their customer satisfaction
• Find out what products their customers choose
• Determine which section of the population is buying their products
➢ Stages of Statistical surveys
Statistical surveys involve two stages namely –
1. Planning and
2. Execution.
Planning: A properly planned investigation can lead to the best results with least
cost and time. There are five steps involved in planning the survey.
Steps involved in planning phase:
Identify the nature of the problem
State the objectives of investigation
Define the scope of the investigation
Identify the type of data
Organize the investigation

Execution phase :
 In Execution phase, controlled methods should be adopted at every stage of
survey to check the accuracy, coverage, methods of measurements, analysis
and interpretation.
 The collected data should be edited, classified, tabulated and presented in the
form of diagrams and graphs.
 The data should be carefully and systematically analyzed and interpreted.
Sampling – Concept, Process and Techniques
➢ Sampling:
The process of selecting a number of individuals for a study in such a way that the
individuals represent the larger group from which they were selected
• Sample: A sample is “a smaller (but hopefully representative) collection of

units from a population used to determine truths about that population”
• Sampling Frame: A list of all elements or other units containing the elements
in a population. The sampling frame must be representative of the population
• Population: The larger group from which individuals are selected to participate
in a study
• Target population: A set of elements larger than or different from the
population sampled and to which the researcher would like to generalize study
findings.
Process of sampling: The sampling process comprises several

stages:
Defining the population of concern
Specifying a sampling frame, a set of items or events possible to measure
Specifying a sampling method for selecting items or events from the frame
Determining the sample size
Implementing the sampling plan
Sampling and data collecting
Reviewing the sampling process
➢ Types of sample:
Probability (Random) Samples Non-Probability Samples
Simple random sample Convenience sample

Systematic random
Purposive sample
sample
Stratified random sample Quota
Cluster sample
1. Simple random sample:
✓ It is applicable when population is small, homogeneous & readily
available
✓ All subsets of the frame are given an equal probability.
✓ Each element of the frame thus has an equal probability of selection.
✓ It provides for greatest number of possible samples. This is done by
assigning a number to each unit in the sampling frame.
✓ A table of random number or lottery system is used to determine which
units are to be selected.
Pros: Cons:
 Estimates are easy to calculate.  If sampling frame large, this

method impracticable.
 Simple random sampling is always
an EPS design, but not all EPS  Minority subgroups of interest in
designs are simple random population may not be present in
sampling. sample in sufficient numbers for
study.
2. Systematic random sample:
 It is applicable when the given population is logically homogenous
 Systematic sampling relies on arranging the target population according to

some ordering scheme and then selecting elements at regular intervals through
that ordered list.
 It involves a random start and then proceeds with the selection of every kth
element from then onwards. In this case, k=(population size/sample size).
 It is important that the starting point is not automatically the first in the list, but
is instead randomly chosen from within the first to the kth element in the list.
 In a systematic sample, after you decide the sample size, arrange the elements
of the population in some order and select terms at regular intervals from the
list.
 A simple example would be to select every 10th name from the telephone
directory (an 'every 10th' sample, also referred to as 'sampling with a skip of
10').
ADVANTAGES: DISADVANTAGES:
 Sample easy to select  The possible weakness of the

method that may compromise the
 Suitable sampling frame can be randomness of the sample is an
identified easily inherent periodicity of the list i.e.
Sample may be biased.
 Sample evenly spread over entire
reference population  Difficult to assess precision of
estimate from one survey.
3. Stratified random sample
 It is applicable when we can divide our population into characteristics of

importance for the research.
 The population is divided into two or more groups called strata, according to
some criterion, such as geographic location, grade level, age, or income, and
subsamples are randomly selected from each strata.
 Every unit in a stratum has same chance of being selected.
 Adequate representation of minority subgroups of interest can be ensured by
stratification & varying sampling fraction between strata as required.
ADVANTAGES: DISADVANTAGES:
 More accurate sample  Identification of all members of

the population can be difficult
 Can be used for both proportional
and non- proportional samples  Identifying members of all
subgroups can be difficult.
 Representation of subgroups in
the sample
4. Cluster Sampling:
 Cluster sampling is an example of 'two-stage sampling' .
 The process of randomly selecting intact groups, not individuals, within the
defined population sharing similar characteristics
 Clusters are locations within which an intact group of members of the

population can be found
 Examples: Neighborhood, School districts, Schools. Classrooms etc
➢ Selection process
 First stage a sample of areas is chosen;
 Second stage a sample of respondents within those areas is selected.
 Population divided into clusters of homogeneous units, usually based on

geographical contiguity.
 Sampling units are groups rather than individuals.
 A sample of such clusters is then selected.
 All units from the selected clusters are studied.
There are two types of cluster sampling methods:
1. One-stage sampling: All of the elements within selected clusters are included
in the sample.
2. Two-stage sampling: A subset of elements within selected clusters is
randomly selected for inclusion in the sample.
One-stage sampling. Two-stage sampling
5. Multi-Stage Sampling
 It is the combination of one or more methods described above.
 Population is divided into multiple clusters and then these clusters are further
divided and grouped into various sub groups (strata) based on similarity.
 One or more clusters can be randomly selected from each stratum. This
process continues until the cluster can’t be divided anymore.
 For example country can be divided into states, cities, urban and rural and all
the areas with similar characteristics can be merged together to form a strata.
Non- probability sampling:
1. Convenience Sampling
 The process of including whoever happens to be available at the time that is,
readily available and convenient .
 Sometimes also known as grab or opportunity sampling or accidental or

haphazard sampling.
 The researcher using such a sample cannot scientifically make generalizations

about the total population from this sample because it would not be
representative enough.
 For example, if the interviewer was to conduct a survey at a shopping center

early in the morning on a given day, the people that he/she could interview
would be limited to those given there at that given time, which would not
represent the views of other members of society in such an area, if the survey
was to be conducted at different times of day and several times per week.
 This type of sampling is most useful for pilot testing.
 In social science research, snowball sampling is a similar technique, where

existing study subjects are used to recruit more subjects into the sample.
Advantages: Disadvantages
The sample is created quickly without Difficulty in determining how much of

adding any additional burden on the the effect (dependent variable) results
available resources. from the cause (independent variable)
2. Purposive sample:
 The researcher chooses the sample based on who they think would be
appropriate for the study.
 This is used primarily when there is a limited number of people that have
expertise in the area being researched
 It is the process whereby the researcher selects a sample based on experience
or knowledge of the group to be sampled
 It is also called “judgment” sampling
Advantages: Disadvantages
 Judgment sampling is less

time consuming than other  Judgment sampling is prone to
sampling techniques researcher bias.
 Judgment sampling allows  Potential for inaccuracy in the
researchers to go directly to researcher’s criteria and
their target population of resulting sample selections
interest.
3. Quota Sampling
 Quota sampling is the non-probability equivalent of stratified sampling that we
discussed earlier.
 It starts with characterizing the population based on certain desired features and
assigns a quota to each subset of the population.
 The population is first segmented into mutually exclusive sub-groups, just as in

stratified sampling.
 Then judgment used to select subjects or units from each segment based on a
specified proportion.
 For example, an interviewer may be told to sample 200 females and 300 males
between the age of 45 and 60.
 It is this second step which makes the technique one of non-probability

sampling.
 In quota sampling the selection of the sample is non-random.
Advantages Disadvantages
 This process can be extended to  People who are less accessible

cover several characteristics and (more difficult to contact, more
varying degrees of complexity. reluctant to participate) are under-
represented
4. Snowball Sampling
 Just as the snowball rolls and gathers mass, the sample constructed in this way
will grow in size as you move through the process of conducting a survey.
 In this technique, you rely on your initial respondents to refer you to the next
respondents whom you may connect with for the purpose of your survey.
 Snowball sampling can be useful when you need the sample to reflect certain
features that are difficult to find.
 To conduct a survey of people who go jogging in a certain park every morning,
for example, snowball sampling would be a quick, accurate way to create the
sample.
Disadvantages:
Advantages:  The clear downside of this
 The costs associated with this approach is that you may restrict
method are significantly lower, yourself to only a small, largely
and you will end up with a homogenous section of the population.
sample that is very relevant to
your study.
➢ Hypothesis Testing
The Hypothesis is an assumption which is tested to check whether the inference
drawn from the sample of data stand true for the entire population or not.
➢ Hypothesis Testing Procedure
The following steps are followed in hypothesis testing:
1. Set up a Hypothesis:
 The first step is to establish the hypothesis to be tested.
 The statistical hypothesis is an assumption about the value of some unknown
parameter, and the hypothesis provides some numerical value or range of
values for the parameter.
 Here two hypotheses about the population are constructed Null
Hypothesis and Alternative Hypothesis.
 The Null Hypothesis denoted by H0 asserts that there is no true difference
between the sample of data and the population parameter and that the
difference is accidental which is caused due to the fluctuations in sampling.
Thus,
a null hypothesis states that
H0 = there is no difference between the assumed and actual value of the

parameter.
 The alternative hypothesis denoted by H1 is the other hypothesis about the

population, which stands true if the null hypothesis is rejected.
 Thus, if we reject H0 then the alternative hypothesis H1 gets accepted.
HYPOTHESIS
TESTING
Alternative
Null hypothesis, H0
hypothesis,HA
State the hypothesized value of the All possible alternatives other than the
parameter before sampling null hypothesis.
The assumption we wish to test (or the E.g µ ≠ 20
assumption we are trying to reject) . µ > 20
E.g population mean µ = 20 µ < 20
There is no difference between coke There is a difference between coke and
and diet coke diet coke
2. Set up a Suitable Significance Level:
 Once the hypothesis about the population is constructed the researcher has to
decide the level of significance, i.e. a confidence level with which the null
hypothesis is accepted or rejected.
 The significance level is denoted by ‘α’ and is usually defined before the
samples are drawn such that results obtained do not influence the choice.
 In practice, we either take 5% or 1% level of significance.
3. Determining a Suitable Test Statistic:
 After the hypothesis is constructed, and the significance level is decided upon,
the next step is to determine a suitable test statistic and its distribution.
 Most of the statistic tests assume the following form:
4. Determining the Critical Region:
 Before the samples are drawn it must be decided that which values to the test
statistic will lead to the acceptance of H0 and which will lead to its rejection.
 The values that lead to rejection of H0 are called the critical region.
5. Performing Computations:
 Once the critical region is identified, we compute several values for the random
sample of size ‘n.’
 Then we will apply the formula of the test statistic as shown in step (3) to
check whether the sample results falls in the acceptance region or the rejection
region.
6. Decision-making:
 Once all the steps are performed, the statistical conclusions can be drawn, and
the management can take decisions.
 The decision involves either accepting the null hypothesis or rejecting it.
 The decision that the null hypothesis is accepted or rejected depends on
whether the computed value falls in the acceptance region or the rejection
region.
Thus, to test the hypothesis, it is necessary to follow these steps systematically so that
the results obtained are accurate and do not suffer from either of the statistical error
Viz. Type-I error and Type-II error.
➢ Type I and Type II Errors

Type I error refers to the situation when we reject the null hypothesis when it is
true (H0 is wrongly rejected).
For example
H0: there is no difference between the two drugs on average.
 Type I error will occur if we conclude that the two drugs produce different
effects when actually there isn’t a difference.
 The probability of making a Type I error when the null hypothesis is true as an
equality is called the level of significance.
 Applications of hypothesis testing that only control the Type I error are often
called significance tests.
 Prob(Type I error) = significance level = α 2
Type II error
 Type II error refers to the situation when we accept the null hypothesis when
it is false.
 H0: there is no difference between the two drugs on average. Type II error will
occur if we conclude that the two drugs produce the same effect when actually
there is a difference.
 Prob(Type II error) = ß
 It is difficult to control for the probability of making a Type II error.
 Statisticians avoid the risk of making a Type II error by using “do not reject
H0” and not “accept H0”.
➢ One tailed Test and Two Tail Test

Two tailed test
 Two tailed test will reject the null hypothesis if the sample mean is
significantly higher or lower than the hypothesized mean.
 Appropriate when H0 : µ = µ0 and HA: µ ≠ µ0
One Tail Test

 A one-sided test is a statistical hypothesis test in which the values for which we
can reject the null hypothesis, H0 are located entirely in one tail of the
probability distribution.
 Lower tailed test will reject the null hypothesis if the sample mean is
significantly lower than the hypothesized mean.
 Appropriate when H0 : µ = µ0 and HA: µ < µ0
 One Tail Test Upper tailed test will reject the null hypothesis if the sample
mean is significantly higher than the hypothesized mean.
 Appropriate when H0 : µ = µ0 and HA: µ > µ0
T test:
 The T-statistic was introduced by W.S. Gossett under the pen name “Student”
 Developed T test around 1905, for dealing with small samples in brewing
quality control which was Published in 1908
 T test is used to compare two samples to determine if they came from the
same population
Conditions for T-test:

1. Limited sample size (n < 30)
2. Variables are approximately normally distributed
3. The sample observations are random and independent.
4. If the populations’ standard deviation is unknown
5. If the standard deviation is known, best to use Z-test
Application of T Test
1. Test of Hypothesis about population(One sample t-Test )
2. Difference between the 2 means in case of independent sample(Independent

samples t-Test )
3. Difference between the 2 means in case of dependent

sample(Correlated/Paired/Repeated Measures t-test )
Degrees of Freedom and t test

 Degrees of freedom describe the number of scores in a sample that are free
to vary.
 degrees of freedom = df = n-1
 Larger the degrees of freedom, the more it approximates the normal
distribution.
 The curve doesn’t touches X axis
How many
Samples
Population How are

Parameters samples
Known Related
Independent Dependent
One Sample T
Z- Test Samples T Samples T
Test
Test Test
One sample t-test:
H0 : µ = µ0
Test if a sample mean for a variable differs significantly from the given
population with a known mean
Unpaired- or independent samples- t-test:

H0 : µ1 = µ2
Test if the population means estimated by 2 independent samples differ
significantly (e.g. group of male and group of females)
Paired- or dependent- samples t-test:

H0 : µ1 = µ2
Test if the population means estimated by dependent samples differ significantly
(e.g. mean of pre and post treatment for same set of patients)
Test Statistics for T Test:
One sample T Test Independent Sample T Test Paired Sample T Test
➢ Z test
 Given by Prof. Fisher
 The Z-test is applied to compare sample and population means to know if

there’s a significant difference between them.
 Z-Test is used when the coefficient of correlation is not zero
 Location tests are the most familiar to z test
 Z-test based on standard Normal Distribution
 Z test is also called as Standard Normal deviate Test, Standard Normal Test,
approximate Test and Large Sample Test
➢ Conditions of Z Test
1. Data points should be independent from each other
2. Z-test is preferable when n is greater than 30
3. The variances of the samples should be the same
4. Population variance is known
5. All individuals must be selected at random from the population
6. All individuals must have equal chance of being selected
➢ Application of Z Test
1. Test of significance for single mean
2. Test of significance for difference of means
3. Test of significance for difference of standard deviation (s.d.)
4. Testing a Claim about a Proportion
5. Testing difference of Two proportions
➢ Conditions for acceptance/rejection of null hypothesis
1. If the Table value > Calculated value, we accept the Null Hypothesis
2. If the Table value < Calculated value, we Reject the Null Hypothesis
➢ Table Values
Level of significance 0.10 0.05 0.01 0.005
1 Tailed Test ±1.28 ±1.645 ±2.33 ±2.58
2 Tailed Test ±1.645 ±1.96 ±2.58 ±2.81
Test of significance for single mean
Test of significance for difference of

means
Test of significance for difference of

standard deviation (s.d.)
Testing a Claim about a Proportion
Testing difference of Two proportions
➢ T test v/s Z Test

 Z-test is a statistical hypothesis test that follows a normal distribution while T-test
follows a Student’s T-distribution.
 A T-test is appropriate when handling small samples (n<30) while a Z-test is

appropriate when handling moderate to large samples (n > 30).
 T-test is more adaptable than Z-test since Z-test will often require certain
conditions to be reliable.
 Additionally, T-test has many methods that will suit any need. T-tests are more
commonly used than Z-tests.
 Z-tests are preferred than T-tests when standard deviations are known.
➢ ANOVA Test:
 Analysis of Variance (ANOVA) is a parametric statistical technique used to
compare datasets.
 This technique was invented by R.A. Fisher, in 1920 and is thus often referred to
as Fisher’s ANOVA, as well.
 It is similar in application to techniques such as t-test and z-test, in that it is used to

compare means and the relative variance between them.
 However, analysis of variance (ANOVA) is best applied where more than 2

populations or samples are meant to be compared.
 F>1, means. →Numeration should be greater than denomination-because value of

F should be always greater than one.
 So F test never be in negative because of square and numerator is always greater

than denominator.
 ƒ = large variance /smaller variance = (s1)2 /(s2)2
 F Test is mainly arise when the models have been shifted to the data using to least
square
➢ Assumptions of ANOVA test

1. The population must be close to a normal distribution.(Normality)
2. Samples must be independent.
3. Population variances must be equal. (Homogeneity)
4. Groups must have equal sample sizes.
➢ Types of t-tests
One way analysis: When we are comparing more than three groups based on one
factor variable, then it said to be one way analysis of variance (ANOVA).
For example, if we want to compare whether or not the mean output of three workers
is the same based on the working hours of the three workers.
Two way analysis: When factor variables are more than two, then it is said to be two
way analysis of variance (ANOVA).
For example, based on working condition and working hours, we can compare
whether or not the mean output of three workers is the same.
➢ Steps ANOVA
Define the null and alternative hypothesis
State Alpha
Calculate degrees of Freedom
State decision rule
Calculate test statistic

• - Calculate variance between samples
• - Calculate variance within the samples
• - Calculate ratio F
• - If F is significant, perform post hoc test
6. State Results & conclusion
• Critical Value is looked up from the F table
• If calculated F value > Critical Value-Ho is Rejected otherwise it is accepted
N- Total Observations (Total sample size)
K- Number of groups
SSb - Sum of Square between the groups
SSW - Sum of Square within the group
MSSW -Mean sum of Square within the group
MSSb- Mean sum of Square within the group
➢ Post Hoc Analysis in ANOVA

➢ If we reject the null hypothesis, all know is that there is a difference somewhere
among (between) the groups but we don’t know where the differences are ???
➢ Additional tests called Post Hoc tests can be done to determine where differences
lie.
➢ It may be between first and second or second and third or may be between all of
them.
➢ Chi-Square Test
✓ The chi-square test is an important test amongst the several tests of significance
developed by statistician Karl Pearson in1900.
✓ A non parametric test.
✓ Measures the differences between what is observed(Oi) and what is

expected(Ei)
✓ It is denoted by the sign- X2
✓ The distributions are positively skewed. The research hypothesis for the chi-
square is always a one-tailed test.
✓ As the number of degrees of freedom increases, the distribution X2 becomes

more symmetrical.
➢ Conditions for the application of X 2 test

1. All the observation must be independent.
2. All the events must be mutually exclusive.
3. The data must be in the form of frequencies
4. The frequency data must have a precise numerical value and must be
organized into categories or groups.
5. Observations recorded and used are collected on a random basis.
6. No group should contain very few items, say less than 10.
7. The overall number of items must also be reasonably large. It should normally
be at least 50.
➢ Determining the Degrees of Freedom
✓ If there are two classes, three classes, and four classes,
✓ The degree of freedom would be 2-1, 3-1, and 4-1, respectively.
df = n-1
In a contingency table
df = (r – 1)(c – 1)
Where r = the number of rows
c = the number of columns
✓ 2×3 contingency table, d.f= (2-1) (3-1) = 2.
✓ 3×4 contingency table d.f=(3-1) (4-1) = 6,
➢ As the number of degrees of freedom increases, the distribution c2 becomes

more symmetrical.
➢ Types of Chi-Square Test:
CHI-
SQUARE
Non-
Parametric
Parametric
Test of Test of
Test of Test of
Comparing Goodness of
Independence Homogeneity
Variance fit
Test Of Comparing Variance
Goodness Of Fit
Test Of Independence
Yates's Correction Factor
Test Of Homogeneity
Test Of  A chi-square test ( Snedecor and Cochran, 1983) can be used

Comparing to test if the variance of a population is equal to a specified
Variance value. This test can be either a two-sided test or a one-sided
test.
Goodness of fit  In Chi-Square goodness of fit test, the term goodness of fit is
used to compare the observed sample distribution with the
expected probability distribution.
Test of  Test enables us to explain whether or not two attributes are

independence associated.
Yates's  When Degree of freedom is 1 i.e. In 2*2 contingency table

Correction Factor and N<50, adjust χ2 by Yates's Correction Factor
Test of  This test determines if two or more populations (or

homogeneity subgroups of a population) have the same distribution of a
single categorical variable.
➢ Decision rule:
➢ If X2 (calculated) > X2(tabulated), then null hypothesis is rejected otherwise

accepted.
➢ Correlation:
 The degree of relationship between the variables under consideration is measure
through the correlation analysis. „
 The measure of correlation called the correlation coefficient.
 The degree of relationship is expressed by coefficient which range from
correlation ( -1 ≤ r ≥ +1) „
 The direction of change is indicated by a sign. „
 The correlation analysis enables us to have an idea about the degree & direction
of the relationship between the two variables under study.
 Correlation is a statistical tool that helps to measure and analyze the degree of
relationship between two variables. „
 Correlation analysis deals with the association between two or more variables.
➢ Types of Correlation:
On the basis of Degree of Correlation
1. Positive Correlation: The correlation is said to be positive correlation if the
values of two variables changing with same direction. As X is increasing, Y is
increasing „ As X is decreasing, Y is decreasing
Ex. Expenses & sales, Height & weight. „
2. Negative Correlation: The correlation is said to be negative correlation when the
values of variables change with opposite direction. As X is increasing, Y is
decreasing „ As X is decreasing, Y is increasing. Ex. Price & qty. demanded.
3. No correlation: There might be the case when there is no change in a variable
with any change in another variable. In this case, it is defined as no correlation
between the two.
Correlation
On the basis On the basis

On the basis
of Degree of of number
of linerity
Correlation of variables
Positive Negitive Simple Partial Multiple Linear Non- Linear

Correlation Correlation correlation Correlation correlation Correlation correlation
➢ On the basis of number of variables

Simple, Partial and Multiple Correlations:
 Whether the correlation is simple, partial or multiple depends on the number of
variables studied.
 The correlation is said to be simple when only two variables are studied.
 The correlation is either multiple or partial when three or more variables are
studied.
 Multiple Correlations: The correlation is said to be Multiple when three
variables are studied simultaneously.
 Such as, if we want to study the relationship between the yield of wheat per acre
and the amount of fertilizers and rainfall used, then it is a problem of multiple
correlations.
 Partial Correlation: Whereas, in the case of a partial correlation we study more
than two variables, but consider only two among them that would be influencing
each other such that the effect of the other influencing variable is kept
constant.
 Such as, in the above example, if we study the relationship between the yield and
fertilizers used during the periods when certain average temperature existed, then
it is a problem of partial correlation
➢ On the basis of linearity.
1. Linear Correlation: The correlation is said to be linear when the amount of

change in one variable to the amount of change in another variable tends to
bear a constant ratio.
For example, from the values of two variables given below, it is clear that the ratio of
change between the variables is the same:
X: 10 20 30 40 50
Y: 20 40 60 80 100
2. Non – Linear correlation: The correlation is called as non-linear or

curvilinear when the amount of change in one variable does not bear a constant
ratio to the amount of change in the other variable. For example, if the amount
of fertilizers is doubled the yield of wheat would not be necessarily being
doubled.
Methods / Measures of correlation:

1. Scatter Diagram
2. Method Karl Pearson’s Coefficient of Correlation
3. Spearman’s Rank Correlation Coefficient;
➢ Scatter Diagram:
 Scatter Diagram is a graph of observed plotted points where each points

represents the values of X & Y as a coordinate.
 It portrays the relationship between these two variables graphically.
 If the line goes upward and this upward movement is from left to right it will
show positive correlation.
 Similarly, if the lines move downward and its direction is from left to right,
it will show negative correlation.
 The degree of slope will indicate the degree of correlation.
➢ Karl Pearson's Coefficient of Correlation
 Pearson’s ‘r’ is the most common correlation coefficient. „

 Karl Pearson’s Coefficient of Correlation denoted by- ‘r’
 The coefficient of correlation ‘r’ measure the degree of linear relationship
between two variables say x & y.
 Karl Pearson’s Coefficient of Correlation denoted by- r -1 ≤ r ≥ +1
 Degree of Correlation is expressed by a value of Coefficient „
 Direction of change is indicated by sign ( - ve) or ( + ve)
r(x, y)= Σxy / √ Σx² Σy²
➢ Properties of Coefficient of Correlation
 The value of the coefficient of correlation (r) always lies between ±1.
 Such as: r=+1, perfect positive correlation
 r=-1, perfect negative correlation
 r=0, no correlation
 The coefficient of correlation is independent of the origin and scale.
 By origin, it means subtracting any non-zero constant from the given value of X
and Y the value of “r” remains unchanged.
 By scale it means, there is no effect on the value of “r” if the value of X and Y is
divided or multiplied by any constant.
 The coefficient of correlation is a geometric mean of two regression
coefficient
 The coefficient of correlation is “ zero” when the variables X and Y are
independent. But, however, the converse is not true.
Probable Error of Correlation Coefficient:

The Probable Error of Correlation Coefficient helps in determining the accuracy and
reliability of the value of the coefficient that in so far depends on the random
sampling.
 The probable error of correlation coefficient can be obtained by applying the

following formula:
 r = coefficient of correlation N = number of observations
 Probable Error is used to:
1. Interpret the value of ‘r’

 If r < P.E. then it is not at all significant(No Correlation)
 If r > 6P.E. then r is highly significant
 If P.E. < r < P.E. then we cannot say anything about the significance of r
2. Constant confidence limits within which the correlation in the population p is
expressed in line.
 By adding and subtracting the value of P.E from the value of ‘r,’ we get the
upper limit and the lower limit, respectively within which the correlation of
coefficient is expected to lie. Symbolically, it can be expressed
where rho denotes the correlation in a population
Conditions under which Probable error is used:

The probable Error can be used only when the following three conditions are fulfilled:
1. The data must approximate to the bell-shaped curve, i.e. a normal frequency
curve.
2. The Probable error computed from the statistical measure must have been taken
from the sample.
3. The sample items must be selected in an unbiased manner and must be
independent of each other.
Thus, the probable error is calculated to check the reliability of the value of
coefficient calculated from the random sampling.
Spearman’s Rank Correlation Coefficient

 The Spearman’s Rank Correlation Coefficient is the non-parametric statistical
measure used to study the strength of association between the two ranked
variables.
 This method is applied to the ordinal set of numbers, which can be arranged in
order, i.e. one after the other so that ranks can be given to each.
 When statistical series in which the variables under study are not capable of
quantitative measurement but can be arranged in serial order, in such situation
Pearson’s correlation coefficient cannot be used in such case Spearman Rank
correlation can be used. „
 R = Rank correlation coefficient „
 D = Difference of rank between paired item in two series. „
 N = Total number of observation
 The value of R lies between ±1 such as:

 R = +1, there is a complete agreement in the order of ranks and move in the
same direction.
 R= -1, there is a complete agreement in the order of ranks, but are in opposite
directions.
 R =0, there is no association in the ranks.
➢ Types of problems:
1. Where actual Ranks are assigned
2. Where ranks are not assigned
3. Equal Ranks or Tie in Ranks or where ranks are repeated
1. Where actual ranks are assigned :
An individual must follow the following steps to calculate the correlation coefficient:
1. The difference between the ranks (R1-R2) must be calculated, denoted by D.

2. Then, square these differences to remove the negative sign and obtain its sum
∑D2.
3. Substitute the values obtained in the formula.
2. Where ranks are not assigned:
 In case the ranks are not given, then the individual may assign the rank by taking
either the highest value or the lowest value as 1. Whatever criteria is being
decided the same method should be applied to all the variables.
3. Equal Ranks or Tie in Ranks or when ranks are repeated:
 In case the same ranks are assigned to two or more entities, then the ranks are
assigned on an average basis.
 Such as if two individuals are ranked equal at third position, then the ranks shall
be calculated as: (3+4)/2 = 3.5
 The formula to calculate the rank correlation coefficient when there is a tie in the
ranks is:
Where m = number of items whose ranks are common.
➢ Regression
 Regression analysis is the scientific technique for making such prediction.

 M.M. Blair has described Regression analysis as a mathematical measures of
the average relationship two or more variables in terms of the original units of
the data.
 Regression Analysis: The Regression Analysis is a statistical tool used to
determine the probable change in one variable for the given amount of change
in another. It is used to get the measure of the error involved while using the
regression line as a basis for estimation
 It estimates the values of dependent variables from the values of the
independent variable. This means, the value of the unknown variable can be
estimated from the known value of another variable.
➢ Regression Line:
 The degree to which the variables are correlated to each other depends on the
Regression Line.
 The regression line is a single line that best fits the data, i.e. all the points
plotted are connected via a line in the manner that the distance from the line to
the points is the smallest.
The regression lines have equations,

 Regression line of Y on X: This gives the most probable values of Y from the
given values of X.
 Regression line of X on Y: This gives the most probable values of X from
➢ Regression Coefficient
 The constant ‘b’ in the regression equation (Ye = a + bX) is called as the
Regression Coefficient.
 It determines the slope of the line, i.e. the change in the value of Y
corresponding to the unit change in X and therefore, it is also called as a
“Slope Coefficient.”
 The correlation coefficient is the geometric mean of two regression
coefficients.
 r2=byx*bxy
 r = √ byx * bxy
 The value of the coefficient of correlation cannot exceed unity i.e. 1.

byx * bxy ≤ 1
 The sign of both the regression coefficients will be same, i.e. they will be
either positive or negative.
 It is an absolute measure
 The average value of the two regression coefficients will be greater than the
value of the correlation.

Unit 8 - Stats and Probab

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 8 - Stats and Probab

Uploaded by

Copyright:

Available Formats

UNIT SNAPSHOT

UGC NET MANAGEMENT

PART I - STATISTICS & PROBABILITY

According to A.L. Bowley “Statistics are numerical statements of facts in any

According to Croxton and Cowden, “Statistics may be defined as the collection,

➢ The important characteristics of statistics are given below:

(i) To present facts in a definite form

Statistical results are not always beyond doubt

➢ Measures of Central Tendency

The main objectives of Measure of Central Tendency are

1) To condense data in a single value.

2) To facilitate comparisons between data.

➢ Essential of a Good Average

4. It should be capable of being calculated with reasonable care and rapidity.

5. It should be stable and unaffected by sampling fluctuations.

6. It should be capable of further algebraic manipulation.

7. It should be not be unduly affected by extreme values.

➢ Methods of Measuring Central Tendency

(ii) Weighted mean

(iii) Geometric mean

(iv) Harmonic mean

Merits of Arithmetic mean: Demerits of Arithmetic mean:

Formulae of calculating arithmetic mean:

Why Weighted Average??

 Takes into account relative importance of data points when calculating an

 Smoothes out data which improves accuracy.

Or, G. M. = (π i = 1n xi) 1⁄n = n√( x1, x2, … , xn).

Relation between Geometric Mean and Logarithms

Since, G.M. = (x1. x2 … xn) 1⁄n

Taking log on both sides, we have

log G.M. = 1⁄n (log ((x1. x2 … xn))

or, log G.M. = 1⁄n (log x1 + log x2 + … + log xn)

or, log G.M. = (1⁄n) ∑ i= 1n log xi

or, G.M. = Antilog(1⁄n (∑ i= 1n log xi))

G.M. = (x1 f1. x2 f2 … xn fn) 1⁄N , where N = ∑ i= 1n fi

Taking logarithms on both sides, we get

➢ Properties of Geometric Means

➢ Geometric Mean of a Combined Group

log G = (n1 log G1 + n2 log G2) ⁄ (n1 + n2)

In general for ni geometric means, i = 1 to k, we have

G = antilog [(log G1 + n2 log G2 + … + nk log Gk) ⁄ (n1 + n2 + … +nk)]

➢ Specific uses of G.M.:

1. It is used in the construction of index numbers.

Advantages of Geometric Mean Disadvantages of Geometric Mean

 A geometric mean is based upon  A geometric mean is not easily

 A simple way to define a harmonic mean is to call it the reciprocal of the

 A harmonic mean is used in averaging of ratios.

 The harmonic mean (H.M.) of n observations is

H.M. = 1÷ (1⁄n ∑ i= 1n (1⁄xi) )

In the case of frequency distribution, a harmonic mean is given by

H.M. = 1÷ [1⁄N (∑ i= 1n (fi ⁄ xi)], where N = ∑ i= 1n fi

Advantages of Harmonic Mean Disadvantages of Harmonic Mean

 A harmonic mean is rigidly  Not easily understandable

 The fluctuations of the

 More weight is given to smaller

Relationship between Arithmetic Mean, Harmonic Mean, and

For two numbers x and y, let x, a, y be a sequence of three numbers.

Let AM = arithmetic mean,

And HM = harmonic mean.

The relationship between the three is given by the formula

How to calculate Median??

For odd numbers : Median = the middle number

For example: 5,8,9,10,6,15

Arrange in ascending order: 5,6,8,10,15