You are on page 1of 13

\

The main objective of statistical analysis is to represent the data by one single
value which shows the concentration of data at that particular value. Such a value is
called the central value which facilitates easy comparison between two or more series
compared to loose data. Quantitative data organized or unorganized show a common
characteristic to concentrate at certain values usually some where in the centre of
distribution. Thus various measures which are employed to measure this tendency are
called measures of Central tendency. Constructing frequency distribution of raw data is
the first step towards condensation of large data into compact form. It is necessary to
condense the data into a single value. Such a single value is called an average. In most
of the data the average is a centre of concentration of the values in the date. Therefore,
the average is called a measure of central tendency. All values of the data are clustered
around the average and it carries the important properties of data. In that sense, it is
representative of the distribution. Two famous statistician named Yule and Kendall had
laid down certain requirements for an ideal average as follows:

1. It should be rigidly defined

2. Its computation should be based on all observations.
3. It should lend itself for algebraic treatment.
4. It should be least affected by extreme observations.
5. It should be easy to calculate and simple to understand.
6. It should not be affected by fluctuations of sampling.

1.Average (Arithmetic mean)

2. Median
3. Mode
4. Quartiles
5. Geometric mean
6. Harmonic mean
7. Weighted mean

1. AM : It is the best known & widely used measures of central tendency. It is the
sum of all observations divided by no. of observations.
Sum of all observations
Mean =
No. of observations

Symbolically, if X1, X2, …….. XN are the values of a variable the mean is
computed by the formula.
N

X = X1 + X2 + X3 + …………+ XN = ∑ Xn (for ungrouped data)

i=1

N N

∑ is read as sigma
X = The mean of values
Xi = Values of the variable
N = No. of values

Fro grouped data, Mean = Sum of frequencies time corresponding value

(Discrete frequency dist”) Total frequencies

Symbolically, if X1, X2 , …….XN are the value of a variable and F1, F2 …………..FN are
their corresponding frequencies, the mean is computed by the formula
N N

X = f1 X1 + f2 X2 + ……… + fN XN = ∑ f Xi = ∑ f Xi
i=1 i=1

f1 + f2 + ……… + fN ∑f N N

If in a discrete frequency distribution, computing mean is laborious and time consuming,

mean can be calculated by the formula.

2
N
∑ f dxi
X = A+ i=1
_______
N
Where A stands for assumed mean
dxi = deviations of xi values from assumed mean
f = frequencies
N = total frequencies

Continuous frequency distribution. In this method, we assume that all frequencies

which fall in a given class are located at the mid-point of that class. This assumption
holds good only when the no. of frequencies is large.

From this assumption we take X1, X2 ………. XN as mid values of intervals and
calculated arithmetic mean
N
∑ fxi
X = i=1 where ∑ f = N
N

Computation procedure :

Step I : Write all class intervals serially in the first coln and
corresponding frequency in the second coln

Step II : Obtain mid-values of each class interval by adding lower

and upper class interval and divide resultant quantity by 2
& put these values in third column.

Step III: Multiply each ‘f ‘ by corresponding X and with the product

in fourth coln. The addition of this column gives ∑ f X.

Thus X = Xum of fourth coln

3
Sum of Second coln

If the values of variables are large in size, make it simple by using short cut
method.

Symbolically, X = A + d

Step – I choose any value from data which is called assured mean (a)
Step – II take the difference of assured mean & mid values known as
deviation of difference (d)
Step – III multiply each d by corresponding f
Step – IV calculate d by using the formula
Step – V the formula X = a + d is used to find mean of original data

Merits of Arithmetic mean (AM)

1 It is rigidly defined
2 it is early to calculate & understand
3 It is based upon all the observations
4 It is capable of further mathematical treatment
5 It is least affected by sampling functions

Demerits of AM :-
1. It is used for quantitative data, mean cannot the calculated for qualitative data like
caste, religion and sex.
2. It is unduly affected by extreme observations.
3. It cannot be calculated when the frequency dist is with open end classes.
4. Some times, AM may not be an observation in a data.
5. It cannot be determined graphically.

4
n1 + n2

Where n1, n2 - sizes of group 1 & 2

x1, x2 – mean of first group with size n1, n2 respectively.

Median:-

As mean is unduly affected by extreme observations and cannot be calculated for

distribution with open end class and qualitative variables like honesty, sex, religion etc.
we use other meaning of CT like median.

Definition:-
Median may be defined as the central value of a variable when the values are
arranged in order of magnitude i.e., either in ascending order or in the descending order.
The median divides the series into two equal parts, 50% of the observations will be
smaller than the median while 50% of the observations will be larger than it.

2

the value of ( n )th observation + the value of ( n + 1 )th observation.

Or median = 2 2
(un grouped data) 2 if n is even.

2
f

Where L1 = Lower limit of median class

L2 = Upper limit of median class
f = Frequency of median class
cf = Cumulative frequency of the pre-median class
h = L2 – L1 class width

The median can also be obtained graphically from the Ogive

5
Merits of median;--(1) Easy to understand and easy to calculate .
(2) Can be computed for a distribution with open and classes.
(3) Not affected due to extreme observation .
(4) Applicable for quantitative as well as qualitative data.
(5)Can be determined graphically.

Demerits;- (1)It is not based on all the observations, hence it is not proper
representative.

(3) It is not capable of further mathematical treatment.

Mode- The mode is the most common value of a variable that occurs
most frequently in a series.

(1) Ungrouped data: -In this case mode is obtained by inspection. For a
given data, mode may or may not exit & even if exists, it is not necessarily
. unique.

( 2)Grouped data- The mode can be obtained by using the formulae;:

Mode= L+ fm- f1
------------------.h
2fm-f1-f2
L-Lower boundary of modal class
Fm- Frequency of modal class
Fi—Frequency of Pre modal class
F2-- Frequency of Post –modal class
h- Width of modal class

(3) The mode can be obtained graphically by plotting histogram of the

given distribution. As compared with the mean & median, the mode has very limited
utility.

Merits:- Applicable for qualitative & quantative type of data.

2) Not affected by extreme observations.
3) Can be determined even though distribution has open-end classes.
4) Can be obtatined graphically.

Demerits:-

6
i. It is not based on all the observations.
ii. Not capabule of further Mathematical treatment.
iii. It is not rigidly defined.
iv. The calculation of mode is labourious & time consuming.

There is an Empirical relation between Mean,Median & Mode.

i.e. Mean-Mode=3(Mean-Median)

v. Quartiles :- The values which divide the given data into four
equal parts when observations are arranged in order of
magnitude are known as Quartiles. There will be three quartiles
Q1, Q2,& Q3. Q1 is known as lower quartile or first quartile
and will have 25% observations of the distributions
Below it and consequently 75% of the observations are
greater than it. The second quartile is known as Median &
Q3,75% observation below & 25% obs after.

Q1=(N+1)th observation of arranged data , if n=odd.

4
Q1= (n/4)th +(n/4+1)th observation of arranged data if n=even
2
For grouped data:- The formula for determining quartile is
Q1= 1+ K-c.f./f *h , where Q1=first quartile , c.f- cumulative
Frequency of the class previous to first quartile class , f-freqency of first quartile.
h =class width of first quartile group , k= N/4 , where N= Total frequency

5 .Geometric Mean:- It is defined as the nth root of the product of values in a

series. This is used when data contains a few extremely large or small values.

Symbolically, the G.M of values X!,X2,…..Xn is G.M=(X1.X2……Xn)1/N

If there values are give 3, 9& 27 the GM be comp led as G= (3X9X27) 1/3=9
When the series consists more than three number ,it is difficult to extract root.
That is why logs are employed
GM= log G= log XI+ logX2+-----logXN
----------------------------------
N

N
=1 log xi
N
Or, G= Antilog [N log xi ]
-=1----------
N

7
For Disorate series ,
G=Antilog [ N f log xi ]
----------
N

(6) Harmonic Means:- It is reciprocal of arithmetic mean of reciprocal observation .

For an ungrouped data the HM is given by formula X= 1+1+---+1
_ _ _
X1 X2 Xn
Or, N
-----------------------------
1
------
X + X2+ 1

(7) Weighted mean - The Weighted mean X

MEASURES OF DISPERSION

As already discussed, the whole data is represented by a single value known as average.
It cannot describe the data completely. There may be two or more data sets with same
mean but data set may not be identified.

8
To avoid disuniformity in observations, if it is necessary to study the variation.
The variation is also known as dispersion. It gives the information how individual
observations are scattered or dispersed for the means of a large sizes.
Deviation=observation-Mean
Different Measures of Dispersion :
(i) Range : A-B
(ii) Quartile deviation : Q3-Q1
2
(iii) Coefficient of Quartile deviation : Q3 - Q1
Q3 + Q1
(iv) Mean deviation Md = ∑ x-x
(v) Standard deviation Md= ∑ + x-x
N

(vi) Variance : N= ∑f
(vii) Coefficient of variation :
Coefficient of mean deviation about mean = MD about mean ∑ x-x /X
mean n
Standard deviation : Positive square root of the arithmetic mean of the square of the
taken for the mean denoted by
δ = ∑ x-x 2
n
When population mean is not known, we can take sample mean as an estimate of
population mean. In this case, only (n-1) observations are independent. Therefore, when
there are n observation in the data, divisor is n-1. In statistical language n-1 is called
degree of freedom.
δ = ∑ x-x 2
n-1
on simplification = δ2 = 1/n(∑x2-nx-2)
When observations are large in size the formula for SD is lebonion short cut method may
be used.
I- Divide assigned mean ‘a’

9
II- Obtain deviation values u,d = x-a
III- Complete mean deviation
IV- Apply formula δ = ∑ (d2-nd-2
n-1
For grouped data δ = ∑ fd2- d-2 xh
n-1
6. Variance : The square of the standard deviation of a set of object is called the
variance & denoted by δ2
Merits of Standard deviation :
(i) It is rigidly defined.
(ii) It is based upon all observations.
(iii) It does not ignore the algebraic sign of deviation.
(iv) It is capable of further treatment.
(v) It is not much affected by sampling fluctuation.
Demerits of Standard deviation :
(i) It is difficult to understood & calculate.
(ii) It cannot be calculated for quantitative data &
(iii) It is unduly affected due to extreme deviation.
Coefficient of variation :
For comparing the variability of two frequency distribution, the relative is
known as Coefficient of variation. It is always expressed in percentage.

Cv = δ x 100
x

SUMMARY :
1. Standard deviation or variance is never negative.
2. When all observations are equal, standard deviation is zero.
3. When all the observation in the data are increased or decreased by a constant,
Standard deviation remains the same.

10
4. When each of the observation is multiplied by constant K, then the standard
deviation is K times the standard deviation of original data.

CORRELATION & REGRESSION

Many a times in statistics, the data is related to two variables known as bivariate
distribution . One of the variables is denoted by ‘x’ & other by ‘y’ & observations are
paired like (x,y). For example blood pressure & weight, age of wife & husband. Pulse
rate & temperature, height of father & sons etc.
We are interested to study whether there is mutual relations between two variables
under consideration or not. The joint relation is called correlations. Two variables are
said to be correlated when change in value of one variable causes corresponding change
in the value of theother variable. To study correlation, there must be logical relationship
between two variables.
Positive Correlation :
Increase in the value of the one variable causes increase in value of the other variable or
decrease in the value of one variable causes decrease in the value of other variable.
Correlation between these two variables is said to be positive correlation. In other words,
direction of change in values of two variables is same e.g Temp & pulse rate are
positively correlated.
Increase in the value of one variable causes decrease in the value of other variable & vice
versa. Change in the values of the two variables is in opposite direction.
The simplest way to study correlation is graphical method. Plot ‘n’ sized observation like
(X1, Y1) …..( Xn, Yn). Put these prints in a graph paper. These points are scattered. Thus
this diagram is known as scattered diagram .
Correlation Coefficient :
Prof. Karl Pearson has suggested a measure of degree of correlation coefficient. It
is calculated by the formula rxy
It is also called Product moment Correlation Coefficient.
r= n ∑ xy - ∑ x . ∑ y XXXXX
√ {n. ∑ x2 – ( ∑ x)2} √ n. ∑ y2- (∑y)2
or
r= 1/n ∑ xy- xy

11
√ (1/n. ∑ x2 - x2 ) x √ 1/n ∑y2-y2)
Properties of Correlation Coefficient :
(i) It always lies between -1 & +1. symbolically -1≤ r ≤ +1
(ii) r is a pure member , r is a unit less quantity.
(iii) Two independent variables are uncorrected , when x & y are independent ,
then r=0
(iv) The absolute value of Correlation Coefficient r is independent of change of
origin & scale.
RANK CORRELATION :
Given by the formula :
rs = 1- ∑ d2
n (n2-1)
Where n = No. of paired observation.
d= difference between respective ranks.
LINEAR REGRESSION :
First used by British biometrician Galton literally means stepping back towards
averages. Regression analysis is a mathematical measures of the average relationship
between two or more variables in terms of original units of the data . In Regression
analysis, there are two types of variables. The variables whose value is to be predicted is
called dependent variable & the variable which is used for prediction is called the
independent variable. In Regression analysis, independent variable is also known as
regressor, or predictor or explanator while the dependent variable is also known as
regressed or explained variable.
Y= a + bx
LINE OF REGRESSION :
If the variables in a bivariate distribution are related, we will find that the points
in the scatter diagram will cluster round some curve called the Curve of Regression. If the
curve is a straight line, it is called Line of Regression & there is said to be Linear
Regression between two variables. The Line of Regression is the line which gives the
best estimate to the value of one variable for any specific value of the other variable.
Thus the line of regression is the “line of best fi” & obtained by the principles of least
square.

12
Linear Equation satisfy an equation of the form
Y= a + bx falls as a straight line where a, b, are constant.
Mathematically, a is the y intercept &
b is the slope of the line.

Correlation Coefficient Regression Coefficient

Summarises the degree of relationship Summarises the nature of relationship
between two variables. between two variables.
Pairs of observation of two variables The value of one variable are selected at
selected at random. random by fixing the value of other
variables.
Applied to those cases where there is no Applied to those cases where there is a
direction of dependency. direction of dependency.
Cause & effect relationship between two One variable is dependent & another is
variables is not clear, x may be cause of y, independent.
y may be the cause of x or correlation may
be due to chance between two variables.

13