Professional Documents
Culture Documents
The Mean
The mean is also known as the arithmetic average, is the sum of all values, divided by the total
number of values.
Let X be a variable which takes values x1 ,x2 ,x3 ,…………….,xn. In a sample size of n from a
population of size N for n < N then A.M. of a set of observations is the sum of all values in a series
divided by the number of items in the series.
x1+x2+x3+x4+x5+⋯+xn ∑𝑛𝑖
𝑖 𝑥𝑖
𝑋̅ = = For raw data
𝑛 𝑛
Example: The following table gives the wages paid to 125 workers in a factory. Calculate the
arithmetic mean of the wages.
Wages (in birr): 200 210 220 230 240 250 260
No. of workers: 5 15 32 42 15 12 4
Solution:
Example: The following table gives the marks of 58 students in introduction to Statistics.
Calculate the average marks of this group.
Solution=
∑ 𝑓𝑖= 4+8=11+………+2= 58
1. The sum of the deviations of a set of items from their mean is always zero. i.e.
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ ) = 0.
2. The effect of transforming original series on the mean.
a) If a constant k is added/ subtracted to/from every observation then the new mean
will be the old mean± k respectively.
b) If every observations are multiplied by a constant k then the new mean will be
k*old mean.
3. If ̅𝑥 1 is the mean from n1 observations
If ̅𝑥 2 is the mean from n2 observations
.
If ̅𝑥 k is the mean from nk observations
Then the mean of all the observation in all groups often called the combined mean is given
by:
̅𝑥 1n1 + ̅𝑥 2n2+⋯…. ̅𝑥 k nk ∑𝑘
𝑖=1 ̅
𝑥 ini
̅𝑥 c= = ∑𝑘
𝑛1+𝑛2+⋯…+𝑛𝑘 𝑖=1 ni
Example: In a class there are 30 females and 70 males. If females averaged 60 in an examination
and boys averaged 72, find the mean for the entire class.
Solutions:
Females Males
̅𝒙 1=60 ̅𝒙 2=72
n1=30 n2=70
̅𝒙 𝟏𝐧𝟏 + ̅𝒙 𝟐𝐧𝟐 (𝟔𝟎∗𝟑𝟎)+(𝟕𝟐∗𝟕𝟎)
̅𝒙 c = = = 68.40
𝒏𝟏+𝒏𝟐 𝟑𝟎+𝟕𝟎
Suppose we compute the mean of n observations and later it was found that one ,two or
more of the observations were wrongly copied down .it is now required to compute the corrected
mean by replacing the wrong observations by the corrected ones. In other word, for correcting
the incorrect value of mean, first we find the corrected ∑ 𝑥. For this we subtract the wrong
items from the incorrect ∑ 𝑥 and add to it the correct items. Finally, on dividing the corrected ∑ 𝑥
by the number of observations, we get the corrected mean.
Example: The average marks of 80 students were found to be 40. Later, it was discovered that a
score of 54was misread as 84. Find the corrected mean of the 80 students.
= 3200 – 84 + 54=3170.
𝐓𝐡𝐞 𝐜𝐨𝐫𝐫𝐞𝐜𝐭 ∑ 𝒙 𝟑𝟏𝟕𝟎
̅) =
Therefore the corrected mean ( 𝒙 = = 39.625
𝑵 𝟖𝟎
Exercise1: The mean of a set of 100 observations were found to be 40. But my mistake a value
50 was taken in place of 40 for one observation. Re-calculate the correct mean.
B. Weighted Mean:
One of the limitations of the arithmetic mean is that it gives equal importance (weight) to all the
items in the Series.
∑𝒏𝒊
𝒊 𝑿𝒊𝑾𝒊
̅𝑥 w= Where, ̅𝒙 w is weighted mean, Wi= the weights attached to values of
∑ 𝐖𝐢
the variable and Xi= the values of the variable.
Example: Suppose a student has secured the following marks in three tests: Mid-term test= 30,
Laboratory = 25 and Final exam= 20.The simple arithmetic mean will be (30+25+20)/3 = 25.
However, this will be wrong if three tests carry different weights on the basis of their relative
PREPARED BY: ABDULMENAN M. (MSc) Page 4 of 21
importance. Assuming that the weights assigned to the three tests are 2, 3 and 5 points. On the
basis of this information, we can now calculate a weighted mean as
∑𝒏𝒊
𝒊 𝑿𝒊𝑾𝒊 𝐖𝟏𝐗𝟏+𝐖𝟐𝐗𝟐+𝐖𝟑𝐗𝟑
̅𝑥 w= = = 60+75+100/ 2+3+5 = 23.5 marks.
∑ 𝐖𝐢 𝐖𝟏+𝐖𝟐+𝐖𝟑
The geometric mean is the nth root of the product of n positive values. If X1, X2,…,,Xn are n
positive values, then their geometric mean is
G.M =(X1X2…Xn)1/n .
The geometric mean is usually used in average rates of change , Ratio, Percentage
distribution, Logarithmical distribution and so on.
In case of number of observation is more than two it may be tedious taking out from square root,
in that case calculation can be simplified by taking natural logarithm with base ten.
𝑛
G. M = √𝑥1. . 𝑥2 … . 𝑥𝑛= G.M=(x1…x2….xn)1/n take log both sides
1 1 1
Log (G .M) = 𝑛 log(x1…x2….xn) =𝑛 (log x1+log x2+…+log xn) = ∑𝑛𝑖=1 𝑙𝑜𝑔𝑥𝑖
𝑛
1
G.M=Antilog (𝑛 ∑𝑛𝑖=1 𝑙𝑜𝑔𝑥𝑖)
This shows that the logarithms of G.M is the mean of the logarithms of individual’s observations.
Example: The ratio of prices in 1999 to those in 2000 for 4 commodities were 0.9, 1.25, 1.75
and 0.85. Find the average price ratio by means of geometric mean.
∑ 𝑙𝑜𝑔𝑥𝑖 𝑙𝑜𝑔0.9+𝑙𝑜𝑔1.25+𝑙𝑜𝑔1.75+𝑙𝑜𝑔0.85
Solution: G.M =Antilog ( ) = antilog( ) =1.14
𝑛 4
𝑛
Geometric mean for ungrouped and G. M = √𝒎𝟏𝒇𝟏 … 𝒎𝟐𝒇𝟐 … 𝒎𝒏𝒇𝒏
grouped frequency distribution:
𝟏
== ∑𝒏𝒊=𝟏 𝒇𝒊 𝒍𝒐𝒈𝒎𝒊
In case of ungrouped data, geometric mean is 𝒏
Where, n=∑ 𝒇𝒊 and mi is class interval of
obtained by
the class.
𝑛
G. M = √𝒙𝟏𝒇𝟏 … 𝒙𝟐𝒇𝟐 … 𝒙𝒏𝒇𝒏
Properties of geometric mean:
𝟏
== ∑𝒏𝒊=𝟏 𝒇𝒊 𝒍𝒐𝒈𝒙𝒊 Its calculations are not as such easy.
𝒏
Where, n =∑ 𝒇𝒊 It involves all observations during
computation
For continuous frequency distribution It may not be defined even it a single
observation is negative.
Example: Find the harmonic mean of the For simple frequency data harmonic
values 2,3 and 6. mean is calculated by using the
following formula.
3
H.M =1 1 1 =3
+ +
2 3 6
𝒇𝒊
∑( ) 𝒏
𝒙𝒊
H. M = Reciprocal = 𝒇𝒊 , Where n is the total number of observations.
𝒏 ∑( )
𝒙𝒊
For any set observation, its A.M, G.M, and H.M are related each other in the relationship.
Note:
The sign of ‘=’ holds if and only if all the observations are identical
(G.M)2=A.M*H.M
Median:
Median is defined as the value of the middle item (or the mean of the values of the two middle
items) when the data are arranged in an ascending or descending order of magnitude. If there are
an odd number of items in the array, the median is the middle number. If there is an even number
of items, the average of the two middle numbers.
n 1
th
th th
n n
1
Median = 2
2
element if n is even.
2
In the case of a continuous frequency distribution, we first locate the median class by cumulating
th
N
the frequencies until point is reached. Finally, the median is calculated by with the help of
2
the following formula:
Remark: The median class is the class with the smallest cumulative frequency (less than type)
N
th 2 Cf w
N
greater than or equal to = Median LCb
Where, Cf = less than
2 f
cumulative frequency of the class preceding(one before) the median class , f is frequency of
In order to calculate median in this case, based on provided cumulative frequency, Median is the
N 143
value of 71.5th item, which lies in the class (1,200-1,400). Thus (1,200-1,400) is the
2 2
median class. For determining the median in this class, we use interpolation formula as follows:
N
2 Cf
71.5 43
Median L C b w =1200+ 200
f mc 30
f 1 f0
Mode = ̂
𝑿 = lo w
1 f f 0 f1 f
2
Where lo is the lower value of the class in which the mode lie, f1 is the frequency of the class in
which the mode lie, f0 is the frequency of the class preceding the modal class, f2 is the frequency
of the class success ding the modal class and w is the class width of the modal class.
12 8 4
Mode 60 10 = 60 10 =65.7
12 8 12 9 43
Advantage of mode:
The mode is not affected by the extreme value in the distribution.
The mode value can be calculated for open-ended frequency distribution.
It is the only measurement of central tendency that can be used for qualitative data for
example in describing the opinion of people about a certain phenomenon and qualitative
data.
Disadvantage of mode:
Mode is not rigidly defined measure as there are several methods for calculating its
value.
PREPARED BY: ABDULMENAN M. (MSc) Page 9 of 21
It is difficult to locate modal class in the case of multi-modal frequency distribution.
Mode is not suitable for algebraic manipulations.
When data set contains more than one mode, such values are difficult to interpret and
compare.
Measure of location (positional measures): They tell where a specific data value falls within the
data set or its relative position in comparison with other data values. Quintiles are measures which
divides a given set of data in to equal subdivision and are obtained by the same procedure to that
of median but data must be arranged only in an increasing order. The most commonly used ones
includes: quartiles, deciles, Percentiles. Their measures that depend up on their positions in
distribution quartiles, deciles, and percentiles are collectively called quantiles.
Quartiles: Quartiles are measure which divided the ordered data in to four equal parts and usually
denoted by Q1,Q2,Q3 and are obtained after arranging the data in an increasing order known as
respectively first(lower) quartile or value for
which 25% of the observation lies below it, second quartile or value for which 50%of the
observation lies below or above it and third (upper) quartile or value for which 75% of the arranged
item lies below it or 25% lies above it.
For ungrouped data the ith quartiles is the value of the items which is at the
n 1 n 1
th th
n 1
th
n 1
th
n 1
th
Q i Lo
i n 4 cf w
Where, n = the sum of the frequencies of all classes =
fQi
f i , Lo = the lower class boundary of the ith quartile class, Cf = the cumulative
in
Note: To find ith quartile class compute and search for the minimum less than cumulative
4
frequency greater than or equal to this value, then the class corresponding to this cumulative
frequency is ith quartile class.
Deciles: Are measures which divide a given ordered data in to ten equal parts and each part
contains equal no of elements. It has nine points known as 1st, 2nd …9th deciles and denoted by
D1 D2 D3………D9 and often called the first, the second,…, the ninth decile respectively.
n 1
th
For ungrouped data, i deciles is the value of the item which is at the i *
th
position it
10
n 1
th
For grouped data or continuous frequency distribution, deciles can be obtained by using
D i Lo
i n10 cf w
, for i=1, 2, 3………..9. Where, n= the sum of the frequencies
fDi
of all classes = fi , Lo the lower class boundary of the ith deciles class, Cf is the cumulative
frequencies of class before Di (ith deciles class) and f is the frequency of ith deciles class and
w is class width.
in
Note: To find ith deciles class compute and search for the minimum less than cumulative
10
frequency greater than or equal to this value.
Percentiles: Percentiles are measures having 99 points which divide a given ordered data in to
100 equal parts and each part consists of equal number of elements. It is denoted by P 1,P2,…P99
and known as 1st , 2nd , …99th percentiles respectively.
For ungrouped data, ith percentiles is the value of item at the i * position Pi =
100
n 1 n 1
th th
n 1 n 1
th th
P i Lo
i n100 cf w
, for i=1, 2, 3………..,99.
fp i
in
Note: To find ith percentile class compute and search less than cumulative frequency
100
greater than or equal to this value, then the class corresponding to this cumulative frequency is i th
percentile class.
Class <5 5 – 10 10 – 15 15 – 20 20 – 25 25 – 30 30 – 35 35 – 40
interval
Frequency 2 5 7 13 21 16 8 3
Compute,
n
Cf
2 n
Median = Lcb + w . To find median class compute 75 37.5
f mc 2 2
Median = 20 +
37.5 27 5
= 22.5. Thus 50 % of the companies earned an annual profit of
21
22.5 thousands birr or less.
Note that from above example on 2nd quartiles which is equal to median value of the profit earned
by 15 companies.
P72 25
54 48 5 26.875 . A It shows that 72 % of the companies earn profit
16
of 26.875 thousands.
D2 Lo
210 n cf w 2*n 2 * 75
f D2
To find 2nd deciles class compute 15
10 10
D2 15
15 1 45
= 16.406.
13
n 75
D). To find 1st quartile class, compute 18.75
4 4
Q1 L o
n 4 cf w 18.75 14
= 15 5 16.827
f Q1 13
CHAPTER FOUR
In addition to knowing the average, you must know how the data values are dispersed.
That is,
Is the data values cluster around the mean?
Are they spread more evenly throughout the distribution?
The measures that determine the spread of the data values are called measures of variation, or
measures of dispersion. These measures include the range, variance, and standard deviation.The
scatter or variation of observations from their average is called dispersion.
The measures of dispersion which are expressed in terms of the original unit of a series are termed
as absolute measures. Such measures are not suitable for comparing the variability of two
distributions which are expressed in different units of measurement and different average size.
Range
Range is the simplest measure of dispersion and it is the highest value minus the lowest value. The
symbol R is used for the range. Range takes only maximum and minimum values into account
and not all the values. Hence it is a very unstable or unreliable indicator of the amount of deviation.
. Range = largest class limit minus smallest class limit or midpoint of the last class interval
minus the first class interval or upper class boundary of the last class minus lower class boundary
of the first Class.
Advantage of range:
To know only the extent of the extreme dispersion “ordinary” condition
It is easy to calculate and simple to understand
To measure the a symmetric and nearly continuous series
Disadvantage of range:
Quartile deviation:
Quartiles are the point which divided the array in to four equal parts.
Inter Quartile Range: Is the difference between 3rd and 1st quartile and it is a good indicator of
the absolute variability than range.
I.Q.R = Q3Q1
Quartile Deviation (semi – inter quartile Range) is a half of inter quartile range
The size of quartile deviation gives an indication about the uniformity. If Q. D is small, it
denotes large uniformity. Thus, a coefficient of quartile deviation is used for comparing
uniformity or variation in different distribution.
As compared to range, it is considered a superior measure of dispersion.
Like the range, it fails to cover all items in the distribution.
It not influenced by the extreme values in a distribution.
𝑄 −𝑄
Coefficient of quartile division = 𝑄3 +𝑄1
3 1
x x
n 2
And the sample variance for raw data can be obtained as, s2 = i an unbiased
i 1
n 1
estimator for population variance and the computing formula for variance is can be simplified as
n
xi xi 2 n
2
S2 = i 1
n 1
f i mi x
2
s2 i 1
Where mi is midpoint value of class interval
(n 1)
f i mi
2
mi fi 2
n
S2
k
f 1
i
i 1
Properties of Variance
o The variance and standard deviation of a data set can never be negative.
o If every element in the distributions are multiplied by a constant C the new variance is
2
Snew C 2 Sold
2
.
o When a constant c is added to all measurement of the distribution, the variance doesn’t
change.
o The variance of constant measured n times is zero.
Standard Deviation
The standard deviation is defined as the square root of the mean of the squared deviations of
individual values from their mean. Finding the square root of the variance puts the standard
deviation in the same units as the raw data.
X X
2
S.D =
n
∑(𝑋−𝑋̅ )2
S=√𝑠 2 =√ where, X=individual value and 𝑋̅=sample mean, n=sample size
𝑛−1
Its advantage over variance is that it is in the same unit as the variable under consideration.
It is a measure of average variation in the set of data.
PREPARED BY: ABDULMENAN M. (MSc) Page 17 of
21
Example: Calculate the S.D for the following grouped frequency distribution.
Class intervals 1–3 3–5 5–7 7 – 9 9 – 11 11 – 13 13 – 15 Total
Frequency(fi) 1 9 25 35 17 10 3 100
Solution=
f m i i
2
f m 2 n
i i
7016 800 2
100
S 2
= = 6.22
fi 1 99
S S2 6. 22 2.49
As previously stated, variances and standard deviations can be used to determine the spread
of the data. If the variance or standard deviation is large, the data are more dispersed. This
information is useful in comparing two (or more) data sets to determine which is more
(most) variable.
The measures of variance and standard deviation are used to determine the consistency of
a variable. For example, in the manufacture of fittings, such as nuts and bolts, the variation
in the diameters must be small, or the parts will not fit together.
The variance and standard deviation are used to determine the number of data values that
fall within a specified interval in a distribution. For example, Chebyshev’s theorem.
Finally, the variance and standard deviation are used quite often in inferential statistics.
Coefficient of variation (CV): The CV is a unit free measure. It is always expressed as percentage.
SD
CV = 100%. The CV will be small if the variation is small. Of the two groups, the one with
Mean
less CV is said to be more consistent.
Example: The mean for the number of pages of a sample of women’s fitness magazines is 132,
with a variance of 23; the mean for the number of advertisements of a sample of women’s fitness
magazines is 182, with a variance of 62. Compare the variations.
√23 √62
CV for pages = 132 ∗ 100% = 3.6% and CV for advertisements *100% = 4.3%
182
The number of advertisements is more variable than the number of pages since the coefficient of
variation is larger for advertisements.
This measures the deviation of individual observation from the mean of the total
observation in the unit of standard deviation and termed as Z – Score.
A standard score or z score tells how many standard deviations a data value is above or
below the mean for a specific distribution of values. If a standard score is zero, then the
data value is the same as the mean.
A comparison of a relative standard similar to both groups can be made.
A 84 75 159
B 74 85 159
Average mark for Accounting is 50 with standard deviation of 11 and average mark for economics
is 60 with standard deviation 13. Whose performance is better A or B?
84 60
Economics 1.846
13
Z score for A
75 50
Accounting 2.273
{ 11
74 60
Economics 1.077
13
Z score for B
75 50
Accounitin g 3.182
{ 11
Chebyshev’s Theorem: The proportion of values from a data set that will fall within k standard
deviations of the mean will be at least, where k is a number greater than 1 (k is not necessarily an
integer).
Example: The mean price of houses in a certain neighborhood is $50,000, and the standard
deviation is $10,000. Find the price range for which at least 75% of the houses will sell.
Solution=Chebyshev’s theorem states that three-fourths, or 75%, of the data values will fall within
2 standard deviations of the mean. Thus,
Hence, at least 75% of all homes sold in the area will have a price range from $30,000 to $70,000.
Measure of shapes
We have seen that averages and measure of dispersion can help in describing the frequency
distribution. However, they are not sufficient to describe the nature of the distribution. For this
purpose, we use Skewness and Kurtosis commonly known as measure of shape. Frequency
distributions can assume many shapes. The three most important shapes are positively skewed,
symmetric, and negatively skewed.
Skewness:
If the distribution is skewed to the right side i.e., when mode > median > mean.
When a distribution is extremely skewed, the value of the mean will be pulled toward the
tail, but the majority of the data values will be greater than the mean or less than the mean
(depending on which way the data are skewed); hence, the median rather than the mean is
a more appropriate measure of central tendency.
For distribution which are bell shaped and are moderately skewed, we have an approximate
relationship between the A.M, Median and mode.
For a symmetrical distribution SK = 0. If the distribution negatively skewed, then the value of Sk
is negative, and if it is positively skewed then Sk is positive. The range for values of SK is from -
3 to 3.