Professional Documents
Culture Documents
Reference Book:
1. “the theory and problems of STATISTICS” 4th Edition by Schaum’s Outline series.
2. “Applied Statistics for Civil and Environmental Engineers ” 2nd Ed., by Nathabandu T. Kottegoda and Renzo Rosso,
Blackwell Publishers, 2008.
Descriptive Statistics: Descriptive statistics includes statistical procedures that we use to describe the population. The data could be
collected from either a sample or a population, but the results help us organize and describe data. Descriptive statistics can only be
used to describe the group that is being studying. Frequency distributions, measures of central tendency (mean, median, and mode),
and graphs like pie charts and bar charts that describe the data are all examples of descriptive statistics.
Inferential Statistics: Inferential statistics is concerned with making predictions or inferences about a population from observations
and analysis of a sample. Regression analysis, test of hypothesis, significance, analysis of variance are the examples of inferential
statistics.
Example 1 : Thirty batteries were tested to determine how long they would last. The results, to the nearest minute, were recorded as:
423, 369, 387, 411, 393, 394, 371, 377, 389, 409, 392, 408, 431, 401, 363, 391, 405, 382, 400, 381, 399, 415, 428, 422, 396, 372, 410,
419, 386, 390 Construct a frequency distribution table. Also Construct a histogram and Ogive
ANS. Range = xmax xmin = 431 – 363 = 68
Frequency Classes f c.f C.B
C.L C.B Tally
( f )
360-369 360-370 || 2 360-369 2 2 359.5-369.5 363-372
370-379 370-380 ||| 3 370-379 3 5 369.5-379.5 373-382
380-389 380-390 |||| | 5 380-389 5 10 379.5-389.5 383-392
390-399 390-400 |||| ||| 7 390-399 7 17 389.5-399.5 393-402
400-409 400-410 |||| | 5 400-409 5 22 399.5-409.5 403-412
410-419 410-420 |||| 4 410-419 4 26 409.5-419.5 413-422
420-429 420-430 ||| 3 420-429 3 29 419.5-429.5 423-432
430-439 430-440 | 1 430-439 1 30 429.5-439.5
30 Given
Note: in class boundaries, all upper class limits are considered in their next class interval
Handout 1 (Complete) (2)
Exercise 1 : The bureau of labor statistics has sampled 30 communities nationwide and compiled prices in each community at the beginning
and end of August in order to find out approximately how the Consumer Price Index has changed during August. The percentage changes in
prices for the 30 communities are as follows: Ref. Ex. 2.19 “Statistics for Management” 7th by Levin Rubin
0.7 0.4 0.3 0.2 0.1 0.1 0.3 0.7 0.0 0.4
0.1 0.5 0.2 0.3 1.0 0.3 0.0 0.2 0.5 0.1
0.5 0.3 0.1 0.5 0.4 0.0 0.2 0.3 0.5 0.4
Using the following four equal sized classes, starting from the minimum value as lower class limit. .
Example 2 the weights of 40 male students at University are recorded to the nearest pound. Construct frequency distribution.
classes Mid points Tally frequency Cf
119 135 138 144 146 150 156 164 119-127 122.5 ||| 3 3
127-135 130.5 || 2 5
125 135 140 144 147 150 157 165 135-143 |||| |||| || 10 15
143-151 |||| |||| |||| 12 27
126 135 140 145 147 152 158 168
151-159 |||| || 6 33
128 136 142 145 148 153 161 173 159-167 |||| 4 37
167-175 || 2 39
132 138 142 146 149 154 163 176 175-183 | 1 40
.
.
A Company manufactures metal rods in different lengths. The table given below shows information of a day’s production of the company.
Book: “Theory and Problems of Statistics” 4th Edition by Schaum’s Series; Practice Problems: 2.27, 2.28, 2.29
Measures of Location
Median
- If n is odd, then the median is the middle value.
- If n is even, the median is the average of the two middle values.
Mode
The mode is the value that is repeated most often in the data set.
e.g. The ages in years of the cars worked on by the Village Auto Haus last week
5 6 3 6 11 7 9 10 2 4 10 6 2 `1 5. Mode in this case is 6
Examples (2)
A computing student received the following grades in subjects of his first semester 2007:
Y = [6; 7; 6; 8; 5; 7; 6; 9; 10; 6] Mode = 6 called unimodel
1,2,3,4,5,6,6,7,7 mode value is 6 and 7 called Bimodal
2,3,4,2,3,4,7,8 2,3,4, are the modes called Multimodal
2,3,4,5,6,7,8 no mode
2,2,3,3,4,4,5,5 no mode
Exercise 1.
A semi-commercial test plant produced the following daily outputs in tonnes/day:
1.3 2.5 1.8 1.4 3.2 1.9 1.3 2.8 1.1 1.7
1.4 3.0 1.6 1.2 2.3 2.9 1.1 1.7 2.0 1.4
Find out the mode?
(ref . McCoursey Chap 4 )
Other Measures of Location
we will discuss here are quartiles, deciles and percentiles
Quartiles
Quartiles divide the distribution into four groups, separated by Q1, Q2, Q3. Note that Q1 is the same as the 25th percentile; Q2 is the
same as the 50th percentile, or the median; Q 3 corresponds to the 75th percentile, as shown:
n
For Q1 we see that 4 is an integer or a non-integer
n n
If 4 is not an integer, then Q1 = [ 4 ] + 1 item in the data
th
n n n
If 4 is an integer, then Q1 = average of {4 th and(4 +1)th items}
2n 3n
Similarly for Q2 and Q3 we will check whether 4 and 4 is an integer or non-integer respectively, then we find the value of Q 2 and Q3 same as
we did in the case of Q1.
Deciles
Deciles divide the distribution into 10 groups, as shown. They are denoted by D1, D2, etc.
Handout 1 (Complete) (4)
7n
For D7 we see that 10 is an integer or a non-integer
7n
If 10 is not an integer, then
7n
D7 = [ 10 ] + 1 item in the data
th
7n
If 10 is an integer, then
7n 7n
D7 = average of {10 th and(10 +1)th items}
2n 3n
Similarly for D2 and D3 we will check whether 10 and 10 is an integer or non-integer respectively, then we find the value of D2 and D3 same as
we did in the case of D7.
Percentiles
Percentiles are position measures used in educational and health-related fields to indicate the position of an individual in a group.
Percentiles divide the data set into 100 equal groups.
Percentiles are symbolized by
P1, P2, P3, . . . , P99
and divide the distribution into 100 groups.
27n
For instance, For P27 we see that 100 is an integer or a non-integer
27n 27n
If 100 is not an integer, then P27 = [ 100 ] + 1 item in the data
th
27n 27n 27n
If 100 is an integer, then P27 = average of {100 th and(100 +1)th items}
25n 30n
Similarly for P25 and P30 we will check whether 100 and 100 is an integer or non-integer respectively, then we find the value of P 25 and
P30 same as we did in the case of P27.
Note that
Median is a point that cumulates 50% of the data below this point (i.e 2n/4 or n/2)
Q1 is a point that cumulates 25% of the data below this point ( i.e. n/4)
Q3 is a point that cumulates 75% of the data below this point ( i.e. 3n/4)
D7 is a point that cumulates 70% of the data below this point ( i.e. 7n/10)
P45 is a point that cumulates 45% of the data below this point ( i.e. 45n/100)
Note:
For all of these measures, first we arrange our data in ascending order.
Examples (3)
The breaking strength of test pieces of a certain alloy is given as under
X: 95 103 97 130 96 73 78 95 89 68
82 79 69 67 83 108 94 87 93 117
Find quartiles, deciles and percentiles.
Solution
Arranged form of the data:
Handout 1 (Complete) (5)
67, 68, 69, 73, 78, 79, 82, 83, 87, 89, 93, 94, 95, 95, 96, 97, 103, 108, 117, 130
Data II : 67, 68, 69, 73, 78, 79, 82, 83, 87, 89, 93, 94, 95, 95, 96,
For Q3 , 3n/4 = 11.25 (non-integer), Q3 = ( [11.25] + 1 ) th = (11 + 1) th item in the data = 12 th item in the data = 94
P35 , 35n/100 = 5.25 (non-integer), P35 = ( [5.25] +1 )th item in the data =( 5 + 1)th item in the data = 6 th value in the data = 79
Note that bracket [ x ] provides an integral value which is equal to x or an integral value just below x
This means [3] = 3 , [7] = 7, [37] = 37
And [3.1] = 3, [3.9] = 3, [ 3.999999] = 3
Examples (4)
Let consider grouped data (or a frequency distribution)
Classes 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89
frequency 6 7 8 10 12 9 7 4
Calculate median, first quartile, 7 th decile and 45th percentile, also calculate mode.
Quartiles, Deciles and Percentiles with the help of Ogive. Graphically we can find out all the three quartiles as:
Handout 1 (Complete) (6)
Similarly we can find out deciles and percentiles using ogive here in this example we have n = 83
Practice Problems, Chap 3 Solved problems: 3.19, 3.26, 3.35, 3.37, 3.40, 3.44, 3.45
100
Time 2 = 50 = 2 hours to return
Hence total time is 4.5 hours, and total miles driven are 200. Now the average speed is
distance 200
Rate = time = 4.5 = 44.44 miles per hour
This value can also be found by using the harmonic mean as
2 1/40 + 1/50
H.M = 1/40 + 1/50 = 44.44 ( 1/H.M = 2 )
2
Definition (Harmonic Mean of the two values a and b, H.M = 1 1)
a+b
(1/xi) n
The harmonic mean is the reciprocal of the mean of the reciprocals. i.e H.M = Reciprocal of ( n ) =
(1/xi)
Exercise 2.
Using the harmonic mean, find each of these.
(a). A salesperson drives 300 miles round trip at 30 miles per hour going to Chicago and 45 miles per hour returning home. Find the average
miles per hour.
(b). A bus driver drives the 50 miles to West Chester at 40 miles per hour and returns driving 25 miles per hour. Find the average miles per
hour.
(c). A carpenter buys $500 worth of nails at $50 per pound and $500 worth of nails at $10 per pound. Find the average cost of 1 pound of
nails.
Dispersion Measures
The main measures are
(1) Range = xmax - xmin = xn x0
xn x0
co-efficient of dispersion = x + x
n 0
(2) Inter-quartile Range
IQR = Q3 Q1
Q3 Q1
Q.D = 2 , called quartile deviation
Q3 Q1
co-efficient of quartile deviation = Q + Q
3 1
which is used for comparing the variation in two or more sets of data.
xi x
(3) Mean Deviation = M.D = n , also called absolute measure
(xi x )
n = 0 because (xi x ) = 0 as the sum of the deviations from mean is always zero
(4) Population Variance and Standard Deviation
The variance is the mean of the squared deviations from mean values.
(xi )2
2 = N (for ungrouped data)
alternative formula,
Xi2 Xi
2 = N - ( N )2 (for ungrouped data)
The positive square root of the variance is called standard Deviation. Symbolically,
Handout 1 (Complete) (8)
(xi)2
= N (for ungrouped data) data : 3, 3, 3, 3, 3, 3, 3, 3, 3, 3
Note : S.D is always non-negative
The standard deviation is a very important concept that serves as a basic measure of variability. A smaller value of the
standard deviation indicates that most of the observations in the data are close to the mean while a larger value implies that the
observations are scattered widely about the mean.
It is an absolute measure of dispersion. Its relative measure called coefficient of standard deviation, is defined as
Standard Deviation
Coefficient of S.D. = Mean
Standard Deviation
Coefficient of variation = Mean 100
These measures are used for consistency, reliability and stability
(5) Sample Variance and Standard Deviation
In most cases, the expression (x - x)2/n does not give best estimate of the population variance. Therefore, instead of
dividing by n, find the variance of the sample by dividing by n 1, giving a slightly larger value and an unbiased estimate of the
population variance. The formula for the sample variance denoted by s2 , is
(xi x)2
s =
2
n1
and standard deviation of a sample (denoted by s) is
(xi)2
s=
n1
note : (i) x is an unbiased estimate of the population mean
(ii) s2 is biased estimate of the population variance 2.
(iii) the division by n-1 is to make s2 an unbiased estimator of population parameter.
Examples (5)
The breaking strength of test pieces of a certain alloy is given as under
X: 95 103 97 130 96 73 78 95 89 68
82 79 69 67 83 108 94 87 93 117
Calculate the average breaking strength of the alloy and the standard deviation.
X X2 X X2
|X - X| (X - X)2 |X - X| (X - X)2
67 4489 23.15 535.92 93 8649 2.85 8.1225
68 4624 22.15 490.62 94 8836 3.85 14.823
69 4761 21.15 447.32 95 9025 4.85 23.522
73 5329 17.15 294.12 95 9025 4.85 23.522
78 6084 12.15 147.62 96 9216 5.85 34.222
79 6241 11.15 124.32 97 9409 6.85 46.922
82 6724 8.15 66.423 103 10609 12.85 165.12
83 6889 7.15 51.123 108 11664 17.85 318.62
87 7569 3.15 9.9225 117 13689 26.85 720.92
89 7921 1.15 1.3225 130 16900 39.85 1588
Total: 1803 167653 253 5112.6
X 1803
Mean = n = 20 = 90.15 (remember ( X)2 X2 )
5
C.V =
100 = 87 100 = 5.7 % (sales)
x
773
C.V =
100 = 5225 100 = 14.8 % (commissions)
x
Since the coefficient of variation is larger for commissions, the commissions are more variable than the sales.
.
Handout 1 (Complete) (10)
Examples (7)
Heights of 18-year-old males have a bell-shaped distribution with mean 69.6 inches and standard deviation 1.4 inches.
(a) About what proportion of all such men are between 68.2 and 71 inches tall?
(b) What interval centered on the mean should contain about 95% of all such men?
Solution (a)
x − ks = 68.2 69.6 – k (1.4) = 68.2 k = 1
x + ks = 71 69.6 + k (1.4) = 71 k = 1
by empirical rule, hence 1-S.D interval about the mean x , it contains approx. 68% of the data
Solution (b)
By the Empirical Rule the shortest such interval containing 95% of the data is x ± 2s. So the interval from
x − 2s = 69.6 − 2(1.4) = 66.8
x + 2s = 69.6 + 2(1.4) = 72.4
So this interval, (66.8, 72.4) contains approximately 95% of the data values.
Difficulty:
Alternatively Part (a) lets say we have the limits 68.2 and 72.4
x − ks = 68.2 69.6 – k (1.4) = 68.2 k = 1 corresponding percentage 68%
x + ks = 72.4 69.6 + k (1.4) = 72.4 k = 2 corresponding percentage 95%
Required %age = Average (68% and 95%) = (68/2 + 95/2)% = 81.5% Approx. ans
Note : If value of k is different from 1 , 2 or 3, then %age will be calculated by (1−1/k 2)% , K>1
Handout 1 (Complete) (11)
Examples (8)
A sample of size n = 50 has mean x = 28 and standard deviation s = 3. Without knowing anything else about the sample, what
can be said about the number of observations that lie in the interval (22, 34)? What can be said about the number of observations that
lie outside that interval?
This means we are given .
x - ks = 22 k = 2
x + ks =34 k = 2
Then at least (1−1/k2)% = 1 – ¼) % = ¾% = 75% of 50 values = 37.5 = 38 values are contained in the given interval.
Therefore 12 observations fall outside the interval.
Exercise
The mean of a distribution is 20 and the standard deviation is 2. Use Chebyshev’s theorem.
a. At least what percentage of the values will fall between 10 and 30?
b. At least what percentage of the values will fall between 12 and 28? (Bluman ch. 3)
Exercise
The Energy Information Administration reported that the mean retail price per gallon of regular grade gasoline was $2.30
(Energy Information Administration, February 27, 2006). Suppose that the standard deviation was $.10 and that the retail price per
gallon has a bell shaped distribution.
a. What percentage of regular grade gasoline sold between $2.20 and $2.40 per gallon?
b. What percentage of regular grade gasoline sold between $2.20 and $2.50 per gallon?
c. What percentage of regular grade gasoline sold for more than $2.50 per gallon?
(prob. 3.30, Sweeny Chap 3 )
Moments and Moment Ratios
Moments
Moments are the arithmetic means of the powers to which the deviations are raised. The mean of the first power of the
deviation from mean is the first moment about mean. The mean of the second power of the deviation from mean is the second
moment about mean and so on….
(i) First four moments about mean are:
(i) For ungrouped data:
(xi x)1
m1 = n =0
(xi x)2
m2 = n
(xi x)3
m3 = n
Handout 1 (Complete) (12)
(xi x)4
m4 = n
For group data;
(ii) First four moments about any arbitrary mean ‘a’ or about x = a are:
For ungrouped data:
(xi a)1
m1 = n
(xi a)2
2 = n
(xi a)3
m3 = n
(xi a)4
m4 = n
For group data;
fi (xi a)1
m1 =
fi
fi (xi a)2
m2 =
fi
fi (xi a)3
m3 =
fi
Handout 1 (Complete) (13)
m3 = m3 - 3 m2m1+2m12
m4 = m4 - r m3m1+ 6m2m12 –3m14
Verification: (optional)
Examples (9)
The first three moments of a distribution about the value 2 of the variable are 1, 16, -40. Show that the mean is 3, the
variance is 15 and m3 = -86. Also show that first three moments about x = 0 are 3, 24, 76.
solution
1 n
m1 = n (xi -2 ) = 1
i=1
1 n
m2 = n (xi -2 )2 = 16
i=1
1 n
m3 = n (xi -2 )3 = -40
i=1
m1 = m1-m1= 0
m2 = m2 - m12 =16 – 1 = 15
m3 = m3- 3m2m1 + 2m13 = -40-3(16)(1)+2(1)3 = -86
1 n 1 n
m1 = n (xi -2 ) =n xi – 2 = 1
i=1 i=1
1 n 1 n
1 = n xi - 2 n xi = 2+1=3
i=1 i=1
when a = 0
1 n 1 n 1 n
m1 = n xi m2 = n xi2 m3 = n xi3
i=1 i=1 i=1
1 n
m1 = n xi = 3
i=1
1 n 1 n 1 n 4 n
m2 = n (xi -2 )2 =n (xi2 + 4 -4xi) =n xi2 - n xi + 4 = 16
i=1 i=1 i=1 i=1
1 n 4 n
m2 = n xi2 = 16 + n xi – 4 = 16 + 4(3) – 4 = 24
i=1 i=1
1 n 1 n 6 n 12 n
m3 = n (xi -2 )3 =n xi3 - n xi2 + n xi – 8 = - 40
i=1 i=1 i=1 i=1
1 n 3
n xi – 6 24 + 12 3 – 8 = -40
i=1
Handout 1 (Complete) (14)
1 n
m3 = n xi3 = ( 9 24) - (12 3) + 8 – 40 = 76
i=1
Examples (10)
First four moments of a distribution about the value 1.5 of a variable are 1, 17, 10 and 40. Calculate its coefficient of variation
and first four moments about origin.
Examples (11)
The first four moments of a distribution about x = 2 are 1, 2.5, 5.5 and 16. Calculate the first four moments about the mean
and about the origin.
Solution
m1=1, m2= 2.5, m3 = 5.5 and m4 = 16
Now we have
m1 = 0
m2 = m2 - m12 = 2.5 – (1)2 = 1.5
m3 = m3 - 3 m2 m1 + 2 m1 = 5.5 – 3(2.5)(1) – 2 (1)3 = 0
m4 = m4 - 4 m3 m1 + 6 m2 m12 - 3 m14 = 16 – 4(5.5)(1) + 6(2.5)(1)2 – 3(1)4 = 6
Moments about origin are defined as:
fi xi1 fi xi2 fi xi3 fi xi4
m1 =
n , m2 = n , m3 = n , m4 = n
we are given moments about x = 2
1 n 1 n
m1 = n (xi -2 ) =n xi – 2 = 3 – 2 =1
i=1 i=1
1 n 1 n
1 = n xi - 2 n xi = 2+1=3
i=1 i=1
1 n 1 n 1 n 4 n
m2 = n (xi -2 )2 =n (xi2 + 4 -4xi) =n xi2 - n xi + 4 = 2.5
i=1 i=1 i=1 i=1
1 n 2 1 n 2
n xi – 4(3) + 4 = 2.5 n xi = 2.5 + 12 - 4 = 10.5
i=1 i=1
1 n 1 n 6 n 12 n
m3 = n (xi -2 )3 =n xi3 - n xi2 + n xi – 8 = 5.5
i=1 i=1 i=1 i=1
1 n 3 1 n 3
n x i - 6(10.5) + 12 (3) = 5.5
n xi = 40.5
i=1 i=1
1 n
m4 = n (xi -2 )4 = 16
i=1
1 n 4 1 n 3 1 n 2 1 n
n x i -8
n x i + 24
n x i - 32
n xi + 16 = 16
i=1 i=1 i=1 i=1
1 n 4
n xi - 8 (40.5) + 24 (10.5) - 32 (3) + 16 = 16
i=1
1 n 4
n xi = 168
i=1
Handout 1 (Complete) (15)
(2) Skewness
If a curve is symmetrical, then the number of values deviating from mean values below the mean and above the mean are
the same. This is called the symmetry.
Skewness is the degree of asymmetry (departure from symmetry of a distribution)
If the frequency curve of a distribution has a longer tail to the left of the central maximum than to the right, the distribution
is said to be skewed to the left or to have negative skewness.
m3
Moment coefficient of skewness = b1 = m 3/2
2
Kurtosis
Kurtosis is the degree of peakness of a distribution. A distribution having relatively high peak is called Lepto-Kurtic whereas a
distribution having flat topped is called Platy Kurtic. A frequency curve which is neither very high peaked nor vary flat topped is called
Meso-kurtic or a Normal curve having a Normal distribution.
For a normal distribution, b2 = 3, for Lepto-Kurtic, b2 > 3 and for Meso-kurtic, b2 < 3.
Another measure of Kurtosis is:
Q.D
Percentile coefficient of Kurtosis = k =
P90 P10
Q3Q1
Where Q.D = quartile deviation = 2
Examples (12)
The second moments about the mean of two distributions are 9 and 16, while the fourth moments about mean are 230 and
780 resp. Which of the distribution is (i) Lepto-Kurtic (ii) Meso-Kurtic and (iii) Platy-Kurtic.
Examples (13)
The 4th moment about the mean of symmetrical distribution is 243. what would be the value of standard deviation in order
that the distribution may be normal.
solution
m4 = 243, for distribution to be normal b2 = 3
m4
Now b2 = m 2
2
243
3= m 2
2
243
m2 2 = 3 = 81 2 = m2 = 9 = 3.
Solution
Handout 1 (Complete) (17)
m2 = 9
Variance = 9 (because variance = m2)
Instructions:
First of all generate a data of 100 values using formula =RANDBETWEEN( 100, 500)
You will get a value in first box A1, drag this box downward upto A100
Enter the data in a column,
construct class intervals like 300-310, 310-320, …
construct frequency against each class with command:
=COUNTIF(A2:A101,">=300")-COUNTIF(A2:A101,">=310")
and total frequency: =SUM(G3:G17)
mean =AVERAGE(A2:A101)
mode =MODE(A2:A101)
Q1 =QUARTILE(A2:A101,1)
Median = Q2 =MEDIAN(A2:A101)
D7 =PERCENTILE(A2:A101,0.7)
P75 =PERCENTILE(A2:A101,0.75)
Min. value =MIN(A2:A101)
Max. Value =MAX(A2:A101)
G.M =GEOMEAN(A2:A101)
H.M =HARMEAN(A2:A101)
Skewness =SKEW(A2:A101)
Kurtosis =KURT(A2:A101)
Variance =VAR(A2:A101)
S.D =STDEV(A2:A101)