# Probability and Statistics First Week Text Book: E.

Kreyszig, Advanced Engineering Mathematics, John Wiley & Sons, Inc., 9th edition, 2006. Reference Books: M.R. Spiegel, J. Schiller and R.A. Srinivasan, Schaum’s Easy Outlines of Probability and Statistics, McGraw–Hill, 2001. R.E. Walpole, R.H. Myers, S.L. Myers and K. Ye, Probability & Statistics for Engineers & Scientists, Pearson Education, Inc. 8th edition, 2007. Outlines for week one Graphical Representation of Data: Stem–and–Leaf Plot Histogram Boxplot Arithmetic Mean Standard Deviation Variance

1

Introduction to Statistics

Introduction Statistics is a word with a variety of meaning. The word statistics refers to numerical facts systematically arranged. For instance, we have statistics of prices, statistics of road accidents, statistics of crimes, statistics of births, etc. In all these examples, the word statistics denotes a set of numerical data in the respective ﬁelds. These all comes under the heading of descriptive statistics, in which items are counted or measured and the results are combined in various ways to give useful results. That type of statistics certainly has its use in engineering. In the second place, the word statistics deﬁned as a discipline that includes procedures and techniques used to collect, process and analyze numerical data to make inferences and to reach decisions in the face of uncertainty. It should be kept in mind that uncertainty does not imply ignorance but it refers to the incompleteness and instability of available data. Thirdly, the word statistics are numerical quantities calculated from sample observations; a single quantity that has been so calculated is called a statistic. The mean of a sample for instance is a statistic. The world statistics is plural when used in this sense. Another type of statistics will engage our attention to a much greater extent, that is, inferential statistics or statistical inference. For example, it is often not practical to measure all the items produced by a process. Instead, we take a sample and measure the quantities on the basis of sample. We infer something about all the items from our knowledge of the sample.

Today we discuss the use of standard representations of data in statistics. We explain corresponding concepts and methods with the help of examples.
2. 86. For example. 84). 89. a Satem and a Leaf. 81. John Turkey introduced a technique known as Stem–and–Leaf plot. etc. 95–99. or pie charts showing how your tax dollar is spent.2
Data Representation. 89. 78. 89.
. 83. 91. we order them by size. For these. see (2). recorded in the order obtained and rounded to integer values. 85–89. 89. 78.1
Graphic Representation of Data
Stem–and–Leaf Plot.
Data can be represented numerically or graphically in various ways. 87. 9. 9. 87. the number 4 in the second line on the left shows that (1) has 4 values up to and including 84. 1. the value 89 has absolute frequency 4. The second leaf is 134 (representing 81. The column to the extreme left in Fig. 83. the sum of the absolute frequencies of the values up to the line of the leaf. 1 shows the cumulative absolute frequencies. 99. To see what is going on. Consider the following data 89. 1) gives the cumulative relative frequencies. 89. that is. The integers in the tens position of the groups are 7. Dividing the cumulative absolute frequencies by n = 14 (in Fig. 83. we sort these data. curves or bar charts illustrating economical or political developments. 8. 84. Maple or Mathematica may be helpful. 90. 90. 87. A stem is the leading digit(s) of each number and is used in sorting. (1)
These are n = 14 measurements of the tensile strength of steel sheet in kg/mm2 . 1. and so on. (2)
We shall now discuss standard graphic representations used in statistics for obtaining information on properties of data. Thus. And there are numerous other representations of data for special purposes. The number 11 in the next line shows that there are 11 values not exceeding 89. 86. 81. 99. 91. In 1977. 89. 75–79. The ﬁrst leaf is 8 (representing 78). Thus 78 has absolute frequency 1. our daily newspaper may contain tables of stock prices and money exchange rates. that is. etc. 8. 90–94. This technique oﬀers a quick and novel way for simultaneously sorting and displaying data sets where each number in the data set is divided into two parts. 87. The numbers in (1) range from 78 to 99. 89. 80–84. 84. For (1) it is shown in Fig. software packages. while a leaf is the rest of number or the trailing digit(s). We divide these numbers into 5 groups. These form the stem in Fig. The number of times a value occurs is called its absolute frequency.

3. For large sets of data. The height of a rectangle with class mark x is the relative class frequency frel (x). the largest minus the smallest data values.5. Hence it has the height IQR. the data value that falls in the middle when the values are ordered. 4. The seventh of them is 87. deﬁned as the number of data values in that class interval.5– 79. The lower quartile qL is the middle value among the data values below the median. 2. The median is also called the middle quartile and is denoted by qM .5. qU . The two lines extend from the box to xmin below and to xmax above. 89. Here the upper quartile qU is the middle value among the data values above the median. The bases of the rectangles in Fig. Thus in (2) we have qU = 89 (the fourth value from the end). we would get a fraction. 5. Better information gives the interquartile range IQR = qU − qL . 6.Histogram. Fig.5–89. 79. 2 are the x–intervals (known as class intervals) 74.5.) The variability of the data values can be measured by the range R = xmax − xmin . 97. the eighth is 89. divided by n = 14 (in our case). For the set of values 1. xmax just determined. Boxplot The boxplot of (1) in Fig. whose midpoints (known as class marks) are x = 77. 82. 3. histograms are better in displaying the distribution of data than stem–and–leaf plots.5. The position of the median in the box shows that the data distribution is not symmetric. 3. the histogram is Histogram. and we split the diﬀerence. 84. 3 shows
. R = 99 − 78 = 21 in (2).5–84. and IQR = 89 − 84 = 5. A histogram is a representation of a frequency distribution by means of rectangles whose widths represent class intervals and whose areas are proportional to the corresponding frequencies. The rule of “splitting the diﬀerence” (just applied to the middle quartile) is equally well used for the other quartiles if necessary. 3. Boxplots are particularly suitable for making comparisons. Hence the areas of the rectangles are proportional to these relative frequencies. The principle is explained in Fig. 94. For example. so that histograms give a good impression of the distribution of data. 2. Quartiles. 2. 3 is obtained from the ﬁve numbers xmin . As a center of the location of data values we can simply take the median. 87. 92. respectively.5. Hence they mark the range R. qL = 84 (the fourth value from the beginning).5–99. Center and Spread of Data: Median. (In general. The box extends from qL to qU . 4. qL . obtaining the median 88.5–94. In (2) we have 14 values. qM .

94. 92. and divide it by n − 1. 90. From the plot we immediately see that the box of (3) is shorter than the box of (1) (indicating the higher quality of the steel sheets!) and that qM is located in the middle of the box (showing the more symmetric form of the distribution). Arithmetic Mean. take the diﬀerence xj − x of each data value from the mean. The average size of the data values can be measured in a more reﬁned way by the mean n 1 j=1 xj x= = (x1 + x2 + · · · + xn ). practically without calculation. 93. 89. and changing one of them will change the mean. 85. 91. 96 (4) (3)
(tensile strength. 90. a fact that we shall discuss later. as before). Thus in (1). 91. qM = 91. Similarly for the quartiles. the variance s2 = 1 n−1
n
(xj − x)2 =
j=1
1 [(x1 − x)2 + (x2 − x)2 + · · · (xn − x)2 ] n−1
(7)
Thus. 1 611 x = (89 + 84 + · · · + 89) = ≈ 87. 91. Variance
Medians and quartiles are easily obtained by ordering and counting. 93. obtained by taking their sum and dividing by the data size n. 96. 87. 91. For plotting the box of (3) we took from (4) the values xmin = 85. Finally. square it. xmax = 96.
3
Mean. But they do not give full information on data: you can change data values to some extent without changing the median. take the sum of these n squares. Ordering gives 85. qU = 93. Standard Deviation. 91. 89 (consisting of n = 13 values). to obtain the variance of the data. 94. The variability of the data values can be measured in a more reﬁned way by the standard deviation s or by its square. 89. 92. 89. qL = 89. 93. xmax is closer to qU for (3) than it is for (1).boxplots of the data sets (1) and 91. 93. Standard Deviation. 87. (5) n n This is the arithmetic mean of the data values. To get the standard
.3 (6) 14 7 Every data value contributes.

5.1 Represent the data by a stem–and–leaf plot. 6. take the square root of s2 . and a boxplot: Question 3. 6. 12. 1. 56 58 54 33 41 30 44 37 51 46 56 38 38 49 39 Solution. 10.2. qM = 44 and qU = 52.9 Solution. Sorting the data we have 9. 6.1. 12. Question 3. Hence qL = 2. 20.4 11. 4. a histogram. Sorting the data we have 0.1. the variance is preferable to the standard deviation in developing statistical methods. 21.4.5.2.7.5 9. Hence qL = 10. 19. 2. 7. 33.8 14. qM = 5 and qU = 6. Question 3. For example. 37. 38.9.2. 19.4. 4. 51. 9. 7.14 (8) s2 = [(89 + 13 7 7 7 7 Hence the standard deviation is s = 176 ≈ 5. 20.2 11. Note that the standard deviation has the 7 same dimension as the data values (kg/mm2 . Sorting the data we have 30. 38. qM = 20 and qU = 20. Hence qL = 19. 12. we get for the data (1) the variance 611 611 611 176 1 ) + (84 + ) + · · · + (89 + )] = ≈ 25. 20. 7 6 4 0 7 1 2 4 6 6 Solution. Problem Set 24. 10. qM = 11. 39.4 10. Hence qL = 38.8.
. 49.7 9. 21. On the other hand.014. 46. which is an advantage. 41.55 and qU = 12. see at the beginning).0.1.2 17.Sorting the data we have 19. 17.5. 20 21 20 19 20 19 21 19 Solution.3. using x = 611/7. 11. 11.1 10 12. 11. 44.deviation s. Question 3.4.

1. 403. 86. Hence qL = 14.0 Solution.3 2.2. 2.15.3.3 69. 89. 401.3. 2.0.3 2. Question 3.1 2. 89. 14. qM = 14. 70.9 69.15. Reaction time [sec] of an automatic switch 2. -0.11 -0.8. 86.3 2.15.4. 69.94.5 2. 69. Find the standard deviation and compare it with the interquartile range. 14. 2. 2. Sorting the data we have 13.4.5 13. 2. Find the mean and compare it with the median. 72.48 0. 70.4 2. 15.9. 2.5 69. 399.52.0 and qU = 14.5 and qU = 71. 70. Question 3. 20 21 20 19 20 19 21 19
.8 Solution. 2. 2.4 2. 401.2 70.4.4 72. Hence qL = 85.8 71.4 Solution.2.19. 85. Question 3.2 23 2. 2.0. 89.0 14.1. 71.5. 69.5 2.0.6 2. 71.94 0. 2.5 and qU = 71. Mean and Standard Deviation.24. Question 3. Hence qL = 69. Hence qL = −0.5. qM = 70. 2. 88.19 and qU = 0.6. Carbon content [%] of coal 89 90 89 84 80 88 90 89 88 90 85 87 86 82 85 76 89 87 86 86 Solution. 401.55. qM = 70.3.5. Sorting the data we have 76. Hence qL = 69.2 71. −0.4 2.4.1 68. 71.24 -0.19 -0. 82. 90.48.1 2.Question 3. 14.2 2. 70. 85.5.0 14.5 14. 2.3.175.9 70. 0. 80.0. Gasoline consumption [gallons per mile] of six cars of the same model 14. 2.52 0.5. 87. 0.5.4.5.5.2.5 2.1.10. 69. qM = 401 and qU = 401. 86.5 69.4. qM = 87 and qU = 89.4 2. 2. 2. 90. Weight of ﬁlled bottles [g] in an automatic ﬁlling process 403 399 398 401 400 401 401 Solution.3 2.15.8.7. 87. 90.8. 2.5. 94. 70. Sorting the data we have 2. 88.9. 2.9. 400. qM = −0. 14.5.3. 0.3. Sorting the data we have 398.6. −0. 89. Sorting the data we have 68.5 14.11.5 2.5. 14.4 2.5.1. Question 3. −0.3. Question 3. Hence qL = 399.55 Solution. Sorting the data we have −0.11.5.2.3 70.5. 71.7.1 71.6 70. 2.4.6. 70. 2.7 71.

14.94 0.8 71.48 0.7 71. x=
n j=1
xj
n
=
1057. x=
n j=1
xj
n
=
2803 = 400.5 69. Why is |x − qM | so large?
.2 71.45 = −0.0642 7
s =
2
n j=1 (xj
− x)2 = 0.4 72.09552 n−1
Question 3.2 70. 5 22 7 23 6.52 0.3 = 70.3 69.15. x=
2
n j=1 (xj
− x)2 = 0.2939 n−1
Question 3.4556 n−1
Question 3. 7 6 4 0 7 1 2 4 6 6 Solution.11 -0.Solution.9 69. Weight of ﬁlled bottles [g] in an automatic ﬁlling process 403 399 398 401 400 401 401 Solution.5 69.8 Solution.24 -0.6 70.3 10
s =
2
− x)2 = 6.3 70. -0.9 70.16.8750 8
s = Question 3.4867 15
s =
2
n j=1 (xj
− x)2 = 1.1 71. x=
n j=1
xj
n
=
159 = 19.13.61908 n−1
Question 3.4286 7
s =
2
n j=1 (xj
− x)2 = 2.19 -0. 70.12.1 68.6964 n−1
n j=1
xj
n
n j=1 (xj
=
43 = 4. x=
n j=1
xj
n
=
−0.55 Solution.

x= Consequently. xsmall ≤ x ≤ xlarge Question 3. Construct the simplest possible data with x = 100 but qM = 0.6 5
s =
2
− x)2 = 82. Let xsmall = min{xj : j = 1. Now. which implies that |x − qM | = |12. Writing Project Average and Spread. xlarge = max{xj : j = 1.
n j=1
xj
n
n j=1
≤
n j=1
xlarge n xlarge = = xlarge n n
xj
n
≥
n j=1
xsmall n xsmall = = xsmall n n
. x= Also. IQR and x. Prove that x must always lie between the smallest and the largest data values. here qM = 7.17.19. n}. n}.6 − 7| = 5. · · · . Compare qM . x=
n j=1
xj
n
n j=1 (xj
=
63 = 12. Question 3. Solution.18. s illustrating the advantages and disadvantages with examples of your own.Solution. · · · .6 Question 3.3 n−1
Also.