Professional Documents
Culture Documents
1.1 Introduction
For us to have an understanding of what the subject of statistics is all about, we need to
introduce some terminology. First, we will explain what we mean by the subject of statistics.
Definition 1.1
Statistics is a collection of methods for planning experiments, obtaining data, and
then organizing, summarizing, presenting, analyzing, interpreting and drawing conclusions
based on the data, numerical facts, which we call data.
The subject of statistics is divided into two broad areas which are descriptive
statistics and inferential statistics. These classifications are shown in Fig. 1.1
Statistics
Descriptive Inferential
Statistics Statistics
Includes: Includes:
Collecting Making inferences
Organizing Hypothesis testing
Summarizing Determining relationships
Presenting data Making predictions
-1-
-ments, and so on) to be studied. The collection is complete in the sense that it
includes all subjects to be studied.
A census is the collection of data from every element in a population.
A sample is a sub-collection of elements drawn from a population (a subset of the
population).
Definition 1.3
A parameter is a numerical measurement describing some characteristic of a
population.
A statistic is a numerical measurement describing some characteristic of a sample.
Population
described by Sample
described by
Parameters
Statistics
Some data sets consist of numbers (such as weights) and others are non-numerical (such as
eye colors). The terms qualitative data and quantitative data are often used to distinguish
between these types.
Definition 1.4
Quantitative data consist of numbers representing counts or measurements.
Qualitative (or categorical or attribute) data can be separated into distinct
categories that are distinguished by some nonnumeric characteristics.
For example, the incomes of college graduates are quantitative data, while the genders
male/ female) of college graduates are qualitative data.
The quantitative data can be described by distinguishing between discrete and continuous
types.
-2-
Discrete data which means that it can only take specific values (The number of
possible values is either a finite number or a countable number).
0, 1, 2, 3, . . .
Continuous data which means that it can take all values in a given range (numerical)
data result from infinitely many possible values that correspond to some continuous
scale that covers a range of values without gaps, interruptions, or jumps). For
example, the amounts of milk that cows produce could be 2.3415 gallons a day.
-3-
TABLE 1.1. Body Mass Index for a Sample of 80 Adults
27.4 23.5 21.9 28.6 20.3 22.4 20.8 25.0
31.0 30.9 30.2 27.3 24.3 35.9 26.5 35.8
34.2 27.4 24.7 22.7 26.5 30.0 28.2 25.4
28.9 25.9 36.6 22.7 30.1 26.2 18.3 27.3
25.7 22.3 25.4 27.3 30.4 27.4 30.8 23.0
37.1 26.3 21.3 23.1 24.5 24.1 27.6 25.7
24.8 37.8 22.9 22.3 22.8 19.8 21.5 26.3
34.9 28.8 24.2 32.6 24.3 26.9 33.6 35.5
27.5 31.8 27.1 29.5 30.9 23.3 24.8 29.8
25.9 23.4 23.1 38.8 28.7 28.4 28.3 26.4
In constructing a frequency table for grouped data, we first determine a set of class
intervals that cover the range of the data (i.e., include all the observed values). The class
intervals are usually arranged from lowest numbers at the top of the table to highest
numbers at the bottom of the table and are defined so as not to overlap. We then tally the
number of observations that fall in each interval and present that number as a frequency,
called a class frequency. Some frequency tables include a column that represents the
frequency as a percentage of the total number of observations; this column is called the
relative frequency percentage. The completed frequency table provides a frequency
distribution.
Although not required, a good first step in constructing a frequency table is to rearrange
the data table, placing the smallest number in the first row of the leftmost column and then
continuing to arrange the numbers in increasing order going down the first column to the
top of the next row. (We can accomplish this procedure by sorting the data in ascending
order.) After the first column is completed, the procedure
is continued starting in the second column of the first row, and continuing until the largest
observation appears in the rightmost column of the bottom row. We call the arranged table
an ordered array. It is much easier to tally the observations for a frequency table from such
an ordered array of data than it is from the original data table. Table 1.2 provides a
rearrangement of the body mass index data
as an ordered array. In Table 1.2, by inspection we find that the lowest and highest values
are 18.3 and 38.8, respectively. We will use these numbers to help us create equally
spaced intervals for tabulating frequencies of data. Although the number of intervals that
one may choose for a frequency distribution is arbitrary, the actual number should depend
on the range of the data (R = range, is the difference between the smallest and the largest
observation in the data set) and the
-4-
TABLE 1.2. Body Mass Index Data for a Sample of 80 Adults:
Ordered Array (Sorted in Ascending Order)
18.3 22.7 24.7 25.4 26.5 27.6 30 33.6
19.8 22.7 24.8 25.7 26.9 28.2 30.1 34.2
20.8 22.8 24.8 25.7 27.1 28.3 30.2 34.9
20.9 22.9 25 25.9 27.3 28.4 30.4 35.5
21.3 23 25.4 25.9 27.3 28.6 30.8 35.8
21.5 23.1 24.7 26.2 27.3 28.7 30.9 35.9
21.9 23.1 24.8 26.3 27.4 28.8 30.9 36.6
22.3 23.3 24.8 26.3 27.4 28.9 31 37.1
22.3 23.4 25 26.4 27.4 29.5 31.8 37.8
22.4 23.5 25.4 26.5 27.5 29.8 32.6 38.8
number of cases (number of observations = N). For a data set of 50 to 150 observations,
the number chosen usually ranges from about five to ten. In the present example, the range
of the data is 38.8 – 18.3 = 20.5.
Suppose we divide the data set into seven intervals. Then, we have 20.5 ÷ 7 = 2.93, which
rounds to 3.0. Consequently, the intervals will have a width of three. These seven intervals
are as follows:
1. 18.0 – 20.9
2. 21.0 – 23.9
3. 24.0 – 26.9
4. 27.0 – 29.9
5. 30.0 – 32.9
6. 33.0 – 35.9
7. 36.0 – 38.9
We have:
The total number of observations =N = 80
The number of classes = K= 7
The size of class (width) = C = 3=W
Relative Frequency = Frequency /N
-5-
TABLE 1.1. Body Mass Index ( BMI)
Class Interval for Frequency Relative Relative %
BMI Levels (f) Frequency Frequency
18.0–20.9 4 0.05 5.0
21.0–23.9 16 0.20 20.0
24.0–26.9 22 0.275 27.5
27.0–29.9 18 0.225 22.5
30.0–32.9 10 0.125 12.5
33.0–35.9 6 0.075 7.5
36.0–38.9 4 0.05 5
Total 80 1 100
20
18
16
15
Frequency
10
10
6
5 4 4
0
18 21 24 27 30 33 36 39
BMI
-6-
In constructing a histogram, the values of the variable under consideration make up the
horizontal axis, while the vertical axis has as its scale the frequency (or relative frequency if
desired) of occurrence. Above each class interval on the horizontal axis a rectangular bar
cell, as it is sometimes called, is erected so that the height corresponds to the respective
frequency. The cells of a histogram must be joined and, to accomplish this, we must take
into account the true limits of the class interval to prevent gaps from occurring between the
cells of our graph. The true limits for each of the class intervals, we take to be as shown in
Table 1.2.
If we draw a graph using these class limits as the base of our rectangles, no gaps will result,
and we will have the histogram shown in Fig 1.1.
22.5
20.0
17.5
15.0
Frequency
12.5
10.0
7.5
5.0
-7-
1.5 Statistical Measures of Data
1.5.1 Measures of Central Tendency
Measures of central tendency are numbers that tell us where the majority of values
in the distribution (or sample) are located. This section will cover the following measures
of central tendency: arithmetic mean, median, and mode. These measures also are called
measures of location. In contrast to measures of central tendency, measures of dispersion
inform us about the spread of values in a distribution. Section 1.5.2 will present measures
of dispersion.
-8-
k
fi xi 1 k
f1x1 + f 2 x 2 + + fk xk i=1
x= = k = xifi (1.2)
f 1 + f 2 + ... + f k N i=1
fi
i=1
k
where N = fi is the sum of all frequencies i.e. the total number of observations in the data.
i 1
Example 1.1
For each boy in a sample of 50 boys, the time correct to the nearest second, for whom
he could hold his breath is recorded. The results are represented in the following table:
Value 18 – 28 29 – 39 40 – 50 51 – 61
Number of Children 5 10 25 10
18 - 28 23 5 115
29 - 39 34 10 340
40 - 50 45 25 1125
51 - 61 56 10 560
Total 50 2140
1.5.1.2 Median
The median of a data set is the value of the observation that divides the ordered
dataset in half. Essentially, the median is the observation whose value defines the midpoint
of a distribution; i.e., half of the data fall above the median and half below.
-9-
Suppose there are n observations in a sample. If these observations are ordered from
smallest to largest, then the median is defined as follows:
The sample median is
th
n 1
(1) The observation if n is odd
2
th th
n n
(2) The average of the and 1 observations if n is even.
2 2
Example 1.2
Each of 10 children in the second grade was given a reading aptitude test. The scores
were as follows: 95 86 78 90 62 73 89 92 84 76. Determine the median.
Solution
First we must arrange the scores in order of magnitude.
62 73 76 78 84 86 89 90 92 95
Because there are an even number of measurements, the median is the average of the two
midpoint scores.
84 + 86
Median = = 85
2
The median tells us that half of the observations are less than 85 and half of the
observations are greater than 85.
Example 1.3
Consider the following data set, 7 35 5 9 8 3 10 12 8 which consists of
white-blood counts taken on admission of all patients entering a one of the Alexandria
University hospitals, on a given day. Compute the median white-blood count.
Solution
First, order the sample as follows: 3, 5, 7, 8, 8, 9, 10, 12, 35. Because n is odd, the sample
median is given by the fifth largest point, which equals 8 or 8000 on the original scale.
1.5.1.3 Mode
The mode is defined as the observation in the sample which occurs most
frequently if there is such an observation.
If each observation occurs the same number of times, then there is no mode.
If two or more observations occur the same number of times (and more frequently
than any of the other observations), then there is more than one mode, and the sample
- 10 -
is said to be multimodal.
If there is only one mode the sample is said to be unimodal.
For example, if the sample is 24, 29, 26, 31, 29, 34, 25, 29 then the mode is 29.
For the sample: 46, 47, 47, 43, 48, 45, 43, 49, then there are two modes namely 43 and 47
(bimodal).
For the sample: 14, 16, 21, 19, 18, 24, 17 then there is no mode.
- 11 -
72 74 75 77 78 79 82 85 86 90 93 94
For n=12, the median position is (12+1)/2=6.5, so that the median is (79+82)/2 = 80.5. The
first quartile Q1 is the median of the first 6 values, the third quartile Q3 is the median of the 6
values above the median. Thus, Q1 = (75+77)/2=76 and Q3 = (86+90)/2=88.
However, if n is odd, it is more accurate to modify the definition by replacing "less
than" by "less than or equal " and "greater than" by "greater than or equal ". As an example,
the median of the following observations 66 73 74 79 82 86 88 90 94 is
82, while Q1 = 74 and Q3 = 88.
1.5.2.1 Range
The range of a set of observations is the difference between the largest and smallest
numbers in the set. For example, the range of the above two groups are, respectively
168 - 162 = 6 cm and 180 - 150 = 30 cm.
The range is a poor measure of variation, particularly, if the size of the sample is
large. The objection to the range is that it does not make use of all the observations in the
sample but uses only two extreme values.
Whereas the range covers all the values in a sample, a similar measure of variation
covers (more or less) the middle 50%. It is the interquartile range Q3 - Q1.
- 12 -
measures of dispersion. It is based upon squared deviations from the mean of a set of values.
So the variance of a sample of n observations x1 , x2 ,..., xn (denoted by S2 ) is defined as the
mean square deviation and is given by
1 n
S = 2
x i x 2 (1.4)
n - 1 i =1
The sum of squares is divided by (n-1) rather than n for theoretical reasons, but if n > 35
there is, practically, no difference in the definitions.
Example 1.5
As an illustration for computing the variance, consider the sample 5, 7, 8, 12 and 18.
The mean of this sample is given by
1
x = (5 + 7 + 8 + 12 + 18) = 10
5
Hence
1 n
2
S = x i x 2
n - 1 i =1
2 2 2 2 2
= 1/4 (5 - 10 ) + (7 - 10 ) + (8 - 10 ) + (12 - 10 ) + (18 - 10 )
106
= = 26.5
4
If the number of observations is large the computation necessary to find S2 from
formula (1.7) is rather laborious, especially if the mean is not an integral value. There is
another formula for computing S2 which is equivalent to (1.7), this formula is
1 n 2 2
xi - n x
2
S = (1.5)
n- 1 i =1
The standard deviation (denoted by S.D.) is defined as the square root of the variance i.e.
S . D . = S = variance
Now, for the grouped data in a frequency distribution, as in the case of mean, a
similar formula would apply. If x1 , x2,..., xk occur with frequencies f1 , f2 ,..., fk respectively,
the variance can by written as
1 k 2 2
xi fi - N x
2
S = (1.6)
N - 1 i =1
k
where N = fi .
i =1
Let us now illustrate the computation of the variance and standard deviation by the
- 13 -
computational formula (1.9), using the data of table 1.6. Using the computing formula 1.8,
the variance is then given by
2
S =
1
50 - 1
2
11844 - 50 (15.0) = 12.12
The standard deviation is;
S = 12.12 = 3.48
- 14 -
To avoid use of the mode, we can employ the empirical formula (1.6),
3 ( Mean - Median ) 3 ( x - Median )
Skewness = =
Standard Deviation S
The above two measures are called, respectively, Pearson’s first and second coefficients of
skewness.
Figure 1.4
+ve skewness
-ve skewness
0 skewness
- 15 -
1.5.3.2 The Boxplot
In descriptive statistics, a boxplot is a convenient way of graphically depicting
groups of numerical data through their quartiles. Boxplots may also have lines extending
vertically from boxes (whiskers) indicating variability outside the upper and lower
quartiles, hence the terms box-and-whisker plot and box-and-whisker
diagram. Outliers may be plotted as individual points.
Boxplots are useful for revealing the center of the data, the spread of the data, the
distribution of the data and the presence of outliers. To construct a boxplot we first obtain
the minimum value, the maximum value, and the quartiles.
Boxplots display differences between populations without making any assumptions
of the underlying statistical distribution: they are non-parametric. The spacing’s between
the different parts of the box help in indicating the degree of dispersion and skewness in
the data, and identifying outliers. Boxplots can be drawn either horizontally or vertically.
Definition
A boxplot (or box-and-whisker diagram) is a graph of a data set consists of a line
extending from the minimum value to the maximum value, and a box with lines at the first
quartile, Q1; the median; and the third quartile, Q3.
Medians and quartiles are not very sensitive to extreme values (outliers). So boxplots,
which use medians and quartiles, also have the advantage of not being as sensitive to
extreme values as other devices based on the mean and standard deviation.
As an example, let us produce the box plot of the BMI data given in example 1.1.
Figure 1.5
- 16 -
Boxplot of BMI
40
35
30
BMI
26.5
25
20
Another example, let us produce the box plot of the 14 data points that were used as an
illustration: 1, 11.5, 6, 7.2, 4, 8, 9, 10, 6.8, 8.3, 2, 2, 10, 1
Figure 1.6
Boxplot of C4
12 11.5
10 10
10
9
8
8
7.2
6.8 6.8
6
6
C4
4
4
2 2
2
11
The resulting box plot is presented in Figure 1.6. Observe that the end points of the
whiskers are 1 for the minimal value, and 11.5 for the largest value. The end values of the
- 17 -
box are 9 for the third quartile and 2 for the first quartile. The median 7 is marked inside
the box.
- 18 -
EXERCISES
[1] Fill-in-the-Blank
i. The measure of central location that uses all of the observations in its
calculation is called ……….…
ii. The value that occurs most often in a set of data is called the ………..
iii. The ………. isn't always found.
iv. The ………… is an absolute measure of dispersion.
v. Range is ………. between minimum and maximum value of the data.
vi. The ………. is affected by the extreme values.
vii. The weekly sales from a sample of ten computer stores yielded a mean of
$25,900; a median $25,000 and a mode of $24,500. Then the shape of the
distribution is …………
[2] For each of the data sets in the following exercises compute (a) the mean, (b) the median,
(c) the mode, (d) the range, (e) the standard deviation, (f) the coefficient of variation, and
(g) the interquartile range. Treat each data set as a sample. For those exercises for which
you think it would be appropriate, construct a box plot and discuss the usefulness in
understanding the nature of the data that this device provides. For each exercise select the
measure of central tendency that you think would be most appropriate for describing the
data. Give reasons to justify your choice.
iii- In a pilot study, a researcher wanted to gain more insight into the psychosocial
consequences for children of a parent with cancer. For the study, 14 families participated
in semistructured interviews and completed standardized questionnaires. Below is the age
of the sick parent with cancer (in years) for the 14 families.
37 48 53 46 42 49 44 38 32 32 51 51 48 41
- 19 -
[3] The following table shows the number of hours 45 hospital patients slept following the
administration of a certain anesthetic.
7 1 0 1 2 4 8 7 3 8 5 12 11 3 8 1 1 13 10 4 4 5 5
8 7 7 3 2 38 13 1 7 17 3 4 5 53 1 17 10 4 7 7 11 8
(a) From these data construct:
A frequency distribution and a relative frequency distribution, a histogram and a
frequency polygon.
(b) Describe these data relative to symmetry and skewness.
(c) Compute the sample mean, sample median, sample range, sample standard deviation and
coefficient of variation.
(d) Construct a box plot and discuss the usefulness in understanding the nature of the data that
this device provides.
[4] The following are the number of babies born during a year in 60 community hospitals.
30 55 27 45 56 48 45 49 32 57 47 56
37 55 52 34 54 42 32 59 35 46 24 57
32 26 40 28 53 54 29 42 42 54 53 59
39 56 59 58 49 53 30 53 21 34 28 50
52 57 43 46 54 31 22 31 24 24 57 29
(a) From these data construct:
A frequency distribution and a relative frequency distribution
A histogram and a frequency polygon.
(b) Describe these data relative to symmetry and skewness.
(c) Compute the sample mean, sample median, sample range, sample standard deviation and
coefficient of variation.
(d) Plot the box plot, do you have the same conclusion in part (b)?
[5] How are mean and median affected when it is known that for a group of 10 students scoring an
average of 60 marks, the best paper was wrongly marked 80 instead of 75.
- 20 -
iv – The standard deviation is
a. 10.17 b. 103.51 c. 1.017
v – The coefficient of variation is
a. 19.22% b. 1.922 c. 195.5%
[7] The ages of 40 people in a class (to nearest year) are as follows:
Age 18 19 20 21 22
Number of Students: 2 8 16 10 4
Find the median, mode, mean and standard deviation.
[8] The following table shows the age distribution of a sample of diabetic patients.
Age Frequency Relative Frequency
10 - 14 4 0.08
15 - 19 6 -
20 - 24 - 0.24
25 - 29 - 0.32
30 - 34 10 -
35 - 39 - -
a- Complete the blank cells in the table
b- Calculate the mean and CV
[9] Circle the correct answer from each of the following multiple choice questions
1. Given the following frequency distribution:
xi -10 -5 0 5 10
fi 1 4 5 9 6
the mean is
a. 3.0 b. 15.0 c. 0.0 d. Otherwise
2. For the frequency distribution given in question 1, the median is
a. 0.0 b. 2.5 c. 5.0 d. Otherwise
3. The following table shows the scores on an intelligence test by a group of 50 students:
Scores 70 - 74 75 – 79 80 - 84 85 - 89 90 - 94
No of children 2 3 20 15 10
the mean is
a. 79.5 b. 84.5 c. 84.8 d. Otherwise
4. For the frequency distribution given in question 3, the median is
a. 84.5 b. 84.6 c. 84.7 d. Otherwise
5. The mean grade of 40 male students in a Math exam is 80, while the mean grade of 20 female
- 21 -
students in the same Math exam is 65. Then the mean grade of all the 60 students is
a. 72.5 b. 75 c. 145 d. Otherwise
6. The observations 2, 4, 5, x, 7, 9, 12, y are arranged in ascending order. If the median is 6 and
the mean is 7, then the values of x and y are respectively,
a. 6 & 12 b. 5 & 14 c. 5 & 12 d. Otherwise
7. The following table shows the scores on an intelligence test by a group of 25 students:
12. Eight people are in a room. Their mean age is 40, and their median age is 36. A 62 year-old
man leaves the room, and a 30 year-old woman enters, the mean age of the people in the
room now is
a. 36 b. 44 c. 50 d. Otherwise.
[11] For a 50 male children born in a certain hospital the mean and standard deviation of their
weight are 3.5 kg and 5.2 kg respectively. For a 40 female children born in the same hospital
these are 3.2 kg. and 4.8 kg respectively. Find the mean and standard deviation of the
combined children of 90.
[12] The Blood Levels (mg/dl) of 50 Subjects are given in the following table
4.9 14.4 3.9 2.5 7.6 0.5 3.9 7.7 8.0 6.5
5.5 3.9 1.0 5.0 4.2 4.1 6.9 2.9 11.5 2.8
7.6 0.7 8.9 2.2 4.0 1.5 10.2 1.1 10.6 2.0
2.3 9.8 6.3 6.1 5.4 6.7 3.2 1.6 6.1 9.0
9.5 4.3 4.8 5.7 4.8 2.1 2.7 3.5 8.2 4.4
- 22 -
(a) Construct a frequency distribution with class intervals 0.5 – 2.4, 2.5 -4.4, …. and
plot the histogram.
(b) Draw an estimate of the distribution curve and discuss the skewness of the distribution.
(c) Compute the sample mean, median, standard deviation and coefficient of variation.
(d) Plot the box plot, do you have the same conclusion in part (b)?
[13] In a study of physical endurance levels of male college freshman, the following composite
endurance scores based on several exercise routines were collected.
254 281 192 260 212 179 225 179 181 149
182 210 235 239 258 166 159 223 186 190
180 188 135 233 220 204 219 211 245 151
198 190 151 157 204 238 205 229 191 200
222 187 134 193 264 288 214 227 190 212
165 194 206 193 218 198 241 149 164 225
265 222 264 249 175 205 252 210 178 159
220 201 203 172 234 198 173 187 189 237
272 195 227 230 168 232 217 249 196 223
232 191 175 236 152 258 155 215 197 210
214 278 252 283 205 184 172 228 193 130
218 213 172 159 203 212 132 197 206 198
a- From these data construct a frequency distribution and a histogram.
b- For these data compute the following descriptive measures: mean, median, standard deviation,
coefficient of variation, range, first and third quartiles.
c- Describe these data relative to symmetry and skewness.
- 23 -