Organizing and Summarizing Data: Statistics

Chapter 1
ORGANIZING AND SUMMARIZING DATA
1.1 Introduction
For us to have an understanding of what the subject of statistics is all about, we need to
introduce some terminology. First, we will explain what we mean by the subject of statistics.
Definition 1.1
Statistics is a collection of methods for planning experiments, obtaining data, and
then organizing, summarizing, presenting, analyzing, interpreting and drawing conclusions
based on the data, numerical facts, which we call data.
The subject of statistics is divided into two broad areas which are descriptive
statistics and inferential statistics. These classifications are shown in Fig. 1.1
Statistics
Descriptive Inferential
Statistics Statistics
Includes: Includes:
 Collecting  Making inferences
 Organizing  Hypothesis testing
 Summarizing  Determining relationships
 Presenting data  Making predictions
Fig. 1.1: Breakdown of the subject of statistics

In statistics we commonly use the terms population and sample. Because these terms are
central to our study, we define them now.
Definition 1.2
 A population is the complete collection of all elements (scores, people, measure-
-1-
-ments, and so on) to be studied. The collection is complete in the sense that it
includes all subjects to be studied.
 A census is the collection of data from every element in a population.
 A sample is a sub-collection of elements drawn from a population (a subset of the
population).
1.2 The Nature of Data

Data are observations (such as measurements, genders, and survey responses) that
have been collected. Data are sometimes used to find statistics. An isolated list of lifeless
numbers might appear to be a data set awaiting some statistical manipulation, but the
effective use of statistics requires that we know the context of the data, how the data were
obtained, and the population from which the data were obtained. Now, we will introduce
some key terms.
Definition 1.3
 A parameter is a numerical measurement describing some characteristic of a
population.
 A statistic is a numerical measurement describing some characteristic of a sample.
Population
described by Sample
described by
Parameters
Statistics
Some data sets consist of numbers (such as weights) and others are non-numerical (such as
eye colors). The terms qualitative data and quantitative data are often used to distinguish
between these types.
Definition 1.4
 Quantitative data consist of numbers representing counts or measurements.
 Qualitative (or categorical or attribute) data can be separated into distinct
categories that are distinguished by some nonnumeric characteristics.
For example, the incomes of college graduates are quantitative data, while the genders
male/ female) of college graduates are qualitative data.
The quantitative data can be described by distinguishing between discrete and continuous
types.
-2-
 Discrete data which means that it can only take specific values (The number of
possible values is either a finite number or a countable number).
0, 1, 2, 3, . . .
 Continuous data which means that it can take all values in a given range (numerical)
data result from infinitely many possible values that correspond to some continuous
scale that covers a range of values without gaps, interruptions, or jumps). For
example, the amounts of milk that cows produce could be 2.3415 gallons a day.
1.3 Computers and Statistical Calculations

The relatively recent widespread use of computers has had a tremendous
impact on all sciences research in general and statistical analysis in particular.
Canned computer programs are available for performing most of the descriptive
and inferential statistical procedures that the average investigator is likely to
need. Some widely used "packages" of statistical procedures are:
(1) Minitab
(2) SPSS (Statistical Package for the Social Sciences)
(3) R.
(4) SAS, and
(5) Statgraph
Statistical programs differ with respect to their input requirements, their

output formats, and the specific calculations they will perform.
1.4 Grouped Data (The Frequency Distribution)

A frequency table provides one of the most convenient ways to summarize or display
grouped data. Before we construct such a table, let us consider the following numerical
data. Table 1.1 lists 80 values of body mass index data from the 1998 National Health
Interview Survey. The body mass index (BMI) is defined as [Weight (in
kilograms)/Height (in meters) squared]. According to established standards, a BMI from
19 to less than 25 is considered healthy; a BMI from 25 to less than 30 is regarded as
overweight; a BMI greater than or equal to 30 is defined as obese. Table 1.1 arranges the
numbers in the order in which they were collected.
-3-
TABLE 1.1. Body Mass Index for a Sample of 80 Adults
27.4 23.5 21.9 28.6 20.3 22.4 20.8 25.0
31.0 30.9 30.2 27.3 24.3 35.9 26.5 35.8
34.2 27.4 24.7 22.7 26.5 30.0 28.2 25.4
28.9 25.9 36.6 22.7 30.1 26.2 18.3 27.3
25.7 22.3 25.4 27.3 30.4 27.4 30.8 23.0
37.1 26.3 21.3 23.1 24.5 24.1 27.6 25.7
24.8 37.8 22.9 22.3 22.8 19.8 21.5 26.3
34.9 28.8 24.2 32.6 24.3 26.9 33.6 35.5
27.5 31.8 27.1 29.5 30.9 23.3 24.8 29.8
25.9 23.4 23.1 38.8 28.7 28.4 28.3 26.4
In constructing a frequency table for grouped data, we first determine a set of class
intervals that cover the range of the data (i.e., include all the observed values). The class
intervals are usually arranged from lowest numbers at the top of the table to highest
numbers at the bottom of the table and are defined so as not to overlap. We then tally the
number of observations that fall in each interval and present that number as a frequency,
called a class frequency. Some frequency tables include a column that represents the
frequency as a percentage of the total number of observations; this column is called the
relative frequency percentage. The completed frequency table provides a frequency
distribution.
Although not required, a good first step in constructing a frequency table is to rearrange
the data table, placing the smallest number in the first row of the leftmost column and then
continuing to arrange the numbers in increasing order going down the first column to the
top of the next row. (We can accomplish this procedure by sorting the data in ascending
order.) After the first column is completed, the procedure
is continued starting in the second column of the first row, and continuing until the largest
observation appears in the rightmost column of the bottom row. We call the arranged table
an ordered array. It is much easier to tally the observations for a frequency table from such
an ordered array of data than it is from the original data table. Table 1.2 provides a
rearrangement of the body mass index data
as an ordered array. In Table 1.2, by inspection we find that the lowest and highest values
are 18.3 and 38.8, respectively. We will use these numbers to help us create equally
spaced intervals for tabulating frequencies of data. Although the number of intervals that
one may choose for a frequency distribution is arbitrary, the actual number should depend
on the range of the data (R = range, is the difference between the smallest and the largest
observation in the data set) and the
-4-
TABLE 1.2. Body Mass Index Data for a Sample of 80 Adults:
Ordered Array (Sorted in Ascending Order)
18.3 22.7 24.7 25.4 26.5 27.6 30 33.6
19.8 22.7 24.8 25.7 26.9 28.2 30.1 34.2
20.8 22.8 24.8 25.7 27.1 28.3 30.2 34.9
20.9 22.9 25 25.9 27.3 28.4 30.4 35.5
21.3 23 25.4 25.9 27.3 28.6 30.8 35.8
21.5 23.1 24.7 26.2 27.3 28.7 30.9 35.9
21.9 23.1 24.8 26.3 27.4 28.8 30.9 36.6
22.3 23.3 24.8 26.3 27.4 28.9 31 37.1
22.3 23.4 25 26.4 27.4 29.5 31.8 37.8
22.4 23.5 25.4 26.5 27.5 29.8 32.6 38.8
number of cases (number of observations = N). For a data set of 50 to 150 observations,
the number chosen usually ranges from about five to ten. In the present example, the range
of the data is 38.8 – 18.3 = 20.5.
Suppose we divide the data set into seven intervals. Then, we have 20.5 ÷ 7 = 2.93, which
rounds to 3.0. Consequently, the intervals will have a width of three. These seven intervals
are as follows:
1. 18.0 – 20.9
2. 21.0 – 23.9
3. 24.0 – 26.9
4. 27.0 – 29.9
5. 30.0 – 32.9
6. 33.0 – 35.9
7. 36.0 – 38.9
We have:
The total number of observations =N = 80
The number of classes = K= 7
The size of class (width) = C = 3=W
Relative Frequency = Frequency /N
1.4.1 Frequency Histograms

We may display a frequency distribution (or a relative frequency distribution)
graphically in the form of a histogram as shown in Fig 1.1.
-5-
TABLE 1.1. Body Mass Index ( BMI)
Class Interval for Frequency Relative Relative %
BMI Levels (f) Frequency Frequency
18.0–20.9 4 0.05 5.0
21.0–23.9 16 0.20 20.0
24.0–26.9 22 0.275 27.5
27.0–29.9 18 0.225 22.5
30.0–32.9 10 0.125 12.5
33.0–35.9 6 0.075 7.5
36.0–38.9 4 0.05 5
Total 80 1 100
Table 1.2. Class Boundaries of the Data of Table 1.1

Class lmits for Class Frequency
BMI Levels Mid-point (f)
18.0– 19.5 4
21.0 – 22.5 16
24.0 – 25.5 22
27.0 – 28.5 18
30.0 – 31.5 10
33.0 – 34.5 6
36.0 – 39 37.5 4
Figure 1.1 Histogram of the BMI Levels of 80 Adults

Histogram of BMI
25
22
20
18
16
15
Frequency
10
10
6
5 4 4
0
18 21 24 27 30 33 36 39
BMI
-6-
In constructing a histogram, the values of the variable under consideration make up the
horizontal axis, while the vertical axis has as its scale the frequency (or relative frequency if
desired) of occurrence. Above each class interval on the horizontal axis a rectangular bar
cell, as it is sometimes called, is erected so that the height corresponds to the respective
frequency. The cells of a histogram must be joined and, to accomplish this, we must take
into account the true limits of the class interval to prevent gaps from occurring between the
cells of our graph. The true limits for each of the class intervals, we take to be as shown in
Table 1.2.
If we draw a graph using these class limits as the base of our rectangles, no gaps will result,
and we will have the histogram shown in Fig 1.1.
1.4.2 Frequency Polygons

To draw a frequency polygon, we first place a dot above the midpoint of each class
interval represented on the horizontal axis of graph like the one shown in Fig 1.1. The height
above the horizontal axis of a given dot corresponds to the frequency of the relevant class
interval. Connecting the dots by straight lines produces the frequency polygon. Fig 1.2 is the
frequency polygon for the data in Table 1.1.
Note that the polygon is brought down to the horizontal axis at the ends at points that
would be the midpoints if there were an additional cell at each end of the corresponding
histogram. This allows for the total area to be enclosed. The total area under the frequency
polygon is equal to the area under the histogram
Figure 1.2 Frequency Polygon of the BMI Levels of 80 Adults
22.5
20.0
17.5
15.0
Frequency
12.5
10.0
7.5
5.0
19.5 22.5 25.5 28.5 31.5 34.5 37.5

Mid-point
-7-
1.5 Statistical Measures of Data
1.5.1 Measures of Central Tendency
Measures of central tendency are numbers that tell us where the majority of values
in the distribution (or sample) are located. This section will cover the following measures
of central tendency: arithmetic mean, median, and mode. These measures also are called
measures of location. In contrast to measures of central tendency, measures of dispersion
inform us about the spread of values in a distribution. Section 1.5.2 will present measures
of dispersion.
1.5.1.1 The Arithmetic Mean

The arithmetic mean (ungrouped data), or simply the mean or average, is the sum
of the individual values in a data set divided by the number of values in the data set. We
can compute a mean of both a finite population and a sample. For the mean of a finite
population (denoted by μ), we sum the individual observations in the entire population and
divide by the population size. When data are based on a sample, to calculate the sample
mean (denoted by X ) we sum the individual observations in the sample and divide by the
number of elements in the sample, n. The sample mean is the sample analog to the mean
of a finite population.
Let x1 , x2 ,..., xn be a sample of observations, then its mean is given by
1 n
x = x i (1.1)
n i=1
For example the mean of a sample consisting of the data 3, 8, 6, 14, 0, -4, 0, 12, -7, 0 and -
10 is given by
1
x= (3 + 8 + 6 + 14 + 0 + (  4)  0  12  ( 7)  0  ( 10)) = 2
11
The Mean Computed from Grouped Data:

For large data sets (e.g., more than about 20 observations when performing
calculations by hand), summing the individual numbers may be impractical, so we use
grouped data. First, the data need to be placed in a frequency table, as illustrated in Section
1.4. We then apply Formula 1.2, which specifies that the midpoint of each class interval (X)
is multiplied by the frequency of observation in that class. So if we denote the class
midpoints by x1 , x2 ,..., xk (where k is the number of classes) and the corresponding
frequencies by f1 ,f2 ,...,fk then the mean is
-8-
k
 fi xi 1 k
f1x1 + f 2 x 2 + + fk xk i=1
x= = k =  xifi (1.2)
f 1 + f 2 + ... + f k N i=1
 fi
i=1
k
where N =  fi is the sum of all frequencies i.e. the total number of observations in the data.
i 1
Example 1.1
For each boy in a sample of 50 boys, the time correct to the nearest second, for whom
he could hold his breath is recorded. The results are represented in the following table:
Value 18 – 28 29 – 39 40 – 50 51 – 61
Number of Children 5 10 25 10
Calculate the mean.

Solution
When we compute the mean from grouped data, it is convenient to prepare a work
table such as Table 1.3,
Table 1.3 Work Table for Computing the Mean from
the Grouped Data of Example 1.1
Class Class Class

Interval midpoint frequency
xi fi xi fi
18 - 28 23 5 115
29 - 39 34 10 340
40 - 50 45 25 1125
51 - 61 56 10 560
Total 50 2140
We may now compute the mean.

1 k 2140
x= 
N i=1
xi fi =
50
= 42.8
1.5.1.2 Median
The median of a data set is the value of the observation that divides the ordered
dataset in half. Essentially, the median is the observation whose value defines the midpoint
of a distribution; i.e., half of the data fall above the median and half below.
-9-
Suppose there are n observations in a sample. If these observations are ordered from
smallest to largest, then the median is defined as follows:
The sample median is
th
 n 1
(1) The   observation if n is odd
 2 
th th
n n 
(2) The average of the   and   1  observations if n is even.
 2 2 
Example 1.2
Each of 10 children in the second grade was given a reading aptitude test. The scores
were as follows: 95 86 78 90 62 73 89 92 84 76. Determine the median.
Solution
First we must arrange the scores in order of magnitude.
62 73 76 78 84 86 89 90 92 95
Because there are an even number of measurements, the median is the average of the two
midpoint scores.
84 + 86
Median = = 85
2
The median tells us that half of the observations are less than 85 and half of the
observations are greater than 85.
Example 1.3
Consider the following data set, 7 35 5 9 8 3 10 12 8 which consists of
white-blood counts taken on admission of all patients entering a one of the Alexandria
University hospitals, on a given day. Compute the median white-blood count.
Solution
First, order the sample as follows: 3, 5, 7, 8, 8, 9, 10, 12, 35. Because n is odd, the sample
median is given by the fifth largest point, which equals 8 or 8000 on the original scale.
1.5.1.3 Mode
The mode is defined as the observation in the sample which occurs most
frequently if there is such an observation.
 If each observation occurs the same number of times, then there is no mode.
 If two or more observations occur the same number of times (and more frequently
than any of the other observations), then there is more than one mode, and the sample
- 10 -
is said to be multimodal.
 If there is only one mode the sample is said to be unimodal.
For example, if the sample is 24, 29, 26, 31, 29, 34, 25, 29 then the mode is 29.
For the sample: 46, 47, 47, 43, 48, 45, 43, 49, then there are two modes namely 43 and 47
(bimodal).
For the sample: 14, 16, 21, 19, 18, 24, 17 then there is no mode.
1.5.1.4 Empirical Relation between Mean, Median and Mode

For unimodal frequency curves which are moderately skewed (asymmetrical), we
have the empirical relation
Mean - Mode = 3 (Mean - Median) (1.3)
1.5.1.5 Quartiles, Deciles, and Percentiles

If a set of data is arranged in order of magnitude, the middle value (or arithmetic
mean of the two middle values) which divides the set into two equal parts is the median. By
extending this idea we can think of these values which divide the set into four equal parts.
These values, denoted by Q1 , Q2 and Q3 , are called the first second and third quartiles
respectively, the value Q2 being equal to the median.
Similarly the values which divide the data into ten equal parts are called deciles and
are denoted by D1 , D2 ,..., D9 , while the values dividing the data into one hundred equal
parts are called percentiles and are denoted by P1 , P2 , ..., P99 . The 5th decile and 50th
percentile correspond to the median. The 25th and 75th percentiles correspond to the first
and third quartiles respectively.
To calculate the quartiles, first locate the median in the ordered list of observations.
The first quartile Q1 is the median of all values less than the median of the whole set of data,
and the third quartile Q3 is the median of all values greater than the median of the whole set
of data.
Example 1.4
The following are the scores of nine students on a statistics test:
82, 90, 75, 77, 94, 72, 93, 74, 78, 86, 85, 79
Find the median and the two quartiles.
Solution
Arranging the data (n=12 )according to size, we get
- 11 -
72 74 75 77 78 79 82 85 86 90 93 94
For n=12, the median position is (12+1)/2=6.5, so that the median is (79+82)/2 = 80.5. The
first quartile Q1 is the median of the first 6 values, the third quartile Q3 is the median of the 6
values above the median. Thus, Q1 = (75+77)/2=76 and Q3 = (86+90)/2=88.
However, if n is odd, it is more accurate to modify the definition by replacing "less
than" by "less than or equal " and "greater than" by "greater than or equal ". As an example,
the median of the following observations 66 73 74 79 82 86 88 90 94 is
82, while Q1 = 74 and Q3 = 88.
1.5.2 Measures of Dispersion

Knowing the average of a distribution in no way tells us whether or not the figures in
the distribution are clustered closely together or well spread out.
For example, there could be two groups, each of five men. The heights of the men in
the two groups are respectively:
First group : 162 , 164 , 165 , 166 , 168 cm.
Second group : 150 , 160 , 165 , 170 , 180 cm.
Both groups have a mean height of 165 cm, but the dispersion in heights is much greater in
one than the other.
It would clearly be useful to find some way of measuring this dispersion and
expressing it as a single figure. Such measures are called measures of dispersion (or
variation) and the most important of these are the following:
1.5.2.1 Range
The range of a set of observations is the difference between the largest and smallest
numbers in the set. For example, the range of the above two groups are, respectively
168 - 162 = 6 cm and 180 - 150 = 30 cm.
The range is a poor measure of variation, particularly, if the size of the sample is
large. The objection to the range is that it does not make use of all the observations in the
sample but uses only two extreme values.
Whereas the range covers all the values in a sample, a similar measure of variation
covers (more or less) the middle 50%. It is the interquartile range Q3 - Q1.
1.5.2.2 Variance and Standard Deviation.

The standard deviation (or its square variance) is by far the most important of the
- 12 -
measures of dispersion. It is based upon squared deviations from the mean of a set of values.
So the variance of a sample of n observations x1 , x2 ,..., xn (denoted by S2 ) is defined as the
mean square deviation and is given by
1 n
S = 2
x i  x 2 (1.4)
n - 1 i =1
The sum of squares is divided by (n-1) rather than n for theoretical reasons, but if n > 35
there is, practically, no difference in the definitions.
Example 1.5
As an illustration for computing the variance, consider the sample 5, 7, 8, 12 and 18.
The mean of this sample is given by
1
x = (5 + 7 + 8 + 12 + 18) = 10
5
Hence
1 n
2
S =  x i  x 2
n - 1 i =1
 2 2 2 2 2
= 1/4 (5 - 10 ) + (7 - 10 ) + (8 - 10 ) + (12 - 10 ) + (18 - 10 ) 
106
= = 26.5
4
If the number of observations is large the computation necessary to find S2 from
formula (1.7) is rather laborious, especially if the mean is not an integral value. There is
another formula for computing S2 which is equivalent to (1.7), this formula is
1  n 2 2
  xi - n x 
2
S = (1.5)
n- 1  i =1 
The standard deviation (denoted by S.D.) is defined as the square root of the variance i.e.
S . D . = S = variance
Now, for the grouped data in a frequency distribution, as in the case of mean, a
similar formula would apply. If x1 , x2,..., xk occur with frequencies f1 , f2 ,..., fk respectively,
the variance can by written as
1  k 2 2
  xi fi - N x 
2
S = (1.6)
N - 1  i =1 
k
where N =  fi .
i =1
Let us now illustrate the computation of the variance and standard deviation by the
- 13 -
computational formula (1.9), using the data of table 1.6. Using the computing formula 1.8,
the variance is then given by
2
S =
1
50 - 1
 2

11844 - 50 (15.0) = 12.12
The standard deviation is;
S = 12.12 = 3.48
1.5.2.3 Coefficient of Variation

It sometimes happens that we need to compare the variability of two or more sets of
figures. For example, a variation or dispersion of 10 inches in measuring a distance of 100
feet is quite different in effect from the same variation of 20 inches in a distance of 100 feat.
These figures are clearly not comparable since, first they are in different units and, second,
they relate to sets of figures of quite different orders of size.
However, we could obtain some idea of the degree of dispersion if we could find the
size of a variation as compared with the average of the figures it was derived from i.e.
calculate the standard deviation as a percentage of the mean. This measure is called the
coefficient of variation which abbreviated by C.V or coefficient of variability, is defined as
S
CV = . 100%
x
The coefficient of variation expresses sample variability relative to the mean of the
sample (and is on rare occasion referred to as the "relative standard deviation". Since S and
x have identical units, CV has no units at all, a fact emphasizing that it is a relative measure,
divorced from the actual magnitude or units of measurement of the data. A disadvantage of
the coefficient of variation is that it fails to be useful when x is close to zero.
1.5.3 Measures of Symmetry

1.5.3.1 Skewness
Skewness is the degree of asymmetry, or departure from symmetry. If the frequency
curve (smoothed frequency polygon) of a distribution has a longer "tail" to the right of the
central maximum than to the left, the distribution is said to be skewed to the right or to have
positive skewness. If the reverse is true it is said to be skewed to the left or to have negative
skewness (see Fig. 1.4).
Mean - Mode x - Mode
Skewness = =
Standard Deviation S
- 14 -
To avoid use of the mode, we can employ the empirical formula (1.6),
3 ( Mean - Median ) 3 ( x - Median )
Skewness = =
Standard Deviation S
The above two measures are called, respectively, Pearson’s first and second coefficients of
skewness.
Figure 1.4
+ve skewness
` mode median mean
-ve skewness
mean median mode
0 skewness
Mean = Median = Mode
- 15 -
1.5.3.2 The Boxplot
In descriptive statistics, a boxplot is a convenient way of graphically depicting
groups of numerical data through their quartiles. Boxplots may also have lines extending
vertically from boxes (whiskers) indicating variability outside the upper and lower
quartiles, hence the terms box-and-whisker plot and box-and-whisker
diagram. Outliers may be plotted as individual points.
Boxplots are useful for revealing the center of the data, the spread of the data, the
distribution of the data and the presence of outliers. To construct a boxplot we first obtain
the minimum value, the maximum value, and the quartiles.
Boxplots display differences between populations without making any assumptions
of the underlying statistical distribution: they are non-parametric. The spacing’s between
the different parts of the box help in indicating the degree of dispersion and skewness in
the data, and identifying outliers. Boxplots can be drawn either horizontally or vertically.
Definition
A boxplot (or box-and-whisker diagram) is a graph of a data set consists of a line
extending from the minimum value to the maximum value, and a box with lines at the first
quartile, Q1; the median; and the third quartile, Q3.
Medians and quartiles are not very sensitive to extreme values (outliers). So boxplots,
which use medians and quartiles, also have the advantage of not being as sensitive to
extreme values as other devices based on the mean and standard deviation.
As an example, let us produce the box plot of the BMI data given in example 1.1.
Figure 1.5
- 16 -
Boxplot of BMI
40
35
30
BMI
26.5
25
20
Another example, let us produce the box plot of the 14 data points that were used as an
illustration: 1, 11.5, 6, 7.2, 4, 8, 9, 10, 6.8, 8.3, 2, 2, 10, 1
Figure 1.6
Boxplot of C4
12 11.5
10 10
10
9
8
8
7.2
6.8 6.8
6
6
C4
4
4
2 2
2
11
The resulting box plot is presented in Figure 1.6. Observe that the end points of the
whiskers are 1 for the minimal value, and 11.5 for the largest value. The end values of the
- 17 -
box are 9 for the third quartile and 2 for the first quartile. The median 7 is marked inside
the box.
- 18 -
EXERCISES
[1] Fill-in-the-Blank
i. The measure of central location that uses all of the observations in its
calculation is called ……….…
ii. The value that occurs most often in a set of data is called the ………..
iii. The ………. isn't always found.
iv. The ………… is an absolute measure of dispersion.
v. Range is ………. between minimum and maximum value of the data.
vi. The ………. is affected by the extreme values.
vii. The weekly sales from a sample of ten computer stores yielded a mean of
$25,900; a median $25,000 and a mode of $24,500. Then the shape of the
distribution is …………
[2] For each of the data sets in the following exercises compute (a) the mean, (b) the median,
(c) the mode, (d) the range, (e) the standard deviation, (f) the coefficient of variation, and
(g) the interquartile range. Treat each data set as a sample. For those exercises for which
you think it would be appropriate, construct a box plot and discuss the usefulness in
understanding the nature of the data that this device provides. For each exercise select the
measure of central tendency that you think would be most appropriate for describing the
data. Give reasons to justify your choice.
i- A researcher performed a 4-year retrospective review of 102 women undergoing radical

hysterectomy for cervical or endometrial cancer. Catheter-associated urinary tract infection
was observed in 12 of the subjects. Below are the numbers of postoperative days until
diagnosis of the infection for each subject experiencing an infection?
16 10 49 15 6 15 8 19 11 22 13 17
ii- A researcher evaluated the duration of benefit derived from the use of noninvasive positive
pressure ventilation by patients with amyotrophic lateral sclerosis on symptoms, quality of
life, and survival. One of the variables of interest is partial pressure of arterial carbon
dioxide (PaCO2). The values below (mm Hg) reflect the result of baseline testing on 30
subjects as established by arterial blood gas analyses.
40.0 47.0 34 42.0 54.0 48.0 53.6 56.9 58.0 45.0

54.5 54.0 43 44.3 53.9 41.8 33.0 43.1 52.4 37.9
34.5 40.1 33 59.9 62.6 54.1 45.7 40.6 56.6 59.0
iii- In a pilot study, a researcher wanted to gain more insight into the psychosocial
consequences for children of a parent with cancer. For the study, 14 families participated
in semistructured interviews and completed standardized questionnaires. Below is the age
of the sick parent with cancer (in years) for the 14 families.
37 48 53 46 42 49 44 38 32 32 51 51 48 41
- 19 -
[3] The following table shows the number of hours 45 hospital patients slept following the
administration of a certain anesthetic.
7 1 0 1 2 4 8 7 3 8 5 12 11 3 8 1 1 13 10 4 4 5 5
8 7 7 3 2 38 13 1 7 17 3 4 5 53 1 17 10 4 7 7 11 8
(a) From these data construct:
A frequency distribution and a relative frequency distribution, a histogram and a
frequency polygon.
(b) Describe these data relative to symmetry and skewness.
(c) Compute the sample mean, sample median, sample range, sample standard deviation and
coefficient of variation.
(d) Construct a box plot and discuss the usefulness in understanding the nature of the data that
this device provides.
[4] The following are the number of babies born during a year in 60 community hospitals.
30 55 27 45 56 48 45 49 32 57 47 56
37 55 52 34 54 42 32 59 35 46 24 57
32 26 40 28 53 54 29 42 42 54 53 59
39 56 59 58 49 53 30 53 21 34 28 50
52 57 43 46 54 31 22 31 24 24 57 29
(a) From these data construct:
A frequency distribution and a relative frequency distribution
A histogram and a frequency polygon.
(b) Describe these data relative to symmetry and skewness.
(c) Compute the sample mean, sample median, sample range, sample standard deviation and
coefficient of variation.
(d) Plot the box plot, do you have the same conclusion in part (b)?
[5] How are mean and median affected when it is known that for a group of 10 students scoring an
average of 60 marks, the best paper was wrongly marked 80 instead of 75.
[6] For the frequency distribution table shown below:

Intervals Frequencies
30 – 39 6
40 – 49 9
50 – 59 25
60 – 69 7
70 - 79 3
i– The boundaries of the third class are:
a. 50-60 b. 49.5-59.5 c. 50.5-59.5 d. none of them
ii– The class midpoint for class 60 - 69 is
a. 54.5 b. 64.5 c. 49.8
iii– The mean is
b. 54.5 b. 52.9 c. 49.5
- 20 -
iv – The standard deviation is
a. 10.17 b. 103.51 c. 1.017
v – The coefficient of variation is
a. 19.22% b. 1.922 c. 195.5%
[7] The ages of 40 people in a class (to nearest year) are as follows:
Age 18 19 20 21 22
Number of Students: 2 8 16 10 4
Find the median, mode, mean and standard deviation.
[8] The following table shows the age distribution of a sample of diabetic patients.
Age Frequency Relative Frequency
10 - 14 4 0.08
15 - 19 6 -
20 - 24 - 0.24
25 - 29 - 0.32
30 - 34 10 -
35 - 39 - -
a- Complete the blank cells in the table
b- Calculate the mean and CV
[9] Circle the correct answer from each of the following multiple choice questions
1. Given the following frequency distribution:
xi -10 -5 0 5 10
fi 1 4 5 9 6
the mean is
a. 3.0 b. 15.0 c. 0.0 d. Otherwise
2. For the frequency distribution given in question 1, the median is
a. 0.0 b. 2.5 c. 5.0 d. Otherwise
3. The following table shows the scores on an intelligence test by a group of 50 students:
Scores 70 - 74 75 – 79 80 - 84 85 - 89 90 - 94
No of children 2 3 20 15 10
the mean is
a. 79.5 b. 84.5 c. 84.8 d. Otherwise
4. For the frequency distribution given in question 3, the median is
a. 84.5 b. 84.6 c. 84.7 d. Otherwise
5. The mean grade of 40 male students in a Math exam is 80, while the mean grade of 20 female
- 21 -
students in the same Math exam is 65. Then the mean grade of all the 60 students is
a. 72.5 b. 75 c. 145 d. Otherwise
6. The observations 2, 4, 5, x, 7, 9, 12, y are arranged in ascending order. If the median is 6 and
the mean is 7, then the values of x and y are respectively,
a. 6 & 12 b. 5 & 14 c. 5 & 12 d. Otherwise
7. The following table shows the scores on an intelligence test by a group of 25 students:
Speed (knots) 5-9 10 – 14 15 - 19 20 - 24

No of ships 6 10 7 2
the mean is
a. 11.6 b. 13.0 c. 81.25 d. Otherwise
8. Ten people are in a room. Their mean age is 40, and their median age is 35. A 50 year-old man
leaves the room, and a 70 year-old woman enters. What is the median age of the people in the
room now?
a. 35 b. 38 c. 40 d. 42 e. Otherwise.
9. Which of the following sets of four numbers has the largest possible standard deviation?
a. 4, 4, 4, 4 b. 4, 5, 6, 7 c. 4, 6, 8, 10 d. 8, 9, 9, 10.
10. The variance of 10 measurements of people's height (in inches) is computed to be 25. The
units for the variance of 25 are:
a. inches b. square root inches c. inches squared d. no units.
11. Which of the following central measurements is the most affected by extreme values?
a. Mean b. Median c. Mode d. Otherwise
12. Eight people are in a room. Their mean age is 40, and their median age is 36. A 62 year-old
man leaves the room, and a 30 year-old woman enters, the mean age of the people in the
room now is
a. 36 b. 44 c. 50 d. Otherwise.
[11] For a 50 male children born in a certain hospital the mean and standard deviation of their
weight are 3.5 kg and 5.2 kg respectively. For a 40 female children born in the same hospital
these are 3.2 kg. and 4.8 kg respectively. Find the mean and standard deviation of the
combined children of 90.
[12] The Blood Levels (mg/dl) of 50 Subjects are given in the following table
4.9 14.4 3.9 2.5 7.6 0.5 3.9 7.7 8.0 6.5
5.5 3.9 1.0 5.0 4.2 4.1 6.9 2.9 11.5 2.8
7.6 0.7 8.9 2.2 4.0 1.5 10.2 1.1 10.6 2.0
2.3 9.8 6.3 6.1 5.4 6.7 3.2 1.6 6.1 9.0
9.5 4.3 4.8 5.7 4.8 2.1 2.7 3.5 8.2 4.4
- 22 -
(a) Construct a frequency distribution with class intervals 0.5 – 2.4, 2.5 -4.4, …. and
plot the histogram.
(b) Draw an estimate of the distribution curve and discuss the skewness of the distribution.
(c) Compute the sample mean, median, standard deviation and coefficient of variation.
(d) Plot the box plot, do you have the same conclusion in part (b)?
[13] In a study of physical endurance levels of male college freshman, the following composite
endurance scores based on several exercise routines were collected.
254 281 192 260 212 179 225 179 181 149
182 210 235 239 258 166 159 223 186 190
180 188 135 233 220 204 219 211 245 151
198 190 151 157 204 238 205 229 191 200
222 187 134 193 264 288 214 227 190 212
165 194 206 193 218 198 241 149 164 225
265 222 264 249 175 205 252 210 178 159
220 201 203 172 234 198 173 187 189 237
272 195 227 230 168 232 217 249 196 223
232 191 175 236 152 258 155 215 197 210
214 278 252 283 205 184 172 228 193 130
218 213 172 159 203 212 132 197 206 198
a- From these data construct a frequency distribution and a histogram.
b- For these data compute the following descriptive measures: mean, median, standard deviation,
coefficient of variation, range, first and third quartiles.
c- Describe these data relative to symmetry and skewness.

- 23 -

Organizing and Summarizing Data: Statistics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Organizing and Summarizing Data: Statistics

Uploaded by

Copyright:

Available Formats

Chapter 1

ORGANIZING AND SUMMARIZING DATA

Fig. 1.1: Breakdown of the subject of statistics

1.2 The Nature of Data

1.3 Computers and Statistical Calculations

Statistical programs differ with respect to their input requirements, their

1.4 Grouped Data (The Frequency Distribution)

1.4.1 Frequency Histograms

Table 1.2. Class Boundaries of the Data of Table 1.1

Figure 1.1 Histogram of the BMI Levels of 80 Adults

1.4.2 Frequency Polygons

Figure 1.2 Frequency Polygon of the BMI Levels of 80 Adults

19.5 22.5 25.5 28.5 31.5 34.5 37.5

1.5.1.1 The Arithmetic Mean

The Mean Computed from Grouped Data:

Calculate the mean.

Class Class Class

We may now compute the mean.

1.5.1.4 Empirical Relation between Mean, Median and Mode

Mean - Mode = 3 (Mean - Median) (1.3)

1.5.1.5 Quartiles, Deciles, and Percentiles

1.5.2 Measures of Dispersion

1.5.2.2 Variance and Standard Deviation.

1.5.2.3 Coefficient of Variation

1.5.3 Measures of Symmetry

` mode median mean

mean median mode

Mean = Median = Mode

i- A researcher performed a 4-year retrospective review of 102 women undergoing radical

40.0 47.0 34 42.0 54.0 48.0 53.6 56.9 58.0 45.0

[6] For the frequency distribution table shown below:

Speed (knots) 5-9 10 – 14 15 - 19 20 - 24

You might also like