0% found this document useful (0 votes)
22 views21 pages

CH 1

Uploaded by

zeyadyasser027
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views21 pages

CH 1

Uploaded by

zeyadyasser027
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 1

ORGANIZING AND SUMMARIZING DATA

1.1 Introduction
For us to have an understanding of what the subject of statistics is all about, we need
to introduce some terminology. First, we will explain what we mean by the subject of
statistics.

Definition 1.1
Statistics is a collection of methods for planning experiments, obtaining data, and
then organizing, summarizing, presenting, analyzing, interpreting and drawing
conclusions based on the data, numerical facts, which we call data.
The subject of statistics is divided into two broad areas which are descriptive
statistics and inferential statistics. These classifications are shown in Fig. 1.1
Statistics

Descriptive Inferential
Statistics Statistics

Includes: Includes:
• Collecting • Making inferences
• Organizing • Hypothesis testing
• Summarizing • Determining relationships
• Presenting • Making predictions

Fig. 1.1: Breakdown of the subject of statistics


In statistics we commonly use the terms population and sample. Because these terms are
central to our study, we define them now.
Definition 1.2
▪ A population is the complete collection of all elements (scores, people, measurements,
and so on) to be studied. The collection is complete in the sense that it includes all
subjects to be studied.

-1-
▪ A census is the collection of data from every element in a population.
▪ A sample is a sub-collection of elements drawn from a population (a subset of the
population).

1.2 The Nature of Data


Data are observations (such as measurements, genders, and survey responses) that
have been collected. Data are sometimes used to find statistics. An isolated list of lifeless
numbers might appear to be a data set awaiting some statistical manipulation, but the
effective use of statistics requires that we know the context of the data, how the data were
obtained, and the population from which the data were obtained. Now, we will introduce
some key terms.

Definition 1.3
▪ A parameter is a numerical measurement describing some characteristic of a
population.
▪ A statistic is a numerical measurement describing some characteristic of a sample.

Population
described by Sample
described by
Parameters
Statistics

Some data sets consist of numbers (such as weights) and others are non-numerical (such as
eye colors). The terms qualitative data and quantitative data are often used to distinguish
between these types.

Definition 1.4
 Quantitative data consist of numbers representing counts or measurements.
 Qualitative (or categorical or attribute) data can be separated into distinct
categories that are distinguished by some nonnumeric characteristics.
For example, the incomes of college graduates are quantitative data, while the genders
male/ female) of college graduates are qualitative data.
The quantitative data can be described by distinguishing between discrete and continuous
types.
▪ Discrete data which means that it can only take specific values (The number of
possible values is either a finite number or a countable number).
0, 1, 2, 3, . . .

-2-
▪ Continuous data which means that it can take all values in a given range
(numerical) data result from infinitely many possible values that correspond to
some continuous scale that covers a range of values without gaps, interruptions,
or jumps). For example, the amounts of milk that cows produce could be 2.3415
gallons a day.

1.3 Computers and Statistical Calculations


The relatively recent widespread use of computers has had a
tremendous impact on all sciences research in general and statistical analysis in
particular. Some widely used "packages" of statistical procedures are:
(1) Minitab
(2) SPSS (Statistical Package for the Social Sciences)
(3) R.
(4) SAS, and
(5) Statgraph

Statistical programs differ with respect to their input requirements, their


output formats, and the specific calculations they will perform.

1.4 Grouped Data (The Frequency Distribution)


A frequency table provides one of the most convenient ways to summarize or
display grouped data. Before we construct such a table, let us consider the following
numerical data. Table 1.1 lists 80 values of body mass index data from the 1998
National Health Interview Survey.
TABLE 1.1. Body Mass Index for a Sample of 80 Adults
27.4 23.5 21.9 28.6 20.3 22.4 20.8 25.0
31.0 30.9 30.2 27.3 24.3 35.9 26.5 35.8
34.2 27.4 24.7 22.7 26.5 30.0 28.2 25.4
28.9 25.9 36.6 22.7 30.1 26.2 18.3 27.3
25.7 22.3 25.4 27.3 30.4 27.4 30.8 23.0
37.1 26.3 21.3 23.1 24.5 24.1 27.6 25.7
24.8 37.8 22.9 22.3 22.8 19.8 21.5 26.3
34.9 28.8 24.2 32.6 24.3 26.9 33.6 35.5
27.5 31.8 27.1 29.5 30.9 23.3 24.8 29.8
25.9 23.4 23.1 38.8 28.7 28.4 28.3 26.4

The body mass index (BMI) is defined as [Weight (in kilograms)/Height (in
meters) squared]. According to established standards, a BMI from 19 to less than 25 is

-3-
considered healthy; a BMI from 25 to less than 30 is regarded as overweight; a BMI
greater than or equal to 30 is defined as obese. Table 1.1 arranges the numbers in the
order in which they were collected.
In constructing a frequency table for grouped data, we first determine a set of class
intervals that cover the range of the data (i.e., include all the observed values). The class
intervals are usually arranged from lowest numbers at the top of the table to highest
numbers at the bottom of the table and are defined so as not to overlap. We then tally
the number of observations that fall in each interval and present that number as a
frequency, called a class frequency. Some frequency tables include a column that
represents the frequency as a percentage of the total number of observations; this
column is called the relative frequency percentage. The completed frequency table
provides a frequency distribution.
Although not required, a good first step in constructing a frequency table is to rearrange
the data table, placing the smallest number in the first row of the leftmost column and
then continuing to arrange the numbers in increasing order going down the first column
to the top of the next row. (We can accomplish this procedure by sorting the data in
ascending order.) After the first column is completed, the procedure
is continued starting in the second column of the first row, and continuing until the
largest observation appears in the rightmost column of the bottom row. We call the
arranged table an ordered array. It is much easier to tally the observations for a
frequency table from such an ordered array of data than it is from the original data table.
Table 1.2 provides a rearrangement of the body mass index data as an ordered array. In
Table 1.2, by inspection we find that the lowest and highest values are 18.3 and 38.8,
respectively.
TABLE 1.2. Body Mass Index Data for a Sample of 80 Adults:
Ordered Array (Sorted in Ascending Order)
18.3 22.7 24.1 25.4 26.5 27.6 30 33.6
19.8 22.7 24.2 25.7 26.9 28.2 30.1 34.2
20.3 22.8 24.3 25.7 27.1 28.3 30.2 34.9
20.8 22.9 24.3 25.9 27.3 28.4 30.4 35.5
21.3 23 24.5 25.9 27.3 28.6 30.8 35.8
21.5 23.1 24.7 26.2 27.3 28.7 30.9 35.9
21.9 23.1 24.8 26.3 27.4 28.8 30.9 36.6
22.3 23.3 24.8 26.3 27.4 28.9 31 37.1
22.3 23.4 25 26.4 27.4 29.5 31.8 37.8
22.4 23.5 25.4 26.5 27.5 29.8 32.6 38.8

-4-
We will use these numbers to help us create equally spaced intervals for tabulating
frequencies of data. Although the number of intervals that one may choose for a
frequency distribution is arbitrary, the actual number should depend on the range of the
data (R = range, is the difference between the smallest and the largest observation in the
data set) and the number of cases (number of observations N). For a data set of 50 to
150 observations, the number chosen usually ranges from about five to ten. In the
present example, the range of the data is 38.8 – 18.3 = 20.5.
Suppose we divide the data set into seven intervals. Then, we have 20.5 ÷ 7 = 2.93,
which rounds to 3.0. Consequently, the intervals will have a width of 3. These seven
intervals are as follows:
1. 18.0 – 20.9
2. 21.0 – 23.9
3. 24.0 – 26.9
4. 27.0 – 29.9
5. 30.0 – 32.9
6. 33.0 – 35.9
7. 36.0 – 38.9
We have:
The total number of observations =N = 80
The number of classes = K= 7
The size of class (width) = C = 3
Relative Frequency = Frequency /N
TABLE1.1. Body Mass Index (BMI)
Class Interval for Frequency Relative Relative
BMI Levels (f) Frequency Frequency (%)
18.0–20.9 4 0.05 5.0
21.0–23.9 16 0.20 20.0
24.0–26.9 22 0.275 27.5
27.0–29.9 18 0.225 22.5
30.0–32.9 10 0.125 12.5
33.0–35.9 6 0.075 7.5
36.0–38.9 4 0.05 5
Total 80 1 100

1.4.1 Frequency Histograms


We may display a frequency distribution (or a relative frequency distribution)
graphically in the form of a histogram as shown in Fig 1.1.
In constructing a histogram, the values of the variable under consideration make
up the horizontal axis, while the vertical axis has as its scale the frequency (or relative
frequency if desired) of occurrence. Above each class interval on the horizontal axis a

-5-
rectangular bar cell, as it is sometimes called, is erected so that the height corresponds to
the respective frequency. The cells of a histogram must be joined and, to accomplish this,
we must take into account the cut points (18, 21, 24,…) as shown in Table 1.2.
Table 1.2. Class Boundaries of the Data of Table 1.1
Cut points of Class Frequency
BMI Levels Mid-point (f)
[18 – 21) 19.5 4
[21 – 24) 22.5 16
[24 – 27) 25.5 22
[27 – 30) 28.5 18
[30 – 33) 31.5 10
[33 – 36) 34.5 6
[36 – 39) 37.5 4

If we draw a graph using these class limits as the base of our rectangles, no gaps will
result, and we will have the histogram shown in Fig 1.1.
Figure 1.1 Histogram of the BMI Levels of 80 Adults
Histogram of BMI
25
22

20
18

16

15
Frequency

10
10

5 4 4

0
18 21 24 27 30 33 36 39
BMI

1.4.2 Frequency Polygons


To draw a frequency polygon, we first place a dot above the midpoint of each
class interval represented on the horizontal axis of graph like the one shown in Fig 1.1.
The height above the horizontal axis of a given dot corresponds to the frequency of the
relevant class interval. Connecting the dots by straight lines produces the frequency
polygon. Fig 1.2 is the frequency polygon for the data in Table 1.1.
Note that the polygon is brought down to the horizontal axis at the ends at points
that would be the midpoints if there were an additional cell at each end of the
corresponding histogram. This allows for the total area to be enclosed. The total area under

-6-
the frequency polygon is equal to the area under the histogram

Figure 1.2 Frequency Polygon of the BMI Levels of 80 Adults


Scatterplot of Frequency vs Mid-point
25

20

15
Frequency

10

16.5 19.5 22.5 25.5 28.5 31.5 34.5 37.5 40.5


Mid-point

1.5 Statistical Measures of Data


1.5.1 Measures of Central Tendency
Measures of central tendency are numbers that tell us where the majority of
values in the distribution (or sample) are located. This section will cover the following
measures of central tendency: arithmetic mean, median, and mode. These measures also
are called measures of location. In contrast to measures of central tendency, measures
of dispersion inform us about the spread of values in a distribution. Section 1.5.2 will
present measures of dispersion.

1.5.1.1 The Arithmetic Mean


The arithmetic mean (ungrouped data), or simply the mean or average, is the sum
of the individual values in a data set divided by the number of values in the data set. We
can compute a mean of both a finite population and a sample. For the mean of a finite
population (denoted by μ), we sum the individual observations in the entire population
and divide by the population size. When data are based on a sample, to calculate the
sample mean (denoted by X ) we sum the individual observations in the sample and
divide by the number of elements in the sample, n. The sample mean is the sample
analog to the mean of a finite population.
Let x1 , x 2 ,..., xn be a sample of observations, then its mean is given by
1 n
x = x i (1.1)
n i=1

-7-
For example the mean of a sample consisting of the data 3, 8, 6, 14, 0, -4, 0, 12, -7, 0
1
and -10 is given by: x= (3 + 8 + 6 + 14 + 0 + ( − 4) + 0 + 12 + ( −7) + 0 + ( −10)) = 2
11

The Mean Computed from Grouped Data:


For large data sets (e.g., more than about 20 observations when performing
calculations by hand), summing the individual numbers may be impractical, so we use
grouped data. First, the data need to be placed in a frequency table, as illustrated in
Section 1.4. We then apply Formula 1.2, which specifies that the midpoint of each class
interval (X) is multiplied by the frequency of observation in that class. So if we denote the
class midpoints by x1 , x 2 ,..., xk (where k is the number of classes) and the corresponding
frequencies by f1 ,f 2 ,...,fk then the mean is
k

f x + f x + + fk xk i=1
fi xi
1 k
x= 1 1 2 2 = k = xf (1.2)
f 1 + f + ... + f N i=1 i i
2 k f
i=1 i
k
where N =  f i is the sum of all frequencies i.e. the total number of observations in the data.
i =1

Example 1.1
For each boy in a sample of 50 boys, the time correct to the nearest second, for whom
he could hold his breath is recorded. The results are represented in the following table:

Value 18 – 28 29 – 39 40 – 50 51 – 61
Number of Children 5 10 25 10
Calculate the mean.
Solution
When we compute the mean from grouped data, it is convenient to prepare a work
table such as Table 1.3, We may now compute the mean.
1 k 2140
x= 
N i=1
xi fi =
50
= 42.8

Table 1.3 Work Table for Computing the Mean from


the Grouped Data of Example 1.1

Class midpoint frequency


xi fi xi fi
Interval

-8-
18 - 28 23 5 115
29 - 39 34 10 340
40 - 50 45 25 1125
51 - 61 56 10 560
Total 50 2140
(This mean can also be calculated in MINITAB or table mode in calculator easier)

1.5.1.2 Median
The median of a data set is the value of the observation that divides the ordered
dataset in half. Essentially, the median is the observation whose value defines the
midpoint of a distribution; i.e., half of the data fall above the median and half below.
Suppose there are n observations in a sample. If these observations are ordered from
smallest to largest, then the median is defined as follows:
The sample median is
th
 n +1
(1) The   observation if n is odd
 2 
th th
n n 
(2) The average of the   and  + 1  observations if n is even.
 2 2 
Example 1.2
Each of 10 children in the second grade was given a reading aptitude test. The
scores were as follows: 95 86 78 90 62 73 89 92 84 76. Determine the median.
Solution
First we must arrange the scores in order of magnitude.
62 73 76 78 84 86 89 90 92 95
Because there are an even number of measurements, the median is the average of the two
midpoint scores.
84 + 86
Median = = 85
2
The median tells us that half of the observations are less than 85 and half of the
observations are greater than 85.

Example 1.3
Consider the following data set, 7 35 5 9 8 3 10 12 8 which consists of
white-blood counts taken on admission of all patients entering a one of the Alexandria
University hospitals, on a given day. Compute the median white-blood count.

-9-
Solution
First, order the sample as follows: 3, 5, 7, 8, 8, 9, 10, 12, 35. Because n is odd, the
sample median is given by the fifth largest point, which equals 8 or 8000 on the original
scale.

1.5.1.3 Mode
The mode is defined as the observation in the sample which occurs most
frequently if there is such an observation.
▪ If each observation occurs the same number of times, then there is no mode.
▪ If two or more observations occur the same number of times (and more frequently
than any of the other observations), then there is more than one mode, and the
sample is said to be multimodal.
▪ If there is only one mode the sample is said to be unimodal.
For example, if the sample is 24, 29, 26, 31, 29, 34, 25, 29 then the mode is 29.
For the sample: 46, 47, 47, 43, 48, 45, 43, 49, then there are two modes namely 43 and
47 (bimodal).
For the sample: 14, 16, 21, 19, 18, 24, 17 then there is no mode.

1.5.1.4 Empirical Relation between Mean, Median and Mode


For unimodal frequency curves which are moderately skewed (asymmetrical), we
have the empirical relation

Mean - Mode = 3 ( Mean - Median ) (1.3)


1.5.1.5 Quartiles, Deciles, and Percentiles
If a set of data is arranged in order of magnitude, the middle value (or arithmetic
mean of the two middle values) which divides the set into two equal parts is the median.
By extending this idea we can think of these values which divide the set into four equal
parts. These values, denoted by Q1 , Q2 and Q3 , are called the first second and third
quartiles respectively, the value Q2 being equal to the median.
Similarly the values which divide the data into ten equal parts are called deciles
and are denoted by D1 , D2 ,..., D9 , while the values dividing the data into one hundred
equal parts are called percentiles and are denoted by P1 , P2 , ..., P99 . The 5th decile and
50th percentile correspond to the median. The 25th and 75th percentiles correspond to the
first and third quartiles respectively.

- 10 -
To calculate the quartiles, first locate the median in the ordered list of
observations. The first quartile Q1 is the median of all values less than the median of the
whole set of data, and the third quartile Q3 is the median of all values greater than the
median of the whole set of data.
Example 1.4
The following are the scores of nine students on a statistics test:
82, 90, 75, 77, 94, 72, 93, 74, 78, 86, 85, 79
Find the median and the two quartiles.
Solution
Arranging the data (n=12 ) according to size, we get
72 74 75 77 78 79 82 85 86 90 93 94
For n=12, the median position is (12+1)/2=6.5, so that the median is (79+82)/2 = 80.5. The
first quartile Q1 is the median of the first 6 values, the third quartile Q3 is the median of the
6 values above the median. Thus, Q1 = (75+77)/2=76 and Q3 = (86+90)/2=88.
However, if n is odd, it is more accurate to modify the definition by replacing
"less than" by "less than or equal " and "greater than" by "greater than or equal ". As an
example, the median of the following observations 66 73 74 79 82 86 88
90 94 is 82, while Q1 = 74 and Q3 = 88.

1.5.2 Measures of Dispersion


Knowing the average of a distribution in no way tells us whether or not the figures
in the distribution are clustered closely together or well spread out.
For example, there could be two groups, each of five men. The heights of the men
in the two groups are respectively:
First group : 162 , 164 , 165 , 166 , 168 cm.
Second group : 150 , 160 , 165 , 170 , 180 cm.
Both groups have a mean height of 165 cm, but the dispersion in heights is much greater in
one than the other.
It would clearly be useful to find some way of measuring this dispersion and
expressing it as a single figure. Such measures are called measures of dispersion (or
variation) and the most important of these are the following:
1.5.2.1 Range
The range of a set of observations is the difference between the largest and
smallest numbers in the set. For example, the range of the above two groups are,
respectively
168 - 162 = 6 cm and 180 - 150 = 30 cm.

- 11 -
The range is a poor measure of variation, particularly, if the size of the sample is
large. The objection to the range is that it does not make use of all the observations in the
sample but uses only two extreme values.
Whereas the range covers all the values in a sample, a similar measure of variation
covers (more or less) the middle 50%. It is the interquartile range Q3 - Q1.

1.5.2.2 Variance and Standard Deviation.


The standard deviation (or its square variance) is by far the most important of the
measures of dispersion. It is based upon squared deviations from the mean of a set of
values.
So the variance of a sample of n observations x1 , x 2 ,..., xn (denoted by S2 ) is defined as
the mean square deviation and is given by
1 n
2
S =  (x i − x )2 (1.4)
n - 1 i =1
The sum of squares is divided by (n-1) rather than n for theoretical reasons, but if n > 35
there is, practically, no difference in the definitions.
Example 1.5
As an illustration for computing the variance, consider the sample 5, 7, 8, 12 and
18. The mean of this sample is given by
1
x = (5 + 7 + 8 + 12 + 18) = 10
5
Hence
1 n
 ( xi − x )
2 2
S =
n -1 i=1

 2 2 2 2
= 1/4 (5 -10) + (7 -10) + (8 -10) + (12 -10) + (18 -10)
2
= 106
4
= 26.5

If the number of observations is large the computation necessary to find S2 from


formula (1.4) is rather laborious, especially if the mean is not an integral value. There is
another formula for computing S2 which is equivalent to (1.4), this formula is
1  n 2 2
  xi - n x 
2
S = (1.5)
n- 1  i =1 
The standard deviation (denoted by S.D.) is defined as the square root of the variance i.e.
S . D . = S = variance

- 12 -
Now, for the grouped data in a frequency distribution, as in the case of mean, a
similar formula would apply. If x1 , x2,..., xk occur with frequencies f1 , f2 ,..., fk
respectively, the variance can by written as
1 k 1  k 2 2
2
S = 
N - 1 i=1
(x i − x) 2
f i =   xi fi - N x 
N - 1  i=1 
(1.6)
k
where N =  fi .
i =1
Let us now illustrate the computation of the variance and standard deviation by
the computational formula (1.6), using the data of Table 1.3. Using the computing formula
1.6, the variance is then given by
2
S =
1
50 - 1

11844 - 50 (15.0)2 = 12.12 
The standard deviation is;
S = 12.12 = 3.48

1.5.2.3 Coefficient of Variation


It sometimes happens that we need to compare the variability of two or more sets
of figures. For example, a variation or dispersion of 10 inches in measuring a distance of
100 feet is quite different in effect from the same variation of 20 inches in a distance of
100 feat. These figures are clearly not comparable since, first they are in different units
and, second, they relate to sets of figures of quite different orders of size.
However, we could obtain some idea of the degree of dispersion if we could find
the size of a variation as compared with the average of the figures it was derived from i.e.
calculate the standard deviation as a percentage of the mean. This measure is called the
coefficient of variation which abbreviated by C.V or coefficient of variability, is defined
as
S
CV = . 100%
x
The coefficient of variation expresses sample variability relative to the mean of the sample
(and is on rare occasion referred to as the "relative standard deviation". Since S and x
have identical units, CV has no units at all, a fact emphasizing that it is a relative measure,
divorced from the actual magnitude or units of measurement of the data. A disadvantage
of the coefficient of variation is that it fails to be useful when x is close to zero.

- 13 -
1.5.3 Measures of Symmetry
1.5.3.1 Skewness
Skewness is the degree of asymmetry, or departure from symmetry. If the
frequency curve (smoothed frequency polygon) of a distribution has a longer "tail" to the
right of the central maximum than to the left, the distribution is said to be skewed to the
right or to have positive skewness. If the reverse is true it is said to be skewed to the left or
to have negative skewness (see Fig. 1.4), leading to the definition.
Mean - Mode x - Mode
Skewness= =
Standard Deviation S
To avoid use of the mode, we can employ the empirical formula (1.6),
3 ( Mean - Median ) 3 ( x - Median )
Skewness= =
Standard Deviation S
The above two measures are called, respectively, Pearson’s first and second coefficients
of skewness.

Figure 1.4

+ve skewness (Skewed to right)

` mode median mean

-ve skewness (Skewed to left)

mean
median mode

- 14 -
0 skewness (Symmetric)

Mean = Median = Mode

1.5.3.2 The Boxplot


In descriptive statistics, a boxplot is a convenient way of graphically depicting
groups of numerical data through their quartiles. Boxplots may also have lines extending
vertically from the boxes (whiskers) indicating variability outside the upper and lower
quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram.
Outliers may be plotted as individual points.
Boxplots are useful for revealing the center of the data, the spread of the data, the
distribution of the data, and the presence of outliers. To construct a boxplot we first
obtain the minimum value, the maximum value, and the quartiles.
Boxplots display differences between populations without making any
assumptions of the underlying statistical distribution: they are non-parametric. The
spacing’s between the different parts of the box help in indicating the degree of
dispersion and skewness in the data, and identifying outliers. Boxplots can be drawn
either horizontally or vertically.

Boxplot of BMI
40

35

30
BMI

25

20

Figure 1.5 Box plot of the BMI data.

- 15 -
Definition
A boxplot (or box-and-whisker diagram) is a graph of a data set consists of a
line extending from the minimum value to the maximum value, and a box with lines at the
first quartile, Q1; the median; and the third quartile, Q3.
Medians and quartiles are not very sensitive to extreme values (outliers). So boxplots,
which use medians and quartiles, also have the advantage of not being as sensitive to
extreme values as other devices based on the mean and standard deviation.
As an example, let us produce the box plot of the 14 data points that were used as an
illustration: 1, 11.5, 6, 7.2, 4, 8, 9, 10, 6.8, 8.3, 2, 2, 10, 1
The resulting box plot is presented in Figure 1.5. Observe that the end points of the
whiskers are 1 for the minimal value, and 11.5 for the largest value. The end values of
the box are 9 for the third quartile and 2 for the first quartile. The median 7 is marked
inside the box.

Boxplot of C1
12
Max=11.5

10
Q3 =9
8

Median=7
6
C1

2 Q1 =2
Min =1
0

- 16 -
EXERCISES
[1] Fill-in-the-Blank
i. The measure of central location that uses all of the observations in its calculation is
called ……….…
ii. The value that occurs most often in a set of data is called the ………..
iii. The ………. isn't always found.
iv. The ………… is an absolute measure of dispersion.
v. Range is ………. between minimum and maximum value of the data.
vi. The ………. is affected by the extreme values.
vii. The weekly sales from a sample of ten computer stores yielded a mean of $25,900; a
median $25,000 and a mode of $24,500. Then the shape of the distribution is …………
[2] The mean of the number of sales of cars over a 3 month period is 87, and the standard
deviation is 15. The mean of the commissions is $5225, and the standard deviation is
$773. Compare the variations of the two.

[3] Consider the following data, which specifies the "life" of 40 similar car batteries recorded
to the nearest tenth of a year. The batteries are guaranteed to last 3 years.
5.4 3.4 2.5 3.3 4.7 4.1 2.7 4.3 3.9 4.9
3.1 3.8 3.5 3.1 3.4 3.7 3.2 4.5 4.2 2.6
3.3 3.6 4.4 2.6 3.2 3.8 2.9 3.2 3.7 3.1
3.9 3.7 3.1 3.3 4.1 3.0 3.0 4.9 3.4 3.5

(a) Set up a relative frequency distribution in 6 classes.


(b) Construct a relative frequency histogram, draw the relative frequency polygon and
discuss the skewness of the distribution.
(c) Compute the sample mean, sample median, and sample standard deviation.
(d) Construct the box plot, do you have the same conclusion in part (b)?

[4] The following table shows the number of hours of 45 hospital patients slept following the
administration of a certain anesthetic.
7 1 0 1 2 4 8 7 3 8 5 12 11 3 8 1 1 13 10 4 4 5 5
8 7 7 3 2 38 13 1 7 17 3 4 5 53 1 17 10 4 7 7 11 8
(a) From these data construct:
A frequency distribution and a relative frequency distribution.
A histogram and a frequency polygon.
(b) Describe these data relative to symmetry and skewness.
(c) Compute the sample mean, sample median, sample range, sample standard deviation and
coefficient of variation.
(d) Construct a box plot and discuss the usefulness in understanding the nature of the data
that this device provides.
[5] The amount of radiation received at a greenhouse plays a key role in determining
the rate of photosynthesis. The accompanying observations on incoming solar

- 17 -
radiation were read from a graph in an article.
7.1 7.4 7.7 8.4 8.5 8.8
9.0 9.1 10.0 10.1 10.2 10.6
10.7 9.2 10.8 10.9 11.1 11.2
11.4 11.9 11.9 12.2 12.9 8.9
(a) Set up a frequency distribution in 6 classes.
(b) Plot a frequency histogram, draw an estimate of the distribution curve and discuss the
skewness of the distribution.
(c) Calculate the mean, median, interquartile range, standard deviation, coefficient
of variation and skewness.
[6] How are mean and median affected when it is known that for a group of 10 students scoring
an average of 60 marks, the best paper was wrongly marked 80 instead of 75.

[7] For the frequency distribution table shown below:


Intervals Frequencies
30 – 39 6
40 – 49 9
50 – 59 25
60 – 69 7
70 - 79 3
i– The boundaries of the third class are:
a. 50-60 b. 49.5-59.5 c. 50.5-59.5 d. none of them
ii– The class midpoint for class 60 - 69 is
a. 54.5 b. 64.5 c. 49.8
iii– The mean is
b. 54.5 b. 52.9 c. 49.5
iv – The standard deviation is
a. 10.17 b. 103.51 c. 1.017
v – The coefficient of variation is
a. 19.22% b. 1.922 c. 195.5%

[8] The following table shows the age distribution of a sample of diabetic patients.
Age Frequency Relative Frequency
10 - 14 4 0.08
15 - 19 6 -
20 - 24 - 0.24
25 - 29 - 0.32
30 - 34 10 -
35 - 39 - -

- 18 -
a- Complete the blank cells in the table
b- Calculate the mean and CV
[9] The ages of 40 people in a class (to nearest year) are as follows:
Age 18 19 20 21 22
Number of Students: 2 8 16 10 4
Find the median, mode, mean and standard deviation.

[10] Circle the correct answer from each of the following multiple choice questions
1. Given the following frequency distribution:
xi -10 -5 0 5 10
fi 1 4 5 9 6
the mean is
a. 3.0 b. 15.0 c. 0.0 d. Otherwise
2. For the frequency distribution given in question 1, the median is
a. 0.0 b. 2.5 c. 5.0 d. Otherwise
3. The following table shows the scores on an intelligence test by a group of 50 students:

Scores 70 - 74 75 – 79 80 - 84 85 - 89 90 - 94
No of children 2 3 20 15 10
the mean is
a. 79.5 b. 84.5 c. 84.8 d. Otherwise
4. For the frequency distribution given in question 3, the median is
a. 84.5 b. 84.6 c. 84.7 d. Otherwise
5. The mean grade of 40 male students in a Math exam is 80, while the mean grade of 20
female students in the same Math exam is 65. Then the mean grade of all the 60 students is
a. 72.5 b. 75 c. 145 d. Otherwise
6. The observations 2, 4, 5, x, 7, 9, 12, y are arranged in ascending order. If the median is 6
and the mean is 7, then the values of x and y are respectively,
a. 6 & 12 b. 5 & 14 c. 5 & 12 d. Otherwise

7. Ten people are in a room. Their mean age is 40, and their median age is 35. A 50 year-old
man leaves the room, and a 70 year-old woman enters. What is the median age of the people
in the room now?
a. 35 b. 38 c. 40 d. 42 e. Otherwise.
8. Which of the following sets of four numbers has the largest possible standard deviation?
a. 4, 4, 4, 4 b. 4, 5, 6, 7 c. 4, 6, 8, 10 d. 8, 9, 9, 10.
9. The variance of 10 measurements of people's height (in inches) is computed to be 25. The

- 19 -
units for the variance of 25 are:
a. inches b. square root inches c. inches squared d. no units.
10. Which of the following central measurements is the most affected by extreme values?
a. Mean b. Median c. Mode d. Otherwise

11. Nine people are in a room. Their mean age is 40, and their median age is 36. A 62 years-
old man leaves the room, and a 35 years-old woman enters, the mean age of the people in
the room now is
a. 36 b. 37 c. 43 d. Otherwise.
[11] For a 60 male children born in a certain hospital the mean and standard deviation of
their weight are 3.40 kg and 5.10 kg respectively. For a 40 female children born in
the same hospital the mean and standard deviation of their weights are 3.15 kg and
4.85 kg respectively. Find the mean and standard deviation of the combined children
of 100.
(Ans.: mean = 3.3 and S.D. ≈ 5.0 or 4.978)
[12] The following data represent the lifetimes (in hours) of a sample of 40 transistors:
112 121 126 108 141 104 136 134 110 124
121 118 143 116 108 122 127 140 132 152
113 117 126 130 134 120 131 133 135 130
118 125 151 147 137 140 132 119 136 128
(a) Set up a frequency distribution.
(b) Plot a frequency histogram, draw an estimate of the distribution curve and discuss the
skewness of the distribution.
(c) Calculate the mean, median, interquartile range, standard deviation, coefficient
of variation and skewness.
[13] An experiment measuring the percent shrinkage on drying of 50 clay specimens
produced the following data:
18.2 21.2 23.1 18.5 15.6 19.3 18.5 19.3 21.2 13.9
20.8 19.4 15.4 21.2 13.5 20.5 19.0 17.6 22.3 18.4
16.4 18.7 18.2 19.6 15.3 21.2 20.4 21.4 20.3 20.1
16.6 23.9 17.6 17.8 20.2 19.6 20.6 14.8 19.7 20.5
17.4 23.6 17.5 20.3 16.6 18.0 20.8 15.8 23.1 17.0

(a) Compute the sample mean, median, and mode.


(b) Compute the sample variance.
(c) Set up a frequency distribution with class intervals 13.5 - 14.9, 15.0 – 16.4, ….. and
draw the resulting histogram.
(d) For the grouped data acting as if each of the data points in an interval was actually
located at the midpoint of that interval, compute the sample mean and sample

- 20 -
variance and compare this with the results obtained in parts (a) and (b). Why do
they differ?.
(e) Plot the box plot; comment on the skewness of these data..

[14] The average particulate concentration, in micrograms per cubic meter, was
measured in a petrochemical complex at 36 randomly chosen times, with the
following concentrations resulting:
5 18 15 7 23 220 130 85 103 25 80 7 24 6 13 65 37 25
24 65 82 95 77 15 70 110 44 28 33 81 29 14 45 92 17 53
(a) Represent the data in a histogram.
(b) Is the histogram approximately normal?

[15] The lengths of power failures, in minutes, are recorded in the following table.
22 18 135 15 90 78 69 98 102 83 55 28 121 120 13 22 124 112 96
70 66 74 89 103 24 21 112 21 40 98 87 132 115 21 28 43 3750 118
158 74 78 83 93 95
(a) Find the sample mean and sample median of the power-failure times.
(b) Find the sample standard deviation and the coefficient of variation of the power
failure times.



- 21 -

You might also like