You are on page 1of 104

What is Statistics?

Statistics is the art of learning from data. It is


concerned with the collection of data, its
subsequent description, and its analysis, which
often leads to the drawing of conclusions.

Statistical analysis begins with a given set of data:


For instance, the government regularly collects
and publicizes data concerning the
unemployment rate, the gross domestic product,
the rate of inflation etc.
Data sources can be divided into three categories:
I. Internal sources,
II. External sources, and
III. Surveys and experiments

Often data come from internal sources, such as a company’s


personnel files or accounting records. A police department might
use data that exist in its records to analyse changes in the nature
of crimes over a period of time.

All needed data may not be available from internal sources.


Hence, to obtain data we may have to depend on sources outside
the company, called external sources. Data obtained from external
sources may be primary or secondary data. Data obtained from
the organization that originally collected them are called primary
data. If we obtain data from the Bureau of Labour Statistics that
were collected by this organization, then these are primary data.
Data obtained from a source that did not originally collect them
are called secondary data.
Surveys and Experiments
Sometimes the data we need may not be available
from internal or external sources. In such cases, we
may have to obtain data by conducting our own
survey or experiment.
Survey In a survey, data are collected from the
members of a population or sample in such a way that
we have no particular control over the factors that
may affect the characteristic of interest or the results
of the survey.
Experiment In an experiment, data are collected from
members of a population or sample in such a way that
we have some control over the factors that may affect
the characteristic of interest or the results of the
experiment.
Variable A variable is a characteristic under study
that assumes different values for different
elements. In contrast to a variable, the value of a
constant is fixed.
Types of variables
I. Quantitative Variables
A variable that can be measured numerically is
called a quantitative variable. The data collected
on a quantitative variable are called quantitative
data.
Quantitative variables may be classified as either
discrete variables or continuous variables.
Discrete Variable A variable whose values are
countable is called a discrete variable. In other words,
a discrete variable can assume only certain values with
no intermediate values.
Continuous Variable A variable that can assume any
numerical value over a certain interval or intervals is
called a continuous variable.

II. Qualitative or Categorical Variable


A variable that cannot assume a numerical value but
can be classified into two or more nonnumeric
categories is called a qualitative or categorical
variable. The data collected on such a variable are
called qualitative data.
Raw Data
Data recorded in the sequence in which they are
collected and before they are processed or ranked are
called raw data.
Level of measurement
When we observe and record a variable, it has
characteristics that influence the type of statistical
analysis that we can perform on it. These
characteristics are referred to as the level of
measurement of the variable.
Four levels of measurement have been identified
• Nominal
• Ordinal
• Scale
• Ratio
Nominal level measurement uses symbols to classify
observations into mutually exclusive and exhaustive
categories.
OR
The nominal level of measurement occurs when the
observations do not have a meaningful numeric value.
Nominal variables classify or categorize the
observations into discrete categories.
Example: In a survey of teachers, sex was determined
by a question. Observations were sorted into two
mutually exclusive and exhaustive categories, male
and female. Observations could be labelled with the
letters M and F, or the numerals 0 and 1
Ordinal variables are used to represent
observations that can be categorized and rank
ordered.
E.g. Class rank, order of finishing a 100m race,
cumulative grade point aggregate CGPA.
The values of ordinal variables can be:
• compared to see if they are equal or not.
• compared to see if one is larger or smaller than
another.
Interval variables represent observations that can be
categorized, rank ordered, and have an unit of
measure. An unit of measure implies that the
difference between any two successive values is
identical.
E.g. Shoe size, IQ scores, Fahrenheit
With an interval scaled variable, the value 0 does not
represent the complete absence of the variable.

The values of ordinal variables can be:


• compared to see if they are equal or not.
• compared to see if one is larger or smaller than
another.
• added or subtracted
Ratio variables represent observations that can be
categorized, rank ordered, have a unit of measure
and have a true zero.
• The true zero implies that a value of zero
represents the complete absence of the variable
E.g. Weight, reaction time
The values of ratio variables can be:
• compared to see if they are equal or not
• compared to see if one is larger or smaller than
another
• added or subtracted
• multiplied or divided
Frequency distribution
A frequency distribution for qualitative data lists all
categories and the number of elements that belong to each of the
categories.
Example
A sample of 30 employees from large companies was selected, and these
employees were asked how stressful their jobs were. The responses of
these employees are recorded below, where very represents very
stressful, somewhat means somewhat stressful, and none stands for not
stressful at all.

somewhat none somewhat very very none


very some what somewhat very somewhat
Somewhat very somewhat none very none
somewhat somewhat very somewhat somewhat
very none somewhat very very somewhat
none somewhat.

Construct a frequency distribution table for these data.


Solution
Stress on Job Frequency

Very 10

Somewhat 14

None 6

Relative frequency and percentages


The relative frequency of a category is obtained by
dividing the frequency of that category by the sum of
all frequencies. Thus, the relative frequency shows
what fractional part or proportion of the total
frequency belongs to the corresponding category. A
relative frequency distribution lists the relative
frequencies for all categories.
Relative frequency of a category =
Frequency of that category/ Sum of all frequencies

• The percentage for a category is obtained by multiplying


the relative frequency of that category by 100. A
percentage distribution lists the percentages for all
categories.
Percentage = (Relative frequency) x 100
Relative frequency and percentage of the above example
Stress on Frequency Relative Percentage
Job frequency
Very 10 10/30 = 33.3
0.333
Somewhat 14 14/30 = 46.7
0.467
None 6 6/30 = 20.0
0.200
Frequency Distribution for Quantitative Data
A frequency distribution for quantitative data lists all the
classes and the number of values that belong to each class.
Data presented in the form of a frequency distribution are
called grouped data

• Data that list individual values are called ungrouped data.

• Class Boundary The class boundary is given by the


midpoint of the upper limit of one class and the lower
limit of the next class.

• The difference between the two boundaries of a class


gives the class width. The class width is also called the
class size.
Class width = Upper boundary - Lower boundary
• The class midpoint or mark is obtained by
dividing the sum of the two limits (or the two
boundaries) of a class by 2.
Class midpoint or mark = (Lower limit + Upper limit)/2

Example: Weekly Earnings of 100 Employees of a


Company
Weekly Earnings Number of Employees Class Boundaries Class Width Class Midpoint

f
Class Limits

801 - 1000 9 800.5 - 1000.5 200 900.5

1001 to 1200 22 1000.5 - 1200.5 200 1100.5

1201 - 1400 39 1200.5 - 1400.5 200 1300.5

1401 - 1600 15 1400.5 - 1600.5 200 1500.5

1601 - 1800 9 1600.5 - 1800.5 200 1700.5

1801 - 2000 6 1800.5 - 2000.5 200 1900.5


Constructing Frequency Distribution Tables
When constructing a frequency distribution table,
we need to make the following three major
decisions
• Number of Classes
Usually the number of classes for a frequency
distribution table varies from 5 to 20, depending
mainly on the number of observations in the data
set. It is preferable to have more classes as the size
of a data set increases. The decision about the
number of classes is arbitrarily made by the data
organizer.
One rule to help decide on the number of classes
is Sturge’s formula / Rule:
c = 1 + 3.3 log n
where c is the number of classes and n is the
number of observations in the data set. The value
of log n can be obtained by using a calculator.
In the above example
n = 30
c = 1 + 3.3 log 30
c = 5.87
We can conveniently say that the expected
number of class is either 5 or 6.
• Class Width
Although it is not uncommon to have classes of
different sizes, most of the time it is preferable to
have the same width for all classes. To determine
the class width when all classes are the same size,
first find the difference between the largest and the
smallest values in the data. Then, the approximate
width of a class is obtained by dividing this
difference by the number of desired classes.
Approximate class width = (Largest value - Smallest
value) / Number of classes
That is, Range/c
Usually this approximate class width is rounded to a
convenient number, which is then used as the class
width. Note that rounding this number may slightly
change the number of classes initially intended.
Example
The following data give the total number of iPods®
sold by a mail order company on each of 30 days.
Construct a frequency distribution table.
8 25 11 15 29 22 10 5 17
21 22 13 26 16 18 12 9 26
20 16 23 14 19 23 20 16 27
16 21 14
Minimum value is 5, and the maximum value is 29
Approximate width of each class = (29 - 5) / 5 = 4.8
Now we round this approximate width to a convenient
number, say 5. The lower limit of the first class can be
taken as 5 or any number less than 5. Suppose we take 5
as the lower limit of the first class. Then our classes will
be:
5–9
10 – 14
15 – 19
20 – 24
25 – 29
Frequency Distribution for the Data on iPods Sold
iPods Sold f

5–9 3

10–14 6

15–19 8

20–24 8

25–29 5
Relative frequency and percentage could also be
calculated for the grouped data.
iPods Sold f Relative frequency percentage

5–9 3 3/30 10.0

10–14 6 6/30 20.0

15–19 8 8/30 26.7

20–24 8 8/30 26.7

25–29 5 5/30 16.7


Cumulative Frequency Distribution A cumulative
frequency distribution gives the total number of
values that fall below the upper boundary of each
class.
iPods Sold f Relative frequency percentage Class boundary Cumulative frequency

5–9 3 3/30 10.0 4.5 – 9.5 3

10–14 6 6/30 20.0 9.5 – 14.5 3+6=9

15–19 8 8/30 26.7 14.5 – 19.5 9+8=17

20–24 8 8/30 26.7 19.5 – 24.5 17+8=25

25–29 5 5/30 16.7 24.5 – 29.5 25+5=30


• The cumulative relative frequencies are
obtained by dividing the cumulative frequencies
by the total number of observations in the data
set.
• The cumulative percentages are obtained by
multiplying the cumulative relative frequencies
by 100.
iPods Sold f Relative frequency percentage Class boundary Cumulative Cumulative Cumulative
frequency Relative frequency percentage

5–9 3 3/30 10.0 4.5 – 9.5 3 3/30 = 0.100 10.0

10–14 6 6/30 20.0 9.5 – 14.5 3+6=9 9/30 = 0.300 30.0

15–19 8 8/30 26.7 14.5 – 19.5 9+8=17 17/30 = 0.567 56.7

20–24 8 8/30 26.7 19.5 – 24.5 17+8=25 25/30 = 0.833 83.3

25–29 5 5/30 16.7 24.5 – 29.5 25+5=30 30/30 = 1.000 100.0


Assignment
Attempt the two questions
1. The following data give the amounts spent on video rentals (in
dollars) during 2009 by 30 households randomly selected from
those who rented videos in 2009.
595 24 6 100 100 40 622 405 90
55 155 760 405 90 205 70 180 88
808 100 240 127 83 310 350 160 22
111 70 15
a. Construct a frequency distribution table. Take $1 as the lower
limit of the first class and $200 as the width of each class.
b. Calculate the relative frequencies and percentages for all
classes.
c. What percentage of the households in this sample spent more
than $400 on video rentals
in 2009?
2. The following data give the numbers of orders
received for a sample of 30 hours at the Timesaver
Mail Order Company.
34 44 31 52 41 47 38 35 32 39
28 24 46 41 49 53 57 33 27 37
30 27 45 38 34 46 36 30 47 50
a. Construct a frequency distribution table. Take 23 as
the lower limit of the first class and 7 as the width of
each class.
b. Calculate the relative frequencies and percentages
for all classes.
c. For what percentage of the hours in this sample was
the number of orders more than 36?
Graphical presentation of qualitative data
All of us have heard the adage “a picture is worth a
thousand words.” A graphic display can reveal
at a glance the main characteristics of a data set.
The bar graph and the pie chart are two types of
graphs that are commonly used to display
qualitative data.
Bar Graph or Bar Chart
A graph made of bars whose heights represent the
frequencies of respective categories is called a bar
graph.
To construct a bar graph (also called a bar chart),
we mark the various categories on the horizontal
axis. Note that all categories are represented by
intervals of the same width. We mark the
frequencies on the vertical axis. Then we draw one
bar for each category such that the height of the
bar represents the frequency of the corresponding
category. We leave a small gap between adjacent
bars.
Sometimes a bar graph is constructed by marking
the categories on the vertical axis and the
frequencies on the horizontal axis.
Stress on the job
16
14
12
10
8
Frequency
6
4
2
0
Very Somewhat None

Stress on the job

None

Somewhat
Frequency

Very

0 2 4 6 8 10 12 14 16

Figure 1: Bar chart for the stress on job example


Other graphical representation of qualitative data
Pareto Charts
Definition:
A Pareto chart is used to represent a frequency distribution for a
categorical variable, and the frequencies are displayed by the
heights of vertical bars, which are arranged in order from highest to
lowest.

A bar chart used to separate the “vital few” from the “trivial many.”
These charts are based on the Pareto Principle which states that 20
percent of the problems have 80 percent of the impact. The 20
percent of the problems are the “vital few” and the remaining
problems are the “trivial many.” A Pareto chart can help you:
Separate the few major problems from the many possible problems
so you can focus your improvement efforts.
Arrange data according to priority or importance
Determine which problems are most important,
using data, not perception.
Steps
•Arrange the data from the largest to smallest
according to frequency.
•Draw and label the x and y axes.
•Draw the bars corresponding to the frequencies
Example
Twenty-five army indicates were given a blood
test to determine their blood type.

Raw Data:
A,B,B,AB,O,O,O,B,AB,B,B,B,O,A,O,A,O,O,O,AB,AB,
A,O,B,A
Pareto chart analysis for counts
Frequency Cum.Freq. Percentage Cum.Percent.
O 9 9 36 36
B 7 16 28 64
A 5 21 20 84
AB 4 25 16 100
Frequency

0 5 10 15 20 25

A
Pareto Chart for Blood group

AB

0% 25% 50% 75% 100%


Cumulative Percentage
Interpretation

The Pareto Chart brings immediate focus to


which reasons are part of the “vital few” and
thus should receive attention first. This chart
shows that Army with Blood Group O and B
are the most occurrence blood in the survey.
Pie Chart
A circle divided into portions that represent the
relative frequencies or percentages of a population
or a sample belonging to different categories is
called a pie chart.

As we know, a circle contains 360 degrees. To


construct a pie chart, we multiply 360 by the
relative frequency of each category to obtain the
degree measure or size of the angle for the
corresponding category.
Stress on the job

20%

33%

Very
Somewhat
None

47%

Figure 2: Pie chart for the stress on job example.


Graphing grouped data
Grouped (quantitative) data can be displayed in a
histogram or a polygon. We describes how to
construct such graphs. We can also draw a pie chart to
display the percentage distribution for a quantitative
data set.
Histogram
A histogram is a graph in which classes are marked on
the horizontal axis and the frequencies, relative
frequencies, or percentages are marked on the vertical
axis. The frequencies, relative frequencies, or
percentages are represented by the heights of the bars.
In a histogram, the bars are drawn adjacent to each
other.
Figure 3: Histogram for the Ipod sold example
Polygon
A graph formed by joining the midpoints of the tops of
successive bars in a histogram with straight lines is
called a polygon. A polygon is another device that can
be used to present quantitative data in graphic form.

To draw a frequency polygon, we first mark a dot


above the midpoint of each class at a height equal to
the frequency of that class. This is the same as marking
the midpoint at the top of each bar in a histogram.
Next we mark two more classes, one at each end, and
mark their midpoints. Note that these two classes have
zero frequencies. In the last step, we join the adjacent
dots with straight lines. The resulting line graph is
called a frequency polygon or simply a polygon.
A polygon with relative frequencies marked on the
vertical axis is called a relative frequency polygon.
Similarly, a polygon with percentages marked on
the vertical axis is called a percentage polygon.
Frequency polygon
9

frequency
4

0
9.5 14.5 19.5 24.5 29.5

Figure 4: frequency polygon for the Ipod sold example


Shapes of Histograms
A histogram can assume any one of a large number of shapes.
The most common of these shapes are
1. Symmetric
2. Skewed
3. Uniform or rectangular

A symmetric histogram is identical on both sides of its central


point.

A skewed histogram is non-symmetric. For a skewed histogram,


the tail on one side is longer than the tail on the other side.
- skewed-to-the-right histogram has a longer tail on the right
side
- skewed-to-the-left histogram has a longer tail on the left side

A uniform or rectangular histogram has the same frequency for


each class.
Ogives
When plotted on a diagram, the cumulative
frequencies give a curve that is called an ogive
(pronounced o-jive ).
Ogive
An ogive is a curve drawn for the cumulative
frequency distribution by joining with straight lines
the dots marked above the upper boundaries of
classes at heights equal to the cumulative
frequencies of respective classes.
To draw the ogive, for instance the variable, which
is total iPods sold, is marked on the horizontal axis
and the cumulative frequencies on the vertical
axis. Then the dots are marked above the upper
boundaries of various classes at the heights equal
to the corresponding cumulative frequencies. The
ogive is obtained by joining consecutive points
with straight lines.
Note: ogive starts at the lower boundary of the
first class and ends at the upper boundary
of the last class.
Ogive
35

30

25

20

15

10

0
9.5 14.5 19.5 24.5 29.5

Figure 5: Ogive for the Ipod sold example


One advantage of an ogive is that it can be used to
approximate the cumulative frequency for any interval.
For example, we can use the Figure 5 to find the
number of days for which 17 or fewer iPods were sold.
First, draw a vertical line from 17 on the horizontal axis
up to the ogive. Then draw a horizontal line from the
point where this line intersects the ogive to the vertical
axis. This point gives the cumulative frequency of the
class 5 to 17.
In the Figure, this cumulative frequency is
(approximately) 13. Therefore, 17 or fewer iPods were
sold on 13 days.
We can draw an ogive for cumulative relative
frequency and cumulative percentage distributions
the same way as we did for the cumulative frequency
distribution.
Stem-and-Leaf Display
In a stem-and-leaf display of quantitative data,
each value is divided into two portions—a stem
and a leaf. The leaves for each stem are shown
separately in a display.
Example
The following are the scores of 30 college students
on a statistics test. Construct a stem-and-leaf
display
75 52 80 96 65 79 71 87 93 95
69 72 81 61 76 86 79 68 50 92
83 84 77 64 71 87 72 92 57 98
To construct a stem-and-leaf display for these
scores, we split each score into two parts.
The first part contains the first digit, which is called
the stem.
The second part contains the second digit, which is
called the leaf.
Thus, for the score of the first student, which is 75,
7 is the stem and 5 is the leaf.
For the score of the second student, which is 52,
the stem is 5 and the leaf is 2.
You continue till you get the stem and leaf for the
score of the last student.
Stem-and leaf display for test scores.
5 207
6 59184
7 591269712
8 0716347
9 635228
Ranked stem-and-leaf display for test scores.

5 027
6 14589
7 112256799
8 0134677
9 223568
Example
The following data give the monthly rents paid by a
sample of 30 households selected from a small
town. Construct a stem-and-leaf display for these
data.
880 1081 721 1075 1023 775 1235 750 965 960
1210 985 1231 932 850 825 1000 915 1191 1035
1151 630 1175 952 1100 1140 750 1140 1370 1280
6 30
7 75 50 21 50
8 80 25 50
9 32 52 15 60 85 65
10 23 81 35 75 00
11 91 51 40 75 40 00
12 10 31 35 80
13 70
Box plot
A recently created graphic display, called a boxplot,
highlights the summary information in the
quartiles. The center half of the data, from the first
to the third quartile, is represented by a
rectangle(box) with the median indicated by a bar.
A line extends from Q3 to the maximum value and
another from Q1 to the minimum.
The construction of a box-and-whisker plot
(sometimes called, simply, a boxplot) makes use of
the quartiles of a data set and may be
accomplished by following these five steps:
1. Represent the variable of interest on the horizontal axis.
2. Draw a box in the space above the horizontal axis in such a way
that the left end of the box aligns with the first quartile and the
right end of the box aligns with the third quartile.
3. Divide the box into two parts by a vertical line that aligns with
the median.
4. Draw a horizontal line called a whisker from the left end of the
box to a point that aligns with the smallest measurement in the
data set.
5. Draw another horizontal line, or whisker, from the right end of
the box to a point that aligns with the largest measurement in
the data set.
Examination of a box-and-whisker plot for a set of data reveals
information regarding the amount of spread, location of
concentration, and symmetry of the data.
Example: In an epidemiological study, the total
organochlorines and PCB's present in milk samples
were recorded from 40 donors in Colorado.
(Source: Pesticides Monitoring Journal, June 1973.)
The measurements were ordered from lowest to
highest. For the data set, construct a box plot.

27 43 52 53 53 53 61 63 63 65
68 70 72 75 83 95 96 97 101 105
110 115 115 115 115 126 127 134 145 152
153 182 190 197 197 282 322 322 342 521
Sort: 27 43 52 53 53 53 61 63 63 65 68 70 72 75 83 95
96 97 101 105 110 115 115 115 115 126 127 134 145 152 153 182
190 197 197 282 322 322 342 521
Min. 1st Qu. Median Mean 3rd Qu. Max.
27.00 65.75 107.50 133.90 152.75 521.00
Outliers or Extreme Values
Values that are very small or very large relative to
the majority of the values in a data set are called
outliers or extreme values.
An outlier is an observation whose value, x, either
exceeds the value of the third quartile by a
magnitude greater than 1.5(IQR) or is less than the
value of the first quartile by a magnitude greater
than 1.5(IQR).
That is, an observation of x > Q3 + 1.5(IQR) or an
observation of x < Q1 - 1.5(IQR) is called an outlier.
Assignment
1. Evans et al. examined the effect of velocity on ground
reaction forces (GRF) in dogs with lameness from a
torn cranial cruciate ligament. The dogs were walked
and trotted over a force platform, and the GRF was
recorded during a certain phase of their performance.
The table below contains 20 measurements of force
where each value shown is the mean of five force
measurements per dog when trotting. Construct a
boxplot for the data set.

14.6 24.3 24.9 27 27.2 27.4 28.2 28.8 29.9 30.7

31.5 31.6 32.3 32.8 33.3 33.6 34.3 36.9 38.3 44


2. The following data give the times served (in
months) by 35 prison inmates who were released
recently. Prepare a stem-and-leaf display for these
data.

37 6 20 5 25 30 24 10 12 20

24 8 26 15 13 22 72 80 96 33

84 86 70 40 92 36 28 90 36 32

72 45 38 18 9
3. The following table, gives the frequency distribution of the
number of credit cards possessed by 80 adults.
Number of Credit Cards Number of Adults
0–3 18
4-7` 26
8 – 11 22
12 – 15 11
16 – 19 3
a. Prepare a cumulative frequency distribution.
b. Calculate the cumulative relative frequencies and
cumulative percentages for all classes.
c. Find the percentage of these adults who possess 7 or fewer
credit cards.
d. Draw an ogive for the cumulative percentage distribution.
e. Using the ogive, find the percentage of adults who possess
10 or fewer credit cards.
4. Nixon Corporation manufactures computer monitors. The
following data are the numbers of computer monitors produced
at the company for a sample of 30 days.

24 32 27 23 33 33 29 25 23 28

21 26 31 22 27 33 27 23 28 29

31 35 34 22 26 28 23 35 31 27

a. Construct a frequency distribution table using the classes 21–


23, 24–26, 27–29, 30–32, and 33–35.
b. Calculate the relative frequencies and percentages for all
classes.
c. Construct a histogram and a polygon for the percentage
distribution.
d. For what percentage of the days is the number of computer
monitors produced in the interval
27–29?
Other types of Bar charts
A. Grouped bar charts
Grouped bar charts are a way of showing
information about different sub-groups of the
main categories.
A separate bar represents each of the sub-groups
and these are usually coloured or shaded
differently to distinguish between them.
Grouped bar charts can be used to show several
sub-groups of each category but care needs to be
taken to ensure that the chart does not contain
too much information making it complicated to
read and interpret.
Consider the example of a grouped bar chart for
agricultural produce in some parts of Nigeria.
Agricultural produce in some parts of Nigeria

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

East 20 25 90 20

West 30 35 40 30

North 45 45 50 40

100

90

80

70

60
East
50
West
40 North
30

20

10

0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
B. Stacked bar charts
Stacked bar chars are similar to grouped bar charts
in that they are used to display information about
the sub-groups that make up the different
categories.
In stacked bar charts the bars representing the
sub-groups are placed on top of each other to
make a single column, or side by side to make a
single bar. The overall height or length of the bar
shows the total size of the category whilst
different colours or shadings are used to indicate
the relative contribution of the different sub-
groups.
Consider the example of a stacked bar chart for
agricultural produce in some parts of Nigeria.
Agricultural produce in some parts of Nigeria

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

East 20 25 90 20

West 30 35 40 30

North 45 45 50 40

200

180

160

140

120
North
100
West
80 East
60

40

20

0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
DESCRIPTIVE STATISTICS:
MEASURES OF CENTRAL TENDENCY
A descriptive measure computed from the data of
a sample is called a statistic.
A descriptive measure computed from the data of
a population is called a parameter.

We often represent a data set by numerical


summary measures, usually called the typical
values. A measure of central tendency gives the
center of a histogram or a frequency distribution
curve.
In each of the measures of central tendency, of
which we discuss three, we have a single value that
is considered to be typical of the set of data as a
whole.
Measures of central tendency convey information
regarding the average value of a set of values. As
we will see, the word average can be defined in
different ways.
The three most commonly used measures of
central tendency are
the mean;
the median and
the mode
Arithmetic Mean
The most familiar measure of central tendency is
the arithmetic mean. It is the descriptive measure
most people have in mind when they speak of the
“average.” The adjective arithmetic distinguishes
this mean from other means that can be
computed.
We shall refer to the arithmetic mean simply as the
mean. The mean is obtained by adding all the
values in a population or sample and dividing by
the number of values that are added.
Calculating Mean for Ungrouped Data
The mean for ungrouped data is obtained by
dividing the sum of all values by the number of
values in the data set. Thus,


 x
Mean for population data: N

x 
 x
Mean for sample data: n

Where  x is the sum of all values, N is the


population size, n is the sample size,  is the
population mean, and x is the sample mean.
Example
The total sales (rounded to billions of dollars) of six U.S.
companies for 2008.
Company Total Sales (billions of dollars)
General Motors 149
Wal-Mart Stores 406
General Electric 183
Citigroup 107
Exxon Mobil 426
Verizon Communication 97
Find the 2008 mean sales for these six companies.

x
 x 1368
  228
n 6
Thus, the mean 2008 sales of these six companies was 228.
Properties of the Mean
The arithmetic mean possesses certain properties,
some desirable and some not so desirable. These
properties include the following:
1. Uniqueness. For a given set of data there is one and
only one arithmetic mean.
2. Simplicity. The arithmetic mean is easily understood
and easy to compute.
3. Since each and every value in a set of data enters
into the computation of the mean, it is affected by
each value. Extreme values, therefore, have an
influence on the mean and, in some cases, can so
distort it that it becomes undesirable as a measure of
central tendency.
Class Activity
1. The following are the ages (in years) of all eight employees
of a small company:
53 32 61 27 39 44 49 57
Find the mean age of these employees.
2. The following data give the 2006–07 team salaries for 20
teams of the English Premier League, arguably the best-
known soccer league in the world. The salaries are given in
the order in which the teams finished during the 2006–07
season. The salaries are in millions of British pounds (note
that the approximate value of 1 British pound was $1.95
during the 2006–07 season, so the team salaries range from
$34.3 million to $259 million). (Source: BBC, May 28, 2008.)
92.3 132.8 77.6 89.7 43.8 38.4 30.7
29.8 36.9 36.7 43.2 38.3 62.5 36.4
44.2 35.2 27.5 22.4 34.3 17.6
Find the mean for these data.
Assignment
1. Johnson et al. performed a retrospective review of 50 fetuses
that underwent open fetal myelomeningocele closure. The data
below show the gestational age in weeks of the 50 fetuses
undergoing the procedure.
25 25 26 27 29 29 29 30 30 31
32 32 32 33 33 33 33 34 34 34
35 35 35 35 35 35 35 35 35 36
36 36 36 36 36 36 36 36 36 36
36 36 36 36 36 36 36 36 37 37
(a) Construct a stem-and-leaf plot for these gestational ages.
(b) Based on the stem-and-leaf plot, what one word would you
use to describe the nature of the data?
(c) Why do you think the stem-and-leaf plot looks the way it
does?
(d) Compute the mean.
2. The purpose of a study by Tam et al. was to investigate the
wheelchair maneuvering in individuals with lower-level
spinal cord injury (SCI) and healthy controls. Subjects used a
modified wheelchair to incorporate a rigid seat surface to
facilitate the specified experimental measurements.
Interface pressure measurement was recorded by using a
high-resolution pressure-sensitive mat with a spatial
resolution of 4 sensors per square centimeter taped on the
rigid seat support. During static sitting conditions, average
pressures were recorded under the ischial tuberosities. The
data for measurements of the left ischial tuberosity (in mm
Hg) for the SCI and control group are shown below.
Control: 131, 115, 124, 131, 122, 117, 88, 114, 150, 169.
SCI: 60, 150, 130, 180, 163, 130, 121, 119, 130, 148.
Find the mean for the controls and the SCI group.
Other Means
We must not think that the arithmetic mean is the
only important mean. The geometric mean and
harmonic mean are all important in some areas of
Engineering.
The geometric mean is defined as the nth root of
the product of n observations:

x  n x1 x 2 x3 ...x n

or, in terms of frequencies,


x  f
x1f1 x2f2 x3f3 ...xnfn
The harmonic mean The harmonic mean of a set of
data values is the reciprocal of the arithmetic mean
of the reciprocals of the data values
n
x
1 1 1
  ... 
x1 x2 xn
or, in terms of frequencies
x 
f
f1 f2 fn
  ...
x1 x2 xn
Example
Obtain the geometric and harmonic means from
the data set I.
Data Set I: 8 16 30 18 22
Geometric mean
x  5 8 16  30 18  22
x  5 1520640
x  17.2348

Harmonic mean
5
x
1 1 1 1 1
   
8 16 30 18 22
5
x  15.5376
0.3218
Median
If all the items with which we are concerned are sorted in
order of increasing magnitude (size), from the smallest to
the largest, then the median is the middle item.
As is obvious from the definition of the median, it divides a
ranked data set into two equal parts. The calculation of the
median consists of the following two steps:
1. Rank the data set in increasing order.
2. Find the middle term. The value of this term is the
median.
Note that if the number of observations in a data set is odd,
then the median is given by the value of the middle term in
the ranked data.
However, if the number of observations is even, then the
median is given by the average of the values of the two
middle terms.
Example
The following data give the prices (in thousands of
dollars) of seven houses selected from all houses sold
last month in a city.
312 257 421 289 526 374 497
Find the median.
First, we rank the given data in increasing order as
follows:
257 289 312 374 421 497 526
Since there are seven homes in this data set and the
middle term is the fourth term, the median is given by
the value of the fourth term in the ranked data.
257 289 312 374 421 497 526
Thus, the median price of a house is 374, or $374,000.
The 2008 profits (rounded to billions of dollars) of 12 companies selected
from all over the world.
2008 Profits
Company (billions of dollars)
Merck & Co 8
IBM 12
Unilever 7
Microsoft 17
Petrobras 14
Exxon Mobil 45
Lukoil 10
AT&T 13
Nestlé 17
Vodafone 13
Deutsche Bank 9
China Mobile 11

Find the median for these data


First we rank the given profits as follows:
7 8 9 10 11 12 13 13 14 17 17 45
There are 12 values in this data set. Because there is
an even number of values in the data set, the median
is given by the average of the two middle values. The
two middle values are the sixth and seventh in the
foregoing list of data, and these two values are 12 and
13. The median, which is given by the average of these
two values, is calculated as follows.
7 8 9 10 11 12 13 13 14 17 17 45
Median = 12  13  25  12.5
2 2
Thus, the median profit of these 12 companies is
$12.5 billion.
Properties of the Median
Properties of the median include the following:
1. Uniqueness. As is true with the mean, there is only
one median for a given set of data.
2. Simplicity. The median is easy to calculate.
3. It is not as drastically affected by extreme values as
is the mean.

The Mode
The mode of a set of values is that value which occurs
most frequently.
If all the values are different there is no mode; on the
other hand, a set of values may have more than one
mode.
Example
The following data give the speeds (in miles per
hour) of eight cars that were stopped on I-95 for
speeding violations.
77 82 74 81 79 84 74 78
Find the mode.
In this data set, 74 occurs twice, and each of the
remaining values occurs only once. Because 74
occurs with the highest frequency, it is the mode.
Therefore,
Mode is 74 miles per hour
A data set with only one value occurring with the
highest frequency has only one mode. The data
set in this case is called unimodal.

A data set with two values that occur with the


same (highest) frequency has two modes. The
distribution, in this case, is said to be bimodal.

If more than two values in a data set occur with


the same (highest) frequency, then the data set
contains more than two modes and it is said to be
multimodal.
Class activity
1. A brochure from the department of public safety in a northern
state recommends that motorists should carry 12 items
(flashlights, blankets, and so forth) in their vehicles for
emergency use while driving in winter. The following data give
the number of items out of these 12 that were carried in their
vehicles by 15 randomly selected motorists.
5 3 7 8 0 1 0
5 1 21 7 6 7 1 19
Find the means (arithmetic, geometric and harmonic), median,
and mode for these data. Are the values of these summary
measures population parameters or sample statistics? Explain.
2. Nixon Corporation manufactures computer monitors. The
following data are the numbers of computer monitors produced
at the company for a sample of 10 days.
24 32 27 23 35 33 29 40 23 28
Calculate the mean, median, and mode for these data.
Measures of dispersion for Ungrouped data
The measures of central tendency, such as the mean,
median, and mode, do not reveal the whole picture of the
distribution of a data set. Two data sets with the same
mean may have completely different spreads.
The variation among the values of observations for one
data set may be much larger or smaller than for the other
data set.
Note that the words dispersion, spread, and variation have
the same meaning.
Consider the following two data sets on the ages (in years)
of all workers working for each of two small companies.
Company 1: 47 38 35 40 36 45 39
Company 2: 70 33 18 52 27
The mean age of workers in both these companies is the same,
40 years. If we do not know the ages of individual workers at
these two companies and are told only that the mean age of the
workers at both companies is the same, we may deduce that the
workers at these two companies have a similar age distribution.

As we can observe, however, the variation in the workers’ ages


for each of these two companies is very different. The ages of the
workers at the second company have a much larger variation
than the ages of the workers at the first company.

Thus, the mean, median, or mode by itself is usually not a


sufficient measure to reveal the shape of the distribution of a
data set. We also need a measure that can provide some
information about the variation among data values. The
measures that help us learn about the spread of a data set are
called the measures of dispersion. The measures of central
tendency and dispersion taken together give a better picture of
a data set than the measures of central tendency alone.
Under the measure of dispersion we shall discuss
the following:
Range, Mean deviation, Variance, Standard
deviation, Coefficient of variation and
Interquartile Range.
Range
The range is the difference between the largest
and smallest value in a set of observations. If we
denote the range by R, the largest value by and
the smallest value
by we compute the range as follows:
R = Largest value - Smallest value
Example
The total areas in square miles of the four western South-Central
states of the United States.
Total Area
State (square miles)
Arkansas 53,182
Louisiana 49,651
Oklahoma 69,903
Texas 267,277
Find the range for this data set.
The maximum total area for a state in this data set is 267,277
square miles, and the smallest area is 49,651 square miles.
Therefore,
Range = Largest value - Smallest value
= 267,277 - 49,651 = 217,626 square miles
Thus, the total areas of these four states are spread over a range
of 217,62
Mean Deviation from the Mean
The mean deviation from the mean, defined as
n n
 (x  x)
i
, where  xi
i 1
x i 1
n n
, is useless because it is always zero.
Example: Consider the following data set:
12, 14, 25, 44, 30, 22.
Obtain the mean deviation .
x
x
n
147
x
6
x  24.5
x xx
12 -12.5
14 -10.5
25 0.5
44 19.5
30 5.5
22 -2.5
Sum of x  x equal zero, that is,
(-12.5)+(-10.5)+(0.5)+(19.5)+(5.5)+(-2.5) = 0.
0
Therefore, the mean deviation equals zero,  0 .
6
Mean Absolute Deviation from the Mean
However, the mean absolute deviation from the
n
mean, defined as  | x  x |i
i 1

is used frequently by engineers to show the


variability of their data, although it is usually not
the best choice. Its advantage is that it is simpler
to calculate than the main alternative, the
standard deviation.
Its disadvantage is that it is not simply related to
the parameters of theoretical distributions. For
that reason its routine use is not recommended.
Now, let us obtain the mean absolute deviation for
the above data set.

Sum of | xi  x | equal 51, that is,


(12.5)+(10.5)+(0.5)+(19.5)+(5.5)+(2.5) = 51
Therefore, the mean absolute deviation equals 8.5,
51
 8.5 .
6
Variance and Standard Deviation
The standard deviation is the most used measure
of dispersion. The value of the standard deviation
tells how closely the values of a data set are
clustered around the mean.
In general, a lower value of the standard deviation
for a data set indicates that the values of that data
set are spread over a relatively smaller range
around the mean. In contrast, a larger value of the
standard deviation for a data set indicates that the
values of that data set are spread over a relatively
larger range around the mean.
The standard deviation is obtained by taking the
positive square root of the variance.
SD  Variance
The variance  2 calculated for population data is
denoted by (read as sigma squared),and the
variance calculated for sample data is denoted by
2
s.
Consequently, the standard deviation calculated
for population data is denoted by  and the
standard deviation calculated for sample data is
denoted by s.
Following are what we will call the basic formulas that
are used to calculate the variance,
N

 ( x   ) 2

2  i 1

N
n

 ( x  x ) 2

S2  i 1

n 1
where 
2 2
is the population variance and S is the
sample variance.
The quantity x   or x  x in the above formulas is
called the deviation of the x value from the mean. The
sum of the deviations of the x values from the mean is
always zero.
Likewise, the following are what we will call the
basic formulas that are used to calculate the
standard deviation,
N

 ( x   ) 2

   2 i 1

N
n

 (x  x) 2

s s 2 i 1

n 1
where  is the population standard deviation and
s is the sample standard deviation.
For example, suppose the midterm scores of a
sample of four students are 82, 95, 67, and 92,
respectively. Then, the mean score for these four
students is
336
x  84
4
s2 
 ( x  x ) 2

n 1
(2) 2  (11) 2  (17) 2  (8) 2
s 
2

4 1
478
s2 
3
s 2  159.33
SD  s 2
SD  159.33
SD  12.62
Short-Cut Formulas for the Variance and
Standard Deviation for Ungrouped Data
The standard deviation is obtained by taking the
positive square root of the variance.

( x )

2

x 2
 x 2
 Nx 2

 
2 N  2

N N
( x)

2

x 2
 x 2
 nx 2

s2  n s 2

n 1 n 1
The following table gives the 2008 market values
(rounded to billions of dollars) of five international
companies.
Company Market Value (billions of dollars)
PepsiCo 75
Google 107
PetroChina 271
Johnson & Johnson 138
Intel 71
Find the variance and standard deviation for these
data
 x  662
 x  114600
2

( x ) 2

x  2

n
S2 
n 1

114600 
662
2

S2  5
5 1
114600  87648.8
S2 
4
S 2  6737.80
S  S2
S  6737.8
S  82.08
Thus, the variance and standard deviation of the
market values of these five companies are $6737.80
and $82.08 billion respectively.
Observation
•The values of the variance and the standard deviation
are never negative. That is, the numerator in the
formula for the variance should never produce a
negative value. Usually the values of the variance and
standard deviation are positive, but if a data set has
no variation, then the variance and standard deviation
are both zero.
•The measurement units of variance are always the
square of the measurement units of the original data.
This is so because the original values are squared to
calculate the variance.
Coefficient of variation CV
One disadvantage of the standard deviation as a
measure of dispersion is that it is a measure of
absolute variability and not of relative variability.
Sometimes we may need to compare the variability of
two different data sets that have different units of
measurement. The coefficient of variation is one such
measure. The coefficient of variation, denoted by CV,
expresses standard deviation as a percentage of the
mean and is computed as follows:
For population data: CV   100%

For sample data: CV 


s
 100%
x
Class Activity
1.The following are the number of babies born during
a year in 48 community hospitals
30 55 27 45 56 48 45 49 32 57 47
56 37 55 52 34 54 42 32 59 35 46
24 57 32 26 40 28 53 54 29 42 42
54 53 59 39 56 59 58 49 53 30 53
21 34 28 50
Obtain the standard deviation, variance and
coefficient of variation of these community hospitals.
2. The SAT scores of 100 students have a mean of 975
and a standard deviation of 105. The GPAs of the
same 100 students have a mean of 3.16 and a
standard deviation of 22. Is the relative variation in
SAT scores larger or smaller than that in GPAs?
3. Butz et al. (A-10) evaluated the duration of benefit
derived from the use of noninvasive positive-pressure
ventilation by patients with amyotrophic lateral sclerosis on
symptoms, quality of life, and survival. One of the variables
of interest is partial pressure of arterial carbon dioxide
(PaCO2). The values below (mm Hg) reflect the result of
baseline testing on 30 subjects as established by arterial
blood gas analyses.
40.0 47.0 34.0 42.0 54.0 48.0 53.6 56.9 58.0
45.0
54.5 54.0 43.0 44.3 53.9 41.8 33.0 43.1 52.4
37.9
34.5 40.1 33.0 59.9 62.6 54.1 45.7 40.6 56.6
59.0
compute (a) the mean, (b) the median, (c) the mode, (d) the
range, (e) the variance, (f) the standard deviation, (g) the
coefficient of variation

You might also like