You are on page 1of 36

CHAPTE ONE

1. BASIC CONCEPTS, METHODS OF DATA COLLECTION AND PRESENTATION


1.1. Introduction
1.1.1. Definitions and Classification of Statistics
Definition
Statistics can be defined in two senses:
a) Statistics in its plural sense: Statistics refer to numerical facts, or figures or quantitative
information that describes every aspect of social and economic phenomenon. Statistics are the
raw data themselves, like statistics of births, statistics of deaths, statistics of imports and
exports, etc.
b) Statistics in its singular sense: Statistics as a branch of scientific method deals with the
planning and design of data collection, organization, presentation, analysis and interpretation
and drawing conclusions based on the data.
Classification
Statistics can be divided in to two broad areas.
1. Descriptive Statistics is concerned with summarizing or describing important features of
the available data without going beyond the data themselves. It is concerned with summary
calculations, graphs, charts and tables.
2. Inferential Statistics is a method used to generalize from a sample to a population. It
induces the use of data from samples to make inferences about a population from which
samples are drawn.
For example, the average income of all families (the population) in Ethiopia can be
estimated from figures obtained from a few hundred (the sample) families. Statistical
techniques based on probability theory are required.
1.1.2. Stages in statistical investigation

The stages or steps in any statistical investigation are


1. Collection of data: The process of measuring, gathering, assembling the raw data up on
which the statistical investigation is to be based. Data can be collected in a variety of ways.
Example, one of the most common methods is through the use of survey. Survey can also be
done in different methods like questionnaire, interview.
2. Organization of data: Summarization of data in some meaningful way. Organization of data
may involve Editing, coding and classification of the collected data.
3. Presentation of the data: In this stage the collected and organized data are presented with
some systematic order to facilitate statistical analysis. The organized data are presented with
the help of tables, diagrams and graphs.
4. Analysis of data: The process of extracting numerical description of data, mainly through
the use of elementary mathematical operation (like mean, standard deviation,..)
5. Interpretation of data: This involves giving meaning to the analyzed data and draw
conclusions. Statistical techniques based on probability theory are required.

1
1.1.3. Definitions of some terms

A (statistical) population: is the complete set of possible measurements for which inferences
are to be made. The population represents the target of an investigation, and the objective of the
investigation is to draw conclusions about the population hence we sometimes call it target
population.
Examples
 Population of trees under specified climatic conditions
 Population of animals fed a certain type of diet
 Population of farms having a certain type of natural fertility
 Population of households, etc.
Sample: is sub part of the population which is representative
The population could be finite or infinite (an imaginary collection of units)
There are two ways of investigation: Census and sample survey.
Census: a complete enumeration of the population. But in most real problems it cannot be
realized, hence we take sample.
Sample: A sample from a population is the set of measurements that are actually collected in the
course of an investigation. It should be selected using some pre-defined sampling technique in
such a way that they represent the population very well. Sample is sub part of the population. In
practice, we don’t conduct census, instead we conduct sample survey.
Parameter: Characteristic or measure obtained from a population.
Statistic: Characteristic or measure obtained from a sample.
Sampling: The process or method of sample selection from the population.
Sample size: The number of elements or observation to be included in the sample.
1.1.4. Applications, Uses and Limitations of statistics.
Applications of statistics:
 In almost all fields of human endeavor
 Almost all human beings in their daily life are subjected to obtain numerical facts
 Applicable in some process e.g. invention of certain drugs, extent of environmental
pollution
 In industries especially in quality control area.
Uses of statistics
The main function of statistics is to enlarge our knowledge of complex phenomena. Some uses of
statistics:
 It presents facts in a definite and precise form.
 Data reduction.
 Measuring the magnitude of variations in data.
 Furnishes a technique of comparison
 Estimating unknown population characteristics.
 Testing and formulating of hypothesis.
 Studying the relationship between two or more variable.
 Forecasting future events.

2
Limitations of statistics
As a science statistics has its own limitations.
Some of the limitations:
 Deals with only aggregate of facts and not with individual data items.
 Statistical data are only approximately and not mathematical correct.
 Statistics can be easily misused and therefore should be used by experts.
1.1.5. Types of variables and measurement scales

Variable: It is an attribute or characteristic that can assume different values.


Variable is divided in to two: Qualitative and quantitative variable
1. Qualitative variables are nonnumeric variables and cannot be measured.
Examples: gender, religious affiliation, and state of birth.
2. Quantitative Variables are numerical variables and can be measured. Examples: balance
in checking account, number of children in family.
Note that quantitative variables are either discrete or continuous
Discrete variable: It assumes a finite or countable number of possible values. It is usually
obtained by counting. Example: number of children‘s in a family, number of cars at a traffic
light
Continuous variable: It can assume any value within the defined range. Continuous variables
are usually obtained by measuring. Example: weight in kg, height, time, air pressure in a tire.
Measurement scales
Proper knowledge about the nature and type of data to be dealt with is essential in order to
specify and apply the proper statistical method for their analysis and inferences. Measurement
scale refers to the property of value assigned to the data based on the properties of order, distance
and fixed zero.
Order
The property of order exists when an object that has more of the attribute than another object, is
given a bigger number by the rule system.
Distance
The property of distance is concerned with the relationship of differences between objects. If a
measurement system possesses the property of distance it means that the unit of measurement
means the same thing throughout the scale of numbers. More precisely, an equal difference
between two numbers reflects an equal difference in the "real world" between the objects that
were assigned the numbers.
Fixed zero (true zero)
True zero is related to the property of absolute absence of characteristic under consideration.
The property of fixed zero (true zero) is necessary for ratios between numbers to be meaningful.
Scale types
Four levels of measurement scales are commonly distinguished: nominal, ordinal, interval, and
ratio and each possessed different properties of measurement systems.

3
Nominal Scales
Nominal scales are measurement systems that possess none of the three properties stated above.
 Level of measurement which classifies data into mutually exclusive, all inclusive
categories in which no order or ranking can be imposed on the data.
 No arithmetic and relational operation can be applied.
Examples:
 Sex (Male or Female),
 Marital status (married, single, widow, divorce)
 Country code
 Regional differentiation of Ethiopia.
Ordinal Scales
Ordinal Scales are measurement systems that possess the property of order, but not the property
of distance. The property of fixed zero is not important if the property of distance is not satisfied.
 Level of measurement which classifies data into categories that can be ranked.
Differences between the ranks do not exist.
 Arithmetic operations are not applicable but relational operations are applicable.
 Ordering is the sole property of ordinal scale.
Example: Rating scales (Excellent, Very good, Good, Fair, poor), Military status.
Interval Scales
Interval scales are measurement systems that possess the properties of Order and distance, but
not the property of fixed zero.
 Level of measurement which classifies data that can be ranked and differences are
meaningful. However, there is no meaningful zero, so ratios are meaningless.
 All arithmetic operations except division are applicable.
 Relational operations are also possible.
Example: Temperature in degree Celsius or 0F,
Your score on an individual intelligence test as a measure of your intelligence.
A temperature of 0°C does not mean that there is no temperature. Furthermore, a temperature of
30°C in town X on a specific day may not be twice as warm as 15°C on another day in the same
town.
Ratio Scales
Ratio scales are measurement systems that possess all three properties: order, distance, and fixed
zero. The added power of a fixed zero allows ratios of numbers to be meaningfully interpreted;
e.g. the ratio of the first person’s height to another person’s height is 1.32, whereas this is not
possible with interval scales.
 Level of measurement which classifies data that can be ranked, differences are
meaningful, and there is a true zero. True ratios exist between the different units of
measure.
 All arithmetic and relational operations are applicable.
Examples: Weight, Height, Number of students, Age

4
Exercises: Classify the following different measurement systems into one of the four types of
scales.
1. Your checking account number as a name for your account.
2. Your checking account balance as a measure of the amount of money you have in that account
3. Your score on the first statistics test as a measure of your knowledge of statistic
4. A response to the statement "Abortion is a woman's right" where "Strongly Disagree" = 1,
"Disagree" = 2, "No Opinion" = 3, "Agree" = 4, and "Strongly Agree" = 5, as a measure of
attitude toward abortion.
5. Times for swimmers to complete a 50-meter race
6. Months of the year Meskerm, Tikimit…
7. Socioeconomic status of a family when classified as low, middle and upper classes.
8. Blood type of individuals, A, B, AB and O.
9. Pollen counts provided as numbers between 1 and 10 where 1 implies there is almost no pollen
and 10 that it is rampant, but for which the values do not represent an actual counts of grains of
pollen.
10. Regions numbers of Ethiopia
11. The number of students in a college
12. The net wages of a group of workers

5
CHAPTER TWO

2.1 Methods of Data Collection and Presentation

2.1.1 Methods of data collection


The statistical data may be classified under two categories, depending upon the sources – (1)
Primary data (2) Secondary data.
Primary Data: are those data, which are collected by the investigator himself for the purpose of
a specific inquiry or study. Such data are original in character and are mostly generated by
surveys conducted by individuals or research institutions.
Secondary Data: When an investigator uses data, which have already been collected by others,
such data are called "Secondary Data".
The secondary data can be obtained from journals, reports, government publications,
publications of professionals and research organizations.

According to the role of time, data are classified in to cross-section and time series data. Cross-
section data is a set of observations taken at one point in time, while, time series data is a set of
observations collected for a sequence of times, usually at equal interval which may be on
weekly, monthly, quarterly, yearly, etc basis.

Before any statistical work can be done data must be collected. Depending on the type of
variable and the objective of the study different data collection methods can be employed. In the
collection of data we have to be systematic. If data are collected haphazardly, it will be difficult
to answer our research questions in a conclusive way.
Various data collection techniques can be used such as:
• Observation • Using available information
• Interview (Face-to-face/telephone interviews) • Focus group discussions (FGD)
• Questionnaire (mailed and self-administered questionnaire)
• Other data collection techniques – life histories, case studies, etc.
i) Observation – It includes all methods from simple visual observations to the use of high level
machines and measurements, sophisticated equipment or facilities, such as radiographic, X-ray
machines, microscope.
An observation guide should be prepared prior to data collection.
Advantages: Gives relatively more detailed, accurate and context related information.
Disadvantages: Investigators or observer’s own biases, prejudice, desires, and etc. and needs
more resources and skilled human power during the use of high level machines.
ii) Interview
Could be face to face /telephone interview
Advantage:
- suitable for use with illiterates
- permits clarifications of questions

6
- higher response rate than self-administered questionnaire

Disadvantage:
- presence of interviewer can influence the response
- more costly than self-administered questionnaire
iii) Questionnaire (Mailed and self-administered questionnaire)
Questionnaire is list of questions arranged in a predetermined sequence for a predetermined
purpose.
Self-administered questionnaires: under this method, the questionnaire is distributed by hand to
the respondents. The use of self-administered questionnaires is simpler and cheaper; such
questionnaires can be administered to many persons simultaneously (e.g. to a class of students).
Mailed Questionnaire Method
- The questionnaires are sent by post to the informants.
Limitations of questionnaire:
 The method can be used only if the respondents are educated.
 The response rates tend to be relatively low.
 Informants may not return the completed questionnaire back and even if they did, they
may have filled them incorrectly.
 It may not give the investigator a chance to explain the questions or ask supplementary
and follow up questions.
Types of questions used in a questionnaire
Depending on how questions are asked and recorded we can distinguish two major possibilities -
Open –ended questions, and closed ended questions.
a) Open-ended questions: Open-ended questions permit free responses that should be recorded
in the respondent’s own words. The respondent is not given any possible answers to choose
from. Such questions are useful to obtain information on:
 Facts with which the researcher is not very familiar
 Opinions, attitudes, suggestions of informants, or Sensitive issues
b) Closed- ended questions: Closed questions offer a list of possible options or answers from
which the respondents must choose. When designing closed questions one should try to:
 Offer a list of options that are exhaustive and mutually exclusive
 Keep the number of options as few as possible.
1.1.2. Methods of Data Presentation

The data collected in a survey is called raw data. In most cases, useful information is not
immediately evident from the mass of unsorted data. Collected data need to be organized in such
a way as to condense the information they contain in a way that will show patterns of variation
clearly. Precise methods of analysis can be decided up on only when the characteristics of the
data are understood. For the primary objective of this different techniques of data organization
and presentation like order array, tables and diagrams are used.
Statistical Tables
A statistical table is an orderly and systematic presentation of data in rows and columns. Rows
are horizontal and columns are vertical arrangements. The use of tables for organizing, for

7
example qualitative data, involves grouping the data into mutually exclusive categories of the
variables and counting the number of occurrences (frequency) to each category.
The simple frequency table is used when the individual observations involve only to a single
variable whereas the cross tabulation is used to obtain the frequency distribution of one variable
by the subset of another variable.
Examples:
Simple or one-way table
Table 1: Immunization status of 210 children in a certain Woreda
Immunization status number of children Percent (%)
Not immunized 75 35.7
Partially immunized 57 27.1
Fully immunized 78 37.2

Two-way table: This table shows two characteristics and is formed when either the row or the
column is divided into two or more parts.
Table 2: Immunization status by marital status of the women of childbearing age in a town.
Immunization Status
Marital Status Immunized Non Immunized Total
Single 58 177 235
Married 156 294 450
Divorce 10 18 28
Widowed 7 7 14
Total 231 496 727

Frequency distributions
For data to be more easily appreciated and to draw quick comparisons, it is often useful to
arrange the data in the form of a table, or in one of a number of different graphical forms.
Frequency: is the number of times a certain value of the variables is repeated in the given data.
It is the number of observations belonging to a given value or a group.
Frequency distribution: is a table which contains the values and the corresponding frequencies.
From the definition, a frequency distribution has two parts, namely- the values of the variables
on the one hand and the number of observations (frequency) corresponding to the values of the
variables on the other.
Array (ordered array): is a serial arrangement of numerical data in an ascending or descending
order.
Types of frequency distribution
There are two types of frequency distributions categorical (qualitative) and numerical
(quantitative).
1. Categorical frequency distribution: Here data are classified according to non-
numerical categories. To construct a categorical frequency distribution, the categories
contained in the frequency distribution must be mutually exclusive and exhaustive. In
other words, an element must be counted in one and only one category.

8
Example: Seniors of a high school were interviewed on their plan after completing high school.
The following data give plans of 548 seniors of a high school.

SENIORS’ PLAN NUMBER OF SENIORS


Plan to attend college 240
May attend college 146
Plan to or may attend a vocational school 57
Will not attend any school 105
Total 548
2. Numerical frequency distribution: In this frequency distribution, data classified
according to numerical size. Numerical frequency distributions are either discrete or
continuous according to whether the variable is discrete or continuous.
Continuous grouped frequency distribution:
Example: 10,392 persons were surveyed by a social scientist who wants to study the age of
persons arrested in a country. We can construct a continuous frequency distribution for this data,
since age is a continuous variable. In connection with large sets of data, a good overall picture
and sufficient information can often be conveyed by grouping the data into a number of class
intervals as shown below.
Age(years) Number of persons
Under 18 1,748
18-24 3,325
25-34 3,149
35-44 1,323
45-54 512
55 and over 335
Total 10,932

This kind of frequency distribution is called grouped frequency distribution. Frequency


distributions present data in a relatively compact form, gives a good overall picture, and contain
information that is adequate for many purposes, but there are usually some things which can be
determined only from the original data. For instance, the above grouped frequency distribution
cannot tell how many of the arrested persons are 19 years old, or how many are over 62.
Some terminologies used in a continuous grouped frequency distribution
Class frequency (f): refers to the numbers of observations belonging to a class.
Class limit: are the lowest (called lower class limit-LCL) and highest (called upper class limit-
UCL) values that can be included in a class.
Units of measurement (U): the distance between two possible consecutive measures. It is usually
taken as 1, 0.1, 0.01, 0.001, -----.
Class boundaries: are the values that fall half way between the class limits of adjacent classes.
The boundaries have one more decimal places than the row data and therefore do not appear in
the data .Each class has a lower boundary (LCB) and an upper class boundary (UCB).
Then UCB = UCL + ½*U and LCB = LCL – ½*U.

9
Class mark (class midpoint-mi): is the value located half way between the lower and upper
class limits of that class. The class mark of the ith class is denoted by mi is,
1
mi = * (LCL + UCL) = ½*(LCB + UCB).
2
Class width (class size-w): is the difference between the upper and lower class boundaries of the
class, that is, w = UCB – LCB. It is also the difference between the lower limits of any two
consecutive classes or the difference between any two consecutive class marks.
Cumulative frequencies: when frequencies of two or more classes are added up, such total
frequencies are called Cumulative Frequencies. This frequencies help as to find the total number
of items whose values are less than or greater than some value.
More than cumulative frequency: it is the total frequency of all values greater than or equal to
the lower class boundary of a given class.
Less than Cumulative frequency: it is the total frequency of all values less than or equal to the
upper class boundary of a given class.
Relative frequency: it is the frequency of each value or class divided by the total frequency
Steps in the construction of grouped continuous frequency distribution;
 Determine the number of classes to use, preferably between 5 and 20. It is possible to
take the approximate number of classes (K) can be the Sturge’s Formula, given by:
K = 1 + 3.322×log(n),where n is the number of observations.
 Determine the class size (class width) as:
W = (Maximum value – Minimum value)/K = Range/K.
 Pick a suitable starting point less than or equal to the minimum value. The starting point
is called the lower limit of the first class. Continue to add the class width to this lower
limit to get the rest of the lower limits.
 To find the upper limit of the first class, subtract U from the lower limit of the second
class. Then continue to add the class width to this upper limit to find the rest of the upper
limits.
 Find the boundaries by subtracting U/2 units from the lower limits and adding U/2 units
from the upper limits.
 Find the frequency and relative frequency of each class.

Example: Construct a grouped frequency distribution of the following data on the amount of time
(in hours) that 80 college students devoted to leisure activities during a typical school week:
23 24 18 14 20 24 24 26 23 21 16 15 19 20 22 14 13 20 19 27 29 2238
28 34 44 23 19 21 31 16 28 19 18 12 27 15 21 25 16 30 17 22 29 29 18
25 20 16 11 17 12 15 24 25 21 22 17 18 15 21 20 23 18 17 15 16 26 23
22 11 16 18 20 23 19 17 15 20 10
Solution:
Maximum value = 44 and Minimum value = 10.
Range = 44 – 10 =34 and class width, W = 35/7 = 4.857 ~ =5.
Using the above formula: K = 1 + 3.322 × log (80) = 7.32 ≈ 7 classes, Let 10 be the lower limit
of the first class. That is LCL1 = 10, LCL2 =10+W= 10+5=15, etc.

10
10, 15, 20, 25, 30, 35, and 40 are lower class limits.
Find the upper class limit; e.g. the first upper class limit (UCL1)=15-U=15-1=14,
UCL2 =1hghjkl;’4+W=14+5 = 19, etc.
14, 19, 24, 29, 34, 39, 44 are the upper class limits.

Time spent(hours) Frequency


10-14 8
15-19 28
20-24 27
25-29 12
30-34 3
35-39 1
40-44 1

The class boundaries are calculated by: UCB = UCL + ½*U and LCB = LCL – ½*U.
Example: consider the above example and determine the class boundaries.
UCB1 = UCL1 + ½*(U=1)=14 +1/2 = 14.5 and LCB1 = LCL1 - ½*(U=1) =10 - 1/2 = 9.5 etc.
The class marks are also calculated as: m1 = ½*(UCL1 +LCL1) = ½*(UCB1 + LCB1) = 12.
m2 = ½*(UCL2 +LCL2) = 17, etc.
So, the complete frequency distribution table with cumulative frequencies is as follows.
So, the complete frequency distribution table with cumulative frequencies is as follows.
Class class class mark frequency relative less than cumulative greater
limit boundary (mi) (fi) frequency frequency than cf
10 – 14 9.5 – 14.5 12 8 0.1 8 80
15 – 19 14.5 – 19.5 17 28 0.35 36 72
20 – 24 19.5 – 24.5 22 27 0.3375 63 44
25 – 29 24.5 – 29.5 27 12 0.15 75 17
30 – 34 29.5 – 34.5 32 3 0.0375 78 5
35 – 39 34.5 – 39.5 37 1 0.0125 79 2
40–44 39.5 – 44.5 42 1 0.0125 80 1

Diagrammatic and graphical presentation of Data


Appropriately drawn graph or diagram allows readers to obtain rapidly an overall grasp of the
data presented. The relationship between numbers of various magnitudes can usually be seen
more quickly and easily from a graph or diagram than from a table.
Bar charts and pie chart are commonly used diagrammatic presentation for qualitative
data
 Histograms, frequency polygons and ogive curve are graphical presentation of
quantitative continuous data.
Type of Diagrams
1) Bar Chart:
There are different types of bar charts, the most important ones are simple bar chart, component
bar chart and multiple bar chat.

11
a) Simple bar chart: It is a one-dimensional chart in which the bar represents the whole of
the magnitude. The height or length of each bar indicates the size (frequency) of the
figure represented.
Consider the data on immunization status of children (Table 1)

90
80 78
75
70
60 57
50
40
30
20
10
0
not immunized partially immunized fully immunized

Immunization status
Fig.1 Immunization status

b) Component Bar chart: Bars are sub-divided into component parts of the figure. These
sorts of diagrams are constructed when each total is built up from two or more
component figures. This is done by dividing the bars into parts representing the
components and shading them accordingly.
Consider the data on immunization status of women by marital status (table 2)
500

400

300 294
immunized
200 non immunized
177
100
156
58 18 7
0 10
single married divorced widowed

Marital status
Fig. 2. Immunization status by marital status of women 15-49 years

c) Multiple bar charts: In this type of chart the component figures are shown as separate
bars adjoining each other. The height of each bar represents the actual frequency of the
component figure. It depicts distributional pattern of more than one variable and
comparisons of each component are desired.

12
Example of multiple bar chart: consider that data on immunization status of women by marital
status.
350
294
300

250

200 177
156 immunized
150 non immunized

100
58
50
10 18
7 7
0
single married divorced widowed

Marital status
Fig. 3. Immunization status by marital status of women 15-49 years

2) Pie-chart: it is a circle representing a categorical data by dividing the circle into different
sectors of angle in proportion of 360o to the amount associated to each category. The proportion
of the category can express either by percentages or by angles.
That is degree of central angle of a category = (amount of the category / total amount)* 360o.The
proportion of a category = (frequency of a category / total frequency)* 100%.

FI NI
37% 36% NI
PI
FI

PI
27%
Fig. 4.Immunization status of children
Type of Graphs
The following are the most commonly used graphical presentations of data.
1) Histograms: A histogram is the graph of the frequency distribution of continuous
measurement variables. It is constructed on the basis of the following principles:
a) The horizontal axis is a continuous scale running from one extreme end of the distribution to
the other. It should be labeled with the name of the variable and the units of measurement.
b) For each class in the distribution a vertical rectangle is drawn with (i) its base on the
horizontal axis extending from one class boundary of the class to the other class boundary,
there will never be any gap between the histogram rectangles. (ii) the bases of all rectangles

13
will be determined by the width of the class intervals. If a distribution with unequal class-
interval is to be presented by means of a histogram, it is necessary to make adjustment for
varying magnitudes of the class intervals.

Example: Consider the data on time (in hours) that 80 college students devoted to leisure
activities during a typical school week. Draw the histogram

2) Frequency Polygon: If we join the midpoints of the tops of the adjacent rectangles of the
histogram with line segments a frequency polygon is obtained. When the polygon is continued to
the X-axis just outside the range of the lengths the total area under the polygon will be equal to
the total area under the histogram.
Example: Consider the above data on time spend on leisure activities.
30
28 27
25

20

15
12
10
8
5
3
0 1 1
0 5 10 15 20 25 30 35 40 45

Fig 5: Frequency polygon curve on time spent for leisure activities by students

3) Ogive or Cumulative Frequency Curve: When the cumulative frequencies of a distribution


are graphed the resulting curve is called Ogive Curve. Ogive are of two types, namely, “Less
than” Ogive and “more than” Ogive.
Less than Ogive: in this case the “less than” cumulative frequencies are plotted against upper
class boundaries of their respective classes and they are joined by lines adjacently.
More than Ogive: in this case, more than cumulative frequencies which are scaled on the Y-
axis plotted against the lower class boundary of their respective classes which are scaled on the
X- axis are joined by lines adjacently.
Example: Consider the above data on time spend on leisure activities.

14
90
80 80 78 79 80
75
70 72
63
60
50
44 Less than Ogive
40
36 More than Ogive
30
20
17
10 8
5
0 0 2 1 0
9.5 14.5 19.5 24.5 29.5 34.5 39.5 44.5

Fig 7: Cumulative frequency curve for amount of time college students devoted to leisure
activities

15
CHAPTER THREE
3. SUMMARIZING OF DATA
3.1.Measures of Central Tendency
When we want to make comparison between groups of numbers it is good to have a single value
that is considered to be a good representative of each group. This single value is called the
average of the group. Averages are also called measures of central tendency.
Objectives
Since the number of sample points is frequently large and it is easy to lose track of the overall
picture by looking at all the data at once, the data must be summarized as briefly as possible.
Some objectives of measuring central tendency:
 To comprehend (understand) the data easily.
 To facilitate comparison.
 To make further statistical analysis.

The Summation Notation


Let X1, X2, X3, …,Xnbe a number of measurements where n is the total number of observation
th
and Xi is i observation.
n
The symbol X
i 1
i (read as “the sum of Xi where i runs from 1 to n”) is mathematical shorthand
n
for X1+X2+X3+...+Xn. That is X
i 1
i = X1+X2+…+Xn

Example: Suppose the following were scores made on the first homework assignment for five
students in the class: 5, 7, 7, 6, and 8.
5

X
i 1
i = X1+X2+ X3 + X4+ X5 = 5 + 7+7+6+8=33

Properties of Summation
n

 k  nk , where k is any constant


i 1
n n

 kX  k  X ,
i 1 i 1
where k is any constant
n n

 (a  bX i )  na  b X i , a and b are constants.


i 1 i 1
n n n

 ( X i  Yi )   X i   Yi
i 1 i 1 i 1

Example: Consider the following data and determine


Xi 5 7 7 6 8
Yi 6 7 8 7 8
5 5
a)  X i =5+7+7+6+8=33
i 1
e) (X
i 1
i  Yi )   3

16
5 5
b)  Yi  36
i 1
f) X Y
i 1
i i =241
5 5
c) 10  10 * 5  50 g)
i 1
X
i 1
i
2
 223

5 5 5 5 5
d)  ( X i  Yi ) 
i 1
 X i +  Yi =69
i 1 i 1
h) (  X i )(  Yi ) = 1188
i 1 i 1

Important characteristics of a good average (Measures of Central Tendency)


1. It should be easy to calculate and understand.
2. It should be based on all the observations during computation.
3. It should be rigidly defined.
4. It should be representative of the data, if it’s from sample. Then the sample should be random
enough to be accurate representative of the POPULATION.
5. It should have sampling stability ,It shouldn’t be affected by sampling fluctuations
6. It shouldn’t be affected by the extreme value if a few very small and very large items is
presented in the data.
Now we will discuss the various measures of central tendency.
Types of measures of central tendency
The different measures of central tendency are the Mean (Arithmetic, Geometric and Harmonic),
the Mode, the Median.
3.1.1.Mathematical Average
The Arithmetic Mean:
It is defined as the sum of the magnitude of the items divided by the number of items.
Suppose X1, X2, X3, …,Xn are n observed values in a sample of size n, then thearithmetic mean
of the sample, denoted by ̅ X is given as:
X + X …+Xn ∑n
i=1 Xi
̅
X = 1 2+ = .
n n
If we take an entire population Mean is denoted by 𝜇 and is given by:
X1 + X2+ …+XN ∑N
i=1 Xi
𝜇= = , where N stands for the total number of observations in the population.
N N

Example: Suppose the sample consists of birth weights (in grams) of live born infants at a
private hospital in a certain city during a 1-week period. These sample birth weights are:
3265, 3323, 2581, 2759, 3260, 3649, 2841, 3248, 3245, 3200, 3609,
3314, 3484, 3031, 2838, 3101, 4146, 2069, 3541, 2834.
Then find arithmetic mean for the sample birth weights.
Solution:X̅= 1 ∑ X i = 1 (3265 + 3260 + ….+ 2834) = 63338 = 3166.9 gram.
20 20 20

If X is a variable having values X1, X2,…,Xk occurring with frequencies of f1, f2,…, fk
respectively, then its arithmetic mean is given by:
X1f + X2f2 + …+Xk fk ∑k
i=1 Xif
̅
X = 1f +f +⋯+f = i
.
1 2 k ∑k
i=1 fi

17
Example: Suppose the X values are 3, 5, 4, 2, 7 and 6 with corresponding frequencies of 2, 1, 3,
2, 1 and 1 respectively. Then fine the mean for data.
Xi 3 5 4 2 7 6
frequency, fi 2 1 3 2 1 1

̅= 3∗2+5∗1+ …+7∗2 +6∗1 = 40 = 4.


Solution:X 2+⋯+1 10

Mean for Grouped Data


This method is applicable where the entire range of observations has been grouped into a
continuous frequency distribution. In such cases the mean of the distribution is computed as:
∑k
i=1 mif
̅
X= i
, where
∑k
i=1 fi
 k is number of classes,
 mi is the midpoint of the ith class and
 fi is the ith class frequency.

Example: Calculate the mean for grouped data on the amount of time (in hours) that 80 college
students devoted to leisure activities during a typical school week given below:
Time spent (hours) Frequency
10 – 14 8
15 – 19 28
20 – 24 27
25 – 29 12
30 – 34 3
35 – 39 1
40 - 44 1
Solution:
 First find the class marks (midpoints)
 Find the product of frequency and class marks
 Find mean using the formula.
The class marks of the distribution are: 12, 17, 22, 27, 32, 37, 42.
Then the mean of the data is computed as:
∑7i=1 mif 12∗8+17∗28+⋯+42∗1 1655
̅=
X i
= = = 20.7 hours.
∑7i=1 fi 8+28+⋯+1 80

Special Properties of the Arithmetic Mean


1) The sum of the deviations about the mean is zero. i. e., ∑(Xi − ̅
X) = 0.
2) The sum of the squares of deviations from the arithmetic mean is less than the sum of
squared of deviations about any other value in the data set,
̅)2 ∑(Xi − A)2 .
i. e. ∑(Xi − X A X

18
3) If we have means ̅
X1 , ̅
X2 , X 3 , …, X k of k groups having the same unit of measurements
of a variable, based on n1, n2, n3, …, nk observations respectively. Then the mean of all
the observation in all groups often called the combined mean is given by
̅ n X  n2 X 2  ...  nk X k
Xc = 1 1
n1  n2  ...  nk

Example: If the mean final exam mark of one class of 50 students is 30 and the mean of marks
of another class of 100 students in the same final exam is 40. What is the mean mark of all 150
students?
50 * 30  100 * 40
Solution: X c   36.7 (50*30 + 100*40)/(50 + 100) =36.7.
50  100

4) If a wrong figure has been used when calculating the mean, then the correct mean can be
obtained without repeating the whole process using:
correct value  wrong value
Correct mean = wrong mean +
n
Where n= number of observations

Example: An average weight of 10 students was calculated to be 65. Later it was discovered that
one weight was misread as 40 instead of 80 k.g.
Calculate the correct average weight.
80  40
Correct mean = 65+ = 65+4 = 69
10
5) The effect of transforming original series on the mean.
a) If a constant k is added to / subtracted from/ every observation then the new mean
will be the old mean ± k respectively.
b) If every observations are multiplied by a constant k then the new mean will be
k*old mean.
Example: The mean of a set of numbers is 500.
a. If 10 is added to each of the numbers in the set, then what will be the mean of the new
set?
New mean = 500+10 =510
b. If each of the numbers in the set are multiplied by -5, then what will be the mean of the
new set? New mean = -5*500= -2500
Example: The mean of n observations X1, X2, …,Xnare known to be 12 . New set of another
observations are obtained by the linear transformation Y = 2X – 0.5 ( i = 1, 2, …, n ) then
i i
what will be the mean of the new set of observations
Solutions: New Mean = 2* Old Mean – 0.5 = 2*12 – 0.5 = 23.5.
Advantages of arithmetic mean
 It is based on all values
 It is easy to calculate and simple to understand

19
 It is suitable for further mathematical treatment.
 It is stable average, i.e. it is not affected by fluctuations of sampling to some extent.
Disadvantages of arithmetic mean
 It is affected by extreme observations.
 It cannot be used in the case of open end classes.
 It cannot be determined by the method of inspection.
 It cannot be used when dealing with qualitative characteristics, such as intelligence, honesty,
beauty.
 Sometimes it leads to wrong conclusion if the details of the data from which it is obtained are
not available.
Weighted Mean
In computation of arithmetic mean we had given equal importance to each observation. While,
when averaging quantities, it is often necessary to account for the fact that not all of them are
equally important in the phenomenon being described. In order to give quantities being averaged
their proper degree of importance, it is necessary to assign them relative importance called
weights, and then calculate a weighted mean.
In general, the weighted mean ̅ Xw of a set of values X1, X2, …,Xn, whose relative importance is
expressed numerically by a corresponding set of weights W1, W2, … Wn, is given by:
X1W + X2W2+ …+Xn Wn ∑n
i=1 XiW
̅
X w = W1 +W +⋯+W = i
.
1 2 n ∑n
i=1 Wi

Example: A student obtained results 60, 75, 63, 59, and 55 in English, Biology, Mathematics,
Physics and Chemistry examinations respectively. Find the students weighted arithmetic mean if
weights 1, 2, 1, 3, 3 respectively are allotted to the subjects.
Solution: ̅ X w = (60*1 +75*2 + 63*1 + 59*3 + 55*3)/ (1+2+1+3+3) = 615/10 = 61.5.

The Geometric Mean


 The geometric mean of a set of n observation is the nth root of their product.
 The geometric mean of X1, X2 ,X3 …Xn is denoted by G.M and given by:

G.M  n X1 * X2 *... * Xn
 Taking the logarithms of both sides
1
log(G.M)  log(n X1 * X2 *... * Xn )  log(X1 * X2 *... * Xn ) n
1 1
 log(G.M)  log(X1 * X2 *.... * Xn )  (logX1  logX2  ...  logXn )
n n
1 n
 log(G.M)   logXi
n i 1
 The logarithm of the G.M of a set of observation is the arithmetic mean of
their logarithm.
1 n
 G.M  Anti log(  logXi )
n i 1
Example 2.7: Find the G.M of the numbers 2, 4, 8.

20
Solutions:
G.M  n X1 * X2 *... * Xn  3 2 * 4 * 8  3 64  4
Remark: The Geometric Mean is useful and appropriate for finding averages of ratios.

The Harmonic Mean


The harmonic mean of X1, X2 , X3 …Xn is denoted by H.M and given by:
n
H.M  n , This is called simple harmonic mean.
1

i 1 X i

In a case of frequency distribution:

k
n
H.M  k
fi
, n   fi

i 1 X i
i 1

If observations X1, X2, …Xn have weights W1, W2, …Wn respectively, then their harmonic
mean is given by

W i
H.M  n
i 1
, This is called Weighted Harmonic Mean.
W
i 1
i Xi

Remark: The Harmonic Mean is useful and appropriate in finding average speeds and average
rates.
Example 2.1.8: A cyclist pedals from his house to his college at speed of 10 km/hr and back
from the college to his house at 15 km/hr. Find the average speed.

Solution: Here the distance is constant


The simple H.M is appropriate for this problem.
X1= 10km/hr X2=15km/hr
2
H.M   12km / hr
1 1

10 15

The mode
The mode is the value of the observation that occurs with the greatest frequency. A particular
disadvantage is that, with a small number of observations, there may be no mode. In addition,
sometimes, there may be more than one mode such as when dealing with a bimodal (two-peak)
distribution. .
Example: Find the modal values for the following data:
(a) 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5 (modal value = 3.0 kg).
(b) 10, 10, 9, 9, 8, 12, 15, 5 (modal value = 9 and 10). Hence, it is possible for a frequency
distribution to have more than one mode.
21
Note: Distributions with one mode are called unimodal, those with two modes are called
bimodal, and those with more than two modes are called multimodal.
Modal Value for Grouped data
To find the Modal value for grouped (continuous) frequency distribution, first find the modal
class which is the class with the highest frequency. Then to compute the modal value for grouped
data, we use the formula:
∆1
Mode = Lmo + (∆ )* w , where
1 + ∆2
Lmo = Lower class boundary of the modal class;
w = the class width of the modal class;
∆1 = fmo − f1 ;
∆2 = fmo − f3 ;
fmo = frequency of the modal class
f1 = frequencyoftheclassimmediatelyprecidingthemodalclass;
f3 = frequency of the class immediately succeeding the modal class.
Note: The modal class is a class with the highest frequency.
Example: Consider the following grouped quantitative data. Calculate the modal value of the
data.
Class limit Class boundary Frequency

6 – 11 5.5 – 11.5 2
12 – 17 11.5 – 17.5 2
18 – 23 17.5 – 23.5 7
24 – 29 23.5 – 29.5 4
30 – 35 29.5 – 35.5 3
36 – 41 35.5 – 41.5 2

Solution:(17.5 – 23.5) is the modal class.


Lmo = 17.5, w =6, ∆1 = fmo − f1 = 7 – 2 = 5; ∆2 = fmo − f3 = 7- 4 =3
∆1
Mode = Lmo + (∆ )* w
1 + ∆2

 5 
= 17.5+  6
 5  3
=21.25

Merits and Demerits of Mode


Merits:
 It is not affected by extreme observations.
 Easy to calculate and simple to understand.
 It can be calculated for distribution with open end class
Demerits:

22
 It is not rigidly defined.
 It is not based on all observations
 It is not suitable for further mathematical treatment.
 It is not stable average (it is affected by fluctuations of sampling to some extent).
3.1.2 Positional Average
The Median
An alternative measure of location, perhaps second in popularity to the arithmetic mean, is the
median. In a distribution, median is the value of the variable which divides it in to two equal
halves. In an ordered series of data median is an observation lying exactly in the middle of the
series. It is the middle most value in the sense that the number of values less than the median is
equal to the number of values greater than it.
Suppose there are n observations in a sample and if these observations are ordered from smallest
to largest, then the sample median foe ungrouped data is defined as:
n + 1 th
(1) The ( ) observations if n is odd
2
n th n th
(2) The average of the ( 2) and (2 + 1) observations if n is even.

Example: Find the median of the following numbers.


(a) 6, 2, 8, 9, 4 (b) 5, 2, 1, 8, 3,7, 8, 9.
Solution: a) ascending ordered data: 2, 4, 6, 8, 9 (n=5)
 5  1
th

Median =   value  3 value  6


rd

 2 
b) Ascending order: 1, 2, 3, 5, 7, 8, 8, 9 (n=8)
4 rd  5 th 5  7
Median =  =6
2 2
Median for Grouped Data
For a grouped (continuous) frequency distribution, median is calculated as:
n
( −cf)
2
Median = L + ∗ w , where
f
L = lower class boundary of the median class
w = length of the interval
n = total frequency of the sample
cf = Cumulative frequency preceding the median class.
f = Frequency of that interval containing the median.
The median class is the class with the smallest cumulative frequency (less than type) greater than
n
or equal to
2

Example: Find the median for the following distribution

23
Class limit Frequency Cumulative freq.(less than type)

40 – 44 7 7
45 – 49 10 17
50 – 54 22 39
55 – 59 15 54
60 – 64 12 66
65 – 69 6 72
70 – 74 3 75
n 75
  37.5
2 2
39 is the first cumulative frequency to be greater than or equal to 37.5.
Therefore, 50 – 54 is the median class. L = 49.5, n=75, w = 5, cf =17, f = 22
n
( −cf)
2
Hence, Median = L + ∗w
f
(37.5  17)5
= 49.5+ = 54.16
22
Note:
 Median is a positional average and hence not influenced by extreme observations.
 Median can be calculated in the case of open end intervals.
 Median can be located even if the data are incomplete.
Other measures of locations (Quantiles: quartiles, deciles, percentiles)

When a distribution is arranged in order of magnitude of items, the median is the value of the
middle term. Their measures that depend up on their positions in distribution quartiles, deciles,
and percentiles are collectively called quantiles.
Quartiles: Quartiles are measures that divide the frequency distribution in to four equal parts.
The value of the variables corresponding to these divisions are denoted Q 1, Q2, and Q3 often

called the first, the second and the third quartile respectively.

Q1 is a value in which 25% items are less than or equal to it. Q 2 has 50% items with value less

than or equal to it and Q3 has 75% items whose values are less than or equal to it.

th
k(n + 1)th
The k quartile Qk for ungrouped data is the value of the item which is the position,
4

24
where k =1, 2, 3 and n is the total number of observations.
The computation of three quartiles for a grouped data can be done as follows:
kn
 Calculate and search for the minimum cumulative frequency which is greater than or
4
kn
equal to , k=1, 2, 3.
4
 The class corresponding to this cumulative frequency is the k thquartile class. This is the
class where Qk lies.
kn
w ( 4 −cf)
 Thus, Qk = L + , k =1, 2, 3, where
f
L = lower class boundary of the kth quartile class
n= the total number of observations
cf = the less than cumulative frequency corresponding to the class immediately preceding
the kth quartile class
w= the class width of the quartile class and
f= frequency of the kth quartile class
Deciles: Deciles are measures that divide the frequency distribution in to ten equal parts. The
values of the variables corresponding to these divisions are denoted D 1, D2,.. D9 often called the

first, the second,…, the ninth decile respectively.


kn
To find Dk(i=1, 2,..9) we count of the classes beginning from the lowest class.
10

For grouped data we have the following formula:


kn
w (10−cf)
Dk = L + , k =1, 2, 3…9, where
f
L = lower class boundary of the kthdeciles class
n= the total number of observations
cf = the less than cumulative frequency corresponding to the class immediately preceding
the kthdeciles class
w= the class width of the deciles class
f = frequency of the kthdeciles class

25
Percentiles: Percentiles are measures that divide the frequency distribution in to hundred equal
parts. The values of the variables corresponding to these divisions are denoted P 1, P2,.. P99 often

called the first, the second,…, the ninety-ninth percentile respectively.


kn
To find P (i=1, 2,..99) we count of the classes beginning from the lowest class.
i 100
For grouped data we have the following formula:
kn
w (100−cf)
Pk = L + , k =1, 2, 3…99, where
f
L = lower class boundary of the kth percentiles class
n= the total number of observations
cf = the less than cumulative frequency corresponding to the class immediately preceding
the kth percentiles class
w= the class width of the percentiles class
f = frequency of the kth percentiles class
Note: To compute quantiles, we first sort the data in ascending order.
Q2 = D5 = P50 = median, P25 = Q1, P75 = Q3, and Di = Pi*10,i=1, 2, 3,…9.
Example: Considering the following distribution
Calculate: a) All quartiles b) The 7thdecile c) The 90th percentile.
Class limit Frequency Cumulative freq.(less than type)
141 – 150 17 17
151 – 160 29 46
161 – 170 42 88
171 – 180 72 160
181 – 190 84 244
191 – 200 107 351
201 – 210 49 400
211 – 220 34 434
221 – 230 31 465
231 – 240 16 481
241 – 250 12 493
Solution a) quartiles
Q1: Determine the class containing the first quartile.
n
 123.25 . Hence, 171- 180 is the class containing the first quartile.
4
L =170.5, n =493, w= 10, cf = 88, f= 72

26
kn
w ( −cf)
4 10(123.25  88)
Q1 = L + = 170.5+ = 174.43
f 72
Q2: Determine the class containing the second quartile.
2n
 246.5 . Hence, 191- 200 is the class containing the second quartile.
4
L =190.5, n =493, w= 10, cf =244 , f= 107
2n
w ( −cf)
4 10(246.5  244)
Q2 = L + = 190.5+ = 190.73
f 107
Q3: Determine the class containing the third quartile.
3n
 369.75 . Hence, 201- 210 is the class containing the third quartile.
4
L =200.5, n =493, w= 10, cf = 351 , f= 49
3n
w ( −cf)
4 10(369.75  351)
Q3 = L + = 200.5+ = 204.33
f 49
b) D7: Determine the class containing the 7thdecile.

7n
 345.1 . Hence, 191- 200 is the class containing the seventh decile.
10
L =190.5, n =493, w= 10, cf = 244 , f= 107
7n
w(
10
−cf) 10(345.1  244)
D7= L + = 190.5+ = 199.95
f 107
c) P90: Determine the class containing the 90th percentile.
90n
 443.7 . Hence, 221- 230 is the class containing 90thpercentile.
100
L =220.5, n =493, w= 10, cf = 434 , f= 31
90n
w(
100
−cf) 10(443.7  434)
P90= L + = 220.5+ = 223.63
f 31
3.2. Measures of variation (dispersion)
3.2.1. Introduction
The measure of central tendency helps us in describing a set of data by a single number or typical
value. However, they do not provide us any information about the extent to which the values
differ from one another or from the average value. Hence, to increase our understanding of the
pattern of a data, we must also measure its dispersion- indicates the degree to which the
numerical data tend to spread or variability about an average value. The scatter or spread of

27
items of a distribution is known as dispersion or variation. The measures of dispersion also
enable us to compare several samples with similar averages.
Consider the following data sets:
Set 1: 60 40 30 50 60 40 70 50
Set 2: 50 49 49 51 48 50 53 50
Set 3: 50 50 50 50 50 50 50 50
The three data sets have a mean of 50, but obviously set 1 is more “spread out” than set 2 and set
3 has no variability.
Objectives
The general object of measuring dispersion is to obtain a single summary figure which
adequately exhibits whether the distribution is compact or spread out.
• To judge the reliability of measures of central tendency
• To control variability itself.
• To compare two or more groups of numbers in terms of their variability.
• To make further statistical analysis.

Absolute and Relative Measures of Dispersion


The measures of dispersion which are expressed in terms of the original unit of a series are
termed as absolute measures. Such measures are not suitable for comparing the variability of two
distributions which are expressed in different units of measurement and different average size.
Relative measures of dispersions are a ratio or percentage of a measure of absolute dispersion to
an appropriate measure of central tendency and are thus pure numbers independent of the units
of measurement. For comparing the variability of two distributions (even if they are not
measured in the same unit), we compute the relative measure of dispersion instead of absolute
measures of dispersion.

Types of Measures of Dispersion


It is useful for comparing variation in two or more distributions where units of measurements are
the same. Various measures of dispersions are in use. The most commonly used measures of
dispersions are:
1) Range and Relative Range
2) Quartile Deviation and Coefficient of Quartile Deviation

28
3) Mean Deviation and Coefficient of Mean Deviation
4) Standard Deviation and Coefficient of Variation.
3.2.1. Absolute measure of variation

The Range (R)


The range is the largest value minus the smallest value in a data set. The range is greatly affected
by extreme values. Range = largest value – smallest value.
The following two distributions have the same range, 13, yet appear to differ greatly in the
amount of variability.

Distribution 1: 32 35 36 36 37 38 40 42 42 43 43 45

Distribution 2: 32 32 33 33 33 34 34 34 34 34 35 45
For this reason, among others, the range is not the most important measure of variability.

Merits and Demerits of range


Merits:
• It is rigidly defined.
• It is easy to calculate and simple to understand.
Demerits:
• It is not based on all observation.
• It is highly affected by extreme observations.
• It is affected by fluctuation in sampling.
• It cannot be computed in the case of open end distribution.
• It is very sensitive to the size of the sample.
Relative Range (RR)
It is also sometimes called coefficient of range and given by:
Highest value  lowest value
RR =
Highest value  lowest value
Example:
1. Find the relative range of the above two distribution. (Exercise!)
2. If the range and relative range of a series are 4 and 0.25 respectively. Then what is the value of:
a) Smallest observation (Ans. 6)
b) Largest observation (Ans. 10)

29
The Variance and Standard Deviation
The variance
The variance is the "average squared deviation from the mean" and it measures the average of
the square of the deviations from the mean for each observations.
Suppose we have population of N observations, say X1, X2, X3, …, XN, then we define the
population variance as:
N N

  X i     X i  N 2
2 2

2  i 1
 i 1

N N
But most of the time we have sample of n observations, say X 1, X2, X3, …, Xn from the
population of N, then we define the sample variance as:
2
 n 
 X  X
n n n

X n X i    X i 
2
 nX
2 2 2
i i
S 
2 i 1
,or S 2 i 1
,or S 2  i 1  i 1 
n 1 n 1 n(n  1)
This measure of variation is universally used to show the scatter of the individual measurements
around the mean of all the measurements in a given distribution. But the disadvantage is that the
units of variance are the square of the units of the original observations. The easiest way for this
difficulty is to use the square root of the variance as a measure of variability called the standard
deviation.
Standard deviation
The population and the sample standard deviations denoted by σ and S respectively are defined
as:

N 2

  xi   
 i 1
, where  is the popuplation mean
N
n

 (x i  X )2
S i 1
where X is the sample mean
n 1
For the case of frequency distribution data the population and sample variance are given as:

 f (x i i  )2
2 
N
, where N= f i

30
 f (x i i  X )2
S2 
n 1
,where n = f i

Variance and Standard Deviation for Grouped Data


The sample variance for a grouped frequency distribution is given by

 f (m i i  X )2
S2 
n 1
, where n = f i , mi = midpoint of ith class

Example: Areas of spray able surfaces with DDT from a sample of 15 houses are as follows
(m2): 101, 105, 110, 114, 115, 124, 125, 125, 130, 133, 135, 136, 137, 140, 145. Find the
variance and standard deviation..
Solution: The mean of the sample is 125 ( X  125) , then

 X  X
n
2
i
(101  125) 2  (105  125) 2  ...  (145  125) 2
S 
2 i 1
=  178.71
n 1 14
Hence, the standard deviation = S = 178.71 = 13.37.
Examples: Find the variance and standard deviation of the following grouped sample data
Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3
Sample mean, = 55, n=75

mi(midpoint) 42 47 52 57 62 67 72 Total

fi(mi- )2 1183 640 198 60 588 864 867 4400

 f (m i i  X )2
4400
Then S 2  = = 59.46
n 1 74
and S = 59.46 = 7.71

31
Note:
If the standard deviation of X1, X2, ….., Xn is S, then the standard deviation of
a) X1+ k, X2+k, …, Xn+k will also be S (where k =constant)
b) kX1, kX2, …, kXn will be |k|S.
c) c+kX1, c+kX2, …,a+ kXn will be |k|S ( c and k are constants)

Example1: The standard deviation of n observations X1, X2, ...., Xnis known to be 3. New set of
bservations are obtained by the linear transformation Yi = 2Xi– 0.5 ( i = 1, 2, …, n ), then what
will be the standard deviation of the new set of observations.
Solution: new standard deviation = |k|S = 2*3 =6
Example 2: The mean and the standard deviation of a set of numbers are respectively 500 and
10.
a) If 10 is added to each of the numbers in the set, then what will be the variance and
standard deviation of the new set?
b) If each of the numbers in the set are multiplied by -5, then what will be the variance and
standard deviation of the new set?

Solutions: a) The variance and standard deviation will remain the same.
b) New standard deviation= |k|S =5*10 =50
3.2.2 Relative measure of variation

Coefficient of Variation (CV)


The coefficient of variation (CV) is defined by
s tan darddeviation
CV= *100%
mean
S
CV= *100%.
X
The coefficient of variation is most useful in comparing the variability of several different
samples, each with different means. This is because a higher variability is usually expected when
the mean increases, and the CV is a measure that accounts for this variability.
CV is a relative measure free from unit of measurement.

Examples: An analysis of the weekly wages paid (in Birr) to workers in two firms A and B
belonging to the same industry gives the following results.
In which firm the wages is more variable?

32
Value Firm A Firm B
Mean wage 56 72
Variance 100 121

S 10
Solution: C.VA = *100% = *100% = 17.86% and
X 56
S 11
C.VB = *100% = *100%= 15.28%.
X 72
Since C.VA> C.VB in A there is greater variability in individual wages.

The standard Score (Z-score):


It is the number of standard deviations that a given value X is below or above the mean.
The standard score of any value Xi is defined as
X i  mean
Zi 
s tan darddeviation
Xi  X
Zi  (for the sample data sets)
S
Values above the mean have positive z-scores and values below the mean have negative Z-
scores. Z-scores are generally meaningless by themselves unless they are compared to the
distribution or scores from some reference group.
Note: A Z-score value less than -2 and greater than 2 considers as unusually low or high value.

Example 1: Two sections were given introduction to statistics examinations. The following
information was given.

Value Section 1 Section 2


Mean 78 90
Standard deviation 6 5

Student A from section 1 scored 90 and student B from section 2 scored 95. Relatively speaking
who performed better?
XA  X 90  78
Solution: Z A  =  2 and
S 6

33
XB  X 95  90
ZB  = 1
S 5
Student A performed better relative to his section because the score of student A is two standard
deviation above the mean score of his section while, the score of student B is only one standard
deviation above the mean score of his section.
Example 2: Two groups of people were trained to perform a certain task and tested to find out
which group is faster to learn the task. For the two groups the following information was given:
Value Group one Group two
Mean 10.4 min 11.9 min
Stan.dev. 1.2 min 1.3 min
Relatively speaking:
a) Which group is more consistent(less variable) in its performance?
b) Suppose a person A from group one takes 9.2 minutes while person B
from Group two takes 9.3 minutes, who was faster in performing the
task? Why?
Solutions:
a) Use coefficient of variation.
S1 1.2
CV1 = *100%  *100%  11.54%
X1 10.4

S2 1.3
CV2 = *100%  *100%  10.92%
X2 11.9
Since C.V2 < C.V1, group 2 is more consistent (less variable)
b) Calculate the standard scores of A and B
X A  X1 9.2  10.4
ZA  =  1 and
S1 1.2

X B  X 2 9.3  11.9
ZB  =  2
S2 1.3
Person B is faster because the time taken by person B is two standard deviation shorter than the
average time taken by group 2 while, the time taken by person A is only one standard deviation
shorter than the average time taken by group 1

34
REVIEW EXERCISES
1. A company was experiencing a chronic weld defect problem with a water outlet tube
assembly. Each assembly manufactured is leak tested in a water tank. Data were collected
on a gap between the flange and the pipe for 6 assemblies that leaked and 6 good
assemblies that passed the leak test.
Leaker .290, .104, .207, .145, .104, .124
i. Calculate the sample mean x.
ii. Calculate the sample standard deviation S.
2. The following are the numbers of minutes that a person had to wait for the bus towork on
15 working days10, 1, 13, 9, 5, 9, 2, 10, 3, 8, 6, 17, 2, 10, 15Find
a) the mean;
b) the median;
c) Calculate s2
3. Three recent years, the price of copper was 69.6, 66.8 and 66.3 cents per pound,and the
price of bituminous coal was 19.43, 19.82 and 22.40 dollars per short ton. Which of the
two sets of prices is relatively more variable?
4. For each of the following distributions, decide whether it is possible to find the mean and
whether it is possible to find the median. Explain your answers.
a.
Grade Frequency
40-49 5
50-59 18
60-69 27
70-79 15
80-89 6

b.
IQ Frequency
Less than 90 3
90-99 14
100-109 22
110-119 19
More than 119 7

Find the first and third quartiles Q1 and Q3 for grouped data..

35
5. The average annual salaries paid to top-level management in three companies are
$94,000, $102,000, and $99,000. If the respective numbers of top-level executives in
these companies are 4, 15, and 11, find the average salary paid to these 30 executives.
6. In a nuclear engineering class there are 22 juniors, 18 seniors, and 10 graduate students.
If the juniors averaged 71 in the midterm examination, the seniors averaged 78, and the
graduate students averaged 89, what is the mean for the entire class?
7. If an instructor counts the final examination in a course four times as much as each 1-
hour examination, what is the weighted average grade of a student who received grades
of 69, 75, 56, and 72 in four 1-hour examinations and a final examination grade of 78?

36

You might also like