You are on page 1of 31

1.

INTRODUCTION TO STATISTICS
The impact of statistics profoundly affects society today. Every day, we have
confronted with some form of statistical information through news papers, magazines,
and other forms of communication. Besides, statistical tables, survey results, and the
language of probability are used with increasing frequency by the media. Such
statistical information has become highly influential in our lives.

Statistics can be considered as numerical statements of facts which are highly


convenient and meaningful forms of communication. The subjects of statistics, as it
seems, is not a new discipline but it is as old as the human society itself. The sphere of
its utility, however, was very much restricted. The word statistics is derived from the
Latin word status which means a political state or government. It was originally
applied in connection with kings and monarchs collecting data on their citizenry
which pertained to state wealth, taxes collected population and so on. Thus, the scope
of statistics in the ancient times was primarily limited to the collection of
demographic and property and wealth data of a country by governments for framing
military and fiscal policies.

Now days, statistics is used almost in every field of study, such as natural science,
social science, engineering, medicine, agriculture, etc. The improvements in computer
technology make it easier than ever to use statistical methods and to manipulate
massive amounts of data.

Definition and Classification of Statistics


Statistics as a subject (field of study): Statistics is defined as the science of
collecting, organizing, presenting, analyzing and interpreting numerical data to make
decision on the bases of such analysis.(Singular sense)

Statistics as a numerical data: Statistics is defined as aggregates of numerical


expressed facts (figures) collected in a systematic manner for a predetermined
purpose. (Plural sense)
In this course, we shall be mainly concerned with statistics as a subject, that is, as a
field of study.
Classification of Statistics: Statistics is broadly divided into two categories based on
how the collected data are used.
i. Descriptive Statistics: is concerned with summary calculations, graphs, charts and
tables. It deals with describing data without attempting to infer anything that goes
beyond the given set of data. It consists of collection, organization, summarization
and presentation of data in the convenient and informative way.
Examples:
• The amount of medication in blood pressure pills.
• The starting salaries for Mathematics and Statistics students in different
organizations.

1
ii. Inferential Statistics: Descriptive Statistics describe the data set that’s being
analyzed, but doesn’t allow us to draw any conclusions or make any inferences
about the data, other than visual “It looks like …..” type statements. It deals with
making inferences and/or conclusions about a population based on data obtained
from a limited sample of observations. It consists of performing hypothesis testing,
determining relationships among variables and making predictions. For example,
the average income of all families (the population) in Ethiopia can be estimated
from figures obtained from a few hundred (the sample) families.
Examples:
a) From past figures, it has been predicted that 90 0 0 of registered voters will vote in the
November election.
b) The average age of a student in Hawassa University is 19.1 years.
c) To determine the most effective dose of a new medication (on the basis of tests
performed with volunteer patients from selected hospitals)
d) To compare the effectiveness of two reducing diets (based on the weights losses of
persons who were taking the diets)
e) There is a relationship between smoking tobacco and an increased risk of developing
cancer.
Stages of Statistical Investigation
The area of statistics incorporates the following five stages. These are collection,
organization, presentation, analysis and interpretation of data.
Proper collection of data:
Data can be collected in a variety of ways; one of the most common methods is
through the use of survey. Survey can also be done in different methods, three of the
most common methods are: Telephone survey, Mailed questionnaire, Personal
interview.
Organization of data: If an investigator has collected data through a survey, it is
necessary to edit these data in order to correct any apparent inconsistencies,
ambiguities, and recording errors.
Presentation of data: the organized data can now be presented in the form of tables
or diagrams or graphs. This presentation in an orderly manner facilitates the
understanding as well as analysis of data.
Analysis of data: the basic purpose of data analysis is to make it useful for certain
conclusions. This analysis may simply be a critical observation of data to draw some
meaningful conclusions about it or it may involve highly complex and sophisticated
mathematical techniques.
Interpretation of data: Interpretation means drawing conclusions from the data
which form the basis of decision making. Correct interpretation requires a high degree
of skill and experience and is necessary in order to draw valid conclusions.

2
Definition of some basic terms
Population: It is the totality of things under considerations. The population represents
the target of an investigation, and the objective of the investigation is to draw
conclusions about the population hence we sometimes call it target population.
Examples:
• All university students of Ethiopia
• Population of trees under specified climatic conditions
• Population of animals fed a certain type of diet
• Population of farms having a certain type of natural fertility
• Population of households, etc
The population could be finite or infinite (an imaginary collection of units).
There are two ways of investigation: Census and sample survey.
Census: a complete enumeration of the population. But in most real problems it
cannot be realized, hence we take sample.
Sample: A group of subjects selected from a population. A sample from a population
is the set of measurements that are actually collected in the course of an investigation.
For example, if we want to study the income pattern of lecturers at Hawassa
University and there are 3000 lecturers, then we may take a random sample of only
250 lecturers out of this entire population of 3000 for the purpose of study. Then this
number of 250 lecturers constitutes a sample.

In practice, we rarely conduct census, instead we conduct sample survey


Parameter: is a descriptive measure of a population, or summary value calculated
from a population. Examples: Average, Range, proportion, variance,
Statistic: is a descriptive measure of a sample, or summary value calculated from a
sample. From the previous example, the summary measure that describes a
characteristic such as average income of this sample is known as a statistic.
Sampling: The process or method of sample selection from the population.
Sample size: The number of elements or observation to be included in the sample.
Variable: It is an item of interest that can take on many different numerical values.
Applications, Uses and Limitation of Statistics
Applications of Statistics
Statistics can be applied in any field of study which seeks quantitative evidence. For
instance, engineering, economics, natural science, etc.
a) Engineering: Statistics have wide application in engineering.
• To compare the breaking strength of two types of materials
• To determine the probability of reliability of a product.
• To control the quality of products in a given production process.
• To compare the improvement of yield due to certain additives
(fertilizer, herbicides, (wee decides), e t c
b) Economics: Statistics are widely used in economics study and research.
• To measure and forecast Gross National Product (GNP)

3
• Statistical analyses of population growth, unemployment figures, rural
or urban population shifts and so on influence much of the economic
policy making
• Financial statistics are necessary in the fields of money and banking
including consumer savings and credit availability.
c) Statistics and research: there is hardly any advanced research going on
without the use of statistics in one form or another. Statistics are used
extensively in medical, pharmaceutical and agricultural research.

Uses of Statistics
Today the field of statistics is recognized as a highly useful tool to making decision
process by managers of modern business, industry, frequently changing technology. It
has a lot of functions in everyday activities. The following are some uses of statistics:
• Statistics condenses and summarizes complex data: the original set of
data (raw data) is normally voluminous and disorganized unless it is
summarized and expressed in few numerical values.
• Statistics facilitates comparison of data: measures obtained from
different set of data can be compared to draw conclusion about those sets.
Statistical values such as averages, percentages, ratios, etc, are the tools
that can be used for the purpose of comparing sets of data.
• Statistics helps in predicting future trends: statistics is extremely useful
for analyzing the past and present data and predicting some future trends.
• Statistics influences the policies of government: statistical study results
in the areas of taxation, on unemployment rate, on the performance of
every sort of military equipment, etc, may convince a government to
review its policies and plans with the view to meet national needs and
aspirations.
• Statistical methods are very helpful in formulating and testing hypothesis
and to develop new theories.

Limitations of Statistics
The field of statistics, though widely used in all areas of human knowledge and
widely applied in a variety of disciplines such as business, economics and research,
has its own limitations. Some of these limitations are:

a) It does not deal with individual values: as discussed earlier, statistics deals
with aggregate of values. For example, the population size of a country for
some given year does not help us for comparative studies, unless it can be
compared with some other countries or with a set standard.
b) It does not deal with qualitative characteristics directly: statistics is not
applicable to qualitative characteristics such as beauty, honesty, poverty,
standard of living and so on since these cannot be expressed in quantitative
terms. These characteristics, however, can be statistically dealt with if some
quantitative values can be assigned to these with logical criterion. For

4
c) Statistical conclusions are not universally true: since statistics is not an
exact science, as is the case with natural sciences, the statistical conclusions
are true only under certain assumptions. Also, the field deals extensively with
the laws of probability which at best are educated guesses. For example, if we
toss a coin 10 times where the chances of a head or a tail are 1:1, we cannot
say with certainty that there will be 5 heads and 5 tails. Thus the statistical
laws are only approximations.
d) It is sensitive for misuse: The famous statement that figures don’t lie but the
liars can figure, is a testimony to the misuse of statistics. Thus inaccurate or
incomplete figures can be manipulated to get desirable references. The
number of car accidents committed in a city in a particular year by women
drivers is 10 while that committed by men drivers is 40. Hence women
drivers are safe drivers.

TYPES OF VARIABLES AND MEASUREMENT SCALES

Types of Variables
A variable is a characteristic of an object that can have different possible values. Age,
height, IQ and so on are all variables since their values can change when applied to
different people. For example, Mr. X is a variable since X can represent anybody. A
variable may be qualitative or quantitative in nature.
These two types of variables are:
a) Quantitative variables: are variables that can be quantified or can have
numerical values. Examples: height, area, income, temperature e t c. Consider
the following questions; How many rooms are there in your house? Or How
many children are there in the family? Would be in numerical values.
b) Qualitative variables: are variables that cannot be quantified directly.
Examples: color, beauty, sex, location, political affiliation, and so on. Consider also
the following questions; “are you currently unemployed?” Would fit in the category
of either yes or no. Qualitative variables are also called categorical variables. And
hence we have two types of data; quantitative & qualitative data.

Quantitative variables can be further classified as


X Discrete variables, and
X Continuous variables
a) Discrete variables are variables whose values are counts. For examples:
• number of students attending a conference
• number of households (family size)
• Number of pages of a book
• number of eggs in the refrigerator, etc
b) Continuous variables are variables that can have any value within an interval.

5
Examples:
• height of models in a beauty context
• weight of people joining a diet program
• Lengths of steel bars in a given production terms, e t c.
Scales of Measurement
Proper knowledge about the nature and type of data to be dealt with is essential in
order to specify and apply the proper statistical method for their analysis and
inferences. Measurement is the assignment of numbers to objects or events in a systematic
fashion. It is a functional mapping from the set of objects {Oi} to the set of real
numbers {M(Oi)}
Measurement scale refers to the property of value assigned to the data based on the
properties of order, distance and fixed zero. Four levels of measurement scales are
commonly distinguished: nominal, ordinal, interval, and ratio and each possessed
different properties of measurement systems

Nominal scale: - “Nominal “is a Latin word for “name” which measures the presence
or absence of a characteristic of the data. The values of a nominal attribute are just
different names, i.e., nominal attributes provide only enough information to
distinguish one object from another. Qualities with no ranking or ordering; no
numerical or quantitative value. Data consists of names, labels and categories. This is
a scale for grouping individuals into different categories.
Examples: Car colors for a certain model are: red, silver, blue and black.
• In this scale, one is different from the other
• +, -, *, /, Impossible, comparison is impossible
Ordinal scale: - measures the presence or absence of a characteristic of the data using
an implied ranking or order. “Ordinal” is a Latin word, meaning “order”.

• Can be arranged in some order, but the differences between the data values are
meaningless.
• Data consisting of an ordering of ranking of measurements are said to be on an
ordinal scale of measurements. That is, the values of an ordinal scale provide
enough information to order objects. (<, >)
• One is different from and grater /better/ less than the other
• +, -, *, / Are impossible, comparison is possible.
Consider the following examples about ordinal scale:
a. Man A weighs more than man B
b. Ethiopian athletes got 1st and 2nd ranks in the 10,000m women’s final
in Beijing.
c. Of 17 fishing reels rated: 6 were rated good quality, 4 were rated better
quality, and 7 were rated best quality.
d. Out of a high school class of 319, Walter ranked 4th, June ranked
12th, and Jim ranked 20th.

6
Interval Level: Data values can be ranked and the differences between data values
are meaningful. However, there is no intrinsic zero, or starting point, and the ratio of
data values are meaningless. Note: Celsius & Fahrenheit temperature readings have
no meaningful zero and ratios are meaningless.
In this measurement scale:-
• One is different, better/greater by a certain amount of difference than another
• Possible to add and subtract. For example; 37Oc –35oc = 2oc, 45oc – 43 oc= 2oc
• Multiplication and division are not possible. For example; 40oc = 2(20oc) But
this does not imply that an object which is 40 oc is twice as hot as an object
which is 20 oc
̇ Interval scale data convey better information than nominal and ordinal scale.
̇ Most common examples are: IQ, temperature, Calendar dates, etc
Consider the following examples:

a. The years in which democrats won presidential elections.


b. Body temperature in degrees Celsius (or Fahrenheit)
c. Building A was built in 1284, Building B in 1492 and Building C in 5 B.C.
Ratio scale: Similar to interval, except there is a true zero, or starting point, and the
ratios of data values have meaning.
̇ Best type of data: can add, subtract, multiply and divide. For ratio variables,
both differences and ratios are meaningful.
̇ One is different, larger /taller/ better/ less by a certain amount of difference
and so much times than the other.
̇ This measurement scale provides better information than interval scale of
measurement
Examples: temperature in Kelvin, monetary quantities, counts, age, mass, length,
electric current.
Consider the following some more examples:
a. Core temperatures of stars measures in degrees Kelvin
b. Time elapsed between the deposit of a check and the clearance of that check.
c. Length of the Nile River.

2. Methods of Data Collection and Presentation


Methods of Data Collection
Data: are the values (measurements or observations) that the variables can assume.
Variables that are determined by chance are called random variables.
Any aggregate of numbers cannot be called statistical data. We say an aggregate of
numbers is statistical data when they are
• Comparable
• Meaningful and
• Collected for a well defined objective
Raw data: are collected data, which have not been organized numerically.
Examples: 25, 10, 32, 18, 6, 93, 4.

7
The required data can be obtained from either a primary source or a secondary source.
Primary source: Is a source of data that supplies first hand information for the use of
the immediate purpose.
Primary Data: data you collect to answer your question. Data measured or collected
by the investigator or the user directly from the source. Or data originally collected
for the immediate purpose.
Two activities involved: planning and measuring.

• Identify source and elements of the data.


a) Planning:

• Decide whether to consider sample or census.


• If sampling is preferred, decide on sample size, selection method,… etc
• Decide measurement procedure.
• Set up the necessary organizational structure.

• Focus Group
b) Measuring: there are different options.

• Telephone Interview
• Mail Questionnaires
• Door-to-Door Survey
• Mall Intercept
• New Product Registration
• Personal Interview and
• Experiments are some of the sources for collecting the primary data.
Secondary source: are individuals or agencies, which supply data originally
collected for other purposes by them or others. Usually they are published or
unpublished materials, records, reports, e t c.

Secondary data: data collected from a secondary source by other people for other
purposes. Data gathered or compiled from published and unpublished sources or files.
When our source is secondary data check that:
• The type and objective of the situations.
• The purpose for which the data are collected and compatible with the present
problem.
• The nature and classification of data is appropriate to our problem.
• There are no biases and misreporting in the published data.
Note: Data which are primary for one may be secondary for the other.
Methods of Data Presentation
Having collected and edited the data, the next important step is to organize it. That is
to present it in a readily comprehensible condensed form that aids in order to draw
inferences from it. It is also necessary that the like be separated from the unlike ones.
The presentation of data is broadly classified in to the following two categories:
• Tabular presentation
• Diagrammatic and Graphic presentation.
The process of arranging data in to classes or categories according to similarities
technically is called classification. It eliminates inconsistency and also brings out the

8
points of similarity and/or dissimilarity of collected items/data. It is necessary because
it would not be possible to draw inferences and conclusions if we have a large set of
collected [raw] data.
Frequency Distributions
Frequency: - is the number of times a certain value or set of values occurs in a
specific group.

A frequency distribution is a table that presents data according to some criteria with
the corresponding number of items falling in each class (i.e. with the corresponding
frequencies.). We see at a glance the shape of the distribution, the range of variation,
and any clustering of the values. By presenting a frequency distribution in relative
form, i.e., as percentages, we convert to the familiar base of 100 and make it easy to
compare the distribution of cases between different variables and/or different samples,
each of which may involve different total numbers of cases.

Generally, there are three basic types of frequency distributions: Categorical,


Ungrouped and Grouped frequency distributions.
1. Categorical frequency distribution
– the data are usually qualitative
– the scales of measurements for the data are usually nominal or ordinal
For instance data on blood types of people, political affiliation, economic status (low,
medium and high), religious affiliation are presented by categorical frequency
distributions.

Example: Thirty students, last year, took Stat 100 course and their grades were as
follows. Construct an appropriate frequency distribution for these data.

Table 2.1: Grades of students

B B C B A C

D C C C B B

B A B C D C

A F B F C A

B C C A C D

Solution:
There are five kinds of grades: A, B, C, D and F which may be used as the classes for
constructing the distribution. The procedure for constructing a frequency distribution
for categorical data is given below.

STEP 1. Construct a table as shown below


STEP 2. Tally the data and place the results
STEP 3. Count the Tallies and put the results

9
STEP 4. Calculate the percentages (%) of frequencies in each class (% ) = f × 100
n
Where f = frequency of the class n = total number of observations

Class Tally Frequency Percent*


(I) ( II ) ( III ) ( IV )

A ///// 5 16.7

B ///// //// 9 30.0

C ///// ///// / 11 36.7

D /// 3 10.0

F // 2 6.7

2. Ungrouped frequency distribution


Ungrouped frequency distribution is a table of all potential raw scored values that
could possibly occur in the data along with their corresponding frequencies. It is often
constructed for small set of data or a discrete variable.
Constructing an ungrouped frequency distribution
To construct an ungrouped frequency distribution, first find the smallest and the
largest raw scores in the collected data. Then make a columnar table of all potential
raw scored values arranged in order of magnitude with the number of times a
particular value is repeated, i.e., the frequency of that value. To facilitate counting
method, tallies can be used.
Example: The following data are the ages in years of 20 women who attend health
education last year: 30, 41, 39, 41, 32, 29, 35, 31, 30, 36, 33, 36, 32, 42, 30, 35, 37,
32, 30, and 41. Construct a frequency distribution for these data.

STEP 1. Find the range: Range = Maximum observation − Minimum observation


Solution:

STEP 2. Construct a table, tally the data and complete the frequency column. The
frequency distribution becomes as follows.
Age 29 30 31 32 33 35 36 37 39 41 42
Tally / //// / /// / // // / / /// /
Frequency 1 4 1 3 1 2 2 1 1 3 1
3. Grouped frequency distribution
STEP 1. Determine the unit of measurement, U
STEP 2. Find the maximum(Max) and the minimum(Min) observation, and then
compute their range, R Range = Max − Min
STEP 3. Fix the number of classes’ desired (k). there are two ways to fix k:
• Fix k arbitrarily between 6 and 20, or
• Use Sturge’s Formula: k = 1 + 3.332 log10 N where N is the total frequency.
And round this value of k up to get an integer number.

10
STEP 4. Find the class widths (W) by dividing the range by the number of classes and
round the number up to get an integer value. W = R
K
STEP 5. Pick a suitable starting point less than or equal to the minimum value. This
starting point is the lower limit of the first class. Continue to add the class
width to this lower limit to get the rest of the lower limits.
STEP 6. Find the upper class limits. To find the upper class limit of the first class, subtract
one unit of measurement from the lower limit of the second class. Then continue to
add the class width to this upper limit to get the rest of the upper limits.
STEP 7. Compute the class boundaries as: LCB = LCL − 12 U and UCB = UCL + 12 U
Where LCL = lower class limit, UCL= upper class limit, LCB= lower class
boundary and UCB= upper class boundary. The class boundaries are also half way
between the upper limit of one class and the lower limit of the next class.
STEP 8. Tally the data.
STEP 9. Find the frequencies.
STEP 10. (If necessary) Find the cumulative frequencies (more than and less than types).
Example: The number of hours 40 employees spends on their job for the last 7 days
Table: Number of hours employees spend
62 50 35 36 31 43 43 43
41 31 65 30 41 58 49 41
37 62 27 47 65 50 45 48
27 53 40 29 63 34 44 32
58 61 38 41 26 50 47 37
Construct a suitable frequency distribution for these data using 8 classes
Solution:
1. Unit of measurement; U= 1year
2. Max = 65, Min = 26 so that R = 65-26 = 39
3. It is already determined to construct a frequency distribution having 8 classes.
4. Class width W = 39 = 4.875 ≈ 5
5
5. Starting point = 26 = lower limit of the first class. And hence the lower class
limits become: 26 31 36 41 46 51 56 61
6. Upper limit of the first class = 31-1 = 30. And hence the upper class limits become
30 35 40 45 50 55 60 65
The lower and the upper class limits (Steps 5 and 6) can be written as follows.
Class limits 26-30 31-35 36-40 41-45 46-50 51-55 56-60 61-65
7. By subtracting 0.5 units of measurement from the lower class limits and by
adding 0.5 units of measurement to the upper class limits, we can get lower and
upper class boundaries as follows.
Class 25.5- 30.5- 35.5- 40.5- 45.5- 50.5- 55.5- 60.5-
Boundaries 30.5 35.5 40.5 45.5 50.5 55.5 60.5 65.5

STEPS 8, 9 and 10 are displayed in the following table.

11
Class limits Class Tally frequ Cumulative Cumulative
boundaries ency frequency frequency
(less than (more than
type) type)
26 – 30 25.5 – 30.5 ///// 5 5 40
31 – 35 30.5 – 35.5 ///// 5 10 35
36 – 40 35.5– 40.5 ///// 5 15 30
41 – 45 40.5– 45.5 ///// //// 9 24 25
46 – 50 45.5– 50.5 ///// // 7 31 16
51 – 55 50.5– 55.5 / 1 32 9
56 – 60 55.5– 60.5 // 2 34 8
61 – 65 60.5– 65.5 ///// / 6 40 6

Diagrammatic presentation of data: Bar charts, Pie-chart, Cartograms


The most convenient and popular way of describing data is using graphical
presentation. It is easier to understand and interpret data when they are presented
graphically than using words or a frequency table. A graph can present data in a
simple and clear way. Also it can illustrate the important aspects of the data. This
leads to better analysis and presentation of the data.
What it is?
Graphs are pictorial representations of the relationships between two (or more)
variables and are an important part of descriptive statistics. Different types of graphs
can be used for illustration purposes depending on the type of variable (nominal,
ordinal, or interval) and the issues of interest.
When to Use It?
Graphs can be used any time one wants to visually summarize the relationships
between variables, especially if the data set is large or unmanageable. They are
routinely used in reports to underscore a particular statement about a data set and to
enhance readability. Graphs can appeal to visual memory in ways that mere tallies,
tables, or frequency distributions cannot. However, if not used carefully, graphs can
misrepresent relationships between variables or encourage inaccurate conclusions.
Importance:
• They have greater attraction.
• They facilitate comparison.
• They are easily understandable.
The three most commonly used diagrammatic presentation for discrete as well as
qualitative data are: • Pie charts, • Pictogram, • Bar charts
Pie chart
A pie chart is a circle that is divided in to sections or wedges according to the
percentage of frequencies in each category of the distribution. The angle of the sector
is obtained using:

Angle of Sector =
Value of the part o
* 360
the whole quantity

12
Example
The following data are the blood types of 50 volunteers at a blood plasma donation
clinic: O A O AB A A O O B A O A AB B O O O A B A A O A A
O B A O AB A O O A B A A A O B O O A O A B O AB A O B
a) Organize this data using a categorical frequency distribution
b) Present the data using both a pie and a bar chart.
Solution
a) The classes of the frequency distribution are A, B, O, AB. Count the number of
donors for each of the blood types.
Table: the number of donors by blood types.
Blood
type Frequency Percent
A 19 38
B 8 16
O 19 38
AB 4 8
Total 50 100
b) Pie chart
Find the percentage of donors for each blood type. In order to find the angles of
the sector for each blood type, multiply the corresponding percentage by 3600
and divide by 100.

Blood type
A
B
O
AB

Figure: Pie-chart of the data on blood types of donors.


Pictogram
In these diagrams, we represent data by means of some picture symbols. We decide
abut a suitable picture to represent a definite number of units in which the variable is
measured
Example: The following table shows the orange production in a plantation from
production year 1990-1993. Represent the data by a pictogram.
Solution:
Table: Orange productions from 1990 to 1993
Production year 1990 1991 1992 1993

Amount (in kg) 3000 3850 3500 5000

13
Figure: Pictogram of the data on Orange productions from 1990 to 1993.
Bar Charts
Bar graphs are commonly used to show the number or proportion of nominal or
ordinal data which possess a particular attribute. They depict the frequency of each
category of data points as a bar rising vertically from the horizontal axis. Bar graphs
most often represent the number of observations in a given category, such as the
number of people in a sample falling into a given income or ethnic group. They can
be used to show the proportion of such data points. Bar graphs are especially good for
showing how nominal data change over time. They are useful for comparing
aggregate over time space. Bars can be drawn either vertically or horizontally.

There are different types of bar charts. The most common being: Simple bar chart,
Component or sub divided bar chart, and Multiple bar charts.
Simple Bar Chart: Are used to display data on one variable and are thick lines
(narrow rectangles) having the same breadth.
Example: Suppose that the following were the gross revenues (in $100,000.00) for
company XYZ for the years 1989, 1990 and 1991. Draw the bar graph for this data.
Table: Gross revenue of companies
Year Revenue
1989 110
1990 95
1991 65
Solution:
The bar diagram for this data can be constructed as follows with the revenues
represented on the vertical axis and the years represented on the horizontal axis.

110

100
Sum of Revenue

90

80

70

60
1989 1990 1991
Year

14
Component Bar chart
When there is a desire to show how a total (or aggregate) is divided in to its
component parts, we use component bar chart. The bars represent total value of a
variable with each total broken in to its component parts and different colors or
designs are used for identifications
Example: Construct a sub-divided bar chart for the four types of products in relation
to the opinion of consumers purchasing the given products as given below:
Table: Opinion of consumers purchasing the given products
Products Definitely Probably Unsure No
Product 1 50% 40% 10% 2%
Product 2 60% 30% 12% 15%
Product 3 70% 45% 8% 8%
Product 4 60% 35% 5% 20%
Component Bar Chart

100%
Consumers
Opinion of

NO
50%
Unsure
0% Probably
1 2 3 4 Definitely
Products

Multiple Bar charts


These are used to display data on more than one variable.
They are used for comparing different variables at the same time.
Example: Construct a sub-divided bar chart for the 3 types of expenditures in dollars
for a family of four for the years 1, 2, 3 and 4 as given below:
Table: Expenditures in dollars for families
Year Food Education Other Total
1 3000 2000 3000 8000
2 3500 3000 4000 10500
3 4000 3500 5000 12500
5000 5000 6000 16000

Bar Charts

8000
expenditures
Types of

6000 Food
4000
2000 Education
0 Other
1 2 3 4 Total
Year

Figure: Multiple bar chart for Expenditures in dollars for families

15
Graphical presentation of data: Histogram, and Frequency Polygon
The histogram, frequency polygon and cumulative frequency graph or ogives are most
commonly applied graphical representation for continuous data.
Procedures for constructing statistical graphs:
• Draw and label the X and Y axes.
• Choose a suitable scale for the (cumulative) frequencies and label it on the Y axes.
• Represent the class boundaries for the histogram or ogive or the mid points for the
frequency polygon on the X axes.
• Plot the points.
• Draw the bars or lines to connect the points.
i. Histogram
Histograms or column bar charts are common ways of presenting frequency in a
number of categories. Commonly used graphical presentation methods also include
the frequency polygon and ogive. Histograms portray an unequal width frequency
distribution table for further statistical use. A frequency distribution table is a
tabulation of the n measurements into mutually exclusive k classes showing the
number of observations in each. The bars appear in a histogram where the classes are
marked on the x axis and the class frequencies on the y axis. The histogram is
constructed by creating x-axis units of equal size and these should correspond to the
frequency table. Histogram usually shows the number of observations in a specific
range.
Example: Construct a histogram for the frequency distribution of the time spent by
the automobile workers.
Table: Time in minutes spent by automobile workers
Time (in minute) Class mark Number of workers
15.5- 21.5 18.5 3
21.5-27.5 24.5 6
27.5-33.5 30.5 8
33.5-39.5 36.5 4
39.5-45.5 42.5 3
45.5-51.5 48.5 1

10
8
Frequency

6
4
2
0
18.5 24.5 30.5 36.5 42.5 48.5

Time

Figure: Histogram of the data on the number of minutes spent by the automobile workers.

16
Table: Data given to present a Histogram

Class Class Frequency Cumulative Cumulative


Boundaries Mark Frequency (less Frequency (more
than type) than type)
5.5 – 11.5 8.5 2 2 20
11.5 – 17.5 14.5 2 4 18
17.5 – 23.5 20.5 7 11 16
23.5 – 29.5 26.5 4 15 9
29.5 – 35.5 32.5 3 18 5
35.5 – 41.5 38.5 2 20 2
ii. Frequency Polygon
A frequency polygon is a line graph drawn by taking the frequencies of the classes
along the vertical axis and their respective class marks along the horizontal axis. Then
join the cross points by a free hand curve.

Example: Construct a frequency polygon for the frequency distribution of the time
spent by the automobile workers that we have seen in example 2.9.

Figure: Number of minutes spent by the automobile workers.


iii. Cumulative Frequency Polygon (Ogive)
It is a graph obtained by plotting the cumulative frequencies of a distribution against
the boundaries used to form the cumulative frequencies.
Example: Construct an ogive for the time spent by the automobile workers.

17
3. Measures of Central Tendency

Introduction
Measures of Central Tendency give us information about the location of the center
of the distribution of data values. A single value that describes the characteristics of
the entire mass of data is called measures of central tendency. We will discuss briefly
the three measures of central tendency: mean, median and mode in this unit.
Suppose we have a random sample of n values of some measurement X. The values
are X1, X2, . . ,Xn. We want to summarize the information contained in this sample as
regards an "average" level. An average is a numerical value that indicates the middle
point or central region of the raw data. Mathematically summarize data in order to
make appropriate comparisons.

Objectives of measures of central tendency


Objectives of measuring central tendency are:



To get a single value that represent(describe) characteristics of the entire data


To summarizing/reducing the volume of the data


To facilitating comparison within one group or between groups of data
To enable further statistical analysis

The Summation Notation (∑): Let a data set consists of a number of observations,
represents by x1 , x 2 , ..., x n where n denotes the number of observations in the data and

∑x
N
xi is the ith observation. Then, is the sum of all the observation
i =1
i

For instance a data set consisting of six measurements 21, 13, 54, 46, 32 and 37 is
represented by x1 , x 2 , x3 , x 4 , x5 and x 6 where x1 = 21, x 2 = 13, x3 = 54, x 4 = 46, x5 =
32 and x 6 = 37.

∑x = 21+13+59+46+32+37=208.
6
Their sum becomes
i =1
i

Similarly x1 + x 2 + ... + x n = ∑x
n
2 2 2 2

i =1
i

Some Properties of the Summation Notation

∑ c = n.c
n
1. where c is a constant number.
i =1

∑ b.xi = b∑ xi where b is a constant number


n n
2.
i =1 i =1

∑ (a + bxi ) = n.a + b∑ xi
n n
3. where a and b are constant numbers
i =1 i =1

∑ ( xi ± y i ) =∑ xi ± ∑ y i
n n n
4.
i =1 i =1 i =1

∑x y ≠ ∑x ∑y
n n n
5.
i =1 i =1 i =1
i i i i

18
Important Characteristics of a Good Average: A typical average should posses the
following:
• It should be rigidly defined.
• It should be based on all observation under investigation.
• It should be as little as affected by extreme observations.
• It should be capable of further algebraic treatment.
• It should be as little as affected by fluctuations of sampling.
• It should be ease to calculate and simple to understand.
i. Mean
The mean of a sample is determined by summing up all the data values of the sample
and dividing this sum by the total number of data values.
There are four types of means which are suitable for a particular type of data. These
are
1. Arithmetic Mean
2. Geometric Mean
3. Harmonic Mean
Arithmetic Mean
Arithmetic mean is defined as the sum of the measurements of the items divided by
the total number of items. It is usually denoted by x .

Suppose x ,x
1 2
,.. x n
are n observed values in a sample of size n from a population
of size N, n<N then the arithmetic mean of the sample, denoted by x is given by

X=
x 1 + x 2 + .... + x n
=
∑x i

n n

∑x
If we take an entire population Mean is denoted by µ and is given by:

x 1 + x 2 + .... + x N
μ= =
i

N N
Example: Consider the samples given for the two routes as follows:
For route A, the sample values are: 38 41 34 46 36 30 37 36 32 40
For route B the sample values are: 28 32 36 30 31 37 36 80 32 34 31
Find the arithmetic mean for the two routes.
Solution:
For route A, the sample values were: 38 41 34 46 36 30 37 36 32 40.
= = 37.0 minutes
38 + 4 1 + 34 + 46 + 36 + 30 + 37 + 36 + 32 + 40 370
x=
10 10
For route B the sample values were: 28 32 36 30 31 37 36 80 32 34 31.
28 + 32 + 36 + 30 + 31 + 37 + 36 + 80 + 32 + 34 + 31 407
x= = = 37.0 minutes
11 11

19
When the numbers , , , … , occur with frequencies , , , … , ,
respectively, then mean can be expressed in a more compact form as


Example
Calculate the arithmetic mean of the sample of numbers of people living in twelve
houses in a street: 3 2 4 4 3 1 2 5 3 3 8 1.
3 + 2 + 4 + 4 + 3 + 1 + 2 + 5 + 3 + 3 + 8 + 1 39
x= = = 3 . 25
12 12

The information in the sentence above can be written in a table, as follows.


Value, x 1 2 3 4 5 8
Frequency, f 2 2 4 2 1 1

The formula for the arithmetic mean for data of this type is

2 × 1 + 2 × 2 + 4 × 3 + 2 × 4 + 1 × 5 + 1 × 8 2 + 4 + 12 + 8 + 5 + 8 39
In this case we have:
x= = = = 3.25.
2 + 2 + 4 + 2 +1+1 12 12
The mean numbers of people living in twelve houses in a street is 3.25.
Arithmetic Mean for Grouped Frequency Distribution

∑f m
If data are given in the form of continuous frequency distribution, the sample mean is:

f m + f m + ... + f m
k

x= i =1
=
∑f
i i

f + f + ... + f
1 1 2 2 k k
k
1 2 k
i =1
i

Where mi is the class mark of the i th class; i = 1, 2, …, k;

Note that ∑ f i = n = the total number of observations.


fi is the frequency of the ith class and k = the number of classes
k

i =1
Example: The following table gives the height (inches) of 100 students in a college.
Class Interval (CI) Frequency (f)
60 - 62 5
62 – 64 18
64 – 66 42
66 – 68 20
68 – 70 8
70 – 72 7
Total 100

Calculate the mean?

20
∑ f m
k

Solution: x = i =1

∑ f
i i

i =1
i

∑ f m
∑f
k

= n = 100 , x =
k
i =1 = 6558 = 65.58
∑ f
i i

i =1
i
k 100
i =1
i

The mean height of students is 65.58


Properties of the Arithmetic Mean
¬ The sum of the deviations of the items from their arithmetic mean is zero. This
means, the algebraic sum of the deviations of a set of numbers x1 , x 2 , ..., x n from

their mean x is zero. That is ∑ ( xi − x ) = 0


n

i =1
¬ The sum of the squares of the deviations of a set of observations from any
number, say A, is minimum when A= . That is, ∑ ∑
¬ When a set of observations is divided into k groups and x1 is the mean of n1
observations of group 1, x 2 is the mean of n 2 observations of group2, …, x k is
the mean of n k observations of group k , then the combined mean ,denoted by x ,
of all observations taken together is given by

¬ If the mean of x1 , x 2 , ..., x n is x , then


a) the mean of x1 ± k , x 2 ± k , ..., x n ± k will be x ± k
b) The mean of kx1 , kx 2 , ..., kx n will be kx .
Example
Last year there were three sections taking Stat 200 course. At the end of the semester,
the three sections got average marks of 80, 83 and 76. There were 28, 32 and 35
students in each section respectively. Find the mean mark for the entire students.
Solution:

n1 x1 + n2 x 2 + n3 x3 28(80) + 32(83) + 35(76) 7556


x = = = = 79.54
n1 + n2 + n3 28 + 32 + 35 95

Weighted Arithmetic Mean


In finding arithmetic mean, all items were assumed to be of equally importance (each
value in the data set has equal weight). When the observations have different weight,
we use weighted average. Weights are assigned to each item in proportion to its
relative importance.

21
If x1 , x 2 , ..., x k represent values of the items and w1 , w2 , ... , wk are the corresponding
weights, then the weighted mean, ( x w ) is given by

∑wx
k

w1 x1 + w2 x 2 + ... + wk x k
Xw = = i =1

∑w
i i

w1 + w2 + ... + wk k

i =1
i

Example: A student’s final mark in Mathematics, Physics, Chemistry and Biology


are respectively 82, 80, 90 and 70.If the respective credits received for these courses
are 3, 5, 3 and 1, determine the approximate average mark the student has got for one
course.
Solution
We use a weighted arithmetic mean, weight associated with each course being taken

∑w x
as the number of credits received for the corresponding course.

(3 × 82) + (5 × 80) + (3 × 90) + (1 × 70)


∑w
Therefore x w = = = 82.17
3 + 5 + 3 +1
i i

Average mark of the student for one course is approximately 82.


Geometric Mean
The geometric mean like arithmetic mean is calculated average. It used when
observed values are measured as ratios, percentages, proportions, indices or growth
rates.
The geometric mean, G.M. of a series of numbers x1 , x 2 , ..., xn is defined as
GM = n
x . x .... x
1 2 n
,

Properties of geometric mean


a. Its calculations are not as such easy.
b. It involves all observations during computation
c. It may not be defined even it a single observation is negative.
d. If the value of one observation is zero its values becomes zero.
Remark:
1) When the observed values x1 , x 2 , ..., xn have the corresponding frequencies
f1 , f 2 , ..., f n
respectively, the geometric mean is obtained by

GM = n f1
x .x
1
f 2 ....
2 x
fk
k

2) The above formula can also be used whenever the frequency distribution are
grouped (continuous), class marks of the class intervals are considered as mi

= Where n = ∑
n
f 1. f f
GM n
m m 2 2 .... m k k
f
i =1
1 i

Example: Compute the geometric mean of the following values: 2, 8, 6, 4, 10, 6, 8, 4


Solution:

22
GM = n
x .x 1
f1
2
f 2 ....
x
fk=
k 8
2 . 4 2 . 6 . 8 .10
1 2 2 1

GM = 8
2 x 16 x 36 x 64 x 10 = 8
737280 = 5.41

The geometric mean for the given data will be 5.41


Harmonic Mean
It is a suitable measure of central tendency when the data pertains to speed, rate and time.
The harmonic mean is defined as the reciprocal of the arithmetic mean of the reciprocal
of the individual observations. If x1 , x 2 , ..., xn are n observations, then harmonic mean
can be represented by the following formula:

= =

n n
+ .... +
HM
n 1 1 1
i =1
x i x 1 x n

If the data arranged in the form of frequency distribution

=
∑ =
+ ..... +
k
i =1 f f f

i 1 k
HM
1 1
+ ........ +
n
i =1
f ix i
f x 1 1 f x k k

Properties of harmonic mean


i.
It is based on all observation in a distribution.
ii.
Used when a situations where small weight is give for larger
observation and larger weight for smaller observation
iii. Difficult to calculate and understand
iv. Appropriate measure of central tendency in situations where data is in
ratio, speed or rate
Example: A motorist travels 480km in 3 days. She travels for 10 hours at rate of
48km/hr on 1st day, for 12 hours at rate of 40km/hr on the 2nd day and for 15 hours at rate
of 32km/hr on the 3rd day. What is her average speed?

= = 39 . 92
3
+ +
HM
1 1 1
48 40 32

Note: Whenever the frequency distribution are grouped (continuous), class marks of the
class intervals are considered as mi and the above formula can be used as

where n = ∑
n

= f
n mi is the class mark of ith class.


i =1
HM i
n f i
i =1
m i

Relations among different means


1. If all the observations are positive we have the relationship among the three
means given as: x ≥ GM ≥ HM

23
2. For two observations x * HM = GM
3. x = GM = HM if all observation are positive and have equal magnitude.

ii. Median
The median of a set of items (numbers) arranged in order of magnitude (i.e. in an
array form) is the middle value or the arithmetic mean of the two middle values. We
shall denote the median of x1 , x 2 , ..., x n by ~
x.

Median for Ungrouped Frequency Distribution


We arrange the sample in ascending order of the variable of interest. Then the median
is the middle value (if the sample size n is odd) or the average of the two middle
values (if the sample size n is even).
For ungrouped data the median is obtained by

⎪⎛ n +1 ⎞ item , if the number of items, n, is odd



⎪⎜ ⎟
th

⎪⎝ 2 ⎠

⎪⎛ n ⎞ ⎛ n ⎞
x =⎨

⎪ ⎜ ⎟ item + ⎜ +1⎟ item


~ th th

⎪⎝ 2 ⎠ ⎝2 ⎠
⎪⎩ 2
, if the number of items, n, is even

Example: Compute the medians for the following two sets of data
A: 38 41 34 46 36 30 37 36 32 40.
B: 28 32 36 30 31 37 36 80 32 34 31
Solution
For A, the sample times in ascending order are: 30 32 34 36 36 37 38 40 41
46.
So the median time ( ~
36 + 37
The two middle values have been underlined, x) =
= 36.5 minutes.
2
For B, the sample times in ascending order are: 28 30 31 31 32 32 34 36
36 37 80.
The middle value has been underlined. So the median time ( ~
x ) = 32 minutes.

The median is easy to calculate for small samples and is not affected by an "outlier" -
atypical value in the sample, e.g. the value 80 in the sample for route B. It is valid for
ranked (ordinal) as well as interval/ratio (numeric) data.
Median for Grouped Frequency Distribution
For grouped data the median, obtained by interpolation method, is given by

24
Where Lmed = lower class boundary of the median class
F = Sum of frequencies of all class lower than the median class (in other words it

f med = Frequency of the median class and


is the cumulative frequency immediately preceding the median class)

W = Class width
The median class is the class with the smallest cumulative frequency greater than or equal
to n .
2
Example: Calculate the median for the following frequency distribution.

Class Class Frequency Less than Cum.


Boundary Freq (CF)
0–5 5 5
5 – 10 8 13
10 – 15 10 23
15 – 20 8 31
20 – 25 5 36
Solution:
To obtain the median class we divide 36 by 2. That is, n = 36 = 18
2 2
Thus the smallest cumulative frequency that contains 18 is 23. Hence the median class is
the 3rd class which is 10 – 15.
Therefore median is

Where Lmed = 9.5, F = 13 sum of frequencies of all class lower than the 3rd class
f med = 10, frequency of the 3rd class and, W = 6, class width
⎛ 18 − 13 ⎞
x = 9 .5 + 6 ⎜ ⎟
⎝ 10 ⎠
~

= 9 .5 + 3
= 12 . 5

iii. Mode
The mode or the modal value is the most frequently occurring score/observation in a
series and denoted by x̂ . A data set may not have a mode or may have more than one
mode. A distribution is called a bimodal distribution if it has two data values that
appear with the greatest frequency. If a distribution has more than two modes, then
the distribution is multimodal. If a distribution has no modes, then the distribution is
non-modal.
Example: Consider the following data:
Data X: 3, 4, 6, 12, 31, 8, 9, 8 the Mode ( x̂ ) = 8
Data Y: 6, 8, 12, 13, 11, 12, 6 the Mode ( x̂ ) = 6 and 12
Data Z: 2, 6, 3, 5, 7, 8, 12, 11 No Mode
Note that in some samples there may be more than one mode or there may not be a
mode. The mode is not a suitable measure of central tendency in these cases.

25
Mode for Grouped Frequency Distribution
For grouped data, the mode is found by the following formula:
⎛ Δ1 ⎞
xˆ = Lmod + ⎜⎜ ⎟⎟W
⎝ Δ1 + Δ 2 ⎠
Where Lmod = lower class boundary of the modal class
Δ1 = f − f1 , Δ 2 = f − f 2
f1 = Frequency of the modal class
f1 = Frequency of the class immediately preceding the modal class
f 2 = Frequency of the class immediately follows the modal class
W = the class width
The modal class is the class with the highest frequency in the distribution.
Example: Find the mode for the frequency distribution of the birth weight (in kilogram)
of 30 children given below.

Weight 1.9-2.3 2.3-2.7 2.7-3.1 3.1-3.5 3.5-3.9 3.9-4.3

No. of children 5 5 9 4 4 3

Solution:

Δ1 = 9 − 5 = 4 and Δ 2 = 9 − 4 = 5 Lmod = 2.7


2.7- 3.1 is the modal class since it has the highest frequency

⎛ 4 ⎞
xˆ = 2.7 + ⎜ ⎟ * 0.4 = 2.878
⎝ 4+ 5⎠

• In the case of symmetrical distribution; mean, median and mode coincide.


The Relationship of the Mean, Median and Mode

That is mean=median = mode. However, for a moderately asymmetrical


(non symmetrical) distribution, mean and mode lie on the two ends and

relationship: Mean – Mode = 3(Mean - Median) ⇒ x − xˆ = 3( x − ~


x)
median lies between them and they have the following important empirical

Example In a moderately asymmetrical distribution, the mean and the median are 20
and 25 respectively. What is the mode of the distribution?
Solution:
Mode = 3median – 2mean = 3(25) – 2(20) = 75 – 40
= 35, hence the mode of the distribution will be 35.

4. MEASURES OF VARIATION (DISPERSION)


Introduction and objectives of measuring variation
We have seen that averages are representatives of a frequency distribution. But they
fail to give a complete picture of the distribution. They do not tell anything about the
spread or dispersion of observations within the distribution. Suppose that we have the
distribution of yield (kg per plot) of two rice varieties from 5 plots each.

26
Variety 1: 45 42 42 41 40 Variety 2: 54 48 42 33 30
The mean yield of both varieties is 42 kg. The mean yield of variety 1 is close to the
values in this variety. On the other hand, the mean yield of variety 2 is not close to the
values in variety 2. The mean doesn’t tell us how the observations are close to each
other. This example suggests that a measure of central tendency alone is not sufficient
to describe a frequency distribution. Therefore, we should have a measure of spreads
of observations. There are different measures of dispersion.
Objectives of measuring variation


To describe dispersion (variability) in a data.


To compare the spread in two or more distributions.
To determine the reliability of an average.
Note: The desirable properties of good measures of variation are almost identical with
that of a good measure of central tendency.
Absolute and relative measures
Absolute measures of dispersion are expressed in the same unit of measurement in
which the original data are given. These values may be used to compare the variation
in two distributions provided that the variables are in the same units and of the same
average size. In case the two sets of data are expressed in different units, however,
such as quintals of sugar versus tones of sugarcane or if the average sizes are very
different such as manager’s salary versus worker’s salary, the absolute measures of
dispersion are not comparable. In such cases measures of relative dispersion should be
used. A measure of relative dispersion is the ratio of a measure of absolute dispersion
to an appropriate measure of central tendency. It is a unitless measure.
Types of Measures of Variation
i. The range and relative range
Range(R) is defined as the difference between the maximum and minimum

Range = Maximum value − Minimum value


observations in a set of data.

It is the crudest absolute measures of variation. It is widely used in the construction of


quality control charts and description of daily temperature.

̇ It is affected by extreme values.


Properties of range

̇ It does not take into account all observations.


̇ It is easy to calculate and simple to understand.
̇ It does not tell anything about the distribution of values in the set of data
relative to some measures of central tendency.

Relative range (RR) is defined as RR =


Range
Max. value + Min. value
ii. The mean deviation and coefficient of mean deviation
Mean deviation (MD) is the average of the absolute deviations taken from a central

∑| X ∑| X ∑| X ∑| X
value, generally the mean or median.
−X | − X | fi − X% | − X% | f i
n k n k

MDX = i =1
= i =1
MDX% = i =1
= i =1
i i i i
,
n n n n

27
Example: Calculate the mean deviation about the median and about the mean of the

∑| X
following scores of students in a certain test. 6,7,7,10,10
− X | fi
k

| 6 − 8 | + | 7 − 8 | *2+ |10 − 8 | *2 8
MDX = i =1
= = = 1.6
i

∑| X
n 5 5
− X% | fi
k

| 6 − 7 | + | 7 − 7 | *2+ |10 − 7 | *2 7
MDX% = i =1
= = = 1.4
i

n 5 5

can use the above formula. Besides, MDX ≥ MDX%


Note: In case of grouped data, the mid-point of each class interval is treated as and we

̇ It is relatively simple to understand as compared to standard deviation.


Properties of mean deviation

̇ Its computation is simple.


̇ It is less affected by extreme values than standard deviation.
̇ It is better than the range and QD since it is based on all observations.
̇ It is not suitable for further statistical treatment.
iii. Variance, standard deviation and coefficient of variation
The variance is the average of the squares of the distance each value is from the mean.
The symbol for the population variance is σ2. Let x1 , x2 ,..., xN be the measurements on
N population units then, the population variance is given by the formula:

⎜ ∑ xi ⎟
⎛ N ⎞
2

∑ ( xi − μ ) ∑ xi − ⎝ i =1 ⎠
N N
2 2

σ 2 = i =1 = i =1
N
,
N N
Where µ is population mean and N is population size
Let x1 , x2 ,..., xn be the measurements on n sample units then, the sample variance is
denoted by S2, and its formula is

⎜ ∑ xi ⎟
⎛ n ⎞
2

∑ ( xi − x ) ∑ xi − ⎝ i =1 ⎠
n n
2 2

S 2 = i =1 = i =1
n −1 n −1
n

Where x is the sample mean and n is the sample size.


Standard deviation, denoted by σ or S, is the square root of the variance. That is,
Population standard deviation σ = σ 2 and sample standard deviation S = S 2
.
Example: For a newly created position, a manager interviewed the following
numbers of applicants each day over a five-day period: 16, 19, 15, 15, and 14. Find

∑x
the variance and standard deviation.
5

16 + 19 + 15 + 15 + 14 79
Solution: x = i =1
= = = 15.8
i

5 5 5

28
∑ (x − x )
5

(16 − 15.8) 2 + ... + (14 − 15.8) 2 14.8


2

⇒S = i =1
= = = 3.7
i

5 −1
2

4 4

⎜ ∑ xi ⎟
⎛ n ⎞
(16 + ... + 14 )
2

∑ xi − ⎝ i =1 ⎠
16 + ... + 14 − 1263 −
5 2
2 6241
Or S 2 = i =1 = = 5 = 14.8 = 3.7
2 2
5
5 −1
5
4 4 4
¬ For grouped frequency distribution, the formula for variance is

⎜ ∑ f i xi ⎟
⎛ k ⎞
2

∑ fi ( xi − x ) 2 ∑ fi xi2 − ⎝ i =1 ⎠
k k

S2 = i =1
= i =1
n −1 n −1
n

Where k is the number of classes, xi is the class mark of class i and n = ∑ fi


k

i =1

X The unit of measurement of the variance is the square of the unit of


Properties of variance

measurement of the observed values. It is one of its limitations.


X The variance gives more weight to extreme values as compared to those which
are near to the mean value, because the difference is squared in variance.
X It is based on all observations in the data set.

X Standard deviation is considered to be the best measure of dispersion and is


Properties of standard deviation

used widely.
X There is, however, one difficulty with it. If the unit of measurement of
variables of two series is not the same, then their variability cannot be
compared by comparing the values of standard deviation.

X The variance and standard deviations can be used to determine the spread of
Uses of the variance and standard deviation

data, consistency of a variable and the proportion of data values that fall
within a specified interval in a distribution.
X If the variance or standard deviation is large, the data is more dispersed. This
information is useful in comparing two or more data sets to determine which is
more (most) variable.
X Finally, the variance and standard deviation are used quite often in inferential
statistics.
Coefficient of variation (CV)
The standard deviation is an absolute measure of dispersion. The corresponding
relative measure is known as the coefficient of variation (CV).
Coefficient of variation is used in such problems where we want to compare the
variability of two or more different series. Coefficient of variation is the ratio of the
standard deviation to the arithmetic mean, usually expressed in percent:
CV = *100%
s
x
A distribution having less coefficient of variation is said to be less variable or more
consistent or more uniform or more homogeneous.

29
Example: Last semester, the students of two departments, A and B took Stat 276
course. At the end of the semester, the following information was recorded.
Dept A Dept B
Mean score 79 64
SD 23 11
Compare the relative dispersion of the two departments.
CVA = A *100% = *100 = 29.11% , CVB = B *100% = *100 = 17.19%
s 23 s 11
xA 79 xB 64
Since CVA > CVB , the variation is department A is greater. Or, in department B the
distribution of the marks is more uniform (consistent).
Exercise: The mean weight of 20 children was found to be 30 kg with variance of
16kg2 and their mean height was 150 cm with variance of 25cm2. Compare the
variability of weight and height of these children.
iv. The standard scores
A standard score is a measure that describes the relative position of a single score in
the entire distribution of scores in terms of the mean and standard deviation. It also
gives us the number of standard deviations a particular observation lie above or below

x−μ x− X
the mean.
Standard score, Z = , or Z =
σ
Where x is the value of the observation, μ / X and σ / S are the mean and standard
S

deviation of the respectively.

⎧ Negative, the observation lies below the mean


Interpretation:


If Z is ⎨ Positive, the observation lies above the mean
⎪ Zero, the observation equals to the mean

Example: Two sections were given an exam in a course. The average score was 72
with standard deviation of 6 for section 1 and 85 with standard deviation of 5 for
section 2. Student A from section 1 scored 84 and student B from section 2 scored 90.
Who performed better relative to his/her group?
Solution:
Section 1: x = 72, S = 6 and score of student A from Section 1; A x = 84

x − X A 84 − 72
Section 2: x = 85, S = 5 and score of student B from Section 2; B x = 90
Z-score of student A: Z A = A = =2
SA 6
xB − X B 90 − 85
Z-score of student B: Z B = = =1
SB 5
From these two standard scores, we can conclude that student A has performed better
relative to his/her section students because his/her score is two standard deviations
above the mean score of selection 1 while the score of student B is only one standard
deviation above the mean score of section 2 students.
Exercise: A student scored 65 on a calculus test that had a mean of 50 and a standard
deviation of 10; she scored 30 on a algebra test with a mean of 25 and a standard
deviation of 5. Compare her relative positions on each test.

30
v. Skewness and kurtosis
¬ Skweness refers to lack of symmetry in a distribution. If a distribution is not
symmetrical we call it skewed distribution. Note that for a symmetrical and
unimodal distribution: Mean = median = mode

x − xˆ
Measure of skewness:
Pearsonian coefficient of skewness (Pcsk) defined as: α 3 =

⎧ <
s


0, the distribution isnegatively skewed
Interpretation: If α 3 ⎨> 0, the distribution is positively skewed
⎪= 0, the distribution is symetrical

3( x − x% )
In moderately skewed distributions: Mode = mean- 3(mean-median) ⇒ α 3 =
s
Note: in a negatively skewed distribution larger values are more frequent than smaller
values. In a positively skewed distribution smaller values are more frequent than
larger values.
Exercise: If the mean, mode and s.d of a frequency distribution are 70.2, 73.6, and
6.4, respectively. What can one state about its skeweness?

¬ Kurtosis refers to the degree of peakedness of a distribution. When the values of a


distribution are closely bunched around the mode in such a way that the peak of
the distribution becomes relatively high, the distribution is said to be leptokurtic.
If it is flat topped we call it platykurtic. A distribution which is neither highly
peaked nor flat topped is known as a mesokurtic distribution (normal).

31

You might also like