You are on page 1of 68

Name of the Course : Foundation Course on Basics of Business Statistics


Topics Covered

Topic 1 An Introduction to Business Statistics

Topic 2 Data Collection , Population vs sample, Variables, Scales of Measurement

Topic 3 Data representation techniques- Tabular

Topic 4 Data representation techniques- Graphical

Topic 5 Measures of Central Tendency- Mean

Topic 6 Measures of Central Tendency- Median

Topic 7 Measures of Central Tendency- Mode

Topic 8 Measures of Dispersion / Variation- Range, Quartile Deviation

Topic 9 Measures of Dispersion - Mean Deviation

Topic 10 Measures of Dispersion - Standard Deviation

Topic 11 Skewness

Topic 12 Moments and Kurtosis


Topic 1 Topic : An Introduction to Business Statistics


1. Introduction
2. Meaning and Definitions of Statistics
3. Types of Data and Data Sources
4. Types of Statistics
5. Managerial Applications of Statistics
6. Importance of Statistics in different fields
7. Limitations of statistics
8. Self-Test Questions

1. Introduction

For a common person, ‘Statistics’ means numerical information expressed in quantitative

terms. This information may correspond to objects, subjects, activities, phenomena, or regions
of space. In fact, data have no limits as to their reference, coverage, and scope. At the macro
level, the data may include gross national product, share of agriculture, manufacturing, and
services in GDP (Gross Domestic Product). At micro level, individual firms, big or small,
produce extensive statistics on their operations. For example - the annual reports of companies
exhibit variety of data on sales, production, expenditure, inventories, capital employed, and
other activities.

A student knows statistics more intimately as a subject of study like economics, mathematics,
chemistry, physics, and others. It is a discipline, which scientifically deals with data, and is
often described as the science of data. In dealing with statistics as data, statistics has developed
appropriate methods of collecting, presenting, summarizing, and analysing data, and thus
consists of a body of these methods.



The Word statistics has been derived from Latin word ―Status or the Italian word ―Statista,
meaning of these words is ―Political State or a Government. Shakespeare used a word Statist
is his drama Hamlet (1602). In the past, the statistics was used by rulers. The application of
statistics was very limited but rulers and kings needed information about lands, agriculture,
commerce, population of their states to assess their military potential, their wealth, taxation
and other aspects of government.

The word ‘statistics’ is used in two senses plural and singular. In the plural sense, it refers to
a set of figures or data. In the singular sense, statistics refers to the whole body of tools that are
used to collect data, organise and interpret them and, finally, to draw conclusions from them.

Defining Statistics – The branch of mathematics concerned with collection, classification,

analysis, and interpretation of numerical facts, for drawing inferences on the basis of their
quantifiable likelihood (probability).

A.L. Bowley has defined statistics as: (i) statistics is the science of counting, (ii) Statistics may
rightly be called the science of averages, and (iii) statistics is the science of measurement of
social organism regarded as a whole in all its manifestations.

According to Prof. Horace Secrist, Statistics is the aggregate of facts, affected to a marked
extent by multiplicity of causes, numerically expressed, enumerated or estimated according to
reasonable standards of accuracy, collected in a systematic manner for a pre-determined
purpose, and placed in relation to each other.

From the above definitions, we can highlight the major characteristics of statistics as follows:

(i) Statistics are the aggregates of facts. It means a single figure is not statistics. For example,
national income of a country for a single year is not statistics but the same for two or more
years is statistics.


(ii) Statistics are affected by a number of factors. For example, sale of a product depends on a
number of factors such as its price, quality, competition, the income of the consumers, and so

(iii) Statistics must be reasonably accurate. Wrong figures, if analysed, will lead to erroneous
conclusions. Hence, it is necessary that conclusions must be based on accurate figures.

(iv) Statistics must be collected in a systematic manner. If data are collected in a haphazard
manner, they will not be reliable and will lead to misleading conclusions.

(v) Collected in a systematic manner for a pre-determined purpose

(vi) Last but not the least, Statistics should be placed in relation to each other. If one collects
data unrelated to each other, then such data will be confusing and will not lead to any logical
conclusions. Data should be comparable over time and over space.


Statistical data refer to those features of a problem that can be measured, quantified, counted,
or classified. Any object / subject phenomenon, or activity that generates data through this
process is termed as a variable. A variable is one that shows a degree of variation / variability
when successive measurements are recorded. In statistics, data are classified into two broad
categories: quantitative data and qualitative data. This classification is based on the kind of
characteristics that are measured.

Quantitative data - which can be quantified in definite units of measurement. These refer to
characteristics whose successive measurements yield quantifiable observations. Quantitative
data can be further categorized as continuous and discrete data. A variable may be a continuous
variable or a discrete variable.

(i) Continuous data represent the numerical values of a continuous variable. A continuous
variable is one that can take any value between any two points on a line segment, thus
representing an interval of values. The values are quite precise and close to each other,


yet different. Example includes - all characteristics such as weight, length, height,
thickness, velocity, temperature, tensile strength, etc.. So, the data recorded on such
characteristics are called continuous data.

(ii) Discrete data are the values assumed by a discrete variable. A discrete variable is the
one whose outcomes are measured in fixed numbers. Such data are essentially count data.
For example - the number of customers visiting a departmental store everyday, the
number of flights incoming at an airport and the defective items in a consignment
received for sale.

Qualitative data refer to qualitative characteristics of a subject or an object. A characteristic

is qualitative in nature when its observations are defined and noted in terms of the presence or
absence of a certain attribute in discrete numbers. These data are further classified as nominal
and rank data.

(i) Nominal data are the outcome of classification into two or more categories of items
or units comprising a sample or a population according to some quality
characteristic. For example - Classification of students according to sex (as males
and females), classification of workers according to skill (as skilled, semi-skilled,
and unskilled), and of employees according to the level of education (as
matriculates, undergraduates, and post-graduates). On the basis of classification,
each item is assigned a particular class and then make a summation of items
belonging to each class. The count data so obtained are called nominal data.

(ii) Rank data are the result of assigning ranks to specify order in terms of the integers
1,2,3, ..., n. Ranks may be assigned according to the level of performance in a test.
a contest, a competition, an interview, or a show. The candidates appearing in an
interview, for example, may be assigned ranks in integers ranging from I to n,
depending on their performance in the interview. Ranks so assigned can be viewed
as the continuous values of a variable.

Data as per data sources could be of two types, viz., secondary and primary.


(i) Secondary data: The data already exist in some form: published or unpublished -
in an identifiable secondary source. Generally available from published source(s),
though it is not necessary that it would be in the form actually required.

(ii) Primary data: Those data which do not already exist in any form, and thus need to
be collected for the first time from the primary source(s). By their very nature, these
data require fresh and first-time collection covering the whole population or a
sample drawn from it.


The two main branches of statistics are descriptive statistics and inferential statistics. Both of
these are employed in scientific analysis of data and both are equally important for the student
of statistics.

Descriptive Statistics

Descriptive statistics deals with the presentation and collection of data. This is usually the first
part of a statistical analysis. It is usually not as simple as it sounds, and the statistician needs to
be aware of designing experiments, choosing the right focus group and avoid biases that are so
easy to creep into the experiment.

Different areas of study require different kinds of analysis using descriptive statistics. For
example, a physicist studying turbulence in the laboratory needs the average quantities that
vary over small intervals of time. The nature of this problem requires that physical quantities
be averaged from a host of data collected through the experiment.

Inferential Statistics

Inferential statistics, also known as inductive statistics, consists of methods that are used for
drawing inferences, or making broad generalizations, about a totality of observations on the
basis of knowledge about a part of that totality. The totality of observations about which an
inference may be drawn, or a generalization made, is called a population or a universe. The


part of totality, which is observed for data collection and analysis to gain knowledge about the
population, is called a sample.

The desired information about a given population of our interest; may also be collected even
by observing all the units comprising the population. This total coverage is called census.
Getting the desired value for the population through census is not always feasible and practical
for various reasons. Apart from time and money considerations making the census operations
prohibitive, observing each individual unit of the population with reference to any data
characteristic may at times involve even destructive testing. In such cases, obviously, the only
recourse available is to employ the partial or incomplete information gathered through a sample
for the purpose. This is precisely what inferential statistics does.


• Actuarial science is the discipline that applies mathematical and statistical methods to
assess risk in the insurance and finance industries.
• Astrostatistics is the discipline that applies statistical analysis to the understanding of
astronomical data.
• Biostatistics is a branch of biology that studies biological phenomena and observations
by means of statistical analysis, and includes medical statistics.
• Business analytics is a rapidly developing business process that applies statistical
methods to data sets (often very large) to develop new insights and understanding of
business performance & opportunities
• Chemometrics is the science of relating measurements made on a chemical system or
process to the state of the system via application of mathematical or statistical methods.
• Demography is the statistical study of all populations. It can be a very general science
that can be applied to any kind of dynamic population, that is, one that changes over
time or space.
• Econometrics is a branch of economics that applies statistical methods to the empirical
study of economic theories and relationships.


• Environmental statistics is the application of statistical methods to environmental
science. Weather, climate, air and water quality are included, as are studies of plant and
animal populations.
• Geostatistics is a branch of geography that deals with the analysis of data from
disciplines such as petroleum geology, hydrogeology, hydrology, meteorology,
oceanography, geochemistry, geography.
• Operations research (or Operational Research) is an interdisciplinary branch of
applied mathematics and formal science that uses methods such as mathematical
modeling, statistics, and algorithms to arrive at optimal or near optimal solutions to
complex problems.
• Population ecology is a sub-field of ecology that deals with the dynamics of species
populations and how these populations interact with the environment.
• Quality control reviews the factors involved in manufacturing and production; it can
make use of statistical sampling of product items to aid decisions in process control or
in accepting deliveries.
• Quantitative psychology is the science of statistically explaining and changing mental
processes and behaviors in humans.
• Statistical finance, an area of econophysics, is an empirical attempt to shift finance
from its normative roots to a positivist framework using exemplars from statistical
physics with an emphasis on emergent or collective properties of financial markets.
• Statistical mechanics is the application of probability theory, which includes
mathematical tools for dealing with large populations, to the field of mechanics, which
is concerned with the motion of particles or objects when subjected to a force.
• Statistical physics is one of the fundamental theories of physics, and uses methods of
probability theory in solving physical problems.

6. Importance of Statistics in Different Fields

Statistics plays a vital role in every fields of human activity. statistics holds a central position
in almost every field like Industry, Commerce, Trade, Physics, Chemistry, Economics,
Mathematics, Biology, Botany, Psychology, Astronomy etc…, so application of statistics is
very wide.


a. Business:
Statistics helps businessman to plan production according to the taste of the costumers, the
quality of the products can also be checked more efficiently by using statistical methods. So
all the activities of the businessman based on statistical information. He can make correct
decision about the location of business, marketing of the products, financial resources etc…

b. In Economics:
Economics largely depends upon statistics. National income accounts are multipurpose
indicators for the economists and administrators. Statistical methods are used for preparation
of these accounts. In economics research statistical methods are used for collecting and analysis
the data and testing hypothesis. The relationship between supply and demands is studies by
statistical methods, the imports and exports, the inflation rate, the per capita income are the
problems which require good knowledge of statistics.

c. In Mathematics
Statistics is branch of applied mathematics. The large number of statistical methods like
probability averages, dispersions, estimation etc… is used in mathematics and different
techniques of pure mathematics like integration, differentiation and algebra are used in

d. In Banking
The banks work on the principle that all the people who deposit their money with the banks do
not withdraw it at the same time. The bank earns profits out of these deposits by lending to
others on interest. The bankers use statistical approaches based on probability to estimate the
numbers of depositors and their claims for a certain day.

e. In State Management (Administration):

Statistical data are now widely used in taking all administrative decisions. Suppose if the
government wants to revise the pay scales of employees in view of an increase in the living
cost, statistical methods will be used to determine the rise in the cost of living. Preparation of
federal and provincial government budgets mainly depends upon statistics because it helps in
estimating the expected expenditures and revenue from different sources. So statistics are the
eyes of administration of the state.


f. In Accounting and Auditing
The correction of the values of current asserts is made on the basis of the purchasing power of
money or the current value of it.In auditing sampling techniques are commonly used. An
auditor determines the sample size of the book to be audited on the basis of error.

g. In Natural and Social Sciences:

Statistical methods are commonly used for analyzing the experiments results, testing their
significance in Biology, Physics, Chemistry, Mathematics, Meteorology, Research chambers
of commerce, Sociology, Business, Public Administration, Communication and Information
Technology etc…

h. In Astronomy:
Astronomy is one of the oldest branches of statistical study; it deals with the measurement of
distance, sizes, masses and densities of heavenly bodies by means of observations. During these
measurements errors are unavoidable so most probable measurements are founded by using
statistical methods.


i. There are certain phenomena or concepts where statistics cannot be used. This is
because these phenomena or concepts are not amenable to measurement. For example,
beauty, intelligence, courage cannot be quantified. Statistics has no place in all such
cases where quantification is not possible.
ii. Statistics reveal the average behaviour, the normal or the general trend. An application
of the 'average' concept if applied to an individual or a particular situation may lead to
a wrong conclusion and sometimes may be disastrous. For example, one may be
misguided when told that the average depth of a river from one bank to the other is four
feet, when there may be some points in between where its depth is far more than four
iii. Since statistics are collected for a particular purpose, such data may not be relevant or
useful in other situations or cases.
iv. Statistics are not 100 per cent precise as is Mathematics or Accountancy


v. In statistical surveys, sampling is generally used as it is not physically possible to cover
all the units or elements comprising the universe. The results may not be appropriate as
far as the universe is concerned. Moreover, different surveys based on the same size of
sample but different sample units may yield different results.
vi. A major limitation of statistics is that it does not reveal all pertaining to a certain
phenomenon. There is some background information that statistics does not cover.
Similarly, there are some other aspects related to the problem on hand, which are also
not covered.

Apart from the limitations of statistics mentioned above, there are misuses of it.
a. Sources of data not given: At times, the source of data is not given. In the absence
of the source, the reader does not know how far the data are reliable. Further, if he
wants to refer to the original source, he is unable to do so.
b. Defective data: Another misuse is that sometimes one gives defective data. This may
be done knowingly in order to defend one's position or to prove a particular point.
c. Unrepresentative sample: The sample may turn out to be unrepresentative of the
universe. One may choose a sample just on the basis of convenience. He may collect
the desired information from either his friends or nearby respondents in his
neighbourhood even though such respondents do not constitute a representative
d. Inadequate sample: one may conduct a survey based on an extremely inadequate
sample. For example, in a city we may find that there are 1, 00,000 households. When
we have to conduct a household survey, we may take a sample of merely 100
households comprising only 0.1 per cent of the universe. A survey based on such a
small sample may not yield right information.
e. Confusion of correlation and causation: In statistics, several times one has to
examine the relationship between two variables. A close relationship between the two
variables may not establish a cause-and-effect-relationship in the sense that one
variable is the cause and the other is the effect. It should be taken as something that
measures degree of association rather than try to find out causal relationship.

8. Self-Test Questions


1. Define Statistics. Explain its types, and importance to trade, commerce and business.
2. “Statistics is all-pervading”. Elucidate this statement.
3. Write a note on the scope and limitations of Statistics.
4. What are the major limitations of Statistics? Explain with suitable examples.
5. Distinguish between descriptive Statistics and inferential Statistics.


Topic 2 Topic : Data Collection , Population vs sample, Variables, Scales of

1. Data Collection
2. Population vs sample
3. Variables
4. Scales of Measurement
5. Self-Test Questions

1. Data Collection

Data is various kinds of information formatted in a particular way. In Statistics, data collection
is a process of gathering information from all the relevant sources to find a solution to the
research problem. It helps to evaluate the outcome of the problem. The data collection methods
allow a person to conclude an answer to the relevant question. Most of the organizations use
data collection methods to make assumptions about future probabilities and trends. Once the
data is collected, it is necessary to undergo the data organization process. Depending on the
type of data, the data collection method is divided into two categories namely –
• Primary Data Collection methods
• Secondary Data Collection method

Primary Data Collection Methods

Primary data or raw data is a type of information that is obtained directly from the first-hand
source through experiments, surveys or observations. The primary data collection method is
further classified into two types. They are
• Quantitative Data Collection Methods
• Qualitative Data Collection Methods

i. Quantitative Data Collection Methods --

It is based on mathematical calculations using various formats like close-ended
questions, correlation and regression methods, mean, median or mode measures. This method


is cheaper than qualitative data collection methods and it can be applied in a short duration of
ii. Qualitative Data Collection Methods
It does not involve any mathematical calculations. This method is closely associated with
elements that are not quantifiable. This qualitative data collection method includes interviews,
questionnaires, observations, case studies, etc. There are several methods to collect this type of
data. They are -

Observation Method - Observation method is used when the study relates to behavioural
science. This method is planned systematically. It is subject to many controls and checks. The
different types of observations are:
• Structured and unstructured observation
• Controlled and uncontrolled observation
• Participant, non-participant and disguised observation
Interview Method - The method of collecting data in terms of verbal responses. It is achieved
in two ways, such as
• Personal Interview – In this method, a person known as an interviewer is required to
ask questions face to face to the other person. The personal interview can be structured
or unstructured, direct investigation, focused conversation, etc.
• Telephonic Interview – In this method, an interviewer obtains information by
contacting people on the telephone to ask the questions or views, verbally.
Questionnaire Method - In this method, the set of questions are mailed to the respondent.
They should read, reply and subsequently return the questionnaire. The questions are printed
in the definite order on the form. A good survey should have the following features:
• Short and simple
• Should follow a logical sequence
• Provide adequate space for answers
• Avoid technical terms
• Should have good physical appearance such as colour, quality of the paper to attract
the attention of the respondent
Schedules - This method is similar to the questionnaire method with a slight difference. The
enumerations are specially appointed for the purpose of filling the schedules. It explains the


aims and objects of the investigation and may remove misunderstandings, if any have come
up. Enumerators should be trained to perform their job with hard work and patience.

f. Secondary Data Collection Methods

Secondary data is data collected by someone other than the actual user. It means that the
information is already available, and someone analyses it. The secondary data includes
magazines, newspapers, books, journals, etc. It may be either published data or unpublished
Published data are available in various resources including
• Government publications
• Public records
• Historical and statistical documents
• Business documents
• Technical and trade journals
Unpublished data includes
• Diaries
• Letters
• Unpublished biographies, etc.

2. Population vs. Sample

A population is the entire group that you want to draw conclusions about. A sample is the
specific group that you will collect data from. The size of the sample is always less than the
total size of the population. In research, a population doesn’t always refer to people. It can
mean a group containing elements of anything you want to study, such as objects, events,
organizations, countries, species, organisms, etc.
Collecting data from a population
Populations are used when your research question requires, or when you have access to, data
from every member of the population. Usually, it is only straightforward to collect data from
a whole population when it is small, accessible and cooperative.
Example: A high school administrator wants to analyze the final exam scores of all graduating
seniors to see if there is a trend. Since they are only interested in applying their findings to the
graduating seniors in this high school, they use the whole population dataset.


For larger and more dispersed populations, it is often difficult or impossible to collect data
from every individual. For example, every 10 years, the Indian government aims to count every
person living in the country using the Census. This data is used to distribute funding across the
nation or for other policy making decisions.

Collecting data from a sample

When your population is large in size, geographically dispersed, or difficult to contact, it’s
necessary to use a sample. With statistical analysis, you can use sample data to make estimates
or test hypotheses about population data.
Example: You want to study political attitudes in young people. Your population is the 300,000
undergraduate students in the undergraduate colleges in Delhi. Because it’s not practical to
collect data from all of them, you use a sample of 300 undergraduate volunteers from three
universities who meet your inclusion criteria. This is the group who will complete your online

Ideally, a sample should be randomly selected and representative of the population.

Using probability sampling methods (such as simple random sampling or stratified sampling)
reduces the risk of sampling bias and enhances both internal and external validity.

Reasons for sampling

• Necessity: Sometimes it’s simply not possible to study the whole population due to its
size or inaccessibility.
• Practicality: It’s easier and more efficient to collect data from a sample.
• Cost-effectiveness: There are fewer participant, laboratory, equipment, and
researcher costs involved.
• Manageability: Storing and running statistical analyses on smaller datasets is easier
and reliable.
Population parameter vs. sample statistic
When you collect data from a population or a sample, there are various measurements and
numbers you can calculate from the data. A parameter is a measure that describes the whole
population. A statistic is a measure that describes the sample. You can use estimation


or hypothesis testing to estimate how likely it is that a sample statistic differs from the
population parameter. Example:
In your study of students’ political attitudes, you ask your survey participants to rate themselves
on a scale from 1, very liberal, to 7, very conservative. You find that most of your sample
identifies as liberal – the mean rating on the political attitudes scale is 3.2. This statistic can be
used , the sample mean of 3.2, to make a scientific guess about the population parameter – that
is, to infer the mean political attitude rating of all undergraduate students in Delhi.

Sampling error
A sampling error is the difference between a population parameter and a sample statistic. In
your study, the sampling error is the difference between the mean political attitude rating of
your sample and the true mean political attitude rating of all undergraduate students in Delhi.

Sampling errors happen even when you use a randomly selected sample. This is because
random samples are not identical to the population in terms of numerical measures
like means and standard deviations.

Because the aim of scientific research is to generalize findings from the sample to the
population, you want the sampling error to be low. You can reduce sampling error by increasing
the sample size.

3. Variables
In statistical research, a variable is defined as an attribute of an object of study. Data is a
specific measurement of a variable – it is the value you record in your data sheet. Data is
generally divided into two categories:
• Quantitative data represents amounts
• Categorical data represents groupings
A variable that contains quantitative data is a quantitative variable; a variable that contains
categorical data is a categorical variable. Each of these types of variables can be broken down
into further types.
Quantitative variables - When you collect quantitative data, the numbers you record represent
real amounts that can be added, subtracted, divided, etc. There are two types of quantitative
variables: discrete and continuous.


Categorical variables - Categorical variables represent groupings of some kind. They are
sometimes recorded as numbers, but the numbers represent categories rather than actual
amounts of things.

Type of variable What does the data Examples

Discrete variables (aka Counts of individual items or • Number of students in
integer variables) values. a class
• Number of different
tree species in a forest
Continuous Measurements of continuous • Distance
variables (aka ratio or non-finite values. • Volume
variables) • Age

There are three types of categorical variables: binary, nominal, and ordinal variables.
Type of variable What does the data Examples
Binary variables (aka Yes or no outcomes. • Heads/tails in a coin flip
dichotomous variables) • Win/lose in a football game
Nominal variables Groups with no rank or • Species names
order between them. • Colors
• Brands
Ordinal variables Groups that are ranked • Finishing place in a race
in a specific order. • Rating scale responses in a
survey, such as Likert


4. Levels and Scales of Measurement

Levels of measurement, also called scales of measurement, tell you how

precisely variables are recorded. In scientific research, a variable is anything that can take on
different values across your data set (e.g., height or test scores). There are 4 levels of
• Nominal: the data can only be categorized
• Ordinal: the data can be categorized and ranked
• Interval: the data can be categorized, ranked, and evenly spaced
• Ratio: the data can be categorized, ranked, evenly spaced, and has a natural zero.

Nominal Level - The first level of measurement is nominal level of measurement. In this level
of measurement, the numbers in the variable are used only to classify the data. In this level of
measurement, words, letters, and alpha-numeric symbols can be used. Suppose there are data
about people belonging to three different gender categories. In this case, the person belonging
to the female gender could be classified as F, the person belonging to the male gender could be
classified as M, and transgendered classified as T. This type of assigning classification is
nominal level of measurement.

Ordinal Level - The second level of measurement is the ordinal level of

measurement. This level of measurement depicts some ordered relationship among the
variable’s observations. Suppose a student scores the highest grade of 100 in the class. In this
case, he would be assigned the first rank. Then, another classmate scores the second highest
grade of an 92; she would be assigned the second rank. A third student scores a 81 and he
would be assigned the third rank, and so on. The ordinal level of measurement indicates an
ordering of the measurements.

Interval Level - The third level of measurement is the interval level of measurement. The
interval level of measurement not only classifies and orders the measurements, but it also
specifies that the distances between each interval on the scale are equivalent along the scale
from low interval to high interval. For example, an interval level of measurement could be the
measurement of anxiety in a student between the score of 10 and 11, this interval is the same


as that of a student who scores between 40 and 41. A popular example of this level of
measurement is temperature in centigrade, where, for example, the distance between 94centi
and 96centi is the same as the distance between 100centi and 102 centi.

Ratio Level - The fourth level of measurement is the ratio level of measurement. In this level
of measurement, the observations, in addition to having equal intervals, can have a value of
zero as well. The zero in the scale makes this type of measurement unlike the other types of
measurement, although the properties are similar to that of the interval level of
measurement. In the ratio level of measurement, the divisions between the points on the scale
have an equivalent distance between them.
Going from lowest to highest, the 4 levels of measurement are cumulative. This means that
they each take on the properties of lower levels and add new properties.

Nominal level Examples of nominal scales

You can categorize your data by labelling them • City of birth
in mutually exclusive groups, but there is no • Gender
order between the categories. • Ethnicity
• Car brands
• Marital status
Ordinal level Examples of ordinal scales
You can categorize and rank your data in an • Top 5 Olympic medallists
order, but you cannot say anything about the • Language ability (e.g.,
intervals between the rankings. beginner, intermediate,
Although you can rank the top 5 Olympic fluent)
medallists, this scale does not tell you how close • Likert-type questions (e.g.,
or far apart they are in number of wins. very dissatisfied to very
Interval level Examples of interval scales
You can categorize, rank, and infer equal • Test scores (e.g., IQ or
intervals between neighboring data points, but exams)
there is no true zero point. • Personality inventories
The difference between any two adjacent • Temperature in Fahrenheit or
temperatures is the same: one degree. But zero Celsius


degrees is defined differently depending on the
scale – it doesn’t mean an absolute absence of
The same is true for test scores and personality
inventories. A zero on a test is arbitrary; it does
not mean that the test-taker has an absolute lack
of the trait being measured.
Ratio level Examples of ratio scales
You can categorize, rank, and infer equal • Height
intervals between neighboring data points, and • Age
there is a true zero point. • Weight
A true zero means there is an absence of the • Temperature in Kelvin
variable of interest. In ratio scales, zero does
mean an absolute lack of the variable.
For example, in the Kelvin temperature scale,
there are no negative degrees of temperature –
zero means an absolute lack of thermal energy.

Self-Test Questions
1. Why are samples used in Research ? When are populations used in research ?
2. What is the difference between a statistic and parameter ?
3. What is sampling error ?
4. What are the four levels of measurement?
5. Why do levels of measurement matter ?


Topic 3 & 4 Topic : Data representation techniques

1. Data Representations
2. Frequency Distribution
3. Self-Test Questions

1. Data Representation

We collect both qualitative data and quantitative data. However, long lists of data points can
be difficult to interpret. Data representations are graphics that display and summarize data and
help us to understand the data's meaning. Data representations can help us answer the following
• How much of the data falls within a specified category or range of values?
• What is a typical value of the data?
• How much spread is in the data?
• Is there a trend in the data over time?
• Is there a relationship between two variables?

Qualitative Data Representation

A variety of data representations can be used to communicate qualitative (also called

categorical) data. Example: Delhi University surveys 600 incoming students about which
world language they want to study. Here, the qualitative variable is the language that the
students want to study and the categories are the particular languages chosen (Spanish, French,
Mandarin and Latin).

A table summarizes the data using rows and columns. Each column contains data for a single
variable, and a basic table contains one column for the qualitative variable and one for the
quantitative variable. Each row contains a category of the qualitative variable and the
corresponding value of the quantitative variable. [Example]


Language No of students

Spanish 200

French 80

Mandarin 280

Latin 40

A vertical bar chart lists the categories of the qualitative variable along a horizontal axis and
uses the heights of the bars on the vertical axis to show the values of the quantitative variable.
A horizontal bar chart lists the categories along the vertical axis and uses the lengths of the
bars on the horizontal axis to show the values of the quantitative variable. This display draws
attention to how the categories rank according to the amount of data within each. [Example]

A pictograph is like a horizontal bar chart but uses pictures instead of the lengths of bars to
represent the values of the quantitative variable. Each picture represents a certain quantity, and
each category can have multiple pictures. Pictographs are visually interesting, but require us to
use the legend to convert the number of pictures to quantitative values. [Example]

• A circle graph (or pie chart) is a circle that is divided into as many sections as there
are categories of the qualitative variable. The area of each section represents, for each


category, the value of the quantitative data as a fraction of the sum of values. The
fractions sum to 111. Sometimes the section labels include both the category and the
associated value or percent value for that category. [Example]

Quantitative Data Representation

Data displays for quantitative data are typically oriented along two numerical axes and relate
two quantitative variables. Displays of quantitative data help us understand the shape and
spread of the data.
Example: Ms. Bakshi asks her homeroom students how long it typically takes them to get to
school (in minutes) and records their responses in the following list: 5, 25, 10, 20, 10, 15, 35 ,
10, 5, 20. Here, one quantitative variable is students' typical travel time to school.
A variety of data representations can be used to communicate quantitative data.

• Dotplots use one dot for each data point. The dots are plotted above their corresponding
values on a number line. The number of dots above each specific value represents the count
of that value. Dotplots show the value of each data point and are practical for small data
sets. Each dot represents the typical travel time to school for one student.

• Histograms divide the horizontal axis into equal-sized intervals and use the heights of the
bars to show the count or percent of data within each interval. By convention, each interval


includes the lower boundary but not the upper one. Histograms show only totals for the
intervals, not specific data points.

The height of each bar represents the number of students having a typical travel time within
the corresponding interval.

2. Frequency Distribution

The frequency of a value is the number of times it occurs in a dataset. A frequency

distribution is the pattern of frequencies of a variable. It’s the number of times each possible
value of a variable occurs in a dataset. It describes the number of observations for each possible
value of a variable. Frequency distributions are depicted using graphs and frequency tables.
Example: In the 2022 Winter Olympics, Team USA won 25 medals. This frequency table
gives the medals’ values (gold, silver, and bronze) and frequencies:

Medal Frequency
Gold 8
Silver 10
Bronze 7


Types of frequency distributions
There are four types of frequency distributions:
o Ungrouped frequency distributions: The number of observations of each value of a
variable. You can use this type of frequency distribution for categorical variables.
o Grouped frequency distributions: The number of observations of each class
interval of a variable. Class intervals are ordered groupings of a variable’s values. You
can use this type of frequency distribution for quantitative variables.
o Relative frequency distributions: The proportion of observations of each value or class
interval of a variable You can use this type of frequency distribution for any type of
variable when you’re more interested in comparing frequencies than the actual number
of observations.
o Cumulative frequency distributions: The sum of the frequencies less than or equal to
each value or class interval of a variable. You can use this type of frequency distribution
for ordinal or quantitative variables when you want to understand how often
observations fall below certain values.

How to make a frequency table

Frequency distributions are often displayed using frequency tables. A frequency table is an
effective way to summarize or organize a dataset. It’s usually composed of two columns:
• The values or class intervals
• Their frequencies
The method for making a frequency table differs between the four types of frequency

How to make an ungrouped frequency table

1. Create a table with two columns and as many rows as there are values of the
variable. Label the first column using the variable name and label the second column
“Frequency.” Enter the values in the first column.
o For ordinal variables, the values should be ordered from smallest to largest in
the table rows.
o For nominal variables, the values can be in any order in the table. You may
wish to order them alphabetically or in some other logical order.


2. Count the frequencies. The frequencies are the number of times each value occurs. Enter
the frequencies in the second column of the table beside their corresponding values.
o Especially if your dataset is large, it may help to count the frequencies
by tallying. Add a third column called “Tally.” As you read the observations,
make a tick mark in the appropriate row of the tally column for each
observation. Count the tally marks to determine the frequency.

How to make a grouped frequency table

1. Divide the variable into class intervals. Below is one method to divide a variable into
class intervals. Different methods will give different answers, but there’s no agreement on
the best method to calculate class intervals.
o Calculate the range. Subtract the lowest value in the dataset from the highest.
o Decide the class interval width. There are no firm rules on how to choose the
width, but the following formula is a rule of thumb:

You can round this value to a whole number or a number that’s convenient to add (such as
a multiple of 10).
Calculate the class intervals. Each interval is defined by a lower limit and upper limit.
Observations in a class interval are greater than or equal to the lower limit and less than
the upper limit:

The lower limit of the first interval is the lowest value in the dataset. Add the class interval
width to find the upper limit of the first interval and the lower limit of the second variable.
Keep adding the interval width to calculate more class intervals until you exceed the
highest value.

Create a table with two columns and as many rows as there are class intervals. Label the
first column using the variable name and label the second column “Frequency.” Enter the
class intervals in the first column.
2. Count the frequencies. The frequencies are the number of observations in each class
interval. You can count by tallying if you find it helpful. Enter the frequencies in the second
column of the table beside their corresponding class intervals.


How to make a relative frequency table

1. Create an ungrouped or grouped frequency table.

2. Add a third column to the table for the relative frequencies. To calculate the relative
frequencies, divide each frequency by the sample size. The sample size is the sum of
the frequencies.

How to make a cumulative frequency table

1. Create an ungrouped or grouped frequency table for an ordinal or quantitative

variable. Cumulative frequencies don’t make sense for nominal variables because the
values have no order—one value isn’t more than or less than another value.
2. Add a third column to the table for the cumulative frequencies. The cumulative
frequency is the number of observations less than or equal to a certain value or class
interval. To calculate the relative frequencies, add each frequency to the frequencies in
the previous rows.
3. Optional: If you want to calculate the cumulative relative frequency, add another
column and divide each cumulative frequency by the sample size.

How to graph a frequency distribution

Pie charts, bar charts, and histograms are all ways of graphing frequency distributions. The best
choice depends on the type of variable and what you’re trying to communicate.

Pie chart
A pie chart is a graph that shows the relative frequency distribution of a nominal variable. A
pie chart is a circle that’s divided into one slice for each value. The size of the slices shows
their relative frequency. This type of graph can be a good choice when you want to emphasize
that one variable is especially frequent or infrequent, or you want to present the overall
composition of a variable.
A disadvantage of pie charts is that it’s difficult to see small differences between frequencies.
As a result, it’s also not a good option if you want to compare the frequencies of different


Bar chart
A bar chart is a graph that shows the frequency or relative frequency distribution of
a categorical variable (nominal or ordinal). The y-axis of the bars shows the frequencies or
relative frequencies, and the x-axis shows the values. Each value is represented by a bar, and
the length or height of the bar shows the frequency of the value. A bar chart is a good choice
when you want to compare the frequencies of different values. It’s much easier to compare the
heights of bars than the angles of pie chart slices.

A histogram is a graph that shows the frequency or relative frequency distribution of
a quantitative variable. It looks similar to a bar chart.

The continuous variable is grouped into interval classes, just like a grouped frequency table.
The y-axis of the bars shows the frequencies or relative frequencies, and the x-axis shows the
interval classes. Each interval class is represented by a bar, and the height of the bar shows the
frequency or relative frequency of the interval class.

Although bar charts and histograms are similar, there are important differences:

Bar Chart Histogram


Type of Categorical Quantitative
Value Ungrouped values Grouped (interval classes)
Bar Spacing Can be a space between Never a space between bars
Bar Order Can be in any order Can only be ordered from lowest to

A histogram is an effective visual summary of several important characteristics of a variable.

At a glance, you can see a variable’s central tendency and variability, as well as
what probability distribution it appears to follow, such as a normal, Poisson, or uniform

The Ogive is defined as the frequency distribution graph of a series. The Ogive is a graph of a
cumulative distribution, which explains data values on the horizontal plane axis and either the
cumulative relative frequencies, the cumulative frequencies or cumulative per cent
frequencies on the vertical axis.
Cumulative frequency is defined as the sum of all the previous frequencies up to the current
point. To find the popularity of the given data or the likelihood of the data that fall within the
certain frequency range, Ogive curve helps in finding those details accurately.
Create the Ogive by plotting the point corresponding to the cumulative frequency of each class
interval. Most of the Statisticians use Ogive curve, to illustrate the data in the pictorial
representation. It helps in estimating the number of observations which are less than or equal
to the particular value.


Ogive Graph
The graphs of the frequency distribution are frequency graphs that are used to exhibit the
characteristics of discrete and continuous data. Such figures are more appealing to the eye than
the tabulated data. It helps us to facilitate the comparative study of two or more frequency
distributions. We can relate the shape and pattern of the two frequency distributions.
The two methods of Ogives are:
• Less than Ogive
• Greater than or more than Ogive

The graph given above represents less than and the greater than Ogive curve. The rising curve
(Rose Curve) represents the less than Ogive, and the falling curve (Purple Curve) represents
the greater than Ogive.

Less than Ogive

The frequencies of all preceding classes are added to the frequency of a class. This series is
called the less than cumulative series. It is constructed by adding the first-class frequency to
the second-class frequency and then to the third class frequency and so on. The downward
cumulation results in the less than cumulative series.

Greater than or More than Ogive

The frequencies of the succeeding classes are added to the frequency of a class. This series is
called the more than or greater than cumulative series. It is constructed by subtracting the first
class, second class frequency from the total, third class frequency from that and so on. The
upward cumulation result is greater than or more than the cumulative series.

Uses of Ogive Curve

Ogive Graph or the cumulative frequency graphs are used to find the median of the given set
of data. If both, less than and greater than, cumulative frequency curve is drawn on the same
graph, we can easily find the median value. The point in which, both the curve intersects,
corresponding to the x-axis, gives the median value. Apart from finding the medians, Ogives
are used in computing the percentiles of the data set values.

Question 1:
Construct the more than cumulative frequency table and draw the Ogive for the below-given
Marks 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80
Frequency 3 8 12 14 10 6 5 2

“More than” Cumulative Frequency Table:

Marks Frequency More than Cumulative Frequency

More than 1 3 60
More than 11 8 57
More than 21 12 49
More than 31 14 37
More than 41 10 23
More than 51 6 13
More than 61 5 7
More than 71 2 2


“Less than Ogive” can be prepared accordingly

Self-Test Questions
1. Why are samples used in Research ? When are populations used in research ?
2. What is the difference between a statistic and parameter ?
3. What is sampling error ?
4. What are the four levels of measurement?
5. Why do levels of measurement matter ?
6. What type of data can be described by a frequency distribution ?
7. What is the difference between relative frequency and probability ?


Topic 5,6 & 7 Topic : Measures of Central Tendency – Mean, Median and Mode

1. Introduction
2. Arithmetic Mean
3. Median
4. Mode
5. Relationships of the Mean, Median and Mode
6. The Best Measure of Central Tendency
7. Self-Test Questions


The description of statistical data may be quite elaborate or quite brief depending on two
factors: the nature of data and the purpose for which the same data have been collected. While
describing data statistically or verbally, one must ensure that the description is neither too brief
nor too lengthy. The measures of central tendency enable us to compare two or more
distributions pertaining to the same time period or within the same distribution over time.

The term central tendency refers to the "middle" value or perhaps a typical value of the data,
and is measured using the mean, median, or mode. Each of these measures is calculated
differently, and the one that is best to use depends upon the situation. In the study of a
population with respect to one in which we are interested we may get a large number of
observations. It is not possible to grasp any idea about the characteristic when we look at all
the observations. So it is better to get one number for one group. That number must be a good
representative one for all the observations to give a clear picture of that characteristic. Such
representative number can be a central value for all these observations. This central value is
called a measure of central tendency or an average or a measure of locations. There are five
averages. Among them mean, median and mode are called simple averages and the other two
averages geometric mean and harmonic mean are called special averages.


The arithmetic mean is the most common measure of central tendency. It is simply the sum of
the numbers divided by the number of numbers. The symbol "μ" is used for the mean of a
population. The symbol "M" is used for the mean of a sample. The formula for μ is :
μ = ΣX/N
where ΣX is the sum of all the numbers in the population and
N is the number of numbers in the population.

The formula for M is essentially identical: M = ΣX/N

Where ΣX is the sum of all the numbers in the sample and
N is the number of numbers in the sample.

Calculate the mean for pH levels of soil 6.8, 6.6, 5.2, 5.6, 5.8

Grouped Data - The mean for grouped data is obtained from the following formula:

Where x = the mid-point of individual class

f = the frequency of individual class
n = the sum of the frequencies or total frequencies in a sample.
Short-cut method

Example Given the following frequency distribution, calculate the arithmetic mean
Marks 64 63 62 61 60 59


Number of Students 8 18 12 9 7 6

X F Fx D=x-A Fd
64 8 512 2 16
63 18 1134 1 18
62 12 744 0 0
61 9 549 -1 -9
60 7 420 -2 -14
59 6 354 -3 -18
60 3713 -7

Example For the frequency distribution of seed yield of plot given in table, calculate the mean
yield per plot.
Yield per plot (in g) 64.5-84.5 84.5-104.5 104.5-124.5 124.5-144.5
No of plots 3 5 7 20
Yield ( in g) No of Plots Mid X d Fd
64.5-84.5 3 74.5 -1 -3
84.5-104.5 5 94.5 0 0
104.5-124.5 7 114.5 1 7
124.5-144.5 20 134.5 2 40
Total 35 44
The mean yield per plot is
Direct method:


Shortcut method

Merits and demerits of Arithmetic mean

1. It is rigidly defined.
2. It is easy to understand and easy to calculate.
3. If the number of items is sufficiently large, it is more accurate and more reliable.
4. It is a calculated value and is not based on its position in the series.
5. It is possible to calculate even if some of the details of the data are lacking.
6. Of all averages, it is affected least by fluctuations of sampling.
7. It provides a good basis for comparison.

1. It cannot be obtained by inspection nor located through a frequency graph.
2. It cannot be in the study of qualitative phenomena not capable of numerical
measurement i.e. Intelligence, beauty, honesty etc.,
3. It can ignore any single item only at the risk of losing its accuracy.
4. It is affected very much by extreme values.
5. It cannot be calculated for open-end classes.
6. It may lead to fallacious conclusions, if the details of the data from which it is
computed are not given.

Example : The mean of the following frequency distribution was found to be 1.46.
No. of Accidents No. of Days (frequency)
0 46
1 ?
2 ?
3 25


4 10
5 5
Total 200 days
Calculate the missing frequencies
Solution: Here we are given the total number of frequencies and the arithmetic mean. We have
to determine the two frequencies that are missing. Let us assume that the frequency against 1
accident is x and against 2 accidents is y. If we can establish two simultaneous equations, then
we can easily find the values of X and Y.

Mean = [(0.46) + (1 . x) + (2 . y) + (3 . 25) + (4 . l0) + (5 . 5)] / 200

1.46 = ( x + 2y + 140 ) / 200
x + 2y + 140 = (200) (1.46)
x + 2y = 152 (i)
x + y=200- {46+25 + 10 +5}
x + y = 200 – 86
x + y = 114 (ii)

Now subtracting equation (ii) from equation (i), we get y=38

Substituting the value of y = 38 in equation (ii) above, x + 38 = 114
Therefore, x = 114 - 38 = 76
Hence, the missing frequencies are:
Against accident 1 : 76
Against accident 2 : 38



The median is the middle most item that divides the group into two equal parts, one part
comprising all values greater, and the other, all values less than that item.

Ungrouped or Raw data

Arrange the given values in the ascending order. If the number of values are odd, median is the
middle value.If the number of values are even, median is the mean of middle two values.
By formula –
When n is odd , Median is (n+1)/2th value
When n is even , Median is average of (n/2)th and (n/2 + 1)th value

Example - If the weights of sorghum ear heads are 45, 60,48,100,65 gms, calculate the
Solution Here n = 5
First arrange it in ascending order 45, 48, 60, 65, 100
Median = (n+1)/2th value = (5+1)/2th value = 6th value = 60

Example - If the sorghum ear- heads are 5,48, 60, 65, 65, 100 gms, calculate the median.
Solution Here n = 6
Median = average of (6/2)th and (6/2 + 1)th value = average of 3rd and 4th value
Average = (60+65)/2 = 62.5 g

Grouped data
In a grouped distribution, values are associated with frequencies. Grouping can be in the form
of a discrete frequency distribution or a continuous frequency distribution. Whatever may be
the type of distribution, cumulative frequencies have to be calculated to know the total number
of items.


Cumulative frequency (cf)
Cumulative frequency of each class is the sum of the frequency of the class and the frequencies
of the pervious classes, i.e. adding the frequencies successively, so that the last cumulative
frequency gives the total number of items.

Discrete Series
Step1: Find cumulative frequencies.
Step 2 : Find (n+1/2)
Step3: See in the cumulative frequencies the value just greater than (n/2 + 1)
Step 4 : Then the corresponding value of x is median

Example The following data pertaining to the number of insects per plant. Find median
number of insects per plant
Number of insects per plant (x) 1 2 3 4 5 6 7 1 8 9 10 11 12
No. of plants(f) 1 3 5 6 10 13 9 1 5 3 2 2 1
Solution Form the cumulative frequency table
x f cf
1 1 1
2 3 4
3 5 9
4 6 15
5 10 25
6 13 38
7 9 47
8 5 52
9 3 55
10 2 57
11 2 59
12 1 60

Median = size of (n+1/2)th item

Here the number of observations is even. Therefore median = average of (n/2)th item and
(n/2+1)th item.
= (30th item +31st item) / 2 = (6+6)/2 = 6
Hence the median size is 6 insects per plant.


Continuous Series
The steps given below are followed for the calculation of median in continuous series.

Step1: Find cumulative frequencies.

Step1: Find cumulative frequencies
Step2: Find n/2
Step3: See in the cumulative frequency the value first greater than n/2, Then the corresponding
class interval is called the Median class. Then apply the formula

Where l = Lower limit of the median class

m = cumulative frequency preceding the median class
c = width of the class
f =frequency in the median class.
n=Total frequency.

Example For the frequency distribution of weights of sorghum ear-heads given in table
below. Calculate the median.
Weights of No of ear Less than Cumulative
ear heads ( in heads (f) class frequency (m)
60-80 22 <80 22
80-100 38 <100 60
100-120 45 <120 105
120-140 35 <140 140
140-160 24 <160 164
Total 164

n/2 = 164/2 = 82

It lies between 60 and 105. Corresponding to 60 the less than class is 100 and corresponding
to 105 the less than class is 120. Therefore the medianal class is 100-120. Its lower limit is
Here 100, n=164 , f = 45 , c = 20, m =60


Median = 100 + [(82-60)/45 * 20] = 109.78 gm

Merits of Median
1. Median is not influenced by extreme values because it is a positional average.
2. Median can be calculated in case of distribution with open-end intervals.
3. Median can be located even if the data are incomplete.

Demerits of Median
1. A slight change in the series may bring drastic change in median value.
2. In case of even number of items or continuous series, median is an estimated value
other than any value in the series.
3. It is not suitable for further mathematical treatment except its use in calculating mean
4. It does not take into account all the observations


4. Mode
The mode refers to that value in a distribution, which occur most frequently. It is an actual
value, which has the highest concentration of items in and around it. It shows the centre of
concentration of the frequency in around a given value. Therefore, where the purpose is to
know the point of the highest concentration it is preferred. It is, thus, a positional measure.

Its importance is very great in agriculture like to find typical height of a crop variety, maximum
source of irrigation in a region, maximum disease prone paddy variety. Thus the mode is an
important measure in case of qualitative data.

Computation of the mode Ungrouped or Raw Data

For ungrouped data or a series of individual observations, mode is often found by mere
Example Find the mode for the following seed weight (2 , 7, 10, 15, 10, 17, 8, 10, 2 ) gms
Mode = 10
In some cases the mode may be absent while in some cases there may be more than one mode.
Example (1) 12, 10, 15, 24, 30 (no mode)
(2) 7, 10, 15, 12, 7, 14, 24, 10, 7, 20, 10
the modal values are 7 and 10 as both occur 3 times each.

Grouped Data
For Discrete distribution, see the highest frequency and corresponding value of x is mode.

Example: Find the mode for the following

Weight of sorghum No. of ear head(f)
in gms (x)
50 4
65 6
75 16
80 8
95 7
100 4


The maximum frequency is 16. The corresponding x value is 75.
Mode = 75 gms.

Continuous distribution
Locate the highest frequency the class corresponding to that frequency is called the modal class.
Then apply the formula.

For the frequency distribution of weights of sorghum ear-heads given in table below.
Calculate the mode




Figure 1

Figure 2

Figure 3

When a distribution is symmetrical, the mean, median and mode are the same, as is shown in
the Figure 1.
In case, a distribution is skewed to the right, then mean> median> mode. Generally, income
distribution is skewed to the right where a large number of families have relatively low income
and a small number of families have extremely high income. In such a case, the mean is pulled
up by the extreme high incomes and the relation among these three measures is as shown in
Figure 2 . Here, we find that mean> median> mode.
When a distribution is skewed to the left, then mode> median> mean. This is because here
mean is pulled down below the median by extremely low values. This is shown as in the figure.
Given the mean and median of a unimodal distribution, we can determine whether it is skewed
to the right or left. When mean>median, it is skewed to the right; when median> mean, it is
skewed to the left. It may be noted that the median is always in the middle
between mean and mode.



At this stage, one may ask as to which of these three measures of central tendency the best is.
There is no simple answer to this question. It is because these three measures are based upon
different concepts. The arithmetic mean is the sum of the values divided by the total number
of observations in the series. The median is the value of the middle observation that divides the
series into two equal parts. Mode is the value around which the observations tend to
concentrate. As such, the use of a particular measure will largely depend on the purpose of the
study and the nature of the data; For example, when we are interested in knowing the consumers
preferences for different brands of television sets or different kinds of advertising, the choice
should go in favour of mode. The use of mean and median would not be proper. However, the
median can sometimes be used in the case of qualitative data when such data can be arranged
in an ascending or descending order. Let us take another example. Suppose we invite
applications for a certain vacancy in our company. A large number of candidates apply for that
post. We are now interested to know as to which age or age group has the largest concentration
of applicants. Here, obviously the mode will be the most appropriate choice. The arithmetic
mean may not be appropriate as it may be influenced by some extreme values. However, the
mean happens to be the most commonly used measure of central tendency.


1. What are the desiderata (requirements) of a good average? Compare the mean, the
median and the mode in the light of these desiderata? Why averages are called
measures of central tendency?
4. "Every average has its own peculiar characteristics. It is difficult to say which average
is the best." Explain with examples.
5. What do you understand .by 'Central Tendency'? Under what conditions is the median
more suitable than other measures of central tendency?
2. The average monthly salary paid to all employees in a company was Rs 8,000. The
average monthly salaries paid to male and female employees of the company were Rs
10,600 and Rs 7,500 respectively. Find out the percentages of males and females
employed by the company.


Topic 8,9 & 10 Topic : Measures of Dispersion / Variation- Range, Quartile
Deviation, Mean Deviation, Standard Deviation

1. Introduction
2. Meaning and Definition of Dispersion
3. Significance and Properties of Measuring Variation
4. Measures of Dispersion
5. Range
6. Interquartile Range or Quartile Deviation
7. Mean Deviation
8. Standard Deviation
9. Lorenz Curve
10. Self Test Questions


The measures of central tendency do not indicate the extent of dispersion or variability in a
distribution. The dispersion or variability provides us one more step in increasing our
understanding of the pattern of the data. Further, a high degree of uniformity (i.e. low degree
of dispersion) is a desirable quality. If in a business there is a high degree of variability in the
raw material, then it could not find mass production economical. Suppose an investor is looking
for a suitable equity share for investment. While examining the movement of share prices, he
should avoid those shares that are highly fluctuating-having sometimes very high prices and at
other times going very low. Such extreme fluctuations mean that there is a high risk in the
investment in shares. The investor should, therefore, prefer those shares where risk is not so


Some important definitions of dispersion are :
1. "Dispersion is the measure of the variation of the items." -A.L. Bowley
2. "The degree to which numerical data tend to spread about an average value is called
the variation of dispersion of the data." -Spiegel


3. Dispersion or spread is the degree of the scatter or variation of the variable about a
central value." -Brooks & Dick
4. "The measurement of the scatterness of the mass of figures in a series about an
average is called measure of variation or dispersion." -Simpson & Kajka

Dispersion (also known as scatter, spread or variation) measures the extent to which the items
vary from some central value. An average is more meaningful when it is examined in the light
of dispersion. For example, if the average wage of the workers of factory A is Rs. 3885 and
that of factory B Rs. 3900, we cannot necessarily conclude that the workers of factory B are
better off because in factory B there may be much greater dispersion in the distribution of
wages. The study can be explained from the following example:

Series A Series B Series C

100 100 1
100 105 489
100 102 2
100 103 3
100 90 5
Total 500 500 500
Mean 100 100 100
Since arithmetic mean is the same in all three series, one is likely to conclude that these series
are alike in nature. But a close examination shall reveal that distributions differ widely from
one another.

In series A, each and every item is perfectly represented by the arithmetic mean or in other
words none of the items of series A deviates from the arithmetic mean and hence there is no
dispersion. In series B, only one item is perfectly represented by the arithmetic mean and the
other items vary but the variation is very small as compared to series C. In series C. not a single
item is represented by the arithmetic mean and the items vary widely from one another. In
series C, dispersion is much greater compared to series B. Similarly, we may have two groups
of labourers with the same mean salary and yet their distributions may differ widely. The mean
salary may not be so important a characteristic as the variation of the items from the mean. The
three figures below represent frequency distributions with some of the characteristics


i. Measures of variation point out as to how far an average is representative of the mass.
When dispersion is small, the average is a typical value in the sense that it closely
represents the individual value and it is reliable in the sense that it is a good estimate of
the average in the corresponding universe. On the other hand, when dispersion is large,
the average is not so typical, and unless the sample is very large, the average may be
quite unreliable.
ii. Another purpose of measuring dispersion is to determine nature and cause of variation in
order to control the variation itself. In industrial production efficient operation requires
control of quality variation the causes of which are sought through inspection is basic to
the control of causes of variation.
iii. Measures of dispersion enable a comparison to be made of two or more series with regard
to their variability.
iv. Many powerful analytical tools in statistics such as correlation analysis. The testing of
hypothesis, analysis of variance, the statistical quality control, regression analysis is
based on measures of variation of one kind or another.

A good measure of dispersion should possess the following properties

− It should be simple to understand.
− It should be easy to compute.
− It should be rigidly defined.
− It should be based on each and every item of the distribution.
− It should be amenable to further algebraic treatment.
− It should have sampling stability.
− Extreme items should not unduly affect it.



There are five measures of dispersion: Range, Inter-quartile range or Quartile Deviation, Mean
deviation, Standard Deviation, and Lorenz curve. Among them, the first four are mathematical
methods and the last one is the graphical method.

The simplest measure of dispersion is the range, which is the difference between the maximum
value and the minimum value of data.

Example : Find the range for the following three sets of data:
Set 1: 05 15 15 05 15 05 15 15 15 15
Set 2: 8 7 15 11 12 5 13 11 15 9
Set 3: 5 5 5 5 5 5 5 5 5 5
Solution: In each of these three sets, the highest number is 15 and the lowest number is 5.
Since the range is the difference between the maximum value and the minimum value of the
data, it is 10 in each case. But the range fails to give any idea about the dispersal or spread of
the series between the highest and the lowest value. This becomes evident from the above data.
In a frequency distribution, range is calculated by taking the difference between the upper limit
of the highest class and the lower limit of the lowest class.

Example : Find the range for the following frequency distribution:

Size of Item Frequency
20- 40 7
40- 60 11
60- 80 30
80-100 17
100-120 5
Total 70

Upper limit of the highest class is 120
Lower limit of the lowest class is 20.
Range is 120 - 20 = 100.


Note that the range is not influenced by the frequencies.
Symbolically, the range is calculated by the formula
Range = L - S,
where L is the largest value and
S is the smallest value in a distribution.

Coefficient of range = (L-S)/ (L+S).

Coefficient of the range for above example = 0.5.

Example : Calculate the coefficient of range separately for the two sets of data
given below:
Set 1 8 10 20 9 15 10 13 28
Set 2 30 35 42 50 32 49 39 33
Solution: It can be seen that the range in both the sets of data is the same:
Set 1 28 - 8 = 20
Set 2 50 - 30 = 20
Coefficient of range in Set 1 = (28-8) / (28+8) = 0.55
Coefficient of range in Set 2 = (50-30) / (50+30) = 0.25

There are some limitations of range, which are as follows:
− It is based only on two items and does not cover all the items in a distribution.
− It is subject to wide fluctuations from sample to sample based on the same
− It fails to give any idea about the pattern of distribution. (Refer examples above)
− Finally, in the case of open-ended distributions, it is not possible to compute the


The quartiles divide the distribution in four parts. There are three quartiles. The second quartile
divides the distribution into two halves and therefore is the same as the median. The first


(lower).quartile (Q1) marks off the first one-fourth, the third (upper) quartile (Q3) marks off
the three-fourth. It may be noted that the second quartile is the value of the median and 50th

Raw or ungrouped data

First arrange the given data in the increasing order and use the formula for Q1 and Q3 then
quartile deviation, Q.D is given by
Q.D = (Q3 - Q1) / 2
Where Q1 = [(n+1)/4]th item
Q3 = 3[(n+1)/4]th item
Example : Compute quartiles for the data given below (grains/panicles)
25, 18, 30, 8, 15, 5, 10, 35, 40, 45
Solution 5, 8, 10, 15, 18, 25, 30, 35, 40, 45
Q1 = [(n+1)/4]th item = [(10+1)/4]th item = 2.75th item
= 2nd item + (0.75)(3rd item – 2nd item)
= 8+ 0.75(10-8)
= 9.5
Q3 = 3[(n+1)/4]th item = 3[(10+1)/4]th item = 8.75th item
= 8th item + 0.75(9th item – 8th item)
= 35 + 0.75(40-35)
= 36.25

Discrete Series
Step1: Find cumulative frequencies.
Step2: Find [(n+1)/4]
Step3: See in the cumulative frequencies, the value just greater than[(n+1)/4], then the
corresponding value of x is Q1
Step4: Find 3[(n+1)/4]
Step5: See in the cumulative frequencies, the value just greater than 3[(n+1)/4], ,then the
corresponding value of x is Q3

Continuous series
Step1: Find cumulative frequencies


Step2: Find n/4
Step3: See in the cumulative frequencies, the value just greater than n/4, then the corresponding
class interval is called first quartile class.
Step4: Find 3n/4, See in the cumulative frequencies the value just greater than 3n/4 , then the
corresponding class interval is called 3rd quartile class. Then apply the respective formulae

Q3 =

Where l1 = lower limit of the first quartile class

f1 = frequency of the first quartile class
c1 = width of the first quartile class
m1 = c.f. preceding the first quartile class
l3 = 1ower limit of the 3rd quartile class
f3 = frequency of the 3rd quartile class
c3 = width of the 3rd quartile class
m3 = c.f. preceding the 3rd quartile class


− As compared to range, it is considered a superior measure of dispersion.
− In the case of open-ended distribution, it is quite suitable.
− Since it is not influenced by the extreme values in a distribution, it is particularly
suitable in highly skewed or erratic distributions.
− Like the range, it fails to cover all the items in a distribution.
− It is not amenable to mathematical manipulation.
− It varies widely from sample to sample based on the same population.
− Since it is a positional average, it is not considered as a measure of dispersion. It merely
shows a distance on scale and not a scatter around an average.

The mean deviation is also known as the average deviation. As the name implies, it is the
average of absolute amounts by which the individual items deviate from the mean. Since the


positive deviations from the mean are equal to the negative deviations, while computing the
mean deviation, we ignore positive and negative signs. Symbolically,
MD = Σ| x | / n
Where MD = mean deviation,
|x| = deviation of an item from the mean ignoring positive and negative signs,
n = the total number of observations.

Example :
Size of Item Frequency
2-4 20
4-6 40
6-8 30
8-10 10
Size of Item Mid-points (m) Frequency (f) fm d from x f |d|
2-4 3 20 60 -2.6 52
4-6 5 40 200 -0.6 24
6-8 7 30 210 1.4 42
8-10 9 10 90 3.4 34
Total 100 560 152


o A major advantage of mean deviation is that it is simple to understand and easy to
o It takes into consideration each and every item in the distribution. As a result, a change in
the value of any item will have its effect on the magnitude of mean deviation.
o The values of extreme items have less effect on the value of the mean deviation.
o As deviations are taken from a central value, it is possible to have meaningful comparisons
of the formation of different distributions.


o It is not capable of further algebraic treatment
o At times it may fail to give accurate results. The mean deviation gives best results when
deviations are taken from the median instead of from the mean.
o But in a series, which has wide variations in the items, median is not a satisfactory measure.
o Strictly on mathematical considerations, the method is wrong as it ignores the algebraic
signs when the deviations are taken from the mean.

The standard deviation is similar to the mean deviation in that here too the deviations are
measured from the mean. At the same time, the standard deviation is preferred to the mean
deviation or the quartile deviation or the range because it has desirable mathematical properties.
The Standard Deviation is a measure of how spread out numbers are. Its symbol is σ (the greek
letter sigma) The formula is easy: it is the square root of the Variance. So now you ask, "What
is the Variance?"
Variance: The Variance is defined as. The average of the squared differences from the Mean.
X X-μ (X-μ)2
20 20-18=12 4
15 15-18= -3 9
19 19-18 = 1 1
24 24-18 = 6 36
16 16-18 = -2 4
14 14-18 = -4 16
108 Total 70
Mean = 108/6 = 18

The second column shows the deviations from the mean. The third or the last column
shows the squared deviations, the sum of which is 70. The arithmetic mean of the
squared deviations is:
= Σ(X-μ)2 / N = 70/6 = 11.67 approx = Variance


This mean of the squared deviations is known as the variance
Symbolically - Var (X) = = Σ(X-μ)2 / N
σ2 = Σ(X-μ)2 / N
Where σ2 (called sigma squared) is used to denote the variance.

Although the variance is a measure of dispersion, the unit of its measurement is (points). If a
distribution relates to income of families then the variance is (Rs)2 and not rupees. Similarly,
if another distribution pertains to marks of students, then the unit of variance is (marks)2. To
overcome this inadequacy, the square root of variance is taken, which yields a better measure
of dispersion known as the standard deviation.
Referring to example above, we take the square root of the Variance
SD or σ = 3.42 points

Example : The following distribution relating to marks obtained by students in an examination:

Marks Number of Students
0- 10 1
10- 20 3
20- 30 6
30- 40 10
40- 50 12
50- 60 11
60- 70 6
70- 80 3
80- 90 2
90-100 1



Marks Frequency (f) Mid-points Deviations (d)/10=d’ Fd’ fd'2
0- 10 1 5 -5 -5 25
10- 20 3 15 -4 -12 48
20- 30 6 25 -3 -18 54
30- 40 10 35 -2 -20 40
40- 50 12 45 -1 -12 12
50- 60 11 55 0 0 0
60- 70 6 65 1 6 6
70- 80 3 75 2 6 12
80- 90 2 85 3 6 18
90-100 1 95 4 4 16
Total 55 Total -45 231

standard deviation is calculated by using the following formula –

Where C is the class interval: fi is the frequency of the ith class and di is the deviation of the
of item from an assumed origin; and N is the total number of observations. Applying this
formula -

SD or σ = 18.8 marks


The standard deviation is the average amount of variability in your dataset. It tells you, on
average, how far each value lies from the mean. A high standard deviation means that values
are generally far from the mean, while a low standard deviation indicates that values are
clustered close to the mean.

The standard deviation is a frequently used measure of dispersion. It enables us to determine

as to how far individual items in a distribution deviate from its mean. In a symmetrical, bell-
shaped curve:


▪ About 68 percent of the values in the population fall within: + 1 standard deviation from
the mean.
▪ About 95 percent of the values will fall within +2 standard deviations from the mean.
▪ About 99 percent of the values will fall within + 3 standard deviations from the mean.

What does standard deviation tell you?

Standard deviation is a useful measure of spread for normal distributions. In normal
distributions, data is symmetrically distributed with no skew. Most values cluster around a
central region, with values tapering off as they go further away from the center. The standard
deviation tells you how spread out from the center of the distribution your data is on average.
Many scientific variables follow normal distributions, including height, standardized test
scores, or job satisfaction ratings. When you have the standard deviations of different samples,
you can compare their distributions using statistical tests to make inferences about the larger
populations they came from.

Example: Comparing different standard deviations

You collect data on job satisfaction ratings from three groups of employees using simple
random sampling. The mean (M) ratings are the same for each group – it’s the value on the x-
axis when the curve is at its peak. However, their standard deviations (SD) differ from each
The standard deviation reflects the dispersion of the distribution. The curve with the lowest
standard deviation has a high peak and a small spread, while the curve with the highest standard
deviation is more flat and widespread.



The standard deviation and the mean together can tell you where most of the values in
your frequency distribution lie if they follow a normal distribution. The empirical rule, or the
68-95-99.7 rule, tells you where your values lie:
• Around 68% of scores are within 1 standard deviation of the mean,
• Around 95% of scores are within 2 standard deviations of the mean,
• Around 99.7% of scores are within 3 standard deviations of the mean.

The empirical rule is a quick way to get an overview of your data and check for any outliers or
extreme values that don’t follow this pattern.

Example : Standard deviation in a normal distribution

You administer a memory recall test to a group of students. The data follows a normal
distribution with a mean score of 50 and a standard deviation of 10. Following the empirical
• Around 68% of scores are between 40 and 60.
• Around 95% of scores are between 30 and 70.
• Around 99.7% of scores are between 20 and 80.


Standard deviation formulas for populations and samples - Different formulas are used for
calculating standard deviations depending on whether you have collected data from a whole
population or a sample.

Population standard deviation - When you have collected data from every member of
the population that you’re interested in, you can get an exact value for population standard
deviation. The population standard deviation formula is :

Sample standard deviation

When you collect data from a sample, the sample standard deviation is used to make estimates
or inferences about the population standard deviation.

With samples, we use n – 1 in the formula because using n would give us a biased estimate that
consistently underestimates variability. The sample standard deviation would tend to be lower
than the real standard deviation of the population.
Reducing the sample n to n – 1 makes the standard deviation artificially large, giving you a
conservative estimate of variability.
While this is not an unbiased estimate, it is a less biased estimate of standard deviation: it is
better to overestimate rather than underestimate variability in samples.

The sample standard deviation formula -



This measure of dispersion is graphical. It is known as the Lorenz curve named after Dr. Max
Lorenz. It is generally used to show the extent of concentration of income and wealth. The
steps involved in plotting the Lorenz curve are:
o Convert a frequency distribution into a cumulative frequency table.
o Calculate percentage for each item taking the total equal to 100.
o Choose a suitable scale and plot the cumulative percentages of the persons and income.
o Use the horizontal axis of X to depict percentages of persons and the vertical axis of Y
to depict percent ages of income.
o Show the line of equal distribution, which will join 0 of X-axis with 100 of Yaxis.
o The curve obtained in (3) above can now be compared with the straight line of equal
distribution obtained in (4) above. If the Lorenz curve is close to the line of equal
distribution, then it implies that the dispersion is much less. If, on the contrary, the Lorenz
curve is farther away from the line of equal distribution, it implies that the dispersion is
The Lorenz curve is a simple graphical device to show the disparities of distribution in any
phenomenon. It is, used in business and economics to represent inequalities in income, wealth,
production, savings, and so on. The Figure below shows two Lorenz curves by way of
illustration. The straight line AB is a line of equal distribution, whereas AEB shows complete
inequality. Curve ACB and curve ADB are the Lorenz curves.

As curve ACB is nearer to the line of equal distribution, it has more equitable distribution of
income than curve ADB. Assuming that these two curves are for the same company, this may
be interpreted in a different manner. Prior to taxation, the curve ADB showed greater inequality
in the income of its employees. After the taxation, the company’s data resulted into ACB curve,


which is closer to the line of equal distribution. In other words, as a result of taxation, the
inequality has reduced.


1. What do you mean by dispersion? What are the different measures of dispersion?
2. “Variability is not an important factor because even though the outcome is more certain,
you still have an equal chance of falling either above or below the median. Therefore, on
an average, the outcome will be the same.” Do you agree with this statement? Give
reasons for your answer.
3. Why is the standard deviation the most widely used measure of dispersion? Explain.


Topic 11 & 12 Topic : Skewness, Moments and Kurtosis

1. Skewness: Meaning and Definitions
2. Tests of Skewness
3. Measures of Skewness
4. Moments
5. Kurtosis
6. Self Test Questions


Lack of symmetry is called Skewness. If a distribution is not symmetrical then it is called

skewed distribution. So, mean, median and mode are different in values and one tail becomes
longer than other. The skewness may be positive or negative.

Symmetrical Distribution - In a symmetrical distribution the values of mean, median and

mode coincide. The spread of the frequencies is the same on both sides of the centre point of
the curve.

Positively skewed distribution: If the frequency curve has longer tail to right the distribution
is known as positively skewed distribution and Mean > Median > Mode.


Negatively skewed distribution: If the frequency curve has longer tail to left the distribution
is known as negatively skewed distribution and Mean < Median < Mode.

In order to ascertain whether a distribution is skewed or not the following tests may be applied.
Skewness is present if:
i. The values of mean, median and mode do not coincide.
ii. When the data are plotted on a graph they do not give the normal bell shaped form i.e.
when cut along a vertical line through the centre the two halves are not equal.
iii. The sum of the positive deviations from the median is not equal to the sum of the
negative deviations.
iv. Quartiles are not equidistant from the median.
v. Frequencies are not equally distributed at points of equal deviation from the mode.

On the contrary, when skewness is absent, i.e. in case of a symmetrical distribution,the

following conditions are satisfied:
i. The values of mean, median and mode coincide.
ii. Data when plotted on a graph give the normal bell-shaped form.
iii. Sum of the positive deviations from the median is equal to the sum of the negative
iv. Quartiles are equidistant from the median.
v. Frequencies are equally distributed at points of equal deviations from the mode.


The measures of skewness are: (i) Karl Pearson's measure, (ii) Bowley’s measure, (iii) Kelly’s
measure, and (iv) Moment’s measure


The formula for measuring skewness as given by Karl Pearson is as follows:
Skewness = Mean - Mode
Coefficient of skewness = (Mean - Mode) / Skewness

In case the mode is indeterminate, the coefficient of skewness is:

Coefficient of skewness = [Mean - (3 Median - 2 Mean)] / Skewness
= 3(Mean – Median)
Equating the two equations, we can deduce that –
Mode = 3 Median – 2 Mean

The direction of skewness is determined by ascertaining whether the mean is greater than the
mode or less than the mode. If it is greater than the mode, then skewness is positive. But when
the mean is less than the mode, it is negative. The difference between the mean and mode
indicates the extent of departure from symmetry. It is measured in standard deviation units,
which provide a measure independent of the unit of measurement. The value of coefficient of
skewness is zero, when the distribution is symmetrical. Normally, this coefficient of skewness
lies between +1. If the mean is greater than the mode, then the coefficient of skewness will be
positive, otherwise negative.

Bowley's Measure
Bowley developed a measure of skewness, which is based on quartile values. The formula for
measuring skewness is:

Where Q3 and Q1 are upper and lower quartiles and M is the median. The value of this
skewness varies between +1 and -1


Kelly’s Measure Formula.
Kelley‗s measure of skewness is given in terms of percentiles(P) and deciles (D). Kelley’s
absolute measure of skewness (Sk)is:
Sk=P90 + P10 – 2*P50
Sk = D9 + D1-2*D5.

In Statistics, Moments are popularly used to describe the characteristic of a distribution. The
utility of moments lies in the sense that they indicate different aspects of a given distribution.
Thus, by using moments, we can measure the central tendency of a series, dispersion or
variability, skewness and the peakedness of the curve.

It may be noted that the first central moment is zero, that is, μ= 0.
The second central moment is μ2=σ, indicating the variance.
The third central moment μ3 is used to measure skewness.
The fourth central moment gives an idea about the Kurtosis.


Kurtosis is another measure of the shape of a frequency curve. It is a Greek word,

which means bulginess. While skewness signifies the extent of asymmetry, kurtosis
measures the degree of peakedness of a frequency distribution. Karl Pearson classified
curves into three types on the basis of the shape of their peaks. These are mesokurtic,
leptokurtic and platykurtic. These three types of curves are -


• Mesokurtic: Distributions that are moderate in breadth and curves with a medium
peaked height.
• Leptokurtic: More values in the distribution tails and more values close to the mean
(i.e. sharply peaked with heavy tails)
• Platykurtic: Fewer values in the tails and fewer values close to the mean (i.e. the
curve has a flat peak and has more dispersed scores with lighter tails).

The coefficient of kurtosis as given by Karl Pearson is β2= μ4/ μ22.

In case of a normal distribution, that is, mesokurtic curve, the value of β2=3.
If β2 turn out to be > 3, the curve is called a leptokurtic curve and is more peaked than the
normal curve. Again, when β2 < 3, the curve is called a platykurtic curve and is less peaked
than the normal curve.
The measure of kurtosis is very helpful in the selection of an appropriate average. For example,
for normal distribution, mean is most appropriate; for a leptokurtic distribution, median is most
appropriate; and for platykurtic distribution, the quartile range is most appropriate.
Another measure of kurtosis is based on both quartiles and percentiles and is given by the
following formula:
K = Q / (P90-P10)
Where K = kurtosis, Q = ½ (Q3 – Q1) is the semi-interquartile range; P90 is 90th percentile
and P10 is the 10th percentile. This is also known as the percentile coefficient of kurtosis. In
case of the normal distribution, the value of K is 0.263.

1. Define skewness and Dispersion.
2. Define Kurtosis and Moments.
3. What are the different measures of skewness? Which one is repeatedly used?
4. Measures of dispersion and skewness are complimentary to one another in
understanding a frequency distribution." Elucidate the statement.


You might also like