You are on page 1of 203

Probability and Statistics

By
Dr.B.V.Dhandra
Professor
Christ (Deemed to be) University,
Bangalore
&
Dr.Dibu
Assistant Professor
Christ( Deemed to be ) Universisty
Bangalore
Christ Deemed to be University, Bangaluru
MCA-132-Isem: Probability and Statistics

Total Teaching Hours for Semester:60 Syllabus No of Lecture Hours/Week:4

Max Marks:100 Credits:4

Course Objectives/Course Description  

The main aim of this course is to provide the grounding knowledge of statistical methods for data analytics.
Data summarization, probability, random variables with properties and distribution functions were included.
Sampling distributions and their applications in hypothesis testing advanced statistical methods like ANOVA
and correlation and regression analysis were included.
Learning Outcomes

After completion of this course, students are able to

 CO1: Understand how to summarize and present the data using exploratory data analysis
CO2: Demonstrate the distribution functions of data and important characteristics

CO3: Infer the sampling distributions and their applications in hypothesis testing

CO4: Identify the relationship between the variables and modeling the same
Unit-1 Teaching Hours:10

Exploratory Data Analysis  

Definition of Statistics, applications, data types and measurements, graphical representation of data using histogram, line
diagram, bar diagram, time series plots; measures of central tendency and dispersion; coefficient of skewness and kurtosis and
their practical importance.

Unit-2 Teaching Hours:15

Probability and Random Variables  

Random experiment, sample space and events. Definitions of probability, addition and multiplication rules of probability,
conditional probability and some numerical examples; Random variables: Definition, types of random variables, pmf and pdf
of random variables; Mathematical expectation: mean, variance, covariance, mgf and cgf of a random variable(s); Probability
distributions: Binomial, Poisson and Normal distributions with their important characteristics.
Unit-3 Teaching Hours:10

Sampling Distributions  

Concepts of population, sample, parameter, statistic, and sampling distribution of a statistics; Sampling distribution of standard
statistics like, sample mean, variance, proportions etc. t, F and Chi- square distributions with statistical properties
 

Unit-4 Teaching Hours:15

Testing of Hypothesis  

Statistical hypotheses-Simple and composite, Statistical tests, Critical region, Type I and Type II errors, Testing of hypothesis – null
and alternative hypothesis, level of significance,. Test of significance using t, F and Chi-Square distributions (large sample case).
Concept of interval estimation and confidence interval construction for standard population parameters like, mean, variance,
difference of means, proportions (only large sample case).
Unit-5 Teaching Hours:10

Advanced Statistical Methods  

Analysis of one-way and two-way classifications with examples, analysis and statistical inference; Correlation and
regression analysis, properties and their statistical significance.

Text Books And Reference Books:

1. Gupta S.C & Kapoor V.K, Fundamentals of Mathematical statistics, SultanChand & sons, 2009. 

2. Douglas C Montgomery, George C Runger, Applied Statistics and Probability for Engineers, Wiley student edition,

2004.
Exploratory Data Analyses:
Data Scientists widely use EDA to understand datasets for decision-
making and data cleaning processes. EDA reveals crucial information
about the data, such as hidden patterns, outliers, variance, covariance,
correlations between features. The information is essential for the
hypothesis’s design and creating better-performing models.
TYPES OF EXPLORATORY DATA ANALYSIS:
1. Univariate Non-graphical
2. Multivariate Non-graphical
3. Univariate graphical
4. Multivariate graphical

1. Univariate Non-graphical: This is the simplest form of data analysis as during


this we use just one variable to research the info. The standard goal of univariate
non-graphical EDA is to know the underlying sample distribution/ data and make
observations about the population. Outlier detection is additionally part of the
analysis. The characteristics of population distribution include:
Types of EDA:
• Generally, EDA falls into two categories:
• 1. Univariate.
• 2. Multivariate.
• The univariate analysis involves analyzing one feature, such as
summarizing and finding the feature patterns.
• The multivariate analysis technique shows the relationship
between two or more features using cross-tabulation or statistics.
• The following figure shows the further subdivision of EDA based on
data to analyze and methods such as numerical or graphical
methods.
Figure showing the Types of EDA:DA
Performing EDA:

• In this we cover various ways of performing EDA using the Titanic


dataset taken from Kaggle.
Getting Data:
• Getting data
• The Titanic dataset is downloaded from Kaggle to local drive and
then loaded into pandas DataFrame using read_csv() method.
1. #import library
2. import pandas as pd
3. import numpy as np
4. import seaborn as sns
5. import matplotlib.pyplot as plt
6. import tensorflow_data_validation as tfdv
7.
Contd..
8. import warnings
9. warnings.filterwarnings('ignore’)
10. Blank
11. # import data into pandas
12. df = pd.read_csv("./Dataset/train.csv")
13.
14. # check first 5 records
15. df.head(n=5)
view rawimport_data_csv.py hosted with ❤ by GitHub
view rawimport_data_csv.py hosted with ❤ by GitHub

• Data Set:

• The dataset consists of 11 columns.


Following shows the dataset column information below:

Data set information


Data cleaning and manipulation
• In this step of the process, the main focus is on data cleaning and
data manipulation. For any data-centric project, data reliability is of
high importance in order to get the reliable output. A reliable dataset
has 5 essential features:
Handling Missing Values
• Data cleaning and manipulation focus on handling the missing value and converting
categorical features to numeric. It leads to a reliable dataset that is ready for EDA.
• Handling missing data:
• The first step is to get the information on columns that have missing data.
• 1. from autoviz.AutoViz_Class import AutoViz_Class
• 2. AV = AutoViz_Class()

• Fig: Columns with


Missing
Handling Age Column Null Value:
• There are two ways to handle missing values in the Age column:
• Replacing the null value of Age by mean.
• Considering Sex and Pclass columns, replace the age column's missing value with the mean
based on sub-groups.
• 1. # Age
• 2. # Missing age data in different passenger class
• 3. df[df.Age.isnull()].Pclass.groupby(df.Pclass).count()
• 4.
• 5. . # Output #
• 6. # Pclass
• 7. # 1 30
8. # 2 11
• 9. # 3 136
Mean Calculation
11. #mean calculation
12. mean_1_male = df.loc[(df.Age.notnull()) & (df.Pclass == 1) & (df.Sex =='male')].Age.mean()
13. mean_2_male = df.loc[(df.Age.notnull()) & (df.Pclass == 2) & (df.Sex == 'male')].Age.mean()
14. mean_3_male = df.loc[(df.Age.notnull()) & (df.Pclass == 3) & (df.Sex == 'male')].Age.mean()
15. #
16. mean_1_female = df.loc[(df.Age.notnull()) &(df.Pclass == 1) & (df.Sex =='female')].Age.mean()
17. mean_2_female = df.loc[(df.Age.notnull()) & (df.Pclass == 2) & (df.Sex == 'female')].Age.mean()
18. mean_3_female = df.loc[(df.Age.notnull()) & (df.Pclass == 3) & (df.Sex == 'female')].Age.mean()
19. #
#Fuction for Age Data Cleaning:
20. #function for age data cleaning
21. def replaceAge(dataframe):
22. #fill null value with mean based on sub-group
23. dataframe['Age'] = np.where(((dataframe.Age.isnull()) & (dataframe['Pclass']==1) & (dataframe['Sex']=='male')), mean_1_male, dataframe['Age’])
24. dataframe['Age']=np.where(((dataframe.Age.isnull()) & dataframe['Pclass']==2) & (dataframe['Sex']=='male')), mean_2_male, dataframe['Age’])
25. dataframe['Age'] = np.where(((dataframe.Age.isnull()) & (dataframe['Pclass']==3) & (dataframe['Sex']=='male')), mean_3_male, dataframe['Age’])
26. #
27. dataframe['Age'] = np.where(((dataframe.Age.isnull()) & (dataframe['Pclass']==1) & (dataframe['Sex']=='female')), mean_1_female, dataframe['Age’])
28. dataframe['Age'] = np.where(((dataframe.Age.isnull()) & (dataframe['Pclass']==2) & (dataframe['Sex']=='female')), mean_2_female, dataframe['Age’])
29. dataframe['Age'] = np.where(((dataframe.Age.isnull()) & (dataframe['Pclass']==3) & (dataframe['Sex']=='female')), mean_3_female, dataframe['Age’])
30. #
31. return dataframe
32. #
Contd….

33. #cleaned data#cleaned data


34. df_Cleaned = replaceAge(df) df_Cleaned = replaceAge(df)
35.
36. #now all age data is not null
37. df_Cleaned.isnull().sum()
38.
2. Handling Cabin column null value
• In this case missing values are replaced with X
#Output#
• No missing value
STATISTICS

The meaning of word statistics is different for different people. For a layman, ‘Statistics’
means numerical information expressed in quantitative terms.

For college students, statistics are the grades list of different courses, OGPA, CGPA etc...
Each of these people is using the word statistics correctly, yet each uses it in a slightly
different way and somewhat different purpose.
STATISTICS

TABLE 1: MARKS IN STATISTICS OF 250 CANDIDATE:


32 47 41 51 41 30 39 18 48 53
54 32 31 46 15 37 32 56 42 48
38 26 50 40 38 42 35 22 62 51
44 21 45 31 37 41 44 18 37 47
68 41 30 52 52 60 42 38 38 34
41 53 48 21 28 49 42 36 41 29
30 33 31 35 29 37 38 40 32 49
43 32 24 38 38 22 41 50 17 46
46 50 26 15 23 42 25 52 38 46
41 38 40 37 40 48 45 30 28 31
40 33 42 36 51 42 56 44 35 38
31 51 45 41 50 53 50 32 45 48
40 43 40 34 34 44 38 58 49 28
40 45 19 24 34 47 37 33 37 36
36 32 61 30 44 43 50 31 38 45
46 40 32 34 44 54 35 39 31 48
48 50 43 55 43 39 41 48 53 34
32 31 42 34 34 32 33 24 43 39
40 50 27 47 34 44 34 33 47 42
17 42 57 35 38 17 33 46 36 23
48 50 31 58 33 4* 26 29 36* 37
47 55 57 37 41 54 42 45 47 43
37 52 47 46 44 50 44 38 42 19
52 45 23 41 47 33 42 24 48 39
48 44 60 38 38 44 38 43 40 48
•STATISTICS: The science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making
more effective decisions.

•As the definition suggests, the first step in investigating a problem is to collect relevant data. They must be
organized in some way and perhaps presented in a chart. Only after the data have been organized are we then
able to analyze and interpret them. Here are some examples of the need for data collection

•Research analysts for Merrill Lynch evaluate many facets of a particular stock before making a “buy” or “sell”
recommendation. They collect the past sales data of the company and estimate future earnings. Other factors,
such as the projected worldwide demand for the company’s products, the strength of the competition, and the
effect of the new union–management contract, are also considered before making a recommendation
• The marketing department at Colgate-Palmolive Co., a manufacturer of soap products,
has the responsibility of making recommendations regarding the potential profitability
of a newly developed group of face soaps having fruit smells, such as grape, orange,
and pineapple. Before making a final decision, the marketers will test it in several
markets. That is, they may advertise and sell it in Topeka, Kansas, and Tampa, Florida.
On the basis of test marketing in these two regions, Colgate-Palmolive will make a
decision whether to market the soaps in the entire country.
• Managers must make decisions about the quality of their product or service. For example,
customers call software companies for technical advice when they are not able to resolve an
issue regarding the software. One measure of the quality of customer service is the time a
customer must wait for a technical consultant to answer the call. A software company might
set a target of one minute as the typical response time. The company would then collect and
analyze data on the response time. Does the typical response time differ by day of the week
or time of day? If the response times are increasing, managers might decide to increase the
number of technical consultants at particular times of the day or week.
Definition of Statistics:
“The science of statistics is essentially a branch of applied mathematics and may be regarded
as mathematics applied to observational data”.
-R. A. Fisher

“Statistics is a science of estimates and probabilities” – Boddington.

Types of Statistics:
There are two major categories of statistics such as descriptive statistics and inferential statistics.
• Descriptive statistics is the branch of statistics that involves the collecting, organization,
summarization, and display of data.

• Inferential statistics is the branch of statistics that involves drawing conclusions about the
population using sample data.
• A basic tool in the study of inferential statistics is probability.
Nature of Statistics:
Statistics is Science as well as an Art:
Statistics as a Science: Statistics classified as Science because of its characteristics as
stated below:
• It is systematic method of processing the data to get the information
(knowledge) for decision making.
• Its methods and procedure are definite and well organized.
• It analyses the cause and effect relationship among variables.
• Its study is according to some rules and dynamism.
•Statistics as an Art: Statistics is considered as an art because it
provides methods, but choosing appropriate statistical methods for
data analysis and making wise decisions based on it is an art. Also
application of statistical methods requires skill and experience of the
investigator.
Functions of statistics:
• To express the facts and statements numerically or
quantitatively.
• To Condensation/simplify the complex facts.
Aims of statistics: Aims and Objective of
• To use it as a technique for making comparisons.
statistics is
• To establish the association and relationship between
• To study the population. different groups.
• To Estimate the present facts and forecasting future.
• To study the variation and its causes. • To Tests of Hypothesis.
• To study the methods for reducing data/ summarization • To formulate the policies and measure their impacts.

of data.
Scope/ Applications of Statistics:

• Statistics plays an important role in our daily life, it is useful in almost all
sciences such as social, biological, psychology, education, economics,
business management, agricultural sciences, information technology
etc.

• There is no upper limit for its applications

• The statistical methods can be and are being used by both educated
and uneducated people. In many instances we use sample data to make
inferences about the entire population.
• Statistics is used in administration by the Government for solving various
problems. Ex: price control, birth-death rate estimation, farming policies,
assessment of pay and preparation of budget etc..
• Statistics are indispensable in planning and in making decisions regarding
export, import, and production etc., Statistics serves as foundation of the super
structure of planning.
• Statistical methods are applied in market research to analyse the demand and
supply of manufactured products and fixing its prices.
• Bankers, stock exchange brokers, insurance companies etc.. make extensive
use of statistical data. Insurance companies make use of statistics of mortality
and life premium rates etc.
• In Medical sciences, statistical tools are widely used. Ex: in order to test the efficiency of a new drug or
medicine.

• To study the variability character like Blood Pressure (BP), pulse rate, Hb %, action of drugs on individuals. To
determine the association between diseases with different attributes such as smoking and cancer. To
compare the different drug or dosage on living beings under different conditions.

• Agricultural economists use forecasting procedures to estimation and demand and supply of food and
export & import, production .

Limitations of Statistics:
• Statistics does not study qualitative phenomenon, i.e. it study only quantitative phenomenon.
• Statistics does not study individual or single observation; in fact it deals with only an aggregate or group of
objects/individuals.
• Statistics laws are not exact laws; they are only approximations.
• Statistics is liable to be misused.
• Statistical conclusions are valid only on average base. i.e. Statistics results are not 100 per cent correct.
• Statistics does not reveal the entire information. Since statistics are collected for a particular purpose, such
data may not be relevant or useful in other situations or cases.
Types of data according to source: There are two types of data
• Primary data

• Secondary data.

Primary data:

The data collected by the investigator him-self/ her-self for a specific purpose by actual
observation or measurement or count is called primary data. Primary data are those which are
collected for the first time, primarily for a particular study. They are always given in the form of raw
materials and originals in character. Primary data are more reliable than secondary data.

These types of data need the application of statistical methods for the purpose of analysis and
interpretation.
Methods of collection of primary data

Primary data is collected in any one of the following methods


• Direct personal interviews.

• Indirect oral interviews

• Information from correspondents.

• Mailed questionnaire method.

• Schedules sent through enumerators.

• Telephonic Interviews, etc...


Secondary data The data which are compiled from the records of others is called
secondary data. The data collected by an individual or his agents is primary data
for him and secondary data for all others. Secondary data are those which have
gone through the statistical treatment.

When statistical methods are applied on primary data then they become
secondary data. They are in the shape of finished products. The secondary data
are less expensive but it may not give all the necessary information.
Secondary data can be compiled either from published sources or unpublished
sources.
Sources of published data
• Official publications of the central, state and local governments.
• Reports of committees and commissions.
• Publications brought about by research workers and educational
associations.
• Trade and technical journals.
• Report and publications of trade associations, chambers of
commerce, bank etc.
• Official publications of foreign governments or international bodies
like U.N.O, UNESCO etc.
Primary data Secondary
The data collected by the investigator
himself/ her-self for a specific purpose is The data which are compiled from the records
called primary data of others is called secondary data.
Primary data are those data which are Secondary data are those data which are
collected from the primary sources. collected from the secondary sources.
Primary data are original because Secondary data are not original. Since
investigator himself collects them. investigator makes use of the other agencies.

If these data are collected accurately and These might or might not suit the objects on
systematically, their suitability will be very enquiry.
positive.
Primary data Secondary
The collection of primary data is more The collection of secondary data is
expensive because they are not readily comparatively less expensive because they
available. are readily available.
It takes more time to collect the data. It takes less time to collect the data.
These are no great need of precaution These should be used with great care and
while using these data. caution.
More reliable & accurate Less reliable & accurate
Primary data are in the shape of raw Secondary data are usually in the shape of
material. readymade/finished products.
Possibility of personal prejudice. Possibility of lesser degree of
personal prejudice.
Grouped data: When the data range vary widely, that data values are sorted and
grouped into class intervals, in order to reduce the number of scoring categories
to a manageable level,
Individual values of the original data are not retained. Ex: 0-10, 11-20, 21-30

Ungrouped data: Data values are not grouped into class intervals in order to
reduce the number of scoring categories, they have kept in their original form.
Ex: 2, 4, 12, 0, 3, 54, etc..
• Types of Data & Measurement Scales: Nominal, Ordinal, Interval and Ratio

• In statistics, there are four data measurement scales: nominal, ordinal, interval and ratio.  These are
simply ways to sub-categorize different types of data (here’s an overview of statistical data types) .
Nominal

• Let’s start with the easiest one to understand.  Nominal scales are used for labeling variables, without
any quantitative value.  “Nominal” scales could simply be called “labels.”  Here are some examples,
below.  Notice that all of these scales are mutually exclusive (no overlap) and none of them have any
numerical significance.  A good way to remember all of this is that “nominal” sounds a lot like “name”
and nominal scales are kind of like “names” or labels.

• What is your Gender?: Male/Female;

• What is Hair color?: brown, black, gray, other.

• Where do you live?: North India, South India, West Bengal etc
Probability and Statistics

• nominal scale with only two categories (e.g. male/female) is called “dichotomous.”  
• Other sub-types of nominal data are “nominal with order” (like “cold, warm, hot, very
hot”) and nominal without order (like “male/female”).
Ordinal
• Ordinal scales are typically measures of non-numeric concepts like satisfaction,
happiness, discomfort, etc.
• “Ordinal” is easy to remember because is sounds like “order” and that’s the key to
remember with “ordinal scales”–it is the order that matters, but that’s all you really get
from these.
• Advanced note: The best way to determine central tendency on a set of ordinal data is to
use the mode or median; a purist will tell you that the mean cannot be defined from an
ordinal set.
Probability and Statistics
Probability and Statistics

•Interval
•Interval scales are numeric scales in which we know both the order and the exact differences
between the values.  The classic example of an interval scale is Celsius temperature because the
difference between each value is the same.  For example, the difference between 60 and 50
degrees is a measurable 10 degrees, as is the difference between 80 and 70 degrees.
•Ratio
•Ratio scales are the ultimate nirvana when it comes to data measurement scales because they tell
us about the order, they tell us the exact value between units, AND they also have an absolute
zero–which allows for a wide range of both descriptive and inferential statistics to be applied.  At
the risk of repeating myself, everything above about interval data applies to ratio scales, plus ratio
scales have a clear definition of zero.  Good examples of ratio variables include height, weight, and
duration
Variable:
A variable is a description of a quantitative or qualitative characteristic
that varies from observation to observation in the same group and by
measuring them we can present more than one numerical values.
Ex: Daily temperature, Yield of a crop, Nitrogen in soil, height, Color, sex.
Observations (Variate):
The specific numerical values assigned to the variables are called
observations.
Ex: yield of a crop is 30 kg.
Types of Variables

Quantitative Variable (Data) Qualitative Variable (Data)

Continuous Variable (Data) Discrete Variable (Data)

Quantitative Variable & Qualitative variable Quantitative Variable:


A quantitative variable is variable which is normally expressed numerically
because it differs in degree rather than kind among elementary units.
Ex: Plant height, Plant weight, length, no of seeds per pod, leaf dry weights,
etc...
Qualitative Variable:
A variable that is normally not expressed numerically because it differs in
kind rather than degree among elementary units. The term is more or less
synonymous with categorical variable.
Some examples are :
1. Hair color, religion, political affiliation, nationality, and social class.
2. Intelligence, beauty, taste, flavor, fragrance, skin colour, honesty, hard

work etc...
Attributes:
The qualitative variables are termed as attributes. The qualitatively
distinct characteristics such as healthy or diseased, positive or
negative. The term is often applied to designate characteristics that are
not easily expressed in numerical terms.

• Continuous variables is a variables which assumes all the (any)


values (integers as well as fractions) in a given range. A continuous
variable is a variable that has an uncountably infinite number of
possible values within a range.

• If the data are measured on continuous variable, then the data


obtained is continuous data.
• Ex: Height of a plant, Weight of a seed, Rainfall, temperature,
humidity, marks of students, income of the individual etc..
Discrete (Discontinuous) variable and discrete data:
 A variables which assumes only some specified values i.e. only whole
numbers (integers) in a given range. A discrete variable can assume
only a finite or, at most countable number of possible values.
 Number of Children will be 2 children or 3 children, but not 2.37
children, so “number of children” is a discrete variable.
 If the data are measured on discrete variable, then the data obtained is
discrete data.
 Ex: Number of leaves in a plant, Number of seeds in a pod, number of
students, number of insect or pest,
Population:
The aggregate or totality of all possible objects possessing specified
characteristics which is under investigation is called population. A population
consists of all the items or individuals about which you want to reach
conclusions.
A population is a collection or well defined set of individual/object/items that
describes some phenomenon of under lying study of interest.
Ex:
 Total number of students studying in a school or college,
 Total number of books in a library,
 Total number of houses in a village or town.
In statistics, the data set is the target group of our interest and it is called a
population. Notice that, a statistical population does not refer to people as in our
everyday usage of the term; it refers to a collection of data

Parameter:
 A parameter is a numerical constant which is measured to describe the
characteristic of a population. OR
 A parameter is a numerical description of a population characteristic.
 Generally Parameters are not known and constant value, they are estimated from
sample data.
Ex:
Population mean (denoted as µ), population standard deviation (σ), Population ratio,
population percentage, population correlation coefficient (ρ) etc...
Sample:
A small portion selected from the population under consideration or fraction of the
population is known as sample.

Statistic:
A statistic is a numerical quantity that measured to describes the characteristic of a sample.

OR
A Statistic is a numerical description of a sample characteristics.
Ex:
Sample Mean, Sample Standard-deviation (s), sample ratio, sample proportion etc..
Nature of data: It may be noted that different types of data can be collected for different
purposes. The data can be collected in connection with time or geographical location or
in connection with time and location. The following are the three types of data:

1.Time series data. 2. Spatial data 3. Spacio-temporal data.

Time series data: It is a collection of a set of numerical values collected and arranged
over sequence of time period. The data might have been collected either at regular
intervals of time.
Ex: The data may be year wise rainfall in Karnataka, Prices of milk over different months
Spatial Data:
If the data collected is connected with that of a place, then it is termed as spatial data.
Ex: The data may be district wise rainfall in karnataka, Prices of milk in four metropolitan cities.

Spacio-Temporal Data:
If the data collected is connected to the time as well as place then it is known as spacio-
temporal data.

Ex:
Data on Both year & district wise rainfall in Karnataka, Monthly prices of milk over different
cities.
CLASSIFICATION AND TABULATION
Introduction
The raw data or ungrouped data are always in an unorganized form, need to be organized and presented in meaningful
and readily comprehensible form in order to facilitate further statistical analysis. Therefore, it is essential for an
investigator to condense a mass of data into more and more comprehensible and digestible form.
Definition:
Classification is the process by which individual items of data are arranged in different groups or classes according to
common characteristics or resemblance or similarity possessed by the individual items of variable under study.
Ex: 1)For Example, letters in the post office are classified according to their destinations viz., Delhi, Chennai, Bangalore,
Mumbai etc...
Human population can be divided in to two groups of Males and Females, or into two groups of educated and
uneducated persons.
Plants can be arranged according to their different heights.
Objectives /Advantages/ Role of Classification:

• It condenses the mass/bulk data in an easily understandable form.


• It eliminates unnecessary details.
• It gives an orderly arrangement of the items of the data.
• It facilitates comparison and highlights the significant aspect of data.
• It enables one to get a mental picture of the information and helps in
drawing inferences.
• It helps in the tabulation and statistical analysis.
Types of classification:
Statistical data are classified in respect of their characteristics. Broadly
there are four basic types of classification namely
• Chronological classification or Temporal
• Geographical classification (or) Spatial Classification
• Qualitative classification
• Quantitative classification
(1). Chronological classification:
In chronological classification, the collected data are arranged according to the
order of time interval expressed in day, weeks, month, years, etc.,. The data is
generally classified in ascending order of time.
Ex: the data related daily temperature record, monthly price of vegetables,
exports and imports of India for different year.

Total Food grain production of India for different time periods.

Year Production (million tonnes)


2005-06 208.60
2006-07 217.28
2007-08 230.78
2008-09 234.47
(2). Geographical classification:
In this type of classification, the data are classified according to geographical
region or geographical location (area) such as District, State, Countries, City-
Village, Urban-Rural, etc...
Ex: The production of paddy in different states in India, production of wheat in
different countries etc...
State-wise classification of production of food grains in India:
State Production (in tonnes)
Orissa 3,00,000
A.P 2,50,000
U.P 22,00,000
Assam 10,000
Qualitative classification:
In this type of classification, data are classified on the basis of attributes or quality
characteristics like sex, literacy, religion, employment social status, nationality,
occupation etc...

Ex: If the population to be classified in respect to one attribute, say sex, then we can
classify them into males and females. Similarly, they can also be classified into
‘employed’ or ‘unemployed’ on the basis of another attribute ‘employment’, etc...
Qualitative classification can be of two types as follows
(i) Simple classification (ii) Manifold classification
i) Simple classification or Dichotomous Classification:
When the classification is done with respect to only one attribute, then it
is called as simple classification. If the attributes is dichotomous (two
outcomes) in nature, two classes are formed. This type of classification is
called dichotomous classification.
Ex: Population can be divided in to two classes according to sex (male and
female) or Income (poor and rich).
Manifold classification:
The classification where two or more attributes are considered and
several classes are formed is called a manifold classification.
Ex: If we classify population simultaneously with respect to two attributes,
Sex and Education, then population are first classified into ‘males’ and
‘females’. Each of these classes may then be further classified into
‘educated’ and ‘uneducated’.
Still the classification may be further extended by considering other
attributes like income status etc.
Quantitative classification:

In quantitative classification the data are classified according to quantitative


characteristics that can be measured numerically such as height, weight,
production, income, marks secured by the students, age, land holding etc...
Ex: Students of a college may be classified according to their height as given in
the table.
Height(in cm) No of students
100-125 20
125-150 25
150-175 40
175-200 15
TABULATION
Meaning & Definition:
A table is a systematic arrangement of data in columns and rows.

Tabulation may be defined as the systematic arrangement of classified


numerical data in rows or/and columns according to certain
characteristics. It expresses the data in concise and attractive form which
can be easily understood and used to compare numerical figures, and an
investigator is quickly able to locate the desired information and chief
characteristics.
Objectives /Advantages/ Role of Tabulation:
• It simplifies complex data to enable us to understand easily.
• It facilitates comparison of related facts.
• It facilitates computation of various statistical measures like averages,
dispersion, correlation etc...
• It presents facts in minimum possible space, and unnecessary
repetitions & explanations are avoided. Moreover, the needed
information can be easily located.
• Tabulated data are good for references, and they make it easier to
present the information in the form of graphs and diagrams.
Disadvantage of Tabulation:

• The arrangement of data by row and column becomes difficult if the


person does not have the required knowledge.
• Lack of description about the nature of data and every data can’t be put
in the table.
• Table figures/data can be misinterpreted.
Ideal Characteristics/ Requirements of a Good Table:
• A good statistical table is such that it summarizes the total information in an easily
accessible form in minimum possible space.
• A table should be formed in keeping with the objects of statistical enquiry.
• A table should be easily understandable and self-explanatory in nature.
• A table should be formed so as to suit the size of the paper.
• If the figures in the table are large, they should be suitably rounded or approximated.
The units of measurements too should be specified.
• The arrangements of rows and columns should be in a logical and systematic order.
This arrangement may be alphabetical, chronological or according to size.
• The rows and columns are separated by single, double or thick lines to
represent various classes and sub-classes used.
• The averages or totals of different rows should be given at the right of
the table and that of columns at the bottom of the table. Totals for
every sub-class too should be mentioned.
• Necessary footnotes and source notes should be given at the bottom of
table
• In case it is not possible to accommodate all the information in a single
table, it is better to have two or more related tables.
Parts or component of a good Table:
The making of a compact table itself an art. This should contain all the
information needed within the smallest possible space.

An ideal Statistical table should consist of the following main parts:


1.Table number 5. Stubs or row designation
2.Title of the table 6. Body of the table
3.Head notes ` 7. Footnotes
4. Captions or column headings 8. Sources of data
Stub Caption Row
Headings Sub Head 1   Sub Head 2 Total

Column Column   Column Column


Head Head Head Head
Stubs  
entries

............ Body
...........
..........
Column GrandTot
Total al
Manifold (Multi way table):
Total
Population
Status
State
s

    Male Female
 
Edu Un Sub- Educ Un Sub- Educ Un Tot
cate educ total ated educ total ated educ al
d ated ated ated
Rich
Poor
UP
Subtota
l
Rich
Poor
MP
Subtota
Types of Tabulation:

Tables may broadly classify into three categories.


• On the basis of no of character used/ Construction:

1) Simple tables 2) Complex tables

• On the basis of object/purpose:

1) General purpose/Reference tables 2) Special purpose/Summary tables

• On the basis of originality


1) Primary or original tables 2) Derived tables

2) I On the basis of no of character used/ Construction:


II On the basis of object/purpose:
• General tables: General purpose tables sometimes termed as reference tables or
information tables. These tables provide information for general use of reference.
They usually contain detailed information and are not constructed for specific
discussion. These tables are also termed as master tables.
Ex: The detailed tables prepared in census reports belong to this class.
• Special purpose tables: Special purpose tables also known as summery tables which
provide information for particular discussion. These tables are constructed or derived
from the general purpose tables. These tables are useful for analytical and
comparative studies involving the study of relationship among variables.
Ex: Calculation of analytical statistics like ratios, percentages, index numbers, etc is
incorporated in these tables.
III On the basis of originality: According to nature of originality of data

• Primary or original tables: This table contains statistical facts in their original
form. Figures in these types of tables are not rounded up, but original, actual &
absolute in natures.

Ex: Time series data recorded on rainfall, foodgrain production etc.


• Derived tables: This table contains total, ratio, percentage, etc... derived from
original tables. It expresses the derived information from original tables.
Ex: Trend values, Seasonal values, cyclical variation data.
FREQUENCY DISTRIBUTIONS
Introduction:
Frequency is the number of times a given value of an observation or character
or a particular type of event has appeared/repeated/occurred in the data set.

Frequency distribution is simply a table in which the data are grouped into
different classes on the basis of common characteristics and the numbers of
cases which fall in each class are counted and recorded. That table shows the
frequency of occurrence of different value of an observation or character of a
single variable.
Types of frequency distribution:
1. Simple frequency distribution:
• Raw Series/individual series/ungrouped data: Raw data have not been
manipulated or treated in any way beyond their original measurement. As such,
they will not be arranged or organized in any meaningful manner. Series of
individual observations is a simple listing of items of each observation. If marks
of 10 students in statistics of a class are given individually, it will form a series of
individual observations. In raw series, each observation has frequency of one.
Ex: Marks of Students: 55, 73, 60, 41, 60, 61, 75, 73, 58, 80.
• Discrete frequency distribution: In a discrete series, the data are presented in
such a way that exact measurements of units are indicated. There is definite
difference between the variables of different groups of items. Each class is
distinct and separate from the other class. Discontinuity from one class to
another class exists. In a discrete frequency distribution, we count the number
of times each value of the variable in data. This is facilitated through the
technique of tally bars. Ex: Number of children’s in 15 families is given by 1, 5,
2, 4, 3, 2, 3, 1, 1, 0, 2, 2, 3, 4, 2,.
Children (No.s) Tally Frequency (f)
(x)
0 | 1
1 ||| 3
2 |||| 5
3 ||| 3
4 || 2
5 | 1
Total 15
Continuous (grouped) frequency distribution:
When the range of the data is too large or the data measured on continuous
variable which can take any fractional values, must be condensed by putting
them into smaller groups or classes called “Class-Intervals”. The number of items
which fall in a class-interval is called as its “Class frequency”. The presentation of
the data into continuous classes with the corresponding frequencies is known as
continuous/grouped frequency distribution.
Ex: Marks scored by 15 students: 55, 82, 45, 18, 29, 42, 62, 72, 83, 15, 75, 87, 93,
56, 74. Class –Interval Tally Frequency (f)
(C.I.)
0-25 || 2
25-50 ||| 3
50-75 |||| || 7
75-100 ||| 3
Total 15
Types of continuous class intervals: There are three methods
of class intervals namely

i) Exclusive method (Class-Intervals)


ii) Inclusive method (Class-Intervals)
iii) Open-end classes
Exclusive method: In an exclusive method, the class intervals are fixed in such a
way that upper limit of one class becomes the lower limit of the next immediate
class. Moreover, an item equal to the upper limit of a class would be excluded
from that class and included in the next class.
Ex: Marks scored by 15 students: 55, 82, 45, 18, 29, 42, 62, 72, 83, 15, 75, 87,
93, 56, 74.
Class –Interval (C.I.) Tally Frequency
(f)
0-25 || 2
25-50 ||| 3
50-75 |||| || 7
75-100 ||| 3
Total 15
Inclusive method: In this method, the observation which are equal to upper as
well as lower limit of the class are included to that particular class. It should be
clear that upper limit of one class and lower limit of immediate next class are
different.
Ex: Marks scored by 15 students: 55, 82, 45, 18, 29, 42, 62, 72, 83, 15, 75, 87, 93,
56, 74.
Class–Interval Tally Frequency
(C.I.) (f)
0-25 || 2
26-50 ||| 3
51-75 |||| || 7
76-100 ||| 3
Total 15
Open-End classes: In this type of class interval, the lower limit of the first class
interval or the upper limit of the last class interval or both are not specified or
not given. The necessity of open end classes arises in a number of practical
situations, particularly relating to economic, agriculture and medical data when
there are few very high values or few very low values which are far apart from
the majority of observations.
The lower limit of first class can be obtained by subtracting magnitude of next
class from the upper limit of the open class. The upper limit of last class can be
obtained by adding magnitude of previous class to the lower limit of the open
class.
< 20 Below 20 Less than 20 0-20
20-40 20-40 20-40 20-40
40-60 40-60 40-60 40-60
60-80 60-80
60-80 60-80
>80 80 –over
80 and 80-100
Above
Difference between Exclusive and Inclusive Class-Intervals:
Exclusive Method Inclusive Method

The observations equal to upper limits of the class is The observations equal to both upper and lower
excluded from that class and are included in the limit of a particular class is counted (includes) in the
immediate next class. same class.

The upper limit of one class and lower limit of The upper limit of one class and lower limit of
immediate next class are same. immediate next class are different.

There is no gap between upper limit of one class and There is gap between upper limit of one class and
lower limit of another class. lower limit of another class.
This method is always useful for both integer as well as This method is useful where the variable may take
fractions variable like age, height, weight etc. only integral values like members in a family,
number of workers in a factory etc., It cannot be
used with fractional values like age, height, weight
etc.

There is no need to convert it to inclusive For simplification in calculation it is necessary to


method to prior to calculation. change it to exclusive method.
Relative frequency distribution:
It is the fraction or proportion of total number of items belongs to the classes.

Ex: Marks scored by 15 students: 55, 82, 45, 18, 29, 42, 62, 72, 83, 15, 75, 87, 93,
56, 74 Tally Frequency
Class –Interval Relative
(C.I.) (f) Frequency
0-25 || 2 2/15=0.1333
25-50 ||| 3 3/15=0.2000
50-75 |||| || 7 7/15=0.4666
75-100 ||| 3 3/15=0.2000
Total 15 15/15=1.000
Percentage frequency distribution:
The percentage frequency is calculated on multiplying relative frequency by 100.
In percentage frequency distribution, we have to convert the actual frequencies
into percentages.

• Ex: Marks scored by 15 students: 55, 82, 45, 18, 29, 42, 62, 72, 83, 15, 75, 87,
93, 56, 74.
Cumulative Frequency distribution:
Cumulative frequency distribution is running total of the frequency values. It is
constructed by adding the frequency of the first class interval to the frequency
of the second class interval. Again add that total to the frequency in the third
class interval and continuing until the final total appearing opposite to the last
class interval, which will be the total frequencies. Cumulative frequency is used
to determine the number of observations that lie above (or below) a particular
value in a data set.
Cumulative percentage frequency distribution:
Instead of cumulative frequency, if we given cumulative percentages, the
distributions are called cumulative percentage frequency distribution. We can
form this table either by converting the frequencies into percentages and then
cumulate it or we can convert the given cumulative frequency into percentages.
Ex: Marks scored by 15 students: 55, 82, 45, 18, 29, 42, 62, 72, 83, 15, 75, 87, 93,
56, 74.
DIAGRAMMATIC REPRESENTATION
Introduction:
One of the most convincing and appealing ways in which statistical results may be
presented is through diagrams and graphs. Just one diagram is enough to represent
a given data more effectively than thousand words. Moreover even a layman who
has nothing to do with numbers can also understands diagrams. Evidence of this
can be found in newspapers, magazines, journals, advertisement, etc....
A diagrammatic representation of data in a visual form, which highlighting their
basic facts and relationship.

1.Line diagram 4. Percentage bar diagram


2.Simple bar diagram 5. Multiple bar diagram
3.Sub-divided bar diagram
Line diagram:
Line diagram is used in case where there are many items to be shown and there is
no much of difference in their values. Such diagram is prepared by drawing a
vertical line for each item according to the scale.
• The distance between lines is kept uniform.
• Line diagram makes comparison easy, but it is less attractive.
2. Simple Bar Diagram:
It is the simplest among the bar diagram and is generally used for comparison of
two or more items of single variable or a simple classification of data.
For example data related to export, import, population, production, profit, sale,
etc... for different time periods or region.

• Simple bar can be drawn vertical or horizontal bar diagram with equal width.
• The heights of bars are proportional to the volume or magnitude of the
characteristics.
• All bars stand on the same base line.
• The bars are separated from each other by equal interval.
• To make the diagram attractive, the bars can be coloured.
Ex: Population in different states

Population (million)

Year UP AP MH

1951 63.22 31.25 29.98

Fig 2: Simple bar diagram showing population in different


states
3. Sub-divided bar diagram:
If we have multi character data for different attributes, we use subdivided or
component bar diagram. In a sub-divided bar diagram, the bar is sub-divided into
various parts in proportion to the values given in the data and the whole bar
represent the total. Such diagram shows total as well as various components of
total. Such diagrams are also called component bar diagrams.

• Here, instead of placing the bars for each component side by side we may place
these one on top of the other.
• The sub divisions are distinguished by different colours or crossings or dotting.
• An index or key showing the various components represented by colors, shades,
dots, crossing, etc... should be given.
4. Percentage bar diagram or Percentage sub-divided bar diagram:
This is another form of component bar diagram. Sometimes the volumes or
values of the different attributes may be greatly different in such cases sub-
divided bar diagram can’t be used for making meaningful comparisons, and then
components of attributes are reduced to percentages.
Here the components are not the actual values but converted into percentages
of the whole. The main difference between the sub-divided bar diagram and
percentage bar diagram is that in the sub-divided bar diagram the bars are of
different heights since their totals may be different whereas in the percentage
bar diagram latter the bars are of equal height since each bar represents 100
percent. In the case of data having sub-division, percentage bar diagram will be
more appealing than sub-divided bar diagram.
• Different components are converted to percentages using following formula:
5. Multiple or Compound bar diagram:
This type of diagram is used to facilitate the comparison of two or more sets of
interrelated phenomenon over a number of years or regions.
• Multiple bar diagram is simply the extension of simple bar diagram.
• Bars are constructed side by side to represent the set of values for comparison.
• The different bars for period or related phenomenon are placed together.
• After providing some space, another set of bars for next time period or
phenomenon are drawn.
• In order to distinguish bars, different colour or crossings or dotting, etc... may be
used
• Same type of marking or colouring should be done under each attribute.
• An index or foot note has to be prepared to identify the meaning of different
colours or dotting or crossing.
• Ex: Population under different states. (Double bar diagram)
Population (million)
Year UP AP MH

1961 73.75 35.98 33.65

1951 63.22 31.25 29.98

Fig 4: Multiple bar diagram indicating population of different states over a year
Pie-Diagram or Angular Diagram:
Pie-diagram are very popular diagram used to represent the both the
total magnitude and its different component or sectors parts. The circle
represents the total magnitude of the variable. The various segments
are represented proportionately by the various components of the
total. Addition of these segments gives the complete circle. Such a
component circular diagram is known as Pie or Angular diagram.

While making comparisons, pie diagrams should be used on a


percentage basis and not on an absolute basis.
Procedure for Construction of Pie Diagram
1)Convert each component of total into corresponding angles in degrees. Degree
(Angle) of any component can be calculated by following formula.
Angle=
Angles are taken to the nearest integral values.
2)Using a compass draw a circle of any convenient radius. (Convenient in the
sense that it looks neither too small nor too big on the paper.)
3)Using a protractor divide the circle in to sectors whose angles have been
calculated in step-1. Sectors are to be in the order of the given items.
4)Various component parts represented by different sector can be distinguished
by using different shades, designs or colours.
5)These sectors can be distinguished by their labels, either inside (if possible) or
just outside the circle with proper identification.
CROPS AREA(ha) Angle in
(degrees)
Cereals 3940 2140
Oil seeds 1165 630
Pulses 464 250
Cotton 249 130
Others 822 450

Total 6640 3600


Pictogram and Cartogram:
Pictogram:
The technique of presenting the data through picture is called as pictogram. In
this method the magnitude of the particular phenomenon, being studied, is
drawn. The sizes of the pictures are kept proportional to the values of different
magnitude to be presented.
Cartogram:
In this technique, statistical facts are presented through maps accompanied by
various type of diagrammatic presentation. They are generally used to presents
the facts according to geographical regions. Population and its other constituent
like birth, death, growth, density, production, import, exports, and several other
facts can be presented on the maps with certain colours, dots, cross, points etc...
Advantage/Significance of diagrams:
• They are attractive and impressive.
• They make data simple and understandable.
• They make comparison possible.
• They save time and labour.
• They have universal utility.
• They give more information.
• They have a great memorizing effect.
Demerits (or) limitations:
• Diagrams are approximations presentation of quantity.
• Minute differences in values cannot be represented properly in
diagrams.
• Large differences in values spoil the look of the diagram and
impossible to show wide gap.
• Some of the diagrams can be drawn by experts only. eg. Pie chart.
• Different scales portray different pictures to laymen.
• Similar characters required for comparison.
• No utility to expert for further statistical analysis.
GRAPHICAL REPRESENTATION OF DATA
A graph is a visual form of presentation of statistical data, which shows the
relationship between two or more sets of figures. A graph is more attractive
than a table of figure. Even a common man can understand the message of
data from the graph. Comparisons can be made between two or more
phenomena very easily with the help of a graph.

From the statistical point of view, graphic presentation of data is more


appropriate and accurate than the diagrammatic representation of the data.
Diagrams are limited to visual presentation of categorical and geographical
data and fail to present the data effectively relating to time-series and
frequency distribution. In such cases, graphs prove to be very useful.
Histogram:
Histogram is the most popular and widely used for presentation of frequency
distributions. In histogram, data are plotted as a series of rectangles or bars. The
height of each rectangle or bars represents the frequency of the class interval
and width represents the size of the class intervals. The area covered by
histogram is proportional to the total frequencies represented. Each rectangle is
formed adjacent to other so as to give a continuous picture. Histogram is also
called staircase or block diagram. There are as many rectangles as many classes.
Class intervals are shown on the X-axis and the frequencies on the Y-axis.
Ex: Systolic Blood Pressure (BP) in mm of people
Systolic No.of
BP persons
100-109 7
110-119 16
120-129 19
130-139 31
140-149 41
150-159 23
160-169 10
170-179 3
Fig 7.3: Systolic Blood Pressure (BP) in mmHg of people
Frequency Polygon:
Frequency polygon is a another way of graphical presentation of a frequency
distribution; it can be drawn with the help of histogram or mid-points.
If we mark the midpoints of the top horizontal sides of the rectangles in a
histogram and join them by a straight line or using scale, the figure so formed is
called as frequency polygon (Using histogram). This is done under the assumption
that the frequencies in a class interval are evenly distributed throughout the class.

The frequencies of the classes are pointed by dots against the mid-points of each
class intervals. The adjacent dots are then joined by straight lines or using scale.
The resulting graph is known as frequency polygon (Using mid-points or without
histogram).
The area of the polygon is equal to the area of the histogram, because the area left
outside is just equal to the area included in it.
.

Frequency polygon
Histogram Frequency Polygon

Histogram is two dimensional Frequency Polygon is multi-


dimensional
Histogram is bar graph Frequency Polygon is a line graph
Only one histogram can be Several Frequency Polygon can
plotted on same axis. be plotted on the same axis

Histogram is drawn only Frequency Polygon can be drawn


for continuous frequency for both discrete and continuous
distribution frequency distribution
Frequency Curve:
Similar to frequency polygon, frequency curve can be drawn with the
help of histogram or mid-points. Frequency curve is obtained by joining
the mid-points of the tops of the rectangles in a histogram by smooth
hand curve or free hand curve (Using Histogram).

The frequencies of the classes are pointed by dots against the mid-
points of each class. The adjacent dots are then joined by smooth hand
curve or free hand curve. The resulting graph is known as frequency
curve (Using mid-points or without histogram).
Frequency Curve
Ogives or Cumulative Frequency Curve:
For a set of observations, we know how to construct a frequency
distribution. In some cases we may require the number of
observations less than a given value or more than a given value. This is
obtained by accumulating (adding) the frequencies up to (or above)
the give value. This accumulated frequency is called cumulative
frequency. These cumulative frequencies are then listed in a table is
called cumulative frequency table. The curve is obtained by plotting
cumulative frequencies is called a cumulative frequency curve or an
ogive curve.
There are two methods of constructing ogive namely:
i) The ‘less than ogive’ method.
ii) The ‘more than ogive’ method.
Line Graph (one variable)
Line Graph (two variables)
MEASURES OF CENTRAL TENDENCY
Definition:
“A measure of central tendency is a typical value around which other figures
congregate.”
Objective and function of Central Tendency
• To provide a single value that represents and describes the
characteristic of entire group.
• To facilitate comparison between and within groups.
• To draw a conclusion about population from sample data.
• To form a basis for statistical analysis.
Essential characteristics/Properties/Pre-requisite for a good or an ideal
Average:
• It should be easy to understand and simple to compute.
• It should be rigidly defined.
• Its calculation should be based on all the
items/observations in the data set.
• It should be capable of further algebraic treatment
(mathematical manipulation).
• It should be least affected by sampling fluctuation.
• It should not be much affected by extreme values.
• It should be helpful in further statistical analysis.
Types of Average

Mathematical Average Positional Average Commercial Average

1.Arithmetic Mean or 1. Moving Average


Mean i) Simple 2.Progressive
Arithmetic Mean 1. Median Average
2. Mode 3.Composite
ii) Weighted Arithmetic
3) Quantiles Average
Mean
i) Quartiles
iii) Combined Mean ii) Deciles
2. Geometric Mean
3. Harmonic Mean iii) Percentiles
Computation of Simple Arithmetic Mean:
i) For raw data/individu isal-series/ungrouped data:
ii) For frequency distribution data:
1) Discrete frequency distribution (Ungrouped frequency distribution) data:
2). Continuous frequency__ distribution (Grouped frequency distribution)
data: _
Mean = X = ΣXi , if we have ungrouped raw data
N
= Σ fi Xi , if we have ungrouped frequency data or continuous
N (grouped) frequency distribution

Note: For continuous frequency distribution Xi is the middle value


corresponding to the i-th class interval of the frequency distribution.
Merits of Arithmetic Mean:
• It is simplest and most widely used average.
• It is easy to understand and easy to calculate.
• It is rigidly defined.
• Its calculation is based on all the observations.
• It is suitable for further mathematical treatment.
• It is least affected by the fluctuations of sampling as possible.
• If the number of items is sufficiently large, it is more accurate
and more reliable.
• It is a calculated value and is not based on its position in the
series.
It provides a good basis for comparison
Demerits of Arithmetic Mean:
• It cannot be obtained by inspection nor can be located graphically.
• It cannot be used to study qualitative phenomenon such as
intelligence, beauty, honesty etc.
• It is very much affected by extreme values.
• It cannot be calculated for open-end classes.
• The A. M. computed may not be the actual item in the series
• Its value can’t be determined if one or more number of observations
are missing in the series.
• Some time A.M. gives absurd results ex: number of child per family
can’t be in fraction.
Uses of Arithmetic Mean
• Arithmetic Mean is used to compare two or more series with respect
to certain character.
• It is commonly & widely used average in calculating Average cost of
production, Average cost of cultivation, Average cost of yield per
hectare etc...
• It is used in calculating standard deviation, coefficient of variance.
• It is used in calculating correlation co-efficient, regression co-
efficient.
• It is also used in testing of hypothesis and finding confidence limit.
Examples:
Example1.
(a)Find the arithmetic mean of the following frequency
distribution :
x: 1 2 3 4 5 6 7
f: 5 9 12 17 14 10 6
(b) Calculate the arithmetic mean of the “marks from the following
table:
Marks : 0-10 10-20 20-30 30-40 40-50 50-60
No. of students : 12 18 27 20 17 6
Solutions:
(a)
x:1 2 3 4 5 6 7 Total

f : 5 9 12 17 14 10 6 73

fx : 5 18 36 68 70 60 42 299
Therefore,
Mean=Σfx/N = 299/73 = 4.0958
Arithmetic Mean= Xbar = A+ h Σ f*d/N
= 28 + 8(-25)/77
= 28 – 200/77
= 28 – 2.597
= 25.403
Mathematical Properties of the Arithmetic Mean :
1. The sum of the deviation of the individual items from the arithmetic mean is
always zero. i.e.

2. The sum of the squared deviation of the individual items from the arithmetic
mean is always minimum. i.e.

3. The Standard Error of A.M. is less than that of any other measures of central
tendency.
4.Arithmetic mean is dependent on change of both Origin and Scale
(i.e. If each value of a variable X is added or subtracted or multiplied or
divided by a constant values k, the arithmetic mean of new series will also
increases or decreases or multiplies or division by the same constant value k.)
•  s
Uses of the weighted mean:

1.Construction of index numbers.


2.Comparison of results of two or more groups where number of items differs in
each group.
3.Computation of standardized death and birth rates.
4.When values of items are given in percentage or proportion.
Weighted Arithmetic Mean
• Example 3. The average salary of male employees in a firm was
Rs.520 and that of females was Rs.420. The mean salary of all the
employees was Rs.500. Find the percentage of male ,and female
employees.
• Solution. Let n and n denote respectively the number of male and
1 2

female employees in the concern and ̅x and ̅x denote respectively


I 2

their average salary (in rupees). Let X̅ denote the average salary of
all the workers in the firm.
• We are given that: ̅x = 520, ̅x = 420 and ̅x = 500
I 2

implies 500 (n + n ) = 520 n + 420n


1 2 1 2
• 
• 
Continuous frequency distribution (Grouped frequency distribution)
data:
If x1, x2, x3,……. Xn are the mid-points of the n-class intervals with their corresponding frequencies f1, f2,
f3…….., fn, then the geometric mean (GM) is defined as
GM = (X1^f1 * X2^ f2 …….. Xn^fn )Ʌ(1/N),
Where N =
Merits of Geometric mean:
• It is rigidly defined.
• It is based on all observations.
• It is capable of further mathematical treatment.
• It is not affected much by the fluctuations of sampling.
• Unlike AM, it is not affected much by the presence of
extreme values.
• It is very suitable for averaging ratios, rates and
percentages.
Demerits of Geometric mean:
• Calculation is not simple as that of A.M and not easy to
understand.
• The GM may not be the actual value of the series.
• It can’t be determined graphically and inspection.
• It cannot be used when the values are negative because if any one
observation is negative, G.M. becomes meaningless or doesn’t
exist.
• It cannot be used when the values are zero, because if any one
observation is zero, G. M. becomes zero.
• It cannot be calculated for open-end classes.
Uses of G. M:
1.It is used in the construction of index numbers.
2.It is also helpful in finding out the compound rates of change such as
the rate of growth of population in a country, average rates of change,
average rate of interest etc..
3.It is suitable where the data are expressed in terms of rates, ratios
and percentage.
4.It is most suitable when the observations of smaller values are given
more weightage or importance.
• 
• X1, X2, …..,Xn then Y1=1/X1,…., Yn=1/Xn
• f1, f2, ……., fn
• Mean of y is same mean of the reciprocals of the observations =
• Sum of Y’s divided by n. Next take the inverse of mean of Y
• 
Merits of H.M.:
• It is rigidly defined.
• It is based on all items is the series.
• It is amenable to further algebraic treatment.
• It is not affected much by the fluctuations of sampling.
• Unlike AM, it is not affected much by the presence of extreme
values.
• It is the most suitable average when it is desired to give greater
weight to smaller observations and less weight to the larger ones.
Demerits of H.M:
• It is not easily understood and it is difficult to compute.
• It is only a summary figure and may not be the actual item in the
series.
• Its calculation is not possible in case the values of one or more
items is either missing, or zero
• Its calculation is not possible in case the series contains negative
and positive observations.
• It gives greater importance to small items and is therefore, useful
only when small items have to be given greater weightage
• It can’t be determined graphically and inspection.
• It cannot be calculated for open-end classes.
Uses of H. M.:
H.M. is greater significance in such cases where prices are expressed in
quantities (unit/prices). H.M. is also used in averaging time, speed, distance,
quantity etc... for example if you want to find out average speed travelled in km,
average time taken to travel, average distance travelled etc...
Positional Averages:
These averages are based on the position of the observations in arranged (either
ascending or descending order) series. Ex: Median, Mode, quartile, deciles, percentiles.

1) Median:
• Median is the middle most value of the series of the data when the observations are
arranged in ascending or descending order.
• The median is that value of the variate which divides the group into two equal parts,
one part comprising all values greater than middle value, and the other all values less
than middle value.
• 
ii.For frequency distribution data :
(a) Discrete frequency distribution (Ungrouped frequency distribution)
data:

(b) Continuous frequency distribution (Grouped frequency distribution)


data:
Median = l + [N/2 – cf]xh/f
Graphic method for Location of median:
• Median can be located with the help of the cumulative frequency curve or ‘ogive’ . The procedure for locating
median in a grouped data is as follows:
• Step1: The class boundaries, where there are no gaps between consecutive classes, i.e. exclusive class are
represented on the horizontal axis (x-axis).
• Step2: The cumulative frequency corresponding to different classes is plotted on the vertical axis (y-axis) against the
upper limit of the class interval (or against the variate value in the case of a discrete series.)
• Step3: The curve obtained on joining the points by means of freehand drawing is called the ‘ogive’ . The ogive so
drawn may be either a (i) less than ogive or a (ii) more than ogive.
• Step4: The value of N/2 is marked on the y-axis, where N is the total frequency.
• Step5: A horizontal straight line is drawn from the point N/2 on the y-axis parallel to x-axis to meet the ogive.
• Step6: A vertical straight line is drawn from the point of intersection perpendicular to the horizontal axis.
Graphic method for location of median
Merits of Median:
• It is easily understood and is easy to calculate.
• It is rigidly defined.
• It can be located merely by inspection.
• It is not at all affected by extreme values.
• It can be calculated for distributions with open-end classes.
• Median is the only average to be used to study qualitative data where
the items are scored or ranked.
Demerits of Median:
• In case of even number of observations median cannot be determined
exactly. We merely estimate it by taking the mean of two middle terms.
• It is not based on all the observations.
• It is not amenable to algebraic treatment.
• As compared with mean, it is affected much by fluctuations of sampling.
• If importance needs to be given for small or big item in the series, then
median is not suitable average.
Uses of Median
• Median is the only average to be used while dealing with qualitative
data which cannot be measure quantitatively but can be arranged in
ascending or descending order.
• Ex: To find the average honesty or average intelligence, average beauty
etc... among the group of people.

• Used for the determining the typical value in problems concerning


wages and distribution of wealth.
• Median is useful in distribution where open-end classes are given.
Mode:
• The mode is the value in a distribution, which occur most frequently or
repeatedly.
• It is an actual value, which has the highest concentration of items in and
around it or predominant in the series.
• In case of discrete frequency distribution mode is the value of x
corresponding to maximum frequency.
Computation of mode:
For raw data/individual-series/ungrouped data:
• Mode is the value of the variable (observation) which occurs maximum
number of times.
For frequency distribution data :
• Discrete frequency distribution (Ungrouped frequency distribution) data:
• In case of discrete frequency distribution mode is the value of x variable
corresponding to maximum frequency.
• Continuous frequency distribution (Grouped frequency distribution) data:
Mode =
Graphic method for location of mode:
Steps:
• Draw a histogram of the given distribution.
• Join the top right corner of the highest rectangle (modal class rectangle)
by a straight line to the top right corner of the preceding rectangle.
Similarly the top left corner of the highest rectangle is joined to the top
left corner of the rectangle on the right.
• From the point of intersection of these two diagonal lines, draw a
perpendicular to the x -axis.
• Read the value in x-axis gives the mode.
Fig 6 .3:   Graphic method for Location of mode
Merits of Mode:
• It is easy to calculate and in some cases it can be located by mere
inspection
• Mode is not at all affected by extreme values.
• It can be calculated for open-end classes.
• It is usually an actual value of an important part of the series.
• Mode can be conveniently located even if the frequency distribution
has class intervals of unequal magnitude provided the modal class
and the classes preceding and succeeding it are of the same
magnitude.
Demerits of mode:
• Mode is ill defined. It is not always possible to find a clearly defined
mode.
• It is not based on all observations.
• It is not capable of further mathematical treatment.
• As compared with mean, mode is affected to a greater extent by
fluctuations of sampling.
• It is unsuitable in cases where relative importance of items has to be
considered.
Remarks:
In some cases, we may come across distributions with two modes. Such
distributions are called bi-modal. If a distribution has more than two modes, it is
said to be multimodal.

Uses of Mode:
Mode is most commonly used in business forecasting such as manufacturing
units, garments industry etc... to find the ideal size. Ex: in business forecasting for
manufacturing of readymade garments for average size of track suits, average
size of dress, average size of shoes etc....
Partition Values:
Partition values are the values of the variable which divide the total number of
observations into number of equal parts when it is arranged in order of magnitude.

Ex: Median, Quartiles, Deciles, Percentiles.

• Median: Median is only one value, which divides the whole series into two equal
parts.
• Quartiles: Quartiles are three in number and divide the whole series into four equal
parts.
• They are represented by Q1, Q2, Q3 respectively.
• 
• 
• 
Some Important relation and results:
1. Relation between A.M., G.M. & H.M. A.M. ≥ G.M. ≥ H.M.
2. i.e. G.M of A.M & H.M. is equal to G.M of two values.
3. A.M. of first “n” natural number 1,2,3,....n is ( n+1)/2
4. Weighted A.M of first “n” natural number 1,2,3,....n with
corresponding weights 1,2,3,...n is
Formula for Mode for Grouped Frequency Distribution
• 1. In a grouped frequency distribution, unlike ungrouped data,
it is impossible to determine the mode by looking at the
frequencies. Here, we can only locate a class with the
maximum frequency, called the modal class. The mode is a
value that lies in the modal class and is calculated using the
formula given as:
Example-1
• Calculate the mean, median and mode of the following frequency distribution:

• Calculation of mean:

Class 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 Total

Frequenc 7 14 13 12 20 11 15 8 100
y
(f)
Mid-Point 15 25 35 45 55 65 75 85 -------
Of CI (X)
fX 105 350 455 540 1100 715 1125 680 5070

• Mean = = 5070/100 = 50.70


Calculation of Median:

• From the above data N/2 lies in the class interval 50-60.
• Hence L = 50 lower limit of the median class.
• f = 20 frequency of the median class.
• cf = 46 cumulative frequency preceding to the median class
• h = 10 Length of the class interval
• Median = L + [N/2 – cf ]x h/f
• = 50 + [ 50-46]x10/20
• = 50 + 4/2
• = 52
• Median of the distribution is 52
• Qi = L + [iN/4– cf ]x h/f, Di = L + [iN/10– cf ]x h/f, Pi = L + [iN/100– cf ]x h/f
Calculation of Mode:

• From the given table,


• The highest frequency = 20
• This value lies in the interval 50-60. Thus, it is the model class.
• Model class = 50 – 60
• l = Lower limit of the modal class = 50
• h = Size of the class interval (assuming all class sizes to be equal)
= 10
• f1 = Frequency of the modal class = 20
• f0 = Frequency of the class preceding the modal class = 12
• f2 = Frequency of the class succeeding the modal class = 11
Formula for mode is

• Mode = L + (- )xh/{2 - - }
• M = 50 + [8x10/(40-23)]
• = 50 +80/17
• = 50 + 4.706
• = 54.706
• Thus, the mode of the distribution is 54.706
Examples
• Industrial engineers periodically conduct “work measurement” analyses to
determine the time used to produce a single unit of output. At a large processing
plant, the total number of man-hours required per day to perform a certain task
was recorded for 50 days. This information will be used in a work measurement
analysis. The total man-hours required for each of the 50 days are listed below.
[Min=88, Max=150]
Questions?

a. Construct a frequency distribution.


b. Construct the histogram, frequency polygon and pie chart.
c. Draw the frequency curve, cumulative frequency curve.
d. Compute the mean, the median, and the mode of the data set.
e. Compute the quartiles, deciles and percentiles.
85-95 95-105 105-115 115-125 125-135 135-145 145-155
Measures of Dispersion

Introduction
Measures of central tendency viz. Mean, Median, Mode, etc..., indicate the central position of
a series. They indicate the general magnitude of the data but fail to reveal all the peculiarities
and characteristics of the series. For example,
• Series A: 20, 20, 20 ΣX = 60, A. M=20
• Series B: 5, 10, 45 ΣX = 60, A. M=20
• Series C: 17, 19, 24 ΣX = 60, A. M=20
Hence, Measures of Central tendency fail to reveal the degree of the spread out or the extent
of the variability in individual items of the distribution. This can be explained by certain other
measures, known as ‘Measures of Dispersion’ or ‘Variation or Deviation’. Simplest meaning
that can be attached to the word ‘dispersion’ is a lack of uniformity in the sizes or quantities
of the items of a group
Definition:
“Dispersion is the extent to which the magnitudes or quantities of individual items differ, the degree of diversity.”
The dispersion or spread of the data is the degree of the scatter or variation of the variable about the central value.
Properties/Characteristics/Pre-requisite of a Good Measure of Dispersion
1. It should be simple to understand and easy to compute.
2. It should be rigidly defined.
3. It should be based on each individual item of the distribution.
4. It should be capable of further algebraic treatment.
5. It should have less sampling fluctuation.
6. It should not be unduly affected by the extreme items.
7. It should be help for further Statistical Analysis.
Significance of measures of dispersion:
▪ Dispersion helps to measure the reliability of central tendency i.e. dispersion
enables us to know whether an average is really representative of the series.
▪ To know the nature of variation and its causes in order to control the variation.
To make a comparative study of the variability of two or more series by
computing the relative dispersion
▪ Measures of dispersion provide the basis for studying correlation, regression,
analysis of variance, testing of hypothesis, statistical quality control etc...
▪ Measures of dispersion are complements of the measures of central tendency.
Both together provide better tool to compare different distributions.
Types of Dispersion: Two types
1) Absolute measure of dispersion
2) Relative measures of dispersion.
• Absolute measure of dispersion:
Absolute measures of dispersion are expressed in the same units in which the original data are
expressed/measured. For example, if the yield of food grains is measured in Quintals, the absolute
dispersion will also gives variation value in Quintals. The only difficulty is that if two or more series are
expressed in different units, the series cannot be compared on the basis of absolute dispersion.
• Relative or Coefficient of dispersion:
‘Relative’ or ‘Coefficient of dispersion’ is the ratio or the percentage of measure of absolute dispersion
to an appropriate average. Relative measures of dispersion are free from units of measurements of the
observation. They are pure numbers. The basic advantage of this measure is that two or more series can
be compared with each other despite the fact they are expressed in different units.
1.Range (Coefficient of Range )
2.Quartile Deviation (Q. D.) (Coefficient of Quartile Deviation)
3.Mean Deviation(M.D.) (Coefficient of Mean Deviation)
4.Standard deviation (S.D.)
5.Variance (Coefficient of Variation)
Range:
It is the simplest method of studying dispersion. Range is the difference between
the Largest (Highest) value and the Smallest (Lowest) value in the given series.
While computing range, we do not take into account frequencies of different
groups.
Range (R) = L-S
Where, L=Largest value
S= smallest value
ii) Frequency distribution data:
• Discrete frequency distribution (Ungrouped frequency distribution) data:
Range (R) = L-S
Where, L=Largest value of x variable
S= smallest value of x variable
• Continuous frequency distribution (Grouped frequency distribution) data:
Range (R) = L-S
Where, L = Upper boundary of the highest class
S = Lower boundary of the lowest class.
Merits of Range:
• Range is a simplest method of studying dispersion.
• It is simple to understand and easy to calculate.
• It is rigidly defined.
• It is useful in frequency distribution where only two extreme
observation are considers, middle items are not given any
importance.
• In certain types of problems like quality control, weather forecasts,
share price analysis, etc..., range is most widely used.
• It gives a picture of the data in that it includes the broad limits within
which all the items fall.
Demerits of Range:
• It is affected greatly by sampling fluctuations. Its values are never
stable and vary from sample to sample.
• It is very much affected by the extreme items.
• It is based on only two extreme observations.
• It cannot be calculated from open-end class intervals.
• It is not suitable for mathematical treatment.
• It is a very rarely used measure.
• Range is very sensitive to size of the sample.
Uses of Range:
• Range is used for constructing quality control charts.
• In weather forecasts, it gives max & min level of temperature, rainfall etc...
• It’s used in studying variation in money rates, share price analysis, exchange
rates & gold prices etc., range is most widely used.
Quartile Deviation (Q.D.):
Quartile Deviation is half of the difference between the first quartile (Q1) and
third quartile (Q3). i.e.

QD=
The range between first quartile (Q1) and third quartile (Q3) is called by Inter
quartile range (IQR) i.e.
Computation of Q.D.:
i) For raw data/Individual series/ ungrouped data:
ii) Frequency distribution data:
1) Discrete frequency distribution (Ungrouped frequency distribution) data:
2) Continuous frequency distribution (Grouped frequency distribution) data:
Merits of Q. D.:
• It is simple to understand and easy to calculate.
• It is rigidly defined.
• It is not affected by the extreme values.
• In the case of open-ended distribution, it is most suitable.
• Since it is not influenced by the extreme values in a
distribution, it is particularly suitable in highly skewed
distribution.
Demerits of Q. D.:
• It is not based on all the items. It is based on two positional values
Q1 and Q3 and ignores the extreme 50% of the items.
• It is not amenable to further mathematical treatment.
• It is affected by sampling fluctuations.
• Since it is a positional average, it is not considered as a measure of
dispersion. It merely shows a distance on scale and not a scatter
around an average.
Mean Deviation (M.D.):
The range and quartile deviation are not based on all observations.
They are positional measures of dispersion. They do not show any
scatter of the observations from an average. The mean deviation is
measure of dispersion based on all items in a distribution.
Definition:
“Mean deviation is the arithmetic mean of the absolute deviations of
a series computed from any measure of central tendency; i.e., the
mean, median or mode, all the deviations are taken as positive”.
“Mean deviation is the average amount scatter of the items in a
distribution from either the mean or the median, ignoring the signs of
the deviations”.
• 
Merits of M. D.:
• It is simple to understand and easy to compute.
• It is rigidly defined.
• It is based on all items of the series.
• It is not much affected by the fluctuations of sampling.
• It is less affected by the extreme items.
• It is flexible, because it can be calculated from any average.
Demerits of M. D.:
• It is not a very accurate measure of dispersion.
• It is not suitable for further mathematical calculation.
• It is illogical and mathematically unsound to assume all negative signs as
positive signs.
• Because the method is not mathematically sound, the results obtained by this
method are not reliable.
• It is rarely used in sociological studies.
Uses of M.D.:
• It is very useful while using small sample.
• It is useful in computation of distributions of personal wealth in
community or nations, weather forecasting and business cycles.
Remarks:
• 1) Mean Deviation is minimum (least) when it is calculated from
median than mean or mode 2) Mean ±15/2 M.D. includes about 99 %
of observations.
• Range covers 100 % of observations.
• 
• 10 20 30 mean = 20- -10 0 10( 100+0+100)=200/3=66.66
• 0 10 20 mean = 10- -10 0 10
• 1 2 3 mean = 2 - -1 0 1 - 2/3
Remarks:
1) Variance is independent on change of origin but not scale.
• {Change of Origin: If all values in the series are increased or decreased by a
constant, the Variance will remain the same.
• Change of Scale: If all values in the series are multiplied or divided by a constant
(k) than the Variance will be multiplied or divided by that square constant (k2).}
Merits of Variance:
• It is easy to understand and easy to calculate.
• It is rigidly defined.
• Its value based on all the observations.
• It is possible for further algebraic treatment.
• It is less affected by the fluctuations of sampling.
• As it is based on arithmetic mean, it has all the merits of arithmetic
mean.
• Variance is most informative among the measures of dispersions.
Demerits of Variance:
• The unit of expression of variance is not the same as that of the observations
because variance is indicated in squared deviation. Ex: if the observations are
measured in meter ( or in Kg), then variance will be in squares meters (or in kg 2).
• It can’t be determined for open-end class intervals.
• It is affected by extreme values
• As it is an absolute measure of variability, it cannot be used for the purpose of
comparison.
• 
Computation of S.D.:
i) For raw data/Individual series/ ungrouped data:
ii)Frequency distribution data:
• Discrete frequency distribution (Ungrouped frequency distribution)
data:
• Continuous frequency distribution (Grouped frequency distribution)
data:
Mathematical properties of standard deviation (σ)
• S.D. of n natural numbers viz. 1,2,3...., n is calculated by

• The sum of the squared deviations of the individual items from the arithmetic
mean is always minimum. i.e.
S.D. is independent on change of origin but not scale.
• { Change of Origin: If all values in the series are increased or decreased by a
constant, the standard deviation will remain the same.
• Change of Scale: If all values in the series are multiplied or divided by a
constant than the standard deviation will be multiplied or divided by that
constant.}
• S.D. ≥ M.D. from Mean.
Merits of S. D:
• It is easy to understand.
• It is rigidly defined.
• Its value based on all the observations
• It is possible for further algebraic treatment.
• It is less affected by the fluctuations of sampling and hence stable.
• As it is based on arithmetic mean, it has all the merits of
arithmetic mean.
• It is the most important, stable and widely used measure of
dispersion.
• It is the basis for calculating other several statistical measures like,
co-efficient of variance, coefficient of correlation, and coefficient
of regression, standard error etc...
Demerits of S. D.:
• It is difficult to compute.
• It assigns more weights to extreme items and less weights to items that are
nearer to mean because the values are squared up.
• It can’t be determined for open-end class intervals.
• As it is an absolute measure of variability, it cannot be used for the purpose
of comparison.
Uses of S. D.:
• It is the most important, stable and widely used measure of dispersion.
• It is very useful in knowing the variation of different series in making the
test of significance of various parameters.
• It is used in computing area under standard normal curve.
• It is used in calculating several statistical measures like, co-efficient of
variance, coefficient of correlation, and coefficient of regression, standard
error etc...
• 
Remarks:
• Generally, coefficient of variation is used to compare two or more series. If
coefficient of variation (C.V.) is more for series-I as compared to the series-II,
indicates that the population (or sample) of series-I is more variation, less stable,
less uniform, less consistent and less homogeneous. If the C.V. is less for series-I
as compared to the series-II, indicates that the population (or sample) of series-I
is less variation, more stable, or more uniform, more consistent and more
homogeneous.
• All relative measure of dispersions are dependent on change of origin but
independent on change of scale.

Relationship between Q.D., M.D. & S.D. is 6 Q. D.=5 M.D.=4 S.D.


S.D. > M.D.>Q.D

You might also like