Professional Documents
Culture Documents
Content Reviewer (CR) Prof. Aslam Mahmood Jawaharlal Nehru University New
Delhi
Module ID QT 3
(1) E-Contents
In any exercise on data collection and analysis after converting the socially meaningful concept into
numerical forms of “Data” and collecting information about them either through Census
enumeration or through sample surveys, the next step is to put the mass of the collected data into a
systematic and manageable form. We can’t refer the raw data in any text or report as it will be too
lengthyand found in a random order. Therefore, the data set is needed to be transformed into a
systematic and manageable form.
Tabulation of the raw data into a concise form is, therefore, the an important step in most of the
statistical analyses. Tabulation serves the dual purpose of putting the data in a systematic as well as
in a manageable form. It, not only puts the data in a concise form but, also arranges the data into a
systematic form.
Tabulation
After the data is arranged in a frequency table it will become much easier to handle it for further
statistical analysis and it ca also be easily referred to in anywhere in the text. The raw data for this
purpose can be transformed into grouped as well as ungrouped “Frequency Distribution Tables”.
There are two types of frequency distributions: (a) Ungrouped and (b) Grouped.
In an ungrouped frequency distribution the classes consist of the fixed number and is used for the
data which is discontinuous by nature and can’t occur in fractions; like size of the family, number of
schools, number of floods in a year to a river etc. The range of the discontinuous data, generally, is
not very large. An ungrouped frequency distribution table may look like the one given below:
Size of family (X) Number of families(
f)
1 2
2 14
3 22
4 24
5 18
6 14
7 6
Total 100
Most of the time we have to handle the data which is continuous by nature, like: rainfall, agricultural
production, income etc. Such data occurs in frictions also. The range of the continuous data is also
large. In such cases, instead of the fixed number of the variable the classes are formed into some
ranges, known as classes and the number of observations, known as frequency, falling in each class
is tabulated. A hypothetical frequency distribution table of the grouped data of the daily rainfall of
90 days of a season of an area may look like the one given below:
In the above frequency table, the values of the variable are tabulated for smaller group of the values
of the variables which are known as classes. Every class has two values known as class limits: Lower
class limit as well as Upper class limit. The difference between the upper limit and the lower limit of
any class is known as class interval. In the present case the first classhas the lower limit as 20.0mm
and the upper limit as 30.0 mm and the class interval of 10.0 mm.. In the second class the lower limit
is 30.0 mm and the upper limit is 40.0 mm, and so on. All the class intervals of the above frequency
distribution table are equal. We notice that upper limit of every class become the lower limit of the
next class. So, it should not be counted at two places. The convention is that any value less than the
upper limit should be included in the class itself. However, the values equal to the upper limit a class
should go to the next class where it is the lower limit. So in every class the lower limit is included in
the class but not the upper limit.
In a grouped frequency distribution table number of classes and the class intervals are very
important and are related to each other. If our class intervals are large, the number of classes will be
less. On the contrary if the class intervals are small, number of classes will increase.
A good frequency distribution table maintains balance between the two. Very large number of
classes will lose the advantage of summarising the data. A very small number of classes like; 2 , 3 or
4 will result in significant loss of information.
There are suggestions regarding the number of classes, one such suggestion traditionally referred in
the books is that the number of classes of a frequency distribution table, k , should be determined by
the formula:
Even when it is found to have class interval not in rounded form, the class intervals of multiple of
five or ten are preferred due to practical reasons.
The difference between upper limit and the lower limit of a class is known as the class interval which
may be equal or may not be equal for all the classes. Class intervals are commonly of equalsize. In
some cases, however, the equal class intervals are not required also. For example, the tabulation of
urban settlements whose size in India varies from below 5000 population to 12442000 (highest
population of Mumbai 2011) population, uses unequal class intervals due to the range of variations
in data. For a range of 12437000, if we use equal class intervals of 5000 each we require
12437000/5000= 2488 (after rounding) classes. This as cumbersome as the data itself, no
simplification in handling and interpretation.On the other hand if we take 10 classes of class interval
of 10,00,000.0 population, we heavily loose the details as the very first class from below 5000 to
10,00,000.0 (below million cities) will have 7882 towns out of total 7935 towns in India in 2011 (99.3
%). This is as bad as having no information.
In such cases where the range of data is too large, for example population of towns, income of
individuals in a society, land holdings among farmers etc. we are forced to go for unequal class
intervals in such a manner that class intervals are smaller to begin with the smaller values and
become larger and larger as we procced to the higher values. Indicating that smaller differences
can’t be ignored at lower end but same differences are not equally important as we move to higher
values where only higher differences matter. Thus Census of India classifies the towns in the form of
unequal class intervals as given below:
For unequal class intervals, number of classes are generally less as each class represents a category
of the data and there should not be larger number of categories to avoid confusion. For example, in
the case of census classification of towns of India, class intervals correspond to well recognizedsix
classes of towns. What is more important in such cases is the understanding of the researcher
toconvert the data into meaningful categories.
Example
Following example shows the process of the conversion of a small set of raw data into a
frequency distribution table and its conversion into a “Histogram”. It shows a hypothetical
set of data of the production of Wheat in 100 plot of equal size of one hectare each in an
area which is given in the table below.
Production of wheat in quintals (00 Kg) per plot of one hectare
20.3 20.2 19.8 20.1 21.0 20.9 20.2 19.9 19.6 19.2
20.3 21.1 19.7 19.1 18.3 18.1 17.9 20.7 20.0 19.4
18.3 18.0 17.0 17.2 22.3 20.7 21.3 18.9 19.7 21.0
21.1 19.8 18.5 18.2 22.1 21.1 18.1 19.3 19.9 19.7
18.8 18.9 16.9 20.1 20.3 18.1 17.6 19.4 20.3 21.1
20.2 22.1 18.7 19.5 20.1 23.0 22.9 22.8 22.8 22.5
20.9 20.4 20.1 20.6 20.9 18.0 20.3 18.1 19.7 18.2
18.3 17.1 20.2 23.0 20.1 18.9 18.3 21.2 17.3 17.6
19.3 19.0 21.3 22.1 19.9 18.8 21.1 23.1 23.6 23.1
20.1 19.8 19.7 18.3 17.1 18.3 19.0 20.1 20.1 18.9
As the range of data is quite low, the maximum value is 23 and the minimum is 16.9. The
range is 23.0 – 16.9 = 6.1. If we choose 10 classes every class would have an interval of 0.61
(00) kg. per hectare. ) 0.61 does not seem a conveniently understood figure compare to 1
hectare which is also close to it. Secondly, 10 classes appear to be quite large as the number
of plots are only 100.Thus a class interval of 1 (00) kg is considered to be quite easily under
stood and will give eight classes, which may be alright for the purpose of making a
histogram.
Starting with the lower class limit of 16.0 in which the minimum value of 16.9 will lie we
form the classes as given following frequency table given below:
Frequency Table
Production of Wheat in (00)Kg
in 100 plots of Size one Hectare
16 -17 1
17 – 18 8
18 -19 15
19 -20 23
20 -21 25
21 -22 15
22 -23 8
23 -24 5
Total 100
Histogram
Distribution of Equal Class Intervals
A frequency distribution table arranges the data into some ordered form which helps us in
understanding the distributional properties of the data in a much better way than the raw
data. For example , after transferring the data into a frequency distribution form, we can
easily see as to how many observations are found in the middle of the values and how many
on the either side of it. We can also see the inequalities in the distribution and other
important socially important characteristics of the data. These characteristics become more
visible if we plot the distribution of the data on a “Histogram”.
A histogram is a collection of a set of rectangles with bases equal to the class interval of
each classof the corresponding frequency distribution and the height of the rectangle will be
equal to the corresponding frequencies of each class.
Taking the wheat production data of 100 plots of size one hectare each as given in above
table we prepared the ‘Histogram’ as shown in the figure given below. The first rectangle
has a base equal to 16.0 -17.0 , second rectangle has the base equal to 17.0 -18.0 and so on
until the last rectangle whose base is equal to the class interval of the last classof 23.0 –
24.0. The height of the first rectangle is equal to the frequency of the first class i.e. 1, the
height of the second rectangle is equal to the frequency of the second class which is 8 and
so on until the last class with height equal to 5.
A histogram can also be converted into a “Frequency Polygon” by joining the middle points
of the upper sides of each bar. To show the pattern of change as a gradual process the
polygon is converted into a smooth curve also, which is known as“Frequency Distribution
Curve” or only frequency curve. Such a frequency curve for the data on production ofwheat
is also shown below along-with the histogram.
A histogram without frequency density will give a distorted image. Thus, before making a
histogram we have to find out the frequency density for each class as shown below.
Now we can prepare a histogram considering the first class as 0-500 with a frequency = 200. As it is
the lowest class of interval 500, its frequencies are not divided. Class interval of Rs. 500 is taken as
standard unit. All other classes are converted into the units of the standard uit. The second class also
has an interval of Rs. 500, so its equivalence is one only. Third class is interval is Rs. 1000, which is
twice as large as the standard class. Fourth class interval is Rs. 3000 , which is six time as high as the
standard class and the last class has a class interval of 5000 which is 10 times as high as the standard
class. Column no. 4 of the above table gives the class interval of each class in the units of the first
class interval. Column no. 5 of the table gives the frequency density of each class per class intervals
of the standard class interval of Rs. 500.
Now the histogram will correspond to the fistr class of 0-500 with 200 frequencies. The second class
will correspond to 500-1000. Third class wil correspond to two classes 1000-1500 and 1500-2000
with each having the frequency of 20. Fourth class will correspond to six classes of 2000-2500, 2500-
3000, 3000-3500,3500-4000,4000-4500,and 4500-5000 each with a frequency of 10. Lastly the last
clas 5000-10000 will correspond to 10 classes of interval 500 starting from 5000-5500 and ending
with 9500-10000. Each of these 10 classes are with frequency 5.
A histogram of the above distribution of unequal class intervals will be as given below.
Income Distribution
250
200
Persons
150
100
50
Income (Rs.)
• Frequency curves play important role in statistical analysis. It helps us in understanding the
process through which it is generated.
• A usual process in which neither very high nor very low values are preferred will generate a
symmetrical curve. Like average annual rainfall of an area over a period of time, height of
children in given age group and agricultural productivity of plots in any adjoining area etc. A
symmetrical curve is such that if it is folded from the middle, one half of it will overlap the
other half.
• On the contrary due to certain natural or social factors the values in some distribution are
not found symmetrical and we will get a curve which is “Asymmetric” or “Skewed”.The
values show inequalities in its distribution either on the higher side or on the lower side.
Distribution of agricultural land holdings, income distribution and district wise proportion of
urban population etc. will show the curves elongated to the right hand side and are known
as “positively skewed”. Proportion of rural population to total population in different
districts will give a curve elongated to the left hand side and are known as “negatively
skewed”.
• Death rates by age in a population will give a “U- shaped curve”, as mortality will be higher
in the beginning and at the end and will be lowest in the middle ages. Shapes of Symmetric
and skewed curves are also given below:
Comparison of Frequency Distributions
• Any research enquiry begins with observations of real world situation around us and
comparing it under different geographical situations. After we collect the data about the real
world and summarise it with the help of frequency tabulation, different types of graphs
provide us only a preliminary understanding about its comparative position under different
geographical conditions as they are not very accurate. For an accurate and meaningful
comparison we need some numerical measures of the distribution. There are several such
meaningful measures of any distribution known as ‘Descriptive Statistics”. Some of the
commonly used such measures are as given below:
• Measures of Skewness.
• First two measures i.e. measures of central tendency and measures of dispersions are very
important parameters of any distribution as they are used extensively in the theory of
sampling, inferences and in many other places also. Measures of skewness, however, are
relatively less frequently used.