You are on page 1of 20

Editing of Data

After collection of data the next step in a statistical investigation is the scrutiny of the collected
information. This is technically called ‘Editing of data’. It is a necessary step as in most cases the
collected data contain various types of mistakes and errors. The word error is used in a specialized
sense in Statistics. It does not mean the same thing as mistake. Mistake in Statistics means a wrong
calculation or use of inappropriate method in the collection or analysis of data. Error, on the other
hand, means “the difference between true value and estimated value”, and we call it as statistical
error.
As a matter of fact, editing involves a careful scrutiny of the completed questionnaires and/or
schedules. Editing is done to assure that the data are accurate, consistent with other facts gathered,
uniformly entered as completed as possible.

Classification of data
Statistical data are classified on the basis of the characteristics after taking into account the
nature, scope, and purpose of an investigation. Generally, data are classified on the basis of the
following four bases:
Geographical Classification:
In geographical classification, data are classified on the basis of geographical or locational
differences such as cities, districts, or villages between various elements of the data set. Such a
classification is also known as spatial classification. Geographical classifications are generally listed
in alphabetical order. The following is an example of a geographical distribution:

City Mumbai Kolkata Delhi Chennai


Population density
654 685 423 205
(per square km)
Chronological Classification:
When data are classified on the basis of time, the classification is known as chronological
classification. Such classifications are also called time series because data are usually listed in
chronological order starting with the earliest period. The following example would give an idea of
chronological classification:

Year 1941 1951 1961 1971 1981 1991 2001 2011


Population 31.9 36.9 43.9 54.7 75.6 85.9 98.6 125.03
(crore)

Qualitative Classification:
In qualitative classification, data are classified on the basis of descriptive characteristics or on the
basis of attributes like sex, literacy, region, caste, or education, which cannot be quantified. This is
done in two ways:
(i) Simple classification: In this type of classification, each class is subdivided into two sub-classes
and only one attribute is studied such as: male and female; blind and not blind, educated and
uneducated, and so on.
Population

Male Female

Educated Uneducated Educated Unedcated

(ii) Manifold classification: In this type of classification, a class is subdivided into more than two
sub-classes which may be sub-divided further.

Quantitative Classification:
In this classification, data are classified on the basis of some characteristics which can be
measured such as height, weight, income, expenditure, production, or sales. Quantitative variables
can be divided into the following two types. The term variable refers to any quantity or attribute
whose value varies from one investigation to another.
(i) Continuous variable is the one that can take any value within the range of numbers. Thus the
height or weight of individuals can be of any value within the limits. In such a case, data are
obtained by measurement.
(ii) Discrete (also called discontinuous) variable is the one whose values change by steps or jumps
and cannot assume a fractional value. The number of children in a family, number of workers
(or employees), number of students in a class, are few examples of a discrete variable. In such a
case data are obtained by counting.

Frequency Distribution
A frequency distribution divides observations in the data set into conveniently established,
numerically ordered classes (groups or categories). The number of observations in each class is
referred to as frequency denoted as f.
94 89 88 89 90 94 92 88 87 85
88 93 94 93 94 93 92 88 94 90
93 84 93 84 91 93 85 91 89 95
The frequency distribution of the number of hours of overtime is shown in following Table.
Number of Overtime Hours Tally Number of Weeks (Frequency)
84 || 2
85 || 2
86 — 0
87 | 1
88 |||| 4
89 ||| 3
90 || 2
91 || 2
92 || 2
93 ||||| 6
94 |||| 5
95 | 1
Total 30
Constructing a Frequency Distribution

As the number of observations obtained gets larger, the method discussed above to condense
the data becomes quite difficult and time-consuming. Thus, to further condense the data into
frequency distribution tables, the following steps should be taken:

1. Decide the number of class intervals:


The decision on the number of class groupings depends largely on the judgment of the
individual investigator and/or the range that will be used to group the data, although there are
certain guidelines that can be used. As a general rule, a frequency distribution should have at
least five class intervals (groups), but not more than fifteen. The following two rules are often
used to decide approximate number of classes in a frequency distribution:
(i) If K represents the number of classes and N the total number of observations, then the value
of K will be the smallest exponent of the number 2, so that 2K ≥ N.
we have N = 30 observations. If we apply this rule, then we shall have
23 = 8 (< 30); 24 = 16 (< 30); 25 = 32 (> 30)
Thus, we may choose k = 5 as the number of classes.
(ii) According to Sturge’s rule, the number of classes can be determined by the formula
k =1 + 3.222 loge N
where k is the number of classes and loge N is the logarithm of the total number of observations.
Applying this rule to the data given in Table we get
k =1 + 3.222 log 30 = 1 + 3.222 (1.4771) = 5.759 ≅ 5
2. Determine the width of class intervals
When constructing the frequency distribution, it is desirable that the width of each class
interval should be equal in size. The size (or width) of each class interval can be determined by
first taking the difference between the largest and smallest numerical values in the data set and
then dividing it by the number of class intervals desired.

(Largest numerical value − Smallest numerical value)


Width of class interval (h) =
Number of classes desired
(94 − 85) 11
Width of class interval (h) = = = 2.2 ≅ 3
5 5
3. Determine Class Limits (Boundaries)
The limits of each class interval should be clearly defined so that each observation
(element) of the data set belongs to one and only one class. Each class has two limits a lower
limit and an upper limit. The usual practice is to let the lower limit of the first class be a
convenient number slightly below or equal to the lowest value in the data set.

Bivariate Frequency Distribution


The frequency distributions discussed so far involved only one variable and therefore called
univariate frequency distributions. In case the data involve two variables (such as profit and
expenditure on advertisements of a group of companies, income and expenditure of a group of
individuals, supply and demand of a commodity, etc.), then frequency distribution so obtained as a
result of cross classification is called bivariate frequency distribution. It can be summarized in the
form of a two-way (bivariate) frequency table and the values of each variable are grouped into
various classes (not necessarily same for each variable) in the same way as for univariate
distributions.
Assignment 1:
The following set of numbers represents mutual fund prices reported at the end of a week for
selected 40 nationally sold funds.
10 17 15 22 11 16 19 24 29 18
25 26 32 14 17 20 23 27 30 12
15 18 24 36 18 15 21 28 33 38
34 13 10 16 20 22 29 29 23 31
Arrange these prices into a frequency distribution having a suitable number of classes.

Assignment 2:
A computer company received a rush order for as many home computers as could be shipped
during a 6-week period. Company records provide the following daily shipments:
22 65 65 67 55 50 65
77 73 30 62 54 48 65
79 60 63 45 51 68 79
83 33 41 49 28 55 61
65 75 55 75 39 87 45
50 66 65 59 25 35 53
Group these daily shipments into a frequency distribution having the suitable number of classes.

Assignment 3:
Following are the number of items of similar type produced in a factory during the last 50 days.
21 22 17 23 27 15 16 22 15 23
24 25 36 19 14 21 24 25 14 18
20 31 22 19 18 20 21 20 36 18
21 20 31 22 19 18 20 20 24 35
25 26 19 32 22 26 25 26 27 22
Arrange these observations into a frequency distribution with both inclusive and exclusive class
intervals choosing a suitable number of classes.

Assignment 4:
Following are the number of two wheelers sold by a dealer during eight weeks of six working days
each.
13 19 22 14 13 16 19 21
23 11 27 25 17 17 13 20
23 17 26 20 24 15 20 21
23 17 29 17 19 14 20 20
10 22 18 25 16 23 19 20
21 17 18 24 21 20 19 26
(a) Group these figures into a table having the classes 10–12, 13–15, 16–18, . . ., and 28–30.
(b) Convert the distribution into a corresponding percentage frequency distribution and
also a percentage cumulative frequency distribution
Tabulation of Data
‘Tabulation’ is a systematic form of arranged data in rows and columns. Tabulation is the final
stage in collection and compilation of data, and is a sort of stepping-stone to the analysis and
interpretation of figures. The importance of proper tabulation is very great because if the tabulation
of data is not satisfactory its analysis will not only be difficult but defective also.
Parts of a Table
Presenting data in a tabular form is an art. A statistical table should contain all the requisite
information in a limited space but without any loss of clarity. There are variations in practice.
A blank model table is given below:

Types of Tables
The classification of tables depends on various aspects: objectives and scope of investigation,
nature of data (primary or secondary) for investigation, extent of data coverage, and so on. The
different types of tables used in statistical investigations are as follows:
Diagrammatic presentation of data
Diagrams play an important role in statistical data presentation. Diagrammatic data presentation
allows us to understand the data in an easier manner. According to P. Maslov, ‘Diagrams are drawn
for two purposes (i) to permit the investigator to graph the essence of the phenomenon he is
observing, and (ii) to permit others to see the results at a glance, i.e., for the purpose of
popularization.’

General Rules for Drawing Diagrams


To draw useful inferences from graphical presentation of data, it is important to understand
how they are prepared and how they should be interpreted. When we say that ‘one picture is
worth a thousand words’, it neither proves (nor disproves) a particular fact, nor is it suitable for
further analysis of data. However, if diagrams are properly drawn, they highlight the different
characteristics of data. The following general guidelines are taken into consideration while
preparing diagrams:
Title: Each diagram should have a suitable title. It may be given either at the top of the diagram or
below it. The title must convey the main theme which the diagram intends to portray.
Size: The size and portion of each component of a diagram should be such that all the relevant
characteristics of the data are properly displayed and can be easily understood.
Proportion of length and breadth: An appropriate proportion between the length and breadth of
the diagram should be maintained. As such there are no fixed rules about the ratio of length to width.
However, a ratio of 1.414 (long side) : 1 (short side) suggested by Lutz in his book Graphic
Presentation may be adopted as a general rule.
Proper scale: There are again no fixed rules for selection of scale. The diagram should neither be
too small nor too large. The scale for the diagram should be decided after taking into consideration
the magnitude of data and the size of the paper on which it is to be drawn. The scale showing the
values as far as possible, should be in even numbers or in multiples of 5, 10, 20, and so on. The scale
should specify the size of the unit and the nature of data it represents, for example, ‘millions of
tones’, in Rs. thousand, and the like. The scale adopted should be indicated on both vertical and
horizontal axes if different scales are used. Otherwise, it can be indicated at some suitable place on
the graph paper.
Footnotes and source note: To clarify or elucidate any points which need further explanation but
cannot be shown in the graph, footnotes are given at the bottom of the diagrams.
Index: A brief index explaining the different types of lines, shades, designs, or colours used in the
construction of the diagram should be given to understand its contents.
Simplicity: Diagrams should be prepared in such a way that they can be understood easily. To keep
it simple, too much information should not be loaded in a single diagram as it may create confusion.
Thus, if the data are large, then it is advisable to prepare more than one diagram, each depicting
some identified characteristic of the same data.
Types of Diagrams
There are a variety of diagrams used to represent statistical data. Different types of diagrams,
used to describe sets of data, are divided into the following categories:
• Dimensional diagrams
(i) One dimensional diagrams: They are in the shape of vertical or horizontal lines or bars.
The lengths of the lines or bars are in proportion to the different figures they represent
(ii) Two-dimensional diagrams: They are in the shape of rectangles, squares or circles. The
areas of squares, rectangles or circles are in proportion to the size of items which they
represent.
(iii) Three dimensional diagrams: They are in the shape of cubes, blocks, or cylinders. Here
the volumes of cubes, blocks or cylinders are in proportion to given values.
• Pictograms or Ideographs: Here the figures are represented by pictures. The size or the number
of pictures is in proportion to the given figures.
• Cartographs or Statistical maps: Here maps are drawn and the figures representing the
phenomena at various places are shown by signs or symbols.

One dimensional diagrams


As has been said earlier in these diagrams only the length of the bars or lines is taken into
account. Since only one dimension of the figure is taken into account these diagrams are known as
One-dimensional diagrams. The bars which are drawn can be of any width or thickness. It has no
effect on the diagram. However, the thickness should not be too much as otherwise bars would
appear like rectangles and give a misleading impression. Such diagrams are also known as Bar
Diagrams. The following are various types of Bar diagrams or Bar charts are:
(i) Simple bar charts (v) Paired bar charts
(ii) Multiple bar charts (vi) Sliding bar charts
(iii) Subdivided bar charts (vii) Relative frequency bar charts
(iv) Deviation bar charts (viii) Percentage bar charts
In simple bar diagrams, one bar represents only one figure and as such there will be as many bars
as the number of figures. Such diagrams represent only one particular type of data. Multiple bar
diagrams are prepared on the basis of simple bar diagrams. These diagrams represent more than
one type of data at a time. The third type of bar diagrams are those in which each bar is divided into
certain parts and each part represents a particular phenomenon.
Simple bar charts
Bar charts are used to represent only one characteristic of data and there will be as many bars
as number of observations. For example, the data obtained on the production of oil seeds in a
particular year can be represented by such bars. Each bar would represent the yield of a particular
oil seed in that year. Since the bars are of the same width and only the length varies, the
relationship among them can be easily established. The data on the production of oil seeds in a
particular year is presented in the following table and represent this data by a suitable bar chart.
Oli seed Yield (Million tonnes) Percentage Production (Million tonnes)
Ground nut 5.8 43.03
Rapeseed 3.3 24.48
Coconut 1.18 8.75
Cotton 2.2 16.32
Soyabean 1 7.42

Multiple Bar Charts


A multiple bar chart is also known as grouped (or compound) bar chart. Such charts are useful
for direct comparison between two or more sets of data. The technique of drawing such a chart is
same as that of a single bar chart with a difference that each set of data is represented in different
shades or colours on the same scale. An index explaining shades or colours must be given.
Example: The data on fund flow (in Rs. crore) of an International Airport Authority during
financial years 2001–02 to 2003–04 are given below:
Years 2001-02 2002-03 2003-04
Non-traffic revenue 40.00 50.75 70.25
Traffic revenue 70.25 80.75 110.00
Profit before tax 40.15 50.50 80.25
Represent this data by a suitable bar chart.
Subdivided Bar Chart
Subdivided bar charts are suitable for expressing information in terms of ratios or percentages.
For example, net per capita availability of food grains, results of a college faculty-wise in last few
years, and so on. While constructing these charts the various components in each bar should be in
the same order to avoid confusion. Different shades must be used to represent various ratio values
but the shade of each component should remain the same in all the other bars. An index of the
shades should be given with the diagram.
A common arrangement while making these charts is that of presenting each bar in order
of magnitude from the largest component at the base of the bar to the smallest at the end. Since
the different components of the bars do not start on the same scale, the individual bars are to be
studied properly for their mutual comparisons.
Example: The data on sales (Rs. in million) of a company are given below:
Years 2005 2006 2007
Export 1.4 1.8 2.29
Home 1.6 2.7 2.9
Total 3.0 4.5 5.18
Represent this data by a suitable bar chart.
Line diagram:
Sometimes only lines are drawn for comparison of given variable values. Such lines are not thick
and their number is sufficiently large. The different measurements to be shown should not have too
much difference, so that the lines may not show too much dissimilarity in their heights. Such charts
are used to economize space, especially when observations are large. The lines may be either
vertical or horizontal depending upon the type of variable-numerical or categorical.
Example: An advertising company kept an account of response letters received each day over a
period of 50 days. The observations were:
0 2 1 1 1 2 0 0 1 0 1 0 0 1 0 1 1 0
2 0 0 2 0 1 0 1 0 1 0 3 1 0 1 0 1 0
2 5 1 2 0 0 0 0 5 0 1 1 2 0
Construct a frequency table and draw a line chart (or diagram) to present the data.
Pie Diagram
These diagrams are normally used to show the total number of observations of different types in
the data set on a percentage basic rather than on an absolute basis through a circle. Usually, the
largest percentage portion of data in a pie diagram is shown first at 12 o'clock position on the circle,
whereas the other observations (in per cent) are shown in clockwise succession in descending
order of magnitude.
The steps to draw a pie diagram are summarized below:
(i) Convert the various observations (in per cent) in the data set into corresponding degrees in
the circle by multiplying each by 3.6 (360 ÷ 100).
(ii) Draw a circle of appropriate size with a compass.
(iii) Draw points on the circle according to the size of each portion of the data with the help of a
protractor and join each of these points to the center of the circle.
The pie chart has two distinct advantages: (i) it is aesthetically pleasing and (ii) it shows that
the total for all categories or slices of the pie adds to 100%.
Example: The data shows market share (in per cent) by revenue of the following companies in a
particular year and Draw a pie diagram for the above data.
Escorts-First
Batata–BPL 30 5
Pacific
Hutchison–Essar 26 Reliance 3
Bharti–Sing Tel 19 RPG 2
Modi Dista Com 12 Srinivas 2

Shyam 1 - -
Histograms
Histograms are the most useful and common graphs for displaying continuous data. A histogram
represents a frequency distribution as a vertical bar chart:

• If the class intervals chosen for the frequency distribution are equal in size, then the bars in
the histogram are drawn equal in width, and the height of each bar then represents the
frequency of the corresponding interval. In this text we always use class intervals that are
equal in size.
• If the class intervals chosen for the frequency distribution are not equal in size, then the bars
of the bar chart will vary in width, (representing the class interval), and height (representing
the frequency density). The area under the bar will be equal to the frequency of the
corresponding interval. This then gives a visual representation of the number of data points
in each interval.
Histograms represent the frequency distribution of continuous data and so are drawn with no
gaps between the bars. Bar charts used to represent categorical data are drawn with gaps
between the bars and are not histograms.
A histogram can show the shape of the distribution, spread or variability, central location of the
data and any unusual observations such as outliers. Histograms are very sensitive to the selection
of class intervals. In practice we prepare several histograms and select the one that represents the
data the best; that is, the one that highlights the key features of the data most clearly.

The above data can be converted into the following frequency distribution table as

A histogram of the per capita GDP data corresponding to the frequency distribution in table is
From the histogram we see that the GDP values are between $0 and $100 000. Most of the GDP
values fall between $20 000 and $40 000. The histogram is not symmetrical; that is, the right half is
not the mirror image of the left half. Most of the data lie below $40 000, with only a few values above
this. We call such data right skewed or positively skewed. Two countries have rather large per capita
GDP values. These are away from the bulk of the data and are considered to be outliers.
In statistics, it is often useful to determine whether data are approximately normally distributed
(bell shaped) as we can see by examining the histogram.

Frequency polygons
A frequency polygon is formed by marking the mid-point at the top of horizontal bars and
then joining these dots by a series of straight lines. The frequency polygons are formed as a closed
figure with the horizontal axis, therefore a series of straight lines are drawn from the mid-point of
the top base of the first and the last rectangles to the mid-point falling on the horizontal axis of the
next outlaying interval with zero frequency.
A frequency polygon can also be converted back into a histogram by drawing vertical lines
from the bounds of the classes shown on the horizontal axis, and then connecting them with
horizontal lines at the heights of the polygon at each mid-point.

Frequency Curve
It is described as a smooth frequency polygon as shown in the following figure. A frequency curve
is described in terms of its (i) symmetry (skewness) and (ii) degree of peakedness (kurtosis). Two
frequency distributions can also be compared by super imposing two or more frequency curves
provided the width of their class intervals and the total number of frequencies are equal for the
given distributions. Even if the distributions to be compared differ in terms of total frequencies, they
still can be compared by drawing per cent frequency curves where the vertical axis measures the
per cent class frequencies and not the absolute frequencies.
Cumulative Frequency Distribution (Ogive)
It enables us to see how many observations lie above or below certain values rather than merely
recording the number of observations within intervals.
Stem-and-Leaf Diagram
It is a graphical display of the numerical values in the data set and separates these values into
leading digits (or stem) and trailing digits (or leaves). The steps required to construct a stem and-
leaf diagram are as follows:
1. Divide each numerical value between the ones and the tens place. The number to the left is
the stem and the number to the right is the leaf. The stem contains all but the last of the
displayed digits of a numerical value. As with histograms, it is reasonable to have between 6
to 15 stems (each stem defines an interval of values). The stem should define equally spaced
intervals. Stems are located along the vertical axis.
Sometimes numerical values in the data set are truncated or rounded off. For example, the
number 15.69 is truncated to 15.6 but it is rounded off to 15.7.
2. List the stems in a column with a vertical line to their right.
3. For each numerical value, attach a leaf to the appropriate stem in the same row (horizontal
axis). A leaf is the last of the displayed digits of a number. It is standard, but not mandatory,
to put the leaves in increasing order at each stem value.
4. Provide a key to stem and leaf coding so that actual numerical value can be re-created, if
necessary.
Note: If all the numerical values are three-digit integers, then to form a stem-and-leaf diagram,
two approaches are followed:
1. Use the hundreds column as the stems and the tens column as the leaves and ignore the
units column.
2. Use the hundreds column as the stems and the tens column as the leaves after rounding of
the units column.
Box Plot
A box-and-whisker plot, sometimes called a boxplot, is a diagram that uses five summary
measures (the first and third quartiles, along with the median and the two most extreme values not
deemed outliers) to depict a distribution graphically. The plot is constructed by using a box to
represent the middle 50% of the data and lines to indicate the remaining 50%. This box begins at
the first quartile and extends to the third quartile. These box endpoints (Q1 and Q3) are referred to
as the hinges of the box. A line within the box represents the location of the median. From the first
and third quartiles, lines referred to as whiskers are extended out from the box towards the
outermost data values that are not deemed outliers. If there are no outliers in the data, the whiskers
extend to the minimum and maximum values. Box-and-whisker plots may be drawn or horizontally.

Example: If you knew the typical time it takes you to get ready in the morning, you might be able
to better plan your morning and minimize any excessive lateness (or earliness) going to your
destination. you first define the time to get ready as the time (rounded to the nearest minute)
from when you get out of bed to when you leave your home. Then, you collect the times shown
below for 10 consecutive workdays (in time)
Day 1 2 3 4 5 6 7 8 9 10
Time (in min) 39 29 43 52 39 44 40 31 44 35
To further analyze the sample of 10 times to get ready in the morning, you can construct a boxplot
displayed as
Scatter Plot
A scatter plot can explore the possible relationship between those measurements by plotting the
data of one numerical variable on the horizontal, or X, axis and the data of a second numerical
variable on the vertical, or Y, axis. For example, a marketing analyst could study the effectiveness of
advertising by comparing advertising expenses and sales revenues of 50 stores. Using a scatter plot,
a point is plotted on the two-dimensional graph for each store, using the X axis to represent
advertising expenses and the Y axis to represent sales revenues.
Example: To explore the possible relationship between the revenues generated by a team and the
value of a team, you can create a scatter plot.

For each team, you plot the revenues on the X axis and the values on the Y axis. The following
diagram shows that the scatter plot for these two variables.
Assignment:
The following data represent the gross income, expenditure (in Rs. lakh), and net profit (in Rs.
lakh) during the years 1999–2002. Construct a diagram or chart you prefer to use here
1999–2000 2000–2001 2001–2002
Gross income 570 592 632
Gross Expenditure 510 560 610
Net income 560 532 522
Assignment:

Assignment:

Assignment:
Assignment:
The following data are the monthly telephone bills for 38 households. Construct a stem-and-leaf
plot of the data.
137.9 136.4 142.6 146.0 145.4 145.3 143.5 137.3 132.2 141.8
145.1 144.9 143.7 144.9 144.9 136.9 139.1 137.8 135.8 136.6
144.6 140.3 143.2 133.2 139.2 140.5 142.9 138.8 144.6 145.4
146.7 144.0 142.4 132.8 145.9 141.6 138.9 139.8
Assignment:

You might also like