Professional Documents
Culture Documents
After collection of data the next step in a statistical investigation is the scrutiny of the collected
information. This is technically called ‘Editing of data’. It is a necessary step as in most cases the
collected data contain various types of mistakes and errors. The word error is used in a specialized
sense in Statistics. It does not mean the same thing as mistake. Mistake in Statistics means a wrong
calculation or use of inappropriate method in the collection or analysis of data. Error, on the other
hand, means “the difference between true value and estimated value”, and we call it as statistical
error.
As a matter of fact, editing involves a careful scrutiny of the completed questionnaires and/or
schedules. Editing is done to assure that the data are accurate, consistent with other facts gathered,
uniformly entered as completed as possible.
Classification of data
Statistical data are classified on the basis of the characteristics after taking into account the
nature, scope, and purpose of an investigation. Generally, data are classified on the basis of the
following four bases:
Geographical Classification:
In geographical classification, data are classified on the basis of geographical or locational
differences such as cities, districts, or villages between various elements of the data set. Such a
classification is also known as spatial classification. Geographical classifications are generally listed
in alphabetical order. The following is an example of a geographical distribution:
Qualitative Classification:
In qualitative classification, data are classified on the basis of descriptive characteristics or on the
basis of attributes like sex, literacy, region, caste, or education, which cannot be quantified. This is
done in two ways:
(i) Simple classification: In this type of classification, each class is subdivided into two sub-classes
and only one attribute is studied such as: male and female; blind and not blind, educated and
uneducated, and so on.
Population
Male Female
(ii) Manifold classification: In this type of classification, a class is subdivided into more than two
sub-classes which may be sub-divided further.
Quantitative Classification:
In this classification, data are classified on the basis of some characteristics which can be
measured such as height, weight, income, expenditure, production, or sales. Quantitative variables
can be divided into the following two types. The term variable refers to any quantity or attribute
whose value varies from one investigation to another.
(i) Continuous variable is the one that can take any value within the range of numbers. Thus the
height or weight of individuals can be of any value within the limits. In such a case, data are
obtained by measurement.
(ii) Discrete (also called discontinuous) variable is the one whose values change by steps or jumps
and cannot assume a fractional value. The number of children in a family, number of workers
(or employees), number of students in a class, are few examples of a discrete variable. In such a
case data are obtained by counting.
Frequency Distribution
A frequency distribution divides observations in the data set into conveniently established,
numerically ordered classes (groups or categories). The number of observations in each class is
referred to as frequency denoted as f.
94 89 88 89 90 94 92 88 87 85
88 93 94 93 94 93 92 88 94 90
93 84 93 84 91 93 85 91 89 95
The frequency distribution of the number of hours of overtime is shown in following Table.
Number of Overtime Hours Tally Number of Weeks (Frequency)
84 || 2
85 || 2
86 — 0
87 | 1
88 |||| 4
89 ||| 3
90 || 2
91 || 2
92 || 2
93 ||||| 6
94 |||| 5
95 | 1
Total 30
Constructing a Frequency Distribution
As the number of observations obtained gets larger, the method discussed above to condense
the data becomes quite difficult and time-consuming. Thus, to further condense the data into
frequency distribution tables, the following steps should be taken:
Assignment 2:
A computer company received a rush order for as many home computers as could be shipped
during a 6-week period. Company records provide the following daily shipments:
22 65 65 67 55 50 65
77 73 30 62 54 48 65
79 60 63 45 51 68 79
83 33 41 49 28 55 61
65 75 55 75 39 87 45
50 66 65 59 25 35 53
Group these daily shipments into a frequency distribution having the suitable number of classes.
Assignment 3:
Following are the number of items of similar type produced in a factory during the last 50 days.
21 22 17 23 27 15 16 22 15 23
24 25 36 19 14 21 24 25 14 18
20 31 22 19 18 20 21 20 36 18
21 20 31 22 19 18 20 20 24 35
25 26 19 32 22 26 25 26 27 22
Arrange these observations into a frequency distribution with both inclusive and exclusive class
intervals choosing a suitable number of classes.
Assignment 4:
Following are the number of two wheelers sold by a dealer during eight weeks of six working days
each.
13 19 22 14 13 16 19 21
23 11 27 25 17 17 13 20
23 17 26 20 24 15 20 21
23 17 29 17 19 14 20 20
10 22 18 25 16 23 19 20
21 17 18 24 21 20 19 26
(a) Group these figures into a table having the classes 10–12, 13–15, 16–18, . . ., and 28–30.
(b) Convert the distribution into a corresponding percentage frequency distribution and
also a percentage cumulative frequency distribution
Tabulation of Data
‘Tabulation’ is a systematic form of arranged data in rows and columns. Tabulation is the final
stage in collection and compilation of data, and is a sort of stepping-stone to the analysis and
interpretation of figures. The importance of proper tabulation is very great because if the tabulation
of data is not satisfactory its analysis will not only be difficult but defective also.
Parts of a Table
Presenting data in a tabular form is an art. A statistical table should contain all the requisite
information in a limited space but without any loss of clarity. There are variations in practice.
A blank model table is given below:
Types of Tables
The classification of tables depends on various aspects: objectives and scope of investigation,
nature of data (primary or secondary) for investigation, extent of data coverage, and so on. The
different types of tables used in statistical investigations are as follows:
Diagrammatic presentation of data
Diagrams play an important role in statistical data presentation. Diagrammatic data presentation
allows us to understand the data in an easier manner. According to P. Maslov, ‘Diagrams are drawn
for two purposes (i) to permit the investigator to graph the essence of the phenomenon he is
observing, and (ii) to permit others to see the results at a glance, i.e., for the purpose of
popularization.’
Shyam 1 - -
Histograms
Histograms are the most useful and common graphs for displaying continuous data. A histogram
represents a frequency distribution as a vertical bar chart:
• If the class intervals chosen for the frequency distribution are equal in size, then the bars in
the histogram are drawn equal in width, and the height of each bar then represents the
frequency of the corresponding interval. In this text we always use class intervals that are
equal in size.
• If the class intervals chosen for the frequency distribution are not equal in size, then the bars
of the bar chart will vary in width, (representing the class interval), and height (representing
the frequency density). The area under the bar will be equal to the frequency of the
corresponding interval. This then gives a visual representation of the number of data points
in each interval.
Histograms represent the frequency distribution of continuous data and so are drawn with no
gaps between the bars. Bar charts used to represent categorical data are drawn with gaps
between the bars and are not histograms.
A histogram can show the shape of the distribution, spread or variability, central location of the
data and any unusual observations such as outliers. Histograms are very sensitive to the selection
of class intervals. In practice we prepare several histograms and select the one that represents the
data the best; that is, the one that highlights the key features of the data most clearly.
The above data can be converted into the following frequency distribution table as
A histogram of the per capita GDP data corresponding to the frequency distribution in table is
From the histogram we see that the GDP values are between $0 and $100 000. Most of the GDP
values fall between $20 000 and $40 000. The histogram is not symmetrical; that is, the right half is
not the mirror image of the left half. Most of the data lie below $40 000, with only a few values above
this. We call such data right skewed or positively skewed. Two countries have rather large per capita
GDP values. These are away from the bulk of the data and are considered to be outliers.
In statistics, it is often useful to determine whether data are approximately normally distributed
(bell shaped) as we can see by examining the histogram.
Frequency polygons
A frequency polygon is formed by marking the mid-point at the top of horizontal bars and
then joining these dots by a series of straight lines. The frequency polygons are formed as a closed
figure with the horizontal axis, therefore a series of straight lines are drawn from the mid-point of
the top base of the first and the last rectangles to the mid-point falling on the horizontal axis of the
next outlaying interval with zero frequency.
A frequency polygon can also be converted back into a histogram by drawing vertical lines
from the bounds of the classes shown on the horizontal axis, and then connecting them with
horizontal lines at the heights of the polygon at each mid-point.
Frequency Curve
It is described as a smooth frequency polygon as shown in the following figure. A frequency curve
is described in terms of its (i) symmetry (skewness) and (ii) degree of peakedness (kurtosis). Two
frequency distributions can also be compared by super imposing two or more frequency curves
provided the width of their class intervals and the total number of frequencies are equal for the
given distributions. Even if the distributions to be compared differ in terms of total frequencies, they
still can be compared by drawing per cent frequency curves where the vertical axis measures the
per cent class frequencies and not the absolute frequencies.
Cumulative Frequency Distribution (Ogive)
It enables us to see how many observations lie above or below certain values rather than merely
recording the number of observations within intervals.
Stem-and-Leaf Diagram
It is a graphical display of the numerical values in the data set and separates these values into
leading digits (or stem) and trailing digits (or leaves). The steps required to construct a stem and-
leaf diagram are as follows:
1. Divide each numerical value between the ones and the tens place. The number to the left is
the stem and the number to the right is the leaf. The stem contains all but the last of the
displayed digits of a numerical value. As with histograms, it is reasonable to have between 6
to 15 stems (each stem defines an interval of values). The stem should define equally spaced
intervals. Stems are located along the vertical axis.
Sometimes numerical values in the data set are truncated or rounded off. For example, the
number 15.69 is truncated to 15.6 but it is rounded off to 15.7.
2. List the stems in a column with a vertical line to their right.
3. For each numerical value, attach a leaf to the appropriate stem in the same row (horizontal
axis). A leaf is the last of the displayed digits of a number. It is standard, but not mandatory,
to put the leaves in increasing order at each stem value.
4. Provide a key to stem and leaf coding so that actual numerical value can be re-created, if
necessary.
Note: If all the numerical values are three-digit integers, then to form a stem-and-leaf diagram,
two approaches are followed:
1. Use the hundreds column as the stems and the tens column as the leaves and ignore the
units column.
2. Use the hundreds column as the stems and the tens column as the leaves after rounding of
the units column.
Box Plot
A box-and-whisker plot, sometimes called a boxplot, is a diagram that uses five summary
measures (the first and third quartiles, along with the median and the two most extreme values not
deemed outliers) to depict a distribution graphically. The plot is constructed by using a box to
represent the middle 50% of the data and lines to indicate the remaining 50%. This box begins at
the first quartile and extends to the third quartile. These box endpoints (Q1 and Q3) are referred to
as the hinges of the box. A line within the box represents the location of the median. From the first
and third quartiles, lines referred to as whiskers are extended out from the box towards the
outermost data values that are not deemed outliers. If there are no outliers in the data, the whiskers
extend to the minimum and maximum values. Box-and-whisker plots may be drawn or horizontally.
Example: If you knew the typical time it takes you to get ready in the morning, you might be able
to better plan your morning and minimize any excessive lateness (or earliness) going to your
destination. you first define the time to get ready as the time (rounded to the nearest minute)
from when you get out of bed to when you leave your home. Then, you collect the times shown
below for 10 consecutive workdays (in time)
Day 1 2 3 4 5 6 7 8 9 10
Time (in min) 39 29 43 52 39 44 40 31 44 35
To further analyze the sample of 10 times to get ready in the morning, you can construct a boxplot
displayed as
Scatter Plot
A scatter plot can explore the possible relationship between those measurements by plotting the
data of one numerical variable on the horizontal, or X, axis and the data of a second numerical
variable on the vertical, or Y, axis. For example, a marketing analyst could study the effectiveness of
advertising by comparing advertising expenses and sales revenues of 50 stores. Using a scatter plot,
a point is plotted on the two-dimensional graph for each store, using the X axis to represent
advertising expenses and the Y axis to represent sales revenues.
Example: To explore the possible relationship between the revenues generated by a team and the
value of a team, you can create a scatter plot.
For each team, you plot the revenues on the X axis and the values on the Y axis. The following
diagram shows that the scatter plot for these two variables.
Assignment:
The following data represent the gross income, expenditure (in Rs. lakh), and net profit (in Rs.
lakh) during the years 1999–2002. Construct a diagram or chart you prefer to use here
1999–2000 2000–2001 2001–2002
Gross income 570 592 632
Gross Expenditure 510 560 610
Net income 560 532 522
Assignment:
Assignment:
Assignment:
Assignment:
The following data are the monthly telephone bills for 38 households. Construct a stem-and-leaf
plot of the data.
137.9 136.4 142.6 146.0 145.4 145.3 143.5 137.3 132.2 141.8
145.1 144.9 143.7 144.9 144.9 136.9 139.1 137.8 135.8 136.6
144.6 140.3 143.2 133.2 139.2 140.5 142.9 138.8 144.6 145.4
146.7 144.0 142.4 132.8 145.9 141.6 138.9 139.8
Assignment: