You are on page 1of 19

Statistical Data 4

Types of Statistical Data: Numerical, Categorical, and Ordinal 4

Numerical data 4

Categorical data 5

Data Editing 5
Editing methods 5

Interactive Editing 5

Selective Editing 5

Macro Editing 6

Aggregation Method 6

Distribution Method 6

Automatic Editing 6

Frequency Distribution 6
Univar ate frequency tables 6

Construction of Frequency Distributions 7

Joint Frequency Distributions 8

Applications 9

Continuous Series 9

Discrete Series 9

Frequency Curve 9

Diagrammatic Representation 10
Diagrams 10

Merits 10

Types of Diagrams 11

Multiple bar diagram 12

Component bar diagram 13

Percentage bar diagram 14

Pie chart / Pie Diagram 16

Mean 18

Page 1 of 19
Median 19

Mode 19

Page 2 of 19
Page 3 of 19
Statistical Data
The branch of mathematics that deals with the collection, organization, analysis, and
interpretation of numerical data

Types of Statistical Data: Numerical, Categorical, and

When working with statistics, its important to recognize the different types of data: numerical
(discrete and continuous), categorical, and ordinal. Data are the actual pieces of information that
you collect through your study. For example, if you ask five of your friends how many pets they
own, they might give you the following data: 0, 2, 1, 4, 18. (The fifth friend might count each of
her aquarium fish as a separate pet.) Not all data are numbers; lets say you also record the
gender of each of your friends, getting the following data: male, male, female, male, female.

Most data fall into one of two groups: numerical or categorical.

Numerical data
These data have meaning as a measurement, such as a persons height, weight, IQ, or blood
pressure; or theyre a count, such as the number of stock shares a person owns, how many teeth a
dog has, or how many pages you can read of your favorite book before you fall asleep.
(Statisticians also call numerical data quantitative data.)

Numerical data can be further broken into two types: discrete and continuous.

Discrete data represent items that can be counted; they take on possible values that can be listed
out. The list of possible values may be fixed (also called finite); or it may go from 0, 1, 2, on to
infinity (making it countable infinite). For example, the number of heads in 100 coin flips takes
on values from 0 through 100 (finite case), but the number of flips needed to get 100 heads takes
on values from 100 (the fastest scenario) on up to infinity (if you never get to that 100th heads).
Its possible values are listed as 100, 101, 102, 103, . . . (representing the countably infinite case).

Continuous data represent measurements; their possible values cannot be counted and can only
be described using intervals on the real number line. For example, the exact amount of gas
purchased at the pump for cars with 20-gallon tanks would be continuous data from 0 gallons to
20 gallons, represented by the interval [0, 20], inclusive. You might pump 8.40 gallons, or 8.41,
or 8.414863 gallons, or any possible number from 0 to 20. In this way, continuous data can be
thought of as being uncountable infinite. For ease of recordkeeping, statisticians usually pick
some point in the number to round off. Another example would be that the lifetime of a C battery
can be anywhere from 0 hours to an infinite number of hours (if it lasts forever), technically,
with all possible values in between. Granted, you dont expect a battery to last more than a few
hundred hours, but no one can put a cap on how long it can go (remember the Energizer

Page 4 of 19
Categorical data
Categorical data represent characteristics such as a persons gender, marital status, hometown,
or the types of movies they like. Categorical data can take on numerical values (such as 1
indicating male and 2 indicating female), but those numbers dont have mathematical meaning.
You couldnt add them together, for example. (Other names for categorical data are qualitative
data, or Yes/No data.)

Ordinal data mixes numerical and categorical data. The data fall into categories, but the
numbers placed on the categories have meaning. For example, rating a restaurant on a scale from
0 (lowest) to 4 (highest) stars gives ordinal data. Ordinal data are often treated as categorical,
where the groups are ordered when graphs and charts are made. However, unlike categorical
data, the numbers do have mathematical meaning. For example, if you survey 100 people and
ask them to rate a restaurant on a scale from 0 to 4, taking the average of the 100 responses will
have meaning. This would not be the case with categorical data.

Data Editing
Data editing is defined as the process involving the review and adjustment of collected survey
data. The purpose is to control the quality of the collected data. Data editing can be performed
manually, with the assistance of a computer or a combination of both.

Editing methods
Interactive Editing
The term interactive editing is commonly used for modern computer-assisted manual editing.
Most interactive data editing tools applied at National Statistical Institutes (NSIs) allow one to
check the specified edits during or after data entry, and if necessary to correct erroneous data
immediately. Several approaches can be followed to correct erroneous data:

Re-contact the respondent

Compare the respondent's data to his data from previous year
Compare the respondent's data to data from similar respondents
Use the subject matter knowledge of the human editor

Interactive editing is a standard way to edit data. It can be used to edit

both categorical and continuous data. Interactive editing reduces the time frame needed to
complete the cyclical process of review and adjustment.

Selective Editing
Selective editing is an umbrella term for several methods to identify the influential
errors, and outliers. Selective editing techniques aim to apply interactive editing to a well-chosen
subset of the records, such that the limited time and resources available for interactive editing are
Page 5 of 19
allocated to those records where it has the most effect on the quality of the final estimates of
publication figures. In selective editing, data is split into two streams:

The critical stream

The non-critical stream

The critical stream consists of records that are more likely to contain influential errors. These
critical records are edited in a traditional interactive manner. The records in the non-critical
stream which are unlikely to contain influential errors are not edited in a computer assisted

Macro Editing
There are two methods of macro editing.

Aggregation Method
This method is followed in almost every statistical agency before publication: verifying whether figures
to be published seem plausible. This is accomplished by comparing quantities in publication tables with
same quantities in previous publications. If an unusual value is observed, a micro-editing procedure is
applied to the individual records and fields contributing to the suspicious quantity.

Distribution Method
Data available is used to characterize the distribution of the variables then all individual values
are compared with the distribution. Records containing values that could be considered
uncommon (given the distribution) are candidates for further inspection and possibly for editing.

Automatic Editing
In automatic editing records are edited by a computer without human intervention. Prior
knowledge on the values of a single variable or a combination of variables can be formulated as
a set of edit rules which specify or constrain the admissible values.

Frequency Distribution
In statistics, a frequency distribution is a table that displays the frequency of various outcomes
in a sample. Each entry in the table contains the frequency or count of the occurrences of values
within a particular group or interval, and in this way, the table summarizes the distribution of
values in the sample.

Univar ate frequency tables

An example of a Univar ate (i.e. single variable) frequency table. The frequency of each response
to a survey question is depicted.

Page 6 of 19
Rank Degree of agreement Number

1 Strongly agree 20

2 Agree somewhat 30

3 Not sure 20

4 Disagree somewhat 15

5 Strongly disagree 15

A different tabulation scheme aggregates values into bins such that each bin encompasses a
range of values. For example, the heights of the students in a class could be organized into the
following frequency table.

Height range Number of students Cumulative number

less than 5.0 feet 25 25

5.05.5 feet 35 60

5.56.0 feet 20 80

6.06.5 feet 20 100

A frequency distribution shows us a summarized grouping of data divided into mutually

exclusive classes and the number of occurrences in a class. It is a way of showing unorganized
data e.g. to show results of an election, income of people for a certain region, sales of a product
within a certain period, student loan amounts of graduates, etc. Some of the graphs that can be
used with frequency distributions are histograms, line charts, bar charts and pie charts.
Frequency distributions are used for both qualitative and quantitative data.

Construction of Frequency Distributions

Decide about the number of classes. Too many classes or too few classes might not reveal the
basic shape of the data set, also it will be difficult to interpret such frequency distribution. The
maximum number of classes may be determined by formula:


Page 7 of 19
where n is the total number of observations in the data.

Calculate the range of the data (Range = Max Min) by finding minimum and maximum data
value. Range will be used to determine the class interval or class width.

Decide about width of the class denote by h and obtained by .

Generally the class interval or class width is the same for all classes. The classes all taken
together must cover at least the distance from the lowest value (minimum) in the data set up to
the highest (maximum) value. Also note that equal class intervals are preferred in frequency
distribution, while unequal class interval may be necessary in certain situations to avoid a large
number of empty, or almost empty classes.

Decide the individual class limits and select a suitable starting point of the first class which is
arbitrary, it may be less than or equal to the minimum value. Usually it is started before the
minimum value in such a way that the midpoint (the average of lower and upper class limits of
the first class) is properly placed.

Take an observation and mark a vertical bar (|) for a class it belongs. A running tally is kept till
the last observation. The tally count indicates five.

Find the frequencies, relative frequency, cumulative frequency etc. as required.[2]

Joint Frequency Distributions

Bivariate joint frequency distributions are often presented as (two-way) contingency tables:

Two-way contingency table with marginal frequencies

Dance Sports TV Total

Men 2 10 8 20

Women 16 6 8 30

Total 18 16 16 50

The total row and total column report the marginal frequencies or marginal distribution, while
the body of the table reports the joint frequencies.

Page 8 of 19
Managing and operating on frequency tabulated data is much simpler than operation on raw data.
There are simple algorithms to calculate median, mean, standard deviation etc. from these tables.

Statistical hypothesis testing is founded on the assessment of differences and similarities between
frequency distributions. This assessment involves measures of central tendency or averages, such
as the mean and median, and measures of variability or statistical dispersion, such as the standard
deviation or variance.

A frequency distribution is said to be skewed when its mean and median are different, or the
same, depending on the textbook. The kurtosis of a frequency distribution is the concentration of
scores at the mean, or how peaked the distribution appears if depicted graphicallyfor example,
in a histogram. If the distribution is more peaked than the norma distribution it is said to be
leptokurtic; if less peaked it is said to be platy kurtic.

Letter frequency distributions are also used in frequency analysis to crack codes and are referred
to the relative frequency of letters in different languages.

Continuous Series
The series dealing with the continuous variable is called continuous series. The continuous
variable is one which can assume any conceivable fractional value within a range. In other
words, a continuous variable is capable of assuming every conceivable fractional value within
the range of possibilities such as the income, weight, profit, length etc.

Discrete Series
The series dealing with the discrete variable is called Discrete Series. The discrete variable refers
to that characteristic which cannot be expressed in fractions or it is a frictionless variable. There
will be either one employee or two. There cannot be 1.5 employee. The number of times each
quantity or amount occurs, are shown in front of each quantity or amount which are known as
frequencies. By frequency, we mean the number of times a particular observed value occurs or
repetition of some items in the universe is known as frequencies.

Frequency Curve
Frequency curve is obtained by joining the points of frequency polygon by a freehand smoothed
curve. Unlike frequency polygon, where the points we joined by straight lines, we make use of
free hand joining of those points in order to get a smoothed frequency curve. It is used to remove
the ruggedness of polygon and to present it in a good form or shape. We smoothen the
angularities of the polygon only without making any basic change in the shape of the curve. In

Page 9 of 19
this case also the curve begins and ends at base line, as is in case of polygon. Area under the
curve must remain almost the same as in the case of polygon.

Diagrammatic Representation
Diagrams are various geometrical shape such as bars, circles etc. Diagrams are

based on scale but are not confined to points or lines. They are more attractive and easier

to understand than graphs.

1. Most of the people are attracted by diagrams.

2. Technical Knowledge or education is not necessary.

3. Time and effort required are less.

4. Diagrams show the data in proper perspective.

5. Diagrams leave a lasting impression.

6. Language is not a barrier.

7. Widely used tool.

Demerits (or) limitations

1. Diagrams are approximations.

Page 10 of 19
2. Minute differences in values cannot be represented properly in diagrams.

3. Large differences in values spoil the look of the diagram.

4. Some of the diagrams can be drawn by experts only. eg. Pie chart.

5. Different scales portray different pictures to laymen.

Types of Diagrams
The important diagrams are

Simple Bar diagram.

Multiple Bar diagram.
Component Bar diagram.
Percentage Bar diagram.
Pie chart
Statistical maps or cartograms.

In all the diagrams and graphs, the groups or classes are represented on the x-axis

and the volumes or frequencies are represented in the y-axis.

Simple Bar diagram

If the classification is based on attributes and if the attributes are to be compared

with respect to a single character we use simple bar diagram.


1. The area under different crops in a state.

2. The food grain production of different years.
3. The yield performance of different varieties of a crop.
4. The effect of different treatments etc.

Simple bar diagrams Consists of vertical bars of equal width. The heights of these

bars are proportional to the volume or magnitude of the attribute. All bars stand on the

same baseline. The bars are separated from each others by equal intervals. The bars may

be coloured or marked.


The cropping pattern in Tamil Nadu in the year 1974-75 was as follows.
Page 11 of 19
Crops Area In 1,000 hectares

Cereals 3940

Oilseeds 1165

Pulses 464

Cotton 249

Others 822

The simple bar diagram for this data is given below.

Multiple bar diagram

If the data is classified by attributes and if two or more characters or groups are to

be compared within each attribute we use multiple bar diagrams. If only two characters

are to be compared within each attribute, then the resultant bar diagram used is known as

double bar diagram.

The multiple bar diagram is simply the extension of simple bar diagram. For each

attribute two or more bars representing separate characters or groups are to be placed side

by side. Each bar within an attribute will be marked or coloured differently in order to

Page 12 of 19
distinguish them. Same type of marking or colouring should be done under each attribute.

A footnote has to be given explaining the markings or colourings.


Draw a multiple bar diagram for the following data which represented agricultural

Production for the period from 2000-2004

Year Food Grains (tones) Vegetables (tones) Other (tones)

2000 100 30 10
2001 120 40 15
2002 130 45 25
2003 150 50 25

Component bar diagram

This is also called sub divided bar diagram. Instead of placing the bars for each

component side by side we may place these one on top of the other. This will result in a

component bar diagram.


Draw a component bar diagram for the following data

Page 13 of 19
Percentage bar diagram
Sometimes when the volumes of different attributes may be greatly different for

making meaningful comparisons, the attributes are reduced to percentages. In that case

each attribute will have 100 as its maximum volume. This sort of component bar chart is

known as percentage bar diagram.


Page 14 of 19
Draw a Percentage bar diagram for the following data

Using the formula

The above table is converted.

Page 15 of 19
Pie chart / Pie Diagram
Pie diagram is a circular diagram. It may be used in place of bar diagrams. It

consists of one or more circles which are divided into a number of sectors. In the

construction of pie diagram the following steps are involved.

Step 1:

Whenever one set of actual value or percentage are given, find the corresponding

angles in degrees using the following formula

Step 2:

Find the radius using the area of the circle r2

where value of is 22/7 or 3.14


Given the cultivable land area in four southern states of India. Construct a pie diagram for

the following data.

Page 16 of 19
The table value becomes

Page 17 of 19
The mean is the average of all numbers and is sometimes called the arithmetic
mean. To calculate mean, add together all of the numbers in a set and then divide
the sum by the total count of numbers. For example, in a data center rack,
five servers consume 100 watts, 98 watts, 105 watts, 90 watts and 102 watts of
power, respectively. The mean power use of that rack is calculated as (100 + 98 +
105 + 90 + 102 W)/5 servers = a calculated mean of 99 W per server.
Intelligent power distribution units report the mean power utilization of the rack to
systems management software.

Page 18 of 19
In the data center, means and medians are often tracked over time to spot trends,
which inform capacity planning or power cost predictions. The statistical median is
the middle number in a sequence of numbers. To find the median, organize each
number in order by size; the number in the middle is the median. For the five
servers in the rack, arrange the power consumption figures from lowest to highest:
90 W, 98 W, 100 W, 102 W and 105 W. The median power consumption of the
rack is 100 W. If there is an even set of numbers, average the two middle numbers.
For example, if the rack had a sixth server that used 110 W, the new number set
would be 90 W, 98 W, 100 W, 102 W, 105 W and 110 W. Find the median by
averaging the two middle numbers: (100 + 102)/2 = 101 W.

The mode is the number that occurs most often within a set of numbers. For the
server power consumption examples above, there is no mode because each element
is different. But suppose the administrator measured the power consumption of an
entire network operations center (NOC) and the set of numbers is 90 W, 104 W, 98
W, 98 W, 105 W, 92 W, 102 W, 100 W, 110 W, 98 W, 210 W and 115 W. The
mode is 98 W since that power consumption measurement occurs most often
amongst the 12 servers. Mode helps identify the most common or frequent
occurrence of a characteristic. It is possible to have two modes (bimodal), three
modes (trimodal) or more modes within larger sets of numbers.

Page 19 of 19

You might also like