Professional Documents
Culture Documents
Numerical data 4
Categorical data 5
Data Editing 5
Editing methods 5
Interactive Editing 5
Selective Editing 5
Macro Editing 6
Aggregation Method 6
Distribution Method 6
Automatic Editing 6
Frequency Distribution 6
Univar ate frequency tables 6
Applications 9
Continuous Series 9
Discrete Series 9
Frequency Curve 9
Diagrammatic Representation 10
Diagrams 10
Merits 10
Types of Diagrams 11
Mean 18
Page 1 of 19
Median 19
Mode 19
Page 2 of 19
Page 3 of 19
Statistical Data
The branch of mathematics that deals with the collection, organization, analysis, and
interpretation of numerical data
Numerical data
These data have meaning as a measurement, such as a persons height, weight, IQ, or blood
pressure; or theyre a count, such as the number of stock shares a person owns, how many teeth a
dog has, or how many pages you can read of your favorite book before you fall asleep.
(Statisticians also call numerical data quantitative data.)
Numerical data can be further broken into two types: discrete and continuous.
Discrete data represent items that can be counted; they take on possible values that can be listed
out. The list of possible values may be fixed (also called finite); or it may go from 0, 1, 2, on to
infinity (making it countable infinite). For example, the number of heads in 100 coin flips takes
on values from 0 through 100 (finite case), but the number of flips needed to get 100 heads takes
on values from 100 (the fastest scenario) on up to infinity (if you never get to that 100th heads).
Its possible values are listed as 100, 101, 102, 103, . . . (representing the countably infinite case).
Continuous data represent measurements; their possible values cannot be counted and can only
be described using intervals on the real number line. For example, the exact amount of gas
purchased at the pump for cars with 20-gallon tanks would be continuous data from 0 gallons to
20 gallons, represented by the interval [0, 20], inclusive. You might pump 8.40 gallons, or 8.41,
or 8.414863 gallons, or any possible number from 0 to 20. In this way, continuous data can be
thought of as being uncountable infinite. For ease of recordkeeping, statisticians usually pick
some point in the number to round off. Another example would be that the lifetime of a C battery
can be anywhere from 0 hours to an infinite number of hours (if it lasts forever), technically,
with all possible values in between. Granted, you dont expect a battery to last more than a few
hundred hours, but no one can put a cap on how long it can go (remember the Energizer
Bunny?).
Page 4 of 19
Categorical data
Categorical data represent characteristics such as a persons gender, marital status, hometown,
or the types of movies they like. Categorical data can take on numerical values (such as 1
indicating male and 2 indicating female), but those numbers dont have mathematical meaning.
You couldnt add them together, for example. (Other names for categorical data are qualitative
data, or Yes/No data.)
Ordinal data mixes numerical and categorical data. The data fall into categories, but the
numbers placed on the categories have meaning. For example, rating a restaurant on a scale from
0 (lowest) to 4 (highest) stars gives ordinal data. Ordinal data are often treated as categorical,
where the groups are ordered when graphs and charts are made. However, unlike categorical
data, the numbers do have mathematical meaning. For example, if you survey 100 people and
ask them to rate a restaurant on a scale from 0 to 4, taking the average of the 100 responses will
have meaning. This would not be the case with categorical data.
Data Editing
Data editing is defined as the process involving the review and adjustment of collected survey
data. The purpose is to control the quality of the collected data. Data editing can be performed
manually, with the assistance of a computer or a combination of both.
Editing methods
Interactive Editing
The term interactive editing is commonly used for modern computer-assisted manual editing.
Most interactive data editing tools applied at National Statistical Institutes (NSIs) allow one to
check the specified edits during or after data entry, and if necessary to correct erroneous data
immediately. Several approaches can be followed to correct erroneous data:
Selective Editing
Selective editing is an umbrella term for several methods to identify the influential
errors, and outliers. Selective editing techniques aim to apply interactive editing to a well-chosen
subset of the records, such that the limited time and resources available for interactive editing are
Page 5 of 19
allocated to those records where it has the most effect on the quality of the final estimates of
publication figures. In selective editing, data is split into two streams:
The critical stream consists of records that are more likely to contain influential errors. These
critical records are edited in a traditional interactive manner. The records in the non-critical
stream which are unlikely to contain influential errors are not edited in a computer assisted
manner.
Macro Editing
There are two methods of macro editing.
Aggregation Method
This method is followed in almost every statistical agency before publication: verifying whether figures
to be published seem plausible. This is accomplished by comparing quantities in publication tables with
same quantities in previous publications. If an unusual value is observed, a micro-editing procedure is
applied to the individual records and fields contributing to the suspicious quantity.
Distribution Method
Data available is used to characterize the distribution of the variables then all individual values
are compared with the distribution. Records containing values that could be considered
uncommon (given the distribution) are candidates for further inspection and possibly for editing.
Automatic Editing
In automatic editing records are edited by a computer without human intervention. Prior
knowledge on the values of a single variable or a combination of variables can be formulated as
a set of edit rules which specify or constrain the admissible values.
Frequency Distribution
In statistics, a frequency distribution is a table that displays the frequency of various outcomes
in a sample. Each entry in the table contains the frequency or count of the occurrences of values
within a particular group or interval, and in this way, the table summarizes the distribution of
values in the sample.
Page 6 of 19
Rank Degree of agreement Number
1 Strongly agree 20
2 Agree somewhat 30
3 Not sure 20
4 Disagree somewhat 15
5 Strongly disagree 15
A different tabulation scheme aggregates values into bins such that each bin encompasses a
range of values. For example, the heights of the students in a class could be organized into the
following frequency table.
5.05.5 feet 35 60
5.56.0 feet 20 80
or
Page 7 of 19
where n is the total number of observations in the data.
Calculate the range of the data (Range = Max Min) by finding minimum and maximum data
value. Range will be used to determine the class interval or class width.
Generally the class interval or class width is the same for all classes. The classes all taken
together must cover at least the distance from the lowest value (minimum) in the data set up to
the highest (maximum) value. Also note that equal class intervals are preferred in frequency
distribution, while unequal class interval may be necessary in certain situations to avoid a large
number of empty, or almost empty classes.
Decide the individual class limits and select a suitable starting point of the first class which is
arbitrary, it may be less than or equal to the minimum value. Usually it is started before the
minimum value in such a way that the midpoint (the average of lower and upper class limits of
the first class) is properly placed.
Take an observation and mark a vertical bar (|) for a class it belongs. A running tally is kept till
the last observation. The tally count indicates five.
Men 2 10 8 20
Women 16 6 8 30
Total 18 16 16 50
The total row and total column report the marginal frequencies or marginal distribution, while
the body of the table reports the joint frequencies.
Page 8 of 19
Applications
Managing and operating on frequency tabulated data is much simpler than operation on raw data.
There are simple algorithms to calculate median, mean, standard deviation etc. from these tables.
Statistical hypothesis testing is founded on the assessment of differences and similarities between
frequency distributions. This assessment involves measures of central tendency or averages, such
as the mean and median, and measures of variability or statistical dispersion, such as the standard
deviation or variance.
A frequency distribution is said to be skewed when its mean and median are different, or the
same, depending on the textbook. The kurtosis of a frequency distribution is the concentration of
scores at the mean, or how peaked the distribution appears if depicted graphicallyfor example,
in a histogram. If the distribution is more peaked than the norma distribution it is said to be
leptokurtic; if less peaked it is said to be platy kurtic.
Letter frequency distributions are also used in frequency analysis to crack codes and are referred
to the relative frequency of letters in different languages.
Continuous Series
The series dealing with the continuous variable is called continuous series. The continuous
variable is one which can assume any conceivable fractional value within a range. In other
words, a continuous variable is capable of assuming every conceivable fractional value within
the range of possibilities such as the income, weight, profit, length etc.
Discrete Series
The series dealing with the discrete variable is called Discrete Series. The discrete variable refers
to that characteristic which cannot be expressed in fractions or it is a frictionless variable. There
will be either one employee or two. There cannot be 1.5 employee. The number of times each
quantity or amount occurs, are shown in front of each quantity or amount which are known as
frequencies. By frequency, we mean the number of times a particular observed value occurs or
repetition of some items in the universe is known as frequencies.
Frequency Curve
Frequency curve is obtained by joining the points of frequency polygon by a freehand smoothed
curve. Unlike frequency polygon, where the points we joined by straight lines, we make use of
free hand joining of those points in order to get a smoothed frequency curve. It is used to remove
the ruggedness of polygon and to present it in a good form or shape. We smoothen the
angularities of the polygon only without making any basic change in the shape of the curve. In
Page 9 of 19
this case also the curve begins and ends at base line, as is in case of polygon. Area under the
curve must remain almost the same as in the case of polygon.
Diagrammatic Representation
Diagrams
Diagrams are various geometrical shape such as bars, circles etc. Diagrams are
based on scale but are not confined to points or lines. They are more attractive and easier
Merits
1. Most of the people are attracted by diagrams.
Page 10 of 19
2. Minute differences in values cannot be represented properly in diagrams.
4. Some of the diagrams can be drawn by experts only. eg. Pie chart.
Types of Diagrams
The important diagrams are
In all the diagrams and graphs, the groups or classes are represented on the x-axis
Example
Simple bar diagrams Consists of vertical bars of equal width. The heights of these
bars are proportional to the volume or magnitude of the attribute. All bars stand on the
same baseline. The bars are separated from each others by equal intervals. The bars may
be coloured or marked.
Example
The cropping pattern in Tamil Nadu in the year 1974-75 was as follows.
Page 11 of 19
Crops Area In 1,000 hectares
Cereals 3940
Oilseeds 1165
Pulses 464
Cotton 249
Others 822
be compared within each attribute we use multiple bar diagrams. If only two characters
are to be compared within each attribute, then the resultant bar diagram used is known as
The multiple bar diagram is simply the extension of simple bar diagram. For each
attribute two or more bars representing separate characters or groups are to be placed side
by side. Each bar within an attribute will be marked or coloured differently in order to
Page 12 of 19
distinguish them. Same type of marking or colouring should be done under each attribute.
Example
Draw a multiple bar diagram for the following data which represented agricultural
component side by side we may place these one on top of the other. This will result in a
Example:
Page 13 of 19
Percentage bar diagram
Sometimes when the volumes of different attributes may be greatly different for
making meaningful comparisons, the attributes are reduced to percentages. In that case
each attribute will have 100 as its maximum volume. This sort of component bar chart is
Example:
Page 14 of 19
Draw a Percentage bar diagram for the following data
Page 15 of 19
Pie chart / Pie Diagram
Pie diagram is a circular diagram. It may be used in place of bar diagrams. It
consists of one or more circles which are divided into a number of sectors. In the
Step 1:
Whenever one set of actual value or percentage are given, find the corresponding
Step 2:
Example
Given the cultivable land area in four southern states of India. Construct a pie diagram for
Page 16 of 19
The table value becomes
Page 17 of 19
Mean
The mean is the average of all numbers and is sometimes called the arithmetic
mean. To calculate mean, add together all of the numbers in a set and then divide
the sum by the total count of numbers. For example, in a data center rack,
five servers consume 100 watts, 98 watts, 105 watts, 90 watts and 102 watts of
power, respectively. The mean power use of that rack is calculated as (100 + 98 +
105 + 90 + 102 W)/5 servers = a calculated mean of 99 W per server.
Intelligent power distribution units report the mean power utilization of the rack to
systems management software.
Page 18 of 19
Median
In the data center, means and medians are often tracked over time to spot trends,
which inform capacity planning or power cost predictions. The statistical median is
the middle number in a sequence of numbers. To find the median, organize each
number in order by size; the number in the middle is the median. For the five
servers in the rack, arrange the power consumption figures from lowest to highest:
90 W, 98 W, 100 W, 102 W and 105 W. The median power consumption of the
rack is 100 W. If there is an even set of numbers, average the two middle numbers.
For example, if the rack had a sixth server that used 110 W, the new number set
would be 90 W, 98 W, 100 W, 102 W, 105 W and 110 W. Find the median by
averaging the two middle numbers: (100 + 102)/2 = 101 W.
Mode
The mode is the number that occurs most often within a set of numbers. For the
server power consumption examples above, there is no mode because each element
is different. But suppose the administrator measured the power consumption of an
entire network operations center (NOC) and the set of numbers is 90 W, 104 W, 98
W, 98 W, 105 W, 92 W, 102 W, 100 W, 110 W, 98 W, 210 W and 115 W. The
mode is 98 W since that power consumption measurement occurs most often
amongst the 12 servers. Mode helps identify the most common or frequent
occurrence of a characteristic. It is possible to have two modes (bimodal), three
modes (trimodal) or more modes within larger sets of numbers.
Page 19 of 19