Introduction to Statistical
Analysis
Why study statistics?
1. Data are everywhere
2. Statistical techniques are used to make
many decisions that affect our lives
3. No matter what your career, you will make
professional decisions that involve data. An
understanding of statistical methods will
help you make these decisions efectively
Data
Data are numbers which can be
measurements or can be obtained by
counting
Facts about something that can be used in
calculating, reasoning, or planning.
Information expressed as numbers for use
especially in a computer.
Nature of data
It may be noted that different types of data can be
collected for different purposes. The data can be
collected in connection with time or geographical
location or in connection with time and location.
The following are the three types of data:
1. Time series data.
2. Spatial data
3. Spatio-temporal data
Data are numbers, numbers contain
information, and the purpose of statistics
is to investigate and evaluate the nature
and meaning of this information.
Statistics
The science of collectiong, organizing,
presenting, analyzing, and interpreting
data to assist in making more effective
decisions.
Statistical analysis – used to manipulate
summarize, and investigate data, so that
useful decision-making information results.
Types of statistics
Descriptive statistics – Methods of
organizing, summarizing, and presenting
data in an informative way
Inferential statistics – The methods used
to determine something about a population
on the basis of a sample
Population –The entire set of individuals or
objects of interest or the measurements
obtained from all individuals or objects of
interest
Sample – A portion, or part, of the population
of interest
Descriptive Statistics
Collect data
e.g., Survey
Present data
e.g., Tables and graphs
Summarize data
e.g., Sample mean = X i
n
Inferential Statistics
Estimation
e.g., Estimate the population
mean weight using the
sample mean weight
Hypothesis testing
e.g., Test the claim that the
population mean weight is
70 kg
Inference is the process of drawing conclusions or making
decisions about a population based on sample results
Descriptive Inferential
I) A Tennis player wants to find his I) A Tennis player wants to estimate his
average score for the past 20 chance of winning an upcoming
games. tournament based on his current season
average and the average of the competing
Tennis players.
II) A politician wants to know the exact II)Based on an opinion poll, a politician would
percentage of votes cast for him in like to estimate his chance for re-election in
the last general election. the upcoming election.
III) Aamir wants to describe the III) Based on the first four test scores, Aamir
variation in his four test scores in would like to predict the variation in his final
statistics. statistics test scores.
IV) Mrs. Rashid wants to determine the IV) Based on last year’s grocery bills, Mrs.
average weekly amount she spent Rashid would like to predict the average
on groceries in the past 3 months amount she will spend on groceries for the
upcoming year.
Statistical Data
Statistical data are usually obtained by
counting or measuring items. Most data
can be put into the following categories:
Qualitative - data are measurements that
each fall into one of several categories.
(hair color, ethnic groups and other
attributes of the population)
Quantitative - data are observations that
are measured on a numerical scale
(distance traveled to college, number of
children in a family, etc.)
Numerical scale of
measurement:
Nominal – the grouping of the observations into
mutually exclusive qualitative categories is said to
constitute a nominal scale. e.g. students are
classified as male and female. number 1 and 2
may also be used to identify these two categories.
Ordinal – contain more information. Consists of
distinct categories in which order is implied.
Values in one category are larger or smaller than
values in other categories (e.g. rating-excellent,
good, fair, poor). Numbers 1,2,3,4 etc. are also
used to indicate ranks.
Interval – a measurement scale possessing a
constant interval size. e.g. temperature, IQ
score
Ratio – it is a special kind of an interval scale
where the scale of measurement has a true
zero point as its origin. The ratio scale is used
to measure weight, volume, length, distance,
money etc. the key to differentiating interval
and ratio scale is that the zero point is
meaningful for ratio scale
Observation and Variables
• An observation often means any sort of
numerically recording of information.
e.g. height or weight, heads or tails or an
answer to a question such as yes
or no.
• A characteristic that varies with an
individual or an object, is called a variable.
e.g. age, height, weight etc.
• The variable is referred to as constant
when it contain only one value.
Discrete and Continuous
variable
A discrete variable is one that can take
only discrete set of integers or whole
numbers. e.g. the number of chairs, the
number of deaths in an accident, the
income of an individual, etc.
A continuous variable can take on any
value within a given interval e.g. age of a
person, the height of a plant, the
temperature at a place, etc.
Data presentation
Frequency Distribution
A frequency distribution is a method of
classifying data into classes or intervals in such a
way that the number of each class can be
determined. The number in a class is called the
class frequency and is denoted by ‘f’. This
method provide a way of reviewing a set of
numbers without actually have to consider the
individual numbers and it can be very usefully
when dealing with large amounts of data.
The procedure of constructing a frequency
distribution for a given set of data depends on
the type of data involved i.e. continuous,
discrete or qualitative
Construction of a Frequency
Distribution
There are no hard and fast rules to construct a
frequency distribution; however some basic
guidelines must be observed.
i) Appropriate number of classes in a frequency
distribution
The number of classes denoted by C, depends on the
situation and the amount of data. There is no hard
and fast rules regarding the number of classes to use
and the choice is arbitrary. It is generally accepted
that the number of classes should be between 5 and
20, depending on the amount of data.
A useful suggestion regarding the number of classes
is given by Sturge’s rule. The rule is:
C = 3.3 log (n) + 1
where, C denotes the number of classes and n is the
number of observations. For example, if there are 25
observations in a data set, then
C = 3.3 log (25) + 1 = 3.3 (1.3979) + 1 = 6
ii)Find the lowest value and the highest value in the
data.
iii)Find the range: Range is obtained by subtracting
the lowest value from the highest value. R= XL - XS
.
iv) Divide the range by the number of classes to
find the class width or class interval h. In case of
fractional results, the next higher whole number if
usually taken as the class interval.
v) Determine the value at which the lowest interval
should begin. It should be ordinarily be a multiple
of the class interval.
vi) Determine the remaining class-limits and class
boundaries by adding the class interval repeatedly.
The lowest class should be placed at the top and
the rest should follow according to size.
Sometimes, the highest class is placed at the top.
vii)Using the tally system, enter the raw data in
the appropriate class intervals. It is customary
for convenience in counting to place the first four
bars or strokes vertically and fifth one diagonally
so as to have a set of five. Sometimes for a
smaller data set, the actual values can be
written against each class instead of tally bars.
viii) Convert each tally to a frequency (f).
Example:
The following data give the index numbers of 100 commodities in a
certain year. Make a frequency distribution.
91 120 138 96 99 113 97 94 119 111
118 83 91 86 71 119 123 87 151 117
87 116 134 90 61 141 104 115 125 79
119 124 112 145 96 114 114 106 113 89
110 111 75 106 153 63 107 96 100 96
81 101 104 108 147 133 100 109 104 110
143 77 109 138 113 86 121 86 136 117
99 95 90 100 104 79 68 88 116 101
144 127 101 128 102 105 106 122 76 78
73 147 127 129 140 120 129 77 108 109
Solution:
Step 1:We first find the range R. As the Maximum value is 153 and the
Minimum value is 61, the range is
R = XL – XS = 153 – 61 = 92
Step 2:We next decide the number of classes. Suppose we decide to take
C=10 classes. Then the class interval is
R 92
h 9.2 10
C 10
Typically, the value of R/C is rounded up to the next value determined
by the precision of measurement to produce a convenient value.
Step 3:Next we decide to locate the lower limit class at 60. With this choice,
the class limits will be 60-69, 70-79, 80-89, ….
Step 4:To determine the frequency of each class we use either a entry table
(for small data set) or a tally column. If a piece of data falls in a class,
we record a tally mark (l) in the tally column corresponding to that
class
The frequency distribution is then constructed as follows:
.
Classes
class Mid-
(Index Tally frequency
Boundaries point
Number)
60-69 III 3 59.5-69.5 64.5
70-79 IIII 9 69.5-79.5 74.5
80-89 IIII 9 79.5-89.5 84.5
90-99 III 13 89.5-99.5 94.5
100-109 I 21 99.5-109.5 104.5
110-119 IIII 19 109.5-119.5 114.5
120-129 II 12 119.5-129.5 124.5
130-139 5 129.5-139.5 134.5
140-149 II 7 139.5-149.5 144.5
150-159 II 2 149.5-159.5 154.5
100
Relative Frequency: It is sometimes useful to express each value
or class in a frequency table as a fraction or a percentage of the
total number of measurements. The relative frequency for a
measurement or class is found by dividing the frequency, f, of the
measurement by the total number of measurements, n.
Cumulative Frequency: A cumulative frequency is the sum of the
frequencies for several consecutive classes of a frequency
distribution.
Class Relative Cumulative
Interval Frequency Midpoint Frequency Frequency
6.30–under 6.50 1 6.40 .025 1
6.50–under 6.70 2 6.60 .050 3
6.70–under 6.90 7 6.80 .175 10
6.90–under 7.10 10 7.00 .250 20
7.10–under 7.30 13 7.20 .325 33
7.30–under 7.50 6 7.40 .150 39
7.50–under 7.70 1 7.60 .025 40
Total 40 1.00
Exercise:
The following data represents the IQ-Score of 60 students. Make a
frequency distribution of IQ scores.
106 107 76 82 109 107 115 93
187 95 123 125 111 92 86 70 126
130 68 82 129 139 119 115 128 100
186 84 99 113 204 111 141 136 118
123 90 115 98 110 185 78 162 178
140 152 173 80 146 158 194 148 90
107 181 131 184 75 104 110