Professional Documents
Culture Documents
INTRODUCTION TO STATSTICS
1.1 . INTRODUCTION
Statistics is concerned with scientific methods for collecting, organizing, summarizing,
presenting and analyzing data as well as with drawing valid conclusions and making reasonable
decision on the basis of such analysis. In a narrowing sense, the term statistics is used to denote
the data themselves or numbers derived from the data, such as averages. Thus we speak of
employment statistics, accidental statistics, etc.
Decision makers make better decisions when they use all available information in an effective
and meaningful way. The primary role of statistics is to provide decision makers with methods
for obtaining and analyzing information to help make these decisions. Statistics is used to answer
long-range planning questions, such as when and where to locate facilities to handle future sales.
1.1.1. DEFINITIONS OF STATISTICS
Statistics has been defined by various authors differently. Some of the definitions are extremely
narrow. This is understandable since statistics has developed over the past several decades and in
the earlier days; the role of statistics was confined to a limited sphere. Let us see some
definitions, which are given below.
W.I. King has defined statistics as the method of judging collective, natural or social phenomena
from the results obtained by analysis or enumeration or collection of estimates.
Croxton and Cowden: Statistics or statistical methods may be defined as a collection,
presentation, analysis and interpretation of numerical data.
Lovett: statistics is a science that deals with collection, classification and tabulation of numerical
facts as a basis of the explanation, description and comparison of phenomena.
A definition, which seems to be more comprehensive, is given by Secrist. He defined statistics
as: “aggregate of facts, affected to a marked extent by multiplicity of causes, numerically
expressed, enumerated or estimated according to reasonable standards of accuracy, collected in a
systematic manner for a pre-determined purpose, and placed in relation to each other”
It may be emphasized that this definition highlights a few major characteristics of statistics.
These are given below.
1. Statistics are aggregates of facts. This means a single figure is not statistics. For example,
national income of a country for a single year is not statistics but the same for two or
more years is.
2. Statistics are affected by a number of factors. For example sale of a product depends on a
number of factors such as its price, quality, competition, the income of consumers, and so
on.
3. Statistics must be reasonably accurate, wrong figures if analyzed, will lead to erroneous
conclusion. Hence, it is necessary that conclusion must be based on accurate figures.
4. Statistics must be collected in a systematic manner. If data are collected in a haphazard
manner, they will not be reliable and will lead to misleading conclusions.
1
5. Finally, statistics should be placed in relation to each other if one collects data unrelated
to each other, and then such data will be confusing and will not lead to any logical
conclusions. Data should be comparable overtime and space.
1.1.2. IMPORTANCE OF STATISTICS IN BUSINESS
There is an increasing realization of the importance of statistics in various quarters. This is
reflected in the increasing use of statistics in the government, industry, business, agriculture,
mining, transport, education, medicine and so on. As we are concerned with the use of statistics
in business and industry here, description given below is confined to these areas only. There are
three major functions in any business enterprise in which statistical methods are useful.
1. The planning functions. This may relate to either special projects or to the recurring
activities of the firm over specified period.
2. The setting up standards. This may relate to the size of employment, volume of sales,
fixation of quality norms for the manufactured products, norms for daily out put, and so
forth.
3. The function of control: This involves comparison of actual production achieved against
the norm or target set earlier. In case the production has fallen short of the target, it gives
remedial measures so that such a deficiency does not occur again.
2
1.2. Descriptive Statistics
1.2.1. Statistical Data
Statistical data are the basic input needed to make an effective decision in a particular situation.
The main reasons for collecting data are:
Statistical data are the outcome of a continuous process of measuring, counting, and/or
observing. These may pertain to several aspects of a phenomenon (or a problem) which are
measurable, quantifiable, countable or classifiable. The researcher will collect and analyze data
about the characteristics of the given population. These characteristics which one intends to
investigate and analyze are termed as variables. Variables have two natures: those which are
expressed in numerical terms and those which are not expressed in numerical terms. While sex,
religion, and language are a few examples of non-numerical variables, age, weight, height, and
distance are examples of numerical variables. The numerical variables are classified into two
categories:
i) Discrete variables – which can only take certain, fixed integer numerical values. For
example, the number of cards or the numbers of employees in an organization are
examples of discrete variables.
ii) Continuous variables – This can take any numerical value. Measurements of height,
weight, length, in centimeters/ inches, grams/kilograms are a few examples of
continuous variables.
Remark: Discrete data numerical measurements arise from a process of counting; while
continuous data are numerical measurements arise from a process of measuring.
Sources of Data
The choice of a data collection method from a particular source depends on the facilities
available, the extent of accuracy required in analysis, the expertise of the investigator, the time
span of the study, and the amount of money and other resources required for data collection.
3
i) Primary sources
Individuals, focus groups, and/or panels of respondents specifically decided up on and set up by
the investigator for data collection are examples of primary data sources. Any one or a
combination of the following methods can be chosen to collect primary data:
Observation: In observation studies, the investigator does not ask questions to seek
clarifications on certain issues. Instead he/she records the behavior, as it occurs of an event in
which he interested. Sometimes mechanical devices such as camera, tape recorders are also used
to record the desired data.
Interviewing: Interviews can be conducted either face –to-face or over telephone. Such
interviews provide an opportunity to establish a rapport with the interviewee and help extract
valuable information. Direct interviews are expensive and time-consuming if a big sample of
respondents is to be personally interviewed. Interviewers’ biases also come in the way. Such
interviews should be conducted at the exploratory stages of research to handle concepts and
situational factors.
Questionnaire: It is a formalized set of questions for extracting information from the target
respondents. The form of the questions should correspond to the form of the required
information. The three general forms of questions are: dichotomous (yes/no response type);
multiple choice, and open-ended. A questionnaire can be administered personally or mailed to
the respondents. It is an efficient method of collecting primary data when the investigator knows
what exactly is required and how to measure such variables of interest.
External secondary data sources: the external secondary data sources include government
publications, non-government publications, various syndicate services such as Operations
Research Group (ORG) and international organizations.
Internal secondary data sources: The data generated with in an organization in the process of
routine business activities, are referred to as internal secondary data. Financial accounts,
production quality control, and sales records are examples of such data.
4
The raw data can be organized in a data array and frequency distribution. Such an arrangement
enables us to see quickly some of the characteristics of the data we have collected. When a raw
data set is arranged in rank order, from the smallest to the largest observation or vice-versa, the
ordered sequence obtained is called an ordered array.
Frequency distribution
Frequency distribution divides observations in the data set into conveniently established,
numerically ordered classes (groups or categories). The number of observations in each class is
referred to as frequency, denoted as f.
Advantages: The following are few advantages of grouping and summarizing raw data in this
compact form:
i) The data are expressed in a more compact form. One can get a deep or insight into
the salient characteristics of the data at the very first glance.
ii) One can quickly note the pattern of distribution of observations falling in various
classes.
iii) It permits the use of more complex statistical techniques which help reveal certain
other obscure and hidden characteristics of the data.
Disadvantages: A frequency distribution suffers from some disadvantages as stated below:
A) Deciding the number of classes: The decision on the number of class groupings depend
largely on the judgment of the individual investigator and/or the range that will be used to
group the data. As a general rule, a frequency distribution should have at least five class
intervals (groups), but not more than fifteen. The following two rules are often used to decide
approximate number of classes in a frequency distribution.
i) If K represents the number of classes and N the total number of observations, then
the value of K will be the smallest exponent of the number 2, so that
5
Let N = 30 observations. If we apply this rule, then we shall have.
ii) According to Sturge’s rule, the number of classes can be determined by the formula
K = 1+3.222 logeN
Where K is the number of classes and log eN is the logarithm of the total number of observations.
Applying this rule to the above
K = 1+3.222 log 30
= 1+3.222 (1.4771) 5
B) Obtaining the width of class Intervals: when constructing the frequency distribution it is
desirable that the width of each class interval should be equal in size. The size (or width) of
each class interval can be determined by first taking the difference between the largest and
smallest numerical values in the data set and then dividing it by the number of class intervals
desired.
For example, if the largest numerical value is 95 and the smallest numerical value of the
observation is 84, using the above formula with 5 classes desired, the width of the class intervals
is approximated as:
For convenience, the selected width (or interval) of each class is rounded to 3.
C) Establishing class limits (Boundaries): the limits of each class interval should be clearly
defined so that each observation (element) of the data set belongs to one and only one class.
Each class has two limits – a lower limit and an upper limit. The usual practice is to let the lower
limit of the first class be a convenient number slightly below or equal to the lowest value in the
data set.
For example let us take an illustration to make it clear on the concepts discussed above.
6
Table 1.1: Raw data pertaining to total time hours worked by machinists
94 89 88 89 90 94 92 88 87 85
88 93 94 93 94 93 92 88 94 90
93 84 93 84 91 93 85 91 89 95
Table 1.2 reorganizes data given in table 1.1 above in the ascending order.
Table 2.2: Ordered arrays of total overtime hours worked by machinists
84 84 85 85 87 88 88 88
88 89 89 89 90 90 91 91
92 92 93 93 93 93 93 93
94 94 94 94 94 95
The frequency distribution of the number of hours of overtime given in Table 1.1 is shown in
Table 1.3.
Table 1.3: Array and Tallies
Number of overtime Hours Tally Number of weeks (Frequency)
84 Ll 2
85 ll 2
86 - 0
87 l 1
88 llll 4
89 lll 3
90 ll 2
91 ll 2
92 ll 2
93 llll l 6
94 llll 5
95 l 1
In Table 1.2, we may take the lower limit of the first class as 82 and the upper class limit as 85.
Thus the class would be written as 82 – 85. This class interval includes all overtime hours
ranging from 82 up to but not including 85 hours. The various other classes can be written as:
7
Graphical Presentation of Data
It has already been discussed that one of the important functions of statistics is to present
complex and unorganized (raw) data in such a manner that they would easily be understandable.
There are a variety of diagrams used to represent statistical data. Different types of diagrams,
used to describe sets of data, are divided into the following categories:
1) Dimensional Diagrams
i) One – dimensional diagrams such as histograms, frequency polygons, and pie chart.
ii) Two-dimensional diagrams such as rectangles, squares, or circles.
iii) Three dimensional diagrams such as cylinders and cubes.
2) Pictograms or Ideographs
A pictogram conveys its meaning through its pictorial resemblance to a physical object.
Statistical maps are used to show the difference in values (frequency of an event, probability of
an event etc.) between different geographical regions in geo-spatial analysis.
8
One – Dimension Diagrams
These diagrams are most useful, simple, and popular in the diagrammatic presentation of
frequency distributions. These diagrams provide a useful and quick understanding of the shape of
the distribution and its characteristics.
These diagrams are called dimensional diagrams because only the length (height) of the bar (not
the width) is taken into consideration.
The one – dimensional diagrams (charts) used for graphical presentation of data sets are as
follows:
Histogram
Frequency polygon
Frequency curve
Cumulative frequency distribution (Ogive )
Pie diagram
Histograms (Bar Diagrams)
Bar graphs are probably the most commonly used graphs, and one you're already familiar with. I
won't mention much more here, except to state a couple keys:
1. Heights can be frequency or relative frequency
2. Bars must not touch
By using color example, we could then make both frequency and relative frequency bar
graphs.
Favorite color frequency relative frequency
Blue 10 10/26 ≈ 0.38
Red 3 3/26 ≈ 0.12
Orange 1 1/26 ≈ 0.04
Yellow 3 3/26 ≈ 0.12
Green 5 5/26 ≈ 0.19
Pink 3 3/26 ≈ 0.12
Purple 1 1/26 ≈ 0.04
9
Pie Charts
Like bar graphs, pie charts are very common.
Pie Charts: 1. should always include the relative frequency
2. Also should include labels, either directly or as a legend
Using the data from our previous color example, we get this pie chart:-
10
Frequency Polygon
A frequency polygon is drawn by plotting a point above each class midpoint and connecting the
points with a straight line. (Class midpoints are found by average successive lower class limits.)
To illustrate the idea, let's look at the average the following example.
Average commute midpoint frequency
16-17.9 17 1
18-19.9 19 2
20-21.9 21 1
22-23.9 23 6
24-25.9 25 2
26-27.9 27 1
28-29.9 29 1
30-31.9 31 1
11
Cumulative Frequency Distribution (0give)
The graph given above represents less than and the greater than O give curve. The rising curve
represents the less than O give, and the falling curve represents the greater than O give.
1.3. MEASURE OF CENTRAL TENDENCY
The term central tendency was coined because observations (numerical values) in most data sets
show a distinct tendency to group or cluster around a value of an observation located somewhere
in the middle.
The measure of central tendency that will be discussed here are the arithmetic mean, the median,
the mode. Before discussing arithmetic mean or any other mean, the question arises why should
we use such a mean?
The answer is that there are the two main objectives of using mean. First, to get a single value
that indicates the characteristic of the entire data for instance, when we talk of per capita income
of a country, it gives a broad idea of the standard of living of the people in that country. The
Second reason for using mean is to facilitate comparisons of data.
1.3.1. The Arithmetic Mean
The arithmetic mean is obtained by adding all the observations and dividing the sum by the
number of observations. Suppose we have the following observations: 10, 15, 30, 7, 42, 79 and
83.
These are seven observations. Symbolically, the arithmetic mean, also called simply mean is
X
X̄ =∑
n , where x is the sample
=10 + 15 + 30 + 7 + 42 + 79 + 83
266
7
= = 38
12
It may be noted that the Greek letter µ is used to denote the mean of the population and n to
X
∑n
denote the total number of observations in a population. Thus, the population means µ = .
Ungrouped Data: Weighted Case
In case of ungrouped data where weights are involved, our approach for calculating arithmetic
mean will be different from the one used earlier.
Example 1
Suppose a student has secured the following marks in three tests
Mid –term test 30
Laboratory 25
Final 20
30+25+ 20
The simple arithmetic mean will be 3 = 25
However, this will be wrong if the three 3 tests carry different weights on the basis of their
relative importance. Assuming that the weights assigned to the three tests are:
Mid – term test 2 Points
Laboratory 3 Points
Final 5 Points
Solution
On the basis of this information, we can now calculate a weighted mean as shown below:
Calculation of weighted mean
Type of test Relative weight (w) Marks (x) W(x)
Mid – term 2 30 60
Laboratory 3 25 75
Final 5 20 100
Total Sw=10 235
X=
∑ wi ( x i ) 60+75+100
∑ wi = 2+3+5 =23 . 5 Marks.
Example 2
An investor is fond of investing in equity shares. During a period of falling prices in the stock
exchange, a stock is sold at birr 120 per share on one day, birr105 on the next and birr 90 on the
third day. The investor has purchased 50 shares on the first day, 80 shares on the second day and
100 shares on the third day. What average price per share did the investor pay?
Day price per share (birr) X No of shares purchased W Amount paid (WX)
1 120 50 6000
2 105 80 8400
3 90 100 900
Total - 230 23,400
13
w1 x 1 + w2 x 2 + w3 x 3
Weighted average = w1 + w2 + w3
6000+8400+ 9000
=50+80+ 100 =birr. 101.7
Thus, the investor paid an average price of birr 101.7 per share.
1.3.2. The Median
The median of a set of numbers arranged in order of magnitude (i.e., in an array) is either the
middle value or the arithmetic mean of the two middle values.
The median can be calculated for both ungrouped and grouped data sets.
Ungrouped data: In this case the data is arranged in either ascending or descending order of
magnitude
(i) If the number of observations (n) is an odd number then the median (med) is
n+1
represented by the numerical value corresponding to the positioning points of 2
ordered observation.
( n+1 )th
Med = size or value of 2 .
(ii) If the number of observations (n) is an even number, then the median is defined as the
n th n
[ +1 ]th
arithmetic mean of the numerical value of 2 and 2 observations in the data
array that is:
nth n+1
+
2 2
Med =
2
Example: Calculate the median of the following data that relates to the service time (in minutes)
per customer for seven customers at railway reservation counter.
Observations in the data array: 1 2 3 4 5 6 7
Service time (in minutes): 3 3.5 3.8 4 4.5 5 5.5
n+1 th
[ ]
Median = value of 2 observation in the data array
7+1 th th
[ ] =4
= 2 observation in the data array is 4, thus the median service time is 4
minutes per customer.
1.3.3. Mode
The mode is that value of an observation which occurs most frequently in the data set, that is, the
point (or class mark) with the highest frequency.
In the case of grouped data, the following formula is used for calculating mode.
14
( f mo −f mo−1 ) h
Mode=L+
2 f mo −f mo−1 −f mo+1
Where L = Lower limit of the model class interval.
fmo-1 = frequency of the class preceding the mode class interval.
fmo+1 = frequency of the class following the mode class interval.
h = Width of the mode class interval.
Example: The data below shows the sales of an item per day for 20 days period Sales volume
Class interval 53-56 57-60 61-64 65-68 69-92 72 and above
Number of days 2 4 5 4 4 1
(frequency)
Solution: Since its largest frequency corresponds to the class interval 61-64, therefore, it is the
modal class. Then we have,
L=61, fmo = 5, fmo-1 =4, fmo+1 = 4 and h = 3. Thus
( f mo−f mo−1 ) h
Mo=L+
2 f mo −f mo−1−f mo+1
(5−4 )×3
= 61 + 10−4−4 = 61 + 1.5 = 62.5
Hence, the modal sale is of 62.5 units.
16
∑ ( X −μ)2
Var (x) = N
σ 2
=
∑ ( X i−μ )2
It is also written as = N
Where
σ 2 (Called sigma squared) is used to denote the variance.
Although the variance is a measure of dispersion, the unit of its measurement is (points) 2. If a
distribution related to income of families, then the variance is (Rs) 2. Similarly, if another
distribution pertains to marks of students, then the unit of variance is (marks) 2. To overcome this
inadequacy, the square root of variance is taken, which yields a better measure of dispersion
known as the standard deviation.
Taking our earlier example of individual observations, we take the square root of the variance
SD. Or 6 = = 3.42 points
Symbolically
σ=
√ ∑ ( X i −μ)2
N
These observations are also known as extreme observations. The presence of extreme
observations on the right hand side of a distribution makes it positively skewed and the three
averages, viz., mean, median and mode, will no longer be equal. We shall in fact have Mean >
Median > Mode when a distribution is positively skewed. On the other hand, the presence of
extreme observations to the left hand side of a distribution make it negatively skewed and the
17
relationship between mean, median and mode is: Mean < Median < Mode. In Fig. 6.2 we depict
the shapes of positively skewed and negatively skewed distributions.
18