Professional Documents
Culture Documents
UNIT ONE
USES OF DATA ANALYSIS AND COURSE OVERVIEW
In another way, we can also define the subject statistics in plural and singular noun.
When we used as plural sense, statistics means numerical data and
When used in singular sense it means statistical method embodying the theory and techniques
used for collecting, analyzing and drawing inferences from numerical data.
i. Descriptive Statistics
Obviously, data are collected for some purpose and the collected data do not provide unless
processed. Data need to be organized and summarized before they are used to support decision
and these are done by descriptive statistics.
Therefore, descriptive statistics is part of statistics concerned about arranging, summarizing and
presenting a set of data in such a way that the meaningful essentials of the data can be extracted
and grasped easily.
The data can be presented using tools like graphs, tables, averages, mode, medians etc.
Statistics are interested in obtaining information about a total collection of elements that is called
population. But population is often too large to examine using statistics. Therefore, inferential
statistics is part of statistics which is concerned the drawing of conclusion from sample (taken
from population) to population.
Statistical Variable
A variable is a characteristic under study that assumes different values for different elements.
For example if you consider profits of „three‟ International Companies.
Company 2002 profits
Midrock 22 (Billions)
Sunshine 18 “
PSCO 10 “
Here, the 2002 profit is a variable. Other examples: income, height, weight, cars sales and so on.
Data
The numerical values represented by any variable are called Data.
For example: 22, 18, 10 in the above table are data. The singular form is datum
1) Variable that can be measured numerically are called Quantitative variable. The corresponding
data are called quantitative data. For example: Time, Income, Gross Sales, Price, Height,
weight, no of accidents on a road par day,
2) Variables that cannot be measured numerically but can be divided into different categories
are called qualitative or categorical variables. The corresponding data –qualitative data. For
example: The states of on under graduate college student is a qualitative variable since a student
can fall into any one of four categories:
1st 2nd 3rd 4th
Freshman Sophomore Junior Senior
Other examples: gender of a person, hair color, etc.
i. Secondary Data
In statistical studies, one must first check availability of prior studies related to the topic of
interest and whether these are relevant for the present purpose. Data types which are already
collected and recorded by another body for another purpose are called secondary data.
The common sources of secondary data are governmental publications, journals and reports,
publication of research organizations and different books.
These types of data help in saving time and expenses of the study and unnecessary
duplication of efforts.
There are two ways that can be used in order to collect primary data.
a) Questionnaire and interview, and
b) Experiment and observation
Interview
In the case of using secondary data is simpler as compared to primary data. However primary data
are more reliable and suitable for the study at hand in most cases as they are original data, their
collection designed and conducted by the investigator to suit the present purpose.
It must also be noted that in many studies, both primary and secondary data can be used
together, especially when the available secondary data are incomplete but accurate enough and
the rest can be supplementary by collecting primary data.
For instance if the desire is to collect data on monthly income of employees in a company‟s
payroll record can supply income come from basic salary while income from other sources can
be obtained by interviewing the employees themselves.
1) Nominal Scale
Nominal scales are measurement systems that posses none of the three properties (order, distance
and fixed zero).
Example:
Political preference ( Republican, Democrat, or Other)
Sex ( Male, or Female)
Marital status ( Married, Single, Widow, Divorce)
2) Ordinal Scale
Ordinal scales are measurement systems that possess the property of order, but not the property
of distance. The property of fixed zero is not important if property of distance is not satisfied.
Level of measurement, which classifies data into categories that can be ranked. Differences
between the ranks do not exist.
Arithmetic operations are applicable but relational operations are applicable.
Ordering is the sole property of ordinal scale.
Example:
Letter grades (A, B, C, D, F)
Rating scales (Excellent, Very Good, Good, Fair, Poor)
3) Interval Scales
Interval scales are measurement systems that possess the properties of order and distance but not
the property of fixed zero.
Level of measurement which classifies data that can be ranked and differences are meaningful.
However, there is no meaningful zero, so ratios are meaningless.
All arithmetic operations except divisions are applicable
Rational operations are also possible.
Example:
IQ
Temperature in 0f
4) Ratio Scales
Ratio scales are measurement systems that possess all three properties: order, distance, and fixed
zero. The added power of a fixed zero allows ratios of numbers to be meaningfully interpreted; i.e.
the ratio of Ananya‟s height to Eyosia‟s height is 1:32 where as this is not possible with interval
scales.
Level of measurement which classifies data that can be ranked, differences are meaningful, and
there is true zero. True ratios exist between the different units of measure.
All arithmetic and rational operations are applicable.
Example:
Weight, Height, Number of Students, Age etc.
Uses of statistics
Normally, knowingly or unknowingly we use statistics almost in the day to day activities of our
lives. When you want to compare yourselves with your classmates, for example, you use statistics.
We study statistics, however, mainly because we are involved in decision making.
Statistics aids our decision making because it:
Provides the models that are needed to study situations involving uncertainties,
Eases identification and determination of functional relationship among variables,
Enable us to predict the condition of something happening
Serves as a source of sufficient information for effective decision
Presents facts in definite and precise form
Studies the relationship between two or more variable
Condenses & summarizes mass of data in to a few presentable, understandable & precise figures.
Misuses of statistics
Some of the possible ways where statistics can be misused are:
They can be used for the wrong purpose, that is, for purposes that are different from the
purpose of for which they were collected
They can be collected incorrectly if there is bias.
They can be analyzed carelessly so that the results obtained are misleading.
The improper use of statistical tools by unscrupulous people with an improper statistical bend of
mind has led to the public distrust in statistics. By this we mean that public loses its belief, faith
and confidence in the science of statistics and starts condemning it. Such irresponsible,
inexperienced and dishonest persons who use statistical data and statistical techniques to fulfill their
selfish motives have discredited the science of statistics with some very interesting comments:
♠ An ounce of truth will produce tons of statistics.
♠ Statistics can prove any thing.
♠ Figures do not lie. Liars figure.
♠ Statistics is unreliable science
♠ There are three types of lies-lies, damned lies and statistics wicked in the order at their
naming; and so on.
Some of the reasons for the above remarks may be enumerated as follows:
a. Figures are innocent and believable, and the facts based on them are psychologically more
convincing. But it is a pity that figures do not have the label of quality on their face.
b. Arguments are put forward to establish certain results which are not true by making use of
inaccurate figures or by using incomplete data, thus distorting the truth.
c. Though accurate, the figures might be molded and manipulated by dishonest persons to
conceal the truth and present in wrong and distorted picture of the facts to the public for
personal and selfish motives.
Hence, if statistics and its tools are misused, the fault does not lie with the science of statistics.
Rather, it is the people who misuse it, are to be blamed. Utmost care and precautions should be
taken for the interpretation of statistical data in all its manifestations. “Statistics should not be used
as a blind man uses a lamp post for support instead of illumination”.
Thus the use of statistics by the experts who are well experienced and skilled in the analysis and
interpretation of statistical data for drawing correct and valid inferences very much reduces the
chance of mass popularity of this important science.
UNIT TWO
METHODS OF DATA COLLECTION AND SAMPLING TECHNIQUE
Before we study the methods of data collection, it is important to define two important terms in
statistics – Population and Sample.
In statistical language, population is the total elements or items under investigation where as Sample
is a part or subset of this population under investigation.
For instance, if a researcher is interested to study the performance male and female students in
PSCO, all students of the college constitute the population. Among the students if you select some
number of female and male students, this collection which is subset of the population is sample.
In statistics, the sample taken from the population must approximately represent the characteristics
of the population.
In general, we have two methods of data collection: Sample survey and Census survey
1) Census survey
A survey that includes every member of the population is called a Census. In the process of data
collection, data are gathered from all elements that we are interested to study.
2) Sample survey
The method of collecting data from a portion of the population is called a sample survey. The purpose of
conducting a sample survey is to make decisions about the corresponding population.
It is important that the results obtained from a sample survey closely match the results that we would
obtain by conducting a census. Other wise, decisions derived from a sample survey will not apply to the
corresponding population. That is, such a sample is not representative sample.
Demerits (disadvantages)
It requires a great deal of enumerators, time, & money. It is practically beyond the reach of researchers
The census method is useless in case results are urgently required.
An element of bias will get larger and large as the number of observations increase.
In practice, some times it is not possible to examine every item in the population. For example in
destructive testing explosives and in medical testing (drug effectiveness)
They are not laws in the strict sense of the term rather; they are only tendencies which operate universally.
The Law of Statistical Regularity
This law may be stated as follows: “On an average the sample chosen at random from the universe
will have the same composition and characteristics as the universe (population.)”.
For example, if one intends to make a study of average weight of students of a college, it is not
necessary to take weight of all students. A few students may be selected at random from all the
classes, their weights taken and average weight of the college students in general may be inferred.
But before the results of the sample can be applied to population; two conditions must be met:
Firstly, the sample should be random, that is every item of in the population has an equal
chance of being included in the sample.
Secondly, the sample it should be sufficiently representative
In statistics, there is a basic principle that the larger the number of items, the more reliable is the
results obtained there from. Because it is possible then to avoid the influence of abnormal items on the
average. The larger the size of the sample the more reliable is the result because the sampling error is
inversely proportional to the square root of the number of item in the sample. i.e.
Sampling Techniques
Sampling techniques are the different techniques of collecting data (information) from a portion of a
population. The major sampling techniques may be grouped in to:
Sampling
Techniques
Probability Non-
Sampling Probability
Sampling
I. Probability Sampling
All probability samples are based on chance selection procedure i.e, every element of the population has a
known non-zero probability of selection. This eliminates the bias inherent in the non-probability sampling
procedures because probability sampling process is random. Random refers to the procedure for selecting
the sample. Randomness refers to a procedure the outcome of which cannot be predicted because it is
dependent on chance. The selection of the sample based on the theory of probability is also known as
random selection and some times probability sampling is also known as random sampling.
Demerits:
It does not use knowledge of population that researches may have. Large errors for same sample size
than stratified sampling.
In simple random sampling, respondents may be widely dispersed, hence higher cost.
2. Systematic Sampling
A sampling procedure in which an initial starting point is selected by a random process and then every n th
number on the list is selected.
Let us suppose that N units in the population are arranged in some systematic order and serially
numbered from 1 to N and we want to draw a sample of size n from it such that:
N= nk = k = N/n, where k is usually called the Sample Interval
Systematic sampling consists in selecting any unit at random from the first k units numbered from 1 to k
and then selecting every kth unit in succession subsequently. Thus, if the first unit selected at random is i th
unit, then the systematic sample of size n will consist of the units numbered.
i + k, i+2k, …, I + (n-1)k.
The random number ‘i’ is called the random start and its value, in fact, determines the whole sample.
As an example, let us suppose that we want to select 50 voters from a list of voters containing 1,000
names arranged systematically. Here
n=50; N=1,000; K= N/n = 1,000/50= 20
We select any number from 1 to 20 at random and the corresponding voter in the list is selected. Suppose
the selected number is 6. Then the systematic sample will consist of 50 voters in the list at serial umbers:
6, 24, 46, 66… 966, 986.
Merits
Simple to draw sample and easy to check. It has moderate cost.
Demerits
If sampling interval is related to a periodic ordering of the population, it may introduce increased
variability.
2. Stratified Sampling
A probability sampling procedure in which sub-samples are drawn from samples with in different strata
that are more or less equal on some characteristic.
The first step of choosing strata on the basis of existing information is the same for both stratified and
quota sampling. However, the processes of selecting sampling units (elements) with in the stratum differ
substantially. In stratified sampling, a sub sample is drawn using simple random sample with in each
stratum. This is not true with quota sampling.
OSU July, 2018 Page 12
Business Statistics Handout
The reason for taking a stratified sample is to have a more efficient sample than could be taken on the
basis of simple random sampling.
Another reason for taking a stratified sample is the assurance that the sample will accurately reflect the
population on the basis of the criterion or criteria used for stratification.
Merits
It assures representation of all groups in a sample.
Characteristics of each stratum can be estimated and comparisons made.
Further it reduces variability for same sample size.
Demerits
It requires accurate information on proportion in each stratum.
If stratified lists are not already available they can be costly to prepare.
3. Cluster Sampling
An economically efficient sampling technique and in which the primary sampling unit is not the
individual element in the population but a larger cluster of elements are selected randomly.
The area sample is the most popular type of cluster sample.
A grocery researcher for example may randomly choose several geographic areas as the primary
sampling units and then interview all, or a sample, of grocery stores with in the geographic
clusters. Interviews are confined to these clusters; no interviews occur in other clusters.
Cluster samples are frequently utilized when no lists of the sample population are available.
Merits
If clusters are geographically defined, yields lowest field cost.
It requires listing of all clusters but of individuals only with in clusters.
It can estimate characteristics of clusters as well as of population.
Demerits
It introduces larger error for comparable size than other probability samples.
Researcher must be able to assign population members to unique cluster, or duplication or
omission of individual results.
Merits
It is useful for certain types of forecasting like sample guaranteed to meet a specific objective.
More over, it has moderate cost and average use.
Demerits
It introduces bias due to experts‟ beliefs and it may make sample unrepresentative.
This is because elements in the population don‟t have some chance to be included in the sample
3. Quota Sampling.
It is non-probability sampling in which the researcher classifies population by pertinent properties,
determines desired proportion of sample from each class & quotas for each interviewer.
Suppose a firm wishes to investigate consumers who currently own videotape recorders. The
researcher wish to ensure that each brand of recorder is proportionately included in the sample.
The purpose of quota sampling is to ensure that the various subgroups in a population are represented
on pertinent sample characteristics to the exact extent that the investigators desire.
Stratified sampling, probability sampling procedure, also has this objective, and it should not be
confused with quota sampling. In quota sampling, the interviewer has a quota to achieve.
Merits
It introduces some stratification of population and requires no list of population.
It has moderate cost and it is used very extensively.
One can finish data collection in a very short period of time.
Demerits
It introduces bias in researcher‟s classification of subjects.
Further non-random selection with in classes means error from population can not be estimated.
Sampling Error
It is the difference between the value of a sample statistic obtained from a sample and the value of the
corresponding population parameter obtained from the population. It is important to remember that a
sampling error occurs because of chance.
Now suppose, when we select the above mentioned sample, we mistakenly record the second salary as
2900 instead of 2000. As a result, we calculate the sample mean as: X
2900 3500 3500 = 3300
3
Consequently, the difference between sample mean & population mean is: X = 3300 – 2800 = 500
This difference does not represent the sampling error. As we calculated earlier, only 200 of this
difference is due to sampling error. The remaining portion: 500-200= 300 birr represents non-
sampling error because it occurred due to the error we made in recording second salary in the sample.
A statistical table is an orderly and logical arrangement of data into rows and columns and it attempts to
present the voluminous and heterogeneous data in a condensed and homogeneous form. But before
tabulating the data, generally, systematic arrangement of the raw data into different homogeneous classes
is necessary to sort out the relevant and significant features from the irrelevant and significant ones.
This process of arranging the data into groups or classes according to resemblances and similarities is
technically called classification. Thus, classification impressed upon the „arrangement of the data into
different classes which are to be determined depending upon the nature, objectives & slope of the enquiry.
For instance, the number of students registered at Public Service College of Oromiya during academic
year 2005 E.C may be classified on the basis of any of the following criterion.
i. Different faculties: ii. Sex
- Agribusiness iii. Age
- Human Resource iv. The Zone to which they belong
- Accounting v. Religion
- Law vi. Heights or weights
Thus the same set of data can be classified into different groups or classes in the number of ways based on
any recognizable physical, social or mental characteristic which exhibits variation among the different
elements of the given data.
Bases of Classification
The bases or the criteria with respect to which the data are classified primarily depend on the objectives
and the purpose of the inquiry. Generally, the data can be classified on the following four bases:
Geographical classification Qualitative classification
Chronological classification Quantitative classification
Frequency Distribution
Definitions:
Raw data: Is recorded information in its original collected form, whether it is counts or
measurements.
Frequency: is the number of values in a specific class of the distribution.
Frequency distribution: is a summarized presentation of the values of a variable arranged in order of
magnitude either individually (in case of discrete variable) or in to classes (in case of continuous
variable) or into categories (in case of qualitative data).
There are three basic types of frequency distributions:
1) Categorical frequency distribution
This is used for data that can be placed in specific categories such as nominal or ordinal data.
Example: marital status of 60 adults classified as single, married, divorced and widowed is given as:
Marital Status Single Married Divorced Widowed Total
Number of adults 25 20 8 7 60
Step -1
A better presentation of the above raw data would be to arrange them in an ascending or descending
order of magnitude which is called the „arraying’ of the data. However, this presentation (arraying),
though better than the raw data does not reduce the volume of the data.
Step-2
A much better way of the representation of the data is to express it in the form of a discrete or
ungrouped frequency distribution where we count the number of times each value of the variable
(marks in the above illustration) occurs in the above data. This is facilitated through the technique of
Tally-Marks or Tally-Bars as explained below.
Table 2: Marks of 200 Students
Marks Tally Bars Frequency Marks Tally Bars Frequency
15 || 2 42 ||||| ||||| |||| 14
16 || 2 43 ||||| 5
17 ||||| 5 44 ||| 3
18 | 1 45 ||||| 5
19 | 1 46 ||||| | 6
20 | 1 47 |||| 4
21 || 2 48 ||||| | 6
22 || 2 49 |||| 4
23 || 2 50 ||||| 5
24 || 2 51 |||| 4
25 || 2 52 |||| 4
26 || 2 53 ||||| ||| 8
27 ||| 3 54 ||| 3
28 || 2 55 ||| 3
29 ||| 3 57 ||| 3
30 ||| 3 58 | 1
31 |||| 4 59 ||| 3
32 ||||| 5 60 | 1
33 ||||| || 7 61 || 2
34 ||||| || 7 62 | 1
35 ||||| 5 63 || 2
36 ||| 3 64 || 2
37 ||||| || 7 65 ||| 3
38 ||||| ||| 8 67 | 1
39 ||||| |||| 9 70 || 2
40 ||||| | 6 75 | 1
41 ||||| || 7 78 | 1
OSU July, 2018 Page 21
Business Statistics Handout
Step -3: Arranging the data into groups
If the identity of the units about whom a particular information is collected is not relevant nor is the
order in which the observations occur, then the first real step of condensation consists in classifying
the data into different classes by dividing the entire range of the values of the variable into a suitable
number of groups called classes and then recording the number of observation in each group (class).
In order to construct group or class for the data follow the following steps:
Find the largest and smallest values: in our case 78 and 15
Compute the range (the difference between the two values) : in our case 78 – 15 = 63
Determine the number of class or groups, usually between 5 and 20 but in general use the
‘Sturges rule’ to determine i.e. k= 1+3.322 log 10 N where K= no. of classes and N= the total
number of observation. In our case it is calculated as 13.
Find the class width (W): it is calculated by dividing the range by the number of classes and
R
rounding up not off. In our case W= = 63/13 = 5
K
Take the minimum value as the lower class limit of the first class and then the width to find the
rest of the lower limits. To find the upper limit of the first class count 5 values in the first class
and take the last one as upper limit and then add the width to find the rest of the upper limits.
Find frequencies for each class
Marks of 200 Students
Marks Frequency Marks Frequency
(X) (f) (X) (f)
15-19 11 50-54 24
20-24 9 55-59 10
25-29 12 60-64 8
30-34 26 65-69 4
35-39 32 70-74 2
40-44 35 75-79 2
45-49 25
Remark
In case of open end classes, it is customary to estimate the class mark or mid-value for the first class
with reference to the succeeding class (i.e.2nd class). In other words, we assume that the magnitude
of the first class is same as that of second class.
Similarly the mid-value of the last class is determined with reference to the preceding class i.e., last
but one class. This assumption will, of course, introduce some error in the calculation of further
statistical measures (averages, dispersion, etc.).
Cumulative Frequency Distribution
A frequency distribution simply tells us how frequently a particular value of the variable is
occurring. However, if we want to know the total number of events getting a value „less than‟ or
„more than‟ a particular value of the variable, this frequency table fails to furnish the information.
This information can be obtained very conveniently from the „cumulative frequency distribution‟
which is obtained on successively adding the frequencies of the values of the variable (classes)
according to a certain law.
The laws used are of „less than‟ and „more than‟ type giving rise ‟less than cumulative frequency
distribution‟ and „more than cumulative frequency distribution‟.
Let us consider the following distribution of marks of 70 students in a test:
Marks No of Students
30 – 35 5
35 – 40 10
40 – 45 15
45 – 50 30
50 – 55 5
55 - 60 5
Total 70
1) Bar Graphs
A bar graph is a graphical presentation which plots the successive values with their frequencies
using bars. All boxes in the bar graph have equal width
2) Frequency Polygon
A frequency polygon is a graphical presentation which plots the successive values of a data set with
their frequencies and connects the plotted points with a straight line.
Example: A frequency polygon for the above data could be represented by frequency polygon as:
A polygon is a closed sided figure. In figure 2 , to make it a closed figure we have to add two values
one at the lower limit and one at the upper limits with zero frequencies, for example in the above
graph, we add lower value 35 and upper value 80 with zero frequencies.
3) Pie chart
It is used to plot relative frequencies in which a circle is sliced up into distinct sectors when the data
are non-numerical. The area of each sector represents the relative frequency of the value of the item.
f f
If the relative frequency of the data value is , then the area of the sector is the fraction of the
n n
angle of the circle; i.e the area of a sector is
f (3600). The angle at the center of the circle is 3600
n
Example:
Items Expenditure
Food 160 birr
Cloths 80 birr
House rent 120 birr
Education 40 birr
Total 400 birr
Solution:
To express this data in a pie chart, 1st determine the proportion of each sector in the total area.
Items Expenditure Proportion of each sector in degree
Food 160 birr 160
x 3600 = 1440
400
Cloths 80 birr 80
x 3600 = 720
400
House rent 120 birr 120
x 3600 = 1080
400
Education 40 birr 40
x 3600 = 360
400
Total 400 birr 400
x 3600 = 3600
400
OSU July, 2018 Page 26
Business Statistics Handout
Using this table, it could be possible to represent the data in a pie chart as follows:
4) Histograms
Histogram is a bar graph with the bars placed adjacent to each other. The vertical axis of a
histogram can represent either the class frequency in a frequency histogram or relative class
frequency in a relative frequency histogram.
Example: The following table shows the distribution of the life time of 485 radio tubes.
Life time No of tubes with Life time No of tubes with
(in hours Life time (in hours) Life time
300 – 400 60 700 – 800 60
400 – 500 40 800 – 900 80
500 – 600 80 900 - 1000 45
600 – 700 120
Represent the above frequency distribution by Histogram
UNIT THREE
DESCRIPTIVE STATISTICS AND STATISTICAL SUMMARIZATION
x
i 1 x1 x2 ......... xn
i
x= =
n n
Example: What is the average monthly income of 10 students in a class given below.
300 1200 800 500 750 2000 1500 1800 350 600
300 1200 800 500 750 2000 1500 1800 350 600 9800
Solution: x = = = 980
10 10
fx i i
f1 x1 f 2 x2 .... f n xn
X = i 1
= , Where: n = f
f f1 f 2 .... f n
i
i
Example:
Suppose the following data represent the number of patients served in a clinic per day
Number of patients Number of days
1 1
5 3
10 5
12 2
15 4
Find the average number of patients served in the given clinic per day.
Solution: The sample mean for data arranged in a frequency table is computed as
n
fx
i 1
i i
X = Where: fi – represents the number of days and xi – number of patients
f i
1x1 5 x3 10 x5 12 x2 15 x4 1 15 50 24 60 150
X = = = = 10
1 3 5 2 4 15 15
Sometimes the data value in the data set may have different importance & as a result we may attach
different weight (wi). In this case, the sample mean is said to be a weighted average ( x w) & is given as:
n
w x
i 1
i i
w1 x1 w2 x2 ............ wn xn
xw= = , where: wi is the weight of xi
n
w1 w2 w3 ........ wn
w
i 1
i
Example: Suppose a student take four courses Statistics, Economics, Mathematics and Basic English.
Course Title Credit hours Grade obtained Scale
Statistics 4 B 3
Economics 3 C 2
Mathematics 3 A 4
Basic English 3 C 2
Find the Grade Point Average (GPA)
Solution:
4 x3 3x2 3x4 3x 2 12 6 12 6 36
GPA = = = ---------------- GPA = 3.00
433 2 12 12
Grand mean / X G/
Suppose we have two distinct samples of sizes n1 and n2. If the sample mean of the first sample is
X 1 and that of the second is X 2. Then the sample mean of the combined sample of size n1 + n2 is
called the Grand Mean.
n X n X
The Grand Mean is denoted by X G and is given by: X G = 1 1 2 2
n1 n2
Example: The following is a frequency table of the ages of a sample of students in a certain college.
Age value Frequency
20 5
22 10
24 12
26 18
Solution:
The harmonic mean is given by:
n n 45 45 45
X H.M = = = = = = 23.8
fi f1 f 2 fn 5 10 12 18 0.25 0.45 0.5 0.69 1.89
X X X .............
X n 20 22 24 26
i 1 2
X =
fiXi = 205 2210 2412 2618 = 1076 = 23.9
fi 5 10 12 18 45
Suppose we have a sample of n data whose values are given by X1, X2, …… Xn. The Geometric
Mean of these observations is given by:
X G ,.M = n X1. X 2. X 3....... X n , Where n is the number of observations .
Example: Suppose the values of a data set are 2, 3,and 36. Find the geometric mean .
3 3
Solution: X G ,.M = 2 x3x36 = 216 = 6
When the data are arranged in a frequency table. We can compute the geometric mean as follows.
Suppose we have a sample of n data whose values are given by x1 x2, ---- xn with frequencies f1, f2, ---
fn. The geometric mean of these observations is given by:
fi f2 fn
.
X G ,.M =
X1 X 2 ..... X n ,where: n= fi is the total number of observations.
Example: Suppose the values of a data set are given in a frequency table as follows.
Value frequency
1 2
2 3
3 1
4 4
Solution: The geometric mean X Gim is given by: X G ,.M = 10
12 .23.31.44 = 10 6144 = 2.39
Relationship among Arithmetic Mean, Harmonic Mean and Geometric Mean
Suppose a sample of data have two values x1 and x2.
X1 X 2
The Arithmetic mean ( X AM ) of these two observations is given by: ( X AM ) =
2
2 2X 1 X2
The Harmonic mean ( X AM ) of these two observations is given by: ( X AM ) = =
1
1 X1 X 2
X1 X 2
The Geometric mean ( X GM ) of these two observations is given by: ( X GM ) = 2 X1, X 2
2
2 X1 X 2 = X A.M X H .M
Remark:
If the two observations X1, and X2 are equal but positive then ( X GM ) = X HM = X A.M
The relationship b/n X A.M , ( X GM ) and X HM will be ( X GM ) ≤ X HM ≤ X
A. M
X=
fi X i
fi
Where Xi = represents the class mark of each class interval
fi = represents the corresponding frequencies of each class interval.
Class mark (w) is the average value of the lower and upper class limits of a given class or it is the
average value of the lower and upper class boundaries of a given class. i.e.
LCLi UCLi
Class mark of ith class (wi) =
2
Where: LCL: represents lower class limit of ith class
UCL: represents upper class limit of ith class Or
LCBi UCBi
Class mark of ith class (wi:) =
2
th
Where: LCB: represents lower class boundary of i class
UCB: represents upper class boundary of ith class
Example: Consider the class interval given for final results in statistics. Compute the sample mean
Class interval Frequency
40 – 49 5
50 – 59 4
60 – 69 5
70 – 79 8
80 – 89 7
90 – 99 1
Solution
To compute the sample mean, first we have to find the class mark for each class. To find the class
mark, as indicated above, take the average of the lower and upper class limits. Using this formula the
class mark for each class is given in the table below.
Class interval Class mark (wi) Frequency (fi) (Xi )( fi)
40 – 49 44.5 5 222.5
50 – 59 54.5 4 218
60 – 69 64.5 5 322.5
70 – 79 74.5 8 596
80 – 89 84.5 7 591.5
90 – 99 94.5 1 94.5
Total 30 2045
OSU July, 2018 Page 32
Business Statistics Handout
Then the sample mean is given by:
X=
xi fi = 5x44.5 4 x54.5 5x64.5 8x74.5 7 x84.5 94.5x1
fi 5 4 5 8 7 1
222.5 218 322.5 596 591.5 94.5 2045
= = = 68.17
30 30
3.1.2 The Median
~
Sample median ( X )
A statistic which is used to indicate the center of a data set but which is not affected by extreme values is
a sample median. It is the middle value when the data are ranked or arranged from the smallest to the
largest.
If the number of data values is odd, then the sample median is the middle value, i.e. the median is the
n 1 th
value corresponding to item, where n is the total number of observations.
2
If the number of data values is even, then the sample median is the average of the two middle values,
n n 2
th
i.e. the median is the value corresponding to 2 2 item.
2
Example: Find the sample median for the data representing number of items sold by a grocery in 5 days.
10, 28, 5, 12, 30
Solution:
To find the median, first arrange the data in increasing order as follows: 5, 10, 12, 28, 30
Since the sample size is 5 (which is odd), the sample median is the 3rd smallest value. That is, the
median number of items sold in the five days is 12.
n 1 th
This median can also be obtained by taking ( ) item which is the 3rd item (12).
2
Example: The following data represent the number of patients served in a certain clinic for 10 days
5, 13, 20, 2, 6, 18, 9, 15, 7, 18
Solution:
To find the median, first arrange the data in increasing order
2, 5, 6, 7, 9, 13, 15, 18, 18, 20
Since the sample size is 10 (which is even). The median is the average of the two middle values. Thus,
9 13
the median is the average value of 9 and 13, which is = 11
2
n
n 2 th
This median can also be obtained by taking the average value of ( 2 )th and item, i.e. The
2
10 th 10 2 th
median is the average value of = 5th item and = 6 item which is 13.
2 2
Thus, Median = 9 + 13 = 11
2
OSU July, 2018 Page 33
Business Statistics Handout
Example: Compute the median for the class interval given for final results in statistics.
Class interval Frequency
40 – 49 5
50 – 59 4
60 – 69 5
70 – 79 8
80 – 89 7
90 - 99 1
Solution:
To find the median, first we have to convert the class limits in the above table in to class boundaries
which are indicated in the table below.
49.5 – 59.5 4 9
59.5 – 69.5 5 14
69.5 – 79.5 8 22
79.5 – 89.5 7 29
89.5 – 99.5 1 30
The median class is a class which contains the median value, and it is the class which contains the
n 1 th
value corresponding to item. Hence, in the above table, the median class is the class which
2
30 1 th 31 th
contains the = = 16.5 value.
th
2
2
There fore, the class which contains the 16.5th value is 69.5 – 79.5 which the median class is also.
Properties of median
Unlike the sample mean, which uses all the data values, median use only one or two middle values.
Median is not affected by extreme values. That means, even if there are few extreme values (i.e. very
small or very large values). The median value will not be affected.
Remark:
It is possible to have no mode if all observations occurs equal number of times. For example, in
the following frequency distribution table, there is no mode, since all values occurs equal number of
items.
Value 1 2 3 4
Frequency 5 5 5 5
It is possible to have one mode if one observation occurs most frequently in the data set. For
example, in the following frequency distribution table there is only one mode which is the value 3
because it occurs most frequently in the data set. In this case, the data set is called unimodal
Value 1 2 3 4 5
Frequency 3 2 6 5 1
It is possible to have more than one mode, if two or more observations occur most frequently in the
data set. If only two values occur most frequently in the data set, then the data set is called
bimodal. While if more than two values occur most frequently in the data set, then the data set is
called multimodal.
2) Mean Deviation
The difference between a number in a data set and the mean of the data is called a deviation shows
how much a value varies from the mean. Deviation = X – A
Mean deviation is the average of the absolute deviation taken from central tendency, usually from
mean or median.
Let X1, X2, X3, -------Xn are n observed values, then
M.D =
X1 A , where A is a measure of central tendency.
n
Solution:
To find the mean deviation from the mean, first let’s compute the sample mean as follows:
X =
fi X i = f1x1 f 2 x2 f3 x3 f 4 x4 = 203 224 241 262
n f1 f 2 f3 f 4 3 4 1 2
60 88 24 52 224
= ------------ x = = 22.4
10 10
Now, mean deviation from the mean is given by
fi / Xi X / 3(20 22.4)2 4(22 22.4)2 1(24 22.4) 2(26 22.4)2
n 3 4 1 2
(3)(5.76) 4(0.16) 1(2.56) 2(12.96)
= = 4.64
10
3) Standard Deviation
Standard deviation is a measure of dispersion that considers all values in a data set. We can compute
standard deviation for population values or for sample values.
The standard deviation for population is called population standard deviation /δ/ and the standard
deviation for sample is called sample standard deviation /s/.
X
2
N
Where: Xi is the ith population value
1) Quartiles /Q/
It is a measure of location which divides the data set in to four equal parts, depending on the value of
(i=1, 2, 3), we have
Q1 -1st quartile Q2 – 2nd quartile Q3 – 3rd quartile
in 2
th
Quartile (Q): for I = 1, 2, 3 is the value corresponding to item, where: n is the total
4
number of observations
First Quartile (Q1) -is a point which divides the data set where 25% of the observations lie below it
and 75% of the observations lie above it.
n 2
th
2) Deciles /Di/
Decile is also a measure of location which divides the data set in to ten equal parts. Depending on
the value of i, (i = 1, 2.3, 4, 5, 6, 7, 8, 9), we have
D1 - 1st decile D4 - 4th decile D7 - 7th decile
D2 - 2nd decile D5 - 5th decile D8 - 8th decile
rd th
D3 - he 3 decile D6 - 6 decile D9 - 9th decile
3) Percentile /Pi/
Percentile is a measure of location which divides the data set in to 100 equal parts. Depending on the
value of i, (i = 1, 2, 3, 4-------- 99) we have
P1 -1st percentile P3 - 3rd percentile “ ” “ “ “ “
P2 - 2nd percentile P4 - 4th percentile P99 - 99th percentile
25th percentile (P25)th: is a point which divides the data set where (25/100)th = (1/4)th of the
observations lie below it and (75/100) th = (3/4)th of the observations lie above it. You will observe
that P25 is equal to Q1 where ¼ th of the observations lie below it and (3/4) th of the observations lie
above it.
th
50
50 percentile (P50): is a point which divides the data set where
th
(half of the observations)
100
th
50
lie below it and (half of the observations) lie above it.
100
Note that P50 is equal to Q2 which is also equal to D5 because all indicates a point which divides the
data set where half of the observations lie below it and half of the observations lie above it.
Example
Find P50, and P75 for the observations which shows the number of items sold for 10 days in a shop
12, 18,15,10,8,16,4,9,18,5
Solution
First arrange the data set in an increasing order
4,5,8,9,10,12,15,16,18,18
Total number of observations is 10
5010 50
P50 = The value corresponding to 100
th item = The value corresponding to the 550 th item
100
th
= The value corresponding to the (5.5) item
= The value corresponding to 5th item + 0.5 (6th item – 5th item)
= 10 + 0.5 (12 –10) = 10 + 1 = 11 which is equal to Ds and Q2
7510 50
P75 = The value corresponding to 100
th item = The value corresponding to the 800 th item
100
th
= The value corresponding to the 8 item = 16
Remark:
From the above discussion, you can observe the following relationship
Q1 = P25 Q3 = P75 “
Q2 = D5 Q1 = P10 “
=P50 Q2 = P20 D9 = P90
The quartile class is a class which contains the respective quartile. i.e the class containing
n 1 th
i value.
4
n 1 th
The first quartile class is a class which contains the value corresponding to i item = The
4
31 th
value corresponding to item = the value corresponding to 7.75th item, which is 49.5 – 59.5.
4
The lower class boundary of the first quartile class is 49.5. The frequency of the first quartile class
is 4, the class width is 10 the sum of frequencies for classes below the first quartile class is 5 and the
total number of observations is 30.
Using this information, the first quartile is given by:
30
5 10
Q1 = 49.5 +
4 = 49.5 +
7.5 510 = 49.5 + 2.510 = 49.5 + 6.25 = 55.75
4 4 4
n 1 th
The second quartile class is a class which contains the value corresponding to = 2 item the
4
31 th
value corresponding to item = the value corresponding to the 15.5 th item which is 69.5 – 79.5.
2
The lower class boundary of the second quartile is 69.5. The frequency of the second quartile class is
8, and the sum of frequencies for classes below the second quartile class is 14.
Example: Find D2, and D5 for the class interval given for final results in statistics,
Class interval frequency Cumulative frequency
39.5 - 49.5 5 5
49.5 – 59.5 4 9
59.5 - 69.5 5 14
69.5 - 79.5 8 22
79.5 – 89.5 7 29
89.5 – 99.5 1 30
Solution
The decile class is a class which contains the respective deciles. i.e it is a class which contains the
n 1 th
value corresponding to i item.
10
n 1 th
The second decile (D2) class is a class which contains the value corresponding to 2 item =
10
n 1 th 31
the value corresponding to item = the value corresponding to th item = the value
5 5
corresponding to 6.2th item, which is 49.5 – 59.5.
The lower class boundary of the second decile is 49.5. The frequency of the second decile class is 4
and the sum of the frequencies of all classes below the second decile class is 5.
Using this information, the second decile is given by:
30
2 5 10
D2 = 49.5+
10
= 49.5 +
6 510 = 49.5 + 10 = 49.5 + 2.5 = 52
4 4 4
Similarly, the value of D5 lies in a class interval which contains the value corresponding to
30 1 th
5 item which is the value corresponding to the 15.5 item.
th
10
Solution:
The percentile class is a class which contains the respective percentiles, i.e. it is a class which
n 1 th
contains the value corresponding to i item
100
n 1 th
The 25th percentile (P25) class is a class which contains the value corresponding to 25 item
100
which is the value corresponding to 7.75th item. The class which contains the value corresponding to
7.75th item is 49.5-59.5 and the lower class boundary is 49.5. The frequency of the given class is 4
and the sum of frequencies of all classes below the given class is 5.
Using these information, the 25th percentile /P25/ is given by
2530
5 10
100 7.5 510 = 49.5 + 2.5 10 = 49.5 + 6.25
P25 = 49.5 + = 49.5 +
4 4 4
= 55.75 which is equal to the first quartile.
(The entries are the probabilities that a random variable having the standard normal distribution will
take on a value between 0 and z.)
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
UNIT FOUR
PROBABILITY AND PROBABILITY DISTRIBUTION
Introduction
There are few things in our lives that are absolutely uncertain. This uncertainty makes life challenging and
interesting. How boring it would be to know in advance everything that was going to happen to us would
be so much less fun if the mystery were taken out. We wouldn't need elections because the winners and
losers would be know beforehand we wouldn't need much of a stock market because everyone would
know what stock prices would be tomorrow and every day after that We wouldn't need sports events
because we would already know the outcomes
The idea behind probabilities is to try to quantify these uncertainties means that a variety of outcomes are
possible we can better understand this uncertainty and be more prepared for the possibilities if we use
probabilities to describe which out comes are likely and which are unlikely.
1) Random experiment
Kind of experiments, in which the value vary from one performance of the experiment to the next even
though the conditions are the same.
Example:
Toss a coin, the results of the experiment is either tail (T) or Head (H).
Toss a die, the results of the experiment are one of the numbers in the set {1,2,3,4,5,6 }.
Example:
In a random experiment of tossing a coin, the possible out comes are Head or Tail: S = {H, T}
In a random experiment of tossing a die, the possible outcomes: S = {1, 2, 3, 4, 5, 6}
Note: If a sample space has a finite number of sample points, then it is called a finite sample space.
Otherwise, it is called an infinite sample space.
3) Event /E/
An event is a subset of the sample space.
8) Independent Events
Events are said to be independent of each other, if happening of any one of them is not affected by the
happening of any one of others. For example
In tossing of a die repeatedly, the event of getting '5' in the 1st throw is independent of getting '5'
in the second throw , third or subsequent throws
Similarly, drawing of balls from an urn gives independent evens if the draws are made with
replacement, i.e. if the ball drawn in the 1st draw is replaced then the resulting draws will be
independent.
4.2. Counting Techniques
In the study of what is possible there is a problem of determining the number of ways in which things
can happen. For this purpose we need counting techniques that include the multiplication of choices,
permutations and combinations
Example: In how many different ways can 4 different Instructors be introduced to the student?
Solution: There are 4! = 4x3x2x1 = 24 ways in which they can be introduced
4.2.3. Permutations
Suppose there are n distinct objects and we want to arrange r of these objects in a line. There are n
ways of choosing the 1st object, n-1 ways of choosing the 2nd object and continue like this and finally
there are n-(r-1) ways of choosing the rth object.
Applying the fundamental principle of counting, a permutation of n different objects taken „r‟ at a
time which is denoted by n Pr
is an order arrangement of only r object of the n objects is given by:
n!
n Pr
(n r )!
The number of d/t permutation of n different objects taken r at a time (without repetition) is given by:
n!
n Pr = n(n-1)( n – 2 )----- (n-r +1) =
(n r )!
Example: Find the number of permutations of four letters a, b, c, d, if we take only two of the four letters
Example: In how many ways can 6 people be seated at a round table if they can sit anywhere?
Solution: There are 5! = 5x4x3x2x1 = 120 ways of arranging the 6 people in a circle.
The number of permutations of n objects of which n1 are of one kind, n2 are of a second kind, …,
n!
nk are of a kth kind, and n1 + n2 +…+ nk = n is given by:
n1 n2 ... nk
Example: 4 red marbles, 2 white marbles and 3 blue marbles are arranged in a row. If all the marbles of
the same color are not distinguishable from each other, how many different arrangements are possible?
9!
Solution: The number of ways of arranging the 9 marbles is given by: = 1260
4!2!3!
4.2.4. Combination
A Combination of n different objects taken 'r at a time denoted by n C r is a selection of only r object
of the n object without any regard to the order of arrangement. In permutation, we are interested in the
order of arrangement of the objects.
For example: abc and bca are the same in combination, however, they are different in permutation.
The number of different combination of r objects selected from n distinct objects (without
repetition) is
n! P
n Cr = n r , for r =0, 1, 2 …, n
r!(n r )! r!
Example: In how many ways can a committee of 3 people be chosen out of 7 people?
7!
Solution: There are 7C3 = = 35 ways of forming a committee of 3 people chosen out of 7 people.
3!4!
The number of different combination of ‘n ‘objects selected from ‘n ‘distinct objects are
n!
n Cn 1
n!(n n)!
Example: Consider an experiment of rolling a die once. What is the probability of getting?
a) a '2' b) an odd number c) a '1' or a '3' and d) a '7'
Solution: In rolling a die once, the possible outcomes are: 1, 2, 3, 4, 5, 6. That is, the sample space is
S={1,2,3,4,5,6} , n(s) = 6
a) E = {2}, the number of favorable cases to an event E is 1-------------P(E) = 1/6
b) E = {1, 3, 5}, the number of favorable cases to an event E is 3----------P=(E)= 3/6 = 1/2
c) E = {1, 3}, the number of favorable cases to and event E is 2------------P(E)= 2/6 = 1/3
d) In rolling a die, it is impossible to get a '7'. Hence the event is E= { }, so P(E)= 0/6 = 0
Example: Two items are chosen at random from a box containing 4 defective and 6 non-defective items.
What is the probability that:
a) Both are defective b) Both are non-defective and c) One is defective and the other is none defective
Solution: There are a total of 4+6=10 items. We can choose two items, out of 10 in
10C2 = 10! = 10! = 45 days
2!(10-2)! 2!8! There are a total of 45 possible equally likely outcomes.
If an unbiased coin is tossed at random, then the classical probability gives the probability of a head as
½. Thus if we toss a coin 20 times, then classical probability suggests we should have 10 heads. In
practice, this may not generally be true.
OSU July, 2018 Page 50
Business Statistics Handout
As a result, in 20 throws of a coin, we may get no head at all or 1 or 2 heads. However, the empirical
probability suggests that if a coin is tossed a large number of times; say 500 times, we should on the
average expect 50% heads and 50% tails.
Thus empirical probability approaches classical probability as the number of trails becomes large i.e.
h
P( A) Lim as n approaches to
n
Both classical and frequency approaches have serious draw backs, first because of the words "equally
likely" are vague and second because the "large number" involved is vague. Because of these
difficulties, mathematicians have led to an axiomatic approach to probability.
Solution:
a) Define the following events:
A: The event that an applicant is accepted in firm X. P (A) = 0.3 = 30%
B: The event that an applicant is accepted in firm Y. P (B) = 100% - 60 % = 40% = 0.4
Since the chance of being rejected in firm Y is 60%, his chance of being accepted in firm Y is 100-60
= 40% = 0.4. Since an applicant can be accepted in one and only one of the two firms, events A and B
cannot both occur at the same time. i.e. A and B are mutually exclusive. Therefore:
P (A B) = P (A) + P (B) = 0.3 + 0.4 = 0.7 = 70% chance of being accepted in one of the firms.
Conditional Probability
Let E1 and E2 be two events and P(E1) > 0. The probability denoted by P(E2/E1) is the probability that
event E2 will occur given that event E1 has already occurred is given by:
P( E1 E2 )
P( E2 / E1 )
P( E1 )
P( E1 E2 )
Similarly, P( E1 / E2 ) , if P (E2) > 0
P( E2 )
Example: Suppose a single die is tossed once.
A. Find the probability that a single toss of a die will result in a number less than 4.
B. Find the probability that a single toss of a die will result a number less than 4 given that the toss
resulted in an odd number.
Solution
a. Let the event E denotes a number less than 4 i.e. E={1,2,3} and the sample space consists of S =
n E
{1,2,3,4,5,6} hence, the probability that event E will occur is given by: PE
3 1
=
nS 6 2
b. Here we have two events, event E1 denotes a number less than 4, i.e E1 {1, 2,3} and event E2 denotes
odd numbers, i.e. E2 = {1,3,5.}. Now it is required to find the probability that E1 will occur given that
E2 has already occurred, i.e. P(E1/E2)
P E 1 E 2
Using the formula given for conditional probability, the probability is given as: P(E1/E2) =
P E 2
nE 1 E 2 n E 2
But PE1 E 2 = = , and PE 2
2 1 3 1
=
nS 6 3 nS 6 2
1
P E 1 E 2 3 1 2 2
Thus (E1/E2) = = =
P E 2 1 3 1 3
2
Multiplication rule for two dependent events
The probability of simultaneous happening of two events A and B is given by
P( A B) P( A).P( B / A) , P( A) 0 P( A B) P( B).P( A / B) , P( B) 0
In the population, the values of the variable may be distributed according to some definite probability
law which can be expressed mathematically and the corresponding probability distribution is known as
theoretical probability distribution.
Such probability laws may be based on a prior considerations or a posteriori inferences. These
distributions are based on exportations on the basis of previous experiences. Theoretical distributions
also enable us to fit a mathematical model or a function of the form p(x) to the given data.
It is the result of an experiment or process which has only two possible outcomes.
If we toss a fair coin „n‟ times which is fixed and finite and the out come of any trial is one of the
mutually exclusive events head (success) and tail (failure). Furthermore, all the trails are independent,
because the result of any throw of a coin does not affect and is not affected by result of other throws.
Moreover, the probability of success (head) in any trial is ½ and the probability of failure (tail) in any
trial is also ½ which are constant for each trial.
Binomial Formula
Consider an experiment in which P=Probability of success & q=1–P is the probability that the failure.
We make the following assumptions:
The number of trials (n) is fixed.
P probability of success is same for each trial.
The trails are independent.
The sum of probability is unity i.e. 1.
Under these assumptions, probability that event will occur „r „times in „n‟ trials, where n r is given:
This discrete probability distribution is called Binomial distribution. X denotes a random variable on
the number of successes in „n‟ trials, which can take the values 0, 1, 2. . . n; since in „n‟ trials we may
get no success (all failures), one success, two successes, . . . , or all the „n‟ successes. We are
interested in finding the corresponding probabilities of 0, 1, 2 . . . n successes.
Solution: Define the random variable X as the number of heads obtained. Then:
o P = probability of getting a head in a single toss of a coin = 0.5
o n = the number of trials (of times the experiment is done) = 4
The probability distribution of X is:
d) No head means X = 0. Hence the required probability is P(X= 0) = 4C0 (0.5)0(0.5)4 = 0.0625
e) At least one head means X ≥ 1, then the required probability is:
P(X≥ 1) = P(x= 1) + P(X= 2) + P(X = 3) + P( X = 4) = 0.2500 + 0.3750 +0.2500 + 0.0625 = 0.9375
Or
P(X≥1) = P(at least one head) = 1– P (No head ) = 1 – P(X = 0 ) = 1–0.0625 = 0.9375
Example 2: A national advertising agency estimates that only 40% of all new products introduced in a
certain country succeed. Out of 8 new product that were recently introduce, what is the probability that:
a) At most 5 succeed and b) At least 7 succeed.
Solution: Define the random variable X as the number of new products that succeed.
o P = the probability that the new product succeed in a single release = 40% = 0.4
o n = number of released = 8
a) P(X ≤5) = 0.9052. This is from the binomial table with n = 8 , P= 0.4 , and r = 5
Thus we can say that it is highly likely that no more than 5 will succeed
b) P(X ≥ 7) = 1–P(X < 7) = 1–P(X ≤ 6) = 1–0.9915= 0.0085. This is from the binomial table with n= 8,
P= 0.4 & r = 6. Thus, it is less likely or there is almost no chance that 7 or more will succeed out of 7.
4.5.2. The Normal Distribution
In this section, we will examine a very important continuous probability distribution known as the normal
probability distribution. The normal probability distribution has the following characteristics:
The graph of the normal probability distribution has a single peak at the center of the distribution. The
Mean, Median and the Mode which in a normal distribution are equal-are all located at the peak.
Therefore, exactly one-half, or 50% of the areas is to left of the center of the distribution and exactly
one-half of the area is to the right of it.
A normal probability distribution is symmetrical about its mean. If you were to “fold” the
probability distribution along its central value, the two haves would be identical.
The normal curve tells of smoothly in a “bell shape” and the two tails of the probability distribution
extend indefinitely in either direction. In theory, the curve never actually touches the X- axis as
indicated below.
For example, if a normal probability distribution has a mean of 20 and standard deviation of 4 then,
about 68% of the values are between 16 & 24, found by 20 1(4), about 95% of the values are
between 12 & 28, found by 20 2(4) & virtually all the values are between 8 & 32, found by 20 3(4)
There are many normal probability distributions one for each pair of values for a mean & standard
deviation. This makes normal probability distribution very versatile in describing many different real-
world situations & it would be very difficult to provide tables for each such distribution.
An efficient method for overcoming this difficulty is transforming a variable into a standard normal
variable. This method is called standardizing the distribution.
Z= X-
Where: Z = the standardized value, or Z – value
X = any observation of interest
= the mean of the normal distribution
= the standard deviation of the normal distribution
The value of Z actually follows a normal probability distribution with a mean of zero and standard
deviation of one unit.
This probability distribution is called Standard Normal Probability Distribution. Thus, we can
convert any normal distribution to the standard normal distribution by using the above formula.
Example: The ages of patient admitted to H hospital are normally distributed with a mean of 60 years and
standard deviation of 12 years. Find the Z-value (standardized value) for a patient (a) aged 78? (b) aged 45
The area under the Standard Normal Curve is equal to one: The area to the left of Z = 0 is equal to 0.5
and the area to the right of Z = 0 is also equal to 0.5
Since the Standard Normal Distribution is symmetric about its mean: that is the area bounded by Z=-a
and Z =0 is equal to the area bounded by Z =0 and Z = a, where a is any real number
Example: Find the area under the standard normal curve bounded by Z = 0 and:
(a) Z = 0.45 (b) Z = 2.83 (c) Z = -0.060 (d) Z = -1.76
To find area bounded by Z = 0 and Z = 0.45 look up the value opposite 0.4 and under 0.05 in the
standard normal distribution table. From the table, the area is 0.1736.
If we look up the value opposite 2.8 & under 0.3, we obtain the area bounded by Z=0 and Z= 2.83.
This value is 0.4977.
The area bounded by Z = -0.60 and Z = 0 is equal to the area bounded by Z = 0 and Z =0.6 because of
symmetric of the normal distribution. Thus, we look up the value opposite 0.6 and under 0.00 that is
0.2257.
UNIT FIVE
ESTIMATION AND HYPOTHESIS TESTING
Introduction
Statistical inference is based on estimation and hypothesis testing. In both estimation and hypothesis
testing, we shall be making inferences about characteristics of populations from information contained in
samples. To calculate the exact proportion or the exact mean would be an impossible goal. Even so, we
will be able to make an estimate, make a statement about the error that will probably accompany this
estimate, and implement some controls to avoid as much of the error as possible.
2) Interval Estimate:
It is a range of values to estimate population parameters. It indicates the error in two ways: by the
extent of its range and by the probability of the true population parameter lying within the range.
In this case, the department head would say something like; “I estimate that the true enrollment in this
course in the fall will be between 330 and 380 and that it is very likely that the exact enrollment will
fall within this interval.” She has a better idea of the reliability of her estimate.
If the course is taught in sections of about 100 students each, and if she had tentatively scheduled five
sections, then on the basis of her estimate, she can now cancel one of those sections and offer an
elective instead.
A) Point Estimates
When a parameter is being estimated, the estimate can be either a single number in which the
estimate is called a "point estimate or it can be a range of scores in which the estimate is called an
interval estimate. Confidence intervals are used for interval estimates.
Point estimates are used as parts of other statistical calculations. For example, a point estimate of the
standard deviation is used in the calculation of a confidence interval for μ. Point estimates of
parameters are often used in the formulas for significance testing.
1) The sample mean is the best estimator of the population mean µ. It is unbiased, consistent, the most
efficient estimator, and, as long as the sample is sufficiently large, its sampling distribution can be
approximate by the normal distribution.
Let us look at a medical supplies company that produces disposable hypodermic syringes. Each syringe is
wrapped in a sterile package and then jumble packed in a large corrugated carton. Jumble packing causes
the cartons to contain differing numbers of syringes. Because the syringes are sold on a per unit basis, the
company needs an estimate of the number of syringes per carton for billing purposes. When we have taken
a sample of 35 cartons at random and recorded the number of syringes in each carton:
= = = 102 syringes
Thus, using the sample means as our estimator, the point estimate of population mean µ is 102 syringes
per carton. The manufacturing price of a disposable hypodermic syringe is quite small (about 25 ), so both
the buyer and the seller would accept the use of this point estimate as the basis for billing, and the
manufacturer can save the time and expenses of counting each syringe that goes in to a carton.
Table -1 101 103 112 98 97 93
Result of a sample of 35 105 100 100 93 94 97
cartons of hypodermic 97 100 97 110 103 99
syringes per carton 93 98 106 112 105 100
114 97 110 98 112 99
Value of X Sample
Table 2 (needles per mean
carton )
Calculation of sample 101 10,201 102 -1 1
variance & standard 105 11,025 102 3 9
deviation for syringes 97 9409 “ -5 25
per carton 93 8649 “ -9 81
114 12996 “ 12 144
103 10609 “ 1 1
100 10000 “ -2 4
100 10000 “ -2 4
98 9604 “ -4 16
97 9409 “ -5 25
112 12544 “ 10 100
110 12100 “ 8 64
97 9409 “ -5 25
106 11236 “ 4 16
110 12100 “ 8 64
98 9604 “ -4 16
93 8649 “ -9 81
110 12100 „ 8 64
112 12544 “ 10 100
98 9604 “ -4 16
97 9409 “ -5 25
94 8836 “ -8 64
103 10609 “ 1 1
105 11025 “ 3 9
112 12544 “ 10 100
93 8649 “ -9 81
97 9409 “ -5 25
99 9801 “ 7 49
100 10000 “ 8 64
99 9801 “ 7 49
= 36.12
- = –- = 36.12
B) Interval Estimates
An interval estimate described a range of values with in which a population parameter is likely to lie.
Suppose that the marketing research director needs an estimate of average life in month of car batteries
his company manufactures, we select a random sample of 200 batteries, record car owner‟s names and
address as listed in store records, and interview them about the battery life they have experienced.
Our sample of 200 users has a mean battery life of 36 months. If we use the point estimate of the
sample mean as the best estimator of the population mean , we would report that the mean life
of the company’s batteries is 36 months.
But the director also asks for a statement about the uncertainty that will be likely to accompany this
estimate, that is, a statement about the range with in which the unknown population mean is likely to
lie. To provide such a statement we need to find the standard error of the mean.
Standard error
= Standard Deviation of the population
Probability of the true population parameter falling within the interval estimate:
To begin to solve this problem, we should review relevant parts of normal probability distribution.
Fortunately, we can apply these properties to standard error of the mean and make the following statement
about the range of values used to make an interval estimate for our battery problem.
The probability is 0.955 that the mean of a sample size of 200 will be within 2 standard errors from
and hence within 2 standard errors of 95.5 percent of the entire sample means. Theoretically, if we
select 1000 samples at random from a given population and then constructed an interval of 2 standard
errors around the mean of each of these samples, about 9.95 of these intervals will include the population
mean similarly the probability is 0.683 that the mean of the sample will be within 1 standard error of the
population mean, and so forth. This theoretical concept is basic to our study of interval construction and
statistical inference.
Now we can report to the director as our best estimate of the life of the company‟s battery is 36
months, and we are 68.3 percent confident that the life lies in the interval from 35.293 to 36.707
months ( 36 1 ).
Similarly, we are 95.5 percent confident that the life falls within the interval of 34.586 to 37.414
months (36 2 ) and we are 99.7 percent confident that battery life falls within the interval of
33.879 to 38.121 months ( 36 3. )
Example: For a population with a known variance of 185 a sample mean of 64 individuals leads to 217
as an estimate of the mean,
A) Find the standard error of the mean
B) Establish an interval estimate that should include the population mean 68.3 percent of the time.
Suitable Unsuitable
H0: The machine packs more than 500 nails on average into each box.
H1: The machine packs 500 nails or fewer on average into each box.
H0: The machine packs 500 nails or fewer on average into each box.
H1: The machine packs 501 nails or more on average into each box.
H0: The machine packs fewer than 500 nails on average into each box.
H1: The machine packs more than 500 nails on average into each box.
B) Significance Level
When we test the hypotheses, we can never be 100% certain of our conclusions. We can only be confident
to a certain level - hopefully a high one. Typically we construct our test so that we will be 95% certain that
the conclusion we draw is a correct one. This is called a 95% confidence level, or a 5% significance level.
Other figures which are quite common are the 99% confidence level (1% significance level) or 90%
confidence level (10% significance level). In each case, the percentage indicates how confident we are
that our conclusion is correct.
The higher the confidence level (99% is higher than 95%), the more certain we are, but the less
likely it is that our test data will pass the test!
C) Sampling
The art of sampling means taking a small number of a population of items and testing them, and then
drawing a conclusion about the population as a whole.
For instance, if you wanted to estimate how many hours of television people in Ethiopia watched on
average, you couldn't possibly ask them all, so you would ask a sample of people & then draw a
conclusion based on what they said. Clearly the larger the sample, the more representative the results.
Let's illustrate this with an example 20 people by spying on them through their letter boxes.
4 7 1 0 1 2 5 2 4 3
1 6 4 1 6 2 3 2 3 0
The mean value of all those figures is found, as you would expect, by adding them all together and
dividing by 20. It comes to 2.85.
Would the accuracy be improved if we choose 5 items out of the 20 for each sample rather than 3
items? Yes, as 5 items is a larger percentage of the population (25%) than 3 items (15%).
It is sure that if we chose a higher sample size (6 numbers, 7 numbers, 8 numbers per sample) then the
means would get closer to the true mean and the standard deviation would go down.
D) Standard Error
When carrying out hypothesis testing on samples we use a measure called the Standard Error. This is
based on the standard deviation of a population, but takes into account the size of the sample
which we draw from the population and on which we base any conclusions.
To get the standard error (S.E.) we divide the standard deviation by the square root of the number of
items in the sample, n:
s
Standard Error (S.E.) =
n
E) Critical Region
The sort of hypotheses that we are going to test will involve comparing the mean of a sample of items
against a true mean for a population. This true mean applies to a whole population (too many to
count), although it may be only a claim (i.e. someone may tell us what the mean of the population is,
and we may want to test it).
Either way, the symbol that we use for the true mean is m (the Greek letter "mu", equivalent to our
letter "m" - "m" for mean) and the mean of the sample of items will be called . We define a
critical region around the true mean, and then we see if the sample mean lies within that region.
Firstly, decide on a significance level. We normally choose a 5% significance level (a 95% confidence
level), which means that we will be 95% certain of drawing the correct conclusion, although there will be
a 5% chance that we will have made the wrong decision (even if we do the mathematics correctly).
Look at the hypotheses carefully. Do they imply that something will be different to the mean value,
or do they imply that it will be higher or lower?
OSU July, 2018 Page 67
Business Statistics Handout
If the crucial word is "different" (or a word that means the same thing) then we call the test a "two
tail test", i.e. any item which is substantially different from the mean in either direction count as
"different".
However, if the hypotheses use words like "taller", "longer", "better" (or "shorter", "worse", "less
efficient" for that matter) then it is a "one tailed test". For instance, if we want to know whether the
machine packs significantly more than 500 nails into each box or not, then a box containing 497
certainly wouldn't provide any evidence to support the hypothesis!
If the test is a two-tailed test, then the critical region has an upper limit and a lower limit, with the
true mean exactly in the middle. The distance from the mean to each limit is the standard error (not
the standard deviation in this case) multiplied by a certain number which will depend on what
significance level we are using.
In the case of a 5% significance level (95% confidence level), the critical number is 1.96. This is the
same as for the 95% confidence interval that is part of the theory of the Normal Distribution, although
it is wrong to think of the critical region as a 95% confidence interval.
How could it contain 95% of the items in the population when it is based on the standard error, which
in turn depends on the size of our sample? If we altered the number of items in the sample, then the
size of the critical region would also change!
For instance, if the mean were 100 and the standard error was 8, then we would multiply 8 by
1.96 (to give 15.68). The lower limit of the critical region would then be 100 - 15.68 = 84.32,
and the upper limit would be 100 + 15.68 = 115.68.
It's a different matter if the test is a one tailed test. In this case, the critical region only has one limit:
If the test is a right-tailed test (we are testing whether the sample mean is significantly higher,
better, heavier etc.) then there is no lower limit, and the upper limit is the true mean plus the
standard error multiplied by a special number (1.64 for 5% significance level).
If the test is a left-tailed test (we are testing whether the sample mean is significantly lower,
worse, lighter etc.) then there is no upper limit, and the lower limit is the mean minus the
standard error multiplied by the same special number.
The diagrams below show the critical regions for a one-tailed test (both right-tailed and left-tailed
versions) for a 95% confidence level.
Right-Tailed Test Left-Tailed Test
Critical region = up to m + 1.64 S.E Critical region = m - 1.64 S.E. upwards
Therefore, the critical region marks the range of values in which we can be fairly certain that the true
mean, from which our sample was taken, lies.
For instance, if we have calculated that the critical region at a 95% confidence level is between 10
and 20, then we can be 95% confident that the true mean lies within that region.
Similarly, if the critical region is one-tailed at the 1% significance level, with a lower limit at 25
and no upper limit, then we can be 99% confident that the true mean is greater than 25.
Solution
1. What are the hypotheses?
The company claims that the mean length of the cotton is 250m. We think it may be something different
from that. The two hypotheses should therefore be:
H0: "The mean length of the cotton is 250m per reel."
H1: "The mean length of the cotton is something other than 250m per reel."
2. Decide what type of test it is, and what significance level is required
Clearly we want to see whether mean length is 250m or different from 250m, so it is a two-tailed test.
Conventional and choose a 5% significance level (95% confidence level).
3. Calculate the standard error.
This is fairly straight-forward, simply divide the standard deviation by the square root of the number
of items in the sample, i.e.
Standard error = 14 / Ö30 = 14 / 5.28 = 2.56
4. Calculate the critical region.
Since the test is a two-tailed test, we will need a symmetrical critical region:
Lower limit = 243 - 1.96 x 2.56 = 237.98 meters
Upper limit = 243 + 1.96 x 2.56 = 248.02 meters
5. Compare the company's claim for the true mean (m) with the critical region.
If it is inside the critical region, then accept H0 and reject H1.
If it is outside the critical region, accept H1 and reject H0.
In this case, the claimed true mean is 250m, which is outside the critical region (just!) This means that we
can be 95% certain that H0 is wrong, and that H1 is correct. We accept H1 at the 5% significance level.
This means that we can conclude that the company's claim (250 meters on average on every reel) is
probably wrong, although there probably isn't enough evidence to go to court!
Illustration 2:
A company has a machine that manufactures light bulbs with a mean lifetime of 5000 hours and a
standard deviation of 160 hours. The company is considering buying a new machine which promises to
make light bulbs which last significantly longer than those produced by the old machine. A sample of
200 bulbs from the new machine are tested and found to have a mean life time of 5020 hours. Does the
new machine produce longer-lasting bulbs?
Here we are not being asked to test whether the true mean has a certain value - we are told the value of
the true mean (5000 hours) and just have to accept that. Instead, we are being asked whether the
sample is compatible with that true mean, or whether it is substantially larger.
However, the same method can be applied. In both this and the previous question, we are asked
whether the sample is compatible with the true mean. In the previous question, it was the "true" mean
What does this critical region mean exactly? Well, we can be 99% certain that the mean of the lifetimes of
the bulbs produced by the new machine will lie within this region. There is only a 1% chance that the
mean will be less than 4993.65.
If the established mean lies within this region, then it is compatible with the mean of the bulbs from
the new machine - i.e. the mean lifetimes produced by both machines could well be the same (no
significant difference)
5. Compare the established value of the true mean with the critical region.
5000 is not smaller than 4993.65, so the established mean is within the critical region. This means that
we can accept H0 and reject H1.
The mean lifetime of the bulbs in the sample is not significantly higher than 5000 at the 1%
significance level, and so the machine does not produce significantly longer-lasting bulbs.
Our advice to the company would be to stick with the machine that they've already got!
If you have a true population mean m, and a true standard deviation s, then the 95% confidence
interval is calculated as follows: m ± 1.96s. This region will contain 95% of all the items in the
population.
& standard deviation of s, then the 95% critical region is calculated as: ± 1.96 S.E, where
the standard error, S.E., is s / Ön.
In this case, we can be 95% certain that the true mean of the population from which this sample
was taken lies within this region.
The same is true for other degrees of certainty (e.g. 99% or 90%) except that the critical number is not
1.96 (it is 2.58 for 99%, 1.64 for 90% etc.) Note that these numbers change again when you are
considering a one-tailed test instead of a two-tailed test.
UNIT SIX
SIMPLE CORRELATION AND REGRESSION
6.1. Simple Correlation
Suppose we have two variables X=(X1, X2, X3---Xn) and Y= (Y1, Y2, Y3 ….Yn). When higher values
of X are associated with higher values of Y and lower values of X are associated with lower values of
Y, then the correlation is said to be positive or direct.
Example:
Income and expenditure Height and weight
Number of hours spent in studying and score obtained
Distance covered and fuel consumed by car
When higher values of X are associated with lower values of Y and lower values of X are associated
with higher values of Y, then the correlation is said to be negative or inverse.
Example:
Demand and supply Income and proportion income spent on food
The correlation between X and Y may be one of the following:
Perfect positive (slope =1) Negative (slope between -1 and 0)
Positive ( slope b/n 0 & 1) Perfect negative (slope = -1)
No correlation ( slope = 0)
The presence of correlation between two variables may be due to three reasons:
1) One variable being the cause of the other. The cause is called subject or independent variable,
while the effect is called dependent variable.
2) Both variables being the result of a common cause. That is the correlation that exists between
two variables is due to their being related to some third force.
Example: Let X1= be ESLCE result
Y1= be the rate of surviving in the university
Y2= be the rate of getting a scholar ship
Both X1& Y1 and X1 &Y2 have high positive correlation, likewise Y1 & Y2 have positive correlation
but they are not directly related, but they are related to each other via X1.
3) Chance:
The correlation that arises by chance is called spurious correlation.
Example: - Price of teff in Addis Ababa and grade of students in USA
- Weight of individuals in Addis Ababa and income of individuals in Harar.
Therefore, while interpreting correlation coefficient, it is necessary to see if there is any likelihood of
any relationship existing between variables under study.
The correlation coefficient between X and Y denoted by ‘r’ is given by:
n XY ( X )( Y )
r= ( X i x )(Yi Y ) = r= XY n X Y
n X 2
( X ) 2 n Y 2 ( Y ) 2
2
2
X n X Y n Y
2 2
Example:
Calculate the simple correlation b/n mid semester & final exam scores of 10 students (both out of 50)
Student 1 2 3 4 5 6 7 8 9 10
Mid exam(X) 31 23 41 32 29 33 28 31 31 33
Final exam(Y) 31 29 34 35 25 35 33 42 31 34
2
Solution: n = 10, X = 31.2, Y 32.9, X 973.4 , Y 2 =1082.4
XY 10331, X 2
9920 , Y 2
1
r=
XY n X Y =
10331 10(31.2)(32.9
= 0.363
2
2
)9920 10(973.4)(11003 10(1082.4)
X n X Y n Y
2 2
This means mid semester exam and final exam scores have slightly positive correlation.
Exercise 1: A researcher who is concerned about the consumption rate of individuals took a sample of 10
individuals & observed their consumption & income (both in tens of Birr) for one month as shown below.
Individual Income (x) Consumption (y)
1 15 15
2 35 30
3 42 30
4 60 50
5 72 48
6 128 100
7 98 93
8 35 33
9 15 14
10 50 50
(a) Compute the coefficient of correlation and interpret.
(b) Find the least squares line of consumption on income.
(c) Estimate the consumption of an individual whose income is 200 Birr
The above formula and procedure is only applicable on quantitative data, but when we have qualitative
data like efficiency, honesty, intelligence, etc. we calculate what is called spearman’s rank
correlation coefficient as follows:
Steps i) Rank the different items in X and Y
Steps ii) Find the difference of the ranks in a pair, denote them by D i
rs= 1 -
n(n 2 1)
Where r s = coefficient of rank correlation
D = the difference between paired ranks
n = the number of pairs
Example: Aster and Chaltu were asked to rank 7 different types of lipsticks, see if there is correlation
between the tests of the ladies.
Lipsticks A B C D E F G
Aster 2 1 4 3 5 7 6
Chaltu 1 3 2 4 5 6 7
Solution
RX 2 1 4 3 5 7 6 total
RY 1 3 2 4 5 6 7
D=RX-RY 1 -2 2 -1 0 1 -1
D2 1 4 4 1 0 1 1 12
6 Di
2
6(12)
rs= 1 - =1- = 0.786----------------------yes, there is positive correlation.
n(n 1)
2
7(48)
This illustration involves a complex relationship between one unknown variable (the selling price of a
house) and a collection of other variables. Here, what we need is finding a suitable mathematical
relationship between the one unknown variable, called the dependent variable, and the group, of other
known quantities, called independent variables. One methodology for handling this type of problem is
called Regression Analysis.
In regression, we can have only one dependent variable. But, we can have more than one independent
variable. The situation where we have only one independent variable is called simple regression. If
In regression analysis, we shall develop an estimating equation –i.e., a mathematical formula that
relates dependent variable to the independent variable. Then, after we have learned the pattern of this
relationship, we can apply correlation analysis to determine the strength of that relationship, i.e. how
well the estimating equation actually describes the relationship?
The resulting set of points on the above XY – plane is called scatter diagram. A scatter diagram can
give us two types of information. Visually, we can look for patterns which indicate that the variables
are related. Then, if the variables are related, we can see what kind of line, or estimating equation,
describes this relationship.
The relationship between two variables can be direct or inverse. If the dependent variable increases
as the independent variable increases, then the relationship is said to be direct/positive.
For instance, we expect the sales of a company to increase as the advertising budget increases.
Hence, the relationship b/n these two variables (sales & advertising expense) is expected to be direct.
Relationships can also be inverse/negative. In such cases the dependent variable decreases as the
independent variable increases.
If the relationship b/n two variables X & Y linear, we express this as:
Y = +X
Here, Y represents the individual values of actual observed points. But as can be seen from Figure 1, all
points do not lie on the fitted line. i.e. so, we should begin to use Y to symbolize the individual values of
the estimated points; i.e. those points that line on the estimating line.
Accordingly, we shall write the equation for the estimating line as:
Yˆ a bX
The estimating line will have a good fit if it minimizes the error between the estimated point on the
line and the actual observed points that were used to draw it.
The best fitting line is the line for which the SSE is the minimum. By applying differential calculus
to the SSE, the slope of the best fitting line becomes:
b = nXiYi – (Xi) (Yi) ………………………..... (1)
nXi2 – (Xi)2
And the Y- intercept becomes:
a = Y – b X . . . . …………………………….(2)
Example: The following table shows the number of items produced (X) & the cost incurred in producing
them (Y) (in Birr).
Number of items produced (X) 4 5 6 8 9
Cost (Y) 15 18 18 20 20
(a) Find the equation of the least squares line treating cost as the dependent variable.
(b) Identify the slope and the Y-intercept and interpret them
(c) Estimate the cost of producing 7 items
OSU July, 2018 Page 77
Business Statistics Handout
X Y XY X2
4 15 60 16
5 18 90 25
6 18 108 36
8 20 160 64
9 22 198 81
X = 32 Y = 93 XY = 616 X 2= 222
a) Since we have 5 pairs of observations, n = 5. The slope b is computed as:
104
b = nXY – (X) (Y) = 5(616) – (32) (93) = = 1.21
86
nX2 –(x)2 5(222) – (32)2
To compute the Y–intercept a, first we need to find the average values of X and Y.
X = = 32 = 6.4 and Y = Y = 93 = 18.6
n 5 n 5
Thus, the Y-intercept is computed as: a = Y bX = 18.6 - 1.21(6.4) = 10.86
Therefore, the equation of the least squares line is:
Yˆ = a + bx Yˆ = 10.86 + 1.21x
(b) The Y-intercept is a = 10.86. It can be obtained by substituting X = 0 in the equation. This value
tells us that, even if no item is produced, there will be a fixed cost of 10.86 Birr (such as insurance
cost, maintenance cost etc). The slope is b = 1.21. This figure indicates that for a unit change in the
number of items produced, the cost changes by 1.21 Birr. It is the marginal cost.
(c) The cost of producing 7 items is estimated as: Yˆ = 10.86 + 1.21(7) = 19.33 Birr
One of the mathematical properties of a line fitted by the method of least squares is that the individual
positive and negative errors add up to zero. For the above problems this property is displayed below.
X Y Yˆ = 10.856 + 1.21x e = Y- Yˆ
4 15 10.856 + 1.21(4) = 15.696 -0.696
5 18 10.856 + 1.21(5) = 16.906 1.094
6 18 10.856 + 1.21(6) = 18.116 -0.116
8 20 10.856 + 1.21(8) = 20.536 -0.536
9 22 10.856 + 1.21(9) = 21.746 0.254
e =( Y Yˆ )= 0
Coefficient of Determination
Another measure of goodness –of – fit of the regression line is the coefficient of determination, which is
the square of the correlation coefficient, that is,r2 lies between 0 and 1, inclusive.
Coefficient of Determination = r2
An r2 close to 1 indicates a strong correlation between X and Y, while an r2 close to 0 means there is a
little correlation between these two variables.
OSU July, 2018 Page 78
Business Statistics Handout
The total variation in the dependent variable (Y) can be divided into two: Explained variation and
unexplained variation.
Explained variation is the change in the dependent variable (Y) explained by changes in the
independent variable (X). The proportion of explained variation is:
r2 x 100 %
Unexplained variation is the variation in the dependent variable (Y) due to chance, excluded
variables, etc. The proportion of unexplained variation is:
(1 -r2) x 100 %
Example: The following data is on the monthly amount of money spent on advertising (x) (in thousands
of Birr) of a certain airlines in randomly selected five months.
Advertising expense (X) 10 12 8 17 10
Number of passengers (Y) 15 17 13 23 17
(a) Compute the coefficient of determination
(b) Find the proportion of explained variation and interpret.
(c) Find the proportion of unexplained variation and interpret.
Hence, the coefficient of determination is: r2 = (0.9725)2 = 0.9458. This figure indicates that there is
a strong correlation between advertising expense and number of passengers.
(b) The proportion of explained variation is: r2 x 100% = 0.9458 x 100% = 94.58%. Thus, we can
conclude that 94.58% of the change in the number of passengers is explained by changes in the
amount of money spent on advertising.
(c) The Proportion of unexplained variation is: (1- r2) x100% = (1-0.9458) x100% = (0.0542)
x100% = 5.42%. Thus, 5.42% of the change in the number of passengers is explained by some
other variables other than advertising expense (such as ticket price, plane safety, etc)
(The entries are the probabilities that a random variable having the
standard normal distribution will take on a value between 0 and z.)
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990