Basic One

Introduction to Statistics
College of Computing And Informatics

Department of Statistics
COURSE TITLE
INTRODUCTION TO STATISTICS
BY:-
Million Wesenu
1 / 83
CHAPTER ONE
Definintion of Statistics
The term statistics is derived from the Latin word status, meaning
state, and historically statistics referred to the display of facts and
figures relating to the demography of states or countries. Generally, it
can be defined in two senses: plural (as statistical data) and singular
(as statistical methods).
Plural sense: Statistics are collection of facts (figures). This
meaning of the word is widely used when reference is made to facts
and figures on sales, employment or unemployment, accident,
weather, death, education, e.t.c
1 / 83
Cont’d
Examples Sales Statistics, Employment Statistics,labor

statistics,e.t.c.
in this sense the word Statistics serves simply as data.
But not all numerical data are statistics. In order for the
numerical data to be identified as statistics, it must possess
certain identifiable characteristics.
Some of these characteristics are described as follows:
2 / 83
a. Statistics are aggregate of facts

b. Statistics, generally, are not the outcome of a single cause
but affected by multiple causes.
c. Statistics are numerically expressed.
d. Statistical data are collected in a systematic manner for
predetermined purpose.
e. Statistics are enumerated or estimated according to
reasonable standard of accuracy.
3 / 83
Singular sense of Statistics
Singular Sense:Statistics is the science that deals with the methods

of data collection, organization, presentation, analysis and
interpretation of data.
It refers the subject area that is concerned with extracting relevant
information from available data with the aim to make sound
decisions.
According to this meaning, statistics is concerned with the
development and application of methods and techniques for
collecting, organizing, presenting, analyzing and interpreting
statistical data.
4 / 83
Classification of Statistics
Based on the scope of the decision, statistics can be classified into
two; Descriptive and Inferential Statistics.
Descriptive Statistics refers to the procedures used to organize
and summarize masses of data. It is concerned with describing or
summarizing the most important features of the data.
It deals only the characteristics of the collected data without
going beyond it.
That is, this part deals with only describing the data collected
without going any further: that is without attempting to
infer(conclude) anything that goes beyond the data themselves.
Examples of Descriptive statistics
a. The average age of the students in this class is 21 years. b.
Of the students enrolled in HU in this year 74% are male and
26% are female.
5 / 83
Classiffication..............Con’d
Inferential Statistics is the methods used to find out something
about a population, based on the sample.
It is concerned with drawing statistically valid conclusions about
the characteristics of the population based on information
obtained from sample.
In this form of statistical analysis, inferential statistics is linked
with probability theory in order to generalize the results of the
sample to the population.
Performing hypothesis testing, determining relationships between
variables and making predictions are also inferential statistics.
Examples of inferential statistics
At least 5% of the killings reported last year in city X were due
to tourists.
The chance of winning the Ethiopian National Lottery in any
6 / 83
Applications of Statistics
In this modern time, statistical information plays a very important

role in a wide range of fields. Today, statistics is applied in almost all
fields of human endeavor.
In Scientific Research: Statistics is used as a tool in a
scientific research. Statistical formulas and concepts are applied
on a data which are results of an experiment.
In Quality Control: Statistical methods help to check whether
a product satisfies a given standard.(especially, in industry)
For Decision Making: Statistics helps to enhance the power
of decision making in the face of uncertainty by providing
sufficient information.
7 / 83
Application————Con’d
In Agriculture: Experiments are designed and analyzed using

statistical procedures.
In Public Health and Medicine: Statistical methods are used
for computation and interpretation of birth and death rates.
In Economics: for modeling functional relationships between or
among variables.
In Natural and Social Sciences, Business, Planning, Behavior
Sciences, e.t.c.
8 / 83
Uses of Statistics
Condenses and summarizes masses of data and presents facts in

numerical and definite form.
Facilitates comparison: statistical devises such as averages,
percentages, ratios, e.t.c. are used for this purpose.
Formulating and testing hypothesis: For instance, hypothesis like
whether a new medicine is effective in curing a disease, whether
there is an association between variables can be tested using
statistical tools.
Forecasting: Statistical methods help in studying past data and
predicting future trends.
9 / 83
Limitations of Statistics
It cannot deal with a single observation; rather it deals

aggregate of facts.
Statistical methods are not applicable to qualitative character
i.e. it deals with quantitative characteristics.
Statistical results are true on average; i.e. for the majority of
case. Laws of statistics are not universally true like the laws of
physics, chemistry and mathematics.
Statistics are liable to be misused or misinterpreted:This
may be due to incomplete information, inadequate and faulty
procedures during data collection and sample selection and
mainly due to ignorance (lack of knowledge).
10 / 83
VARIABLE
Variable is any phenomenon or an attribute that can assume

different values.
The most important single distinguishing feature of a variable is
that it varies; that is, it can take on different values.
Based on the values that variables assume, variables can be
classified as: namely Qualitative and Quantitative variables.
1. Qualitative variables: A qualitative variable has values that
are intrinsically non-numerical (categorical). Example: Gender,
Religion, Color of automobile, etc.
11 / 83
Variable......................Con’d
2. Quantitative variables: A quantitative variable has values that

are intrinsically numerical. Example: Height, Family size, Weight, etc.
Quantitative variable can be classified int two
Discrete variable: takes whole number values and consists of
distinct recognizable individual elements that can be counted. It is a
variable that assumes a finite or countable number of possible values.
These values are obtained by counting (0, 1, 2, ...). Example: Family
size, Number of children in a family, number of cars at the traffic
light.
12 / 83
........Cont’d
Continuous variable: takes any value including decimals. Such
a variable can theoretically assume an infinite number of possible
values. These values are obtained by measuring. Example:
Height, Weight, Time, Temperature, etc.
Classify each of the following as qualitative and
quantitative and if it is quantitative classify as discrete
and continuous.
1. Color of automobiles in a dealers show room.
2. Number of seats in a movie theater.
3. Classification of patients based on nursing care needed
(complete, partial or safer).
4. Number of tomatoes on each plant on a field.
5. Weight of newly born babies.
13 / 83
Scales of Measurement /levels of measurement
The level of measurement is one way in which variables can be

classified. Broadly, this relates to the level of information content
implicit in the set of values and how each value may be interpreted
(mathematically) relative to other values on the variable - an issue
which dictates how the variable can be used and interpreted in
statistical analysis.
We need scales of measurement

a. shows the information contained in the value of a variable.
b. shows also that what mathematical operations and what
statistical analysis are permissible to be done on the values of the
variable.
14 / 83
The four types of scales available in statistical

analysis are:
1. Nominal Scales of variables are those qualitative variables which
show category of individuals.
They reflect classification in to categories where there is no
particular order or qualitative difference to the labels.
Numbers may be assigned to the variables simply for coding
purposes. It is not possible to compare individual basing on the
numbers assigned to them.
The only mathematical operation permissible on these variables
is counting.
They Have mutually exclusive (non-overlapping) and exhaustive
categories.
No ranking or order between (among) the values of the variable.
Example: Gender (Male, Female),Ethnicity (White, Black,
15 / 83
....................... CONT’D
2.Ordinal Scales of variables are also those qualitative variables

whose values can be ordered and ranked.
Ranking and counting are the only mathematical operations to
be done on the values of the variables.
But there is no precise difference between the values (categories)
of the variable.
Example: Academic Rank (BSc, MSc, PhD), Grade Scores (A,
B, C, D, F), Strength (Very Weak, Week, Strong, Very Strong),
Health Status (Very Sick, Sick, Cured), Economic Status (Lower
Class, Middle Class, Higher Class), etc.
16 / 83
..........................CONT’D
3. Interval Scales of variables are those quantitative variables when
the value of the variables is zero it does not show absence of the
characteristics i.e. there is no true zero.
Zero indicates lower than empty.
For example, for temperature measured in degrees Celsius, the
difference between 5 and 10 is treated the same as the difference
between 10 and 15..
However, we cannot say that 20 is twice as hot as 10., i.e. the
ratio between two different values has no quantitative meaning.
This is because there is no absolute zero on the Celsius scale; 0.
not imply no heat
All mathematical operation are allowed/permissible except
division.
17 / 83
........................CONT’D
4. Ratio Scales of variables are those quantitative variables when

the values of the variables are zero, it shows absence of the
characteristics.
Zero indicates absence of the characteristics.

All mathematical operations are allowed to be operated on the
values of the variables.
Example: weight, height, time, income,salary, consumption,
price, Age, etc.
18 / 83
CHAPTER TWO
Method of Data Collection, Organization and Presentation
Method of Data Collection

Method of Data Organization
Method of Data Presentation
19 / 83
Method of Data collection

Types of Data
Research results or findings reveal informations that are obviously
an output of properly and carefully collected relevant data.
Based on the source of utilization, data can be classified into
two: Primary Data and Secondary Data.
Primary data are data collected for the first time either
through direct observation or by enquiring individuals.
Secondary data are data obtained from published or
unpublished sources like newspapers, journals, official records,
e.t.c.
Based on the role of time, data can be classified as
Cross-sectional and Time series.
20 / 83
Types of data cont”d

Cross-Sectional and Time Series data
Cross-sectional Data: is a set of observations taken at a point of time.
Time series Data: is a set of observations collected for a sequence of
time usually at equal intervals.
Methods of Data Collection
The first and foremost task in statistical investigation is data
collection.
Before data collection, four important points should be considered.
These are the purpose of data collection (why we need to collect
data), the data to be collected (what kind of data to be collected),
the source of data (where we can get the data) and the methods of
data collection (how can we collect this data).
Primary data can be collected through experimental methods in
laboratory in natural sciences and through survey method in social
sciences.
The survey methods of data collection are personal interview,
telephone interview, mailed questionnaire and personal observation.21 / 83
Method of Data Organization
The most convenient way of organizing data is to construct a

frequency distribution. Frequency distribution is the organization
of raw data in table form, using classes and frequencies
Definition of some basic terms
Class: is a description of a group of similar numbers in a data set.
Frequency: is the number of times a variable value is repeated.
Class frequency: the number of observations belonging to a certain

class
22 / 83
CONT”D
There are three types of frequency distributions; categorical,

ungrouped (discrete or frequency array) and grouped (continuous)
frequency distributions.
1. Categorical FD:-a FD in which the data is qualitative i.e. either nominal or
ordinal.
Each category of the variable represents a single class and the number of times
each category repeats represents the frequency of that class (category).
For Example:Construct FD for the following letter grade of 20 students A

BCCCCBBADACCABFCCAB
23 / 83
Organization of Data .......................CONT”D
2. Ungrouped FD (Frequency Array)

A FD of numerical data (quantitative) in which each value of a variable
represents a single class (i.e. the values of the variable are not grouped)
and the number of times each value repeats represents the frequency of
that class.
For Example consider the Number of children for 21 families to construct

appropriate frequency distribution 2 3 5 4 3 3 2 3 1 0 4 3 2 2 1 1 1 4 2 2 2
24 / 83
3. Grouped (Continuous) Frequency Distribution
A frequency of numerical data in which several values of a

variable are grouped into one class.
The number of observations belonging to the class is the

frequency of the class.
25 / 83
Basic Terms
Class Limits: the lowest and highest values that can be
included in a class are called class limits. The lowest values are
called lower class limits and the highest values are called upper
class limits.
Class Boundaries: are class limits when there is no gap

between the UCL of the first class and the LCL of the second
class. The lowest values are called lower class boundaries and
the highest values are called upper class boundaries.
Class Width: the difference between UCB and LCB of a class.
It is also the difference between the lower limits of two
consecutive classes or it is the difference between upper limits of
two consecutive classes.
26 / 83
Besic Terms ........................Cont”D
Relative Frequency: The absolute FD is a summary table in

which the original data is condensed into groups and their
27 / 83
Besic Terms ............................CONT”D

The relative frequency distribution can be formed by dividing the
frequency in each class of the frequency distribution by the total
number of observations.
It can be converted in to a percentage frequency distribution by
simply multiplying each relative frequency by 100.
It tell us the actual number (percentage) of units in each class,
it does not tell us directly the total number (percentage) of units
that lie below or above the specified values of the classes.
Cumulative Frequency distribution. A cumulative frequency
distribution displays the total number of observations above
(below) a certain value.
When the interest of the investigator focuses on the number of
items below a specified value, then this specified value is the
upper boundary of the class. It is known as less than cumulative
28 / 83
Consider the following table for illustration
29 / 83
Construction of Grouped Frequency Distribution
1 Arrange the data in an array form (increasing or decreasing

order).
2 Find the Unit of Measurement (U). U is the smallest difference
between any two distinct values of the data.
3 Find the Range(R):R is the maximum numerical difference in the
data set, i.e. the difference between the largest and the smallest
values of the variable.
4 Determine the number of classes (K) using Sturges Rule.
K=1+3.322logN where N is the total number of observations.
5 Specify the class width(W): It is the ratio of Range(R) to
number of classes(K).
30 / 83
The Steps....................................... Cont”D

6. Put the smallest value of the data set as the LCL of the first class.
To obtain the LCL of the second class add the class width W to the
LCL of the first class. Continue adding until you get K classes.
Let X be the smallest observation. LCL1=X
Example: Mark of 50 students out of 40. 16 21 26 24 11 17 25

26 13 27 24 26 3 27 23 24 15 22 22 12 22 29 18 22 28 25 7 17 22 28
19 23 23 22 3 19 13 31 23 28 24 9 20 33 30 23 20 8 21 24 Construct
grouped FD. 31 / 83
SOLUTION
32 / 83
Properties of Classes (Class Boundaries)
Complete and non-overlapping.

Clear and properly set
Standardized
Continuous
Advantages
It condenses a large mass of data in to a comparatively small
table.
It attracts the attention of even a layman and gives him an
insight into the nature of the distribution.
It helps for further statistical analysis, like central tendency,
scatter, and symmetry of the data.
33 / 83
Data Presentation
Graphic Display of Data

Bar Chart:It is the simplest and most commonly used
diagrammatic representation of a frequency distribution.
It is the most common presentation for nominal, categorical or
discrete data.
It uses a serious of separated and equally spaced bars.
The heights of the bars represent the frequency or relative
frequency of the classes.
But the width of the bars has no meaning; however, all the bars
should be the same width to avoid distortion.
And also the bars are separated by constant distance.
34 / 83
Simple bar chart, Component bar chart and

Multiple bar chart
35 / 83
Bar chart.......................CONT”D
Component Bar Chart: is used when there is a desire to show a
total or aggregate is divided into its component parts. The bars
represent total value of a variable with each total broken into its
component parts and different colors are used for identification.
36 / 83
Bar chart .............................CONT’D

Multiple Bar Chart: is used to display data on more than one
variable. In the multiple bars diagram two or more sets of
inter-related data are interpreted.
37 / 83
Pie Chart
Pie chart is popularly used in practice to show percentage break down
of data.
It is a circle representing a set of data by dividing the circle into
sectors proportional to the number of items in the categories or it is a
circle representing the total, cut into slices in proportional to the size
of the parts that make up the total.
38 / 83
Histogram
Histogram is the most common graphical presentation of a

frequency distribution for numerical data.
It uses a series of adjacent bars in which the width of each bar
represents the class width and the heights represent the
frequency or relative frequency of the class.
It is used for grouped data in which the class boundaries are
marked on the X axis and the frequencies are marked along the
Y axis.
For Example: In the following, the heights of 45 female students
at Haramaya University are recorded to the nearest inch.
Construct a histogram by hand.
39 / 83
Histogram example................cont”d
67 67 64 64 74 61 68 71 69 61 65 64 62 63 59 70 66 66 63 59 64 67
70 65 66 66 56 65 67 69 64 67 68 67 67 65 74 64 62 68 65 65 65 66
67
40 / 83
Frequency Polygon and Cummulative Frequency

Curves(Ogive)
41 / 83
Cont”D
42 / 83
SUMMARY
In this chapter, there are two sources of data (primary and
secondary). Primary data is the one which is collected by the
investigator himself for the purpose of a specific inquiry or study
while secondary data is the data which has already been
collected by others.
The most common methods of data collection for survey, called
questionnaire,interview (telephone and personnal)
The most convenient way of organizing numerical data is to
construct a frequency distribution which is the organization of
raw data in table form, using classes and frequencies. There are
three types of frequency distributions; categorical, ungrouped
and grouped frequency distribution.
The most common diagrammatic and graphical data
presentations were used.
43 / 83
CHAPTER SIX
Statistical Inference
The term statistical inference deals with the collection of data
on a relatively small number of cases so as to form conclusion
about the general population from which the sample was taken
The sample from the population is used to provide the estimate

of the population parameters (point and interval estimations).
The most important objective of statistical analysis is to draw

inferences about the population using sample information.
The process of generalizing from sample to the population is

known as Statistical Inference.
44 / 83
CONT’D
1 A summary measure that describes any given characteristic of
the population is known as Parameter. Eg: population mean
(µ), population variance (δ 2 ), population standard deviation (δ),
population proportion (P), population moments are parameters.
2 The summary measure that describes the characteristic of the
sample is known as Statistic.Eg: sample mean , sample variance
, sample standard deviation (S), sample proportion (p), sample
moments(r) are Statistics.
3 Statistical inference generally takes one of the two forms,
namely, estimation of the population parameter and
testing of hypothesis.
4 Estimation:-It is the process of calculating the population
parameters based on the corresponding characteristics from a
random sample estimate. Because it is not possible to make
45 / 83
There are two types of estimation:-

Point estimation
Interval estimation
Point Estimation:It is the process of obtaining a single sample
value (point estimate) that is used to estimate the desired
population parameter. The estimator is known as point
estimator.
The best estimator should be highly reliable and have such
desirable properties as unbiasedness, consistency, efficiency and
sufficiency. These criteria are described as follows:
Unbiasedness: An estimator is a random variable since it is

always a function of the sample values. The expected value of
the sample statistic is considered to be an unbiased estimator if
it equals the population parameter which is being estimated. .
46 / 83
CONTINUED
Consistency: It refers to the effect of sample size on the
accuracy of the estimator. A statistic is said to be consistent
estimator of the population parameter if it approaches the
parameter as the sample size increases.
Efficiency: An estimator is considered to be efficient if its value
remains stable from sample to sample. The best estimator would
be the one which would have the least variance from sample to
sample. From the three point estimators of central tendency,
namely the mean, median and mode, the mean is considered the
least variant and hence a better estimator.
Sufficiency: An estimator is said to be sufficient if it uses all
the information about the population parameter contained in the
sample.
47 / 83
Point estimation for population mean and

population proportion
From a single sample, we can calculate a sample statistic to estimate
a single parameter (point estimate) .That means a point estimate for
a population mean is given by:
48 / 83
Interval estimation
Point estimator has some drawbacks.

First a point estimator from the sample may not exactly locate
the population parameter (i.e. the value of the point estimator is
not likely to be exactly equal to the value of the parameter)
resulting in some margin of uncertainty. If the sample value is
different from the population value, the point estimator does not
indicate the extent of the possible error.
Secondly a point estimate does not specify as to how confident
we can be that the estimate is close to the parameter it is
estimating. That is we cannot attach any degree of confidence
to such an estimate as to what extent it is closer to the value of
the parameter. .
49 / 83
continued
Because of these limitations of point estimation, interval

estimation is considered desirable.
The interval estimation involves the determination of an interval
(a range of values) within which the population parameter must
lie with a specified degree of confidence. It is the construction of
an interval on both sides of the point estimate within which we
can reasonably confident that the true parameter will lie.
It is the same as the confidence interval estimate.
In a point estimation, the value of the sample statistic will vary
from sample to sample. Therefore, simply to obtain an estimate
of a single value of a parameter is not generally
acceptable.Rather
50 / 83
continued
1 We need to take into account the sample to sample variation of

the statistic.
2 We need to also a measure of how precise our estimate is likely
to be.
A confidence interval defines an interval within which the
population parameters likely to fall (interval).
Interval Estimation for the Population Mean (µ)
51 / 83
For Example
Haramaya University wishes to estimate the average age of students

who graduate with B.Sc. degree. A random sample of 625 graduating
students showed that the average age was 24 with a standard
deviation of 5 years. Construct the 95% confidence interval for the
true average age of all such graduating students at the university and
interpret it.
52 / 83
Characteristics of interval estimation
It provides arrange of values

It takes consideration sample to sample variation of sample
statistic.
It is based on the observation from one sample.
It gives information about closeness to unknown parameters.
Interval estimation expresses the inherent uncertainty in any
study by expressing lower and upper boundaries for anticipated
true underlying population parameter.
If the population standard deviation is known or the sample size
is relatively large (greater than 30) then, the confidence interval
for sample mean is
53 / 83
continued
54 / 83
continued
If the population standard deviation is unknown and the sample size
is small (< 30) ,then the confidence interval for the sample is
55 / 83
Example for population mean
1. A Simple Random Sample of 16 apparently healthy subjects

yielded the following values of urine excreted (milligram per day).
0.007, 0.03, 0.025 ,0.008 ,0.03 ,0.038 ,0.007 ,0.005 ,0.032 ,0.04
,0.009 ,0.014 ,0.011 ,0.022 ,0.009 ,and ,0.008,then
a) Compute point estimate of population mean
b) Compute 95% interval estimate of population mean
56 / 83
SOLUTION
57 / 83
Example for population proportion
2. In survey of 300 automobile drivers in one city, 123 reported that

they wear seat belts regularly.
a) Estimate the seat belt rate of the city.
b) Construct 95% CI for the rate
58 / 83
SOLUTION
HOW WE INTERPRET THIS RESULT?
59 / 83
Hypothesis testing on population mean and

proportion
1 statistical hypothesis is a conjecture (an assumption) about a

population parameter which may or may not be true.
2 Hypothesis testing is a statistical procedure which leads to
take a decision about such an assumption for the population
parameter being correct or not, by using data obtained from the
sample.
3 In hypothesis testing, the researcher must define the population
under study, state the particular hypothesis that will be checked,
give the significance level, select sample from the population,
perform calculations required for statistical test and reach
conclusion.
60 / 83
There are two types of statistical hypotheses.
Null Hypothesis (H0):- is a statistical hypothesis that states

there is no difference between a parameter and a specific value
or hypothesized value. H0:µ=µ0 where µ is the population mean
and µ0 is the hypothesized mean
Alternative Hypothesis (H1):- is a statistical hypothesis that
states there exists a difference between a parameter and a
specific value or hypothesized value. H1:µ 6= µ0 H1:µ < µ0 H1:
µ > µ0
Errors in Hypothesis Testing

Type-I error
Type-II error
61 / 83
CONTINUED
1 Type I error: is an error occurred if one rejects the null

hypothesis which is actually true. OR
2 It is an error occurred if one fail to rejects the alternative
hypothesis which is actually false.
3 The maximum probability of committing type I error is called the
level of significance and denoted by α (alpha).
4 Type II error: is an error occurred if one failed to reject the
null hypothesis which is actually false.
5 It is an error occurred if one reject the alternative hypothesis
which is actually true.
62 / 83
Steps in hypothesis testing
1 Stating the null and alternative hypothesis.

2 Fix the level of significance (α ).
3 Selecting an appropriate test statistics and calculate its value.
4 Determine the critical value or tabulated value.
5 Compare the test statistics with the tabulated value and draw
the statistical conclusion.
Hypothesis testing for the population mean
There are three cases to be considered while testing any one of the
above hypothesis.
63 / 83
CONTINUED
Case 1 :if the population standard deviation (δ) is known and n
is large(n >30), then the appropriate test of statistic:
Case 2: if the population standard deviation (δ) is unknown and

n is large(n>30) , then the appropriate test of statistic is:
Case 3: if the population standard deviation (δ) is unknown and

n is small (n < 30), then the appropriate test of statistic is:
64 / 83
Decision For Case 1-3 of above
65 / 83
For Example
1. The mean life of a sample of 200 tires taken from the lot is found
to be 40,000kms. Past experience shows that the standard deviation
for life of tires in the lot is 3200kms.
A. Construct a 95% confidence interval for the mean life of tire in
the lot is expected to lie?
B. Is it reasonable to suppose the mean life of tires in the lot as
41,000kms? (At 5% level of significance
66 / 83
SOLUTION
67 / 83
CONTINUED FOR .....Question B
68 / 83
Another Example
2. A soap manufacturing company was distributing a particular type

of brand through a large number of retails soaps .Before a heavy
advertizing movement, the mean sales per weak per shop was 140
dozens. After the movement, a sample of 49 shops was taken and the
mean sales were found to be 147 dozens with SD 16.
A. Construct a 95% confidence interval for the mean sales of soap
manufacturing company?
B. Can you consider the advertisement effective?
69 / 83
SOLUTION 2A
70 / 83
SOLUTION 2B
71 / 83
HOME WORK
1. A research reports that the average salary of the sociologist is
more than $42000. A sample of 30 sociologists has a mean salary of
$43260. Test the reports claim. Assume the population standard
deviation is $5230.
2. A national magazine claims that the average college students
watches less television than the general public. The national average
is 29.4 hours per week, with a standard deviation 2 hours. A random
sample of 25 college students has a mean of 27 hours. Test the
claim. Assume normality.
3. A merchant believes that the average age of customers who
purchase a certain brand of wears is 13 years of age. A random
sample of 35 customers had an average age of 15.6 years. At
α=0.01, should this conjecture be rejected. The standard deviation
of the population is 1 year.
72 / 83
CHAPTER SEVEN
7. CORRELATION AND SIMPLE LINEAR REGRESSION

7.1 Correlation Analysis
Correlation is a mathematical to ol desired towards meas uring

the degree of linear relationship (degree of asso ciation) b
etween the variables.
Correlation that involves only two variables is called simple
correlation and which involves more than two variables is called
multiple correlations.
Covariance is a measure of the joint variation in two variables,
i.e. it measures the way in which the values of the two variables
vary together.
73 / 83
Continued
1 If the covariance is zero, there is no linear relationship between

the two variables.
2 If it is negative, there is an indirect linear relationship between
them.
3 If the covariance is positive, there is a direct linear relationship
between the variables
Pearson’s Coefficient of Correlation

Pearson’s coefficient of correlation (r) is used to measure the
strength of the linear relationship between two variables
74 / 83
continued
75 / 83
7.2 Simple Linear Regression
Regression is defined as the estimation or prediction of the

unknown value of one variable from the known values of one or
more variables.
It is also functional relationship between two or more variables.
The variable whose values are to be estimated or predicted is
known as dependent or explained variable while the variable
which are used in determining the value of the dependent
variable are called independent or predictor variables.
The regression study that involves only two variables is called
simple regression and the regression analysis that studies more
than two variables is called multiple regression.
76 / 83
continued
Regression Equation: is a mathematical equation that defines
the relationship between two variables. Regression of y on x is
given by
To estimate the regression coefficients (α̂ and β̂), the procedure

is minimizing the sum of the squares of the errors.
77 / 83
Continued
Let the estimated model be yhat=α̂+ β̂x
78 / 83
Coefficient of Determination
The coefficient of determination tells how well the estimated

model fits the data.
For simple linear regression (two variables case), it is defined as
the square of the sample correlation coefficient, and denoted by
r2.
Hence r2 measures the proportion or percentage of the variation
in the dependent variable explained by the independent variable.
r2 is a nonnegative quantity which lies in the limits 0 and 1. If it
approaches to 1, it means a good fit and if it approaches 0, no
relationship between the variables
79 / 83
continued
Example 1 : A researcher wants to find out if there is a relationship

between the heights of sons and the heights of their fathers. In other
words, do taller fathers have taller sons? The researcher took a
random sample of 6 fathers and their 6 sons. Their height in inches is
given below in an ordered array
80 / 83
SOLUTION For a
estimated parameters.
81 / 83
SOLUTION For b
82 / 83
Compute coefficient of determination and interpret
83 / 83

Basic One

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Basic One

Uploaded by

Copyright:

Available Formats

Introduction to Statistics

College of Computing And Informatics

Examples Sales Statistics, Employment Statistics,labor

a. Statistics are aggregate of facts

Singular sense of Statistics

Singular Sense:Statistics is the science that deals with the methods

In this modern time, statistical information plays a very important

In Agriculture: Experiments are designed and analyzed using

Condenses and summarizes masses of data and presents facts in

It cannot deal with a single observation; rather it deals

Variable is any phenomenon or an attribute that can assume

2. Quantitative variables: A quantitative variable has values that

Scales of Measurement /levels of measurement

The level of measurement is one way in which variables can be

We need scales of measurement

The four types of scales available in statistical

2.Ordinal Scales of variables are also those qualitative variables

4. Ratio Scales of variables are those quantitative variables when

Zero indicates absence of the characteristics.

Method of Data Collection, Organization and Presentation

Method of Data Collection

Method of Data collection

Types of data cont”d

Method of Data Organization

The most convenient way of organizing data is to construct a

Frequency: is the number of times a variable value is repeated.

Class frequency: the number of observations belonging to a certain

There are three types of frequency distributions; categorical,

For Example:Construct FD for the following letter grade of 20 students A

Organization of Data .......................CONT”D

2. Ungrouped FD (Frequency Array)

For Example consider the Number of children for 21 families to construct

3. Grouped (Continuous) Frequency Distribution

A frequency of numerical data in which several values of a

The number of observations belonging to the class is the

Class Boundaries: are class limits when there is no gap

Besic Terms ........................Cont”D

Relative Frequency: The absolute FD is a summary table in

Besic Terms ............................CONT”D

Consider the following table for illustration

Construction of Grouped Frequency Distribution

1 Arrange the data in an array form (increasing or decreasing

The Steps....................................... Cont”D

Example: Mark of 50 students out of 40. 16 21 26 24 11 17 25

Properties of Classes (Class Boundaries)

Complete and non-overlapping.

Graphic Display of Data

Simple bar chart, Component bar chart and

Bar chart .............................CONT’D

Histogram is the most common graphical presentation of a

Frequency Polygon and Cummulative Frequency

The sample from the population is used to provide the estimate

The most important objective of statistical analysis is to draw

The process of generalizing from sample to the population is

There are two types of estimation:-

Unbiasedness: An estimator is a random variable since it is

Point estimation for population mean and

Point estimator has some drawbacks.

Because of these limitations of point estimation, interval

1 We need to take into account the sample to sample variation of

Haramaya University wishes to estimate the average age of students

Characteristics of interval estimation

It provides arrange of values