You are on page 1of 71

STATISTICS FOR M&E, (HEME 3707),

PROC (HEPL 3101), PM, HRM, SM,


ICT, ENTREPRENEURSHIP

Lecture 1
Introduction to Statistics
Outline

◼ Introduction
◼ Numerical summaries
◼ Probability
◼ Distribution theory and applications
Introduction
What is Statistics

◼ In its simplest form, as a plural noun, statistics


describe a collection of numerical data such
as employment statistics, accident statistics,
population statistics, birth and death, income
and expenditure, of exports and imports etc.
◼ As a singular noun, the purpose of statistics is
to develop and apply methodology for
extracting useful information and
knowledge from both experimental and survey
data.
Major activities in statistics

◼ Collection of data
◼ Presentation of data
◼ Analysis of data
◼ Interpretation of analyzed data
Why a Manager Needs to Know
About Statistics

◼ To Know How to Properly Present Information


◼ To Know How to Draw Conclusions about
Populations Based on Sample Information
◼ To Know How to Improve Processes
◼ To Know How to Obtain Reliable Forecasts
Examples
◼ A shoe factory will be interested in the most common
shoe sizes in order to make a decision on the
production process.
◼ The Ministry of Education will be interested in the trend

in the number of pupils starting each level of education


in order to make decisions related to building of
schools, training of teachers, etc.
Task: The latest sales data have just come in, and your
boss wants you to prepare a report for management on
places where the company could improve its business.
What should you look for?
What should you not look for?
Uses of Statistics
◼ To present the data in a concise and definite form : Statistics
helps in classifying and tabulating raw data for processing and
further tabulation for end users.
◼ To make it easy to understand complex and large data :
This is done by presenting the data in the form of tables, graphs,
diagrams etc., or by condensing the data with the help of means,
dispersion etc.
◼ For comparison : Tables, measures of means and dispersion can
help in comparing different sets of data..
◼ In forming policies : It helps in forming policies like a production
schedule, based on the relevant sales figures. It is used in
forecasting future demands.
◼ In measuring the magnitude of a phenomenon:- Statistics
has made it possible to count the population of a country, the
industrial growth, the agricultural growth, the educational level (of
course in numbers).
Limitations of Statistics
◼ In most statistical investigations, we use
samples to represent a population. These may
not always represent the population adequately
◼ Results based on data with strong departures
from the assumptions such as normality will be
less reliable than results from data that meet
the assumptions of a statistical test.
◼ It is possible to to make mistakes by ignoring
some key statistical principles. E.g. Correlation
does not imply causation
Limitations of Statistics cont…
◼ In hypothesis testing, the p-value or "probability value" inform us
of the probability of the null hypothesis occurring. For example,
when comparing means of several groups, the p-value is the probability
that the observed differences occur only by chance (does not exist in the
population). We then use the reverse logic that if the differences occur by
chance so seldom (typically when p < 5% or 0.05), real differences must
exist in the population. This has serious implications on what you say
about the hypothesis you accept:
◼ Accepting a null hypothesis does not mean that the samples are the

same or that there is no relationship. It is just that the evidence in the


sample is not strong enough to support the opposite.
◼ By accepting an alternative hypothesis at the 5% level of significance

you can say that if 100 similar surveys were done 95 of them would
show a difference (that is only 5 out of 100 surveys would be expected
NOT to differ).
Operational Definitions

◼ Variable: A variable is a characteristic of an


item or individual. It is simply something that
varies or doesn’t always have the same value
such as date of birth, age, marks, districts as
you move from one subject to another
◼ Data: Data are the different values associated
with a variable.
Classification of Data

Data

Categorical Numerical
(Qualitative) (Quantitative)

Discrete Continuous
Classification of Data cont…
1) Qualitative refers to variables whose values fall into
groups or categories. They are also called categorical
variables because the data they carry describes categories
(e.g. Marital status, Gender, Religious affiliation, Type of
car owned). They can further be classified as;
(a) Nominal variables: Variables whose categories are
just names with no ’natural ordering’ E.g. gender, color,
district, marital status etc. or
(b) Ordinal variables: Variables whose categories have
a ’natural ordering’ E.g. education level, degree
classifications e.t.c. In a variable such as performance,
category ’Excellent’ is better than the category ’Very good’
which is better than ’Good’ .
Classification of Data cont…
2) Quantitative: Numerical variables (e.g. number of
students, age, weight, distance etc). They can further be
classified as;
(a) Discrete variables (Interval): can only assume
certain values and there are usually between values, e.g
the number of bedrooms in a house, the number of
children in a family e.t.c. In most cases they arise from
counting and their ratios do not make sense.
(b) Continuous variables (Ratio): can assume any
value within a specific range, e.g. The time to cook ugali,
Height of a tree, Your age, e.t.c. In most cases, such data
arises from measurements.
Data Sources
Data Sources

Print or Electronic
Observation Survey

Experimentation
Types of Statistics
◼ Descriptive Statistics: is a field that focuses on describing
different characteristics of the data rather than trying to infer
something from it. It is a body of methods of organizing,
summarizing, and presenting sample data in an informative
way. E.g. When voting was to held to day, 60% of the
electorates would not vote for their current MP. It describes
the number of respondents out of every 100 persons who
were interviewed.
◼ Inferential Statistics: body of methods which tries to infer
or reach conclusions about the population based on the
scientifically sampled data. The calculated summaries from
the sample are used for estimation, prediction, or
generalization about a population from which the sample was
taken. E.g. The JKUAT accounting department normally
selects a sample of the payment vouchers to check for
accuracy for all the payment vouchers.
Population & Sample
◼ Finite populations A sample provides information
about a population when it is too difficult or
expensive to make measurements from the whole
population.
◼ We often want to find information about a
particular group of individuals (people, fields, trees,
bottles of beer or some other collection of items).
This target group is called the population.
◼ Collecting measurements from every item in the
population is called a census. A census is rarely
feasible, because of the cost and time involved.
Population & Sample cont…
◼ Simple random sample: We can usually obtain
sufficiently accurate information by only collecting
information from a selection of units from the
population - a sample. Although a sample gives less
accurate information than a census, the savings in cost
and time often outweigh this.
◼ The simplest way to select a representative sample is a
simple random sample. In it, each unit has the same
chance of being selected and some random mechanism
is used to determine whether any particular unit is
included in the sample.
Effect of sample size
◼ Bigger samples mean more stable and reliable
information about the underlying population.
◼ As the sample size is increased, the sampling
error becomes smaller.
◼ When a sample is used to estimate a population
characteristic, an error is usually involved.
Sampling error is caused by random selection of
the sample from the population.
◼ The difference between an estimate and the
population value being estimated is called its
sampling error.
Illustration of Sampling Error
Numerical Summaries and
Probability Distributions
Measures of Central Tendency

◼ Descriptive statistics
◼ Describe the middle characteristics of the
data (distribution of scores); represent scores
in a distribution around which other scores
seem to center
◼ Most widely used statistics
◼ mean, median, and mode
Mean
The arithmetic average of a distribution of scores; most
generally used measure of central tendency.
Characteristics
◼ Most sensitive of all measures of central tendency

◼ Most appropriate measure of central tendency to use


for ratio data (may be used on interval data)
◼ Considers all information about the data and is used to
perform other statistical calculations
◼ Influenced by extreme scores, especially if the
distribution is small
Symbols Used to Calculate Mean
Median

Score that represents the exact middle of the


distribution; the fiftieth percentile; the score
that 50% of the scores are above and 50% of
the scores are below.
Characteristics
◼ Not affected by extreme scores.
◼ A measure of position.
◼ Not used for additional statistical calculations.
◼ Represented by Mdn or P50.
Steps in Calculation of Median
1. Arrange the scores in ascending
order.
2. Multiple N by .50.
3. If the number of scores is odd, P50
is the middle score of the
distribution.
4. If the number of scores is even, P50
is the arithmetic average of the two
middle scores of the distribution.
Median Estimator
Median Estimator cont…
Mode

Score that occurs most frequently; may


have more than one mode.

Characteristics
Least used measure of central tendency.
Not used for additional statistics.
Not affected by extreme scores.
Example

◼ In {6, 3, 9, 6, 6, 5, 9, 3} the Mode is 6, as it


occurs most often
Which Measure of Central Tendency is Best for
Interpretation of Test Results?

◼ Mean, median, and mode are the same for a


normal distribution, but often will not have a
normal curve.
◼ The farther away from the mean and median
the mode is, the less normal the distribution.
◼ The mean and median are both useful
measures.
◼ In most testing, the mean is the most reliable
and useful measure of central tendency; it is
also used in many other statistical procedures.
Numerical Summaries
SET A B C D
8 8 8 4
8 8 6 12
8 6 7 8
8 10 9 16
8 8 10 0
Mean 8 8 8 8
Mode 8 8
Median 8 8 8 8
Two data sets may have common means, medians and modes and identical
frequencies in the modal class, yet they may differ widely in their spread of
values about the measures of central tendencies.
Meaning: Although the measures of central tendency provide useful
information about the data, they depend largely on the extent to
which the data is dispersed.
Numerical Summaries cont…
◼ From the previous example, it is not
possible to differentiate the four sets in
the absence of the raw data. Other
measures are required to make the
comparison. We can examine the
◼ Dispersion or spread which is the degree
of scatter or variation of the variable
about the central value. There are various
measures of dispersion.
Example
◼ Consider two individuals of different ages, 1
year old child; and 99 years old
◼ What is the average age of the two? (50
years)
◼ Is it answer sufficiently informative?
◼ More informative Response:
◼ Mean age is 50 years with the range

varying between 1 and 99 years.


Measures of Variability

◼ To provide a more meaningful interpretation of


data, you need to know how the scores spread.
◼ Variability - the spread, or scatter, of scores;
terms dispersion and deviation often used
◼ With the measures of variability, you can
determine the amount that the scores spread,
or deviate, from the measures of central
tendency.
◼ Descriptive statistics; reported with measures of
central tendency
Measures of Disperson
◼ Range (R)
◼ Inter-quartile range (IQR), Semi-Inter-
quartile range (SIQR) and Quartile déviation
(QD)
◼ Mean absolute deviation (MAD)
◼ Variance (s2); and
◼ Standard deviation (s)
Range
Determined by subtracting the lowest score
from the highest score; represents on the
extreme scores.

Characteristics
1. Dependent on the two extreme scores.
2. Least useful measure of variability.

Formula: R = Hx - Lx
Quartile Deviation
Sometimes called semi-quartile range; is the spread of
middle 50% of the scores around the median. Extreme
scores will not affect the quartile deviation.
Characteristics
1. Uses the 75th and 25th percentiles; difference
between
these two percentiles is referred to as the interquartile
range.
2. Indicates the amount that needs to be added to, and
subtracted from, the median to include the middle
50% of the scores.
3. Usually not used in additional statistical calculations.
Quartile Deviation

Symbols
Q = quartile deviation
Q1 = 25th percentile or first quartile (P25) =
score in which 25% of scores are below and
75% of scores are above
Q3 = 75th percentile or third quartile (P75) =
score in which 75% of scores are below and
25% of scores are above
Steps for Calculation of Q3
1. Arrange scores in ascending order.
2. Multiply N by .75 to find 75% of the distribution.
3. Count up from the bottom score to the number
determined in step 2.
Approximation and interpolation
may be required.
Steps for Calculation of Q1
1. Multiply N by .25 to find 25% of the distribution.
2. Count up from the bottom score to the number
determined in step 1.
To Calculate Q
Substitute values in formula: Q = Q3 - Q1
Quartiles

Q1 = 25%
Q2 = 50%
Q3 = 75%
Q4 = 100%
Q2 - Q1 = range of scores below median
Q3 - Q2 = range of scores above median
Example

◼ 43, 69, 57, 30, 93, 26, 82, 75


◼ Find the 50th percentile for the following data
= (30+93)/2 = 63
Illustration
◼ Range

The range and


IQR only involves
◼ Interquartile Range: two values while
ignoring the
negative sign
Mean absolute deviation (MAD)
◼ The mean absolute deviation of a dataset is
the average distance between each data point
and the mean. It gives us an idea about the
variability in a dataset.

◼ Example: John enjoys posting pictures of his


dog online. Here's how many "likes" the
past 6 pictures each received:
◼ 10, 15, 15, 17, 18, 21
◼ Find Mean absolute deviation
MAD
Data Points Distance from Mean Absolute distance
10 10-16 6
15 15-16 1
15 15-16 1
17 17-16 1
18 18-16 2
21 21-16 5
Mean= 16 SUM=16

MAD=16/6 = 2.37 ; Meaning: On average each picture was


about 3 likes away from the mean.
Exercise: The following shows marks score by Kevin in 3
consecutive test. 3, 15, 21, 13.
Task: Find the MAD for the data set
Variance
◼ In calculating the MAD, the negative signs were
ignored.
◼ However, we know that the square of a number
is always positive. We can therefore attempt to
use the average of squared deviation from the
mean.
◼ This defines variance of the data denoted by
s2 and dene it as:
Variance Continued

◼ For samples, the sum of mean difference is n-1 to correct for sample bias
◼ Variance is dependent on the calculation of the sample means, therefore
we have one constraint, hence the degree of freedom is N-1
Variance Continued
◼ Consider the set of values: 3, 8, and 4 whose
mean is 5.
◼ S2 = (Σ (x- x̄ )2)/n
◼ = {(3-5)2 + (8-5)2 + (4-5)2}/3 = 4.67

◼ NOTE: This approach of computing variance is


tedious because the mean has to be calculated
first. It becomes even more tedious if the mean is
a decimal number. An easier computational
formula is the expanded form which we quote
directly as;
◼ Where Ʃ f = n
Example
◼ The following table gives marks obtained by
some students in a CAT
Marks 8 13 16 19 21 25

Frequency 3 4 7 9 8 6

◼ Sample Size (n)= 37


◼ Sum of values Ʃ fx = 677
◼ Sum of squares Ʃ fx2 = 13187
◼ Sample mean x̅ = (Ʃ fx2 /n) = 18.297
◼ Population mean = 18.297
Standard Deviation
◼ Most useful and sophisticated measure of variability.
◼ Describes the scatter of scores around the mean.
◼ Is a more stable measure of variability than the range
or quartile deviation because it depends on the weight
of each score in the distribution.
◼ Lowercase Greek letter sigma is used to indicate the
the standard deviation of a population; letter s is used
to indicate the standard deviation of a sample.
◼ Since you generally will be working with small samples,
the formula for determining the standard deviation will
include (N - 1) rather than N.
Characteristics of Standard Deviation

1) Is the square root of the variance, which is the


average of the squared deviations from the mean.
Population variance is represented as F2 and the
sample variance is represented as s2.
2) Is applicable to interval and ratio data, includes all
3) scores, and is the most reliable measure of
variability.
4) Is used with the mean. In a normal distribution, one
5) standard deviation added to the mean and one
standard deviation subtracted from the mean includes
the middle 68.26% of the scores.
Characteristics of Standard Deviation
4. With most data, a relatively small standard deviation
indicates that the group being tested has little variability
(performed homogeneously). A relatively large standard
deviation indicates the group has much variability (performed
heterogeneously).
5. Is used to perform other statistical calculations.
Symbols used to determine the standard deviation:
s = standard deviation X = individual score
X = mean N = number of scores
Ʃf = sum of
d = deviation score (X - X)
Distribution theory and
applications
Sample & Population Variance
Marks 8 13 16 19 21 25
Frequency 3 4 7 9 8 6

◼ Suppose the above example is considered a


sample (groups data), Then;

Sample Variance = 21.61


◼ Suppose the above example is considered as


the population (grouped data);

◼ Population Variance = 22.21


Interpretation and Uses of the Standard
Deviation
◼ In most distributions the vast majority of the data lies
within three standard deviations of the mean.
◼ Tchebyshe's theorem: For any set of observations,
x1,x2, . . .xn, at least 1−1/k2 of the values will lie
within k standard deviations of the mean where k ≥1.
◼ Empirical Rule: For any symmetrical, bell-shaped
distribution;
◼ Approximately 68% of the observations will lie within ±1σ of
the mean (μ);
◼ Approximately 95% of the observations will lie within ±2σ of
the mean (μ);
◼ Approximately 99.7% within ±3σ of the mean (μ)
◼ The coefficient of variation is the ratio of the standard
deviation to the arithmetic mean, expressed as a percentage:
◼ CV = (Standard deviation /Mean) ×100%
◼ This coefficient can be used to compare the variation of two
sets of data measured in different units (e.g. miles and Meters)
The smaller the better.
Probability distributions

◼ Using the earlier example, suppose we now


divide each frequency value by n = Ʃ f, and
proceed to get the mean and the variance in
the same way. The corresponding distribution
table is
Marks 8 13 16 19 21 25

Frequency 3 4 7 9 8 6

Probability 0.0811 0.1081 0.1892 0.2432 0.2162 0.1622


(p) (f/Ʃ f)
Probability Cont…
◼ The scaled frequencies are no longer frequencies but
probabilities.
◼ A probability is a numerical measure of the
chance or of how likely an event can occur.
◼ The idea is to assign value between 0 and 1 to event
such that the magnitude gives the likelihood of the
event occurring.
◼ A value close to 0: Event unlikely (Value of zero implies
impossible event)
◼ A value close to 1: Event very likely
◼ Close to 1/2: may or may not occur (50:50 chance of
occurring)
Marks 8 13 16 19 21 25

Frequency 3 4 7 9 8 6

Probability 0.0811 0.1081 0.1892 0.2432 0.2162 0.1622


(p) (f/Ʃ f)

◼ From the table it is very clear that a randomly picked


student from this class is likely to have obtained 19
marks because the value corresponds to the highest
probability of 0.2432.
◼ Notice also that the sum of all probabilities is 1. This
table would actually be referred to as a probability
distribution table for the marks (X) which is taken to be
a random variable. From this table we can answer the
following questions;
Marks 8 13 16 19 21 25

Frequency 3 4 7 9 8 6

Probability 0.0811 0.1081 0.1892 0.2432 0.2162 0.1622


(p) (f/Ʃ f)

◼ What is the probability that a randomly picked student


from this class scored;
◼ More than 19 marks?
◼ Solution: This can be written mathematically as
follows
P(X > 19) = P (X = 21)+P(X = 25)
= 0.2162+0.1622
= 0.3784
Marks 8 13 16 19 21 25

Frequency 3 4 7 9 8 6

Probability 0.0811 0.1081 0.1892 0.2432 0.2162 0.1622


(p) (f/Ʃ f)

◼ What is the probability that a randomly picked student


from this class scored;
◼ Less than 17.5 marks?
◼ Solution: This can be written mathematically as
follows
P(X < 17.5) = P(X = 8)+P(X = 13)+P(X = 16)
= 0.0811+0.1081+0.1892
= 0.3784
Marks 8 13 16 19 21 25

Frequency 3 4 7 9 8 6

Probability 0.0811 0.1081 0.1892 0.2432 0.2162 0.1622


(p) (f/Ʃ f)

◼ What is the probability that a randomly picked student


from this class scored;
◼ between 13 and 20 marks exclusive?
◼ Solution: This can be written mathematically as
follows
P(13 < X < 20) = P(X = 16)+P(X = 19)
= 0.1892+0.2432
= 0.4324
Marks 8 13 16 19 21 25

Frequency 3 4 7 9 8 6

Probability 0.0811 0.1081 0.1892 0.2432 0.2162 0.1622


(p) (f/Ʃ f)

◼ What is the expected marks (This is actually the


mean)?
◼ Solution: We will follow the same procedure of getting
the mean
◼ Sum of frequencies = Ʃ p = 1
◼ Sum of values, Ʃ xp = 8×0.0811+13×0.1081+. .
.+25×0.1622
◼ Expected value (x̅) = Ʃ xp / Ʃ p = Ʃ xp = 18.2973
◼ Notice that the denominator is no longer necessary in
the formula.
Probability cont…
◼ Note: This is the same answer you while working with
frequencies
◼ If P(X = x) = p(x) is the probability distribution of a random
variable X, then
0 ≤ p(x) ≤ 1
and Ʃ p(x) = 1
all x

◼ The expected value, denoted by E(X), is just the mean of the


realizations x of X (uppercase denotes the variable and lowercase
denotes a realized/sample value). The probability of various
outcomes of X, then Mean of X or expected value of X is given by
◼ E(X) = x1p(x1)+x2p(x2)+· · ·+xnp(xn) = Ʃ xp (x)
allx
Probability cont…
◼ To get the standard deviation we take the
square root of the variance.
Exercise 1

◼ The table below gives a probability


distribution of a discrete random variable X.
Given that P(X < 150) = 0.6, find the values
of a and b hence calculate E(X) and
Standard deviation of X
X 40 80 120 150 200

P(X=x) a 0.20 0.23 b 0.15


Solutions
X 40 80 120 150 200

P(X=x) a 0.20 0.23 b 0.15


Exercise 2
1. Find the mean and the variance of the
following data
50, 64, 50, 40, 104, 36, 80, 72
X 300 450 600 750 900 1050 1200 1350

P(X=x) 0.0986 0.1127 0.1268 0.1549 0.169 0.1268 0.1127 0.0986

2. From the data above, compute the following;


◼ P(600 < X < 1050); E(X); E(X2); Variance of

X; S

You might also like