Professional Documents
Culture Documents
Lecture 1
Introduction to Statistics
Outline
◼ Introduction
◼ Numerical summaries
◼ Probability
◼ Distribution theory and applications
Introduction
What is Statistics
◼ Collection of data
◼ Presentation of data
◼ Analysis of data
◼ Interpretation of analyzed data
Why a Manager Needs to Know
About Statistics
you can say that if 100 similar surveys were done 95 of them would
show a difference (that is only 5 out of 100 surveys would be expected
NOT to differ).
Operational Definitions
Data
Categorical Numerical
(Qualitative) (Quantitative)
Discrete Continuous
Classification of Data cont…
1) Qualitative refers to variables whose values fall into
groups or categories. They are also called categorical
variables because the data they carry describes categories
(e.g. Marital status, Gender, Religious affiliation, Type of
car owned). They can further be classified as;
(a) Nominal variables: Variables whose categories are
just names with no ’natural ordering’ E.g. gender, color,
district, marital status etc. or
(b) Ordinal variables: Variables whose categories have
a ’natural ordering’ E.g. education level, degree
classifications e.t.c. In a variable such as performance,
category ’Excellent’ is better than the category ’Very good’
which is better than ’Good’ .
Classification of Data cont…
2) Quantitative: Numerical variables (e.g. number of
students, age, weight, distance etc). They can further be
classified as;
(a) Discrete variables (Interval): can only assume
certain values and there are usually between values, e.g
the number of bedrooms in a house, the number of
children in a family e.t.c. In most cases they arise from
counting and their ratios do not make sense.
(b) Continuous variables (Ratio): can assume any
value within a specific range, e.g. The time to cook ugali,
Height of a tree, Your age, e.t.c. In most cases, such data
arises from measurements.
Data Sources
Data Sources
Print or Electronic
Observation Survey
Experimentation
Types of Statistics
◼ Descriptive Statistics: is a field that focuses on describing
different characteristics of the data rather than trying to infer
something from it. It is a body of methods of organizing,
summarizing, and presenting sample data in an informative
way. E.g. When voting was to held to day, 60% of the
electorates would not vote for their current MP. It describes
the number of respondents out of every 100 persons who
were interviewed.
◼ Inferential Statistics: body of methods which tries to infer
or reach conclusions about the population based on the
scientifically sampled data. The calculated summaries from
the sample are used for estimation, prediction, or
generalization about a population from which the sample was
taken. E.g. The JKUAT accounting department normally
selects a sample of the payment vouchers to check for
accuracy for all the payment vouchers.
Population & Sample
◼ Finite populations A sample provides information
about a population when it is too difficult or
expensive to make measurements from the whole
population.
◼ We often want to find information about a
particular group of individuals (people, fields, trees,
bottles of beer or some other collection of items).
This target group is called the population.
◼ Collecting measurements from every item in the
population is called a census. A census is rarely
feasible, because of the cost and time involved.
Population & Sample cont…
◼ Simple random sample: We can usually obtain
sufficiently accurate information by only collecting
information from a selection of units from the
population - a sample. Although a sample gives less
accurate information than a census, the savings in cost
and time often outweigh this.
◼ The simplest way to select a representative sample is a
simple random sample. In it, each unit has the same
chance of being selected and some random mechanism
is used to determine whether any particular unit is
included in the sample.
Effect of sample size
◼ Bigger samples mean more stable and reliable
information about the underlying population.
◼ As the sample size is increased, the sampling
error becomes smaller.
◼ When a sample is used to estimate a population
characteristic, an error is usually involved.
Sampling error is caused by random selection of
the sample from the population.
◼ The difference between an estimate and the
population value being estimated is called its
sampling error.
Illustration of Sampling Error
Numerical Summaries and
Probability Distributions
Measures of Central Tendency
◼ Descriptive statistics
◼ Describe the middle characteristics of the
data (distribution of scores); represent scores
in a distribution around which other scores
seem to center
◼ Most widely used statistics
◼ mean, median, and mode
Mean
The arithmetic average of a distribution of scores; most
generally used measure of central tendency.
Characteristics
◼ Most sensitive of all measures of central tendency
Characteristics
Least used measure of central tendency.
Not used for additional statistics.
Not affected by extreme scores.
Example
Characteristics
1. Dependent on the two extreme scores.
2. Least useful measure of variability.
Formula: R = Hx - Lx
Quartile Deviation
Sometimes called semi-quartile range; is the spread of
middle 50% of the scores around the median. Extreme
scores will not affect the quartile deviation.
Characteristics
1. Uses the 75th and 25th percentiles; difference
between
these two percentiles is referred to as the interquartile
range.
2. Indicates the amount that needs to be added to, and
subtracted from, the median to include the middle
50% of the scores.
3. Usually not used in additional statistical calculations.
Quartile Deviation
Symbols
Q = quartile deviation
Q1 = 25th percentile or first quartile (P25) =
score in which 25% of scores are below and
75% of scores are above
Q3 = 75th percentile or third quartile (P75) =
score in which 75% of scores are below and
25% of scores are above
Steps for Calculation of Q3
1. Arrange scores in ascending order.
2. Multiply N by .75 to find 75% of the distribution.
3. Count up from the bottom score to the number
determined in step 2.
Approximation and interpolation
may be required.
Steps for Calculation of Q1
1. Multiply N by .25 to find 25% of the distribution.
2. Count up from the bottom score to the number
determined in step 1.
To Calculate Q
Substitute values in formula: Q = Q3 - Q1
Quartiles
Q1 = 25%
Q2 = 50%
Q3 = 75%
Q4 = 100%
Q2 - Q1 = range of scores below median
Q3 - Q2 = range of scores above median
Example
◼ For samples, the sum of mean difference is n-1 to correct for sample bias
◼ Variance is dependent on the calculation of the sample means, therefore
we have one constraint, hence the degree of freedom is N-1
Variance Continued
◼ Consider the set of values: 3, 8, and 4 whose
mean is 5.
◼ S2 = (Σ (x- x̄ )2)/n
◼ = {(3-5)2 + (8-5)2 + (4-5)2}/3 = 4.67
Frequency 3 4 7 9 8 6
Frequency 3 4 7 9 8 6
Frequency 3 4 7 9 8 6
Frequency 3 4 7 9 8 6
Frequency 3 4 7 9 8 6
Frequency 3 4 7 9 8 6
Frequency 3 4 7 9 8 6
X; S