You are on page 1of 73

Statistics

Statistics is concerned with


scientific methods for collecting,
organizing, summarizing,
presenting, and analyzing data, as
well as drawing valid conclusions
and making reasonable decisions on
the basis of such analysis.
Statistics

 (From SRTC) The term Statistics has both plural and


singular sense. In its plural sense, the word statistics
refers to numerical facts that are systematically collected
and analyzed. For instance, readers of a business section
of a newspaper would think of statistics as the consumer
price index, the returns of a particular stock, the peso to
dollar exchange rate, etc. In its singular sense, the word
statistics refers to the scientific discipline consisting of
theory and methods for processing numerical information
that one can use when making decisions in the face of
uncertainty.
Universe, Variable, Population, and
Sample
 The universe is the set of all entities under
study. Meanwhile, a variable is the
attribute of interest observable of each
entity in the universe. The population is
the set of all possible values of the variable
while a sample is a subset of the
population.
A parameter is a property descriptive of the
population. Other authors define
parameter as a numerical measurement
describing some characteristics of a
population. The term estimate refers to a
property of a sample drawn at random from
a population. The sample value is
presumed to be an estimate of a
corresponding population parameter.
Types of Data and Measurement Scales

Data

Nonmetric Metric
or or
Qualitative Quantitative

Nominal Ordinal Interval Ratio


Scale Scale Scale Scale

1-5
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Measurement Scales

 Nonmetric
Nominal – size of number is not related to the amount of the
characteristic being measured, unordered categories
Ordinal – larger numbers indicate more (or less) of the
characteristic measured, but not how much more (or less).
 Metric
Interval – contains ordinal properties, and in addition, there
are equal differences between scale points, no inherent
starting point
Ratio – contains interval scale properties, and in addition,
there is a natural zero point.

1-6
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
The Mean (Arithmetic Average)
The mean is defined to be the sum of
the data values divided by the total
number of values.
We will compute two means: one for
the sample and one for a finite
population of values.
The Mean (Arithmetic Average)
The mean, in most cases, is not an
actual data value.
3-9
The Sample Mean

T he symbol X represents the sample mean.


X is read as " X -bar ". The Greek symbol
 is read as " sigma " and it means "to sum".

X + X + ... + X
X= 1 2 n

n
 X.
=
n
The Mean (Arithmetic Average)
The number of hours spent by a random sample of ten factory workers in
assembling a certain product per day were recorded as follows, 5, 8, 4, 2, 2,
2, 2, 5, 3, and 4. Find the arithmetic mean.

Solution:

x
 x  5  8  4  2  2  2  2  5  3  4  37  3.7 hours
n 10 10

This result shows that on the average, the 10 factory workers spent 3.7 hours
a day for study.
The Population Mean

The Greek symbol m represents the population


mean. The symbol m is read as " mu" .
N is the size of the finite population.

X + X + ... + X
m=
1 2 N

N
X.
=
N
The Population Mean - Example

A small company consists of the owner, the manager,


the salesperson, and two technicians. The salaries are
listed as $50,000, 20,000, 12,000, 9,000 and 9,000
respectively. ( Assume this is the population.)
Then the population mean will be
 X
m =
N
50,000 + 20,000 +12,000 + 9,000 + 9,000
=
5
= $20,000.
The Median
When a data set is ordered, it is called a data
array.

The median is defined to be the midpoint of


the data array.

The symbol used to denote the median is MD.


The Median - Example
The prices (in pesos) of a certain brand
of cereal are 180, 201, 220, 191, 219,
209, and 186. Find the median.
Arrange the data in order and select
the middle point.
The Median - Example
Data array: 180, 186, 191, 201, 209,
219, 220.
The median, MD = 201.
The Median
In the previous example, there was an
odd number of values in the data set.
In this case it is easy to select the
middle number in the data array.
The Median
When there is an even number of
values in the data set, the median is
obtained by taking the average of the
two middle numbers.
The Median - Example
Six customers purchased the following
number of magazines: 1, 7, 3, 2, 3, 4.
Find the median.
Arrange the data in order and compute the
middle point.
Data array: 1, 2, 3, 3, 4, 7.
The median, MD = (3 + 3)/2 = 3.
The Median - Example
The wages (in US dollar) of ten
employees per day are: 18, 24, 20, 35,
19, 23, 26, 23, 19, 20. Find the
median.
Arrange the data in order and compute
the middle point.
The Median - Example
Data array: 18, 19, 19, 20, 20, 23, 23,
24, 26, 35.
The median,
MD = (20 + 23)/2 = 21.5.
The Mode
The mode is defined to be the value that
occurs most often in a data set.

A data set can have more than one mode.

A data set is said to have no mode if all


values occur with equal frequency.
The Mode - Example
The following data represent the duration
(in days) of the delivery of raw materials
to a manufacturing plant. Find the mode.
Data set: 8, 9, 9, 14, 8, 8, 10, 7, 6, 9, 7, 8,
10, 14, 11, 8, 14, 11.
Ordered set: 6, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9,
10, 10, 11, 11, 14, 14, 14. Mode = 8.
The Mode - Example
Six employees in a fast food restaurant were
tested on their customer attending time. The time,
in minutes, is given below. Find the mode.

Data set: 2, 3, 5, 7, 8, 10.

There is no mode since each data value occurs


equally with a frequency of one.
The Mode - Example
Eleven different automobiles were tested at a
speed of 15 mph for stopping distances. The
distance, in feet, is given below. Find the
mode.
Data set: 15, 18, 18, 18, 20, 22, 24, 24, 24, 26,
26.
There are two modes (bimodal). The values
are 18 and 24. Why?
Distribution Shapes
Frequency distributions can assume
many shapes.
The three most important shapes are
positively skewed, symmetrical, and
negatively skewed.
Skewness
Skewness is a measure of symmetry, or
more precisely, the lack of symmetry.
Symmetric distribution looks the same to
the left and right of the center point.
Asymmetric distribution would either be
negatively skewed or positively skewed.
Skewness
The skewness for a normal distribution is zero, and any
symmetric data should have a skewness near zero.

Negative values for the skewness indicate data that are skewed
left and positive values for the skewness indicate data that are
skewed right.

By skewed left, we mean that the left tail is long relative to the
right tail. Similarly, skewed right means the right tail is long
relative to the left tail. If the data are multimodal, then this
may affect the sign of the skewness.
Positively Skewed

Y
Positively Skewed

X
Mode < Median < Mean
Symmetrical

Y
Symmetrical

X
Mean = Median = Mode
3-52
Negatively Skewed

Negatively Skewed

X
Mean < Median < Mode
Formula for Solving Skewness

 Formula for Skewness:

  y  y  3
/n
g1 
s3
 Or

ni 1  xi  x 
n 3

 n  1 n  2 s 3
 Excel calculates skewness of a sample using the second formula.
Skewness and Kurtosis
Kurtosis is the sharpness of the peak of a frequency distribution
curve (i.e. mesokurtic, platykurtic, leptokurtic)

It is the measure of the peak of a distribution and indicates how


high the distribution is around the mean.

It is a measure whether the data are heavy-tailed or light-tailed


relative to a normal distribution.

Distributions of data and probability distributions are not all the


same shape.
Skewness and Kurtosis
The peak of a mesokurtic distribution is neither high
nor low, rather it is considered to be a baseline for
the two other classifications.

Platykurtic has peak that is lower than mesokurtic;


platy means broad

Leptokurtic has peak that is thin and tall; lepto


means skinny
Formula for Kurtosis

 Excel calculates the kurtosis of a sample S as follows:

n n  1 i 1  xi  x  3 n  1
n 4 2

 n  1 n  2  n  3 s 4
 n  2  n  3

 This formula requires that n  3 .


Measures of Variation - Range
The range is defined to be the highest
value minus the lowest value. The
symbol R is used for the range.
R = highest value – lowest value.
Extremely large or extremely small data
values can drastically affect the range.
Measures of Variation - Range
Characteristics of the Range

1. Simple, easy to compute and easy-to-understand measure.

2. It uses only the extreme values. It fails to communicate any


information about the clustering or the lack of clustering of the
values between the extremes.

3. A weakness of the range is that an outlier can greatly alter its


value.

4. It cannot be approximated from open-ended frequency


distributions.
Measures of Variation - Range
5. It is unreliable when computed from a frequency
distribution table with gaps or zero frequencies.

6. It is not tractable mathematically.

7. Tends to be smaller in smaller samples than in large


samples.

8. Used chiefly in control of production, expressing the


stock prices and interest rates, etc.
Measures of Variation –
Population Variance

The variance is the average of the squares of the


distance each value is from the mean.
The symbol for the population variance is
s ( s is the Greek lowercase letter sigma)
2

 ( X -m ) , where
2

s =
2

N
X = individual value
m = population mean
N = population size
Measures of Variation –
3-55
Population Standard Deviation

The standard deviation is the square


root of the variance.

( X - m)
2

s = s = .
2

N
Measures of Variation –
Population Standard Deviation
is the positive square root of the variance and
measures on the average the dispersion of each
observation from the mean.

Most important measure of variation

Shows variation about the mean

Has the same units as the original data


It is always positive
Measures of Variation –
Population Standard Deviation
Remarks:
1. If there is a large amount of variation in the data
set, then on the average, the data values will be far
from the mean. Hence, the standard deviation will
be large.

2. If there is only a small amount of variation in the


data set, then on the average, the data values will be
close to the mean. Hence, the standard deviation
will be small.
Measures of Variation –
Standard Deviation
Advantages:

It is the most widely used measure of dispersion. It is


based on all the items and is rigidly defined.

It is of great significance for testing the reliability of


measures calculated from samples, the difference
between such measures, and in comparing the extent
of fluctuation in two or more samples.
Measures of Variation –
Standard Deviation
Disadvantages:

The standard deviation is sensitive to the


presence of extreme values.

It is not easy to calculate by hand.


 
Comparing Standard Deviations
Data A Mean = 15.5
s = 3.338
11 12 13 14 15 16 17 18 19 20 21

Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = .9258

Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 4.57
Measures of Variation -
Example
Consider the following data to constitute
the population: 10, 60, 50, 30, 40, 20.
Find the mean and variance.
The mean  = (10 + 60 + 50 + 30 + 40 +
20)/6 = 210/6 = 35.
The variance  2 = 1750/6 = 291.67.
See next slide for computations.
Measures of Variation -
Example

XX– - m m (X(X –- mm) )2


2
XX
10
10 -25
-25 625
625
60
60 +25
+25 625
625
50
50 +15
+15 225
225
30
30 -5
-5 25
25
40
40 +5
+5 25
25
20
20 -15
-15 225
225
210
210 1750
1750
Measures of Variation – Sample
Variance
The unbiased estimator of the population
variance or the sample variance is a statistic
whose value approximates the expected value of
a population variance. It is denoted by s2 , where

s
2


 xx  2
, and
n 1
x  sample mean
n  sample size
Measures of Variation – Sample
Standard Deviation
The sample standard deviation is the
square root of the sample variance.

-
å(X X) 2

s = s = .
2

n -1
Shortcut Formula for the
Sample Variance and the
Standard Deviation

å X - ( å X ) /n
2 2

s=
2

n -1

å X - ( å X ) /n
2 2

s=
n -1
Sample Variance - Example
Find the variance and standard
deviation for the following sample: 16,
19, 15, 15, 14.
X = 16 + 19 + 15 + 15 + 14 = 79.
X2 = 162 + 192 + 152 + 152 + 142
= 1263.
Sample Variance - Example

 X  ( X ) / n
2 2

s =
2

n 1
1263 (79) / 5
2

 = 3.7
4

s= 3.7  19
..
Standard Error of the Mean

The standard error of the mean estimates the variability


between sample means that you would obtain if you took
multiple samples from the same population.

The standard error of the mean estimates the variability


between samples whereas the standard deviation
measures the variability within a single sample.
Standard Error of the Mean
For example, you have a mean delivery time of 3.80 days
with a standard deviation of 1.43 days based on a random
sample of 312 delivery times.

The numbers yield a standard error of the mean of 0.08


days (1.43 divided by the square root of 312).

Had you taken multiple random samples of the same size


from the same population the standard deviation of those
different sample means would be around 0.08 days.
Standard Error of the Mean
Use the standard error of the mean to determine how
precisely the mean of the sample estimates the population
mean. Lower values of the standard error of the mean
indicate more precise estimates of the population mean.

Usually, a larger standard deviation will result in a larger


standard error of the mean and a less precise estimate.

A larger sample size will result in a smaller standard error of


the mean and a more precise estimate.
Standard Error of the Mean
(From Investopedia)

A standard error is the standard deviation of the


sampling distribution of a statistic.

In statistics, a sample mean deviates from the


actual mean of a population; this deviation is
the standard error
Confidence Interval (Interval Estimate)
A confidence interval for a parameter is an interval
of numbers within which we expect the true value
of the population parameter to be contained.
How confident are we that the true population
average is in the shaded area? We are 95%
confident. This is the level of confidence.
How many standard errors away from the mean
must we go to be 95% confident? From –z to z there
is 95% of the normal curve.
Confidence Interval
The likelihood that our confidence interval will contain the
population parameter is called the confidence level.

For example, how confident are we that our confidence


interval of 23-28 years of age contains the mean age of our
population?

If this range of ages was calculated with a 95% confidence


level, we could say that we are 95% confident that the
mean age of our population is between 23-28 years.
General Format for a Confidence Interval

 point estimate margin of error


 The margin of error is a multiple of the standard error (SE), i.e. the standard
deviation of the sampling distribution
 In the case of the sample mean, the central limit theorem assures us that
there is approximately 68% chance for the sample mean to be within one
standard error from its expected value, and about 95% chance for the sample
mean to be within two standard errors from the population mean. Such
results enable us to attach approximately 68% confidence to covering the
population mean in an interval of the form:
sample mean

sample mean  SE of the Mean  X 
n
 And about 95% confidence to covering the population mean in an interval of
the form:


sample mean  2 SE of the Mean  X  2
n

An interval estimate X  2
n for the mean has an attached 95% confidence

interval, in the sense that in about 19 out of 20 sampling experiments, we
would expect to contain the true value of the parameter in the resulting
interval estimate. Thus, we also call this interval estimate the 95%
confidence interval for the mean.
Further Illustrations of Confidence
Interval
 Confidence Interval Estimation of the Population Mean
Case 1: σ is unknown and n≥30 (rather large sample)

s
X  z / 2
n
where z / 2 is the percentile of the standard deviation that has an area of  / 2 to
its right.
Example: The mean and standard deviation for the quality point indices (QPI) of
a random sample of 36 Ateneo college sophomores are 2.6 and 0.3, respectively.
Find the 95% and 99% confidence intervals for the mean of the entire population
batch.
Soln: The 95% confidence interval for the mean QPI is
0.3
2.6  1.96 or equivalently , from 2.50 to 2.70
36

The 99% confidence interval for the mean QPI is


0.3
2.6  2.575 or equivalently , from 2.47 to 2.73
36
Case 2: σ is unknown and n is less than30, data distribution is normal (or
approximately normal)
s
X  T / 2
n

where T / 2 is the percentile of the Student’s T distribution with v = n-1 degrees


of freedom, that has an area of  / 2 to its right.
 The contents of 8 similar bottles of acetic acid are 110, 112, 111, 109, 107,
113, 110, and 109 milliliters. Find a 95% confidence interval for the mean of
all such bottles, assuming an approximate normal distribution for the
population of the acetic acid contents.
 Verify!
1.89
110.125  2.365 or equivalently , from
8
108.545 to 111 .705.
Measures of Position -
Percentiles
Percentiles divide the distribution into
100 groups.
The Pk percentile is defined to be that
numerical value such that at most k% of
the values are smaller than Pk and at
most (100 – k)% are larger than Pk in an
ordered data set.
Measures of Position -
Percentiles
The percentile corresponding to a given
value (X) is computed by using the
formula:
Percentiles - Examples
A teacher gives a 20-point test to 10
students. Find the percentile rank of a score
of 12. Scores: 18, 15, 12, 6, 8, 2, 3, 5, 20, 10.
Ordered set: 2, 3, 5, 6, 8, 10, 12, 15, 18, 20.
Percentile = [(6 + 0.5)/10](100%) = 65th
percentile. Student did better than 65% of the
class.
Deciles and Quartiles
Deciles divide the data set into 10
groups.
Deciles are denoted by D1, D2, …, D9
with the corresponding percentiles
being P10, P20, …, P90
Quartiles divide the data set into 4
groups.
Deciles and Quartiles
Quartiles are denoted by Q1, Q2, and Q3
with the corresponding percentiles
being P25, P50, and P75.
The median is the same as P50 or Q2.
Outliers and the Interquartile
Range
An outlier is an extremely high or an
extremely low data value when
compared with the rest of the data
values.
The Interquartile Range,
IQR = Q3 – Q1.
Outliers and the Interquartile
Range
To determine whether a data value can be
considered as an outlier:
Step 1: Compute Q1 and Q3.
Step 2: Find the IQR = Q3 – Q1.
Step 3: Compute (1.5)(IQR).
Step 4: Compute Q1 – (1.5)(IQR) and
Q3 + (1.5)(IQR).
Outliers and the Interquartile
Range
To determine whether a data value can be
considered as an outlier:
Step 5: Compare the data value (say X)
with Q1 – (1.5)(IQR) and Q3 + (1.5)(IQR).
If X < Q1 – (1.5)(IQR) or
if X > Q3 + (1.5)(IQR), then X is considered
an outlier.
Outliers and the Interquartile
Range - Example
Given the data set 5, 6, 12, 13, 15, 18, 22, 50,
can the value of 50 be considered as an outlier?
Q1 = 9, Q3 = 20, IQR = 11. Verify.
(1.5)(IQR) = (1.5)(11) = 16.5.
9 – 16.5 = – 7.5 and 20 + 16.5 = 36.5.
The value of 50 is outside the range – 7.5 to
36.5, hence 50 is an outlier.
Methods of Presenting Data

 Textual Method – used when there are only few


observations or information to be presented. We
simply describe the observation using words.
 Tabular Method – used when there are so many
observations collected. We present the data in
the so-called statistical table.
 Graphical Method – this is the pictorial way of
presenting data

You might also like