You are on page 1of 39

DESCRIPTIVE

STATISTICS
PART 2
DESCRIPTIVE STATISTICS

 Measures of Location
 Measures of Variability
 Measures of Shape
MEASURES OF CENTRAL TENDENCY (location)
A measure of central tendency gives a single value that act as a
representative average of the values of all the outcomes of your
experiment. Three parameters that measure the center of the
distribution in some sense are of interest. These parameters, called the
population mean, the population median and the population mode.

Central Tendency
refers to the Middle
of the Distribution
A.THE MEAN
For Ungrouped Data:
Let x1 , x2 , x3 ,…. xn be n observations of a random variable X. The sample mean,
denoted by x, is the arithmetic average of these values. That is,
N

x i
  i 1
for population mean
N
n

x i
x i 1
for sample mean
n
For Grouped Data

 f i xi
 or x  i 1
k

i 1
fi

Where: fi is the frequency of class interval i


xi is the class midpoint of class interval i
B. THE MEDIAN
_

For Ungrouped Data:


Let x1 , x2 , x3 ,…. xn be a sample observations arranged in the order of smallest to largest.
The sample median for this collection is given by the middle observation if n is odd. If n is even,
the sample median is the average of the two middle observations.


 x( n 1) / 2 If n is odd
~

x 
 x( n / 2)  x( n / 2 ) 1
 If n is even
 2
For Grouped Data:
When the data are grouped into a frequency distribution, the median is
obtained by finding the cell that has the middle umber and then
interpolating within the cell.

~ n   cf n  cf
x  Lbi  2 i 1
(class size)
~
2 i 1
x  Ubi  (class size)
fi fi
where:
Lbi = lower class boundary of the interpolated interval
Ubi = lower class boundary of the interpolated interval
<cfi-1 = less than cumulative frequency of the class before interpolated
interval
>cfi-1 = greater than cumulative frequency of the class before interpolated
interval
fi = frequency of the interpolated interval
i = interpolated interval
n = number of data points
C. THE MODE

The last measure of central tendency is


the mode. For a finite population, the
population mode is the value of X that
occurs most often. The mode of a
sample is the value that occurs most
often in the sample. The drawback to
this measure is that there might not be
a unique mode. There might be no single
number that occurs more often that any
another. For this reason, the mode is
not a particularly useful descriptive
measure.
When the data are grouped into a
frequency distribution, the midpoint of
the cell with the highest frequency is
the mode, since this point represents
the highest point (greatest frequency).
For grouped Data:

d1
Mode  LB  (class size)
d1  d 2
L B  lower boundary of the modal class
Modal Class  is a category contanig the highest frequency
d 1  difference between th e frequency of the modal class and frequency above it when th e
scores are arranged from lowest to highest
d 2  difference between th e frequency of the modal class and frequency below it when th e
scores are arranged from lowest to highest
EXAMPLES:
1. A high school teacher at a small private school assigns trigonometry practice problems to be
worked via the net. Students must use a password to access the problems and the time of log
in and log-off are automatically recorded for the teacher. At the end of the week, the teache
examines the amount of time each student spent working the assigned problems. The data is
provided below in minutes.
Data 15 28 25 48 22 43 49 34 22 33 27 25 22 20 39

 15+ 28 + 25 + 48 + 22 + 43 + 49 + 34 +22 + 33 +27 + 25 + 22 + 20 +39


Mean x = ------------------------------------------------------------------------------------------
15
Mean = 30.13

~
Median : x = 15 20 22 22 22 25 25 27 28 33 34 39 43 48 49

= 27
^
Mode : x =22
2. The number of television viewing hours per household
and the prime viewing times are two factors that affect
television advertising income, A random sample of 30
households in a particular viewing area produced the
following estimated of viewing hours per household.
3.0 6.0 7.0 15.0 12.0 6.1

6.5 8.0 4.0 5.0 6.0 7.3

5.0 12.0 1.0 3.5 3.0 5.4

7.5 5.0 10.0 8.0 3.5 8.3

9.0 2.0 6.5 1.0 5.0 8.5

Find the mean, median and mode


3. The frequency table (below) represent the final examination for
an statistics course. Find the population mean, the population
median and the population mode.

Class Interval Frequency Class mark Cumulative


Frequency
<CF
10– 19 3 14.5 3
20 – 29 2 24.5 5
30 – 39 3 34.5 8
40 – 49 4 44.5 12
50 – 59 5 54.5 17
60 – 69 11 64.5 28
70 – 79 14 74.5 42
80 – 89 14 84.5 56
90 – 99 4 94.5 60
 fi xi
Mean = ---------------
 fi
(3)(14.5) + (2)(24.5) +( 3)(34.5) + (4)(44.5) + (5)(54.5) +
(11)(64.5) + 14(74.5)+ (14)(84.5) +(4)(94.5)
Mean = --------------------------------------------------------------------------------
3 + 2 + 3 + 4 + 5 + 11 + 14 + 14 + 14
Mean = 66
n/2 – <cfi-1
Median = Lb + -------------------- (i)
fi
60/2 – 28
Median = 69.5 + -------------------- (10)
14
Median = 70.93
Mode = Classmark with the highest frequency
Mode = 74.5 and 84.5
Mode = Classmark with the highest frequency
Mode = 74.5 and 84.5

Modal Class 1
3
Mode = 69.5 + -------- (10) = 79.5
3-0

Modal Class 2

0
Mode = 79.5 + -------- (10) =79.5
0 - 10
4. Find the sample mean, sample median and sample mode

CLASS FREQUENCY CLASS CLASS <CF >CF


INTERVAL BOUNDARY MARK
5-9 4 4.5-9.5 7 4 100

10-14 8 9.5-14.5 12 12 96

15-19 17 14.5-19.5 17 29 88

20-24 26 19.5-24.5 22 55 71

25-29 20 24.5-29.5 27 75 45

30-34 15 29.5-34.5 32 90 25

35-39 10 34.5-39.5 37 100 10


Summary of when to use the mean, median and
mode

Best measure of central


Type of Variable
tendency

Nominal Mode
Ordinal Median

Interval/Ratio (not skewed) Mean

Interval/Ratio (skewed) Median


Piece of Advice
MEASURES OF VARIABILITY
Refers to the extent of scatter or dispersion
around the zone of central tendency
Variability is about the Spread
A. RANGE
One measure of variation is the range, which has the advantage of being
very easy to compute. The range, R, of a set of n measurements is defined as the
difference between the largest and smallest measurements.
Formula:
Range = Highest score – Lowest Score or R = (H – L)
B. VARIANCE and STANDARD DEVIATION

The variance of a population of N measurements is defined to be the average of the


squares of the deviations of the measurements about their mean μ. The
population variance is denoted by σ² and is given by the formula
N

 (x i   )2
2  i 1 For ungrouped data
N

f i ( xi   ) 2
2  i 1
k
For grouped data
f i 1
i
_
The variance of a sample of n measurements is defined to be the sum of the
squared deviations of the measurement about their mean x divided by (n-1).
The sample variance is denoted by s² and is given by the formula
n

 (x i  x) 2
For ungrouped data
s2  i 1

n 1
k

f i ( xi  x ) 2
For grouped data
s2  i 1

 k 
  fi  1
 i 1 

The standard deviation, in essence, represents the “average amount of


variability” in a set of measures, using the mean as a reference point. Strictly
speaking, the standard deviation is the positive square root of the average of
the square deviations about the mean or the positive square root of the
variance. The standard deviation is basically a measure of how far each score,
on the average, is from the mean
1. A high school teacher at a small private school assigns trigonometry practice problems to be
worked via the net. Students must use a password to access the problems and the time of log-in
and log-off are automatically recorded for the teacher. At the end of the week, the teacher
examines the amount of time each student spent working the assigned problems. The data is
provided below in minutes.
Data 15 28 25 48 22 43 49 34
22 33 27 25 22 20 39
Calculate the range, variance and standard deviation.
Range = HV – LV
= 49-20
 (x – x-bar) ²
s²(sample variance) = --------------------------
n-1
(15-30.13)2 + (28-30.13)2 + (25-30.13)2 +(48-30.13)2 + (22-30.13)2 + (43-30.13)2 + (49-30.13)2 +
(34-30.13)2 + (22-30.13)2 +(33-30.13)2 + (27-30.13)2 + (25-30.13)2 + (22-30.13)2 +(20-30.13)2 + (39-
30.13)2
= -----------------------------------------------------------------------------------
15 -1
= 109.9809524

(sample standard deviation)


s  s2
= 10.48718038
2. The frequency table (below) represent the final examination for
statistics course. Find the population range, population variance and
population standard deviation

Class Interval Frequency Class mark Cumulative


Frequency

10– 19 3 14.5 3
20 – 29 2 24.5 5
30 – 39 3 34.5 8
40 – 49 4 44.5 12
50 – 59 5 54.5 17
60 – 69 11 64.5 28
70 – 79 14 74.5 42
80 – 89 14 84.5 56
90 – 99 4 94.5 60
Range = Highest Upper Class Boundary - Smallest Lower Class Boundary
= 99.5 – 9.5
= 90
 ƒ (x - µ) ²
² = ----------------- ƒ
3(14.5 – 66)2 +2 (24.5 – 66)2 +3 (34.5 – 66)2 + 4(44.5 – 66)2 +
5(54.5 – 66)2 +11 (64.5 – 66)2 +14 (74.5 – 66)2 +
14(84.5 – 66)2 + 4(94.5 – 66)2
² = ----------------------------------------------------------------------------
60
= 432.75

 = 20.80264406 or 20.80
Influence of Distribution Shape
Measures of Shape

- refer to the visual characteristics of a certain


distribution.
- knowledge of the shape of the distribution can
help in concluding whether the distribution is
normal or not

Two (2) Principal Measures


of Shape

SKEWNESS
KURTOSIS
Measures of Shape

Skewness

refers to the symmetry of a


distribution. A distribution
which is not symmetric with
respect to its mean can be
termed as either positively-
skewed or negatively-skewed Kurtosis
refers to the flatness or
peakedness of a particular
distribution
Skewness

SK = 0
Symmetric (Normal)
SK= S[(Xi - μ)/s]3
N SK > 0
where: Positively Skewed

Xi - individual reading
σ - standard deviation
μ - mean SK< 0
N - population size Negatively Skewed
Skewness relating to central tendency
negative skew: The left tail is longer than the right tail. It
has relatively few low values. The distribution is said to
be left-skewed or "skewed to the left“; Example
(observations): 1,1000,1001,1002,1003

positive skew: The right tail is longer the left tail. It has
relatively few high values. The distribution is said to be
right-skewed or "skewed to the right".Example
(observations): 1,2,3,4,100.

The skewness for a normal distribution is zero, and any


symmetric data should have a skewness near zero.
Kurtosis

k = 3
MesoKurtic (Normal)
k = S[(Xi - μ)/s] 4

where:
N
k > 3
Xi - individual reading LeptoKurtic
σ - standard deviation
μ - mean
N - population size k < 3
PlatyKurtic
Platykurtic data set has a flatter peak around its mean, which causes thin
tails within the distribution. The flatness results from the data being less
concentrated around its mean, due to large variations within observations

Mesokurtic data, A term used in a statistical context where kurtosis of a


distribution is similar, or identical, to the kurtosis of a normally
distributed data set.

Leptokurtic distributions have higher peaks around the mean compared to


normal distributions, which leads to thick tails on both sides. These peaks
result from the data being highly concentrated around the mean, due to
lower variations within observations.
SUMMARY:
CENTRAL TENDENCY

MEAN MEDIAN MODE


N

x i
 x( n 1) / 2
UNGROUPED DATA   i 1
N
~

x 
ODD value of X that
n occurs most
x  x( n / 2 )  x( n / 2 ) 1
x i 1
i
 often
EVEN
n  2

GROUPED DATA the midpoint of


k

 f i xi n   cf
~ the cell with the
x  Lbi  2
 or x  i 1
i 1

k
fi (CS) highest
i 1 fi frequency

VARIABILITY
RANGE VARIANCE STANDARD
N
DEVIATION
UNGROUPED DATA R= HV-LV  (x i   )2
2  i 1
Population variance   2
n N

2
 ( xi  x) 2 s  s2
s  i 1
Sample variance
n 1
GROUPED DATA R= Highest Upper
f
k

i ( xi   ) 2
Class Boundary –  2
 i 1
k Population variance   2
Lowest Lower Class k
 fi

f
i 1
( xi  x ) 2 s  s2
Boundary s2  i 1
i

Sample variance
 k 
  fi   1
 i 1 
SUMMARY:
SHAPE

SKEWNESS POPULATION SAMPLE


N N
UNGROUPED DATA  ( xi   )3  (x i  x)3 SK = 0, normal
SK  i 1
SK  i 1 SK > 0, positively skewed
N 3 ns 3
SK < 0, negatively skewed

GROUPED DATA k k

 f i ( xi   ) 3 f i ( xi  x ) 3
SK  i 1 SK  i 1

 k   k 
  f i  3   fi s3
 i 1   i 1 

KURTOSIS
UNGROUPED DATA N N
K= 3, mesokurtic
 (x i   )4  ( xi  x) 4 K>3, leptokurtic
K  i 1
K  i 1
N 4 ns 4 K<3, mesokurtic
GROUPED DATA
k k

f i ( xi   ) 4 f i ( xi  x ) 4
K  i 1
K  i 1

 k   k 
  f i  4   fi s 4
 i 1   i 1 
PRACTICAL SIGNIFICANCE OF THE
STANDARD DEVIATION

A. TCHEBYSHEFF’S (CHEBYSHEV) THEOREM


Tchebysheff’s theorem applies to any set of measurements
and can be used to describe either a sample of or
population. The idea involved in this theorem is illustrated
below. An interval is constructed by measuring a distance k
σ on either side of the mean μ. Note that the theorem is
true for any number we choose for k as it is greater than or
equal to 1. Then at least 1 – (1/k²) of the total number of n
measurements lies constructed interval
The theorem states that:
-At least none the measurements lie in the
interval μ-σ to μ+σ.
-At least ¼ of the measurements lie in the
interval μ-2σ to μ+2σ.
-At least 8/9 of the measurements lie in the
interval μ-3σ to μ+3σ.
B. EMPIRICAL RULE
Another rule helpful in interpreting a value for a
standard deviation is the Empirical rule, which
applies to a data set having a distribution that is
approximately bell-shaped. The empirical rule is
often stated in abbreviated form, sometimes
called the 68-95-99.7 rule.
Examples:
1. Let X be the number of screws delivered to a box by an automatic filling
device. Assume µ = 1,000, σ2 = 25. There are problems with too many
(giving away free product) or too few (potential irritated customers)
screws in a box.
a) How many σ-units to the right of µ is 1009?
b) What X value 2.6 σ-units to the left of µ ?
c) Use Chebyshev’s inequality to find a bound on P[994 < X < 1006].

a) Limit(X) = µ ± kσ Then, X=1009 is 1.8(units) standard


limit -  1009  1000 deviation to the right of µ
k   1.8
 5

b) Limit(X) = µ ± kσ Then, X=987 is 2(units) standard


X= 1000 – 2.6(5) = 987 deviation to the left of µ

c) Area = 1 – 1/k2
Area = 1 – 1/(-1.2)2
To solve k:
=0.3056
limit -  994  1000
k   1.2 P(994 < X < 1006) ≥ 0.3056
 5
2.The mean life of a certain brand of auto batteries is 44
months with a standard deviation of three months. Assume
that the lives of all auto batteries of this brand have a bell-
shaped distribution. Using the empirical rule, find the
percentage of auto batteries of this brand that have a life
of
a. 41 to 47 months b. 41 to 50 months c. 35 to 53
months

0.34 0.34

0.135
0.135
0.0015 0.0235 0.0235 0.0015

35 38 41 44 47 50 53

a) About 68% (0.34 + 0.34)of battery lives fall between 41 to 44 months

b) About 81.5% (0.34+0.34+0.135) of battery lives fall between 41 to 50 months

a) About 99.7% of battery lives fall between 35 to 53 months

You might also like