You are on page 1of 58

SUMMARIZING DATA SETS - STATPROB FASILKOM UI

SUMMARIZING DATA SETS


1. MEASURES OF CENTRAL TENDENCY

2. MEASURES OF VARIATION
3. MEASURES OF POSITION

50
¡ Where are the majority of scores concentrated?
1. MEASURES OF
¡ Statistics that are used for describing the center of a
CENTRAL
set of data values:
TENDENCY

MEAN MEDIAN MODE

SUMMARIZING DATA SETS - STATPROB FASILKOM UI 51


SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Mean
¡ Mean is the arithmetic average of the scores in distribution.

¡ Population Mean
∑ $% • ! is for mean of population
!= • & is size of the population
&

¡ Sample Mean

∑ $% • $̅ is for mean of sample


$̅ = • ( is size of the sample
(

52
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Sample Mean
¡ Let !" , !$ , !% , … , !' are n numerical values of our data set, then the sample mean,
(, is defined by
denoted by !
*, + *. + ⋯ + *0 ∑034, *3
*̅ = =
1 1

53
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Sample Mean (2)


¡ Modified data; multiply with a constant a and
add with a constant b.

¡ The constants, a and b, will impact the mean of


the modified data.

¡ Relatively simplify the calculation of the mean.

54
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Example
¡ Find the sample mean of the following scores (The winning scores in the U.S. Masters golf
tournament 1999-2008).
{280, 278, 272, 276, 281, 279, 276, 281, 289, 280}
¡ It is easy to first subtract 280 from these values, +, = ., – 280
{0, −2, −8, −4, 1, −1, −4, 1, 9, 0}
¡ It is easy to determine the mean of +, ’s, i.e +2 = −0.8
¡ So, the mean of original data is,
.̅ = +2 + 280 = 279.2

55
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Exercise
¡ Find the sample mean of the following Statprob Scores.
{90, 87, 85, 92, 90, 86, 98, 95, 91, 81}

56
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Exercise
¡ Find the sample mean of the following Statprob Scores.
{90, 87, 85, 92, 90, 86, 98, 95, 91, 81}

¡ Mean = 89.5

57
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Mean in Class Intervals


¡ Mean for data distribution that are grouped into class intervals (in grouped frequency
table). n

åfm i i
x= i =1
n

åf
i =1
i

¡ !" = mid-point of "#ℎ interval.


¡ %" = frequency of "#ℎ interval.

58
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Properties of the Mean


1. The sum of deviations of all scores from the mean is zero.

4
0 (-1 − -)̅ = 0
123

¡ Prove it using the Statprob scores:


¡ {90, 87, 85, 92, 90, 86, 98, 95, 91, 81}
¡ Mean -̅ = 89.5

59
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Properties of the Mean (2)


2. The sum of squares of the deviation from the mean is smaller than the sum of squares
of the deviation from any other value in the distribution.

% %
! &" − &̅ ) ≤! &" − + ),+ ∈.
"#$ "#$

¡ Prove it using the Statprob scores:


¡ {90, 87, 85, 92, 90, 86, 98, 95, 91, 81}
¡ Mean = 89.5

60
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Properties of the Mean (3)

!" %)
(!" − & %) (
(!" − & (!" − )*) (
90 0.5 0.25 25
87 -2.5 6.25 64
85 -4.5 20.25 100
92 2.5 6.25 9
90 0.5 0.25 25
86 -3.5 12.25 81
98 8.5 72.25 9
95 5.5 30.25 0
91 1.5 2.25 16
81 -8.5 72.25 196
∑ 0 222.5 525
61
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Sample Median
¡ Order the values of a data set of size n from smallest to largest.
¡ If n is odd, the sample median is the value in position (" + $)/'
¡ if n is even, the sample median is the average of the values in positions "/' and "/' + $.

¡ Median is actually second quartile.

¡ Example
¡ {3, 6, 12, 18, 19, 21, 23} à median = 4th datum = 18.
¡ {3, 6, 12, 18, 19, 21, 23, 25} à median = (18 + 19) / 2 = 18.5

62
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Grouped Data [Hinkle, 2003]

Sample Median (2)


¡ For grouped frequency table
æ n(0.50) - cf ö
Mdn = ll + çç ÷÷( w)
è fi ø
¡ !!: lower exact limit of the interval containing the "(0.50) score
¡ ": total number of score
¡ (): cumulative freq. of scores below the interval containing the "(0.50) score
¡ )* : freq. of scores in the interval containing the "(0.50) score
¡ +: width of class interval

For left-end-inclusion [Ross, 2009] case, lower limit of an interval is the left-interval-bound
63
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Exercise: Sample Median

æ n(0.50) - cf ö
Mdn = ll + çç ÷÷( w)
è fi ø

64
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Exercise: Sample Median

æ n(0.50) - cf ö
Mdn = ll + çç ÷÷( w)
è fi ø
æ ö
ç 90 - 50 ÷
Med = 44.5 + ç ÷(5) = 49.26
ç 42 ÷
è ø

65
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Mean vs Median
¡ Mean is highly sensitive to outliers !

¡ Suppose we have a data set consisting 4 persons’ weight:


{60, 70, 80, 990}
¡ The mean of this sample is

(60 + 70 + 80 + 990)
= 300
4
¡ The mean 300 fails to present a realistic picture of the major part of the data. 990 seems to be
an outlier !

66
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Mean vs Median (2)


¡ We need another statistic à the median.

¡ For the data set consisting 4 persons’ weight:


{60, 70, 80, 990}
¡ The median of this sample is

(70 + 80)
= 75
2
¡ In this case, 3 observations out of 4 lie between 60-80, so the median is a good statistic here.

67
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Sample Mode
¡ Mode is the most frequent score in a distribution.

¡ Calculate frequency distribution

Score f
783 6 783 is the most frequent score (6 times)
785 4 Mode of the data is 783
786 2
788 2
789 2
790 2
791 3
792 2 68
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Sample Mode
¡ When data are grouped into class intervals [Hinkle, 2003], the mode is a modal interval.
And the midpoint of this interval is considered the mode.

Modal interval is interval 45-49.


Mode of the data is 47

69
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Mode, Median, and Mean in Normal Distribution


"

Mode, median, and mean

70
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Mode, Median, and Mean in Skewed Distribution

71
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Mode, Median, and Mean in Bimodal Distribution

72
¡ How widely are scores spread throughout the
distribution ?
¡ These statistics measure how much our variables
2. MEASURES OF vary from the mean.
VARIABILITY
¡ The measures of variation to be discussed:

RANGE MEAN VARIANCE STANDARD


DEVIATION DEVIATION

SUMMARIZING DATA SETS - STATPROB FASILKOM UI 73


SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Range
¡ Range is the number of units on the scale of measurement that include the highest and
lowest values.
!"#$% = (ℎ)$ℎ%*+ *,-.% – 0-1%*+ *,-.%) + 1 5#)+

¡ Sample: ordered data:

Distribution 1 11 16 18 … 31 37
Distribution 2 18 19 21 … 26 29

¡ Compute the range of both distributions!

74
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Range
¡ Range is the number of units on the scale of measurement that include the highest and
lowest values.
*+,-. = (ℎ1-ℎ.23 2456. – 859.23 2456.) + 1 ;,13

¡ Sample: ordered data:

Distribution 1 11 16 18 … 31 37
Distribution 2 18 19 21 … 26 29

¡ Distribution 1 = 37 − 11 + 1 = 27
¡ Distribution 2 = 29 − 18 + 1 = 12

75
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Mean Deviation
¡ Deviation score is the difference between the given score and the mean.

!"# = (&# − &)̅

¡ Mean deviation (MD) is the average of the absolute values of the deviation scores.
n n

å x - x å DS
i i

MD = i =1
= i =1
n n
¡ A larger MD shows a greater variation

76
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Variance
¡ Using square instead of absolute.

¡ Variance is the average of the sum of squared deviations around the mean.
¡ Population variance !2 N

SS å (x - µ) i
2

s2 = = i =1
N N
$$: sum of square
¡ Sample Variance #2
n

SS å (x - x) i
2

s2 = = i =1
n -1 n -1

77
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Sample Variance
¡ The sample variance, call it !", of the data set #$ , #" , #& , … , #( is defined by

å (x - x) i
2

s2 = i =1
n -1

78
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Sample Variance (2)


¡ Modified data; multiply with a constant a
and add with a constant b.

¡ The constants, a and b, will impact the


mean of the modified data.

¡ Relatively simplify the calculation of the


mean.

79
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Sample Variance (3)

Low variance

High variance

80
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Grouped Data [Hinkle, 2003]

Sample Variance for Grouped Data

(
∑.+,- /+ (1+ − 3)(
' =
5−1

¡ !" : frequency of the #$ℎ interval


¡ &" : midpoint of the #$ℎ interval
¡ k : number of interval

81
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Exercise
¡ Compute the sample variance

"
∑(%&' )% (+% − -)"
! =
/−1

82
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Standard Deviation
¡ Standard deviation is the square root of the variance.

¡ Symbols:
¡ Standard deviation of population !
s = s2
¡ Standard deviation of sample "

s = s2

83
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Example
¡ Compute the standard deviation ¡ !=7

0 *0 )
*0 − * ) )4
(*0 − * ¡ Total ∑%& = 42
1 9 3 9 ,-
2 12 6 36
)̅ =
¡ Mean * = 6
.
3 7 1 1
4 5 -1 1
¡ s
2
=
å (X i - X )
2

=
76
= 12.67
5 2 -4 16 n -1 6
6 3 -3 9
7 4 -2 4
42 0 76
s = s 2 = 12.67 = 3.56

84
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Exercise
¡ Compute mean, mean deviation, and variance of the following grouped frequency table!

Class Interval Frequency


0-2 3
3-5 6
6-8 6
9-11 4
12-14 1

85
¡ Where can you find a certain score in reference to
the other scores?
3. MEASURES OF ¡ Statistics to give context or frame of reference, i.e.,
POSITION relative position of a score among other scores.
¡ Some statistics for this problem:

PERCENTILE PERCENTILE
RANK

SUMMARIZING DATA SETS - STATPROB FASILKOM UI 86


SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Sample Percentile
¡ The sample 100p percentile is that data value such that:
¡ 100p percent of the data are less than or equal to it
¡ 100(1 - p) percent are greater than or equal to it
¡ If two data values satisfy this condition, then the sample 100p percentile is the average
of these two values.
¡ Writing convention

¡ Sample 100p percentile = !"##$

¡ Sample 25 percentile = !%&

87
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Sample Percentile (2)


¡ To determine the sample 100# percentile of a data set of size $, we need to determine
the data values such that:

¡ At least %& of the values are less than or equal to it.


¡ At least %(( − &) of the values are greater than or equal to it.
¡ First, You need to arrange the data in increasing order !

88
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Sample Percentile for Non-Grouped Data


¡ To determine the sample 100# percentile of data of size $:

1. Arrange the data in order (lowest to highest)


2. Compute $#
3. Test:
a. If $# is not whole number, round up to the next whole number !
b. If $# is whole number, compute the average of values in the position $# and $# + 1.

89
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Sample Percentile for Non-Grouped Data (2)


¡ If ! = 22, determine the position of 80 percentile !

¡ What we can conclude


¡ $% = &&((. *) = ,-. . of the values are less than or equal to it.
¡ $(, − %) = &&((. &) = 0. 0 of the values are greater than or equal to it.
¡ So the 18th smallest value satisfies both conditions.
¡ This is the sample 80 percentile, where 123 = 1867 value.
¡ If $% is an integer (e.g 18) , then both values in positions !8 and !8 + 1 satisfy both
conditions, and so the sample 100p percentile is the average of these values, where
(: ;2<: ;=)
>

90
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Sample Percentile for Non-Grouped Data


¡ First quartile (!1) : the sample 25 percentile.
¡ Second quartile (!2) : the sample 50 percentile à Sample median.
¡ Third quartile (!3) : the sample 75 percentile
¡ Interquartile Range (%!&) : !3 – !1.

91
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Example
¡ Determine first, second, and third quartile, as well as P70 of the following data set !
{17.11, 6.6, 6.59, 11.06, 2.78, 6.96, 3.79, 4.3}
¡ Ordered data set:
{2.78, 3.79, 4.3, 6.59, 6.6, 6.96, 11.06, 17.11}

92
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Example
¡ Ordered data set:
{2.78, 3.79, 4.3, 6.59, 6.6, 6.96, 11.06, 17.11}

(3.79 + 4.3)
/01 23 = 8(0.25) = 2 /01 = = 4.045
2
(6.59 + 6.6)
/18 23 = 8(0.50) = 4 /18 = = 6.595
2
(6.96 + 11.06)
/91 23 = 8(0.75) = 6 /91 = = 9.01
2
/98 23 = 8(0.70) = 5.6 /98 = 6.96

Interquartile Range (GHI) = HJ – HL = 9.01 – 4.045 = 4.965


93
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Grouped Data [Hinkle, 2003]

Sample Percentile for Grouped Data


¡ For grouped frequency table
æ n. p - cf ö
X th percentile = PX = ll + çç ÷÷( w)
è fi ø
¡ !!: lower exact limit of the interval containing the "($) score
¡ ": total number of scores
¡ p: proportion corresponding to the desired percentile
¡ &': cumulative freq. of scores below the interval containing the "($) score
¡ '( : freq. of scores in the interval containing the "($) score
¡ ): width of class interval

For left-end-inclusion [Ross, 2009] case, lower limit of an interval is the left-interval-bound 94
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Example
¡ Find the 34th percentile!

æ n. p - cf ö
X th percentile = PX = ll + çç ÷÷( w)
è fi ø
æ 180(0.34) - 50 ö
P34 = 44.5 + ç ÷(5) = 45.83
è 42 ø

95
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Percentile Rank
¡ Percentile rank of a score is the percent of scores less than or equal to that score.

¡ Suppose you got 65 on the final exam of this course. You want to know what percent of
students scored lower.
¡ Writing convention
¡ Percentile rank of score 65 = !"65

96
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Percentile Rank for Non-Grouped Data

CF’ = the count of all scores less than the score of interest
F = frequency of the score
N = number of scores

97
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Percentile Rank for Non-Grouped Data (2)


¡ Find percentile rank of a score of 12 from the following dataset
18, 15, 12, 6, 8, 2, 3, 5, 20, 10
¡ Ordered data set:

2, 3, 5, 6, 8, 10, 12, 15, 18, 20

¡ Percentile rank:
6 + (0.5)(1)
)*+, = ×100 = 65
10

98
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Grouped Data [Hinkle, 2003]

Percentile Rank for Grouped Data


¡ Percentile rank in grouped data
æ X - ll ö
ç cf + fi ÷
PRX = ç w ÷(100)
ç n ÷
ç ÷
è ø
¡ !"# = percentile rank of score X
¡ $% = cumulative frequency of scores below the interval containing percentile point
¡ && = exact lower limit of the interval containing percentile point
¡ ' = width of class interval
¡ %( = frequency of scores in the interval containing percentile point
¡ ) = total number of scores

For left-end-inclusion [Ross, 2009] case, lower limit of an interval is the left-interval-bound 99
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Percentile Rank
¡ Find percentile rank of score 61 !

æ X - ll ö
ç cf + fi ÷
PRX = ç w ÷(100)
ç n ÷
ç ÷
è ø

æ 61 - 59.5 ö
ç 159 + 15 ÷
PR61 = ç 5 ÷(100) = 90.83
ç 180 ÷
ç ÷
è ø

100
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Ogive & Percentile


¡ Ogive can be used to find percentile & percentile rank

101
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Percentile Rank as an Ordinal Scale


¡ Position of percentile for normal distribution

¡ In the middle, a difference of 6 raw score (45-51) is equivalent to a difference of 20 percentile


points
¡ In the tails, the opposite phenomenon occurs
¡ Percentile Rank is an ordinal scale
102
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Percentile Rank as an Ordinal Scale


¡ Recall the criteria of an ordinal scale
¡ Mutually exclusive
à yes
¡ Some logical order

¡ The difference between !"# − !%# and !&# − !'# may not be the same à ordinal

¡ Percentile should be used only for describing points in a distribution (relative


position/rank in a distribution), NOT for making comparisons accross distribution.

103
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Box Plot
¡ A straight line segment stretching from the smallest to the largest data value. It contains
information about first to the third quartile on the “box” part.

(3.79 + 4.3)
!"# $% = 8(0.25) = 2 !"# = = 4.045
2
(6.59 + 6.6)
!#3 $% = 8(0.50) = 4 !#3 = = 6.595
2
(6.96 + 11.06)
!5# $% = 8(0.75) = 6 !5# = = 9.01
2
!53 $% = 8(0.70) = 5.6 !53 = 6.96

Interquartile Range (DEF) = EG – EI = 9.01 – 4.045 = 4.965


104
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Box Plot (2)


¡ A straight line segment stretching from the smallest to the largest data value. It contains
information about first to the third quartile on the “box” part.
4.045 6.595 9.01
2.78 17.11

Min Max
Q1 Q2 Q3

2.78 4.045 6.595 9.01 17.11

105
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Outliers
¡ An outlier is an unusual score in a distribution that may warrant special consideration.

¡ Outliers can arise because of a measurement or recording error or because of equipment


failure during an experiment, etc.
¡ An outlier might be indicative of a sub-population, e.g. an abnormally low or high value in
a medical test could indicate presence of an illness in the patient.

106
SUMMARIZING DATA SETS - STATPROB FASILKOM UI

Outliers & Box Plot


¡ 5 important numbers on the box plot: RUB
¡ !"# (reasonable upper boundary)
!"# = %3 + 1,5 (,%!) Q3
¡ %3 (third quartile) Median
¡ Median (%3) Q1
¡ %1 (first quartile)
¡ RLB (reasonable lower boundary)
RLB
!.# = %1 – 1,5(,%!)
Outliers
Outliers are all scores above the !"# or below the !.#
107

You might also like