You are on page 1of 199

Subtopics

❑ Representation of Data
▪ Histogram
▪ Cumulative Frequency Curve (Ogive – “less than”)
▪ Stem and leaf display

❑ Measures of Central tendency


▪ Mean
▪ Mode
▪ Median

2
Introduction to Statistics
Statistics may mean any of the following:
1) Numerical facts
Example: Total students enrolled in UTAR over
the years 2000-2003.
2) Measures based on sample data
Example: A sample mean is known as a statistic.
3) Field or discipline of study
- Concerned with scientific techniques used for
collecting, organizing, summarizing, presenting and
analyzing data; drawing valid conclusions and
making decisions based on such analysis.
3
Types of Statistics
Descriptive Inferential
statistics statistics

Consists of methods for Consists of methods that use


organizing, summarizing, sample results to help to make
presenting and analyzing data decisions or predictions about a
by using tables, graphs, and population.
summary measures. Also called inductive statistics.

4
Population Versus Samples
Population Sample
• Consists of all elements • A portion of the population
(individuals, items or objects) selected for study.
whose characteristics are being
• Sample survey : Collecting
studied.
information from a portion of
• Census : Collecting the target population.
information from every member
• Statistical measures obtained
of the target population.
from sample data are called
• Statistical measures obtained sample statistics.
from population data are called
parameters.

5
Text Book

Figure 1.1 Population and Sample

Prem Mann, Introductory Statistics, 7/E


Copyright © 2010 John Wiley & Sons. All right reserved 6
Why is a Sample used?

Most of the time we cannot study the entire population,


so we must use a sample as a guide because :
✔It would take too much time to study the entire
population
✔It would cost too much money to study the entire
population
✔It might not be possible to identify all the members of
the population

7
Types of Variables
Discrete
data

8
Types of Variables

9
Types of Variables

10
Example 1.1
1. Explain whether each of the following constitutes a
population or a sample.

(a) Scores of all students in a statistics class


(b) Yield of potatoes per acre for 10 pieces of land
(c) Weekly salaries of all employees of a company
(d) Cattle owned by 100 farmers in Kedah.
(e) Numbers of computers sold during the past week at
all computer stores in Los Angeles

11
Example 1.2
Based on the following statements, determine whether the
data obtained is discrete or continuous data.
(a) Age of a person.
(b) Result obtained when a fair die is thrown.
(c) Time (in minutes) taken to run 100 meters.
(d) Average monthly expenditure (in RM) on household
goods.
(e) Number of robberies reported per day.
(f) Diameter (in nearest cm) of a tennis ball.

12
Example 1.2 (solution)
(a) Age of a person. Continuous data

(b) Result obtained when a fair die is thrown.


Discrete data

(c) Time taken (in minutes) to run 100 meters.


Continuous data

13
Example 1.2 (solution)
(d) Average monthly expenditure (in RM) on household
goods. Continuous data

(e) Number of robberies reported per day.


Discrete data

(f) Diameter (in nearest cm) of a tennis ball.


Discrete data

14
Presentation Of Data
Raw data
Data recorded in the sequence in which they are collected
and before they are processed or ranked

(a) Qualitative data


(b) Quantitative data
collected on a
collected on a
qualitative variable
quantitative variable
(nonnumeric
(measure numerically)
categories)

15
Organizing and Graphing
-Data organized into tables, and displayed in graphs and charts
Qualitative data Quantitative data
a) Frequency distribution a) Stem and leaf displays
b) Relative Frequency and b) Frequency distribution
percentage distributions
c) Relative Frequency and percentage
distributions
d) Graphing frequency distribution:
- Histogram
- Shape of histogram
e) Cumulative frequency distribution
f) Graphing cumulative frequency distribution
- Ogive or cumulative frequency
curve/polygon.
16
Quantitative
data

17
Stem plots (Stem and
leaf diagram)

18
Stem and Leaf Plots
• A simple graph for quantitative data
• Uses the actual numerical values of each data point.

– Divide each measurement into two parts:


the stem and the leaf.
– List the stems in a column, with a vertical line to their
right.
– For each measurement, record the leaf portion in the
same row as its matching stem.
– Order the leaves from lowest to highest in each stem.
– Provide a key to your coding.
19
Example 1.3
The marks of 24 students on a statistics test:

52 65 53 42 85 57 76 69 44 57 60 67
65 74 72 53 81 68 51 62 87 56 90 70

Draw a stem plot for the marks of these students.

20
Example 1.3 (solution)
52 65 53 42 85 57 76 69 44 57 60 67
65 74 72 53 81 68 51 62 87 56 90 70

42 44
52 53 57 57 53 51 56
65 69 60 67 65 68 62
76 74 72 70
85 81 87
90

21
Example 1.3 (solution)
Stem and Leaf display:
Stem Leaf Stem Leaf
4 2 4 4 2 4
5 2 3 3 7 1 7 6 5 1 2 3 3 6 7 7
6 5 5 8 9 2 0 7 6 0 2 5 5 7 8 9
7 4 2 6 0 7 0 2 4 6
8 5 1 7 8 1 5 7
9 0 9 0
Key: 4|2 = 42 marks

22
Example 1.4
The monthly electricity bill (in RM) paid by a sample of 24
households selected from a city :

192 235 253 302 455 157 156 549 244 257 350 447
155 154 352 353 410 148 151 502 247 356 246 244

Draw a stem plot for the data.

23
Example 1.4 (solution)

192 235 253 302 455 157 156 549 244 257 350 447
155 154 352 353 410 148 151 502 247 356 246 244

192 157 156 155 154 148 151


235 253 244 257 247 246 244
302 350 352 353 356
455 447 410
549 502

24
Example 1.4 (solution)
Stem and Leaf display:
Stem Leaf Stem Leaf
1 92 55 54 57 48 56 51 1 48 51 54 55 56 57 92
2 35 53 44 47 57 46 44 2 35 44 44 46 47 53 57
3 52 02 53 56 50 3 02 50 52 53 56
4 55 10 47 4 10 47 55
5 49 02 5 02 49

Key: 4|55 = RM 455 25


Example 1.5
The heights of students (to nearest cm) in a class is given
below:

152 145 153 142 155 157 156 149 144 157 150 147
155 154 152 153 151 148 151 152 147 156 146 144

Draw a stem plot for the heights of these students using


a class interval of 2 cm.

26
Example 1.5 (solution)
Smallest value = 142
Highest value = 157.

If the class interval of 2cm is used, the following class


intervals are obtained:

142-143, 144-145, 146-147, 148-149,


150-151, 152-153, 154-155, 156-157.

27
Example 1.5 (solution)
152 145 153 142 155 157 156 149 144 157 150 147
155 154 152 153 151 148 151 152 147 156 146 144

142-143 : 142
144-145 : 145, 144, 144
146-147 : 147, 147, 146
148-149 : 149, 148
150-151 : 150, 151, 151
152-153 : 152, 152, 153, 152
154-155 : 153, 155, 155, 154
156-157 : 157, 156, 157, 156

28
Example 1.5 (solution)
Stem and Leaf display:
Stem Leaf Stem Leaf
14 2 14 2
14 5 4 4 14 4 4 5
14 7 7 6 14 6 7 7
14 8 9 14 8 9
15 0 1 1 15 0 1 1
15 2 3 2 3 2 15 2 2 2 3 3
15 5 5 4 15 4 5 5
15 7 6 7 6 15 6 6 7 7
Key: 14|2 = 142 cm 29
Example 1.6
For the data in Example 1.5, draw a stem and leaf
diagram using the following class intervals:

142-144, 145-147, 148-150,


151-153, 154-156, 157-159.

30
Example 1.6 (solution)
The interval 148-150 cannot be represented by the ‘stem’
14 because the ‘tens’ digit changes in this interval.

Therefore, ‘stem’ 142, 145, 148, 151, 154 and 157 are
used. The ‘leaf’ is the value that is added to the ‘stem’.

31
Example 1.6 (solution)
152 145 153 142 155 157 156 149 144 157 150 147
155 154 152 153 151 148 151 152 147 156 146 144

142-144 : 142, 144, 144


145-147 : 145, 147, 147, 146
148-150 : 149, 150, 148
151-153 : 152, 153, 152, 153, 151, 151, 152
154-156 : 155, 156, 155, 154, 156
157-159 : 157, 157

32
Example 1.6 (solution)
Stem and Leaf display:
Stem Leaf
142 0 2 2
145 0 1 2 2
148 0 1 2
151 0 0 1 1 1 2 2
154 0 1 1 2 2
157 0 0

Key: 142|2 means 142+2= 144cm


33
Ungrouped data
and grouped data

34
Ungrouped data
1. Raw Data

Example:
A survey on the number of male children in 20 families:
1 4 2 0 2 3 3 2 1 4
5 2 1 2 0 1 2 3 1 2

The above data is called raw data.

35
Ungrouped data
2. Array
• An arrangement of quantitative raw data in
ascending or descending order.

Example:
Number of male children in 20 families in ascending
order:
0 0 1 1 1 1 1 2 2 2
2 2 2 2 3 3 3 4 4 5

The above data is called an array.


36
Summary of Ungrouped data and
Grouped data

37
Number of Number of
male children family
0 10 Height (cm) Number of
1 12 children
2 8 100 - < 105 10
3 6 105 - < 110 12
4 3 110 - < 115 8
5 1 115 - < 120 6
120 - < 125 3 38
Relative Frequency and Percentage
The relative frequencies and percentages for a
quantitative data set are obtained as follows:
Relative frequency of a category
frequency of that category f
= =
sum of all frequencies f

Percentage = relative frequency  100%

39
Grouped frequency distribution: Class limits
and Class Boundaries
• Class limits: The smallest and largest possible
measurements in each class, that is, the lower and upper
limits of each class.
• Class boundaries: The dividing lines between successive
classes.
• Class Boundary (discrete data): Given by the midpoint
of the upper limit of one class and the lower limit of the
next class.
• Class Boundary (continuous data) : Corresponds to the
upper limit of one class or the lower limit of the next
class. 40
Frequency distribution: Class width
• The difference between the two boundaries of a class
gives the class width. The class width is also called the
class size or class interval.
Class Width = Upper boundary – Lower boundary
• The class width can also be determined by finding the
difference between the lower limit of the next class and
the lower limit of the class.
Class Width =
Lower limit of the next class – Lower limit of the class

41
Frequency distribution:
Class midpoint
• The midpoint of each class is called class midpoint or
class mark. It lies half-way between the class limits or the
class boundaries.

42
Example 1.7 (a) (Discrete Data)

43
Example 1.7 (b) (Continuous Data)

The following table gives the frequency


distribution of ages for all 50 employees of a
company.

Age No. of Employees


18 to less than 30 12
30 to less than 42 19
42 to less than 54 14
54 to less than 66 5

44
Example 1.7 (b) (Continuous Data)

Class Class Midpoint, Frequency, Relative Percentage


Age
boundaries width, c m f frequency (%)

12
18 – 30 18 – 30 12 24 12 = 0.24 24
50

19
30 – 42 30 – 42 12 36 19 = 0.38 38
50

14
42 – 54 42 – 54 12 48 14 = 0.28 28
50

5
54 – 66 54 – 66 12 60 5 = 0.10 10
50
Sum = Sum =
 f = 50
1.00 100%
45
Summary of Class Boundaries
Continuous
Class
Continuous
Boundaries Equivalent to
Grouped Data, x

0 − 20 0 – 20
20 − 40 20 – 40

Discrete
Discrete Class
Equivalent to
Grouped Data, x Boundaries
1 − 20 0.5 – 20.5
21 − 40 20.5 – 40.5
Example 1.8
The table below shows the class boundaries, class width and
class midpoints for the grouped frequency distribution of
weekly sales in units.
Sales (units) Class boundaries Class width, Class midpoint,
Class limits c m

110 – 129
130 – 149
150 – 169
170 – 189
190 – 209

47
Example 1.9
The table below shows the class boundaries, class width and
class midpoints for the grouped frequency distribution of
age of employees in years.
Age (years) Class Class Class
Class limits boundaries width, c midpoint, m
25 – < 30
30 – < 35
35 – < 40
40 – <45
45 – < 50

48
Histogram
▪ A diagrammatic presentation of a frequency
distribution.
▪ Histogram for Frequency Distribution
- All bars are of same width.
- Height of every bar is proportional to the frequency
of the corresponding class.

▪ A histogram is a graph in which class boundaries


are marked on the horizontal (x) axis and the
frequencies, relative frequencies, or percentages
are marked on the vertical (y) axis.
49
Example 1.10
The table below shows the times (in minutes) taken by 60
students to complete a model in a competition.
Time 30– 35– 40– 45– 50– 55– 60–
( minutes) <35 <40 <45 <50 <55 <60 <65
Frequency 2 17 18 13 6 1 3
Illustrate the above data with a histogram.

50
Example 1.10 (solution)
Time (minutes), Class boundaries Frequency, f
30–35 2
35–40 17
40–45 18
45–50 13
50–55 6
55–60 1
60–65 3
Note: Since the distribution is of equal class size, a 2-column
table showing class boundaries and frequency is
51
required for drawing the histogram.
Example 1.10 (solution)
Histogram: Times (in minutes) taken by 60
complete a model in a competition
students to

52
30 35 40 45 50 55 60 65
“Less than” Cumulative
Frequency Curve
▪ A graphical presentation of a “less than” cumulative
frequency distribution is called a “less than” ogive or
“less than” cumulative frequency curve/polygon.

▪ A “less than” ogive is a graph showing the cumulative


frequency less than the upper class boundary of a class
plotted against the upper class boundary of the class.

53
“Less than” Cumulative
Frequency Curve
An ogive for the cumulative frequency distribution can be
presented in two forms:

• As a smooth curve – by drawing a smooth curve passing


through the dots marked above the upper boundaries of
classes at heights equal to the cumulative frequencies of
respective classes.

• As a polyline between points – by joining with straight


lines the dots marked above the upper boundaries of
classes at heights equal to the cumulative frequencies of
the respective classes. 54
Example 1.11
Construct ‘less than’ cumulative frequency distribution and
state upper class boundaries.
Marks Number of students (Frequency)
40 – 49 2
50 – 59 10
60 – 69 18
70 – 79 13
80 – 89 5
90 – 99 2
Total 50
55
Example 1.11 (solution)
Discrete variable :

‘Less than’
Upper class
Class Boundaries cumulative
boundary
frequency
<39.5 0
39.5 – 49.5 < 49.5 2 (0+2)
49.5 – 59.5 < 59.5 12 (2+10)
59.5 – 69.5 < 69.5 30 (12+18)
69.5 – 79.5 < 79.5 43 (30+13)
79.5 – 89.5 < 89.5 48 (43+5)
89.5 – 99.5 < 99.5 50 (48+2)
56
Example 1.12
Construct ‘less than’ cumulative frequency distribution and
state upper class boundaries for the completion times taken
by all 120 workers to complete a standard task in a factory.

Completion time (minutes) Number of workers


10 – less than 12 9
12 – less than 14 29
14 – less than 16 42
16 – less than 18 26
18 – less than 20 14

57
Example 1.12 (solution)
Continuous variable :

Class
Boundaries Upper class ‘Less than’
boundary cumulative frequency

< 10 0
10 - 12 < 12 9 (0+9)
12 - 14 < 14 38 (9+29)
14 - 16 <16 80 (38+42)
16 - 18 <18 106 (80+26)
18 - 20 < 20 120 (106+14)
58
Example 1.16Example
(a) 1.13
Construct a “less than” cumulative frequency distribution
with upper class boundaries and draw a cumulative
frequency curve based on the following information .
The weights of 20 students (in nearest kg).
Weight (kg) 60 – 62 63 – 65 66 – 68 69 – 71 72 – 74
Number of 3 4 5 6 2
students
Estimate from the ogive,
(i) the total number of students of weight less than 67 kg.
(ii) the value of x, if 20 % of the students were of weight x kg or more.
(iii) the number of students of weight 64 kg or more.
(iv) the minimum weight of the heavier 10% of the students in a group.
59
Example 1.13 (solution)
The table for “less than” cumulative frequency distribution
and the upper class boundaries.
Class Upper class Cumulative
Boundaries boundary Frequency

<59.5 0
59.5 – 62.5 <62.5 3
62.5 – 65.5 <65.5 7
65.5 – 68.5 <68.5 12
68.5 – 71.5 <71.5 18
71.5 – 74.5 <74.5 20 60
Example 1.13 (solution)
“Less than” ogive for the weight of 20 students (nearest kg)

61
Example 1.13 (solution)
“Less than” ogive for the weight of 20 students (nearest kg)

Cumulative Frequency
20
18
16
14
12
10 (i) 9
8
6
4 (iii)
(iv) 71.5 kg
2
(ii) 70 Weight (kg)
0
59.5 62.5 65.5 68.5 71.5 74.5 77.5

62
Example 1.13 (solution)
From the ogive:
( i ) Number of students with weight less than 67 kg = 9
80
( ii ) Position of x =  20 = 16th
100
x = 70
( iii ) 5 students weigh less than 64 kg, therefore
20 − 5 = 15 students weigh more than or equal to 64 kg.
( iv ) 10% of the students have the heaviest weight.
10
 20 = 2 students ( last 2 students )
100
Position on cumulative frequency = 20 − 2 = 18th
2 students have at least 71.5 kg. 63
Example 1.14
The table below shows the bonuses given out to 250 employees of a factory
in one particular year.
Bonus (RM) Number of employees
40 ≤ x < 50 3
50 ≤ x < 60 8
60 ≤ x < 80 27
80 ≤ x < 100 75
100 ≤ x < 120 79
120 ≤ x < 150 44
150 ≤ x < 200 14
i.Construct a “less than” cumulative percentage distribution with upper
class boundaries and draw a cumulative percentage ogive.
ii. Estimate the number of employees receiving at least RM115 as bonus.
Ans: (b) 80 employees
64
Example 1.14 (solution)

65
Example 1.14 (solution)
(i) Bonuses (RM) for 250 employees

66
Example1.14
Example 11 (solution)
Solution

(b) The number of employees receiving at least RM115 as


bonus:

67
68
Measure Of Central Tendency

69
Measure of Central Tendency

• A measure of central location for a data set and


can be used as a summary value for that data set.

• There are three measures of central location.


They are:

1) Mean 2) Median 3) Mode

70
Mean
Mean is the average of values. It is also known as
arithmetic mean.

▪ Mean does not necessary correspond to one of the


values in the original data
▪ Mean is influenced by the extreme values/outliers
(values that are very small or very large relative to
the majority of the values in a data set).
▪ Mean is not suitable to be used in a data set that
contains extreme values

71
Mean of Ungrouped data

Mean for population data:

Mean for sample data:

Where,

72
Example 1.15
Find the mean of the set of the numbers
{ 12,18, 13, 10, 6, 23, 16}.

Solution :

73
Mean of Ungrouped data with frequency

Mean for population data:

Mean for sample data:

Where,

74
Mean of Grouped data

Mean for population data:

Mean for sample data:

Where,

75
Example 1.16

x 4 5 7 10 11 15 17
f 3 12 23 10 14 8 2

Calculate the mean for the sample data in the frequency


table above.

Ans: 8.9028

76
Example 1.16 (solution)
x f fx
4 3 12
5 12 60
7 23 161
10 10 100
11 14 154
15 8 120
17 2 34
 fx = 641 77
Example 1.16 (solution)

Mean,

78
Example 1.17
Calculate the sample mean of the following grouped
frequency distribution and interpret the value.
Sales (units) Frequency, f
1–4 5
5–8 13
9 – 12 31
13 – 16 19
17 – 20 8
21 – 24 4
Ans: 11.7 units 79
Example 1.17 (solution)
Sales Mid-point , f fm
(units) m
1–4 2.5 5 12.5
5–8 6.5 13 84.5
9 – 12 10.5 31 325.5
13 – 16 14.5 19 275.5
17 – 20 18.5 8 148.0
21 – 24 22.5 4 90.0

80
Example 1.17 (solution)
Mean,

Interpretation : The average number of sales is


11.7 units.
81
Median
Median is the middle value of a data set after the
data is arranged in ascending or descending order.
- Median is not influenced by extreme values.
- Being the middle value implies that 50% of
the observations will be less than
the median and 50% of them will be more
than the median.

82
Ungrouped data:

Median position =

where n = number of observations

Grouped data (Grouped frequency distribution):

Median position =

83
Example 1.18 (Median for Ungrouped data)

Find the median for the set of data shown below and
interpret the value.

75, 67, 48, 66, 89, 51, 70

Ans: 67
84
Example 1.18 (solution)
75, 67, 48, 66, 89, 51, 70

Arrange the numbers in ascending order, which is


48, 51, 66, 67, 70, 75, 89

Median position =
Median = 67

Interpretation:
About 50% of the data is less than 67 and about 50% of
the data is more than 67.
85
Example 1.19 (Median for Ungrouped data)
Find the median for the following data :

65, 75, 20, 63, 42, 51, 39, 25

Ans: 46.5 86
Example 1.19 (solution)
Arrange the numbers in ascending order, that is

20, 25, 39, 42, 51, 63, 65, 70

Median position = (8+1)/2


=4.5th

87
Median for Grouped data (Multi-value
grouping)
The median is the th observation and it

can be estimated by using the following steps.

1. Find the median class

1. Determine the total frequency before the median class.

3. Use the method of proportion to calculate the median.


88
In general, by proportion:

1
2
(  f ) −  f M −1
Median M = LM + c
fM

where LM = lower boundary of median class,


 f = cumulative frequency before median class,
M −1

fM = frequency of median class,


c = width of median class.

89
Example 1.20
Find the median for the data in the following grouped
frequency distribution and interpret the value.

Number of
defectives 1-2 3-4 5-6 7-8 9-10 11-12 13-14 15-16
(units)
Number of
weeks 5 8 22 18 36 29 20 12
(Frequency)

Ans: 9.72 units


90
Example 1.20 (solution)
Cumulative
Number of defectives, Frequency,
frequency,
Class boundaries f
F
0.5 – 2.5 5 5
2.5 – 4.5 8 13
4.5 – 6.5 22 35
6.5 – 8.5 18 53
8.5 – 10.5 36 89
10.5 – 12.5 29 118
12.5 – 14.5 20 138
14.5 – 16.5 12 150 91
Example 1.20 (solution)
n = 150
Median position = = 75th observation

Median class boundaries: 8.5 – 10.5

Lower boundary of median class, LM = 8.5


Cumulative frequency before median class,  f M −1 = 53
Frequency of median class, fM = 36
Width of median class, c = 2

92
Example 1.20 (solution)

Interpretation:
About 50% of the weeks with number of defectives less
than 9.72 units and about 50% of the weeks with number
of defectives more than 9.72 units. 93
Example 1.21
Find the median for the data in the following grouped
frequency distribution.

Ans: 21.51 minutes 94


Example 1.21 (solution)

Ans: 21.51 minutes 95


Example 1.21 (solution)

Total frequency, = 284


Median position = (284/2)th = 142th observation
Median class boundaries : .
Lower boundary , LM = 20
Cumulative frequency before the median class, f M −1

=123
Frequency of the median class, fM = 63
Width of the median class, c = 25 - 20 = 5
96
Example 1.21 (solution)

97
Example 1.22
Determining Median Using Ogive
100 earthworms were collected from a garden. The length
(to the nearest mm) of the earthworms is recorded as shown
in the table below.
Length (mm) 95 110 125 140 155 170 185 200
- - - - - - - -
109 124 139 154 169 184 199 214
Number of 2 8 17 26 24 16 6 1
earthworms

Draw a ‘less than’ cumulative frequency curve for the


information. Estimate the median length of the worms.
Ans: 153 mm 98
Example 1.22 (solution)
Length of earthworms (mm) Cumulative
(upper class boundary) frequency, F
< 94.5 0
< 109.5 2
< 124.5 10
< 139.5 27
< 154.5 53
< 169.5 77
< 184.5 93
< 199.5 99
< 214.5 100
99
Example 1.22 (solution)

Median position =

=
=
50th

100
Example 1.22 (solution)
“Less than” cumulative curve: Length of 100 earthworms

Median = 153 mm
101
Mode of Ungrouped data
Mode is the value of observation that occurs most

• Mode may not exist and if it does, it may not be


unique.
• Example : Mode of ungrouped data
• Raw data: 2, 4, 9, 8, 8, 5, 3
– The mode is 8, which occurs twice (unimode)
• Raw data: 2, 2, 9, 8, 8, 5, 3
– Two modes—8 and 2 (bimodal)
• Raw data: 2, 4, 9, 8, 5, 3
– No mode (each value is unique). 102
Mode of Grouped data
An estimate of the mode can be obtained from
the modal class (The class which has the largest
standard frequency).

There are 2 methods:


1) Using Calculation
2) Using histogram

103
104
fm = frequency of the modal class
fb = frequency before the modal class
fa = frequency after the modal class
Lm = lower boundary of the modal class,
c = width of the modal class.

105
Example 1.23
The marks obtained by 134 students in an examination
are recorded in the following table.
Marks 20-29 30-39 40-49 50-59 60-69 70-79 80-89
Frequency 22 18 22 24 14 14 20
(a) Draw a histogram to represent the above
information.
(b) Estimate the mode
(i) from the histogram;
(ii) using the formula.
(c) Interpret the mode.
106
Example 1.23 (solution)
(a)
Marks
Frequency , f
(Class boundaries)
19.5 – 29.5 22
29.5 – 39.5 18
39.5 – 49.5 22
49.5 – 59.5 24
59.5 – 69.5 14
69.5 – 79.5 14
79.5 – 89.5 20
107
Example 1.23 (solution)
Histogram: Marks obtained by 134 students in an examination

19.5 29.5 39.5 49.5 59.5 69.5 79.5 89.5


108
(b)(i)Estimated mode = 52.5 marks
Example 1.23 (solution)
b)(ii) Calculation Method
Modal class boundaries = 49.5 – 59.5
Lower boundary of modal class, Lm = 49.5
Frequency of modal class, fm = 24
Frequency before modal class, fb = 22
Frequency after modal class, fa = 14
Width of modal class, c = 59.5 – 49.5 =10

109
Example 1.23 (solution)

c) Interpretation

Most of the 134 students scored about 52 marks in the


examination.

110
Example 1.24
The frequency table below shows the mass of mangoes
(in gm) collected from the farm of Mr Nazri Adam
during the mango season. Class a – b shows the
interval a mass < b
100 125 150 175 200
Mass (gm) - - - - -
125 150 175 200 225
Number
28 75 42 26 10
of mangoes

Calculate the mode of the mangoes.


Ans: 139.688 gm 111
Example 1.24 (solution)

Modal class boundaries = 125 – 150


Lower boundary of modal class, Lm = 125
Frequency density of modal class, = 75
Frequency density before modal class, = 28
Frequency density after modal class, = 42
Width of modal class, c = 150 – 125 = 25

112
Subtopics
❑ Measures of Dispersion
▪ Range
▪ Variance
▪ Standard Deviation
❑ Measures of Position
▪ Quartiles
❑ Box Plots
2
Measures Of Dispersion
• Types of measurement which provide
information on the spread or variability of a
set of data.

• There are
1) Range
2) Variance
3) Standard Deviation

3
Measures Of Dispersion
EXAMPLE : The following are two data sets on the ages of all
workers in each of two small companies.
Company 1 : 47 38 35 40 36 45 39
Company 2 : 70 33 18 52 27
The mean age of workers in both companies are the same, 40
years. If we are not provided with the ages of individual workers
in these two companies and are only told that the mean age of
workers in both companies are the same, we may deduce that the
workers in these two companies have the similar age distribution.
But, as we can observe, the variation in the workers’ ages for the
two companies are very different. As illustrated in the diagram,
the ages of the workers in the second company have a much
larger variation than the ages of the workers in the first company. 4
Measures Of Dispersion
36 39
Company 1 :
35 38 40 45 47

Company 2 :
18 27 33 52 70

Conclusion:
Two data sets can have the same measures of central tendency,
and yet they can still be very different on the variability of values.
A measure of dispersion is used to describe such difference
quantitatively. 5
Range for Ungrouped data

Largest Smallest
Range =
observation − observation

6
Range
Disadvantage :
1) Range is not a good measure of dispersion
for a data set that contains outliers.
2) Its calculation is based on two values only;
the largest and the smallest. All other
values in a data set are ignored. Thus,
range is not a very satisfactory measure of
dispersion.

7
Standard Deviation
The value of standard deviation tells how closely
the values of a data set are clustered around the
mean.
Population standard deviation = 
Sample standard deviation = s

• Standard deviation for ungrouped data


• Standard deviation for grouped data

8
Variance
Variance is the square of standard
deviation.
Population variance =  2

Sample variance = s2

• Variance for ungrouped data


• Variance for grouped data
9
Population Variance and Standard
Deviation For Ungrouped Data

Where,
 = population mean of data
N = total number of observations (population size)

10
Sample Variance and Standard Deviation
For Ungrouped Data

Where,
x = sample mean of data
n = total number of observations (sample size)
11
Example 1.25
Find the variance and the standard deviation
for the following set of sample data.

{ 4, 5 , 6 , 7 , 8 , 9 ,10}

12
Example 1.25 (using original formula)
Solution

x x−x (x − x ) 2

4 -3 9
5 -2 4
6 -1 1
7 0 0
8 1 1
9 2 4
10 3 9
 x = 49  ( x − x ) 2
= 28
13
Example 1.25 Solution

Mean , x=  x
=
49
=7
n 7

Sample variance , s 2
=
 ( x − x) 2

=
28
= 4.667
n −1 6

Sample standard deviation, s = s 2


= 4.667 = 2.160

14
Example 1.25 (using alternative formula)
Solution
2
x x
4 16
5 25
6 36
7 49
8 64
9 81
10 100
 x = 49  = 371
x 2 15
Example 1.25 Solution

(  x)
2

x 2

n
Sample variance , s 2
=
n −1
2
(49)
371 −
= 7
6
= 4.667
Sample standard deviation, s = 4.667 = 2.160
16
Example 1.26
Find the standard deviation for the
population data.

3, 5, 6, 4, 6, 5, 6, 8, 5

17
Example 1.26 Solution
Number of observations,
x x2
N= 9
3 9
Standard deviation,
5 25

x x
2 2
6 36
= −   4 16
N  N  6 36
2
5 25
272  48 
= −  6 36
9  9  8 64
= 1.333 5 25
 x = 48  = 272
x 2 18
Population Variance For Grouped
Data (Frequency Distribution)

where,  = population mean of the data


f = frequency of the class

19
Sample Variance and Standard
Deviation For Grouped Data

Where,
m = midpoint of the class
x = mean of the data
f = frequency of the class
n= f 20
Example 1.27(a)
The grouped frequency distribution below shows
the number of sales made by all the salesperson
of a company in one particular month. Find the
mean and standard deviation.
Sales (units) Frequency
0–9 5
10 – 19 13
20 – 29 23
30 – 39 31
40 – 49 16
21
Example 1.27(a) Solution

Mid-
Sales Frequency,
point, fm fm 2
(units) f
m
0-9 4.5 5 22.5 101.25
10 - 19 14.5 13 188.5 2733.25
20 - 29 24.5 23 563.5 13,805.75
30 - 39 34.5 31 1069.5 36897.75
40 - 49 44.5 16 712 31684

f = 88  fm = 2556  fm2 = 85222


Example 1.27(a) Solution

Mean , =  fm
f
2556
=
88
= 29.05 units
Example 1.27(a) Solution

Standard deviation ,

 fm   fm 
2
2

= − 
f  f 
 
2
85222  2556 
= − 
88  88 
= 11.17 units 24
Example 1.27(b)
The following data give the frequency distribution of the
number of orders received each day during a sample
period of 50 days at the office of a mail-order company.

Number of Orders Number of Days


10−12 4
13−15 12
16−18 20
19−21 14
Calculate the variance and standard deviation.
Example 1.27(b) Solution

Number Class mid- Number of


fm fm2
of Orders point, m Days, f
10−12 11 4 44 484
13−15 14 12 168 2352
16−18 17 20 340 5780
19−21 20 14 280 5600
f = 50 fm = 832  fm = 14216
2

(  fm )
2
(832) 2
 fm 2

n
14216 −
50 = 7.582
s2 = =
n −1 49

s = s 2 = 7.582 = 2.754
Formula List

27
Data Distribution:
Symmetry and Skewness
• If a distribution is represented by a histogram
or a frequency curve, we can see the general
shape of its distribution and the relationship
between the mean, median and mode.

• There are 3 general shapes:


1) Symmetrical distribution
2) Positively skewed distribution
3) Negatively skewed distribution
28
Symmetrical Distribution
(bell shaped)
• Also known as normal distribution

Value of averages: Mode = Median = Mean

29
Positively Skewed Distribution
(Skewed to the right)

Value of averages: Mode < Median < Mean


30
Negatively Skewed Distribution
(Skewed to the left)

Value of averages: Mode > Median >Mean

31
Measures of
Location

32
What is measure of position ?
• A measure of position determines the position of a
single value in relation to other values in a sample
or a population data set.

• The three commonly used measures of position are


quartiles, percentiles, and percentile rank.

33
Quartiles
Quartiles divide a set of data (arranged in
ascending or descending order) into 4 equal
parts.

34
Inter-quartile Range and Semi-Inter-
quartile Range (or Quartile deviation)
Interquartile Range (IQR)
= Third Quartile − First Quartile
= Q3 − Q 1

Semi-interquartile range (or quartile deviation)


1
= ( Q3 − Q1 )
2
35
Example 1.28 (Quartiles For Ungrouped Data)

(a) Find the values of the three quartiles. Where does the number of
car thefts of 40,197 fall in relation to these quartiles?
(b) Find the interquartile range. 36
Example 1.28 Solution

(a) Rank the data in increasing order.


12 + 1
Position of median = = 6.5th
2
Values less than median Values greater than median
11,669 13,435 14,413 18,103 18,215 21,088 26,343 29,920 33,956 40,197 40,769 42,082

14,413 +18,103 Q = 21,088 + 26,343Q = 33,956 + 40,197


Q1 = 2 3
2 2 2
=16,258 = 23,715.5 = 37, 076.5

Also the median


By looking at the position of 40,197, this value lies in the top
25% of the car thefts. 37
Example 1.28 Solution

(b) The interquartile range is given the difference


between the values of the third and the first quartiles.
Thus,
IQR = Interquartile Range
= Q3 − Q1
= 37,076.5 −16, 258
= 20,818.5 car thefts

38
Example 1.29
The following are the ages of nine employees of
an insurance company:
47 28 39 51 33 37 59 24 33

(a) Find the values of the three quartiles. Where


does the age of 28 fall in relation to the ages of
these employees?

(b) Find the interquartile range.

39
Example 1.29 Solution

(a) Rank the data in increasing order.


9 +1
(b) Position of median = 2 = 5
th

Values less than median Values greater than median


24 28 33 33 37 39 47 51 59

28 + 33 47 + 51
Q1 = Q2 = 37 Q3 =
2 2
= 30.5 = 49
Also the median
Thus the values of the three quartiles are
Q1 = 30.5 years Q2 = 37 years Q3 = 49 years

The age of 28 falls in the lowest 25% of the ages. 40


Example 1.29 Solution

(b) The interquartile range is

IQR = Interquartile range


= Q3 − Q1
= 49 − 30.5
=18.5 years

41
Quartiles For Ungrouped frequency
distribution (single-value grouping)
1. Construct cumulative frequency.
(n + 1)
2. Median position = , locate the
median. 2

3. Q1 = Middle value of the first 50%.


4. Q3 = Middle value of the second 50%.

42
Example 1.30
Number of fishes 0 1 2 3 4 5
Frequency 1 5 8 7 3 1
The above data shows the number of fishes
reared in each of 25 houses along Green Road.
Find the median and semi inter-quartile range for
the data.

Ans: Q1 = 1.5; Q2 = 2; Q3 = 3; Semi-IQR = 0.75

43
Example 1.30 Solution

Number of Cumulative
Frequency, f
fishes, x frequency, F
0 1 1
1 5 6
2 8 14
3 7 21
4 3 24
5 1 25
44
Example 1.30 Solution

1
Median position = (25 + 1) = 13th
2
Median = 2 fishes

45
Example 1.30 Solution

There are 12 values before median.


1
Q1 position = (12 + 1) = 6.5th
2

First quartile , Q = ( x + x )
1
1 6 7
2
1
= (1 + 2)
2
= 1.5 fishes
46
Example 1.30 Solution

Calculate 6.5th value after median

Q3 position = (13 + 6.5) = 19.5th

Q3 = ( x19 + x 20 )
1
Third quartile,
2
1
= (3 + 3)
2
= 3 fishes 47
Example 1.30 Solution

Semi-interquartile range = (Q3 − Q1 )


1
2
1
= (3 − 1.5)
2
= 0.75 fishes

48
Quartiles for Grouped Frequency
Distribution
For a grouped frequency distribution with total
frequency,  f

1 
First quartile, Q1 =   f th value
4 
3 
Third quartile, Q3 =   f th value
4 
49
Quartiles For Grouped Frequency
Distribution
Determine the class boundaries and compute the
cumulative frequency for each class. Locate the
classes that contain the quartiles by computing
their positions. Determine the quartiles using
formulae or graphically.
1
(a) Median position = 2
f
1
(b) Q1 position = 4
f
3
Q
(c) 3 position =
4
f
50
Formula For First Quartile
1 
4 (  f ) −  f Q1 −1 
Q1 = LQ1 +  c
 fQ1 
 
LQ1 = lower class boundary of the first quartile class
 fQ −1 = cumulative frequency before the first quartile
1

class
fQ = frequency of the first quartile class
1

c = width of the first quartile class


51
Formula For Third Quartile
1 
4 (  f ) −  f Q3 −1 
Q3 = LQ3 +  c
 fQ3 
 
LQ3 = lower class boundary of the third quartile class
 fQ −1 = cumulative frequency before the third quartile
3

class
fQ = frequency of the third quartile class
3

c = width of the third quartile class


52
Example 1.31
The following table shows the height distribution
for a group of students. Find the first quartile,
third quartile and semi-interquartile range.
150 155 160 165 170 175
Height
- - - - - -
(cm)
155 160 165 170 175 180
Frequency 15 32 68 52 24 12

53
Example 1.31 Solution

Height (cm), Cumulative


Frequency
Class boundaries Frequency

150 – 155 15 15
155 – 160 32 47
160 – 165 68 115
165 – 170 52 167
170 – 175 24 191
175 – 180 12 203

54
Example 1.31 Solution

 203 
Q1 position =   th = 50.75th
 4 
Q1 class boundaries = 160-165.
Lower boundary of 1st quartile class, LQ = 160
1

Cumulative frequency before 1st quartile class,  fQ −1 = 47


1

Frequency of 1st quartile class, fQ = 68


1

Width of 1st quartile class, c = 165 - 160 = 5

55
Example 1.31 Solution

1 
 4 (  f ) −  fQ1 −1 
1st quartile, Q1 = LQ +  c
1
 fQ1 
 
1 
 4 ( 203) − 47 
= 160 +   ( 5)
 68 
 
= 160.28 cm
56
Example 1.31 Solution
3
Q3 position = ( 203) th = 152.25th
4
Q3 class boundaries = 165 – 170.

Lower boundary of 3rd quartile class, LQ3 = 165


Cumulative frequency before 3rd quartile class,  fQ −1 = 115
3

Frequency of 3rd quartile class, fQ = 52


3

Width of 3rd quartile class, c = 5

57
Example 1.31 Solution

1 
 4 (  f ) −  fQ3 −1 
3rd quartile, Q3 = LQ3 +  c
 fQ3 
 
3 
4 (203) − 115 
= 165 +   (5)
 52 
 
= 168.58cm
58
Example 1.31 Solution

Semi-interquartile range

= (Q3 − Q1 )
1
2
1
= (168.58 − 160.28)
2
= 4.15cm
59
Find Quartiles graphically
(use ogive)
Median and Quartiles can be determined
directly from cumulative frequency curve
(ogive).

60
Example 1.32
The table below shows the distribution of the mass of
babies (in kg) for babies born in a hospital from January
to June. Draw an ogive to show the frequency
distribution. From your ogive, find the first quartile and
third quartile of the mass of the babies.

Mass 0.0 1.0 2.0 3.0 4.0 5.0


(kg) - - - - - -
1.0 2.0 3.0 4.0 5.0 6.0
Number 12 233 442 185 96 32

61
Example 1.32 Solution

Mass (kg), Cumulative


Upper boundary frequency
< 0.0 0
< 1.0 12
< 2.0 245
< 3.0 687
< 4.0 872
< 5.0 968
< 6.0 1000
62
Example 1.32 Solution

“Less than” Ogive: Mass of babies (kg)


Number of babies
1000

900

800
750th
700

600

500

400

300
250th
200

100
Q = 2.0kg
1
Q = 3.3kg
3 Mass (kg)
1 2 3 4 5 6
63
Example 1.32 Solution

1
Q1 position = (1000) = 250th
4
3
Q3 position = (1000) = 750th
4
From the ogive:
First quartile, Q1 = 2kg
Third quartile, Q3 = 3.3kg
64
Box-plot
A Box-plot shows the spread of a distribution
and to detect outliers by using the 5-number
summary:

1) smallest value,
2) largest value,
3) first quartile ,
4) third quartile and
5) median.

It can be displayed horizontally or vertically. 65


Box-plot
Boxplots displayed horizontally

0 10 20 30 40 50 60

Smallest Median Largest


value value
1st 3rd
Quartile Quartile
66
Box-plot
Boxplots

Displayed

vertically

67
Box-plot

• ‘box’ starts from Q1 to Q3 and contains 50%


of the data in the middle of the distribution.

• ‘whisker’ starts from the box to the smallest


value and also from the box to the largest
value.

The ‘whisker’ displays the range of the data.

68
Constructing a Box-Plot
• Calculate Q1, the median, Q3 and IQR.
• Draw a horizontal line to represent the scale of
measurement.
• Draw a box using Q1, the median, Q3.

Q1 m Q3

69
Constructing a Box-Plot
• Isolate outliers by calculating:
Lower fence: Q1 – 1.5 IQR
Upper fence: Q3+1.5 IQR
• Measurements beyond the upper or lower fence
are outliers and are marked (*).

*
Q1 m Q3

70
Constructing a Box-Plot
• Draw “whiskers” connecting the largest and
smallest measurements that are NOT outliers
to the box.

*
Q1 m Q3

71
Box-plot
Boxplot for 3 types of distribution :

1) Symmetrical distribution

2) Positively skewed distribution

3) Negatively skewed distribution

72
Box-plot For Symmetrical
Distribution

‘whisker’ : same length


Median : centre of the box 73
Box-plot For
Positively Skewed Distribution

‘whisker’ : left side shorter than right side


Median : nearer to 1st quartile
74
Box-plot For
Negatively Skewed Distribution

‘whisker’ : left side longer than right side


Median : nearer to 3rd quartile
75
Use of box-plot to identify ‘outliers’

Sometimes, values which are unusually small or


large occur in a set of data.

The unusual values occur probably because of


an error in recording the data.

‘Outliers’ : points which are 1.5 times the


interquartile range more than the 3rd quartile or
less than the 1st quartile.
76
Outliers

1.5 (Q3 - Q1) 1.5 (Q3 - Q1)

* Boundary Boundary *
Q1 Q3
Last value Last value
inside inside
boundary boundary
‘Outlier’ ‘Outlier’

77
Example 1.33
Given the amount of sodium in 8 brands of cheese:
260 290 300 320 330 340 340 520

Q1 = 295 m = 325 Q3 = 340


Draw the boxplot based on the data.
IQR = 340 – 295 = 45
Lower fence = 295 – 1.5(45) = 227.5
Upper fence = 340 + 1.5(45) = 407.5
Outlier : x = 520

78
Example 1.33 Solution

Blox plot : Amount of sodium in 8 brands of cheese.

*
200

m
Q1 Q3

79
Example 1.34
The following data shows a summary of the marks for
Mathematics and Science for students in a class.

Subjects Minimum Maximum Median First Third


quartile quartile

Mathematics 10 90 60 45 70

Science 35 85 60 48 72

Draw two boxplots for this data and give comments


regarding the distribution of marks for Mathematics and
Science.
80
Example 1.34 Solution

Box plot: Marks for Mathematics and Science for


students in a class.

0 10 20 30 40 50 60 70 80 90 100

Mathematics

Science

The distribution of marks for Mathematics is skewed to the


left whereas the distribution of marks for Science is
symmetrical. 81
Example 1.35
The following stem plot shows the maximum
temperature for each day from 1st August to 23rd August
in a town. Draw a boxplot and use your boxplot to
identify the ‘outliers’.
Stem Leaf
7 6 7
7 0 2 2 3
6 5 7 8 8 8 9 9
6 2 3 3 4 4 4 4 4
5 9
5 1
Key : 5 | 9 means 590F 82
Example 1.35 Solution

Number of observation , n = 23

Median position = 24/2 = 12th

Median, Q2 = 670F

Q1 position = 12/2 = 6th (in the values < median)

First quartile, Q1 = 640F

83
Example 1.35 Solution

Q3 position = 6th (in the values > median)

Third quartile, Q3 = 700F

Upper boundary = Q3 + 1.5(Q3 – Q1)


= 70 + 1.5 (70 – 64) = 790F
Lower boundary = Q1 – 1.5(Q3 – Q1)
= 64 – 1.5(70 – 64) = 550F
Therefore, the outlier is 510F.
84
Example 1.35 Solution

n = 23 Q = 70 C Upper boundary = 79 C


3
Q = 64 C IQR = 6 C Lower boundary = 55 C
1
Box plot: Maximum temperature for each day from 1st
August to 23rd August in a town
Outlier Boundary

55 64 67 77
50 60 70 80 90
59 Temperature, 0F 79
4th Industry Revolution (4th IR)
✓ The fourth industrial revolution is the
fusion of the real world with the
virtual world. The digital revolution is
marked by technology that takes
advantage of Big Data and Artificial
Intelligence (AI) to nurture automatic
learning systems.

✓ Artificial intelligence uses data and


mathematics and statistics to create
intelligent machines.

✓ Big Data is the collection and


analysis of data sets that are
complex in terms of the volume and
variety, and in some cases
incorporate the velocity at which
they are collected.
4th Industry Revolution (4th IR)
✓ Statistics method is applied to analyse
the large and complex data sets with
intelligent algorithms to spot patterns,
understand the relationships between
data for predict future outcomes and
make decisions.

✓ The links provided an overview of big


data and the role of statisticians in
understanding and advancing big data.
• https://www.researchgate.net/public
ation/284045063_Statistical_Perspe
ctives_on_Big_Data
• http://higherlogicdownload.s3.amazo
naws.com/AMSTAT/UploadedImage
s/49ecf7cf-cb26-4c1b-8380-
3dea3b7d8a9d/BigDataOnePager.p
df
88

You might also like