You are on page 1of 46

CHAPTER 2

DESCRIPTIVE
STATISTICS
LECTURER: DR RUZANITA MAT RANI

P R E PA R E D B Y:

H A Z I YA H B I N T I M D J A S M I N
CONTENTS
2.0 Introduction to Descriptive Statistics
2.1 Organizing & Graphing Qualitative and Quantitative Data
2.1.1 Tabular form (Frequency distribution & Cross Tabulation)
2.1.2 Graphical form (Pie chart, Bar chart, Stem and Leaf Plot, Histogram and Boxplot)
2.0
Introduction to Descriptive Statistics
➢Descriptive Statistics are used to describe the basic features of the data in a study.

➢They provide simple summaries about the sample and the measures.

➢With descriptive statistics you are simply describing what is or what the data shows.

➢Descriptive statistics therefore enables us to present the data in a more meaningful way, which allows
simpler interpretation of the data.
➢For example, if we had the results of 100 pieces of students' coursework, we may be interested in the
overall performance of those students. We would also be interested in the distribution or spread of the
marks. Descriptive statistics allow us to do this.
2.1
Organizing & Graphing Qualitative
and Quantitative Data
Data
Presentation

Tabular form Graphical form

Quantitative data
Qualitative data
1. Frequency distribution 1. Stem and leaf plot
1. Pie chart
2. Cross tabulation 2. Histogram
2. Bar chart
3. Boxplot
2.1.1
Tabular form
1) Frequency Distribution Table
❖ A frequency distribution is a table consisting of columns and rows that contains a list of data values and its
frequency.
❖ Frequency is the number of times a values occurs.
❖ A column may consist of categories of data. A class is a category into which qualitative data can be
classified. Class frequency is the number of observations that fall in a particular class.
❖ For example: A car dealer in KL makes the sales for the following types of cars in the month of January
2005 as shown below:
Car model Number of cars
CLASS Waja 66
CLASS
(Qualitative Wira 50 FREQUENCY
variable)
Saga 39
Gen-2 25
Total 180
EXAMPLE 1
25 army inductees were given a blood test to determine their blood type. The data set and its
frequency table are as follow:

Conclusion: More people have type ‘O’ blood than any other types
2.1.1
Tabular form
2) CROSS TABULATION TABLE (Cross Tab)
❖Also known as “Contingency Table”
❖Illustrated by Row x Column dimension.
❖It is often desirable to examine the categorical responses in terms of Two Qualitative Variables
simultaneously.
Variable 1/ Variable 2 Variable 2 Total Column, j

1 2 ... c

Variable 1 1 O11 O12 ... O1c Total Column, 1

2 O21 O22 ... O2c Total Column, 2

... ... ... ... ... ...

r Or1 Or2 ... Orc Total Column, j

Total Row, i Total Row, 1 Total Row, 2 ... Total Row, i Grand Total
Example 2
A car manufacturer might be interested to know whether colour preference for a car is
independent of gender. In this case, the two variables are GENDER and COLOUR.

❖ The table shows that men preferred black or red cars while women prefer red cars. Men
dislike blue cars while women dislike green cars.
2.1.2
Graphical form (Qualitative data)
10 A B O AB
9
9
8
7
7
AB A
No of Army
6 16%
5 20%
5
4
4
3
2 O B
1 36% 28%

0
A B O AB
Blood Type

A bar chart A pie chart


2.1.2
Graphical form (Quantitative data)
A. STEM AND LEAF PLOT
❑ Basic tools to determine the skewness of data.
❑ W idely used in exploratory data analysis, in which it separates data entries
into leading digits and trailing digits.
❑ Suitable for small data.

Following are the steps to construct stem and leaf plot manually:
Step 1 Step 2 Step 3

• Arrange the data • The leading • Display data


values from the digit(s) becomes
smallest value to the stem and the
the largest value trailing digit the
leaf
EXAMPLE 3
At an outpatient testing centre, the number of cardiograms performed each day for 20 days is
shown below.

Construct a stem and leaf plot for the above data.

Solution:

STEP 1: Arrange the data values in ascending order.


STEP 2: The leading digit(s) becomes the stem and the trailing digit becomes the leaf.
LEADING DIGIT TRAILING DIGIT
(STEM) (LEAF)
02
0 2

STEP 3: Display the data.

Ordered Stem Plot Unordered Stem Plot


Conclusion:

1. For 20 days, each day the number of patients receiving cardiograms was between 31 and 36.

2. Minimum number of patients is 2 while maximum number of patients is 57.


2.1.2
Graphical form (Quantitative data)
B. HISTOGRAM
❖ A histogram is a graphical
representation of the frequency
distribution in which bars represent
frequencies.
❖The histogram is constructed by using
class boundaries and frequencies of
the classes.
❖Y – axis: Class frequency , X – axis:
Class boundary
2.1.2
Graphical form (Quantitative data)
C. BOXPLOT (Box and Whisker Plot)
❖ A box-and-whisker plot provides a useful graphical
representation of data using minimum ( 𝑋m i n ) ,
maximum (𝑋m a x ), first quartile (𝑄1 ),
median/second
quartile (𝑄2 ) and the third quartile (𝑄3 ).
❖ Widely use as an initial tools to identify the
skewness/distribution of the data such as:
1. Normally distributed/No skew
2. Positively skewed/Skew to the right/Right skewed
3. Negatively skewed/Skew to the left/Left skewed
Types of Boxplot
How to construct Box-and-Whisker plot?
1. 1st quartile (Q1) – Positional value where 25% of the observation are smaller and 75% are larger
than the value given by the following formula.
𝑄1= n+1 th value of ordered observation
4

2. Median (Q2) – The middle value of an ordered array of data


◦ 1st arranged the data in ascending or descending order.
◦ If there is an odd no. of observations in the data; median is the middle value
◦ If there is an even no. of observations in the data; median is the average of the two middle
numbers.
3. 3rd quartile (Q3) – Positional value where 75% of the observation are smaller and 25% are larger
than the value given by the following formula.

𝑄3= 3(n+1) th value of ordered observation


4

Rules of obtaining the quartile values as follows:

a) If the positioning point is an integer, the numerical observation on the positioning point is chosen for the quartiles.

b) If the positioning point is halfway between two integers, the average of their corresponding observations is selected.
c) If the positioning point is neither an integer nor a value halfway between two integers, thus round off to the nearest integer and select the
numerical value of the corresponding observation.
EXAMPLE 4
The 3-year annual returns of 14 low-risk funds arranged in ascending order are given as follows.
Find the min, max, Q1, Q3 and median. Then construct the box whisker plot.

Solution: Minimum value = 9.77 and maximum value = 38.16


Comments: Skew to the right/Positively skewed
Exercise Chapter 2- Part 1
CONTENTS
2.2 Numerical Descriptive Measures
2.2.1 Measures of Central Tendency

2.2.2 Measures of Dispersion/Variation

2.2.3 Coefficient of Variation (CV)

2.2.4 Measures of Skewness

2.2.5 Measures of Position


2.2
Numerical Descriptive Measures
Descriptive
Measures

Measures of Measures of Coefficient of Measures of Measures of


Central Tendency Dispersion Variation Skewness Position

Pearson
Mean Range Coefficient of
Skewness

Variance &
Median Standard
Deviation *NOTES:
Please round off your answer
that involved calculation to 3
Mode decimal places
2.2.1
Measures of Central Tendency
EXAMPLE 5
A company has five departments. The numbers of workers in five departments are 24, 13, 19,
26, and 11 respectively. What is the mean for the number of workers in a department and
interpret the value?
ΣX 24+13+19+26+11
Solution: μ = = = 18.6
N 5

Interpretation:

On average, the number of workers in five departments is 18.6.≈ 19 workers


EXAMPLE 6
The following data give the hours spent studying per week by all nine students. Determine the median
and comment on its value.
12 7 11 7 16
16 22 18 25

Solution: 1st step - Arrange the data from the smallest to largest value (increasing order)

7 7 11 12 16 16 18 22 25

2nd step – Find the position of middle term by using formula


EXAMPLE 6

Interpretation:
50% of the students spend more than 16 hours studying per week and another 50% of the
students spend less than 16 hours studying per week.
EXAMPLE 7
The marks of nine students in a mathematics test with a maximum possible mark of 50 are giving
below. Find the value of mode and give your comment.

Solution: Find the value that most occur often in a data set.
◦ NOTES: In a set of data can have more than one mode as well as no mode

Interpretation: Majority/most of the students got 35/50 marks


Relationship between Mean,
Median and Mode
2.2.2
Measures of Dispersion/Variation
Interpretation Formula
Max value – Min Value
Range Defined as the largest value – smallest value
Example: Refer data example 6, Range = 25 – 7 = 18
• Useful to measure the variability Population Variance, 𝜎 2 Sample Variance, s2
• A standard deviation (Std. Dev) value tells
how closely the values of a data set 1 (Σ𝑥)2
s2 = Σ𝑥 2 −
clustered around the mean. n− 1 n
• Lower value is indicated the data set value
Variance &
are spread over relatively smaller range
Standard Population Std. Dev, 𝜎 Sample Std. Dev, s
around the mean.
Deviation
• Larger value is showed the data set values
are spread over relatively larger around the 1 2
(Σ𝑥)2
s= Σ𝑥 −
mean. n− 1 n
• Variance is obtained from the square of the
standard deviation.
EXAMPLE 8
The following number of the wealthiest people in the world gives the total wealth of five
persons. Calculate the standard deviation and variance.
46.5 18.0 16.0 7.8 7.2

Solution:
𝒙 𝒙𝟐
46.5 2162.25 Sample Std. Dev, s 1 (Σ𝑥)2
= Σ𝑥 2 −
18 324 n−1 n
16 256 1 (95.5)2
= 2854.93 − = 16.054
7.8 60.84 5−1 5
7.2 51.84 Sample Variance, 𝒔𝟐 = (16.054)2 = 257.731
Σ𝑥 = 95.5 Σ𝑥 2 = 2854.93
2.2.3
Coefficient of Variation (CV)

% of CV Conclusion
Large % of CV ✓ More dispersed
✓ Less consistent/Less reliable
Small % of CV ✓ Less dispersed
✓ More consistent/More reliable
EXAMPLE 9
A study is conducted to determine the performance of student from various classes of
Sekolah Menengah Taman Meru Jati. The measurements on students’ CGPA were tabulated
as follows:

Using the most suitable measurement, which class is more consistent in their performance?
Solution:

Melati Lily Mawar

Conclusion:
Class is more consistent in their performance since it has the smallest
percentage value of CV.
2.2.4
Measures of Skewness
Pcs value Description
Positive Skewed to the Right/ Positively skew
Negative Skewed to the Left/ Negatively skew
Zero Normal
EXAMPLE 10
Calculate the skewness for the data in Example 9 and comment on the shape of the
distribution.

Solution:
CLASS MELATI
Formula
Mean − mode 3(Mean − median)
Pcs @ Skewness = Pcs @ Skewness =
Std. Deviation Std. Deviation
3.12 − 2.12 Calculation 3(3.12 − 3.10)
Skewness = = 1.905 Skewness = = 0.114
0.2755 0.2755
Skew to the right Conclusion on the Skew to the right
shape of the
distribution
CLASS LILY

Mean − mode 3(Mean − median)


Pcs @ Skewness = Formula Pcs @ Skewness =
Std. Deviation Std. Deviation

Skewness = Calculation Skewness =

Conclusion on the
shape of the
distribution
CLASS MAWAR

Mean − mode 3(Mean − median)


Pcs @ Skewness = Formula Pcs @ Skewness =
Std. Deviation Std. Deviation

Skewness = Calculation Skewness =

Conclusion on the
shape of the
distribution
2.2.5
Measures of Position
Past Year Question
DEC’19 – Question 3
Waiting time (in minutes) of customers at the Providence Bank (PB) and the Valley Bank (VB) are given
as follow.
PB 3.2 4.4 4.8 5.2 6.2 6.7 7.5 8.0 6.5 9.0 5.1 3.3 5.2 4.0 4.9
VB 5.8 5.6 5.7 5.8 6.1 6.4 6.7 6.7 6.7 6.8 6.5 7.0 6.9 6.5 6.4

a) Calculate the mean and standard deviation for Valley Bank waiting times.
(3 marks)
b) From the above box-and-whisker plots of waiting times for both Providence and Valley banks,
comment on the shape of distribution for each plot.
(2 marks)
c) The mean and standard deviation for Providence Bank are 5.6 and 1.692 respectively. Determine
which bank shows a more consistent in waiting times.
(3 marks)
Past Year Question
JUNE’19 – Question 2
The stem and leaf plot below represents the Mathematics test scores (out of 100) of 15 randomly selected
students in Class Angle.
Stem Leaf
4 3
5 2 4 8
6 2 3 5 7 8 8 9
7 1 5 8
8 9 Key : 4|3 means 43
a) Calculate the mean and standard deviation for the test score.
(4 marks)
b) Interpret the mean obtained in a).
(1 mark)
Past Year Question
JUNE’19 – Question 2
c) The summary statistics of the test score for Mathematics and Biology students in Class
Angle are summarized in the following table. Using an appropriate measure, determine
which distribution of the test score between the subjects is more dispersed.

Descriptive statistics
N Mean Std. Deviation
Mathematics 15 65.4667 11.1859
Biology 17 66.9412 17.5623

(3 marks)
Past Year Question
DEC’18 – Question 2
The descriptive statistics for the life span (in years) of brand AA washing machine are summarized as below.

a) Calculate the coefficient of skewness. Hence, comment on the shape of the distribution.

(2 marks)

b) Explain the meaning of the mode value.

(1 mark)

c) Given the mean and variance for the life span (in years) of brand BB were 7.1 and 12.3 respectively. Using an
appropriate measurement, determine which brand has a more consistent life span.

(4 marks)
Past Year Question
JULY’17 – Question 2
The plot given below represents the time (in minutes) taken by each child in Class A to solve a
same problem. Time to solve problem
5 7 9
6 1 4 8
7 0 2 2 3 4 4 4 6 7 9
8 2 5 8 9 9
9 1 3 3 6
Key : 5 | 7 means 5.7
a) What is the name of the plot above?
b) How many children are there in Class A?
c) State the mode time taken for the children in Class A to solve the problem. (3 marks)
Past Year Question
JULY’17 – Question 2
d) The statistics for the time taken by children in Class A and Class B to solve the problem are
summarized in the following table. Using an appropriate measure, determine which class has
a more consistent time in solving the problem.
Descriptive statistics
N Mean Std. Deviation
Class A 24 7.733 1.131
Class B 27 7.196 1.223

(5 marks)
END OF CHAPTER 2

You might also like