You are on page 1of 38

NT of MATHEMATICS and STATISTICS

EGE OF SCIENCE AND MATHEMATICS, MSU-ILIGAN INSTITUTE OF TECHNOLOGY


A Commission on Higher Education - Center of Excellence

Exploratory Data Analysis:


A Brief Introduction
Michael B. Frondoza, M.Sc.
michael.frondoza@g.msuiit.edu.ph

August 20, 2020

DEPARTMENT
.
of MATHEMATICS and STATISTICS
COLLEGE OF SCIENCE AND MATHEMATICS, MSU-ILIGAN INSTITUTE OF TECHNOLOGY
A Commission on Higher Education - Center of Excellence
The following discussion is a derivative of the lecture on Exploratory Data
Analysis presented by by Prof. Patrick Meyer of University of Virginia
found at https://www.youtube.com/watch?v=zHcQPKP6NpM

MB Frondoza August
Exploratory Data Analysis 2020 1/36
Exploratory Data Analysis refers to a set of procedures for producing
descriptive and graphical summaries of the data.

MB Frondoza August
Exploratory Data Analysis 2020 2/36
Benefit
* examine data as they are
* without assumptions

MB Frondoza August
Exploratory Data Analysis 2020 3/36
Learning Objectives:
1 Identify types of data
2 Select appropriate descriptive statistics
3 Choose the correct type of plot

MB Frondoza August
Exploratory Data Analysis 2020 4/36
We will be using data from 2012 PISA mathematics exam and background
questions. The Data involved 24,144 examinees from sixty four different
countries.

MB Frondoza August
Exploratory Data Analysis 2020 5/36
Types of data

Categorical data
1. nominal
2. ordinal
Continuous data
3. interval
4. ratio

MB Frondoza August
Exploratory Data Analysis 2020 6/36
Nominal
* discrete units
* no inherent ordering

MB Frondoza August
Exploratory Data Analysis 2020 7/36
Ordinal
* discrete units
* inherently ordering
* distance between units is not the same

MB Frondoza August
Exploratory Data Analysis 2020 8/36
Interval
* discrete units
* inherently ordering
* distance between units is the same
* no absolute zero

MB Frondoza August
Exploratory Data Analysis 2020 9/36
Ratio
* discrete units
* inherently ordering
* distance between units is the same
* has absolute zero

MB Frondoza August
Exploratory Data Analysis 2020 10/36
Which of the variables in the table are nominal?

MB Frondoza August
Exploratory Data Analysis 2020 11/36
Which of the remaining variables are ordinal?

MB Frondoza August
Exploratory Data Analysis 2020 12/36
What about Books? Is it ordinal, interval or ratio?

MB Frondoza August
Exploratory Data Analysis 2020 13/36
Let us look at the survey question for the variable ”Books”.

*distance is not the same, so the variable ”Books” is ordinal.


*If distance is the same and it has absolute zero, then the variable
”Books” will become a ratio.

MB Frondoza August
Exploratory Data Analysis 2020 14/36
The last variable in the table is what type of data?

MB Frondoza August
Exploratory Data Analysis 2020 15/36
The last variable in the table is what type of data?

*Test scores are commonly assumed to be interval


*PISA involves a scaling methodology that helps guarantee that the scale
is actually an interval

MB Frondoza August
Exploratory Data Analysis 2020 16/36
Why do we need to know the type of data contained in a variable?

*Statistical methods are designed to work with certain types of data


*Many of the methods used to analyse continuous data are not the same
methods used to analyse categorical data
*Not knowing the type of data can produce the wrong analysis

MB Frondoza August
Exploratory Data Analysis 2020 17/36
Before we proceed, we will need the following concept:
A percentile is a measure used in statistics indicating the value below
which a given percentage of observations in a group of observations falls.

For example, the 20th percentile is the value (or score) below which 20%
of the observations may be found.

Another example, say your score belong to the 99th percentile among
those who took the same exam, it simply means that 99% of those who
took the exam are below your score or it can be said the your score
belongs to the top 1%.

MB Frondoza August
Exploratory Data Analysis 2020 18/36
For example, given an ordered list {5, 15, 20, 25, 35, 40, 50, 65, 80, 90}.

10th percentile (P10 ) is 6,


25th percentile (P25 ) is 18.75
50th percentile (P50 ) is 37.5
75th percentile (P75 ) is 68.75
90th percentile (P90 ) is 89

Calculator:
https://www.socscistatistics.com/descriptive/percentile/default.aspx

MB Frondoza August
Exploratory Data Analysis 2020 19/36
Note:

First Quartile (Q1 ) is the 25th percentile

Second Quartile (Q2 ) is the 50th percentile or Median

Third Quartile (Q3 ) is the 75th percentile

MB Frondoza August
Exploratory Data Analysis 2020 20/36
MB Frondoza August
Exploratory Data Analysis 2020 21/36
MB Frondoza August
Exploratory Data Analysis 2020 22/36
Bar Chart
A bar chart (or bar graph) is a chart or graph that presents categorical
data with rectangular bars with heights or lengths proportional to the
values that they represent.

source: http://mathisfun.com/data/bar-graphs.html

MB Frondoza August
Exploratory Data Analysis 2020 23/36
Pie Chart
A pie chart (or a circle chart) is a circular statistical graphic, which is
divided into slices to illustrate numerical proportion. In a pie chart, the arc
length of each slice (and consequently its central angle and area), is
proportional to the quantity it represents.

source: http://mathisfun.com/data/pie-charts.html
MB Frondoza August
Exploratory Data Analysis 2020 24/36
MB Frondoza August
Exploratory Data Analysis 2020 25/36
MB Frondoza August
Exploratory Data Analysis 2020 26/36
Histogram
a histogram is an approximate representation of the distribution of
numerical data.
To construct a histogram, the first step is to ”bin” (or ”bucket”) the range
of values and then count how many values fall into each interval. The bins
are usually specified as consecutive, non-overlapping intervals of a variable.
The bins (intervals) must be adjacent, and are often (but not required to
be) of equal size.

MB Frondoza August
Exploratory Data Analysis 2020 27/36
source: http://mathisfun.com/data/histograms.html

MB Frondoza August
Exploratory Data Analysis 2020 28/36
Box plot
a box plot (or boxplot) is a method for graphically depicting groups of
numerical data through their quartiles.
Box plots may also have lines extending from the boxes (whiskers)
indicating variability outside the upper and lower quartiles, Outliers may be
plotted as individual points.
Outliers are points which are 1.5 times interquartile range away from the
first and 3rd quartiles. These are values greater than Q3 + 1.5(IQR) or
less than Q1 − 1.5(IQR).

MB Frondoza August
Exploratory Data Analysis 2020 29/36
Box plot without outliers
For example, given an ordered list {5, 15, 20, 25, 35, 40, 50, 65, 80, 90}.

source: http://www.alcula.com/calculators/statistics/box-plot/
MB Frondoza August
Exploratory Data Analysis 2020 30/36
MB Frondoza August
Exploratory Data Analysis 2020 31/36
Box plot with outliers
For example, given an ordered list {−60, 15, 20, 25, 35, 40, 50, 65, 80, 150}.

source: http://www.alcula.com/calculators/statistics/box-plot/
MB Frondoza August
Exploratory Data Analysis 2020 31/36
MB Frondoza August
Exploratory Data Analysis 2020 32/36
MB Frondoza August
Exploratory Data Analysis 2020 33/36
MB Frondoza August
Exploratory Data Analysis 2020 34/36
MB Frondoza August
Exploratory Data Analysis 2020 35/36
John Wilder Tukey (June 16, 1915 - July 26, 2000)
* graduated Ph.D. in Mathematics at Princeton University in 1939
* Professor at Princeton University
* credited for development of exploratory data analysis
* invented several of the plots: the box plot, and the stem and leaf plot
* famous for coining the term “bit” and “software”

MB Frondoza August
Exploratory Data Analysis 2020 36/36

You might also like