You are on page 1of 24

Exploratory Data Analysis (Descriptive Statistics)

Copyright © 2023, Department of Biostatistics, Epidemiology, and Population Health


Faculty of Medicine, Public Health, and Nursing
Universitas Gadjah Mada
Yogyakarta, Indonesia
Contents
Unit 2: Descriptive Statistics 3
Course Objectives 3
Learning Objectives 3
Introduction 3
Measures of Central Tendency 3
Measures of Variability 4
Data Summary 5
Data Presentation 7
Introduction of Statistical Analysis Tools 10
Pre-Class Exercise 12
Exercise in Class 12

2
Exploratory Data Analysis (Descriptive Statistics)
Course Objectives
To understand the use of statistical tests in research and to interpret results
from a statistical test.
Learning Objectives
1. Differentiate between various type of variable/scale of measurement
2. Describe data location, spread, and distribution (mean, median, modus,
frequency, proportion, maximum, minimum, percentiles/quartiles,
standard deviation, interquartile range)
3. Explain the features of normal distribution and non-normal distribution
4. Effectively summarize data and display data in tables and graphs
5. Interpret the descriptive summary and tabular/graphical data
visualization correctly
6. Perform descriptive analysis in Jamovi

Introduction
Descriptive statistics refers to the different methods applied to summarize and
present data in a form that will be more informative but also easier to
understand. Data collected for a study often consist of hundreds or even
thousands of subjects from whom the information is obtained. In order to
analyze this numerical data, it is necessary to organize the data systematically
and describe them comprehensively.

To describe a mass of numerical data and explore patterns within the dataset,
graphical methods or summary measures such as mean, median, mode
(central tendency), and dispersion or variability of our data can be used. These
measures are generally referred to as descriptive statistics. We can either
explore the distribution of one variable at a time, or explore the relationship
between two variables. Exploratory data analysis can also be used to check for
data errors and assumptions to more complex statistical tests.

3
The type of measures and graphs to be used depend on the type of
variables/scale of measurement.

Distributions of Numerical/Quantitative Variables

Data, which were collected from the field in many ways, are not homogenous
so understanding and describing the data distribution is important.
We can describe the shape, center, and spread of the distribution of a numerical
variable.
The shape of a data can be described by its symmetry/skewness and
peakedness (modality), can be distinguished as:
a. Symmetric Distribution is when the left side of the distribution
mirrors the right side as it is looked at from the median.
b. Skewed Distribution, namely skewed to the right or skewed to the
left. Skewed to the right (positively skewed distribution) is caused
by the extremes in the higher values distorting the curve. Skewed to
the left (negatively skewed distribution), a less common type, is
caused by the extremes in the lower values which distort the curve
towards the left.

4
An unimodal distribution has one mode around which the observations are
concentrated; while bimodal distribution has two modes. A data can also be
uniformly distributed (has no mode).

Measures of Central Tendency (for numerical variables)


1. Arithmetic Mean (for interval or ratio data)
The arithmetic mean or mean of a set of data (aggregate) is equal to their
total or sum divided by the number of observations.
𝑆𝑢𝑚 𝑜𝑓 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
𝑀𝑒𝑎𝑛 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛

Mean is the average or ‘typical’ value of a set of observations.


2. Median
The alternative measure of central tendency is the median. Equal numbers
of observations lie below and above this value. The median is a value that
is situated in the middle of an array.
How to calculate the median?
a. List/order the data from the lowest to the highest (make an array)
b. Identify the middle most observation and the value of this observation is
taken to be the median
Alternatively, we can find a position of a median using the following formula:
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 + 1𝑡ℎ
𝑀𝑒𝑑𝑖𝑎𝑛 =
2
For example, we want to calculate the median from 9 children's data. In this
case, the median will be the middlemost observation since the total number
of observations is odd.
7.0 7.6 7.7 8.1 9.3 9.9 11.2 11.4 12.9

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 + 1𝑡ℎ 9 + 1𝑡ℎ


𝑀𝑒𝑑𝑖𝑎𝑛 = = = 5𝑡ℎ = 9.3
2 2
𝑛
If the number of observations (n) is even, the median is the average of ( 2 )𝑡ℎ
𝑛
and ( + 1)𝑡ℎ largest observation.
2

5
Mean is very sensitive to extreme values (outliers), while median is not
affected by outliers. Mean is used for data with symmetric distribution with
no extreme values, while median might be more appropriate for data with
skewed distribution and or extreme values.

3. Mode
Mode is the value that appears most frequently in the dataset. In other words
that mode is the value that has the highest frequency.

Measures of Spread/Variability (for numerical variables)


The measures of spread provide more information about the distribution of the
data.
1. Range
The range is the simplest measure of variability, which is the difference
between the highest and the lowest value of observation. So, we just
subtract the highest value from the lowest value. If the range is big then, it
means that our data have big variability. It is possible that it is only one value
that makes it big, that is why we should look back at our data if we think that
the range is big.
2. IQR (Interquartile Range)
Another type of range is the interquartile range. This IQR is the difference
between the 3rd quartile (75th percentile) and the 1st quartile (25th
percentile). IQR is generally used as the measure of spread when median
is the measure of central tendency.
3. Standard Deviation and Variance
The most common measurement of the spread of distribution is the standard
deviation. This measures the spread from the mean. So, this measurement
should be used only when the mean is employed as our measure of
centrality (data was normally distributed). There are two terminologies
regarding this measurement of spread from the mean:
a. Variance. Variance measures variability which takes the mean as
the reference point. So, it calculates the deviation of (how far are)

6
individual observations from the mean, however, the negative and
positive deviation will cancel. Therefore, these deviations are
squared. Conceptually, variance is the average squared
deviation from the mean.
b. Standard Deviation. Standard deviation is the squared root of the
variance. Not like the variance, standard deviation measures
variation in original units of measurement, so this is the most
commonly used measurement.

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑠 = √𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = √𝑠 2

4. Percentile
Percentiles split a set of ordered data into hundredths. Most commonly used
are 25th, 50th (the median), and 75th percentile. As an example, 70 mmHg is
the 25th percentile of the distribution of diastolic blood pressure of a sample; it
means that 25% of the sample have diastolic blood pressure less than 70
mmHg and 75% have more than 70 mmHg.

To summarize, mean and standard deviation should be used for relatively


symmetric distribution with no outliers; while minimum and maximum value, 1st
quartile, 3rd quartile, median (five-number summary) can be used for all other
cases.

Data Summary
1. Frequency Distribution
The frequency distribution is an arrangement of numerical data according
to size or group. For discrete data, the frequency distribution is simply a tally
of the observation in each category. For continuous data, frequency
distribution can be made by constructing classes and counting the number
of observations that appear for each class. Categorical variables can also
be summarized as frequencies and proportions.

7
Table 1. Example of Frequency Distribution Table

2. Contingency table
Contingency tables (also called crosstabs or two-way tables) are used in
statistics to summarize the relationship between several categorical
variables. A contingency table is a special type of frequency distribution
table, where two variables are shown simultaneously, in which each of the
two variables has two categories. Epidemiological investigation/study often
use contingency table (also called two-by-two table) to compare exposure
vs. disease/case status.

Table 2. Example of Contingency Table

8
Data Presentation
There are various ways to present and summarize the data. The most
common way is to make a graph.
a. Stem and leaf plot
This is a quick way to picture the shape of distribution while including the
actual numerical values in the graphs. A stem and leaf plot works best for
a small number of observations. If the number is large enough (a hundred
for example), it is difficult to read the stem-leaf plot. This stem and leaf plot
is obtained by sorting the observation into rows according to their leading
digit.
Figure 1. Example of Stem and Leaf Plot

From the stem-leaf plot, we can see the important features:


1) Location of the center distribution
2) Examine the overall shape of the distribution. Is the shape skewed or
symmetric?
3) The deviation from a smooth distribution shape. These may be gaps in
the distribution or may be outliers. An outlier is individual observations that fall
outside the overall pattern of the data.

b. Box and Whisker Plot


The box and whisker plot is known as box plots. This is basically drawing a
long thin box that stretches from the low to high percentile. This low
percentile could be 5% or 10% and the high percentile could be 90% or

9
95%. It depends on what percentile is interesting to show. This box is
crossed at the median. Then we draw a “whisker” from each end of the box
to correspond to the extreme value of the range. With a box plot, we can
easily identify the extreme value (outlier), highest values, upper percentile,
median, mean, lower percentile, and IQR (the five-number summary).
Boxplot is especially useful when presented side-by-side to compare the
distributions from several groups.

Figure 2. Example of Box-and-Whisker Plots

c. Histogram
A histogram is a better choice to present large volumes of data. In a
histogram, the data is simplified by grouping it into intervals (bins) and then
displaying it as a series of columns. Each column is proportional in height
to the number of observations or individuals falling into that interval (the
interval is plotted on the x-axis and the number of observations in each
interval (frequency) on the y-axis). For nominal or ordinal data, the grouping
will be the category themselves. For continuous data, the grouping can be
made using the same steps as the one of making frequency distribution.

Figure SEQ Figure \* ARABIC 3. Example of Histogram

10
d. Chart (Bar and Pie)
A pie or bar chart is also one alternative for data presentation. The pie
chart is more appropriate for a single variable and a small number of
categories while the bar chart can be used for more than 1 variable and
more than 5 categories.

Figure 4. Example of Pie Chart

Figure 5. Example of Bar Chart

e. Scatter plot

11
The scatter plot deals with two continuous variables. In the scatter plot, we
put the variable on two lines, horizontal or x-axis and vertical line or y-axis.
Then, we plot our data according to the value of the variable on the x and y-
axis. When we start to see the relation between two variables, the scatter
plot helps us to locate the outlier (if there is one).

Figure 6. Example of Scatter plot

Introduction of Statistical Analysis Tools


There are some statistical tools that can be used in data analysis. Each tool has its
strengths and weaknesses. The following table shows us the features of each tool to
help us to choose the most appropriate for our requirement.

Table 3. Comparison of feature among various statistical tools.

Features STATA SPSS SAS R Sofware Jamovi

Data .dta .sav / .por .sas7bcat / .Rdata .omv /.Rdata


extension .sas#bcat / .xpt

User interface Programming/ Mostly point- Programming Programming Point-and-click


Point-and-click and-click

Data Very strong Moderate Very strong Very strong Moderate


manipulation

Data Analysis Powerful Powerful Powerful/Versa Powerful/Versat Powerful


tile ile

12
Cost Affordable Expensive Expensive Open source Open source
(Perpetual (Renewal when (yearly renewal) (based on R
license, renew upgrading, long software)
only when term license)
upgrade)

Program .do (do-files) .sps (syntax file) .sas .R (R script file) .omv
extension

Output .log (text file) / .spo (SPSS Various format .txt (log files) .omv
extension .smcl (Stata output file) /.Rmd (R
formatted log markdown file in
file) html/pdf/doc
format)

13
Exercise
Pre-Class Exercise
Please read the following paper:

Purnama, I., Widjajanto, P. H., & Damayanti, W. (2021). Influence of initial treatment
delay on overall survival and event-free survival in childhood acute
lymphoblastic leukemia. Paediatrica Indonesiana, 61(4), 217–22.
https://doi.org/10.14238/pi61.4.2021.217-22

Based on the paper please fill the table below. The first row has been filled for you as
a guide.
You have to do this exercise before the practical session. Please give the printout
of your filled assignment to the tutor by the time you enter the practical session room.
Failure to submit your assignment means that you will be denied entry to the practical
session room. Your attendance will then be marked as absence.
No Variable name Variable type Scale Unit/Group
1 Sex Categorical Nominal Male &
female
2
3
4
5
6
7

In-Class Exercise
Exercise 1

In the basic descriptive statistics, students should be able to describe parameters


that can tell the central value of our variable and the spread of data from the center
value (variability/spread). We will be using fictitious leukemia data (can be found in
Gamel). Using that data, we will answer the following questions:

a. What is the variable type for age (this refers to the age at diagnosis)?

b. How do we describe age?

c. Please describe age!

14
● Open dataset in Jamovi. Choose menu (3 bars icon on upper left) → Open →

This PC → browse the leukemia_dataset.omv location→ Click

leukemia_dataset.omv

● Choose Analyses → Exploration → Descriptives

● Choose “Age at diagnosis” and press the arrow button next to the Variables

list. Open the Statistics menu (below the variables selection area) → choose

15
N, Missing, Mean, Median, Percentiles, Std. Deviation, Variance, Range, and

IQR.

● Open the Plots menu → Choose Histogram and Density

● Interpret the result!


d. Can we describe age using frequency distribution?

e. Which of the following pictures best describe the age distribution?

16
f. What do we need to see age variability?

g. What is the range, interquartile range, variance, and standard deviation of


age?

h. Which of the following picture has good variability?

i. Can we use the mean and median for the “Gender” variable?

j. What can we use for describing the “Gender” variable?

● Gender variable is not labeled yet (if you import the data from Microsoft Excel

file format, all variables are not labelled). Assign label by double-click on

variabel name → change values in the Level box (Use the codebook sheet in

leukocyte_dataset.xlsx as a guidance).

17
● Choose Analyses → Exploration → Descriptives
● Choose Gender → press the arrow button next to the “variable(s)” list
● Choose “Frequency tables” between Variables List and Statistics menu. In the
Statistics menu only choose N and Missing

● Interpret the result!

18
Exercise 2

Below is a table about age and leukocyte count data. It contains the mean and
standard deviation. The students are asked to help the principal investigator
determine which variable is more varied between age and leukocyte count.

Age in years Leukocyte count

Means 6.12 61,912

Standard Deviation 4.44 115,147

Exercise 3

One of the common ways to visualize quantitative data is through box plots. Box
plots are useful for identifying outliers and for comparing distributions. You are asked
to construct a box plot graph for a single variable with a certain command using the
software. Please create a box plot about Age at diagnosis:

a. Box plot – leukocyte counts


● The extremes (mix and max)
● The quartiles (Q1, Q2, Q3)
● IQR
b. Please identify the letter-coded
part of the box plot on the left!

Steps in Jamovi

19
● Choose Analyses → Exploration → Descriptives
● Choose “Age at diagnosis” and press the arrow button next to the Variables
list.
● Open the Plots menu → choose Box plot

● Interpret the result!

Exercise 4

There are some common ways to visualize quantitative data such as diagrams,
histograms, and or box plots. You can construct a graph for a single variable with a
certain command using the software. Please create a graph and interpret the result:

1) Pie diagram – Gender and Classification


○ Make sure that “JJStatsPlot” module is installed
○ Choose JJStatsPlot → choose Pie Charts

20
○ Choose Gender → press the arrow button next to the Dependent

Variable

21
○ Interpret the result
○ Repeat the steps for Classification (choose “Classification” variables
instead of “Gender”)

2) Box plot - Age between men and women – See the different shape of
distribution
○ Choose Analyses → Exploration → Descriptives
○ Choose “Age at diagnosis” and press the arrow button next to the
Variables list. Choose “Gender” and press the arrow button next to
Split by
○ Open Plots menu and choose Box plot

22
○ Interpret the result!

Exercise 5

Now, we will move to scatter plot using Leukocyte Count

● Choose Analyses → Exploration → Scatterplot

23
● Choose “Leukocytes count” → press arrow next to “Y-Axis” box. Choose “Age

at diagnosis” → press arrow next to “X-Axis” box.

● Interpret the result!

24

You might also like