Descriptive Statistics: Outline

EPHD-310 Basic Biostat Dr.
Jaffa
Descriptive Statistics
Outline:
1. Variables and types of variables.
2. Data representation.
3. Measures of location.
4. Measures of spread.
5. Coefficient of variation.
6. Grouped/Recoded data.
7. Graphic Methods.
8. Learning Outcomes Covered in Lecture 1
Dr. Jaffa Lecture1 Descriptive 1

Statistics
Variables and Types of Variables

Types of random variables:
1. Quantitative: data are numeric.
2. Qualitative: data are nonnumeric and organized into
categories.

Statistics
1
EPHD-310 Basic Biostat Dr.Jaffa
Variables and Types of Variables

Quantitative data could be:
• Discrete: which is numeric data that have finite number of
possible values (eg: number of pregnancies).
• Continuous: which is numeric data that have infinite
number of possible values (eg: age, weight, height,
systolic blood pressure).
Qualitative data could mainly be:
• Nominal: data organized into different categories that can
not be ordered (eg: gender: male, female; race: African,
Caucasian, Asian, Hispanic).
• Ordinal: data organized into different categories that can
be ordered (eg: status of disease: normal, mild, moderate,
severe; grading scale: Excellent, very good, good, pass,
fail).
Statistics
Data Representation
• Consider a sample of data x1,…,xn drawn from some

population p where:
x1 corresponds to first sample point
xn corresponds to nth sample point.
• Before drawing conclusions on the population p from this

sample, data need to be described in some concise and
informative way.
• Data can be represented in a numerical or graphical form.

Statistics
2
Data Representation
• Numerical representation of data entails:

1) Measures of location
2) Measures of spread
• Graphic methods for displaying data encompasses:

1) Bar Graphs
2) Histograms
3) Stem-and-leaf plots
4) Box Plots

Statistics
Measures of Location
• Measure of location is a type of measure used to
summarize data by defining the center or middle of the
sample.
• Measures of location is also referred to as “Measures of

Central Tendency”.
• There are different types of measures of locations but the

two important measures that are commonly used are:
1) Arithmetic mean
2) Median

Statistics
3
Measures of Location:
Arithmetic Mean
• Arithmetic mean is usually referred to as the average of

the sample and denoted by x
• Arithmetic mean is the sum of all observations divided by

the number of observations:
1 n
x  xi
n i 1

Statistics
Arithmetic Mean
• Limitation:
 Arithmetic mean is very sensitive to extreme values.
 In this instance it may not be representative of the

majority of the sample point.
 Arithmetic mean is shifted towards extreme values.

Statistics
4
Median
• Consider a sample of n data points x1,…,xn drawn from

some population p where:
x(1) corresponds to the smallest sample point
x(n) corresponds to nth largest sample point.
i.e. observations are ordered from smallest to
largest.
• Sample median is the center point of the sample where
50% of the observations are above it and 50% is below
it.

Statistics
Median
• Sample median is:
(1) The  n  1  th largest observation if n is odd
 
 2 
n n 
(2) average of the  2  th and  2  1 th largest
 
observation if n is even.
• Strength: Sample median is less sensitive to extreme

values.
• Limitation: provide a less sensitive representation of the
actual data points since it is mainly determined by the
value of the middle point.

Statistics
5
• Example 1:
Assume the weights in pounds for 5 people were collected
randomly and ordered from lowest to highest as such:
x(1) = 130; x(2) = 138; x(3) = 138; x(4) = 140; x(5)= 220.
 Arithmetic Mean :
x1  x2  x3  x4  x5 130  138  138  140  220
x   153.2
n 5
 Median: since n =5 (odd) then median is the value of the
third observation x(3) = 138. The median of this sample is 138.
130; 138; 138; 140; 220.
2 numbers above 138 and 2 numbers

below it 11
Dr. Jaffa Lecture1 Descriptive Statistics
• Example 2:
Assume the weights in pounds for 6 people were collected
randomly and ordered from lowest to highest as such:
x(1) = 120; x(2) = 124; x(3) = 126; x(4) = 128;
x(5)= 398; x(6)=399
x1  x2  x3  x4  x5  x6
 Arithmetic Mean : x 
n
120  124  126  128  398  399
  215.8
6
Mean=215.8 is shifted towards extreme values
(398 and 399 in this example).

Statistics
6
• Example 2: (continued)
 Median: since n =6 (even) then median is the average of the

values in third and fourth observations.
120; 124; 126; 128; 398; 399
2 numbers above 126 and 128 and 2

numbers below these 2
 Median=(126+128)/2=127 and is not sensitive to extreme

values 398 and 399.

Statistics
Measures of Spread
• Measures of spread, also referred to as “Measures of

Dispersion” are used to describe the variability in a sample.
• Many samples can be well described by the combination of

measure of location and a measure of spread.
• Several different measures of spread can be used to

describe the variability:
Range
 Percentiles (or quantiles)
Variance and standard deviation.

Statistics
7
Measures of Spread:
Range
• The range is the difference between the largest and
smallest observations: Range = x(n)- x(1)
• Limitation: range is very sensitive to extreme values.
• Example:
Assume you have the following observations ordered
from lowest to highest:
x(1) = 5; x(2) = 10; x(3) = 28; x(4) = 64; x(5)= 185.
Range = x(5)- x(1) = 185 – 5 = 180

Statistics
Measures of Spread:
Percentiles (or quantiles )
• The pth percentile is the value Vp such that p percent of the

ordered sample points are less than or equal to Vp .
• Note sample should be ordered from lowest to highest
• Percentiles are less sensitive to extreme values.

Statistics
8
Measures of Spread:
• Computation by hand of percentiles is not required but I will

show you the formulas for computing percentiles and its
applications just for your information.
• You are NOT responsible for the computation by hand of

percentiles just you need to know what a percentile mean
and you should learn how to generate percentiles on SPSS.

Statistics
Measures of Spread:
Formula for computing percentile (You are NOT responsible for it

just for your general info):
• The pth percentile is defined by:
(1) The (k+1)th largest sample point if np/100 is not an

integer (where k is the integer part of np/100). Recall
integer is a number that does not have decimal points.
(2) The average of the (np/100)th and (np/100 + 1)th

largest observation if np/100 is an integer.
Statistics
9
Measures of spread: Percentiles (or quantiles)

Example: The following observations ordered from lowest to
highest:
x(1) = 5; x(2) = 10; x(3) = 28; x(4) = 64; x(5)= 185.
Compute the 60th percentile:

• n = 5, p = 60. Since np/100 = 5*60/100 = 3 is an integer
then use rule (2) which states:
pth percentile= Avg. (np/100)th and (np/100 + 1)th
largest observation if np/100 is an integer.
th
• 60 percentile is the average between the third and fourth
largest observations.
• 60th percentile = (x(3) + x(4) ) / 2 = (28+64)/2 = 46

Statistics
Measures of Spread: Percentiles (or quantiles)

Example: (continued)
x(1) = 5; x(2) = 10; x(3) = 28; x(4) = 64; x(5)= 185.
Compute the 75th percentile:

• n = 5, p = 75. Since np/100 = 5*75/100 = 3.75 is not an
integer (with K = 3), then use rule (1)
pth percentile=(k+1)th largest sample point if np/100 is
not an integer (where k is the integer part of np/100).
• 75th percentile is the 4th largest observation.
• 75th percentile is x(4) = 64

Statistics
10
Measures of Spread:
• Upper and lower quartiles are respectively the 75th and 25th
percentiles of the sample. These are commonly used in the
literature.
• Interquartile range = upper quartile-lower quartile
• Median can be thought of as the 50th percentile since 50% of

the data fall below it.

Statistics
Measures of Spread:
Variance and Standard Deviation
• The variance is a measure that summarizes the

deviations between the individual sample points and the
arithmetic mean (with the center of the sample being
defined as the arithmetic mean).
• The sample variance is defined as follows:
 x  x   x       x  x
2 2 2 n 2
1 2  x    xn  x i 1 i
s 2

n 1 n 1

Statistics
11
Measures of Spread:
• Note that the sum of all deviations from the mean is zero;
i.e. 
n
i 1 
xi  x  0 
• The sample standard deviation is the square root of the
sample variance :
  x  x
n 2
i 1 i
s
n 1
• The mean and standard deviation are the most widely

used measures of location and spread in literature.

Statistics
Measures of Spread:
Example:
Compute the sample variance and standard deviation for the
following observations ordered from lowest to highest:
x(1) = 9; x(2) = 10; x(3) = 12; x(4) = 14; x(5)= 16.

Statistics
12
Measures of Spread:
Example: (continued)
 Sample variance:
 Sample standard deviation: s  s 2  8.2  2.9

Statistics
SPSS Output for Measures of Location and Spread
Dr. Jaffa Lecture1

Descriptive 26
Statistics
13
Selected data summary from a published article by

Jaffa et al.

Statistics
Selected data summary chosen from a published

article by Jaffa et al.
As you can see, mean and standard deviation are both

used to summarize the data numerically.

Statistics
14
Graphic Methods
• Graphic methods are used to display data in a graphical format.
• The purpose of using graphs is to give a quick overall

impression of the data, which is difficult to obtain with numeric
measures.
• What impression in your opinion we can get from graphical

representation? You can visually assess the spread of the data,
distribution (symmetrical or skewed), and presence of outliers.
• There are various types of graphical display of the data, most

commonly used methods for the continuous data are:
 Histograms
 Box plots
Statistics
Graphic Methods:
Histograms
SPSS generated histogram corresponding to hours slept example

Statistics
15
Graphic Methods:
Box-and-whisker Plot
• A box-and whisker plot (simply referred to as box plot)
presents the median, upper quartile, and lower quartile
of the sample.
• Recall: The upper and lower quartiles are the

approximate 75th and 25th percentiles of the sample;
• A box plot can also be used to help detect extreme

values and outliers, and to describe the skewness of a
distribution.

Statistics
Graphic Methods:
Box Plot for hours slept generated by SPSS
Upper quartile
Median
Lower quartile

Statistics
16
Graphic Methods:
Box Plots
• If the distribution is symmetric, then the upper and lower
whiskers have equal length
Upper whisker
Lower whisker

Statistics
Graphic Methods:
Box Plots
• A distribution is positively skewed or skewed to the right if
the upper whisker is longer than the lower whisker
Upper whisker
Lower whisker

Statistics
17
Graphic Methods:
Box Plots
• A distribution is negatively skewed or skewed to the left if the
lower whisker is longer than the upper whisker
Upper whisker
Lower whisker

Statistics
Other Assessments for Skewness

Statistics
18
Other Assessments for skewness
Right/positive Skeweness

Statistics
Left /negative Skewness

Statistics
19
Symmetrical No Skewness

Statistics
Graphic Methods:
Box Plots
• A box plot can also be used to help detect outliers and
extreme values.
• In the next two slides I will show you for your general
information the formulas used to detect outliers and extreme
values with an application of these formulas.
• But you are NOT responsible for detecting by hand using the
following formulas the outliers and extreme values.
• You are responsible to know how to identify outliers and

extreme values using SPSS only as I showed in Box Plots.

Statistics
20
Graphic Methods:
Box Plots
• A box plot can also be used to help detect outliers and
extreme values (NOT responsible for the below. This is just for
your information).
• An outlier is defined as follows:

(1) x > upper quartile + 1.5 * (upper quartile – lower quartile)
or
(2) x < lower quartile - 1.5 * (upper quartile – lower quartile)
• An extreme outlying value is defined as follows:

(1) x > upper quartile + 3.0 * (upper quartile – lower quartile)
or
(2) x < lower quartile – 3.0 * (upper quartile – lower quartile)
Statistics
Graphic Methods:
Box Plots
• Assess whether there are outliers or extreme outliers for the
following sample: 16, 10, 49 , 15, 6, 15, 8, 19, 11, 22, 13, 17
• Ordered sample: 6, 8,10,11,13, 15, 15, 16, 17, 19, 22, 49
• Upper quartile= 75th percentile = 18

• Lower quartile = 25th percentile= 10.5
 49 is an extreme outlier (in the positive direction) since :

49 > upper quartile + 3.0 * (upper quartile – lower quartile) =
18+3.0*(18-10.5) = 40.5
 Since 49 is greater than 40.5 so it an extreme outlier in the
positive direction. Dr. Jaffa Lecture1 Descriptive 42
Statistics
21
Graphic Methods:
Box Plots
SPSS generated box plot for this data is as follows:
3rd observation of the
data as entered to
SPSS. Here this 3rd
observation has the
value 49 and is
considered an
extreme outlier

Statistics
Graphic Methods:
An example of Box plot generated by SPSS
Extreme value
outlier
outlier

Statistics
22
Pie Chart:
SPSS Example of a Pie Chart

Statistics
General Comments
• When data have a finite number of possible values i.e.

discrete such as gender: male/female; marital status: single,
married, divorced; then pie chart and bar graphs for the
different values (males/females) are appropriate methods for
presenting data.
• However, when the data to be presented have an infinite

possible values, then pie chart becomes cumbersome and
better presentation would be histograms or box plots.
• What can we do with outliers/extreme values: if generated

from wrong data entry then either rectify if possible or erase
and conduct the analysis with and without it to check for its
effect, or keep if it is the right value.
46
23
General Comments
• When the distribution of the data is skewed, in most of the

times you notice that there is a discrepancy in the value of
the mean and that of the median.
• In this situation, you report the median as the measure of

central tendency rather than the mean or report both.

Statistics
Data Grouping and Recoding

• Data can be grouped to form larger groups or to change your data
from a continuous scale to categorical data (groups).
• Example: if you have data collected on income per year on 10

individuals and presented in dollars as such: 24000, 32500,
35000, 37000, 40000, 45000, 47000, 48000,49000, 52000.
• You can create groups as such: group 1: 20000 to 29000; group 2:

29001 to 30000, group 3: 30001 to 39000, group 4: 39001 to
40000, group 5: 40001 to 50000, group 6: more than 50000.
• You can also reduce these groups into group 1: 20000 to 40000;
group2: 40001 to 50000, group 3: more than 50000.
• This grouping can be achieved in SPSS using what we refer to as
“recoding”. 48
24
Data Grouping and Recoding
• Note that you can change your data from continuous to

categorical scale: example you can change the age in years
let’s say to age groups.
• But you can not change a categorical variable to form a

continuous variable. Example if you have data given to you
where you only have age groups. That is for every individual
you only have the age group that this person is in but you can
not know his exact age. Example: John’s age is between 20
and 25. You can not then tell his exact age.

Statistics
Course Learning Outcomes Covered in Lecture 1

• Apply the appropriate descriptive techniques commonly used
to summarize public health data.
• Apply ethical principles to data management and analysis.
(Note that honoring ethical conduct is the essence of all our
analysis and data reporting).

Statistics
25

Descriptive Statistics: Outline

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Descriptive Statistics: Outline

Uploaded by

Copyright:

Available Formats

EPHD-310 Basic Biostat Dr.

Dr. Jaffa Lecture1 Descriptive 1

Variables and Types of Variables

Dr. Jaffa Lecture1 Descriptive 2

Variables and Types of Variables

• Consider a sample of data x1,…,xn drawn from some

• Before drawing conclusions on the population p from this

• Data can be represented in a numerical or graphical form.

Dr. Jaffa Lecture1 Descriptive 4

• Numerical representation of data entails:

• Graphic methods for displaying data encompasses:

Dr. Jaffa Lecture1 Descriptive 5

• Measures of location is also referred to as “Measures of

• There are different types of measures of locations but the

Dr. Jaffa Lecture1 Descriptive 6

• Arithmetic mean is usually referred to as the average of

• Arithmetic mean is the sum of all observations divided by

Dr. Jaffa Lecture1 Descriptive 7

 In this instance it may not be representative of the

 Arithmetic mean is shifted towards extreme values.

Dr. Jaffa Lecture1 Descriptive 8

• Consider a sample of n data points x1,…,xn drawn from

Dr. Jaffa Lecture1 Descriptive 9

• Strength: Sample median is less sensitive to extreme

Dr. Jaffa Lecture1 Descriptive 10

130; 138; 138; 140; 220.

2 numbers above 138 and 2 numbers

Dr. Jaffa Lecture1 Descriptive 12

 Median: since n =6 (even) then median is the average of the

2 numbers above 126 and 128 and 2

 Median=(126+128)/2=127 and is not sensitive to extreme

Dr. Jaffa Lecture1 Descriptive 13

• Measures of spread, also referred to as “Measures of

• Many samples can be well described by the combination of

• Several different measures of spread can be used to

Dr. Jaffa Lecture1 Descriptive 14

• Limitation: range is very sensitive to extreme values.

Dr. Jaffa Lecture1 Descriptive 15

• The pth percentile is the value Vp such that p percent of the

• Note sample should be ordered from lowest to highest

• Percentiles are less sensitive to extreme values.

Dr. Jaffa Lecture1 Descriptive 16

• Computation by hand of percentiles is not required but I will

• You are NOT responsible for the computation by hand of

Dr. Jaffa Lecture1 Descriptive 17

Formula for computing percentile (You are NOT responsible for it

• The pth percentile is defined by:

(1) The (k+1)th largest sample point if np/100 is not an

(2) The average of the (np/100)th and (np/100 + 1)th

Measures of spread: Percentiles (or quantiles)

Compute the 60th percentile:

Dr. Jaffa Lecture1 Descriptive 19

Measures of Spread: Percentiles (or quantiles)

Compute the 75th percentile:

Dr. Jaffa Lecture1 Descriptive 20

• Interquartile range = upper quartile-lower quartile

• Median can be thought of as the 50th percentile since 50% of

Dr. Jaffa Lecture1 Descriptive 21

• The variance is a measure that summarizes the

• The sample variance is defined as follows:

Dr. Jaffa Lecture1 Descriptive 22

• The mean and standard deviation are the most widely