You are on page 1of 25

EPHD-310 Basic Biostat Dr.

Jaffa

Descriptive Statistics
Outline:
1. Variables and types of variables.
2. Data representation.
3. Measures of location.
4. Measures of spread.
5. Coefficient of variation.
6. Grouped/Recoded data.
7. Graphic Methods.
8. Learning Outcomes Covered in Lecture 1

Dr. Jaffa Lecture1 Descriptive 1


Statistics

Variables and Types of Variables


Types of random variables:
1. Quantitative: data are numeric.
2. Qualitative: data are nonnumeric and organized into
categories.

Dr. Jaffa Lecture1 Descriptive 2


Statistics

1
EPHD-310 Basic Biostat Dr.Jaffa

Variables and Types of Variables


Quantitative data could be:
• Discrete: which is numeric data that have finite number of
possible values (eg: number of pregnancies).
• Continuous: which is numeric data that have infinite
number of possible values (eg: age, weight, height,
systolic blood pressure).
Qualitative data could mainly be:
• Nominal: data organized into different categories that can
not be ordered (eg: gender: male, female; race: African,
Caucasian, Asian, Hispanic).
• Ordinal: data organized into different categories that can
be ordered (eg: status of disease: normal, mild, moderate,
severe; grading scale: Excellent, very good, good, pass,
fail).
Dr. Jaffa Lecture1 Descriptive 3
Statistics

Data Representation

• Consider a sample of data x1,…,xn drawn from some


population p where:
x1 corresponds to first sample point
xn corresponds to nth sample point.

• Before drawing conclusions on the population p from this


sample, data need to be described in some concise and
informative way.

• Data can be represented in a numerical or graphical form.

Dr. Jaffa Lecture1 Descriptive 4


Statistics

2
EPHD-310 Basic Biostat Dr.Jaffa

Data Representation

• Numerical representation of data entails:


1) Measures of location
2) Measures of spread

• Graphic methods for displaying data encompasses:


1) Bar Graphs
2) Histograms
3) Stem-and-leaf plots
4) Box Plots

Dr. Jaffa Lecture1 Descriptive 5


Statistics

Measures of Location
• Measure of location is a type of measure used to
summarize data by defining the center or middle of the
sample.

• Measures of location is also referred to as “Measures of


Central Tendency”.

• There are different types of measures of locations but the


two important measures that are commonly used are:
1) Arithmetic mean
2) Median

Dr. Jaffa Lecture1 Descriptive 6


Statistics

3
EPHD-310 Basic Biostat Dr.Jaffa

Measures of Location:
Arithmetic Mean

• Arithmetic mean is usually referred to as the average of


the sample and denoted by x

• Arithmetic mean is the sum of all observations divided by


the number of observations:

1 n
x  xi
n i 1

Dr. Jaffa Lecture1 Descriptive 7


Statistics

Measures of Location:
Arithmetic Mean

• Limitation:
 Arithmetic mean is very sensitive to extreme values.

 In this instance it may not be representative of the


majority of the sample point.

 Arithmetic mean is shifted towards extreme values.

Dr. Jaffa Lecture1 Descriptive 8


Statistics

4
EPHD-310 Basic Biostat Dr.Jaffa

Measures of Location:
Median

• Consider a sample of n data points x1,…,xn drawn from


some population p where:
x(1) corresponds to the smallest sample point
x(n) corresponds to nth largest sample point.
i.e. observations are ordered from smallest to
largest.
• Sample median is the center point of the sample where
50% of the observations are above it and 50% is below
it.

Dr. Jaffa Lecture1 Descriptive 9


Statistics

Measures of Location:
Median
• Sample median is:
(1) The  n  1  th largest observation if n is odd
 
 2 
n n 
(2) average of the  2  th and  2  1 th largest
 
observation if n is even.

• Strength: Sample median is less sensitive to extreme


values.
• Limitation: provide a less sensitive representation of the
actual data points since it is mainly determined by the
value of the middle point.

Dr. Jaffa Lecture1 Descriptive 10


Statistics

5
EPHD-310 Basic Biostat Dr.Jaffa

Measures of Location
• Example 1:
Assume the weights in pounds for 5 people were collected
randomly and ordered from lowest to highest as such:
x(1) = 130; x(2) = 138; x(3) = 138; x(4) = 140; x(5)= 220.

 Arithmetic Mean :
x1  x2  x3  x4  x5 130  138  138  140  220
x   153.2
n 5
 Median: since n =5 (odd) then median is the value of the
third observation x(3) = 138. The median of this sample is 138.

130; 138; 138; 140; 220.

2 numbers above 138 and 2 numbers


below it 11
Dr. Jaffa Lecture1 Descriptive Statistics

Measures of Location

• Example 2:
Assume the weights in pounds for 6 people were collected
randomly and ordered from lowest to highest as such:
x(1) = 120; x(2) = 124; x(3) = 126; x(4) = 128;
x(5)= 398; x(6)=399
x1  x2  x3  x4  x5  x6
 Arithmetic Mean : x 
n
120  124  126  128  398  399
  215.8
6
Mean=215.8 is shifted towards extreme values
(398 and 399 in this example).

Dr. Jaffa Lecture1 Descriptive 12


Statistics

6
EPHD-310 Basic Biostat Dr.Jaffa

Measures of Location
• Example 2: (continued)

 Median: since n =6 (even) then median is the average of the


values in third and fourth observations.
120; 124; 126; 128; 398; 399

2 numbers above 126 and 128 and 2


numbers below these 2

 Median=(126+128)/2=127 and is not sensitive to extreme


values 398 and 399.

Dr. Jaffa Lecture1 Descriptive 13


Statistics

Measures of Spread

• Measures of spread, also referred to as “Measures of


Dispersion” are used to describe the variability in a sample.

• Many samples can be well described by the combination of


measure of location and a measure of spread.

• Several different measures of spread can be used to


describe the variability:
Range
 Percentiles (or quantiles)
Variance and standard deviation.

Dr. Jaffa Lecture1 Descriptive 14


Statistics

7
EPHD-310 Basic Biostat Dr.Jaffa

Measures of Spread:
Range
• The range is the difference between the largest and
smallest observations: Range = x(n)- x(1)

• Limitation: range is very sensitive to extreme values.

• Example:
Assume you have the following observations ordered
from lowest to highest:
x(1) = 5; x(2) = 10; x(3) = 28; x(4) = 64; x(5)= 185.
Range = x(5)- x(1) = 185 – 5 = 180

Dr. Jaffa Lecture1 Descriptive 15


Statistics

Measures of Spread:
Percentiles (or quantiles )

• The pth percentile is the value Vp such that p percent of the


ordered sample points are less than or equal to Vp .

• Note sample should be ordered from lowest to highest

• Percentiles are less sensitive to extreme values.

Dr. Jaffa Lecture1 Descriptive 16


Statistics

8
EPHD-310 Basic Biostat Dr.Jaffa

Measures of Spread:
Percentiles (or quantiles )

• Computation by hand of percentiles is not required but I will


show you the formulas for computing percentiles and its
applications just for your information.

• You are NOT responsible for the computation by hand of


percentiles just you need to know what a percentile mean
and you should learn how to generate percentiles on SPSS.

Dr. Jaffa Lecture1 Descriptive 17


Statistics

Measures of Spread:
Percentiles (or quantiles )

Formula for computing percentile (You are NOT responsible for it


just for your general info):

• The pth percentile is defined by:

(1) The (k+1)th largest sample point if np/100 is not an


integer (where k is the integer part of np/100). Recall
integer is a number that does not have decimal points.

(2) The average of the (np/100)th and (np/100 + 1)th


largest observation if np/100 is an integer.
Dr. Jaffa Lecture1 Descriptive 18
Statistics

9
EPHD-310 Basic Biostat Dr.Jaffa

Measures of spread: Percentiles (or quantiles)


Example: The following observations ordered from lowest to
highest:
x(1) = 5; x(2) = 10; x(3) = 28; x(4) = 64; x(5)= 185.

Compute the 60th percentile:


• n = 5, p = 60. Since np/100 = 5*60/100 = 3 is an integer
then use rule (2) which states:
pth percentile= Avg. (np/100)th and (np/100 + 1)th
largest observation if np/100 is an integer.
th
• 60 percentile is the average between the third and fourth
largest observations.
• 60th percentile = (x(3) + x(4) ) / 2 = (28+64)/2 = 46

Dr. Jaffa Lecture1 Descriptive 19


Statistics

Measures of Spread: Percentiles (or quantiles)


Example: (continued)
x(1) = 5; x(2) = 10; x(3) = 28; x(4) = 64; x(5)= 185.

Compute the 75th percentile:


• n = 5, p = 75. Since np/100 = 5*75/100 = 3.75 is not an
integer (with K = 3), then use rule (1)
pth percentile=(k+1)th largest sample point if np/100 is
not an integer (where k is the integer part of np/100).
• 75th percentile is the 4th largest observation.
• 75th percentile is x(4) = 64

Dr. Jaffa Lecture1 Descriptive 20


Statistics

10
EPHD-310 Basic Biostat Dr.Jaffa

Measures of Spread:
Percentiles (or quantiles )

• Upper and lower quartiles are respectively the 75th and 25th
percentiles of the sample. These are commonly used in the
literature.

• Interquartile range = upper quartile-lower quartile

• Median can be thought of as the 50th percentile since 50% of


the data fall below it.

Dr. Jaffa Lecture1 Descriptive 21


Statistics

Measures of Spread:
Variance and Standard Deviation

• The variance is a measure that summarizes the


deviations between the individual sample points and the
arithmetic mean (with the center of the sample being
defined as the arithmetic mean).

• The sample variance is defined as follows:

 x  x   x       x  x
2 2 2 n 2
1 2  x    xn  x i 1 i
s 2

n 1 n 1

Dr. Jaffa Lecture1 Descriptive 22


Statistics

11
EPHD-310 Basic Biostat Dr.Jaffa

Measures of Spread:
Variance and Standard Deviation
• Note that the sum of all deviations from the mean is zero;
i.e. 
n
i 1 
xi  x  0 
• The sample standard deviation is the square root of the
sample variance :
  x  x
n 2

i 1 i
s
n 1

• The mean and standard deviation are the most widely


used measures of location and spread in literature.

Dr. Jaffa Lecture1 Descriptive 23


Statistics

Measures of Spread:
Variance and Standard Deviation
Example:
Compute the sample variance and standard deviation for the
following observations ordered from lowest to highest:
x(1) = 9; x(2) = 10; x(3) = 12; x(4) = 14; x(5)= 16.

Dr. Jaffa Lecture1 Descriptive 24


Statistics

12
EPHD-310 Basic Biostat Dr.Jaffa

Measures of Spread:
Variance and Standard Deviation
Example: (continued)
 Sample variance:

 Sample standard deviation: s  s 2  8.2  2.9

Dr. Jaffa Lecture1 Descriptive 25


Statistics

SPSS Output for Measures of Location and Spread

Dr. Jaffa Lecture1


Descriptive 26
Statistics

13
EPHD-310 Basic Biostat Dr.Jaffa

Selected data summary from a published article by


Jaffa et al.

Dr. Jaffa Lecture1 Descriptive 27


Statistics

Selected data summary chosen from a published


article by Jaffa et al.

As you can see, mean and standard deviation are both


used to summarize the data numerically.

Dr. Jaffa Lecture1 Descriptive 28


Statistics

14
EPHD-310 Basic Biostat Dr.Jaffa

Graphic Methods
• Graphic methods are used to display data in a graphical format.

• The purpose of using graphs is to give a quick overall


impression of the data, which is difficult to obtain with numeric
measures.

• What impression in your opinion we can get from graphical


representation? You can visually assess the spread of the data,
distribution (symmetrical or skewed), and presence of outliers.

• There are various types of graphical display of the data, most


commonly used methods for the continuous data are:
 Histograms
 Box plots
Dr. Jaffa Lecture1 Descriptive 29
Statistics

Graphic Methods:
Histograms
SPSS generated histogram corresponding to hours slept example

Dr. Jaffa Lecture1 Descriptive 30


Statistics

15
EPHD-310 Basic Biostat Dr.Jaffa

Graphic Methods:
Box-and-whisker Plot
• A box-and whisker plot (simply referred to as box plot)
presents the median, upper quartile, and lower quartile
of the sample.

• Recall: The upper and lower quartiles are the


approximate 75th and 25th percentiles of the sample;

• A box plot can also be used to help detect extreme


values and outliers, and to describe the skewness of a
distribution.

Dr. Jaffa Lecture1 Descriptive 31


Statistics

Graphic Methods:
Box Plot for hours slept generated by SPSS

Upper quartile

Median
Lower quartile

Dr. Jaffa Lecture1 Descriptive 32


Statistics

16
EPHD-310 Basic Biostat Dr.Jaffa

Graphic Methods:
Box Plots
• If the distribution is symmetric, then the upper and lower
whiskers have equal length

Upper whisker

Lower whisker

Dr. Jaffa Lecture1 Descriptive 33


Statistics

Graphic Methods:
Box Plots
• A distribution is positively skewed or skewed to the right if
the upper whisker is longer than the lower whisker

Upper whisker

Lower whisker

Dr. Jaffa Lecture1 Descriptive 34


Statistics

17
EPHD-310 Basic Biostat Dr.Jaffa

Graphic Methods:
Box Plots
• A distribution is negatively skewed or skewed to the left if the
lower whisker is longer than the upper whisker

Upper whisker

Lower whisker

Dr. Jaffa Lecture1 Descriptive 35


Statistics

Other Assessments for Skewness

Dr. Jaffa Lecture1 Descriptive 36


Statistics

18
EPHD-310 Basic Biostat Dr.Jaffa

Other Assessments for skewness

Right/positive Skeweness

Dr. Jaffa Lecture1 Descriptive 37


Statistics

Other Assessments for skewness

Left /negative Skewness

Dr. Jaffa Lecture1 Descriptive 38


Statistics

19
EPHD-310 Basic Biostat Dr.Jaffa

Other Assessments for skewness

Symmetrical No Skewness

Dr. Jaffa Lecture1 Descriptive 39


Statistics

Graphic Methods:
Box Plots
• A box plot can also be used to help detect outliers and
extreme values.

• In the next two slides I will show you for your general
information the formulas used to detect outliers and extreme
values with an application of these formulas.

• But you are NOT responsible for detecting by hand using the
following formulas the outliers and extreme values.

• You are responsible to know how to identify outliers and


extreme values using SPSS only as I showed in Box Plots.

Dr. Jaffa Lecture1 Descriptive 40


Statistics

20
EPHD-310 Basic Biostat Dr.Jaffa

Graphic Methods:
Box Plots
• A box plot can also be used to help detect outliers and
extreme values (NOT responsible for the below. This is just for
your information).

• An outlier is defined as follows:


(1) x > upper quartile + 1.5 * (upper quartile – lower quartile)
or
(2) x < lower quartile - 1.5 * (upper quartile – lower quartile)

• An extreme outlying value is defined as follows:


(1) x > upper quartile + 3.0 * (upper quartile – lower quartile)
or
(2) x < lower quartile – 3.0 * (upper quartile – lower quartile)
Dr. Jaffa Lecture1 Descriptive 41
Statistics

Graphic Methods:
Box Plots
• Assess whether there are outliers or extreme outliers for the
following sample: 16, 10, 49 , 15, 6, 15, 8, 19, 11, 22, 13, 17

• Ordered sample: 6, 8,10,11,13, 15, 15, 16, 17, 19, 22, 49

• Upper quartile= 75th percentile = 18


• Lower quartile = 25th percentile= 10.5

 49 is an extreme outlier (in the positive direction) since :


49 > upper quartile + 3.0 * (upper quartile – lower quartile) =
18+3.0*(18-10.5) = 40.5
 Since 49 is greater than 40.5 so it an extreme outlier in the
positive direction. Dr. Jaffa Lecture1 Descriptive 42
Statistics

21
EPHD-310 Basic Biostat Dr.Jaffa

Graphic Methods:
Box Plots
SPSS generated box plot for this data is as follows:
3rd observation of the
data as entered to
SPSS. Here this 3rd
observation has the
value 49 and is
considered an
extreme outlier

Dr. Jaffa Lecture1 Descriptive 43


Statistics

Graphic Methods:
An example of Box plot generated by SPSS

Extreme value

outlier
outlier

Dr. Jaffa Lecture1 Descriptive 44


Statistics

22
EPHD-310 Basic Biostat Dr.Jaffa

Pie Chart:
SPSS Example of a Pie Chart

Dr. Jaffa Lecture1 Descriptive 45


Statistics

General Comments

• When data have a finite number of possible values i.e.


discrete such as gender: male/female; marital status: single,
married, divorced; then pie chart and bar graphs for the
different values (males/females) are appropriate methods for
presenting data.

• However, when the data to be presented have an infinite


possible values, then pie chart becomes cumbersome and
better presentation would be histograms or box plots.

• What can we do with outliers/extreme values: if generated


from wrong data entry then either rectify if possible or erase
and conduct the analysis with and without it to check for its
effect, or keep if it is the right value.
46
Dr. Jaffa Lecture1 Descriptive Statistics

23
EPHD-310 Basic Biostat Dr.Jaffa

General Comments

• When the distribution of the data is skewed, in most of the


times you notice that there is a discrepancy in the value of
the mean and that of the median.

• In this situation, you report the median as the measure of


central tendency rather than the mean or report both.

Dr. Jaffa Lecture1 Descriptive 47


Statistics

Data Grouping and Recoding


• Data can be grouped to form larger groups or to change your data
from a continuous scale to categorical data (groups).

• Example: if you have data collected on income per year on 10


individuals and presented in dollars as such: 24000, 32500,
35000, 37000, 40000, 45000, 47000, 48000,49000, 52000.

• You can create groups as such: group 1: 20000 to 29000; group 2:


29001 to 30000, group 3: 30001 to 39000, group 4: 39001 to
40000, group 5: 40001 to 50000, group 6: more than 50000.

• You can also reduce these groups into group 1: 20000 to 40000;
group2: 40001 to 50000, group 3: more than 50000.
• This grouping can be achieved in SPSS using what we refer to as
“recoding”. 48
Dr. Jaffa Lecture1 Descriptive Statistics

24
EPHD-310 Basic Biostat Dr.Jaffa

Data Grouping and Recoding

• Note that you can change your data from continuous to


categorical scale: example you can change the age in years
let’s say to age groups.

• But you can not change a categorical variable to form a


continuous variable. Example if you have data given to you
where you only have age groups. That is for every individual
you only have the age group that this person is in but you can
not know his exact age. Example: John’s age is between 20
and 25. You can not then tell his exact age.

Dr. Jaffa Lecture1 Descriptive 49


Statistics

Course Learning Outcomes Covered in Lecture 1


• Apply the appropriate descriptive techniques commonly used
to summarize public health data.
• Apply ethical principles to data management and analysis.
(Note that honoring ethical conduct is the essence of all our
analysis and data reporting).

Dr. Jaffa Lecture1 Descriptive 50


Statistics

25

You might also like