Summarizing Data

Summarizing Data
Descriptive Statistics
print all
 Prev
 Next
 1
 | 2
 | 3
 | 4
 | 5
 | 6
 | 7
 | 8
 | 9
 | 10
 InterQuartile Range (IQR)

 Outliers and Tukey Fences:
Contents
All Modules
InterQuartile Range (IQR)

When a data set has outliers or extreme values, we summarize a typical value using
the median as opposed to the mean. When a data set has outliers, variability is often
summarized by a statistic called the interquartile range, which is the difference between the
first and third quartiles. The first quartile, denoted Q1, is the value in the data set that holds 25%
of the values below it. The third quartile, denoted Q3, is the value in the data set that holds 25%
of the values above it. The quartiles can be determined following the same approach that we
used to determine the median, but we now consider each half of the data set separately. The
interquartile range is defined as follows:
Interquartile Range = Q3-Q1
With an Even Sample Size:

For the sample (n=10) the median diastolic blood pressure
is 71 (50% of the values are above 71, and 50% are
below). The quartiles can be determined in the same way
we determined the median, except we consider each half
of the data set separately.
Figure 9 - Interquartile Range with Even Sample Size
There are 5 values below the median (lower half), the

middle value is 64 which is the first quartile. There are 5
values above the median (upper half), the middle value is
77 which is the third quartile. The interquartile range is 77
– 64 = 13; the interquartile range is the range of the
middle 50% of the data.
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
------
With an Odd Sample Size:

When the sample size is odd, the median and quartiles
are determined in the same way. Suppose in the previous
example, the lowest value (62) were excluded, and the
sample size was n=9. The median and quartiles are
indicated below.
Figure 10 - Interquartile Range with Odd Sample Size
When the sample size is 9, the median is the middle

number 72. The quartiles are determined in the same way
looking at the lower and upper halves, respectively. There
are 4 values in the lower half, the first quartile is the mean
of the 2 middle values in the lower half ((64+64)/2=64).
The same approach is used in the upper half to determine
the third quartile ((77+81)/2=79).
Outliers and Tukey Fences:

When there are no outliers in a sample, the mean and standard deviation are used to
summarize a typical value and the variability in the sample, respectively. When there are
outliers in a sample, the median and interquartile range are used to summarize a typical value
and the variability in the sample, respectively.
Tukey Fences
There are several methods for determining outliers in a sample. A very popular method is based on the following:
Outliers are values below Q1-1.5(Q3-Q1) or above Q3+1.5(Q3-Q1) or equivalently, values below Q1-1.5 IQR or above Q
These are referred to as Tukey fences.6 For the diastolic blood pressures, the lower limit is 64 - 1.5(77-64) = 44.5 and
is 77 + 1.5(77-64) = 96.5. The diastolic blood pressures range from 62 to 81. Therefore there are no outliers. The bes
typical diastolic blood pressure is the mean (in this case 71.3) and the best summary of variability is given by the stand
(s=7.2).
Table 13 displays the means, standard deviations, medians, quartiles and interquartile ranges
for each of the continuous variables in the subsample of n=10 participants who attended the
seventh examination of the Framingham Offspring Study.
Table 13 - Summary Statistics on n=10 Participants
Characteristic Mean Standard Deviation Median
Systolic Blood Pressure 121.2 11.1 122.5
Diastolic Blood Pressure 71.3 7.2 71.0
Total Serum Cholesterol 202.3 37.7 206.5
Weight 176.0 33.0 169.5
Height 67.175 4.205 69.375
Body Mass Index 27.26 3.10 26.60
Table 14 displays the observed minimum and maximum values along with the limits to
determine outliers using the quartile rule for each of the variables in the subsample of n=10
participants. Are there outliers in any of the variables? Which statistics are most appropriate to
summarize the average or typical value and the dispersion?
Table 14 - Limits for Assessing Outliers in Characteristics Measured in the n=10

Participants
Characteristic Minimum Maximum Lower Limit1 Upper Limit2
Systolic Blood Pressure 105 141 92 148
Diastolic Blood Pressure 62 81 44.5 96.5
Total Serum Cholesterol 150 275 67 323
Weight 138 235 68.5 288.5
Height 60.75 72.00 52.5 80.5
Body Mass Index 22.8 31.9 17.85 36.65
1
Determined byQ1-1.5(Q3-Q1)
2
Determined by Q3+1.5(Q3-Q1)
Since there are no suspected outliers in the subsample of n=10 participants, the mean and
standard deviation are the most appropriate statistics to summarize average values and
dispersion, respectively, of each of these characteristics.
The Full Framingham Cohort

For clarity, we have so far used a very small subset of the Framingham Offspring Cohort to
illustrate calculations of summary statistics and determination of outliers. For your interest,
Table 15 displays the means, standard deviations, medians, quartiles and interquartile ranges
for each of the continuous variable displayed in Table 13 in the full sample (n=3,539) of
participants who attended the seventh examination of the Framingham Offspring Study.
Table 15 - Summary Statistics on Sample of (n=3,539) Participants
Characteristic Mean Standard Deviation Median Q

(s)
Systolic Blood Pressure 127.3 19.0 125.0 11
Diastolic Blood Pressure 74.0 9.9 74.0 67
Total Serum Cholesterol 200.3 36.8 198.0 17
Weight 174.4 38.7 170.0 14
Height 65.957 3.749 65.750 63.
Body Mass Index 28.15 5.32 27.40 24
Based solely on a comparison of the means and medians in Table 15 above, there is evidence that there
was one or more characteristics with values that were outliers?
True
False
Table 16 displays the observed minimum and maximum values along with the limits to
determine outliers using the quartile rule for each of the variables in the full sample (n=3,539).
Table 16 - Limits for Assessing Outliers in Characteristics Presented in Table 15
Tukey Fences
Characteristic Minimum Maximum Lower Limit1 Upper Limit2
Systolic Blood Pressure 81.0 216.0 78 174
Diastolic Blood Pressure 41.0 114.0 47.5 99.5
Total Serum Cholesterol 83.0 357.0 103 295
Weight 90.0 375.0 68.0 276.0
Height 55.00 78.75 54.4 77.4
Body Mass Index 15.8 64.0 15.05 40.25
1
Determined byQ1-1.5(Q3-Q1)
2
Determined by Q3+1.5(Q3-Q1)
Click below the question to view the answer.

Are there outliers in any of the variables? Which statistics
appropriate to summarize the average or typical values and the disp
each variable?
Show Answer
return to top | previous page | next page
Content ©2016. All Rights Reserved.

Date last modified: May 17, 2016.
Created by Lisa Sullivan, PhD and Wayne W. LaMorte, MD, PhD, MPH,
Boston University School of Public Health
Interpret the key results for Descriptive

Statistics
Learn more about Minitab
Complete the following steps to interpret descriptive statistics. Key output includes N, the
mean, the median, the standard deviation, and several graphs.
In This Topic
 Step 1: Describe the size of your sample

 Step 2: Describe the center of your data
 Step 3: Describe the spread of your data
 Step 4: Assess the shape and spread of your data distribution
 Step 5. Compare data from different groups
Step 1: Describe the size of your sample

Use N to know how many observations are in your sample. Minitab does not include
missing values in this count.
You should collect a medium to large sample of data. Samples that have at least 20
observations are often adequate to represent the distribution of your data. However, to
better represent the distribution with a histogram, some practitioners recommend that you
have at least 50 observations. Larger samples also provide more precise estimates of the
process parameters, such as the mean and standard deviation.
Statistics
Variabl N SE Minimu Media Maximu

e N * Mean Mean StDev m Q1 n Q3 m
6 0 21.264 0.778 6.422 10.0000 16.000 20.000 24.750 37.0000

Torque
8 7 8 0 0 0 0
Key Result: N
In these results, you have 68 observations.
Step 2: Describe the center of your data

Use the mean to describe the sample with a single value that represents the center of the
data. Many statistical analyses use the mean as a standard measure of the center of the
distribution of the data.
The median and the mean both measure central tendency. But unusual values, called
outliers, affect the median less than they affect the mean. When you have unusual values,
you can compare the mean and the median to decide which is the better measure to use. If
your data are symmetric, the mean and median are similar.
Statistics

6 0 21.264 0.778 6.422 10.0000 16.000 20.000 24.750 37.0000

Torque
8 7 8 0 0 0 0
Key Results: Mean and Median

In these results, the mean torque that is required to remove a toothpaste cap is 21.265, and the median
torque is 20. The data appear to be skewed to the right, which explains why the mean is greater than the
median.
Step 3: Describe the spread of your data
Use the standard deviation to determine how spread out the data are from the mean.
A higher standard deviation value indicates greater spread in the data..
Statistics

6 0 21.264 0.778 6.422 10.0000 16.000 20.000 24.750 37.0000

Torque
8 7 8 0 0 0 0
Key Result: StDev

In these results, the standard deviation is 6.422. With normal data, most of the observations are spread
within 3 standard deviations on each side of the mean.
Step 4: Assess the shape and spread of your data

distribution
Use the histogram, the individual value plot, and the boxplot to assess the shape and spread
of the data, and to identify any potential outliers.
Examine the shape of your data to determine whether your data appear to be skewed
When data are skewed, the majority of the data are located on the high or low side of the
graph. Often, skewness is easiest to detect with a histogram or boxplot.
Right-skewed
Left-skewed
The histogram with right-skewed data shows wait times. Most of the wait times are relatively short, and
only a few wait times are long. The histogram with left-skewed data shows failure time data. A few items
fail immediately, and many more items fail later.
Determine how much your data varies

Assess the spread of the points to determine how much your sample varies. The greater the
variation in the sample, the more the points will be spread out from the center of the data.
This individual plot shows that the data on the right has more variation than the data on the left.
Look for multi-modal data

Multi-modal data have multiple peaks, also called modes. Multi-modal data often indicate
that important variables are not yet accounted for.
If you have additional information that allows you to classify the observations into groups,
you can create a group variable with this information. Then, you can create the graph with
groups to determine whether the group variable accounts for the peaks in the data.
Simple
With Groups
For example, a manager at a bank collects wait time data and creates a simple histogram. The histogram
appears to have two peaks. After further investigation, the manager determines that the wait times for
customers who are cashing checks is shorter than the wait time for customers who are applying for home
equity loans. The manager adds a group variable for customer task, and then creates a histogram with
groups.
Look for outliers

Outliers, which are data values that are far away from other data values, can strongly affect
the results of your analysis. Often, outliers are easiest to identify on a boxplot.
On a boxplot, asterisks (*) denote outliers.
Step 5. Compare data from different groups

If you have a Group variable, you can use it to analyze your data by group or by group
level.
Statistics
SE
Varia Machi N Mea StDe Minim Medi Maxim
ble ne N * Mean n v um Q1 an Q3 um
Torqu 1 3 0 18.66 0.73 4.39 10.000 15.25 17.00 21.75 30.000

e 6 67 25 48 0 00 00 00 0
2 3 0 24.18 1.25 7.11 14.000 17.50 24.00 31.00 37.000

2 8 8 9 0 0 0
In these results, the summary statistics are calculated separately by machine. You can easily see the
differences in the center and spread of the data for each machine. For example, Machine 1 has a lower
mean torque and less variation than Machine 2. To determine whether the difference in means is
significant, you can perform a 2-sample t-test.
 Minitab.com
 License Port

Summarizing Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Summarizing Data

Uploaded by

Copyright:

Available Formats

Summarizing Data

 InterQuartile Range (IQR)

InterQuartile Range (IQR)

Interquartile Range = Q3-Q1

With an Even Sample Size:

Figure 9 - Interquartile Range with Even Sample Size

There are 5 values below the median (lower half), the

With an Odd Sample Size:

Figure 10 - Interquartile Range with Odd Sample Size

When the sample size is 9, the median is the middle

Outliers and Tukey Fences:

Table 13 - Summary Statistics on n=10 Participants

Characteristic Mean Standard Deviation Median

Systolic Blood Pressure 121.2 11.1 122.5

Diastolic Blood Pressure 71.3 7.2 71.0

Total Serum Cholesterol 202.3 37.7 206.5

Weight 176.0 33.0 169.5

Height 67.175 4.205 69.375

Body Mass Index 27.26 3.10 26.60

Table 14 - Limits for Assessing Outliers in Characteristics Measured in the n=10

Characteristic Minimum Maximum Lower Limit1 Upper Limit2

Systolic Blood Pressure 105 141 92 148

Diastolic Blood Pressure 62 81 44.5 96.5

Total Serum Cholesterol 150 275 67 323

Weight 138 235 68.5 288.5

Height 60.75 72.00 52.5 80.5

Body Mass Index 22.8 31.9 17.85 36.65

The Full Framingham Cohort

Table 15 - Summary Statistics on Sample of (n=3,539) Participants

Characteristic Mean Standard Deviation Median Q

Systolic Blood Pressure 127.3 19.0 125.0 11

Diastolic Blood Pressure 74.0 9.9 74.0 67

Total Serum Cholesterol 200.3 36.8 198.0 17

Weight 174.4 38.7 170.0 14

Height 65.957 3.749 65.750 63.

Body Mass Index 28.15 5.32 27.40 24

Table 16 - Limits for Assessing Outliers in Characteristics Presented in Table 15

Systolic Blood Pressure 81.0 216.0 78 174

Diastolic Blood Pressure 41.0 114.0 47.5 99.5

Total Serum Cholesterol 83.0 357.0 103 295

Weight 90.0 375.0 68.0 276.0

Height 55.00 78.75 54.4 77.4

Body Mass Index 15.8 64.0 15.05 40.25

Click below the question to view the answer.

return to top | previous page | next page

Content ©2016. All Rights Reserved.

Boston University School of Public Health

Interpret the key results for Descriptive

 Step 1: Describe the size of your sample

Step 1: Describe the size of your sample

Variabl N SE Minimu Media Maximu

6 0 21.264 0.778 6.422 10.0000 16.000 20.000 24.750 37.0000

Step 2: Describe the center of your data

Variabl N SE Minimu Media Maximu

6 0 21.264 0.778 6.422 10.0000 16.000 20.000 24.750 37.0000

Key Results: Mean and Median

A higher standard deviation value indicates greater spread in the data..

Variabl N SE Minimu Media Maximu

6 0 21.264 0.778 6.422 10.0000 16.000 20.000 24.750 37.0000

Key Result: StDev

Step 4: Assess the shape and spread of your data

Determine how much your data varies

Look for multi-modal data