You are on page 1of 12

Summarizing Data

Descriptive Statistics
print all

 Prev

 Next

 1

 | 2

 | 3

 | 4

 | 5

 | 6

 | 7

 | 8

 | 9

 | 10

 InterQuartile Range (IQR)


 Outliers and Tukey Fences:

Contents

All Modules

InterQuartile Range (IQR)


When a data set has outliers or extreme values, we summarize a typical value using
the median as opposed to the mean. When a data set has outliers, variability is often
summarized by a statistic called the interquartile range, which is the difference between the
first and third quartiles. The first quartile, denoted Q1, is the value in the data set that holds 25%
of the values below it. The third quartile, denoted Q3, is the value in the data set that holds 25%
of the values above it. The quartiles can be determined following the same approach that we
used to determine the median, but we now consider each half of the data set separately. The
interquartile range is defined as follows:

Interquartile Range = Q3-Q1

With an Even Sample Size:


For the sample (n=10) the median diastolic blood pressure
is 71 (50% of the values are above 71, and 50% are
below). The quartiles can be determined in the same way
we determined the median, except we consider each half
of the data set separately.

Figure 9 - Interquartile Range with Even Sample Size

There are 5 values below the median (lower half), the


middle value is 64 which is the first quartile. There are 5
values above the median (upper half), the middle value is
77 which is the third quartile. The interquartile range is 77
– 64 = 13; the interquartile range is the range of the
middle 50% of the data.

-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
------

With an Odd Sample Size:


When the sample size is odd, the median and quartiles
are determined in the same way. Suppose in the previous
example, the lowest value (62) were excluded, and the
sample size was n=9. The median and quartiles are
indicated below.

Figure 10 - Interquartile Range with Odd Sample Size

When the sample size is 9, the median is the middle


number 72. The quartiles are determined in the same way
looking at the lower and upper halves, respectively. There
are 4 values in the lower half, the first quartile is the mean
of the 2 middle values in the lower half ((64+64)/2=64).
The same approach is used in the upper half to determine
the third quartile ((77+81)/2=79).

Outliers and Tukey Fences:


When there are no outliers in a sample, the mean and standard deviation are used to
summarize a typical value and the variability in the sample, respectively. When there are
outliers in a sample, the median and interquartile range are used to summarize a typical value
and the variability in the sample, respectively.

Tukey Fences
There are several methods for determining outliers in a sample. A very popular method is based on the following:

Outliers are values below Q1-1.5(Q3-Q1) or above Q3+1.5(Q3-Q1) or equivalently, values below Q1-1.5 IQR or above Q
These are referred to as Tukey fences.6 For the diastolic blood pressures, the lower limit is 64 - 1.5(77-64) = 44.5 and
is 77 + 1.5(77-64) = 96.5. The diastolic blood pressures range from 62 to 81. Therefore there are no outliers. The bes
typical diastolic blood pressure is the mean (in this case 71.3) and the best summary of variability is given by the stand
(s=7.2).
Table 13 displays the means, standard deviations, medians, quartiles and interquartile ranges
for each of the continuous variables in the subsample of n=10 participants who attended the
seventh examination of the Framingham Offspring Study.

Table 13 - Summary Statistics on n=10 Participants

Characteristic Mean Standard Deviation Median

Systolic Blood Pressure 121.2 11.1 122.5

Diastolic Blood Pressure 71.3 7.2 71.0

Total Serum Cholesterol 202.3 37.7 206.5

Weight 176.0 33.0 169.5

Height 67.175 4.205 69.375

Body Mass Index 27.26 3.10 26.60

Table 14 displays the observed minimum and maximum values along with the limits to
determine outliers using the quartile rule for each of the variables in the subsample of n=10
participants. Are there outliers in any of the variables? Which statistics are most appropriate to
summarize the average or typical value and the dispersion?

Table 14 - Limits for Assessing Outliers in Characteristics Measured in the n=10


Participants

Characteristic Minimum Maximum Lower Limit1 Upper Limit2

Systolic Blood Pressure 105 141 92 148

Diastolic Blood Pressure 62 81 44.5 96.5

Total Serum Cholesterol 150 275 67 323

Weight 138 235 68.5 288.5

Height 60.75 72.00 52.5 80.5

Body Mass Index 22.8 31.9 17.85 36.65

1
Determined byQ1-1.5(Q3-Q1)

2
Determined by Q3+1.5(Q3-Q1)
Since there are no suspected outliers in the subsample of n=10 participants, the mean and
standard deviation are the most appropriate statistics to summarize average values and
dispersion, respectively, of each of these characteristics.

The Full Framingham Cohort


For clarity, we have so far used a very small subset of the Framingham Offspring Cohort to
illustrate calculations of summary statistics and determination of outliers. For your interest,
Table 15 displays the means, standard deviations, medians, quartiles and interquartile ranges
for each of the continuous variable displayed in Table 13 in the full sample (n=3,539) of
participants who attended the seventh examination of the Framingham Offspring Study.

Table 15 - Summary Statistics on Sample of (n=3,539) Participants

Characteristic Mean Standard Deviation Median Q


(s)

Systolic Blood Pressure 127.3 19.0 125.0 11

Diastolic Blood Pressure 74.0 9.9 74.0 67

Total Serum Cholesterol 200.3 36.8 198.0 17

Weight 174.4 38.7 170.0 14

Height 65.957 3.749 65.750 63.

Body Mass Index 28.15 5.32 27.40 24

Based solely on a comparison of the means and medians in Table 15 above, there is evidence that there
was one or more characteristics with values that were outliers?

True
False

Table 16 displays the observed minimum and maximum values along with the limits to
determine outliers using the quartile rule for each of the variables in the full sample (n=3,539).

Table 16 - Limits for Assessing Outliers in Characteristics Presented in Table 15

Tukey Fences
Characteristic Minimum Maximum Lower Limit1 Upper Limit2

Systolic Blood Pressure 81.0 216.0 78 174

Diastolic Blood Pressure 41.0 114.0 47.5 99.5

Total Serum Cholesterol 83.0 357.0 103 295

Weight 90.0 375.0 68.0 276.0

Height 55.00 78.75 54.4 77.4

Body Mass Index 15.8 64.0 15.05 40.25

1
Determined byQ1-1.5(Q3-Q1)

2
Determined by Q3+1.5(Q3-Q1)

Click below the question to view the answer.


Are there outliers in any of the variables? Which statistics
appropriate to summarize the average or typical values and the disp
each variable?
Show Answer

return to top | previous page | next page

Content ©2016. All Rights Reserved.


Date last modified: May 17, 2016.
Created by Lisa Sullivan, PhD and Wayne W. LaMorte, MD, PhD, MPH,

Boston University School of Public Health

Interpret the key results for Descriptive


Statistics
Learn more about Minitab

Complete the following steps to interpret descriptive statistics. Key output includes N, the
mean, the median, the standard deviation, and several graphs.

In This Topic

 Step 1: Describe the size of your sample


 Step 2: Describe the center of your data
 Step 3: Describe the spread of your data
 Step 4: Assess the shape and spread of your data distribution
 Step 5. Compare data from different groups

Step 1: Describe the size of your sample


Use N to know how many observations are in your sample. Minitab does not include
missing values in this count.
You should collect a medium to large sample of data. Samples that have at least 20
observations are often adequate to represent the distribution of your data. However, to
better represent the distribution with a histogram, some practitioners recommend that you
have at least 50 observations. Larger samples also provide more precise estimates of the
process parameters, such as the mean and standard deviation.

Statistics

Variabl N SE Minimu Media Maximu


e N * Mean Mean StDev m Q1 n Q3 m

6 0 21.264 0.778 6.422 10.0000 16.000 20.000 24.750 37.0000


Torque
8 7 8 0 0 0 0

Key Result: N
In these results, you have 68 observations.

Step 2: Describe the center of your data


Use the mean to describe the sample with a single value that represents the center of the
data. Many statistical analyses use the mean as a standard measure of the center of the
distribution of the data.

The median and the mean both measure central tendency. But unusual values, called
outliers, affect the median less than they affect the mean. When you have unusual values,
you can compare the mean and the median to decide which is the better measure to use. If
your data are symmetric, the mean and median are similar.

Statistics

Variabl N SE Minimu Media Maximu


e N * Mean Mean StDev m Q1 n Q3 m

6 0 21.264 0.778 6.422 10.0000 16.000 20.000 24.750 37.0000


Torque
8 7 8 0 0 0 0

Key Results: Mean and Median


In these results, the mean torque that is required to remove a toothpaste cap is 21.265, and the median
torque is 20. The data appear to be skewed to the right, which explains why the mean is greater than the
median.
Step 3: Describe the spread of your data
Use the standard deviation to determine how spread out the data are from the mean.

A higher standard deviation value indicates greater spread in the data..

Statistics

Variabl N SE Minimu Media Maximu


e N * Mean Mean StDev m Q1 n Q3 m

6 0 21.264 0.778 6.422 10.0000 16.000 20.000 24.750 37.0000


Torque
8 7 8 0 0 0 0

Key Result: StDev


In these results, the standard deviation is 6.422. With normal data, most of the observations are spread
within 3 standard deviations on each side of the mean.

Step 4: Assess the shape and spread of your data


distribution
Use the histogram, the individual value plot, and the boxplot to assess the shape and spread
of the data, and to identify any potential outliers.

Examine the shape of your data to determine whether your data appear to be skewed
When data are skewed, the majority of the data are located on the high or low side of the
graph. Often, skewness is easiest to detect with a histogram or boxplot.

Right-skewed
Left-skewed

The histogram with right-skewed data shows wait times. Most of the wait times are relatively short, and
only a few wait times are long. The histogram with left-skewed data shows failure time data. A few items
fail immediately, and many more items fail later.

Determine how much your data varies


Assess the spread of the points to determine how much your sample varies. The greater the
variation in the sample, the more the points will be spread out from the center of the data.

This individual plot shows that the data on the right has more variation than the data on the left.

Look for multi-modal data


Multi-modal data have multiple peaks, also called modes. Multi-modal data often indicate
that important variables are not yet accounted for.

If you have additional information that allows you to classify the observations into groups,
you can create a group variable with this information. Then, you can create the graph with
groups to determine whether the group variable accounts for the peaks in the data.

Simple
With Groups

For example, a manager at a bank collects wait time data and creates a simple histogram. The histogram
appears to have two peaks. After further investigation, the manager determines that the wait times for
customers who are cashing checks is shorter than the wait time for customers who are applying for home
equity loans. The manager adds a group variable for customer task, and then creates a histogram with
groups.

Look for outliers


Outliers, which are data values that are far away from other data values, can strongly affect
the results of your analysis. Often, outliers are easiest to identify on a boxplot.

On a boxplot, asterisks (*) denote outliers.

Step 5. Compare data from different groups


If you have a Group variable, you can use it to analyze your data by group or by group
level.

Statistics

SE
Varia Machi N Mea StDe Minim Medi Maxim
ble ne N * Mean n v um Q1 an Q3 um

Torqu 1 3 0 18.66 0.73 4.39 10.000 15.25 17.00 21.75 30.000


e 6 67 25 48 0 00 00 00 0

2 3 0 24.18 1.25 7.11 14.000 17.50 24.00 31.00 37.000


2 8 8 9 0 0 0

In these results, the summary statistics are calculated separately by machine. You can easily see the
differences in the center and spread of the data for each machine. For example, Machine 1 has a lower
mean torque and less variation than Machine 2. To determine whether the difference in means is
significant, you can perform a 2-sample t-test.

 Minitab.com
 License Port

You might also like