Professional Documents
Culture Documents
Handout 05 Descriptive Measures
Handout 05 Descriptive Measures
Once we have described the distribution of a single variable with a graph or table, we then
interpret what are the key characteristics of the distribution.
In the example for Age of respondents, we can quickly read off from the table that the
mode is 18 years.
Age of respondents
Age in years Frequency
18 7
19 5
20 4
21 2
22 2
Total 20
Limitations on using the mode
The mode may misrepresent the center of a distribution when we have interval/ratio data
with lots of individual values
Looking at the distribution tells us that the center is somewhere in the 35-45 years range
When too many values in a distribution limit the use of the simple mode, we can either:
1. Use our eyeball judgment of what the center of the distribution is, or
2. Use an alternative measure of central tendency (i.e. median, mean) if possible, or
3. Collapse the values into appropriate class intervals and state which interval is the mode (e.g. 35-45
years) We can now say that the mode is 35-45 years of age
Half of the distribution has scores below the median and half the distribution has scores
above the median.
The determination of the median is affected by whether we have an even or odd number
of cases.
For an odd number of rank-ordered cases, the median is the middle score
For an even number of rank-ordered cases, the median is the average of the two middle scores
Because the median requires that case be ranked, it can only be applied to ordinal and
interval/ratio data.
The median: An example
To calculate the median we:
1. line up all the cases from youngest to oldest (i.e. we rank them)
2. since we have an even number of cases (n = 20) we identify the 10th and 11th in the line
(the middle two cases)
19 19
Md 19 years
2
Calculating the median when there are many cases
Imagine that we have a distribution with many scores.
Trying to calculate the median on such a distribution by ranking them from lowest to highest
may be very impractical.
An alternative is to construct a cumulative frequency table and read off the category or
value at which the cumulative relative frequency first exceeds 50%
Age of respondents
Age in years Frequency Cumulative percent
18 7 35%
19 5 60%
20 4 80%
21 2 90%
22 2 100%
Total 20
The mean
The mean is the sum of all scores divided by the total number of cases
The formula used for calculating the mean depends on the form in which we have the
data, and whether we have a sample or a population.
18 21 20 ... 20
20
19.35 years
The mean for class intervals
To calculate the mean on class interval data:
1. calculate the mid-point of each class interval
2. multiply each mid-point by the number of cases in that interval
3. sum these products
4. divide the total by the number of cases
fm 179
X 8.95 years
n 20
Comparing measures of central tendency
Measures of central tendency do not always ‘agree’ with each other: it is possible to have a
different value for the mode, median, and mean, even though they have all been calculated
on the same data.
The median is very stable, whereas any change in the scores will affect the mean.
For example, assume we had the following 5 scores: 12, 15, 18, 19, 22
the mean is 17.2
the median is 18
The mean is affected by the small number of high or low scores in a skewed distribution,
so that in such situations it is generally not relied upon as a measure of central tendency.
For example, data for annual income for a population are usually skewed to the right, with
a proportionately small number of high income earners. The mean income will include
these high values and therefore overstate the ‘typical’ income earned by the bulk of the
population.
While the mean and the median on their own refer to the center of a distribution, it is still
useful to cite the difference between the mean and the median to indicate the shape of
the distribution:
Mean greater than the median: the distribution is skewed to the right
Mean is less than the median: the distribution is skewed to the left
Mean is equal to the median: the distribution is symmetrical
Measures of dispersion (or spread)
Interquartile range Difference between lowest and highest score for the Very stable measure of average; often used in
middle 50% of cases conjunction with the median
Standard deviation Average distance scores are from the average Has the same limits as the mean
Coefficient of Standard deviation relative to the mean Allows comparison of variation among different
relative variation distributions
Measures of dispersion are descriptive statistics that indicate the spread or variety of
scores in a distribution.
25
Imagine that an ‘average’ person in this distribution (i.e.
someone aged 20 years) asks each of the 19 other people what
20
20
is their respective age
15
0
0 0 0 0 a very homogeneous group of people
18 19 20 21 22
Age in years
4
of the 19 other people what is their respective age
4 4 4 4 4
1
In other words there is high variation in the data: a very
heterogeneous group of people in terms of age
0
18 19 20 21 22
Age in years
Describing variation in words
Annual incomes
Group A Group B
$5,000 $20,000
$6,500 $28,500
$8,000 $35,000
$55,000 $36,000
$85,000 $40,000
But we can see that there is a difference in terms of the spread of scores around the
average.
The homogeneity of Group B is evident in the values we obtain for the various measures of
dispersion we can calculate on these data.
These measures have much lower values for Group B than for Group A
The IQR eliminates the problem of outliers/extreme scores that can be a problem with the
use of the simple range
In the example for age, we first rank cases from youngest to oldest.
We then count off from the bottom one quarter of all cases (the first quartile), and identify
the age of that person.
We then count off from the bottom three quarters of all cases (the third quartile), and
identify the age of that person:
IQR = 20 – 18 = 2 years
IQR with lots of cases
When we have a distribution with lots of cases we can (in a manner similar to the
calculation of the median) determine the IQR from a cumulative frequency table.
The value at the upper end of the first quartile: where the cumulative percentage first
passes 25% (22 years)
The value at the upper end of the third quartile: where the cumulative percentage first
passes 75% (27 years)
Unless we have a perfectly homogeneous distribution, however, most individual scores will
not equal the mean
The difference between each individual score and the mean is called the deviation
The more heterogeneous the distribution the larger will be these deviations between the
actual scores that make up the distribution and the mean
The standard deviation is based on the size of these deviations from the mean
The standard deviation: A graphical illustration
X i X
2
s = (sample)
n 1
X i
2
= (population)
N
Coefficient of relative variation (CRV)
The Coefficient of Relative Variation is useful in comparing the amount of variation across
distributions (i.e. in bivariate analysis)
The standard deviation in age for another group of people also is be 1.35 years
In absolute terms the two groups display the same amount of variation
But the mean age in the first group is 5 years and for the second group it is 60 years
In relative terms there is more variation around the mean in the first group compared to
the second
s 1.35
Group 1 CRV 100 100 27%
X 5
s 1.35
Group 2 CRV 100 100 2%
X 60
CRV: Comparing distributions measured with different units
The CRV also allows us to compare variation between distributions measured in different
units
For example, I may have a group of people whose standard deviation in terms of age is 12
years around a mean of 60 years.
The standard deviation for the distribution of their incomes is $10,000, around a mean of
$25,000.
Does this group display more variation in terms of their age or their income?
We cannot directly compare these values for the standard deviation because they are
expressed in different units of measurement
But by calculating the CRV we eliminate the units of measurement (i.e. years and dollars)
The CRV for age is 20%, but for income it is 40%: there is twice as much variation in these
people’s income, relatively speaking, as there is in age
Measures of dispersion and the shape of a distribution
The shape of the distribution of interval/ratio data affects the measures of dispersion we
use
In particular, whether a distribution is skewed or symmetrical will have the following two
effects:
1. A skewed distribution makes the standard deviation invalid, so we rely on the IQR instead. Similar
to the way in which skewness affects the mean as a measure of central tendency; in fact the
standard deviation uses the mean in its calculation so when we can’t use the mean we can’t use
the standard deviation.
2. With a skewed distribution some authors suggest that the CRV should be calculated with reference
to the median rather than the mean.
Measuring variation for categorical data: IQV
All the measures of spread we have looked at so far apply only to interval/ratio data
With nominal/ordinal data (i.e. ‘categorical’ data) we use a measure called the Index of
Qualitative Variation
Rather than measuring the differences between scores, it counts the number of times each
case is different to another
IQV: An example
Actual variation
Sex Frequency
Male 12
Female 8
Total 20
In this distribution for sex of students we can see that each of the 12 males is different to
each of the 8 females in terms of this variable.
Notice that because of the lower level of measurement we can say that one person is
different to another, but we can’t measure how much different.
We can compare our actual distribution to each of two extremes: no variation and
maximum variation
Minimum and maximum possible variation
No variation
Sex Frequency There is absolutely no variation in this distribution. Each person is like the
Male 20 others in terms of sex: there are no differences to count
Female 0
Maximum variation
Sex Frequency The maximum amount of variation that can be displayed by categorical data is
Male 10 where cases are evenly spread across the categories
Female 10 We can see that there are 100 ‘variations’ in this distribution: each of the ten
males paired with each of the ten females
For any table the maximum number of differences will be determined by the
following formula, where:
The number of categories (k)
The number of cases spread across the categories (n)
Actual variation
Sex Frequency The IQV expresses the fact that the actual distribution sits much more closely
Male 12
to the maximum possible amount of variation than it does to the minimum
Female 8 possible variation. There are 12 x 8 = 96 differences
The IQV is the ratio of the actual differences in a distribution to the maximum
number of differences: IQV = Observed differences 96 = 0.96
Maximum differences 100
Reporting results
Hours of TV Watched per Day
500
400
300
200
100
0
0 1 2 3 4 5 6 7 8 9 10 12 16 20 22 24
Hours
US General Social Survey, 1993
n = 1489
From this figure a number of conclusions can be drawn about TV viewing behavior. We can see that the
distribution is mound-shaped (unimodal), and is skewed to the right by the handful of people reporting a
high amount of TV viewing. On average, survey respondents watch around 2 hours of TV per day, with the
median for this distribution equal to 2 hours. A high percentage (80%) of people watched between 1-4 hours
of TV, indicating that there was not a great deal of variation in TV viewing among these viewers (IQR = 2
hours). It should be noted though, that despite most cases being clustered around the average, a very small
number of cases exhibit extreme scores for amount of TV watched, in relation to the rest of the sample.
One case, for example, recorded 24 hours of TV watching. Given their extreme nature, however, such scores
may reflect unreliable responses to the survey rather than actual behavior. Respondents recording 12 or
more hours of TV viewing have therefore been excluded from further analysis.
Numerical measures and bivariate analysis
Strictly speaking measures of central tendency and dispersion are univariate descriptive
statistics in that they describe a single distribution (e.g. the mean age of students)
We can extend the use of numerical measures of average into bivariate analysis
We do this by calculating relevant measures for each group defined by the independent
variable
For example, if we were interested in whether there was a relationship between the sex
of students in the sample and age, we can calculate summary statistics for age for men
and for women separately and see if there is a difference
In this example:
1. Students’ sex must be the independent variable and income the dependent variable
2. We divide the data into males and females
3. We compare males and females in terms of some summary statistics for income
With or without spread information? Question usefulness of average is data are very
disperse
Add a 0 before decimal points where necessary (0.2 rather than .2)
Indicate units of measurement where relevant (e.g. P15,000 rather than 15000)
Except in tables/graphs, numerals with four digits do not usually need a comma (e.g. 7623
rather than 7,623). Numerals with more than four digits usually require a comma (e.g.
17,623 rather than 17623)
Do not begin a sentence with Arabic numerals; spell the number out in Roman letters (e.g.
‘Forty two per cent of people commented…’ not ‘42% of people commented…’)
Except in tables/graphs spell out numbers one to ten and use digits for 11 on, except:
– when used with units or for a numbered item (e.g. 2 metres; 10 per cent; page 2);
– as commonly used in scientific or technical textbooks;
– when digits are needed to avoid a string of hyphenated words (e.g. 24-hour day not twenty-four-
hour day);
– at the beginning of a sentence, when numbers should be spelt out