You are on page 1of 35

Descriptive statistics: Numerical measures

Methods of data description


 Data description methods we have looked at so far:
 Graphs
 Tables

 Once we have described the distribution of a single variable with a graph or table, we then
interpret what are the key characteristics of the distribution.

 Aspects of a univariate distribution we may be interested in:


 shape
 center
 spread
 existence of outliers
Example: Age in years
8  We can describe the same distribution as either a
7 graph or table
6
5  We can look at the graph or table for the
4 distribution of age and say that the center of the
3 distribution is around 18-19 years of age.
2
 That is where most of the cases seem to ‘hover’.
1
0
 Notice that we are using ‘rough judgment’ to
18 19 20 21 22
decide where the center of the distribution sits.
Age in years
 An alternative to making a simple ‘eyeball’
judgment is to use certain mathematical procedures
which calculate, under certain conditions, the
Age of respondents center of a distribution
Age in years Frequency
18 7  These are equations that give a specific value for
19 5 the center of a distribution.
20 4
21 2  The result is that instead of using loose words, such
22 2 as “the center is around 18-19 years”
Total 20
we use definite numbers, such as “the center of the
distribution is 18.5 years”
Numerical measures: Central tendency
Measures of central tendency indicate the ‘typical’ or ‘average’ score in a distribution

Measure Method of calculation Data considerations


Mode Most frequent score Can be used with all levels of measurement, but not useful with
scales that have many values
Median Middle score in a ranked series Can be used with ranked data (ordinal and interval/ratio), but not
useful for scales with few values
Mean Sum of scores divided by the Can be used for interval/ratio data that are not skewed and do not
total contain outliers
The mode
 Refers to the value in a distribution that occurs most often (not the frequency with which
it occurs).

 In the example for Age of respondents, we can quickly read off from the table that the
mode is 18 years.

 The mode has certain desirable features:


 Easy to calculate
 Can be generated for any level of measurement

Age of respondents
Age in years Frequency
18 7
19 5
20 4
21 2
22 2
Total 20
Limitations on using the mode
 The mode may misrepresent the center of a distribution when we have interval/ratio data
with lots of individual values

 Assume we have the following distribution for age in years of respondents:


12, 13, 13, 35, 37, 38, 40, 41, 42, 43, 53, 61, 65

 Looking at the distribution tells us that the center is somewhere in the 35-45 years range

 The mode, however, calculates the center to be 13 years

 When too many values in a distribution limit the use of the simple mode, we can either:
1. Use our eyeball judgment of what the center of the distribution is, or
2. Use an alternative measure of central tendency (i.e. median, mean) if possible, or
3. Collapse the values into appropriate class intervals and state which interval is the mode (e.g. 35-45
years) We can now say that the mode is 35-45 years of age

Age range Frequency


Less than 35 years 3
35-45 years 8
More than 45 years 2
The median
 The median is the score which divides a rank-ordered series in half.

 Half of the distribution has scores below the median and half the distribution has scores
above the median.

 The determination of the median is affected by whether we have an even or odd number
of cases.
 For an odd number of rank-ordered cases, the median is the middle score
 For an even number of rank-ordered cases, the median is the average of the two middle scores

 Because the median requires that case be ranked, it can only be applied to ordinal and
interval/ratio data.
The median: An example
To calculate the median we:

1. line up all the cases from youngest to oldest (i.e. we rank them)

2. since we have an even number of cases (n = 20) we identify the 10th and 11th in the line
(the middle two cases)

3. take the mean of the two middle scores

19  19
Md   19 years
2
Calculating the median when there are many cases
 Imagine that we have a distribution with many scores.

 Trying to calculate the median on such a distribution by ranking them from lowest to highest
may be very impractical.

 An alternative is to construct a cumulative frequency table and read off the category or
value at which the cumulative relative frequency first exceeds 50%

 By definition this must ‘take-in’ the middle case.

Age of respondents
Age in years Frequency Cumulative percent
18 7 35%
19 5 60%
20 4 80%
21 2 90%
22 2 100%
Total 20
The mean
 The mean is the sum of all scores divided by the total number of cases

 Applied only to interval/ratio data that are not skewed

 The formula used for calculating the mean depends on the form in which we have the
data, and whether we have a sample or a population.

Type of data Formula for a population Formula for a sample


Listed data X i X i
 (population) X (sample)
N n
Frequency data fXi fX i
 = (population) X = (sample)
N n

Class intervals fm fm


 (population) X  (sample)
N n
The mean for listed and frequency data
Listed data Frequency data
Case Age in years Age in years Frequency f Xi
1 18
18 7 7 x 18 = 126
2 21
19 5 5 x 19 = 95
3 20
4 18 20 4 4 x 20 = 80
5 19 21 2 2 x 21 = 42
6 18 22 2 2 x 22 = 44
7 22 Total 20  f Xi = 387
8 19
9 18
10 20 1. multiply each value in the distribution by the
11 18 frequency ( f ) with which it occurs;
12 19
13 22 2. sum these products ( f Xi ); and
14 19
15 20 3. divide the sum by the number of cases
16 18
17 21 fX i
X =
18 19 n
19 18
20 20 (18  7) +(19  5) +(20  4) +(21 2) +(22  2)

20
Sum the scores and divide by the total:
= 19.35 years
X i
X 
n

18  21 20  ... 20

20

 19.35 years
The mean for class intervals
To calculate the mean on class interval data:
1. calculate the mid-point of each class interval
2. multiply each mid-point by the number of cases in that interval
3. sum these products
4. divide the total by the number of cases

Class interval Mid-point, m Frequency, f Frequency times mid-point, fm


1-5 3 7 3x7= 21
6-10 8 10 8 x 10 = 80
11-15 13 6 13 x 6 = 78
Total n = 20 fm = 179

fm 179
X    8.95 years
n 20
Comparing measures of central tendency
 Measures of central tendency do not always ‘agree’ with each other: it is possible to have a
different value for the mode, median, and mean, even though they have all been calculated
on the same data.

 The median is very stable, whereas any change in the scores will affect the mean.

 For example, assume we had the following 5 scores: 12, 15, 18, 19, 22
 the mean is 17.2
 the median is 18

 If we change 22 to 80: 12, 15, 18, 19, 80


 the median remains 18
 the mean now equals 28.8
Skewed distributions and measures of central tendency
 With skewed data the mean and the median will give a different value for central
tendency.

 The mean is affected by the small number of high or low scores in a skewed distribution,
so that in such situations it is generally not relied upon as a measure of central tendency.

 For example, data for annual income for a population are usually skewed to the right, with
a proportionately small number of high income earners. The mean income will include
these high values and therefore overstate the ‘typical’ income earned by the bulk of the
population.

 While the mean and the median on their own refer to the center of a distribution, it is still
useful to cite the difference between the mean and the median to indicate the shape of
the distribution:
 Mean greater than the median: the distribution is skewed to the right
 Mean is less than the median: the distribution is skewed to the left
 Mean is equal to the median: the distribution is symmetrical
Measures of dispersion (or spread)

Measure Method of calculation Data considerations


Index of Qualitative Compare actual distribution with extreme Used for tables where categories have no natural
Variation heterogeneity ordering
Range Difference between lowest and highest score Affected by outliers

Interquartile range Difference between lowest and highest score for the Very stable measure of average; often used in
middle 50% of cases conjunction with the median
Standard deviation Average distance scores are from the average Has the same limits as the mean

Coefficient of Standard deviation relative to the mean Allows comparison of variation among different
relative variation distributions

 Measures of dispersion are descriptive statistics that indicate the spread or variety of
scores in a distribution.

 They indicate the level of variation in the data


The notion of variation
 Two groups have a distribution for age displayed by the graphs below
 In both cases the mean age will be 20 years: the center of the distribution is the same
 But we can also see that in another sense these distributions are very different

Respondents by age - minimum variation

25
 Imagine that an ‘average’ person in this distribution (i.e.
someone aged 20 years) asks each of the 19 other people what
20
20
is their respective age
15

 This person will find that there is no difference between


10 him/her and everyone else; every other person is also average

 In other words there is no variation in the data in terms of age:


5

0
0 0 0 0 a very homogeneous group of people
18 19 20 21 22
Age in years

Respondents by age - maximum variation

 Imagine that an ‘average’ person in this distribution asks each


5

4
of the 19 other people what is their respective age
4 4 4 4 4

 This person will occasionally meet another average person, but


3

more often will meet people very different from themselves in


2
terms of age; most people are not average

1
 In other words there is high variation in the data: a very
heterogeneous group of people in terms of age
0
18 19 20 21 22
Age in years
Describing variation in words
Annual incomes
Group A Group B
$5,000 $20,000
$6,500 $28,500
$8,000 $35,000
$55,000 $36,000
$85,000 $40,000

 The mean for each distribution is $31,900.

 But we can see that there is a difference in terms of the spread of scores around the
average.

 We could express this difference in spread using simple words:


“Group B is more homogeneous than Group A”
“There is much less variation in income within Group B compared to Group A”
“The distribution of income is more widely spread in Group A”
Describing variation using numerical calculations
 In addition to our eyeball judgment of the spread of scores, we can also calculate
measures of dispersion to express this difference numerically

 The homogeneity of Group B is evident in the values we obtain for the various measures of
dispersion we can calculate on these data.

 These measures have much lower values for Group B than for Group A

Descriptive statistics Group A Group B


Mean $31,900 $31,900
Range $80,000 $20,000
Standard deviation $36,377 $7,829
CRV 114 25
The range, R
 The range is the difference between the highest and lowest values in a distribution
RA = 85,000 – 5000 = $80,000
RB = 40,000 – 20,000 = $20,000

 The range is the simplest measure of dispersion to calculate

 Since it requires the measurement of difference between intervals on a scale, it is only


applicable to interval/ratio data

 The range can be unreliable if there are outliers in the distribution


The interquartile range (IQR)
 The interquartile range is the difference between the upper limits of the first quartile and
the third quartile.

 The IQR eliminates the problem of outliers/extreme scores that can be a problem with the
use of the simple range

 In the example for age, we first rank cases from youngest to oldest.

 We then count off from the bottom one quarter of all cases (the first quartile), and identify
the age of that person.

 We then count off from the bottom three quarters of all cases (the third quartile), and
identify the age of that person:
IQR = 20 – 18 = 2 years
IQR with lots of cases
 When we have a distribution with lots of cases we can (in a manner similar to the
calculation of the median) determine the IQR from a cumulative frequency table.

 Assume I had 1000 cases in a survey ranging in age from 20 to 28

 The value at the upper end of the first quartile: where the cumulative percentage first
passes 25% (22 years)

 The value at the upper end of the third quartile: where the cumulative percentage first
passes 75% (27 years)

Age in years Cumulative per cent


20 18%
21 24%
22 32%
23 45%
IQR = 27 - 22 = 5 years
24 51%
25 59%
26 67%
27 82%
28 100%
The standard deviation
 The standard deviation captures the extent to which each score differs from the mean

 We know that the mean is a way of determining the average of a distribution

 Unless we have a perfectly homogeneous distribution, however, most individual scores will
not equal the mean

 The difference between each individual score and the mean is called the deviation

 The more heterogeneous the distribution the larger will be these deviations between the
actual scores that make up the distribution and the mean

 The standard deviation is based on the size of these deviations from the mean
The standard deviation: A graphical illustration

 The formula for the standard deviation captures


the idea behind the standard deviation: the
average deviation from the average

 For the data for age, the s.d. is 1.35 years

X i  X 
2

s = (sample)
n 1

X i  
2

= (population)
N
Coefficient of relative variation (CRV)
 The Coefficient of Relative Variation is useful in comparing the amount of variation across
distributions (i.e. in bivariate analysis)

 In particular it allows us to:


– compare distributions measured in the same units but which have very different means
– compare distributions measured with different units.
CRV: Comparing distributions measured in the same units
 The standard deviation in age for a group of people is 1.35 years.

 The standard deviation in age for another group of people also is be 1.35 years

 In absolute terms the two groups display the same amount of variation

 But the mean age in the first group is 5 years and for the second group it is 60 years

 In relative terms there is more variation around the mean in the first group compared to
the second

 To allow us to compare variation across these two distributions we calculate the


Coefficient of Relative Variation, which is the standard deviation expressed as a
percentage of the mean

s 1.35
Group 1 CRV  100  100  27%
X 5

s 1.35
Group 2 CRV   100   100  2%
X 60
CRV: Comparing distributions measured with different units
 The CRV also allows us to compare variation between distributions measured in different
units

 For example, I may have a group of people whose standard deviation in terms of age is 12
years around a mean of 60 years.

 The standard deviation for the distribution of their incomes is $10,000, around a mean of
$25,000.

 Does this group display more variation in terms of their age or their income?

 We cannot directly compare these values for the standard deviation because they are
expressed in different units of measurement

 But by calculating the CRV we eliminate the units of measurement (i.e. years and dollars)

 The CRV for age is 20%, but for income it is 40%: there is twice as much variation in these
people’s income, relatively speaking, as there is in age
Measures of dispersion and the shape of a distribution
 The shape of the distribution of interval/ratio data affects the measures of dispersion we
use

 In particular, whether a distribution is skewed or symmetrical will have the following two
effects:

1. A skewed distribution makes the standard deviation invalid, so we rely on the IQR instead. Similar
to the way in which skewness affects the mean as a measure of central tendency; in fact the
standard deviation uses the mean in its calculation so when we can’t use the mean we can’t use
the standard deviation.

2. With a skewed distribution some authors suggest that the CRV should be calculated with reference
to the median rather than the mean.
Measuring variation for categorical data: IQV
 All the measures of spread we have looked at so far apply only to interval/ratio data

 With nominal/ordinal data (i.e. ‘categorical’ data) we use a measure called the Index of
Qualitative Variation

 Rather than measuring the differences between scores, it counts the number of times each
case is different to another
IQV: An example
Actual variation
Sex Frequency
Male 12
Female 8
Total 20

 In this distribution for sex of students we can see that each of the 12 males is different to
each of the 8 females in terms of this variable.

 Therefore there are 8 x 12 = 96 differences in this distribution.

 Notice that because of the lower level of measurement we can say that one person is
different to another, but we can’t measure how much different.

 We can compare our actual distribution to each of two extremes: no variation and
maximum variation
Minimum and maximum possible variation
No variation
Sex Frequency  There is absolutely no variation in this distribution. Each person is like the
Male 20 others in terms of sex: there are no differences to count
Female 0

Maximum variation
Sex Frequency  The maximum amount of variation that can be displayed by categorical data is
Male 10 where cases are evenly spread across the categories
Female 10  We can see that there are 100 ‘variations’ in this distribution: each of the ten
males paired with each of the ten females
 For any table the maximum number of differences will be determined by the
following formula, where:
 The number of categories (k)
 The number of cases spread across the categories (n)

maximum possible differences = n (k 1) = 20 (21) = 100


2 2
2k 2(2)

Actual variation
Sex Frequency  The IQV expresses the fact that the actual distribution sits much more closely
Male 12
to the maximum possible amount of variation than it does to the minimum
Female 8 possible variation. There are 12 x 8 = 96 differences

 The IQV is the ratio of the actual differences in a distribution to the maximum
number of differences: IQV = Observed differences  96 = 0.96
Maximum differences 100
Reporting results
Hours of TV Watched per Day
500

400

300

200

100

0
0 1 2 3 4 5 6 7 8 9 10 12 16 20 22 24

Hours
US General Social Survey, 1993

n = 1489

From this figure a number of conclusions can be drawn about TV viewing behavior. We can see that the
distribution is mound-shaped (unimodal), and is skewed to the right by the handful of people reporting a
high amount of TV viewing. On average, survey respondents watch around 2 hours of TV per day, with the
median for this distribution equal to 2 hours. A high percentage (80%) of people watched between 1-4 hours
of TV, indicating that there was not a great deal of variation in TV viewing among these viewers (IQR = 2
hours). It should be noted though, that despite most cases being clustered around the average, a very small
number of cases exhibit extreme scores for amount of TV watched, in relation to the rest of the sample.
One case, for example, recorded 24 hours of TV watching. Given their extreme nature, however, such scores
may reflect unreliable responses to the survey rather than actual behavior. Respondents recording 12 or
more hours of TV viewing have therefore been excluded from further analysis.
Numerical measures and bivariate analysis
 Strictly speaking measures of central tendency and dispersion are univariate descriptive
statistics in that they describe a single distribution (e.g. the mean age of students)

 We can extend the use of numerical measures of average into bivariate analysis

 We do this by calculating relevant measures for each group defined by the independent
variable

 For example, if we were interested in whether there was a relationship between the sex
of students in the sample and age, we can calculate summary statistics for age for men
and for women separately and see if there is a difference

Group Mean Standard deviation Level of analysis


All cases 19.3 years 1.35 years univariate
Females 18.5 years 2 years
bivariate
Males 21 years 1 year
There is a relationship between sex and age profile of students. Men are on average
older and display less variation in their age than women.
Measures of central tendency for bivariate analysis: How to …
 To undertake simple bivariate analysis, we:
1. Determine the independent and dependent variables (does a person’s sex affect their income or
vice versa?)
2. Split the data into groups defined by the categories of the independent variable
3. Compare these groups in terms of some descriptive statistic for the dependent variable

 In this example:
1. Students’ sex must be the independent variable and income the dependent variable
2. We divide the data into males and females
3. We compare males and females in terms of some summary statistics for income

 In SPSS this can be done a number of ways:


1. Data/Split File command followed by the Frequencies/Statistics command
2. Explore command with the independent variable as a Factor
3. Compare Means command
Averages and spread: A checklist
When reading statistics ask/do the following:

 Which average: mean or median?

 If the mean: with or without outliers/extreme scores? For a skewed distribution?

 With or without spread information? Question usefulness of average is data are very
disperse

 Have the authors averaged averages or percentages?

 Have the authors made the original data available?


– If Yes, check the calculations
– If No, question validity
Reporting results: some guidelines
 Remove unnecessary decimal points (e.g. 0.2, not 0.200)

 Add a 0 before decimal points where necessary (0.2 rather than .2)

 Indicate units of measurement where relevant (e.g. P15,000 rather than 15000)

 Except in tables/graphs, numerals with four digits do not usually need a comma (e.g. 7623
rather than 7,623). Numerals with more than four digits usually require a comma (e.g.
17,623 rather than 17623)

 Do not begin a sentence with Arabic numerals; spell the number out in Roman letters (e.g.
‘Forty two per cent of people commented…’ not ‘42% of people commented…’)

 For spans, use an en dash (e.g. 34–35 rather than 34-35)

 Except in tables/graphs spell out numbers one to ten and use digits for 11 on, except:
– when used with units or for a numbered item (e.g. 2 metres; 10 per cent; page 2);
– as commonly used in scientific or technical textbooks;
– when digits are needed to avoid a string of hyphenated words (e.g. 24-hour day not twenty-four-
hour day);
– at the beginning of a sentence, when numbers should be spelt out

You might also like