You are on page 1of 18

CHAPTER 3

Descriptive Statistics

1. Introduction

A descriptive statistic is a summary statistic that quantitatively describes some feature of interest of a
data set. The subject of descriptive statistics di↵ers from inferential statistics (which uses probability
theory: see later):

• descriptive statistics tries to describe or summarise a sample (or a population) in its own right;
• inferential statistics uses the sample data to learn about the population that the sample is from.

So descriptive statistics describes the data set in front of you, while inferential statistics uses the data
set you have to infer something about a bigger data set that yours lies within.

2. Summary Measures of data

There are several kinds of summary measures of data used in descriptive statistics: some are listed here.

• Central Tendency
– Mean
– Median
– Mode
• Variation (also called Dispersion)
– Range
– Variance
– Standard Deviation
• Other measures of shape of distribution (skewness, etc)

Next, we need a few definitions:

• The population (universe) is the collection of all things under consideration;


• A sample is a portion of the population selected for analysis;
• A parameter is a summary measure computed to describe a characteristic of the population;
• A statistic is a summary measure computed to describe a characteristic of the sample.

For example:

• The population could be all the registered voters in Ireland;


• The sample could be the portion of the population selected by a polling company to survey;
• A parameter could be the percentage of the population that actually give first preference vote
to the Green party;
• A statistic could be the percentage of the surveyed sample that say they will give first preference
vote to the Green party.

We will now learn how to calculate and interpret these descriptive summary measures for sample and
population data. These are mostly the same formulas, except for variance and standard deviation.
39
40 MIS10090

2.1. Indexed Summation notation. First we introduce a handy shorthand1 which saves writing
and/or typing and also reduces clutter in formulas.
Suppose we have a set of numbers {x1 , . . . , xn }. For example, if n = 4, the set would be {x1 , x2 , x3 , x4 }.
These are fixed numbers, not variables. If the set was the set of heights (measured in metres) of four
particular students in the class, it might be {1.65, 1.53, 1.81, 1.73}. (That is, x1 = 1.65, x2 = 1.53, etc.)
P
The notation ni=1 xi means

• for each integer value of the index i, starting at 1 and going up to n,


add xi to the sum
P P
In other words, ni=1 xi = x1 + x2 + · · · + xn . “ ” is a version of the Greek capital letter sigma, used
to indicate “sum”.
Example 3.1 (Heights). In the height example above we have n = 4, so the values of the index i run
through 1, 2, 3 and 4, so we get
n
X
xi = x1 + x2 + x3 + x4 = 1.65 + 1.53 + 1.81 + 1.73 = 6.72
i=1
and we could work from this to get the average height of the four students as
n
1X 1 1 6.72
xi = (x1 + x2 + x3 + x4 ) = (1.65 + 1.53 + 1.81 + 1.73) = = 1.68.
n 4 4 4
i=1
}

Note that once you have chosen a symbol like i to be the index you sum over, you cannot then use i to
stand for some other meaning in the expression (e. g., you can’t talk about i being a height).
We can extend this idea of summations further:
Xn
x2i = x21 + x22 + · · · + x2n ;
i=1
in other words

• for each integer value of the index i, starting at 1 and going to n,


multiply xi by itself to get x2i
add x2i to the sum
Pn 2
so i=1 xi means the sum of the squares of the numbers x1 , . . . , xn .
For our example of four students, we get
n
X 4
X
x2i = x2i = 1.652 + 1.532 + 1.812 + 1.732 = 11.3324.
i=1 i=1

Or if we had another set {y1 , . . . , yn } of numbers, we could talk about


n
X
x i yi = x 1 y1 + x 2 y 2 + · · · + x n y n ;
i=1
in other words

• for each integer value of the index i, starting at 1 and going to n,


multiply xi by yi to get xi yi
add xi yi to the sum
1Choosing good notation is crucial in any mathematical topic such as statistics. The human eye can deal with a few
things in an expression in a formula but gets swamped if there are too many things. So we want our notation to show a
possibly complicated formula in a compact form, so we can focus on the important parts of it: we can see the wood instead
of the trees.
Data Analysis for Decision Makers 41

Suppose y1 = 1.5, y2 = 3, y3 = 2.5, y4 = 2. Then with the student heights x1 , x2 , x3 , x4 above we get
n
X 4
X
x i yi = xi yi = 1.65 ⇥ 1.5 + 1.53 ⇥ 3 + 1.81 ⇥ 2.5 + 1.73 ⇥ 2 = 16.86.
i=1 i=1

And so on: we will see numerous uses of this kind of indexed summation notation during the course.

3. Measures of Central Tendency

Data tends to cluster around a central point; measures that indicate where data is concentrated are
called measures of central tendency.

3.1. Outliers. We need to be conscious of how to treat observations that are very di↵erent from
the rest. An outlier is an extreme value that is not representative of the data set as a whole: it di↵ers
significantly from other observations. There are di↵erent definitions of the word extreme and outlier: we
will see some below.
An outlier can be due to noise or variability in measurement, or experimental error; in the case of
experimental error we wish to exclude outliers from the data set. Note that outliers can also indicate
novel or unusual data (which is sometimes what we are looking for: e. g., in anomaly detection); or that
the data set has a heavy-tailed (highly-skewed: see below) distribution.
Outliers can occur by chance in any data set. They can cause major problems in data analysis, particu-
larly if we use measures that are not robust to outliers.

3.2. (Arithmetic) Mean or Expected Value. When we have a data set D and no reason to
assume one outcome is more likely than another, we assign them equal weight in calculating measures
of central tendency, dispersion, etc. Thus, if we are able to work with all items of the data set D (called
the population), and the population has size N , each outcome is thus given equal weight of 1/N . If we
are working with a proper subset S of the full data set (called a sample), and the sample has size n, each
outcome is thus given equal weight of 1/n.
Later, we will see that these equal weights are a special case of probabilities.
We can define the (arithmetic) mean of a numerical attribute of data using indexed summation notation
with the following formula:

Population mean: Given a finite population D, the population mean µ of a numerical attribute
is the arithmetic mean of this attribute, taken over all members of the population:
N
x1 + x2 + · · · + xN 1 X
µ= = xi .
N N
i=1

Sample mean: If the numerical attribute is observed by sampling a subset S = {x1 , . . . , xn } of


size n from the population, the mean is called the sample mean x̄, defined by
n
x1 + x2 + · · · + xn 1X
x̄ = = xi .
n n
i=1

We already had an example of this in Example 3.1.

The sample mean may be di↵erent from the population mean, particularly if the sample size n is small.
The larger the sample size, the more likely the sample mean will be close to the population mean (this
is called the law of large numbers).
Both sample mean and population mean are sensitive to extreme outliers and if outliers exist the mean
may not be a good measure to use. Also, the mean can only be used for data with ratio level of
measurement.
42 MIS10090

Label x1 x2 x3 x4 x5 x6
Sales for Sta↵ Id Jan Feb Mar Apr May Jun
aa10 ··· ··· ··· ··· ··· ···
aa11 22 112 116 86 40 64
aa12 ··· ··· ··· ··· ··· ···

Table 3.1. Sales from months January-June for sta↵ members

Example 3.2. Let’s look at an example for a company selling printers:


Table 3.1 shows the printer sales per month over a 6-month period for a member of sta↵ called Jane
whose Sta↵ Id is “aa11”. In the first month, 22 printers are sold: we denote sales in January by x1 = 22,
and so on. The average monthly sales for this member of sta↵ over this 6-month time period is
6
x1 + x2 + · · · + x6 1X
x̄ = = xi
6 6
i=1
1
= (22 + 112 + 116 + 86 + 40 + 64)
6
= 73.33 printers.
}
Exercise 3.3. A basketball team has five players, with heights (in metres) 2.10, 1.88, 1.99, 2.05, 2.15.
Working to two decimal places, find the average height of the players on the team
(a) by hand, on paper;
(b) using Excel.

3.3. Median and Quartiles. The median of an ordered array (data set) is the middle value. If
there is an even number of values, we simply take the average of the two middle values. The median
is the 50th percentile: half of the observations lie to the left of it, and half to the right. It has the
properties:
• it is a robust measure of central tendency since it is not a↵ected by extreme (“outlier”) values;
it is preferred to the mean if there are significant outliers or if the distribution is very skewed
(see §5.1 below)
• it may be used for ordinal data (in which values are ranked relative to each other but are not
measured according to a scale), e. g., numerical or ordinal categorical data
• in an ordered array of numbers, the median is the “middle” number:
– if n or N is odd, the median is the actual middle number;
– if n or N is even, the median is the average of the two middle numbers.
By dividing the ordered data into four quarters with equal numbers of data items, we get quartiles where
• Q1 is the value below which lies the lowest 25% in the ordered array,
• Q2 is the value that splits the data in half, i. e., the median (50% are below, 50% are above),
• Q3 is the division point between the top 25% and the rest of the ordered data.
The position of Qi is given by i(n + 1)/4 in the ordered data.
The value of Qi is the value of the item of ordered data at that position (interpolation may be required).
Note that to calculate the median or other quartiles, you must have ordered the array first in ascending
order.
Example 3.4. Let’s look at a di↵erent example for the number of text messages sent by Jane over 9
working days: see Figure 3.1.
We see that the Q1 value for the number of texts sent by Jane is positioned half way between the second
and third data values in the ordered array.
Data Analysis for Decision Makers 43

Figure 3.1. Text messages sent by day. The upper part of the table shows the numbers
of texts sent per day. The lower part shows the data as an ordered array, that is, sorted
from smallest to largest. The notation (Qi ) at the top stands for the position of Qi once
the array has been ordered.

The Q1 value is 12 (12 + 13) = 12.5 texts.


This means that 25% of the time, sta↵ member aa11 (Jane) sends 12.5 texts or fewer.
Confirm yourselves that the median is 16 texts while the Q3 value is 19.5 texts. }

3.4. Mode. The mode is the value that occurs most often in a set of observations. It has the
properties:

• a measure of central tendency


• not a↵ected by extreme values
• may be used for data at any level of measurement, even nominal
• there may be no mode
• there may be several modes

In the example above, we see that the mode for Jane is 16 texts: there were two days when she sent 16
texts.
Figure 3.2 shows two examples of how the mode of a distribution may behave:

• mode = 9 (left);
• the items 0, 1, 2, 3, 4, 5, 6 all occur with the same frequency (once) so there is no mode (right).

Figure 3.2. Two examples of of how the mode of a distribution may behave: mode =
9 (left); no mode (right)
44 MIS10090

Exercise 3.5 (Measures of Central Tendency). For the following set of numbers:
16, 12, 21, 13, 17, 18, 16, 11, 17, 14, 19, 15
find the mean, median and mode. Do this (a) by hand; (b) (in tutorials) using Excel.

3.5. Outliers and measures of central tendency. Suppose we want to calculate the “average”
salary of ten people, and nine of them are between e 25k and e 30k, but the other is e 180k. Then the
median will be between e 25k and e 30k but the mean will be between e 40.5k and e 45k. Here, the
median is a better reflection of the salary of a randomly sampled person than the mean; while we can
think of the median as “a typical sample”, we would be wrong to think of the mean as typical.

4. Measures of Variation / Dispersion

While the measures of central tendency tell us where the data is centred, they don’t tell us the full
story. We use Measures of Variation to bring the story a bit further: to quantify how much di↵erence
or dispersion or spread (distance from the mean) there is in the data.
Did you ever notice how much di↵erence there was in the heights of a group of primary school children,
even in the same class?
Important measures of variation include

• Range
• Interquartile Range
• Variance
– Population Variance
– Sample Variance
• Standard Deviation
– Population Standard Deviation
– Sample Standard Deviation
• Coefficient of Variation

4.1. Range. The range is the di↵erence between the largest observation xmax and the smallest
observation xmin :
range = xmax xmin .
It has the properties:

• it is a measure of variation;
• it is sensitive to outliers as it only considers the most extreme (outlying) values;
• it doesn’t tell us anything about how the data are distributed.

The range for the number of texts sent by Jane in the example above is 22 11 = 11 texts.
Figure 3.3 shows two very di↵erent distributions with the same xmax and xmin , and so the same range.

Figure 3.3. Two very di↵erent distributions with the same xmax , xmin and range
Data Analysis for Decision Makers 45

4.2. Interquartile Range. The interquartile range (IQR) is the di↵erence Q3 Q1 between the
first and third quartiles Q1 and Q3 . It gives an indication of how the data is distributed. It can also be
written as x0.75 x0.25 , that is, the di↵erence between the 25th and 75th percentiles (note that x0.25 and
x0.75 are just alternative names for Q1 and Q3 .). It is also known as midspread as it is the spread in the
middle 50% of the distribution. It has the properties:

• it is a measure of variation;
• it is not a↵ected by extreme values (by only looking at the middle 50% we have eliminated any
extreme values: this reduces the impact of any outliers).

For example, suppose we have these data in an ordered array: 11, 12, 13, 16, 16, 17, 17, 18, 21. Then the
interquartile range Q3 Q1 = x0.75 x0.25 = 17.5 12.5 = 5.
In the earlier example, the IQR for the number of texts sent by Jane is 19.5 12.5 = 7 texts.

Definition 3.6. Tukey (1977) defines an outlier to be any observation outside the range:
[Q1 k(Q3 Q1 ), Q3 + k(Q3 Q1 )] , that is, [Q1 k ⇥ IQR, Q3 + k ⇥ IQR]
where k is a nonnegative constant.

Tukey proposed k = 1.5 to indicate an “outlier” and k = 3 to indicate “far out” data. We will use this
below.

4.3. Variance. The variance is an important measure of variation of a numerical attribute: it shows
average squared variation about the mean. It is measured in units squared.
The variance quantifies how the individual values di↵er from the mean by taking the average of the
squared di↵erences of the individual values from the mean. If we just looked at the di↵erences the positive
and negative di↵erences would just cancel each other out, thus telling us nothing about dispersion;
squaring avoids this. It is denoted by 2 (sigma squared) for the population, or s2 for sample variance.
To compute it: for each possible value xi calculate the di↵erence between the mean µ and xi and square
that di↵erence; then find the mean of these squared di↵erences (in the case of sample variance, use n 1 1
rather than n1 ).
If all observed values are equally likely, then the variance can be written as

• Population variance:
N N
2 1 X 2 1 X
= (xi µ) where population mean µ = xi ;
N N
i=1 i=1

• Sample variance:
n
X n
1 1X
s2 = (xi x̄)2 where sample mean x̄ = xi .
n 1 n
i=1 i=1

Note the denominator n 1 (one fewer than the number of data items in the sample) for sample variance
rather than the n you might expect based on the population variance formula. There is a technical
reason for this, which is outside our scope.2
You can use Excel or your calculator to calculate the variance rather than calculating it by hand. Use
Excel or your calculator to confirm that the variance for the texts sent by Jane is 14.44 texts squared.

2If you are interested in why this is so: because x̄ = 1


Pn
i=1 xi , the n numbers xi
n
x̄ (for i = 1, . . . , n) are not
independent: instead of the n degrees of freedom one might expect, there are only n 1 degrees of freedom, so n 1 is used
in the denominator, which makes the sample mean an unbiased estimator. Taking the square root, as done in calculating
sample standard deviation below, introduces bias, so sometimes a denominator of n 1.5 is used instead, or variations on
this which approximate for “distribution excess kurtosis”.
46 MIS10090

4.4. Standard Deviation. Variance is measured in units squared, but we would like to have a
measure of dispersion which is in the same units as the observations: to achieve this, we simply take the
square root of the variance to get the standard deviation.
The standard deviation of a numerical attribute of a dataset, denoted by , is the square root of the
variance.

• it is the most important measure of variation


• it measures how far any given value in the population/sample might be expected to lie from
the mean p p
• it has the same units as the original data since it is variance = units squared

• Population standard deviation:


v
u N N
p u1 X 1 X
= variance = t (xi µ)2 where pop. mean µ = xi ;
N N
i=1 i=1

• Sample standard deviation:


v
u n n
p u 1 X 1X
s = variance = t (xi x̄)2 where sample mean x̄ = xi .
n 1 n
i=1 i=1

A small standard deviation says that the values tend to be close to the mean, while a large standard
deviation says that the values are spread out over a wider range.
Figure 3.4 shows how three di↵erent datasets can have the same mean but very di↵erent standard
deviations;

• the numbers in Dataset A = {11, 12, 13, 16, 16, 17, 18, 21} are dispersed across a large range:
mean x̄ = 15.5, standard deviation s = 3.338;
• the numbers in Dataset B = {14, 15, 15, 15, 16, 16, 16, 17} are dispersed across a much smaller
range: mean x̄ = 15.5, standard deviation s = 0.9258;
• Dataset C = {11, 11, 11, 12, 19, 20, 20, 20} has roughly the same range as A, but C’s data are at
extremes and we get mean x̄ = 15.5, standard deviation s = 4.57.

Figure 3.4. Three di↵erent datasets can have the same mean but di↵erent standard
deviations: Dataset A is dispersed across a large range; Dataset B is dispersed across a
smaller range; Dataset C is dispersed across roughly the same range as A, but the data
are concentrated at the extremes.

Exercise 3.7. You can use Excel or your calculator to confirm that the standard deviation for the
number of texts sent by Jane is 3.8 texts.
Data Analysis for Decision Makers 47

4.5. Coefficient of Variation. The coefficient of variation is a relative measure expressed as a


percentage and measures the scatter (sample std. dev.) relative to the mean:
s
CV = ⇥ 100%.

• measures relative variation
• always given as a percentage (%)
• shows variation relative to mean
• is used to compare two or more sets of data measured in di↵erent units.

5. Other Measures of data

We will have more to say about distributions later but for now we will think of a data distribution as
plotting data items along the x-axis, then along the y-axis we plot the frequency (number of occurrences)
for each data item on the x-axis.
For discrete data (e. g., a finite data set) this is just a barchart (a discrete form of histogram). It is
useful to draw a histogram of a data set to visualise what kind of pattern the data has.
The mean tells us the “centre of gravity” of a distribution: if you cut this shape out of cardboard and
placed your finger at the mean, it would balance.
The standard deviation (and other dispersion measures) tell us how wide or how narrow is the part of
the distribution with the bulk of the weight.
We saw already that we can use Excel to draw graphs such as histograms. We will return to di↵erent
types of distributions or patterns in the chapters on probability distributions.

5.1. Shape of a distribution: Skewness. The shape of a distribution can also be measured to
see how symmetrical (or not) it is. We mention briefly the shape of a distribution and how it is related
to the positioning of the mean, median and mode.
Pearson’s measure of skewness is:
mean mode x̄ mode
= .
stdev s
An alternative, if there is no mode or many modes, is
mean median x̄ median
= .
stdev s
Both of these are dimensionless as the numerator and denominator have the same units.

Figure 3.5. Negative- or left-skewed distribution (left of diagram) and


positive- or right-skewed distribution (right of diagram) (after Rodolfo Hermans,
https://commons.wikimedia.org/w/index.php?curid=4567445)

The right tail of a distribution may taper di↵erently from the left tail. If so, we say the distribution is
skewed:
• left skew (or negative skew): the left tail is longer and the weight of the distribution is concen-
trated to the right;
48 MIS10090

• right skew (or positive skew): the right tail is longer and the weight of the distribution is
concentrated to the left.
Figure 3.6 shows how the mean, median and mode are positioned relative to each other in a left-skewed
(left), symmetric (centre) or right-skewed (right) distribution.

Figure 3.6. How the mean, median and mode are positioned relative to each other in a
left-skewed (left), symmetric (centre) or right-skewed (right) distribution

For a right-skewed distribution (as shown in Figure 3.7) the mode (most frequent value) is to the left of
the median (value with equal area each side), which is to the left of the mean (centre of gravity). This
order is reversed for a left-skewed distribution.

Figure 3.7. Comparison of the mean, median and mode of a right-skewed distribution
(after Cmglee, https://commons.wikimedia.org/w/index.php?curid=38969094)

6. Box(-and-whiskers) plot

A boxplot is a graphical way to show the centrality, dispersion and skewness of a numerical data set
using its extremes and quartiles. It can be displayed either horizontally or vertically. The spacings in
each subsection of the boxplot show the dispersion and skewness of the data.
The five numbers xmin , Q1 , Q2 =median, Q3 , xmax can be plotted as a boxplot or box-and-whiskers plot
to give an indication of the overall range of the data (whiskers: the extreme points xmin and xmax ) as well
as the midspread (the box part in the middle, showing Q1 , Q2 and Q3 ; its width is the IQR = Q3 Q1 ).
Figure 3.8 shows example horizontal box-and-whiskers plots.
Figure 3.9 shows five vertical box-and-whiskers plots for the five experiments carried out by Michelson
(1882) which showed that the speed of light is the same in all directions: this experimentally supported
Einstein’s special theory of relativity.
We have defined the whiskers as extending to xmin and xmax ; but sometimes, as in Figure 3.9, the
whiskers only extend to the furthest observation within a certain distance of Q1 or Q3 . For example
Data Analysis for Decision Makers 49

Data set 3

Data set 2

Data set 1

0 0.5 1 1.5 2 2.5 3

Figure 3.8. Example horizontal box-and-whiskers plots for three artificial data sets.
Data set 1 is left-skewed (tail to the left), Data set 2 is symmetric and Data set 3 is
right-skewed (tail to the right)

Figure 3.9. Example vertical box-and-whiskers plots for the five experiments carried
out by Michelson (1882) which showed that the speed of light is the same in all directions
and so experimentally supported Einstein’s special theory of relativity

(as in Definition 3.6), we might draw the lower whisker to the smallest observation within 1.5⇥IQR of
Q1 (and similarly the upper whisker to the biggest observation within 1.5⇥IQR of Q3 ). Then all other
observations outside the whiskers are plotted as outliers, e. g., as a dot or small circle (as in Figure 3.9),
etc. Or the whiskers can stand for something else about the data set, e. g.,

• one standard deviation above and below the mean;


• the 9th and 91st percentiles;
• The 2nd and 98th percentiles.

A boxplot can be drawn either horizontally (as in Figure 3.8) or (as in Figure 3.9) vertically.

Exercise 3.8. Draw a box-and-whiskers plot for the number of texts sent by Jane.
50 MIS10090

7. Correlation

A fundamental technique of data analysis and analytics is to investigate the relationship between a pair
of variables. In some cases, it is obvious that there is a relationship. For example, we can say with
confidence that there is a positive relationship between a person’s salary and the price they pay for a
new car. This information is probably useful to e. g., a car dealer.
In other cases it is not so obvious. What, if any, is the relationship between a person’s salary and the
amount they spend on international phone calls per week? This information would probably be useful
to a phone company. If it is not obvious, we need a way to measure it based on data.
Correlation is a simple method of measuring whether two variables are related and, if so, what type of
relationship and how strong it is.
We say that two variables are correlated if, when one goes up, so does the other, or when one goes up,
the other goes down — either is equally useful to know about.
In order to carry out a correlation, we need paired data — that is, we have multiple observations, and
each observation is a pair — one value for each variable. An example is the height and weight of a
sample of adults. For each adult, we have one value of each variable. We will usually see our data set in
a table, as in Table 3.2. Plotted on paper it would look like Figure 3.10.

Table 3.2. Height and weight of a sample of adults

height (cm) weight (kg)


168 60
170 62
155 55
183 70
172 68

7.1. Scatter Plots. A good place to start is a scatter plot. We plot the two variables x and y as
a scatter plot.
A scatter plot has two axes, one for each variable x, y, and each observation is plotted as a point — no
lines or bars. In Figure 3.10, each point represents an observation (one person).
A simple visual observation of the scatter plot is already enough to provide some information with respect
to how the two variables are related.
Exercise 3.9. What can you conclude from Figure 3.10? Is there a relationship between the two
variables? When one variable goes up, does the other one tend to go up, go down, or stay the same?

7.2. Linear Correlation. In practice, we do not always want to go to the trouble of plotting the
data and then judging the relationship visually. Sometimes it is useful to have a simple automated
formula to describe the relationship.
A big idea in modelling is to start simple and then enrich the model. Probably the simplest model we
could try to use to explain a scatter plot would be a linear model: in other words, is there a straight
line relationship that seems to “fit the plot reasonably well”?3
Recall that the equation of a line would be y = mx + c where m is the slope of the line and c is the
y-intercept (the height at which the line cuts the y-axis).
Figure 3.11 shows three scatter plots: for the left-hand and centre plots, there is an “obvious” fit to a
straight line but for the right-hand plot, there is no “obvious” straight line fit.
Other types of model could include e. g., quadratic models of the form y = ax2 +bx+c or other nonlinear
models, but we will not consider these in this course.
3This is a huge area and even to explain the term “fit the plot reasonably well” is a major topic. If interested, try
searching for the terms Statistical Learning and/or Machine Learning, especially the term Regression.
Data Analysis for Decision Makers 51

Weight/Height relationship

180 ●

170


Height (cm)

160


150

50 55 60 65 70 75

Weight (kg)

Figure 3.10. Scatter plot showing height and weight of the sample of adults.

Figure 3.11. Scatter plots: an “obvious” straight line fit (left); another “obvious”
straight line fit (centre); and no “obvious” straight line fit (right).

Question: Could we possibly work out (from a formula) if there is a linear relationship in a scatter plot
and — if there is — how strong is the relationship?
Answer: Yes: the formula is called the linear correlation between x and y.
Definition 3.10. The linear correlation between two variables x and y is called r. It is calculated as
follows, for a sample of n observations (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ):
P Pn P
n · ni=1 xi yi i=1 xi · ni=1 yi
r=q P P q P P
n · ni=1 x2i ( ni=1 xi )2 n · ni=1 yi2 ( ni=1 yi )2
P
where ni=1 xi is the sum of the x values for all observations, i = 1, . . . , n, etc.

This is called Pearson’s correlation coefficient, and measures the linear correlation between the variables
x and y.
This correlation metric is symmetric: it does not matter which variable is called x (and put on the x-axis
in a scatter plot) and which is y. The value of r for the relationship between x and y is equal to the
value of r for the relationship between y and x.
Comment 3.11. There are several di↵erent equivalent formulae for r (or, more correctly, rxy ). We
mention some here in case you are interested, or in case you are confused as to why di↵erent books have
(apparently) di↵erent formulae.
52 MIS10090

In the following, let:

• n be the sample size;


• xi , yi be the individual sample points indexed by i;
1 Pn
• x̄ = n i=1 xi be the sample mean of the xi , and analogously for ȳ;
q P
• sx = n 1 1 ni=1 (xi x̄)2 be the sample standard deviation of the xi , and analogously for sy ;
P
• all the sums run from i = 1 to n, ni=1 , to reduce clutter in the formulae.

Given a paired data sample of n pairs, {(x1 , y1 ), . . . , (xn , yn )}, our original formula for r = rxy is as
defined earlier: P P P
n x i yi x i yi
r=q P P q P P .
n x2i ( xi )2 n yi2 ( yi )2

Rearranging this gives us another formula for r:


P
(xi x̄)(yi ȳ)
r = pP pP .
(xi x̄)2 (yi ȳ)2

Rearranging again gives us yet another formula for r:


P
xi yi nx̄ȳ
r=qP qP .
( x2i nx̄2 ) ( yi2 nȳ 2 )

Yet another equivalent expression for r is as follows:


n ✓ ◆✓ ◆
1 X xi x̄ yi ȳ
r= .
n 1 sx sy
i=1
xi x̄
Note that sx is called the standard score of xi (and analogously for the standard score of yi ).

Finally, another alternative formula for r, given in (Lind et al., 2012, (13-1), p. 372), is:
P
xi yi nx̄ȳ
r=
(n 1)sx sy
where n, xi , yi , x̄, ȳ, sx and sy are defined as above.
The crucial point is that whichever formula you use, they all give the same number r when calculated.

7.2.1. Values that r can take. The formula returns a value in the range [ 1, 1]. A value of 1 or 1
indicates a perfect linear relationship between the two variables: in a scatter plot, one could draw a
straight line that would pass through every single data point.
In general, positive values for r indicate that when the value of a variable goes up, the value of the other
variable also goes up. Negative values for r indicate that when one goes up, the other goes down. If
there is absolutely no relationship between the two variables, the value for r will be 0. Returning to
Figure 3.11, we see that

• the left-hand plot shows a perfect positive linear correlation, r = 1;


• the centre plot a perfect negative linear correlation, r = 1; and
• the right-hand plot shows a nonexistent linear correlation, r = 0.

Correlation is not an all-or-nothing measure: two variables can also be imperfectly correlated. In many
real-world datasets, there is a clear relationship, but unlike the r = 1 and r = 1 cases above, not
perfect. In those cases, r takes on intermediate values; not zero, but not 1 or 1 either:
Data Analysis for Decision Makers 53

If r is in the range . . . then we say there is a . . .


+.70 or higher very strong positive relationship
+.40 to +.69 strong positive relationship
+.30 to +.39 moderate positive relationship
+.20 to +.29 weak positive relationship
+.01 to +.19 no or negligible relationship
close to 0 no relationship
.01 to .19 no or negligible relationship
.20 to .29 weak negative relationship
.30 to .39 moderate negative relationship
.40 to .69 strong negative relationship
.70 or higher very strong negative relationship

Exercise 3.12. Calculate the (linear) correlation coefficient between height and weight, using the data
given in Table 3.2.

7.3. Misleading Correlations. When carrying out correlation, we may find that there is a rela-
tionship between y and x. We should not assume that relationship to be causal; that is, we should not
assume that a change in x actually causes the change in y. It might be the other way around. Or there
might be some other variable z which we have not considered, which causes related changes in both x
and y. Either would lead to a correlation between x and y. Remember the following slogan:

Correlation is not causation.

Example 3.13. Suppose we collect data on all the accidental fires in Dublin for the last ten years. We
correlate the number of fire engines at each fire (x) and the eventual damages in euro (y) at each fire.
We will probably find that there is a strong positive correlation. Can we conclude that the presence of
more fire engines causes more costly damage? }

Example 3.14. Suppose we collect data from 100 randomly-chosen adults: what is the value x of their
car, and what is their income y. (Anyone who does not own a car is omitted from the survey). We will
find a strong positive correlation. Does this suggest that if I buy a more expensive car, my salary will
go up? Why or why not? }

Example 3.15. At a particular beach, for each day over a period of time we measure the number of
shark attacks versus ice cream sales. We find a positive correlation. Does this mean ice cream at beaches
should be banned so as to reduce shark attacks?
In fact, there is a third confounding variable: higher temperatures cause more people to go to the beach,
and so on days on which the temperature is higher we get higher sales of ice cream and more shark
attacks. }

Example 3.16. Suppose that for many cities we collect two pieces of data: the number of buses in
the city, and the number of restaurants in the city. What relationship will we find? Do buses cause
restaurants? Can you suggest a third variable z which is involved here? }

A weak or zero linear correlation between two variables implies that there is a weak or nonexistent linear
relationship between them. That does not imply that there is no other relationship between them. That
is because linear correlation only measures the strength of any possible linear relationship. There might
be a strong nonlinear relationship which nevertheless results in a zero correlation. Look at Figure 3.12.
The bottom row shows scatter plots of several datasets where r = 0, despite the fact that there is a clear
relationship between the variables.
Note that it is not always possible to calculate the correlation between two variables. In the centre of
Figure 3.12, di↵erent values of x all correspond to the same value of y. In this case, the value of r is
undefined: we cannot measure the variation of y when the value of x changes.
54 MIS10090

Figure 3.12. More examples of correlation. The numbers given are the correlation
coefficients of the x-y datasets shown as scatter plots. Note that several very strong x-y
relationships give a zero correlation. Note that correlation is undefined in the case where
there is no variation in y, i.e. y is constant (same for x). (Image public domain, from
Wikipedia.)

Exercise 3.17. Think of it the other way around: given that correlation is a symmetric measure, how
can we measure the corresponding change in the value of x when y changes, if the value of y never
changes?

Some very di↵erent data sets can have the same correlation values, as in Figure 3.13, based on Anscombe
(1973). That is why it is always useful to plot the data and look at them.

7.4. Correlation and Coincidence. Correlation is a statistical relationship between two variables.
It involves multiple observations. Correlation is not the same as coincidence. A coincidence is a single
paired observation (e. g., one particular person has a large height and a large weight). It does not tell
us anything about the relationship in general. Above, we saw that correlation is not causation. Now,
we have a second slogan:

Coincidence is not correlation.

(Or: “one swallow does not make a summer”.)


Exercise 3.18. Your friend Bob is the tallest person you know, and he got the best marks in his class
in first year exams. What do you conclude from this data and why?
Exercise 3.19. 5G phone masts arrived in late 2019, around the same time as Covid. Is this evidence
that 5G electromagnetic waves cause Covid?
Professor John Cook of George Mason University points out that the character Baby Yoda also arrived
in late 2019. Is it possible that Baby Yoda (Figure 3.14) is the real cause of Covid?
How would you respond to someone who put forward either of these theories?4

In summary:
4www.theguardian.com/society/2020/nov/29/how-to-deal-with-a-conspiracy-theorist-5g-covid-plandemic-qanon.
This article has other interesting points, e. g., how would you respond to someone who says “You can prove anything with
data, it’s all lies, damned lies and statistics.”
Data Analysis for Decision Makers 55

Figure 3.13. Anscombe’s Quartet: four data sets with identical statistical values (mean
and variance of x and of y, x-y correlation, and more), but very di↵erent properties.
From Wikipedia: “Anscombe’s quartet 3” by Schutz, derivative work: Avenue (talk) -
Anscombe.svg. Licensed under CC BY-SA 3.0 via Wikimedia Commons - http://en.
wikipedia.org/wiki/Anscombe’s_quartet

or
Figure 3.14. Who is to blame for Covid: Baby Yoda (left) or 5G masts (right)?

• the coefficient of correlation measures the strength of the linear relationship between two vari-
ables;
• We cannot infer a causal relationship on the basis of a high correlation value.

8. Other considerations in data analysis

Here are some final points to be aware of (compare Tufte’s comments on data graphics).
• Data analysis is objective:
– we should report the summary measures that best meet the assumptions about the data
set;
• Data interpretation is subjective:
– it should be done in a fair, neutral and clear manner;
• Ethical considerations:
– Tell the truth about the data,
⇤ document both good and bad results;
⇤ do not use inappropriate summary measures to distort facts;
56 MIS10090

– Be honest about the methodology;


– Do not ‘cook the data’.

9. Chapter Summary

In this chapter we have considered:


• measures of central tendency of a data set (mean, median & quartiles, mode);
• measures of dispersion/spread of a data set (range, IQR, variance, standard deviation);
• shape and skewness;
• scatter plots and linear correlation of two variables;
• ethical and other considerations in data analysis.
You can see that we need to use several descriptive statistics to get a better picture of the story told
by any given set of data. Anscombe’s quartet shows us that very di↵erent datasets can have the same
summary statistics, so you need to be careful not to over-rely on summary statistics. A histogram or
other views of the shape of a distribution will give us a better picture.
You should attempt the tutorial and other exercises online, first using your calculator and then using
Excel, to practice calculating these measures.
It is really important to interpret your results — but do this fairly.
Complete the online exercises to check that you now know how to gather data by using surveys or
sourcing data from internal IT systems or from secondary data sources.
You should also be able to create graphs (e. g., histograms) using Excel and be able to calculate and
interpret descriptive statistical measures.

You might also like