You are on page 1of 113

Understanding Variables

Hi everyone, my name is Dash, and I will be taking you through this unit.
Overview
• Types of variables
• Definition of variables
• Independent and dependent
variables
• Categorical and Numerical
variables
• Summary statistics
• Mean and standard deviation
• Median, quartiles and interquartile
range
• Mode

These are the following topics and sub-topics that will be gone through in this unit.

2
Understanding Variables
- Types of Variables

3
What is a variable?
• A variable is an attribute that can be measured or labelled.

• A data set consists of individuals and variables pertaining to the individuals.

Examples of variables include Temperature, blood type, race, weight to name a few. We note that
“individuals” in this case can refer to either people or objects.
Independent and dependent variables
• In research questions involving examining relationships between variables there are
typically 2 sets of variables, namely independent variables and dependent variables.

• An independent variable is a variable that maybe subject to manipulation (either


deliberately or spontaneously) in a study

• A dependent variable is a variable which is hypothesised to change depending on how


the independent variable is manipulated in a study.

For independent variable, note that when we say “deliberately” it means that the researcher is
directly involved in the manipulation of the variable whereas “spontaneously” refers to the
manipulation of the variable happening outside of the researcher’s control. A dependent variable is a
variable which is hypothesised to change depending on how the independent variable is
manipulated in a study.
Examples
Research question Dependent variable/Independent variable
Independent variable : Method of note taking for
Do NUS students who make notes using pen and
GEA1000
paper score better in GEA1000 than those who
use laptops? Dependent variable : GEA1000 grade.
Independent variable : Amount of caffeine
Does amount of caffeine consumed per day affect consumed per day
the quality of sleep amongst Singaporean adults?
Dependent variable : Quality of sleep

Here are two examples.


Types of Variables
There are two main types of variables: numerical and categorical.
Categorical variables take category or label values. Each observation can be placed in
only one label, and the labels are mutually exclusive (i.e no 2 labels overlap with each
other).
◦ Example : Smoking status can be a categorical variable, with two groups (smoker or
non-smoker). Education Level is another example with multiple labels.

Numerical variables take numerical values for which arithmetic operations such as
adding and averaging make sense.
◦ Age, measured in years for example, is a numerical variable. Mass (kg) and height
(m) are also numerical variables.

A point to add here is that the type of variable has nothing to do with whether it should be an
independent or dependent variable. In other words, it’s perfectly possible that independent
variables can be either numerical or categorical, and dependent variables likewise can be either
numerical or categorical.

7
Categorical variables : Ordinal vs Nominal
• Sometimes categories come with some • In other cases where there is no intrinsic
natural ordering and numbers are often ordering for the variables, we will refer to
used to represent the ordering. We call these variables as nominal.
such variables ordinal.
• For example if one were trying to collect
• For example a happiness index can be rated basic information on a sample of birds, the
from 0-10 in order of increasing happiness. eye colour (Blue / Brown) can be
• Does this make happiness a numerical considered a nominal variable since there is
variable? no intrinsic ordering.

While categorical variables may sometimes come with a natural ordering, you must be careful that
an ordering doesn’t mean you can do arithmetic on the data. In the case of happiness index, we
cannot assume that the difference between a happiness score of 5 and 6, is the same as the
difference between 6 and 7. Hence it is very important to know that for such data, doing arithmetics
like calculating an average is not advisable. So if any comparison between different levels of
happiness is needed, one solution can be to calculate what percentage of the respondents have
responded in each of the different levels.
Numerical variables : Discrete vs continuous
Discrete variable : Is one where Continuous variable : Is one
possible values of the variable that can take on all possible
form a set of numbers with numerical values in a given
“gaps” . range or interval.
Examples : Time, length.
Example : Population count

Examples of discrete variables can be number of pets in a household, number of children in a family.
Notice that the possible values are numbers like {0, 1, 2, 3…} with “gaps” whereby numbers like 1.5
do not make sense when talking about number of pets in a household or number of children in a
family. Any numerical variable that can only take a finite number of possible values is automatically a
discrete variable.

We would like to add a disclaimer that the definition of discrete variables that we have given is
actually a little different from the formal definition but for most scenarios, the definition we give
works just as well as the formal one and we hope it gives you more intuition behind the nature of a
discrete variable.

For continuous variables, let’s take time as an example. We can see that if one were to give a range
of time from 0 to 5 seconds, all possible values in between 0 and 5, can have a meaningful
interpretation of time which includes values which are beyond the measuring capability of common
everyday instruments.

9
Putting it altogether
Types of Variables

Numerical
Categorical - Measured in numerical values
- Measured in categories

Nominal Ordinal Discrete Continuous


- Named category with - Values can be ordered or - Is one where possible values of - Can take any
no definite order ranked where distances between the variable form a set of numbers numeric value on a
values are incomparable with “gaps” given range

Eye Colour Happiness Level Module Credits Time


(MCs)
10

As a summary, this is a broad classification of variables.


Summary
• Types of variables
• Definition of variables
• Independent and
dependent variables
• Categorical
• Ordinal and Nominal variables
• Numerical variables
• Discrete and Continuous variables

11

Just a gentle reminder again that for ordinal variables, do remember that just because we may use
numbers to represent the labels, it doesn’t mean we can start doing arithmetic on the variable.

11
Understanding Variables
- Summary Statistics Part 1

12
Presentation of data

Row

Column
13

This is part of a real data set with regard to COVID patients in Singapore. Data is typically presented
in the form of rows and columns where each row is information pertaining to an individual (in this
case, a patient) and each column is a variable. For example, case number 5 is a female, Chinese
person who is 56 years old and it has taken 21 days for her to recover. Looking at her education
level, she has a diploma. You can take some time to classify whether each variable in this data-set is
numerical or categorical.
The macro and the micro

What do I want to do
with my data?
Micro Macro

Get information on Get information on


particular individual(s) groups/population

Go to the data set and Data visualisation Summary statistics


extract the information
for the particular
individual(s)
14

Furthermore raw data is useful if we want to get information on particular individuals. But most of
the times, we also wish to get information on groups of individuals. Looking at raw data, it’s
generally very difficult to spot patterns and trends within a group or across groups. Visualisations are
a fantastic tool that serve to bring forth patterns which allow us to describe groups of inviduals and
explore relationships between variables on a macro-level.

But one cannot perform calculations on visualisations so we would also need some ways that can
summarize information about groups in a data set using quantities which are commonly called
“Summary Statistics”.
Summary statistics
Summary statistics for
numerical variables

Measures of central Measures of


tendencies dispersion

● Mean ● Standard deviation


● Median ● Interquartile range
● Mode

15

Summary statistics for numerical data focuses on 2 main features of the data. Namely the idea of
central tendencies of the data as well as the dispersion of the data points. We can intuitively think of
dispersion as the spread of data-points.

The central tendency (or measure of central tendency) is a central or typical value for a probability
distribution. This may be a slightly technical overdose at this juncture but for the time being we can
think of central tendency as the “center” of a collection of values for a numerical variable. The
formal definition will make more sense once we have studied distributions and probability in the
later chapters.

There are many ways of quantifying the center and dispersion of numerical variables. The 3 most
common measures of central tendencies are the Mean and Median. For each of these central
tendencies, we have corresponding measures of dispersion which are the Standard deviation and
Interquartile range respectively. We will first establish basic theoretical properties of these
aggregations and look at these in the context of real data.
Mean and standard
deviation
16
Mean

The mean of a numerical variable x, denoted by 𝑥̅


(read as “x bar”) is given by the formula
… ∑
𝑥̅ = which is also written as 𝑥̅ =

Sigma notation

where 𝑛 denotes the number of data-points and 𝑥 to


𝑥 denotes the values of the numerical variable 𝑥 in
the data set.

EXCEL COMMAND : “=AVERAGE”

17

The mean is the good old average value that we have encountered since time immemorial. The
formula for mean written in the sigma notation is just a more compact way of expressing the
formula and it’s not a must to For the data set above we can ask what is the mean number of Days
to Recover for males and females.

We will leave it to you to do the calculation if you wish to

Mean for males = 19.3

Mean for females = 15.8


Properties of Mean
• 𝑥 +𝑥 + … + 𝑥 = 𝑛𝑥̅ . (This follows from the formula for mean in the previous slide)

• Adding a constant value to all the data points (be it positive or negative) changes the
mean by that constant value.

• Multiplying all the values to all the data points by a constant number 𝑐 will result in
the mean also being multiplied by 𝑐.

18

From the first point, we can get one interpretation of the mean as the “fair share” value if the total
was equally distributed amongst all individuals. The second and third points are basically explaining
how the mean behaves under addition and multiplication by constants.
Means in real-life
scenarios

19
Whether the weather be fine
• Which were the hottest and coolest months?

• Which were the wettest months? How much rain do


we typically get in a month?

• Is there any relationship between wind-speed,


temperature and rainfall?

• Does the weather pattern for 2020, serve as a good


prediction for how the weather will be in 2021?
20

Meteorological stations keep records of the daily weather. Common weather phenomena which are
tracked on a day-to-day basis include

Temperature (which consists of maximum, minimum and average temperature)


Amount of rainfall
Wind speed

The data-set shown above is taken from Changi Meteorological station for the year 2020.

Even a relatively straight-forward scenario such as this can generate lot’s of questions to which we
can try get the answer from analysing the data. Some of these questions are more straight-forward
than others. Let’s focus on rainfall and see which were the wettest months for 2020.
Aggregating the data
After aggregating the
data, we can see that May
and December were the
wettest months

We can calculate the


average rainfall per month
which works out to be
157.22mm. But what does
this average tell us?

21

You can see that it’s much easier to read a total of just 12 rows as compared to more than 300 rows
if we were still working with the original dataset. Furthermore, it becomes easy to calculate the
average rainfall per month which works out to 157.22mm. But let’s take a look at what the average
can and cannot tell us and to aid us in our understanding, we’ll do a simple data visualisation to
compare the rainfall across different months.
What the mean can tell us about rainfall

The bar graph shows the total rainfall for each We can add the total rainfall of each month to get
month across the year 2020. The average rainfall the total rainfall for the entire year but knowing the
per month in 2020 is 157.22mm. mean gives us an easier way to get this quantity.

22

1) Since we know 𝑥 + 𝑥 + … + 𝑥 = 𝑛𝑥̅ , the mean can give us an easier way to calculate the total
rainfall for the year if we are interested in that quantity. The total is 1886.6mm.

More generally, in scenarios where you’re only interested in the total of a numerical variable across
data points, then knowing the mean and the number of data points is enough and you do not need
any further information on how the total is distributed.

2) Based on the mean of 157.2mm, it is not possible that for every month the total rainfall was
below 157.2mm. (Pause for a moment to think why this is true before moving on). This is because if
every month had a total rainfall less than 157.2mm, then the mean itself would also be less than
157.2mm which is not possible. A similar argument also shows that we cannot have every month
having rainfall more than 157.2mm
What the mean can’t tell us about rainfall

23

Although the mean is the most commonly used measure of central tendency, there are plenty of
occasions where it also gets abused through mis-interpretation. Here we make it explicit with
regards to what we cannot do if we only have the mean.

Knowing the mean doesn’t tell us anything much about how the total rainfall is distributed across
the year. You can imagine there can be plenty of ways of distributing an annual rainfall of 1886.6mm
across 12 months. Indeed, for regions that have specific monsoon seasons, the bulk of the annual
rainfall is concentrated during the monsoon period.

Knowing the mean does not imply that 50% of the months have rainfall of at least 157.2mm. Indeed
looking at the values, we see that only 5 out of 12 months had a total rainfall that exceeded
157.2mm.

It doesn’t tell us anything about how many days it rained across the year. If we’re interested in the
number of days where there was rainfall (regardless of the amount), knowing the mean doesn’t tell
us anything much about this count.

More generally, the mean doesn’t give much information about the frequency of occurrence of the
values of the numerical variable. As the joke goes "There was a statistician that drowned crossing a
river... It was 3 feet deep on average." Clearly the mean doesn’t tell us as much as we often like to
think it does.
Overall mean vs means in subgroups
The bar graph shows the
average performance of
students when categorized by
school. The maximum score
attainable is 60.

Number of students
School A 349
School B 46
Total 395

24

Looking at the bar graph above, one may be tempted to say that the overall mean for the math score
across all students is the sum of the subgroup means divided by 2. However, this assumes that there
is an equal number of students in each school which is not the case as shown in the table. In such
scenarios for us to be able to calculate the overall mean, we need to account for the proportion of
students in each school.

Therefore, the overall mean in this case is × 32.21 + × 30.72 = 32.03. Such a process of
computing the overall mean is known as taking a weighted average and the numbers , are the
weights of the subgroups. This is an efficient way to compute the overall mean when we know the
means within subgroups along with the subgroup weights instead of just relying on the formula. You
can try computing the overall mean using the usual formula and you will see that it tallies.

One more feature to note here is that the overall mean of 32.03 lies between the largest and
smallest subgroup means. This is not a coincidence and it is always true that the overall mean will be
between the smallest and largest values of the subgroup means. We will touch upon this “law” in the
form of rates in Chapter 2.
Clinical trials : Means disguised as proportions
• Imagine we want to investigate the effectiveness of a new drug for treating asthma
attacks compared to an existing drug. Is it fair to say that the new drug is performing
better since there are only 200 asthma attacks for those taking the new drug as
compared to 300, for those taking the existing drug?
New drug Existing drug
Number of patients 500 1000
Total asthma 200 300
attacks
Proportion of asthma attacks with new drug = 0.4
Proportion of asthma attacks with existing drug = 0.3
25

The mean serves as a good metric when comparing groups of unequal sizes. For example, in the
table above, by looking at the raw numbers, we can be misled into thinking that the new drug is
more effective since there are only 200 asthma attacks as compared to the existing drug for which
there are 300. But this would be overlooking that the number of people receiving the existing drug is
1000 which is twice the size of the group receiving the new drug. In this case, calculating the
proportion of asthmatic attacks in each group helps us account for differences in group sizes.

Proportion of asthma attacks with new drug= 0.4 > Proportion of asthma attacks with existing drug =
0.3.

Therefore, for these 1500 people, it may be possible that the new drug is not as effective as the
existing drug.

** Note, at this stage we do not commit to saying that the new drug is definitely not as effective, just
purely based on the data above without any further context. This is because we need information of
how the clinical trial was designed before making any stronger claims about the effectiveness of the
new drug. We shall visit this soon in the Unit of Study Designs.
How is proportion an example of mean?
New drug Existing drug
Number of patients 500 1000
Total asthma 200 300
attacks

Proportion of asthma attacks with new drug = 0.4


No. of people receiving new drug = 500
Person who get an asthma attack : 1
Person who doesn’t get an asthma attack : 0
26

Now this computation of proportion as shown above, may not seem like a calculation of mean since

it doesn’t appear like we are using 𝑥̅ = . But in fact we are and here is how we can
interpret it as a calculation of mean. Let’s focus on the 500 people who’re receiving the New drug.
For each person, an asthmatic attack contributes a value of 1 whilst no asthmatic attack contributes
a value of 0. Hence the values 𝑥 , 𝑥 , … 𝑥 can either be 0 or 1, depending on whether that

particular person experiences an asthma attack or not and we can use to calculate the
mean which also happens to be the percentage of asthma attacks in the group.

Note that when we calculate the proportion of asthma attacks among the new drug users = 0.4, the
calculation doesn’t give us much information about how the asthma attacks are distributed amongst
the 500 people who received the new drug and this is consistent with what we know of the mean.
Sample Variance and Standard deviation
• Recall that the mean doesn’t tell us anything significant about how data is distributed
which also includes the spread of the data points.
• The standard deviation is one way (yes there are other ways) of quantifying the
“spread” of the data about the mean. The formula is derived via the variance.
( ̅) ( ̅) … ( ̅)
𝑆𝑎𝑚𝑝𝑙𝑒 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑠 = 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒

Notation for standard deviation of x

where 𝑛 denotes the number of data-points and 𝑥 to 𝑥 denotes the values of the
numerical variable 𝑥 in the data set. We assume this is sample data.

27

You are not required to manually compute the standard deviation since in most situations, data sets
are too big to compute even the mean by hand let alone the standard deviation. What we will focus
on here is to get an intuition about how that formula (which you may find unfriendly to your eyes) is
actually measuring the spread of the data.

To start with the simplest example, consider a small set of data with just 5 data points. The 5 points
are {10, 10, 10, 10 10}. Now looking at this data, we can immediately see that there is no spread of
the data. Does our formula also say the same? Well in this case the mean is also 10, therefore the
difference of each point and the mean is 0 and the variance works out to be zero and that makes the
standard deviation zero as well so the formula does coincide with our intuition.

** Some of you will remember that the formulas for population standard deviation and sample
standard deviations are slightly different. We will not concern ourselves with why the formulas are
different. We will only state that the standard deviation formula for population data is almost the
same except that we divide by n instead of n – 1.
More intuition behind the formula
• Question : Why can’t we do the following to quantify spread?
• Take the difference between each value and the mean
• Add up the differences to get the “total spread”.
• Divide by the number of points to get an “average spread”

• Answer : Try applying that idea to a simple dataset {5, 10, 15, 20, 25}
• The mean is 15. The difference between each point and the mean gives us {-10, -5, 0,
5, 10}
• What happens when you add up these differences to calculate the “total spread”?

28

When adding up the differences, we get a total of 0. What’s causing the trouble is that differences
can be positive or negative. So we cannot just simply take the difference between each value and
the mean and add them up to get the total spread.

To avoid a scenario where positive and negative numbers cancel each other out, we square the
differences to make all the numbers positive and then add them up and divide by n - 1. This is
precisely the variance whose formula we have stated in the previous section.

But squaring results in a magnification of the spread. To “compensate” for the squaring, we take the
square root. Hence, we say that the standard deviation is the square root of the variance.
An explicit computation of Standard deviation
Consider a simple sample data set of just 3 points. Let the points be {1, 4, 7} and
suppose we wish to find the standard deviation of this data set.
Step 1: Find the average value of the data set.
In this case the average is = 4.
Step 2 : Subtract the average value from each of your data points and square the
answer.
We get (1 − 4) = 9, (4 − 4) = 0, (7 − 4) = 9.

29

So let’s do an explicit computation of a standard deviation for a very simple data set consisting of
just three points {1,4,7}.
Step 1: Find the average value of the data set.
Step 2 : Subtract the average value from each of your data points and square the answer.
An explicit computation of Standard deviation

30

Step 3: Add up the results in Step 2 and divide by the number of points minus 1 to get the variance.
Step 4: Take square root of the variance.
Properties of Standard deviation
• The standard deviation is always non-negative. (i.e it is either 0 or a positive number)
with same units (if any) as the numerical variable
• Adding a constant value, 𝑐 (positive or negative) to all the data points does not change
the standard deviation.

• Multiplying all the data-points by a constant value 𝑐 results in the standard deviation
being multiplied by 𝑐 where 𝑐 is the absolute value of c.

31

We use dot plots to help us visualize how standard deviations behave when constants are added and
multiplied to a data-set. Dot plots are a good tool to use when trying to understand how data-points
are distributed especially when the data-set is small.

When we add constant values to all the data points, we are literally “shifting” the data points by the
value which we add. From here you should be able to visualize that shifting every point by the same
amount should not change the spread of the points. Focusing on the 2 dot-plots in the middle of the
slide, the dot-plot on the right is obtained by adding 5 to all values of the dot-plot on the left. Notice
how the spread between points remains unchanged.

However, when we multiply points by a constant value, the “distance” of each point from the mean
also gets multiplied and thus we should expect a change in the spread when multiplying data points
by a constant value. Focus on the 2 dot-plots on the bottom of the slide. The dot-plot on the right is
obtained by multiplying 2 to all values of the dot plot on the left. Notice how the spread between
the points is also doubled as a result.
Standard deviations
in real-life scenarios

32
Meet the Palmer Penguins
Not only do I
I’m the biggest Yeah the killer
have a chinstrap,
among you lot!! whales will love
I AM the you
chinstrap!

33

A group of researchers were studying penguin species in the Antarctic region and collected data on
342 different penguins coming from 3 different species in the years 2007, 2008 and 2009. For each
penguin, they made observations/measurements of

Flipper length,
mass,
bill length
bill depth
gender and location
species

Part of the data is shown on the right. For those who have already seen this data before, note that
the original data set consists of 344 penguins but we removed 2 points because there was no
information on any of the variables for those 2 penguins. For those variables like gender which still
have some “NA” values, we will not delete them since the other variables have valid values.
An overarching question
How different are these species
of penguins?

34

A very common practice amongst researchers would be to compare characteristics of penguins


across different species. Characteristics in this case would be referring to the physical characteristics
which have been recorded in the data.

But it doesn’t make sense to just take one penguin from each species because we want to be able to
describe each species as a group and we wish to compare the differences across groups. Therefore
we would need “typical” values for each species of penguins. This is once again, the job of summary
statistics.
Comparing mass across species
Mean mass Standard
deviation of
mass
Chinstrap 3733g 384.3g

Adelie 3701g 458.6g

Gentoo 5076g 504.1g


Does this mean the heaviest penguin has
a mass of 4201g + 802.0g = 5003g?
Overall 4201g 802.0g

35

Notice again that the overall mean mass of the penguins is between the smallest and the largest
means of the species.

Now while the standard deviation is attempting to measure the spread of the data about the mean,
it does not give us the highest and lowest values of the data.

To convince yourself that it really doesn’t, take a look at the mean mass of the Gentoo species of
penguin. How can it be possible that based on the overall standard deviation, the heaviest penguin
has a mass of 4201 + 802 = 5003g but yet the Gentoo penguin has mean mass 5076g?

One has to look at the data points to find the lowest and highest values.

In general, a low standard deviation indicates that data points tend to be more clustered around the
mean while a large standard deviation indicates that data points are far away from the mean.
Are the Adelie and Chinstrap similar?
Why is it that the Adelie and Mean mass Standard
Chinstrap have almost the same deviation of
mass on average but yet the mass
standard deviation for the Adelie Chinstrap 3733g 384.3g
species is more? Could it be due
to Adelie 3701g 458.6g

- Gender? Gentoo 5076g 504.1g


- Age? (not found in the data set)
- Location? Overall 4201g 800.8g

36

Here again we can see that mean doesn’t give us the complete picture. If we never computed the
standard deviation, we would only have the information that the Chinstrap and Adelie penguins
have similar average masses. But we see that the mass of the Adelie penguin is spread more about
the mean. In other words, it is possible to find Adelie penguins which are considerably lighter and
heavier than their Chinstrap counterparts.

Again this begs a question with regard to why? Why is it that Adelie penguins seem exhibit greater
variation in their mass as compared to the chin-strap? This is very much in spirit of the EDA process is
whereby you

1) Start off with a few questions and explore the data to try and answer those questions.
2) Then you generate information based on the data to answer the questions and in the process, ask
new questions based on the information that is presented and then attempt to find the answers to
those questions by delving deeper into the data.

What we have shown here is a minor example but we hope it gives you the idea of what exploratory
data analysis is like.
Comparing the spread b/w variables
• Let’s focus on the Adelie penguin species to define a notion that helps us
quantify the degree of spread relative to the mean.
Bill

The coefficient of variation is a way of


quantifying the degree of spread,
relative to the mean.
𝑠
𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 =
𝑚𝑒𝑎𝑛 𝑜𝑓 𝑥
variation variation

Flipper

37

If we want to compare the spread of data for the length of flippers versus the length of the bill,
based on the raw standard deviation, one can say that the flipper length has the greater spread of
data but this would not be a fair comparison since the flipper length also has a much higher mean
value as compared to the bill length. When we want to compare spread across different variables,
it’s important to consider the spread relative to the mean.

There is a tool for this known as the coefficient of variation. It is a way of quantifying the degree of
spread, relative to the mean. The formula is . So how does one interpret
the coefficient of variation? For example the coefficient of variation for the bill length is higher. This
means that, for a mean of only 38.8mm, the spread of the bill length is large compared to the spread
of the flipper length about a mean of 190.0 mm.
Questions, Questions and more questions
Are male penguins heavier
Is there a relationship
than female penguins
between bill length and
across all species?
bill-depth across all
species?

Can findings in this data


be generalized to all the Do the heavier penguins
three species of come from colder
penguins? locations?

38

While we have attempted to give the flavour of EDA, when presented with any sufficiently large data
set, there are many questions that can be asked with reference to the data set. Some of these
questions may be answerable, with the given data. There may be some to which you can get a partial
answer and it shouldn’t come as a surprise if there are questions for which the data may not prove
to be adequate to give any satisfactory answers. But that shouldn’t discourage you from asking those
questions and you may eventually find an answer via other resources or by gathering more data.
Summary
• Summary statistics
• Mean
• Formula for mean
• Basic properties of mean
• Means in real-life scenarios
• Standard deviation
• Formula for standard deviation via variance
• Properties of standard deviation
• Standard deviations in real-life scenarios

39

In summary, this is what we have gone through so far. We have defined the mean, we have given its
formula, and we have established certain theoretical properties of the mean. We have also seen how
means are encountered in real life scenarios. Similarly for standard deviation, we have given the
formula of the standard deviation, which is deduced via the variance, we have also established
theoretical properties of standard deviations, and we have also seen how standard deviation can
occur in real life scenarios.

39
Understanding Variables
- Summary Statistics Part 2

40
Median, quartiles and
interquartile range

41
Median
The median of a numerical variable in a data-set is the middle value of the variable after
arranging the values of the data-set in ascending/descending order.
Arranged in ascending
order based on the
number of days to
recover
Median of the number
of Days To Recover?
} (17+20)/2 = 18.5

EXCEL COMMAND : “=MEDIAN”

42

One point to note is that when there are an even number of values, the median is the average of the
middle 2 values. When the number of values is odd, there is only one middle value after the sorting
and this is precisely the median.

We don’t worry about the rearranging because, once again, software comes to our aid in helping us
arrange the values when the data set is large.
Properties of median
• Adding a constant value (positive or negative) to all the data points changes the
median by that constant value.

• Multiplying all the data points by a constant value 𝑐 results in the median being
multiplied by 𝑐.

43

Observe that the median as another measure of central tendency has similar properties to the mean
even though the way it is calculated is different.
Medians in real-life
scenarios

44
Performance in a test
• Recall that when we were discussing means, we showed this graph regarding average
performance in a test of 2 schools. Let’s focus on School B which consists of 46 students.
The bar graph shows the
average performance of
students when categorized by
school. The maximum score
attainable is 60.
Number of students
School A 349
School B 46
Total 395

45

Let’s focus on the students of School B to get an understanding of what information can be gleaned
from the median.
What the median can/can’t tell us
• Since School B only has 46 students, a dot plot is quite handy in viewing how the
scores are distributed. The median score is 30.5.
• Is there any reason why the mean is so close to the median? Does this happen all the
time?
Mean = 30.7

Median = 30.5

46

In this case, the median score being 30.5, translates to 50% of the students scoring below 30.5 which
also means 50% of the students scored above 30.5. If a teacher is interested in differentiated
instruction for the students, then the median is literally telling him/her that half the class is either
failing or barely passing the test. Remember the mean does not give this information.

We have calculated before that the overall mean is 30.7. Notice that it is not too different from the
median. There is a reason for this in this context.

Looking at the distribution of points on the dot plot, we see that it is roughly symmetrical about the
median. It is because of this approximate symmetry in the distribution of points, that the mean and
median don’t differ by too much. There are scenarios where the mean and median can differ
drastically from each other, but we will encounter such scenarios in subsequent units.

Similar to the mean, knowing the median alone also does not tell us about the frequency of
occurrence of scores nor does it tell us how the scores are distributed within the class.

More generally, the median of a numerical variable, does not tell us the total value, frequency of
occurrence or the distribution of data points of the numerical variable.
Overall median vs median in subgroups

Students in Class A Median = 33

Students in Class B Median = 28.5

Class A + Class B Median = 29

47

We make a disclaimer that the dot plots here have nothing to do with the 46 students from School B
that we have just encountered. This is purely to illustrate the relationship between overall median
and the medians in subgroups.

The median shares this property with the mean whereby the overall median will be in between the
smallest and the largest medians of the subgroups. However, do note that unlike the mean, where
we can take a weighted average of the subgroup means to calculate the overall mean, the median
has no such property. Here we see that Class A and Class B have the same number of students, yet
the median is not exactly in between 33 and 28.5. In general, knowing the median of subgroups does
not tell you anything about the overall median beyond the fact that it must be in between the
medians of the subgroups.
Quartiles and InterQuartileRange
• The first quartile usually denoted by 𝑄 is the 25th percentile of the data-values and the
third quartile, usually denoted by 𝑄 is the 75th percentile of the data-values.
(The 25th percentile, is a value such that 25% of the data is either equal or less than this value. Same idea for the 75th
percentile)

• The interquartile range is the difference between the third quartile and the first quartile.

𝐼𝑄𝑅 = 𝑄 − 𝑄

It gives us another way of quantifying the spread of the data.

48

A small IQR value means that the middle 50% of data values have a narrow spread whilst a large IQR
value indicates a large spread for the middle 50% of the values.
Similarities between IQR and S.D
• The IQR is always non-negative and this follows from the fact that 𝑄 is at least as large
as 𝑄 .

• Adding a constant value, 𝑐 (positive or negative) to all the data points does not change
the IQR.

• Multiplying all the data points by a constant value 𝑐 results in the IQR being multiplied
by 𝑐 .

49
Explicit computation of Quartiles and IQR
Let’s go back to the same 46 students from School B and explicitly work out the IQR
𝐼𝑄𝑅 = 37.75 − 24 = 13.75

𝑄 = 24 𝑄 = 37.75

11, 11, 12, 13, 16, 16, 17, 20, 23, 23, 23, 24, 24, 25, 26, 26, 27, 27, 28, 29, 30, 30, 30.

50

The 25th percentile (also known as the First Quartile) can be computed via the following procedure.

1) Pick all the points below the median of the data set and arrange them in ascending order. We get
a sub-collection of the data points namely

11, 11, 12, 13, 16, 16, 17, 20, 23, 23, 23, 24, 24, 25, 26, 26, 27, 27, 28, 29, 30, 30, 30.

2) Find the median of the subcollection and this will be the 25th percentile of the whole data set.
Here we get the 1st quartile as 24.

In a similar way we can find the 75th percentile, which is also known as the 3rd Quartile by working
with all the points above the median and we get 37.75. The IQR in this context is 13.75 which means
that the middle 50% of data points have a spread of 13.75 marks.

We will visit the notion of IQR in subsequent units when we consider outliers and data visualisation
for numerical variables.

** While finding quartiles may seem like a straight-forward task, when using softwares, there are at
least six different rules/algorithms for finding quartiles. The good news is that we don’t have to
worry too much about finding the “exact” value of the quartile since for large data sets, all the
different methods give pretty close answers. For small data sets the differences maybe greater but
there’s usually not much need to summarize data for small data sets anyway.
Using summary statistics appropriately
• The mean and standard deviation are a pair of summary statistics that attempt to describe the
central tendency and dispersion of the data
• The median and IQR are another pair of summary statistics that attempt to describe the
central tendency and dispersion of the data.
Ways of
describing
central
tendencies and
dispersion
Mean and
Standard Which one Median
deviation and IQR
do I use?

51

Given that we have defined 2 ways of measuring the central tendency and spread of the data, it may
occur to some to ask “So which notion of central tendency and dispersion is more useful?”

The answer in short depends on the distribution of the data points. Briefly speaking, the median is
often used in preference to the mean when the distribution of points is not symmetrical.

An example of this occurs in HDB prices. The median is typically chosen as the measure of central
tendency for HDB prices since there maybe flats which are extremely expensive relative to the price
of most of the HDB flats.

This will be covered in greater detail in subsequent chapters and units when we learn the idea of
skewness of a distribution.
Mode
52
Mode
Mode of a variable is the value of the variable that appears the most frequently.

Mode for Gender?

Male

Mode for Age?


56
EXCEL COMMAND : “=MODE”

53

Again, working with a part of data set we had previously. The mode of a set of data values is the
value that appears most often. Note that whilst mean and median apply strictly for numerical
variables, the mode can take on both numerical and categorical values. Having learnt about means
and medians, it is not difficult to see that the mode is not conveying the same information as what
the mean and median are conveying and is not very useful when values are unique.
Interpretation of mode as “peaks”
• For the final time, lets go back to that dot-plot of the 46 students from School B and see
what the mode is telling us.

• When we are describing the distribution of points of a discrete variable, the mode can
be interpreted as a “peak” of the distribution. In the context of probability, a peak of the
distribution, refers to the value that has the highest probability of occurring.
• We will touch upon this idea more in later chapters when we define a Discrete Random
Variable (DRV).

54

How does one determine the mode? The mode is generally interpreted as the peak of a distribution.
If we once again look at the dot plot of the 46 students from School B, we find that the mode here is
35 and this value is interpreted as the peak of the distribution.
Applications of mode

Real estate : Real estate agents HR : Human Resource managers Healthcare and Insurance :
need the mode of the number also use the mode of different Actuaries also calculate the
of bedrooms per house so they positions in the company so that mode of the age of their
can inform their clients on how they can be aware of the most customers (the most commonly
many bedrooms they can expect common position of employees occurring age) to find out which
to have in houses located in a at their company. age group uses their insurance
particular area. the most.

55

Finally, here are some applications of mode. Mode is often used in real estate, HR, as well as health
care and insurance.
Summary
• Summary statistics
• Median
• Definition of median
• Medians in real-life scenarios
• Quartiles and interquartile range
• Definition of 1st and 3rd Quartile.
• Formula for Interquartile range
• Explicit computation of interquartile range
• Mode
• Interpretation of mode
• Applications of mode.

56

In summary, these are the concepts we have covered as part of summary statistics part 2.
We started off with median, gave the definition and saw how medians are encountered in real life
scenarios.
We learnt what is a quartile, and the interquartile range, and also learnt how to compute the
interquartile range using the 1st and 3rd quartile.
Finally, we briefly touched on mode, how does one interpret the mode, as well as the applications of
mode.

56
Study Designs

57
Overview
• Experimental Studies
• Treatment and Control Groups
• Random Assignment
• Blinding
• Observational Studies
• Experimental vs Observational
Studies

58

58
Study Designs
- Experimental Studies

Hi, its back to me, Samuel Yeun, again. And, in this unit, we will be covering study designs.

59
Recall

60

Recall from Unit 1, we covered different types of research questions. In this unit, we will focus on
one important type – the questions that “Investigate a relationship between two variables in the
population”.

60
Recall

Does
drinking coffee
help students
pass the math exam
?

61

For example, in an education setting, we may have a research question: “Does drinking coffee help
students pass the math exam?”.
In this case, the variable – “Drinking/ not drinking coffee“ is what we call the independent variable.
The variable “Passing/ Failing the math exam” is what we call the dependent variable.

To answer this research question, we could conduct a study. As mentioned in in the previous units,
we start with taking a sample (or census) of the students in the education setting. Once we have the
sampled subjects, we can conduct the study on them. But how should we design the study?

61
Types of Study Designs

Experimental Observational

62

We have a choice of two main study designs – Experimental and Observational. Both have their
advantages and are used depending on the type of research question and situation. Let us first look
at experimental studies.

62
Experiments
An experiment intentionally manipulates one variable in an attempt to cause an effect on
another variable.

The primary goal of an experiment is to provide evidence for a cause-and-effect relationship


between two variables.

Independent Variable

Dependent Variable

63

An experimental study, sometimes referred to as ‘a controlled experiment’, or more simply as ‘an


experiment’, intentionally manipulates one variable in an attempt to cause an effect on another
variable. The primary goal of an experiment is to provide evidence for a cause-and-effect
relationship between two variables.

63
Experiments
An experiment intentionally manipulates one variable in an attempt to cause an effect on
another variable.

The primary goal of an experiment is to provide evidence for a cause-and-effect relationship


between two variables.

Drinking coffee

Help students

Pass the math exam

64

Back to the research question “Does drinking coffee help students pass the math exam?”. Say that in
this scenario, we already have some suspicion that coffee has an effect on cognitive functions or
mathematical thinking; hence the reason for this research question. If so, an experiment may be a
suitable to show if coffee can help cause an improvement in passing rates for the math exam.
“Drinking coffee” is the independent variable of interest, and “Passing the math exam” is the
dependent variable of interest.

64
Experiments – Treatment and Control
Coffee No coffee
Drink exactly one cup of coffee every Not drink any coffee
day for one month for one month

65

We could split the subjects such that some of them are in the “Coffee” group, and the rest are in the
“No coffee” group.
The “Coffee” group would be asked to drink exactly one cup of coffee every day, specifically in the
morning, for one month. The “No coffee” group would be asked to not drink any coffee for one
month.

65
Experiments – Treatment and Control
Coffee No coffee
Drink exactly one cup of coffee every Not drink any coffee
day for one month for one month

Treatment Control
Group Group

66

Researchers often formally refer to the “Coffee” group as the “Treatment group”, because it receives
the treatment – coffee in this case.
The “No coffee” group is referred to as the “Control group“.

66
Experiments – Treatment and Control
Coffee No coffee
Drink exactly one cup of coffee every Not drink any coffee
day for one month for one month

Treatment Control
Group Group

67

Some of us may be wondering. Why does our study need two groups? Why not save the trouble and
just have one big group?

Let’s imagine that scenario together to see what happens. What if all the subjects were placed in the
“Coffee” group and asked to drink a cup of coffee every morning for 1 month?

67
Experiments – Treatment and Control
Coffee
Drink exactly one cup of coffee every
day for one month

Treatment Coffee helps?


Group
90% passed

68

If the study shows that 90% of the subjects passed their math exam, would that be enough to
convince you that drinking coffee does help with passing the math exam? No! We have nothing to
compare our 90% with.

68
Experiments – Treatment and Control
Coffee No coffee
Drink exactly one cup of coffee every Not drink any coffee
day for one month for one month

Treatment
Group
90% passed 90% passed

69

For all we know, people who do not drink coffee may also have 90% of them passing the exam too,
implying that drinking coffee may not have any use in helping subjects pass the exam.

69
Experiments – Treatment and Control
Coffee No coffee
Drink exactly one cup of coffee every Not drink any coffee
day for one month for one month

Treatment Control
Group Group

The control group provides a


baseline for comparison with the
treatment group.

70

That is why we need both the treatment and control groups in the study. The control group provides
a baseline for comparison with the treatment group.

A small point to note. In this case, the treatment group receives “Coffee”, while the control group
does not. Occasionally in some other experiments though, the control group may receive a standard
or existing treatment rather than nothing.

70
Experiments – Treatment and Control
Antidepressant Nicotine patch

Treatment Control
Group Group

The control group provides a


baseline for comparison with the
treatment group.

71

For instance, a smoking cessation study might analyse whether an antidepressant works better than
a nicotine patch in helping smokers to quit. The treatment group receives the antidepressant, while
the control group receives the nicotine patch. Regardless, the control group still provides a baseline
for comparison with the treatment group.

71
Study Designs
- Experimental Studies (Random Assignment)

72
Experiments – Study results
Coffee No coffee
Drink exactly one cup of coffee every Not drink any coffee
day for one month for one month

Treatment Control
Group Group

73

Previously, we discussed different study designs – Experimental and Observational studies. Within
Experiments, we discussed the importance of having a treatment and control group.

73
Experiments – Study results
Coffee No coffee
Drink exactly one cup of coffee every Not drink any coffee
day for one month for one month
Pass

Fail

74

In the coffee example, subjects in the “Coffee” group were asked to drink exactly one cup of coffee
every day for one month, and subjects in the “No coffee” group were asked not to drink any coffee
for one month. At the end of that one month, the subjects would then be asked to take the math
exam. The number of passes and failures can then be recorded.

74
Experiments – Study results
Coffee No coffee
Drink exactly one cup of coffee every Not drink any coffee
day for one month for one month
Pass 900 450

Fail 100 550

90% passed 45% passed

75

Let’s imagine that the study’s results are as such in the table. In this case, we can see that from this
experiment’s results, most students who were in the “Coffee” group passed their exam (90% of
them!). Compared to that, the students in the “No coffee” group have a much poorer performance –
only 450 out of 1,000 of them passed the exam (which is 45% of them). From this table, there seems
to be some evidence to show that drinking coffee does help with passing the math exam.

75
Experiments
Coffee No coffee
Drink exactly one cup of coffee every Not drink any coffee
day for one month for one month
Pass 900 450

Fail 100 550

Coffee causes an improvement


in passing the math exam!

76

But wait! From this table alone, would you feel it is enough evidence to go around telling people that
‘coffee causes an improvement in passing the math exam’? Maybe that is not wise! Why? What
issues would you be concerned about? I’ll give you a few seconds to think about it.

76
Experiments

Revision time

Pass the math exam

77

Here’s a thought! Would you be concerned if the subjects had drastically different amounts of
revision time for the exam? To make matters worse, what if the “Coffee” group had many subjects
with long revision time, and the “No coffee” group had many subjects with short revision time?

77
Experiments
Coffee No coffee
Drink exactly one cup ofTime
Long Revision coffee every Short Revision
Not drink Time
any coffee
day for one month for one month
Pass 900 450

Fail 100 550

78

If that was the case, maybe the reason for the “Coffee” group’s good performance in our study was
due to the amount of revision time, and not coffee intake!

78
Experiments
To establish a cause-and-effect relationship, we want to make sure that the independent
variable is the only factor that impacts the dependent variable.

Drinking coffee
Revision time

Pass the math exam

79

That is why when we want to establish a cause-and-effect relationship, we want to make sure that
the independent variable is the only factor that impacts the dependent variable. In this case, we
want to try ensure that “Drinking coffee” is the only variable in the study available to impact a
student’s ability to “Pass the math exam”. We want to remove the effects of the subjects having
different amount of revision time, and make the treatment and control groups similar in all aspects
apart from the “Drinking coffee” variable. One way of doing so is by assigning subjects of similar
revision times to both treatment and control groups.

79
Experiments
To establish a cause-and-effect relationship, we want to make sure that the independent
variable is the only factor that impacts the dependent variable.

Drinking coffee
Revision time

Pass the math exam IQ

Age

Other variables

80

Now let’s think even further! Even though the subjects in both groups have similar revision times to
each other, there are other factors that can affect subjects passing the math exam. The amount of
revision time is not the only variable that may be of concern! What about different initial IQ levels of
the subjects? Or different ages of the subjects? There may be many other variables of concern too!

It is almost impossible to have a sound knowledge of these factors!

80
Experiments

How do we account for the effects from all these other variables?

Random Assignment

81

So, what should we do? How do we account for the effects from all these other variables? Not to
worry! In experiments, we have a powerful method to solve our concern!

Random Assignment.

81
Random Assignment
Random assignment is an impartial procedure that uses chance.

82

Random assignment is an impartial procedure that uses chance.

Imagine writing down the names of every subject in the study on identical pieces of paper and
mixing them up in a box.

82
Random Assignment
Random assignment is an impartial procedure that uses chance.

83

Without looking, draw out a paper. The subject, whose name is on the chosen paper, will be
assigned to the treatment group. Remove that chosen paper from the box.

83
Random Assignment
Random assignment is an impartial procedure that uses chance.

84

Now, repeat the process until half of the papers are removed from the box. Note that at every draw,
each paper in the box has an equal chance of being picked. We call this process a “random draw
without replacement” and is one way of conducting random assignment.

84
Random Assignment
Random assignment is an impartial procedure that uses chance.

Treatment Group Control Group

85

The chosen subjects all belong to the treatment group. The remaining subjects that are not chosen
belong to the control group.

85
Random Assignment
Random assignment is an impartial procedure that uses chance.
If the number of subjects is large, by the laws of probability, the treatment and control groups
will tend to be similar in all aspects.

Treatment Group Control Group

86

If the number of subjects is large, by the laws of probability, the treatment and control groups will
tend to be similar in all aspects.

86
Random Assignment
The goal of random assignment is to create similar treatment and control groups with respect to
• Revision time
• IQ
• Age
• Other variables

87

In particular, they will tend to have similar amounts of revision times, IQ, age, and other variables
that might influence passing the math exam.

87
Random Assignment – S&P Example

S&P Index
1,000 tickets

88

This is an example to illustrate how random assignment can make the treatment and control groups
similar.

The S&P Index, is a stock market index of the largest U.S. publicly traded companies. A possible
interest could be the ‘percentage returns’ of the S&P companies in 2013. In this example, the
percentage returns were written on 1,000 tickets, and the percentage returns range from -4% to 4%.

88
Random Assignment – S&P Example

S&P Index
1,000 tickets

Randomly draw
500 tickets
(Without replacement)

89

Imagine we randomly draw 500 tickets without replacement, and assign these drawn tickets to
group A. The remaining 500 tickets left will be assigned to group B. What would you expect to
happen?

89
Random Assignment – S&P Example
Group A Group B

Similar distributions

90

Let’s try it out. Instead of using a box with paper tickets, a computer program was used to simulate
the draws. Computer programs use ‘pseudorandom number generators’, and thus are not truly
random. However, they still work very well.

The computer made 500 random draws without replacement. Group A’s results are on the left, and
group B’s results on the right.

As seen in the two graphs, the two groups have rather similar shapes, or what we call - distributions.
The random assignment worked! It has helped to make both groups similar!

90
Random Assignment – S&P Example
Group A Group B

The treatment and control groups can have different sizes.


As long as the size of the groups are quite large, then a
randomised assignment tends to produce two very similar groups.

91

Actually, due to many different reasons, in experiments, the treatment and control groups may have
different sizes. That is not a problem. As long as the size of the groups are quite large, then a
randomised assignment tends to produce two very similar groups.

91
Note on the term “Random”
“Random” has a much more strict meaning related to an impartial chance mechanism.
◦ “Random” does not mean “Haphazard”.

92

To end of this unit, here’s a final note on the term ‘Random’. In casual conversation, we may loosely
describe an event as “random”. For example, ‘I randomly decided to buy a cup of tea today.’. What
we actually mean is that I decided to buy a cup of tea today on a whim. Within this example’s
context, you would not usually think that I took out some evenly weighted coin, flipped it, and then
decided to buy my tea based on the outcome of the coin.

However, in our context of an experiment, the term “random” has a much more strict meaning
related to an impartial chance mechanism. Referring back to using a box and pieces of paper to
conduct random assignment - if we did not use identical sized papers, the bigger papers may be
chosen more often than smaller papers. In this case, we will say that the draws are haphazard, not
random.

92
Study Designs
- Experimental Studies (Blinding)

93
Experiments
Coffee No coffee
Drink exactly one cup of coffee every Not drink any coffee
day for one month for one month

Treatment Control
Group Group

In some cases, leaving the control group alone may cause bias!

94

So far, we have seen that a good experimental study design involves random assignment. This is
good, since the treatment and control groups will tend to be similar in all aspects. The treatment
group receives the treatment, while the control group does not receive the treatment. However, this
does not mean that we leave the control group alone. Actually, in some cases, leaving it alone may
cause bias!

94
Experiments
Coffee No coffee
Drink exactly one cup of coffee every Not drink any coffee
day for one month for one month

Control
Group
Coffee

95

In this coffee example, subjects in the control group may feel that they are lacking the extra
advantage that coffee could give to the treatment group, and hence start revising very hard for the
math exam. This in turn could create bias in the results.

95
Experiments
Coffee No coffee
Drink exactly one cup of coffee every Not drink any coffee
day for one month for one month

Coffee Choffy

96

To solve the problem, some of us may come up with a bright idea. Let’s give these control group
subjects an alternative coffee drink! For instance, we could use “Choffy” - a beverage brewed from
roasted cocoa beans with negligible caffeine, as an alternative to coffee. It has no direct effect on
the subjects passing the math exam, and has no additional ingredients and nutrients that coffee has.
That way, we can ease the subjects’ worries that they are lacking the extra advantage.

96
Experiments - Placebo
Placebo: Treatment with no active ingredients, and no effect.

Placebo Effect: The response observed when subjects receive a placebo treatment, but still show
some positive effects.

Coffee Choffy

97

Choffy is what we call a ‘Placebo’. A placebo is a treatment with no active ingredients and has no
effect.

While this seems like a good idea, there has been a considerable amount of research showing that
people who receive a treatment with no active ingredients can also show positive effects. Merely
thinking that they received some form of treatment was enough to observe some response in the
subjects, even if the treatment does nothing! This is called the placebo effect.

97
Blinding
Blinded subjects do not know whether they are in the treatment or control group.
◦ A placebo that is very similar to the treatment can be chosen to help make the blinding effective.
◦ The subjects are blind to the treatment to prevent their own beliefs about the treatment from affecting
the results.

??? ???
Coffee Decaffeinated
Coffee

98

Oh dear. It seems like there are so many concerns to deal with. We want the benefits of having a
placebo, but need to avoid the placebo effect causing any bias. Again, not to worry. This is where we
learn a method called “blinding” to use in our study.

Blinding the subjects will prevent them from knowing whether they are in the treatment or control
group. A placebo that is very similar to the treatment can be chosen to help make the blinding
effective. This is done to prevent their own beliefs about the treatment from affecting the results.
That way, the treatment and control groups would respond the same way to the idea of treatment.

For the coffee example, subjects in both treatment and control group will be given a cup of drink
every morning. However, the study will ensure that both treatment and placebo smell and taste the
same, to prevent subjects from knowing whether they are drinking coffee or choffy!

98
Blinding
Blinded assessors do not know whether they are assessing the treatment or control group.

??? ???
Coffee Decaffeinated
Coffee

99

That’s not all. We should also consider the side of the research team. For this example, the
researcher could have a team of assessors that help to grade the subjects’ exams. Sometimes, the
math exam questions may be difficult to grade with exactly the same standard and strictness (for
example: open ended questions). If the assessor knows he is scoring a subject from the treatment
group, he may subconsciously expect better performance, and be unintentionally more lenient. To
prevent such bias from the assessors, the study would also need to blind the assessors involved in
marking so they do not know whether they are marking the exam of a treatment or control group
subject.

99
Double-Blinding
An experiment is called double-blind if both subjects and assessors are blinded about the
assignment.

??? ???
Coffee Decaffeinated
Coffee

100

An experiment is called ‘double-blind’ if both subjects and assessors are blinded about the
assignment. Sometimes, it may be difficult to blind both subjects and assessors, but when done
right, double-blinding can be very helpful in dealing with the concerns mentioned earlier on, and can
make comparison between the treatment and control groups much easier.

100
Study Designs
- Experimental vs Observational Studies

101
Experiments
Do vaccinations help reduce the effects of the coronavirus?

("Dozens to be deliberately infected with coronavirus in UK ‘human challenge’ trials," 2020)

102

We have just finished discussing the study design of experiments. Now let’s look at an experimental
case study.

In 2020, during the Covid-19 pandemic, a Dublin-based commercial clinical-research


organization was reported to be planning an experiment to test the effectiveness of the Covid-19
vaccines. Or in other words, “Do vaccinations help reduce the effects of the coronavirus?”. The plan
was for participants to be deliberately infected with low dosages of the virus strain, and then
provided with the vaccine to test the effectiveness of the vaccine.

102
Ethical issues
Experiments are useful in providing evidence for a cause-and-effect relationship.
However, an experiment has its issues too.

103

Put yourselves in the shoes of the researchers of the study. You can probably imagine the amount of
ethical issues that would first need to be ironed out before the experiment can begin.

For example, the issue of injecting low dosages of the virus strain into humans.
Or deciding which subject to assign to the vaccine treatment and control group.
Consent from the subjects would be needed, and given at the start of the study to prevent subjects
from backing out halfway.

We would not be able to discuss the details of ethical issues here. But the point is this: We have
already seen how experiments are useful in providing evidence for a cause-and-effect relationship.
However, we also want to show that while useful, an experiment has its issues too. We need to
carefully consider the research question and these factors before deciding if an experiment is a
suitable study to conduct.

103
Types of Study Designs

Experimental Observational

104

This is where ‘Observational Studies’ come in. It is an alternative approach to experiments, especially
for scenarios where there are ethical issues preventing the use of experiments.

104
Observational Studies
An observational study observes individuals and measures variables of interest.
However, researchers do not attempt to directly manipulate one variable to cause an effect in
another variable.
◦ Does not provide convincing evidence of a cause-and-effect relationship.

105

An observational study observes individuals and measures variables of interest. However, in an


observational study, researchers do not attempt to directly manipulate one variable to cause an
effect in another variable. Unlike experiments. Therefore, an observational study does not provide
convincing evidence of a cause-and-effect relationship.

105
Observational Studies

Is long term smoking linked to heart disease?

("Impact of smoking and smoking cessation on cardiovascular events and mortality among older adults: Meta-analysis of individual
participant data from prospective cohort studies of the chances consortium," 2015)

106

Some research questions are impractical to answer with controlled experiments. Take for example
the research question: “Is long term smoking linked to heart disease?”. If we were to conduct an
experiment, we would need to find subjects willing to smoke for a long term, to study its effect on
their heart health. That may be difficult and unethical to do.

In such a scenario, a study conducted in Europe and North America decided that an observational
study may be more suitable, where they merely record what happens to subjects who have chosen
to smoke or not to smoke.

106
Observational Studies

Is long term smoking linked to heart disease?

Note: For observational studies, while there is no actual treatment being assigned to the subjects, we still use the
terms “treatment” and “control” groups in the same way as though we are dealing with an experiment.

Smokers Non-smokers
Treatment (Exposure) Control (Non-exposure)
Group Group

107

Take note that for observational studies, while technically there is no actual treatment being
assigned to the subjects, we still use the terms “treatment” and “control” groups in the same way as
though we are dealing with an experiment.

In this example, the treatment group are the smokers, and the control group are the non-smokers.

Sometimes we may also use the terms “Exposure” and “Non-exposure” groups instead.

107
Experiment vs Observational Study

Experimental Studies Observational Studies


Assigned by researcher Assigned by subjects themselves

108

To summarize, in experiments, the subjects are assigned into treatment and control group by the
researcher, while in observational studies, the subjects themselves decide which group they are in.

108
Experiment vs Observational Study

Experimental Studies Observational Studies


Assigned by researcher Assigned by subjects themselves
Can provide evidence of a cause-and- Cannot provide evidence of a cause-
effect relationship and-effect relationship

Our study corroborates and expands evidence from previous studies in showing that smoking is
a strong independent risk factor of cardiovascular events and mortality …

109

As mentioned, observational studies cannot provide convincing evidence of a cause-and-effect


relationship.

Citing from the Europe and North America observational study, notice how the study rightfully only
declares that: “Our study corroborates and expands evidence from previous studies in showing that
smoking is a strong independent risk factor of cardiovascular events and mortality …”. It does not
mention anything about cause-and-effect evidence.

On the other hand, experiments can provide evidence of a cause-and-effect relationship if the study
was designed properly, ideally with randomised assignment and double-blinding.

109
Experiment vs Observational Study

Experimental Studies Observational Studies


Assigned by researcher Assigned by subjects themselves
Can provide evidence of a cause-and- Can provide
Cannot evidence
provide of ‘Association’
evidence of a cause-
effect relationship and-effect relationship

110

That does not mean observational studies are useless though! They can still provide evidence of
‘Association’. We will be covering more about the difference between association and causation in
the next chapter.

110
Final note - Generalisability

If an experiment is well-designed, can we generalise the results?

111

On a final note, let’s wrap up this chapter with one last question: ‘If an experiment is well-designed,
can we generalise the results?’. If the experiment has randomised assignment, double-blinding, no
ethical issues of concern, is the results generalisable to everyone?

111
Final note - Generalisability

If an experiment is well-designed, can we generalise the results?

Sampling frame?
Sampling method?
Sample size?

The design of the experiment is not the only factor we look at.

112

We cannot answer it just yet. Remember that we previously covered the sampling process, including
suitable sampling frames, methods, and size. We need to know this information first. The design of
the experiment is not the only factor we look at. If the aim of the study is to generalise the results of
the experiment to a more general population, it is necessary to consider the sampling design process
too.

112
Summary
• Experimental Studies
• Treatment and Control Groups
• Random Assignment
• Blinding
• Observational Studies
• Experimental vs Observational
Studies

113

Summary time. In this unit, we covered 2 different study design type – experimental and
observational. Within experimental studies, we also discussed how it is advantageous for us to try
ensure that we design the study to include ‘treatment and control groups’, ‘random assignment’,
and ‘blinding’. While experimental studies are useful in providing evidence of a cause-and-effect
relationship, it may not be suitable for all situations. That is where observational studies come in as
an alternative to experimental studies.

113

You might also like