Professional Documents
Culture Documents
Lesson one
Is psychology really a science?
- (“Science refers to particular areas of study like biology, chemistry and physics”) NO,
science is not defined by the subject matter. You can’t decide whether something is a
science by just looking at the topic that is studied. It is the approach taken to study
these topics that matters. Science is defined by the approach, not the topic
- (“science is done with sophisticated equipment and technology”) Nope, not always. You
can do science with a paper and pencil.
- (“science has a large body of well-established findings, facts, and principles”) No, not
necessarily. Sciences vary widely in the amount of knowledge that has been acquired, as
some sciences have a much longer history than others.
- The term science refers to a particular approach to acquiring knowledge. The scientific
approach can be used to study virtually any topic.
Features of a scientific approach: How scientists approach their work
1. Empiricism: relying one evidence or observations to draw conclusions
2. Scientists test theories, research questions, hypotheses, and predications: the theory
data cycle. Theory is a set of statements that describe general principles, theories are
typically quite broad and complex to be tested in one particular research study.
Research question is what the researcher wants to know and questions about how one
variable might be related to another variable. Hypothesis (conceptua) is a more specific,
focused statement of what is expected in a specific situation (usually a statement about
how two variables are thought to relate to one another). Prediction (experimental)
refers to the expected result of a specific study
3. Scientists tackle both applied and basic problems. Basic research: the researchers want
to understand something, without regard to whether knowledge will be immediately
useful or practical. The main goal is to increase knowledge. Applied research: the
researchers want to learn something that be applied to a problem of immediate
concern, the main goal is to find a solution to a specific problem.
4. Scientists dig deeper: they almost never conduct just a single research study. Instead,
each study leads them to ask new questions that requires more research.
5. Scientists make their work public that the research findings can be examined closely by
other scientists. Key aspect of articles in scientific journals is that, before publication,
they usually go through and extensive process of peer-review
The role of induction and deduction:
- inductions involve reasoning from specific instances to a general proposition 归纳法.
Example: start with research findings or observations; use them to derive a theory.
- Deduction involve reasoning from a general proposition to a specific implication of that
proposition. Example: start with a theory, use it to derive a hypothesis.
Functions of a theory:
1. Explain existing date/ observations: theories summarize and integrate existing
observations, such as a set of research findings coming out of several different studies.
The theory is not just a simple listing of all the different results. Instead, a good theory
provides an integration or synthesis of many disparate observations, in a manner that
generates new insights.
2. Guide future research: theories guide future research by suggesting new hypotheses
that can be tested. Once a general theory has been formulated, it should suggest
specific hypotheses and predictions that follow from the theory- and these new
hypotheses can then be tested. If they are confirmed by a study, this is further evidence
in support of the theory; if they are not confirmed, the theory will need to be revised to
account for the inconsistent observations or be discarded
Features of a good theory:
1. Good theories are well supported by data
2. Good theories are falsifiable 可证伪的
3. Good theories have parsimony 简约. This idea is linked to “Occam’s razor” named after
William of Occam, a fourteenth century English philosopher who emphasized the
importance of simplicity, precision, and clarity of thought. If two theories do an equally
good job of accounting for the data and predicting future outcomes, then the simpler
theory (the one that uses fewer or simpler ideas) is generally preferred
Types of scientific sources:
1. Empirical journal articles: Peer reviewed,Report for the first time the result of an
empirical research study,Provide details about the study’s method, the statistical tests
used, and the results,Are written for an audience of other psychological scientists and
psychology students
2. Review journal articles: Peer reviewed,Provide a summary of all the published studies
that have been done in one research area,Sometimes use a quantitative technique
called “meta-analysis” to combine the results of many empirical studies and provide a
number that summarized the overall effect size
3. Chapters in edited books: Not peer reviewed as rigorously as journal articles,Each
chapter written by a different contributor,Usually reviews a collection of studies done
in a research area
4. Full length scholarly books: Not peer reviewed,Not as common in psychology as in
some other disciplines,Most likely to be found in academic libraries
Components of an empirical journal article:
- Title, Abstract, Introduction, Method ,Results, Discussion
Lesson two
Variables:
- Variables are what researchers examine in their research studies
- Variables are also sometimes referred to as: factors, dimensions, qualities, attributes, or
characteristics
- A more formal definition: a variable is anything that takes on different values or levels
across a set of cases
- In a research study, a variable must assume at least two different values or level across a
set of cases. Otherwise it would be a constant
- In an experiment the levels of a variable are often called the conditions of the
experiment.
Types of variables:
1. Qualitative variable: also known as categorical or nominal variables
- Variables that classify or categorize
- Levels differ in terms of quality or type, not in the amount of something
- Example: religious affiliation, major in university
2. Quantitative variable:
- Levels of the variable differ in terms of quantity or amount
- You can number the levels and place them in order, from less to more
- These are variables that asks how much? How many? To what extent?
- Examples: self-esteem on rating scale, family income, current happiness
3. Discrete variables:
- There are no meaningful values of the underlying variable between the levels of the
scale
- Must be measured in whole units or categories
- Between any two adjacent values, no intermediate values are possible
- Example: number of children in family, number of errors made, courses a student takes
4. Continuous variables:
- There are meaningful values of the underlying variable in between the levels of the
scale; not limited to a certain number of values such as whole numbers
- Can be measured in whole units or fractional units
- In principle, between any two adjacent scale values, intermediate values are possible
- Example: the time it takes to complete a task, blood alcohol level
- When we measure continuous variables, they are by necessity converted into discrete
variables
- a qualitative variable is always discrete
- quantitative variable could be either continuous or discrete
5. Manipulated variables:
- A variable in an experiment study that is intentionally varied by the researcher
- The research intervenes to create the different levels of the variable, and assigns
participants to those levels
- Example: situation variables- room temperature, light intensity, privacy. Participant
state- current mood, anxiety level, hunger level
6. Measured variables:
- A variable whose levels are simply observed and recorded
- The researcher takes an assessment of each participant’s level on the variable without
trying to alter it
- Example: situation variables- room temperature, hunger level, how many people in
room. Participant variables- gender identity, age, participant IQ, participant personality.
Socioeconomic status
7. Conceptual variables:
- Sometimes called a “construct”
- Must be carefully defined at the theoretical level; when researchers state their
hypotheses, they are usually stated in terms of the conceptual variables
- Example: hunger= having a need or desire for food, love- an emotion in which the
presence or thought of another person triggers arousal desire and a sense for that
person
8. Operational definition:
- Also known as operationalizations
- Specifies precisely how the concept is measured or manipulated in a particular study
- Definition of a variable in terms of the specific procedures the researcher uses to
measure or manipulate it
- Example: hunger= the number of hours that the participant went without eating prior to
the study, hunger= a scale rating in response to the question “how hungry are you?”,
love= a rating on a scale from -2 (strongly disagree) to +2 (strongly agree) in response to
the statement “I am in love with my current partner”
3. Casual claims: argue that variable is associated with another variable but go further
in arguing that one of the variables is responsible for the other variable. The
association between variables is casual. The only type of research study that can
provided strong support for a casual claim is an experiment-that is a study where the
casual variable is manipulated by the researcher and the other variable is measured
- A glass of wine a day may promote mental well being
- Listening to classical music at a young age improves math ability
- Exposure to violent television increases aggressive behaviour
- Power may lead to rudeness
- Eating blueberries for breakfast can improve cognitive function
Importance of causation:
- Key goal of research studies is to draw a conclusion about cause and effect. Essentially
what we mean is that we want to be able explain causes of the behaviour. For these
reasons, understanding causation is highly important, and many of the hypotheses that
researchers test in psychology are about the causal effects of one variable on another
variable
Criteria for causation:
1. Covariance: the study must show that the causal variable and the outcome variable are
related
2. Temporal precedence: the study must show that the causal variable came first in time,
before the outcome variable
3. Internal validity: the study must establish that no other explanations exist for the
relationship between the variables
Four types of validity:
1. Construct validity: how well a variable was measured or manipulated in the study
2. Statistical validity: the extent to which a study’s statistical conclusions are accurate and
reasonable. For example: when a study finds an association between two variables, can
we be sure that it is not just due to chance connections in that particular sample; in
other words, it is statistically significant?
3. External validity: how well the results of a study generalize to, or represent, people or
contexts beyond those in the original study
4. Internal validity: a study’s ability to rule out alternative explanations for a causal
relationship between two variables (applies only to causal claims)
Prioritizing validities:
- Because it is not possible to achieve all validities at once
- Internal validity is a crucial consideration when we are evaluating a causal claim, but not
applicable to association or frequency claims
- External validity is crucial for frequency claims but is often not prioritized for research
testing casual claims.
Lesson three
Core principles
1. Respect for persons:
- Enabling people to make their own decision about whether to participate in research
free from coercion or interference
- Principle is applied primarily through informed consent means that potential research
participants should be provided with all information that might influence their decision
- Participants are provided with an informed consent form
2. Concern for welfare
- Researchers must minimize the risks associated with research participation and
maximize the benefits of the research. The principle is applied primarily through a risk-
benefit analysis
- Need to consider potential risks and benefits that are likely to result from the research
- Only if the potential benefits of the study outweigh the risks involved should the
research be carried out
3. Justice:
- Researchers must treat people fairly and equitably
- They should give participants adequate compensation for participating and make sure
the benefits and risks of a research study are distributed fairly across all the participants
- Principle is applied primarily through recruitment methods that offer participation to a
diverse range of social groups
Research ethics:
- In Canada researchers and research institutions must follow the code of ethics in the Tri-
council policy statement (TCPS2) (research involving animals is governed by the
Canadian Council on Animal care)
- Are based on three core principles
The REB:
- A panel of individuals that reviews and monitors all research conducted by researchers
at that institution to ensure that it abides by the TCPS2 guidelines
- Reviews and approves proposed and ongoing research involving humans
Scholarly Integrity: Ethical data analysis and reporting of results
A researcher's ethical responsibilities do not end when the study has been conducted.
Rare obligated to maintain integrity through the process of data analysis, publication,
and beyond.
Researchers must sometimes remove data from a data set and it is ethical to do so.
Including data from certain participants – those who did not follow instructions, did
not take the study seriously, or gave nonsensical responses – would undermine the
validity of the study. Thus researchers often discard the data from these participants
in a "data cleaning" process before going on to analyze their data.
The problem arises if researchers' decisions about which cases to discard could be
biased by knowledge of how the study results would be affected. Data cleaning must
always take place before performing the primary data analyses. Researchers are never
permitted to discard data simply because they did not fit the hypothesis or are hard to
explain.
2. Overanalyzing data (p-hacking)
When initial planned analyses don't turn out as expected, researchers often continue
to explore their data, looking for other interesting or useful results.
The problem is that conducting many unplanned statistical tests increases the
likelihood of obtaining "false-positive" results, that is results that are really just due to
chance. The practice of conducting analysis after analysis is often called "p-hacking".
(This refers to statistical significance testing, where a result with a "probability value"
or "p-value" less than .05 is considered significant.)
The key here is that researchers should never pretend their exploratory analyses were
planned from the beginning. If they do find an unexpected but potentially important
result during data exploration, they should use it as an idea for future research, and
conduct a new study to test that idea.
3. Selective reporting (cherry picking)
The key principle here is that researchers are obligated to report all results that deal
directly with the hypothesis that the study was designed to examine. It would be
unethical to engage in "cherry picking" by selectively reporting results that support
the hypothesis but failing to mention results that did not.
4. Post-hoc theorizing (HARKing)
When researchers discover unexpected findings, they may be tempted to act as if the
unexpected finding was predicted all along. This is a problem known as "post-hoc
theorizing" or HARKING (hypothesizing after results are known; Kerr 1998) that
clearly violates the integrity of the scientific process. Clearly, if a hypothesis is
generated only after seeing the results, there is no possibility of it being disconfirmed
by those results.
Researchers should never pretend that an unpredicted effect was predicted. Instead,
they should acknowledge that the effect was unpredicted and that it should be
interpreted cautiously until it is replicated. They could then go on to design a new
study to test the hypothesis - if the predicted effect is obtained, this time it does not
involve post-hoc theorizing.
Reforms to improve scientific integrity:
In particular, recent reforms have focused on making the entire research process more
public and transparent. encourages researchers to adopt three specific practices:
1. Full disclosure of information.
Researchers are urged to describe their studies and findings in greater detail than in
previous journal articles
2. Pre-registering methods and data analysis plans.
Researchers are encouraged to pre-register their study's method, hypotheses, and data
analysis plan online, in advance of data collection.
3. Open data and materials.
Speaking anxiety – participants who have just given an oral presentation are asked to
rate how anxious they were feeling while giving the presentation.
Speaking anxiety – observers watch the speaker deliver an oral presentation and look
for markers of anxiety such as sweating, shaking, stuttering, etc.
3. Physiological
Involves measuring a bodily process that can't be observed directly; usually requires the
use of equipment to amplify, record, and analyze biological data.
Scales of Measurement
There are four different scales of measurement: categorical (also known as nominal),
ordinal, interval, ratio.
many measures in psychology are treated as interval scales, including IQ test scores,
personality test scores, and scale ratings.
Of course, sometimes this is not possible. For example, qualitative variables can only
be measured on a categorical scale of measurement (e.g., gender, religious
To answer this question we will need to consider two key aspects of measurement:
reliability and validity.
Scatterplots
Scatterplots are a type of graph that allow us to visualize the relation between the two
measured variables.
If there is a strong positive relationship between the variables, then the points will tend
to cluster quite closely around a line that slopes upward to the right.
Correlation coefficients
The data depicted in a scatterplot can also be summarized efficiently with a single
number called a correlation coefficient (symbolized with a small letter r) which indicates
how closely the points cluster around a line drawn through them. Correlation
coefficients can range from -1.0 to 1.0 and tell us two important things about the points
in a scatter plot:
1. The direction of the relationship (i.e., slope direction) can be either positive (sloping
upward), negative (sloping downward) or zero (not sloping up or down). If r is a
positive value the relationship is positive, and if r is a negative value the relationship
is negative.
For the purpose of assessing reliability, negative relationships would be rare and
problematic
2. The strength of the relationship
test-retest reliability
internal reliability (also known as internal consistency or inter-item reliability)
interrater reliability
the Rosenberg Self Esteem scale (RSE; Rosenberg, 1965). The RSE is the most
commonly used measure of self-esteem. It consists of 10 items that are each rated on a
4-point scale. Here is the scale and how it is scored.
1. Test-retest reliability
Give the RSE to the same participants on two occasions and look at the correlation
between the total RSE score at Time 1 and the total RSE score at Time 2.
If the test-retest correlation is positive and strong, there is good test-retest reliability.
We are asking: Are all the items on a multi-item scale measuring the same construct?
We want to see that participants give consistent responses to all of the items, despite
the slightly different wordings.
For the RSE scale, we would have a large sample of participants complete the scale at
one time point. Then there are two ways to assess internal reliability:
So, for the RSE scale, we would compute 10 separate item-total correlation coefficients.
If they are all strong and positive we know that all of the items are consistently tapping
into the same construct. If there was a low item-total correlation for an item, this would
suggest that the item is not assessing the same construct as the rest of the items, and
so it should be dropped from the scale.
For the RSE scale, we could compute Cronbach's alpha as a way to know whether
there is good internal reliability.
In summary, if all item-total correlations are high, or if Cronbach's alpha is high (.70 or
higher), this is evidence that there is good internal reliability and that it makes sense to
sum all the items together to create a total score. If there is low internal reliability, the
items should not be summed to create a single total score.
3. Interrater Reliability
Interrater reliability is relevant for observational measures, where two or more raters
have scored or coded participants' responses. We want to see consistency among the
raters in order to conclude that scores on the measure are reasonably independent of
who did the rating.
Quantitative measures (e.g., rating scales):
If there is a strong positive correlation between raters, this means that when Rater 1
gave a high rating, Rater 2 also tended to give a relatively high rating. In other words,
there is consistency across the two raters.
Categorical coding:
When observers rate a categorical variable they indicate whether responses fall into
specific, pre-determined categories. In such cases, we can calculate the percentage of
responses that the raters place into the same categories. Simple percentage agreement
would be calculated as: number of agreements/(number of agreements + number of
disagreements) * 100.
Another statistic called kappa can also be used and this statistic has some advantages
over simple percentage agreement (e.g., it adjusts for how many agreements would be
expected due to chance). Once again, as with r, a kappa that is closer to 1.0 means that
the two raters showed stronger agreement.
The starting point for this interpretation of reliability is to note that every observed score
on a measure consists of two components: true score and measurement error.
True score is the score the participant would receive if the measure were absolutely
perfect.
Measurement error is the result of factors that distort the observed score, so that it
differs from the true score.
No measure is perfect. Every score will contain some degree of measurement error. A
measure that is reliable is one that has a larger "true score" component and smaller
"measurement error" component.
To illustrate, imagine that I have a class of students and I am trying to measure each
student's math ability by giving them a test. Each student has a certain amount of math
ability (i.e., the true score) that we are trying to capture. But there will also always be
some measurement error, as there are many factors that are not about math ability, yet
might influence the observed score on the test.
Where does measurement error come from?
There are many sources of measurement error. Anything that influences a participant’s
observed score on the variable, other than the construct we are trying to measure, is a
source of measurement error.
a question that might occur to you is this: How can we know how much of the person's observed
score is actually coming from their "true score" and how much is coming from "measurement
error". The answer is that we can't ever really know this for sure because we never know the
person's "true score" – if we did we wouldn't need to administer a measure.
But we can estimate the degree to which the scores on the measure are coming from true scores
(rather than from random measurement error). To do this we look for evidence of consistency.
1. Standardize administration: Ensure that every participant is tested under exactly the
same conditions.
2. Clarify instructions and questions: Make the instructions and items clear and easy to
understand, and pilot test the items beforehand to make sure they are clear.
3. Train observers: For observational measures, provide clear instructions that indicate
concrete behaviors to look for. Have observers practice using the rating instrument
beforehand.
4. Minimize coding errors: Take great care recording and entering data.
Example of known groups paradigm: The Beck Depression Inventory (BDI) is a 21-item self-
report scale with questions about symptoms of depression. In order to test the criterion validity of
the BDI using the known-groups paradigm, one could administer the BDI to a group of people
with depression and a group of individuals who were not depressed, as determined by four
psychiatrists who had conducted clinical interviews and provided diagnoses.
Convergent validity: A measure should correlate strongly with other measures of the same
construct; similarity.
Discriminant validity (aka divergent validity): A measure should correlate less strongly with
measures of different constructs; in other words, there must be differences.
MORE ON THE DISTINCTION BETWEEN VALIDITY
AND RELIABILITY
To evaluate reliability, you compute correlations using just your measure (e.g., correlation of the
measure with itself across time, correlation among parts of your measure).
To evaluate validity, you compute correlations of your measure with other different measures
Here are four sources of information about existing measures that can be particularly useful:
1. Journal articles
2. Books and collections
3. Online databases
4. Commercially published scales
Scale points and labels
It is also worth noting that, when researchers use rating scales (e.g., Likert ratings, semantic
differential), they have a few additional issues to consider.
The advantage of a scale with more points (e.g., a 9-point or 11-point scale) is that it could
potentially allow for finer discrimination; that is, it could be more sensitive and capture smaller
variations among respondents.
The disadvantage of using more scale points is that people may be unable to differentiate their
responses at such a fine level of discrimination (e.g., on a 21-point scale could you really judge
whether your rating should be a 16 or a 17?) and thus such differences in scale ratings may not
be meaningful – they may reflect nothing but measurement error.
The advantage of excluding a midpoint is that this prevents participants from gravitating toward
the easy or safe option (see further discussion of the "fence sitting" response set below).
The disadvantage of excluding a midpoint is that sometimes the most accurate and honest
response is in the middle.
The advantage of labeling every scale point is that this can help to more clearly define the
meaning of each point and reduce measurement error from relying on each person's idiosyncratic
interpretations.
The disadvantage is that this can make the scale appear cluttered and cumbersome. Researchers
may find it simpler and more straightforward to label only the scale endpoints, and respondents
are usually able to use such scales without difficulty.
Unnecessary complexity:
Survey questions should be stated as clearly and simply as possible. People should be able to
understand and respond to the questions easily. Avoid vague or imprecise language, jargon, and
technical terms that people might not understand.
Leading questions:
Sometimes two questions seem to be asking the same thing, but they yield very different
responses depending on how they are worded (for example, of the two questions, How fast do
you think the car was going when it hit the other car? and How fast do you think the car was
going when it smashed into the other car?, which question do you think would lead people to
give a higher speed estimate?). If the goal of a survey is to capture respondents’ true opinions,
then questions should be worded as neutrally as possible.
Double-barreled questions:
Asking two questions in one (e.g., Was your cell phone purchased within the last two years, and
have you downloaded the most recent updates?"). Instead break it up into two separate questions.
Negative wording:
The more cognitively difficult a question is for people to answer, the more confusion there will
be, which can reduce the construct validity of the item. Using double negatives can make
questions more difficult for respondents to process. Negatively worded questions can reduce
construct validity by adding to the cognitive load, which may interfere with getting people’s true
opinions.
Learning Activity
1. Acquiescence (yea-saying) response set: A response set in which people answer
positively (yes, strongly agree, agree) to a number of items instead of looking at each
item individually. Many people have a bias to agree with (say “yes” to) almost any
item—no matter what it states. Acquiescence can threaten construct validity because
instead of measuring the construct of interest, the survey could just be measuring the
tendency to agree or the lack of motivation to think carefully.
Solution:
There are no clear cut solutions for these problems discussed. As discussed
in the text, researchers need to be aware that people may not be able to
report accurately about their own motivations and memories. Researchers
can't assume that self-reports will be accurate for these kinds of variables,
and so they may want to consider using other measurement approaches
instead.
BEHAVIORAL OBSERVATIONS
Another type of data that can be used to support a frequency claim (as well as association claims
or causal claims) is data from observational research (in which a researcher watches people or
animals and systematically records their actions).
1. Observer bias
When observers’ expectations influence their interpretations of participants’ behaviors
or the outcome of the research. Observers rate behaviors according to their own
expectations or hypotheses instead of rating behaviors objectively.
Example: Let’s say that we all watched a videotape of a man in his mid-20s being
interviewed. Before viewing the video, half of you were told that the man was a graduate
student and half of you were told that he was a criminal. Do you think this might
influence your opinions of this man?
2. Observer effects
When observers change the behavior of the participants to match the observer’s
expectations; also known as expectancy effects.
Example - Bright and dull rats: Psychology undergraduates were given five rats and
told to see how long it took for the rats to learn to run a maze (Rosenthal & Fode, 1963).
Each student was given a randomly selected group of rats. Half of the students were
told that their rats were bred to be “maze-bright” and half were told they were bred to be
“maze-dull.” The rats were actually all genetically similar, but the “maze-bright” rats ran
the maze faster each day with fewer mistakes, whereas the “maze-dull” rats did not
improve their performance over several days of testing. The study demonstrated that
sometimes observers’ expectations can influence the behavior of those they’re
observing.
Reflection Question
How do researchers prevent observer bias and observer effects?
1. Clear codebooks and well trained coders: Having clear rating scales or codebooks
is important (see Figure 6.9). These should indicate the specific behaviors that will be
recorded; it is best if the coders are looking for specific, concrete behaviors rather
than rating their global impressions or subjective interpretations. Also coders should
be well trained and given a chance to practice using the coding system by comparing
and discussing their practice ratings, before actually coding (independently) the study
data.
In order to assess the construct validity of a coded measure, multiple observers can be
used to determine the interrater reliability of the measures. If interrater reliability is
low, then observers might need to be retrained, a clearer coding system for behaviors
needs to be used, or both. Remember too that just because a measure is reliable, that
doesn’t mean it’s valid – but it is an important first step.
2. Masked research design (aka blind design): One way to prevent observer bias and
observer effects is to use a masked design (blind design). In a masked design, the
observers do not know to which conditions the participants have been assigned, and
they are not aware of what the study is about.
3. Reactivity
When people change their behavior in some way when they know that someone else is
watching them.
Reflection Question
What can researchers do to prevent reactivity?
1. Blend in. Make yourself less obvious by sitting in the back of the classroom or by
being in an adjacent room behind a one-way mirror in a setting that’s designed for
such observations. The goal is to make unobtrusive observations to prevent
reactivity.
2. Wait it out. Sometimes it’s a good idea to have research participants get used to your
presence by visiting a number of times so that they become more familiar with you
being around.
3. Measure the behavior’s results. Instead of actually measuring behaviors, sometimes
researchers obtain unobtrusive data by measuring traces of a behavior. For example,
if you were interested in the amount and types of junk food that college students eat
in the dorm, you could look at the wrappers in the dorm’s garbage cans.
Technique for recording behavior
Similar to response formats for self-report measures, the researcher needs to decide on a
technique for recording behavior:
1. Checklists
Researchers often use checklists to record pre-determined attributes and behaviours. This
requires having clear and specific definitions of the categories of behavior to be noted. May
involve checking whether specific behaviors occurred (yes/no) or how often they occurred.
2. Temporal measures
Temporal measures are used to record when a behaviour occurred or for how long. Two common
temporal measures are measures of duration and latency.
Duration refers to the amount of time a specific behaviour lasts. For example, researchers might
measure how long a child cries.
Latency refers to the amount of time that elapses between an event and a behaviour. For
example, reaction time is a measure of the time that elapses between the presentation of a
stimulus and a response. Inter-behaviour latency refers to the amount of time that elapses
between two (or more) occurrences of a behaviour.
3. Rating scales
Rating scales can be used to assess the quality or intensity of a behaviour. Often a 3-, 5-, or 7-
point scale is used.
Sampling decisions
Next, the researcher needs to make a sampling decision - that is a decision of when to
make the behavioral observations. Below are four options. To make these options more
concrete, let's imagine that a researcher has video-recorded children for a 20 minute
session and wants to measure each child's smiling behavior.
e.g., Note how many times the child smiles during the 20 entire minute period.
2. Time point sampling
Recording is done at the end of set time periods, such as every 10 seconds or every
15th minute (like freezing time and then recording whether the behavior is present at
that moment).
e.g., Stop the video after every 30 seconds and note whether the child is smiling right
then, at each of those time points.
3. Time interval sampling
Each behaviour is recorded once during successive intervals in a session. The data
record will simply show whether the behavior occurred during a specified time period
(not how many times it occurred).
e.g., stop the video after every 30 secs and note whether the child smiled at any time
during the last 30 second interval
4. Event sampling
Each time a specified behaviour takes place, observers rate aspects of it.
e.g., stop the video each time the child actually smiles, and rate aspects of the smile
(e.g., intensity, duration)
Lesson five
Once we have collected a set of data, we usually enter the data in a grid format, called
a data matrix. The data here are often referred to as the "raw data" as nothing has been
done yet to organize or summarize them. Data Matrix 2. Exam scores
Descriptive statistics are statistics that provide a description or summary of the data that
has been collected. They are used to summarize the characteristics of a set of data.
These are sometimes referred to as "one variable" statistics as they are used to
describe data on a single variable (unlike other "two variable" statistics that we will
cover later in the course used to examine whether different variables are related to each
other).
The goal is to present the data we have collected in a manner that is accurate, concise,
and understandable. Of course, the most accurate presentation of data would be a
listing of the raw data itself as in the matrices above – all the information is there, with
nothing missing or distorted. But this is certainly not a very concise (especially for large
samples) or understandable presentation.
Frequency distributions
As a starting point, one thing that researchers typically do to understand their data is to
look at the distribution of scores. That is, how often did each of the possible values of
the variable occur? To do this, researchers typically create a table called a frequency
distribution. These frequency distributions can be created for variables that used any
scale of measurement (categorical, ordinal, interval, ratio).
It can be very helpful to display frequencies using a graph known as a histogram. This is
a graph that lists the possible values of the variable on one axis (usually the horizontal
x-axis) and displays the frequency of each score on the other axis (usually the vertical
y-axis).
Learning Activity
a. Simple frequency distribution
Possible values Frequency
0 0
1 1
2 2
3 2
4 3
5 5
6 7
7 5
8 3
9 2
10 0
b. Table 1. Simple frequency distribution
c. Relative frequency distribution
Possible values Frequency Relative Frequency
0 0 0.0
1 1 3.3
2 2 6.7
3 2 6.7
4 3 10.0
5 5 16.7
6 7 23.3
7 5 16.7
8 3 10.0
9 2 6.7
10 0 0.0
d. Table 2. Relative frequency distribution
e. Histogram Figure
1. Histogram
Grouped frequency distributions, histograms and stemplots
In cases where there are many possible values (e.g., scores that range from 1 to 100)
researchers would simplify by grouping the possible values together into intervals (e.g.,
1-10, 11-20, 21-30, etc.) to create a grouped frequency distribution, grouped frequency
histogram, or stemplot.
Learning Activity
Data Matrix 2 above contains exam scores that could range from 1 to 100 obtained from
a sample of 25 participants. For this data set, sketch out each of the following in
your study notes, and then click on my answers to see if yours are the same.
My Answers:
a. grouped frequency distribution
Possible values Frequency
1-10 0
11-20 0
21-30 0
31-40 0
41-50 1
51-60 3
61-70 7
71-80 8
81-90 5
91-100 1
b. Table 3. Grouped frequency distribution
c. grouped frequency histogram
Figure
2. Grouped Frequency Histogram
d. stemplot
Stem Leaves
0
10
20
30
40 3
50 2 5 5
60 1 2 2 2 4 5 6
70 2 2 3 5 7 7 7 8
80 2 3 5 5 6
90 2
e. Figure 3. Stemplot
Other frequency graphs
Frequency polygon: is very similar to a histogram but represents the frequency of
each possible value with a dot rather than a bar, and then uses lines to connect the
dots. Typically the lines are anchored at scores that were not obtained by anyone, just
beyond the range of collected data. This results in a graph that has the appearance of
a polygon.
Frequency bar graph: is a variation of the histogram that is used for measures at the
categorical level of measurement. The difference from a histogram is that a bar graph
uses a separate and distinct bar for each value. The bars don't touch. In contrast, in a
typical histogram the bars touch each other, reflecting the fact that the x-axis represents
a continuous variable. A histogram (or polygon) should only be used if the variable is a
continuous variable at the interval or ratio level of measurement. The bar graph is the
correct graph to use when the x-axis represents a categorical variable or an ordinal
variable. An example of a bar graph is shown below for the frequency of people who
reported driving different makes of cars.
The Shape of a Distribution
The Normal Distribution
You are probably already somewhat familiar with this type of distribution. It is the one
that has a bell-shaped appearance (i.e., a bell-shaped curve) and has the following
characteristics:
Other Distributions
Although the normal distribution appears very commonly, not all distributions of data will
have this shape. There are different types of distributions that sometimes occur, such as
bimodal and skewed distributions.
Bimodal distributions
These distributions have two distinct peaks.
They occur when there are two distinct response tendencies.
Example: There are two distinct groups within a sample (e.g., children and adults) that
score very differently
Example: On some polarizing issues people are either strongly in favor or strongly
opposed, with few in the middle.
Skewed distributions
These distributions are asymmetrical (not symmetrical)
Most scores fall toward one end of the distribution
Distinct tail to either the right (positive skew) or left (negative skew)
Fig
ure 16. Negative skew due to ceiling effect
Frequency Distributions in Research Articles
It is worth mentioning that frequency distributions and frequency graphs are not usually
presented in research articles. They are typically used by researchers primarily as an
intermediary step as they prepare for further statistical analyses.
There are some exceptions. In some cases, the shape of the distribution may be
discussed in an article, especially if the distribution deviates from a normal distribution.
Expanding slightly on the textbook discussion, below are three considerations that can
help you decide whether to rely on the mean to describe central tendency.
It depends very much on the shape of the distribution. In a normally shaped distribution,
they will be the same or very close to one another. However when a distribution is
highly skewed, either negatively or positively, they will be quite different.
Learning Activity
For the following questions, jot down your answers in your study notes before
comparing to my answers:
a. For the following set of twenty scores, identify the mean, median, mode: [2, 2, 3, 3, 4,
4, 4, 4, 5, 5, 5, 5, 6, 6, 7, 7, 8, 9, 10, 11].
b. Does the median differ from the mean, and if so, why?
My Answer:
a. Mean = 5.5, Median = 5, Mode = 4 and 5.
b. The mean is higher than the median because there is a positive skew, and the mean is
increased by the extreme scores in the right tail of the distribution (whereas the
median isn't influenced by these extreme scores).
Describing Variability
Variability refers to how much the scores tend to differ from the average value.
The mean tells us what the average person scored, but it doesn't tell us how much the
scores tended to differ from that average. In some distributions most people score at the
average or close to it. Other distributions are much more spread out. A distribution has
high variability when the scores differ a lot from the average.
The two most common statistical indexes that are used to capture the amount of
variability in a set of scores are the variance (SD 2) and the standard deviation (SD).
Variance – indicates the average amount that each score differs from the mean,
expressed in squared units. (It is the standard deviation squared.)
Standard Deviation – indicates the average amount that each score differs from the
mean, expressed in original scale units. (It is the square root of variance.)
The standard deviation is more commonly reported than the variance because it better
captures how far, on average, each score is from the mean. It is easier to understand
because it is expressed in the original scale units.
We can easily turn any score into a z-score as long as we know the mean and the
standard deviation of the set of scores. Just use the following formula: z = (Raw Score –
Mean) / Standard Deviation.
Notice that the z-score can be either positive (indicating the score is above the mean) or
negative (indicating the score is below the mean). For instance, a z-score of -1.6
indicates that the person's score was 1.6 standard deviations below the mean. A z-
score of 2.1 indicates that the person's score was 2.1 standard deviations above the
mean.
The z-score summarizes exactly where the student's midterm score was relative to the
rest of the class.
In the first case, the person's score was 65, the mean was 40, and the standard
deviation was 12.
z = (65-40) / 12
z = 2.08
This benefit is explained in the text, with an example that compares one's score on a
midterm with one's score on the final exam. Even though these are two different
distributions of scores, you can compare z-scores to see whether one's relative
standing was better on the midterm or on the final exam.
3. Allows us to combine measures taken on different scales.
Sometimes researchers obtain several measures of the same construct, but cannot
simply average across them because the measures used different scales. For example,
if self-esteem was measured using a one item scale that ranged from 1-10, and also
using a second measure with possible scores from 1-40, you can't simply average
across the two scores because they use different units. However if the scores are all
converted into z-scores (i.e., standardized scores) this creates a common underlying
metric that is based on relative position. It is then reasonable to average the z-scores
and create an overall measure of self-esteem that combines the two scales. Thus
researchers sometimes use z-scores to combine measures that were initially taken on
different scales.
Population
A population is the entire set of cases that are of interest to the researcher. It is usually
large, containing too many cases to include all of them in the study.
Sample
A sample is the smaller set of cases included in the study. These are selected from a
population and are often intended to represent that population. That is, the goal is to
examine the results from the sample and generalize them to the larger population.
Volunteers (self selection) – using a sample of people who volunteer to participate. This could
lead to bias through self-selection. For example, if posters were placed around campus asking for
students to volunteer their time for a short attitude survey.
Probability Sampling Techniques
In research where external validity is vital (e.g., research making frequency claims)
probability sampling is the best option.
Indeed the key element that distinguishes Probability Sampling techniques from
Nonprobability sampling techniques is the component of "randomness". Random means
that there is no order or pattern that could be predicted in advance.
You would begin with a numbered list of every student at WLU. Such a list of all the
cases in the population is known as the "sampling frame" and is the starting point for
simple random sampling. Then you would use a random process to pick out 200
numbers from the list in a random manner, where every student has the same chance
of being selected. This could be done using a random number table, or a computerised
random number generator (e.g., the research randomizer shown in the text, the
randomize function in Excel
You would begin with a list of "groupings" or "clusters" which might be quite arbitrary.
This could be a list of all the degree programs at WLU (e.g., BMus, BBA, BA, BSc etc;
there are more than 100 programs at WLU). You would randomly select a subset of
those programs (let's say 10 programs).
At that point you could go ahead and contact every student in the 10 selected programs.
That would be cluster sampling.
You would break your overall list of students into four separate categories (i.e., four
strata) and then you would randomly sample from each list.
Oversampling
My Ideas:
This would be a fairly minor variation on Stratified Random Sampling that would be
done if some of the categories are much smaller than others. For example, imagine that
your categories for stratified random sampling were International Students and
Domestic Students. International students make up quite a small proportion of the WLU
population (let's say 5% for example), and so selecting them proportional to their
population membership would leave you with a very small sample of these students. So,
instead, you may decide to sample 20% of the International students to be sure you get
a sufficient sample for this category, even though they represent a smaller proportion of
the overall population. This would be oversampling.
Systematic sampling
My Ideas:
For systematic sampling you would select every "kth" member (e.g., every 17th person)
in the population. For example, you could go through a list of all WLU students and
choose every 17th student for your sample.
Nonprobability Sampling Techniques
- Non-response biases
The response rate in a survey refers to the percentage of people selected for the
sample who actually complete the survey. Response rates may be low when
researchers are unable to contact participants or they do not agree to participate.
This is another way in which systematic bias could enter a sample, and leave
researchers with a final sample that is not representative. This sort of non-response
bias would happen if those who participate in the survey are systematically different
from those who were selected for the study but did not actually participate.
margin of error is expressed as an interval that would usually (e.g., 95% of the time)
contain the true population value. For example, you might hear: The margin of error is +
or – 3% in 95% of observations. This means that, if the researchers took a sample like
this many times, the true population value would fall within 3% of the value that was
obtained in the sample (i.e., the population value would fall between 85% and 91%) in
95% of the samples. So a low margin of error tells us that the true population value is
likely very close to the sample value that was obtained.
1. The sample size - With larger samples there is less sampling error.
That is, there are diminishing returns to increased sample size. For example, if you have a
sample of 2,000 participants, there would be little to be gained by adding another 1,000
participants. Consequently researchers typically seek an "economic sample" that provides
a balance between statistical accuracy and polling costs.
2. The population size - with larger populations there is greater sampling error.
But here again the impact on the margin of error is much less than people usually
think. People often mistakenly believe that if the population is very large (e.g.,
millions of people) you also need to draw a very large sample. But that is simply not
the case. A sample of about 1,000 people still does a good job if it is representative.
3. The variance of the data – with greater variability there is greater sampling error.
Sampling error (Statistical Validity) vs. Representativeness (External
Validity)
When we calculate a margin of error we are evaluating the statistical validity of the
data – in this case how close the sample value is likely to be to the true value in the
population.
This is separate from evaluating the external validity of the study. The external validity
of the study is determined by whether the sample is representative or not. For external
validity, it doesn't really matter whether you have a large sample or not. Just having a
larger sample doesn't make it more representative. What matters most for external
validity is whether the sample was drawn using probability sampling technique. In short,
what matters most is HOW not HOW MANY.
When is external validity a priority?
When researchers are making a frequency claim, then the external validity of the study
is usually crucial.
However, when researchers design a study to test association or causal claims (e.g.,
that one variable affects another variable), the external validity of that particular study is
not top priority.
Lesson 6
To see whether two variables are indeed correlated, we use slightly different techniques
depending on whether we are working with quantitative or categorical variables
1 8 7
2 2 1
3 4 3
4 1 1
5 5 6
6 7 5
7 9 6
8 6 5
9 10 9
10 3 2
It is much easier to see whether the variables are correlated if we first rearrange the scores in the
data columns, putting them in order (from lowest to highest) on the X variable, as seen in Table 2
below. We must be sure to keep the two scores from the participants paired together on the same
row. Then we can ask: As the scores are increasing on the X variable (across participants), do
we also see a tendency for scores to be increasing, systematically on the Y variable?
This is appears to be the case in our example. After placing the scores on X in order (from
lowest to highest), there appears to be a systematic pattern in the scores on Y - the scores are
generally quite low at first and tend to increase as you move down through the column. That is,
as the scores on X are increasing across the participants, the scores on Y are also tending to
increase across the participants.
4 1 1
2 2 1
10 3 2
3 4 3
5 5 6
8 6 5
6 7 5
1 8 7
7 9 6
9 10 9
Scatter plots:
Recall that in a scatterplot, one of the variables is represented on the horizontal axis (X axis) and
the other is represented on the vertical axis (Y axis). Each point on the scatterplot shows where a
participant scored on the two variables. the points slope upward to the right, indicating that there
is a positive correlation between the two variables.
Perfect correlations
Perfect positive correlation
There is one pattern that we would call a "perfect positive correlation". We would almost never
see a correlation like this in real data, but knowing what is meant by a perfect positive correlation
can help us understand the concept of correlation. This is the strongest positive correlation
possible.
If there is a perfect positive correlation, this means that every time the X values increase by a
unit (as you move from one participant to another), the Y values also increase by a fixed amount.
When plotted in a scatterplot, then, the data points will all fall on a straight line sloping upward.
The correlation is stronger to the extent that the points in a scatter plot are closer to an
imaginary straight line sloping upward or downward. If the correlation is strong, the
points tend to fall close to the line. If the correlation is weak, there is still a systematic
slope, but the points do not cluster as closely around the straight line. the absolute
value of the correlation coefficient, r, is higher when the data points fall closer to a
straight, sloping line.
However, for this type of study researchers would be much more likely to simply
compare the mean score on the y-variable across the two different groups (i.e., across
the two levels of the x-variable). They might also use a bar graph to present the means
visually as in the example bar graph in Figure 7 below. This allows us to examine
visually how much the means differ in the two groups. Or they could just present the
means in the written description of the results, indicating how different they were.
Figure 7. Bar graph displaying mean score on expressiveness for each group.
No matter how the means are presented, the important point is this: If the means differ
across the two groups, then the two variables are related. For instance, because the
average level of expressiveness was much higher for females (M = 8.00) than males
(M = 4.25), this indicates that there was a strong relation between the children's sex and
their emotional expressiveness. If the average scores had been the same in the two
groups, this would mean that there was no relation between the children's sex and their
emotional expressiveness.
Two Categorical Variables - Comparing frequencies/proportions
across groups
imagine that a researcher wants to know whether there is an association between
children’s sex (male or female) and whether they display a specific type of reading
disability (yes or no).
In such cases, researchers would use a contingency table to see whether the two
variables are associated. A contingency table depicts the number of people at each
combination of the X-variable and the Y-variable.
More specifically, the rule is this: compare the proportion of the people in the entire first
column who are at the first level of the y-variable (10 out of 30 = 33.3%) with the
proportion of people in the entire second column who are at the first level of the y-
variable (20 out of 100 = 20%).
If the proportions differ there is an association between the two variables. The greater
the difference between proportions, the stronger the association.
X X
Level 1 Level 2
Y
Level 1 10 20
Y
Level 2 20 80
If you were trying to find the regression line visually, you would try to draw it so that it
cuts through the points with about half the points below it and half above it. Better still,
the line can be found mathematically using regression analysis. Regression analysis is
a mathematical technique for finding the regression line. The regression analysis gives
us the line that best minimizes the squared distance between the line and all of the data
points.
An even better way to do this is to make use of the mathematical equation for the
regression line (i.e., the regression equation: y = β0 + β1x.
This regression equation might look new to you, but it is just like the standard equation
for a line [y = mx + b]. The regression equation tells us the slope of the line (β1 regression
coefficient) as well as where it passes through the y-axis (β0 regression constant). To predict a
person's score on the y variable, then, you just need to plug the person's x-score into
the equation.
Note that the slope of the regression line indicates how much the y variable changes for
every unit of change in x. In cases where there is no relation between the two variables,
the regression would be a horizontal line that falls at the mean of the y variable. In this
case, having information about a person's x-score does nothing to improve our
prediction of their y-score. The best you could do in that case is predict the mean score
on y (i.e., the most typical score for anyone).
the correlation coefficient, r, tells us how accurate our predictions would be – with
stronger correlations the predictions will be more accurate. Regression analysis is the
technique that we would use to actually make the predictions.
Learning Activity
A researcher designed a skills test to give to job applicants in order to predict which
applicants are most likely to succeed on the job (as measured by employee
performance appraisals). The researchers then conducted a correlational study on a
sample of current employees and found that those who scored higher on the skills test
tended to have better job performance (the correlation coefficient was above zero).
1. Effect size
My Answer:
We would want to know how strong the association was between scores on
the skills test and job performance. We would be concerned if there was only
a very small correlation. If there is only a small or weak correlation, then
knowing an applicant's score on the screening test would not allow you to
predict with high degree of accuracy their performance on the job. Generally
speaking, a very small correlation is not as important a finding as a larger
one.
2. Significance
My Answer:
We would want to know whether there are any outliers in the data – that is
people with rare, extreme scores (which we could see easily by examining a
scatterplot). This would be especially problematic if a participant has extreme
scores on both of the variables (not just one of them), and when the sample
size is smaller. In such cases even one or two outliers could have artificially
increased or decreased the correlation that was found. For example, in a
small sample, the correlation coefficient might be inflated a great deal due to
even one participant who had an extremely high score on the skills test along
with an extremely high score on job performance, even though there was little
association between the variables in the rest of the sample.
4. Restriction of range
My Answer:
We would want to know whether there was a full range of scores on both of
the variables. If there was not a full range of scores on one of the variables,
this could have reduced the correlation that was obtained, so that it
underestimated the true correlation between the two measures. In our
example, this could occur if applicants were hired only if they have very high
scores on the skills test. In that case there would be a very restricted range of
scores on the screening test (only high scores) rather than a full range of
scores. This would tend to reduce the correlation that was found, and would
lead us to conclude erroneously that the test was only weakly related to
performance when in fact the true correlation between these variables may
be much stronger.
5. Curvilinear association
My Answer:
We would want to know whether there was any curvilinear association in the
data, and we could look for this on a scatterplot. We would look to see if there
is a curvilinear association: that is, that increases on the screening test were
associated with systematic increases, and with systematic decreases in job
performance. For example, it could be that at the low end of the test scores,
people who scored higher on the skills test had better job performance (a
positive association) but then once they hit a certain point, further increases
in the screening test were associated with lower job performance (a negative
association). This type of curvilinear pattern is not detected by the statistics
we have discussed such as the correlation coefficient. The correlation
coefficient r is able to assess only "linear relations", not "curvilinear relations".
In this case, the correlation coefficient would be close to zero even though the
two variables are quite related – just not in a linear way.
Recall that, in order to support a causal claim, a study has to satisfy these three criteria:
1. Covariance: There must be a correlation between the cause variable (X) and the
effect variable (Y).
2. Temporal precedence (the directionality problem): The causal variable (X) must
come before the effect variable (Y). If we can't be sure which came first, we can't
infer causation.
3. Internal validity (the third-variable problem): There must be no other plausible
alternative explanation for the relationship between the two variables. If there is a
third variable (Z) that could influence variables A and B independently, then we can’t
infer causation.
Learning Activity
Scenario 1:
A professor at WLU has found a strong positive correlation between how often students in a
course attend lectures and their grades in the course. The professor concludes that lecture
attendance leads to higher grades in the class.
My Answer:
1. Directionality problem
Getting higher grades (e.g., on early assignments or midterms) might lead people to
go and attend more lectures.
2. Third variable problem
There are many plausible third variables. But keep in mind that to be a plausible third
variable, you must identify something that could sensibly influence BOTH of the
measured variables in the original correlation. That is, you should be able to explain
how the third variable might reasonably influence lecture attendance, and also how it
might influence course grades. Below are just a couple possibilities.
Motivation – people vary in how motivated they are. Having higher motivation could
lead people to attend more lectures. Having higher motivation might also lead people
to get better grades (regardless of lecture attendance). In this case it could be that
lecture attendance had no causal effect on grades whatsoever (i.e., the correlation
between attendance and grades was "spurious")
Free time – people vary in how much free time they have. Having more time
available may lead people to attend more lectures. Having more time available might
also help people to get better grades regardless of lecture attendance (e.g., because of
more time to study). In this case it could be that lecture attendance had no causal
effect on grades whatsoever. (i.e., the correlation between attendance and grades was
"spurious")
When third variables are a problem for internal validity
In such cases the original correlation (between wine consumption and satisfaction) is called
a SPURIOUS CORRELATION. This means it is somewhat false or misleading. The
correlation does exist, but it's only because of the third variable.
There can be some instances, however, where a third variable is not an internal validity problem.
This could occur if an identified third variable does not account for the initial correlation. For
example, it may be that wealth is indeed associated positively with wine consumption, and with
life satisfaction. But even so, when we look at the data closely, wealth does not account for the
original correlation between wine consumption and satisfaction. That is, the original correlation
can still be seen at each level of the third variable. For example, if you look within a group with
high wealth, there is still a correlation between wine consumption and happiness; and if you look
within a group with low wealth there is still a correlation between wine consumption and
happiness. So there is a potential third variable, but in this case it does not pose a problem for
internal validity.
The bottom line, however, is that whenever we have correlational findings we cannot know for
sure whether the findings might be due entirely to some third variable. Whenever the research
findings are correlational, we need to do a lot more digging and ask more questions before we
could be in a position to draw any causal conclusions.
1. X causes Y (causation)
2. Y causes X (reverse causation; directionality problem)
3. Z causes both X and Y (third-variable problem) (correlation between X and Y is
spurious; see figure below)
Figure 11. Z causes X and Y
correlation
In the textbook we are also introduced briefly to the idea of moderating variables.
Whenever the relationship between two variables changes depending on the level of
another variable, that other variable is called a "moderator" or a "moderating variable".
Learning Activity
Scenario 1:
If we found that gender moderates the relation between bodyweight and self-esteem,
what might that mean?
My Answer:
It means that the association between bodyweight and self-esteem changes depending
on gender.
Lesson 8
in a longitudinal study, the researcher measures the same variable in the same
people at two or more time points – often to see how the variable of interest changes
across time. In a longitudinal correlational study, two variables are measured at each
time point. Note that this is considered a “multivariate” design rather than “bivariate”
because you end up with more than two measured variables. For example, if you looked
at two measures at four time points, you would collect 8 measured variables from each
participant.
one of the key reasons that researchers use longitudinal designs is because it helps
them to arrive at causal conclusions – whether one variable influences another. Recall
that the bivariate correlations we looked at last lesson were good at showing covariance
but not at establishing temporal precedence or internal validity. Longitudinal studies can
help rule out the “directionality” problem and thus help the research to establish the
“temporal precedence” of a particular variable.
To illustrate, let’s work with a research example that was not covered in the textbook,
but has interested psychologists for decades: Does exposure to violence on TV lead
children to become more aggressive themselves? In several bivariate correlation
studies, researchers have measured both how much violent TV children watch (TV
violence) and how aggressively they behave (aggression), and they usually find a
positive association – the more violent TV that kids watch the more aggressively they
behave.
But at this point a little voice in your head should be screaming “correlation does not
indicate causation”. And you are right. This bivariate correlation establishes covariance,
but does not rule out directionality and third variable problems. So it does not support a
causal conclusion.
Figure 1. Longitudinal study examining the relation of exposure to TV violence and aggression.
Three Types of Correlations
Learning Activity
For each of the following types of correlations:
Autocorrelations: The autocorrelation is the correlation of each variable with itself across
time, and tells us whether people who are higher on the variable at Time 1 are also
higher at Time 2. For example, the autocorrelation for aggression (r = .38) tells us that
the boys who were more aggressive at Time 1 also tended to be more aggressive at
Time 2.
Cross-lag correlations: The cross-lag correlation indicates the degree to which an earlier
measure of one variable is associated with the later measure of the other variable.
Researchers are most interested in these correlations because they can help to
establish temporal precedence. That is, the cross lag-correlations allow us to conclude
with confidence that the one variable came before the other one. Depending on the
pattern of the two cross-lag correlations, we would draw different conclusions – you can
test yourself on how do interpret different outcomes of cross lag correlations in the
Learning Activity below.
Learning Activity: Interpreting possible outcomes of cross-lag
correlations
The questions below present three possible outcomes of the cross-lag correlations: the
outcome depicted in Figure 1, and two other possibilities. For each of these outcomes,
briefly describe what it means.
Outcome 1: In Figure 1, Time 1 TV violence was significantly correlated with Time 2
aggression, whereas Time 1 aggression was not correlated with Time 2 TV violence.
How do you interpret this outcome?
My Answer:
This pattern supports the idea that watching more violent TV leads to increases in
aggressive behavior. Most importantly, it establishes temporal precedence because it
rules out the problem of reverse direction. It can’t be the case that being more
aggressive at Time 2 is what led the kids to watch more violent TV at Time 1.
Outcome 2: What if the study had instead found that Time 1 aggression was positively
correlated with Time 2 TV violence, but Time 1 TV violence was not correlated with
Time 2 aggression?
My Answer:
This pattern would have indicated that aggression came first, leading to greater
watching of TV violence later. That is, the kids who were already more aggressive at a
young age went on to have a greater preference for violent TV.
Outcome 3: What if the study found that both correlations were significant – that TV
violence at Time 1 predicted aggression at Time 2, and that aggression at Time 1
predicted TV violence at Time 2?
My Answer:
This pattern would suggest that viewing TV violence and acting aggressively are
mutually reinforcing – that they both influence each other.
Is there any way to rule out the suspected third variable (SES)? Yes, multiple
regression analysis is a way of statistically controlling for possible third variables. To
perform these analyses, you need to measure more than just the two original variables;
you also need to include a measure of the suspected third variable.More generally,
following are the steps needed for multiple regression analyses.
You will need to measure more than just the two original variables; you must also
include a measure of the suspected third variable. [e.g. include a measure of SES]
2. Conduct analyses that control for the third variable.
Compute multiple regression analyses that tell you the relation between the two
original variables after “controlling for” the third variable. Conceptually you are
asking whether the original bivariate correlation appears in “subgroups” representing
the levels of the third variable [e.g., subgroups of students at different SES levels]. In
each subgroup, people have the same level of the third variable. So you are
conducting a correlation for a group of people where that third variable has been held
constant.
3. Interpret the pattern of results.
If the bivariate correlation remains within each of the “subgroups”, this indicates that
it was NOT due to the suspected third variable. This is because the correlation
appears even when that variable has been “controlled” or “held constant”. However,
if the bivariate correlation does not appear within the subgroups, this suggests that the
original bivariate correlation is explained by the suspected third variable.
Understanding what it means to control for a third variable
Let’s take a moment to consider these interpretations further, by looking at some
hypothetical data from a small sample of student participants (n = 9). As Table 1
shows, in this sample of students the original bivariate correlation between sports
involvement and academic success is very strong and significant (r = .85, p < .05). But
again, recall that there could be a third variable such as SES that explains this bivariate
correlation.
1 3
2 1
4 3
4 3
5 6 .85
8 4
9 7
9 9
10 8
Table 1. Original bivariate relation between sports involvement and academic success.
Fortunately, the researchers also included a measure of the participants’ SES with three
levels (1 = low, 2 = moderate, 3 = high). This allows the researchers to “control” for this
suspected third variable. That is, they can look at the relation between sports
involvement and academic success within “subgroups” of people who have the same
SES level. As Table 2 shows, there is no relation, or very little relation, within each of
the subgroups. So, in the analyses that control for SES (i.e., hold SES constant) there is
no longer an association between sports involvement and academic success. This
pattern of results suggests that the original bivariate correlation between sports
involvement and academic success was due to the third variable of SES levels. If
instead the correlation had been substantial at each level of SES, this would have
indicated that SES was not a likely third variable.
1 1 3
1 2 1 .19
1 4 3
2 4 3
2 5 6 .05
2 8 4
3 9 7
3 9 9 .00
3 10 8
Table 2. Relation between sports involvement and academic success controlling for SES.
The researchers want to know whether a third variable can account for the bivariate
relationship between the two variables of interest. To answer the question, they see
what happens when they statistically control for the third variable. That is the basic logic
underlying multiple regression analysis.
Table 3. Results of a multiple regression analysis for a study examining predictors of academic
success.
Beta basics Beta is similar to r. A positive beta indicates a positive relationship
between that predictor variable and the criterion variable when the other predictors
are statistically controlled for. A negative beta reflects a negative relationship when
the other predictors are controlled for. A beta that is zero or not significantly different
from zero suggests that there is no relationship when the other predictors are
controlled for.
Similarities between beta and r: direction (positive or negative) and strength (the closer
to -1 or +1, the stronger the relationship; the closer to zero, the weaker the relationship).
You may compare beta strengths within a single regression table, but you may not
compare beta strengths across regression tables. There are no absolute cutoffs for beta
effect sizes as we have with r and Cohen’s d cutoffs. Sometimes a regression table will
report b instead of beta; b is an unstandardized coefficient.
Interpreting beta: Note that in Table 3 sports involvement has a beta of .39. This
positive beta, like a positive r, means that higher levels of sports involvement go with
higher levels of academic success, even when we statistically control for the other
predictor in this analysis—SES. The other beta, associated with the SES predictor
variable, is also positive. This beta means that higher SES is associated with higher
academic success, when sports involvement is controlled for.
Statistical significance of beta: In a regression table, there is usually a column
labeled p or sig or an asterisk footnote with p values. When p is less than or equal to .
05, the beta is statistically significant. When p is greater than .05, the beta is not
significant..
Research Study: Dr. Nguyen is a psychologist who studies legal decision making.
Specifically, he is curious about the factors that are irrelevant to the crime committed
that might influence the sentences juries give to defendants (known as extra-legal
factors). To study this further, he samples a group of jury-eligible adults from the
Toronto area. He provides them with the fact pattern to a particular case and allows
them to watch the closing statements from the trial. He then asks them to provide a
sentence (in months) for the defendant. In addition, he measures two legal factors (the
number of arguments made by the prosecuting attorney and the length of time the
defense attorney speaks during his or her closing argument) and two extra-legal factors
(how attractive the participants think the defendant is [higher scores indicate higher
ratings of attractiveness] and how many legal television shows the participants watch).
The data are below in Table 4.
Answer each of the questions below, before checking your answer against mine.
Criterion: Length of Sentence Provided
Bet Significance
Predictor Variable a (p)
Table 4. Results of a multiple regression analysis for a study examining factors that
influence legal decision making
My Answer: Length of sentence
Explain how the variable attractiveness of the defendant relates to criminal sentencing,
considering the direction of the relationship, statistical significance, and the strength of
the relationship compared with the other variables.
Moderator: Gender
2. Drinking wine more often is associated with greater life satisfaction but only because
wealthier people can afford to drink wine more often, and being wealthy is also
associated with greater life satisfaction.
My Answer:
Mediator: Socializing
4. Having a mentally demanding job is associated with cognitive skills in later years
because people who are highly educated take mentally demanding jobs, and people
who are highly educated have better cognitive skills.
My Answer:
Moderator: Gender
6. Having a mentally demanding job is associated with cognitive skills in later years
because cognitive challenges build lasting connections in the brain.
My Answer:
Mediator: loneliness
9. Sibling aggression is associated with poor childhood mental health only because of
parental conflict. Sibling aggression is more likely among parents who argue more,
and parents’ arguing also affects kids’ mental health.
My Answer:
In a longitudinal study, the researcher measures the same variable in the same
people at two or more time points – often to see how the variable of interest changes
across time. In a longitudinal correlational study, two variables are measured at each
time point. Note that this is considered a “multivariate” design rather than “bivariate”
because you end up with more than two measured variables.
As noted above, one of the key reasons that researchers use longitudinal designs is
because it helps them to arrive at causal conclusions – whether one variable influences
another. Recall that the bivariate correlations we looked at last lesson were good at
showing covariance but not at establishing temporal precedence or internal validity.
Longitudinal studies can help rule out the “directionality” problem and thus help the
research to establish the “temporal precedence” of a particular variable.
But at this point a little voice in your head should be screaming “correlation does not
indicate causation”. And you are right. This bivariate correlation establishes covariance,
but does not rule out directionality and third variable problems. So it does not support a
causal conclusion.
Autocorrelations: The autocorrelation is the correlation of each variable with itself across
time, and tells us whether people who are higher on the variable at Time 1 are also
higher at Time 2. For example, the autocorrelation for aggression (r = .38) tells us that
the boys who were more aggressive at Time 1 also tended to be more aggressive at
Time 2.
Cross-lag correlations: The cross-lag correlation indicates the degree to which an earlier
measure of one variable is associated with the later measure of the other variable.
Researchers are most interested in these correlations because they can help to
establish temporal precedence. That is, the cross lag-correlations allow us to conclude
with confidence that the one variable came before the other one. Depending on the
pattern of the two cross-lag correlations, we would draw different conclusions – you can
test yourself on how to interpret different outcomes of cross lag correlations in the
Learning Activity below.
Learning Activity: Interpreting possible outcomes of cross-lag
correlations
Outcome 1: In Figure 1, Time 1 TV violence was significantly correlated with Time 2
aggression, whereas Time 1 aggression was not correlated with Time 2 TV violence.
How do you interpret this outcome?
My Answer:
This pattern supports the idea that watching more violent TV leads to increases in
aggressive behavior. Most importantly, it establishes temporal precedence because it
rules out the problem of reverse direction. It can’t be the case that being more
aggressive at Time 2 is what led the kids to watch more violent TV at Time 1.
Outcome 2: What if the study had instead found that Time 1 aggression was positively
correlated with Time 2 TV violence, but Time 1 TV violence was not correlated with
Time 2 aggression?
My Answer:
This pattern would have indicated that aggression came first, leading to greater
watching of TV violence later. That is, the kids who were already more aggressive at a
young age went on to have a greater preference for violent TV.
Outcome 3: What if the study found that both correlations were significant – that TV
violence at Time 1 predicted aggression at Time 2, and that aggression at Time 1
predicted TV violence at Time 2?
My Answer:
This pattern would suggest that viewing TV violence and acting aggressively are
mutually reinforcing – that they both influence each other.
You will need to measure more than just the two original variables; you must also
include a measure of the suspected third variable. [e.g. include a measure of SES]
2. Conduct analyses that control for the third variable.
Compute multiple regression analyses that tell you the relation between the two
original variables after “controlling for” the third variable. Conceptually you are
asking whether the original bivariate correlation appears in “subgroups” representing
the levels of the third variable [e.g., subgroups of students at different SES levels]. In
each subgroup, people have the same level of the third variable. So you are
conducting a correlation for a group of people where that third variable has been held
constant.
3. Interpret the pattern of results.
If the bivariate correlation remains within each of the “subgroups”, this indicates that
it was NOT due to the suspected third variable. This is because the correlation
appears even when that variable has been “controlled” or “held constant”. However,
if the bivariate correlation does not appear within the subgroups, this suggests that the
original bivariate correlation is explained by the suspected third variable.
Understanding what it means to control for a third variable
As Table 1 shows, in this sample of students the original bivariate correlation between
sports involvement and academic success is very strong and significant (r = .85, p < .
05). But again, recall that there could be a third variable such as SES that explains this
bivariate correlation.
1 3 .85
2 1
4 3
4 3
5 6
8 4
9 7
9 9
10 8
Fortunately, the researchers also included a measure of the participants’ SES with three
levels (1 = low, 2 = moderate, 3 = high). This allows the researchers to “control” for this
suspected third variable. That is, they can look at the relation between sports
involvement and academic success within “subgroups” of people who have the same
SES level. As Table 2 shows, there is no relation, or very little relation, within each of
the subgroups. So, in the analyses that control for SES (i.e., hold SES constant) there is
no longer an association between sports involvement and academic success. This
pattern of results suggests that the original bivariate correlation between sports
involvement and academic success was due to the third variable of SES levels. If
instead the correlation had been substantial at each level of SES, this would have
indicated that SES was not a likely third variable.
1 1 3
1 2 1 .19
1 4 3
2 4 3
2 5 6 .05
2 8 4
3 9 7 .00
3 9 9
3 10 8
Table 2. Relation between sports involvement and academic success controlling for SES.
Beta basics: Beta is similar to r. A positive beta indicates a positive relationship
between that predictor variable and the criterion variable when the other predictors
are statistically controlled for. A negative beta reflects a negative relationship when
the other predictors are controlled for. A beta that is zero or not significantly different
from zero suggests that there is no relationship when the other predictors are
controlled for.
Similarities between beta and r: direction (positive or negative) and strength (the
closer to -1 or +1, the stronger the relationship; the closer to zero, the weaker the
relationship). You may compare beta strengths within a single regression table, but you
may not compare beta strengths across regression tables. There are no absolute
cutoffs for beta effect sizes as we have with r and Cohen’s d cutoffs. Sometimes a
regression table will report b instead of beta; b is an unstandardized coefficient. You
cannot compare two b values even within the same table.
Interpreting beta: Note that in Table 3 sports involvement has a beta of .39. This
positive beta, like a positive r, means that higher levels of sports involvement go with
higher levels of academic success, even when we statistically control for the other
predictor in this analysis—SES. The other beta, associated with the SES predictor
variable, is also positive. This beta means that higher SES is associated with higher
academic success, when sports involvement is controlled for.
Statistical significance of beta: In a regression table, there is usually a column
labeled p or sig or an asterisk footnote with p values. When p is less than or equal to .
05, the beta is statistically significant. When p is greater than .05, the beta is not
significant. In Table 3, both of the betas are significant.
Research Study: Dr. Nguyen is a psychologist who studies legal decision making.
Specifically, he is curious about the factors that are irrelevant to the crime committed
that might influence the sentences juries give to defendants (known as extra-legal
factors). To study this further, he samples a group of jury-eligible adults from the
Toronto area. He provides them with the fact pattern to a particular case and allows
them to watch the closing statements from the trial. He then asks them to provide a
sentence (in months) for the defendant. In addition, he measures two legal factors (the
number of arguments made by the prosecuting attorney and the length of time the
defense attorney speaks during his or her closing argument) and two extra-legal factors
(how attractive the participants think the defendant is [higher scores indicate higher
ratings of attractiveness] and how many legal television shows the participants watch).
The data are below in Table 4.
Answer each of the questions below, before checking your answer against mine.
Criterion: Length of Sentence Provided
Bet Significance
Predictor Variable a (p)
Table 4. Results of a multiple regression analysis for a study examining factors that
influence legal decision making
My Answer: Length of sentence
Which predictor variables are related significantly to the length of sentence (when we
statistically control for the other predictors)?
Explain how the variable attractiveness of the defendant relates to criminal sentencing,
considering the direction of the relationship, statistical significance, and the strength of
the relationship compared with the other variables.
Lesson 9
What is an experiment?
An experiment is a study in which researchers manipulate at least one variable
and measure another variable. In a “simple” experiment there is just one manipulated
variable; later in the course we will see experiments with more than one manipulated
variable. The goal in a simple experiment is to test whether the manipulated variable
has an effect on the measured variable(s). Thus the key factor that distinguishes an
experimental design from the correlational designs that we have looked at previously is
that the presumed causal variable is manipulated by the researcher rather than only
measured.
Experimental variables
Independent variable (IV):
This is the presumed causal variable that is manipulated by the researcher. This means
that the researcher assigns each participant to a particular level of the variable – to a
“condition” in the experiment. The different levels of the IV are the “conditions” in the
experiment. In a graph of results, the independent variable will be on the x-axis.
Control variables:
This term refers to potential variables that are held constant on purpose by the
researcher. That, is the researcher makes sure that every participant has the same level
of the variable – for example, the same room, same lighting, same time of day, same
experimenter giving the instructions, same computer keyboard for responding, etc.
As you may have noticed, this is just like testing for associations in
correlational studies with one categorical variable and one quantitative
variable. Again, we are comparing means to establish covariance – the
greater the difference between means across conditions, the stronger the
association between the IV and the DV.
Example: If the researchers find that the mean creativity score is much
higher in the alcohol condition than in the control condition, they have
established covariance. They have shown that alcohol consumption is
associated with creativity.
1. Experiments Establish Temporal Precedence
My Answer:
In experiments we can be sure that the cause variable precedes the effect
variable. So we can totally rule out the possibility of reverse causation,
because the experimenter has manipulated the IV. It can’t be argued that the
participants’ standing on the IV may have been influenced by their standing
on the DV – we know that it was the experimenter that determined which level
of the IV each participant received, by assigning each participant to a level.
Recall that to make a causal claim you also need to be able to rule out third
variable explanations and be confident that it was the IV (and not some other
variable) that caused the difference in means on the DV across conditions.
You need to be sure that the difference in means across conditions was due
to the IV and only the IV. To do this, you need to be able to rule out
“confounds” or “threats to internal validity” discussed in the next section.
Example: If the experiment is done well, there will not be any differences
between the two conditions other than the level of alcohol consumption. If that
is the case, the experiment is high in internal validity and we can be sure it
was the alcohol consumption that created the differences in creativity across
the two conditions.
Design confounds
A design confound occurs in an experiment when some extraneous variable (a
variable other than the IV) varies systematically along with the IV and thus provides an
alternative explanation for the results. That is, a design confound exists when the
conditions differ systematically (i.e., differ on average) on some variable, other than the
independent variable, that might have affected the DV. If a design confound is present,
it threatens internal validity and the experiment does not support a causal claim.
Selection effects
A selection effect occurs in an experiment when the participants in one condition are
systematically different than the participants in the other condition(s). Selection effects
can arise in cases where participants are free to select which level of the IV they
receive, or in cases where the researcher assigns participants to the conditions but not
using random assignment. In such cases, it could be that different types of participants
end up in each condition. That’s a big problem when we go to interpret the results. We
cannot know whether a difference in means across conditions at the end of the study
was caused by the IV, or whether it was found because there were different types of
people in the conditions in the first place.
Fortunately, researchers can avoid selection effects by using random
assignment. Random assignment is a way of assigning participants to the different
conditions (i.e., different levels of the IV) such that every participant has an equal
chance of being assigned to each condition. If random assignment is used, there should
be no systematic differences between conditions prior to the manipulation of the
independent variable.
Learning Activity: Spot the confound:
1. In a market research study, participants were randomly assigned to receive a 6-oz
serving of either Pepsi or Coke, and were asked to rate how much they liked the cola
on a scale from 1 to 10. To help the researchers keep track of the brand that each
participant was assigned to taste, the Pepsi was always served in a blue cup and Coke
in a red cup. The researcher also asked participants to indicate their gender, age, and
where they live.
My Answer:
IV – cola brand (Pepsi, Coke)
DV – liking of the cola (rating from 1-10)
Confound: Design confound: colour of the cup (blue, red). It might be that the
colour of the cup influences participants’ ratings of liking.
Solution: Use the same colour of cup for all participants.
2. A researcher tests whether the age of a target person (young vs. old) influences
perceived intelligence. The researcher first collects photos from the internet of several
old people and young people. Participants in the experiment are randomly assigned
to view photos of either the old people or the young people and they rate how
intelligent these people seem. Results indicate that the ratings are higher on average
for the old people than for the young people. The researcher also noticed later that
most of the old people, and none of the young people, were wearing glasses.
My Answer:
IV – age of target person (old, young)
DV – perceived intelligence
Confound: Wearing glasses or not. The conditions differ not only in the age
of the people being rated, but also whether they are wearing glasses. So we
can’t know whether it was the person’s age or glasses that made them seem
more intelligent to raters.
Solution: Choose the same number of photos where people are wearing
glasses (and the same number without glasses) in each condition.
Random assignment is a technique used to assign the sample of participants in your
experiment to the different conditions of the experiment.
Simple random assignment
There are a couple downsides of using these “simple random assignment” procedures.
They might result in an unequal number of participants in each condition at the end of
the study, and at certain time periods during the study.
This matching procedure ensures that the conditions will be completely equivalent on a
variable that the researcher believes is important (i.e., the matching variable). Matching
is most useful in cases where the sample size is small, and thus random assignment
might not be as effective in equalizing the groups. Or when the researcher wants to be
absolutely sure that the groups are as equal as possible (e.g., when studying samples
that are very difficult or costly to recruit; when trying to detect small effects of an IV).
The downside of matching is that it often requires the extra step of contacting the
participants to measure the matching variable ahead of the experiment, so matching
procedures can be costly in terms of time and resources.
However, posttest only designs can still be very powerful, given their combination of
random assignment and a manipulated variable (IV). They avoid the key disadvantage
of including a pretest which is that the pretest can create “testing effects” – that is,
participants may respond differently on the posttest than they would have if they hadn’t
first done a pretest. [We will look more closely at “testing effects” in the next lesson]
Within-groups (within-subjects)
In within-groups designs, the very same participants are exposed to all the different
conditions of the experiment. There are two variations.
Repeated measures. In a repeated measures design the participants are first exposed
to one condition and complete the dependent variable, and then are exposed to the
other condition and complete the dependent variable again. For example, participants
may first see a video of an assertive candidate and rate how much they like that
candidate, and then see a video of a submissive candidate and rate how much they like
that candidate
Learning Activity: Types of experiments
1. A researcher ran an experiment in which he asked people to shake hands with an
experimenter (played by a female friend) and then to rate the experimenter’s
friendliness. People were randomly assigned to shake hands with her either after she
had cooled her hands under cold water or after she had warmed her hands under
warm water.
My Answer:
o IV: temperature of hands
o DV: perceived friendliness
o Type of experiment: independent groups posttest-only.
2. Participants in an experiment were presented with information about the grades of
two hypothetical students. These two students had similar grades in their most recent
term (A average) but differed in their previous grades: one student had earned
consistently strong grades every term whereas the other had low grades early on and
improved over time. Participants were asked to choose which of the two students was
most deserving of an academic scholarship.
My Answer:
o IV: grade history (consistent vs. improvement)
o DV: choice for scholarship
o Type of experiment: concurrent measures
3. Children participating in an experiment were first asked to read a passage of complex
text, and the researcher counted how many reading errors were made. Half of the
children were then given a series of literacy training exercises designed to improve
reading skills, whereas the other children did not receive these exercises. A week
later, when the children were again asked to read a passage of complex text, the
children who had received the literacy exercises made fewer errors than those who
did not.
My Answer:
o IV: training condition (exercises vs. no exercises)
o DV: reading errors
o Type of experiment: independent groups pretest/posttest
4. An experiment tested the effects of mood on memory. To manipulate participants’
current mood, the researcher asked some of the participants to visualize a recent
positive event and others to visualize a recent negative event. Participant were then
asked to describe their memories of their high school years and the researcher rated
how positive the memories were.
My Answer:
o IV: mood (positive, negative)
o DV: positivity of memories
o Type of experiment: independent groups posttest-only
5. Each participant in a study was presented with two different types of cola, and rated
their liking of each one after tasting it on a scale from 1-10.
My Answer:
o IV: type of cola (Pepsi, Coke)
o DV: liking of cola
o Type of experiment: repeated measures
Order effects
An order effect is a type of confound that can occur in within-groups designs because
one condition comes before the other condition in time. The means on the DV in the two
conditions may differ simply because of the passage of time or the sequence in which
the conditions were experienced. Order effects can be considered a “confound”
because it is not only the IV that differs across conditions, but also the sequence in
which the conditions were experienced.
Passage of time
e.g., participants may become bored, tired, hungry, fatigued by the time of the second
condition
Practice effects
e.g., participants may get better at a task because they have done it before
Carryover effects: what happens in the first condition carries over, and
contaminates, the second condition.
e.g., participants in a taste test may rate the first cola they taste more positively than the
second because the first taste is always better and subsequent tastes are never quite
as good. The first taste lingers on and contaminates the second taste.
e.g., participants rate the submissive job candidate lower because they still have the
assertive candidate in mind, and in comparison the submissive candidate is not as
appealing (but they would have given higher ratings to the submissive candidate if they
weren’t first exposed to the assertive candidate)
Counterbalancing
Fortunately, there is a fairly straightforward solution to the problem of order effects in
repeated measures designs: counterbalancing. Counterbalancing means presenting the
levels of the independent variable to participants in different orders. For example, in an
experiment with two conditions half the participants would be randomly assigned to
order 1 (IV level 1 first, IV level 2 second) and the other half to order 2 (IV level 2 first,
IV level 1 second), as illustrated in the figure below.
So, in our example experiment, half the participants would rate the assertive video first
and the submissive video second (order 1) and half the participants would rate the
submissive video first and the assertive video second
The main goal of counterbalancing procedures is to ensure that each condition occurs
early in the sequence as often as it occurs late in the sequence, so that effects
associated with order will be balanced out across the different conditions.
Full counterbalancing:
With full counterbalancing, every possible order is presented equally often in the
experiment. This is relatively easy if there are only 2 or 3 conditions.
Learning Activity
there will be a lot of possible orders. Indeed, the number of possible orders is equal to
the number of conditions “factorial”. The mathematical symbol for factorial is !. The !
symbol means that you start with the number and multiply by the next smallest number,
then by the next smallest number, and so on until you have multiplied by 1. For
example, 3! = 3 factorial = 3 x 2 x 1 = 6:
2 conditions: 2 x 1 = 2 orders
3 conditions: 3 x 2 x 1 = 6 orders
4 conditions: 4 x 3 x 2 x 1 = 24 orders
If you have an experiment with several conditions, then, it might not be feasible to use
full counterbalancing, and instead researchers would use some form of partial
counterbalancing.
Partial counterbalancing
With partial counterbalancing, only a subset of the possible orders are chosen and
presented to participants. How do you choose which subset of orders to use?
Randomized orders
One technique is to present a randomized order for every participant (i.e., select the
orders randomly from the list of all possible orders). For example, when an experiment
is administered by a computer it can easily select a new random order for each
participant. So you would not use every order, but instead would use only as many
orders as there are participants in the study.
Latin square
Another technique is to use a Latin square procedure. This involves selecting the subset
of orders in a special way so that they have two properties. First, every condition
appears in each ordinal position once. Second, every condition comes right before and
right after each other condition once.
In fact, if you have an even number of conditions in your experiment (e.g., 6 conditions),
the number of orders you need is equal to the number of conditions. So, for an
experiment with 6 conditions, you would need to use only 6 orders – much lower than
the 720 orders that would be needed for full counterbalancing! For an experiment with
an odd number of conditions, the number of orders you need is twice the number of
conditions. So for an experiment with 5 conditions, you would need to use 10 orders.
My Ideas:
Advantages
1. Equivalent comparison groups. Participants in each condition are COMPLETELY
equivalent because they are the same participants and serve as their own comparison
groups. You could think of this as similar to a study using matching – but in this case
participants are “matched” in every way.
2. More power. These studies are better able to detect differences between conditions
due to a reduction of unsystematic variability (noise). We will expand on this topic in
the next lesson.
3. Fewer participants required. For example, imagine again that you are conducting an
experiment with three different conditions, and you would like to have 40 participants
in each condition. For an independent groups (between-subjects) design you would
need a total of 120 participants. For a within-groups (within-subjects) design, how
many participants would you need? The answer is only 40 participants in total –
because each participant would go through all three conditions.
Disdvantages
1. Potential for order effects. Repeated measures designs can have order effects
(however, as we discussed, these can be controlled using counterbalancing, so this is
not usually a serious disadvantage).
2. May not be practical or possible – if you can’t undo what happened in the first
condition. Example: Suppose someone has devised a new way of teaching children
how to ride a bike, called Method A. She wants to compare Method A with the older
method, Method B. Obviously, she cannot teach a group of children to ride a bike
with Method A and then return them to baseline and teach them again with Method
B. Once taught, the children are permanently changed. In such a case, a within-
groups design, with or without counterbalancing, would make no sense.
3. Demand characteristics. These occur when participants pick up on cues that let them
guess the experiment’s hypothesis. Seeing all the conditions can often let participants
guess the hypothesis, and this may change the way they act.
Construct validity: How well were the variables measured and
manipulated?
For dependent variables we would want to know whether the measures were reliable
and valid, just as we have done for all other types of research.
For manipulated variables, we want to be sure that the manipulation created the
differences on the IV that it was supposed to. As noted in the text, researchers often
include a manipulation check in their studies for this purpose. A manipulation check is
a measure that is different than the dependent variable. A manipulation check comes
after the experimental manipulation and measures where participants’ stand on the
Independent Variable. For example, if a study is trying to create different levels of
anxiety in participants (IV) to see if it affects their test performance (DV), the researcher
might include a few items that ask participants how anxious they were feeling just
before they do the test – that way the researchers can make sure the manipulation
created differences in anxiety (the IV) across the experimental conditions. Sometimes,
instead of including a manipulation check in the experiment, this is done instead in a
separate pilot study carried out before the actual experiment is conducted.
2. External validity: To whom or what can the causal claim
generalize?
We may also wish to know whether the results of the experiment can be generalized to
other people and other situations. However, we can never really establish high external
validity on the basis of a single experiment.
- If a within-groups design was used, did researchers control for order effects by
counterbalancing?
Lesson 10
Over the 8 week period, the students were becoming more comfortable and better
adjusted with their university surroundings, and it was these psychological changes
happening naturally within them that led to the lower depression scores. In other
words, they just improved on their own over time.
2. History: change on the DV is caused by external events or factors that happen during
the time period of the study. Something other than the IV happens between the pretest
and posttest.
Some event could have happened that affected many of the participants in the study,
and reduced their depression levels (for example, the government announced a
reduction in tuition fees).
3. Regression to the mean: A statistical phenomenon involving extreme scores. Occurs
when participants are selected for a study, or assigned to conditions, because they had
extreme score. People with extreme scores on a measure tend to score more
moderately (closer to the mean) when retested.
Students were selected to be in the study because they scored especially high on the
pretest measures of depression. Consequently, their scores will tend to be lower (i.e.,
closer to the mean) when retested just due to regression toward the mean.
4. Attrition: Changes on the DV due to loss of participants during the study. Occurs
when those who drop out of the study differ systematically from those who stay.
It may be that the most depressed participants left the study, and that’s why the
average score is lower on the posttest.
5. Testing: A specific type of order effect which occurs when the very act of completing
a pretest influences responses on the posttest.
Just going through the pretest measures might have prompted participants to think
more about their state of depression, perhaps to reevaluate and see themselves as less
depressed than they initially reported, and that’s why they score lower on depression
when tested again later.
6. Instrumentation: Changes on the DV that occur because of changes over time in the
measurement “instrument” – that is, in how the DV is assessed. A researcher might
use slightly different versions of a self-report measure at the two time points. But this
applies most commonly to behavioural/observational measures for which coders may
shift their coding standards over time.
This study included an observational measure of depression and so, at least on that
measure, instrumentation could be a problem. The coders may have changed in how
they were rating participants’ depression level. They may have interpreted more
behaviours as indicative of depression early on in the study, and the very same
behaviors were given lower ratings later in the study.
Six further threats – applied to a one-group pretest/posttest design
How to prevent these threats with a true experiment (comparison
groups)
Fortunately, there is a solution to these potential threats: Add a control group or
comparison group that receives a different level of the IV. In other words, conduct a
“true experiment” in which you manipulate the independent variable to create different
levels of the IV (different conditions).
In almost all cases, what happens is that the influence that a confounding factor is
having gets equated across the experimental conditions by random assignment to
conditions. So the confounding factor would not explain why there is a difference
between the two conditions at the end of the study. For example, if maturation is leading
people to become less depressed, this would be happening in both the control and
treatment condition, resulting in reduced depression levels in both conditions. So any
difference between conditions on the post-test measure of depression would not be due
to maturation – and instead it can be interpreted as an effect of the treatment.
Similarly, random assignment to the treatment and control condition can eliminate each
of the other potential threats, as summarized below:
[Another solution is to compute the pretest scores and posttest scores with only the
final sample included; that is, remove the dropouts’ data from the pretest mean as
well.]
5. Instrumentation: Generally influence is equated across groups by random
assignment
6. Testing: Most types of testing effects should be equated by random assignment
Six further threats – applied to a one-group pretest/posttest design
Combined threats
As noted in the text, sometimes in a pretest/posttest design two types of threats to
internal validity might work together.
Selection-history threat:
Suppose that the researcher did not use random assignment, but instead assigned
students at one university to the treatment group and students at another university to
the control group. However, during the course of the study, a stressful event occurs on
one of the campuses and that happens to be the campus of the control group. This
might explain why the posttest depression scores are higher in the control condition
than in the treatment condition. An external event is influencing scores in the one
condition but not in the other.
Selection-attrition threat:
Attrition is a more serious problem when it differs across conditions (sometimes called
“differential attrition”). That is, it could be that different numbers of people and different
types of people drop out of the two conditions. For example, maybe the most severely
depressed people dropped out of the treatment condition because they found the
training sessions too demanding and arduous. This might explain why posttest
depression scores were lower in that condition.
Researchers can avoid this problem by computing pretest scores and posttest scores
with only the final sample included; that is, removing dropouts’ data from the pretest
mean as well.
Example. In our example study, the participants are first given a measure of
depression, and then later are given the same measure of depression again with some
sessions in between. They might reasonably guess that the researchers are studying
how depressed they are, and whether what is happening during the sessions affects
their depression.
Solutions:
Keep the participants “blind” to their condition – this is the best solution in studies where
it is feasible.
Conceal the full purpose of study; use cover stories to disguise the study purpose.
The problem for researchers, then, is that in experiments testing the effect of a
therapeutic treatment, the placebo effect provides an alternative explanation.
Participants in a treatment condition receive not only the treatment but also beliefs or
expectations that it will affect them. So how can the researchers know whether an
effect is actually caused by therapeutic treatment itself or by the accompanying beliefs?
Note that you can also test whether there actually is any placebo effect by including
both the placebo control group and a standard control group (which does not receive
the treatment or any belief that they received it).
1. The IV manipulation (i.e., the treatment). If the experiment is well designed, this
should be the only thing creating the between-groups difference.
2. Confounds (i.e., variables that differ systematically between the groups). If the
researcher did allow confounds to creep into the experiment, these would also lead to
a systematic difference between the groups.
Statistically, the measure of how much systematic variability there is in the experiment
is the difference between means (Mean1 – Mean2).
Where does this type of unsystematic variability come from? There are actually several
sources including:
Individual differences: People truly differ from one another in ways that lead
them to have different scores on the DV.
Measurement error: Any measurement error in the DV will create within-group
variability. That is, even if two participants are truly the same on the DV, they
might get slightly different scores due to measurement error (recall that this
happens more when measures are not highly reliable).
Situation factors: Situational differences in the experimental sessions that lead
people’s scores to differ (e.g., the time of day, rooms, the level of noise,
environmental distractions, slight differences in experimenter’s behavior, etc.)
Note that in each case, these factors are leading the participants’ scores to be different
from one another but, unless they are confounded with the IV, they would not create a
difference in the means between the conditions of the experiment. They would just lead
to variation within each condition.
Effect size, d, indicates how far apart the means are in Standard Deviation units. For example, if
d = 2.2, this indicates that the mean in one condition is 2.2 standard deviations higher than the
mean in the other condition. This would be considered a large effect (See text for the values of d
that are considered small, medium, and large effects).
Effect size, d, is based on exactly the same underlying logic as the t-test – that is, it involves a
comparison of the between groups difference and the within-groups variability. Again, you
don’t need to calculate effect size d in this course, but here is the formula so you can see what
goes into it
Figure 6. Depiction of effect size, d, equation
Note that the formula is based on two components: the between-groups difference (M1 – M2), the
within-group variability (SD pooled across the conditions). Unlike the t-test, it does not take
sample size into account.
8. Frequency polygon
showing distribution of score of control and experimental groups with almost complete overlap
Fig
ure 3. Depiction of a significant difference
In contrast, if there is a null effect, we would see that the between-groups difference is quite
small in comparison to the within-groups variability. Another way to say this is that there is a lot
of overlap in the two distributions of scores (they fall mostly on top of each other). In such cases,
there would not be a significant difference between the means. This can be seen in the figure
below where the difference between the means is quite small in comparison to the amount of
variability within each group:
Note that the formula is based on three components: the between-groups difference (M1 – M2),
the within-groups variance (SD2 pooled across the conditions), and the number of participants in
each condition.
Effect size, d, indicates how far apart the means are in Standard Deviation units. For example, if
d = 2.2, this indicates that the mean in one condition is 2.2 standard deviations higher than the
mean in the other condition. This would be considered a large effect (See text for the values of d
that are considered small, medium, and large effects).
Effect size, d, is based on exactly the same underlying logic as the t-test – that is, it involves a
comparison of the between groups difference and the within-groups variability. Again, you
don’t need to calculate effect size d in this course, but here is the formula so you can see what
goes into it
Note that the formula is based on two components: the between-groups difference (M1 – M2), the
within-group variability (SD pooled across the conditions). Unlike the t-test, it does not take
sample size into account.
This is much like trying to detect the effect of an experimental manipulation – although
the independent variable may actually be having an effect (the “signal” is there) it
cannot be detected because there is too much within-groups variability (too much
background “noise”). Indeed experimental researchers often use the terms “signal” and
“noise” when talking about the systematic and unsystematic variability in their
experiments
Lesson 11
A simple experiment (aka a one-way design) is able to test a simple causal hypothesis
about the effect that one IV will have on a DV. Below are some examples of simple
causal hypotheses that have been tested by researchers:
A noteworthy aspect of such claims is that they are very general and unqualified. They
imply that the IV generally has the stated effect on the DV
Now, when researchers think of another variable that might alter the effect that the
original IV has on a DV, they end up generating a more complex causal hypothesis
known as an “interaction hypothesis”. An interaction hypothesis states how two (or
more) IVs affect a DV. They are called “interaction hypotheses” because they describe
how two independent variables combine or “interact” to influence a dependent variable.
When these hypotheses are tested in experiments, they lead to more precise and more
specific conclusions.
Note that an interaction hypothesis states that the effect that one IV will have on a
DV depends on some other factor. An interaction hypothesis typically takes the
following form: It states that the effect one variable (X) has on the dependent variable
(Y) depends on another variable (Z). And then it spells out the effect that X is expected
to have on Y at each level of Z.
In the context of experiments, we can say that a moderator is a variable that changes
the effect that an IV has on the DV. So a “moderation hypothesis” is actually another
term for an “interaction hypothesis”. The researcher is identifying a variable (a
moderator) that alters the effect that an IV has on the DV.
Factorial design terminology
A factorial design is a research design that examines the impact of 2 or more factors
(i.e., independent variables) simultaneously.
The number of conditions (or “cells”) is the product of the number of levels of each
factor.
2 x 2 = 4 conditions
2 x 4 = 8 conditions
3 x 2 x 4 = 24 conditions
Factor A has 2 levels (A1, A2) and Factor B has 2 levels. These factors are crossed to
create all possible combinations (A1B1, A2B1, A1B2, A2B2). So there are four
conditions in the experiment (and there are four “cells” in the table).
A1 A2 Below is a table
or design matrix
B1 A1B1 A2B1 that illustrates a
2 x 2 study
B2 A1B2 A2B2
Extremely Moderately Mildly Nonviolent
Violent TV Violent TV Violent TV TV
Positive
Parental
Response
Negative
Parental
Response
Table 3. 2 x 4 design with 4 different levels of TV violence and 2 parental responses
Factorial design notation
notation conveys a lot of information succinctly with numbers. For example, when we
read that the researcher used a 2 x 3 design, we can know that:
IV x IV
IV x PV
In factorial experiments researchers often call both types of variables IVs for the sake of
simplicity. However the distinction does matter when drawing conclusions about
causation – recall that only independent variables that are manipulated experimentally
can support a causal claim.
Note that each factor/IV is presented using a variable name and also the levels (variable
name: level 1, level2). Note also that a separate statement is needed to describe the DV
in the experiment.
My Answer:
The study used a 3(qualification level: high, moderate, low) x 2(behavior: friendly,
unfriendly) independent-groups design. The dependent variable was the participants’
liking of the candidate.
Study 2
A researcher believed that people may be more prone to driver error when they are
listening to the radio, and that this effect would be more pronounced under heavier
traffic conditions. He asked participants to do a driving test (using a driving simulator)
once with the radio on and once with it off. Also during each of these tests, the traffic
was heavy half of the time and was light the other half of the time.
My Answer:
The study used a 2(radio: on, off) x 2(traffic: heavy, light) within-groups design. The
dependent variable was the number of errors made in a driving test.
Study 3
A researcher tested people’s ability to recall names by first presenting participants with
the photo and name of 30 people, and then 20 minutes later asking them to name each
photo – the researcher counted how many names were correctly recalled. Participants
did the name memory test once in the morning and once in the evening. The researcher
found that old participants and middle-aged participants did better in the morning, but
young participants did better in the evening.
My Answer:
The study used a 2(time of day: morning, evening) x 3(age: young, middle-age, old)
mixed factorial design, with the time of day as a within-groups factor and age as an
independent-groups factor. The dependent variable was the number of names correctly
recalled.
Does the effect that performance level has on self-esteem differ depending on task
importance?
Thus in a study with 2 IVs, there are 3 possible “effects” to look for: A main effect of the
first IV, a main effect of the second IV, and an interaction effect.
Which means do we compare to assess these effects?
Success Failure Marginal (row)
Means
This main effect indicates that, overall, people reported higher self-esteem when they
succeeded at tasks than when they failed (disregarding the importance of the tasks).
The marginal mean for important tasks (M = 15) is higher than the marginal mean for
unimportant tasks (M = 10) so there is a main effect of the importance factor.
This main effect indicates that, overall, people reported higher self-esteem when they
had done important tasks than when they had done unimportant tasks (disregarding
how well they performed).
Interaction effect
To evaluate an interaction effect we need to compare condition means.
Begin by asking what effect the first factor (performance level) had when participants
were at the first level of the other factor (important tasks). In other words, how much of
a difference between means was created by success vs. failure on the task. When tasks
were important, the difference created by success vs. failure was 6 (18 – 12 = 6).
Next, ask what effect the first factor (performance level) had when participants were at
the second level of the other factor (i.e., unimportant tasks). How much of a difference
between means was created by success vs. failure on the task. When tasks were
unimportant, the difference created by success vs. failure was 0 (10-10 = 0) (i.e., there
was no effect).
Now compare those differences. If the amount of difference created is not the same,
then there is an interaction effect. In other words, there is an interaction when there is
“a difference between the differences”.
The effect that performance level had on self-esteem was different depending on
whether the tasks were important or unimportant. When tasks were important, people
had higher self-esteem when they succeeded than when they failed. When tasks were
unimportant, people had the same level of self-esteem when they succeeded as when
they failed.
In this case the line for important tasks is generally higher (on average) than the line for
unimportant tasks, so there is a main effect of task importance.
Interaction effect
If the lines are parallel to each other (i.e., the slopes are the same) there is not an
interaction effect.
If the lines are NOT parallel to each other (i.e., the slopes are different) there is an
interaction effect.
In this case, the lines are not parallel so there is an interaction effect. This tells us that
the effect created by performance level was greater (i.e., the slope was steeper) for the
important tasks than for the unimportant tasks.
Bar Graphs
You can also identify the effects in a similar manner. For example, here we see the
main effect of performance level because the bars on the left are higher overall (on
average) than the bars on the right.
Statistical Tests
with one-way designs, researchers would also typically perform a statistical test to
determine whether the difference between means is actually a significant difference (p <
.05).
The statistical analysis used for factorial designs is not the t-test that we have seen
previously, but rather the Analysis of Variance (ANOVA or MANOVA). This analysis is
used to identify the amount of systematic variance coming from:
Main effect of A
Main effect of B
A x B interaction
The analysis gives an “F value” for each effect that represents the amount of systematic
variance relative to unsystematic variance. In this way, the logic underlying the analysis
is very similar to the t-test. It is a test of how much systematic variance is created
relative to the amount of unsystematic variance.
Example 1
A1 A2
B1 16 10 13
B2 14 8 11
15 9
Table 6. Factorial design effects example 1 with generic factors
Main effect of Factor A? Yes. Points for A1 are higher (on average) than points for A2.
Main effect of Factor B? Yes. Line for B1 is higher (on average) than line for B2.
A x B interaction? Yes. Lines are not parallel. Slope of line for B1 is not same as slope
of line for B2
Example 4
Figure 4. Line graph with DV score on y-axis and A1
And A2 on x-axis.
Main effect of Factor A? Yes. Points for A1 are lower (on average) than points for A2.
Main effect of Factor B? Yes. Line for B1 is higher (on average) than line for B2.
A x B interaction? No. Lines are parallel. Slope of line for B1 is the same as slope of line
for B2.
B1 20 15 10 15
B2 18 13 5 12
19 14 7.5
Table 13. 2 x 3 design with generic factors
1. Main effect of Factor A: Compare the marginal means for Factor A. If they are not
all the same, there is a main effect.
My Answer: Yes there is a main effect of Factor A [19 vs. 14 vs. 7.5]
2. Main effect of B. Compare the marginal means for Factor B found by averaging
across three conditions. If the marginal means for B are not the same there is a main
effect.
My Answer: Yes there is a main effect of Factor B [15 vs. 12]
3. Interaction effect: Do it in steps.
o First ask what difference is created by moving from A1 to A2, and is this
difference the same at B1 and at B2. Here we see a difference of 5, and it is
the same, so at this point we don’t see an interaction.
o Next ask what difference is created by moving from A2 to A3, and is this
difference the same at B1 and a B2. Here we see a difference of 5 for
participants at B1 and a difference of 8 for participants at B2. These
differences are not the same, so we conclude there is an interaction effect.
If the effect created by moving across the levels of Factor A differs
depending on Factor B, there is an interaction effect – the effect of Factor
A depends on Factor B.
My Answer: Yes there is an interaction effect. [15-10=5 vs. 13-5=8]
Again, it is worth noting that you could alternatively test for the interaction by
asking: What is the difference created by Factor B (B1-B2), and is it the same at each
level of Factor A? There is a difference of 2 at A1, a difference of 2 at A2, but a
difference of 5 at A3. So the difference created by B is not the same, and we conclude
that there is an interaction effect.
there is an interaction effect because the lines are not always parallel.
The following questions test your ability to identify main effects and interactions when
factors have more than two levels. Some of the questions use generic factors (Factor A,
Factor B) and others use the factors from our example experiment (Performance Level,
Importance). Answer each question and then check against my answers.
1. Examine the table and answer each of the questions below it.
A1 A2 A3
B1 14 12 10
B2 12 12 12
2. My Answer:
3. Main effect of Factor A? Yes. Marginal means are 13, 12, 11.
4. Main effect of Factor B? No. Marginal means are 12, 12.
5. Interaction effect? Yes. A1-A2 differences are 2,0. Also A2-A3 differences
are 2,0. Also B1-B2 differences are 2,0,-2
6. Sketch a line graph:
Success Failure
Extremely
Important 18 16
Moderately
Important 14 12
Slightly
Important 12 7
Not
Important 10 5
9. Table 15. Table with 2 factors on top: success and failure, and 4 factors along the
side: extremely important, moderately important, slightly important, not important
Main effect of Performance?
Main effect of Factor B?
Interaction effect?
Sketch a Line graph:
10. My Answer:
11. Main effect of Performance? Yes. Marginal means are 13.5,10.
12. Main effect of Factor B? No. Marginal means are 17,13,9.5,7.5.
13. Interaction effect? Yes. Success-Failure differences are 2,2,5,5.
14. Sketch a Line graph:
Extensions and variations
Higher order designs: More than 2 factors
Researchers sometimes conduct “higher order” factorial designs, meaning they include
more than two factors.
To illustrate, let’s imagine that we added yet another factor to our example experiment.
The researcher wonders whether the effect of the other two factors might further
depend on the sex of the participants, and so the researcher decides to include
participant sex as another factor. Now the experiment is a 2 (performance level:
success, failure) x 2 (importance: important, unimportant) x 2 (sex: male, female)
design. Below is a table showing the design and results.
Success Fail
Important 18 12 15
Female
Unimportant 16 16 16
Male Important 18 14 16
Unimportant 18 18 18
17.5 15
Table 18. 2 x 2 x 2 design with success/failure factors, female/male factors, and
important/unimportant factors
What effects can we look for in a 2 x 2 x 2 design? Below we will first look at this with
generic factors (A, B, C), and then after that you can try applying this to the example
experiment.
Main effects: There can be one main effect of each factor (The number of possible
main effects always equals the number of factors in a study)
Main effect of Factor A
Main effect of Factor B
Main effect of Factor C
2-way interactions: These are interactions that combine two factors at a time.
A x B interaction
A x C interaction
B x C interaction
3-way interaction: Combining all three factors
A x B x C interaction
Learning Activity: Possible Effects in the 2 x 2 x 2 design
Now try applying this to our example experiment. List all of the possible effects that
could occur in your study notes. Then compare with my answers below:
My Answer:
Three main effects:
Importance x Sex
Success Fail
Important 18 12 15
Female
Unimportant 16 16 16
Important 18 14 16
Male
Unimportant 18 18 18
17.5 15
Table 18. 2 x 2 x 2 design with success/failure factors, female/male factors, and
important/unimportant factors (repeated from above)
For each effect, I have indicated which means need to be compared. Follow these
instructions, and indicate whether each effect occurred – just write Yes or No in
your study notes. Then compare to my answers.
Main effect of performance level: Compare marginal mean for all success conditions
(average across 4 conditions) and marginal mean for all failure conditions (average
across 4 conditions)
My Answer: Yes. Success 17.5 vs. Failure 15
Main effect of importance: Compare marginal mean for all important conditions
(average across 4 conditions) and marginal mean for all unimportant conditions
(average across 4 conditions).
My Answer: Yes. Important 15.5 vs. Unimportant 17
Main effect of sex: Compare marginal mean for all female conditions (average across
4 conditions) and marginal mean for all male conditions (average across 4 conditions).
My Answer: Yes. Female 15.5 vs. Male 17
Performance level x Importance: Create a 2(performance level) x 2(importance) table
which ignores the sex factor by averaging across it.
Success Fail
Important
18 13
Unimportant
17 17
Female
17 14
Male
18 16
Female
15 16
Male
16 18
Go back to the full design table, and examine the Performance Level x Importance
interaction pattern at each level of Sex. Does the 2-way interaction differ across the
levels of Sex?
Success Fail
Important
18 12
Female
Unimportant
16 16
Important
18 14
Male
Unimportant
18 18
Interaction Effects
The text describes two approaches that can be used to describe interaction effects in
words:
1. Describe and compare the effect that one factor has when you are at each level of the
other factor.
2. Use key phrases such as “especially for” and “but only for” to describe the different
effects that a factor has.
I recommend that you usually stick with the first approach, that is, describe the effect
that one factor has at each level of the other factor. As your text notes, this is the
“foolproof” way to describe an interaction effect.
Generic Template 1
“There was an interaction between Factor A and Factor B. The interaction indicates that
the effect of Factor A on the DV differs across the levels of Factor B. Specifically, when
[first level of Factor B], increases in Factor A resulted in higher scores on the DV. In
contrast, when [second level of Factor B], increases in Factor A resulted in lower scores
on the DV.”
Generic Template 2
“There was an interaction between Factor A and Factor B. The interaction indicates that
the effect of Factor A on the DV depends on Factor B. Specifically, when [first level of
Factor B], participants scored higher on the DV in the A1 condition than in the A2
condition. In contrast, when [second level of Factor B] participants’ scores on DV did not
differ across the A1 condition and the A2 condition.” [or did not differ by the same
amount]
Learning Activity: Describing main effects and interactions in
words
For each example below, provide a verbal description of any main effects and/or
interaction effect. First state briefly whether the effect was found, and then go on to say
in words what it means.
Important 16 12
Unimportant 14 10
There was a main effect of performance level. Participants had higher levels of self-
esteem when they succeeded on a task than when they failed.
There was also a main effect of importance indicating that participants had higher levels
of self-esteem when they did important tasks then when they did unimportant tasks.
There was not an interaction between performance level and importance. Succeeding
on a task led to the same increase in self-esteem regardless of the importance of the
task.
Example 2
Success Failure
Important 18 10
Unimportant 12 10
There was a main effect of performance level. Participants had higher levels of self-
esteem when they succeeded on a task than when they failed. There was also a main
effect of importance indicating that participants had higher levels of self-esteem when
they did important tasks then when they did unimportant tasks.
These main effects were qualified by an interaction effect. When participants did
important tasks, they had much higher self-esteem when they succeeded than when
they failed. However, when they did unimportant tasks their self-esteem was only
slightly higher when they succeeded than when they failed. Thus the effects of
performance level on self-esteem were moderated by the importance of the task.
Example 3
A study tested whether caffeine intake would improve performance on a task and
whether this would depend on the difficulty of the task. Below are the performance
scores out of 100.
No Caffeine Caffeine
Easy
Task 80 85
Difficult
Task 40 60
Table 25. 2 x 2 design with no caffeine/caffeine factors and easy task/difficult task
factors
My Answer:
There was a main effect of caffeine intake. Performance scores were higher in the
caffeine condition than in the control condition. There was also a main effect of task
difficulty indicating that performance was higher on the easy tasks than on the difficult
tasks.
There was also an interaction effect indicating that the effect of caffeine intake was
moderated by task difficulty. When participants did easy tasks, caffeine intake improved
their performance scores by 5%, but when they did difficult tasks caffeine intake
improved their scores by 20%. [Thus caffeine led to greater improvement for difficult
tasks than for easy tasks]
Example 4
No Alcohol Alcohol
Not Sleep
Deprived 6 10
Sleep 8 18
Deprived
Table 26. 2 x 2 design with no alcohol/alcohol factors and not sleep deprived/sleep
deprived factors
My Answer:
There was a main effect of alcohol consumption, indicating that participants who
consumed alcohol made more errors than those who did not. There was also a main
effect of sleep deprivation indicating that participants made more errors when they were
sleep deprived than when they were not.
However, to address possible order effects, you would want to use some kind of
counterbalancing procedure. For instance, in our example experiment, you could
number or letter the conditions just as if there were 4 levels of a single IV. Then use the
counterbalancing procedures that we used previously for single-factor designs – either
full counterbalancing or partial counterbalancing.
Note that within-groups designs require far fewer participants (see Figure 12.21 in text
for an illustration). For instance, if you want 10 participants in each condition, you would
need only 10 participants in total, because they will go through every condition.
Note that again you would want to counterbalance the order in which participants
receive the levels of the within factor. So, you need to address both the assignment of
participants to conditions (between-factor) and the order in which they will receive the
conditions (within-factor)
Use a 2-step process: First, deal with the between-subjects factor by using random
assignment of participants to groups (important group, unimportant group). Next, deal
with the repeated-measures factor by creating different orders. Do this separately for
each group. For example, randomly assign half of the important group to order 1
(success first, failure second) and half to order 2 (failure first, success second).
If you wanted 10 participants in each condition, how many would you need? The answer
is 20. They would be randomly assigned to the importance groups; but then those
groups would go through both the success condition and the failure condition.
If there is an interaction, we end up with more specific and accurate conclusions about
the effect that an IV has (i.e., we learn about limits, we learn that the effect depends on
other things).
If there is not an interaction, we gain evidence for the generality of the effect. The
finding suggests that the effect generalizes across other factors, and thus supports the
“external validity” of the effect.
For example, the original experiment may have shown that listening to the radio while
driving led to increased driving errors. In the original study, the radio content was
music, and now the researcher wonders whether the type of content matters. So they
add type of content (music, talk) as another factor. Notice, then, that half of the study is
exactly the same as the original study, as participants are listening to music, whereas
half of the study is new. So, in addition to testing an interesting possible moderator
(type of radio content), the researcher is taking the important step of testing whether a
previous finding replicates when the study is repeated.
Module 8: REB Review explains the roles and responsibilities of a Research Ethics Board
(REB), as well as the ethics application review and approval process. You are given an option to
complete the module as a researcher (yellow button) or REB member (blue button). Please select
Module 8 Researchers (yellow button). Work through this last module and answer the two
questions below.
Questions:
What is the primary goal of an REB?
- To assess whether the research proposals they review are ethically acceptable according
to TCPS2
- Goal is to represent the interests of participants by assessing the foreseeable risks,
ethical implications, and potential benefits of each proposal they review
What is the general rule regarding REB review of study materials?
- If a participant will read/hear/view it the REB needs to read/hear/view it
Animal research ethics:
The Canadian Council on Animal Care (CCAC) publishes guidelines that promote the
ethical treatment of animals in research. This exercise will teach you about the
conditions under which animals should/should not be used in research, as well as how
animals should/should not be treated while under a researcher’s (or institution’s) care.
1. On the CCAC website homepage, scroll down to the section titled Learn About the
CCAC. This section contains four modules that describe the CCAC guidelines,
certification process, three Rs tenet, and facts and legislation regarding animal data.
Select the The Rs Tenet module and browse through the information under the
various tabs to answer the five questions below.
Question:
What are the three Rs to guide researchers on the ethical use of animals?
- Replacement, reduction, and refinement
What are the three main types of ethical concerns related to animal welfare?
- Need to consider whether animals are absolutely required or whether suitable
replacements can be used instead
- Must consider what numbers of animals will ensure valid results, while, at the same
time, maximizing the amount of information obtained per animal
- Must also identify any potential harm to the animals the develop ways to minimize it
What are two ways to reduce the number of animals used in research?
- By reducing the number of animal experiments
- By reducing the number of animals needed in each experiment
Inter-rater reliability is a measure of the consistency among two or more raters who are
observing and scoring the same behaviour(s) using the same instrument. If raters do
not demonstrate agreement between their scores, then we cannot know whether the
construct of interest (i.e., in our case, sociability) has been measured in an objective
manner. When observational measures use coding categories (e.g., presence or
absence of specific categories of behaviour), inter-rater reliability is often expressed as
the percentage of agreement as follows:
Percentage of Agreement=100 (number of agreements/number of agreements+number
of disagreements)
Most journals that publish articles in the social and behavioural sciences require
researchers to submit their manuscripts in APA style. Also, most psychology courses
you take at Laurier will require reports to be submitted in APA style. APA guidelines
serve three main purposes: (i) To help researchers write effectively, (ii) To make
research articles uniform, and (iii) To facilitate the conversion of submitted
manuscripts to published journal articles.
3. What does an APA formatted document “look like?”
An APA style document has seven required sections, presented in this order: (i) Title
page, (ii) Abstract, (iii) Introduction, (iv) Method, (v) Results, (vi) Discussion, and
(vii) References.
4. What are the general APA style formatting rules?
The general formatting rules that apply to the entire document include (but are not
limited to) the following:
o Page numbers appear at the top right-hand corner of each page.
o Running head appears at the top left-hand corner of each page.
o All margins are 1 inch.
o Headings are used to organize sections (five different levels of headings).
o Lines of text are double-spaced with no extra spaces between paragraphs
and sections.
o The typeface is 12pt Times New Roman (black).
o Two spaces follow a period at the end of a sentence, and once space
follows other punctuation.