PS295

PS295
Lesson one
Is psychology really a science?
- (“Science refers to particular areas of study like biology, chemistry and physics”) NO,
science is not defined by the subject matter. You can’t decide whether something is a
science by just looking at the topic that is studied. It is the approach taken to study
these topics that matters. Science is defined by the approach, not the topic
- (“science is done with sophisticated equipment and technology”) Nope, not always. You
can do science with a paper and pencil.
- (“science has a large body of well-established findings, facts, and principles”) No, not
necessarily. Sciences vary widely in the amount of knowledge that has been acquired, as
some sciences have a much longer history than others.
- The term science refers to a particular approach to acquiring knowledge. The scientific
approach can be used to study virtually any topic.
Features of a scientific approach: How scientists approach their work
1. Empiricism: relying one evidence or observations to draw conclusions
2. Scientists test theories, research questions, hypotheses, and predications: the theory
data cycle. Theory is a set of statements that describe general principles, theories are
typically quite broad and complex to be tested in one particular research study.
Research question is what the researcher wants to know and questions about how one
variable might be related to another variable. Hypothesis (conceptua) is a more specific,
focused statement of what is expected in a specific situation (usually a statement about
how two variables are thought to relate to one another). Prediction (experimental)
refers to the expected result of a specific study
3. Scientists tackle both applied and basic problems. Basic research: the researchers want
to understand something, without regard to whether knowledge will be immediately
useful or practical. The main goal is to increase knowledge. Applied research: the
researchers want to learn something that be applied to a problem of immediate
concern, the main goal is to find a solution to a specific problem.
4. Scientists dig deeper: they almost never conduct just a single research study. Instead,
each study leads them to ask new questions that requires more research.
5. Scientists make their work public that the research findings can be examined closely by
other scientists. Key aspect of articles in scientific journals is that, before publication,
they usually go through and extensive process of peer-review
The role of induction and deduction:
- inductions involve reasoning from specific instances to a general proposition 归纳法.
Example: start with research findings or observations; use them to derive a theory.
- Deduction involve reasoning from a general proposition to a specific implication of that
proposition. Example: start with a theory, use it to derive a hypothesis.
Functions of a theory:
1. Explain existing date/ observations: theories summarize and integrate existing
observations, such as a set of research findings coming out of several different studies.
The theory is not just a simple listing of all the different results. Instead, a good theory
provides an integration or synthesis of many disparate observations, in a manner that
generates new insights.
2. Guide future research: theories guide future research by suggesting new hypotheses
that can be tested. Once a general theory has been formulated, it should suggest
specific hypotheses and predictions that follow from the theory- and these new
hypotheses can then be tested. If they are confirmed by a study, this is further evidence
in support of the theory; if they are not confirmed, the theory will need to be revised to
account for the inconsistent observations or be discarded
Features of a good theory:
1. Good theories are well supported by data
2. Good theories are falsifiable 可证伪的
3. Good theories have parsimony 简约. This idea is linked to “Occam’s razor” named after
William of Occam, a fourteenth century English philosopher who emphasized the
importance of simplicity, precision, and clarity of thought. If two theories do an equally
good job of accounting for the data and predicting future outcomes, then the simpler
theory (the one that uses fewer or simpler ideas) is generally preferred
Types of scientific sources:
1. Empirical journal articles: Peer reviewed，Report for the first time the result of an
empirical research study，Provide details about the study’s method, the statistical tests
used, and the results，Are written for an audience of other psychological scientists and
psychology students
2. Review journal articles: Peer reviewed，Provide a summary of all the published studies
that have been done in one research area，Sometimes use a quantitative technique
called “meta-analysis” to combine the results of many empirical studies and provide a
number that summarized the overall effect size
3. Chapters in edited books: Not peer reviewed as rigorously as journal articles，Each
chapter written by a different contributor，Usually reviews a collection of studies done
in a research area
4. Full length scholarly books: Not peer reviewed，Not as common in psychology as in
some other disciplines，Most likely to be found in academic libraries
Components of an empirical journal article:
- Title, Abstract, Introduction, Method ,Results, Discussion
Lesson two
Variables:
- Variables are what researchers examine in their research studies
- Variables are also sometimes referred to as: factors, dimensions, qualities, attributes, or
characteristics
- A more formal definition: a variable is anything that takes on different values or levels
across a set of cases
- In a research study, a variable must assume at least two different values or level across a
set of cases. Otherwise it would be a constant
- In an experiment the levels of a variable are often called the conditions of the
experiment.
Types of variables:
1. Qualitative variable: also known as categorical or nominal variables
- Variables that classify or categorize
- Levels differ in terms of quality or type, not in the amount of something
- Example: religious affiliation, major in university
2. Quantitative variable:
- Levels of the variable differ in terms of quantity or amount
- You can number the levels and place them in order, from less to more
- These are variables that asks how much? How many? To what extent?
- Examples: self-esteem on rating scale, family income, current happiness
3. Discrete variables:
- There are no meaningful values of the underlying variable between the levels of the
scale
- Must be measured in whole units or categories
- Between any two adjacent values, no intermediate values are possible
- Example: number of children in family, number of errors made, courses a student takes
4. Continuous variables:
- There are meaningful values of the underlying variable in between the levels of the
scale; not limited to a certain number of values such as whole numbers
- Can be measured in whole units or fractional units
- In principle, between any two adjacent scale values, intermediate values are possible
- Example: the time it takes to complete a task, blood alcohol level
- When we measure continuous variables, they are by necessity converted into discrete
variables
- a qualitative variable is always discrete
- quantitative variable could be either continuous or discrete
5. Manipulated variables:
- A variable in an experiment study that is intentionally varied by the researcher
- The research intervenes to create the different levels of the variable, and assigns
participants to those levels
- Example: situation variables- room temperature, light intensity, privacy. Participant
state- current mood, anxiety level, hunger level
6. Measured variables:
- A variable whose levels are simply observed and recorded
- The researcher takes an assessment of each participant’s level on the variable without
trying to alter it
- Example: situation variables- room temperature, hunger level, how many people in
room. Participant variables- gender identity, age, participant IQ, participant personality.
Socioeconomic status
7. Conceptual variables:
- Sometimes called a “construct”
- Must be carefully defined at the theoretical level; when researchers state their
hypotheses, they are usually stated in terms of the conceptual variables
- Example: hunger= having a need or desire for food, love- an emotion in which the
presence or thought of another person triggers arousal desire and a sense for that
person
8. Operational definition:
- Also known as operationalizations
- Specifies precisely how the concept is measured or manipulated in a particular study
- Definition of a variable in terms of the specific procedures the researcher uses to
measure or manipulate it
- Example: hunger= the number of hours that the participant went without eating prior to
the study, hunger= a scale rating in response to the question “how hungry are you?”,
love= a rating on a scale from -2 (strongly disagree) to +2 (strongly agree) in response to
the statement “I am in love with my current partner”
Three types of claims:

- In the Morling text this term is used, primarily to refer to an argument or conclusion
that is made based on the results of a study
1. Frequency claims: describes a particular rate or degree of a single variable. Studies
that report where people stand on a variable are making frequency claims
- Most teenagers are exposed to hundreds of ads every day
- Most high school students have tried marijuana
- 1 in 10 Canadians are suffering from mental health problems
- The current approval rating for prime minister is below 50%
- 80% of women feel dissatisfied with how their bodies look
2. Association claims: argue that variable is associated with or correlated with, another
variable. The type of study that tests these claims is called a correlational study.
These studies can reveal that there is a positive association, a negative association,
or no association between the two variables of interest. These associations or
correlations can be seen in the scatterplots shown in the chapter section on
association claims.
- People who work at home are more productive
- Late dinners linked to childhood obesity
- As people get older, they are less likely to change their political attitudes
- Your grades in high school may predict how long you live
- Smoke marijuana regularly is linked to poor performance in school
3. Casual claims: argue that variable is associated with another variable but go further
in arguing that one of the variables is responsible for the other variable. The
association between variables is casual. The only type of research study that can
provided strong support for a casual claim is an experiment-that is a study where the
casual variable is manipulated by the researcher and the other variable is measured
- A glass of wine a day may promote mental well being
- Listening to classical music at a young age improves math ability
- Exposure to violent television increases aggressive behaviour
- Power may lead to rudeness
- Eating blueberries for breakfast can improve cognitive function
Importance of causation:
- Key goal of research studies is to draw a conclusion about cause and effect. Essentially
what we mean is that we want to be able explain causes of the behaviour. For these
reasons, understanding causation is highly important, and many of the hypotheses that
researchers test in psychology are about the causal effects of one variable on another
variable
Criteria for causation:
1. Covariance: the study must show that the causal variable and the outcome variable are
related
2. Temporal precedence: the study must show that the causal variable came first in time,
before the outcome variable
3. Internal validity: the study must establish that no other explanations exist for the
relationship between the variables
Four types of validity:
1. Construct validity: how well a variable was measured or manipulated in the study
2. Statistical validity: the extent to which a study’s statistical conclusions are accurate and
reasonable. For example: when a study finds an association between two variables, can
we be sure that it is not just due to chance connections in that particular sample; in
other words, it is statistically significant?
3. External validity: how well the results of a study generalize to, or represent, people or
contexts beyond those in the original study
4. Internal validity: a study’s ability to rule out alternative explanations for a causal
relationship between two variables (applies only to causal claims)
Prioritizing validities:
- Because it is not possible to achieve all validities at once
- Internal validity is a crucial consideration when we are evaluating a causal claim, but not
applicable to association or frequency claims
- External validity is crucial for frequency claims but is often not prioritized for research
testing casual claims.
Lesson three
Core principles
1. Respect for persons:
- Enabling people to make their own decision about whether to participate in research
free from coercion or interference
- Principle is applied primarily through informed consent means that potential research
participants should be provided with all information that might influence their decision
- Participants are provided with an informed consent form
2. Concern for welfare
- Researchers must minimize the risks associated with research participation and
maximize the benefits of the research. The principle is applied primarily through a risk-
benefit analysis
- Need to consider potential risks and benefits that are likely to result from the research
- Only if the potential benefits of the study outweigh the risks involved should the
research be carried out
3. Justice:
- Researchers must treat people fairly and equitably
- They should give participants adequate compensation for participating and make sure
the benefits and risks of a research study are distributed fairly across all the participants
- Principle is applied primarily through recruitment methods that offer participation to a
diverse range of social groups
Research ethics:
- In Canada researchers and research institutions must follow the code of ethics in the Tri-
council policy statement (TCPS2) (research involving animals is governed by the
Canadian Council on Animal care)
- Are based on three core principles
The REB:
- A panel of individuals that reviews and monitors all research conducted by researchers
at that institution to ensure that it abides by the TCPS2 guidelines
- Reviews and approves proposed and ongoing research involving humans
Scholarly Integrity: Ethical data analysis and reporting of results
A researcher's ethical responsibilities do not end when the study has been conducted.
Rare obligated to maintain integrity through the process of data analysis, publication,
and beyond.
Fabrication, Falsification, and Plagiarism

These are among the most obvious, serious, and blatant forms of scientific misconduct.
Data fabrication occurs when researchers make up entire data sets that fit their
hypothesis, instead of recording what actually happened in a study. Falsification
involves collecting data from real participants but intentionally distorting the data.
Plagiarism occurs when researchers present the ideas or words of others as their own,
without giving appropriate credit.
Ethical Data analysis

A methods text by Leary (2017) identifies four specific practices that were once more
common but are now recognized as problematic
1. Cleaning and deleting data
Researchers must sometimes remove data from a data set and it is ethical to do so.
Including data from certain participants – those who did not follow instructions, did
not take the study seriously, or gave nonsensical responses – would undermine the
validity of the study. Thus researchers often discard the data from these participants
in a "data cleaning" process before going on to analyze their data.
The problem arises if researchers' decisions about which cases to discard could be
biased by knowledge of how the study results would be affected. Data cleaning must
always take place before performing the primary data analyses. Researchers are never
permitted to discard data simply because they did not fit the hypothesis or are hard to
explain.
2. Overanalyzing data (p-hacking)
When initial planned analyses don't turn out as expected, researchers often continue
to explore their data, looking for other interesting or useful results.
The problem is that conducting many unplanned statistical tests increases the
likelihood of obtaining "false-positive" results, that is results that are really just due to
chance. The practice of conducting analysis after analysis is often called "p-hacking".
(This refers to statistical significance testing, where a result with a "probability value"
or "p-value" less than .05 is considered significant.)
The key here is that researchers should never pretend their exploratory analyses were
planned from the beginning. If they do find an unexpected but potentially important
result during data exploration, they should use it as an idea for future research, and
conduct a new study to test that idea.
3. Selective reporting (cherry picking)
For the purpose of exploration, researchers often include measures of several

variables, including some that are not central to their main hypothesis. Thus they
usually have collected more data than they are able to include in a journal article, and
need to make decisions about which measures and findings they will report.
The key principle here is that researchers are obligated to report all results that deal
directly with the hypothesis that the study was designed to examine. It would be
unethical to engage in "cherry picking" by selectively reporting results that support
the hypothesis but failing to mention results that did not.
4. Post-hoc theorizing (HARKing)
When researchers discover unexpected findings, they may be tempted to act as if the
unexpected finding was predicted all along. This is a problem known as "post-hoc
theorizing" or HARKING (hypothesizing after results are known; Kerr 1998) that
clearly violates the integrity of the scientific process. Clearly, if a hypothesis is
generated only after seeing the results, there is no possibility of it being disconfirmed
by those results.
Researchers should never pretend that an unpredicted effect was predicted. Instead,
they should acknowledge that the effect was unpredicted and that it should be
interpreted cautiously until it is replicated. They could then go on to design a new
study to test the hypothesis - if the predicted effect is obtained, this time it does not
involve post-hoc theorizing.
Reforms to improve scientific integrity:
In particular, recent reforms have focused on making the entire research process more
public and transparent. encourages researchers to adopt three specific practices:
1. Full disclosure of information.
Researchers are urged to describe their studies and findings in greater detail than in
previous journal articles
2. Pre-registering methods and data analysis plans.
Researchers are encouraged to pre-register their study's method, hypotheses, and data
analysis plan online, in advance of data collection.
3. Open data and materials.
Researchers are encouraged to post online their complete data sets

Three common types of measures
There are three types of measures that researchers typically use to measure variables:
self-report, observational (also called behavioral), and physiological measures.
1. Self-Report measures
Involves having people tell you something about themselves
Speaking anxiety – participants who have just given an oral presentation are asked to
rate how anxious they were feeling while giving the presentation.
2. Observational (aka Behavioural)

Involves the direct observation of participants' behavior or physical traces of behaviors.
They can be used to measure anything observable that an animal or person does.
Speaking anxiety – observers watch the speaker deliver an oral presentation and look
for markers of anxiety such as sweating, shaking, stuttering, etc.
3. Physiological
Involves measuring a bodily process that can't be observed directly; usually requires the
use of equipment to amplify, record, and analyze biological data.
Speaking anxiety – specialized equipment is used to measure physiological responses

that reflect a state of anxiety, such as sweating (GSR), increased heart rate, muscle
tension (EMG), cortisol levels.
Which type of measure is best?

There is no straightforward answer to this question. Each type of measure has its own
strengths and limitations. Ideally, researchers use more than one type of measure in a
study or across a series of studies.
Scales of Measurement
There are four different scales of measurement: categorical (also known as nominal),
ordinal, interval, ratio.
Are rating scales really at the interval scale of measurement?
many measures in psychology are treated as interval scales, including IQ test scores,
personality test scores, and scale ratings.
Why do the scales of measurement matter, and which is best?

The scales of measurement are important for two reasons: they influence how much
information about the variables is conveyed and the ways researchers can analyze the
data.
1. Information conveyed - As we move through the scales of measurement, from the

categorical through to ratio scale – more information is conveyed about the variable
being measured. So, if they have a choice, researchers generally prefer to use
measures on interval or ratio scales.
Of course, sometimes this is not possible. For example, qualitative variables can only
be measured on a categorical scale of measurement (e.g., gender, religious
3. Statistical analyses -.Although numbers can be assigned to any variable, the

mathematical operations that researchers can perform on those numbers (e.g.,
add, subtract, multiply, or divide them. For example, we can't conclude that the
person who scored 6 on a happiness scale is twice as happy as the person who
scored 3. Ratio scales, on the other hand, have a true zero and thus do allow
these analyses and interpretations.
construct validity question:
How can we know whether the measures that a researcher used were good
ones?
To answer this question we will need to consider two key aspects of measurement:
reliability and validity.
Researchers use empirical techniques to establish the reliability of a measure.

Reliability is all about consistency. If a measure is reliable, it will yield a consistent
pattern of scores. To check on whether there is a consistent pattern of scores,
researchers typically make use of scatterplots and correlation coefficients.
Scatterplots
Scatterplots are a type of graph that allow us to visualize the relation between the two
measured variables.
If there is a strong positive relationship between the variables, then the points will tend
to cluster quite closely around a line that slopes upward to the right.
Correlation coefficients
The data depicted in a scatterplot can also be summarized efficiently with a single
number called a correlation coefficient (symbolized with a small letter r) which indicates
how closely the points cluster around a line drawn through them. Correlation
coefficients can range from -1.0 to 1.0 and tell us two important things about the points
in a scatter plot:
1. The direction of the relationship (i.e., slope direction) can be either positive (sloping
upward), negative (sloping downward) or zero (not sloping up or down). If r is a
positive value the relationship is positive, and if r is a negative value the relationship
is negative.
For the purpose of assessing reliability, negative relationships would be rare and
problematic
2. The strength of the relationship
As the absolute value of r is higher (i.e., r is closer to -1 or to 1) there is a stronger

relationship between the two variables. This means that the points tend to cluster
more closely around a straight line that slopes either upward or downward.
For the purpose of assessing reliability, we typically want to see a correlation

coefficient of .70 or higher, indicating there is a strong positive correlation between
the two measurements.
Types of reliability
There are three different types of reliability that researchers may consider:
 test-retest reliability
 internal reliability (also known as internal consistency or inter-item reliability)
 interrater reliability
the Rosenberg Self Esteem scale (RSE; Rosenberg, 1965). The RSE is the most
commonly used measure of self-esteem. It consists of 10 items that are each rated on a
4-point scale. Here is the scale and how it is scored.
1. Test-retest reliability
Give the RSE to the same participants on two occasions and look at the correlation
between the total RSE score at Time 1 and the total RSE score at Time 2.
If the test-retest correlation is positive and strong, there is good test-retest reliability.
2. Internal Reliability (aka Internal consistency; Inter-item reliability).

Internal reliability is relevant for measures that use more than one item to get at the
same construct – such as the 10-item RSE scale.
We are asking: Are all the items on a multi-item scale measuring the same construct?
We want to see that participants give consistent responses to all of the items, despite
the slightly different wordings.
For the RSE scale, we would have a large sample of participants complete the scale at
one time point. Then there are two ways to assess internal reliability:
a) Item-total correlations (not in text).

One technique used to assess internal reliability is to calculate an item-total correlation
coefficient for each item in the scale.
So, for the RSE scale, we would compute 10 separate item-total correlation coefficients.
If they are all strong and positive we know that all of the items are consistently tapping
into the same construct. If there was a low item-total correlation for an item, this would
suggest that the item is not assessing the same construct as the rest of the items, and
so it should be dropped from the scale.
b) Cronbach's alpha (in text)

This is a correlation-based statistic that summarizes, with a single number, all possible
correlations among the items in the scale. The closer that alpha is to 1.0, the better the
scale reliability.
For the RSE scale, we could compute Cronbach's alpha as a way to know whether
there is good internal reliability.
In summary, if all item-total correlations are high, or if Cronbach's alpha is high (.70 or
higher), this is evidence that there is good internal reliability and that it makes sense to
sum all the items together to create a total score. If there is low internal reliability, the
items should not be summed to create a single total score.
3. Interrater Reliability
Interrater reliability is relevant for observational measures, where two or more raters
have scored or coded participants' responses. We want to see consistency among the
raters in order to conclude that scores on the measure are reasonably independent of
who did the rating.
Quantitative measures (e.g., rating scales):
If there is a strong positive correlation between raters, this means that when Rater 1
gave a high rating, Rater 2 also tended to give a relatively high rating. In other words,
there is consistency across the two raters.
Categorical coding:
When observers rate a categorical variable they indicate whether responses fall into
specific, pre-determined categories. In such cases, we can calculate the percentage of
responses that the raters place into the same categories. Simple percentage agreement
would be calculated as: number of agreements/(number of agreements + number of
disagreements) * 100.
Another statistic called kappa can also be used and this statistic has some advantages
over simple percentage agreement (e.g., it adjusts for how many agreements would be
expected due to chance). Once again, as with r, a kappa that is closer to 1.0 means that
the two raters showed stronger agreement.
Reliability as Low Measurement Error

A measure that is reliable is one that is low in measurement error, so that the score a
participant receives is close to the participant's "true" score on the variable.
The starting point for this interpretation of reliability is to note that every observed score
on a measure consists of two components: true score and measurement error.
True score is the score the participant would receive if the measure were absolutely
perfect.
Measurement error is the result of factors that distort the observed score, so that it
differs from the true score.
No measure is perfect. Every score will contain some degree of measurement error. A
measure that is reliable is one that has a larger "true score" component and smaller
"measurement error" component.
To illustrate, imagine that I have a class of students and I am trying to measure each
student's math ability by giving them a test. Each student has a certain amount of math
ability (i.e., the true score) that we are trying to capture. But there will also always be
some measurement error, as there are many factors that are not about math ability, yet
might influence the observed score on the test.
Where does measurement error come from?
There are many sources of measurement error. Anything that influences a participant’s
observed score on the variable, other than the construct we are trying to measure, is a
source of measurement error.
a question that might occur to you is this: How can we know how much of the person's observed
score is actually coming from their "true score" and how much is coming from "measurement
error". The answer is that we can't ever really know this for sure because we never know the
person's "true score" – if we did we wouldn't need to administer a measure.
But we can estimate the degree to which the scores on the measure are coming from true scores
(rather than from random measurement error). To do this we look for evidence of consistency.
How can we improve reliability?

Thinking of reliability as low measurement error also points us toward some steps that we can
take to improve the reliability of our measures. The key will be to reduce the likelihood that
factors other than math ability influence people's scores on the test. Looking back at our list
sources of measurement error, here are some ways we could reduce measurement error.
1. Standardize administration: Ensure that every participant is tested under exactly the
same conditions.
2. Clarify instructions and questions: Make the instructions and items clear and easy to
understand, and pilot test the items beforehand to make sure they are clear.
3. Train observers: For observational measures, provide clear instructions that indicate
concrete behaviors to look for. Have observers practice using the rating instrument
beforehand.
4. Minimize coding errors: Take great care recording and entering data.
The "more is better" principle:

Also, when it comes to reliability, a general principle is that having more measurements is
generally better. Reliability is likely to increase as we increase the number of: observers,
observations (or items in a measure), or occasions
VALIDITY OF MEASURES: DOES IT MEASURE WHAT

IT IS SUPPOSED TO MEASURE?
construct validity of measures. When we examine construct validity we are asking whether the
measure actually gets at the conceptual variable that it is intended to measure. Are we actually
measuring what we think we are measuring?
a measure is reliable does not ensure that it is valid.
Ways to Assess Construct Validity

subjective approaches (face validity and content validity) and empirical approaches (criterion
validity, convergent validity, discriminant validity).
1. Face Validity and Content Validity

These are both subjective types of validity where experts in the field examine the measure and
make judgments of whether the measure appears to be getting at what you want to measure (face
validity) and whether it contains all the important parts of the construct you want to measure
(content validity).
2. Criterion Validity (correlational evidence, known groups evidence)

Most psychological scientists prefer to rely on empirical assessments of validity over subjective
judgments. One way to empirically assess validity is by examining criterion validity (whether the
measure is related to a concrete outcome that it should be related to).
Criterion validity is typically represented with correlation coefficients, but it can also be

demonstrated with a known-groups paradigm (in which you examine whether scores on the
measure can distinguish among a set of groups whose behavior is already well understood).
Example of known groups paradigm: The Beck Depression Inventory (BDI) is a 21-item self-
report scale with questions about symptoms of depression. In order to test the criterion validity of
the BDI using the known-groups paradigm, one could administer the BDI to a group of people
with depression and a group of individuals who were not depressed, as determined by four
psychiatrists who had conducted clinical interviews and provided diagnoses.
3. Convergent Validity and Discriminant Validity

Another type of validity is concerned with whether there is a meaningful pattern of similarities
and differences between the new measure and other existing measures of the same construct.
Convergent validity: A measure should correlate strongly with other measures of the same
construct; similarity.
Discriminant validity (aka divergent validity): A measure should correlate less strongly with
measures of different constructs; in other words, there must be differences.
MORE ON THE DISTINCTION BETWEEN VALIDITY
AND RELIABILITY
To evaluate reliability, you compute correlations using just your measure (e.g., correlation of the
measure with itself across time, correlation among parts of your measure).
To evaluate validity, you compute correlations of your measure with other different measures
When do researchers assess reliability and validity?

researchers can make use of an existing measure, and the tests of reliability and validity may
have already been performed.
Here are four sources of information about existing measures that can be particularly useful:
1. Journal articles
2. Books and collections
3. Online databases
4. Commercially published scales
Scale points and labels
It is also worth noting that, when researchers use rating scales (e.g., Likert ratings, semantic
differential), they have a few additional issues to consider.
1. How many scale points should be offered?

Although 5-point and 7-point rating scales appear to be the norm, sometimes more points are
included.
The advantage of a scale with more points (e.g., a 9-point or 11-point scale) is that it could
potentially allow for finer discrimination; that is, it could be more sensitive and capture smaller
variations among respondents.
The disadvantage of using more scale points is that people may be unable to differentiate their
responses at such a fine level of discrimination (e.g., on a 21-point scale could you really judge
whether your rating should be a 16 or a 17?) and thus such differences in scale ratings may not
be meaningful – they may reflect nothing but measurement error.
2. Should there be a midpoint?

Most rating scales include a midpoint. Note that each of the scale examples above had an odd
number of scale points and thus included a "midpoint" that would reflect something like
"neutral" or "uncertain".
Sometimes researchers decide to eliminate the midpoint to force respondents to make a choice.
The advantage of excluding a midpoint is that this prevents participants from gravitating toward
the easy or safe option (see further discussion of the "fence sitting" response set below).
The disadvantage of excluding a midpoint is that sometimes the most accurate and honest
response is in the middle.
3. Should all scale points be labeled?

Researchers often choose to label every scale point but sometimes they may provide labels for
only the scale endpoints (see examples above) or for the endpoints and the midpoint.
The advantage of labeling every scale point is that this can help to more clearly define the
meaning of each point and reduce measurement error from relying on each person's idiosyncratic
interpretations.
The disadvantage is that this can make the scale appear cluttered and cumbersome. Researchers
may find it simpler and more straightforward to label only the scale endpoints, and respondents
are usually able to use such scales without difficulty.
Advantages of open ended questions

 They do not impose the researchers' ideas or categories on respondents.
 Provide spontaneous, rich information. Respondents can give you information that
you had not considered before.
 Great in exploratory research, or early stages of research, where you are not sure of
the most relevant or common responses.
Disadvantages of open ended questions
 Can be difficult to analyze - responses need to be content coded or categorized.
 Can be difficult for respondents to answer; may not be clear what kinds of responses
are wanted.
 May not be focused enough to capture the specific variables the researcher is
interested in.
Funneling
In an attempt to gain some of the advantages of both types of measures, sometimes researchers
use a "funneling approach" where they begin with a broad open-ended question (to obtain rich,
spontaneous responses) followed by more focused, closed-ended items that ask more specific
questions about the factors of greatest interest to the researcher.
Then, after participants have answered the open-ended question, the researcher could present
participants with a list of internal factors and external factors and ask them to rate how much
each factor influenced their test performance.
Writing Good Questions

To create a survey with good construct validity, we need to be sure that the survey questions are
clear and straightforward to answer. We will want to avoid each of the following problems.
Unnecessary complexity:
Survey questions should be stated as clearly and simply as possible. People should be able to
understand and respond to the questions easily. Avoid vague or imprecise language, jargon, and
technical terms that people might not understand.
Leading questions:
Sometimes two questions seem to be asking the same thing, but they yield very different
responses depending on how they are worded (for example, of the two questions, How fast do
you think the car was going when it hit the other car? and How fast do you think the car was
going when it smashed into the other car?, which question do you think would lead people to
give a higher speed estimate?). If the goal of a survey is to capture respondents’ true opinions,
then questions should be worded as neutrally as possible.
Double-barreled questions:
Asking two questions in one (e.g., Was your cell phone purchased within the last two years, and
have you downloaded the most recent updates?"). Instead break it up into two separate questions.
Negative wording:
The more cognitively difficult a question is for people to answer, the more confusion there will
be, which can reduce the construct validity of the item. Using double negatives can make
questions more difficult for respondents to process. Negatively worded questions can reduce
construct validity by adding to the cognitive load, which may interfere with getting people’s true
opinions.
Question order effects:

Earlier items can influence the way respondents interpret later items. Suppose the first item on
the survey is rather sensitive. That might influence the way respondents answer subsequent
questions because they were “primed” by the first item. One way to control for order effects is to
make different versions of the survey with questions in different orders. That allows you to
examine whether responses are influenced by the order of the questions.
Pilot study to refine questions

Before actually administering the survey, it is a good idea to conduct a small pilot study
in which you give the questions to several participants and have them "think aloud' while
answering them. The pilot participants can tell you whether they understand each
question and how they interpret the response alternatives. This procedure can provide
valuable information that can be used to improve the wording of questions.
Encouraging Accurate Responses

. Below are some factors that can sometimes lead people to give inaccurate responses.
Learning Activity
1. Acquiescence (yea-saying) response set: A response set in which people answer
positively (yes, strongly agree, agree) to a number of items instead of looking at each
item individually. Many people have a bias to agree with (say “yes” to) almost any
item—no matter what it states. Acquiescence can threaten construct validity because
instead of measuring the construct of interest, the survey could just be measuring the
tendency to agree or the lack of motivation to think carefully.
Solution:
Reverse wording of some items

2. Fence sitting: A response set in which people play it safe by choosing the response in
the middle of the scale for all items. This might involve providing a neutral response
or responding, “I don’t know.” Fence sitting weakens construct validity when
respondents actually have a positive or negative opinion but their responses indicate
neutrality.
Solutions:
a. Remove the neutral (center) option so there are an even number of
options.
b. Use the forced choice format instead of rating scales.
3. Social desirability: This involves trying to look good. Most people like to look good
in other people’s eyes, and this may extend to the way they respond on surveys.
Socially desirable responding/faking good involves trying to look better than we
actually are.
Solutions:
a. Remind respondents that their responses are anonymous. (However,
sometimes this results in respondents not taking the survey seriously.)
b. Include special items in the survey to identify socially desirable
responders.
c. Ask people’s friends to rate them.
d. Use computerized measures to evaluate implicit attitudes about sensitive
topics (for example, the Implicit Association Test [IAT]).
4. Self-reporting “more than they can know” or "inaccurate memories of events":
People may not be capable of reporting accurately about the reasons for their
behavior. For example, if I ask you why you always buy a particular brand of yogurt,
you might not really know why, yet you may tell me something that sounds
reasonable. Similarly people's memories for events are often not as accurate as they
might like to believe. Furthermore, confidence and accuracy are not related. People
might feel very confident about memories they report, even if they’re not accurate.
Solutions:
There are no clear cut solutions for these problems discussed. As discussed
in the text, researchers need to be aware that people may not be able to
report accurately about their own motivations and memories. Researchers
can't assume that self-reports will be accurate for these kinds of variables,
and so they may want to consider using other measurement approaches
instead.
BEHAVIORAL OBSERVATIONS
Another type of data that can be used to support a frequency claim (as well as association claims
or causal claims) is data from observational research (in which a researcher watches people or
animals and systematically records their actions).
Potential problems and solutions

With observational measures there are also several potential problems (quite different
from the problems with self-report measures) that can threaten reliability and construct
validity. These problems include: observer bias, observer effects, and reactivity.
1. Observer bias
When observers’ expectations influence their interpretations of participants’ behaviors
or the outcome of the research. Observers rate behaviors according to their own
expectations or hypotheses instead of rating behaviors objectively.
Example: Let’s say that we all watched a videotape of a man in his mid-20s being
interviewed. Before viewing the video, half of you were told that the man was a graduate
student and half of you were told that he was a criminal. Do you think this might
influence your opinions of this man?
2. Observer effects
When observers change the behavior of the participants to match the observer’s
expectations; also known as expectancy effects.
Example - Bright and dull rats: Psychology undergraduates were given five rats and
told to see how long it took for the rats to learn to run a maze (Rosenthal & Fode, 1963).
Each student was given a randomly selected group of rats. Half of the students were
told that their rats were bred to be “maze-bright” and half were told they were bred to be
“maze-dull.” The rats were actually all genetically similar, but the “maze-bright” rats ran
the maze faster each day with fewer mistakes, whereas the “maze-dull” rats did not
improve their performance over several days of testing. The study demonstrated that
sometimes observers’ expectations can influence the behavior of those they’re
observing.
Reflection Question
How do researchers prevent observer bias and observer effects?
1. Clear codebooks and well trained coders: Having clear rating scales or codebooks
is important (see Figure 6.9). These should indicate the specific behaviors that will be
recorded; it is best if the coders are looking for specific, concrete behaviors rather
than rating their global impressions or subjective interpretations. Also coders should
be well trained and given a chance to practice using the coding system by comparing
and discussing their practice ratings, before actually coding (independently) the study
data.
In order to assess the construct validity of a coded measure, multiple observers can be
used to determine the interrater reliability of the measures. If interrater reliability is
low, then observers might need to be retrained, a clearer coding system for behaviors
needs to be used, or both. Remember too that just because a measure is reliable, that
doesn’t mean it’s valid – but it is an important first step.
2. Masked research design (aka blind design): One way to prevent observer bias and
observer effects is to use a masked design (blind design). In a masked design, the
observers do not know to which conditions the participants have been assigned, and
they are not aware of what the study is about.
3. Reactivity
When people change their behavior in some way when they know that someone else is
watching them.
Reflection Question
What can researchers do to prevent reactivity?
1. Blend in. Make yourself less obvious by sitting in the back of the classroom or by
being in an adjacent room behind a one-way mirror in a setting that’s designed for
such observations. The goal is to make unobtrusive observations to prevent
reactivity.
2. Wait it out. Sometimes it’s a good idea to have research participants get used to your
presence by visiting a number of times so that they become more familiar with you
being around.
3. Measure the behavior’s results. Instead of actually measuring behaviors, sometimes
researchers obtain unobtrusive data by measuring traces of a behavior. For example,
if you were interested in the amount and types of junk food that college students eat
in the dorm, you could look at the wrappers in the dorm’s garbage cans.
Technique for recording behavior
Similar to response formats for self-report measures, the researcher needs to decide on a
technique for recording behavior:
1. Checklists
Researchers often use checklists to record pre-determined attributes and behaviours. This
requires having clear and specific definitions of the categories of behavior to be noted. May
involve checking whether specific behaviors occurred (yes/no) or how often they occurred.
2. Temporal measures
Temporal measures are used to record when a behaviour occurred or for how long. Two common
temporal measures are measures of duration and latency.
Duration refers to the amount of time a specific behaviour lasts. For example, researchers might
measure how long a child cries.
Latency refers to the amount of time that elapses between an event and a behaviour. For
example, reaction time is a measure of the time that elapses between the presentation of a
stimulus and a response. Inter-behaviour latency refers to the amount of time that elapses
between two (or more) occurrences of a behaviour.
3. Rating scales
Rating scales can be used to assess the quality or intensity of a behaviour. Often a 3-, 5-, or 7-
point scale is used.
Sampling decisions
Next, the researcher needs to make a sampling decision - that is a decision of when to
make the behavioral observations. Below are four options. To make these options more
concrete, let's imagine that a researcher has video-recorded children for a 20 minute
session and wants to measure each child's smiling behavior.
1. Continuous real-time measurement
Every instance of a behaviour is recorded during the entire observational session.
e.g., Note how many times the child smiles during the 20 entire minute period.
2. Time point sampling
Recording is done at the end of set time periods, such as every 10 seconds or every
15th minute (like freezing time and then recording whether the behavior is present at
that moment).
e.g., Stop the video after every 30 seconds and note whether the child is smiling right
then, at each of those time points.
3. Time interval sampling
Each behaviour is recorded once during successive intervals in a session. The data
record will simply show whether the behavior occurred during a specified time period
(not how many times it occurred).
e.g., stop the video after every 30 secs and note whether the child smiled at any time
during the last 30 second interval
4. Event sampling
Each time a specified behaviour takes place, observers rate aspects of it.
e.g., stop the video each time the child actually smiles, and rate aspects of the smile
(e.g., intensity, duration)
Lesson five
Once we have collected a set of data, we usually enter the data in a grid format, called
a data matrix. The data here are often referred to as the "raw data" as nothing has been
done yet to organize or summarize them. Data Matrix 2. Exam scores
Descriptive statistics are statistics that provide a description or summary of the data that
has been collected. They are used to summarize the characteristics of a set of data.
These are sometimes referred to as "one variable" statistics as they are used to
describe data on a single variable (unlike other "two variable" statistics that we will
cover later in the course used to examine whether different variables are related to each
other).
The goal is to present the data we have collected in a manner that is accurate, concise,
and understandable. Of course, the most accurate presentation of data would be a
listing of the raw data itself as in the matrices above – all the information is there, with
nothing missing or distorted. But this is certainly not a very concise (especially for large
samples) or understandable presentation.
Frequency distributions
As a starting point, one thing that researchers typically do to understand their data is to
look at the distribution of scores. That is, how often did each of the possible values of
the variable occur? To do this, researchers typically create a table called a frequency
distribution. These frequency distributions can be created for variables that used any
scale of measurement (categorical, ordinal, interval, ratio).
Simple frequency distributions and histograms

A frequency distribution is a table that displays how many of the cases scored each
possible value on the variable of interest.
A common addition to a simple frequency distribution is to add a column showing

relative frequency. Relative frequency is the percentage of the cases that scored each
possible value (rather than a simple count) and is calculated by taking the frequency
and dividing by the number of cases and then multiplying by 100.
It can be very helpful to display frequencies using a graph known as a histogram. This is
a graph that lists the possible values of the variable on one axis (usually the horizontal
x-axis) and displays the frequency of each score on the other axis (usually the vertical
y-axis).
Learning Activity
a. Simple frequency distribution
Possible values Frequency
0 0
1 1
2 2
3 2
4 3
5 5
6 7
7 5
8 3
9 2
10 0
b. Table 1. Simple frequency distribution
c. Relative frequency distribution
Possible values Frequency Relative Frequency
0 0 0.0
1 1 3.3
2 2 6.7
3 2 6.7
4 3 10.0
5 5 16.7
6 7 23.3
7 5 16.7
8 3 10.0
9 2 6.7
10 0 0.0
d. Table 2. Relative frequency distribution
e. Histogram Figure
1. Histogram
Grouped frequency distributions, histograms and stemplots
In cases where there are many possible values (e.g., scores that range from 1 to 100)
researchers would simplify by grouping the possible values together into intervals (e.g.,
1-10, 11-20, 21-30, etc.) to create a grouped frequency distribution, grouped frequency
histogram, or stemplot.
Learning Activity
Data Matrix 2 above contains exam scores that could range from 1 to 100 obtained from
a sample of 25 participants. For this data set, sketch out each of the following in
your study notes, and then click on my answers to see if yours are the same.
My Answers:
a. grouped frequency distribution
Possible values Frequency
1-10 0
11-20 0
21-30 0
31-40 0
41-50 1
51-60 3
61-70 7
71-80 8
81-90 5
91-100 1
b. Table 3. Grouped frequency distribution
c. grouped frequency histogram
Figure
2. Grouped Frequency Histogram
d. stemplot
Stem Leaves
0
10
20
30
40 3
50 2 5 5
60 1 2 2 2 4 5 6
70 2 2 3 5 7 7 7 8
80 2 3 5 5 6
90 2
e. Figure 3. Stemplot
Other frequency graphs
Frequency polygon: is very similar to a histogram but represents the frequency of
each possible value with a dot rather than a bar, and then uses lines to connect the
dots. Typically the lines are anchored at scores that were not obtained by anyone, just
beyond the range of collected data. This results in a graph that has the appearance of
a polygon.
Frequency bar graph: is a variation of the histogram that is used for measures at the
categorical level of measurement. The difference from a histogram is that a bar graph
uses a separate and distinct bar for each value. The bars don't touch. In contrast, in a
typical histogram the bars touch each other, reflecting the fact that the x-axis represents
a continuous variable. A histogram (or polygon) should only be used if the variable is a
continuous variable at the interval or ratio level of measurement. The bar graph is the
correct graph to use when the x-axis represents a categorical variable or an ordinal
variable. An example of a bar graph is shown below for the frequency of people who
reported driving different makes of cars.
The Shape of a Distribution
The Normal Distribution
You are probably already somewhat familiar with this type of distribution. It is the one
that has a bell-shaped appearance (i.e., a bell-shaped curve) and has the following
characteristics:
 Most of the scores are toward the middle (average value)

 There are fewer extreme scores in the tails
 It is symmetrical (left and right sides are mirror images)
Other Distributions
Although the normal distribution appears very commonly, not all distributions of data will
have this shape. There are different types of distributions that sometimes occur, such as
bimodal and skewed distributions.
Bimodal distributions
 These distributions have two distinct peaks.
 They occur when there are two distinct response tendencies.
Example: There are two distinct groups within a sample (e.g., children and adults) that
score very differently
Example: On some polarizing issues people are either strongly in favor or strongly
opposed, with few in the middle.
Skewed distributions
 These distributions are asymmetrical (not symmetrical)
 Most scores fall toward one end of the distribution
 Distinct tail to either the right (positive skew) or left (negative skew)
"The skew is in the tail of the whale"

Skewed distributions commonly occur in two situations
1. A measure has a lower limit but no upper limit. For example, people's reports of the
number of children in their family is constrained at the low end (as it can't go below
zero) but not at the high end (a few families will have a very high number of children,
resulting in a positive skew).
2. A measure creates artificial "floor effects" or "ceiling effects". Sometimes many

scores on a measure run up against the lower or higher end of the measure. For
example, if a test is too easy and almost everyone hits the top score, this is a "ceiling
effect" and results in a negative skew (because most scores are right up at the top
with only a few in the lower tail). If a test is too hard and almost everyone is stuck at
0, this is a "floor effect" and results in a positive skew (because most scores are down
at the bottom with only a few in the upper tail)
15. Positive
skew due to floor effect
Fig
ure 16. Negative skew due to ceiling effect
Frequency Distributions in Research Articles
It is worth mentioning that frequency distributions and frequency graphs are not usually
presented in research articles. They are typically used by researchers primarily as an
intermediary step as they prepare for further statistical analyses.
There are some exceptions. In some cases, the shape of the distribution may be
discussed in an article, especially if the distribution deviates from a normal distribution.
Two Characteristics of Distributions: Central Tendency and

Variability
Describing Central Tendency
Measures or indexes of central tendency are numerical indexes that convey with a
single number the typical response or typical score in the distribution. There are three
indexes of central tendency commonly used by researchers: the mean, the median, and
the mode.
Which index of Central Tendency is best?

In most cases, the mean is the most appropriate measure of central tendency – and it is
by far the most commonly reported measure of central tendency. Its advantage is that it
incorporates the actual value of every score – if even one score in the distribution
changes, so does the mean. This is not true of the median and mode which ignore
information about scores. So in this way the mean is the most informative. It also serves
as a fundamental building block for many other statistical techniques.
Expanding slightly on the textbook discussion, below are three considerations that can
help you decide whether to rely on the mean to describe central tendency.
1. Level of measurement – your measure must be at the interval or ratio scale of

measurement or it is not appropriate to use the mean. For ordinal measurement you
should use the median or mode. If your variable is categorical, your only option
would be the mode.
Mean: interval or ratio

Median: ordinal
Mode: categorical
2. Outliers – as the text explains, the mean can be distorted by a few extreme scores
and in these cases it does not accurately summarize the typical score.
3. Skewed distributions – in a similar way, when a distribution is skewed, the mean is
affected greatly by the extreme scores in the one tail of the distribution. This is
illustrated in the figure below. Note that in a normal distribution all three indexes
would be the same or very similar. However, in a skewed distribution the mean gets
pulled out toward the tail of a distribution because it is affected by the extreme
scores, which can make it unrepresentative of the main body of cases.
So are the mean, median, and mode generally the same value?
It depends very much on the shape of the distribution. In a normally shaped distribution,
they will be the same or very close to one another. However when a distribution is
highly skewed, either negatively or positively, they will be quite different.
Learning Activity
For the following questions, jot down your answers in your study notes before
comparing to my answers:
a. For the following set of twenty scores, identify the mean, median, mode: [2, 2, 3, 3, 4,
4, 4, 4, 5, 5, 5, 5, 6, 6, 7, 7, 8, 9, 10, 11].
b. Does the median differ from the mean, and if so, why?
My Answer:
a. Mean = 5.5, Median = 5, Mode = 4 and 5.
b. The mean is higher than the median because there is a positive skew, and the mean is
increased by the extreme scores in the right tail of the distribution (whereas the
median isn't influenced by these extreme scores).
Describing Variability
Variability refers to how much the scores tend to differ from the average value.
The mean tells us what the average person scored, but it doesn't tell us how much the
scores tended to differ from that average. In some distributions most people score at the
average or close to it. Other distributions are much more spread out. A distribution has
high variability when the scores differ a lot from the average.
The two most common statistical indexes that are used to capture the amount of
variability in a set of scores are the variance (SD 2) and the standard deviation (SD).
Variance – indicates the average amount that each score differs from the mean,
expressed in squared units. (It is the standard deviation squared.)
Standard Deviation – indicates the average amount that each score differs from the
mean, expressed in original scale units. (It is the square root of variance.)
The standard deviation is more commonly reported than the variance because it better
captures how far, on average, each score is from the mean. It is easier to understand
because it is expressed in the original scale units.
Describing Relative Standing Using z scores (i.e., standardized

scores)
It turns out there is also a statistic known as the "z score" or "standardized score" that
incorporates both the mean and the standard deviation and thereby allows us to
describe where an individual score stands relative to others. The z-score indicates
whether an individual's score is above or below the mean and how far it is from the
mean in standard deviation units.
We can easily turn any score into a z-score as long as we know the mean and the
standard deviation of the set of scores. Just use the following formula: z = (Raw Score –
Mean) / Standard Deviation.
Notice that the z-score can be either positive (indicating the score is above the mean) or
negative (indicating the score is below the mean). For instance, a z-score of -1.6
indicates that the person's score was 1.6 standard deviations below the mean. A z-
score of 2.1 indicates that the person's score was 2.1 standard deviations above the
mean.
Useful Qualities of the Z-Score

The z-score has several useful qualities, illustrated below:
1. Makes raw scores more interpretable (by indicating relative standing)
The z-score summarizes exactly where the student's midterm score was relative to the
rest of the class.
In the first case, the person's score was 65, the mean was 40, and the standard
deviation was 12.
z = (65-40) / 12
z = 2.08
2. Allows comparisons across different distributions.
This benefit is explained in the text, with an example that compares one's score on a
midterm with one's score on the final exam. Even though these are two different
distributions of scores, you can compare z-scores to see whether one's relative
standing was better on the midterm or on the final exam.
3. Allows us to combine measures taken on different scales.
Sometimes researchers obtain several measures of the same construct, but cannot
simply average across them because the measures used different scales. For example,
if self-esteem was measured using a one item scale that ranged from 1-10, and also
using a second measure with possible scores from 1-40, you can't simply average
across the two scores because they use different units. However if the scores are all
converted into z-scores (i.e., standardized scores) this creates a common underlying
metric that is based on relative position. It is then reasonable to average the z-scores
and create an overall measure of self-esteem that combines the two scales. Thus
researchers sometimes use z-scores to combine measures that were initially taken on
different scales.
Population
A population is the entire set of cases that are of interest to the researcher. It is usually
large, containing too many cases to include all of them in the study.
Sample
A sample is the smaller set of cases included in the study. These are selected from a
population and are often intended to represent that population. That is, the goal is to
examine the results from the sample and generalize them to the larger population.
Good external validity = representative sample = unbiased sample = probability sample

= random sample
Poor external validity = unrepresentative sample = biased sample = nonprobability

sample = non-random sample
When is a sample biased?

A sample can be either unbiased (representative) or biased (unrepresentative). In a biased
sample some members of the population are much more likely to end up in the sample than
others, and as a result the characteristics of people in the sample are different from those in the
larger population. That is, only certain types of people are included in the sample, and their
characteristics differ from those of the population.
a. What are two ways to get a biased sample?

b. How might these problems occur when sampling WLU students?
My Answer:
Convenience sampling – sampling those who are easy to contact. For example, it might be easy
for an instructor to sample the students who come to class on a particular day. But these students
may be very different from (i.e., unrepresentative) the overall population of students at WLU.
Volunteers (self selection) – using a sample of people who volunteer to participate. This could
lead to bias through self-selection. For example, if posters were placed around campus asking for
students to volunteer their time for a short attitude survey.
Probability Sampling Techniques
In research where external validity is vital (e.g., research making frequency claims)
probability sampling is the best option.
Indeed the key element that distinguishes Probability Sampling techniques from
Nonprobability sampling techniques is the component of "randomness". Random means
that there is no order or pattern that could be predicted in advance.
Simple random sampling

My Ideas:
You would begin with a numbered list of every student at WLU. Such a list of all the
cases in the population is known as the "sampling frame" and is the starting point for
simple random sampling. Then you would use a random process to pick out 200
numbers from the list in a random manner, where every student has the same chance
of being selected. This could be done using a random number table, or a computerised
random number generator (e.g., the research randomizer shown in the text, the
randomize function in Excel
Cluster sampling & Multistage sampling

My Ideas:
You would begin with a list of "groupings" or "clusters" which might be quite arbitrary.
This could be a list of all the degree programs at WLU (e.g., BMus, BBA, BA, BSc etc;
there are more than 100 programs at WLU). You would randomly select a subset of
those programs (let's say 10 programs).
At that point you could go ahead and contact every student in the 10 selected programs.
That would be cluster sampling.
Stratified random sampling

My Ideas:
You would begin by identifying particular categories, or strata, and these would ideally
be categories that you think could matter for the question you are asking. For example,
you might believe that students ratings will be quite different depending on whether they
are in Year 1, 2, 3, 4 in their program.
You would break your overall list of students into four separate categories (i.e., four
strata) and then you would randomly sample from each list.
Oversampling
My Ideas:
This would be a fairly minor variation on Stratified Random Sampling that would be
done if some of the categories are much smaller than others. For example, imagine that
your categories for stratified random sampling were International Students and
Domestic Students. International students make up quite a small proportion of the WLU
population (let's say 5% for example), and so selecting them proportional to their
population membership would leave you with a very small sample of these students. So,
instead, you may decide to sample 20% of the International students to be sure you get
a sufficient sample for this category, even though they represent a smaller proportion of
the overall population. This would be oversampling.
Systematic sampling
My Ideas:
For systematic sampling you would select every "kth" member (e.g., every 17th person)
in the population. For example, you could go through a list of all WLU students and
choose every 17th student for your sample.
Nonprobability Sampling Techniques
- Non-response biases
The response rate in a survey refers to the percentage of people selected for the
sample who actually complete the survey. Response rates may be low when
researchers are unable to contact participants or they do not agree to participate.
This is another way in which systematic bias could enter a sample, and leave
researchers with a final sample that is not representative. This sort of non-response
bias would happen if those who participate in the survey are systematically different
from those who were selected for the study but did not actually participate.
Researcher do several things to address the non-response problem:

 Use strategies to increase response rates (e.g., incentives for participating,
multiple follow-up contacts)
 Include response rates when reporting results (so readers can judge how
serious a problem this may be)
 Try to assess whether respondents differ from non-respondents. This can be
done if the population list (i.e., the sampling frame) includes some
demographic variables relevant to the study. The researcher can examine
whether respondents differ from non-respondents on these available
measures.
Sampling Error and the Margin of Error
IThe degree to which the data from the sample deviate from the true population value is
called sampling error.
margin of error is expressed as an interval that would usually (e.g., 95% of the time)
contain the true population value. For example, you might hear: The margin of error is +
or – 3% in 95% of observations. This means that, if the researchers took a sample like
this many times, the true population value would fall within 3% of the value that was
obtained in the sample (i.e., the population value would fall between 85% and 91%) in
95% of the samples. So a low margin of error tells us that the true population value is
likely very close to the sample value that was obtained.
What affects the margin of error?

Below is a list of three factors that affect the margin of error
1. The sample size - With larger samples there is less sampling error.
That is, there are diminishing returns to increased sample size. For example, if you have a
sample of 2,000 participants, there would be little to be gained by adding another 1,000
participants. Consequently researchers typically seek an "economic sample" that provides
a balance between statistical accuracy and polling costs.
2. The population size - with larger populations there is greater sampling error.
But here again the impact on the margin of error is much less than people usually
think. People often mistakenly believe that if the population is very large (e.g.,
millions of people) you also need to draw a very large sample. But that is simply not
the case. A sample of about 1,000 people still does a good job if it is representative.
3. The variance of the data – with greater variability there is greater sampling error.
Sampling error (Statistical Validity) vs. Representativeness (External
Validity)
When we calculate a margin of error we are evaluating the statistical validity of the
data – in this case how close the sample value is likely to be to the true value in the
population.
This is separate from evaluating the external validity of the study. The external validity
of the study is determined by whether the sample is representative or not. For external
validity, it doesn't really matter whether you have a large sample or not. Just having a
larger sample doesn't make it more representative. What matters most for external
validity is whether the sample was drawn using probability sampling technique. In short,
what matters most is HOW not HOW MANY.
When is external validity a priority?
When researchers are making a frequency claim, then the external validity of the study
is usually crucial.
However, when researchers design a study to test association or causal claims (e.g.,
that one variable affects another variable), the external validity of that particular study is
not top priority.
Lesson 6
To see whether two variables are indeed correlated, we use slightly different techniques
depending on whether we are working with quantitative or categorical variables
Two Quantitative Variables
1. Column depiction, scatter plots, and the correlation coefficient, r

2. One Categorical Variable and One Quantitative Variable
o Compare means across groups
3. Two Categorical Variables
o Compare frequencies/proportions across groups
Two Quantitative Variables
When two quantitative variables are correlated, it means that increases on the one variable (X)
are associated with systematic increases (or decreases) on the other variable (Y). As the one
variable is increasing, the other is also tending to increase or to decrease systematically.
Column depictions and scatterplots

Column depiction:
One way to see this is to examine the two columns of scores in the raw data files. In a
correlational study, there will be two scores for each participant. If there is a positive
correlation, the participants who score relatively high on the X variable also tend to score
relatively high on the Y Variable.
Participant X variable Y variable
1 8 7
2 2 1
3 4 3
4 1 1
5 5 6
6 7 5
7 9 6
8 6 5
9 10 9
10 3 2
It is much easier to see whether the variables are correlated if we first rearrange the scores in the
data columns, putting them in order (from lowest to highest) on the X variable, as seen in Table 2
below. We must be sure to keep the two scores from the participants paired together on the same
row. Then we can ask: As the scores are increasing on the X variable (across participants), do
we also see a tendency for scores to be increasing, systematically on the Y variable?
This is appears to be the case in our example. After placing the scores on X in order (from
lowest to highest), there appears to be a systematic pattern in the scores on Y - the scores are
generally quite low at first and tend to increase as you move down through the column. That is,
as the scores on X are increasing across the participants, the scores on Y are also tending to
increase across the participants.
Participant X variable Y variable
4 1 1
2 2 1
10 3 2
3 4 3
5 5 6
8 6 5
6 7 5
1 8 7
7 9 6
9 10 9
Scatter plots:
Recall that in a scatterplot, one of the variables is represented on the horizontal axis (X axis) and
the other is represented on the vertical axis (Y axis). Each point on the scatterplot shows where a
participant scored on the two variables. the points slope upward to the right, indicating that there
is a positive correlation between the two variables.
Perfect correlations
Perfect positive correlation
There is one pattern that we would call a "perfect positive correlation". We would almost never
see a correlation like this in real data, but knowing what is meant by a perfect positive correlation
can help us understand the concept of correlation. This is the strongest positive correlation
possible.
If there is a perfect positive correlation, this means that every time the X values increase by a
unit (as you move from one participant to another), the Y values also increase by a fixed amount.
When plotted in a scatterplot, then, the data points will all fall on a straight line sloping upward.
Perfect Negative Correlation

There is another pattern that we would call a "perfect negative correlation", which would
be the strongest negative correlation possible. If there is a perfect negative correlation,
this means that every time the X values increase by a unit (as you move from one
participant to another), the Y values decrease by a fixed amount. When plotted in a
scatterplot, then, the data points will all fall on a straight line sloping downward.
The correlation coefficient r

Recall that the correlation coefficient, r, is a statistic that indicates both the direction
(positive or negative) and strength of the correlation between two variables. In the case
of a perfect positive correlation, r = 1.0; and in the case of a perfect negative correlation,
r = -1.0.
The correlation is stronger to the extent that the points in a scatter plot are closer to an
imaginary straight line sloping upward or downward. If the correlation is strong, the
points tend to fall close to the line. If the correlation is weak, there is still a systematic
slope, but the points do not cluster as closely around the straight line. the absolute
value of the correlation coefficient, r, is higher when the data points fall closer to a
straight, sloping line.
One Categorical Variable and One Quantitative Variable -

Comparing Averages Across Groups
As an example, imagine that you have the hypothesis that children's level of emotional
expressiveness (measured on a 10-point scale) is related to their sex (male, female)
which is a categorical variable.
l
Figure 6. Scatter plot when one variable is categorical
However, for this type of study researchers would be much more likely to simply
compare the mean score on the y-variable across the two different groups (i.e., across
the two levels of the x-variable). They might also use a bar graph to present the means
visually as in the example bar graph in Figure 7 below. This allows us to examine
visually how much the means differ in the two groups. Or they could just present the
means in the written description of the results, indicating how different they were.
Figure 7. Bar graph displaying mean score on expressiveness for each group.
No matter how the means are presented, the important point is this: If the means differ
across the two groups, then the two variables are related. For instance, because the
average level of expressiveness was much higher for females (M = 8.00) than males
(M = 4.25), this indicates that there was a strong relation between the children's sex and
their emotional expressiveness. If the average scores had been the same in the two
groups, this would mean that there was no relation between the children's sex and their
emotional expressiveness.
Two Categorical Variables - Comparing frequencies/proportions
across groups
imagine that a researcher wants to know whether there is an association between
children’s sex (male or female) and whether they display a specific type of reading
disability (yes or no).
In such cases, researchers would use a contingency table to see whether the two
variables are associated. A contingency table depicts the number of people at each
combination of the X-variable and the Y-variable.
More specifically, the rule is this: compare the proportion of the people in the entire first
column who are at the first level of the y-variable (10 out of 30 = 33.3%) with the
proportion of people in the entire second column who are at the first level of the y-
variable (20 out of 100 = 20%).
If the proportions differ there is an association between the two variables. The greater
the difference between proportions, the stronger the association.
X X
Level 1 Level 2
Y
Level 1 10 20
Y
Level 2 20 80
Table 9. Contingency table

Step 1: Finding the regression line
The first step is to find the line that best represents or fits the data points shown in a
scatterplot. This line goes by many names: regression line, trend line, prediction line,
line of best fit. Your text usually calls it the prediction line. I will refer to it here as the
regression line.
.
If you were trying to find the regression line visually, you would try to draw it so that it
cuts through the points with about half the points below it and half above it. Better still,
the line can be found mathematically using regression analysis. Regression analysis is
a mathematical technique for finding the regression line. The regression analysis gives
us the line that best minimizes the squared distance between the line and all of the data
points.
Step 2: Using the regression line to make predictions

Once you have found the regression line, as long as it is sloped, you are in a position to
make predictions。That is, you would locate a person's score on the x variable, move
up to the prediction line and then go left to the y-axis. The place where you hit the y-axis
is your prediction for where that person is likely to score on the y variable, given their
score on the x variable.
An even better way to do this is to make use of the mathematical equation for the
regression line (i.e., the regression equation: y = β0 + β1x.
This regression equation might look new to you, but it is just like the standard equation
for a line [y = mx + b]. The regression equation tells us the slope of the line (β1 regression
coefficient) as well as where it passes through the y-axis (β0 regression constant). To predict a
person's score on the y variable, then, you just need to plug the person's x-score into
the equation.
Note that the slope of the regression line indicates how much the y variable changes for
every unit of change in x. In cases where there is no relation between the two variables,
the regression would be a horizontal line that falls at the mean of the y variable. In this
case, having information about a person's x-score does nothing to improve our
prediction of their y-score. The best you could do in that case is predict the mean score
on y (i.e., the most typical score for anyone).
the correlation coefficient, r, tells us how accurate our predictions would be – with
stronger correlations the predictions will be more accurate. Regression analysis is the
technique that we would use to actually make the predictions.
Statistical validity issues

When we interrogate statistical validity we are asking how well the data support the
conclusion. The questions pertain to five issues: effect size, significance, outliers,
restriction of range, curvilinear associations.
Learning Activity
A researcher designed a skills test to give to job applicants in order to predict which
applicants are most likely to succeed on the job (as measured by employee
performance appraisals). The researchers then conducted a correlational study on a
sample of current employees and found that those who scored higher on the skills test
tended to have better job performance (the correlation coefficient was above zero).
1. Effect size
My Answer:
We would want to know how strong the association was between scores on
the skills test and job performance. We would be concerned if there was only
a very small correlation. If there is only a small or weak correlation, then
knowing an applicant's score on the screening test would not allow you to
predict with high degree of accuracy their performance on the job. Generally
speaking, a very small correlation is not as important a finding as a larger
one.
2. Significance
My Answer:
We would want to know whether the correlation was statistically significant.

We would be concerned if the correlation was not significant, because this
would mean that we could not be confident that the correlation found in this
particular sample of job applicants is a true correlation that would be found in
the larger population of interest (e.g., all applicants taking the skills test). If
the correlation is not significant, there is a relatively high probability (higher
than 5%) that the correlation observed in the sample is larger than zero just
due to chance alone.
3. Outliers
My Answer:
We would want to know whether there are any outliers in the data – that is
people with rare, extreme scores (which we could see easily by examining a
scatterplot). This would be especially problematic if a participant has extreme
scores on both of the variables (not just one of them), and when the sample
size is smaller. In such cases even one or two outliers could have artificially
increased or decreased the correlation that was found. For example, in a
small sample, the correlation coefficient might be inflated a great deal due to
even one participant who had an extremely high score on the skills test along
with an extremely high score on job performance, even though there was little
association between the variables in the rest of the sample.
4. Restriction of range
My Answer:
We would want to know whether there was a full range of scores on both of
the variables. If there was not a full range of scores on one of the variables,
this could have reduced the correlation that was obtained, so that it
underestimated the true correlation between the two measures. In our
example, this could occur if applicants were hired only if they have very high
scores on the skills test. In that case there would be a very restricted range of
scores on the screening test (only high scores) rather than a full range of
scores. This would tend to reduce the correlation that was found, and would
lead us to conclude erroneously that the test was only weakly related to
performance when in fact the true correlation between these variables may
be much stronger.
5. Curvilinear association
My Answer:
We would want to know whether there was any curvilinear association in the
data, and we could look for this on a scatterplot. We would look to see if there
is a curvilinear association: that is, that increases on the screening test were
associated with systematic increases, and with systematic decreases in job
performance. For example, it could be that at the low end of the test scores,
people who scored higher on the skills test had better job performance (a
positive association) but then once they hit a certain point, further increases
in the screening test were associated with lower job performance (a negative
association). This type of curvilinear pattern is not detected by the statistics
we have discussed such as the correlation coefficient. The correlation
coefficient r is able to assess only "linear relations", not "curvilinear relations".
In this case, the correlation coefficient would be close to zero even though the
two variables are quite related – just not in a linear way.
Internal Validity issues

"CORRELATION DOES NOT INDICATE CAUSATION".
Recall that, in order to support a causal claim, a study has to satisfy these three criteria:
1. Covariance: There must be a correlation between the cause variable (X) and the
effect variable (Y).
2. Temporal precedence (the directionality problem): The causal variable (X) must
come before the effect variable (Y). If we can't be sure which came first, we can't
infer causation.
3. Internal validity (the third-variable problem): There must be no other plausible
alternative explanation for the relationship between the two variables. If there is a
third variable (Z) that could influence variables A and B independently, then we can’t
infer causation.
Learning Activity
Scenario 1:
A professor at WLU has found a strong positive correlation between how often students in a
course attend lectures and their grades in the course. The professor concludes that lecture
attendance leads to higher grades in the class.
My Answer:
1. Directionality problem
Getting higher grades (e.g., on early assignments or midterms) might lead people to
go and attend more lectures.
2. Third variable problem
There are many plausible third variables. But keep in mind that to be a plausible third
variable, you must identify something that could sensibly influence BOTH of the
measured variables in the original correlation. That is, you should be able to explain
how the third variable might reasonably influence lecture attendance, and also how it
might influence course grades. Below are just a couple possibilities.
Motivation – people vary in how motivated they are. Having higher motivation could
lead people to attend more lectures. Having higher motivation might also lead people
to get better grades (regardless of lecture attendance). In this case it could be that
lecture attendance had no causal effect on grades whatsoever (i.e., the correlation
between attendance and grades was "spurious")
Free time – people vary in how much free time they have. Having more time
available may lead people to attend more lectures. Having more time available might
also help people to get better grades regardless of lecture attendance (e.g., because of
more time to study). In this case it could be that lecture attendance had no causal
effect on grades whatsoever. (i.e., the correlation between attendance and grades was
"spurious")
When third variables are a problem for internal validity
In such cases the original correlation (between wine consumption and satisfaction) is called
a SPURIOUS CORRELATION. This means it is somewhat false or misleading. The
correlation does exist, but it's only because of the third variable.
There can be some instances, however, where a third variable is not an internal validity problem.
This could occur if an identified third variable does not account for the initial correlation. For
example, it may be that wealth is indeed associated positively with wine consumption, and with
life satisfaction. But even so, when we look at the data closely, wealth does not account for the
original correlation between wine consumption and satisfaction. That is, the original correlation
can still be seen at each level of the third variable. For example, if you look within a group with
high wealth, there is still a correlation between wine consumption and happiness; and if you look
within a group with low wealth there is still a correlation between wine consumption and
happiness. So there is a potential third variable, but in this case it does not pose a problem for
internal validity.
The bottom line, however, is that whenever we have correlational findings we cannot know for
sure whether the findings might be due entirely to some third variable. Whenever the research
findings are correlational, we need to do a lot more digging and ask more questions before we
could be in a position to draw any causal conclusions.
In summary, then, whenever we hear a correlational finding we might be tempted to conclude

that one variable caused the other. However, we need to keep in mind that there are always three
basic kinds of causal interpretations for any correlation.
1. X causes Y (causation)
2. Y causes X (reverse causation; directionality problem)
3. Z causes both X and Y (third-variable problem) (correlation between X and Y is
spurious; see figure below)
Figure 11. Z causes X and Y
correlation
In the textbook we are also introduced briefly to the idea of moderating variables.
Whenever the relationship between two variables changes depending on the level of
another variable, that other variable is called a "moderator" or a "moderating variable".
Learning Activity
Scenario 1:
If we found that gender moderates the relation between bodyweight and self-esteem,
what might that mean?
My Answer:
It means that the association between bodyweight and self-esteem changes depending
on gender.
Lesson 8
in a longitudinal study, the researcher measures the same variable in the same
people at two or more time points – often to see how the variable of interest changes
across time. In a longitudinal correlational study, two variables are measured at each
time point. Note that this is considered a “multivariate” design rather than “bivariate”
because you end up with more than two measured variables. For example, if you looked
at two measures at four time points, you would collect 8 measured variables from each
participant.
one of the key reasons that researchers use longitudinal designs is because it helps
them to arrive at causal conclusions – whether one variable influences another. Recall
that the bivariate correlations we looked at last lesson were good at showing covariance
but not at establishing temporal precedence or internal validity. Longitudinal studies can
help rule out the “directionality” problem and thus help the research to establish the
“temporal precedence” of a particular variable.
A classic example: Violent TV and aggressive behavior
To illustrate, let’s work with a research example that was not covered in the textbook,
but has interested psychologists for decades: Does exposure to violence on TV lead
children to become more aggressive themselves? In several bivariate correlation
studies, researchers have measured both how much violent TV children watch (TV
violence) and how aggressively they behave (aggression), and they usually find a
positive association – the more violent TV that kids watch the more aggressively they
behave.
But at this point a little voice in your head should be screaming “correlation does not
indicate causation”. And you are right. This bivariate correlation establishes covariance,
but does not rule out directionality and third variable problems. So it does not support a
causal conclusion.
Figure 1. Longitudinal study examining the relation of exposure to TV violence and aggression.
Three Types of Correlations
Learning Activity
For each of the following types of correlations:
a. First, briefly define each type of correlation in your own words.

b. Next, identify two of these correlations in Figure 1 by naming the variables.
c. Finally, list the two r values in Figure 1 for each type of correlation.
1. Cross-sectional correlations
My Answer:
a. Definition: Correlation of the two different variables measured at the same time
point.
b. Identify two of these in Figure 1:
o Time 1 TV violence with Time 1 aggression
c. List the two r values:
o r = .21
o r = -.05
2. Autocorrelations
My Answer:
a. Definition: Correlation of each variable with itself across time points
o Time 1 TV violence with Time 2 TV violence
o Time 1 aggression with Time 2 aggression
o r = .05
o r = .38
3. Cross-lag correlations
My Answer:
a. Definition: Correlation of the earlier measure of one variable with the later measure
of the other variable
o Time 1 aggression with Time 2 TV violence
o r = .31
o r = .01
What do we learn from each of these correlations?
Cross-sectional correlations: The cross sectional correlations tell us whether, at each
time point, there is an association between TV violence and aggression. For example,
the correlation at Time 1 (r = .21), tells us that students who watched more violent TV
also tended to act more aggressively. However, because both variables were measured
at the same time, this result cannot establish temporal precedence as either of these
variables might have led to changes in the other.
Autocorrelations: The autocorrelation is the correlation of each variable with itself across
time, and tells us whether people who are higher on the variable at Time 1 are also
higher at Time 2. For example, the autocorrelation for aggression (r = .38) tells us that
the boys who were more aggressive at Time 1 also tended to be more aggressive at
Time 2.
Cross-lag correlations: The cross-lag correlation indicates the degree to which an earlier
measure of one variable is associated with the later measure of the other variable.
Researchers are most interested in these correlations because they can help to
establish temporal precedence. That is, the cross lag-correlations allow us to conclude
with confidence that the one variable came before the other one. Depending on the
pattern of the two cross-lag correlations, we would draw different conclusions – you can
test yourself on how do interpret different outcomes of cross lag correlations in the
Learning Activity below.
Learning Activity: Interpreting possible outcomes of cross-lag
correlations
The questions below present three possible outcomes of the cross-lag correlations: the
outcome depicted in Figure 1, and two other possibilities. For each of these outcomes,
briefly describe what it means.
Outcome 1: In Figure 1, Time 1 TV violence was significantly correlated with Time 2
aggression, whereas Time 1 aggression was not correlated with Time 2 TV violence.
How do you interpret this outcome?
My Answer:
This pattern supports the idea that watching more violent TV leads to increases in
aggressive behavior. Most importantly, it establishes temporal precedence because it
rules out the problem of reverse direction. It can’t be the case that being more
aggressive at Time 2 is what led the kids to watch more violent TV at Time 1.
Outcome 2: What if the study had instead found that Time 1 aggression was positively
correlated with Time 2 TV violence, but Time 1 TV violence was not correlated with
Time 2 aggression?
My Answer:
This pattern would have indicated that aggression came first, leading to greater
watching of TV violence later. That is, the kids who were already more aggressive at a
young age went on to have a greater preference for violent TV.
Outcome 3: What if the study found that both correlations were significant – that TV
violence at Time 1 predicted aggression at Time 2, and that aggression at Time 1
predicted TV violence at Time 2?
My Answer:
This pattern would suggest that viewing TV violence and acting aggressively are
mutually reinforcing – that they both influence each other.
Longitudinal studies and the three criteria for causation:

In wrapping up this section, it is worth highlighting how longitudinal studies relate to the
three criteria for causation that we have seen previously. Longitudinal correlational
designs can provide some evidence for causation by fulfilling the first two of the three
criteria:
1. Covariance: when cross-sectional correlations are significant, covariance has been

established.
2. Temporal precedence: if one of the cross-lag correlations is stronger than the other,
this establishes which variable came first in time, so it moves us closer to figuring out
which variables causes the other.
3. Internal validity: by measuring only four variables, as we have in the TV and
aggression example, longitudinal designs do not rule out “third-variable
explanations”. For example, it could still be the case that certain parenting styles lead
kids to have greater exposure to TV violence and also lead them to act more
aggressively. Thus, although these designs move us closer to a causal conclusion,
they cannot fully establish causation.
Is there any way to rule out the suspected third variable (SES)? Yes, multiple
regression analysis is a way of statistically controlling for possible third variables. To
perform these analyses, you need to measure more than just the two original variables;
you also need to include a measure of the suspected third variable.More generally,
following are the steps needed for multiple regression analyses.
Steps in a multiple regression analysis

1. Measure the suspected third variable
You will need to measure more than just the two original variables; you must also
include a measure of the suspected third variable. [e.g. include a measure of SES]
2. Conduct analyses that control for the third variable.
Compute multiple regression analyses that tell you the relation between the two
original variables after “controlling for” the third variable. Conceptually you are
asking whether the original bivariate correlation appears in “subgroups” representing
the levels of the third variable [e.g., subgroups of students at different SES levels]. In
each subgroup, people have the same level of the third variable. So you are
conducting a correlation for a group of people where that third variable has been held
constant.
3. Interpret the pattern of results.
If the bivariate correlation remains within each of the “subgroups”, this indicates that
it was NOT due to the suspected third variable. This is because the correlation
appears even when that variable has been “controlled” or “held constant”. However,
if the bivariate correlation does not appear within the subgroups, this suggests that the
original bivariate correlation is explained by the suspected third variable.
Understanding what it means to control for a third variable
Let’s take a moment to consider these interpretations further, by looking at some
hypothetical data from a small sample of student participants (n = 9). As Table 1
shows, in this sample of students the original bivariate correlation between sports
involvement and academic success is very strong and significant (r = .85, p < .05). But
again, recall that there could be a third variable such as SES that explains this bivariate
correlation.
Sports Involvement Academic Success

(Predictor) (Criterion) r
1 3
2 1
4 3
4 3
5 6 .85
8 4
9 7
9 9
10 8
Table 1. Original bivariate relation between sports involvement and academic success.
Fortunately, the researchers also included a measure of the participants’ SES with three
levels (1 = low, 2 = moderate, 3 = high). This allows the researchers to “control” for this
suspected third variable. That is, they can look at the relation between sports
involvement and academic success within “subgroups” of people who have the same
SES level. As Table 2 shows, there is no relation, or very little relation, within each of
the subgroups. So, in the analyses that control for SES (i.e., hold SES constant) there is
no longer an association between sports involvement and academic success. This
pattern of results suggests that the original bivariate correlation between sports
involvement and academic success was due to the third variable of SES levels. If
instead the correlation had been substantial at each level of SES, this would have
indicated that SES was not a likely third variable.
SES Sports Involvement Academic Success

(3rd Variable) (Predictor) (Criterion) r
1 1 3
1 2 1 .19
1 4 3
2 4 3
2 5 6 .05
2 8 4
3 9 7
3 9 9 .00
3 10 8
Table 2. Relation between sports involvement and academic success controlling for SES.
The researchers want to know whether a third variable can account for the bivariate
relationship between the two variables of interest. To answer the question, they see
what happens when they statistically control for the third variable. That is the basic logic
underlying multiple regression analysis.
Interpreting results of multiple regression analysis – beta basics

First, a bit of terminology. In multiple regression analyses, you have three or more
variables. The variable that you’re most interested in understanding and predicting is
the criterion variable (aka the outcome or dependent variable) (for example,
academic success); it is specified in either the title or the top row of the regression
table. The other variables measured are the predictor variables (for example, sports
involvement); they are listed below the criterion variable in the regression table.
Criterion: Academic Success (GPA)
Predictor Variable Beta Significance (p)

Sports involvement .39 .01
SES level .32 .05
Table 3. Results of a multiple regression analysis for a study examining predictors of academic
success.
Beta basics Beta is similar to r. A positive beta indicates a positive relationship
between that predictor variable and the criterion variable when the other predictors
are statistically controlled for. A negative beta reflects a negative relationship when
the other predictors are controlled for. A beta that is zero or not significantly different
from zero suggests that there is no relationship when the other predictors are
controlled for.
Similarities between beta and r: direction (positive or negative) and strength (the closer
to -1 or +1, the stronger the relationship; the closer to zero, the weaker the relationship).
You may compare beta strengths within a single regression table, but you may not
compare beta strengths across regression tables. There are no absolute cutoffs for beta
effect sizes as we have with r and Cohen’s d cutoffs. Sometimes a regression table will
report b instead of beta; b is an unstandardized coefficient.
Interpreting beta: Note that in Table 3 sports involvement has a beta of .39. This
positive beta, like a positive r, means that higher levels of sports involvement go with
higher levels of academic success, even when we statistically control for the other
predictor in this analysis—SES. The other beta, associated with the SES predictor
variable, is also positive. This beta means that higher SES is associated with higher
academic success, when sports involvement is controlled for.
Statistical significance of beta: In a regression table, there is usually a column
labeled p or sig or an asterisk footnote with p values. When p is less than or equal to .
05, the beta is statistically significant. When p is greater than .05, the beta is not
significant..
Research Study: Dr. Nguyen is a psychologist who studies legal decision making.
Specifically, he is curious about the factors that are irrelevant to the crime committed
that might influence the sentences juries give to defendants (known as extra-legal
factors). To study this further, he samples a group of jury-eligible adults from the
Toronto area. He provides them with the fact pattern to a particular case and allows
them to watch the closing statements from the trial. He then asks them to provide a
sentence (in months) for the defendant. In addition, he measures two legal factors (the
number of arguments made by the prosecuting attorney and the length of time the
defense attorney speaks during his or her closing argument) and two extra-legal factors
(how attractive the participants think the defendant is [higher scores indicate higher
ratings of attractiveness] and how many legal television shows the participants watch).
The data are below in Table 4.
Answer each of the questions below, before checking your answer against mine.
Criterion: Length of Sentence Provided
Bet Significance
Predictor Variable a (p)
Number of arguments made by the

prosecuting attorney .08 .12
Length of the defense attorney’s

closing argument .37 .02
Attractiveness of the defendant -.45 .04
Number of legal television shows

watched -.03 .39
Table 4. Results of a multiple regression analysis for a study examining factors that
influence legal decision making
What is the criterion variable?
My Answer: Length of sentence
What are the predictor variables?
My Answer: Number of arguments by prosecutor, length of defense attorney’s closing

argument, attractiveness of the defendant, and number of legal TV shows watched.
Which predictor variables are related significantly to the length of sentence (when we
statistically control for the other predictors)?
My Answer: Length of defense attorney’s closing argument (p = .02), and

attractiveness of the defendant (p = .04). These p values are below .05 so they are
significant.
Explain how the variable attractiveness of the defendant relates to criminal sentencing,
considering the direction of the relationship, statistical significance, and the strength of
the relationship compared with the other variables.
My Answer: There is a negative relationship between the attractiveness of the

defendant and the length of sentence given, controlling for the number of arguments by
the prosecutor, the length of defense attorney’s closing argument, and the number of
legal TV shows watched by participants. That is, even after controlling for the other
variables, defendants seen as more attractive received shorter sentences. This
relationship is statistically significant. Comparison of the betas suggests that
attractiveness of the defendant is the variable that is most strongly related to sentence
length.
Learning Activity: Limits to Multiple regression:
two main reasons that multiple regression analyses cannot completely establish
causation.
My Answer:
1. Multiple regression can control third variables but can’t establish temporal
precedence. We sometimes don’t know which variable came first, and multiple
regression analyses don’t address this problem.
2. Researchers can only control for the variables they have measured. There could be
important third variables that you didn’t measure.
What is mediation?
Mediation hypotheses, then, provide an explanation for why or how two variables are
related. the mediator is a variable that comes between the other two variables. The
mediator is sometimes referred to as the “intervening process” or the “mechanism”.
Mediators vs. Moderators vs. Third Variables

Mediation – Explains WHY the two variables are related
Moderation – Identifies certain groups or situations for which the two variables are
more strongly related.
Third-variable problem – The two variables are correlated but only because they are
both linked to a third variable.
Learning Activity
1. Drinking wine more often is associated with greater life satisfaction but only for
women, not men.
My Answer:
Bivariate relationship: Wine consumption is associated with greater life

satisfaction
Moderator: Gender
2. Drinking wine more often is associated with greater life satisfaction but only because
wealthier people can afford to drink wine more often, and being wealthy is also
associated with greater life satisfaction.
My Answer:

satisfaction
Third variable: Wealth

3. Drinking wine more often leads people to socialize more with others, which in turn
leads to greater life satisfaction.
My Answer:

satisfaction
Mediator: Socializing
4. Having a mentally demanding job is associated with cognitive skills in later years
because people who are highly educated take mentally demanding jobs, and people
who are highly educated have better cognitive skills.
My Answer:
Bivariate relationship: Demanding jobs are associated with cognitive skills
Third variable: Education

5. Having a mentally demanding job is associated with cognitive skills in later years, but
only in men, not women.
My Answer:
Moderator: Gender
6. Having a mentally demanding job is associated with cognitive skills in later years
because cognitive challenges build lasting connections in the brain.
My Answer:
Mediator: Connections in brain

7. Being a victim of sibling aggression is associated with poor mental health in
childhood, but the link is especially strong for later-born children and weaker in first-
born children.
My Answer:
Bivariate relationship: sibling aggression is associated with poor mental

health
Moderator: Birth order

8. Sibling aggression is associated with poor childhood mental health because child
victims of sibling aggression are more likely to feel lonely at home, and then
loneliness leads to mental health problems.
My Answer:

health
Mediator: loneliness
9. Sibling aggression is associated with poor childhood mental health only because of
parental conflict. Sibling aggression is more likely among parents who argue more,
and parents’ arguing also affects kids’ mental health.
My Answer:

health
Third variable: parental conflict

10. Taking notes in class by hand (instead of on laptop) is associated with higher exam
grades, because taking notes by hand leads students to put the ideas in their own
words (instead of simply repeating), and putting ideas into their own words helps
students remember the material better during exams.
My Answer:
Bivariate relationship: note taking method is associated with exam grades

Mediator: Saying in own words vs. Repeating
11. Taking notes in class by hand (instead of on laptop) was associated with higher exam
grades when the course material was highly complex, but was associated with lower
exam grades for course material that was not complex.
My Answer:
Moderator: complexity of material

12. Taking notes in class by hand (instead of on laptop) was associated with higher exam
grades but this was only because the most motivated students chose to take notes by
hand, and more motivated students do better on exams.
My Answer:
Third variable: motivation level

13. Facebook use and college grades are more strongly correlated among nonathletes, and
less strongly correlated among athletes.
My Answer:
Bivariate relationship: Facebook use and college grades
Moderator: athletes vs. Nonathletes

14. Facebook use and college grades are only correlated with each other because students
in more difficult majors get worse grades, and students in difficult majors have less
time to use Facebook.
My Answer:
Third variable: difficulty of major

15. Facebook use and college grades are correlated because Facebook use leads to less
time studying, which leads to lower grades.
My Answer:
Mediator: Time studying

Lesson 8:
In a longitudinal study, the researcher measures the same variable in the same
people at two or more time points – often to see how the variable of interest changes
across time. In a longitudinal correlational study, two variables are measured at each
time point. Note that this is considered a “multivariate” design rather than “bivariate”
because you end up with more than two measured variables.
Why would researchers use a longitudinal correlational design?
As noted above, one of the key reasons that researchers use longitudinal designs is
because it helps them to arrive at causal conclusions – whether one variable influences
another. Recall that the bivariate correlations we looked at last lesson were good at
showing covariance but not at establishing temporal precedence or internal validity.
Longitudinal studies can help rule out the “directionality” problem and thus help the
research to establish the “temporal precedence” of a particular variable.
A classic example: Violent TV and aggressive behavior
Does exposure to violence on TV lead children to become more aggressive

themselves? In several bivariate correlation studies, researchers have measured both
how much violent TV children watch (TV violence) and how aggressively they behave
(aggression), and they usually find a positive association – the more violent TV that kids
watch the more aggressively they behave.
But at this point a little voice in your head should be screaming “correlation does not
indicate causation”. And you are right. This bivariate correlation establishes covariance,
but does not rule out directionality and third variable problems. So it does not support a
causal conclusion.
This is why researchers sometimes turn to longitudinal correlational designs, in which

they measure both of these variables in the same participants at multiple time points..
Three Types of Correlations

Learning Activity
1. Cross-sectional correlations
My Answer:
a. Definition: Correlation of the two different variables measured at the same time
point.
o r = .21
o r = -.05
2. Autocorrelations
My Answer:
a. Definition: Correlation of each variable with itself across time points
o Time 1 TV violence with Time 2 TV violence
o Time 1 aggression with Time 2 aggression
o r = .05
o r = .38
3. Cross-lag correlations
My Answer:
a. Definition: Correlation of the earlier measure of one variable with the later measure
of the other variable
o Time 1 aggression with Time 2 TV violence
o r = .31
o r = .01
What do we learn from each of these correlations?
Cross-sectional correlations: The cross sectional correlations tell us whether, at each
time point, there is an association between TV violence and aggression. For example,
the correlation at Time 1 (r = .21), tells us that students who watched more violent TV
also tended to act more aggressively. However, because both variables were measured
at the same time, this result cannot establish temporal precedence as either of these
variables might have led to changes in the other.
Autocorrelations: The autocorrelation is the correlation of each variable with itself across
time, and tells us whether people who are higher on the variable at Time 1 are also
higher at Time 2. For example, the autocorrelation for aggression (r = .38) tells us that
the boys who were more aggressive at Time 1 also tended to be more aggressive at
Time 2.
Cross-lag correlations: The cross-lag correlation indicates the degree to which an earlier
measure of one variable is associated with the later measure of the other variable.
Researchers are most interested in these correlations because they can help to
establish temporal precedence. That is, the cross lag-correlations allow us to conclude
with confidence that the one variable came before the other one. Depending on the
pattern of the two cross-lag correlations, we would draw different conclusions – you can
test yourself on how to interpret different outcomes of cross lag correlations in the
Learning Activity below.
Learning Activity: Interpreting possible outcomes of cross-lag
correlations
Outcome 1: In Figure 1, Time 1 TV violence was significantly correlated with Time 2
aggression, whereas Time 1 aggression was not correlated with Time 2 TV violence.
How do you interpret this outcome?
My Answer:
This pattern supports the idea that watching more violent TV leads to increases in
aggressive behavior. Most importantly, it establishes temporal precedence because it
rules out the problem of reverse direction. It can’t be the case that being more
aggressive at Time 2 is what led the kids to watch more violent TV at Time 1.
Outcome 2: What if the study had instead found that Time 1 aggression was positively
correlated with Time 2 TV violence, but Time 1 TV violence was not correlated with
Time 2 aggression?
My Answer:
This pattern would have indicated that aggression came first, leading to greater
watching of TV violence later. That is, the kids who were already more aggressive at a
young age went on to have a greater preference for violent TV.
Outcome 3: What if the study found that both correlations were significant – that TV
violence at Time 1 predicted aggression at Time 2, and that aggression at Time 1
predicted TV violence at Time 2?
My Answer:
This pattern would suggest that viewing TV violence and acting aggressively are
mutually reinforcing – that they both influence each other.
Longitudinal studies and the three criteria for causation

Longitudinal correlational designs can provide some evidence for causation by fulfilling
the first two of the three criteria:
1. Covariance: when cross-sectional correlations are significant, covariance has been

established.
2. Temporal precedence: if one of the cross-lag correlations is stronger than the other,
this establishes which variable came first in time, so it moves us closer to figuring out
which variables causes the other.
3. Internal validity: by measuring only four variables, as we have in the TV and
aggression example, longitudinal designs do not rule out “third-variable
explanations”. For example, it could still be the case that certain parenting styles lead
kids to have greater exposure to TV violence and also lead them to act more
aggressively. Thus, although these designs move us closer to a causal conclusion,
they cannot fully establish causation.
Steps in a multiple regression analysis
1. Measure the suspected third variable
You will need to measure more than just the two original variables; you must also
include a measure of the suspected third variable. [e.g. include a measure of SES]
2. Conduct analyses that control for the third variable.
Compute multiple regression analyses that tell you the relation between the two
original variables after “controlling for” the third variable. Conceptually you are
asking whether the original bivariate correlation appears in “subgroups” representing
the levels of the third variable [e.g., subgroups of students at different SES levels]. In
each subgroup, people have the same level of the third variable. So you are
conducting a correlation for a group of people where that third variable has been held
constant.
3. Interpret the pattern of results.
If the bivariate correlation remains within each of the “subgroups”, this indicates that
it was NOT due to the suspected third variable. This is because the correlation
appears even when that variable has been “controlled” or “held constant”. However,
if the bivariate correlation does not appear within the subgroups, this suggests that the
original bivariate correlation is explained by the suspected third variable.
Understanding what it means to control for a third variable
As Table 1 shows, in this sample of students the original bivariate correlation between
sports involvement and academic success is very strong and significant (r = .85, p < .
05). But again, recall that there could be a third variable such as SES that explains this
bivariate correlation.
Sports Involvement Academic Success

(Predictor) (Criterion) r
1 3 .85
2 1
4 3
4 3
5 6
8 4
9 7
9 9
10 8
Fortunately, the researchers also included a measure of the participants’ SES with three
levels (1 = low, 2 = moderate, 3 = high). This allows the researchers to “control” for this
suspected third variable. That is, they can look at the relation between sports
involvement and academic success within “subgroups” of people who have the same
SES level. As Table 2 shows, there is no relation, or very little relation, within each of
the subgroups. So, in the analyses that control for SES (i.e., hold SES constant) there is
no longer an association between sports involvement and academic success. This
pattern of results suggests that the original bivariate correlation between sports
involvement and academic success was due to the third variable of SES levels. If
instead the correlation had been substantial at each level of SES, this would have
indicated that SES was not a likely third variable.
SES Sports Involvement Academic Success

(3rd Variable) (Predictor) (Criterion) r
1 1 3
1 2 1 .19
1 4 3
2 4 3
2 5 6 .05
2 8 4
3 9 7 .00
3 9 9
3 10 8
Table 2. Relation between sports involvement and academic success controlling for SES.
This hypothetical example illustrates, conceptually, what it means to control for a

suspected third variable. Controlling for the variable means holding it constant. The
researchers want to know whether a third variable can account for the bivariate
relationship between the two variables of interest. To answer the question, they see
what happens when they statistically control for the third variable. That is the basic logic
underlying multiple regression analysis.
Interpreting results of multiple regression analysis – beta basics

First, a bit of terminology. In multiple regression analyses, you have three or more
variables. The variable that you’re most interested in understanding and predicting is
the criterion variable (aka the outcome or dependent variable) (for example,
academic success); it is specified in either the title or the top row of the regression
table. The other variables measured are the predictor variables (for example, sports
involvement); they are listed below the criterion variable in the regression table.
Criterion: Academic Success (GPA)
Predictor Variable Beta Significance (p)
Sports involvement .39 .01
SES level .32 .05
Beta basics: Beta is similar to r. A positive beta indicates a positive relationship
between that predictor variable and the criterion variable when the other predictors
are statistically controlled for. A negative beta reflects a negative relationship when
the other predictors are controlled for. A beta that is zero or not significantly different
from zero suggests that there is no relationship when the other predictors are
controlled for.
Similarities between beta and r: direction (positive or negative) and strength (the
closer to -1 or +1, the stronger the relationship; the closer to zero, the weaker the
relationship). You may compare beta strengths within a single regression table, but you
may not compare beta strengths across regression tables. There are no absolute
cutoffs for beta effect sizes as we have with r and Cohen’s d cutoffs. Sometimes a
regression table will report b instead of beta; b is an unstandardized coefficient. You
cannot compare two b values even within the same table.
Interpreting beta: Note that in Table 3 sports involvement has a beta of .39. This
positive beta, like a positive r, means that higher levels of sports involvement go with
higher levels of academic success, even when we statistically control for the other
predictor in this analysis—SES. The other beta, associated with the SES predictor
variable, is also positive. This beta means that higher SES is associated with higher
academic success, when sports involvement is controlled for.
Statistical significance of beta: In a regression table, there is usually a column
labeled p or sig or an asterisk footnote with p values. When p is less than or equal to .
05, the beta is statistically significant. When p is greater than .05, the beta is not
significant. In Table 3, both of the betas are significant.
Research Study: Dr. Nguyen is a psychologist who studies legal decision making.
Specifically, he is curious about the factors that are irrelevant to the crime committed
that might influence the sentences juries give to defendants (known as extra-legal
factors). To study this further, he samples a group of jury-eligible adults from the
Toronto area. He provides them with the fact pattern to a particular case and allows
them to watch the closing statements from the trial. He then asks them to provide a
sentence (in months) for the defendant. In addition, he measures two legal factors (the
number of arguments made by the prosecuting attorney and the length of time the
defense attorney speaks during his or her closing argument) and two extra-legal factors
(how attractive the participants think the defendant is [higher scores indicate higher
ratings of attractiveness] and how many legal television shows the participants watch).
The data are below in Table 4.
Answer each of the questions below, before checking your answer against mine.
Criterion: Length of Sentence Provided
Bet Significance
Predictor Variable a (p)
Number of arguments made by the

prosecuting attorney .08 .12
Length of the defense attorney’s
closing argument .37 .02
Attractiveness of the defendant -.45 .04
Number of legal television shows

watched -.03 .39
Table 4. Results of a multiple regression analysis for a study examining factors that
influence legal decision making
What is the criterion variable?
My Answer: Length of sentence
What are the predictor variables?
My Answer: Number of arguments by prosecutor, length of defense attorney’s closing

argument, attractiveness of the defendant, and number of legal TV shows watched.
Which predictor variables are related significantly to the length of sentence (when we
statistically control for the other predictors)?
My Answer: Length of defense attorney’s closing argument (p = .02), and

attractiveness of the defendant (p = .04). These p values are below .05 so they are
significant.
Explain how the variable attractiveness of the defendant relates to criminal sentencing,
considering the direction of the relationship, statistical significance, and the strength of
the relationship compared with the other variables.
My Answer: There is a negative relationship between the attractiveness of the

defendant and the length of sentence given, controlling for the number of arguments by
the prosecutor, the length of defense attorney’s closing argument, and the number of
legal TV shows watched by participants. That is, even after controlling for the other
variables, defendants seen as more attractive received shorter sentences. This
relationship is statistically significant. Comparison of the betas suggests that
attractiveness of the defendant is the variable that is most strongly related to sentence
length.
Lesson 9
What is an experiment?
An experiment is a study in which researchers manipulate at least one variable
and measure another variable. In a “simple” experiment there is just one manipulated
variable; later in the course we will see experiments with more than one manipulated
variable. The goal in a simple experiment is to test whether the manipulated variable
has an effect on the measured variable(s). Thus the key factor that distinguishes an
experimental design from the correlational designs that we have looked at previously is
that the presumed causal variable is manipulated by the researcher rather than only
measured.
Experimental variables
Independent variable (IV):
This is the presumed causal variable that is manipulated by the researcher. This means
that the researcher assigns each participant to a particular level of the variable – to a
“condition” in the experiment. The different levels of the IV are the “conditions” in the
experiment. In a graph of results, the independent variable will be on the x-axis.
Dependent variable (DV):

This is the presumed effect or outcome that is measured by the researcher. The
researcher is testing whether this variable gets influenced by the independent variable.
There is sometimes more than one dependent variable in an experiment. In a graph of
results, the dependent variable will be on the y-axis.
Control variables:
This term refers to potential variables that are held constant on purpose by the
researcher. That, is the researcher makes sure that every participant has the same level
of the variable – for example, the same room, same lighting, same time of day, same
experimenter giving the instructions, same computer keyboard for responding, etc.
1. Experiments Establish Covariance

My Answer:
Experiments test whether there is covariance between the IV and the DV –

that is, they test whether there is an association between the IV and the DV?
This is done by comparing averages on the DV across the levels of the IV
(i.e., across conditions or groups).
As you may have noticed, this is just like testing for associations in
correlational studies with one categorical variable and one quantitative
variable. Again, we are comparing means to establish covariance – the
greater the difference between means across conditions, the stronger the
association between the IV and the DV.
So a key part of experiments is that you have different groups to compare.

Sometimes this will be a comparison of a “treatment group” vs. a “control
group”, where the control group is the neutral or no treatment level of the IV.
But not all experiments have or need a control group. Sometimes there are
just “comparison groups”, that is, conditions that are designed to differ in an
intended and meaningful way.
Example: If the researchers find that the mean creativity score is much
higher in the alcohol condition than in the control condition, they have
established covariance. They have shown that alcohol consumption is
associated with creativity.
1. Experiments Establish Temporal Precedence
My Answer:
In experiments we can be sure that the cause variable precedes the effect
variable. So we can totally rule out the possibility of reverse causation,
because the experimenter has manipulated the IV. It can’t be argued that the
participants’ standing on the IV may have been influenced by their standing
on the DV – we know that it was the experimenter that determined which level
of the IV each participant received, by assigning each participant to a level.
Example: Because the researchers assigned participants to the different

levels of alcohol consumption, the experiment has established temporal
precedence. We can be sure that alcohol consumption “came first”. It can’t be
argued that being more creative led people to drink more alcohol, because
alcohol consumption was assigned by the researcher.
2. Experiments Establish Internal Validity (if well-designed)
My Answer:
Recall that to make a causal claim you also need to be able to rule out third
variable explanations and be confident that it was the IV (and not some other
variable) that caused the difference in means on the DV across conditions.
You need to be sure that the difference in means across conditions was due
to the IV and only the IV. To do this, you need to be able to rule out
“confounds” or “threats to internal validity” discussed in the next section.
Example: If the experiment is done well, there will not be any differences
between the two conditions other than the level of alcohol consumption. If that
is the case, the experiment is high in internal validity and we can be sure it
was the alcohol consumption that created the differences in creativity across
the two conditions.
Design confounds
A design confound occurs in an experiment when some extraneous variable (a
variable other than the IV) varies systematically along with the IV and thus provides an
alternative explanation for the results. That is, a design confound exists when the
conditions differ systematically (i.e., differ on average) on some variable, other than the
independent variable, that might have affected the DV. If a design confound is present,
it threatens internal validity and the experiment does not support a causal claim.
Selection effects
A selection effect occurs in an experiment when the participants in one condition are
systematically different than the participants in the other condition(s). Selection effects
can arise in cases where participants are free to select which level of the IV they
receive, or in cases where the researcher assigns participants to the conditions but not
using random assignment. In such cases, it could be that different types of participants
end up in each condition. That’s a big problem when we go to interpret the results. We
cannot know whether a difference in means across conditions at the end of the study
was caused by the IV, or whether it was found because there were different types of
people in the conditions in the first place.
Fortunately, researchers can avoid selection effects by using random
assignment. Random assignment is a way of assigning participants to the different
conditions (i.e., different levels of the IV) such that every participant has an equal
chance of being assigned to each condition. If random assignment is used, there should
be no systematic differences between conditions prior to the manipulation of the
independent variable.
Learning Activity: Spot the confound:
1. In a market research study, participants were randomly assigned to receive a 6-oz
serving of either Pepsi or Coke, and were asked to rate how much they liked the cola
on a scale from 1 to 10. To help the researchers keep track of the brand that each
participant was assigned to taste, the Pepsi was always served in a blue cup and Coke
in a red cup. The researcher also asked participants to indicate their gender, age, and
where they live.
My Answer:
IV – cola brand (Pepsi, Coke)
DV – liking of the cola (rating from 1-10)
Confound: Design confound: colour of the cup (blue, red). It might be that the
colour of the cup influences participants’ ratings of liking.
Solution: Use the same colour of cup for all participants.
2. A researcher tests whether the age of a target person (young vs. old) influences
perceived intelligence. The researcher first collects photos from the internet of several
old people and young people. Participants in the experiment are randomly assigned
to view photos of either the old people or the young people and they rate how
intelligent these people seem. Results indicate that the ratings are higher on average
for the old people than for the young people. The researcher also noticed later that
most of the old people, and none of the young people, were wearing glasses.
My Answer:
IV – age of target person (old, young)
DV – perceived intelligence
Confound: Wearing glasses or not. The conditions differ not only in the age
of the people being rated, but also whether they are wearing glasses. So we
can’t know whether it was the person’s age or glasses that made them seem
more intelligent to raters.
Solution: Choose the same number of photos where people are wearing
glasses (and the same number without glasses) in each condition.
Random assignment is a technique used to assign the sample of participants in your
experiment to the different conditions of the experiment.
Simple random assignment
There are a couple downsides of using these “simple random assignment” procedures.
They might result in an unequal number of participants in each condition at the end of
the study, and at certain time periods during the study.
Blocked random assignment

To begin, you would define a “block” of participants as the number of conditions in the
experiment. Then one participant in each block gets randomly assigned to each
condition, before moving on to the next block. For example, if there are four conditions
in the experiment and you want 25 participants in each condition, the first four
participants to arrive at the study are randomly assigned to the four conditions; then the
next four participants to arrive are randomly assigned to the four conditions, and so on
for all 25 blocks. There are two main benefits to this procedure: it ensures there will be
an equal number of participants in each condition at the end of the study, and it ensures
an equal number of participants are in each condition at each stage in the study
Matched random assignment

matched assignment (matching) with small samples. Matching involves matching
participants on some variable that is thought to be important (e.g., IQ) before assigning
them to conditions. For example, we would randomly assign the four participants with
the highest IQs to the four conditions, then assign the next four highest participants, and
so on.
This matching procedure ensures that the conditions will be completely equivalent on a
variable that the researcher believes is important (i.e., the matching variable). Matching
is most useful in cases where the sample size is small, and thus random assignment
might not be as effective in equalizing the groups. Or when the researcher wants to be
absolutely sure that the groups are as equal as possible (e.g., when studying samples
that are very difficult or costly to recruit; when trying to detect small effects of an IV).
The downside of matching is that it often requires the extra step of contacting the
participants to measure the matching variable ahead of the experiment, so matching
procedures can be costly in terms of time and resources.
Independent groups (between-subjects)

In independent groups designs, different groups of participants are placed into the
different conditions of the experiment. There are two variations.
In a posttest-only design, participants are randomly assigned to IV groups and are

tested on the DV just once. For example, participants view either the assertive or
submissive video and rate their liking of the candidate (see figure below).
In a pretest/posttest design, participants are randomly assigned to at least two
different groups and are tested on the key dependent variable twice—once before and
once after exposure to the independent variable. For example, participants first rate
their liking of the candidate before viewing a video and then after viewing a video of the
interview they rate their liking of the candidate again
Which Design is Better?

Deciding which design is better depends on the researcher’s particular goal. In some
situations, it is clearly problematic to use a pretest/posttest design. For example, if the
DV involves eating or physical exertion, participants might become full or exhausted by
having a pretest and a posttest. In other situations, a pretest/posttest design makes
sense.
Including a pretest has advantages. First, it provides a sensitive assessment of change.

You can track change across time because you have a baseline measure to compare
against. Second, you can also be extra sure that random assignment equalized the
groups prior to the manipulation, by checking that the pretest scores were very similar,
on average, across groups. If sample size is large you could be quite confident that it
did so without having a pretest, but if you want to be totally sure that groups are
equivalent at the start of the study, you’ll want a pretest.
However, posttest only designs can still be very powerful, given their combination of
random assignment and a manipulated variable (IV). They avoid the key disadvantage
of including a pretest which is that the pretest can create “testing effects” – that is,
participants may respond differently on the posttest than they would have if they hadn’t
first done a pretest. [We will look more closely at “testing effects” in the next lesson]
Within-groups (within-subjects)
In within-groups designs, the very same participants are exposed to all the different
conditions of the experiment. There are two variations.
Concurrent measures. In a concurrent measures design the participants are exposed

to all the conditions at roughly the same time, and a single measure such as preference
or choice is the dependent variable. For example, participants may see a video of an
assertive candidate and a submissive candidate and be asked to choose which
candidate they would hire (see figure below).
Repeated measures. In a repeated measures design the participants are first exposed
to one condition and complete the dependent variable, and then are exposed to the
other condition and complete the dependent variable again. For example, participants
may first see a video of an assertive candidate and rate how much they like that
candidate, and then see a video of a submissive candidate and rate how much they like
that candidate
Learning Activity: Types of experiments
1. A researcher ran an experiment in which he asked people to shake hands with an
experimenter (played by a female friend) and then to rate the experimenter’s
friendliness. People were randomly assigned to shake hands with her either after she
had cooled her hands under cold water or after she had warmed her hands under
warm water.
My Answer:
o IV: temperature of hands
o DV: perceived friendliness
o Type of experiment: independent groups posttest-only.
2. Participants in an experiment were presented with information about the grades of
two hypothetical students. These two students had similar grades in their most recent
term (A average) but differed in their previous grades: one student had earned
consistently strong grades every term whereas the other had low grades early on and
improved over time. Participants were asked to choose which of the two students was
most deserving of an academic scholarship.
My Answer:
o IV: grade history (consistent vs. improvement)
o DV: choice for scholarship
o Type of experiment: concurrent measures
3. Children participating in an experiment were first asked to read a passage of complex
text, and the researcher counted how many reading errors were made. Half of the
children were then given a series of literacy training exercises designed to improve
reading skills, whereas the other children did not receive these exercises. A week
later, when the children were again asked to read a passage of complex text, the
children who had received the literacy exercises made fewer errors than those who
did not.
My Answer:
o IV: training condition (exercises vs. no exercises)
o DV: reading errors
o Type of experiment: independent groups pretest/posttest
4. An experiment tested the effects of mood on memory. To manipulate participants’
current mood, the researcher asked some of the participants to visualize a recent
positive event and others to visualize a recent negative event. Participant were then
asked to describe their memories of their high school years and the researcher rated
how positive the memories were.
My Answer:
o IV: mood (positive, negative)
o DV: positivity of memories
o Type of experiment: independent groups posttest-only
5. Each participant in a study was presented with two different types of cola, and rated
their liking of each one after tasting it on a scale from 1-10.
My Answer:
o IV: type of cola (Pepsi, Coke)
o DV: liking of cola
o Type of experiment: repeated measures
Order effects
An order effect is a type of confound that can occur in within-groups designs because
one condition comes before the other condition in time. The means on the DV in the two
conditions may differ simply because of the passage of time or the sequence in which
the conditions were experienced. Order effects can be considered a “confound”
because it is not only the IV that differs across conditions, but also the sequence in
which the conditions were experienced.
There can be several different types of order effects:
Passage of time
e.g., participants may become bored, tired, hungry, fatigued by the time of the second
condition
Practice effects
e.g., participants may get better at a task because they have done it before
Carryover effects: what happens in the first condition carries over, and
contaminates, the second condition.
e.g., participants in a taste test may rate the first cola they taste more positively than the
second because the first taste is always better and subsequent tastes are never quite
as good. The first taste lingers on and contaminates the second taste.
e.g., participants rate the submissive job candidate lower because they still have the
assertive candidate in mind, and in comparison the submissive candidate is not as
appealing (but they would have given higher ratings to the submissive candidate if they
weren’t first exposed to the assertive candidate)
Counterbalancing
Fortunately, there is a fairly straightforward solution to the problem of order effects in
repeated measures designs: counterbalancing. Counterbalancing means presenting the
levels of the independent variable to participants in different orders. For example, in an
experiment with two conditions half the participants would be randomly assigned to
order 1 (IV level 1 first, IV level 2 second) and the other half to order 2 (IV level 2 first,
IV level 1 second), as illustrated in the figure below.
So, in our example experiment, half the participants would rate the assertive video first
and the submissive video second (order 1) and half the participants would rate the
submissive video first and the assertive video second
The main goal of counterbalancing procedures is to ensure that each condition occurs
early in the sequence as often as it occurs late in the sequence, so that effects
associated with order will be balanced out across the different conditions.
Full counterbalancing:
With full counterbalancing, every possible order is presented equally often in the
experiment. This is relatively easy if there are only 2 or 3 conditions.
Learning Activity
there will be a lot of possible orders. Indeed, the number of possible orders is equal to
the number of conditions “factorial”. The mathematical symbol for factorial is !. The !
symbol means that you start with the number and multiply by the next smallest number,
then by the next smallest number, and so on until you have multiplied by 1. For
example, 3! = 3 factorial = 3 x 2 x 1 = 6:
2 conditions: 2 x 1 = 2 orders
3 conditions: 3 x 2 x 1 = 6 orders
4 conditions: 4 x 3 x 2 x 1 = 24 orders
5 conditions: 5 x 4 x 3 x 2 x 1 = 120 orders
6 conditions: 6 x 5 x 4 x 3 x 2 x 1 = 720 orders
If you have an experiment with several conditions, then, it might not be feasible to use
full counterbalancing, and instead researchers would use some form of partial
counterbalancing.
Partial counterbalancing
With partial counterbalancing, only a subset of the possible orders are chosen and
presented to participants. How do you choose which subset of orders to use?
Randomized orders
One technique is to present a randomized order for every participant (i.e., select the
orders randomly from the list of all possible orders). For example, when an experiment
is administered by a computer it can easily select a new random order for each
participant. So you would not use every order, but instead would use only as many
orders as there are participants in the study.
Latin square
Another technique is to use a Latin square procedure. This involves selecting the subset
of orders in a special way so that they have two properties. First, every condition
appears in each ordinal position once. Second, every condition comes right before and
right after each other condition once.
In fact, if you have an even number of conditions in your experiment (e.g., 6 conditions),
the number of orders you need is equal to the number of conditions. So, for an
experiment with 6 conditions, you would need to use only 6 orders – much lower than
the 720 orders that would be needed for full counterbalancing! For an experiment with
an odd number of conditions, the number of orders you need is twice the number of
conditions. So for an experiment with 5 conditions, you would need to use 10 orders.
In comparison to a between-groups design, what you think the advantages and

disadvantages of this experimental design are?
My Ideas:
Advantages
1. Equivalent comparison groups. Participants in each condition are COMPLETELY
equivalent because they are the same participants and serve as their own comparison
groups. You could think of this as similar to a study using matching – but in this case
participants are “matched” in every way.
2. More power. These studies are better able to detect differences between conditions
due to a reduction of unsystematic variability (noise). We will expand on this topic in
the next lesson.
3. Fewer participants required. For example, imagine again that you are conducting an
experiment with three different conditions, and you would like to have 40 participants
in each condition. For an independent groups (between-subjects) design you would
need a total of 120 participants. For a within-groups (within-subjects) design, how
many participants would you need? The answer is only 40 participants in total –
because each participant would go through all three conditions.
Disdvantages
1. Potential for order effects. Repeated measures designs can have order effects
(however, as we discussed, these can be controlled using counterbalancing, so this is
not usually a serious disadvantage).
2. May not be practical or possible – if you can’t undo what happened in the first
condition. Example: Suppose someone has devised a new way of teaching children
how to ride a bike, called Method A. She wants to compare Method A with the older
method, Method B. Obviously, she cannot teach a group of children to ride a bike
with Method A and then return them to baseline and teach them again with Method
B. Once taught, the children are permanently changed. In such a case, a within-
groups design, with or without counterbalancing, would make no sense.
3. Demand characteristics. These occur when participants pick up on cues that let them
guess the experiment’s hypothesis. Seeing all the conditions can often let participants
guess the hypothesis, and this may change the way they act.
Construct validity: How well were the variables measured and
manipulated?
For dependent variables we would want to know whether the measures were reliable
and valid, just as we have done for all other types of research.
For manipulated variables, we want to be sure that the manipulation created the
differences on the IV that it was supposed to. As noted in the text, researchers often
include a manipulation check in their studies for this purpose. A manipulation check is
a measure that is different than the dependent variable. A manipulation check comes
after the experimental manipulation and measures where participants’ stand on the
Independent Variable. For example, if a study is trying to create different levels of
anxiety in participants (IV) to see if it affects their test performance (DV), the researcher
might include a few items that ask participants how anxious they were feeling just
before they do the test – that way the researchers can make sure the manipulation
created differences in anxiety (the IV) across the experimental conditions. Sometimes,
instead of including a manipulation check in the experiment, this is done instead in a
separate pilot study carried out before the actual experiment is conducted.
2. External validity: To whom or what can the causal claim
generalize?
We may also wish to know whether the results of the experiment can be generalized to
other people and other situations. However, we can never really establish high external
validity on the basis of a single experiment.
3. Statistical validity: How well do the data support the causal

claim?
4. To evaluate the statistical validity of experimental findings we again focus on two
main questions. Is the effect statistically significant and how large is the effect?
5. Statistical significance: Is the difference between means statistically
significant? In other words, are you reasonably sure that the difference did not
occur by chance?
6. To answer these questions in experiments with two conditions, researchers

conduct a statistical test known as the t-test. This test provides researchers with
a t-statistic as well as a probability level (p). If the p value is less than .05, then
the researcher concludes that the effect (i.e., the difference in means between
conditions) is statistically significant –meaning that it is unlikely that it occurred by
chance.
7. Effect size. How large is the effect?
8. Typically, the larger the effect size, the stronger is the causal effect. In
experiments, researchers use an indicator of standardized effect size called
Cohen’s d (d). This coefficient quantifies how far apart two groups are on the DV.
It indicates the amount of distance between group means and how much scores
within groups overlap. The d value will be larger when the difference in means is
larger, and the amount of variability within each group is smaller, because then
the scores in the groups won’t overlap as much. As illustrated in Figure 10.20 of
the text, a large d indicates that the scores in the experimental groups don’t
overlap very much – suggesting that the IV had a large effect on the DV.
9. 4. Internal validity: Are there alternative explanations for the
outcome?
10. Remember that internal validity is usually the top priority in experiments because
we are testing a causal claim. Throughout this lesson, we have already
considered fundamental questions to ask about internal validity, summarized
here:
- Were there any design confounds?
- If an independent-groups design was used, did researchers control for selection

effects using random assignment or matching?
- If a within-groups design was used, did researchers control for order effects by
counterbalancing?
Lesson 10
Six further threats – applied to a one-group pretest/posttest design

A hypothetical study
Learning activity: Identifying potential threats
1. Maturation: Changes that happen naturally within participants, because time has
passed, that influence the DV.
Over the 8 week period, the students were becoming more comfortable and better
adjusted with their university surroundings, and it was these psychological changes
happening naturally within them that led to the lower depression scores. In other
words, they just improved on their own over time.
2. History: change on the DV is caused by external events or factors that happen during
the time period of the study. Something other than the IV happens between the pretest
and posttest.
Some event could have happened that affected many of the participants in the study,
and reduced their depression levels (for example, the government announced a
reduction in tuition fees).
3. Regression to the mean: A statistical phenomenon involving extreme scores. Occurs
when participants are selected for a study, or assigned to conditions, because they had
extreme score. People with extreme scores on a measure tend to score more
moderately (closer to the mean) when retested.
Students were selected to be in the study because they scored especially high on the
pretest measures of depression. Consequently, their scores will tend to be lower (i.e.,
closer to the mean) when retested just due to regression toward the mean.
4. Attrition: Changes on the DV due to loss of participants during the study. Occurs
when those who drop out of the study differ systematically from those who stay.
It may be that the most depressed participants left the study, and that’s why the
average score is lower on the posttest.
5. Testing: A specific type of order effect which occurs when the very act of completing
a pretest influences responses on the posttest.
Just going through the pretest measures might have prompted participants to think
more about their state of depression, perhaps to reevaluate and see themselves as less
depressed than they initially reported, and that’s why they score lower on depression
when tested again later.
6. Instrumentation: Changes on the DV that occur because of changes over time in the
measurement “instrument” – that is, in how the DV is assessed. A researcher might
use slightly different versions of a self-report measure at the two time points. But this
applies most commonly to behavioural/observational measures for which coders may
shift their coding standards over time.
This study included an observational measure of depression and so, at least on that
measure, instrumentation could be a problem. The coders may have changed in how
they were rating participants’ depression level. They may have interpreted more
behaviours as indicative of depression early on in the study, and the very same
behaviors were given lower ratings later in the study.
How to prevent these threats with a true experiment (comparison
groups)
Fortunately, there is a solution to these potential threats: Add a control group or
comparison group that receives a different level of the IV. In other words, conduct a
“true experiment” in which you manipulate the independent variable to create different
levels of the IV (different conditions).
Participants would be randomly assigned to either the control condition or treatment

condition. So now this would be a true experiment -- an independent groups
pretest/posttest design
In almost all cases, what happens is that the influence that a confounding factor is
having gets equated across the experimental conditions by random assignment to
conditions. So the confounding factor would not explain why there is a difference
between the two conditions at the end of the study. For example, if maturation is leading
people to become less depressed, this would be happening in both the control and
treatment condition, resulting in reduced depression levels in both conditions. So any
difference between conditions on the post-test measure of depression would not be due
to maturation – and instead it can be interpreted as an effect of the treatment.
Similarly, random assignment to the treatment and control condition can eliminate each
of the other potential threats, as summarized below:
1. Maturation: Generally influence is equated across groups by random assignment

2. History: Generally influence is equated across groups by random assignment
[Note: a selection-history combination might still produce differences on the posttest.

See next section]
3. Regression toward the mean: Generally influence is equated across groups by
random assignment
4. Attrition: Generally influence is equated across groups by random assignment
[Note: a selection-attrition combination might still produce differences on the

posttest. See next section]
[Another solution is to compute the pretest scores and posttest scores with only the
final sample included; that is, remove the dropouts’ data from the pretest mean as
well.]
5. Instrumentation: Generally influence is equated across groups by random
assignment
6. Testing: Most types of testing effects should be equated by random assignment
Combined threats
As noted in the text, sometimes in a pretest/posttest design two types of threats to
internal validity might work together.
Selection-history threat:
Suppose that the researcher did not use random assignment, but instead assigned
students at one university to the treatment group and students at another university to
the control group. However, during the course of the study, a stressful event occurs on
one of the campuses and that happens to be the campus of the control group. This
might explain why the posttest depression scores are higher in the control condition
than in the treatment condition. An external event is influencing scores in the one
condition but not in the other.
Selection-attrition threat:
Attrition is a more serious problem when it differs across conditions (sometimes called
“differential attrition”). That is, it could be that different numbers of people and different
types of people drop out of the two conditions. For example, maybe the most severely
depressed people dropped out of the treatment condition because they found the
training sessions too demanding and arduous. This might explain why posttest
depression scores were lower in that condition.
Researchers can avoid this problem by computing pretest scores and posttest scores
with only the final sample included; that is, removing dropouts’ data from the pretest
mean as well.
Three more potential threats in any study

Observer bias, demand characteristics, and placebo effects are three more potential
threats to internal validity, and they can occur, not only in the very bad experiment (one-
group pretest/posttest design), but also in true experiments that have a comparison
group.
Observer bias (and observer effects)

This is bias that occurs when researchers’ expectations influence the results.
Experimenters’ expectations can influence what they see (observer bias) or can even
lead them to interact differently with the participants in a way that influences their
responses (observer effects). This can be a serious threat to internal validity if the
experimenter knows which condition each participant is in. Although comparison groups
can prevent many threats to internal validity, they do not control for observer bias if the
experimenter knows which condition a participant is in.
Example: In our example study, the researcher might be a biased observer of her
patients’ depression levels: She expects to see her patients improve, whether they do or
do not. If she knows which students are in which group, then her biases could lead her
to see more improvement in the treatment group than in the control group.
Solution: Where possible, use a “masked design” or “blind design” in which the
experimenters working with the participants are kept unaware of the condition that each
participant is in.
Demand characteristics
Demand characteristics are aspects of the research that (subtly) suggest to participants
how they ‘should’ behave. It can bias results when participants figure out what the
study is about and change their behavior in the expected direction. As mentioned in
previous lessons, demand characteristics tend to be more of a problem in studies that
use a pre-test or a within-groups design, because in these designs it is much easier for
participants to figure out what the IV and DV are.
Example. In our example study, the participants are first given a measure of
depression, and then later are given the same measure of depression again with some
sessions in between. They might reasonably guess that the researchers are studying
how depressed they are, and whether what is happening during the sessions affects
their depression.
Solutions:
Keep the participants “blind” to their condition – this is the best solution in studies where
it is feasible.
Conceal the full purpose of study; use cover stories to disguise the study purpose.
Do not use a pretest or a within-groups design. Instead use an independent-groups

posttest only design.
The double blind study

In order to control for both observer bias and demand characteristics, a researcher can
sometimes conduct a “double-blind study”. These studies are designed so that
neither the participants nor the experimenters working with the participants know who is
in the treatment group and who is in the control group.
If a double-blind study isn’t feasible, then an acceptable alternative is a masked
design (aka blind design; participants know which group they’re in but the
experimenters working with the participants don’t know).
Placebo effects
A placebo effect occurs when people receive a treatment and really do improve, but it’s
only because they believe they received an effective treatment. These effects are
perhaps best illustrated in experiments testing the effectiveness of drugs or
medications. In such experiments, researchers often lead some of the participants to
believe that they received an effective drug, when in fact they were given a placebo –
such as a pill or injection with no active ingredients (e.g., a sugar pill or a saline
solution). Nevertheless the participants given only the placebo often show
improvements compared to participants assigned to a standard control condition.
The problem for researchers, then, is that in experiments testing the effect of a
therapeutic treatment, the placebo effect provides an alternative explanation.
Participants in a treatment condition receive not only the treatment but also beliefs or
expectations that it will affect them. So how can the researchers know whether an
effect is actually caused by therapeutic treatment itself or by the accompanying beliefs?
Solution: The solution is to include a special kind of control condition. Participants in a

“placebo control condition” are led to believe they are receiving the treatment when in
fact they are simply given the placebo. Importantly, neither the person treating the
participants nor the participants themselves should know whether they are in the real
treatment group or the placebo control group. Thus this kind of design is called a
“double-blind placebo control study”. It allows us to know whether the treatment has any
effect over and above any placebo effect.
Note that you can also test whether there actually is any placebo effect by including
both the placebo control group and a standard control group (which does not receive
the treatment or any belief that they received it).
What is a null effect?

There is a null effect in an experiment when the results indicate that there is not
significant covariance between the IV and DV. That is, the means on the DV do not
differ significantly between the conditions.
Why is there a null effect?

There are three general reasons why a null effect might occur. The first is that the IV
really does not affect the DV. However, it could also be that the IV really does affect the
DV, but that the study failed to detect this effect. This can occur if there is not enough
between-groups difference, or if there is too much within-groups variability.
Understanding variability in experiments

In order to interpret the results of experiments, we need to be able to distinguish
between two types of variability: the between-groups difference and within-groups
variability.
Between-groups difference (Systematic variability)

One reason that scores in an experiment vary is that the participants were assigned to
different conditions or groups. So we can look to see if the scores differ systematically
between the groups – for example is the mean score in one group higher than the mean
score in the other group. The amount of difference between the group means is
referred to as the “between-groups” or “systematic” variability.
This type of variability could actually come from two places.
1. The IV manipulation (i.e., the treatment). If the experiment is well designed, this
should be the only thing creating the between-groups difference.
2. Confounds (i.e., variables that differ systematically between the groups). If the
researcher did allow confounds to creep into the experiment, these would also lead to
a systematic difference between the groups.
Statistically, the measure of how much systematic variability there is in the experiment
is the difference between means (Mean1 – Mean2).
Within-groups variability (Unsystematic or error variability)

Within-groups variability refers to the differences among the scores from individual
participants in the experiment. Even within the very same experimental condition, the
scores won’t all be the same. They will differ from one participant to another. These
differences occurring within each group are referred to as the within-groups variability
(aka unsystematic variability). They are not systematic – they don’t lead to a difference
between the means of the conditions.
Where does this type of unsystematic variability come from? There are actually several
sources including:
 Individual differences: People truly differ from one another in ways that lead
them to have different scores on the DV.
 Measurement error: Any measurement error in the DV will create within-group
variability. That is, even if two participants are truly the same on the DV, they
might get slightly different scores due to measurement error (recall that this
happens more when measures are not highly reliable).
 Situation factors: Situational differences in the experimental sessions that lead
people’s scores to differ (e.g., the time of day, rooms, the level of noise,
environmental distractions, slight differences in experimenter’s behavior, etc.)
Note that in each case, these factors are leading the participants’ scores to be different
from one another but, unless they are confounded with the IV, they would not create a
difference in the means between the conditions of the experiment. They would just lead
to variation within each condition.
Statistically, the measure of how much unsystematic (error) variability there is in an

experiment is the variance (SD2) or standard deviation (SD) of the scores within each
condition. These measures indicate how much the scores of individual participants in
the study differ from one another.
Learning Activity: Systematic vs. Unsystematic Variability

This activity gives you practice distinguishing between Systematic and Unsystematic
variability. Read each of the statements below and identify whether it is referring to
systematic or unsystematic variability.
1. The researcher used a measure that was low in reliability.

Unsystematic
2. The researchers created a confound by testing the participants in the experimental
condition in one room, and all the participants in the control condition in a different
room.
Systematic
3.. The independent variable manipulation created a large difference between the two
groups in the experiment.
Systematic
4. This is also referred to as error variability or noise.
Unsystematic
5.. The participants in the experiment differed from one another in their scores on the
DV.
Unsystematic
6.. There was a selection effect in the experiment because the participants were able to
choose which condition they were in.
Systematic
7. Participants came in to participate in the study at different times, and this could have
led their scores to differ from one another.
Unsystematic
8. The standard deviation for the scores within each condition was high.
Unsystematic
9. One experimenter ran one condition of the experiment, and another experimenter ran
another condition of the experiment, and this could have created a difference between
the scores in the two conditions.
Systematic
10.This is also known as between-groups variability.
Systematic
Implications for tests of significance and effect size

To test whether two means differ significantly, researchers use test statistics that compare the
amount of between-groups difference and within-groups variability. To obtain a significant effect
(and a large effect size), the between-groups difference must be high relative to the within-
groups variability.
This can be understood by looking at distributions of scores in an experiment. If we use a

frequency polygon to plot the distribution of scores in each group, what we are hoping to see is
that the difference between the means is large in comparison to the amount of within-group
variability. Another way to say this is that there will not be a lot of overlap in the two groups of
scores (instead the two groups of scores are distinctly separated). In such cases, there would be a
significant difference between the groups.
Is the difference significant? The t-test

To test for statistical significance in a simple experiment, researchers use the t-test. They
calculate a t-value, and the larger the t-value is, the more likely it is to be significant.
Note that the formula is based on three components: the between-groups difference (M1 – M2),
the within-groups variance (SD2 pooled across the conditions), and the number of participants in
each condition.
Learning Activity: The t-test

Using what you have just read, when is it that the t-value will be larger (and thus more likely to
be significant)? The t-value will be larger when…
1. Between-groups difference is _______

Large
2. Within-groups variability is _______
Small
3. Sample size is ______
Large
How large is the effect? Effect size, d
In addition to knowing whether an effect created by the manipulation is significant, we also
would like to know how large the effect is. To measure effect size in simple experiments,
researchers often calculate Cohen’s d.
Effect size, d, indicates how far apart the means are in Standard Deviation units. For example, if
d = 2.2, this indicates that the mean in one condition is 2.2 standard deviations higher than the
mean in the other condition. This would be considered a large effect (See text for the values of d
that are considered small, medium, and large effects).
Effect size, d, is based on exactly the same underlying logic as the t-test – that is, it involves a
comparison of the between groups difference and the within-groups variability. Again, you
don’t need to calculate effect size d in this course, but here is the formula so you can see what
goes into it
Figure 6. Depiction of effect size, d, equation
Note that the formula is based on two components: the between-groups difference (M1 – M2), the
within-group variability (SD pooled across the conditions). Unlike the t-test, it does not take
sample size into account.
Learning Activity: Effect size, d

Again let’s think about when effect size will be large. Using what you have just read, when is it
that effect size, d, will be larger? The effect size will be larger when…
Large
Small
Effect Size, Group Difference and Variability

From the above discussion, we have learned that an experiment has a larger effect
when between-groups difference is high and within-groups variability is low – in other
words when there is less overlap between the groups of scores in the two conditions.
Below are two images that illustrate how effect size d is related to the amount of overlap
between two groups. In the top image, there is less overlap and thus a stronger effect (d
= 2.0) than in the bottom image (d = .50). These images were captured from an open
access statistics website.
Figure 7. Frequency
polygon showing distribution of scores of control and experimental groups with slight overlap
8. Frequency polygon
showing distribution of score of control and experimental groups with almost complete overlap
Fig
ure 3. Depiction of a significant difference
In contrast, if there is a null effect, we would see that the between-groups difference is quite
small in comparison to the within-groups variability. Another way to say this is that there is a lot
of overlap in the two distributions of scores (they fall mostly on top of each other). In such cases,
there would not be a significant difference between the means. This can be seen in the figure
below where the difference between the means is quite small in comparison to the amount of
variability within each group:
Figure 4. Depiction of a null

effect
Is the difference significant? The t-test
To test for statistical significance in a simple experiment, researchers use the t-test. They
calculate a t-value, and the larger the t-value is, the more likely it is to be significant.
Note that the formula is based on three components: the between-groups difference (M1 – M2),
the within-groups variance (SD2 pooled across the conditions), and the number of participants in
each condition.
Learning Activity: The t-test

Using what you have just read, when is it that the t-value will be larger (and thus more likely to
be significant)? The t-value will be larger when…
Large
Small
3. Sample size is ______
Large
How large is the effect? Effect size, d
In addition to knowing whether an effect created by the manipulation is significant, we also
would like to know how large the effect is. To measure effect size in simple experiments,
researchers often calculate Cohen’s d.
Effect size, d, indicates how far apart the means are in Standard Deviation units. For example, if
d = 2.2, this indicates that the mean in one condition is 2.2 standard deviations higher than the
mean in the other condition. This would be considered a large effect (See text for the values of d
that are considered small, medium, and large effects).
Effect size, d, is based on exactly the same underlying logic as the t-test – that is, it involves a
comparison of the between groups difference and the within-groups variability. Again, you
don’t need to calculate effect size d in this course, but here is the formula so you can see what
goes into it
Figure 6. Depiction of effect size, d,

equation
Note that the formula is based on two components: the between-groups difference (M1 – M2), the
within-group variability (SD pooled across the conditions). Unlike the t-test, it does not take
sample size into account.
Learning Activity: Effect size, d

Again let’s think about when effect size will be large. Using what you have just read, when is it
that effect size, d, will be larger? The effect size will be larger when…
Large
Small
Effect Size, Group Difference and Variability
From the above discussion, we have learned that an experiment has a larger effect
when between-groups difference is high and within-groups variability is low – in other
words when there is less overlap between the groups of scores in the two conditions.
Below are two images that illustrate how effect size d is related to the amount of overlap
between two groups. In the top image, there is less overlap and thus a stronger effect (d
= 2.0) than in the bottom image (d = .50). These images were captured from an open
access statistics website.
Figure 7. Frequency polygon

showing distribution of scores of control and experimental groups with slight overlap
Figure 8. Frequency polygon

showing distribution of score of control and experimental groups with almost complete overlap
Specific reasons for a null effect
From the discussion so far, it should be clear that there are two different reasons for a
null effect in the results of an experiment. It could be that there is not enough between-
groups difference, or that there is too much within-groups variability.
For many students, the idea that having too much within-group variability can obscure
the effect of an independent variable is new. There are a number of analogies that can
help to illustrate why it is that having too much variability within groups can make it hard
to detect an effect. For example, our text uses one that involves hot sauce. Here is
another analogy.
This is much like trying to detect the effect of an experimental manipulation – although
the independent variable may actually be having an effect (the “signal” is there) it
cannot be detected because there is too much within-groups variability (too much
background “noise”). Indeed experimental researchers often use the terms “signal” and
“noise” when talking about the systematic and unsystematic variability in their
experiments
Lesson 11
A simple experiment (aka a one-way design) is able to test a simple causal hypothesis
about the effect that one IV will have on a DV. Below are some examples of simple
causal hypotheses that have been tested by researchers:
 Increases in anxiety lead to decreases in performance.

 Exposure to violent TV leads to increased aggressive behavior.
 A return trip feels shorter than the original trip to a destination.
 Being exposed to something more often makes you like it more.
 People are less likely to help in an emergency when there are more
bystanders.
A noteworthy aspect of such claims is that they are very general and unqualified. They
imply that the IV generally has the stated effect on the DV
Now, when researchers think of another variable that might alter the effect that the
original IV has on a DV, they end up generating a more complex causal hypothesis
known as an “interaction hypothesis”. An interaction hypothesis states how two (or
more) IVs affect a DV. They are called “interaction hypotheses” because they describe
how two independent variables combine or “interact” to influence a dependent variable.
When these hypotheses are tested in experiments, they lead to more precise and more
specific conclusions.
Note that an interaction hypothesis states that the effect that one IV will have on a
DV depends on some other factor. An interaction hypothesis typically takes the
following form: It states that the effect one variable (X) has on the dependent variable
(Y) depends on another variable (Z). And then it spells out the effect that X is expected
to have on Y at each level of Z.
Below are examples to illustrate:
Example 1 (generic variables):

The effect that X has on Y depends on Z.
When Z is at one level, X has this effect on Y.
When Z is at another level, X has a different effect on Y.
Example 2:
The effect that anxiety has on performance depends on task difficulty.
When tasks are easy, increases in anxiety produce increases in performance.
When tasks are difficult, increases in anxiety produce decreases in performance.
In the context of experiments, we can say that a moderator is a variable that changes
the effect that an IV has on the DV. So a “moderation hypothesis” is actually another
term for an “interaction hypothesis”. The researcher is identifying a variable (a
moderator) that alters the effect that an IV has on the DV.
Factorial design terminology
A factorial design is a research design that examines the impact of 2 or more factors
(i.e., independent variables) simultaneously.
 Each factor is “crossed” with each other factor.

 This creates all possible combinations.
The number of conditions (or “cells”) is the product of the number of levels of each
factor.
 2 x 2 = 4 conditions
 2 x 4 = 8 conditions
 3 x 2 x 4 = 24 conditions
The most basic and common type of factorial design is the 2 x 2
Factor A has 2 levels (A1, A2) and Factor B has 2 levels. These factors are crossed to
create all possible combinations (A1B1, A2B1, A1B2, A2B2). So there are four
conditions in the experiment (and there are four “cells” in the table).
A1 A2 Below is a table
or design matrix
B1 A1B1 A2B1 that illustrates a
2 x 2 study
B2 A1B2 A2B2
Extremely Moderately Mildly Nonviolent
Violent TV Violent TV Violent TV TV
Positive
Parental
Response
Negative
Parental
Response
Table 3. 2 x 4 design with 4 different levels of TV violence and 2 parental responses
Factorial design notation
notation conveys a lot of information succinctly with numbers. For example, when we
read that the researcher used a 2 x 3 design, we can know that:
 There are 2 IVs (or factors)

 The first factor has 2 levels
 The second factor has 3 levels
 There are 6 conditions (or cells in the table)
Learning Activity
1. In 2 x 2 x 2 design, how many factors are there?
My Answer: 3 factors
2. In a 2 x 4 x 2 design, how many conditions are there?
My Answer: 16 conditions
3. In a 2 x 3 x 2 design, how many independent variables are there?
My Answer: 3 IVs
4. In a 2 x 2 x 2 design, how many cells are there?
My Answer: 8 cells
5. In a 3 x 4 design, how many levels does each IV have?
My Answer: first IV has 3 levels, second IV has 4 levels
6. In a 3 x 3 design, how many conditions are there?
My Answer: 9 conditions
7. A study includes 2 IVs that both have 3 levels, and one DV. What is the factorial
notation?
My Answer: 3 x 3
8. A study has the lowest number of conditions possible for a factorial design. What is
the factorial notation?
My Answer: 2 x 2
9. A study includes 3 factors that all have 2 different levels. What is the factorial
notation?
My Answer: 2 x 2 x 2
Independent-groups (between-subject) vs. within-groups
First, as with a simple experiment, each factor can be either an independent-groups
(between-subject) factor or a within-groups (repeated-measures; within subject) factor.
Independent-groups means that different participants receive each level of the IV.
Within-groups means that the same participants receive every level of the IV.
As a result, the factorial designs could either be:
 Completely independent-groups design (between)

 Completely within-groups design (within)
 Mixed factorial design (between-within)
Independent variables vs. participant variables
In addition, each factor in the experiment can be either a true independent variable (IV)
or a participant variable (PV). A true IV is a manipulated variable. A PV is something
about the participant that is only measured, not manipulated. Notice that a PV will
always be a between-subject factor, because a participant can’t take on more than one
level
Factorial design experiments can thus be:
 IV x IV
 IV x PV
In factorial experiments researchers often call both types of variables IVs for the sake of
simplicity. However the distinction does matter when drawing conclusions about
causation – recall that only independent variables that are manipulated experimentally
can support a causal claim.
Note that each factor/IV is presented using a variable name and also the levels (variable
name: level 1, level2). Note also that a separate statement is needed to describe the DV
in the experiment.
Learning Activity: Describing factorial designs

Study 1
A researcher asked participants to watch a video of a job candidate and to rate how
much they liked that candidate. The candidate they viewed was either highly qualified,
moderately qualified, or unqualified for the job, and behaved in a friendly or unfriendly
manner.
My Answer:
The study used a 3(qualification level: high, moderate, low) x 2(behavior: friendly,
unfriendly) independent-groups design. The dependent variable was the participants’
liking of the candidate.
Study 2
A researcher believed that people may be more prone to driver error when they are
listening to the radio, and that this effect would be more pronounced under heavier
traffic conditions. He asked participants to do a driving test (using a driving simulator)
once with the radio on and once with it off. Also during each of these tests, the traffic
was heavy half of the time and was light the other half of the time.
My Answer:
The study used a 2(radio: on, off) x 2(traffic: heavy, light) within-groups design. The
dependent variable was the number of errors made in a driving test.
Study 3
A researcher tested people’s ability to recall names by first presenting participants with
the photo and name of 30 people, and then 20 minutes later asking them to name each
photo – the researcher counted how many names were correctly recalled. Participants
did the name memory test once in the morning and once in the evening. The researcher
found that old participants and middle-aged participants did better in the morning, but
young participants did better in the evening.
My Answer:
The study used a 2(time of day: morning, evening) x 3(age: young, middle-age, old)
mixed factorial design, with the time of day as a within-groups factor and age as an
independent-groups factor. The dependent variable was the number of names correctly
recalled.
Identifying effects in a 2 x 2 design

Example Experiment
Results: Below is a table that summarizes the results of the example experiment.
Notice that the table presents “cell means” and “marginal means”. A cell mean is the
mean on the DV for the participants (n = 30) in that particular condition. A “marginal
mean” is the mean on the DV for all the participants (n=60) at one level of a factor. It
can also be found by averaging across the cell means from the conditions that go into it
(as long as there is the same number of participants in each condition). When we
calculate the marginal mean for a factor, we are ignoring the fact that there was another
factor in the study – we are just averaging across (i.e., “collapsing across”) that other
factor.
Success Failure Marginal (row)
Means
Important Cell mean = 18 Cell mean = 12 M=15

(n=30) (n=30) (n=60)
Unimportant Cell mean = 10 Cell mean = 10 M=10

(n=30) (n=30) (n=60)
Marginal (column) M=14 M=11

Means (n=60) (n=60)
Types of Effects: Main effects and interactions

Main Effects:
We can assess whether each IV has a “main effect” on the DV. When we look for a
main effect, we are evaluating the “overall effect” that a single factor has, ignoring the
other factor.
 Main effect of performance level: Does success produce different self-esteem

than failure?
 Main effect of task importance: Do important tasks produce different self-
esteem than unimportant tasks?
Interaction Effect:
We can assess whether the effect of one IV on the DV differs depending on the other IV
Does the effect that performance level has on self-esteem differ depending on task
importance?
Thus in a study with 2 IVs, there are 3 possible “effects” to look for: A main effect of the
first IV, a main effect of the second IV, and an interaction effect.
Which means do we compare to assess these effects?
Success Failure Marginal (row)
Means
Important Cell mean = 18 Cell mean = 12 M=15

(n=30) (n=30) (n=60)
Unimportant Cell mean = 10 Cell mean = 10 M=10

(n=30) (n=30) (n=60)
Marginal (column) M=14 M=11

Means (n=60) (n=60)
Main effect of performance level

Compare the marginal means for the performance level factor (the column means). If
they differ then there is a main effect. The marginal (column) mean for Success (M =
14) is different than the marginal (column) mean for Failure (M = 11) so there is a main
effect of the performance level factor.
This main effect indicates that, overall, people reported higher self-esteem when they
succeeded at tasks than when they failed (disregarding the importance of the tasks).
Main effect of task importance

Compare the marginal means for task importance (the row means). If they differ then
there is a main effect.
The marginal mean for important tasks (M = 15) is higher than the marginal mean for
unimportant tasks (M = 10) so there is a main effect of the importance factor.
This main effect indicates that, overall, people reported higher self-esteem when they
had done important tasks than when they had done unimportant tasks (disregarding
how well they performed).
Interaction effect
To evaluate an interaction effect we need to compare condition means.
Begin by asking what effect the first factor (performance level) had when participants
were at the first level of the other factor (important tasks). In other words, how much of
a difference between means was created by success vs. failure on the task. When tasks
were important, the difference created by success vs. failure was 6 (18 – 12 = 6).
Next, ask what effect the first factor (performance level) had when participants were at
the second level of the other factor (i.e., unimportant tasks). How much of a difference
between means was created by success vs. failure on the task. When tasks were
unimportant, the difference created by success vs. failure was 0 (10-10 = 0) (i.e., there
was no effect).
Now compare those differences. If the amount of difference created is not the same,
then there is an interaction effect. In other words, there is an interaction when there is
“a difference between the differences”.
The effect that performance level had on self-esteem was different depending on
whether the tasks were important or unimportant. When tasks were important, people
had higher self-esteem when they succeeded than when they failed. When tasks were
unimportant, people had the same level of self-esteem when they succeeded as when
they failed.
Results Presented in Graphs

Often researchers will present the results of a factorial design in a line graph (or a bar
graph). In these graphs the dependent variable is represented by the vertical y-axis, one
IV is represented on the horizontal x-axis, and the other IV is represented by different
types of lines (or bars).
Are the points on the left side generally higher or lower (on average) than the points on
the right side? If so there is a main effect of the factor on the x-axis. In this case the
points for success are higher (on average) than the points for failure, so there is a main
effect of the performance level factor.
Main effect of task importance

Is one line generally higher or lower (on average) than the other line. If so there is a
main effect of the factor represented by different lines. [note that the average of a line is
the midpoint of the line]
In this case the line for important tasks is generally higher (on average) than the line for
unimportant tasks, so there is a main effect of task importance.
Interaction effect
If the lines are parallel to each other (i.e., the slopes are the same) there is not an
interaction effect.
If the lines are NOT parallel to each other (i.e., the slopes are different) there is an
interaction effect.
In this case, the lines are not parallel so there is an interaction effect. This tells us that
the effect created by performance level was greater (i.e., the slope was steeper) for the
important tasks than for the unimportant tasks.
Bar Graphs
You can also identify the effects in a similar manner. For example, here we see the
main effect of performance level because the bars on the left are higher overall (on
average) than the bars on the right.
Statistical Tests
with one-way designs, researchers would also typically perform a statistical test to
determine whether the difference between means is actually a significant difference (p <
.05).
The statistical analysis used for factorial designs is not the t-test that we have seen
previously, but rather the Analysis of Variance (ANOVA or MANOVA). This analysis is
used to identify the amount of systematic variance coming from:
 Main effect of A
 Main effect of B
 A x B interaction
The analysis gives an “F value” for each effect that represents the amount of systematic
variance relative to unsystematic variance. In this way, the logic underlying the analysis
is very similar to the t-test. It is a test of how much systematic variance is created
relative to the amount of unsystematic variance.
IMPORTANT: although researchers would typically perform these statistical tests, we

are not going to do so in this course. For this course, you should treat any difference
between means as if it is a significant difference.
Possible patterns of effects

We have seen above the different kinds of effects in factorial designs. It is also
important to note that each of these effects can occur (or not occur) independently of
the others. For instance you can have a main effect of Factor A, with or without an
interaction effect. This means you could end up with eight different combinations of
results (see Figure 12.20 in text for all eight possible combinations). Having one kind of
effect (e.g., a main effect of Factor A) tells you nothing about whether you have another
kind of effect (e.g., an A x B interaction).
Example 1
A1 A2
B1 16 10 13
B2 14 8 11
15 9
Table 6. Factorial design effects example 1 with generic factors
Main effect of Factor A? Yes [15 vs. 9]
Main effect of Factor B? Yes [13 vs. 11]
A x B interaction? No [16-10= 6] vs. [14-8=6]. The differences are the same.
Main effect of Factor A? Yes. Points for A1 are higher (on average) than points for A2.
Main effect of Factor B? Yes. Line for B1 is higher (on average) than line for B2.
A x B interaction? Yes. Lines are not parallel. Slope of line for B1 is not same as slope
of line for B2
Example 4
Figure 4. Line graph with DV score on y-axis and A1
And A2 on x-axis.
Main effect of Factor A? Yes. Points for A1 are lower (on average) than points for A2.
Main effect of Factor B? Yes. Line for B1 is higher (on average) than line for B2.
A x B interaction? No. Lines are parallel. Slope of line for B1 is the same as slope of line
for B2.
Extensions and variations

Factors with more than two levels
Learning Activity: Identifying the Effects
A1 A2 A3
B1 20 15 10 15
B2 18 13 5 12
19 14 7.5
Table 13. 2 x 3 design with generic factors
1. Main effect of Factor A: Compare the marginal means for Factor A. If they are not
all the same, there is a main effect.
My Answer: Yes there is a main effect of Factor A [19 vs. 14 vs. 7.5]
2. Main effect of B. Compare the marginal means for Factor B found by averaging
across three conditions. If the marginal means for B are not the same there is a main
effect.
My Answer: Yes there is a main effect of Factor B [15 vs. 12]
3. Interaction effect: Do it in steps.
o First ask what difference is created by moving from A1 to A2, and is this
difference the same at B1 and at B2. Here we see a difference of 5, and it is
the same, so at this point we don’t see an interaction.
o Next ask what difference is created by moving from A2 to A3, and is this
difference the same at B1 and a B2. Here we see a difference of 5 for
participants at B1 and a difference of 8 for participants at B2. These
differences are not the same, so we conclude there is an interaction effect.
If the effect created by moving across the levels of Factor A differs
depending on Factor B, there is an interaction effect – the effect of Factor
A depends on Factor B.
My Answer: Yes there is an interaction effect. [15-10=5 vs. 13-5=8]
Again, it is worth noting that you could alternatively test for the interaction by
asking: What is the difference created by Factor B (B1-B2), and is it the same at each
level of Factor A? There is a difference of 2 at A1, a difference of 2 at A2, but a
difference of 5 at A3. So the difference created by B is not the same, and we conclude
that there is an interaction effect.
there is an interaction effect because the lines are not always parallel.
Figure 9. Line graph with DV

score on y-axis and A1, A2, and A3 on x-axis
Learning activity: Test yourself
Part 1
The following questions test your ability to identify main effects and interactions when
factors have more than two levels. Some of the questions use generic factors (Factor A,
Factor B) and others use the factors from our example experiment (Performance Level,
Importance). Answer each question and then check against my answers.
1. Examine the table and answer each of the questions below it.
A1 A2 A3
B1 14 12 10
B2 12 12 12
2. My Answer:
3. Main effect of Factor A? Yes. Marginal means are 13, 12, 11.
4. Main effect of Factor B? No. Marginal means are 12, 12.
5. Interaction effect? Yes. A1-A2 differences are 2,0. Also A2-A3 differences
are 2,0. Also B1-B2 differences are 2,0,-2
6. Sketch a line graph:
7. Figure 10. Line graph

with self-esteem on y-axis, and success, neutral, and failure on x-axis
8. Examine the table and answer each of the questions below it.
Success Failure
Extremely
Important 18 16
Moderately
Important 14 12
Slightly
Important 12 7
Not
Important 10 5
9. Table 15. Table with 2 factors on top: success and failure, and 4 factors along the
side: extremely important, moderately important, slightly important, not important
Main effect of Performance?
Main effect of Factor B?
Interaction effect?
Sketch a Line graph:
10. My Answer:
11. Main effect of Performance? Yes. Marginal means are 13.5,10.
12. Main effect of Factor B? No. Marginal means are 17,13,9.5,7.5.
13. Interaction effect? Yes. Success-Failure differences are 2,2,5,5.
14. Sketch a Line graph:
Higher order designs: More than 2 factors
Researchers sometimes conduct “higher order” factorial designs, meaning they include
more than two factors.
To illustrate, let’s imagine that we added yet another factor to our example experiment.
The researcher wonders whether the effect of the other two factors might further
depend on the sex of the participants, and so the researcher decides to include
participant sex as another factor. Now the experiment is a 2 (performance level:
success, failure) x 2 (importance: important, unimportant) x 2 (sex: male, female)
design. Below is a table showing the design and results.
Success Fail
Important 18 12 15
Female
Unimportant 16 16 16
Male Important 18 14 16
17.5 15
Table 18. 2 x 2 x 2 design with success/failure factors, female/male factors, and
important/unimportant factors
What effects can we look for in a 2 x 2 x 2 design? Below we will first look at this with
generic factors (A, B, C), and then after that you can try applying this to the example
experiment.
Main effects: There can be one main effect of each factor (The number of possible
main effects always equals the number of factors in a study)
 Main effect of Factor A
 Main effect of Factor B
 Main effect of Factor C
2-way interactions: These are interactions that combine two factors at a time.
 A x B interaction
 A x C interaction
 B x C interaction
3-way interaction: Combining all three factors
 A x B x C interaction
Learning Activity: Possible Effects in the 2 x 2 x 2 design
Now try applying this to our example experiment. List all of the possible effects that
could occur in your study notes. Then compare with my answers below:
My Answer:
Three main effects:
Main effect of importance
Main effect of sex
Three 2-way interactions:
Performance level x Importance

Performance level x Sex
Importance x Sex
One 3-way interaction:
Performance level X Importance X Sex
Learning Activity: Identifying the effects

Now let’s try to find each of the effects in the example experiment.
Success Fail
Important 18 12 15
Female
Important 18 14 16
Male
17.5 15
important/unimportant factors (repeated from above)
For each effect, I have indicated which means need to be compared. Follow these
instructions, and indicate whether each effect occurred – just write Yes or No in
your study notes. Then compare to my answers.
Main effect of performance level: Compare marginal mean for all success conditions
(average across 4 conditions) and marginal mean for all failure conditions (average
across 4 conditions)
My Answer: Yes. Success 17.5 vs. Failure 15
Main effect of importance: Compare marginal mean for all important conditions
(average across 4 conditions) and marginal mean for all unimportant conditions
(average across 4 conditions).
My Answer: Yes. Important 15.5 vs. Unimportant 17
Main effect of sex: Compare marginal mean for all female conditions (average across
4 conditions) and marginal mean for all male conditions (average across 4 conditions).
My Answer: Yes. Female 15.5 vs. Male 17
Performance level x Importance: Create a 2(performance level) x 2(importance) table
which ignores the sex factor by averaging across it.
Success Fail
Important
18 13

Unimportant
17 17

Table 19. 2 x 2 design using success/fail factors and important/unimportant factors

My Answer: Yes. Differences = 5 vs. Difference = 0.
Performance level x Sex: Create a 2(performance level) x 2(sex) table which ignores
the importance factor by averaging across it.
Success Fail
Female
17 14

Male
18 16

Table 20. 2 x 2 design using success/fail factors and male/female factors

My Answer: Yes. Differences = 3 vs. Difference = 2.
Importance x Sex: Create a 2(performance level) x 2(sex) table which ignores the
importance factor by averaging across it.
Important Unimportant
Female
15 16

Male
16 18

Table 21. 2 x 2 design using important/unimportant factors and male/female factors

My Answer: Yes. Differences = -1 vs. Difference = -2.
Performance level X Importance X Sex
There will be a 3-way interaction if the 2-way interaction between two factors differs
depending on the remaining factor. You are asking: does a 2-way interaction differ
across the levels of the remaining factor?
Go back to the full design table, and examine the Performance Level x Importance
interaction pattern at each level of Sex. Does the 2-way interaction differ across the
levels of Sex?
Success Fail
Important
18 12
Female
Unimportant
16 16
Important
18 14
Male
Unimportant
18 18

important/unimportant factors
My Answer: Yes.
 For female participants the interaction is: difference = 6 vs. difference = 0.
 For male participants the interaction is: difference = 4 vs. difference = 0.
 The 2-way interactions are not the same so there is a 3-way interaction.
Main Effects
It is fairly straightforward to describe what a main effect indicates. You simply describe
the effect that the factor had overall, disregarding the other factor(s) – be sure to
indicate the direction of the effect, stating in which condition the means were higher or
lower. When a main effect occurs without an interaction, this is all you need to do. The
main effect indicates that the factor had a general effect that wasn’t qualified by the
other factor(s).
When a main effect occurs with an interaction, then you should still summarize the main
effect briefly, but it should be given less emphasis. Instead your description of results
should highlight that the main effect was qualified by an interaction, and you must be
sure to give a complete description of the interaction effect. As the text points out, in
such cases, the interaction effect is almost always more important than the main effect.
By explaining the interaction effect, you provide the most accurate description of what
the study found.
Interaction Effects
The text describes two approaches that can be used to describe interaction effects in
words:
1. Describe and compare the effect that one factor has when you are at each level of the
other factor.
2. Use key phrases such as “especially for” and “but only for” to describe the different
effects that a factor has.
I recommend that you usually stick with the first approach, that is, describe the effect
that one factor has at each level of the other factor. As your text notes, this is the
“foolproof” way to describe an interaction effect.
Generic Template 1
“There was an interaction between Factor A and Factor B. The interaction indicates that
the effect of Factor A on the DV differs across the levels of Factor B. Specifically, when
[first level of Factor B], increases in Factor A resulted in higher scores on the DV. In
contrast, when [second level of Factor B], increases in Factor A resulted in lower scores
on the DV.”
Generic Template 2
“There was an interaction between Factor A and Factor B. The interaction indicates that
the effect of Factor A on the DV depends on Factor B. Specifically, when [first level of
Factor B], participants scored higher on the DV in the A1 condition than in the A2
condition. In contrast, when [second level of Factor B] participants’ scores on DV did not
differ across the A1 condition and the A2 condition.” [or did not differ by the same
amount]
Learning Activity: Describing main effects and interactions in
words
For each example below, provide a verbal description of any main effects and/or
interaction effect. First state briefly whether the effect was found, and then go on to say
in words what it means.
Write your descriptions in your study notes and then compare to my answers.

Example 1
Success Failure
Important 16 12
Unimportant 14 10
Table 23. 2 x 2 design with success/failure factors and important/unimportant factors

My Answer:
There was a main effect of performance level. Participants had higher levels of self-
esteem when they succeeded on a task than when they failed.
There was also a main effect of importance indicating that participants had higher levels
of self-esteem when they did important tasks then when they did unimportant tasks.
There was not an interaction between performance level and importance. Succeeding
on a task led to the same increase in self-esteem regardless of the importance of the
task.
Example 2
Success Failure
Important 18 10
Unimportant 12 10
Table 24. 2 x 2 design with success/failure factors and important/unimportant factors

My Answer:
There was a main effect of performance level. Participants had higher levels of self-
esteem when they succeeded on a task than when they failed. There was also a main
effect of importance indicating that participants had higher levels of self-esteem when
they did important tasks then when they did unimportant tasks.
These main effects were qualified by an interaction effect. When participants did
important tasks, they had much higher self-esteem when they succeeded than when
they failed. However, when they did unimportant tasks their self-esteem was only
slightly higher when they succeeded than when they failed. Thus the effects of
performance level on self-esteem were moderated by the importance of the task.
Example 3
A study tested whether caffeine intake would improve performance on a task and
whether this would depend on the difficulty of the task. Below are the performance
scores out of 100.
No Caffeine Caffeine
Easy
Task 80 85
Difficult
Task 40 60
Table 25. 2 x 2 design with no caffeine/caffeine factors and easy task/difficult task
factors
My Answer:
There was a main effect of caffeine intake. Performance scores were higher in the
caffeine condition than in the control condition. There was also a main effect of task
difficulty indicating that performance was higher on the easy tasks than on the difficult
tasks.
There was also an interaction effect indicating that the effect of caffeine intake was
moderated by task difficulty. When participants did easy tasks, caffeine intake improved
their performance scores by 5%, but when they did difficult tasks caffeine intake
improved their scores by 20%. [Thus caffeine led to greater improvement for difficult
tasks than for easy tasks]
Example 4
No Alcohol Alcohol
Not Sleep
Deprived 6 10
Sleep 8 18
Deprived
Table 26. 2 x 2 design with no alcohol/alcohol factors and not sleep deprived/sleep
deprived factors
My Answer:
There was a main effect of alcohol consumption, indicating that participants who
consumed alcohol made more errors than those who did not. There was also a main
effect of sleep deprivation indicating that participants made more errors when they were
sleep deprived than when they were not.
An interaction effect indicated that the effect of alcohol consumption depended on

whether participants were sleep deprived or not. When participants were not sleep
deprived, drinking alcohol increased the numbers of errors they made. When
participants were sleep deprived, drinking alcohol led to an even greater increase in the
number of errors made.
Independent-groups design (Between-subjects)

For example, in our example 2 x 2 experiment, you could simply number the conditions from 1
to 4 and then use simple or block random assignment to these four conditions. If you want to
have 10 participants in each condition, this would require a total of 40 participants.
Within-groups (Repeated measures)

The entire sample goes through each condition (and is tested on the DV in each
condition). So it is the very same participants who go through each condition. For
instance, in our example experiment, participants are first told they succeeded on an
important task and rate their current self-esteem, then they are told they succeeded on
an unimportant task and rate their self-esteem, and so on for each condition. Note, then,
that you don’t need random assignment to the conditions.
However, to address possible order effects, you would want to use some kind of
counterbalancing procedure. For instance, in our example experiment, you could
number or letter the conditions just as if there were 4 levels of a single IV. Then use the
counterbalancing procedures that we used previously for single-factor designs – either
full counterbalancing or partial counterbalancing.
Note that within-groups designs require far fewer participants (see Figure 12.21 in text
for an illustration). For instance, if you want 10 participants in each condition, you would
need only 10 participants in total, because they will go through every condition.
Mixed-factorial design (between-within)

A study with at least one independent groups and one within-groups factor. In our
example it could be that the researcher treats task importance as a between factor –
tells some participants they are working on a set of important tasks and other
participants they are working on unimportant tasks. But performance level is a within
factor – participants are told they succeeded on the first task and are asked to rate their
self-esteem, then they are told they failed on the next task, and are asked to rate their
self-esteem (see diagram below).
Figure 16. Figure showing
mixed factorial design with random assignment on important and unimportant tasks
Note that again you would want to counterbalance the order in which participants
receive the levels of the within factor. So, you need to address both the assignment of
participants to conditions (between-factor) and the order in which they will receive the
conditions (within-factor)
Use a 2-step process: First, deal with the between-subjects factor by using random
assignment of participants to groups (important group, unimportant group). Next, deal
with the repeated-measures factor by creating different orders. Do this separately for
each group. For example, randomly assign half of the important group to order 1
(success first, failure second) and half to order 2 (failure first, success second).
If you wanted 10 participants in each condition, how many would you need? The answer
is 20. They would be randomly assigned to the importance groups; but then those
groups would go through both the success condition and the failure condition.
1. Can “test limits”

This is the advantage of factorial designs that we focused on at the opening of this
lesson. Factorial designs allow us to test whether the effect of an IV is a general one, or
whether it depends on other variables. We do this by testing for interaction effects, and
we gain very useful information whether or not the interaction effect is actually found.
If there is an interaction, we end up with more specific and accurate conclusions about
the effect that an IV has (i.e., we learn about limits, we learn that the effect depends on
other things).
If there is not an interaction, we gain evidence for the generality of the effect. The
finding suggests that the effect generalizes across other factors, and thus supports the
“external validity” of the effect.
2. Can “test theories”

Many theories suggest that two variables should interact. Thus testing for an interaction
is often a form of “theory testing”, taking us right back to the theory-data cycle
introduced in our first lesson. If the hypothesized interaction is found, this provides
empirical support for the theory.
3. Can reduce variability that might otherwise obscure an effect

Factorial designs offer a way to reduce unwanted variance or “noise” from a factor that
would otherwise be creating “noise” and obscuring the effect of your IV. For example, in
a simple experiment testing one manipulated IV, any individual differences in
participants (e.g., age, sex, ethnicity) would contribute to unsystematic variance or
“noise”. But if you take one of those individual differences and include it as a factor in
the experiment (e.g., sex of participants), then you will now be testing the effect of your
manipulated IV separately on groups that are at the same level of that participant
variable (e.g., a test for males, a test for females). In other words you have eliminated
any unsystematic variance that would have been created by that factor - leaving you
with a more powerful test of your IV.
4. Can evaluate order effects in within-groups designs

Recall that in a simple within-groups experiment, researchers typically control for order
effects with counterbalancing. In such cases, the researcher may want to know whether
there actually was an order effect. To find out, the researcher can include “order” as a
factor when analyzing the results. Moreover, this allows the researcher to test whether
there was a “differential order effect”, meaning that the effect of order was different
depending on the levels of the IV. In that case the results would show an Order x IV
interaction effect instead of just a main effect of order.
5. Replication plus extension of an existing finding

a type of replication study known as “replication-plus-extension”. In these studies
researchers replicate a simple experiment conducted previously, but also add a new
factor to test additional questions. So half the study is a replication of the original simple
experiment, and half the study is new.
For example, the original experiment may have shown that listening to the radio while
driving led to increased driving errors. In the original study, the radio content was
music, and now the researcher wonders whether the type of content matters. So they
add type of content (music, talk) as another factor. Notice, then, that half of the study is
exactly the same as the original study, as participants are listening to music, whereas
half of the study is new. So, in addition to testing an interesting possible moderator
(type of radio content), the researcher is taking the important step of testing whether a
previous finding replicates when the study is repeated.
the TCPS2 is based on what three core principles?

- Respect for persons
- Concern for welfare
- Justice
What does a participant’s autonomy in research refer to?
- Individuals are generally presumed to have the ability to make voluntary and informed
decisions
- Respect for persons means respecting every individual’s ability to give or refuse their
consent to participate
- Their decision must be based on clear information about the foreseeable risks and
potential benefits of the study and must not be coerced or influenced
1. Module 2: Defining Research describes how research is defined and identifies projects
that require and do not require ethics review by a Research Ethics Board (REB).
Complete module 2 and answer the two questions below.
Questions:
How does the TCPS2 define research?
- “an undertaking intended to extend knowledge through a disciplined inquiry or
systematic investigation.”
What types of research require REB review and approval?
- Research that does not involve living humans but relies upon human remains, embryos
or biological samples is considered to be research involving humans in Canada that
requires REB review
2. Module 3: Assessing Risk & Benefit identifies potential risks and benefits of conducting
research, and outlines the researcher’s responsibility of ensuring potential risks to
participants are minimized. Work through this module and answer the two questions
below.
Question:
What is the proportionate approach to reviewing research?
- The assessment of the foreseeable risks and potential benefits of research participation
What are the four different categories of foreseeable risk that researchers must consider when
designing a study?
- Physical harm
- Psychological harm
- Economic harm
- Social harm
3. Module 4: Consent outlines the rules and requirements surrounding the informed consent
process. Navigate through this module and answer the two questions below.
Questions:
What does free, informed and ongoing consent mean?
- Free: individuals who participate in research should do so voluntarily, free of coercion or
pressure to consent
- Informed: decision to participate must be based on an understanding of the purpose of
the research, its foreseeable risks and potential benefits
- Ongoing: whenever any new information arises that affects participants in a study,
researchers are obligated to inform them
What is a debriefing and when is it required?
- When a study protocol used deception or partial disclosure, participants must be
debriefed once their involvement in the study has ended
4. Module 5: Privacy & Confidentiality describes the importance of protecting and
safeguarding participants’ identifiable information and data, as well as strategies to
ensure this information and data are securely maintained. Complete module 5 and answer
the two questions below.
Questions:
What is the difference between privacy and confidentiality with respect to research?
- Privacy: state of being free from intrusion or interference by others
- Have privacy interests in relation to their bodies, personal information, expressed
thoughts and opinions, personal communication with others, and the spaced they
occupy
- Confidentiality: researchers have an ethical obligation to keep anything participants
reveal to them in confidence- that is not sharing this information in such a way that
would identify participants and protection it from unauthorized access, use, disclosure,
modification, loss or theft.
What is the difference between anonymized and anonymous information?
- Anonymized information: one step further than coded information in that once the data
is stripped of direct identifiers, a code is not kept allowing future el-linkages. The risk of
re-identification of individuals from remining indirect identifiers must also be low or
very low
- Anonymous information: this is information that has never had identifiers associated
with it and risk of identification of individuals is low or very low
5. Module 6: Fairness & Equity addresses concerns regarding research with vulnerable
populations, as well as issues concerning inclusion, fairness and equity in research
settings. Work through this module and answer the two questions below.
Questions:
What circumstances could make a student vulnerable in research?
- Minors: someone who has not reached the age of majority
- Brain injury survivors: people who have suffered a brain injury
- Students: it is the relationship between the student and the researcher that creates a
vulnerable circumstance
What is inappropriate exclusion?
- Exclusions that are not justified by the research question violate the principle of justice
by denying the excluded groups the benefits of research participation
6. Module 7: Conflict of Interest describes real, perceived and potential conflicts of interest,
as well as strategies to avoid such conflicts in research. Work through module 7 and
answer the two questions below.
Questions:
What is conflict of interest?
- The incompatibility of two or more duties, responsibilities or interests of an individual or
institution as they relate to the ethical conduct of research- such that one cannot be
fulfilled without compromising another
What are the three approaches to dealing with an identified conflict of interest?
- Eliminate by adjusting the research design
- Minimizing
- Managing
Module 8: REB Review explains the roles and responsibilities of a Research Ethics Board
(REB), as well as the ethics application review and approval process. You are given an option to
complete the module as a researcher (yellow button) or REB member (blue button). Please select
Module 8 Researchers (yellow button). Work through this last module and answer the two
questions below.
Questions:
What is the primary goal of an REB?
- To assess whether the research proposals they review are ethically acceptable according
to TCPS2
- Goal is to represent the interests of participants by assessing the foreseeable risks,
ethical implications, and potential benefits of each proposal they review
What is the general rule regarding REB review of study materials?
- If a participant will read/hear/view it the REB needs to read/hear/view it
Animal research ethics:
The Canadian Council on Animal Care (CCAC) publishes guidelines that promote the
ethical treatment of animals in research. This exercise will teach you about the
conditions under which animals should/should not be used in research, as well as how
animals should/should not be treated while under a researcher’s (or institution’s) care.
1. On the CCAC website homepage, scroll down to the section titled Learn About the
CCAC. This section contains four modules that describe the CCAC guidelines,
certification process, three Rs tenet, and facts and legislation regarding animal data.
Select the The Rs Tenet module and browse through the information under the
various tabs to answer the five questions below.
Question:
What are the three Rs to guide researchers on the ethical use of animals?
- Replacement, reduction, and refinement
What are the three main types of ethical concerns related to animal welfare?
- Need to consider whether animals are absolutely required or whether suitable
replacements can be used instead
- Must consider what numbers of animals will ensure valid results, while, at the same
time, maximizing the amount of information obtained per animal
- Must also identify any potential harm to the animals the develop ways to minimize it
What are two ways to reduce the number of animals used in research?
- By reducing the number of animal experiments
- By reducing the number of animals needed in each experiment
Inter-rater reliability is a measure of the consistency among two or more raters who are
observing and scoring the same behaviour(s) using the same instrument. If raters do
not demonstrate agreement between their scores, then we cannot know whether the
construct of interest (i.e., in our case, sociability) has been measured in an objective
manner. When observational measures use coding categories (e.g., presence or
absence of specific categories of behaviour), inter-rater reliability is often expressed as
the percentage of agreement as follows:
Percentage of Agreement=100 (number of agreements/number of agreements+number
of disagreements)
1. What is APA Style?
The American Psychological Association (APA) developed formatting guidelines

for report writing almost 100 years ago, which are now considered the standard
guidelines in the social and behavioural sciences (as well as other disciplines).
2. Why is APA style important?
Most journals that publish articles in the social and behavioural sciences require
researchers to submit their manuscripts in APA style. Also, most psychology courses
you take at Laurier will require reports to be submitted in APA style. APA guidelines
serve three main purposes: (i) To help researchers write effectively, (ii) To make
research articles uniform, and (iii) To facilitate the conversion of submitted
manuscripts to published journal articles.
3. What does an APA formatted document “look like?”
An APA style document has seven required sections, presented in this order: (i) Title
page, (ii) Abstract, (iii) Introduction, (iv) Method, (v) Results, (vi) Discussion, and
(vii) References.
4. What are the general APA style formatting rules?
The general formatting rules that apply to the entire document include (but are not
limited to) the following:
o Page numbers appear at the top right-hand corner of each page.
o Running head appears at the top left-hand corner of each page.
o All margins are 1 inch.
o Headings are used to organize sections (five different levels of headings).
o Lines of text are double-spaced with no extra spaces between paragraphs
and sections.
o The typeface is 12pt Times New Roman (black).
o Two spaces follow a period at the end of a sentence, and once space
follows other punctuation.
Remember that for within-subjects factors, participants experience all

levels/conditions. For between-subjects factors, participants experience one
level/condition.

PS295

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PS295

Uploaded by

Copyright:

Available Formats

PS295

Three types of claims:

Fabrication, Falsification, and Plagiarism

Ethical Data analysis

1. Cleaning and deleting data

For the purpose of exploration, researchers often include measures of several

Researchers are encouraged to post online their complete data sets

2. Observational (aka Behavioural)

Speaking anxiety – specialized equipment is used to measure physiological responses

Which type of measure is best?

Are rating scales really at the interval scale of measurement?

Why do the scales of measurement matter, and which is best?

1. Information conveyed - As we move through the scales of measurement, from the

3. Statistical analyses -.Although numbers can be assigned to any variable, the

Researchers use empirical techniques to establish the reliability of a measure.

As the absolute value of r is higher (i.e., r is closer to -1 or to 1) there is a stronger

For the purpose of assessing reliability, we typically want to see a correlation

2. Internal Reliability (aka Internal consistency; Inter-item reliability).

a) Item-total correlations (not in text).

b) Cronbach's alpha (in text)

Reliability as Low Measurement Error

How can we improve reliability?

The "more is better" principle:

VALIDITY OF MEASURES: DOES IT MEASURE WHAT

Ways to Assess Construct Validity

1. Face Validity and Content Validity

2. Criterion Validity (correlational evidence, known groups evidence)

Criterion validity is typically represented with correlation coefficients, but it can also be

3. Convergent Validity and Discriminant Validity

When do researchers assess reliability and validity?

1. How many scale points should be offered?

2. Should there be a midpoint?

3. Should all scale points be labeled?

Advantages of open ended questions

Writing Good Questions

Question order effects:

Pilot study to refine questions

Encouraging Accurate Responses

Reverse wording of some items

Potential problems and solutions

1. Continuous real-time measurement

Every instance of a behaviour is recorded during the entire observational session.

Simple frequency distributions and histograms

A common addition to a simple frequency distribution is to add a column showing

 Most of the scores are toward the middle (average value)

"The skew is in the tail of the whale"

2. A measure creates artificial "floor effects" or "ceiling effects". Sometimes many

Two Characteristics of Distributions: Central Tendency and

Which index of Central Tendency is best?

1. Level of measurement – your measure must be at the interval or ratio scale of

Mean: interval or ratio

Describing Relative Standing Using z scores (i.e., standardized

Useful Qualities of the Z-Score

1. Makes raw scores more interpretable (by indicating relative standing)

2. Allows comparisons across different distributions.

Good external validity = representative sample = unbiased sample = probability sample

Poor external validity = unrepresentative sample = biased sample = nonprobability

When is a sample biased?

a. What are two ways to get a biased sample?

Simple random sampling

Cluster sampling & Multistage sampling

Stratified random sampling

Researcher do several things to address the non-response problem:

What affects the margin of error?