PSY417 Week02

CRICOS Provider No. 00300K (NT/VIC) I 03286A (NSW) | RTO Provider No.
0373
PSY417: Research Methods and Practice
Week 2 – Data screening, assumptions and non-parametric models

Faculty Of Health
Dr. Rebecca Williams
SEM1 2023
Readings
• Field (2017) chapters 6 & 7
• Nimon (2012)
• This is a recommended reading: will not be examinable, but
might help you with your learning
2
Today’s Lecture Outline
• Data screening
• Identifying problems in datasets
• Good data analysis habits
• Missing Values Analysis
• Assumption testing
• Identifying outliers
• Normality
• Linearity
• Homogeneity of variance
• Dealing with Bias (AKA violations of assumptions)
• Transformations
• Bootstrapping
• Non-parametric tests
• Data analysis: Data screening
3
Data screening: The first step
• First check your

data in excel
• Looks for issues

and fix before SPSS
• What issues can

you see here?
4
Data screening: Checklist
• Save original version of document + versions
• Make sure all strings (i.e. words) are the same

• Remove any formatting such as full stops
• Then replace strings with dummy codes
• All numbers should be same format

• Integers (“number”, not scientific notation) with 2 decimal spaces
• Remove any commas, and any empty cells with discrete, improbable values
5
Good data analysis habits
1. Keep a data analysis log (in a Word document, for example)
Example:
20230312
Started session 0935
Opened dataset_version1
Started screening dataset for abnormalities. Found some open-ended
responses were difficult to code (some not genuine responses).
Create a coding variable to filter out (OE_Type) and coded 1 for
'normal', 2 for 'rounded', 3 for 'ineligible' and 4 for 'ridiculous'.
Saved new file (dataset_version2)
Ended session 0946
6
2. Keep your data versions organized and easy to follow
Example:
dataAnalysis_20220301.csv
7
3. Back up every day to two different locations, including
a cloud (e.g. Dropbox, OneDrive).
• Make sure to ONLY copy files from your main folder to the
backup locations: don’t get confused and open backed up
files to work on.
• Use programs such as ‘rsync’ to copy files easily
4. Check out this link:

Manage Data During Research
8
Missing Values Analysis
• This is an good initial data check in SPSS
• It will determine if there is a pattern of missing data, or if missing data are random.
• Systematic missing values will highly bias the results
• The test is called Little’s MCAR (Missing Completely at Random)

• Null hypothesis: Missing data is completely at random
• P < 0.05 means missing data is not completely at random
• Is good to retain the null hypothesis
• Imputation can be performed to fill in missing data (although this should be done
sparingly)
9
Guidelines for reporting analyses affected by
missingness
• Report the number of missing values for each variable
• Give specific reasons (if possible), and detail any excluded cases
• Report any important differences between individuals with complete
and incomplete data
e.g., categorical variables, screening variables
• Describe the analyses used to account for missingness, and the
assumptions that were made
e.g., outcome of Little’s MCAR test; adoption of imputation
method
10
• Data screening
• Normality
• Linearity
• Transformations
• Bootstrapping
• Data analysis
11
Assumptions
• Every statistical test has some assumptions about
your data
• These can be thought of as conditions that must be

met in order to run the test
• If they are not met, no ‘errors’ would be output but

your results will be meaningless
12
Outliers: Not really an assumption, but can
really mess up your analysis
• Values that are very different
from all others can bias means
and increase the sum of
squared errors (and therefore,
the standard deviation and CI)
• Can be visually spotted in

histograms, boxplots
13
Field (2017): Page 240
Identifying outliers quantitatively
• Boxplots are a good way to
determine whether there are
statistical outliers
• In SPSS, outliers are:
Scores > the upper quartile + (1.5 x

IQR), or
Scores < the lower quartile – (1.5 x IQR)
Reminder: IQR = 75th percentile – 25th

percentile 14
Notes on outliers
• Always make a note of any outliers you find and remove
• Always save an ‘original’ and ‘outliers removed’ version of

your data
• Report:
• How you determined outliers in your methods section
• How many outliers you removed and from which variables in
your results section
15
The Assumption of Normality
• Important for
1. Using ordinary least squares (OLS) to estimate
our study parameters
2. NHST: The shape of the distribution of the test
statistic (e.g. t, F)
• Strictly speaking, it is the residuals (i.e. the

error) need to be normally distributed (and
not the data themselves)
• But we infer normality of residuals by looking at
the dependent variable data distribution
16
What does Central Limit Theorem (CLT) have
to do with this?
• CLT states that as sample size gets larger, the sampling distribution becomes more
normally distributed...
• In other words, we don’t need to worry about the assumption of normality when
the sample size is large (>50).
17
Testing for normality
1. P-P plots (probability-
probability) converts all
scores to ‘expected’ z-scores
(if data were normally
distributed) and compares to
actual z-scores
18
2. Skewness (i.e. symmetry) and kurtosis (i.e. tail-heaviness)
indicate the shape of the distribution.
Skewness: Kurtosis:
19
3. The Kolmogorov-Smirnov test and the
Shapiro-Wilk test will compare the data to
a normally distribution.
SPSS: Analyze > Descriptive Statistics >

Explore (see figure to the right)
A significant test (P< 0.05) = data are

significantly different from a normal
distribution (i.e., data are not normally
distributed.
** These should only be used on small
sample sizes 20
The Assumption of Linearity
• This means that the outcome variable can be
created by adding the predictor variables
𝑜𝑢𝑡𝑐𝑜𝑚𝑒 = (𝑏! + 𝑏" 𝑋" + 𝑏# 𝑋# ) + 𝑒𝑟𝑟𝑜𝑟
• If you have a nonlinear relationship between the

outcome and predictor variables, then the model
will be inaccurate
21
Example of linear and nonlinear relationships
Success on task
Success on task
Small Large Small Large Extra large
22
Amount of caffeine consumed Amount of caffeine consumed
22
The Assumption of Homogeneity of Variance
• AKA homoscedasticity
• This means that on average, the

(squared) distance between a score
and its mean is the same across
groups
• In other words, equal variances
• This is very important for CI, NHST

23
Testing for Homogeneity of Variance
• Levene’s Test
• Tests null hypothesis that variance is equal between
groups
• If Levene’s Test is significant, can conclude that

variances are not equal
• We want to retain the null hypothesis to avoid violating
the assumption
• However, it only matters if group sizes being compared
are unequal
24
Menti recap quiz
• Menti is a fun, anonymous and non-graded online quiz that we do together
as a group
• Go to this website:
menti.com
And enter this code:
1613 0858
25
• Data screening
• Normality
• Linearity
• Transformations
• Bootstrapping
• Data analysis
26
Transformations
• If normality and/or linearity are issues, you can transform
your data
• This is not an issue if you apply the same transformation to
every datapoint
• If you’re running correlations (relationships between
variables), can transform the problematic variable
• For comparisons between variables (e.g., t-test), all tested
variables must be transformed
• Most common transformation is the log transform

(log(Xi))
27
Log transform
• Good for positively skewed data (as well as
unequal variances, linearity, positive kurtosis)
• But might not work in all instances
• SPSS log transform instructions (page 273) use

log transformation to base 10
• AKA common logarithm
• If 103 = 10 x 10 x 10 = 1000, then log10(1000) = 3
• The “10” is excluded and this would be written
as ”log(1000)”
• There is no log for 0
28
statistics.laerd.com
Other transformations
• There are other options to
transform your data
• See table to the right
• This is not the best way to deal

with assumption violations
• Try bootstrapping if you can-

this is the preferred method for
dealing with violations of
assumptions
29
Bootstrapping
• When we have small sample sizes,
normality is an issue because we
don’t know the shape of the
sampling distribution
• Bootstrap makes no assumptions

about the shape of the sampling
distribution
• Rather, it makes it own sampling
distribution by resampling our
data
30
Castro (2021). A beginner’s guide to the Bootstrap. Berkeley D-lab: dlab.berkeley.edu
How Bootstrapping works
• Our dataset is resampled many (i.e. thousands) of times by drawing
random samples from it
• Each datapoint can be selected more than once for a resampled

dataset: this is what ‘with replacement’ is referring to
• The resampled datasets have same n as our original dataset
• We can now use the bootstrapped distribution to estimate the CI,

standard error
31
32
Bootstrapping can be performed in many
SPSS analyses
1. Go to 2. Check ‘Perform 3. This allows for

Bootstrap option bootstrapping’ replication 4. This produces
more robust results
33
Non-parametric tests
• When bootstrapping can’t be applied, non-parametric tests might be useful
• These can have less power than the parametric counterparts, but only if data do
not violate the test assumptions
• In which case, you wouldn’t use non-parametric tests anyway
• Non-parametric tests assess the median rather than the mean

• Good for data with outliers
• Good for small sample sizes (and non-normal data)
• Good for ordinal data

34
Some parametric tests have a non-
parametric equivalent
Parametric Test Non-parametric test
Independent samples t-test Wilcoxon rank-sum test

Mann-Whitney Test
Repeated measures t-test Wilcoxon signed-rank test
McNemar’s test (for nominal data)
One-way ANOVA (independent groups) Kruskal-Wallis test
* Can also look at Jonckheere-Terpstra Test
One-way ANOVA (repeated measures) Friedman’s test
Correlation Spearman rank-order correlation
35
• Data screening
• Normality
• Linearity
• Transformations
• Bootstrapping
• Data analysis
36
Data Analysis Time: Excel and SPSS
37
Today’s dataset: Speed dating
This is a reduced dataset from Fisman et al. (2006). Gender differences in mate selection: Evidence
from a speed dating experiment. The Quarterly Journal of Economics.
There are two files you need to download from Learnline (Learning materials > Lecture 2) to start:
1. SpeedDatingData.csv
2. SpeedDatingDataKey_variableKeys.doc
Open the .csv file in Excel. The Word document outlines what all the columns indicate.
Step-by-step instructions can be found

1. In the PDF in Learning materials > Lecture 2 > Lecture02_Workshop.pdf , or
2. On pebblepad by following this link (also available on Learnline):
https://v3.pebblepad.com.au/spa/#/public/q4yHmmfrnfjwxW9h93czpfqmHZ 38

PSY417 Week02

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PSY417 Week02

Uploaded by

Copyright:

Available Formats

CRICOS Provider No. 00300K (NT/VIC) I 03286A (NSW) | RTO Provider No.

PSY417: Research Methods and Practice

Week 2 – Data screening, assumptions and non-parametric models

• Field (2017) chapters 6 & 7

• First check your

• Looks for issues

• What issues can

• Make sure all strings (i.e. words) are the same

• Then replace strings with dummy codes

• All numbers should be same format

4. Check out this link:

• The test is called Little’s MCAR (Missing Completely at Random)

• These can be thought of as conditions that must be

• If they are not met, no ‘errors’ would be output but

• Can be visually spotted in

• In SPSS, outliers are:

Scores > the upper quartile + (1.5 x

Scores < the lower quartile – (1.5 x IQR)

Reminder: IQR = 75th percentile – 25th

• Always save an ‘original’ and ‘outliers removed’ version of

• Strictly speaking, it is the residuals (i.e. the

SPSS: Analyze > Descriptive Statistics >

A significant test (P< 0.05) = data are

𝑜𝑢𝑡𝑐𝑜𝑚𝑒 = (𝑏! + 𝑏" 𝑋" + 𝑏# 𝑋# ) + 𝑒𝑟𝑟𝑜𝑟

• If you have a nonlinear relationship between the

• This means that on average, the

• This is very important for CI, NHST

• If Levene’s Test is significant, can conclude that

• Most common transformation is the log transform

• SPSS log transform instructions (page 273) use

• This is not the best way to deal

• Try bootstrapping if you can-

• Bootstrap makes no assumptions

• Each datapoint can be selected more than once for a resampled

• The resampled datasets have same n as our original dataset

• We can now use the bootstrapped distribution to estimate the CI,

1. Go to 2. Check ‘Perform 3. This allows for

• Non-parametric tests assess the median rather than the mean

• Good for small sample sizes (and non-normal data)

• Good for ordinal data

Independent samples t-test Wilcoxon rank-sum test

Correlation Spearman rank-order correlation

Step-by-step instructions can be found

You might also like