You are on page 1of 38

CRICOS Provider No. 00300K (NT/VIC) I 03286A (NSW) | RTO Provider No.

0373

PSY417: Research Methods and Practice

Week 2 – Data screening, assumptions and non-parametric models


Faculty Of Health
Dr. Rebecca Williams
SEM1 2023
Readings

• Field (2017) chapters 6 & 7

• Nimon (2012)
• This is a recommended reading: will not be examinable, but
might help you with your learning

2
Today’s Lecture Outline
• Data screening
• Identifying problems in datasets
• Good data analysis habits
• Missing Values Analysis
• Assumption testing
• Identifying outliers
• Normality
• Linearity
• Homogeneity of variance
• Dealing with Bias (AKA violations of assumptions)
• Transformations
• Bootstrapping
• Non-parametric tests
• Data analysis: Data screening
3
Data screening: The first step

• First check your


data in excel

• Looks for issues


and fix before SPSS

• What issues can


you see here?

4
Data screening: Checklist
• Save original version of document + versions

• Make sure all strings (i.e. words) are the same


• Remove any formatting such as full stops

• Then replace strings with dummy codes

• All numbers should be same format


• Integers (“number”, not scientific notation) with 2 decimal spaces

• Remove any commas, and any empty cells with discrete, improbable values
5
Good data analysis habits
1. Keep a data analysis log (in a Word document, for example)

Example:

20230312
Started session 0935
Opened dataset_version1
Started screening dataset for abnormalities. Found some open-ended
responses were difficult to code (some not genuine responses).
Create a coding variable to filter out (OE_Type) and coded 1 for
'normal', 2 for 'rounded', 3 for 'ineligible' and 4 for 'ridiculous'.
Saved new file (dataset_version2)
Ended session 0946
6
Good data analysis habits
2. Keep your data versions organized and easy to follow

Example:

dataAnalysis_20220301.csv
dataAnalysis_20220302.csv
dataAnalysis_20220303.csv
dataAnalysis_20220304.csv
7
Good data analysis habits
3. Back up every day to two different locations, including
a cloud (e.g. Dropbox, OneDrive).
• Make sure to ONLY copy files from your main folder to the
backup locations: don’t get confused and open backed up
files to work on.
• Use programs such as ‘rsync’ to copy files easily

4. Check out this link:


Manage Data During Research
8
Missing Values Analysis
• This is an good initial data check in SPSS

• It will determine if there is a pattern of missing data, or if missing data are random.
• Systematic missing values will highly bias the results

• The test is called Little’s MCAR (Missing Completely at Random)


• Null hypothesis: Missing data is completely at random
• P < 0.05 means missing data is not completely at random
• Is good to retain the null hypothesis

• Imputation can be performed to fill in missing data (although this should be done
sparingly)
9
Guidelines for reporting analyses affected by
missingness
• Report the number of missing values for each variable
• Give specific reasons (if possible), and detail any excluded cases
• Report any important differences between individuals with complete
and incomplete data
e.g., categorical variables, screening variables
• Describe the analyses used to account for missingness, and the
assumptions that were made
e.g., outcome of Little’s MCAR test; adoption of imputation
method
10
Today’s Lecture Outline
• Data screening
• Identifying problems in datasets
• Missing Values Analysis
• Good data analysis habits
• Assumption testing
• Identifying outliers
• Normality
• Linearity
• Homogeneity of variance
• Dealing with Bias (AKA violations of assumptions)
• Transformations
• Bootstrapping
• Non-parametric tests
• Data analysis
11
Assumptions
• Every statistical test has some assumptions about
your data

• These can be thought of as conditions that must be


met in order to run the test

• If they are not met, no ‘errors’ would be output but


your results will be meaningless
12
Outliers: Not really an assumption, but can
really mess up your analysis
• Values that are very different
from all others can bias means
and increase the sum of
squared errors (and therefore,
the standard deviation and CI)

• Can be visually spotted in


histograms, boxplots
13
Field (2017): Page 240
Identifying outliers quantitatively
• Boxplots are a good way to
determine whether there are
statistical outliers

• In SPSS, outliers are:

Scores > the upper quartile + (1.5 x


IQR), or

Scores < the lower quartile – (1.5 x IQR)

Reminder: IQR = 75th percentile – 25th


percentile 14
Field (2017): Page 194
Notes on outliers
• Always make a note of any outliers you find and remove

• Always save an ‘original’ and ‘outliers removed’ version of


your data

• Report:
• How you determined outliers in your methods section
• How many outliers you removed and from which variables in
your results section
15
The Assumption of Normality
• Important for
1. Using ordinary least squares (OLS) to estimate
our study parameters
2. NHST: The shape of the distribution of the test
statistic (e.g. t, F)

• Strictly speaking, it is the residuals (i.e. the


error) need to be normally distributed (and
not the data themselves)
• But we infer normality of residuals by looking at
the dependent variable data distribution
16
What does Central Limit Theorem (CLT) have
to do with this?
• CLT states that as sample size gets larger, the sampling distribution becomes more
normally distributed...

• In other words, we don’t need to worry about the assumption of normality when
the sample size is large (>50).
17
Field (2017): Page 234
Testing for normality
1. P-P plots (probability-
probability) converts all
scores to ‘expected’ z-scores
(if data were normally
distributed) and compares to
actual z-scores

18
Field (2017): Page 245
Testing for normality
2. Skewness (i.e. symmetry) and kurtosis (i.e. tail-heaviness)
indicate the shape of the distribution.

Skewness: Kurtosis:

19
Testing for normality
3. The Kolmogorov-Smirnov test and the
Shapiro-Wilk test will compare the data to
a normally distribution.

SPSS: Analyze > Descriptive Statistics >


Explore (see figure to the right)

A significant test (P< 0.05) = data are


significantly different from a normal
distribution (i.e., data are not normally
distributed.
** These should only be used on small
sample sizes 20
Field (2017): Page 251
The Assumption of Linearity
• This means that the outcome variable can be
created by adding the predictor variables

𝑜𝑢𝑡𝑐𝑜𝑚𝑒 = (𝑏! + 𝑏" 𝑋" + 𝑏# 𝑋# ) + 𝑒𝑟𝑟𝑜𝑟

• If you have a nonlinear relationship between the


outcome and predictor variables, then the model
will be inaccurate
21
Example of linear and nonlinear relationships
Success on task

Success on task
Small Large Small Large Extra large
22
Amount of caffeine consumed Amount of caffeine consumed
22
The Assumption of Homogeneity of Variance
• AKA homoscedasticity

• This means that on average, the


(squared) distance between a score
and its mean is the same across
groups
• In other words, equal variances

• This is very important for CI, NHST


23
Field (2017): Page 238
Testing for Homogeneity of Variance
• Levene’s Test
• Tests null hypothesis that variance is equal between
groups

• If Levene’s Test is significant, can conclude that


variances are not equal
• We want to retain the null hypothesis to avoid violating
the assumption
• However, it only matters if group sizes being compared
are unequal

24
Menti recap quiz
• Menti is a fun, anonymous and non-graded online quiz that we do together
as a group

• Go to this website:
menti.com
And enter this code:

1613 0858
25
Today’s Lecture Outline
• Data screening
• Identifying problems in datasets
• Missing Values Analysis
• Good data analysis habits
• Assumption testing
• Identifying outliers
• Normality
• Linearity
• Homogeneity of variance
• Dealing with Bias (AKA violations of assumptions)
• Transformations
• Bootstrapping
• Non-parametric tests
• Data analysis
26
Transformations
• If normality and/or linearity are issues, you can transform
your data
• This is not an issue if you apply the same transformation to
every datapoint
• If you’re running correlations (relationships between
variables), can transform the problematic variable
• For comparisons between variables (e.g., t-test), all tested
variables must be transformed

• Most common transformation is the log transform


(log(Xi))
27
Log transform
• Good for positively skewed data (as well as
unequal variances, linearity, positive kurtosis)
• But might not work in all instances

• SPSS log transform instructions (page 273) use


log transformation to base 10
• AKA common logarithm
• If 103 = 10 x 10 x 10 = 1000, then log10(1000) = 3
• The “10” is excluded and this would be written
as ”log(1000)”
• There is no log for 0
28
statistics.laerd.com
Other transformations
• There are other options to
transform your data
• See table to the right

• This is not the best way to deal


with assumption violations

• Try bootstrapping if you can-


this is the preferred method for
dealing with violations of
assumptions
29
Field (2017): Page 269
Bootstrapping
• When we have small sample sizes,
normality is an issue because we
don’t know the shape of the
sampling distribution

• Bootstrap makes no assumptions


about the shape of the sampling
distribution
• Rather, it makes it own sampling
distribution by resampling our
data
30
Castro (2021). A beginner’s guide to the Bootstrap. Berkeley D-lab: dlab.berkeley.edu
How Bootstrapping works
• Our dataset is resampled many (i.e. thousands) of times by drawing
random samples from it

• Each datapoint can be selected more than once for a resampled


dataset: this is what ‘with replacement’ is referring to

• The resampled datasets have same n as our original dataset

• We can now use the bootstrapped distribution to estimate the CI,


standard error
31
32
Field (2017): Page 266
Bootstrapping can be performed in many
SPSS analyses

1. Go to 2. Check ‘Perform 3. This allows for


Bootstrap option bootstrapping’ replication 4. This produces
more robust results
33
Non-parametric tests
• When bootstrapping can’t be applied, non-parametric tests might be useful

• These can have less power than the parametric counterparts, but only if data do
not violate the test assumptions
• In which case, you wouldn’t use non-parametric tests anyway

• Non-parametric tests assess the median rather than the mean


• Good for data with outliers

• Good for small sample sizes (and non-normal data)

• Good for ordinal data


34
Some parametric tests have a non-
parametric equivalent
Parametric Test Non-parametric test

Independent samples t-test Wilcoxon rank-sum test


Mann-Whitney Test
Repeated measures t-test Wilcoxon signed-rank test
McNemar’s test (for nominal data)
One-way ANOVA (independent groups) Kruskal-Wallis test
* Can also look at Jonckheere-Terpstra Test
One-way ANOVA (repeated measures) Friedman’s test

Correlation Spearman rank-order correlation

35
Today’s Lecture Outline
• Data screening
• Identifying problems in datasets
• Missing Values Analysis
• Good data analysis habits
• Assumption testing
• Identifying outliers
• Normality
• Linearity
• Homogeneity of variance
• Dealing with Bias (AKA violations of assumptions)
• Transformations
• Bootstrapping
• Non-parametric tests
• Data analysis
36
Data Analysis Time: Excel and SPSS

37
Today’s dataset: Speed dating
This is a reduced dataset from Fisman et al. (2006). Gender differences in mate selection: Evidence
from a speed dating experiment. The Quarterly Journal of Economics.

There are two files you need to download from Learnline (Learning materials > Lecture 2) to start:
1. SpeedDatingData.csv
2. SpeedDatingDataKey_variableKeys.doc

Open the .csv file in Excel. The Word document outlines what all the columns indicate.

Step-by-step instructions can be found


1. In the PDF in Learning materials > Lecture 2 > Lecture02_Workshop.pdf , or
2. On pebblepad by following this link (also available on Learnline):
https://v3.pebblepad.com.au/spa/#/public/q4yHmmfrnfjwxW9h93czpfqmHZ 38

You might also like