Professional Documents
Culture Documents
Module 1:
Data Analysis using SPSS
TRAINER: WAQAR AKBAR
ORGANIZER: FACULTY OF MANAGEMENT
SCIENCE
SPSS – What Is It?
SPSS means “Statistical Package for the Social Sciences” and
was first launched in 1968.
Since SPSS was acquired by IBM in 2009, it's officially known
as IBM SPSS Statistics but most users still just refer to it as
“SPSS”.
SPSS - Overview Main
Features
SPSS is software for editing and analyzing all sorts of data.
These data may come from basically any source: scientific
research, a customer database, Google Analytics or even
the server log files of a website. SPSS can open all file
formats that are commonly used for structured data such as
spreadsheets from MS Excel or OpenOffice;
plain text files (.txt or .csv);
relational (SQL) databases;
Stata and SAS.
SPSS Data View
◦ This sheet -called data view- always displays our data
values
SPSS Variable View
◦ An SPSS data file always has a second sheet called
variable view. It shows the metadata associated with the
data. Metadata is information about the meaning of
variables and data values. This is generally known as the
“codebook” but in SPSS it's called the dictionary.
SPSS Output Window
◦ After clicking Ok, a new window opens up: SPSS’ Output
Viewer window. It holds a nice table with all statistics on
all variables we chose
Preparing a codebook
Preparing a codebook involves deciding (and documenting)
how you will go about:
◦ Defining and labelling each of the variable
◦ Assigning number to each of the possible
Continuous Variable
◦ Descriptive Statistics
Missing Data
◦ Exclude Case Listwise
◦ Exclude cases Pairwise
◦ Replace with mean
Descriptive Statistics: Missing
Values
The Exclude cases listwise option will include cases in the analysis only if
they have full data on all of the variables listed in your Variables box for that
case. A case will be totally excluded from all the analyses if it is missing even
one piece of information. This can severely, and unnecessarily, limit your
sample size.
The Exclude cases pairwise option, however, excludes the case (person) only
if they are missing the data required for the specific analysis. They will still be
included in any of the analyses for which they have the necessary
information.
The Replace with mean option, which is available in some SPSS statistical
procedures (e.g. multiple regression), calculates the mean value for the
variable and gives every missing case this value. This option should never be
used, as it can severely distort the results of your analysis, particularly if you
have a lot of missing values.
Descriptive Statistics
Assessing Normality
◦ Descriptive
◦ 5% trimmed Mean
◦ Extreme Values
◦ Skewness and Kurtosis
◦ Test of Normality
◦ K-S test & Shapiro-Wilk
◦ This assesses the normality of the distribution of scores,
◦ A non-significant value (more than 0.05) indicates normality
◦ Histogram
◦ Normal Q-Q plot (Probability plot)
◦ Detrended Normal Q-Q plots
◦ Obtained by plotting the actual deviation of the scores from the straight
line.
◦ There should be no clustering of points , with most collecting around the
zero line
◦ Outliers
◦ Histogram
◦ Boxplot
◦ Rectangle represents 50% of the cases, with the whiskers (the
lines protruding from the box) going out to the smallest and
largest values.
◦ The line inside the rectangle is the median value.
◦ Outliers appears as little circle with a number attached
◦ Extend more than 1.5 box-length from the edge of the box.
◦ Extreme points indicated with asterisk are those that extend
more than three box length from the edge of the box.
Descriptive Statistics
Exercise : data file Staffsurvey.sav
Cronbach’s Alpha
◦ Values above .7 are considered acceptable; however, values above .8 are preferable.
Interpretation
Step 1: Checking the information about the sample
Step 2: Determining the direction of the relationship
Step 3: Determining the strength of the relationship
small r=.10 to .29; medium r=.30 to .49; large r=.50 to 1.0
Step 4: Calculating the coefficient of determination
Step 5: Assessing the significance level
PRESENTING THE RESULTS
FROM CORRELATION
The relationship between perceived control of internal states (as
measured by the PCOISS) and perceived stress (as measured by the
Perceived Stress Scale) was investigated using Pearson product-moment
correlation coefficient. Preliminary analyses were performed to ensure
no violation of the assumptions of normality, linearity and
homoscedasticity. There was a strong, negative correlation between the
two variables, r = –.58, n = 426, p < .001, with high levels of perceived
control associated with lower levels of perceived stress.
Correlation is often used to explore the relationship among a group of
variables, rather than just two as described above. In this case, it would be
awkward to report all the individual correlation coefficients in a paragraph; it
would be better to present them in a table. One way this could be done is as
follows:
COMPARING THE CORRELATION
COEFFICIENTS FOR TWO GROUPS
Sometimes when doing correlational research you may want to
compare the strength of the correlation coefficients for two separate
groups.
◦ For example, you may want to look at the relationship between optimism
and negative affect for males and females separately.
◦ Follow procedure handout
Important:
Remember, when you have finished looking at males and females
separately you will need to turn the Split File option off. It stays in place
until you specifically turn it off.
◦ To do this, click on Data, Split File and select the first button: Analyze all
cases, do not create groups.
Exercise
Data file: sleep.sav.
1. Check the strength of the correlation between
scores on the Sleepiness and Associated Sensations
Scale (totSAS) and the Epworth Sleepiness Scale
(ess).
9. Partial Correlation
• Partial correlation is similar to Pearson product-moment
correlation, except that it allows you to control for an additional
variable.
• This is usually a variable that you suspect might be influencing
your two variables of interest. By statistically removing the
influence of this confounding variable, yon can get a clearer and
more accurate indication of the relationship between your two
variables.
•This occurs when the relationship between two variables (A and
B) is influenced, at least to some extent, by a third variable (C).
This can serve to artificially inflate the size of the correlation
coefficient obtained.
Example of research question:
After controlling for subjects' tendency to present
themselves in a positive light on self-report scales, is there
still a significant relationship between perceived control of
internal states (PCOISS) and levels of perceived stress?
File: Survey.sav
Follow Procedure Handout
Interpretation
In the top half of the table is the normal Pearson product-moment
correlation matrix between your two variables of interest not
controlling for your other variable. In this case, the correlation is –.581.
The bottom half of the table repeats the same set of correlation
analyses, but this time controlling for (removing) the effects of your
control variable (e.g. social desirability). In this case, the new partial
correlation is –.552.
PRESENTING THE RESULTS
FROM PARTIAL CORRELATION
Partial correlation was used to explore the relationship between
perceived control of internal states (as measured by the PCOISS) and
perceived stress (measured by the Perceived Stress Scale), while
controlling for scores on the Marlowe-Crowne Social Desirability Scale.
Preliminary analyses were performed to ensure no violation of the
assumptions of normality, linearity and homoscedasticity. There was a
strong, negative, partial correlation between perceived control of
internal states and perceived stress, controlling for social desirability, r =
–.55, n = 425, p < .001, with high levels of perceived control being
associated with lower levels of perceived stress. An inspection of the
zero-order correlation (r = –.58) suggested that controlling for socially
desirable responding had very little effect on the strength of the
relationship between these two variables.
Exercise
sleep.sav.
1. Check the strength of the correlation between
scores on the Sleepiness and Associated Sensations
Scale (totSAS) and the impact of sleep problems on
overall wellbeing (impact6) while controlling for
age. Compare the zero order correlation (Pearson
Correlation) and the partial correlation coefficient.
Does controlling for age make a difference?
10. Multiple Regression
• Multiple regression tells you how much of the
variance in your dependent variable can be
explained by your independent variables.
• It also gives you an indication of the relative
contribution of each independent variable. Tests
allow you to determine the statistical significance
of the results, in terms of both the model itself and
the individual independent variables.
Some of the main types of research questions that multiple regression
can be used to address are:
• how well a set of variables is able to predict a particular outcome
• which variable in a set of variables is the best predictor of an outcome
• whether a particular predictor variable is still able to predict an
outcome when the effects of another variable are controlled for (e.g.
socially desirable responding).
MAJOR TYPES OF
MULTIPLE REGRESSION
1. standard or simultaneous
2. hierarchical or sequential
3. stepwise.
Standard Multiple Regression
In standard multiple regression, all the independent (or
predictor) variables are entered into the equation
simultaneously.
Each independent variable is evaluated in terms of its
predictive power, over and above that offered by all the
other independent variables.
This is the most commonly used multiple regression
analysis. You would use this approach if you had a set of
variables (e.g. various personality scales) and wanted to
know how much variance in a dependent variable (e.g.
anxiety) they were able to explain as a group or block.
This approach would also tell you how much unique
variance in the dependent variable each of the independent
variables explained.
Hierarchical multiple
regression
In hierarchical regression (also called sequential regression), the
independent variables are entered into the model in the order specified by
the researcher based on theoretical grounds.
Variables or sets of variables are entered in steps (or blocks), with each
independent variable being assessed in terms of what it adds to the
prediction of the dependent variable after the previous variables have been
controlled for.
For example, if you wanted to know how well optimism predicts life
satisfaction, after the effect of age is controlled for, you would enter age in
Block 1 and then Optimism in Block 2.
Once all sets of variables are entered, the overall model is assessed in terms
of its ability to predict the dependent measure. The relative contribution of
each block of variables is also assessed.
Stepwise multiple regression
In stepwise regression, the researcher provides a list of independent
variables and then allows the program to select which variables it will
enter and in which order they go into the equation, based on a set of
statistical criteria.
There are three different versions of this approach: forward selection,
backward deletion and stepwise regression.
There are a number of problems with these approaches and some
controversy in the literature concerning their use (and abuse). Before
using these approaches(see Tabachnick & Fidell 2013, p. 138).
It is important that you understand what is involved, how to choose the
appropriate variables and how to interpret the output that you receive.
ASSUMPTIONS OF
MULTIPLE REGRESSION
Sample size
◦ Stevens (1996, p. 72) recommends that ‘for social science research, about 15
participants per predictor are needed for a reliable equation.
◦ Tabachnick and Fidell (2013, p. 123) give a formula for calculating sample size
requirements, taking into account the number of independent variables that
you wish to use: N > 50 + 8m (where m = number of independent variables).
If you have five indepndent variables, you will need 90 cases.
Adjusted R Square
◦ When a small sample is involved, the R square value in the sample tends to
be a rather optimistic overestimation of the true value in the population (see
Tabachnick & Fidell 2013).
◦ The Adjusted R square statistic ‘corrects’ this value to provide a better
estimate of the true population value.
ANOVA
◦ This tests the null hypothesis that multiple R in the population equals 0.
Step 3: Evaluating each of the
independent variables
Coefficients
◦ To compare the different variables it is important that you look at the
standardized coefficients, not the unstandardised ones. ‘
◦ Standardised’ means that these values for each of the different variables
have been converted to the same scale so that you can compare them.
◦ If you were interested in constructing a regression equation, you would use
the unstandardised coefficient values listed as B.
Sig.
◦ tells you whether this variable is making a statistically significant unique
contribution to the equation.
.
Part correlation coefficients
◦ If you square this value, you get an indication of the contribution of that
variable to the total R square.
◦ In other words, it tells you how much of the total variance in the dependent
variable is uniquely explained by that variable and how much R square would
drop if it wasn’t included in your model.
more practical purposes
The beta values obtained in this analysis can also be used for other
more practical purposes than the theoretical model testing shown here.
Standardised beta values indicate the number of standard deviations
that scores in the dependent variable would change if there was a one
standard deviation unit change in the predictor.
In the current example, if we could increase Mastery scores by one
standard deviation (which is 3.97, from the Descriptive Statistics table)
the perceived stress scores would be likely to drop by .42 standard
deviation units.
Hierarchical Multiple
Regression
In hierarchical regression (also called sequential regression), the
independent variables are entered into the equation in the order
specified by the researcher based on theoretical grounds.
Variables or sets of variables are entered in steps (or blocks), with each
independent variable being assessed in terms of what it adds to the
prediction of the dependent variable, after the previous variables have
been controlled for.
For example, if you wanted to know how well optimism predicts life
satisfaction, after the effect of age is controlled for, you would enter age
in Block 1 and then Optimism in Block 2.
Once all sets of variables are entered, the overall model is assessed in
terms of its ability to predict the dependent measure. The relative
contribution of each block of variables is also assessed.
Example Question
If we control for the possible effect of age and
socially desirable responding, is this set of variables
still able to predict a significant amount of the
variance in perceived stress?
Diagnosis
Step 1: Evaluating the Model
R Square
◦ After the variables in Block 1 (social desirability) have been entered, the
overall model explains 5.7 per cent of the variance (.057 × 100). After Block 2
variables (Total Mastery, Total PCOISS) have also been included, the model as
a whole explains 47.4 per cent (.474 × 100).
R Square change
square change value is .42. This means that Mastery and PCOISS explain
an additional 42 per cent (.42 × 100) of the variance in perceived stress,
even when the effects of age and socially desirable responding are
statistically controlled for.
ANOVA
indicates that the model as a whole is significant.
Step 2: Evaluating each of the
independent variables
Coefficient and Sig
◦ only two variables that make a unique statistically significant contribution
(less than .05).
◦ Remember, these beta values represent the unique contribution of each
variable, when the overlapping effects of all other variables are statistically
removed.
PRESENTING THE RESULTS FROM MULTIPLE REGRESSION
Two statistical measures are also generated by IBM SPSS to help assess the
factorability of the data: Bartlett’s test of sphericity (Bartlett 1954), and the
Kaiser-
Meyer-Olkin (KMO) measure of sampling adequacy (Kaiser 1970, 1974).
Bartlett’s test of sphericity should be significant (p < .05) for the factor
analysis to be considered appropriate.
The KMO index ranges from 0 to 1, with .6 suggested as the minimum value
for a good factor analysis (Tabachnick & Fidell 2013).
Step 2: Factor extraction
Factor extraction involves determining the smallest number of factors
that can be used to best represent the interrelationships among the set
of variables.
There are a variety of approaches that can be used to identify (extract)
the number of underlying factors or dimensions.
The most commonly used approach is principal components analysis
There are a number of techniques that can be used to assist in the
decision concerning the number of factors to retain:
◦ Kaiser’s criterion;
◦ scree test; and
◦ parallel analysis.
Kaiser’s criterion
◦ only factors with an eigenvalue of 1.0 or more are retained for further
investigation
Scree test
◦ This involves plotting each of the eigenvalues of the factors (IBM SPSS does
this for you) and inspecting the plot to find a point at which the shape of the
curve changes direction and becomes horizontal
Parallel analysis
◦ Parallel analysis involves comparing the size of the eigenvalues with those
obtained from a randomly generated data set of the same size. Only those
eigenvalues that exceed the corresponding values from the random data set
are retained.
Step 3: Factor rotation and
interpretation
There are two main approaches to rotation, resulting in either
orthogonal (uncorrelated) or oblique (correlated) factor solutions.
Within the two broad categories of rotational approaches there are a
number of different techniques
◦ orthogonal: Varimax, Quartimax, Equamax;
◦ oblique: Direct Oblimin, Promax).
◦ The most commonly used orthogonal approach is the Varimax method,
which attempts to minimise the number of variables that have high loadings
on each factor.
◦ The most commonly used oblique technique is Direct Oblimin.
EXAMPLE (data file:
survey.sav)
Research question:
What is the underlying factor structure of the Positive and Negative
Affect Scale? Past research suggests a two-factor structure (positive
affect/negative affect). Is the structure of the scale in this study, using a
community sample, consistent with this previous research?
What you need: A set of correlated continuous variables.
What it does: Factor analysis attempts to identify a small set of factors
that represents the underlying relationships among a group of related
variables.
Interpretation of output—Part 1
Step 1
Kaiser-Meyer-Olkin Measure of Sampling Adequacy (KMO) value is .6 or
above and that the Bartlett’s Test of Sphericity value is significant
Correlation Matrix
◦ look forcorrelation coefficients of .3 and above
Step 3: Screeplot
Step 4:parallel analysis MonteCarloPA
◦ the number of variables you are analysing (in this case, 20);
◦ the number of participants in your sample (in this case, 435); and
◦ the number of replications(specify 100).
Pattern Matrix
◦ shows the factor loadings of each of the variables.
Structure Matrix
◦ unique to the Oblimin output, provides information about the correlation
between variables and factors.
Communalities
◦ This gives information about how much of the variance in each item is
explained. Low values (e.g. less than .3) could indicate that the item does not
fit well with the other items in its component
PRESENTING THE RESULTS FROM FACTOR ANALYSIS
The 20 items of the Positive and Negative Affect Scale (PANAS) were subjected to
principal components analysis (PCA) using SPSS version 18. Prior to performing
PCA, the suitability of data for factor analysis was assessed . Inspection of the
correlation matrix revealed the presence of many coefficients of .3 and above.
The Kaiser-Meyer- Olkin value was .87, exceeding the recommended value of .6
(Kaiser 1970, 1974) and Bartlett’s Test of Sphericity (Bartlett 1954) reached
statistical significance, supporting the factorability of the correlation matrix.
Principal components analysis revealed the presence of four components with
eigenvalues exceeding 1, explaining 31.2%, 17%, 6.1% and 5.8% of the variance
respectively. An inspection of the screeplot revealed a clear break after the
second component. Using Catell’s (1966) scree test, it was decided to retain two
components for further investigation. This was further supported by the results
of Parallel Analysis, which showed only two components with eigenvalues
exceeding the corresponding criterion values for a randomly generated data
matrix of the same size (20 variables × 435 respondents).
The two-component solution explained a total of 48.2% of the variance,
with Component 1 contributing 31.25% and Component 2 contributing
17.0%. To aid in the interpretation of these two components, oblimin
rotation was performed. The rotated solution revealed the presence of
simple structure (Thurstone 1947), with both components showing a
number of strong loadings and all variables loading substantially on only
one component.
The interpretation of the two components was consistent with previous
research on the PANAS Scale, with positive affect items loading strongly
on Component 1 and negative affect items loading strongly on Component
2. There was a weak negative correlation between the two factors (r =
–.28). The results of this analysis support the use of the positive affect
items and the negative affect items as separate scales, as suggested by the
scale authomrs (Watson, Clark & Tellegen 1988).
ADDITIONAL EXERCISES
Data file: staffsurvey.sav
1. conduct a principal components analysis with Oblimin rotation on the
ten agreement items that make up the Staff Satisfaction Survey (Q1a to
Q10a). You will see that, although two factors record eigenvalues over
1, the screeplot indicates that only one component should be retained.
Run Parallel Analysis using 523 as the number of cases and 10 as the
number of items. The results indicate only one component has an
eigenvalue that exceeds the equivalent value obtained from a random
data set. This suggests that the items of the Staff Satisfaction Scale are
assessing only one underlying dimension (factor).
12. T-tests
• independent-samples t-test, used when you want to compare the
mean scores of two different groups of people or conditions
• paired-samples t-test, used when you want to compare the mean
scores for the same group of people on two different occasions, or
when you have matched pairs.