You are on page 1of 87

INTRODUCTION TO SPSS

FOR WINDOWS
Version 15.0

Summer 2007

Contents

Purpose of handout & Compatibility between different versions of SPSS.. 1


SPSS window & menus 1
Getting data into SPSS & Editing data.. 3
Reading an SPSS viewer/output (.spo) file & Editing your pout. 7
Saving data as an SPSS data (.sav) file..... 8
Saving your output (statistical results and graphs) 9
Exporting SPSS Output. 10
Printing your work & Exiting SPSS.. 12
Running SPSS using syntax or command language (.sps files).
13
Creating a new variable. 14
Recoding or combining categories of a variable15
Summarizing your data
Frequency tables (& bar charts) for categorical variables. 20
Contingency tables for categorical variables. 21
Descriptive statistics (& histograms) for numerical variables.. 22
Descriptive statistics (& boxplots) by groups for numerical variables. 24
Using the Split File option for summaries by groups 26
Using the Select Cases option for summaries for a subgroup of subjects/observations 27
Graphing your data
Bar chart 28
Histogram & Boxplot 29
Normal probability plot. 30
Error bar plot..31
Scatter plot. 32
Adding a line or loess smooth to a scatter plot.. 32
Stem-and-leaf plot..33
Hypothesis tests & Confidence intervals
One sample t test & Confidence interval for a mean. 34
Paired t test & Confidence interval for the difference between means. 37
Two sample t test & Confidence interval for the difference between means 39
Sign test and Wilcoxon signed rank test....... 42
Mann Whitney U test (or Wilcoxon rank sum test).............. 45
One-way ANOVA (Analysis of variance) & Post-hoc tests......... 47
Kruskal-Wallis test.....50
One-sample binomial test...... 52
McNemars test..53
Chi-square test for contingency tables...55
Fishers exact test.......55
Trend test for contingency tables/ordinal variables....... 55
Binomial, McNemars, Chi-square and Fishers exact tests using summary data.... 59
Confidence interval for a proportion. 63
Correlation & Regression
Pearson and spearman rank correlation coefficient....... 65
Linear regression........68
Liner regression via ANOVA commands.. 76
Logistic regression 80

Purpose of handout
SPSS for Windows provides a powerful statistical and data management system in a graphical
environment. The user interfaces make statistical analysis more accessible for casual users and
more convenient for experienced users. Most tasks can be accomplished simply by pointing and
clicking the mouse.
The objective of this handout is to get you oriented with SPSS for Windows. It teaches you how
to enter and save data in SPSS, how to edit and transform data, how to explore your data by
producing graphics and summary descriptives, and how to use pointing and clicking to run
statistical procedures. It is also intended to serve as a reference guide for SPSS procedures that
you will need to know to do your homework assignments.

Compatibility between different versions of SPSS


SPSS for Windows data files (files ending in .sav) and syntax (command) files (files ending in
.sps) are compatible between different versions of SPSS (at least, versions 11.0 or newer).
However, SPSS output/viewer files (files ending .spo) are NOT always compatible between
different versions. Usually SPSS output files created with an old version and can be read by a
new version, but an output file created using a new version can not be read by an old version.
One option for avoiding compatibility problems between different versions of SPSS is to export
your output in html or MS Word format. The compatibility between Window and Mac versions
of SPSS is limited.

SPSS Windows & Menus


An overview of the SPSS windows, menus, toolbars, and dialog boxes is given in the SPSS
Tutorials under Help. You can also find information under Topics, Case Studies, Statistics
Coach, and Command & Syntax (if you are using syntax commands.)

Window Types
SPSS Data Editor. When you start an SPSS session, you usually see the Data Editor window
(otherwise you will see a Viewer window). The Data Editor displays the contents of the working
data file. There a two views in the data editor window: 1) Data View displays the data in a
spreadsheet format with variable names listed for column headings, and 2) Variable View which
displays information about the variables in your data set. In the Data View you can edit or enter
data, and in the Variable View you can change the format of a variable, add format and variable
labels, etc.
SPSS Viewer/Output. Statistical results and graphs are displayed in the Viewer window. The
(output) Viewer window is divided into two panes. The right-hand pane contains the all the
output and the left-hand pane contains a tree-structure of the results. You can use the left-hand
pane for navigating through, editing and printing your results.

2
Chart Editor. The chart editor is used to edit graphs. When you double-click on figure or
graph, it will reappear in a chart editor window.
SPSS Syntax Editor. The Syntax Editor is used to create SPSS command syntax for using the
SPSS production facility. Usually you will be using the point and click facilities of SPSS, and
hence, you will not need to use the Syntax Editor. More information about the Syntax Editor and
using the SPSS syntax is given in the SPSS Help Tutorials under Working with Syntax. A few
instructions to get you started are given later in the handout in the section Running SPSS using
the Syntax Editor (or Command Language)

Menus
Data Editor Menu:
File. Use the File menu to create a new SPSS file, open an existing file, or read in spreadsheet or
database files created by other software programs (e.g., Excel).
Edit. Use the Edit menu to modify or copy data and output files.
View. Choose which buttons are available in the window or how the window should look.
Data. Use the Data menu to make changes to SPSS data files, such as merging files, transposing
variables, or creating subsets of cases for subset analysis.
Transform. Use the Transform menu to make changes to selected variables in the data file (e.g.,
to recode a variable) and to compute new variables based on existing variables.
Analyze. Use the Analyze menu to select the various statistical procedures you want to use, such
as descriptive statistics, cross-tabulation, hypothesis testing and regression analysis.
Graphs. Use the Graphs menu to display the data using bar charts, histograms, scatterplots,
boxplots, or other graphical displays . All graphs can be customized with the Chart Editor.
Utilities. Use the Utilities menu to view variable labels for each variable.
Add-ons. Information about other SPSS software.
Window. Choose which window you want to view.
Help. Index of help topics, tutorials, SPSS home page, Statistics coach, and version of SPSS.
Viewer Menu: Menu is similar to Data Editor menu, but has two additional options:
Insert. Use the insert menu to edit your output
Format. Use the format menu to change the format of your output.
Chart Editor Menu: Use SPSS Help to learn more about the Chart Editor.

Toolbars
Most Windows applications provide buttons arranged along the top of a window that act as
shortcuts to executing various functions. In SPSS, you will find such buttons (icons) at the top
the of the Data Editor, Viewer, Chart Editor, and Syntax windows. The icons are usually
symbolic representations of the procedure they execute when pushed, unfortunately their
meanings are not intuitively obvious until one has already used them. Hence, the best way to
learn these buttons is to use them and note what happens.
The Status Bar The Status Bar runs along the bottom of a window and alerts the user to the status
of the system. Typical messages one will see are SPSS Processor is ready,
Running procedure. The Status Bar will also provide up-to-date information concerning
special manipulations of the data file like whether only certain cases are being used in an
analysis or if the data has been weighted according to the value of some variable.

File Types
Data Files. A file with an extension of .sav is assumed to be a data file in SPSS for Windows
format. A file with an extension of .por is a portable SPSS data file. The contents of a data file
are displayed in the Data Editor window.
Viewer (Output) Files. A file with an extension of .spo is assumed to be a Viewer file
containing statistical results and graphs.
Syntax (Command) Files. A file witn an extension of .sps is assumed to be a Syntax file
containing spss syntax and commands.

Getting Data into SPSS & Editing Data


When reading and editing data into SPSS the data will be displayed in the Data Editor Window.
An overview of the basic structure of an SPSS data file is given in the SPSS Help Tutorials:
1. Choose Help on the menu bar
2. Choose Tutorial
3. Choose Reading Data

Reading Data from a SPSS Data (.sav) File


To read a data file from your computer/floppy disk/flash drive that was created and saved using
SPSS. The filename should end with the suffix .sav.
1. Choose Open an existing data source
2. Double click on the filename or
3. Single click on the filename and choose OK
Or

1.
2.
3.
4.
5.
6.
7.

Choose Cancel
Choose File on the menu bar
Choose Open
Choose Data...
Edit the directory or disk drive to indicate where the data is located.
Double click on the filename or
Single click on the filename and choose Open

Reading Data from an Text Data File


To read an raw/text (ascii) data file from your computer/floppy disk/flash drive, where the data
for each observation is on a separate line and a space is used to separate variables on the same
line (i.e., the file format is freefield). The filename should end with the suffix .dat.
1.
2.
3.
4.
5.
6.
7.

Choose File on the menu bar


Choose Read Text Data
Choose Files of Type *.dat
Edit the directory or disk drive to indicate where the data is located
Double click on the filename or
Single click on the filename and choose Open
Follow the Import Wizard Instructions.

You can also get to the Import Wizard as follows:


1.
2.
3.
4.
5.
6.
7.
8.

Choose File on the menu bar


Choose Open
Choose Data...
Choose Files of Type *.dat
Edit the directory or disk drive to indicate where the data is located
Double click on the filename or
Single click on the filename and choose Open
Follow the Import Wizard Instructions.

Instructions on how to read a text data file in fixed format are located in SPSS Help Tutorials
under Reading Data from a Text File.

Reading Data from Other Types of External Files


SPSS allows you to read a variety of other types of external files, such as Excel spreadsheet files,
SAS data files, Lotus 1-2-3 spreadsheet files, and dBASE database files. To read data from other
types of external files, you follow the same steps as you would for reading an SPSS save file,
except that you specify the file type according to what package was used to create the save file.
For further instruction on how to read data from other types of external files, see the SPSS for
Windows Base System User's Guide on data files or the SPSS Help Tutorials.

Entering and Editing Data Using the Data Editor


The Data Editor provides a convenient spreadsheet-like facility for entering, editing, and
displaying the contents of your data file. A Data Editor window opens automatically when you
start an SPSS session. Instruction on Using the Data Editor to enter data is given in the SPSS
Help Tutorials. Note that if you are already familiar with entering data into a different
spreadsheet program (e.g., MS Excel), you might find it easy to enter your data in the program
your are familiar with and then read the data into SPSS.
Entering Data. Basic data entry in the Data Editor is simple:
Step 1. Create a new (empty) Data Editor window. At the start of an SPSS session a new
(empty) Data Editor window opens automatically. During an SPSS session you can create a new
Data Editor window by
1. Choose File
2. Choose New
3. Choose Data
Step 2. Move the cursor to the first empty column.
Step 3. Type a value into the cell. As you type, the value appears in the cell editor at the top of
the Data Editor window. Each time you press the Enter key, the value is entered in the cell and
you move down to the next row. By entering data in a column, you automatically create a
variable and SPSS gives it the default variable name var00001.
Step 4. Choose the first cell in the next column. You can use the mouse to click on the cell or use
the arrow keys on the keyboard to move to the cell. By default, SPSS names the data in the
second column var00002.
Step 5. Repeat step 4 until you have entered all the data. If you entered an incorrect value(s) you
will need to edit your data. See the following section on Editing Data.

6
Editing Data. With the Data Editor, you can modify a data file in many ways. For example you
can change values or cut, copy, and paste values, or add and delete cases.
To Change a Data Value:
1. Click on a data cell. The cell value is displayed in the cell editor.
2. Type the new value. It replaces the old value in the cell editor.
3. Press then Enter key. The new value appears in the data cell.
To Cut, Copy, and Paste Data Values
1. Select (highlight) the cell value(s) you want to cut or copy.
2. Pull down the Edit box on the main menu bar.
3. Choose Cut. The selected cell values will be copied, then deleted. Or
4. Choose Copy. The selected cell values will be copied, but not deleted.
5. Select the target cell(s) (where you want to put the cut or copy values).
6. Pull down the Edit box on the main menu bar.
7. Choose Paste. The cut or copy values will be ``pasted'' in the target cells.
To Delete a Case (i.e., a Row of Data)
1. Click on the case number on the left side of the row. The whole row will be highlighted.
2. Pull down the Edit box on the main menu bar.
3. Choose Clear.
To Add a Case (i.e., a Row of Data)
1. Select any cell in the case from the row below where you want to insert the new case.
2. Pull down the Data box on the main menu bar.
3. Choose Insert.
Defining Variables. The default name for new variables is the prefix var and a sequential fivedigit number (e.g., var00001, var00002, var00003). To change the name, format and other
attributes of a variable.
1. Double click on the variable name at the top of a column or,
2. Click on the Variable View tab at the bottom of Data Editor Window.
3. Edit the variable name under column labeled Name. The variable name must be eight
characters or less in length. You can also specify the number of decimal places (under
Decimals), assign a descriptive name (under Label), define missing values (under
Missing), define the type of variable (under Measure; e.g., scale, ordinal, nominal), and
define the values for nominal variables (under Values).
After the data is entered (or several times during data entering), you will want to save it as an
SPSS save file. See the section on Saving Data As An SPSS Save File.

Reading an SPSS Viewer/Output (.spo) File


Statistical results and graphs are displayed in the Viewer window. An overview of how to use
the Viewer is given in the SPSS Help Tutorials under Working with Output.
If you saved the results of Viewer window during an earlier SPSS session, you can use the
following commands to display the Viewer (output) results in a current SPSS session. However,
SPSS output/viewer files (files ending .spo) are NOT always compatible between different
versions. Usually SPSS output files created with an older version and can be read by a new
version, but an output file created using a new version can not be read by an older version. One
option for avoiding compatibility problems between different versions of SPSS is to export your
output in html or MS Word format. The compatibility between Window and Mac versions of
SPSS is limited.
To read a Viewer file from your computer\floppy disk\flashdrive that was created and saved
using SPSS. The filename should end with the suffix spo.
1.
2.
3.
4.
5.
6.

Choose File on the menu bar


Choose Open
Choose Output...
Edit the directory or disk drive to indicate where the data is located
Double click on the filename or
Single click on the filename and choose Open

Editing Your Output


Editing the statistical results and graphs in the Viewer window is beyond the scope of this
handout. Instructions on how to edit your output is given in the SPSS Help Tutorials under
Working with Output and Creating and Editing Charts.
You can use either the tree-structure in the left hand pane or the results displayed in the right
hand pane to select, move or delete parts of the output.
To edit a table or object (an object is a group of results) you first need to double click on the
table/object so an editing box appears around the table/object, and then select the value you
want to modify. An editing box' will be a ragged box outlining the table. If you only do a
single click you will get a box with straight/plain lines outlining the table. In general, to create
nice looking tables of your results it is often easier to hand enter the values into a blank MS
Word table than to edit a SPSS table/object (either in SPSS or MS Word).
To edit a chart you first need to double click on the chart so it appears in a new Chart Editor
window. After you are done editing the chart, close the window and then export the chart, for
example to a windows metafile and then into a MS Word file.

8
By default in SPSS a P-value is displayed as .000 if the P-value is less than .001. You can
report the P-value as <.001 or to have SPSS display more significant digits:
1. In a SPSS (output) Viewer window double click (with the left mouse button) on the table
containing the p-value you want to display differently A ``editing box'' should appear
around the table.
2. Click on the p-value using the right mouse button.
3. Choose Cell Properties. (If you do not get this option, you need to double click on the table
to get the ragged box.)
4. Change the number of decimals to the desired number (default is 3).
5. Choose OK or
6. Double click on the p-value with the left mouse button and SPSS will display the p-value
with more significant digits. If the p-value is very small, the p-value will be displayed in
scientific notation (e.g., 1.745E-10 = 0.0000000001745).

Saving Data as an SPSS Data (.sav) File


To save data as a new SPSS Data file onto your computer/floppy disk/flashdrive:
1. Display the Data Editor window (i.e., execute the following commands while in the Data
Editor window displaying the data you want to save.)
2. Choose File on the menu bar.
3. Choose Save As...
4. Edit the directory or disk drive to indicate where the data should be saved. SPSS will
automatically add the .sav suffix to the filename.
5. Choose Save
To save data changes in an existing SPSS Save: file.
1. Display the Data Editor window (i.e., execute the following commands while in the Data
Editor window displaying the data you want to save.)
2. Choose File box on the menu bar
3. Choose Save
Caution. The Save command saves the modified data by overwriting the previous version of the
file.
You can save your data in other formats besides an SPSS save file (e.g., as an ASCII file, Excel
file, SAS data set). To save your data with a given format you follow the same steps as saving
data in a new SPSS Save file, except that you specify the Save as Type as the desired format.

Saving Your Output (Statistical Results and Graphs)


To save the statistical results and graphs displayed in the Viewer window as a new SPSS Output
file:
1. Display the Viewer window (i.e., execute the following commands while in the Viewer
window displaying the results you want to save.)
2. Choose File on the menu bar.
3. Choose Save As...
4. Edit the directory or disk drive to indicate where the output should be saved. SPSS will
automatically add the .spo suffix to the filename.
5. Choose Save
To save Viewer changes in an existing SPSS Output file.
1. Display the Viewer window (i.e., execute the following commands while in the Viewer
window displaying the results you want to save.)
2. Choose File on the menu bar.
3. Choose Save.
Caution. The Save command saves the modified Viewer window by overwriting the previous
version of the file.
Note that you will not be able to open SPSS output that was created with a newer version than
the version of SPSS that you are using to open the output. Hence, you may want to avoid this
problem you by exporting your output in html or MS word format. Also, charts often do not
export properly into a Html or Word file. Usually you need to export charts separately into a
window metafile file (.wmf). Sometimes the output, including charts, and be copied and pasted
directly into a Word file.

10

Exporting SPSS Output


Sometimes you will want to save your SPSS output in a different file format than a SPSS output
file, because you want to avoid compatibility problems between different versions of SPSS, you
want to further edit your output in a Word document, or you want include graphs or figures in
another document file. The basic steps in exporting SPSS output to another file type are, while
in a SPSS (output) Viewer window:
1. Choose File
2. Choose Export
3. Choose what you want to export:
Output Document exports all the output
Output Document (No Charts) exports only
the numerical results
Charts Only exports only charts (i.e., graphs &
figures)
Note that charts often do not export properly into a
Html or Word file. Usually you need to export
charts separately into a window metafile file
(.wmf).
4. Define further what you want to export:
All Objects this option also exports other
extraneous information (rarely useful)
All Visible Objects use this option to
export all the output.
Selected Objects this allows you to export
only the objects you have selected in the
Viewer window.

11

5. Choose the file type


HTML and Word/RTF a good file
types for numerical results (no
charts).

Windows Metafile (.WMF) is a good file


type for charts in you want to include
figures in a MS Word document.
Note that the file type options are
dependent on what you are exporting.

6. Choose the location and file name for the


output you want to export.

7. Choose OK

12

Printing Your Work in SPSS


To print statistical results and graphs in the Viewer window or data in the Data Editor window:
1. Display the output or data you want to print
(i.e., execute the following commands while
in a output or data window)
2. Choose File on the menu bar.
3. Choose Print...
4. Choose All visible output or Selection (if you
have selected parts of the output). When
printing from a data file, the options are All,
Selection and Page # to Page #.
5. Choose OK

Exiting SPSS
To exit SPSS:
1. Choose File on the menu bar
2. Choose Exit SPSS
If you have made changes to the data file or the output file since the last time you saved these
files, before exiting SPSS you will be asked whether you want to save the contents of the Data
Editor window and Viewer window. If you are unsure as to whether you want to save the
contents of the data or output window, choose Cancel, then display the window(s) and if you
want to save the contents of the window, follow the instructions in this handout for saving data or
output windows. SPSS will use the overwrite method when saving the contents of the window.

13

Running SPSS using Syntax (or Command Language)


This handout describes how to the run various statistical summaries and procedures using the
point-and-click menus in SPSS. However, it is possible run SPSS commands using SPSS
syntax/command language. If you are running similar analyses repeatedly, it can be more
efficient to run your analysis using SPSS syntax. How to run SPSS using the syntax/command
language is beyond the scope of this handout. Help on running SPSS using the syntax/command
language can be found in the SPSS Tutorials under Working with Syntax.
To get you started using SPSS syntax, follow the point-and-click instructions for running a
particular analysis, but select Paste instead of OK at the last step. A SPSS Syntax Editor window
will open containing the SPSS syntax for running the analysis. To run the analysis you can
choose Run on the menu bar or you can highlight the syntax you want to run, click the right
mouse button, and select Run Current. You can add more syntax to the Syntax Editor window by
using the point-and-click method, selecting Paste instead of OK at the last step. The additional
syntax will be added at the bottom of the Syntax Editor window. You can also write syntax
directly into the syntax file and/or use copy, paste and editing commands to modify the syntax.
Remember to save you syntax file before exiting SPSS. The file should end in .sps. You can
open a syntax file by selecting File on the menu bar, Open, and the Syntax
Heres an example of SPSS syntax.
This syntax runs a two sample test
comparing HDL cholesterol (hdl) for
subjects without and with a family history
of heart attack (fhha, coded 0 for no and 1
for yes).
This syntax creates 3 indicators variables,
neversmoke, formersmoke, and
currentsmoke for smoking status (smoke).
Note that a period (.) is used to denote the
end of a string of syntax and Execute. is
sometimes required to run the syntax.

14

Creating a New Variable


To create a new variable:
1. Display the Data Editor window (i.e., execute the following commands while in the Data
Editor window displaying the data file you want to use to create a new variable).
2. Choose Transform on the menu bar
3. Choose Compute...
4. Enter the new variable name in the Target Variable box.
5. Enter the definition of the new variable in the Numeric Expression box (e.g., SQRT(visan),
LN(age), or MEAN(age)) or
6. Select variable(s) and combine with desired arithmetic operations and/or functions.
7. Choose OK
After creating a new variable(s), you will probably want to save the new variable(s) by re-saving
your data using the Save command under File on the menu bar (See Saving Data as an SPSS
Save File). Further instructions on creating a new variable are given in the SPSS Help Tutorials
under Modifying Data Values.
Example: Creating a (New) Transformed Variable
You can use the SPSS commands for creating a new variable to create a transformed
variable. Suppose you have a variable indicating triglyceride level, trig, and you want to
transform this variable using the natural logarithm to make the distribution less skewed
(i.e., you want to create a new variable which is natural logarithm of triglyceride levels).

1. Display the Data Editor window


2. Choose Transform on the menu
bar
3. Choose Compute...
4. Enter, say, lntrig, in the Target
Variable box.
5. Enter Ln(trig) in the Numeric
Expression box.
6. Choose OK

Now, a new variable, lntrig, which is the natural logarithm of trig, will be added to your
data set. Remember to save your data set before exiting SPSS (e.g., while in the SPSS
Data window, choose Save under File or click on the floppy disk icon).

15

Recoding or Combining Categories of a Variable


To recode or combine categories of a variable:
1.

Display the Data Editor window (i.e., execute the following commands while in the Data
Editor window displaying the data file you want to use to recode variables).
2. Choose Transform on the menu bar
3. Choose Recode
4. Choose Into Same Variable... or Into Different Variable...
5. Select a variable to recode from the variable list on the left and then click on the arrow
located in the middle of the window. This defines the input variable.
6. If recoding into a different variable, enter the new variable name in the box under Name:,
then choose Change. This defines the output variable.
7. Choose Old and New Values...
8. Choose Value or Range under Old Value and enter old value(s).
9. Choose New Value and enter new value, then choose Add.
10. Repeat the process until all old values have been redefined.
11. Choose Continue
12. Choose OK
After creating a new variable(s), you will probably want to save the new variable(s) by re-saving
your data using the Save command under File box on the menu bar (See Saving Data as an SPSS
Save File).
Example: Recoding a Categorical Variable
You can use the commands for recoding a variable to change the coding values of a
categorical variable. You may want to change a coding value for a particular category to
modify which category SPSS uses as the referent category in a statistical procedure. For
example, suppose you want to perform linear regression using the ANOVA (or General
Linear Model) commands, and one of your independent variables is smoking status, smoke,
that is coded 1 for never smoked, 2 for former smoker and 3 for current smoker. By
default SPSS will use current smoker as the referent category because current smoker
has the largest numerical (code) value. If you want never smoked to be the referent
category you need to recode the value for never smoked to a value larger than 3.
Although you can recode the smoking status into the same variable, it is better to recode
the variable into a new/different variable, newsmoke, so you do not lose your original data
if you make an error while recoding.

16
1.
2.
3.
4.

Display the Data Editor window


Choose Transform
Choose Recode
Choose Into Different
Variables...
5. Select the variable smoke as the
Input variable
6. Enter newsmoke as the name of
the Output variable, and then
choose Change.
7. Choose Old and New Values...

8. Choose Value under Old Value.


(It may already be selected.)
9. Enter 1 (code for never
smoker)
10. Choose Value under New
Value. (It may already be
selected.)
11. Enter 4 (or any value greater
than 3)
12. Choose Add
13. Choose All Other Values under
Old Value.
14. Choose Copy Old Value(s)
under New Value.
15. Choose Add
16. Choose Continue
17. Choose OK

Remember to save your data set before exiting SPSS .

17
Example: Creating Indicator or Dummy Variables
You can use the commands for recoding a variable to create indicator or dummy variables
in SPSS. Suppose you have a variable indicating smoking status, smoke, that is coded 1 for
never smoked, 2 for former smoker and 3 for current smoker. To create three new
indicator or dummy variables for never, former and current smoking:

1. Display the Data Editor


window

2. Choose Transform
3. Choose Recode
4. Choose Into Different
Variables...
5. Select the variable smoke as
the Input variable
6. Enter neversmoke as the name
of the Output variable, and
then choose Change.
7. Choose Old and New Values...
8. Choose Value under Old
Value. (It may already be
selected.)
9. Enter 1 (code value for never
smoker)
10. Choose Value under New
Value. (It may already be
selected.)
11. Enter 1 (to indicate never
smoker)
12. Choose Add
13. Choose All Other Values
under Old Value.
14. Choose Value under New
Value.
15. Enter 0
16. Choose Add
17. Choose Continue
18. Choose OK

Now, you have created a binary indicator variable for never smoker (coded 1 if never
smoker, 0 if former or current smoker). Next, create a binary indicator variable for
former smoker.

18

1. Display the Data Editor


window

2. Choose Transform
3. Choose Recode
4. Choose Into Different
Variables...
5. Select the variable smoke as
the Input variable
6. Enter formersmoke as the
name of the Output variable,
and then choose Change. (Or
change (edit) never to former,
and then choose Change).
7. Choose Old and New Values...
8. Choose 11 under
OldNew and then choose
Remove.
9. Choose Value under Old
Value.
10. Enter 2 (code value for former
smoker)
11. Choose Value under New
Value.
12. Enter 1 (to indicate former
smoker)
13. Choose Add
14. Choose Continue
15. Choose OK

Now, you have a created a binary indicator variable for former smoker (coded 1 if former
smoker, 0 if never or current smoker). To create a binary indicator variable for current
smoker you would use similar commands to those for creating the indicator variable for
former smoke, except that now the value of 3 for smoke is coded as 1 and all other values
are coded as 0.

19
Example: Creating a Categorical Variable From a Numerical Variable
You can use the commands for recoding a variable to create a categorical variable from a numerical
variable (i.e., group values of the numerical variable into categories). For example, suppose you have
a variable that is the number of pack years smoked, packyrs, and you want to create a categorical
variable with the four categories, 0, >0 to 10, >10 to 30, and >30 pack years smoked .

1.
2.
3.
4.
5.

Display the Data Editor window


Choose Transform
Choose Recode
Choose Into Different Variables...
Select the variable packyrs as the Input
variable
6. Enter a name for the new variable,
packcat, for the Output variable, and
then choose Change.
7. Choose Old and New Values...
8. Choose Value under Old Value. (It may
already be selected.)
9. Enter 0
10. Choose Value under New Value.
11. Enter 0 (to indicate 0 pack years)
12. Choose Add
13. Choose Range under Old Value.
14. Enter 0.01 and 10 in the two blank
boxes.
15. Choose Value under New Value
16. Enter 1 (to indicate >0 to 10 pack years)
17. Choose Add

18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.

Choose Range under Old Value.


Enter 10.01 and 30 in the two blank boxes.
Choose Value under New Value
Enter 2 (to indicate >10 to 30 pack years)
Choose Add
Choose Range, value through HIGHEST under Old Value.

Enter 30.01 in the blank box.


Choose Value under New Value

Enter 3 (to indicate >30 pack years)


Choose Add
Choose Continue
Choose OK

Note that if you may want to use different coding values depending on which category you want to
be used as the referent category in certain statistical procedures. Remember to save your data set
before exiting SPSS.

20

Summarizing Your Data


Frequency Tables (& Bar Charts) for Categorical Variables. To produce frequency tables and
bar charts for categorical variables:
1.
2.
3.
4.
5.
6.
7.
8.

Choose Analyze from the menu bar


Choose Descriptive Statistics
Choose Frequencies
Variable(s): To select the variables you want from the source list on the left, highlight a
variable by pointing and clicking the mouse and then click on the arrow located in the middle
of the window. Repeat the process until you have selected all the variables you want.
Choose Charts (Skip to step 7 if you do not want bar charts.)
Choose Bar Chart(s)
Choose Continue
Choose OK

Example: Frequency table and bar chart for the categorical variable, smoking status.
Smoking status is
the selected
variable(s) and
Bar charts under
Charts has
been selected.

Frequency table and bar chart of smoking status


Sm oking status

Smoking status

Percent

Valid
Percent

never

590

59.0

59.0

59.0

former

293

29.3

29.3

88.3
100.0

current
Total

117

11.7

11.7

1000

100.0

100.0

60

Cumulative
Percent

50

40

Percent

Frequency

30

20

10

0
never

former

Sm oking status

cu rrent

21
Contingency Tables for Categorical Variables. To produce contingency tables for categorical
variables:
1.
2.
3.
4.
5.
6.
7.
8.
9.

Choose Analyze from the menu bar.


Choose Descriptive Statistics
Choose Crosstabs...
Row(s): Select the row variable you want from the source list on the left and then click on the
arrow located next to the Row(s) box. Repeat the process until you have selected all the row
variables you want.
Column(s): Select the column variable you want from the source list on the left and then
click on the arrow located next to the Column(s) box. Repeat the process until you have
selected all the column variables you want.
Choose Cells...
Choose the cell values (e.g., observed counts; row, column, and margin (total) percentages).
Note the option is selected when the little box is not empty.
Choose Continue
Choose OK

Example: Contingency table of smoking status by coronary heart disease (CHD).


Smoking status
is the row
variable and CHD
is the column
variable.
Observed counts
and row
percentages will
be displayed.

Smoking status * Incident CHD Crosstabulation


Incident CHD
Smoking
status

no
537
91.0%

yes
53
9.0%

Total
590
100.0%

never

Count
% within Smoking status

former

Count

257

36

293

% within Smoking status

87.7%

12.3%

100.0%

current
Total

Count

106

11

117

% within Smoking status

90.6%

9.4%

100.0%

Count

900

100

1000

% within Smoking status

90.0%

10.0%

100.0%

22
Descriptive Statistics (& Histograms) for Numerical Variables. To produce descriptive
statistics and histograms for numerical variables:
1.
2.
3.
4.

Choose Analyze on the menu bar


Choose Descriptive Statistics
Choose Frequencies...
Variable(s): To select the variables you want from the source list on the left, highlight a
variable by pointing and clicking the mouse and then click on the arrow located in the middle
of the window. Repeat the process until you have selected all the variables you want.
5. Choose Display frequency tables to turn off the option. Note that the option is turned off
when the little box is empty.
6. Choose Statistics
7. Choose summary measures (e.g., mean, median, standard deviation, minimum, maximum,
skewness or kurtosis).
8. Choose Continue
9. Choose Charts (Skip to step 11 if you do not want histograms.)
10. Choose Histograms(s)
11. Choose Continue
12. Choose OK
An alternate way to produce only the descriptive statistics is at step 3 to choose Descriptives...
instead of Frequencies..., then, select the variables you want. By default SPSS computes the
mean, standard deviation, minimum and maximum. Choose Options... to select other summary
measures.
Example: Descriptive summaries and histogram for the numerical variable age.
Age is the variable to summarize. You can
select more than one variable to analyze.
Remember to turn off the Display frequency
tables option.

23
Mean, standard
deviation, minimum
and maximum were
selected under
Statistics, and
histogram was
selected under
Charts

Summaries for Age


Statistics
Age
Valid
Missing

1000
0

Mean

72.14

Std. Deviation

5.275

Minimum

65

Maximum

90

Histogram of Age
Histogram

120

Frequency

100

80

60

40

20

Mean =72.14
Std. Dev. = 5.275
N =1,000

0
60

65

70

75

80

A ge

85

90

95

24
Descriptive Statistics (& Boxplots) by Groups for Numerical Variables. To produce
descriptive statistics and boxplots by groups for numerical variables:
1.
2.
3.
4.

Choose Analyze on the menu bar


Choose Descriptive Statistics
Choose Explore...
Dependent List: To select the variables you want to summarize from the source list on the
left, highlight a variable by pointing and clicking the mouse and then click on the arrow
located next to the dependent list box. Repeat the process until you have selected all the
variables you want.
5. Factor List: To select the variables you want to use to define the groups from the source list
on the left, highlight a variable by pointing and clicking the mouse and then click on the
arrow located next to the factor list box.
6. Choose Plots... (If you do not want boxplots, choose Statistics for the Display option and
skip to Step 11.)
7. Choose Factor levels together from the Boxplot box.
8. Select Stem-and-leaf option from the Descriptive box to turn off the option.
9. Choose Continue
10. Choose Both for the Display option
11. Choose OK
Example: Total cholesterol by family history of heart attack (yes or no).
In this example total cholesterol is
the dependent variable. You can
select more than one variable.
Summaries will computed for each
group defined by family history of
heart attack.
Both numerical summaries (statistics)
and plots are selected.
Under Statistics
Descriptives is usually
selected by default.
Under Plots select
Boxplot option and
unselect stem-and-leaf.

25
Descriptives

Std.
Error

Statistic
Mean
95% Confidence
Interval for Mean

221.93

Lower Bound

219.15

Upper Bound

224.72

5% Trimmed Mean

221.63

Median

219.76

Variance
Std. Deviation

yes

36.751
111

Maximum

363

Range

252
49

Skewness

.184

.094

Kurtosis

.363

.188
2.150

Lower Bound

220.53
216.30

Upper Bound

224.76

Mean
95% Confidence
Interval for Mean

1.417

1350.641

Minimum

Interquartile Range

The explore
command by
default produces
a lot of different
summaries, so
you need to
select what to
report.
All summaries
are shown for all
groups the
table has been
cropped in this
example.

Boxplot of Total Cholesterol by Family History of Heart Attack

400

95
350

Total cholesterol

Total
cholesterol

Family
history of
heart
attack
no

812
172
438

875

729

659

300

250

200

150

100
no

yes

Family history of heart attack

26
Using the Split File Option for Summaries by Groups for Categorical and Numerical
Variables. The Split File option in SPSS is a convenient way to produce summaries, graphs, and
run statistical procedures by groups. To activate the option:
1. Choose Data on the menu bar of the Data Editor window
2. Choose Split File
3. Choose Compare groups or Organize output by groups. The two options display the output
differently. Try each option to see which works best for your needs.
4. Choose the variable that defines the groups.
5. Choose OK
Now, all the summaries, graphs, and statistical procedures you request will be done
(automatically) for each group. To turn off this option:
1.
2.
3.
4.

Choose Data on the menu bar of the Data Editor window


Choose Split File
Choose Analyze all cases, do no create groups
Choose OK

Example. Use the Split File option to run summaries by family history of heart attack (yes
or no).

Compare groups option will try to


display the results for each group
side by side when feasible.

Organize output by groups option


will display the results separately
for each group starting with the
group with the lowest numerical
code value.

27
Using the Select Cases Option for Summaries for a subgroup of subjects/observations.
The Select Cases option in SPSS is a convenient way to produced summaries and run statistical
procedures for a subgroup of subjects or to temporary exclude subjects from the analysis. To
activate this option:
1.
2.
3.
4.
5.
6.
7.

Choose Data on the menu bar of the Data Editor window


Choose Select Cases
Choose If condition is satisfied
Choose If
Enter the expression that indicates the subjects/observation you want to select.
Choose Continue
Choose OK

Now, all the summaries, graphs, and statistical procedures you request will be done using only
the selected subjects/observations. To turn off this option:
1.
2.
3.
4.

Choose Data on the menu bar of the Data Editor window


Choose Select Cases
Choose All cases
Choose OK

Example: Select subjects not lipid lowering medications (i.e., subjects with lipid = 0
indicating no medications).

Select the If condition is satisfied and then If

Caution! Usually you do not want to delete


observations from your dataset, so do not select
this option.
Typical expressions will involve
combinations of the following symbols:
Symbol
=
~=
>=
<=
>
<
&
|

Definition
equal
not equal
greater than or equal
less than or equal
greater than
less than
and
or

28

Graphing Your Data


You can produce very fancy figures and graphs in SPSS. Producing fancy figures and graphs is
beyond the scope of this handout. Instructions on producing figures and graphs can be found in
SPSS Help under Topics Contents Chart Galleries, Standard Charts, and Chart Editor, as
well as in the SPSS Tutorials under Creating and Editing Charts. The commands for making
charts are located under Graphs (and then Legacy Dialogs, if using Version 15) on the menu bar,
and the commands for making simple figures and graphs are relatively easy to use and some
instruction is given below. The Interactive option under Graphs is another way to produce charts
in SPSS interactively, as well as fancier versions of the basic charts (e.g., 3-dimensional bar
charts).
Bar Charts
The easiest way to produce simple bar charts is to use the Bar Chart option with the
Frequencies... command. See Frequency Tables (& Bar Charts) for Categorical Variables. You
can only produce only one bar chart at a time using the Bar command.
1.
2.
3.
4.
5.
6.

Choose Graphs (& then Legacy Dialogs, if Version 15) from the menu bar.
Choose Bar...
Choose Simple, Clustered, or Stacked
Choose what the data in the bar chart represent (e.g., summaries for groups of cases).
Choose Define
Select a variable from the variable list on the left and the click on the arrow next to the
Category axis.
7. Choose what the bars represent (e.g., number of cases or percentage of cases)
8. Choose OK
60.0%

60.0%

50.0%

50.0%

40.0%

40.0%

Family history of
heart attack

P
e
r
c
e
n
t

Percent

no
yes

30.0%

30.0%

20.0%

20.0%

10.0%

10.0%

0.0%

0.0%
never

former
Smoking status

current

never

form er
S moking status

current

29
Histograms
The easiest way to produce simple histograms is to use the Histogram option with the
Frequencies... command. See Descriptive Statistics (& Histograms) for Numerical Variables.
You can produce only one histogram at a time using the Histogram command.
120

100

80
Frequency

1. Choose Graphs (& then Legacy


Dialogs, if Version 15) from the menu
bar
2. Choose Histogram...
3. Select a variable from the variable
list on the left and then click on the
arrow in the middle of the window.
4. Choose Display normal Curve if you
want a normal curve superimposed on
the histogram.
5. Choose OK

60

40

20
Mean =26.2366
Std. Dev. =4.8667
N =1,000

0
10

20

30

40

50

Body m ass index

Boxplots
The easiest way to produce simple boxplots is to use the Boxplot option with the Explore...
command. See Descriptive Statistics (& Boxplots) By Groups for Numerical Variables.
You can produce only one boxplot at a time using the Boxplot command.

880

684

400
Serum fasting glucose

1. Choose Graphs (& then Legacy


Dialogs, if Version 15) from the menu
bar.
2. Choose Boxplot...
3. Choose Simple or Clustered
4. Choose what the data in the
boxplots represent (e.g., summaries
for groups of cases).
5. Choose Define
6. Select a variable from the variable
list on the left and then click on the
arrow next to the Variable box.
7. Select the variable from the variable
list that defines the groups and then
click on the arrow next to Category
Axis.
8. Choose OK

77
673

200

785

0
norm al

impaired fasting
glucose
ADA diabetes status

diabetic

30
Normal Probability Plots. To produce Normal probability plots:
1. Choose Graphs (& then Legacy Dialogs, if Version 15) from the menu bar.
2. Choose Q-Q... to get a plot of the quantiles (Q-Q plot) or choose P-P... to get a plot of the
cumulative proportions (P-P plot)
3. Select the variables from the source list on the left and then click on the arrow located in the
middle of the window.
4. Choose Normal as the Test Distribution. The Normal distribution is the default Test
Distribution. Other Test Distributions can be selected by clicking on the down arrow and
clicking on the desired Test distribution.
5. Choose OK
SPSS will produce both a Normal probability plot and a detrended Normal probability plot for
each selected variable. Usually the Q-Q plot is the most useful for assessing if the distribution of
the variable is approximately Normal.
Normal Q-Q Plot of Serum fasting glucose

Normal Q-Q P lot of Body m ass index

2 50
40
Expected Normal Value

Expected Normal Value

2 00

1 50

1 00

50

30

20

-5 0
-200

10

200
O bserved Value

40 0

60 0

10

20

30
Obser ved Value

40

50

31
Error Bar Plot. To produce an error bar plot of the mean of a numerical variable (or the means
for different groups of subjects):
1.
2.
3.
4.
5.
6.

Choose Graphs (& then Legacy Dialogs, if Version 15) from the menu bar.
Choose Error Bar...
Choose Simple or Clustered
Choose what the data in the error bars represent (e.g., summaries for groups of cases).
Choose Define
Select a variable from the variable list on the left and then click on the arrow next to the
Variable box.
7. Select the variable from the variable list that defines the groups and then click on the arrow
next to Category Axis.
8. Select what the bars represent (e.g., confidence interval, standard deviation, standard error
of the mean)
9. Choose OK
Error Bar Plot
Mean +- 2 SD Serum fasting glucose

300

250

200

150

100

50
norm al

impaired fasting
glucose

diabetic

ADA diabetes status

A bar chart of the mean with error bars can be made


using the commands for making a bar chart

Mean Serum fasting glucose

300

200

100

0
normal

im paired fasting
glucose

ADA d iabetes status


Error bars: +/- 2 SD

diabetic

32
Scatter Plot. To produce a scatter plot between two numerical variables:
HLD cholesterol vs BMI

140
120
HDL cholesterol

1. Choose Graphs (& then Legacy


Dialogs, if Version 15) on the menu
bar.
2. Choose Scatter/Dot...
3. Choose Simple
4. Choose Define
5. Y Axis: Select the y variable you want
from the source list on the left and then
click on the arrow next to the y axis
box.
6. X Axis: Select the x variable you want
from the source list on the left and then
click on the arrow next to the x axis
box.
7. Choose Titles...
8. Enter a title for the plot (e.g., y vs. x).
9. Choose Continue
10. Choose OK

100
80
60
40
20
0
10

20

30

40

50

B ody mass index

Adding a linear regression line to a scatter plot. To add a linear regression (least-squares) line
to a scatter plot of two numerical variables:
HLD cholesterol vs BMI

140
120
HDL cholesterol

1. While in the Viewer window double


click on the scatter plot. The scatter
plot should now be displayed in a
window titled Chart Editor.
2. Choose Elements.
3. Choose Fit Line at Total. (A line
should be added to the plot, because
the next 2 steps are the default options.
4. Choose Linear (in the Properties
window)
5. Choose Apply (in the Properties
window).

100
80
60
40
20

R Sq Linear = 0.121

0
10

20

30

40

50

B ody mass index

Additional options:
Choose Mean under Confidence Intervals (in the Properties window) to add a prediction
interval for the linear regression line to the scatter plot or
o
Choose Individual under Confidence Intervals to add a prediction interval for individual
observations to the scatter plot.
o

6.

Click on the ``X'' in the upper right hand corner of the Chart Editor window or choose File,
and then Close to return to the Viewer window.

33

HDL cholesterol

Adding a Loess (scatter plot) smooth to a scatter plot. To add a Loess smooth to a scatter plot
of two numerical variables:
1. While in the Viewer window double
click on the scatter plot. The scatter
HLD cholesterol vs BMI
plot should now be displayed in a
window titled Chart Editor.
2. Choose Elements.
140
3. Choose Fit Line at Total.
120
4. Choose Loess (in the Properties
100
window). Default options for % of
points to fit (50%) and kernel
80
(Epanechnikov) are usually the most
60
appropriate options.
40
5. Choose Apply (in the Properties
20
window). If a line was added to the
plot in Step 3, it will be replaced by
0
the loess smooth.
10
20
30
40
50
6. Click on the ``X'' in the upper right
B ody mass index
hand corner of the Chart Editor
window or choose File, and then Close
to return to the Viewer window.
0.
Stem-and-leaf Plot. To produce stem-and-leaf plot:
Severity of Illness Index Stem-andChoose Analyze on the menu bar
Leaf Plot
Choose Descriptive Statistics
Choose Explore...
Frequency
Stem & Leaf
Dependent List: To select the variables you
2.00
4 . 34
want from the source list on the left,
7.00
4 . 6688899
highlight a variable by pointing and clicking
10.00
5 . 0001112344
the mouse and then click on the arrow
3.00
5 . 568
located next to the dependent list box.
1.00 Extremes
(>=62)
Repeat the process until you have selected
Stem width:
10.00
all the variables you want.
Each
leaf:
1 case(s)
5. Choose Plots...
6. Choose Stem-and-leaf from the Descriptive
box. Note the option may already be
selected if the little box is not empty.
7. Choose None from the Boxplot box
8. Choose Continue
9. Choose Plots for the Display option
10. Choose OK

1.
2.
3.
4.

34

Hypothesis Tests & Confidence Intervals


One-Sample t Test
1.
2.
3.
4.

Choose Analyze from the menu bar.


Choose Compare Means
Choose One-Sample T Test...
Test Variable(s): Select the variable you want from the source list on the left, highlight
variables by pointing and clicking the mouse and then click on the arrow located in the
middle of the window.
5. Edit the Test Value. The Test Value is the value of the mean under the null hypothesis. The
default value is zero.
6. Choose OK

Confidence Interval for a Mean (from one sample of data)


1.
2.
3.
4.

Choose Analyze from the menu bar.


Choose Compare Means
Choose One-Sample T Test...
Test Variable(s): Select the variable you want from the source list on the left, highlight
variables by pointing and clicking the mouse and then click on the arrow located in the
middle of the window.
5. The Test Value should be 0, which is the default value.
6. By default a 95% confidence interval will be computed. Choose Options to change the
confidence level.
7. Choose OK
SIDS Example. There were 48 SIDS cases in King County, Washington, during the years
1974 and 1975. The birth weights (in grams) of these 48 cases were:
2466
3317
2013
2750
2722
3005
2013
2722

3941
3742
3515
2807
2495
2608
2551
2863

2807
3062
3260
2807
3459
2353
2977
2013

3118
3033
2892
3005
3374
4394
3118
3232

2098
2353
1616
3374
1984
3232
2637
2863

3175
3515
4423
3572
2495
3062
1503
2438

The mean (and standard deviation)


of these measurements is 2891
(623) grams.

We want to know if the mean birth weight in the population of SIDS infant is different
from that of normal children, 3300 grams. We could construct a 95% confidence interval,
to see if the interval contains the value of 3300 grams or we could perform a one sample t
test to test if the mean in the SIDs population is equal to 3300 (versus not equal to 3300).

35

To construct a 95% confidence interval

When computing the


interval for a mean make
sure the Test Value is 0.

One-Sample Statistics

N
birth weight

48

Mean
2891.1250

Std. Error
Mean
89.97885

Std. Deviation
623.39177

Number of subjects, mean, standard


deviation, and standard error of the
mean.

One-Sample Test
Test Value = 0
95% Confidence Interval
of the Difference

birth weight

t
32.131

df
47

Sig. (2-tailed)
.000

Ignore the t test results (t,


df, sig.) because these
results are for testing if the
mean birth weight is equal to
0 (versus not equal to zero).

Mean
Difference
2891.12500

Lower
2710.1109

Upper
3072.1391

95% confidence interval for the


mean birth weight is 2710 to
3072 grams

36

To perform a one sample t test to test if the mean in the SIDs population is equal
to 3300 versus not equal to 3300.

To run the one-sample t test


to test if the mean birth
weight is equal to 3300 you
need to change the Test
Value from the default value
of 0 to 3300.
One-Sample Statistics

N
birth weight

48

Mean
2891.1250

Std. Error
Mean
89.97885

Std. Deviation
623.39177
One-Sample Test

Test Value = 3300

birth weight

t
-4.544

df
47

Sig. (2-tailed)
.000

Mean
Difference
-408.87500

Sig. (2-tailed) = two tailed p-value = <.001


t = test statistic value = -4.544
df = degrees of freedom = 47

95% Confidence Interval


of the Difference
Lower
-589.8891

Upper
-227.8609

Ignore the results for 95%


confidence interval of the
difference, because it is the
confidence interval for the
mean minus 3300.

37

Paired t Test
1.
2.
3.
4.

Choose Analyze from the menu bar.


Choose Compare Means
Choose Paired-Samples T Test...
Paired Variable(s): Select two paired variables you want from the source list on the left,
highlight both variables by pointing and clicking the mouse and then click on the arrow
located in the middle of the window. Repeat the process until you have selected all the
paired variables you want to test.
5. Choose OK

Confidence Interval for the Difference Between Means from Paired Sample
By default a 95% confidence interval for the difference means of the paired samples will be
computed when performing a paired t test. Choose Options to change the confidence level.

Prozac Example. To compare the effect of Prozac on anxiety 10 subjects are given one
week of treatment with Prozac and one week of treatment with a placebo. The order of
the treatments was randomized for each subject. An anxiety questionnaire was used to
measure a subject's anxiety on a scale of 0 to 30. Higher scores indicate more anxiety.
Subject

Placebo

Prozac

Difference

1
2
3
4
5
6
7

22
18
17
19
22
12
14

19
11
14
17
23
11
15

3
7
3
2
-1
1
-1

8
9
10

11
19
7

19
11
8

-8
8
-1

Mean difference, d 1.3


Standard deviation, sd 4.5

38

Paired t test and confidence interval for the difference between paired means.

The order of the variables in


calculating the difference is
determined by the order of the
variables in the data set (and not
the order in which you select the
variables).

Paired Samples Statistics

Pair 1

placebo
prozac

Mean
16.1000
14.8000

N
10
10

Std. Deviation
4.95424
4.68568

Std. Error
Mean
1.56667
1.48174

Summaries for each


sample of data (or
variable).

Paired Samples Correlations


N
Pair 1

placebo & prozac

10

Correlation
.556

Sig.
.095

Correlation between the paired


values - usually not useful.

Paired Samples Test

Mean

Std.
Deviation

Paired Differences
Std. Error
95% Confidence Interval of
Mean
the Difference
Lower

Pair 1

placebo
- prozac

1.30000

difference = placebo - prozac

4.54728

1.43798

-1.95293

Sig. (2tailed)

df

Upper
4.55293

.904

.390

95% confidence interval for the


mean difference is -1.9 to 4.6

mean difference = 1.3


standard deviation of the
differences = 4.5
standard error of the
differences = 1.4

Paired t test
Sig. (2 tailed) = two-sided p-value = 0.39
t = test statistic value = .904
df = degrees of freedom

39

Two-Sample t Test
1.
2.
3.
4.

Choose Analyze on the menu bar.


Choose Compare Means
Choose Independent-Samples T Test...
Test Variable(s): Select the test variable you want from the source list on the left and then
click on the arrow located next to the test variable box. Repeat the process until you have
selected all the variables you want.
5. Grouping Variable: Select the variable which defines the groups and then click on the arrow
located next to the grouping variable box.
6. Choose Define Groups...
7. Click on blank box next to Group 1, then enter the code value (numeric or
character/string) for group 1.
8. Click on blank box next to Group 2, then enter the code value (numeric or
character/string) for group 2.
9. Choose Continue
10. Choose OK

Confidence Interval for the Difference Between Means from Independent


Samples
By default a 95% confidence interval for the difference means from two independent samples
will be computed when performing a two sample t test. Choose Options to change the
confidence level.
Model Cities Example. Two groups of people were studied - those who had been randomly
allocated to a Fee-For-Service medical insurance group and those who had been randomly
allocated to a Prepaid insurance group.
We would like to compare the two groups on the quality of health care they received in
each group, but first we would like to know how comparable the groups are on other
characteristics that might affect medical outcome. For example, we would like to know if
the mean age in the two groups is similar. Hopefully, the process of random allocation
minimizes this possibility, but there is always a chance that it didn't.

Group

Mean

Standard deviation

Prepaid (GHC)

1167

24.0

15.3

Fee-for-service (KCM)

3207

26.4

17.1

We could compare the average age between the two groups using a two sample t test or a
confidence interval for the difference between the average ages of the two groups.

40

Two sample t test and 95% confidence interval for the difference between means
(from independent samples).

After you select the Grouping Variable,


SPSS will put in question marks to prompt
you to define the code values for the two
groups. Select Define Groups to enter
the code values.

In this example the group codes are


numeric, 0 (for GHC) and 1 (for KCM)

T-Test
Group Statistics

age

prov
GHC
KCM

N
1167
3207

Mean
23.9846
26.3676

Std. Deviation
15.30787
17.10260

Std. Error
Mean
.44810
.30200

Summaries for each


sample/group.

Independent Samples Test


Levene's Test for
Equality of Variances
F
age

Equal variances
assumed
Equal variances
not assumed

47.068

Sig.
.000

SPSS by default tests if the


variances are equal using Levenes
test. A small p-value (sig.) indicates
the variances may be different.
sig. = p-value = <.001
F = test statistic value = 47.0

41
Independent Samples Test

t-test for Equality of Means


t
age

Equal variances
assumed
Equal variances
not assumed

df

Sig. (2-tailed)

Mean
Difference

Std. Error
Difference

-4.188

4372

.000

-2.38306

.56896

-4.410

2293.698

.000

-2.38306

.54037

Two Sample t test. SPSS by default always performs both versions of the two
sample t test assuming equal variance and unequal variances
Sig. (2 tailed) = two sided p-value = <.001 (equal var.), <.001 (unequal var.)
t = test statistic value = -4.2 (equal var.), -4.4 (unequal var.)
df = degrees of freedom = 4372 (equal var.), 2294 (unequal var.)
mean difference = difference between means = -2.4 (equal and unequal var.)
std. error difference = standard error of the difference between means = .6 (equal
var.), .5 (unequal var.)

Independent Samples Test


95% Confidence Interval
of the Difference

age

Equal variances
assumed
Equal variances
not assumed

Lower

Upper

-3.49851

-1.26760

-3.44273

-1.32338

95% confidence interval for


the difference between means
is
-3.4 to -1.3 (assuming equal
variances)
-3.4 to -1.3 (assuming unequal
variances)

42

Sign Test and Wilcoxon Signed-Rank Test


1.
2.
3.
4.

5.
6.
7.
8.

Choose Analyze from the menu bar.


Choose Nonparametric Tests
Choose 2 Related Samples...
Test Pair(s) List: Select two paired variables you want from the source list on the left hand
side, highlight both variables by pointing and clicking the mouse and then click on the arrow
located in the middle of the window. Repeat the process until you have selected all the
paired variables you want to test.
Choose Sign as the Test Type.
or
Choose Wilcoxon as the Test Type.
Choose OK

Aspirin Example. To compare 2 types of Aspirin, A and B, 1 hour urine samples were
collected from 10 people after each had taken either A or B. A week later the same
routine was followed after giving the other type to the same 10 people.
Person
1
2
3
4
5
6
7
8
9
10

Type A
15
26
13
28
17
20
7
36
12
18

Mean = 19.2
Standard deviation = 8.63

Type B
13
20
10
21
17
22
5
30
7
11
15.6
7.78

Difference
2
6
3
7
0
-2
2
6
5
7
3.6 = d
3.098 = sd

A Sign test or Wilcoxon Signed Rank test could be used to compare the two types of
Aspirin.

43

The order of the variables in


calculating the difference is
determined by the order of the
variables in the data set (and not
the order in which you select the
variables).
Select Wilcoxon or Sign (or both)

Under Options you can select summaries


Descriptive (n, mean, etc.) and Quartiles
(median, 25th and 75th percentile)
Descriptive Statistics
Percentiles
N
aspirina
aspirinb

Mean
19.2000
15.6000

10
10

Std. Deviation
8.62554
7.77746

Minimum
7.00
5.00

Maximum
36.00
30.00

25th
12.7500
9.2500

50th (Median)
17.5000
15.0000

Sign Test
Frequencies
N
aspirinb - aspirina

Negative
Differences(a)
Positive
Differences(b)
Ties(c)
Total

a aspirinb < aspirina


b aspirinb > aspirina
c aspirinb = aspirina
Test Statistics(b)
aspirinb aspirina
Exact Sig. (2-tailed)
.039(a)
a Binomial distribution used.
b Sign Test

8
1
1
10

Sign Test
Exact sig. (2-tailed) = exact, two-sided
p-value = 0.039
The p-value is exact because it is
computed using the Binomial
distribution instead of using an
approximation to the Normal
distribution.

75th
26.5000
21.2500

44

Wilcoxon Signed Ranks Test


Ranks
N
aspirinb - aspirina

Negative Ranks
Positive Ranks

8(a)
1(b)

Ties

1(c)

Total

10

Mean Rank
5.38
2.00

Sum of Ranks
43.00
2.00

Information used
in the test
statistic not
usually reported;
use the previous
descriptives.

a aspirinb < aspirina


b aspirinb > aspirina
c aspirinb = aspirina
Test Statistics(b)

Z
Asymp. Sig. (2-tailed)

aspirinb aspirina
-2.442(a)
.015

a Based on positive ranks.


b Wilcoxon Signed Ranks Test

Wilcoxon Signed Rank Test


Asymp. Sig. (2-tailed) = two sided p-value = 0.015
Asymp. is an abbreviation for asymptotic, which
means the p-value is computed using a large sample
approximation based on the Normal distribution.

45

Mann-Whitney U Test (or Wilcoxon Rank Sum Test)


1.
2.
3.
4.

Choose Analyze on the menu bar.


Choose Nonparametric Tests
Choose 2 Independent Samples...
Test Variable(s): Select the test variable you want from the source list on the left and then
click on the arrow located next to the test variable box. Repeat the process until you have
selected all the variables you want.
5. Grouping Variable: Select the variable which defines the grouping and then click on the
arrow located next to the grouping variable box. The grouping variable must be numeric for
the variable to appear on the left hand side.
6. Choose Define Groups...
7. Click on the blank box next to group 1, then enter the code value (it must be numeric) for
group 1.
8. Click on the blank box next to group 2, then enter the code value (it must be numeric) for
group 2.
9. Choose Continue to return to Two Independent Samples dialog box.
10. Choose Mann-Whitney U as the Test Type. Note that the option may already be selected if
the little box is not empty.
11. Choose OK
Legionnaires Example. During July and August, 1976, a large number of Legionnaires
attending a convention died of mysterious and unknown cause. Chen et al. (1977) examined
the hypothesis of nickel contamination as a toxin. They examined the nickel levels in the
lungs of nine cases and nine controls. There was no attempt to match cases and controls.
The data are as follows (g/100g dry weight):
Legionnaire cases 65 24 52 86 120 82 399 87 139
Controls
12 10 31 6 5 5 29 9 12
The Mann Whitney U test could be used to compare the two groups.
After you select the Grouping Variable,
SPSS will put in question marks to
prompt you to define the code values
for the two groups. Select Define
Groups to enter the code values.
Note: The codes must be numeric,
otherwise the grouping variable will not
appear on the left hand side.

46

In this example the group codes are


1 for legionnaires and 2 for controls.

Mann-Whitney Test
Ranks

nickel

group
1
2

Total

9
9

Mean Rank
13.78
5.22

18

Sum of Ranks
124.00
47.00

Information used in the test


statistic not usually reported.
The descriptives under Options
are not useful; you can produce
relevant descriptives (e.g. median
and interquartile range for each
group) using the Explore command.

Test Statistics(b)

Mann-Whitney U
Wilcoxon W
Z

nickel
2.000
47.000

Asymp. Sig. (2-tailed)

.001

Exact Sig. [2*(1-tailed


Sig.)]

.000(a)

a Not corrected for ties.


b Grouping Variable: group

Mann Whitney test

-3.403

Asymp. Sig. (2-tailed) = two-sided p-value =


0.001
This p-value is computed based a large
sample approximation to the Normal
distribution and it corrects for ties in the
data, if present.
Exact Sig. [2*(1-tailed Sig.)] = two-sided pvalue = <.001
This p-value is an exact p-value, but it does
not correct for ties in the data, if present.
In this example, given the small sample sizes
and few ties in the data, the exact p-value
would be appropriate to report.

47

One-way ANOVA (Analysis of Variance) (E.g., to compare two or more means


from two or more independent samples)
1.
2.
3.
4.

Choose Analyze on the menu bar


Choose Compare Means
Choose One-Way ANOVA...
Dependent: Select the variable from the source list on the left for which you want to use to
compare the groups and then click on the arrow next to the dependent variable box. You run
multiple one-way ANOVAs by selecting more than one dependent variable.
5. Factor: Select the variable from the source list on the left which defines the groups.
6. Choose OK
To perform pairwise comparisons to determine which groups are different while controlling for
multiple testing use the Post Hoc... option. There are many methods to choose from (e.g.,
Bonferroni and R-E-G-W-Q).
Other useful options can be found under Options... For example, choose Descriptive to get
descriptive statistics for each group (e.g., mean, standard deviation, minimum value, and
maximum value). Choose Homogeneity-of-variance to perform the Levene Test to test if the
group variances are all equal versus not all equal. A small p-value for the Levene's Test may
indicate that the variances are not all equal.
CHD Example. We can use one-way ANOVA to compare HDL levels between subjects with
different hypertensive status (0=normotensive, 1=borderline, 2=definite)
Hypertensive
Group
Normotensive
Borderline
Definite

n
1568
547
1310

Mean
55.8
55.7
53.5

Standard
Deviation
15.5
16.2
15.2

You can select 1 or more variables to


compare between groups.

The variable selected as the Factor


defines the groups. The variable can be
numeric or character/string.

48

Oneway
ANOVA
HDL cholesterol

Between Groups
Within Groups

Sum of
Squares
4344.834
821904.577

Total

826249.411

df
2
3422

Mean Square
2172.417
240.183

F
9.045

Sig.
.000

3424

One-way analysis of variance


Sig. = p-value = <.001
F = test statistic = 9.0; df = degrees of freedom
Sometimes the test statistic and degrees of freedom of the test statistics are
reported along with the p-value; in this example, F=9.0 with degrees of freedom 2
and 3422. Sum of squares and mean square are used to compute the test statistic;
they are usually not reported.

Descriptives

Under Options you can request Descriptives for each group to be


computed. This information can be used to describe the differences
between the groups.

HDL cholesterol
N

Mean

Std.
Deviation

Std.
Error

95% Confidence Interval for


Mean

Minimum

Maximum

.391
.693

Lower Bound
55.05
54.30

Upper Bound
56.59
57.03

21
24

138
149

15.192

.420

52.64

54.29

15

129

15.534

.265

54.38

55.42

15

149

normotensive
borderline

1568
547

55.82
55.67

15.500
16.202

definite

1310

53.47

Total

3425

54.90

49

Post Hoc Tests


Under Post Hoc you can request further comparisons be done between each of the
possible pair of groups to determine which groups are different from each other. These
are multiple comparison procedures, which control for the number of tests/comparison
being performed. There are many methods to choose from; below is an example of the
Bonferroni method and Ryan-Einot-Gabriel-Welsch method.
Multiple Comparisons
Dependent Variable: HDL cholesterol
(I)
(J)
Hypertension
Hypertension
status
status

Bonferroni

normotensive

borderline
definite

borderline

normotensive

Mean
Difference
(I-J)

definite
definite

normotensive
borderline

Std.
Error

Sig.

95% Confidence Interval

.157
2.356(*)

.770
.580

1.000
.000

Lower Bound
-1.69
.97

Upper Bound
2.00
3.74

-.157

.770

1.000

-2.00

1.69

2.198(*)
-2.356(*)
-2.198(*)

.789
.580
.789

.016
.000
.016

.31
-3.74
-4.09

4.09
-.97
-.31

* The mean difference is significant at the .05 level.

The Bonferroni method is a method that shows all pairwise comparisons/differences along
with a p-value (sig.) adjusted for the number of comparisons. In this example, subjects
with normal blood pressure and borderline hypertension have similar HDL cholesterol
levels, but subjects with definite hypertension have different HDL cholesterol levels than
both subjects with normal blood pressure and borderline hypertension.
Homogeneous Subsets
HDL cholesterol
Subset for alpha = .05
Ryan-Einot-GabrielWelsch Range

Hypertension status
definite
borderline
normotensive

N
1310
547

1
53.47

55.67

1568

Sig.

55.82
1.000

.867

Means for groups in homogeneous subsets are displayed.

The Ryan-Einot-Gabriel-Welsch (R-E-G-W-Q) method is a method that groups together


groups that are similar in the same subset and groups that are different are in different
subsets. In this example, subjects with normal blood pressure and borderline
hypertension are in one subset and subjects with definite hypertension are in a different
subset. Hence, subjects with definite hypertension have different HDL cholesterol levels
than subjects with normal blood pressure and borderline hypertension, but subjects with
normal blood pressure and borderline hypertension have similar HDL cholesterol levels.

50

Kruskal-Wallis Test
1.
2.
3.
4.

Choose Analyze on the menu bar.


Choose Nonparametric Tests
Choose K Independent Samples...
Test Variable(s): Select the test variable you want from the source list on the left and then
click on the arrow located next to the test variable box. Repeat the process until you have
selected all the variables you want to test.
5. Grouping Variable: Select the variable which defines the grouping and then click on the
arrow located next to the grouping variable box.
6. Choose Define Range...
7. Click on the blank box next to Minimum, then enter the smallest numeric code value for
the groups.
8. Click on the blank box next to Maximum, then enter the largest numeric code value for the
groups.
9. Choose Continue
10. Choose Kruskal-Wallis H as the Test Type. Note that the option may already be selected if
the little box is not empty.
11. Choose OK
CAUTION: The group variable must be numeric and you must correctly enter the smallest
numeric code value and the largest numeric code value. SPSS will allow you to select a
character/string variable as the grouping variable, as well as allow you to incorrectly enter the
numeric code values. The results displayed for the Kruskal Wallis test in these cases will be
incorrect, but no error or warning message will be displayed.
CHD Example. We can use one-way ANOVA to compare serum insulin levels between
subjects with different hypertensive status (0=normotensive, 1=borderline, 2=definite)
Hypertensive
Group
Normotensive
Borderline
Definite

n
1568
547
1310

Median
12
12
14

IQR*
9, 15
9, 17
11, 20

*IQR, interquartile range = 25th percentile, 75th percentile

51

Kruskal Wallis test


You can select 1 or more variables
to compare between groups.
The variable selected as the
Grouping Variable defines the
groups. THE VARIABLE SHOULD
BE NUMERIC.

In this example the smallest numeric


code is 0 (for normal) and the largest
numeric code is 2 (for definite).

Kruskal-Wallis Test
Ranks

Serum insulin

Hypertension status
normotensive
borderline

N
1568
547

Mean Rank
1526.31
1685.28

definite

1310

1948.03

Total

3425

Information used in the test


statistic not usually reported.
The descriptives under Options
are not useful; you can produce
relevant descriptives (e.g. median
and interquartile range for each
group) using the Explore command.

Test Statistics(a,b)

Chi-Square
df

Serum insulin
130.816
2

Asymp. Sig.

.000

a Kruskal Wallis Test


b Grouping Variable: Hypertension status

Kruskal Wallis test


Asymp. Sig. = p-value = <.001
Asymp. is an abbreviation for asymptotic,
which means the p-value is computed
using a large sample approximation based
on the Normal distribution.
Chi-Square = test statistic value = 130.8
Df = degrees of freedom = 2

52

One-Sample Binomial Test


1.
2.
3.
4.

Choose Analyze from the menu bar.


Choose Nonparametric Tests
Choose Binomial...
Test Variable List: Select the test variable you want from the source list on the left and then
click on the arrow located next to the test variable box. Repeat the process until you have
selected all the variables you want.
5. Test Proportion: Click on the box next to Test Proportion and enter/edit the proportion value
specified by your null hypothesis.
6. Choose OK
Example. In the TRAP study, 125 patients of the 527 patients who were negative for
lymphocytotoxic antibodies at baseline became antibody positive. The expected rate for
being antibody positive is 30%. We could use the one-sample binomial test to test if the
rate is different in the TRAP study population.

Positive is a variable coded


1 if positive and 0 if
negative.

Make sure to edit the test


proportion value. This
case .30 or 30%. The
default is .50.

NPar Tests
Binomial Test

positive

Group 1
Group 2

Category
yes
no

N
125
402

Observed
Prop.
.24
.76

Test Prop.
.3

Asymp. Sig.
(1-tailed)
.001(a,b)

Total

527
1.0
a Alternative hypothesis states that the proportion of cases in the first group < .3.
b Based on Z Approximation.

One-sample binomial test, two-sided p-value given by 2 x .001 = .002


(Note: SPSS reports the one-sided p-value).

53

McNemar's Test
1.
2.
3.
4.

Choose Analyze from the menu bar.


Choose Descriptive Statistics
Choose Crosstabs...
Row(s): Select the row variable you want from the source list on the left and then click on
the arrow located next to the Row(s) box. Repeat the process until you have selected all the
row variables you want.
5. Column(s): Select the column variable you want from the source list on the left and then
click on the arrow located next to the Column(s) box. Repeat the process until you have
selected all the column variables you want.
6. Choose Cells...
7. For cell values choose total under percentages.
8. Choose Continue
9. Choose Statistics...
10. Choose McNemar
11. Choose Continue
12. Choose OK
There is also another way to run McNemars test (but the test pair variables must be numeric and
an asymptotic (Asymp.) p-value, based a large sample approximation based on the Normal
distribution, is reported instead of a p-value based on exact methods).
1.
2.
3.
4.

Choose Analyze from the menu bar.


Choose Nonparametric Tests
Choose 2 Related Samples...
Test Pair(s) List: Select two paired variables you want from the source list on the left,
highlight both variables by pointing and clicking the mouse and then click on the arrow
located in the middle of the window. Repeat the process until you have selected all the
paired variables you want.
5. Choose McNemar as the Test Type.
6. Choose Wilcoxon to turn off the option. Note that the option is turned off when the little box
is empty.
7. Choose OK
Example. Suppose we want to compare two different treatments for a rare form of
cancer. Since relatively few cases of this disease are seen, we want the two treatment
groups to be as comparable as possible. To accomplish this goal, we set up a matched study
such that a random member of each matched pair gets treatment A (chemotherapy),
whereas the other member gets treatment B (surgery). The patients are assigned to pairs
(621 pairs) matched on age (within 5 years), sex, and clinical condition. The patients are
followed for 5 years, with survival as the outcome variable.
The 5-year survival rate for treatment A is 17.1% (106/621) and for treatment B is 15.3%
(95/621). We could use McNemars test to compare the survival rate of the two
treatments.

54

McNemars test
It doesnt matter for McNemars test
which variable is selected for the
Row(s): or Columns(s). You can run
more than one test at a time.

Under
Statistics
select McNemar.
Under Cells, in
this example,
select Total
percentages.

Crosstabs
TreatmentA * TreatmentB Crosstabulation
TreatmentB
TreatmentA

died

Count
% of Total

survived

Count

died
510
82.1%

% of Total
Total

Count
% of Total

Value

a Binomial distribution used.

621

5
.8%

Total
515
82.9%

16

90

106

2.6%

14.5%

17.1%

526

95

621

84.7%

15.3%

100.0%

Chi-Square Tests

McNemar Test
N of Valid Cases

survived

Survival rate for


Treatment A is 17.1%
Survival rate for
Treatment B is 15.3%

McNemars test
Exact Sig.
(2-sided)
.027(a)

Exact Sig. (2-sided) = exact two-sided p-value


= 0.027
The p-value is exact because it is computed
using the Binomial distribution instead of using
an approximation to the Normal distribution.

55

Chi-square Test, Fishers Exact test and Trend test for Contingency Tables
If the Chi-square test is requested for a 2 x 2 table, SPSS will also compute the Fisher's Exact
test. If the Chi-square test is requested for a table larger than 2 x 2, SPSS will also compute the
Mantel-Haenszel test for linear or linear by linear association between the row and column
variables.
1.
2.
3.
4.

Choose Analyze from the menu bar.


Choose Descriptive Statistics
Choose Crosstabs...
Row(s): Select the row variable you want from the source list on the left and then click on
the arrow located next to the Row(s) box. Repeat the process until you have selected all the
row variables you want.
5. Column(s): Select the column variable you want from the source list on the left and then
click on the arrow located next to the Column(s) box. Repeat the process until you have
selected all the column variables you want.
6. Choose Cells...
7. Choose the cell values (e.g., observed and expected counts; row, column, and margin (total)
percentages). Note the option is selected when the little box is not empty.
8. Choose Continue
9. Choose Statistics...
10. Choose Chi-square
11. Choose Continue
12. Choose OK
Asthma Example. An investigator studied the relationship of parental smoking habits and
the presence of asthma in the oldest child. Type A families are defined as those in which
both parents smoke and Type B families are those in which neither parent smokes. Of 100
type A families, 15 eldest children have asthma, and of 200 type B families, 6 children
have asthma. We could use a chi-square test or Fishers exact test to test if the
proportion of first born children with asthma different in these two types of families?

It doesnt matter for the chi-square,


Fishers Exact or trend test which variable
is selected for the Row(s): or Columns(s).
You can run more than one test at a time.

56

Under
Statistics
select Chisquare.
Under Cells, in
this example,
select Row
percentages.

Crosstabs
familytype * asthma Crosstabulation
asthma
familytype

Count
% within familytype

Count
% within familytype

Total

Count
% within familytype

Total

No
85
85.0%

Yes
15
15.0%

100
100.0%

194

200

97.0%

3.0%

100.0%

279

21

300

93.0%

7.0%

100.0%

15% of first born in family


type A have asthma
3% of first borin in family type
B have asthma

Chi-Square Tests

Pearson Chi-Square
Continuity
Correction(a)
Likelihood Ratio

Asymp.
Sig. (2sided)
.000

12.961

.000

13.745

.000

Value
14.747(b)

Fisher's Exact Test

df

Exact Sig.
(2-sided)

Exact Sig.
(1-sided)

.000
.000
N of Valid Cases
300
a Computed only for a 2x2 table
b 0 cells (.0%) have expected count less than 5. The minimum expected count is 7.00.

Fishers Exact test


Exact Sig. (2-sided)
= exact two-side pvalue = <.001

Chi-square test
Pearson Chi-square (without continuity correction), p-value = <.001
Pearson Chi-square with continuity correction, p-value = <.001
Asymp. Sig. (2-sided) = two-sided p-value. Asymp. is an abbreviation for asymptotic, which
means the p-value is computed using a large sample approximation based on the Normal
distribution. Check that all cells have expected cell counts 5 or greater.
Value = test statistic value
df = degrees of freedom

57
Trend Test Example. A clinical trial of a drug therapy to control pain was
performed. The investigators wanted to investigate whether adverse responses to
the drug increased with larger drug doses. Subjects received either a placebo or
one of four drug doses. In this example dose is an ordinal variable, and it
reasonable to expect that as the dose increases and rate of adverse events will
increase.
Dose
Placebo
500 mg
1000 mg
2000 mg
4000 mg

Adverse event
% (n)
18.8% (6)
21.9% (7)
28.1% (9)
31.3% (10)
50.0% (16)

n
32
32
32
32
32

There are several different methods for performing a trend test with ordinal
variables. One test, which is available in SPSS is the Mantel-Haenszel chi-square,
also called the Mantel-Haenszel test for linear association or linear by linear
association chi-square test.
Adverse events
No
dose

Count
% within dose

500

Count
% within dose

1000

4000
Total

32

21.9%

100.0%

23

32

71.9%

28.1%

100.0%

22

10

32

68.8%

31.3%

100.0%

Count
% within dose
Count
% within dose

32
100.0%

25

Count
% within dose

Total

6
18.8%

78.1%

Count
% within dose

2000

Yes

26
81.3%

16

16

32

50.0%

50.0%

100.0%

112

48

160

70.0%

30.0%

100.0%

Chi-Square Tests

Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear
Association
N of Valid Cases

Value
9.107(a)
8.836
8.876
160

4
4

Asymp. Sig.
(2-sided)
.058
.065

.003

df

In this example, there is a


significant trend (p-value =
0.003, chi-square trend test),
and we would conclude that
the rate of adverse responses
increases with drug dose.

a 0 cells (.0%) have expected count less than 5. The minimum expected count is 9.60.

58
Using Standardized Residuals in R x C tables. When the contingency table has
more then 2 rows and 2 columns it can be hard to determine the association or the
largest differences. Standard residuals are often helpful in describing the
association, if the chi-square test indicates there is a statistically significant
association. The (adjusted) standardized residual re-expresses the difference
between the observed cell count and expected cell count in terms of standard
deviation units below or above the value 0 (the expected differences if there is no
association), and the distribution of the standardized residuals has a standard
Normal distribution. Hence, values less than -2 or greater than 2 indicate large
differences and values less than -3 or greater than 3 indicate very large
differences.
Under Cells, select Adjusted
standardized for Residuals

Education vs Stage of Disease at Diagnosis Example. The chi-square indicated a


significant association between education level and stage of disease at diagnosis (
Chi-square test, p-value = 0.016).
The adjusted standardized
Stage of Disease
residuals indicate the biggest
Education
I
II
III
12 years
Count
difference between the observed
20
24
35
% within education
25.3%
30.4%
44.3%
and expected cell counts (i.e., the
Adjusted Residual
-2.6
-.5
3.3
most unusual differences under
College
Count
37
32
23
the assumption of no association
% within education
40.2%
34.8%
25.0%
Adjusted Residual
between education and stage of
.8
.6
-1.4
College graduate Count
40
29
21
disease) are for subjects with
% within education
44.4%
32.2%
23.3%
12 years of education,
Adjusted Residual

1.8

-.1

-1.8

where there are fewer subjects with Stage I and more subjects with Stage III or
IV than expected if there was no association between education and stage of
disease. Also, to a lesser extent, among the subjects with a college graduate
degree there a more subjects with Stage I and fewer subject with Stage III or
IV than expected if there was no association between education and stage of
disease.

59

One sample binomial test, McNemar's test, Fisher's Exact test and Chi-square
test for 2 x 2 and R x C Contingency Tables Using Summary Data
There is an easy way in SPSS to perform a one sample binomial test, a McNemar's test, a
Fisher's Exact test or a Chi-square test for a 2 x 2 or R x C table when you only have summary
data (i.e., the number of observations in each cell).
One sample binomial test. Suppose you observe 15 cases of myocardial infarction (MI) in 5000
men over a 1 year period and you want to test if the rate of MI is equal to a previously reported
incidence rate of 5 per 1000 (or 0.005).
1. In a new (empty) SPSS Data Editor window enter the following 2
rows of data:
MI
0
1

Observed
4985
15

The values of 0 and 1 used to indicate MI (no/yes) are arbitrary. The variable names are also
arbitrary (e.g., you can leave them as var0001 and var0002).
2. Next, you want to weight cases by Observed:
Choose Data
Choose Weight Cases...
Choose Weight cases by
Choose Observed and then the arrow button so the variable appears in the Frequency variable
box.
Choose OK
3. Now, run the one sample binomial test:
Choose Analyze
Choose Nonparametric Tests
Choose Binomial...
Choose MI so that in appears in the Test Variable List
Change (edit) Test Proportion to .005.
Choose OK

60
McNemar's test. Suppose you have the following summary table of presence and absence of
DKA before and after therapy for paired data,

Before
therapy

No DKA
DKA

After therapy
No DKA
DKA
128
7
19
7

1. In a new (empty) SPSS Data Editor window enter the following 4


rows of data:
Before After Observed
1
1
128
1
0
19
0
1
7
0
0
7
The values of 0 and 1 used to indicate DKA and no DKA are arbitrary. The variable names are
also arbitrary (e.g., you can leave them as var0001, var0002, and var0003).
2. Next, you want to weight cases by Observed:
Choose Data
Choose Weight Cases...
Choose Weight cases by
Choose Observed and then the arrow button so the variable appears in the Frequency variable
box.
Choose OK
3. Now, run McNemar's test:
Choose Analyze
Choose Nonparametric Tests
Choose 2 Related Samples...
Choose Before and After so that they appear in the Test Pair(s) List.
Choose McNemar as the Test Type
Choose Wilcoxon to turn off the option
Choose OK

61
Chi-square test and Fisher's Exact test for a 2 x 2 table. Suppose you have the following
summary table for oral contraceptive (OC) use by presence or absence of cancer (case or
control),
OC Use
No
Yes
Cases (cancer)
111
6
Controls
387
8
1. In a new (empty) SPSS Data Editor window enter the following 4
rows of data:
Case OCuse Observed
1 0 111
1 1
6
0 0 387
0 1
8
The values of 0 and 1 used to indicate case/control and OC use (no/yes)
are arbitrary. The variable names are also arbitrary (e.g., you can
leave them as var0001, var0002, and var0003).
2. Next, you want to weight cases by Observed:
Choose Data
Choose Weight Cases...
Choose Weight cases by
Choose Observed and then the arrow button so the variable appears in the Frequency variable
box.
Choose OK
3. Now, run the Chi-square (\& Fisher's Exact) test
Choose Analyze
Choose Crosstabs
Choose Case and OCuse as the row the column variables
Choose Statistics...
Choose Chi-square
Choose Continue
Choose OK

62
The commands are similar for running the Chi-square test for tables larger than 2x 2. Suppose
you have the following summary table for education level by stage of disease at diagnosis

Education level
High school or less
College
College graduate

Stage of Disease
I
II
III or IV
20
24
35
37
32
23
40
29
21

1. In a new (empty) SPSS Data Editor window enter the following 9


rows of data:
Educ Stage Observed
1 1 20
1 2 24
1 3 35
2 1 37
2 2 32
2 3 23
3 1 40
3 2 29
3 3 21
The values used to indicate education level and stage are arbitrary, and the variable names are
also arbitrary.
Follow steps 2. and 3. on the previous page (except use variables Educ and Stage, instead of
Case and OCuse).

63

Confidence Interval for a Proportion


To construct a confidence interval for a proportion or rate is rather awkward in SPSS, but you
can do it with the raw data or with summary data (as long as the sample size is large enough to
use the Normal approximation methods for binomial data).
To construct a confidence interval using the raw data you need 1) a binary indicator variable
equal to 1 if the variable is present for a subject and equal to 0 if the variable is absent for a
subject, and 2) a variable that is equal to 1 for all subjects. For example, suppose you want to
construct a confidence interval for the proportion of males in your data set. First you need a
binary indicator variable for males, e.g. you could have a variable named Gender which is equal
to 1 if the subject is a male and equal to 0 if the subject is a female. Second you need to create a
variable that is equal to 1 for all subjects (e.g., use the Compute statement and create a variable
Allones = 1). Now,
1.
2.
3.
4.

Choose Analyze on the menu bar


Choose Descriptive Statistics
Choose Ratio...
Numerator: Select the binary indicator variable from the source list on the left and then
click on the arrow located in the middle of the window (e.g. select Gender)
5. Denominator: Select the variable equal to 1 for all subjects from the source list on the left
and then click on the arrow located in the middle of the window (e.g. select Ones)
6. Choose Statistics...
7. Choose Mean under Central Tendency
8. Choose Confidence intervals (default is a 95% confidence interval)
9. Choose Continue
10. Choose OK
To illustrate how you would construct a confidence interval with summary data, suppose in a
data set of 3425 subjects, 1341 are males and 2084 are females:
1. In a new (empty) SPSS Data Editor window enter the following 2
rows of data:
Gender Observed Allones
0
2084
1
1
1341
1
2. Next, you want to weight cases by Observed:
Choose Data
Choose Weight Cases...
Choose Weight cases by
Choose Observed and then the arrow button so the variable appears in the Frequency variable
box.
Choose OK

64
3. Now,
Choose Analyze on the menu bar
Choose Descriptive Statistics
Choose Ratio...
Numerator: Select Gender
Denominator: Select Allones
Choose Statistics...
Choose both Mean and Confidence intervals under Central Tendency
Choose Continue
Choose OK
Example of the SPSS output using the previous summary data.

Ratio Statistics
Ratio Statistics for Gender / Allones
Mean
95% Confidence Interval
for Mean

Lower Bound
Upper Bound

.392
.375
.408

Price Related Differential

1.000

Coefficient of Dispersion

Coefficient of Variation

Median Centered

.
The confidence intervals are constructed by assuming a Normal distribution
for the ratios.

The observed
proportion was .392 or
39.2%.
A 95% confidence
interval is 37.5% to
40.8%.

65

Correlation & Regression


Pearson and Spearman Rank Correlation Coefficient
1.
2.
3.
4.

Choose Analyze on the menu bar


Choose Correlate
Choose Bivariate...
Variable(s): Select the variables from the source list on the left and then click on the arrow
located in the middle of the window.
5. Choose Pearson or/and Spearman as the Correlation Coefficients. Note that the option is
selected if the box has a check mark in it.
6. Choose Two-tailed as the Test of Significance. SPSS will perform the test testing if the
correlation is equal to zero versus it is not equal to zero.
7. Choose OK
Note that you can use the Crosstabs command to calculate confidence intervals for the
correlation.
Example. Pain-related beliefs, catastrophizing, and coping have been shown to be
associated with measures of physical and psychosocial functioning among patients with
chronic musculoskeletal and rheumatologic pain. However, little is known about the relative
importance of these process variables in the functioning of patients with
temporomandibular disorders (TMD).
Correlation coefficients could be calculated to examine the association between
catastrophizing, depression (Beck Depression Inventory), pain-related activity
interference and jaw opening (maximum assisted opening).
(Reference: JA Turner, SF Dworkin, L Mancl, KH Huggins, EL Truelove. The roles of
beliefs, catastrophizing, and coping in the functioning of patients with temporomandibular
disorders. Pain, 92, 41-51, 2001.
Typically, you would only report
either the Pearson or
Spearman (rank) correlation
coefficients, but you might
calculate both to see if you get
different results or
conclusions.

The correlations are shown on the next page. Note that SPSS will display the correlation between
variable 1 and variable 2 and between variable 2 and variable 1, which are equivalent, and similarly
the correlations between all possible pairs of variables. So, all results displayed below the diagonal
of the matrix of results are redundant.

66
Correlations
1st entry = Pearson correlation coefficient
2nd entry = Sig. (2-tailed) = p-value
3rd entry = N = the number observations or subjects with non-missing data for both variables
Correlations

Catastroph
-izing

Pearson Correlation
Sig. (2-tailed)

Catastroph
izing
1

Beck
inventory
score
.602(**)
.000

118

118

118

116

.602(**)

.445(**)

-.079

.000

.397

N
Beck inventory

Pearson Correlation

score

Sig. (2-tailed)

.000

N
Interference

Pearson Correlation

Interference
.451(**)
.000

Maximum
assisted
opening
-.029
.758

118

118

118

116

.451(**)

.445(**)

-.068

.000

.000

Sig. (2-tailed)
N

.468

118

118

118

116

-.029

-.079

-.068

.758

.397

.468

116
** Correlation is significant at the 0.01 level (2-tailed).

116

116

Maximum

Pearson Correlation

assisted

Sig. (2-tailed)

opening

Correlation
between
Catastrophizing and
Interference
= .45
P-value =
<.001
N = 118
subjects

116

Nonparametric Correlations
1st entry = Spearman rank correlation coefficient
2nd entry = Sig. (2-tailed) = p-value
3rd entry = N = the number observations or subjects with non-missing data for both variables
Correlations
Beck
inventory
score

Interference

1.000

.625(**)

.451(**)

-.013

.000

.000

.892

118

118

118

116

.625(**)

1.000

.455(**)

-.110

.000

.000

.241

118

118

118

116

.451(**)

.455(**)

1.000

-.046

.000

.000

.621

118

118

118

116

-.013

-.110

-.046

1.000

.892

.241

.621

116

116

116

116

Catastrophizing
Spearman's
rho

Catastrophizing

Correlation
Coefficient
Sig. (2-tailed)
N

Beck inventory
score

Correlation
Coefficient
Sig. (2-tailed)
N

Interference

Correlation
Coefficient
Sig. (2-tailed)
N

Maximum
assisted
opening

Correlation
Coefficient
Sig. (2-tailed)
N

** Correlation is significant at the 0.01 level (2-tailed).

Maximum
assisted
opening

Rank
correlation
between
Catastrophiz
-ing and
Interference
= .45
P-value =
<.001
N = 118
subjects

67
Confidence Interval for a Correlation Coefficient
Typically the Crosstabs command is used to produce contingency tables for categorical variables.
One of the options under Statistics is used to compute the correlation coefficient, which would
you might want to calculate for ordinal variables. However, you can also use this option for
quantitative variables.
The Crosstabs command is found by selecting
Analyze and then Descriptive Statistics.
In this example the correlation between the
quantitative variables catastrophizing and
interference will be calculated.
Select Statistics and then select Correlations.
SPSS will produce a contingency table of the crosstabulation of the two variables which you can ignore.
SPSS will display the correlation coefficient and
standard error estimate for the correlation

coefficient, which can be used to calculate


confidence intervals.

Symmetric Measures

Value

Asymp. Std.
Error(a)

Approx. T(b)

Approx. Sig.

Interval by Interval

Pearson's R

.451

.068

5.445

.000(c)

Ordinal by Ordinal

Spearman Correlation

.451

.076

5.449

.000(c)

N of Valid Cases

118

a Not assuming the null hypothesis.


b Using the asymptotic standard error assuming the null hypothesis.
c Based on normal approximation.

An approximate 95% confidence interval for the correlation coefficient is given by


Correlation coefficient 1.96 x Asymp. Std Error
In this example, 95% confidence interval for the Pearson correlation coefficient is given
by .451 1.96 x .068 or .31, .58
95% confidence interval for the Spearman rank correlation coefficient is given by .451
1.96 x .076 or .30, .60

68

Linear Regression
1.
2.
3.
4.

Choose Analyze on the menu bar


Choose Regression
Choose Linear...
Dependent: Select the dependent variable from the source list on the left and then click on
the arrow next to the dependent variable box.
5. Independent(s): Select the independent variable and then click on the arrow next to the
independent variable(s) box. Repeat the process until you have selected all the independent
variables you want.
6. Choose Statistics...
7. Choose Estimates. SPSS will print the regression coefficient estimate, standard error, t
statistic and p-value for each independent variable (as well as the intercept/constant). By
default the option should be selected (i.e., the box has a check mark in it).
8. Choose Model fit. SPSS will print the multiple R, R squared, Adjusted R-squared, standard
error of the regression line, and the ANOVA table. By default the option should be selected.
9. Choose Continue
10. Choose Enter as the Method. Enter is the default method for independent variable entry.
Other methods of variable entry can be selected by clicking on the down arrow and clicking
on the desired method of entry.
11. Choose OK
Additional options are available under Statistics..., Plots..., Save..., Method, and Options... For
example:
Statistics...
Estimates. Default option, which prints the usual linear regression results.
Model fit. Default option, which prints the usual linear regression results.
Confidence intervals (for the regression coefficient estimates)
Covariance matrix (and correlation matrix for the regression coefficient estimates).
R squared change. If independent variables are entered in Blocks (using the Block option;
see below), this option computes the change in the R squared between models with different
blocks of independent variables. It is also useful for computing a partial F test for a
categorical variable with more than two categories by entering the indicator variables for the
categorical variable in the second block (Block 2 of 2) and all other independent variables in
the first block (Block 1 of 2) and using the R squared change option.
Part and Partial Correlations. This option computes the Pearson correlation coefficient
between the dependent variable and each independent variable (Zero-order correlation) and
the correlation coefficient between the dependent variable and an independent variables after
controlling for all the other independent variables in the regression model (Partial correlation).
Squaring the partial correlation gives you the partial R-squared for an independent variable.
This option also computes a Part correlation, which is the correlation between the dependent
variable and an independent after (only) the independent variable has been adjusted for all the
other independent variables in the regression model. The square of the Part correlation is
equal to the change in the R-squared when an independent is added to the regression model
with all the other independent variables.

69
(Multi-)Collinearity diagnostics. This option computes various statistics for detecting
collinearity between the independent variables. For example, Tolerance is the proportion of a
variable's variance not accounted for by other independent variables in the equation. A
variable with a very low tolerance contributes little information to a model, and can cause
computational problems. Another statistic is the VIF (variance inflation factor). Large values
are an indicator of multicollinearity between independent variables.
Plots... which are useful for doing regression diagnostics:
Histogram or Normal Probability Plot (P-P plot) (of the standardized residuals).
Produce all partial (residual) plots
Other scatter plots
Save... which produced variables which are useful for doing regression diagnostics:
Predicted Values (unstandardized, standardized, adjusted)
Residuals (unstandardized, standardized, studentized, delete)
Distances (Mahalanobis, Cook's, Leverage)
Influence Statistics (dfBeta, dfFit)
Note that SPSS creates a new variable for each selected Save... option and adds the new
variables to the data file. The variable names are defined in the Variable View of the Data Editor.
Once you are done using these variables you may want to delete them from the data file or save
them (by re-saving the data file).
Method. Click on the down arrow to the right of Method to display the methods available for
independent variable entry (enter, stepwise, remove, backward, forward). Enter is the default
option. The other options you enter independent variables into the model using various stepwise
methods.
Options...
You can modify the entry and removal criteria used by stepwise, remove, backward, and
forward independent variable entry methods.
You can define how observations with missing data are handled.
Previous, Block \# of \#, Next
You can use these options to enter independent variables in blocks into the regression model.
You can select different methods of variable entry for each block. This option is also useful
for computing partial F tests with the R squared change option.

70
Example. Simple linear regression of forced expiratory volume (volume, 1 second) on
height (cm).
The dependent variable in
this example is forced
expiratory volumne (fev1).
There is only 1
independent variable in
this example, height.
Additional options can be
found under Statistics,
Plots, Save, & Options.

Here are the Statistics options


Usually you want the default options
Estimates and Model fit selected.
In this example, (95%) confidence interval
for the regression coefficients is also
selected.

Here are the Plots options


By default no options are selected.
In this example, the normal probability plot
of the residuals is requested.

71

Regression

Information on the independent variables


and dependent variable in the regression
model, and the method of entering the
independent variables into the regression
model.

Variables Entered/Removed(b)
Variables
Variables
Entered
Removed
height(a)
.
a All requested variables entered.
b Dependent Variable: fev1
Model
1

Method
Enter

R-Square = proportion of the total


variation in the dependent variable
explained by the independent
variable(s) = .315 or 31.5%
R is square root of R Square

Model Summary(b)

Model
1

R
R Square
.562(a)
.315
a Predictors: (Constant), height
b Dependent Variable: fev1

Adjusted R
Square
.314

Std. Error of
the Estimate
.55337

Adjusted R Square adjusts the R


square for the number of variables in
the model
Std. error of the estimate = standard
deviation of the error or residuals.
Not usually reported, but used in
estimating the standard error of the
regression coefficients.

ANOVA(b)

Model
1

Regression
Residual
Total

Sum of
Squares
112.380
244.054

356.434
a Predictors: (Constant), height
b Dependent Variable: fev1

df
1
797
798

Mean
Square
112.380
.306

F
366.997

Sig.
.000(a)

ANOVA = analysis of
variance table. Not
needed when there is only
1 independent variable in
the model. The F test is
equivalent to the t test
for testing if the slope is
equal to zero in the
output that follows. (F =
t2)

72
Coefficients(a)
Unstandardized
Coefficients
Std.
B
Error
1
(Constant)
-4.330
.335
height
.039
.002
a Dependent Variable: fev1

Standardized
Coefficients

Model

Sig.

Beta
.562

-12.943
19.157

.000
.000

95% Confidence Interval for B


Lower Bound
-4.987
.035

Upper Bound
-3.673
.043

Unstandardized coefficients B = regression coefficient


In this example B = 0.039 is the slope and B = -4.330 the intercept
Std. Error = standard error of the regression coefficient.
Standardized coefficients Beta = standardized regression coefficient
t = t statistic for testing if the regression coefficient is equal to zero (versus not equal to
zero)
Sig. = p value for testing if the regression coefficient is equal to zero (versus not equal
to zero).
95% confidence interval for B = 95% confidence interval for the regression coefficient
In this example, you would report the slope (.039), standard error of the slope (.002)
and the p-value (<.001), or the slope (.039) and 95% confidence interval (.035 to 0.043).

Charts
N ormal P -P Plot of Regression Standardized Residual

D ependent Variable: fev1

Normal probability plot of


the residuals. The points fall
along a straight line,
indicating the residuals have,
at least approximately, a
Normal distribution.

1.0

Expected Cum Prob

0.8

0.6

0.4

0.2

0.0
0.0

0.2

0.4

0.6

0.8

Obs erved Cum Prob

1.0

73
Linear Regression Example with three independent variables
The dependent variable is forced
expiratory volume (fev1).
The independent variables are
height, age and enter.
The Enter method means all 3
independent variables will be
included in the regression model.

Statistics options
By default, Estimates and Model fit are
selected.
In this example, part and partial correlations
and collinearity diagnostics are also selected.

Plots options
Normal probability plot (of the
standardized residuals) and partial
(residual) plots are selected.

74

Regression
Variables Entered/Removed(b)

Model
1

Variables
Entered

Variables
Removed

gender,
age,
height(a)

Information on the independent variables,


method of variable entry, and dependent
variable.

Method
.

Enter

a All requested variables entered.


b Dependent Variable: fev1
Model Summary(b)
Adjusted R
R
R Square
Square
.601(a)
.361
.358
a Predictors: (Constant), gender, age, height
b Dependent Variable: fev1
Model
1

Std. Error of
the Estimate
.53531

R-square is .361 or 36.1%


(adjusted R-square is 35.8%).
About 36% of the variation in the
dependent variables can be
explained by the 3 independent
variables.

ANOVA(b)

Model
1

Regression
Residual

Sum of
Squares
128.623
227.811

df
3
795

Mean
Square
42.874
.287

F
149.621

Sig.
.000(a)

Total

356.434
798
a Predictors: (Constant), gender, age, height
b Dependent Variable: fev1

The overall F test, indicates


1 or more the independent
variables is significant (P < .
001). Degrees of freedom of
the F test are 3 and 795.

Coefficients(a)

(Constant)
height

Unstandardized
Coefficients
Std.
B
Error
-.780
.593
.028
.003

age

-.025

Standardized
Coefficients

Sig.
Zeroorder

Beta

Partial

Part

Tolerance

VIF

.399

-1.315
9.143

.562

.308

.259

.423

2.364

.004

-.200

-6.857

.000

-.206

-.236

-.194

.944

1.059

.273
.059
a Dependent Variable: fev1

.201

4.591

.000

.478

.161

.130

.420

2.379

gender

.189
.000

Collinearity
Statistics

Correlations

Height, age, and gender are all statistically significant (P < .001), i.e., the regression
coefficients are different from zero.
The partial correlations (and partial R-squares, .308 2=.095, -.2362 =.056, and .1612=.026)
indicate the correlation with the dependent variable adjusted for the other variables in
the regression model.
A low tolerance value (say, <.20) or a high variance inflation factor (VIF) (say, > 5 or 10)
may indicate a multicollinearity problem.

75

N ormal P-P Plot of Regression Standardized Residual

D ependent Variable: fev1

Normal probability plot of the


residuals. The points fall
approximately along a straight line,
indicating the residuals have
(approximately) a Normal
distribution.

1.0

Expected Cum Prob

0.8

0.6

0.4

0.2

0.0
0.0

0.2

0.4

0.6

0.8

1.0

Observed Cum Prob

P artial Regression Plot

Partial regression plots for


height and age with lowess
smooths.

0.00

fev1

The plot for height is assessing


the relationship between height
and fev1 after adjusting for age
and gender (e.g., is the
relationship linear).

D ependent Variable: fev1


2.00

-2.00

-30.00

-20 .00

-1 0.0 0

0.00

1 0.00

20.00

30.00

height

P artial Regression Plot

Similarly, the plot for age is


assessing the relationship
between age and fev1 adjusting
for height and gender.

D ependent Variable: fev1


2.00

fev1

0.00

-2.00

-15.00

-10.00

- 5.0 0

0 .00

5 .00

10.00

15.00

20.00

age

Note that SPSS will also produce a partial residual plot for gender. In general, the partial
residuals plots for categorical/nominal variables are not very useful. Boxplots of the
residuals for each category of a categorical/nominal variable are useful for regression
diagnostics. To produce the boxplots you could use the Save options to save the
residuals from a regression and then the Boxplot commands to plot the residuals.

76

Linear Regression via ANOVA Commands


It is possible to use the analysis variance commands of SPSS to perform a linear regression
analysis, because the methods are mathematically equivalent. Performing a linear regression
analysis via analysis of variance in SPSS is more complicated than using the linear regression
commands. However, the advantage of using the analysis of variance commands to perform a
linear regression is that you do not have to create indicator variables for categorical variables or
create interaction terms. To perform a linear regression via analysis of variance commands
1.
2.
3.
4.

Choose Analyze on the menu bar


Choose General Linear Model
Choose Univariate...
Dependent: Select the dependent variable from the source list on the left and then click on
the arrow next to the dependent variable box.
5. Fixed Factor(s): Select the independent variables that are categorical/qualitative and then
click on the arrow next to the fixed factor(s) box. Repeat the process until you have selected
all the categorical variables you want.
6. Covariate(s): Select the independent variables that are continuous/quantitative and then click
on the arrow next to the covariate(s) box. Repeat the process until you have selected all the
continuous variables you want.
7. Choose Model...
8. Choose Custom
9. Factors & Covariates: Select/highlight all the variables, then under Build Terms select Main
Effects. You may need to click on the down arrow to display the Main Effects option. After
you have selected Main Effects, select the arrow under the Build Terms. All the variables
should now appear in the Model box on the right hand side.
10. Choose Continue
11. Choose Options...
12. Choose Parameter Estimates under Display
13. Choose Continue
14. Choose OK
For categorical variables the last category (i.e., the category with the largest numeric coding
value) will be the referent group/category. SPSS will compute the F test for each continuous
independent variable and for categorical independent variable. By selecting to have the
parameter estimates displayed, SPSS will also compute the regression coefficient estimates,
standard errors, t (statistic) values, p-values, and 95% confidence intervals that you get from the
linear regression commands.
To include interaction terms in the regression model, in Step 9 highlight two variables you want
to create an (two-way) interaction term. Under Build Terms select Interaction, and then select the
arrow under the Build Terms. A two-way interaction between two variables (variable 1 *
variable 2) should now appear in the Model box on the right hand side.

77
Example. Linear regression of forced expiratory volume on height (continuous variable)
and diabetes status (categorical variables; normal, impaired fasting glucose, diabetic).
Forced expiratory volume
(fev1) is the dependent
variable.
Diabetes is a categorical
variables with the 3
categories
Height is a continuous
variable

Under Model, select


Custom, then select each of
the variables separately until
they all appear under Model:
or select Main Effects under
Build Terms(s), select all
Factors & Covariates, and
then select the arrow under
Build Term(s).

Under Options, select


Parameter estimates to have
usual linear regression
results displayed in the
output.

78

Univariate Analysis of Variance


Between-Subjects Factors
Tests of Between-Subjects Effects
Dependent Variable: fev1
Source
Corrected Model
Intercept

Type III Sum


of Squares
114.617(a)
51.195

diabetes

3
1

Mean
Square
38.206
51.195

F
125.606
168.308

Sig.
.000
.000

1.118

3.677

.026

366.168

.000

df

2.237

height

111.378

111.378

Error

241.817

795

.304

Total

3773.779

799

The overall test for


the significant of
diabetes is displayed
(p-value = 0.026)

Corrected Total

356.434
798
a R Squared = .322 (Adjusted R Squared = .319)
Parameter Estimates
Dependent Variable: fev1
Parameter

Std.
Error

Sig.

95% Confidence Interval


Lower
Upper
Bound
Bound
-5.054
-3.730
.029
.223

Intercept
[diabetes=1.00]

-4.392
.126

.337
.049

-13.025
2.549

.000
.011

[diabetes=2.00]

.046

.056

.830

.407

[diabetes=3.00]

0(a)

.039
.002
19.136
.000
a This parameter is set to zero because it is redundant.

.035

.043

height

-.063

.156

This table displays the


usual linear regression
results. In this
example diabetes = 3
(diabetic) is the
reference group.

79
Example. Adding an interaction between diabetes status and height in the regression
model

To add an interaction
between two variables,
select the Build Term(s) to
show Interaction, select
two variables under Factors
& Covariates and then
select the arrow under
Build Term(s)

Univariate Analysis of Variance


Tests of Between-Subjects Effects
Dependent Variable: fev1
Source
Corrected Model
Intercept

Type III Sum


of Squares
114.946(a)
42.741

diabetes

df
5
1

Mean
Square
22.989
42.741

F
75.492
140.354

Sig.
.000
.000

.272

.136

.447

.639

94.349

94.349

309.823

.000

.328

.164

.539

.583

Error

241.488

793

.305

Total

3773.779

799

356.434

798

height
diabetes * height

Corrected Total

This table displays the


significant of the diabetes
status by height interaction
(p-value = 0.58).

a R Squared = .322 (Adjusted R Squared = .318)


Parameter Estimates
Dependent Variable: fev1
Parameter

Std. Error

Sig.

Intercept
[diabetes=1.00]

-4.373
-.168

.673
.818

-6.498
-.206

.000
.837

[diabetes=2.00]

.614

.963

.637

.524

[diabetes=3.00]

0(a)

height

.039

.004

9.506

.000

[diabetes=1.00] * height

.002

.005

.361

.719

[diabetes=2.00] * height

-.003

.006

-.593

.553

[diabetes=3.00] * height

0(a)
.
a This parameter is set to zero because it is redundant.

This table displays the


usual linear regression
results which includes the
results for diabetes status,
height and the interaction
between diabetes status
and height.

80

Logistic Regression
1.
2.
3.
4.

Choose Analyze on the menu bar


Choose Regression
Choose Binary Logistic...
Dependent: Select the dependent variable from the source list on the left and then click on
the arrow next to the dependent variable box.
5. Covariate(s): Select the independent variable and then click on the arrow next to the
Covariate(s) box. Repeat the process until you have selected all the independent variables
you want.
6. Choose Enter as the Method. Enter is the default method for independent variable entry.
Other methods of variable entry can be selected by clicking on the down arrow and clicking
on the desired method of entry.
7. Choose OK
Additional options are available under >a*>b, Categorical..., Save..., Method, or Options... .
For example:
>a*>b (for adding two-way interactions) You can add an interaction between two independent
variables to the regression model by selecting two variables from the source list on the left (hold
down the Ctrl key while selecting the two variables) and then clicking on >a*>b (after you
highlight two variables from the source list on the left the >a*>b should be available to select).
Categorical... You can use the categorical option to have SPSS create indicator or dummy
variables for categorical variables.
1. Choose Categorical
2. Categorical Covariates: Select a covariate that is categorical and then click on the arrow next
to the Covariates box.
3. Choose Indicator as the Contrast: Indicator is the default method for creating indicator
variables. Other methods can be selected by clicking on the down arrow and clicking on the
desired method.
4. Choose the reference category as the last category (i.e., the category with the largest numeric
coding value) or the first the category (i.e., category with the smallest numeric coding value).
5. Choose Change.
6. Repeat steps 2 through 5 until you have defined all categorical variables.
7. Choose Continue.
Save...
Predicted Values (Probabilities and Group Membership). This options creates new variables
that are the predicted probabilities and the predicted group membership. The predicted group
membership (0 or 1) is based on the whether the predicted probability is less than (group
membership=0) or greater than or equal to (group membership=1) the classification cutoff. By
default the classification cutoff value is 0.5. You can change the cutoff value using Options...
Residuals (Unstandardized, Logit, Studentized, Standardized, Deviance)
Influence (Cook's, leverage, dfBeta)

81
Note that SPSS creates a new variable for each selected Save... option and adds the new
variables to the data file. The variable names are defined in the Viewer window. Once you are
done using these variables you may want to delete them from the data file or save them (be resaving the data file).
Method Click on the down arrow to the right of Method to display the methods available for
independent variable entry (enter, forward:conditional, forward:LR, forward:Wald,
backward:conditional, backward:LR, backward:Wald).
Options...
Confidence interval for odds ratio (CI for exp(B))
Hosmer-Lemeshow goodness-of-fit
You can modify the entry and removal criteria used by the backward and forward variable
entry methods.
Previous, Block # of #, Next You can use these options to enter independent variables in blocks
into the regression model. You can select different methods of variable entry for each block.
Example. Logistic regression will be used to determine the relationship between any use
of health services (coded 0 = no use, 1 = any use) and age, health index, gender and race.
Subjects in the study (Model Cities Data Set) were followed for a varying amount of time,
so the number of months followed (expos) will also be included as an independent variable
in the logistic regression model.
The dependent variable,
anyuse, is binary.

There are 5 independent


variables. Female and Race are
categorical/nominal variables.

82
You can use the Categorical option to
define which variables are categorical and
SPSS will create the indicator variables.
By default the category with the largest
numerical value (last) will be the
reference group. Here, the category with
the smallest numerical value was selected
as the reference group.
Under Options you can select to have the
95% confidence intervals for the odds
ratios displayed in the output.
Also, you can run the Hosmer-Lemeshow
goodness-of-fit test.

Logistic Regression
Case Processing Summary
Unweighted Cases(a)
Selected Cases
Included in Analysis
Missing Cases
Total
Unselected Cases

N
3199
1175

Percent
73.1
26.9

4374

100.0

.0

Information on the number


of observations used in the
logistic regression. Subjects
with missing data are
excluded.

Total

4374
100.0
a If weight is in effect, see classification table for the total number of cases.
Dependent Variable Encoding
Original Value
.00
1.00

Internal Value
0
1

SPSS will always recode the dependent variable to a 0 or 1


binary variable (internal value), and will estimate the odds
ratio for the event coded as 1 (vs the event coded as 0).
If your dependent variable is not coded 0 or 1, check this
table to determine the interpretation of the odds ratios.

83
Categorical Variables Codings

This table gives the definition of the


indicator variables. E.g.,
race(1) = other
race(2) = black
(race = white, is the reference group)

Parameter coding
race

female

white
other

Frequency
497
455

(1)
.000
1.000

black

2247

.000

male

1450

.000

female

1749

1.000

(2)
.000
.000
1.000

female(1) = female
(male is the reference group)

Caution! Make sure you understand the interpretation of the indicator variables that
SPSS creates. It is very easy to get confused. For example, in this example the variable
race is coded 1=white, 2=other, 3=black. A common mistake would be to interpret race(1) =
white and race(2) = other.

Block 0: Beginning Block

Ignore all the output under Block 0. The output


displays information for the logistic regression
model with no independent variables in the model.

Block 1: Method = Enter


Omnibus Tests of Model Coefficients
Step 1

Step
Block

Chi-square
301.534
301.534

Model

301.534

df
6
6

Sig.
.000
.000

.000

Model Summary

Unless you are using stepwise


methods to enter variables or
entering variables in different blocks
you can ignore this output.
R-square measures for logistic
regression usually not very
useful.

-2 Log
Cox & Snell
Nagelkerke R
likelihood
R Square
Square
2609.415(a)
.090
.151
a Estimation terminated at iteration number 5 because parameter estimates changed by less than .001.
Step
1

Classification Table(a)

Observed
Step 1

anyuse

Overall
percentage
a The cut value is .500

.00
.00
1.00

Predicted
anyuse
percent
1.00
correct
0
542
.0
0
2657
100.0
83.1

Ignore this table also. It is


describing how the logistic
regression predicts any use if a
predicted probability > 0.5 is to
used to indicate any use. All
subjects are predicted to have
use.

84
Hosmer and Lemeshow Test
Step
1

Chi-square
8.368

df

Hosmer-Lemeshow goodness-of-fit
statistic is formed by grouping the data
into g groups (usually

Sig.
.398

Contingency Table for Hosmer and Lemeshow Test


anyuse = .00
Step 1

anyuse = 1.00

Total

Observed

Expected

Observed

Expected

Observed

124

123.653

197

197.347

321

101

97.310

218

221.690

319

79

81.589

241

238.411

320

73

67.769

248

253.231

321

57

54.600

263

265.400

320

33

41.820

287

278.180

320

32

29.724

288

290.276

320

16

21.258

304

298.742

320

13

15.538

307

304.462

320

10

14

8.740

304

309.260

318

g=10) based on the


percentiles of the estimated
probabilities and calculating
the Pearson chi-square
statistic from the 2 x g
table of observed and
estimated expected
frequencies. A small p-value
indicates a lack of fit. Large
differences between the
observed

and expected values can be used to help identify where there is lack-of-fit when present.
The last table of the output usually has the results we are most interested in. It lists the
odds ratios, p-values and 95% confidence intervals for the odds ratios.
Variables in the Equation
B

S.E.

Wald

df

Sig.

Exp(B)

95.0% C.I.for EXP(B)


Lower

Step
1(a)

expos

Upper

.077

.006

167.398

.000

1.080

1.068

1.093

age

.009

.003

8.118

.004

1.009

1.003

1.016

female(1)

.501

.099

25.363

.000

1.650

1.358

2.005

12.715

.002
.950

race
race(1)

-.424

.190

4.964

.026

.655

.451

race(2)

-.530

.149

12.689

.000

.588

.440

.788

health

.048

.010

23.603

.000

1.049

1.029

1.070

-.337
.196
2.958 1
.085
a Variable(s) entered on step 1: expos, age, female, race, health.

.714

Constant

Exp(B) = Odds Ratio


95.0% C.I. for EXP(B) = 95% confidence interval for the odds ratio
Sig. = P-value for the individual odds ratio or the overall significant of a
categorical/nominal variable if there is no Exp(B) listed.
B = the logistic regression coefficient, the log odds ratio

85

S.E. = the standard error the of the logistic regression coefficient


Wald = the Wald test statistic for testing if B=0 (or equivalently odds ratio = 1)
or if all Bs = 0 for a categorical variable with >2 indicator variables.
d.f. = degrees of freedom of the test statistic.

It is often helpful to write on your output the definition of the indicator variables, so you
dont get confused about the interpretation of the results. Also, helpful to change Exp(B)
to odds ratio, and sig. to P-value.
Odds
Ratio
Step
1(a)

expos

95.0% C.I.for
odds ratio
Lower

Upper

P-value

1.080

1.068

1.093

.000

age

1.009

1.003

1.016

.004

female (vs male)

1.650

1.358

2.005

.000

.655

.451

.950

.026

race
other vs white
black vs white
health

.002
.588

.440

.788

.000

1.049

1.029

1.070

.000