You are on page 1of 187

DEBRE TABOR UNIVERSITY

FACULTY OF NATURAL AND COMPUTATIONAL SCIENCES


DEPARTMENT OF STATISTICS

STATISTICAL COMPUTING I

MANUAL FOR SPSS AND MINITAB SOFTWARE

By:

Emebet Yismaw (M.Sc., Biostatistics), Staff member of Statistics Department

Minilik Derseh (B.Sc., Statistics), Staff member of Statistics Department

REVIEWER:
Shewayiref Geremew (M.Sc.,Biostatistics), Staff member of Statistics Department

May, 2018

Debre Tabor, Ethiopia

1
Preface

SPSS, standing for Statistical Package for the Social Sciences, is a powerful, user-
friendly software package for the manipulation and statistical analysis of data. The
package is particularly useful for students and researchers in statistics, psychology,
sociology, psychiatry, and other behavioral sciences, containing as it does an extensive
range of both univariate and multivariate procedures much used in these disciplines.

MINITAB is a powerful, easy-to-use, statistical software package that provides a wide


range of basic and advanced data analysis capabilities. MINITAB's straightforward
command structure makes it accessible to users with a great variety of background and
experience. MINITAB runs on PC and Macintosh computers, and most of the leading
workstations, minicomputers and mainframe computers. While MINITAB differs
across releases and computer platforms, the core of MINITAB -- the worksheet and
commands is the same. Thus, if you know how to use one release of MINITAB on one
platform, you can easily switch to another.The manual is not intended in any way to be
an introduction to statistics and, indeed, we assume that most readers will have attended
at least one statistics course and will be relatively familiar with concepts such as linear
regression, correlation, significance tests, and simple analysis of variance. Our hope is
that researchers and students with such a background will find this manual a relatively
self-contained means of using SPSS and Minitab to analyze their data correctly.

Each chapter ends with a number of exercises, some relating to the data sets introduced
in the chapter and others introducing further data sets. Working through these exercises
will develop SPSS, Minitab and statistical skill.

i
Table of Content
1. Introduction to SPSS ............................................................................................................. 1
1.1. What is a statistical package? ........................................................................................... 1
1.2. Starting SPSS ................................................................................................................... 1
1.3. Overview of SPSS for windows ....................................................................................... 2
1.4. The menus and their use ................................................................................................... 6
1.5. Entering and saving data in SPSS .................................................................................... 7
1.6. Data Importing From Microsoft Excel and ASCII files ................................................. 18
2. Modifying and Organizing Data ........................................................................................ 21
2.1. Retrieving data ............................................................................................................... 21
2.2. Inserting cases and variables .......................................................................................... 21
2.3. Deleting cases or variables ............................................................................................. 22
2.4. Transforming Variables with the Compute Command .................................................. 22
2.5. Transforming Variables with the Recode Command ..................................................... 23
2.5.1. Banding Values . …………………………………………………………..26
2.6. Keeping and dropping of cases ...................................................................................... 32
2.7. Collapsing and transposing Data .................................................................................... 33
2.8. Listing Cases .................................................................................................................. 39
3. Descriptive Statistics Using SPSS....................................................................................... 41
3.1. Summarizing Data .......................................................................................................... 41
3.1.1. Producing Frequency distribution ............................................................... 41
3.1.2. Descriptive Statistics ................................................................................... 42
3.1.3. Cross Tabulation ......................................................................................... 42
3.1.4. Diagrams and graphs................................................................................... 45
4. Customizing SPSS Outputs and Reporting ....................................................................... 50
4.1. Customizing SPSS outputs ............................................................................................. 50
4.1.1. Modifying Tables ........................................................................................ 50
4.1.2. Exporting Tables in SPSS ........................................................................... 53
4.1.3. Modifyingscatter plot .................................................................................. 54
4.1.4. Modifying and Exporting Graphs ............................................................... 55
5. Introduction to Minitab ...................................................................................................... 63

ii
5.1. How to start and exit Minitab ......................................................................................... 63
5.2. Minitab windows: worksheet, session and project ......................................................... 63
5.2.1. Worksheet Window .................................................................................... 63
5.2.2. Session Window.......................................................................................... 64
5.2.3. Minitab Project............................................................................................ 65
5.2.4. Moving between windows .......................................................................... 65
5.2.5. Understanding the interface ........................................................................ 66
5.3. The menu and their use .................................................................................................. 67
5.4. Type data ........................................................................................................................ 67
5.5. Entering and saving data ................................................................................................ 68
5.5.1. Entering the Data ........................................................................................ 68
5.5.2. Saving Minitab data .................................................................................... 69
5.6. Importing and Exporting data ........................................................................................ 70
5.6.1. Importing Data from Excel ......................................................................... 70
5.6.2. Opening a text file....................................................................................... 71
5.6.3. Export data .................................................................................................. 74
6. Descriptive Statistics Using Minitab .................................................................................. 76
7. Statistical Analysis Using Minitab and SPSS .................................................................. 105
7.1. Inferential statistics Using Minitab .............................................................................. 105
7.2. Inferential Statistics using SPSS .................................................................................. 114
7.3. Regression and Correlation .......................................................................................... 125
7.3.1. Correlation Analysis in SPSS ..................................................... ………..125
7.3.2. Linear Regression ..................................................................................... 127
7.3.3. Regression Diagnostics using SPSS ......................................................... 131
7.3.3.1. Unusual and Influential data ................................................................. 132
7.3.3.3. Collinearity ............................................................................................ 165
7.3.3.4. Tests on Nonlinearity ............................................................................ 169
7.3.3.5. Model Specification .............................................................................. 175
7.3.3.6. Issues of Independence.......................................................................... 179
7.3.3.7. Summary ............................................................................................... 180
REFERENCES ...................................................................................................................... 182

iii
1. Introduction to SPSS

1.1. What is a statistical package?


 It is a computer program or set of programs that provides many different statistical
procedures within a unified framework.
 The advantages of such packages are:
 Much easier to use
 Possible to run complex analyses without getting bogged down in the details of
computations
 The principle disadvantage of such packages is that they sometimes make doing
statistics too easy
 It is possible to apply complex procedures inappropriately
 Properly apply a procedure and then misinterpret the results
The “Statistical Package for the Social Sciences” (SPSS) is a package of programs for
manipulating, analyzing, and presenting data; the package is widely used in the social and
behavioral sciences. SPSS enables you to perform intense numerical calculations in a
fraction of time. SPSS is frequently used in both academia and business environments. Much
of SPSS’s popularity within academia and various industries can be attributed to its capacity
for managing data sets, a functionality that represents the bulk of the work done by
professional statisticians. In addition, SPSS allows you to create, with great ease, beautiful
graphics and tabular outputs. However, despite these significant conveniences, it is important
to remember that no statistical software will relieve you of the need to think critically about
the results any software package produces.
There are three basic tasks associated with data analysis:
1. Type data into a computer, and organize and format the data so both SPSS and you can
identify easily.
2. Tell SPSS what type of analysis you wish to conduct.
3. Be able to interpret what the SPSS output means.

1.2. Starting SPSS


To start the SPSS application, follow the following steps:
From the Windows Start menu choose:

1
All Programs
SPSS for Windows
A small window will appear. This window has several choices with the following questions
and options.
What would you like to do?
• Run tutorial
• Type in Data
• Run an existing query
• Create new query using an existing data base
• Open an existing data source

If you choose type in data, you will get Data Editor Window.

1.3. Overview of SPSS for windows


SPSS for Windows consists of five different windows, each of which is associated with a
particular SPSS file type. This document discusses the two windows most frequently used in
analyzing data in SPSS, the Data Editor and the Output Viewer windows.
DATA EDITOR
 is the window that is open at start-up and is used to enter and store data in a
spreadsheet format.
 it consists of two windows: Data View and the Variable View windows, each
window can be accessible by clicking on tabs at the bottom of the screen.

Data View tab Variable View tab

2
Data View window
Click on Data View tab at the bottom of the screen to open the “Data view” window. The
window is simply a grid with rows and columns which display the content of a data file.

 Each row represents a case (one individual’s data)


 Each column represents a variable whose name should appear at the top of the
column.
 The intersection between a row and a column is known as a cell. Each cell contains the
score of a particular case on one particular variable.
Note: It is good practice to define all variables first before entering data

Variable View window


Click on Variable View tab at the bottom of the screen to open the “Variable view” window.
The Variable View window is also a simple grid of rowsand columns. This is where you
define the structures of all your variables.
There are ten fixed columns in the Variable View, these are:
 Name: is what you want the variable to be called. SPSS has rules for variable names
such as variable names are limited to eight characters; variable names should always
begin with a letter and should never include a full stop or space.
 Type: is the kind of information SPPS should expect for the variable. Variables come
in different types, including Numeric, String, Currency, Date…etc but the ones that
you will probably use the most are Numeric or String (text).
 Width: The maximum number of characters to be entered for the variable.
 Decimals column: This is where you specify how many decimal places you would
like SPSS to store for a variable..
 Values labels: Provide a method for mapping your variable values to string labels. It
is mainly used for categorical variable. For example, if you have a variable called
“Gender”, there are two acceptable values for that variable: Female or Male. You can
assign a code for each category, f for Female and m for Male or 1 for Female and 2 for
Male.
 Missing value: It is important to define missing values; this will help you in your data
analysis. For example, you may want to distinguish data missing because a respondent

3
refused to answer from data missing because the question did not apply to the
respondent.
 Columns: Use this to adjust the width of the Data Editor columns, note that if the
actual width of a value is wider than the column, asterisks are displayed in the Data
View.
 Align: To change the alignment of the value in the column (left, right or centre)
 Measure: You can specify the level of measurement of the variable as scale, ordinal
or nominal.
The SPSS variable naming convention requires the following:
 The variable names should be up to eight characters or fewer.
 The variable name should not begin with any special character such as
numerals, comma, inequality symbols etc.
 The latest versions of SPSS can accept variable names with length greater
than 8 characters
 variable in a given file must be unique, duplication is not allow
 do not end variable names with an underscore and a period.
 names are not case sensitive.
 Variable names cannot contain spaces.
To write your variable name:

 Open variable view button or


 Use Ctrl+T
 Then write the name of the variable under the “Name” column

Note: To find the rules for naming variables press: Help, Topics, Index. Then enter the
phrase variable names: rules. Then press the Display button
The Output Viewer
 The Output Viewer opens automatically when you execute an analysis or create a
graph using dialog box or command syntax to execute a procedure.
 All statistical results, tables, and charts are displayed in the Viewer. You can edit the
output and save it for later use. A Viewer window opens automatically the first time
you run a procedure that generates output.

4
The Output Viewer is divided into two panes. The right-hand pane contains statistical tables,
charts, and text output. The left-hand pane contains a tree structure similar to those used in
Windows Explorer, which provides an outline view of the contents.

Pivot Table Editor

 Output that is displayed in pivot tables can be modified in many ways with the Pivot
Table Editor. You can edit text, swap data in rows and columns, add color, create
multidimensional tables, and selectively hide and show results.

Syntax Editor

 A text editor where you compose SPSS commands and submit them to the SPSS
processor. All outputs from this command appear in the output view.

Chart Editor

 You can modify high-resolution charts and plots in chart windows. You can change the
colors, select different type fonts or sizes, switch the horizontal and vertical axes, rotate
3-D scatter plots, and even change the chart type.

5
1.4. The menus and their use
Each window in SPSS has their own menus. The common menus are:
 File (new, open, save, save as, etc,…)
 Edit (undo, redo, cut, copy, insert cases/variables, etc,…)
 View (value labels, etc,…)
 Analyses (descriptive statistics, tables, compare means, correlate, regression, etc)
 Graphs (bar, pie, scatter plot, histogram, etc)
 Window (split, minimize the window, etc)
 Help (topics, tutorial, etc)
Data Editor Menus
 The menu bar provides easy access to most SPSS features. It consists of ten drop-down
menus:

Data Editor Toolbar


 Clicking once on any of these buttons allows you to perform an action, such as opening
a data file, or selecting a chart for editing etc.

6
1.5. Entering and saving data in SPSS
Entering the Data
 You may also use the Up, Down, Left, and Right arrow keys to enter values and move
to another cell for data input
 To edit existing data points (i.e., change a specific data value), click in the cell, type in
the new value, and press the Tab, Enter, Up, Down, Right, or Left arrow keys
In Data View, you enter your data just as you would in a spreadsheet program. You can move
from cell to cell with the arrow keys on your keyboard or by clicking on the cell with the
mouse.
 Once one case (row) is complete, begin entering another case at the beginning of the
next row.
 You can delete a row of data by clicking on the row number at the far left and pushing
the delete key on your keyboard.
 In a similar fashion, you delete a variable (column) by clicking on the variable name
so that the entire column is highlighted and pushing the delete key.
In the steps that follow, we would see how to type in data by defining different variable types.
Click the Variable View tab at the bottom of the Data Editor window. Define the variables
that are going to be used. In our case, let us consider three variables: namely age, marital
status, and income.
 In the first row of the first column, type age.
 In the second row, type marital.
 In the third row, type income.

7
New variables are automatically given a numeric data type. If you don't enter variable names,
unique names are automatically created. However, these names are not descriptive and are not
recommended for large data files.

Click the Data View tab to continue entering the data.


The names that you entered in Variable View are now the headings for the first three columns
in Data View.
Begin entering data in the first row, starting at the first column.
In the age column, type 55.
In the marital column, type 1.
In the income column, type 72000.
Move the cursor to the first column of the second row to add the next subject's data.
In the age column, type 53.
In the marital column, type 0.
In the income column, type 153000.
Currently, the age and marital columns display decimal points, even though their values are
intended to be integers. To hide the decimal points in these variables:
Click the Variable View tab at the bottom of the Data Editor window.
Select the Decimals column in the age row and type 0 to hide the decimal.
Select the Decimals column in the marital row and type 0 to hide the decimal.
Non-numeric data, such as strings of text, can also be entered into the Data Editor.
Click the Variable View tab at the bottom of the Data Editor window.
In the first cell of the first empty row, type sex for the variable name.
Click the Type cell.
Click the button in the Type cell to open the Variable Type dialog box.
Select String to specify the variable type.
Click OK to save your changes and return to the Data Editor.
In addition to defining data types, you can also define descriptive variable and value labels for
variable names and data values. These descriptive labels are used in statistical reports and
charts. Labels can be up to 256 characters long. These labels are used in your output to
identify the different variables.
Click the Variable View tab at the bottom of the Data Editor window.
In the Label column of the age row, type Respondent's Age.

8
In the Label column of the marital row, type Marital Status.
In the Label column of the income row, type Household Income.
In the Label column of the sex row, type Gender.
Adding a Variable Label: Click the Variable View tab at the bottom of the Data Editor
window. In the Label column of the age row, type Respondent's Age. In the Label column of
the marital row, type Marital Status. In the Label column of the income row, type Household
Income. In the Label column of the sex row, type Gender.
The Type column displays the current data type for each variable. The most common are
numeric and string, but many other formats are supported.
In the current data file, the income variable is defined as a numeric type.
Click the Type cell for the income row, and then click the button to open the Variable Type
dialog box.
Select Dollar inthe Variable Type dialog box. The formatting options for the currently
selected data type are displayed. Select the format of this currency. For this example, select
$###,###,###.
Click OK to save your changes.
Value labels provide a method for mapping your variable values to a string label. In the case
of this example, there are two acceptable values for the marital variable. A value of “0”
means that the subject is single and a value of “1” means that he or she is married.
Click the values cell for the marital row, and then click the button to open the Value
Labels dialog box.
The value is the actual numeric value.
The value label is the string label applied to the specified numeric value.
Type “0” in the value field.
Type “Single” in the Value Label field.
Click Add to add this label to the list.
Repeat the process, this time typing 1 in the value field and Married in the Value Label field.
Click Add, and then click OK to save your changes and return to the Data Editor.
These labels can also be displayed in Data View, which can help to make your data more
readable.
Click the Data View tab at the bottom of the Data Editor window.
From the menus choose:

9
View
Value Labels
The labels are now displayed in a list when you enter values in the Data Editor. This has the
benefit of suggesting a valid response and providing a more descriptive answer.
Adding Value Labels for String Variables
String variables may require value labels as well. For example, your data may use single
letters, M or F, to identify the sex of the subject.
Value labels can be used to specify that M stands for Male and F stands for Female.
Click the Variable View tab at the bottom of the Data Editor window.
Click the Values cell in the sex row, and then click the button to open the Value Labels dialog
box.
Type F in the value field, and then type Female in the Value Label field.
Click Add to add this label to your data file.
Repeat the process, this time typing M in the Value field and Male in the Value Label field.
Click Add, and then click OK to save your changes and return to the Data Editor.
Because string values are case sensitive, you should make sure that you are consistent. A
lowercase m is not the same as an uppercase M.
In a previous example, we choose to have value labels displayed rather than the actual data by
selecting Value Labels from the View menu. You can use these values for data entry.
Click the Data View tab at the bottom of the Data Editor window. In the first row, select the
cell for sex and select Male from the drop-down list.
In the second row, select the cell for sex and select Female from the drop-down list. Only
defined values are listed, which helps to ensure that the data entered are in a format that you
expect.
Handling Missing Data
Missing or invalid data: are generally too common to ignore. Survey respondents may refuse
to answer certain questions, may not know the answer, or may answer in an unexpected
format.
If you don't take steps to filter or identify these data, your analysis may not provide accurate
results.
For numeric data, empty data fields or fields containing invalid entries are handled by
converting the fields to system missing, which is identifiable by a single period.

10
The reason a value is missing may be important to your analysis. For example, you may find it
useful to distinguish between those who refused to answer a question and those who didn't
answer a question because it was not applicable.
Click the Variable View tab at the bottom of the Data Editor window. Click the Missing cell
in the age row, and then click the button to open the Missing Values dialog box. In this
dialog box, you can specify up to three distinct missing values, or a range of values plus one
additional discrete value.
Select Discrete missing values. Type 999 in the first text box and leave the other two
empty.
Click OK to save your changes and return to the Data Editor. Now that the missing data value
has been added, a label can be applied to that value. Click the Values cell in the age row, and
then click the button to open the Value Labels dialog box.
Type 999 in the Value field. Type No Response in the Value Label field. Click Add to add
this label to your data file. Click OK to save your changes and return to the Data Editor.
Missing values for string variables are handled similarly to those for numeric values.
Unlike numeric values, empty fields in string variables are not designated as system missing.
Rather, they are interpreted as an empty string. Click the Variable View tab at the bottom of
the Data Editor window.
Click the Missing cell in the sex row, and then click the button to open the Missing
Values dialog box. Select Discrete missing values. Type NR in the first text box.
Missing values for string variables are Case sensitive. So, a value of “nr” is not treated as a
missing value.
Click OK to save your changes and return to the Data Editor. Now you can add a label for the
missing value. Click the Values cell in the sex row, and then click the button to open the
Value Labels dialog box. Type NR in the Value field. Type “No Response” in the Value
Label field. Click Add to add this label to your project. Click OK to save your changes and
return to the Data Editor.
Once you've defined variable attributes for a variable, you can copy these attributes and
apply them to other variables.
In Variable View, type agewed in the first cell of the first empty row. In the Label column,
type Age Married. Click the Values cell in the age row.
From the menus choose:

11
Edit
Copy
Click the Values cell in the agewed row
From the menus choose:
Edit
Paste
The defined values from the age variable are now applied to the agewed variable. To apply the
attribute to multiple variables, simply select multiple target cells (click and drag down the
column).
When you paste the attribute, it is applied to all of the selected cells. New variables are
automatically created if you paste the values into empty rows.
You can also copy all of the attributes from one variable to another. Click the row number in
the marital row.
From the menus choose:
Edit
Copy
Click the row number of the first empty row.
From the menus choose:
Edit
Paste
All of the attributes of the marital variable are applied to the new variable.
For categorical (nominal, ordinal) data, Define Variable Properties can help you define
value labels and other variable properties. Define Variable Properties:
 Scans the actual data values and lists all unique data values for each selected variable.
 Identifies unlabeled values and provides an "auto-label" feature.
 Provides the ability to copy defined value labels from another variable to the selected
variable or from the selected variable to multiple additional variables.
This example uses the data file demo.sav. This data file already has defined value labels; so
before we start, let's enter a value for which there is no defined value label:
In Data View of the Data Editor, click the first data cell for the variable ownpc (you may have
to scroll to the right) and enter the value 99.
From the menus choose:

12
Data
Define Variable Properties...
In the initial Define Variable Properties dialog box, you select the nominal or ordinal
variables for which you want to define value labels and/or other properties.
Since Define Variable Properties relies on actual values in the data file to help you make good
choices, it needs to read the data file first. This can take some time if your data file contains a
very large number of cases, so this dialog box also allows you to limit the number of cases to
read, or scan.
Limiting the number of cases is not necessary for our sample data file. Even though it contains
over 6,000 cases, it doesn't take very long to scan that many cases.
Drag and drop Owns computer [ownpc] through Owns VCR [ownvcr] into the Variables to
Scan list.
You might notice that the measurement level icons for all of the selected variables indicate
that they are scale variables, not categorical variables. By default, all numeric variables are
assigned the scale measurement level, even if the numeric values are actually just codes that
represent categories.
All of the selected variables in this example are really categorical variables that use the
numeric values 0 and 1 to stand for No and Yes, respectively--and one of the variable
properties that we'll change with Define Variable Properties is the measurement level.
Click Continue
In the Scanned Variable List, select ownpc. The current level of measurement for the selected
variable is scale. You can change the measurement level by selecting one from the drop-down
list or you can let Define Variable Properties suggest a measurement level.
Click Suggest
Since the variable doesn't have very many different values and all of the scanned cases contain
integer values, the proper measurement level is probably ordinal or nominal.
Select Ordinal and then click Continue.
The measurement level for the selected variable is now ordinal.
The Value Labels Grid displays all of the unique data values for the selected variable, any
defined value labels for these values, and the number of times (count) each value occurs in the
scanned cases

13
The value that we entered, 99, is displayed in the grid. The count is only 1 because we
changed the value for only one case, and the Label column is empty because we haven't
defined a value label for 99 yet.
An X in the first column of the Scanned Variable List also indicates that the selected variable
has at least one observed value without a defined value label.
In the Label column for the value of 99, enter No answer.
Then click (check) the box in the Missing column. This identifies the value 99 as user
missing. Data values specified as user missing are flagged for special treatment and are
excluded from most calculations.
Before we complete the job of modifying the variable properties for ownpc, let's apply the
same measurement level, value labels, and missing values definitions to the other variables in
the list. In the Copy Properties group, click To Other Variables.
In the Apply Labels and Level to dialog box, select all of the variables in the list, and then
click Copy. If you select any other variable in the list in the Define Variable Properties main
dialog box now, you'll see that they are all now ordinal variables, with a value of 99 defined
as user missing and a value label of No answer. Click OK to save all of the variable properties
that you have defined. By doing so, we copied the property of the ownpc variable to the other
five selected variables.
Exercise-1:The following small data set consists of four variables namely, Agecat, gender,
acid and pop.
Where: agecat is a categorical variable created for age.
1= ‘Under 21 ‘ 2= ‘21-25’, and 3= ‘26-30’
Gender: 0 = ‘Male’ and 1= ‘Female’
Accid and Pop are numeric.
After defining these variables in a data editor window, enter the following data for the
variables Agecat, Gender, Accid and Pop respectively. Your data should appear as given
below. Save the data set as trial1.sav.
1 1 57997 198522
2 1 57113 203200
3 1 54123 200744
1 0 63936 187791
2 0 64835 195714
3 0 66804 208239

14
Exercise-2: Create a data set called Trial2.sav from the following data. The data set has the
following variables:
I. Subject numeric width =2, right aligned and columns = 8
II. Anxiety: numeric width =2, right aligned and columns = 8
III. Tension: numeric width =2, right aligned and columns = 8
IV. Score: numeric width =2, right aligned and columns = 8
V. Trial: numeric width =2, right aligned and columns = 8
In addition there is no value level for each of the above variables. After completing the
definition of the above variables, type in the following data into your data editor window so
that your data appears as given below.
1 1 1 18 1
1 1 1 14 2
1 1 1 12 3
1 1 1 6 4
2 1 1 19 1
2 1 1 12 2
2 1 1 8 3
2 1 1 4 4
3 1 1 14 1
3 1 1 10 2
3 1 1 6 3
3 1 1 2 4
4 1 2 16 1
4 1 2 12 2
4 1 2 10 3
4 1 2 4 4
5 1 2 12 1
5 1 2 8 2
5 1 2 6 3
5 1 2 2 4
6 1 2 18 1
6 1 2 10 2

15
6 1 2 5 3
6 1 2 1 4
7 2 1 16 1
7 2 1 10 2
7 2 1 8 3
7 2 1 4 4
8 2 1 18 1
8 2 1 8 2
8 2 1 4 3
8 2 1 1 4
9 2 1 16 1
9 2 1 12 2
9 2 1 6 3
9 2 1 2 4
10 2 2 19 1
10 2 2 16 2
10 2 2 10 3
10 2 2 8 4
11 2 2 16 1
11 2 2 14 2
11 2 2 10 3
11 2 2 9 4
12 2 2 16 1
12 2 2 12 2
12 2 2 8 3
Exercise-3: Given below is an example of a questionnaire, suppose you have information on
several of such questionnaires. Prepare a data entry format that will help you to enter your
data to SPSS.
Examples of questionnaire Design
Name ____________________________________________________________
Age ______________ ____________Sex ________________________________
City __________________________________________________________________

16
Marital Status □ Married □ Single
Family Type □ Joint □ Nuclear
Family Members □ Adults □ Children
Family Income □ less than 10,000 □ 10, 000 to 15,000
□ 15,000-20,000 □ More than 20, 000
Date:______________
Place:______________
1. What kind of food do you normally eat at home?
□ North Indian □ South Indian □ Chinese □ Continental
2. How frequently do you eat out?
In a week □ once □ Twice □ Thrice □ More than thrice
3. You usually go out with:
□ Family □ Friends □ Colleagues □ Others _______________
4. Is there any specific day when you go out?
□ Weekdays □ Weekends □ Holidays □Special occasions
□ No specific days
5. You generally go out for
□ Lunch □ Snacks □ Dinner □ Party/Picnics
6. Where do you usually go?
□ Restaurant □ Chinese Joint □Fast food joint □others __________
7. Who decide on the place to go?
□ Husband □ Wife □ Children □ Others ______________
8. How much do you spend on eating out (one time)?
□ Below 200 □ 200-500 □ 500-800 □ More than 800
9. What did you normally order?
□ Pizza □ Burgers □ Curries and Breads □ Pasta
10. The price paid by you for the above is
10.1 Pizza: □ Very high □ A little bit high □ Just right
10.2 Burgers: □ Very high □ A little bit high □ Just right
10.3 Curries and Breads: □ Very high □ A little bit high □ Just right
10.4 Soups: □ Very high □ A little bit high □ Just right
10.5 Pasta: □ Very high □ A little bit high □ Just right

17
1.6. Data Importing From Microsoft Excel and ASCII files
Data can be directly entered in SPSS (as seen above), or a file containing data can be opened
in the Data Editor. From the menu in the Data Editor window, choose the following menu
options.
File
Open...
 If the file you want to open is not an SPSS data file, you can often use the Open menu
item to import that file directly into the Data Editor.
 If a data file is not in a format that SPSS recognizes, then try using the software
package in which the file was originally created to translate it into a format that can
be imported into SPSS.
Importing Data from Excel Files
Data can be imported into SPSS from Microsoft Excel with relative ease. If you are working
with a spreadsheet in another software package, you may want to save your data as an Excel
file, then import it into SPSS.
To open an Excel file, select the following menu options from the menu in the Data Editor
window in SPSS.
File
Open...
First, select the desired location on disk using the Look in option. Next, select Excel from the
Files of type drop-down menu. The file you saved should now appear in the main box in the
Open File dialog box. You can open it by double-clicking on it. You will see one more dialog
box which appears as follows.

18
This dialog box allows you to select a spreadsheet from within the Excel Workbook.
The drop-down menu in the example shown above offers two sheets from which to choose.
As SPSS only operates on one spreadsheet at a time, you can only select one sheet from this
menu.
This box also gives you the option of reading variable names from the Excel Workbook
directly into SPSS.
Click on the Read variable names box to read in the first row of your spreadsheet as the
variable names.
If the first row of your spreadsheet does indeed contain the names of your variables and you
want to import them into SPSS, these variables names should conform to SPSS variable
naming conventions (eight characters or fewer, not beginning with any special characters).
You should now see data in the Data Editor window. Check to make sure that all variables
and cases were read correctly. Next, save your dataset in SPSS format by choosing the Save
option in the File menu.
Example: Import an excel data set called book1.xls into SPSS data editor window from the
desktop.
The procedure is as follows:
File
Open... Data
After you select data you will see a window with the header “opens file”. On the same
window, select the desktop using the Look in option
Then select Excel (*.xls) from the file type drop down menu. Then another small window will
appear. In this window you may see that there is only one worksheet. Now if the first row of
the Book1.xls data set has variables names, then you select the option “Read variable names
from the first row of the data”. Subsequently, SPSS will consider the elements of the first row
as variables. If the first of row of book1.xls is not variable names then leave the option
unselected, then SPSS will understand elements of the first row as data values.
Importing data from ASCII files
Data are often stored in an ASCII file format, alternatively known as a text or flat file format.
Typically, columns of data in an ASCII file are separated by a space, tab, comma, or some
other character. To import text files to SPSS we have two wizards to consider:

19
 Read Text Data: If you know that your data file is an ASCII file, then you can open
the data file by opening the Read Text Data Wizard from the File menu. The Text
Import Wizard will first prompt you to select a file to import. After you have selected a file,
you will go through a series (about six steps) of dialog boxes that will provide you with
several options for importing data.
Once we are through with importing of the data, we need to check for its accuracy. It is also
necessary to save a copy of the dataset in SPSS format by selecting the Save or Save As
options from the File menu.
 Open Data: The second option to read an ASCII file to SPSS is by using
File open Data option.
File
Open... Data
After you select data you will see a dialogue box with the header “opens file”. On the same
window, select the desktop using the Look in option
Then select Text (*.txt) from the file type drop down menu. Select the file and click on open
button. A serious of dialog boxes will follow.
Exercise: Suppose there is a text file named mychap1 on the desktop under the subdirectory
training. Import this file to SPSS. Also name the first variable as X and the second as Y.

20
2. Modifying and organizing data

2.1. Retrieving data


We can retrieve A. sav data file from any directory in our personal computer or from floppy
disk or any other removable disk.
To retrieve a data file from floppy disk, we select from application window menu bar
File
Open
Data
You will see the open data file dialogue box. Assuming the data type you want is on the
floppy disk and has been saved previously by SPSS, open the drives drop down list and click
on the icon for the drive a: All the files on drive A ending with . SAV extension will be listed
in the files list. Click on the name of the file you want to retrieve, and it will appear in the file
name box. Click on the Ok button on the right hand side of the dialogue box. The file will
then be put into the Data Editor Window, and its name will be the title of that window.
Assume the data type you want is on the hard disk and has been saved previously by SPSS,
under the directory program files. Open the drive’s drop down list and click on the icon for
program files: All the files under program files ending with”. SAV” extension will be listed in
the files list. Click on the name of the file you want to retrieve, and it will appear in the file
name box. Click on the Ok button on the right hand side of the dialogue box. The file will
then be put into the Data Editor Window, and its name will be the title of that window.

2.2. Inserting cases and variables


You may want to add new variables or cases to an existing dataset. The Data Editor provides
menu options that allow you to do that. For example, you may want to add data about
participants' ages to an existing dataset.
To insert a new variable, click on the variable name to select the column in which the
variable is to be inserted.

To insert a case, select the row in which the case is to be added by clicking on the row's
number. Clicking on either the row's number or the column's name will result in that row or
column is being highlighted. Next, use the insert options available in the Data menu in the
Data Editor:

21
Data
Insert Variable
Insert case
If a row has been selected, choose Insert Case from the Data menu; if a column has been
selected, choose, Insert Variable. This will produce an empty row or column in the
highlighted area of the Data Editor. The existing cases and variables will be shifted down and
to the right respectively.

2.3. Deleting cases or variables


You may want to delete cases or variables from a dataset. To do that, select a row or column
by highlighting as described above. Next, use the Delete key to delete the highlighted area. Or
you can use the Delete option in the Edit menu to do it.

2.4. Transforming Variables with the Compute Command


In the Data Editor, you can use the COMPUTE or the RECODE command to create new
variables from existing variables
The COMPUTE option allows you to arithmetically combine or alter variables and place the
resulting value under a new variable name. As an example, to calculate the area of shapes
based on their height and width, you compute a new variable "area" by multiplying "height"
and "width" with one another. See below.

The new variable created is area. This is specified under target variable. This target variable is
the product of the two existing variables height and width.

22
Another example may be a dataset that contained employees' salaries in terms of their
beginning and current salaries. Our interest is on the difference between starting salary and
present salary. A new variable could be computed by subtracting the starting salary from the
present salary. See the dialogue box below
Transform
Compute...

In other situations, you may also want to transform an existing variable. For example, if data
were entered as months of experience and you wanted to analyze data in terms of years on the
job, then you could re-compute that variable to represent experience on the job in numbers of
years by dividing number of months on the job by 12.

2.5. Transforming Variables with the Recode Command


The RECODE option allows you to create discrete categories from continuous variables. As
an example, you may want to change the height variable where values can range from 0 to
over 100 into a variable that only contains the categories tall, medium, and short. We have to
pass through the following steps.
 Select Transform/Recode/Into Different Variables.
 A list of variables in the active data set will appear. Select the variable you wish to
change by clicking once on the variable name and clicking the arrow button.
 Click the output box and enter a new variable name (8 characters maximum) and click
Change.
See the figure below. The variable to be recoded is the height.

23
NOTE: In dialog boxes that are used for mathematical or statistical operations, only those
variables that you defined as numeric will be displayed. String variables will not be displayed
in the variable lists.

Now the variable height_b is the new variable that will be obtained after recoding. The value
label for the new variable is “Height variable recoded”.
 Select OLD AND NEW VALUES. This box presents several recoding options. You
identify one value or a range of values from the old variable and indicate how these
values will be coded in the new variable.
 After identifying one value category or range, enter the value for the new variable in
the New Value box. In our example, the old values might be 0 through 10, and the
new value might be 1 (the value label for 1 would be "short", for 2 "medium", for 3
"tall").
 Click ADD and repeat the process until each value of the new variable is properly
defined.
(See Figure Below) . Recode: Old and new values

24
Caution: You also have the option of recoding a variable into the same name. If you did this
in the height example, the working data file would change all height data to the three
categories (a value of 1 for "short"), 2 ("for medium", or 3 for "tall"). If you save this file with
the same name, you will lose all of the original height data. The best way to avoid this is to
always use the recode option that creates a different variable. Saving the data file keeps
the original height data intact while adding the new categorized variable to the data set for
future use.
Using if statement in the Data Editor
 IF statement is an option to use within the compute or recode command. You can
choose to only recode values if one of your variables satisfies a condition of your choice.
This condition, which is captured by means of the "IF" command, can be simple (such
as "if area=15). To create more sophisticated conditions, you can employ logical
transformations using AND, OR, NOT. The procedure is as given below.
 In the Compute and Recode dialog boxes click on the IF button.
 They Include If Case Satisfies Condition dialog pops up (see the Figure below).
 Select the variable of interest and click the arrow button.
 Use the key pad provided in the dialog box or type in the appropriate completion of
the IF statement.
 When the IF statement is complete, click CONTINUE.

25
2.5.1. Banding Values
Banding is taking two or more continuous values and grouping them into the same category.
The data you start with may not always be organized in the most useful manner for your
analysis or reporting needs. For example, you may want to:
 Create a categorical variable from a scale variable.
 Combine several response categories into a single category.
 Create a new variable that is the computed difference between two existing variables.
 Calculate the length of time between two dates.
Once again we use the data file demo.sav.
Several categorical variables in the data file demo.sav are, in fact, derived from scale variables
in that data file. For example, the variable inccat is simply income grouped into four
categories.
This categorical variable uses the integer values 1–4 to represent the following income
categories: less than 25, 25–49, 50–74, and 75 or higher.
To create the categorical variable inccat:
From the menus in the Data Editor window choose:
Transform
Visual Bander...
In the initial Visual Bander dialog box, you select the scale and/or ordinal variables for which
you want to create new, banded variables. Banding is taking two or more contiguous values
and grouping them into the same category.
Since the Visual Bander relies on actual values in the data file to help you make good banding
choices, it needs to read the data file first. Since this can take some time if your data file
contains a large number of cases, this initial dialog box also allows you to limit the number of
cases to read ("scan").
This is not necessary for our sample data file. Even though it contains more than 6,000 cases,
it does not take long to scan that number of cases.
Drag and drop Household income in thousands [income] from the Variables list into the
Variables to Band list, and then click Continue.
In the main Visual Bander dialog box, select Household income [in thousands] in the Scanned
Variable List.

26
A histogram displays the distribution of the selected variable (which in this case is highly
skewed). Enter inccat2 for the new banded variable name and Income category (in thousands)
for the variable label.
Click Make Cut points.
Select Equal Width Intervals.
Enter 25 for the first cut-point location, 3 for the number of cut-points, and 25 for the width.
The number of banded categories is one greater than the number of cut-points. So, in this
example, the new banded variable will have four categories, with the first three categories
each containing ranges of 25 (thousand) and the last one containing all values above the
highest cut-point value of 75 (thousand).
Click Apply.
The values now displayed in the grid represent the defined cut-points, which are the upper
endpoints of each category. Vertical lines in the histogram also indicate the locations of the
cut-points. By default, these cut-point values are included in the corresponding categories. For
example, the first value of 25 would include all values less than or equal to 25.But in this
example, we want categories that correspond to less than 25, 25–49, 50–74, and 75 or higher.
In the Upper Endpoints group, select Excluded (<).
Then click Make Labels.
This automatically generates descriptive value labels for each category. Since the actual
values assigned to the new banded variable are simply sequential integers starting with 1, the
value labels can be very useful.
You can also manually enter or change cut-points and labels in the grid, change cut-point
locations by dragging and dropping the cut-point lines in the histogram, and delete cut-points
by dragging cut-point lines off of the histogram. Click OK to create the new, banded
variable.
The new variable is displayed in the Data Editor. Since the variable is added to the end of the
file, it is displayed in the far right column in Data View and in the last row in Variable View.
But in this example, we want categories that correspond to less than 25, 25–49, 50–74, and 75
or higher.
In the Upper Endpoints group, select Excluded (<).
Then click Make Labels.

27
This automatically generates descriptive value labels for each category. Since the actual
values assigned to the new banded variable are simply sequential integers starting with 1, the
value labels can be very useful.
Sorting Cases
Sorting cases allows you to organize rows of data in ascending or descending order on the
basis of one or more variable. For instance consider once again the Employee data set.
Suppose we are interested to sort the data based on the variable “Jobcat” which refers to the
category of employment. The procedure for sorting will be as follows:
Data
Sort Cases...
A small dialog box with header Sort Cases will pop up. This dialogue box has few options. If
you choose the ascending option in the dialogue box and click OK, you data will be sorted by
Jobcat. All of the cases coded as job category 1 appear first in the dataset, followed by all of
the cases that are labeled 2 and 3 respectively.
The data could also be sorted by more than one variable. For example, within job category,
cases could be listed in order of their salary. Again we can choose
Data
Sort Cases...
In the small dialogue box select, select the variable jobcat followed by salary. The dialogue
box comes into view as follows.

To choose whether the data are sorted in ascending or descending order, select the appropriate
button. Let us choose ascending so that the data are sorted in ascending order of magnitude
with respect to the values of the selected variables. The hierarchy of such a sorting is

28
determined by the order in which variables are entered in the Sort by box. Data are sorted by
the first variable entered, and then sorting will take place by the next variable within that first
variable. In our case jobcat was the first variable entered, followed by salary, the data would
first be sorted by job category, and then, within each of the job categories, data would be
sorted by salary.
Merging Files:
We can merge files into two different ways. The first option is “add variables” and the
second is “add cases”.
Add variables: The Add Variables adds new variables on the basis of variables that are
common to both files. In this case, we need to have two data files. Each case in the one file
corresponds to one case in the other file. In both files each case has an identifier, and the
identifiers match across cases. We want to match up records by identifiers. First, we must sort
the records in each file by the identifier. This can be done by clicking Data, Sort Cases, and
then selecting the identifier into the “Sort by” box, OK.
Example, Given below we have a file containing dads and we have a file containing faminc.
We would like to merge the files together so we have the dadsobservation on the same line
with the faminc observation based on the key variable famid. The procedure to merge the two
files is as follows:
 First sort both data sets by famid.
 Retrieve the dads data set into data editor window.
 Select
Data Merge files … add variables and select the file faminc. dads
famid name inc
2 Art 22000
1 Bill 30000
3 Paul 25000
faminc
famid faminc96 faminc97 faminc98
3 75000 76000 77000
1 40000 40500 41000
2 45000 45400 45800
After merging the dads and faminc, the data would look like the following.

29
famid name inc faminc96 faminc97 faminc98
1 Bill 30000 40000 40500 41000
2 Art 22000 45000 45400 45800
3 Paul 25000 75000 76000 77000
Add variables (one to many)
The next example considers a one to many merge where one observation in one file may have
multiple matching records in another file. Imagine that we had a file with dads like we saw in
the previous example, and we had a file with kids where a dad could have more than one kid.
It is clear why this is called a one to many merge since we are matching one dad observation
to one or more (many) kids observations. Remember that the dads file is the file with one
observation, and the kids file is the one with many observations. Below, we create the data
file for the dads and for the kids.
Dads data set
Famid Name Inc
2 Art 22000
1 Bill 30000
3 Paul 25000
Kids data set
Famid Kid’s name birth age wt sex
1 Beth 1 9 60 f
1 Bob 2 6 40 m
1 Barb 3 3 20 f
2 Andy 1 8 80 m
2 Al 2 6 50 m
2 Ann 3 2 20 f
3 Pete 1 6 60 m
3 Pam 2 4 40 f
3 Phil 3 2 20 m
To merge the two data sets, we follow the steps indicated below.
 SORT the data set dads by famid and save that file and call it dads2
 SORT the data set kids by famid and save that file as kids2
 Retrieve the data set kids2 to data editor window.
 Select data …merge files… add variables.
 From the dialogue box select the file dads2

30
 Another dialogue box will appear. In this dialogue box we select the option “match
cases on key variables in sorted files”.
 Select external file is keyed table and choose famid as key variable
 Click Ok.
The Data Editor window will appear as given below.
FAMID KIDNAME BIRTH AGE WT SEX NAME INC
1.00 Beth 1.00 9.00 60.00 f Bill 30000.00
1.00 Bob 2.00 6.00 40.00 m Bill 30000.00
1.00 Barb 3.00 3.00 20.00 f Bill 30000.00
2.00 Andy 1.00 8.00 80.00 m Art 22000.00
2.00 Al 2.00 6.00 50.00 m Art 22000.00
2.00 Ann 3.00 2.00 20.00 f Art 22000.00
3.00 Pete 1.00 6.00 60.00 m Paul 25000.00
3.00 Pam 2.00 4.00 40.00 f Paul 25000.00
3.00 Phil 3.00 2.00 20.00 m Paul 25000.00
We can also retrieve the data set dads2 to data editor window and perform steps 4 to 6 for the
file kids2. This time you select working file is keyed table and choose famid as key variable.
The data editor window will appear as given below.
FAMID NAME IncKid name BIRTH AGE WT SEX
1 Bill 30000 Beth 1 9 60 f
1 Bill 30000 Bob 2 6 40 m
1 Bill 30000 Barb 3 3 20 f
2 Art 22000 Andy 1 8 80 m
2 Art 22000 Al 2 6 50 m
2 Art 22000 Ann 3 2 20 f
3 Paul 25000 Pete 1 6 60 m
3 Paul 25000 Pam 2 4 40 f
3 Paul 25000 Phil 3 2 20 m
Here the correct choice of keyed table can give us correct results.
The key difference between a one to one merge and a one to many merge is that you need to
correctly identify the keyed table. That means we have to identify which file plays the role of
one (in one to many). That file should be chosen as keyed table. In the above example the
keyed table file is only dads2 but not kids2.
Merging files (add cases option)
The Add Cases option combines two files with different cases that have the same variables.
To merge files in this option we should follow the following procedures.
Data merge files add cases

31
All variables should be listed under the small window “new working data file”. Click
Ok to complete merging.

2.6. Keeping and dropping of cases


Selecting Cases
You can analyze a specific subset of your data by selecting only certain cases in which you
are interested. For example, you may want to do a particular analysis on employees only if the
employees have been with the company for greater than six years. This can be done by using
the Select Cases menu option, which will either temporarily or permanently remove cases you
didn't want from the dataset. The Select Cases option (or Alt+D+C) is available under the
Data menu item:
Data
Select Cases...
Selecting this menu item will produce the following dialog box. This box contains a list of the
variables in the active data file on the left and several options for selecting cases on the right.

The portion of the dialog box labeled “Unselected Cases Are” gives us the option of
temporarily or permanently removing data from the dataset.
 If the “Filtered” option is selected, the selected cases will be removed from subsequent
analyses until “All Cases” option reset.

32
 If the “Deleted” option is selected, the unselected cases will be removed from the
working dataset. If the dataset is subsequently saved, these cases will be permanently
deleted.
Selecting one of these options will produce a second dialog box that prompts us to a particular
specification in which we are interested. For example, if we choose the “If condition is
satisfied” option and clicking on the If button the results in a second dialog box, will appear as
shown below.

The above example selects all of the cases in the dataset that meet a specific criterion:
employees that have worked at the company for greater than six years (72 months) will be
selected. After this selection has been made, subsequent analyses will use only this subset of
the data. If you have chosen the Filter option in the previous dialog box, SPSS will indicate
the inactive cases in the Data Editor by placing a slash over the row number. To select the
entire dataset again, return to the Select Cases dialog box and select the All Cases option.

2.7. Collapsing and transposing Data


Collapsing data across observations
At times we might have data files that need to be collapsed to be useful to us. For Instance,
you might have student data but we really want classroom data, or we might have weekly data
but we are interested on monthly data, etc. Let us see how we can collapse data across kids to
make family level data.
Aggregating Files
Aggregating files is one way of data manipulation procedure. The Aggregate procedure allows
you to condense a dataset by collapsing the data on the basis of one or more variables. For
example, to investigate the characteristics of people in the company on the basis of the

33
amount of their education, you could collapse all of the variables you want to analyze into
rows defined by the number of years of education. To access the dialog boxes for aggregating
data, follow the following steps:
 Select Data and then AGGREGATE
 We will observe a dialogue box. This dialogue box has several options. These are as
follows.
Break variable: The top box, labeled Break Variable(s), contains the variable within which
other variables are summarized. This is something like classification variable.
Aggregated Variables: contains the variables that will be collapsed.
Number of cases: This option allows us to save the number of cases that were collapsed at
each level of the break variable.
Save: This has three different options. I) Add the aggregated variables to working data file II)
Create new data file containing aggregated variables. III) Replace working data with
aggregated variables only. We may choose one of the above three options depending on our
interest.
Options for very large data sets: This has two options
 File is already sorted on break variable(s)
 Sort file before aggregating.
Example: Suppose we have a file containing information about the kids in three families.
There is one record per kid. Birth is the order of birth (i.e., 1 is first),age wt and sex are the
child's age, weight and sex respectively. This data is saved as kid3.sav file in the directory
desktop:\ training r . We will use this file for showing how to collapse data across
observations. If we consider the aggregate command under the data menu we can collapse
across all of the observations and make a single record with the average age of the kids. To do
so we need to create a break variable const=1 using the compute command.

Famid Kidname birth Age Wt Sex


1 Bekele 1 9 60 f
1 Bogale 2 6 40 m
1 Barbie 3 3 20 f
2 Anteneh 1 8 80 m
2 Alemayehu 2 6 50 m
2 Abush 3 2 20 f
3 Chapie 1 6 60 m
3 Chuchu 2 4 40 f
3 Mamush 3 2 20 m

34
To collapse the above data, we follow the following steps:
 Select Data and then AGGREGATE. In the observed dialogue box, select const as
break variable.
 Choose “age” for summaries of variables
 Choose add aggregated variables to working data file
The “age_mean” variable will be added to our working data. This is the mean age of all 9
children.
If we follow all of the above steps and change the last option to “Create new data file
containing aggregated variables only”, we will have the following output saved as aggr.sav.
CONST AVGAGE N_Break
1.00 5.11 9
If we use “famid” as break variable, the aggregate option will the average age of the kids in
the family. The following output will be obtained.
FAMID AGE1
1.00 6.00
2.00 5.33
3.00 4.00
We can request averages for more than one variable. For instance, if we want to aggregate
both age and weight by famid we can follow the following steps.
 Select Data and then AGGREGATE. In the observed dialogue box, select Constas
break variable.
 Choose “age” and “wt” for summaries of variables
 Choose “Create new data file containing aggregated variables only”.
The following output will be produced. The variable N_Break is the count of the number of
kids in each family.
Famid Age_mean Wt_mean N_Break
1 6.00 40.00 3
2 5.33 50.00 3
3 4.00 40.00 3
We can variable “girls” that counts the number of girls in the family, and “boys” that can
help us count the number of boys in the family. You can also add a label after the new
variable name. If you save the output in SPSS, you can see the labels in SPSS data editor after
clicking on the "variable view" tab in the lower left corner of the editor window.

35
To have summary information which shows the number of boys and girls per family, we will
follow the following procedure. We create two dummy variables Sexdum1 for girls and
Sexdum2 for boys. The sum of sexdum1 is the number of girls in the family. The sum of
sexdum2 is the number of boys in the family.
I) we recode sex into dumgirl=1 if sex=girl and dumgirl=0 if sex=m
II) We recode sex into dumboy =1 if sex=m, dumboy=0 if sex=f
III) We select Data … Aggregate option. At this step a dialogue box will appear. In this
dialogue box, we select Break-variable = famid,
IV) Select dumgirl and dumboy for aggregated variables.
V) Below the aggregated variables we have two options 1. Function 2. Name and label.
After selecting one of the variables to be aggregated choose the ‘function’ option. A new
dialogue box will pop-up.
VI) From this new dialogue box, we select the function ‘sum’ and click continue for both
variables.
VII) Again below the aggregated variables, select the option Name and label. Change the
name “dumgirl_sum” to girl and “dumboy_sum” to boy. You can also boy as number of boys
and girl as number of boys.
VIII) Now click on the number cases box and change the name N_Brake to “NumKids”.
Ix) Finally we have to choose the save option. If we choose the option “Create new data file
containing aggregated variables only”. , SPSS will save the file in the directory of our
choice.
For instance, if we save our file in the directory desktop\training r, our file will be saved as
SPSS file. Our results look like the following output.
FamId Boys girls Numkid
1 1.00 2.00 3
2 2.00 1.00 3
3 2.00 1.00 3
Restructure Data: We use Restructure data wizard to restructure our data.
In the first dialog box, we select the type of restructuring that we want to do. Suppose, we the
data that are arranged in groups of related columns. Our interest is to restructure these data
into groups of rows in the new data file. Then we choose the option restructure selected
variables into cases.

36
Example: Consider a small data set consisting of three variables as given below.
V1 V2 V3
8 63 82
9 62 87
10 64 89
12 66 85
15 67 86
The objective is then to restructure the above data into groups of rows in the new data file. In
other words we want to convert the above data into one variable that has all the values of the
three variables and one factor variable that indicate group. This procedure is known as the
restructuring of variables to cases. The procedure is as follows.
 From the data menu select restructure, the dialogue box which says “Welcome to the
restructure Data wizard” will appear.
 Choose the first option “Restructure selected variables into cases” and click next.
 Another dialogue box which says “Variable to cases:Number of variable groups” will
appear. Choose the first option “One” and click next.
 Give the name of target variable call it “ all-inone”
 Select all three variables (V1, V2 and V3) to variables to be transposed box and Click next.
 Another dialogue box which says “Variable to cases: Create index variable” will appear.
Choose the first option “One” and click next.
 Another new dialogue box will appear here change the variable name “Index” to group.
Click finish and see your restructured data. The data may appear as shown below.
Id Group All_inone
1 1 8
1 2 63
1 3 82
2 1 9
2 2 62
2 3 87
3 1 10
3 2 64
3 3 89
4 1 12
4 2 66
4 3 85
5 1 15
5 2 67
5 3 86

37
The variable ID stands for the row position in of the data before the data was restructured. We
can also restructure the data from cases to variables.
For instance, consider the following small data set on age of Nurses and Doctors.
The variable group 1 stands for nurses and 2 stands for doctors.
Id Age Group
1 23 1
2 25 1
3 26 1
4 35 1
5 42 1
6 22 1
1 60 2
2 36 2
3 29 2
4 56 2
5 32 2
6 54 2
The objective is to restructure the above age data into a data set having two separate variables
for Nurses and Doctors. To do so, we follow the following procedure.
 From the data menu we select restructure
 From the dialogue box we select “ Restructure selected cases to variables ”
 We select Id for Identifier variable
 We select group for Index variable and click next and respond to the dialogue box that
will appear.
 When you observe the dialogue box which says “Cases to variables:Options”
dialogue box select group by Index and click next.
 Click finish.
Our data will be restructured as given below.
Id Age.1 Age.2
1 23 60
2 25 36
3 26 29
4 35 56
5 42 32
6 22 54

Therefore, we have separate variables for ages of nurses and doctors.

38
Transpose all data. We choose this when we want to transpose our data. All rows will
become columns and all columns will become rows in the new data. The procedure is as
follows:
 From the data menu we select restructure
 From the dialogue box we select “ Transpose all data ” and click finish
 Transpose dialogue box will appear. We have to select all variables to transpose. (Note
un-selected variables will be lost.) Click Ok.
 The transformed data that change rows to columns and columns to row will appear.
Example: Consider the following data set.
Id Age group
1 23 1
2 25 1
3 26 1
4 35 1
5 42 1
6 22 1
7 60 2
8 36 2
9 29 2
10 56 2
11 32 2
12 54 2
Applying the above procedure, the transposed form of this data is as given below.
Case_lbl V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
Id 1 2 3 4 5 6 7 8 9 10 11 12
age 23 25 26 35 42 22 60 36 29 56 32 54
group 1 1 1 1 1 1 2 2 2 2 2 2

2.8. Listing Cases


You may sometime want to print a list of your cases and the values of variables associated
with each case, or perhaps a list of only some of the cases and variables. For example, if you
have two variables that you want to examine visually, but this cannot be done because they
are at very different places in your dataset, you could generate a list of only these variables in

39
the Output Viewer. The procedure for doing this cannot be performed using dialog boxes and
is available only through command syntax. The syntax for generating a list of cases is shown
in the Syntax Editor window below. The variable names shown in lower case below instruct
SPSS which variables to list in the output. Or, you can type in the command ALL in place of
variables names, which will produce listing of all of the variables in the file. The sub-
command /CASES FROM 1 TO 10, is an instruction to SPSS to print only the first ten cases.
If this instruction were omitted, all cases would be listed in the output.

To execute this command, first highlight the selection by pressing on your mouse button while
dragging the arrow across the command or commands that you want to execute. Next, click on
the icon with the black, right-facing arrow on it. Or, you can choose a selection from the Run
menu.
Executing the command will print the list of variables, gender and minority in the above
example, to the Output Viewer. The Output Viewer is the window in which all output will be
printed. The Output Viewer is shown below, containing the text that would be generated from
the above syntax.

40
3. DESCRIPTIVE STATISTICS USING SPSS
From the previous section, we have seen in the Output Viewer:

 The results from running a statistical procedure are displayed in the Viewer.
 The output produced can be statistical tables, charts or graphs, or text, depending on
the choices you make when you run the procedure.
 The viewer window is divided into two panes.
 The outline pane (left side): contains an outline of all of the information stored in the
Viewer.
 The contents pane (right hand side): contains statistical tables, charts, and text output.
The icons in the outline pane can have two forms namely:
 The open book icon: indicates that it is currently visible in the Viewer
 The closed book icon: indicates that it is not currently be visible in the viewer.

3.1. Summarizing Data


A common first step in data analysis is to summarize information about variables in your
dataset such as the mean and variance
 Producing frequency distribution
 Descriptive Statistics
 Cross tabulations

3.1.1. Producing Frequency distribution


 It is more useful to investigate the numbers of cases that fall into various categories.
 Frequency tables are useful for summarizing categorical variables -- variables with a
limited number of distinct categories.
 From the menu bar chose:
Analyze
Descriptive (Statistics)
Frequencies...
Under Frequencies dialog box you can choose Statistics, Chart and Format button to add
whatever you want Chart button, for instance, has different types of charts such as bar, pie
and histogram.
For example, clicking on the Histograms button with its sub option, With normal curve will
produce a histogram with bell shaped diagram.

41
3.1.2. Descriptive Statistics
 The Descriptive Statistics table provides summary statistics for continuous,
numeric variables.
 From the menu bar chose:
Analyze
Descriptive Statistics
Descriptive...
 Descriptive option available from the Analyze and Descriptive
Statisticsmenus:This will produce the following dialog box

3.1.3. Cross Tabulation


 It give information about the relationship between only categorical variables
 To do so follow the following steps
Analyze
Descriptive Statistics
Cross tabs…
 After selecting Crosstabs from the menu, the dialog box shown below will appear on
your monitor

42
 The options available by selecting the Statistics and Cells buttons provide you with
several additional output features.
 Selecting the Cells button will produce a menu that allows you to add additional values
to your table.
Three way tables (Adding a layer variable):

 It can be formed by adding a layer variable.


 You can add a layer variable to create a three-way table in which categories of the
row and column variables are further subdivided by categories of the layer
variable.
 This variable is sometimes referred to as the control variable because it may
reveal how the relationship between the row and column variables changes when
you "control" for the effects of the third variable.
 Procedure
 Open the Crosstabs dialog box again.
 Click Cells.
 Click Continue.
 Select one variable as the layer variable.
 Click OK to run the procedure

43
Displaying Tables:

 Tables
 Much of the output in SPSS is displayed in a pivot table format .
 The procedure for creating tables from the menu-bar select
 Analyze
 Table
 Custom Tables.
 Then simply drag and drop variables where we want them to appear in the table.
Step:- Analyze  Tables  Custom Tables  Drag Categorical Variables  Ok

Summary Statistics:

 Right-click on variable category on the canvas pane and select Summary Statistics
from the pop-up context menu.
 In the Summary Statistics dialog box, select Row N % in the Statistics list and click
the arrow button to add it to the Display list.
 Both the counts and row percentages will be displayed in the table.
 Click Apply to Selection to save these settings and return to the table builder.
 To insert totals and subtotals click categories and totals in the define section
 Then click ok.
 For scale variable we can display summaries statistics ( mean, median, …) in the
cells of the table
Stacking Variables:
 Taking separate tables and pasting them together into the same display.
 To Stack Variables:
 In the variable list, select all of the variables you want to stack, then
drag and drop them together into the rows or columns of the canvas
pane. Or
 Drag and drop variables separately, dropping each variable either
above or below existing variables in the rows or to the right or left
of existing variables in the columns.

44
3.1.4. Diagrams and graphs
A. Bar Chart
 Bar Charts are a common way to graphically display the data that represent the
frequency of each level of a variable
Graphs
Bar...

This will produce the following dialog box:

 To get started with the bar graph, click on the icon representing the type of graph that
you want, then click on the Define button to produce the following dialog box

45
B. Pie Chart
 used to present categorical variable
 From the menu bar choose
Graph

Pie chart

Summaries of group of cases

Then select slices are represented by (%) of cases…. click ok

C. Histograms
 Making histograms is one of the best ways to check your data for normality.

 From the Graphs menu, select "Histogram."

 Put your variable in the "variable" box.

Step:- Graphs  Legacy Dialogs  Histogram  Variable( ?)  Ok

D. Scatter Plots
 Scatter plots give you a tool for visualizing the relationship between two or more
variables

 Scatter plots are especially useful when you are examining the relationship between
continuous variables using statistical techniques such as correlation or regression.
 Scatter plots are also often used to evaluate the bivariate relationships in regression
analyses.
 Useful in the early stage of analysis when exploring data and determining is a linear
regression analysis is appropriate
 May show outliers in your data
Example: Performance and Self-confidence
To obtain a scatter plot in SPSS

Graphs
Scatter...

This will produce the following dialog box:

46
Simple Scatter Plot

 The Simple scatter plot graphs the relationship between two variables
 When you select the Simple option from the initial dialog box, you will get the
following dialog box:

 We can also have SPSS draw different colored markers for each group by entering a
group variable in the Set Markers by box.
The Matrix Scatter Plot:

 Used to visualize the relationship between two or more variables.


 A variable is plotted with every other variable.

47
 Every combination is plotted twice so that each variable appears on both the X and Y
axis.
 Considerer a Matrix scatter plot with three variables, salary, salbegin, and jobtime,
you would receive the following scatterplot matrix:

The Matrix Scatter Plot:

Overlay Scatter Plot


 It allows you to plot two scatter plots on top of each other
 The plots are distinguished by color on the overlaid plot.
 The procedure is as follows.
From the menu bar choose
Graph
Scatter/Dots…
Overlay option
 Consider the Employee.sav data set in SPSS; we can plot education by beginning
and current salaries by pairing the variables educ with salbegin and educ with
salary.
Overlay Scatter Plot

48
Exercise:
1. Let us consider a small data set given below.
x 400 675 475 350 425 600 550 325 675 450
y 1.8 3.8 2.8 1.7 2.8 3.1 2.6 1.9 3.2 2.3
After entering these data into SPSS plot the scatter plot. What type relationship do you
observe between x and y? Is an increase in x followed by an increase in y?

2. Produce a scatter plot for the following data and discuss the results.
x 400 675 475 350 425 600 550 325 675 450
y -1.8 -3.8 -2.8 -1.7 -2.8 -3.1 -2.6 -1.9 -3.2 -2.3

49
4. Customizing SPSS outputs and reporting

4.1. Customizing SPSS outputs

4.1.1. Modifying Tables

Much of the output in SPSS is displayed in a pivot table format. While these pivot tables are
professional quality in their appearance, you may wish to alter their appearance or export
them to another application. There are several ways in which you can modify tables. In this
section, we will discuss how you can alter text, modify a table's appearance, and export
information in tables to other applications.

To edit the text in any SPSS output table, you should first double-click that table. This will
outline dashed lines, as shown in the figure below, indicating that it is ready to be edited.
Some of the most commonly used editing techniques are changing the width of rows and
columns, altering text, and moving text. Each of these topic are discussed below:

 Changing column width and altering text. To change column widths, move the
mouse arrow above the lines defining the columns until the arrow changes to a double-
headed arrow facing left and right. When you see this new arrow, press down on your
left mouse button, then drag the line until the column is the width you want, then
release your mouse button.
 Editing text. First double-click on the cell you wish to edit, then place your cursor on
that cell and modify or replace the existing text. For example, in the frequency table
shown below, the table was double-clicked to activate it, and then the pivot table's title
was double-clicked to activate the title. The original title, "Employment Category,"
was modified by adding the additional text, "as of August 1999."
 Using basic editing commands, such as cut, copy, delete, and paste. When you cut
and copy rows, columns, or a combination of rows and columns by using the Edit
menu's options, the cell structure is preserved and these values can easily be pasted
into a spread sheet or table in another application.

50
Aside from changing the text in a table, you may also wish to change the appearance of the
table itself. But first, it is best to have an understanding of the SPSS Table Look concept. A
Table Look is a file that contains all of the information about the formatting and appearance
of a table, including fonts, the width and height of rows and columns, colouring, etc. There are
several predefined Table Looks that can be viewed by first right-clicking on an active table,
then selecting the Table Looks menu item. Doing so will produce the following dialog box:

51
You can browse the available Table Looks by clicking on the file names in the Table Look
Files box, as shown above. This will show you a preview of the Table Look in the Sample
box.

While the Table Looks dialog box provides an easy way to change the look of your table, you
may wish to have more control of the look or create your own Table Look. To modify an
existing table, right-click on an active pivot table, then select the Table Properties menu item.
This will produce the following dialog box:

The above figure shows the Table Properties dialog box with the Cell Formats tab selected.
You can alternate between tabs (e.g., General, Footnotes, etc.) by clicking on the tab at the
upper left of the dialog box. While a complete description of the options available in the Table
Properties dialog box is beyond the scope of this document, there are a few key concepts that
are worth mentioning. Note the Area box at the upper right of the dialog box. This refers to
the portion of the box that is being modified by the options on the left side of the box. For
example, the color in the Background of the Data portion of the table was changed to black
and the color of the text was changed to white by first choosing Data from the Area box, then

52
selecting black from the Background drop-down menu and selecting white for the text by
clicking on the colour palette icon in the Text area on the left side of the dialog box.

The Printing tab also has some useful options. For example, the default option for three-
dimensional tables containing several layers is that only the visible layer will be printed. One
of the options under the Printing tab allows you to request that all layers be printed as
individual tables. Another useful Printing option is the Rescale wide/long tables to fit page,
which will shrink a table that is larger than a page so that it will fit on a single page.

Any modifications to a specific table can be saved as a Table Look. By saving a Table Look,
you will be saving all of the layout properties of that table and can thus apply that look to
other tables in the future. To save a Table Look, click on the General tab in the Table
Properties dialog box. There are three buttons on the bottom right of this box. Use the Save
Look button to save a Table Look. That button will produce a standard Save As dialog box
with which you can save the Table Look you created.

4.1.2. Exporting Tables in SPSS

In addition to modifying a table's appearance, you may also wish to export that table. There
are three primary ways to export tables in SPSS. To get a menu that contains the available
options for exporting tables, right-click on the table you wish to export. The three options for
exporting tables are: Copy, Copy object, and Export.

The Copy option copies the text and preserves the rows and columns of your table but does
not copy formatting, such as colours and borders. This is a good option if you want to modify
the table in another application. When you select this option, the table will be copied into your
system clipboard. Then, to paste the table, select the Paste command from the Edit menu in
the application to which you are importing the table. The Copy option is useful if you plan to
format your table in the new application; the disadvantage of this method is that only the text
and table formatting remains and you will therefore lose much of the formatting that you
observe in the Output Viewer.

The Copy object method will copy the table exactly as it appears in the SPSS Output Viewer.
When you select this option, the table will be copied into your clipboard and can be imported

53
into another application by selecting the Paste option from the Edit menu of that application.
When you paste the table using this option, it will appear exactly as it is in the Output Viewer.
The disadvantage of this method is that it can be more difficult to change the appearance of
the table once it has been imported.

The third method, Export, allows you to save the table as an HTML or an ASCII file. The
result is similar to the Copy command: you will have a table that retains the text and cell
layout of the table you exported, but it will retain little formatting. This method for exporting
tables to other applications is different from the above two methods in that it creates a file
containing the table rather than placing a copy in the system clipboard. When you select this
method, you will immediately be presented with a dialog box allowing you to choose the
format of the file you are saving and its location on disk. The primary advantage of this
method is that you can immediately create an HTML file that can be viewed in a Web
browser.

4.1.3. Modifyingscatter plot


 double-click on the chart
Chart
Options...

 To get the following dialog box

 Some of the most useful options that will add information to your scatterplot are the
Fit Line options.

54
 The Fit Line option will allow you to plot a regression line over your scatter plot.
Click on the Fit Options button to get this dialog box:

4.1.4. Modifying and Exporting Graphs

The primary tool for modifying charts in SPSS is the Chart Editor. The Chart Editor will open
in a new window, displaying a chart from your Output Viewer. The Chart Editor has several
tools for changing the appearance of your charts or even the type of chart that you are using.
To open the Chart Editor, double-click on an existing chart and the Chart Editor window will
open automatically. The Chart Editor shown below contains a bar graph of employment
categories:

55
While there are many useful features in the Chart Editor, we will concentrate on the three of
them: changing the type of chart, modifying text in the chart, and modifying the graphs.

You can change the type of chart that you are using to display your data using the Chart
Editor. For example, if you want to compare how your data would look when displayed as a
bar graph and as a pie chart, you can do this from the Gallery menu:

Gallery
Pie...

Selecting this option will change the above bar graph into the following pie chart:

56
Once you have selected your graphical look, you can start modifying the appearance of your
graph. One aspect of the chart that you may want to alter is the text, including the titles,
footnotes, and value labels. Many of these options are available from the Chart menu. For
example, the Title option could be selected from the Chart menu to alter the charts title:

Chart
Title...

Selecting this menu item will produce the following dialog box:

The title "Employment Categories" was entered in the box above and the default justification
was changed from left to center in the Title Justification box. Clicking OK here would cause
this title to appear at the top center of the above pie chart. Other text in the chart, such as
footnotes, legends, and annotations, can be altered similarly. The labels for the individual
slices of the pies can also be modified, although it may not be obvious from the menu items.
To alter the labels for areas of the pie, choose the Options item from the Chart menu.

Chart
Options...

This will produce the following dialog box:

57
In addition to providing some general options for displaying the slices, the Labels section
enables you to alter the text labelling slices of the pie chart as well as format that text. You
can click the Edit Text button to change the text for the labels. Doing so will produce the
following dialog box:

To edit individual labels, first click on the current label, which will be displayed below the
Label box, then alter the text in the Label box. When you finish, click the Continue button to
return to the Pie Options dialog box. You can make changes to the format of your labels by
clicking the Format button here. If you do not want to change formatting, click on OK to
return to the Chart Editor.

In addition to altering the text in your chart, you may also want to change the appearance of
the graph with which you are working. Options for changing the appearance of graphs can be
accessed from the Format menu. Many options available from this menu are specific to a
particular type of graph. There are some general options that are worth discussing here. One
such is Fill Pattern option, which changes the pattern of the graph. It can be obtained by
selecting the Fill Pattern option from the Format menu:

Format
Fill Pattern...
This will produce the following dialog box:

58
First, click on the portion of the graph where you want to change the pattern, then select the
pattern you want by clicking on the pattern sample on the left side of the dialog box. Then,
click the Apply button to change the appearance of your graph.

One other formatting option that is generally useful is the ability to change the colors of your
graphs. To do that, select the Color option from the Format menu:

Format
Color...

To produce the following dialog box:

This will allow you to change the color of a portion of a graph and its border. First, select the
portion of the graph for which you would like to change its color, then select the Fill option if
you want to change the color of a portion of the graph and select the Border option if you
want to change the color of the border for a portion of the graph. Next, click on the color that
you want and click Apply. Repeat this process for each area or border in the graph that you
want to change.

Interactive Charts

Many of the standard graphs available through SPSS are also available as interactive charts.
Interactive charts offer more flexibility than standard SPSS graphics: you can add variables to
an existing chart, add features to the charts, and change the summary statistics used in the
chart. To obtain a list of the available interactive charts, select the Interactive option from the
Graphs menu:

59
Graphs
Interactive

Selecting one of the available options will produce a dialog for designing an interactive graph.
For example, if you selected the Boxplot option from the menu, you would get this dialog
box:

Dialog boxes for interactive charts have many of the same features as other SPSS dialog
boxes. For example, in the above dialog box, the variable type is represented by icons: scale
variables, such as the variable bdate, are represented by the icon that resembles a ruler, while
categorical variables, such as the variable educ, are represented by the icon that resembles a
set of blocks. Variables in the variable list on the left of the dialog box can be moved into the
boxes on the right side of the screen by dragging them with your mouse, in contrast to using
the arrow button used in other SPSS dialog boxes. Options in interactive graphs can be
accessed by clicking on the tabs. For example, clicking on the Boxes tab produces the
following dialog box:

60
Here, you have several choices about the look of your boxplot. The choice to display the
median line is selected here, but the options to indicate outliers and extremes are not selected.
The Titles and Options tabs offer several other choices for altering the look of your table as
well, although a thorough discussion of these is beyond the scope of this document. When you
have finished the specifications for a graph, click the OK button to produce the graph you
have specified in the Output Viewer.

Interactive graphs offer several choices for altering the look of the chart after you have a draft
in the Output Viewer. To get the menus for doing that, double-click on the interactive graph
that you want to alter. For example, double-clicking on the boxplot obtained through the
above dialog box will produce the following menus:

The icons immediately surrounding the graph provide you with several possibilities for
altering the look of your graph. The three leftmost items in the horizontal menu are worthy of
mention. The leftmost icon produces a dialog box that resembles the original Interactive
Graphs dialog box and contains many of the same options. For example, you could change the
variables that you are graphing using this dialog box. The next icon, the small bar graph, lets
you add additional graphical information. For example you could overlay a line that graphed
the means of the three groups in the above graph by choosing the Dot-Line option from the
menu, or you could add circles representing individual’s salaries within each group by
choosing the Cloud option. The third icon provides several options for changing the look of
your chart. Selecting that icon will produce the following dialog box:

61
Each icon in this dialog box can be double-clicked to produce a dialog box that contains the
properties of the component of the chart represented by that icon. For example, you could
obtain the properties of the boxes in the interactive graph above by double-clicking on the
icon labelled Box. Doing so would produce this dialog box:

Changing the properties in this or any other dialog box that controls the properties of any
portion of the chart will change the look of the graph in the Output Viewer. For example, you
could change the colors of the boxes and their outlines by selecting a different

62
5. INTRODUCTION TO MINITAB

Minitab is Statistical Analysis software that allows to easily conducting analyses of data. This
is one of the suggested software for the class. It is commonly used to enter, organize, present
and analyze any certain data of a given variable. It can be used for learning about statistics as
well as to undertake statistical researches. Its applications have the advantage of being
accurate reliable and generally faster than computing statistics and drawing graphs by hand.
This guide is intended to guide you through the basics of Minitab and help you get started
with it.

5.1. How to start and exit Minitab


Starting Minitab
Double click on the “Minitab” icon
In the computer lab: the Minitab icon is in the folder: “program~>Minitab”

5.2. Minitab windows: worksheet, session and project


As you get into Minitab, two windows open:
 The worksheet window
 The session window

5.2.1. Worksheet Window


Data in Minitab is stored in a worksheet of columns and rows. Typically, each column
contains the data for1variable, with 1observation in each row. Columns are numbered C1, C2,
C3,…, while rows are numbered 1,2,3,
A. Changing column name.
You can change column names by clicking the field below the column number, typing the
name and then hitting return. You can also change the name using commands.
B. Editing column data.
Column data can be edited. You can also copy and paste data into the columns. You can
paste.txt and Excel data with no problem.

C. LOADINGDATAIN MINITAB

Minitab files are organized as “projects”. Each project will contain all the data you use and the
commands and analysis you perform on the data.
You can open a new, empty worksheet at any time. In this empty worksheet you can copy,

63
paste and type the data you need by simply working on the worksheet as you would on any
spreadsheet.
D. Opening an existing Worksheet(Minitab type file)
Within a project you can open one or more files that contain data. When you open a file, you
copy the contents of the file into the current Minitab project. Any changes you make to the
worksheet while in the project will not affect the original file.
To open a Minitab type file
1. Choose FILE-> OPEN WORKSHEET
2. Look for the file you want to open. Should be a.MTW or.MPJ type file. Select the
file and click Open.
3. If you get a message box indicating that the content of the file will be added to
the current project, check “Do not display this message again”, and then click OK.

5.2.2. Session Window


Displays output and lets you type commands. In order to be able to type commands in the
Session window you need to enable this option. To do so go to
EDITOR->ENABLECOMMANDS.
The Session Window will now look like

64
Minitab has a large number of built-in routines that allows you to do most of the basic data
analysis. Commands can also be typed in to the Session Window, to either replicate the built-
in routines or to create a more tailored data analysis.
The “MTB>”prompt should be visible in the Session Window.

5.2.3. Minitab Project


Projects are made up of the commands, graphs and worksheets. Every time you save a
Minitab project you will be saving graphs, worksheets and commands. However each one of
the elements can be saved individually for use in other documents or Minitab projects.
Likewise you can print projects and its elements.
Project manager contains different folders which has their own function, these are:
a) Session Folder: It manages the session window
b) History folder: It lists commands you have used in your session.
c) Graph folder: It is for managing, arranging and naming your graphs.
d) Report pad folder: It is used to creating, arranging & editing reports of you project.
e) Related document folder: For quickly accessing project related, non-
MINITAB files for easy reference.
f) Graph window: it used to display graphs and charts, but it is visible if you create a
graph or chart for your data.

5.2.4. Moving between windows


You may keep the worksheet and session windows occupying half a screen each or you can
maximize any one of them to a full screen. Then you can move between the different
windows:

65
Window
Then choose your desired window from the given list and click on it. The report pad is
accessible through the project manager.
Alternatively, each window is represented by an icon on the top bar. Clicking on the icon will
take you to the window right way. In particular, note the icons for worksheet, session, and
report pad.

5.2.5. Understanding the interface


After loading Minitab, you will either open an existing project or a new one. In either case, the
following window structure will appears.
Close Button

Title bar

Menu
bar

Standard
tool bar

Session
Window
Column
names

Row
names

Worksheet

Project
manager

Status bar

66
5.3. The menu and their use
There are 4 areas in the screen, the Menu bar, the Toolbar, the Session, Window and the
Worksheet window.
You can open menus and choose commands. Here you can find the built-in routines.

 File -use this menu to open and save worksheets and to import data.
 Edit -use this menu to cut and paste text and data across windows.
 Manip-use this menu to sort and recode your data.
 Calc-use this menu to create new columns.
 Stat -use this menu to analyses your data. This key menu performs many useful
statistical functions
 Graph -use this menu to graphically represent your data.
 Editor -use this menu to edit and format your data.
 Window -use this menu to change windows.
 Help - this opens a standard Microsoft Help window containing information on
how to use the many features of Minitab.

5.4. TYPE DATA

5.4.1. The data in the spreadsheet

This section discusses the types of data you can work with in MINITAB and the
various forms those data types can take. In MINITAB you can work with 3 types of
data in three forms: columns, constants, or matrices, these are

1. Numeric: It includes digits 0, 1 … 9 and *. But the symbol * is reserved for missing
value. The number can have a – or + sign, also it can be written in exponential
notation if it is very large or very small number. e.g. 3.2E12 which is equal with
3.2×1012. Numbers can be stored in columns, constants or matrices. MINITAB stores
and computes numbers in double precision, which means that numbers can have up
to15 or 16 digits (depending on the number) without round-off error.
2. Text: It can be two types either character or string. Characters are a single alphabet,
digits (from 0 to 9), spaces and punctuation marks such as >, ?, <, !.... Strings are a
series of characters; some examples of strings are country, name, occupation etc. The

67
maximum number of characters that can be entered at a time is 80. Texts can be stored
in columns or constants but not in matrices.
3. Date/Time: You can write Date (Such as Jan-1-1997, 03/01/2011…) or Times (Such
as 24:23) or both (Such as 24/11/2002; 10:30AM)

5.4.2. CREATING A SET OF DATA IN MINITAB

i. Create the set to sample from:


Calc> Make Patterned Data > Simple Set of Numbers
Store it in C1, from first value (choose a number) to last value (choose a number) in
steps of 1
ii. Sample from (created) set:
Calc> Random Data > Sample from Columns
Number of rows: …. (Whatever number of data pieces you need)
From sample C1
Store data in C2 (do it again with C1 and see what happens).
If the data set that you need is larger than your original set that you are sampling from,
make sure to check the “with replacement” box).
You can experiment with different types of data (check how much you remember from
statistics).
iii. Rank – sort data:
To sort data: Data > Sort
To rank date: Data> Rank
b. Or you can always type the collected set of data into Minitab (Make sure that the
data is in numeric format).

5.5. Entering and saving data

5.5.1. Entering the Data

There are two main ways to enter data into the Minitab worksheet:
1. Typing in the values (of give variable) one by one and clicking <enter> after
each entry.
2. Opening an existing Minitab worksheet or Minitab project.

68
FILE + OPEN WORKSHEET or FILE + OPEN PROJECT
Then select the file from appropriate drive/folder
Minitab files have a yellow icon with MTB written on them.
Note (i) Data sets from the textbook are available on the CD-ROM attached to the
book. They organize by chapters. The files may need to be unzipped.
(ii) A new spreadsheet is created each time you open a data file.
You can merge the spreadsheet only; you need to use
FILE + MERGE WORKSHEET
(iii) The “open file” icon defaults to Minitab project file only.
To open a spreadsheet only, you need to use FILE + OPEN WORKSHEET.
Entering Data into a Worksheet
There are various methods for entering data into a worksheet. The simplest approach is
to use the Data window to enter data directly into the worksheet by clicking your mouse
in a cell and then typing the corresponding data entry and hitting Enter. Remember that
you can make a Data window active by clicking anywhere in the window or by using
Windows in the menu bar.
If you type any character that is not a number, Minitab automatically identifies the
column containing that cell as a text variable and indicates that by appending T to the
column name, e.g., C5-T in Display I.4. You do not need to append the T when
referring to the column. Also, there is a data direction arrow in the upper left corner of
the data window that indicates the direction the cursor moves after you hit Enter.
Clicking on it alternates between row-wise and column wise data entry. Certainly, this
is an easy way to enter data when it is suitable.
Remember, columns are variables and rows are observations! Also, you can have
multiple data windows open and move data between them. Use the command to open a
new worksheet.

5.5.2. Saving Minitab data


Quite often, you will want to save the results of all your work in creating a work-sheet.
If you exit Minitab before you save your work, you will have to reenter everything. So
we recommend that you always save. To use the commands of this section make sure
that the Worksheet window of the worksheet in question is active.

69
Save Current Worksheet to save the worksheet with its current name, or the default
name if it doesn’t have one.
The Save in box at the top contains the name of the folder in which the worksheet will
be saved once you click on the Save button. Here the folder is called data, and you can
navigate to a new folder using the Up One Level button immediately to the right of this
box. The next button takes you to the Desktop and the third button allows you to create
a subfolder within the current folder. The box immediately below contains a list of all
files of type .mtw in the current folder.
You can select the type of file to display by clicking on the arrow in the Save as type
box, which we have done here, and click on the type of file you want to display that
appears in the drop-down list.
There are several possibilities including saving the worksheet in other formats, such as
Excel. Currently, there is only one .mtw file in the folder data and it is called
marks.mtw. If you want to save the worksheet with a different name, type this name in
the File name box and click on the save button.to retrieve a worksheet, use File I Open
Worksheet and file in the dialog box as depicted in Display I.20 appropriately. The
various windows and buttons Minitab for Data Management 27 in this dialog box work
as described for the File I Save Current Worksheet As command, with the exception
that we now type the name of the file we want to open in the File name box and click
on the Open button

5.6. Importing and Exporting data

5.6.1. Importing Data from Excel

To set up a connection between Minitab and Excel, we need to tell Minitab the file path
(directories, folders, etc.) to where that Excel file lives. The simplest import of an Excel
file is by using the File > Open Worksheet command in Minitab.

In the Open Worksheet dialog box, the first step is to click the “Files of Type” drop-
down list and choose “All.” This lets us see all file types in the folder. Navigate to your
Excel file and select it.

70
But before you click “Open,” take a look at the buttons that appear at the bottom of the
dialog box after you select the Excel file. Click “Preview” to view how Minitab is
recognizing the data in the worksheet. Then you can click “Options” to specify which
data in the worksheet you want to import.

Since Excel is a general, cell-based spreadsheet, your document may have data in any
row or column with formulas scattered in between. Minitab, as a statistical software
package, requires the data to be in column-wise format (which is why it's easy to
manipulate data with the Data menu in Minitab). Because of this difference, you want
to avoid bringing over any header or footer information from Excel. Just focus on
bringing over the raw dataset into Minitab. Use the Open Worksheet > Options box to
specify exactly which rows to import.

5.6.2. Opening a text file


1. Choose FILE-> OPEN WORKSHEET
2. In the field “File of Type” select TEXT(*.txt) from the drop down menu.
3. Click OPTIONS and the following dialog box will appear.

71
4. Go to the “SINGLE CHARACTER SEPARATOR” option. The data on the text file
is usually separated by spaces or tabs. Choose the appropriate option. If you are unsure
how the data is separated, another option is to use the number of data rows. Just
introduce the number of data rows in the “NUMBER OF DATA ROWS” box.
5. Click OK.
6. The results will appear in the worksheet window.

72
Note: This can be sometimes a little tricky as you can get a file that does not have the
data in the format that you want. If this happens, close the worksheet where the data is
placed and try importing it again, changing some of the options in step 4.This is a trial
and error procedure; so don’t panic if you don’t get it in the first attempt.
Copying data to Minitab
Copying data to Minitab works like copying data to any other type of spreadsheet(eg.
Excel).
1. Copy the data you wish to use in Minitab.
2. Go to the position where you want to copy the data in the desired Minitab
worksheet. If you wish to paste a cell with a Header or Name, make sure that you stand
in the variable name cell (cell below the number of the Column C1, C2, etc).
3. Go to EDIT ->PASTE CELLS to paste the data.
4. Sometimes when you copy data, Minitab reads it in a wrong format, eg. As a text
When is numeric. To solve this problem, select the problematic column(s) and go to
DATA -> CHANGE DATA TYPE-> CHOOSETHE DESIRED FORMAT. The most
useful format is numeric.

The following dialog box appears. Choose the variables you want to modify and where
you want to store them. The storage variables can be the same variables as the ones you
are modifying. Then hit OK.

73
5.6.3. Export data

To export data, you can save the Minitab worksheet as a different file type. Choose File
> Save Current Worksheet As to save the following types of files in Minitab:

Save a Minitab worksheet as an Excel file, text file, or HTML file

1. Choose File > Save Current Worksheet As.


2. Browse to a Save in location and enter a File name.
3. In Save as type, choose the type of file you are saving; for example, Excel
(*.xls).
4. Click Save.

Import data into Microsoft Access

To import data into Access, first, save the Minitab worksheet as an Excel file. Then,
import the Excel file into Access.

1. Save the worksheet that contains the data as an Excel file.


1. In Minitab, choose File > Save Current Worksheet As.
2. Enter a File name.
3. From Save as type, choose Excel (*.xls).

74
4. Click Save.
2. Import the Excel file into Access. Consult the Access Help for details.

How Minitab exports date/time data

When you save your worksheet as a text file, Minitab saves date/time data in the same
format in which it is displayed in the worksheet. Thus, if dates are displayed in the
format mm/dd/yyyy, then only the date is saved and not the hidden components, such
as the time.

When you save your worksheet as a file type other than text, Minitab saves all the
date/time information. For example, if dates in a column are displayed in the format
mm/dd/yyyy and you save the worksheet as an Excel file, when you open that file in
Excel, your spreadsheet will include both the date and time information: mm/dd/yyyy
h:mm.ss.ss.

Saving worksheets as text files with unusual formatting

If you use Save Current Worksheet As to save the worksheet as a text file, you cannot
specify the columns to save. You also cannot save your data in a custom format, for
example, with line breaks after certain columns. If you want to have more control over
how text files are saved, use File > Other Files > Export Special Text.

75
6. Descriptive statistics using Minitab

Descriptive Statistics
Displays N, N*, Mean, SE Mean, StDev, Min, Q1, Median, Q3, and Max
Descriptive Statistics for one variable
Stat Basic Statistics Display Descriptive Statistics Double-click on
appropriate variable (For Dell Data, double-click on Rates of Return so that it is
displayed under Variables).

As you can see from the screen above, you are given the option to alter the output by
clicking on the buttons. If you click on the Statistics button, this screen will appear:

76
The checked items will be displayed in the output. To check or uncheck an item, click
in the box to the left of the word.
If you click on the Graphs button, this screen will appear:

To display any of these graphs (in addition the descriptive statistics displayed in the
session window), click in the box. (For purposes of this example, I have not clicked on
any graphs since graphs will be explained in the next section.)
To display the data, click on OK. For the Dell example, this information is displayed in
the session window:
Descriptive Statistics: Rates of Return
Variable N N* Mean SE Mean StDev Minimum Q1 Median
Rates of 60 0 0.0907 0.0195 0.1511 -0.2175 -0.0304 0.0784
Return
Variable Q3 Maximum
Rates of Return 0.1931 0.4561
b) Descriptive statistics for one variable, grouped by a second variable
Stat Basic Statistics Display Descriptive Statistics Double-click on
appropriate variable Click in By variables (optional) box and then double-click
on appropriate variable OK. (For Auction Data, double-click on Auction Price so
that it is displayed under Variables. Then move the cursor into the By variables
(optional) box and double-click on No. of bidders so that it is displayed under By
variables (optional).)

77
For the Auction Data example, this information is displayed in the session window:

Note: If you see a * in the output, that indicates that the value could not be calculated.
In this example, the numerous * appear because N is not large enough in each group to
calculate all the descriptive statistics. (e.g. There is only one instance where the number
of bidders equals 5, and thus SE Mean, StDev, Q1, and Q3 could not be calculated with
only one data point)

78
C) Store Descriptive Statistics
This feature adds the descriptive statistics to the data worksheet instead of displaying
the output in the session window:
Stat Basic Statistics Store Descriptive Statistics Double-click on
appropriate variable (For Dell Data, double-click on Rates of Return so that it is
displayed under Variables).

As you can see from the screen above, you are again given the option to alter the output
by clicking on the buttons. If you click on the Statistics button, this screen will appear:

79
d) Column Statistics
You can calculate various statistics on columns. Column statistics are displayed in the
Session window, and are optionally stored in a constant.
Calc Column Statistics Click by the Statistic you want calculated (For Auction
Data, click by Standard Deviation) Double-click on appropriate column in Input
variable box (Double-click on No. of Bidders) OK.

80
This output is displayed in the session window:
Standard Deviation of No. of Bidders
Standard deviation of No. of Bidders = 2.83963
e) Row Statistics

You can compute one value for each row in a set of columns. The statistic is calculated
across the rows of the column(s) specified and the answers are stored in the
corresponding rows of a new column.

Calc Row Statistics Click by the Statistic you want calculated Double-click
on appropriate variable(s) in Input variables box Type the name of the new
column that will be created OK.

Calculating Row Statistics does not make sense using the example data because it is not
meaningful in context. Thus, an example is not given here. However, in order to see
what row statistics are able to be calculated, the screen shot is shown below.

81
Graphs

a) Histogram

Using the Dell Data that is now inserted into Minitab, a histogram can be made by
going to Graph Histogram Then this screen will appear:

Click on appropriate graph and then click OK. (For this example, we will display the
simple histogram). Double-click on appropriate variable (For Dell Data, double-
click on Rates of Return so that it is displayed under Graph Variables) OK.

This histogram will display:

82
Note: You are able to edit the graph at this point. On the graph below, the arrows
represent where you can double-click to make changes to the graph. You can do this
type of editing on most graphs.

Let’s say you wanted to edit the scale on the x-axis. By double-clicking on any of the x-
axis numbers (For this Dell example, you could double-click on -0.16), this screen
willthen appear:

83
This screen shows the Scale tab. Another way to edit the scale is to click on the Binning
tab. By doing so, this screen will appear:

(The default sets the Interval Definition to Automatic. However, for this Dell example,
click by Midpoint/Cut point positions and replace the numbers given with the new
numbers shown above.)
If you click on the Show tab, this screen will appear:

If you click on the Attributes tab, this screen will appear:

84
If you click on the Labels tab, this screen will appear:

If you click on the Font tab, this screen will appear:

(The default is set to Tahoma Font, Size 10. For this example, choose Lucida
Handwriting Font, Size 12.)

85
If you click on the Alignment tab, this screen will appear:

If you click on OK, the new histogram will display:

As you can see, the binning, size, and font have been changed in this example. Since
we originally double-clicked on one of the x-axis numbers, we were able to make
changes regarding that aspect of the graph. Likewise, you can make changes to other
parts of the graph by double-clicking on the appropriate spot. The details for all the

86
other arrows (displayed on page 26) are not going to be explained here. Basically, you
can change the way the text, bars, and background are displayed.
Another way to alter graphs is to use the buttons. If we go back to our original
histogram example, after going to Graph Histogram OK Double-clicking on
appropriatevariable, we are back to this screen:

Here you are given the option to alter the output by clicking on the buttons. If you click
on the Scale button, this screen will appear:

87
This screen shows the Axes and Ticks tab. If you click on the Y-Scale Type tab, this
screen will appear:

(The default is set for Percent, but for this Dell example, click by Frequency.)
If you click on the Gridlines tab, this screen will appear

If you click on the Reference Lines tab, this screen will appear:

88
(There are no references lines by default, but for this example, type 6 to show a
reference line at y = 6.)
If you click on the Labels button, this screen will appear:

(The default is set for None, but click by Use y-value labels for this example.)
If you click on the Data View button, this screen will appear:

89
This screen shows the Data Display tab. If you click on the Distribution tab, this screen
will appear:

If you click on the Smoother tab, this screen will appear:

If you click on the Multiple Graphs button, this screen will appear:

90
This screen shows the Multiple Variables tab. If you click on the By Variables tab, this
screen will appear:

If you click on the Data Options button, this screen will appear:

This screen shows the Subset tab. If you click on the Group Options tab, this screen
will appear:

This screen shows the Subset tab. If you click on the Group Options tab, this screen
will appear:

91
To display the graph, click on OK. The histogram will display:

Dotplot
Graph Dotplot Then this screen will appear:

92
Click on appropriate graph and then click OK. (For this example, we will display the
simple dotplot). Double-click on appropriate variable (For Dell Data, double-click
on Rates of Return so that it is displayed under Graph Variables) OK.

Box-and-Whisker Plot (Boxplot)


Graph Boxplot Then this screen will appear:

93
Click on appropriate graph and then click OK. (For this example, we will display the
simple boxplot). Double-click on appropriate variable (For Dell Data, double-click
on Rates of Return so that it is displayed under Graph Variables) OK.

This Box-and-Whisker Plot (Boxplot) will display:

d) Probability Plot Graph Probability Plot This screen will appear:

Click on appropriate graph and then click OK. (For this example, we will display the
single probability plot). Double-click on appropriate variable (For Dell Data,
double-click on Rates of Return so that it is displayed under Graph Variables) OK.

94
This Probability Plot will display:

e) Graphical Summary
Stat Basic Statistics Graphical Summary Double-click on appropriate
variable (For Dell Data, double-click on Rates of Return so that it is displayed under
Variables) OK.

95
Note: The By variables option is used to create multiple graphical summaries based on
a type of grouping variables, called a by variable. For an example using the Auction
Data, if use Auc Price as the Variable and No. of Bidders as the By variable,

96
the output will display a graphical summary for every group of number of bidders. Here
is one of the graphical summaries that is displayed:

Thus, only the auction prices for when the number of bidders = 9 is shown.
f) Bar Chart
i) Bars representing counts of unique values
Choose this graphical format if you have one or more columns of categorical data and
you want to chart the frequency of each category.
Graph Bar Chart Choose Counts of unique values from the drop box and Click
OK. (For this example, we will use the Student Data and show a simple Bar Chart.)

97
As you can see from the screen above, you are given the option to alter the output by
clicking on the buttons. If you click on the Chart Options button, this screen will
appear:

98
If we would have chosen Decreasing Y instead of Default after clicking on the Bar
Chart Options button,

99
Bars representing a function of a variable
Choose if you have one or more columns of data and you want to chart a function of the
data. Quite a few of these functions are summary statistics.
Graph Bar Chart Choose A function of a variable from the drop box (Then, for
this example, we will click on Cluster under Multiple Y’s) and Click OK.

Click on appropriate variable from the drop box to choose a function. (Here we’ll choose mean)

Double-click on appropriate variable in Graph variables box and then double-click


on appropriate variable in the Categorical variables for grouping box. (For Student
Data, age was put under the graph variable and gender and portfolio was put under the

100
categorical variables.) OK.

The bar chart will display:

iii) Bars representing values from a table


Choose if you have one or more columns of summary data and you want to chart the
summary value for each category.
Graph Bar Chart Choose Values from a table from the drop box (Then, for this
example, we will click on Simple under One column of values) and Click OK.

101
Double-click on appropriate variable in Graph variables box and then double-click
on appropriate variable in the Categorical variable. (For Student Data, age was put
under the graph variable and gender was put under the categorical variable.) OK.

Although it does not provide much use in context to sum the ages of males versus
females, this example was completed to showcase the use of this function.

102
g) Pie Chart
i) Chart raw data
Choose when each row in a column represents a single observation. Each slice in the
pie is proportional to the number of occurrences of a value in the column.
Graph Pie Chart Click on Chart raw data Double-click on appropriate
variable in Categorical variables box (For Student Data, double-click on Portfolio).

As you can see from the screen above, you are given the option to alter the output by
clicking on the buttons. If you click on the Pie Options button, this screen will appear:

103
ii) Chart values from a table
Choose when the category names are in one column and summary data are in another
column.
Let’s look at how to use a pie chart if our data was organized differently. (Look at
Student (2) Data)
Graph Pie Chart Click on Chart values from a table Double-click on
appropriate variable in Categorical variable box and double-click on appropriate
variable in the Summary variables box. (For Student (2) Data, double-click Gender
for Categorical variable and Count for Summary variables.)

104
7. STATISTICAL ANALYSIS USING MINITAB AND SPSS
7.1. Inferential statistics Using Minitab
Ways to Analyze Data

Analysis in Minitab can be done in two ways: using the Built-In routines or using
command language in the Session window. These two can be used interchangeably.

Built-In routines
Most of the functions needed in basic and more advanced statistical analysis are found
as Minitab Built-In routines. These routines are accessed through the menu bar. To use
the menu commands, click on an item in the menu bar to open a menu, click on a menu
item to execute a command or open a submenu or dialog box.
Command Language
To be able to type commands in the Session window, you must obtain the “MTB>”
prompt. All commands are then entered after the “MTB>” prompt. All command lines
are free format, in other words, all text may be entered in upper or lower case letters
anywhere in the line.
NOTE: This guide focuses mainly on using the Built-In routines. All the explanations
and examples that follow will be done using Minitab’s Built-In routines. A brief
introduction to using Minitab commands is found in section.
INFERENTIAL STATISTICS
a. Confidence Intervals:
i. 1-Sample Z: Stat> Basic Statistics> 1-sample Z >check the alpha level in options.

ii. 1-Sample t: Stat> Basic Statistics> 1-sample t >check the alpha level in options.
b. Hypothesis Testing:
i. 1-Sample Z: Stat> Basic Statistics> 1-sample Z >check the alpha level and alternative
hypothesis in options

ii. 1-Sample t: Stat> Basic Statistics> 1-sample t>check the alpha level and alternative
hypothesis in options

105
Point and interval estimation

Confidence Intervals about μ, Known


1. If you have raw data, enter them in column C1.

2. Select the Stat menu, highlight Basic Statistics, then click 1-Sample Z . . .

3. If you have raw data, enter C1 in the cell marked “Samples in columns:”. If you have
summarized data, select the summarized data radio button and enter the summarized
values. Select Options and enter a confidence level. Click OK. In the cell marked
standard deviation, enter the value of. Click OK.
Confidence Intervals about μ, Unknown
1. If you have raw data, enter them in column C1.

2. Select the Stat menu; highlight Basic Statistics, then highlight 1-Sample t . . .

3. If you have raw data, enter C1 in the cell marked “Samples in columns”. In you have
summarized data, select the “summarized data” radio button and enter the summarized
data. Select Options . . . and enter a confidence level. Click OK twice.
Confidence Intervals about p
1. If you have raw data, enter the data in column C1.

2. Select the Stat menu, highlight Basic Statistics, and then highlight 1 Proportion . . .

3. Enter C1 in the cell marked “samples in Columns” if you have raw data. If you have
summary statistics; Click “Summarized data” and enter the number of trials, n, and the
number of events (successes) x.

4. Click the Options . . . button. Enter a confidence level. Click “Use test based on a
normal distribution” (provided that the assumptions stated are satisfied). Click OK
twice.
Confidence Intervals about σ2
1. Enter raw data in column C1

2. Select the Stat menu, highlight Basic Statistics, and then highlight Graphical
Summary . . .
3. Enter C1 in the cell marked “Variables.”

106
4. Enter the confidence level desired. Click OK. The confidence interval for sigma is
reported in the output.
Testing of hypothesis about one population mean and proportion

Hypothesis Tests Regarding μ, σ2 Known


1. Enter raw data in column C1 if necessary.

2. Select the Stat menu, highlight Basic Statistics, and then highlight 1-Sample Z . . .
3. Click Options. In the cell marked “Alternative,” select the appropriate direction for
the alternative hypothesis. Click OK.
Hypothesis Tests Regarding a Population Proportion
1. If you have raw data, enter them in C1, using 0 for failure and 1 for success.

2. Select the Stat menu, highlight Basic Statistics, then highlight 1-Proportion.

3. If you have raw data, select the “Samples in columns” radio button and enter C1. If
you have summarized statistics, select “Summarized data.” Enter the number of trials
and the number of successes.

4. Click Options. Enter the value of the proportion stated in the null hypothesis. Enter
the direction of the alternative hypothesis. If ( ) , check the box marked “Use test and
interval based on normal distribution.” Click OK twice.
Hypothesis Tests Regarding a Population Standard Deviation
1. Enter the raw data into column C1 if necessary. Select the Stat menu, highlight
Basic Statistics, and then highlight 1 Variance.

2. Make sure the pull-down menu has “Enter standard deviation” in the window. If you
have raw data, enter C1 in the window marked “Samples in columns” and make sure
the radio button is selected. If you have summarized data, select the “Summarized data”
radio button and enter the sample size and sample standard deviation.

3. Click Options and select the direction of the alternative hypothesis. Click OK.

4. Check the “Perform hypothesis test” box and enter the value of the standard
deviation in the null hypothesis. Click OK.

107
Comparisons of two population means and proportions

Comparison of two population means

MINITAB will calculate the test value (statistics) and p-value for difference between
the means for two populations when the population standard deviation is unknown.

1. Enter the data into C1 and C2.


2. Select “stat”>”basic statistics “>”2-sample t”.
3. Click the button from [sample in different columns].
4. Click in the box for [first]:. Double click C1 in the list.
5. Click in the box for [second]:., then double click C2 in the list. Do not check the box
for [assume equal variances]. Minitab will use the large sample formula. The completed
dialog box in shown.
6. Click [options].
a. Type in 90 for the [confidence level] and 0 for the [Test mean].
b. Select [greater than] for the [Alternative]. This option affects the p-value. It must be
correct.
7. Click [ok] twice.

The Chi-Square Distribution


If Z is distributed according to the N (0, 1) distribution, then Y = Z2 is distributed
according to the Chi-square (1) distribution. If X1 is distributed Chi-square (k1)
independent of X2 distributed Chi-square (K2), then Y = X1+X2 is distributed
according to the Chi-square (k1 + k2) distribution. There are Minitab commands that
assist in carrying out computations for the Chi-square (k) distribution. Note that k is
any positive value and is referred to as the degree of freedom. The values of the density
curve for the Chi-square (k) distribution can be obtained using the Calc I Probability
Distributions I Chi-Square command, with k as the Degree of freedom in the dialog
box, or the session command pdf with the subcommand chi-square. For example, the
command M
MTB>pdf C1 C2;
SUBC> Chi-square 4

108
Calculates the value of the Chi-square (4) density curve at each value in C1 and stores
these values in C2. This is useful for plotting the density curve. The Calc I Probability
Distributions I Chi-Square command or the session commands cdf and invcdf, can also
be used to obtain values of the Chi-square (k) cumulative distribution function and
inverse distribution function, respectively. We use the Calc I Random Data I Chi-
Square command, or the session command random, to obtain random samples from
these distributions.
We will see applications of the chi-square distribution later in the book but we mention
one here. In particular, if x1. . . xn is a sample from a N(μ, σ)distribution, then (n − 1)
s2/σ2 =Pni=1 (xi − ¯x)2 /σ2 is known to follow a Chi-square(n − 1) distribution, and
this fact is used as a basis for inference about σ (confidence intervals and tests of
significance). Because of the non robustness of these inferences to small deviations
from normality, these inferences are not recommended.
Correlations
While a scatter plot is a convenient graphical method for assessing whether or not there
is any relationship between two variables, we would also like to assess this numerically.
The coefficient provides a numerical summarization of the degree to which a linear
relationship exists between two quantitative variables, and this can be calculated using
the Stat I Basic Statistics. I Correlation command.
Correlate E1 . . . Em
Where E1... Em are columns corresponding to numerical variables, and a correlation
coefficient is computed between each pair. This gives m(m − 1)/2correlation
coefficients. The subcommand no p -values is available if you want to sup press the
printing of P-values.
1. With the explanatory variable in C1 and the response variable in C2, select the Stat
menu and highlight Basic Statistics. Highlight Correlation.
2. Select the variables whose correlation you wish to determine and click OK.
 Choose CORRELATION and obtain the following dialog box. Choose the pair
of variables to be analyzed.

109
2. Results are displayed in the Session window as presented below

Doing Calculations with variables


Minitab allows you to do calculations with the variables that you are using. For
example you can add and multiply variables. In order to do these calculations
you should go to CALC ->CALCULATOR

110
The following dialog box appears

This is basically a calculator that allows doing many calculations with the variables.
Basic functions are found in the number pad and more sophisticated ones are found in
the functions box to the right of the number pad.
To make sure that your results is not over writing a variable, name a new variable in the
“STORERESULTSIN VAVRIABLE” field in the top of the calculator.
a. Adding variables
1. To add variables name the variable where you want to store the results.
2. Select the first variable, press the “+”sign and select the second variable (and so on
for more than two variables).You should obtain something similar to the window in the
below

111
3. The result will then be shown in the worksheet window

Taking logarithms

Another useful function in statistical analysis is to take logs of variables.


1. Look for the “NATURALLOG”or“LOGBASE10”(depending on the one you need) in
the function list. A shortcut to finding the functions is to choose “LOGARITHM”
from the function drop down menu.
2. Inside the parenthesis, change number for the variable name.
Highlighting “NUMBER” and then selecting the variable you want to modify does
this.
3. Make sure that you have defined a variable where you want to store results, by putting
the name in the “STORERESULT IN VARIABLE” box.
4. The result will appear in the worksheet window.

112
Logical functions
Some statistical analysis will need to separate by groups according to characteristics
that are contained in the data. Logical functions are particularly useful in these cases. A
simple example on how to use them is described below.
1. Choose the variable you want to do the logical test to. Here we are looking at the
“SEX” variable.
2. Choose the logical test you want to use. Here we want to see which observations
have the variable “SEX” equal 1. That is, which observations are males?

3. Make sure that you have indicated a variable in which to store your results, by
typing the name of your result variable in the “STORE RESULT IN VARIABLE”
box.
4. The result variable will be a binary variable (variable of 1s and 0s) where 1
indicates the logical testis true and0the test is false. The result variable will appear in
the Worksheet window.

113
Determining the Least-Squares Regression Line
Regression is another technique for assessing the strength of a linear relationship
existing between two variables and it is closely related to correlation. For this, we use
the Stat I Regression command.
As noted in IPS, the regression analysis of two quantitative variables involves
computing the least-squares line y = a + bx, where one variable is taken to bet he
response variable y and the other is taken to be the explanatory variable x.
It is very convenient to have a scatter plot of the points together with the least-squares
line. This can be accomplished using the Stat I Regression I Fitted Line Plot command.
1. With the explanatory variable in C1 and the response variable in C2, select the Stat
menu and highlight Regression. Highlight Regression . . . .
2. Select the explanatory (predictor) variable and response variable and click OK.
The Coefficient and Determination,
This is provided in the standard regression output
Residual Plots
Follow the same steps as those used to obtain the regression output (Section 4.2).
Before selecting OK, click GRAPHS. In the cell that says “Residuals versus the
variables,” enter the name of the explanatory variable. Click OK
Simulation
1. Set the seed by selecting the Calc menu and highlighting Set Base . . . Insert any
seed you with into the cell and click OK.
2. Select the Calc menu, highlight Random Data, and then highlight Integer.

3. Select the Stat menu, highlight Tables, and then highlight Tally . . . Enter C1 into
the variables cell. Make sure that the Counts box is checked and click OK.

7.2. Inferential Statistics using SPSS

The chi-square Test

 The chi-squared (2) test statistics is widely used in the analysis of contingency
tables.

114
 The chi-square measures test the hypothesis that the row and column variables in
a cross tabulation are independent.
 It compares the actual observed frequency in each group with the expected
frequency (the latter is based on theory, experience or comparison groups).
 The chi-squared test (Pearson’s χ2) allows us to test for association between
categorical (nominal!) variables.
 The null hypothesis for this test is there is no association between the variables.
Consequently a significant p-value implies association.
After opening the Crosstabs dialog box as described in the preceding section, click the
Statistics button to get the following dialog box:

2x2 Contingency table

Variable A
A1 A2 Total
B1 a b a+b
Variable B B2 c d c+d
Total a+c b+d n
Test Statistic: 2-test Test Statistic: 2-test for 2 x 2 Contingency table

nad  bc 
2
 2

( a  c)(b  d )(a  b)(c  d )

115
Test Statistic: 2-test with d.f. = (r-1)x(c-1)

O  Eij  2

 2
 
i, j
ij

Eij

i th raw total  jth column total Ri  C j


Eij  
grand total n
Where:
Oij=observed frequency,
Eij=expected frequency of the cell at the juncture of i th raw & j th column
Assumptions of the 2 - test
 Data must be categorical
 The data should be a frequency data (counts for frequency, proportions
/difference of proportions for prevalence & incidence).
 The chi-squared test assumes adequate sample size -that the numbers in each
cell are ‘not too small’
 No expected frequency should be less than 1, and no more than 20% of the
expected frequencies should be less than 5.
 If some numbers are too small,
 row or column variables categories can sometimes be combined to make
the expected frequencies larger or use Yates correction,
 the Fisher’s exact test should be used instead.
 It assumes that measures are independent of each other i.e. the categories
created are mutually exclusive.
 The 2 - test assumes that there is/must exist theoretical basis for the
categorization of the variables.
Measures of Association
 The chi-square test tells us whether or not an association exists. It is useful
however, to know how strong the association is.
 The relative risk and the odds ratio are appropriate measures of the strength
of the association.

116
Test for a relationship between two categorical variables

 Is there an association between being a member of a fraternity/sorority and ever


being diagnosed with depression?
 Hypotheses

 Ho: There is no association between being a member of a


fraternity/sorority and ever being diagnosed with depression.

 HA: There is an association between being a member of a


fraternity/sorority and ever being diagnosed with depression.

 Test: Chi-square test for independence

 Result: Fail to reject null

i n
O  Ei 2
= 
i 1
i
Ei

Ever - Depressi on * Frat or sorority? Crosstabul ati on

Frat or sorority ?
y es no Tot al
Ev er - Depress ion y es Count 681 7692 8373
Expec ted Count 715.6 7657.4 8373.0
no Count 3744 39657 43401
Expec ted Count 3709.4 39691.6 43401.0
Tot al Count 4425 47349 51774
Expec ted Count 4425.0 47349.0 51774.0

Ch i-Sq uare Tests

Asy mp. Sig. Exact Sig. Exact Sig.


Value df (2-sided) (2-sided) (1-sided)
Pears on C hi-Square 2. 185b 1 .139
Continuity C orrectiona 2. 122 1 .145
Likelihood Ratio 2. 211 1 .137
F isher's Exact Tes t .141 .073
Linear-by -Linear
2. 185 1 .139
Assoc iation
N of Valid Cas es 51774
a. Com put ed only f or a 2x2 table
b. 0 cells (.0%) hav e ex pected count less than 5. The minim um ex pect ed count is 715.
62.

117
T-tests

 The t test is a useful technique for comparing mean values of two sets of
numbers
 The comparison will provide you with a statistic for evaluating whether the
difference between two means is statistically significant
 T tests can be used either to compare independent-samples t test or paired-
samples t test
 There are three types of t tests; the options are all located under the Analyze
menu item
 Analyze
Compare Means
One-Sample T test...
Independent-Samples T test...
Paired-Samples T test...
One-Sample T- Test:

 used to compare a single sample with a population value


 tests whether the mean of a single variable differs from a specified constant
Example: College students report drinking an average of 5 drinks the last time they
“partied”/socialized

 Hypotheses
 Ho : µ = 5
 HA: µ ≠ 5
 Test: Two-tailed t-test
 Result: Reject

One-Sampl e Statistics

Std. Error
N Mean Std. Dev iation Mean
How many drinks 53374 4. 42 4. 401 .019

118
One-Sampl e Test

Tes t Value = 5
95% Conf idence
Interv al of the
Mean Dif f erence
t df Sig. (2-t ailed) Dif f erence Lower Upper
How many drinks -30.352 53373 .000 -. 578 -. 62 -. 54

The independent-sample t test:

 Used to compare two groups' scores on the same variable.


 In independent-Sample T test dialog box we have to identify the grouping variable
or cut point by clicking on define groups button after dragging the grouping
variable
Independent sample T-test

Example: Men and women report significantly different numbers of sexual partners
over the past 12 months

 Hypotheses
 µ1 = µ2
 µ1 ≠ µ2
 Test: Independent Samples t-test OR One-way ANOVA
 Result: Reject null

Group Statistics

Std. Error
Sex N Mean Std. Dev iation Mean
Part ners y ou had f emale 32687 1. 34 2. 017 .011
male 18474 1. 82 3. 627 .027

119
Independent Samples Test

Lev ene's Test for


Equality of Variances t-test f or Equality of Means
95% Conf idence
Interv al of the
Mean Std. Error Dif f erence
F Sig. t df Sig. (2-tailed) Dif f erence Dif f erence Lower Upper
Partners y ou had Equal variances
867.978 .000 -19.360 51159 .000 -.483 .025 -.532 -.434
assumed
Equal variances
-16.704 25065.988 .000 -.483 .029 -.540 -.426
not assumed

Group Statistics

Std. Error
gender N Mean Std. Dev iation Mean
v erbal f luenc y - animal f emale 855 15. 24 5. 711 .195
naming score male 580 15. 95 5. 493 .228

The group statistics tells us the mean of animal naming score among males and
females

Independent Samples Test

Lev ene's Test for


Equality of Variances t-test f or Equality of Means
95% Conf idence
Interv al of the
Mean Std. Error Dif f erence
F Sig. t df Sig. (2-tailed) Dif f erence Dif f erence Lower Upper
v erbal f luency - anim al Equal variances
.643 .423 -2.336 1433 .020 -.71 .303 -1.300 -.113
naming score assumed
Equal variances
-2.354 1274.743 .019 -.71 .300 -1.296 -.118
not assumed

Levene’s test for equality of variances tests assumption of homogeneity of variance, if


it is not significant, we could say that EQUAL VARIANCES ASSUMED, thus take
from first raw. If it was significant, it could be said that EQUAL VARIANCES NOT
ASSUMED, and taking the second raw will be advised.

The t-test is a test that tells us the mean difference of animal naming score among
males and females, is statistically significant.

120
The paired-sample t test:

 It compares the means of two variables that represent the same group at different
times (e.g. before and after an event) or related groups (e.g., husbands and
wives).
 In paired sample T test dialog box we have to choose two variables from the
left side box to paired variable box
Note: By clicking on Option button we can specify whatever we want

The Paired-Samples T Test procedure compares the means of two variables for a single
group. It computes the differences between values of the two variables for each case
and tests whether the average differs from 0.

Analysis of Variance

One way Analysis of Variance:

 The One-Way ANOVA compares the mean of one or more groups based on one
independent variable (or factor)
 It measures differences among group means.
In SPSS can be performed as:

From the menus choose:

Analyze

Compare Means

One-Way ANOVA

121
 Move all dependent variables into the box labeled "Dependent List"
 Move the independent variable into the box labeled "Factor"
 Click on the button labeled "Options"
 Check off the boxes for Descriptive and Homogeneity of Variance
 Click on the box marked "Post Hoc" and choose the appropriate post hoc
comparison
 The two groups have approximately equal variance on the dependent
variable. You can check this by looking at the Levene's Test
 If Levene's statistic is significant, we have evidence that the homogeneity
assumption has been violated.
 If it is a problem, you can re-run the analysis selecting the option for "Equal
Variances Not Assumed"
 Hypotheses:
 Null: There are no significant differences between the groups' mean
scores.
 Alternate: There is a significant difference between the groups'
mean scores.
Steps for one-way ANOVA

Step 1: Explore the data

Step 2: State the hypothesis


HO: μ1= μ2 = …. = μk

HA: μi  μj for i j
Step 3: Computation of the test statistics

Variationb etweenthep opulation S B2


F  2
Variationw ith int hepopulati on SW

122
Steps can be summarized in to ANOVA-table as:

Source of variation df Sum of Square Mean square F –ratio

Between groups/Treatment K-1 SSB MST=SSB/k-1 MST/MSE

Within Groups/Error N-k SSE MSE=SSE/n-k

Total N-1 SST

Following assumptions should be fulfilled to do ANOVA:

 Data should quantitative

 The observations in each group come from a normal distribution.

 The population variances of each group are the same or homogeneous

Levene's Test

 It used to test homogeneity of variance, with hypothesis of:

 Ho: each group is have constant variance

 If Levene's statistic issignificant(p-value<0.05), we have evidence that the


homogeneity assumption has been violated.

 If this happen, you should re-run the analysis selecting the option for
"Equal Variances Not Assumed"

 Because the value is calculated by taking care of the situation

Multiple comparisons

 If the mean are significantly different (reject Ho),we are interested in which pair
of mean are different. Consequently, we should use method called multiple
comparisons.

For all test, the hypothesis will be:

123
Ho: Pair of treatment mean is equal (μi=μj for i j)

H1: Not equal (μiμj for i j).

 Reject Ho if p-value < 0.05 or zero is not included in the confidence interval

 To do this in SPSS, click post Hoc button and select method based on equal
variance assumed or not (For this see Levene test of homogeneity of variance)

Test of the difference between two or more means

 Mean BAC reported differs across student residences

 Hypotheses
 µ1 = µ2 = µ3 =µ4 = µ5 = µ6

 µi ≠ µj for at least one pair i, j

 Test: One-way ANOVA

 Result: Reject null

Descriptives

Blood Alcohol Content


95% Conf idence Interv al for
Mean
N Mean Std. Dev iation Std. Error Lower Bound Upper Bound Minimum Maximum
residence hall 21285 .0741 .08215 .00056 .0730 .0752 .00 1.27
f rat/sorority house 781 .1127 .09278 .00332 .1062 .1193 .00 .75
other univ ersity housing 3620 .0622 .07357 .00122 .0598 .0646 .00 1.41
of f campus 18151 .0773 .08539 .00063 .0760 .0785 .00 2.47
with parents 4279 .0606 .08490 .00130 .0581 .0631 .00 1.17
other 2266 .0579 .08296 .00174 .0545 .0613 .00 1.26
Total 50382 .0731 .08357 .00037 .0724 .0738 .00 2.47

124
ANOVA

Blood Alcohol Content


Sum of
Squares df Mean Square F Sig.
Between Groups 3. 188 5 .638 92. 123 .000
Within Groups 348.695 50376 .007
Tot al 351.884 50381

7.3. Regression and Correlation

7.3.1. Correlation Analysis in SPSS


Bivariate Correlation
Are measures of the linear relationship between two or more variables.

 It can be used to measure the degree of the association between the two variables.

 To obtain a correlation in SPSS, start at the Analyze men

 Select the Correlate option from this menu

 By selecting this menu item, you will see that there are three options for correlating
variables:

 Bivariate,
 Partial, and
 Distances
 The bivariate correlation is for situations where you are interested only in the
relationship between two variables

 To obtain a bivariate correlation, choose the following menu option:

Analyze
Correlate
Bivariate...
 Drag the necessary variables to Variables dialog box

 The partial correlation measures an association between two variables with the
effects of one or more other variables factored out

 To obtain a partial correlation, select the following menu item:

125
Analyze
Correlate
Partial...

 Under partial correlation dialog box we have to select necessary variable to


Variable box and controlling for box

Partial correlations can be especially useful in situations where it is not obvious


whether variables possess a unique relationship or whether several variables overlap
with each other.

 (xi  x )(yi  y)  xy  [ x  y]/n


r 
 (xi  x )  (yi  y)
2 2
[ x 2  ( x) 2 /n][ y 2  ( y)2 /n]

Test of a relationship between two continuous variables

Example: There is a relationship between the number of drinks students report


drinking the last time they drank and the number of sex partners they have had within
the last school year

 Hypotheses
 Ho: ρ = 0

 HA: ρ ≠ 0

 Test: Pearson Product Moment Correlation

 Result: Reject null

Co r re lati on s

How m any Part ners


drin ks y ou h ad
How m any d rinks Pears on C orrelation 1 .2 38* *
Sig. (2-t ailed) .0 00
N 533 74 525 76
Part ners y ou h ad Pears on C orrelation .2 38* * 1
Sig. (2-t ailed) .0 00
N 525 76 528 96
* * . Correlat ion is signif ican t a t t he 0 .01 lev el (2 -tailed ).

126
Partial Correlation in SPSS

 The partial correlation measures the strength association between two variables by
controlling the effects of one or more other variables. For example: current and
beginning salary by controlling effect of previous experience

 Partial correlations can be especially useful in situations where it is not obvious


whether several variables overlap with each other

 To obtain a partial correlation, select the following menu item:

– Analyze

> Correlate

Partial...

 Under partial correlation dialog box we have to select necessary variable to


Variable box and controlling for box

Example: Let us compare the strength of relationship between current salary and
beginning salary, after controlling the effect of previous experience.

7.3.2. Linear Regression


A) Simple linear regression
Regression is a technique that can be used to investigate the effect of one or more
predictor variables on an outcome variable

 Fitting a simple linear regression model to the data allows us to explain or predict
the values of one variable (the dependent or outcome or response variable or y)
given the values of a second variable (called the independent or exposure or
explanatory variable or x).

 The basic idea of simple linear regression is to find the straight line which best
fits the data.

127
For example if we are interested in predicting under-five mortality rate from percentage
of children immunized against DPT we would treat immunization as independent
variable and mortality rate as dependent variable.

Equation of the fitted line

 The result of fitting a simple linear regression model is to produce a regression


equation of the form:

y = a + bx.

Intercept slope (estimate of β)

 Interpretation (we are interested in slopes)

 Negative slope indicates that there is inverse relationship

 Testing a null hypothesis HO : β=0 and look for the p-value

B) Multiple linear regression


By extending simple linear regression we got

y = a + b1x1 + b2x2 + … + bnxn.

To conduct a regression analysis, select the following from the Analyze menu

Analyze
Regression
Linear...
This will produce the following dialog box:

128
R is the multiple correlation coefficient between all of the predictor variables and the
dependent variable.

R Square used to describe the goodness-of-fit or the amount of variance explained by a


given set of predictor variables.

 Select the dependent variable to the ‘dependent’ space and the independent
variable to the ‘independent’.

 After Clicking the ‘statistics’, chose the ‘estimate’, ‘model fit’, ‘confidence
interval’ and ‘R squared change’ and click the ‘Ok’.
 This will give you the mean difference between and within group
difference and its significance is measured using F-test.

 It also gives you Regression coefficients (the intercept and the slop)

 (the ß = slop, gives you positive or negative relationship between the


predictor and the Outcome Variable)

 It also gives you R2 which is the explanatory or prediction power of the


model in predicting the outcome variable.

129
After clicking the ‘statistics’

‘Estimate’,

‘Model fit’,

‘R squared change’ ‘Confidence interval’

OUTPUT
Model Summary

Change St atist ics


Adjust ed Std. Error of R Square
Model R R Square R Square the Estimate Change F Change df 1 df 2 Sig. F C hange
1 .193a .037 .037 5. 496 .037 52. 271 1 1344 .000
a. Predic tors : (Cons tant ), marital stat us

The Model summary shows you the R2 which tells us how many the predictive
Variables explains outcome variable, here in this example, it is 3.7 %.

ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regress ion 1578.905 1 1578.905 52. 271 .000a
Res idual 40597.181 1344 30. 206
Tot al 42176.086 1345
a. Predic tors : (Const ant ), marit al st atus
b. Dependent Variable: v erbal f luency - anim al nam ing score

130
ANOVA statistics also tells us whether the explanatory variable predicts the outcome
variable well using F-test.

Coefficientsa

Unstandardized Standardized
Coeff icients Coeff icients 95% Conf idence Interv al for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) 17.779 .344 51.718 .000 17.105 18.454
marital status -.808 .112 -.193 -7.230 .000 -1.027 -.589
a. Dependent Variable: verbal fluency - animal naming score

 The B is the coefficient that each independent variable contributes to the


dependent Variable, it is also the indicator of (ß = slop), and the intercept that
crosses X value at 0. It tells us to what extent (degree) each predictor affects the
outcome, if the effects of all other predictors are held constant.
The equation will seem:

Verbal fluency score = ß0 + ß1x Marital status + ……..

=17.78 – 0.81x Marital status + ……..

 Standard coefficient may be useful and gives a good estimate through relative
estimation using standard deviation.
 Students’t-test is the statistics that estimates the significance, and the upper and
lower 95% CI, are significant if both become Negative or Positive.

7.3.3. Regression Diagnostics using SPSS


Without verifying that your data have met the regression assumptions, your results may
be misleading. This sub topic will explore how you can use SPSS to test whether your
data meet the assumptions of linear regression. In particular, we will consider the
following assumptions.

 Linearity - the relationships between the predictors and the outcome variable
should be linear
 Normality - the errors should be normally distributed - technically normality is
necessary only for the t-tests to be valid, estimation of the coefficients only
requires that the errors be identically and independently distributed

131
 Homogeneity of variance (homoscedasticity) - the error variance should be
constant
 Independence - the errors associated with one observation are not correlated
with the errors of any other observation
 Model specification - the model should be properly specified (including all
relevant variables, and excluding irrelevant variables)

Additionally, there are issues that can arise during the analysis that, while strictly
speaking are not assumptions of regression, are none the less, of great concern to
regression analysts.

 Influence - individual observations that exert undue influence on the


coefficients
 Collinearity - predictors that are highly collinear, i.e. linearly related, can cause
problems in estimating the regression coefficients.

Many graphical methods and numerical tests have been developed over the years for
regression diagnostics and SPSS makes many of these methods easy to access and use.
In this chapter, we will explore these methods and show how to verify regression
assumptions and detect potential problems using SPSS.

7.3.3.1. Unusual and Influential data

A single observation that is substantially different from all other observations can make
a large difference in the results of your regression analysis. If a single observation (or
small group of observations) substantially changes your results, you would want to
know about this and investigate further. There are three ways that an observation can
be unusual.

Outliers: In linear regression, an outlier is an observation with large residual. In other


words, it is an observation whose dependent-variable value is unusual given its values
on the predictor variables. An outlier may indicate a sample peculiarity or may indicate
a data entry error or other problem.

132
Leverage: An observation with an extreme value on a predictor variable is called a
point with high leverage. Leverage is a measure of how far an observation deviates
from the mean of that variable. These leverage points can have an unusually large effect
on the estimate of regression coefficients.

Influence: An observation is said to be influential if removing the observation


substantially changes the estimate of coefficients. Influence can be thought of as the
product of leverage and outlierness.

How can we identify these three types of observations? Let's look at an example dataset
called crime. This dataset appears in Statistical Methods for Social Sciences, Third
Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997). The variables are
state id (sid), state name (state), violent crimes per 100,000 people (crime), murders
per 1,000,000 (murder), the percent of the population living in metropolitan areas
(pctmetro), the percent of the population that is white (pctwhite), percent of
population with a high school education or above (pcths), percent of population living
under poverty line (poverty), and percent of population that are single parents
(single). Below we read in the file and do some descriptive statistics on these
variables. You can click crime.sav to access this file, or see the Regression with SPSS
page to download all of the data files used in this book.

get file = "c:\spssreg\crime.sav".


descriptives
/var=crime murder pctmetro pctwhite pcths poverty single.

Descriptive Statistics

N Minimum Maximum Mean Std. Deviation

CRIME 51 82 2922 612.84 441.100

MURDER 51 1.60 78.50 8.7275 10.71758

PCTMETRO 51 24.00 100.00 67.3902 21.95713

133
PCTWHITE 51 31.80 98.50 84.1157 13.25839

PCTHS 51 64.30 86.60 76.2235 5.59209

POVERTY 51 8.00 26.40 14.2588 4.58424

SINGLE 51 8.40 22.10 11.3255 2.12149

Valid N (listwise) 51

Let's say that we want to predict crime by pctmetro, poverty, and single. That is to
say, we want to build a linear regression model between the response variable crime
and the independent variables pctmetro, poverty and single. We will first look at the
scatter plots of crime against each of the predictor variables before the regression
analysis so we will have some ideas about potential problems. We can create a scatter
plot matrix of these variables as shown below.

graph
/scatter plot(matrix)=crime murder pctmetro pctwhite pcths poverty
single .

134
The graphs of crime with other variables show some potential problems. In every plot,
we see a data point that is far away from the rest of the data points. Let's make
individual graphs of crime with pctmetro and poverty and single so we can get a
better view of these scatterplots. We will use BY state (name) to plot the state name
instead of a point

GRAPH /SCATTER PLOT(BIVAR)=pctmetro WITH crime BY state(name) .

GRAPH /SCATTER PLOT(BIVAR)=poverty WITH crime BY state(name) .

135
GRAPH /SCATTER PLOT(BIVAR)=single WITH crime BY state(name) .

All the scatter plots suggest that the observation for state = "dc" is a point that requires
extra attention since it stands out away from all of the other points. We will keep it in
mind when we do our regression analysis.
Now let's try the regression command predicting crime from pctmetro poverty and
single. We will go step-by-step to identify all the potentially unusual or influential
points afterwards.
regression
/dependent crime
/method=enter pctmetro poverty single.

Variables Entered/Removed(b)

Model Variables Entered Variables Removed Method

1 SINGLE, PCTMETRO, POVERTY(a) . Enter

a All requested variables entered.

b Dependent Variable: CRIME

136
Model Summary(b)
Model R R Square Adjusted R Square Std. Error of the Estimate

1 .916(a) .840 .830 182.068

a Predictors: (Constant), SINGLE, PCTMETRO, POVERTY

b Dependent Variable: CRIME

ANOVA(b)

Model Sum of Squares df Mean Square F Sig.

Regression 8170480.211 3 2723493.404 82.160 .000(a)

1 Residual 1557994.534 47 33148.820

Total 9728474.745 50

a Predictors: (Constant), SINGLE, PCTMETRO, POVERTY

b Dependent Variable: CRIME

Coefficients(a)

Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error

(Constant) -1666.436 147.852 -11.271 .000

1 PCTMETRO 7.829 1.255 .390 6.240 .000


POVERTY 17.680 6.941 .184 2.547 .014
SINGLE 132.408 15.503 .637 8.541 .000
a Dependent Variable: CRIME

137
Let's examine the standardized residuals as a first means for identifying outliers. Below
we use the /residuals=histogram subcommand to request a histogram for the
standardized residuals. As you see, we get the standard output that we got above, as
well as a table with information about the smallest and largest residuals, and a
histogram of the standardized residuals. The histogram indicates a couple of extreme
residuals worthy of investigation.
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram.

Variables Entered/Removed(b)
Model Variables Entered Variables Removed Method

1 SINGLE, PCTMETRO, POVERTY(a) . Enter

a All requested variables entered.


b Dependent Variable: CRIME

Model Summary(b)

Model R R Square Adjusted R Square Std. Error of the Estimate

1 .916(a) .840 .830 182.068

a Predictors: (Constant), SINGLE, PCTMETRO, POVERTY

b Dependent Variable: CRIME

ANOVA(b)
Sum of
Model df Mean Square F Sig.
Squares
Regression 8170480.211 3 2723493.404 82.160 .000(a)
1 Residual 1557994.534 47 33148.820
Total 9728474.745 50
a Predictors: (Constant), SINGLE, PCTMETRO, POVERTY

138
Coefficients(a)

Unstandardized Standardized
Coefficients Coefficients t Sig.
Model B Std. Error Beta

(Constant) -1666.436 147.852 -11.271 .000

PCTMETRO 7.829 1.255 .390 6.240 .000


1
POVERTY 17.680 6.941 .184 2.547 .014

SINGLE 132.408 15.503 .637 8.541 .000

a Dependent Variable: CRIME

Residuals Statistics(a)

Minimum Maximum Mean Std. Deviation N

Predicted Value -30.51 2509.43 612.84 404.240 51

Residual -523.01 426.11 .00 176.522 51

Std. Predicted Value -1.592 4.692 .000 1.000 51

Std. Residual -2.873 2.340 .000 .970 51

a Dependent Variable: CRIME

139
Let's now request the same kind of information, except for the studentized deleted
residual. The studentized deleted residual is the residual that would be obtained if the
regression was re-run omitting that observation from the analysis. This is useful
because some points are so influential that when they are included in the analysis they
can pull the regression line close to that observation making it appear as though it is not
an outlier -- however when the observation is deleted it then becomes more obvious
how outlying it is. To save space, below we show just the output related to the residual
analysis.
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid).
Residuals Statistics(a)

Minimum Maximum Mean Std. Deviation N

Predicted Value -30.51 2509.43 612.84 404.240 51


Std. Predicted Value -1.592 4.692 .000 1.000 51
Standard Error of
25.788 133.343 47.561 18.563 51
Predicted Value
Adjusted Predicted Value -39.26 2032.11 605.66 369.075 51
Residual -523.01 426.11 .00 176.522 51
Std. Residual -2.873 2.340 .000 .970 51
Stud. Residual -3.194 3.328 .015 1.072 51
Deleted Residual -646.50 889.89 7.18 223.668 51
Stud. Deleted Residual -3.571 3.766 .018 1.133 51
Mahal. Distance .023 25.839 2.941 4.014 51
Cook's Distance .000 3.203 .089 .454 51
Centered Leverage Value .000 .517 .059 .080 51
a Dependent Variable: CRIME

140
The histogram shows some possible outliers. We can use the outliers (sdresid) and
id(state) options to request the 10 most extreme values for the studentized deleted
residual to be displayed labeled by the state from which the observation
originated. Below we show the output generated by this option, omitting all of the rest
of the output to save space. You can see that "dc" has the largest value (3.766)
followed by "ms" (-3.571) and "fl" (2.620).
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram (sdresid) id (state) outliers (sdresid).

Outlier Statistics(a)

Case Number STATE Statistic

1 51 dc 3.766
2 25 ms -3.571
3 9 fl 2.620
Stud. Deleted Residual 4 18 la -1.839
5 39 ri -1.686
6 12 ia 1.590
7 47 wa -1.304

141
8 13 id 1.293
9 14 il 1.152
10 35 oh -1.148
a Dependent Variable: CRIME

We can use the /casewise subcommand below to request a display of all observations
where the sdresid exceeds 2. To save space, we show just the new output generated by
the /casewise subcommand. This shows us that Florida, Mississippi and Washington
DC have sdresid values exceeding 2.

regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid) id(state) outliers(sdresid)
/casewise=plot(sdresid) outliers(2) .

Casewise Diagnostics(a)

Case Stud. Deleted Predicted


STATE CRIME Residual
Number Residual Value

9 fl 2.620 1206 779.89 426.11

25 ms -3.571 434 957.01 -523.01

51 dc 3.766 2922 2509.43 412.57

a Dependent Variable: CRIME

Now let's look at the leverage values to identify observations that will have potential
great influence on regression coefficient estimates. We can include lever with the
histogram ( ) and the outliers ( ) options to get more information about observations
with high leverage. We show just the new output generated by these additional
subcommands below. Generally, a point with leverage greater than (2k+2)/n should be
carefully examined. Here k is the number of predictors and n is the number of
observations, so a value exceeding (2*3+2)/51 = .1568 would be worthy of further

142
investigation. As you see, there are 4 observations that have leverage values higher
than .1568.

regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid lever) id(state) outliers(sdresid lever)
/casewise=plot(sdresid) outliers(2).

Outlier Statistics(a)

Case Number STATE Statistic

1 51 dc 3.766
2 25 ms -3.571
3 9 fl 2.620
4 18 la -1.839
5 39 ri -1.686
Stud. Deleted Residual
6 12 ia 1.590
7 47 wa -1.304
8 13 id 1.293
9 14 il 1.152
10 35 oh -1.148
1 51 dc .517
2 1 ak .241
3 25 ms .171
4 49 wv .161
5 18 la .146
Centered Leverage Value
6 46 vt .117
7 9 fl .083
8 26 mt .080
9 31 nj .075
10 17 ky .072
a Dependent Variable: CRIME

143
As we have seen, DC is an observation that both has a large residual and large
leverage. Such points are potentially the most influential. We can make a plot that
shows the leverage by the residual and look for observations that are high in leverage
and have a high residual. We can do this using the /scatter plot subcommand as shown
below. This is a quick way of checking potential influential observations and outliers at
the same time. Both types of points are of great concern for us. As we see, "dc" is both
a high residual and high leverage point, and "ms" has an extremely negative residual
but does not have such a high leverage.

144
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid lever) id(state) outliers(sdresid, lever)
/casewise=plot(sdresid) outliers(2)
/scatterplot(*lever, *sdresid).

Now let's move on to overall measures of influence, specifically let's look at Cook's D,
which combines information on the residual and leverage. The lowest value that Cook's
D can assume is zero, and the higher the Cook's D is, the more influential the point is.
The conventional cut-off point is 4/n, or in this case 4/51 or .078. Below we add the
cook keyword to the outliers( ) option and also on the /casewise subcommand and
below we see that for the 3 outliers flagged in the "Casewise Diagnostics" table, the
value of Cook's D exceeds this cutoff. And, in the "Outlier Statistics" table, we see that
"dc", "ms", "fl" and "la" are the 4 states that exceed this cutoff, all others falling below
this threshold.

regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid lever) id(state) outliers(sdresid, lever, cook)
/casewise=plot(sdresid) outliers(2) cook dffit
/scatterplot(*lever, *sdresid).

145
Casewise Diagnostics(a)

Case Stud. Deleted Cook's


STATE CRIME DFFIT
Number Residual Distance

9 fl 2.620 1206 .174 48.507

25 ms -3.571 434 .602 -123.490

51 dc 3.766 2922 3.203 477.319

a Dependent Variable: CRIME

Outlier Statistics(a)

Case Number STATE Statistic Sig. F

1 51 dc 3.766
2 25 ms -3.571
3 9 fl 2.620
4 18 la -1.839
5 39 ri -1.686
Stud. Deleted Residual
6 12 ia 1.590
7 47 wa -1.304
8 13 id 1.293
9 14 il 1.152
10 35 oh -1.148
1 51 dc 3.203 .021
2 25 ms .602 .663
3 9 fl .174 .951
4 18 la .159 .958
5 39 ri .041 .997
Cook's Distance
6 12 ia .041 .997
7 13 id .037 .997
8 20 md .020 .999
9 6 co .018 .999
10 49 wv .016 .999
Centered Leverage Value 1 51 dc .517

146
2 1 ak .241
3 25 ms .171
4 49 wv .161
5 18 la .146
6 46 vt .117
7 9 fl .083
8 26 mt .080
9 31 nj .075
10 17 ky .072
a Dependent Variable: CRIME
Cook's D can be thought of as a general measure of influence. You can also consider
more specific measures of influence that assess how each coefficient is changed by
including the observation. Imagine that you compute the regression coefficients for the
regression model with a particular case excluded, then re compute the model with the
case included, and you observe the change in the regression coefficients due to
including that case in the model. This measure is called DFBETA and a DFBETA
value can be computed for each observation for each predictor. As shown below, we
use the /save sdbeta (sdbf) subcommand to save the DFBETA values for each of the
predictors. This saves 4 variables into the current data file, sdfb1, sdfb2, sdfb3 and
sdfb4, corresponding to the DFBETA for the Intercept and for pctmetro, poverty and
for single, respectively. We could replace sdfb with anything we like, and the
variables created would start with the prefix that we provide.

regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid lever) id(state) outliers(sdresid, lever, cook)
/casewise=plot(sdresid) outliers(2) cook dffit
/scatterplot(*lever, *sdresid)
/save sdbeta(sdfb).

The /save sdbeta (sdfb) subcommand does not produce any new output, but we can see
the variables it created for the first 10 cases using the list command below. For

147
example, by including the case for "ak" in the regression analysis (as compared to
excluding this case), the coefficient for pctmetro would decrease by -.106 standard
errors. Likewise, by including the case for "ak" the coefficient for poverty decreases
by -.131 standard errors, and the coefficient for single increases by .145 standard errors
(as compared to a model excluding "ak"). Since the inclusion of an observation could
either contribute to an increase or decrease in a regression coefficient, DFBETAs can
be either positive or negative. A DFBETA value in excess of 2/sqrt(n) merits further
investigation. In this example, we would be concerned about absolute values in excess
of 2/sqrt(51) or .28.

list
/variables state sdfb1 sdfb2 sdfb3
/cases from 1 to 10.
STATE SDFB1 SDFB2 SDFB3
ak -.10618 -.13134 .14518
al .01243 .05529 -.02751
ar -.06875 .17535 -.10526
az -.09476 -.03088 .00124
ca .01264 .00880 -.00364
co -.03705 .19393 -.13846
ct -.12016 .07446 .03017
de .00558 -.01143 .00519
fl .64175 .59593 -.56060
ga .03171 .06426 -.09120
Number of cases read: 10 Number of cases listed: 10

We can plot all three DFBETA values for the 3 coefficients against the state id in one
graph shown below to help us see potentially troublesome observations. We see
changed the value labels for sdfb1sdfb2 and sdfb3 so they would be shorter and more
clearly labeled in the graph. We can see that the DFBETA for single for "dc" is about
3, indicating that by including "dc" in the regression model, the coefficient for single is

148
3 standard errors larger than it would have been if "dc" had been omitted. This is yet
another bit of evidence that the observation for "dc" is very problematic.

VARIABLE LABLES sdfb1 "Sdfbetapctmetro"


/sdfb2 "Sdfbeta poverty"
/sdfb3 "Sdfbeta single" .
GRAPH
/SCATTER PLOT (OVERLAY)=sidsidsid WITH sdfb1 sdfb2 sdfb3 (PAIR) BY
state(name)
/MISSING=LISTWISE.

The following table summarizes the general rules of thumb we use for the measures we
have discussed for identifying observations worthy of further investigation (where k is
the number of predictors and n is the number of observations).

Measure Value
leverage >(2k+2)/n
abs(rstu) >2
Cook's D > 4/n
abs(DFBETA) > 2/sqrt(n)

149
We have shown a few examples of the variables that you can refer to in the /residuals,
/casewise, /scatterplot and /save sdbeta ( ) subcommands. Here is a list of all of the
variables that can be used on these subcommands; however, not all variables can be
used on each subcommand.

PRED Un standardized predicted values.


RESID Un standardized residuals.
DRESID Deleted residuals.
ADJPRED Adjusted predicted values.
ZPRED Standardized predicted values.
ZRESID Standardized residuals.
SRESID Studentized residuals.
SDRESID Studentized deleted residuals.
SEPRED Standard errors of the predicted values.
MAHAL Mahalanobis distances.
COOK Cook’s distances.
LEVER Centered leverage values.
Change in the regression coefficient those results from the deletion of the
DFBETA ith case. A DFBETA value is computed for each case for each regression
coefficient generated by a model.
Standardized DFBETA. An SDBETA value is computed for each case for
SDBETA
each regression coefficient generated by a model.
DFFIT Change in the predicted value when the ith case is deleted.
SDFIT Standardized DFFIT.
Ratio of the determinant of the covariance matrix with the ith case deleted
COVRATIO
to the determinant of the covariance matrix with all cases included.
Lower and upper bounds for the prediction interval of the mean predicted
response. A lower bound LMCIN and an upper bound UMCIN are
MCIN
generated. The default confidence interval is 95%. The confidence interval
can be reset with the CIN subcommand. (See Dillon & Goldstein
Lower and upper bounds for the prediction interval for a single
observation. A lower bound LICIN and an upper bound UICIN are
ICIN
generated. The default confidence interval is 95%. The confidence interval
can be reset with the CIN subcommand. (See Dillon & Goldstein

150
In addition to the numerical measures we have shown above, there are also several
graphs that can be used to search for unusual and influential observations. The partial-
regression plot is very useful in identifying influential points. For example below we
add the /partial plot subcommand to produce partial-regression plots for all of the
predictors. For example, in the 3rd plot below you can see the partial-regression plot
showing crime by single after both crime and single have been adjusted for all other
predictors in the model. The line plotted has the same slope as the coefficient for
single. This plot shows how the observation for DC influences the coefficient. You
can see how the regression line is tugged upwards trying to fit through the extreme
value of DC. Alaska and West Virginia may also exert substantial leverage on the
coefficient of single as well. These plots are useful for seeing how a single point may
be influencing the regression line, while taking other variables in the model into
account.
Note that the regression line is not automatically produced in the graph. We double
clicked on the graph, and then chose "Chart" and the "Options" and then chose "Fit
Line Total" to add a regression line to each of the graphs below.
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram (sdresid lever) id(state) outliers(sdresid, lever, cook)
/casewise=plot(sdresid) outliers(2) cook dffit
/scatterplot(*lever, *sdresid)
/partialplot.

151
DC has appeared as an outlier as well as an influential point in every analysis. Since
DC is really not a state, we can use this to justify omitting it from the analysis saying
that we really wish to just analyze states. First, let's repeat our analysis including DC
below.
regression
/dependent crime
/method=enter pctmetro poverty single.

<some output omitted to save space>

Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients t Sig.
Model B Std. Error Beta
(Constant) -1666.436 147.852 -11.271 .000
PCTMETRO 7.829 1.255 .390 6.240 .000
1
POVERTY 17.680 6.941 .184 2.547 .014
SINGLE 132.408 15.503 .637 8.541 .000
a Dependent Variable: CRIME

152
Now, let's run the analysis omitting DC by using the filter command to omit "dc" from
the analysis. As we expect, deleting DC made a large change in the coefficient for
single .The coefficient for single dropped from 132.4 to 89.4. After having deleted DC,
we would repeat the process we have illustrated in this section to search for any other
outlying and influential observations.

computefiltvar = (state NE "dc").


filter by filtvar.
regression
/dependent crime
/method=enter pctmetro poverty single .

<some output omitted to save space>

Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients t Sig.
Model B Std. Error Beta
(Constant) -1197.538 180.487 -6.635 .000
PCTMETRO 7.712 1.109 .565 6.953 .000
1
POVERTY 18.283 6.136 .265 2.980 .005
SINGLE 89.401 17.836 .446 5.012 .000
a Dependent Variable: CRIME

Summary
In this section, we explored a number of methods of identifying outliers and influential
points. In a typical analysis, you would probably use only some of these
methods. Generally speaking, there are two types of methods for assessing outliers:
statistics such as residuals, leverage, and Cook's D, which assess the overall impact of
an observation on the regression results, and statistics such as DFBETA that assess the
specific impact of an observation on the regression coefficients. In our example, we
found out that DC was a point of major concern. We performed a regression with it

153
and without it and the regression equations were very different. We can justify
removing it from our analysis by reasoning that our model is to predict crime rate for
states not for metropolitan areas.

7.3.3.2. Tests for Normality of Residuals

One of the assumptions of linear regression analysis is that the residuals are normally
distributed. It is important to meet this assumption for the p-values for the t-tests to be
valid. Let's use the elemapi2 data file we saw in Chapter 1 for these analyses. Let's
predict academic performance (api00) from percent receiving free meals (meals),
percent of English language learners (ell), and percent of teachers with emergency
credentials (emer). We then use the /save command to generate residuals.

get file="c:\spssreg\elemapi2.sav".
regression
/dependent api00
/method=enter meals ell emer
/save resid(apires).
Variables Entered/Removed(b)
Model Variables Entered Variables Removed Method
1 EMER, ELL, MEALS(a) . Enter
a All requested variables entered.
b Dependent Variable: API00

Model Summary(b)

Std. Error of
Model R R Square Adjusted R Square
the Estimate

1 .914(a) .836 .835 57.820

a Predictors: (Constant), EMER, ELL, MEALS

b Dependent Variable: API00

154
ANOVA(b)
Sum of Mean
Model df F Sig.
Squares Square
Regression 6749782.747 3 2249927.582 672.995 .000(a)
1 Residual 1323889.251 396 3343.155
Total 8073671.997 399
a Predictors: (Constant), EMER, ELL, MEALS
b Dependent Variable: API00

Coefficients(a)

Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error

(Constant) 886.703 6.260 141.651 .000

MEALS -3.159 .150 -.709 -21.098 .000


1
ELL -.910 .185 -.159 -4.928 .000

EMER -1.573 .293 -.130 -5.368 .000

a Dependent Variable: API00

Casewise Diagnostics(a)

Case Number Std. Residual API00

93 3.087 604

226 -3.208 386

a Dependent Variable: API00

155
Residuals Statistics(a)

Minimum Maximum Mean Std. Deviation N

Predicted Value 425.52 884.88 647.62 130.064 400

Residual -185.47 178.48 .00 57.602 400

Std. Predicted Value -1.708 1.824 .000 1.000 400

Std. Residual -3.208 3.087 .000 .996 400

a Dependent Variable: API00

We now use the examine command to look at the normality of these residuals. All of
the results from the examine command suggest that the residuals are normally
distributed -- the skewness and kurtosis are near 0, the "tests of normality" are not
significant, the histogram looks normal, and the Q-Q plot looks normal. Based on these
results, the residuals from this regression appear to conform to the assumption of being
normally distributed.
examine
variables=apires
/plot boxplot stemleaf histogram npplot.
Case Processing Summary

Cases

Valid Missing Total

N Percent N Percent N Percent

APIRES 400 100.0 0 .0 400 100.0

Descriptive

156
Statistic Std. Error

Mean .0000000 2.88011205

95% Confidence Lower Bound -5.6620909


Interval for Mean Upper Bound 5.6620909
5% Trimmed Mean -.7827765
Median -3.6572906
Variance 3318.018
APIRES Std. Deviation 57.60224104
Minimum -185.47331
Maximum 178.48224
Range 363.95555
Interquartile Range 76.5523053
Skewness .171 .122
Kurtosis .135 .243

Tests of Normality
Kolmogorov-Smirnov(a) Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
APIRES .033 400 .200(*) .996 400 .510
* This is a lower bound of the true significance.
a Lilliefors Significance Correction

157
Unstandardized Residual Stem-and-Leaf Plot
Frequency Stem & Leaf
1.00 Extremes (=<-185)
2.00 -1 . 4
3.00 -1 . 2&
7.00 -1 . 000
15.00 -0 . 8888899
35.00 -0 . 66666666667777777
37.00 -0 . 444444444555555555
49.00 -0 . 222222222222223333333333
61.00 -0 . 000000000000000011111111111111
48.00 0 . 000000111111111111111111
49.00 0 . 222222222222233333333333
28.00 0 . 4444445555555
31.00 0 . 666666666677777
16.00 0 . 88888899
9.00 1 . 0011
3.00 1 . 2&
1.00 1 .&
5.00 Extremes (>=152)
Stem width: 100.0000
Each leaf: 2 case(s) & denotes fractional leaves.

158
Heteroscedasticity

Another assumption of ordinary least squares regression is that the variance of the
residuals is homogeneous across levels of the predicted values, also known as
homoscedasticity. If the model is well-fitted, there should be no pattern to the residuals
plotted against the fitted values. If the variance of the residuals is non-constant then the
residual variance is said to be "heteroscedastic." Below we illustrate graphical methods
for detecting heteroscedasticity. A commonly used graphical method is to use the
residual versus fitted plot to show the residuals versus fitted (predicted) values. Below
we use the /scatter plot subcommand to plot *zresid (standardized residuals) by *pred
(the predicted values). We see that the pattern of the data points is getting a little
narrower towards the right end, an indication of mild heteroscedasticity.

159
regression
/dependent api00
/method=enter meals ell emer
/scatterplot(*zresid *pred).

Let's run a model where we include just enroll as a predictor and show the residual vs.
predicted plot. As you can see, this plot shows serious heteroscedasticity. The
variability of the residuals when the predicted value is around 700 is much larger than
when the predicted value is 600 or when the predicted value is 500.

regression
/dependent api00
/method=enter enroll
/scatterplot(*zresid *pred).

160
As we saw in Chapter 1, the variable enroll was skewed considerably to the right, and
we found that by taking a log transformation, the transformed variable was more
normally distributed. Below we transform enroll, run the regression and show the
residual versus fitted plot. The distribution of the residuals is much
improved. Certainly, this is not a perfect distribution of residuals, but it is much better
than the distribution with the untransformed variable.
computelenroll = ln(enroll).
regression
/dependent api00
/method=enter lenroll
/scatterplot(*zresid *pred).
Variables Entered/Removed(b)
Model Variables Entered Variables Removed Method
1 LENROLL(a) . Enter
a. All requested variables entered.
b. Dependent Variable: API00

Model Summary(b)

R Adjusted R Std. Error of the


Model R
Square Square Estimate
1 .275(a) .075 .073 136.946
a Predictors: (Constant), LENROLL
b Dependent Variable: API00

ANOVA(b)
Model Sum of Squares df Mean Square F Sig.
Regression 609460.408 1 609460.408 32.497 .000(a)
1 Residual 7464211.589 398 18754.300
Total 8073671.997 399
a Predictors: (Constant), LENROLL
b Dependent Variable: API00

161
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error
(Constant) 1170.429 91.966 12.727 .000
1
LENROLL -86.000 15.086 -.275 -5.701 .000
a Dependent Variable: API00

Residuals Statistics(a)

Minimum Maximum Mean Std. Deviation N

Predicted Value 537.57 751.82 647.62 39.083 400

Residual -288.65 295.47 .00 136.775 400

Std. Predicted
-2.816 2.666 .000 1.000 400
Value

Std. Residual -2.108 2.158 .000 .999 400

a Dependent Variable: API00

162
Finally, let's revisit the model we used at the start of this section, predicting api00 from
meals, ell and emer. Using this model, the distribution of the residuals looked very
nice and even across the fitted values. What if we add enroll to this model. Will this
automatically ruin the distribution of the residuals? Let's add it and see.
regression
/dependent api00
/method=enter meals ell emer enroll
/scatterplot(*zresid *pred).
Variables Entered/Removed(b)
Variables
Model Variables Entered Method
Removed
ENROLL, MEALS, EMER,
1 . Enter
ELL(a)
a All requested variables entered.
b Dependent Variable: API00

Model Summary(b)
R Adjusted R Std. Error of the
Model R
Square Square Estimate
1 .915(a) .838 .836 57.552
a Predictors: (Constant), ENROLL, MEALS, EMER, ELL
b Dependent Variable: API00

ANOVA(b)
Sum of Mean
Model df F Sig.
Squares Square
Regression 6765344.050 4 1691336.012 510.635 .000(a)
1 Residual 1308327.948 395 3312.223
Total 8073671.997 399
a Predictors: (Constant), ENROLL, MEALS, EMER, ELL
b Dependent Variable: API00

163
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error
(Constant) 899.147 8.472 106.128 .000
MEALS -3.222 .152 -.723 -21.223 .000
ELL -.768 .195 -.134 -3.934 .000
1
EMER -1.418 .300 -.117 -4.721 .000
-3.126E-
ENROLL .014 -.050 -2.168 .031
02
a Dependent Variable: API00

Casewise Diagnostics(a)

Case Number Std. Residual API00

93 3.004 604

226 -3.311 386

a Dependent Variable: API00

Residuals Statistics(a)

Minimum Maximum Mean Std. Deviation N

Predicted Value 430.82 888.08 647.62 130.214 400

Residual -190.56 172.86 .00 57.263 400

Std. Predicted
-1.665 1.847 .000 1.000 400
Value

Std. Residual -3.311 3.004 .000 .995 400

a Dependent Variable: API00

164
As you can see, the distribution of the residuals looks fine, even after we added the
variable enroll. When we had just the variable enroll in the model, we did a log
transformation to improve the distribution of the residuals, but when enroll was part of
a model with other variables, the residuals looked good so no transformation was
needed. This illustrates how the distribution of the residuals, not the distribution of the
predictor, was the guiding factor in determining whether a transformation was needed.

7.3.3.3. Collinearity

When there is a perfect linear relationship among the predictors, the estimates for a
regression model cannot be uniquely computed. The term collinearity implies that two
variables are near perfect linear combinations of one another. When more than two
variables are involved it is often called multicollinearity, although the two terms are
often used interchangeably.

The primary concern is that as the degree of multicollinearity increases, the regression
model estimates of the coefficients become unstable and the standard errors for the
coefficients can get wildly inflated. In this section, we will explore some SPSS
commands that help to detect multicollinearity.

We can use the /statistics=defaults tol to request the display of "tolerance" and "VIF"
values for each predictor as a check for multicollinearity. The "tolerance" is an
indication of the percent of variance in the predictor that cannot be accounted for by the
other predictors, hence very small values indicate that a predictor is redundant, and

165
values that are less than .10 may merit further investigation. The VIF, which stands for
variance inflation factor, is (1 / tolerance) and as a rule of thumb, a variable whose VIF
values is greater than 10 may merit further investigation. Let's first look at the
regression we did from the last section, the regression model predicting api00 from
meals, ell and emer using the /statistics=defaults tol subcommand. As you can see,
the "tolerance" and "VIF" values are all quite acceptable.

regression
/statistics=defaults tol
/dependent api00
/method=enter meals ell emer .
<some output deleted to save space>

Coefficients(a)

Unstandardized Standardized Collinearity


Coefficients Coefficients Statistics
t Sig.
Std.
Model B Beta Tolerance VIF
Error

(Constant) 886.703 6.260 141.651 .000

MEALS -3.159 .150 -.709 -21.098 .000 .367 2.725


1
ELL -.910 .185 -.159 -4.928 .000 .398 2.511

EMER -1.573 .293 -.130 -5.368 .000 .707 1.415

a Dependent Variable: API00

Now let's consider another example where the "tolerance" and "VIF" values are more
worrisome. In the regression analysis below, we use acs_k3 avg_edgrad_schcol_grad
and some_col as predictors of api00. As you see, the "tolerance" values for
avg_edgrad_sch and col_grad are below .10, and avg_ed is about 0.02, indicating that
only about 2% of the variance in avg_ed is not predictable given the other predictors in
the model. All of these variables measure education of the parents and the very low
"tolerance" values indicate that these variables contain redundant information. For

166
example, after you know grad_sch and col_grad, you probably can predict avg_ed
very well. In this example, multicollinearity arises because we have put in too many
variables that measure the same thing, parent education.

We also include the collin option which produces the "Collinearity Diagnostics" table
below. The very low eigenvalue for the 5th dimension (since there are 5 predictors) is
another indication of problems with multicollinearity. Likewise, the very high
"Condition Index" for dimension 5 similarly indicates problems with multicollinearity
with these predictors.
regression
/statistics=defaults tolcollin
/dependent api00
/method=enter acs_k3 avg_edgrad_schcol_gradsome_col.
<some output deleted to save space>

Coefficients(a)

Unstandardized Standardized Collinearity


Coefficients Coefficients Statistics
t Sig.
Std.
Model B Beta Tolerance VIF
Error

(Constant) -82.609 81.846 -1.009 .313

ACS_K3 11.457 3.275 .107 3.498 .001 .972 1.029

AVG_ED 227.264 37.220 1.220 6.106 .000 .023 43.570


1
GRAD_SCH -2.091 1.352 -.180 -1.546 .123 .067 14.865

COL_GRAD -2.968 1.018 -.339 -2.916 .004 .068 14.779

SOME_COL -.760 .811 -.057 -.938 .349 .246 4.065

a Dependent Variable: API00

167
Collinearity Diagnostics(a)

Cond Variance Proportions


Eigen
ition
Mod Dime value Index ACS_ AVG_ GRAD_ COL_
(Constant) SOME_COL
el nsion K3 ED SCH GRAD
1 5.013 1.000 .00 .00 .00 .00 .00 .00
2 .589 2.918 .00 .00 .00 .05 .00 .01
3 .253 4.455 .00 .00 .00 .03 .07 .02
1
4 .142 5.940 .00 .01 .00 .00 .00 .23
5 .0028 42.036 .22 .86 .14 .10 .15 .09
6 .0115 65.887 .77 .13 .86 .81 .77 .66
a Dependent Variable: API00
Let's omit one of the parent education variables, avg_ed. Note that the VIF values in
the analysis below appear much better. Also, note how the standard errors are reduced
for the parent education variables, grad_sch and col_grad. This is because the high
degree of collinearity caused the standard errors to be inflated. With the
multicollinearity eliminated, the coefficient for grad_sch, which had been non-
significant, is now significant.
regression
/statistics=defaults tolcollin
/dependent api00
/method=enter acs_k3 grad_schcol_gradsome_col.
<some output omitted to save space>
Coefficients(a)
Unstandardized Standardized Collinearity
Coefficients Coefficients t Sig. Statistics
Model B Std.Error Beta Tolerance VIF
(Constant) 283.745 70.325 4.035 .000
ACS_K3 11.713 3.665 .113 3.196 .002 .977 1.024
1 GRAD_SCH 5.635 .458 .482 12.298 .000 .792 1.262
COL_GRAD 2.480 .340 .288 7.303 .000 .783 1.278
SOME_COL 2.158 .444 .173 4.862 .000 .967 1.034
a Dependent Variable: API00

168
Collinearity Diagnostics(a)

Dim Cond Variance Proportions


Model Eigen
ension ition ACS_ GRAD_ COL_ SOME_
value (Constant)
Index K3 SCH GRAD COL
1 3.970 1.000 .00 .00 .02 .02 .01
2 .599 2.575 .00 .00 .60 .03 .04
1 3 .255 3.945 .00 .00 .37 .94 .03
4 .174 4.778 .00 .00 .00 .00 .92
5 .0249 39.925 .99 .99 .01 .01 .00
a Dependent Variable: API00
7.3.3.4. Tests on Nonlinearity
When we do linear regression, we assume that the relationship between the response
variable and the predictors is linear. If this assumption is violated, the linear regression
will try to fit a straight line to data that do not follow a straight line. Checking the
linearity assumption in the case of simple regression is straightforward, since we only
have one predictor. All we have to do is a scatter plot between the response variable
and the predictor to see if nonlinearity is present, such as a curved band or a big wave-
shaped curve. For example, let us use a data file called nations.sav that has data about
a number of nations around the world. Let's look at the relationship between GNP per
capita (gnpcap) and births (birth). Below if we look at the scatter plot between
gnpcap and birth, we can see that the relationship between these two variables is quite
non-linear. We added a regression line to the chart by double clicking on it and
choosing "Chart" then "Options" and then "Fit Line Total" and you can see how poorly
the line fits this data. Also, if we look at the residuals by predicted, we see that the
residuals are not homoscedastic, due to the non-linearity in the relationship between
gnpcap and birth.

get file = "c:\sppsreg\nations.sav".


regression
/dependent birth
/method=enter gnpcap
/scatterplot(*zresid *pred)
/scat(birth gnpcap) .

169
Variables Entered/Removed(b)
Variables
Model Variables Entered Method
Removed
1 GNPCAP(a) . Enter
a All requested variables entered.
b Dependent Variable: BIRTH

Model Summary(b)
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .626(a) .392 .387 10.679
a Predictors: (Constant), GNPCAP
b Dependent Variable: BIRTH

ANOVA(b)
Mean
Model Sum of Squares df F Sig.
Square
Regression 7873.995 1 7873.995 69.047 .000(a)
1 Residual 12202.152 107 114.039
Total 20076.147 108
a Predictors: (Constant), GNPCAP
b Dependent Variable: BIRTH

Coefficients(a)

Standardized
Unstandardized Coefficients
Coefficients t Sig.
Model B Std. Error Beta

(Constant) 38.924 1.261 30.856 .000


1
GNPCAP -1.921E-03 .000 -.626 -8.309 .000

a Dependent Variable: BIRTH

170
Residuals Statistics(a)

Minimum Maximum Mean Std. Deviation N

Predicted Value 1.90 38.71 32.79 8.539 109


Residual -23.18 28.10 .00 10.629 109
Std. Predicted Value -3.618 .694 .000 1.000 109
Std. Residual -2.170 2.632 .000 .995 109
a Dependent Variable: BIRTH

171
We modified the above scatter plot changing the fit line from using linear regression to
using "lowess" by choosing "Chart" then "Options" then choosing "Fit Options" and
choosing "Lowess" with the default smoothing parameters. As you can see, the
"lowess" smoothed curve fits substantially better than the linear regression, further
suggesting that the relationship between gnpcap and birth is not linear.

We can see that the capgnp scores are quite skewed with most values being near 0, and
a handful of values of 10,000 and higher. This suggests to us that some transformation
of the variable may be necessary. One commonly used transformation is a log
transformation, so let's try that. As you see, the scatter plot between capgnp and birth
looks much better with the regression line going through the heart of the data. Also, the
plot of the residuals by predicted values look much more reasonable.
computelgnpcap = ln(gnpcap).
regression
/dependent birth
/method=enter lgnpcap
/scatterplot(*zresid *pred) /scat(birth lgnpcap)
/save resid(bres2).
Variables Entered/Removed(b)
Model Variables Entered Variables Removed Method
1 LGNPCAP(a) . Enter
a All requested variables entered.
b Dependent Variable: BIRTH

172
Model Summary(b)

Model R R Square Adjusted R Square Std. Error of the Estimate

1 .756(a) .571 .567 8.969

a Predictors: (Constant), LGNPCAP

b Dependent Variable: BIRTH

ANOVA(b)

Sum of Mean
Model df F Sig.
Squares Square

Regression 11469.248 1 11469.248 142.584 .000(a)

1 Residual 8606.899 107 80.438

Total 20076.147 108

a Predictors: (Constant), LGNPCAP

b Dependent Variable: BIRTH

Coefficients(a)

Unstandardized Standardized
Coefficients Coefficients t Sig.
Model B Std. Error Beta

(Constant) 84.277 4.397 19.168 .000


1
LGNPCAP -7.238 .606 -.756 -11.941 .000

a Dependent Variable: BIRTH

173
Residuals Statistics(a)
Minimu
Maximum Mean Std. Deviation N
m
Predicted Value 12.86 50.25 32.79 10.305 109
Residual -24.75 24.98 .00 8.927 109
Std. Predicted Value -1.934 1.695 .000 1.000 109
Std. Residual -2.760 2.786 .000 .995 109
a Dependent Variable: BIRTH

174
This section has shown how you can use scatter plots to diagnose problems of non-
linearity, both by looking at the scatter plots of the predictor and outcome variable, as
well as by examining the residuals by predicted values. These examples have focused
on simple regression; however similar techniques would be useful in multiple
regression. However, when using multiple regression, it would be more useful to
examine partial regression plots instead of the simple scatter plots between the
predictor variables and the outcome variable.

7.3.3.5. Model Specification

A model specification error can occur when one or more relevant variables are omitted
from the model or one or more irrelevant variables are included in the model. If
relevant variables are omitted from the model, the common variance they share with
included variables may be wrongly attributed to those variables, and the error term can
be inflated. On the other hand, if irrelevant variables are included in the model, the
common variance they share with included variables may be wrongly attributed to
them. Model specification errors can substantially affect the estimate of regression
coefficients.

175
Consider the model below. This regression suggests that as class size increases the
academic performance increases, with p=0.053. Before we publish results saying that
increased class size is associated with higher academic performance, let's check the
model specification.

/dependent api00

/method=enter acs_k3 full

/save pred(apipred).

<some output deleted to save space>

Coefficients(a)

Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error

(Constant) 32.213 84.075 .383 .702

1 ACS_K3 8.356 4.303 .080 1.942 .053

FULL 5.390 .396 .564 13.598 .000

a Dependent Variable: API00

SPSS does not have any tools that directly support the finding of specification errors,
however you can check for omitted variables by using the procedure below. As you
notice above, when we ran the regression we saved the predicted value calling it
apipred. If we use the predicted value and the predicted value squared as predictors of
the dependent variable, apipred should be significant since it is the predicted value, but
apipred squared shouldn't be a significant predictor because, if our model is specified

176
correctly, the squared predictions should not have much of explanatory power above
and beyond the predicted value. That is we wouldn't expect apipred squared to be a
significant predictor if our model is specified correctly. Below we compute apipred2 as
the squared value of apipred and then include apipred and apipred2 as predictors in
our regression model, and we hope to find that apipred2 is not significant.

compute apipred2 = apipred**2.


regression
/dependent api00
/method=enter apipred apipred2.
<some output omitted to save space>

Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error
(Constant) 858.873 283.460 3.030 .003
-
1 APIPRED -1.869 .937 -1.088 .047
1.994
APIPRED2 2.344E-03 .001 1.674 3.070 .002
a Dependent Variable: API00
The above results show that apipred2 is significant, suggesting that we may have
omitted important variables in our regression. We therefore should consider whether we
should add any other variables to our model. Let's try adding the variable meals to the
above model. We see that meals is a significant predictor, and we save the predicted
value calling it preda for inclusion in the next analysis for testing to see whether we
have any additional important omitted variables.
regression
/dependent api00
/method=enter acs_k3 full meals
/save pred(preda).

177
<some output omitted to save space>

Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error
(Constant) 771.658 48.861 15.793 .000
ACS_K3 -.717 2.239 -.007 -.320 .749
1
FULL 1.327 .239 .139 5.556 .000
MEALS -3.686 .112 -.828 -32.978 .000
a Dependent Variable: API00

We now create preda2 which is the square of preda, and include both of these as
predictors in our model.
compute preda2 = preda**2.
regression
/dependent api00
/method=enter preda preda2.
<some output omitted to save space>

Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
t Sig.
Std.
Model B Beta
Error
(Constant) -136.510 95.059 -1.436 .152
PREDA 1.424 .293 1.293 4.869 .000
1
-3.172E-
PREDA2 .000 -.386 -1.455 .146
04
a Dependent Variable: API00
We now see that preda2 is not significant, so this test does not suggest there are any
other important omitted variables. Note that after including meals and full, the
coefficient for class size is no longer significant. While acs_k3 does have a positive
relationship with api00 when only full is included in the model, but when we also

178
include (and hence control for) meals, acs_k3 is no longer significantly related to
api00 and its relationship with api00 is no longer positive.

7.3.3.6. Issues of Independence


The statement of this assumption is that the errors associated with one observation are
not correlated with the errors of any other observation. Violation of this assumption can
occur in a variety of situations. Consider the case of collecting data from students in
eight different elementary schools. It is likely that the students within each school will
tend to be more like one another that students from different schools, that is, their errors
are not independent.

Another way in which the assumption of independence can be broken is when data are
collected on the same variables over time. Let's say that we collect truancy data every
semester for 12 years. In this situation it is likely that the errors for observations
between adjacent semesters will be more highly correlated than for observations more
separated in time -- this is known as autocorrelation. When you have data that can be
considered to be time-series you can use the Durbin-Watson statistic to test for
correlated residuals.

We don't have any time-series data, so we will use the elemapi2 dataset and pretend
that snum indicates the time at which the data were collected. We will sort the data on
snum to order the data according to our fake time variable and then we can run the
regression analysis with the durbin option to request the Durbin-Watson test. The
Durbin-Watson statistic has a range from 0 to 4 with a midpoint of 2. The observed
value in our example is less than 2, which is not surprising since our data are not truly
time-series.
sort cases by snum .
regression
/dependent api00
/method=enter enroll
/residuals = durbin .
Model Summary

179
R Adjusted R Std. Error of the Durbin-
Model R
Square Square Estimate Watson

1 .318 .101 .099 135.026 1.351

a. Predictors: (Constant), ENROLL, b. Dependent Variable: API00

7.3.3.7. Summary
This chapter has covered a variety of topics in assessing the assumptions of regression
using SPSS, and the consequences of violating these assumptions. As we have seen, it
is not sufficient to simply run a regression analysis, but it is important to verify that the
assumptions have been met. If this verification stage is omitted and your data does not
meet the assumptions of linear regression, your results could be misleading and your
interpretation of your results could be in doubt. Without thoroughly checking your data
for problems, it is possible that another researcher could analyze your data and uncover
such problems and question your results showing an improved analysis that may
contradict your results and undermine your conclusions.

7.3.4. Regression Diagnostics in Minitab

Partial Regression Plots

Minitab does not explicitly produce partial regression plots. Fortunately, they can be
created easily (if tediously, for large models):

(1) regressY on all Xs except Xk, saving the residuals


(2) regressXk on all the other Xs, saving the residuals
(3) plot the residuals from (1) against those from (2)
(4) repeat (1)–(3) for each X.
Nothing is needed from these regressions but the residuals, so you can use the
Results… window to tell Minitab not to print any output.
Residuals
Minitab produces three forms of residuals (all obtained on the Storage… window):
Minitab name Text name Definition

180
Residuals Residuals
Standardized Studentized residuals divided by its standard
residuals error
Deleted t residuals Studentized deleted divided by its standard
residuals error, where is deleted
residual

In the Graphs… window in the regression procedure, these three kinds of residuals are
called Regular, Standardized, and Deleted, respectively. The standardized residuals are
what Minitab uses to flag unusually large residuals (any observations with standardized
residual greater than 2 in absolute value).

Leverage and Influence

The Storage… window of the regression procedure provides three measures of


leverage and influence:

• Hi (leverages) [hiiin the text’s notation]


• Cook’s distance [Di]
• DFITS [(DFFITS)i]

DFBETAS
Minitab does not explicitly produce DFBETAS statistics of influence on particular
coefficients. It can be calculated for a particular suspect observation i(perhaps flagged
by the preceding measures), and coefficient k, as follows:
(1) from the regression on the full data set, obtain

• (this is inversein the Storage… window)
(2) from the regression without observation i, obtain

181

(3) find , the diagonal element of

(4) calculate

Multicollinearity Diagnostics — Variance Inflation Factor


Variance Inflation Factors can be requested on the Options… window of the regression
procedure (check box for Variance inflation factors, under Display).

REFERENCES

 Minitab.MeetMinitab15.2007

 Sabian, L and bria, S (2004). A handbook of statistical analysis using SPSS.

 Federico Bandi.Introduction to Minitab. Summer 2009.

 SPSS Manual: SPSS Base Version User's Guide, SPSS Inc.

 Freund, J. E. and Walpole, R. E. (1980). Mathematical Statistics.

 Mann, P. S. (2006). Introductory Statistics.

 Krzanowski, W. J. (1998). An Introduction to Statistical Modelling.

 Puri, B. K. (1996). Statistics in Practice: An Illustrated Guide to SPSS. Oxford

University Press, Inc.

 Gupta, C.B. and Gupta, V. (2004). An Introduction to Statistical Methods. Vikas

Publishing House, Pvt. Ltd, India.

182
183

You might also like