You are on page 1of 58

DATA ANALYSIS GUIDE-SPSS

When conducting any statistical analysis, you need to get familiar with your data and
perform an examination of it in order to lessen the odds of having biased results that can
make all of your hard work essentially meaningless or substantially weak.
Getting to Know SPSS
When you run a procedure in SPSS, such as frequencies, you need to select the variables
in the dialog box. On the left side of the dialog box, you see the list of variables in your
data file. ou click a variable name to select it, and then click the right!arrow button to
move the variable into the "ariable#s$ list.
TIP 1
ou can use change the appearance of the variables so that they appear as variable names
rather than variable labels %see above&, which is the default option. ou can also make
the variables appear alphabetical. ' recommend switching to the variable names option
and having them listed alphabetically so that you can more easily find the variables of
interest to you. (y selecting this method, you can type the first letter of the name of the
variable that you want in the variable display section of the dialog box and SPSS will
)ump to the first variable that starts with that letter, and every subsequent variable that
starts with that letter as well.
Directions: Pull down the Edit T!" Select #$tions" Select the generl t!"
nd under %ri!le lists select dis$l& n'es nd l$h!eticl
TIP ( )*ht is this +ri!le,-
'n case you forget the label that you gave to variables when you go to the dialog box,
such as the frequency dialog box above, highlight the variable that you are interested in
and click the right!mouse button. *his will provide a pop!up window that offers the
"ariable 'nformation section. 'n other words, if you were presented with the frequency
table above, you might highlight +si,e of the company %si,e&, click on the right!mouse
button and select variable information. *his action provides you with the name of the
variable, its label, measurement setting %e.g. ordinal&, and value labels for the variable
%i.e. categories&
TIP . )*ht is this sttistic,-
'f you are unsure of what a particular statistic is used for, then highlight the particular
item, right!click on the selected statistics %e.g. mean& and you will receive a brief
description of what the statistics available in the dialog box provide. 'f the variable is
one that seems useful to you, then you select it !& $lcing chec/ in the +il!le !o0
ne0t to the sttistic1 Then clic/ o/ nd &ou will return to the 'in 2re3uencies !o01
TIP 4 )*ht is in this out$ut,-
*o obtain help on the output screen %i.e. the spss viewer&, you need to double click a pivot
table in order to activate it so that you can make modifications. When activated, it will
appear to have +railroad track- lines surrounding it. See T!le !elow1
Favor or Oppose Death Penalty for Murder
1074 71.6 77.4 77.4
314 20.9 22.6 100.0
1388 92.5 100.0
106 7.1
6 .4
112 7.5
1500 100.0
Favor
Oppose
Total
Valid
DK
NA
Total
Missi!
Total
Fre"#e$% &er$et Valid &er$et
(#)#lative
&er$et
#nce &ou h+e cti+ted the $i+ot t!le" then &ou should right-clic/ row or
colu'n heder 2or $o$ u$ 'enu" such s the colu'n l!eled +lid $ercent1
5hoose wht6s this, This will !ring u$ $o$-u$ window e0$lining wht the
$rticulr colu'n or row is ddressing1 I2 &ou 2orget to cti+te the $i+ot t!le nd
si'$l& right-clic/ on colu'n or row" &ou will get the 2ollowing 'essge: )Displays
output. Click once to select an object (for example, so that you can copy it to the
clipboard). Double-click to activate an object for editing. f the object is a pivot table,
you can obtain detailed help on items !ithin the table by right-clicking on ro! and
column labels after the table is activated."
I2 &ou wnt 'ore thn $o$-u$ deli+ers" choose results coch 2ro' the list insted
o2 wht6s this, Essentill&" this will t/e &ou through su!section o2 the SPSS
tutoril1
Dressing U$ Your #ut$ut
Changing Text
Acti+te the $i+ot t!le in the SPSS +iewer )see #ut$ut screen- nd then dou!le-
clic/ on the te0t tht &ou wish to chnge1 Enter the new te0t nd then 2ollow the
s'e $rocedure s needed1 I2 &ou wish to get rid o2 the title" then &ou cn select the
title nd then hit the delete !utton on &our /e&!ord nd &ou will o!tin t!le li/e
the one !elow1
Changing the Number of Decimal Places
Select the cell entries with too many %or too few& decimal places. 7ro' the 2or't
'enu" choose 5ell $ro$erties" select the nu'!er o2 deci'l $oints tht &ou wnt nd
clic/ #K1
Showing/Hiding Cells
Acti+te the t!le nd then select the row or colu'n &ou wish !& using 5trl-Alt-
clic/ in the colu'n heding or row l!el1 7ro' the +iew l!el" choose hide1 To
resurrect the t!le t lter ti'e" cti+te the t!le nd then 2ro' the +iew 'enu"
choose show ll1 I2 &ou6re sure tht &ou ne+er wnt to resurrect the in2or'tion"
then &ou cn si'$l& delete the' nd the& will !e $er'nentl& re'o+ed1 See T!le
!o+e" which shows t!le with unhidden colu'ns1 See T!le ! 2or n e0'$le o2
t!le with hidden +lid $ercent colu'n1
T!le !
Favor or Oppose Death Penalty for Murder
1074 71.6 77.4
314 20.9 100.0
1388 92.5
106 7.1
6 .4
112 7.5
1500 100.0
Favor
Oppose
Total
Valid
DK
NA
Total
Missi!
Total
Fre"#e$% &er$et
(#)#lative
&er$et
Rearranging the rows, columns, and layers
Acti+te the $i+ot t!le" nd 2ro' the $i+ot 'enu" choose $i+oting tr&s1 .
schematic representation of a pivot table appears with / areas %trays& labeled layer, row,
and column. 0olored icons in these trays represent the contents of the table, one for each
variable and one for statistics. Plce &our 'ouse $ointer o+er one o2 the' to see wht
it re$resents nd i2 &ou wish to chnge the structure o2 the t!le" then &ou cn drg
n icon nd the t!le will rerrnge itsel21 See T!le ! !o+e 2or $re-'odi2iction
+ersion o2 the t!le nd T!le c !elow 2or $ost-'odi2iction +ersion1
T!le c 8 $ost-'odi2iction +ersion o2 T!le !
Favor or Oppose Death Penalty for Murder
1074
71.6
77.4
314
20.9
100.0
1388
92.5
106
7.1
6
.4
112
7.5
1500
100.0
Fre"#e$%
&er$et
(#)#lative
&er$et
Fre"#e$%
&er$et
(#)#lative
&er$et
Fre"#e$%
&er$et
Fre"#e$%
&er$et
Fre"#e$%
&er$et
Fre"#e$%
&er$et
Fre"#e$%
&er$et
Favor
Oppose
Total
Valid
DK
NA
Total
Missi!
Total
diting !our Charts
Dou!le clic/ on the +iewer o$tion to o$en it in new chrt editor window1 Note: *o
access some chart editing capabilities, such as identifying points on a scatterplot or
changing the width of bars in a histogram, you must click an element of the chart to select
it. 1or example, &ou 'ust clic/ n& $oint in sctter$lot or n& !r in histogr'
or !r chrt1 You cn chnge l!els )dou!le-clic/ n& te0t nd su!stitute &our
own-" crete re2erence lines" nd chnge colors" line t&$es nd si9es1 *hen &ou close
the 5hrt editor window" the originl chrt in the +iewer u$dtes show n& chnges
tht &ou 'de1
"sing Syntax
ou should AL*AYS use syntax when running statistical analyses. *here are 2 ways
that you can do this. You cn select the $ste t! when &ou run sttisticl nl&sis
using one o2 the dilog !o0es" such s tht 2or 2re3uencies1 :owe+er" when &ou use
the $ste 2unction" &ou h+e to re'e'!er to go to the newl& creted s&nt0 window
or one tht &ou creted in $re+ious session nd highlight the co''nds i2 &ou
wish the nl&sis to ctull& run1
The other 'ethod is to o$en new s&nt0 2ile so tht &ou cn t&$e in n&
co''entr& nd s&nt0 or co$& nd $ste 2ro' n lred& e0isting s&nt0 2ile1 *he
syntax below is what you would receive if you did a paste command in SPSS after using
a dialog box, such as that for frequencies and you would also receive this command if
you did a copy and paste of prior commands in an already existing syntax or a newly
created one.
FREQUENCIES
VARIA!ES"#appun
$PIEC%AR& PERCEN&
$ORDER" ANA!'SIS (
Whichever method you use to create syntax, you ;UST always type in commentary that
explains what the command does. *his ensures that you have a way of checking back to
see the methodology that you used and the steps that were taken when you conducted
your analysis. *his is useful in case something goes wrong and you need to make
corrections and )ust to provide you and others with a guide for how the analyses occurred
in case replications need to be done. 0ommentary should be written in the following way
when dealing with commands3
<2re3uencies o2 ttitudes towrd c$itl $unish'ent nd gun lws1<
4otice that there are asterisk at either end and that a period #.$ is )ust before the closing
asterisk. *his tells the computer that this is not command text, so that while the computer
may highlight it during a run of the analysis, it will not view it as command text. 'f you
were going to combine the commentary and the command syntax in a syntax file, it
would appear as you see it below.
<2re3uencies o2 ttitudes towrd c$itl $unish'ent nd gun lws1<
FREQUENCIES
VARIA!ES"#appun
$PIEC%AR& PERCEN&
$ORDER" ANA!'SIS (
'n addition, you ;UST keep a Log of the analyses that you run, which will appear in the
output %SPSS viewer& file. *o do this, you need to go to 5dit, then options, and select the
viewer tab. 6nder that tab, be sure that initial output state has +log- listed in the pull!
down tab and that display commands in the log is checked. *his ensures that the
information that the program enters the text of any analysis that you do right before it
displays the results of the analysis, which is another way to let yourself and others know
what type of analysis you did and to evaluate whether it is the appropriate analysis and
whether it has been done properly in that case. See the information )ust below this text
for a sample.
FREQUENCIES
VARIABLES=cappun
/ORDER= ANALYSIS .
Fre)uen#*es
Stat*st*#s
Favor or Oppose Deat* &ealt% +or M#rder
1388
112
Valid
Missi!
N
Favor or Oppose Death Penalty for Murder
1074 71.6 77.4
314 20.9 100.0
1388 92.5
106 7.1
6 .4
112 7.5
1500 100.0
Favor
Oppose
Total
Valid
DK
NA
Total
Missi!
Total
Fre"#e$% &er$et
(#)#lative
&er$et
Introducing Dt
Ty#ing your own data
'f your data aren7t already in a computer!readable SPSS format, you can enter the
information directly into the SPSS 8ata 5ditor. 7ro' the 'enus" choose 2ile" then new"
then dt" which o$ens the dt editor in dt +iew1 I2 &ou t&$e nu'!er into the
2irst cell" SPSS will l!el tht colu'n with the +ri!le n'e %A=>>>>11 To crete
&our own +ri!le n'es" clic/ the +ri!le +iew t!1
$ssigning %ariable names and #ro#erties
In the n'e colu'n" enter uni3ue n'e 2or ech +ri!le in the order in which
&ou wnt to enter the +ri!les1 *he name must start with a letter, but the remaining
part of the variable can be letters or digits. . name can7t end with a period, contain
blanks or special characters, or be longer than 9: characters.
$ssigning Descri#ti&e 'abels
%ri!le L!els: .ssign descriptive text to a variable by clic/ing the cell nd
then entering the l!el. 1or instance for the variable +cappun- the label says
+favor or oppose death penalty for murder.-
%lue L!els: *o label individual values, clic/ the !utton in the %lue colu'n1
*his opens its dialog box. 1or cappun, the label is coded ; < favor, 2 < oppose.
*he sequence of operations is to3 enter the +lue" enter its l!el" clic/ dd" nd
re$et this $rocess 2or ech +lue.
o Note: =abels for individual values are useful only for variables with a
limited number of categories whose codes aren7t self!explanatory. ou
don7t want to attach value labels to individual ages> however, you should
label the missing value codes for all variables if you use more than one
code.
$ssigning (issing %alues
*o indicate which codes were used for each variable when information is not available,
click in the missing column, and assign missing values. 0ases with these codes will be
treated differently during statistical analysis. 'f you don7t assign codes for missing
values, even nonsensical values are accepted. . value of !; for age would be considered
a real age. *he missing!value codes that you assign to a variable are called user-
'issing +lues1 S&ste'-'issing +lues are assigned by SPSS to any blank numeric cell
in the 8ata 5ditor or to any calculated value that is not defined. . system!missing value
is indicated with a period #.$.
Note: !ou can)t assign missing &alues to a string &ariable that is more than *
characters in width+ ,or string &ariables, u##ercase and lowercase letters are
treated as distinct characters+ *his means that if you use the code 4. #not
available$ as a missing value code, entries coded as na will not be treated as
missing. .lso, if a string variable is / characters wide and the missing value code
is only 2 characters wide, the placement of the two characters in the field of /
affects what7s considered missing. (lanks at the end of the field #trailing blanks$
are ignored in missing!value specifications.
*rning: D#N6T use a blank space as a missing value. 6se a specific number
or character to signify that ' looked for this value and ' don7t know what it is.
D#N6T use missing!value codes that are between the smallest and largest valid
values, even if these particular codes don7t occur in the data.
$ssigning 'e&els of (easurement
0lick in a cell in the #easure column to assign a level of measurement to each variable.
ou have / choices3 nominal, ordinal, and scale.
*rning 1: 'f you don7t specify the scale, SPSS attempts to divine it based on
characteristics of the data, but its )udgment in this matter is fallible. 1or example,
string variables are always designated as nominal. 'n some procedures, SPSS
uses different icons for the / types of variables. *he scale on which a variable is
measured doesn7t necessarily dictate the appropriate statistical analysis for a
variable. 1or example, an '8 number assigned to sub)ects in an experiment is
usually classified as a nominal variable. 'f the numbers are assigned sequentially,
however, they can be plotted on a scale to see if sub)ect responses change with
time. "ellemena and Wilkinson #;??/$ discuss the problems associated with
stereotyping variables.
*rning (: .lthough SPSS assigns a level of measurement to each variable, this
information is seldom used to guide you. SPSS will let you calculate means for
nominal variables as long as they have numeric values. 0ertain statistical
procedures don7t allow string variables in particular fields in the dialog boxes.
1or example, you can7t calculate the mean of a string variable.
Sa&ing the Data ,ile
ou ;UST always save your data periodically so that you don7t have to start from
scratch if anything goes wrong. ou can also include text information in an SPSS data
file by choosing utilities nd dt 2ile co''ents" which will $$er in the s&nt0
screen1 .nyone using the file can read the text associated with it. ou can also elect to
have the comments displayed in the output. *his is similar to what you would do with
your own inclusion of comments alerting what steps you are taking in your data analysis.
' recommend the other way because you will already be in the syntax rather than having
to switch back and forth, but this is a possible option.
, Data File (o))ets.
&-./.-V..
/.T &-0NT OFF.
ADD DO(1M.NT
2test o+ data +ile $o))ets2.
-./TO-..
Selecting Cases for $nalyses
I2 &ou wish to $er2or' nl&ses on su!set o2 &our cses" this co''nd is
in+lu!le1 1or instance, consider that you want to examine gender differences in
support for or opposition to capital punishment. 5hoose select cses 2ro' the dt
'enu nd ll nl&ses will !e restricted to the cses tht 'eet the criteri &ou
s$eci2ied1 A2ter choosing select cses" choose select i2 condition is stis2ied nd lso
clic/ on the ?i2@ t!1 This will t/e &ou to dilog !o0 tht llows &ou to co'$lete
the co''nd s&nt0 necessr& to crr& out the $rocedure1 5onsidering the
e0'$le" tht I g+e &ou" ;les re coded 1 nd 7e'les re coded (1 I '
interested in clculting results se$rtel& 2or !oth grou$s1 There2ore" I clic/ on se0
under the +ri!le list nd use the rrow to $ut it in the !o0 llocted 2or 2or'uls1
#nce this +ri!le hs !een trns2erred" I clic/ on the 8 sign on the clcultor
$ro+ided nd then on ?1@ so tht I in2or' the co'$uter tht I ' onl& interested
in selecting cses 2or 'les1 Then I hit continue nd go !c/ to the originl select
cses dilog !o0 where I cn choose ?unselected cses re 2iltered@ or ?unselected
cses re deleted1 I2 &ou wish to /ee$ !oth 'les nd 2e'les in the dtset" !ut &ou
wnt to conduct se$rte nl&ses 2or ech grou$" &ou wnt to choose ?2iltered1@ I2
&ou wish to get rid o2 those cses tht don6t 'eet the criterion" i1e1 &ou wnt to delete
the 2e'les 2ro' the dt set $er'nentl&" &ou wnt to choose ?deleted1@ I2 &ou
loo/ t the Dt Editor when Select 5ses is in e22ect" &ou6ll see lines through the
cses tht did not 'eet the selection criteri )onl& 2or 2iltering o2 cses-1 The& won6t
!e included in n& sttisticl nl&sis or gr$hicl $rocedures1
Re#eating the $nalysis for Different -rou#s of Cases
'f you want to perform the same analysis for several groups of cases, choose Split 1ile
from the 8ata menu. . separate analysis is done for each combination of values of the
variables specified in the Split 1ile 8ialog box.
/O-T (A/./ 34 se5 .
/&60T F06.
6A4.-.D 34 se5 .
Fre)uen#*es
Stat*st*#s
Favor or Oppose Deat* &ealt% +or M#rder
607
34
781
78
Valid
Missi!
N
Valid
Missi!
N
Male
Fe)ale
Favor or Oppose Death Penalty for Murder
502 78.3 82.7 82.7
105 16.4 17.3 100.0
607 94.7 100.0
34 5.3
641 100.0
572 66.6 73.2 73.2
209 24.3 26.8 100.0
781 90.9 100.0
72 8.4
6 .7
78 9.1
859 100.0
Favor
Oppose
Total
Valid
DK Missi!
Total
Favor
Oppose
Total
Valid
DK
NA
Total
Missi!
Total
-espodet2s /e5
Male
Fe)ale
Fre"#e$% &er$et Valid &er$et
(#)#lative
&er$et
ou can also select how you want the output displayed@all output for each subgroup
together or the same output for each subgroup together.
/O-T (A/./ 34 se5 .
/&60T F06.
/.&A-AT. 34 se5 .
Fre)uen#*es
Respondent+s Se, " Male
Stat*st*#s
a
Favor or Oppose Deat* &ealt% +or M#rder
607
34
Valid
Missi!
N
-espodet2s /e5 7 Male
a.
Favor or Oppose Death Penalty for Murder
a
502 78.3 82.7 82.7
105 16.4 17.3 100.0
607 94.7 100.0
34 5.3
641 100.0
Favor
Oppose
Total
Valid
DK Missi!
Total
Fre"#e$% &er$et Valid &er$et
(#)#lative
&er$et
-espodet2s /e5 7 Male
a.
Respondent+s Se, " Fe-ale
Stat*st*#s
a
Favor or Oppose Deat* &ealt% +or M#rder
781
78
Valid
Missi!
N
-espodet2s /e5 7 Fe)ale
a.
Favor or Oppose Death Penalty for Murder
a
572 66.6 73.2 73.2
209 24.3 26.8 100.0
781 90.9 100.0
72 8.4
6 .7
78 9.1
859 100.0
Favor
Oppose
Total
Valid
DK
NA
Total
Missi!
Total
Fre"#e$% &er$et Valid &er$et
(#)#lative
&er$et
-espodet2s /e5 7 Fe)ale
a.
Pre$ring Your Dt
Chec.ing %ariable Definitions
"sing the "tilities (enu
5hoose utilities nd then +ri!les to get data!definition information for each variable
in your data file. Aake sure that all of your missing!value codes are correctly identified.
TIP A
I2 &ou clic/" Go To" &ou 2ind &oursel2 in the colu'n o2 the Dt Editor 2or the
selected +ri!le i2 the dt editor is in dt +iew1 To edit the +ri!le in2or'tion
2ro' the dt editor in dt +iew" dou!le-clic/ the +ri!le n'e t the to$ o2 tht
colu'n1 *his takes you to the variable view for that variable.
TIP B
To get listing o2 the in2or'tion 2or ll o2 the +ri!les without h+ing to select the
+ri!les indi+idull&" choose 7ile" then dis$l& dt 2ile in2or'tion" then wor/ing
2ile1 This lists +ri!le in2or'tion 2or the whole dt 2ile1 The disd+ntge is tht
&ou cn6t 3uic/l& go !c/ to the dt editor to 2i0 'ist/es1 An d+ntge is tht
codes tht re de2ined s 'issing re identi2ied" so it6s esier to chec/ the l!els1
Chec.ing !our Case Count
liminating Du#licate Cases
I2 &ou h+e entered &our own dt, it is possible that you will enter the same case twice
or even more. *o oust any duplicates, choose dt" then identi2& du$licte cses1 I2
&ou entered su$$osedl& uni3ue ID +ri!le 2or ech cse" 'o+e the n'e o2 tht
ID +ri!le into the De2ine ;tching 5ses C& list1 I2 it t/es 'ore thn one
+ri!le to gurntee uni3ueness D2or e0'$le" college nd student IDE" 'o+e ll o2
these +ri!les into the list1 *hen &ou clic/ #K" SPSS chec/s the 2ile 2or cses tht
h+e du$licte +lues o2 the ID +ri!les1
TIP F
D#N6T automatically discard cases with the same '8 number unless all of the other
values also match. 't7s possible that the problem is merely that a wrong '8 number was
entered.
$dding (issing Cases
Bun any procedure and look at the count of the total cases processed. *hat7s always the
first piece of output. T!le d shows the summary from the 0rosstabs procedure for sex
by cappun.
T!le d
Crossta.s
Case Pro#ess*n/ Su--ary
1388 92.58 112 7.58 1500 100.08
-espodet2s /e5 ,
Favor or Oppose Deat*
&ealt% +or M#rder
N &er$et N &er$et N &er$et
Valid Missi! Total
(ases
ou see that the data file has ;CDD cases, but only ;/EE have valid #nonmissing$ values
for the sex and cappun variables. 'f the count isn7t what you think it should be and if you
assigned sequential numbers to cases, you can look for missing '8 numbers.
Chec.ing !our Case Count
*rning: 8ata checking is not an excuse to get rid of data values that you don7t like.
ou are looking for values that are obviously in error and need to be corrected or
replaced with missing values. *his is not the time to deal with unusual but correct data
points. ou7ll deal with those during the actual data analysis phase.
(a.ing ,re/uency Tables
6se the frequency procedure to count the number of times each value of a variable occurs
in your data. 1or example, how many people in the gss survey support capital
punishmentF ou can also graph this information using pie charts, bar charts, or
histograms.
TIP G
ou can acquire the information that you need for the descriptive statistics %e.g. mean,
minimum, maximum& through the frequency dialog box by selecting the statistics tab and
checking on those statistics of interest to you.
*o obtain your frequencies and descriptives, follow the instructions below.
Go to Anl&9e" scroll down to descri$ti+e sttistics nd select 2re3uencies1
5lic/ on the +ri!les o2 interests nd 'o+e the' into the !o0 2or nl&sis
using the rrow shown1 To o!tin the descri$ti+es" chec/ on the sttistics t!
in the 2re3uenc& dilog !o0 nd select the 'en" 'ini'u'" '0i'u'" nd
stndrd de+ition !o0es1 Then clic/ on chrts nd decide whether &ou wish
to run $ie chrt" !r chrt" or histogr' chrt wH nor'l cur+e1 You
cn onl& run one t&$e o2 gr$hHchrt t ti'e1
When conducting frequency analyses, you want to consider the following questions when
reviewing the results presented in your output.
Are the codes tht &ou used 2or 'issing +lues l!eled s 'issing +lues in
the 2re3uenc& t!le, 'f the codes are not labeled, go back to the data editor and
specify them as missing values.
Do the +lue l!els correctl& 'tch the codes, 1or example, if you see that
CDG of your customers are very dissatisfied with your product, make sure that
you haven7t made a mistake in assigning the labels.
Are ll o2 the +lues in the t!le $ossi!le, 1or example, if you asked the
number of times a person has been married and you see values of !2, you know
that7s an error. Ho back to the source and see if you can figure out what the
correct values are. 'f you can7t, replace them with codes for missing values.
Are there +lues tht re $ossi!le" !ut highl& unli/el&, 1or example, if you see
a sub)ect who claims to own ;; toasters, you want to check whether the value is
correct. 'f the value is incorrect, you7ll have to take that into account when
analy,ing the data.
Are there une0$ectedl& lrge or s'll counts 2or n& o2 the +lues, 'f you7re
studying the relationship of highest educational degree to subscription to Web
services offered by your company and you see that no one in your sample has a
college degree, suspect problems.
TIP I
*o search for a particular data value for a variable, go to dt +iew" highlight the
colu'n o2 the +ri!le tht &ou re interested in" choose edit" then 2ind" nd then
t&$e in the +lue tht &ou re interested in 2inding1
'oo.ing $t the Distribution of %alues
1or a scale variable with too many values for a frequency table %e.g. income in
dollars&, you need different tools for checking the data values because counting
how often different values occur isn7t useful anymore.
Are the s'llest nd lrgest +lues sensi!le,
ou don7t want to look solely at the single largest and single smallest values>
instead, you want to look at a certain G or I of cases with the largest and smallest
values. *here are several ways to do this. *he simplest, but most limited way, is
to choose3
o Anl&9e" then Descri$ti+e Sttistics" then either descri$ti+es or
e0$lore1 5lic/ sttistics in the e0$lore dilog !o0 nd select outliers in
the e0$lore sttistics dilog !o01 ou will receive a list of cases with the
C smallest and the C largest values for a particular variable. "alues that
are defined as missing aren7t included, so if you see missing values in the
list, there7s something wrong. 0heck the other values if they appear to be
unusual. %see T!le e&
T!le e )using the E0$lore co''nd-
E,tre-e Values
1402 24
466 22
300 20
1360 20
115 16
a
1500 0
1400 0
1373 0
1372 0
1356 0
9
1
2
3
4
5
1
2
3
4
5
:i!*est
6o;est
:o#rs &er Da%
<at$*i! TV
(ase N#)9er Val#e
Ol% a partial list o+ $ases ;it* t*e val#e 16 are s*o; i
t*e ta9le o+ #pper e5tre)es.
a.
Ol% a partial list o+ $ases ;it* t*e val#e 0 are s*o; i
t*e ta9le o+ lo;er e5tre)es.
9.
ou can see that one of the respondents claims to watch television 2: hours a day. ou
know that7s not correct. 't7s possible that he or she understood the question to mean how
many hours is the *" set on. When analy,ing the *" variable, you7ll have to decide
what to do with people who have reported impossible values. 'n T!le e, you see that
there are only : cases with values of ;9 hours or greater and then there is a gap until ;2
hours. ou might want to set values greater than ;2 hours to ;2 hours when analy,ing
the data. *his is similar to what many people do when dealing with a variable for +age.-
Is there n&thing strnge !out the distri!ution o2 +lues,
*he next task is to examine the distribution of the values using histograms or
stem!and!leaf plots. Aake a stem!and!leaf plot %for small data sets& or a
histogram of the data using either HraphsJKistogram or the 5xplore Plots dialog
box. ou want to look for unusual patterns in your data. 1or example, look at the
histogram of ages in T!le 2. .sk yourself where all of the /D!year!olds have
goneF Why are there no people above the age of ?DF Were there really no people
younger than ;E in the surveyF
'oo.ing $t the Distribution of %alues
Are there logicl i'$ossi!ilities,
1or example, if you have a data file of hospital admissions, you can make a
frequency table to count the reason for admission and the number of male and
female admissions. =ooking at these tables, you may not notice anything strange.
Kowever, if you look at these 2 variables together in a 0rosstabs table, you may
uncover unusual events. 1or instance, you may find males giving birth to babies,
and women undergoing prostate surgery.
Sometimes, pairs of variables have values that must be ordered in a particular
way. 1or example, if you ask a woman her current age, her age at first marriage,
and the duration of her first marriage, you know that the current age must be
greater than or equal to the age at first marriage. ou also know that the age at
first marriage plus the duration of first marriage cannot exceed the current age.
Start by looking at the simplest relationship3 's the age at first marriage less than
the current ageF ou can plot the two variables on a scatterplot and look for cases
that have unacceptable values. ou know that all of the points must fall on or
above the identity line.
TIP 1>
1or large data files, the drawback to this approach is that it7s tedious and prone to error.
. better way is to create a new variable that is the difference between the current age and
the age at first marriage. *hen use dt" select cses to select cses with negti+e
+lues nd nl&9e" then re$orts" then cse su''ries to list the $ertinent
in2or'tion1 #nce &ou6+e re'edied the ge $ro!le'" &ou cn crete new +ri!le
tht is the su' o2 the ge t 2irst 'rrige nd the durtion o2 2irst 'rrige1 You
cn then 2ind the di22erence !etween this su' nd the current ge1 =eset the select
cses criteri nd use cse su''ries to list cses with o22ending +lues1
Is there consistenc&,
1or a survey, you often have questions that are conditional. 1or example, first
you ask Do you have a car$ and then, if the answer is es, you ask insightful
questions about the car. ou can make 0rosstabs tables of the responses to the
main question with those to the subquestions. ou have to decide how to deal
with these inconsistencies3 do you impute answers to the main question, or do you
discard answers to subquestionsF 't7s your call.
Is there gree'ent,
*his refers to whether you have pairs of variables that convey similar information
in different ways. 1or example, you may have recorded both years of education
and highest degree earned. Or, you may have created a new variable that groups
age into 2 categories, such as less than 2C, 2C to CD, and older than CD. 0ompare
the values of the 2 variables using crosstabs. *he table may be large, but it7s easy
to check the correspondence between the 2 variables. ou can also identify
problems by plotting the values of the 2 variables.
Are there unusul co'!intions o2 +lues,
'dentify any outliers so that you can make sure the values of these variables are
correct and make any necessary ad)ustments. What counts as an outlier depends
on the variables that are being considered together.
TIP 11
ou can identify points in a scatterplot by specifying a variable in the =abel 0ases (y
text box in the Scatterplot dialog box. Dou!le-clic/ the $lot to cti+te it in the 5hrt
Editor1 7ro' the Ele'ents 'enu" choose Dt L!el ;ode or clic/ on the Dt
L!el ;ode icon on the tool!r1 This chnges &our cursor to !lc/ !o01 5lic/ the
cursor o+er the $oint tht &ou wnt identi2ied !& the +lue o2 the l!eling +ri!le1
To go to tht cse in the Dt Editor" right clic/ on the $oint" nd then le2t clic/1
;/e sure tht the Dt Editor is in Dt %iew1 To turn Dt L!el ;ode o22" clic/
on the Dt l!el ;ode icon on the tool!r1
Trns2or'ing Your Dt
(efore you transform your data, be sure that you know that value of the variables that
you are interested in so that you know how to code the information. See earlier
instructions about how to use the utilities menu to obtain information on the variables
either individually or for the entire working data file.
Com#uting a New %ariable
'f you want to perform the same calculation for all of the cases in your data file, the
transformation is called unconditionl1 'f you want to perform different computations
based on the values of ; or more variables, the transformation is conditionl1 1or
example, if you compute an index differently for men and women, the transformation is
conditional. (oth types of transformations can be performed in the 0ompute "ariable
dialog box.
0ne Si1e ,its $ll2 "nconditional Transformation
5hoose co'$ute 2ro' the trns2or' 'enu to o$en the co'$ute +ri!le dilog !o01
At the to$ le2t" ssign new n'e to the +ri!le tht &ou will !e co'$uting1 To do
so" clic/ in the trget +ri!le !o0 nd t&$e in the desired n'e1 ou must follow the
same rules for assigning variable names as you did when naming variables in the 8ata
5ditor. .lso, don7t forget to enter the information in the type and label tab in the dialog
box.
*rning: ou ;UST use a new variable name rather than one already in use. 'f you
reuse the same name and make a mistake specifying the transformation, you7ll replace the
values of the original variable with values that you don7t want. 'f you don7t catch the
mistake right away, and you save the data file, the original values of the variable are lost.
SPSS will ask you for permission to proceed if you try to use an existing variable name.
*o specify the formula for the calculations that you want to perform, either type directly
in the 4umeric 5xpression text box or use the calculator pad. 5ach time you want to
refer to an existing variable, click it in the variable list and then click the arrow button.
*he variable name will appear in the formula at the blinking insertion point. Once you
click ok, the variable is added to your data file as the last variable. Kowever, remember
that you want to click the paste button and then run the syntax command from the syntax
window so that you know what commands you specified. ou also want to remember to
use commentary information above the pasted syntax in order to tell yourself and the
reviewer, in this case me, what you did to conduct your analysis.
TIP 1(
Bight!click your mouse on any button #except the Is$ on the calculator pad or any
function for an explanation of what it means.
"sing a 3uilt4in ,unction
*he function groups are located in the dialog box and can be used to perform your
calculations, if necessary. *here are L main groups of functions3 arithmetic, statistical,
string, data and time, distribution, random!variable, and missing!values. 'f you wish to
use it, clic/ it when the !lin/ing insertion $oint is $lced where &ou wnt to insert
the 2unction into &our 2or'ul" nd then clic/ the u$ rrow !utton1 The 2unction
will $$er in &our 2or'ul" !ut it will h+e 3uestion 'r/s 2or the rgu'ents1 *he
arguments of a function are the numbers or strings that it operates on. 'n the expression
SMB*#2C$, 2C is the sole argument of this function. Enter +lue 2or the rgu'ent" or
dou!le-clic/ +ri!le to 'o+e it into the rgu'ent list1 I2 there re 'ore 3uestion-
'r/ rgu'ents" select the' in turn nd enter +lue" 'o+e +ri!le" or
so'ehow su$$l& whte+er suits the needs o2 the 2unction1
TIP 1(
1or detailed information about any function and its arguments, from the Kelp menu,
choose *opics, click the index tab, and type the word functions. ou can then select the
type of function that you want.
5f and Then2 Conditional Transformation
'f you want to use different formulas, depending on the values of one or more existing
variables, you have to enter the formula and then click the button labeled if at the bottom
of the compute variable dialog box. *his will take you a secondary compute data dialog
box in which you choose, +include if cases satisfies condition.- *o make your conditional
equation. 1or example, if you wish to compute a new variable, you would specify how
the new target variable is coded in reference to the +if, then expression.-
Changing the Coding Scheme
Recode into Same %ariables
'f you wish to change the coding of a variable but not create a totally different variable,
you would select trns2or'" recode" into s'e +ri!les" nd clic/ on the +ri!le or
+ri!les o2 interest nd 'o+e the' into the +ri!le !o0 !& clic/ing the rrow1
8epending on how you wish to recode the values within a variable, you could select old
nd new +lues nd on the le2t side o2 the dilog !o0" choose the nu'!ers tht &ou
wish to chnge nd on the right side o2 the dilog !o0" choose wht &ou wnt the'
to !eco'e nd clic/ dd1 *hen done" select continue" to go !c/ to the $re+ious
dilog !o0 nd $ste co''nd s&nt0 so tht &ou cn run it1 Agin don6t 2orget to
t&$e in co''entr& o2 wht the co''nd is doing1 'n other cases, you might
choose the +'1- tab to compute the conditions under which a recode will take place.
TIP 1.
'f you wish to recode a group of variables using the same coding scheme, such as recode
a 2 into a ; for a set of variables even if the numbers stand for different value labels, you
can enter several variables into the dialog box at once.
Recode into Different %ariables
'f you want to recode an existing variable into a new one in which every original value
has to be transformed into a value of the new variable. 5lic/ trns2or'" recode" into
di22erent +ri!les nd &ou will get dilog !o01 In this dilog !o0" select the n'e
o2 the +ri!le tht will !e recoded1 Then in the out$ut +ri!le n'e test !o0"
enter n'e 2or the new +ri!le1 5lic/ the chnge !utton nd the new n'e
$$ers 2ter the rrow in the centrl list1 #nce this is done" clic/ ?old nd new
+lues@ nd enter the recode criteri tht will co'$rise the co''nd s&nt01 SPSS
carries out the recode specifications in the order they are listed in the old to new list.
TIP 14
.lways specify all of the values even if you7re leaving them unchanged. Select ll other
+lues nd then co$& cold +lues1 =e'e'!er to clic/ the dd !utton 2ter entering
ech s$eci2iction to 'o+e it into the old to new listJ otherwise" it is ignored1
Chec.ing the Recode
*he easiest method is to make a crosstabs table of the original variable with the new
variable containing recoded values.
*rning: .fter you7ve created a new variable with recode, go to the variable view in the
8ata 5ditor and set the missing values for each newly created variable.
Descri!ing Your Dt
xamining Tables and Chart Counts
,re/uency Tables
Rap Mus*#
41 2.7 2.9 2.9
145 9.7 10.1 13.0
266 17.7 18.6 31.6
401 26.7 28.0 59.6
578 38.5 40.4 100.0
1431 95.4 100.0
58 3.9
11 .7
69 4.6
1500 100.0
6i=e Ver% M#$*
6i=e 0t
Mi5ed Feeli!s
Disli=e 0t
Disli=e Ver% M#$*
Total
Valid
DK M#$* A9o#t 0t
NA
Total
Missi!
Total
Fre"#e$% &er$et Valid &er$et
(#)#lative
&er$et
'magine that you were interested in analy,ing respondents views regarding rap music.
ou would run a frequency table like the one above to find a count of the level of like or
dislike of rap music reported by respondents. 5ach row of the table corresponds to one of
the recorded answers. (e sure to make sure that the counts presented appear to be
correct, including those for the missing data listing.
*he /
rd
!C
th
columns contain percentages. *he /
rd
column labeled simply percent is the G
of all cases in the data file with that value. ?G of respondents reported that they like rap
music. Kowever, the :
th
column, labeled valid percent indicates that ;DG of respondents
like rap music. Why the differenceF *he :
th
column bases the G only on people who
actually respondent to the question.
*rning: . large difference between the G and valid G columns can signal big
problems for your study. 'f the missing values result from people not being asked the
question because that7s the design of the study, you don7t have to worry. 'f people
weren7t asked because the interviewer decided not to ask them or if they refused to
answer, that7s a different matter.
*he C
th
column, labeled cumulative percent is the sum of the valid G for that row and all
of the rows before it. 't7s useful only if the variable is measured at least on an ordinal
scale. 1or example, the cumulative G for +like- tells you that ;/G of respondents either
reported that they like rap music or that they like it very much. *he valid data value that
occurs most frequently is called the 'ode. 1or these data, +dislike very much- is the
modal category since CLE of the respondents reported that they disliked rap music very
much. *he mode is not a particularly good summary measure, and if you report it, you
should always indicate the percentage of cases with that value. 1or variables measured
on a nominal scale, the mode is the only summary statistic that makes sense, but that isn7t
the case for this variable because there is a natural order to the responses %i.e. ordinal
variable&.
,re/uency Tables as Charts
ou can display the numbers in a frequency table in a pie chart or a bar chart, although
prominent statisticians advise that one should +never use a pie chart.-
40.398
28.028
18.598
10.138
2.878
Disli=e Ver% M#$*
Disli=e 0t
Mi5ed Feeli!s
6i=e 0t
6i=e Ver% M#$*
-ap M#si$
>>
*rning: 'f you create a pie chart by choosing 8escriptive Statistics, then frequencies, a
slice for missing values is always included. 6se graph, then select pie if you don7t want
to include a slice for missing values. *his was the way that ' obtained the pie chart
above.
Disli=e Ver% M#$* Disli=e 0t Mi5ed Feeli!s 6i=e 0t 6i=e Ver% M#$*
Rap Mus*#
50.08
40.08
30.08
20.08
10.08
0.08
P
e
r
#
e
n
t
xamining Tables and Chart Counts
4ow you know how people as a group feel about rap music, but what about more
nuanced information about the kinds of people who hold these views. .re they maleF
0ollege 5ducatedF Bacial and 5thnic AinoritiesF *o find out this information, you need
to look at attitudes regarding rap music in con)unction with other variables. .
crosstabualtion involving a 2!way table of counts, for attitudes toward rap music and
gender. Hender is the row variable since it defines the rows of the table, and attitudes
toward rap music is the column variable since it defines the columns. 5ach of the unique
combinations of the values of the 2 variables defines a cell of the table. *he numbers in
the total row and column are called 'rginls because they are in the margins of the
table. *hey are frequency tables for the individual variables.
TIP 1A
D#N6T be alarmed if the marginals in the crosstabulation aren7t identical to the
frequency tables for the individual variables. Only cases with valid values for both
variables are in the crosstabulation, so if you have cases with missing values for one
variable but not the other, they will be excluded from the crosstabulation. Bespondents
who tell you their gender but not their attitudes about rap music are included in the
frequency table for gender but not in the crosstabulation of the 2 variables.
*he table below shows a crosstabulation that contains information solely on the number
of cases that meet both criteria, but not a G distribution.
Respondent+s Se, 0 Rap Mus*# Crossta.ulat*on
(o#t
17 62 97 181 258 615
24 83 169 220 320 816
41 145 266 401 578 1431
Male
Fe)ale
-espodet2s
/e5
Total
6i=e Ver%
M#$* 6i=e 0t
Mi5ed
Feeli!s Disli=e 0t
Disli=e
Ver% M#$*
-ap M#si$
Total
Percentages
*he above information, i.e. the counts in the cell are the basic elements of the table, but
they are usually not the best choice for reporting findings because they cannot be easily
compared if there are different totals in the rows and columns of the table. 1or example,
if you know that ;L Aales and 2: 1emales like rap music very much, you can conclude
little about the relationship between the 2 variables unless you also know the total of men
and women in the sample.
1or a crosstabulation, you can compute / different percentages3
=ow K: the cell count divided by the number of cases in the row times ;DD
5olu'n K: the cell count divided by the number of cases in the column times
;DD
Totl K: the cell count divided by the total number of cases in the table times
;DD
*he / G convey different information, so be sure to choose the correct one for your
problem. 'f one of the 2 variables in your table can be considered an independent
variable and the other a dependent variable, make sure the G sum up to ;DD for each
category of the independent variable.
Respondent+s Se, 0 Rap Mus*# Crossta.ulat*on
17 62 97 181 258 615
2.88 10.18 15.88 29.48 42.08 100.08
41.58 42.88 36.58 45.18 44.68 43.08
1.28 4.38 6.88 12.68 18.08 43.08
24 83 169 220 320 816
2.98 10.28 20.78 27.08 39.28 100.08
58.58 57.28 63.58 54.98 55.48 57.08
1.78 5.88 11.88 15.48 22.48 57.08
41 145 266 401 578 1431
2.98 10.18 18.68 28.08 40.48 100.08
100.08 100.08 100.08 100.08 100.08 100.08
2.98 10.18 18.68 28.08 40.48 100.08
(o#t
8 ;it*i
-espodet2s /e5
8 ;it*i -ap M#si$
8 o+ Total
(o#t
8 ;it*i
-espodet2s /e5
8 ;it*i -ap M#si$
8 o+ Total
(o#t
8 ;it*i
-espodet2s /e5
8 ;it*i -ap M#si$
8 o+ Total
Male
Fe)ale
-espodet2s
/e5
Total
6i=e Ver%
M#$* 6i=e 0t
Mi5ed
Feeli!s Disli=e 0t
Disli=e
Ver% M#$*
-ap M#si$
Total
Since gender would fall under the realm of an independent variable, you want to calculate
the row G because they will tell you what G of women and men fall into each of the
attitudinal categories. *his G isn7t affected by unequal numbers of males and females in
your sample. 1rom the row G displayed above, you find that 2.EG of males like rap
music very much as do 2.?G of females. So with regard to strong positive feelings about
rap music, you note that there are no visible differences. Note: 4o statistical differences
are examined yet. 1rom the columnG displayed above, you find that among those who
like rap music very much, :;.CG are men and CE.CG are female. *his does not tell you
that females are significantly more likely to report liking rap music very much than
males. 'nstead, it tells you that of the people who like rap music very much, women tend
to hold a stronger view than men. Note: *he column G depend on the number of men
and women in the sample as well as how they feel about rap music. 'f men and women
have identical attitudes but there are twice as many men in the survey than women, the
column G for men will be twice as large as the column G for women. ou can7t draw
any conclusions based on only the column G.
TIP 1B
'f you use row G, compare the G within a column. 'f you use column G, compare the G
within a row.
(ultiway Tables of Counts as Charts
ou can plot the G in the table above by using a clustered bar chart like the one below.
1or each attitudinal category regarding rap music, there are separate bars for men and
women since gender is the cluster variable. *he values plotted are the G of all men and
the G of all women who gave each response. ou can easily that females are equally
likely to like rap music very much as much as males. .lthough the same information is
in the crosstabulation, it is easier to see in the bar chart.
Disli=e Ver%
M#$*
Disli=e 0t Mi5ed
Feeli!s
6i=e 0t 6i=e Ver%
M#$*
Rap Mus*#
50.08
40.08
30.08
20.08
10.08
0.08
P
e
r
#
e
n
t
39.228
26.968
20.718
10.178
2.948
41.958
29.438
15.778
10.088
2.768
Fe)ale
Male
-espodet2s /e5
TIP 1F
.lways select G in the clustered bar chart dialog boxes> otherwise, you7ll have a difficult
time making comparisons within a cluster, since the height of the bars will depend on the
number of cases in each subgroup. 1or example, you won7t be able to tell if the bar for
men who always read newspapers is higher because men are more likely to read a
newspaper daily or because there are more men in the sample.
Control %ariables
ou can examine the relationship between gender and attitudes toward rap music
separately for each category of another variable, such as education %i.e., the control
+ri!le-. See the crosstabulation model below to show you how the information would
look when entered into the crosstabulation dialog box.
Respondent+s Se, 0 Rap Mus*# 0 RS %*/hest De/ree Crossta.ulat*on
5 11 14 30 55 115
4.38 9.68 12.28 26.18 47.88 100.08
10 18 19 35 59 141
7.18 12.88 13.58 24.88 41.88 100.08
15 29 33 65 114 256
5.98 11.38 12.98 25.48 44.58 100.08
9 36 50 87 110 292
3.18 12.38 17.18 29.88 37.78 100.08
11 45 95 134 175 460
2.48 9.88 20.78 29.18 38.08 100.08
20 81 145 221 285 752
2.78 10.88 19.38 29.48 37.98 100.08
1 4 4 13 14 36
2.88 11.18 11.18 36.18 38.98 100.08
1 3 13 15 18 50
2.08 6.08 26.08 30.08 36.08 100.08
2 7 17 28 32 86
2.38 8.18 19.88 32.68 37.28 100.08
2 8 22 32 41 105
1.98 7.68 21.08 30.58 39.08 100.08
2 11 30 27 52 122
1.68 9.08 24.68 22.18 42.68 100.08
4 19 52 59 93 227
1.88 8.48 22.98 26.08 41.08 100.08
3 7 19 38 67
4.58 10.48 28.48 56.78 100.08
5 12 9 16 42
11.98 28.68 21.48 38.18 100.08
8 19 28 54 109
7.38 17.48 25.78 49.58 100.08
(o#t
8 ;it*i
-espodet2s /e5
(o#t
8 ;it*i
-espodet2s /e5
(o#t
8 ;it*i
-espodet2s /e5
(o#t
8 ;it*i
-espodet2s /e5
(o#t
8 ;it*i
-espodet2s /e5
(o#t
8 ;it*i
-espodet2s /e5
(o#t
8 ;it*i
-espodet2s /e5
(o#t
8 ;it*i
-espodet2s /e5
(o#t
8 ;it*i
-espodet2s /e5
(o#t
8 ;it*i
-espodet2s /e5
(o#t
8 ;it*i
-espodet2s /e5
(o#t
8 ;it*i
-espodet2s /e5
(o#t
8 ;it*i
-espodet2s /e5
(o#t
8 ;it*i
-espodet2s /e5
(o#t
8 ;it*i
-espodet2s /e5
Male
Fe)ale
-espodet2s
/e5
Total
Male
Fe)ale
-espodet2s
/e5
Total
Male
Fe)ale
-espodet2s
/e5
Total
Male
Fe)ale
-espodet2s
/e5
Total
Male
Fe)ale
-espodet2s
/e5
Total
-/ :i!*est De!ree
6ess t*a :/
:i!* s$*ool
?#ior $olle!e
3a$*elor
@rad#ate
6i=e Ver%
M#$* 6i=e 0t
Mi5ed
Feeli!s Disli=e 0t
Disli=e
Ver% M#$*
-ap M#si$
Total
ou see that the largest difference in strong dislike of rap music between men and
women occurs among those with a graduate degree. C9.LG of males strongly dislike rap
compared to /E.;G of females. *he G are almost equal for those with a high school
education.
.s the number of variables in a crosstabulation increases, it becomes unwieldy to plot all
of the categories of a variable. 'nstead you can restrict your attention to a particular
responses.
T-Tests
When using these statistical tests, you are testing the null hypothesis that 2 population
means are equal. *he alternative hypothesis is that they are not equal. *here are /
different ways to go about this, depending on how the data were obtained.
Deciding 6hich T4test to "se
4either the one!sample t test nor the paired samples t test requires any assumption about
the population variances, but the 2!sample t test does.
TIP 1G
When reporting the results of a t test, make sure to include the actual means, differences,
and standard errors. Don6t give )ust a t value and the observed significance level.
0ne4sam#le T test
'f you have a single sample of data and want to know whether it might be from a
population with a known mean, you have what7s termed a one!sample design, which can
be analy,ed with a one!sample t test.
xam#les
ou want to know whether 05Os have the same average score on a personality
inventory as the population on which it was normed. ou administer the test to a
random sample of 05Os. *he population value is assumed to be known in
advance. ou don7t estimate it from your data.
ou7re suspicious of the claim that the normal body temperature is ?E.9 degrees.
ou want to test the null hypothesis that the average body temperature for human
adults is the long assumed value of ?E.9, against the alternative hypothesis that it
is not. *he value ?E,9 isn7t estimated from the data> it is a known constant. ou
take a single random sample of ;,DDD adult men and women and obtain their
temperatures.
ou think that :D hours no longer defines the traditional work week. ou want to
test the null hypothesis that the average work week is :D hours, against the
alternative that it isn7t. ou ask a random sample of CDD full!time employees
how many hours they worked last week.
ou want to know whether the average 'M score for children diagnosed with
schi,ophrenia differs from ;DD, the average for the population of all children.
ou administer an 'M test to a random sample of LDD schi,ophrenic children.
our null hypothesis is that the population value for the average 'M score for
schi,ophrenic children is ;DD, and the alternative hypothesis is that it isn7t.
Data $rrangement
1or the one!sample t test, you have one variable that contains the values for each
case. 1or example3
. manufacturer of high!performance automobiles produces disc brakes that must
measure /22 millimeters in diameter. Muality control randomly draws ;9 discs
made by each of eight production machines and measures their diameters. *his
example uses the file brakes.sav . 6se One Sample * *est to determine whether or
not the mean diameters of the brakes in each sample significantly differ from /22
millimeters. . nominal variable, #achine %umber, identifies the production
machine used to make the disc brake. (ecause the data from each machine must
be tested as a separate sample, the file must first be split into groups by #achine
%umber.
Select compare groups in the split file dialog box. Select machine number from the
variable listing and move it into the box for +groups based on.- Select the +compare
groups circle- and since the file isn7t already sorted, be sure that you have selected, +sort
the file by grouping variables.-
4ext select one!sample * test from the analy,e tab.
Select nl&9e" then co'$re 'ens" nd then one-s'$le T test1
Select the test variable, i.e. disc brake diameter #mm$, type /22 as the test variables, and
click options.
'n the options dialog box for the one!sample * test, type ?D in the confidence interval G,
then be sure that you have missing values coded as +exclude cases analysis by analysis,-
then click continue, then click paste so that the syntax is entered in the syntax viewer, and
then select ok.
Note: . ?CG confidence interval is generally used, but the examples below
reflect a ?DG confidence interval.
*he 8escriptives table displays the sample si,e, mean, standard deviation, and standard
error for each of the eight samples. *he sample means disperse around the /22mm
standard by what appears to be a small amount of variation.
*he test statistic table shows the results of the one!sample * test.
*he t column displays the observed t statistic for each sample, calculated as the ratio of the
mean difference divided by the standard error of the sample mean.
*he df column displays degrees of freedom. 'n this case, this equals the number of cases in
each group minus ;.
*he column labeled &ig. ('-tailed) displays a probability from the t distribution with ;C
degrees of freedom. *he value listed is the probability of obtaining an absolute value
greater than or equal to the observed t statistic, if the difference between the sample mean
and the test value is purely random.
*he #ean Difference is obtained by subtracting the test value #/22 in this example$ from
each sample mean.
*he ()* Confidence nterval of the Difference provides an estimate of the boundaries
between which the true mean difference lies in ?DG of all possible random samples of ;9
disc brakes produced by this machine.
Since their confidence intervals lie entirely above D.D, you can safely say that machines 2,
C and L are producing discs that are significantly wider than /22mm on the average.
Similarly, because its confidence interval lies entirely below D.D, machine : is producing
discs that are not wide enough.
*he one!sample t test can be used whenever sample means must be compared to a known
test value. .s with all t tests, the one!sample t test assumes that the data be reasonably
normally distributed, especially with respect to skewness. 5xtreme or outlying values
should be carefully checked> boxplots are very handy for this.
Paired4Sam#les T test
ou use a paired!samples #also known as the matched cases$ * test if you want to test
whether 2 population means are equal, and you have 2 measurements from pairs of
people or ob)ects that are similar in some important way. 1or example, you7ve observed
the same person before and after treatment or you have personally measures for each
05O and their non!05O sibling. 5ach +case- in this data file represents a pair of
observations.
xam#les
ou are interested in determining whether self!reported weights and actual
weights differ. ou ask a random sample of 2DD people how much they weigh
and then you weigh them on a scale. ou want to compare the means of the 2
related sets of weights.
ou want to test the null hypothesis that husbands and wives have the same
average years of education. ou take a random sample of married couples and
compare their average years of education.
ou want to compare 2 methods for teaching reading. ou take a random sample
of CD pairs of twins and assign each member of a pair to one of the 2 methods.
ou compare average reading scores after completion of the program.
Data $rrangement
'n a paired!samples design, both members of a pair must be on the same data
record. 8ifferent variable names are used to distinguish the 2 members of a pair.
1or example3
. physician is evaluating a new diet for her patients with a family history of heart
disease. *o test the effectiveness of this diet, ;9 patients are placed on the diet for
9 months. *heir weights and triglyceride levels are measured before and after the
study, and the physician wants to know if either set of measurements has changed.
*his example uses the file dietstudy.sav . 6se Paired!Samples * *est to determine
whether there is a statistically significant difference between the pre! and post!
diet weights and triglyceride levels of these patients.
o Select Anl&9e" then co'$re 'ens" then $ired-s'$les T test
Select +riglyceride and ,inal +riglyceride as the first set of paired variables.
Select *eight nd 2inl weight s the second $ir nd clic/ o/1
*he 8escriptives table displays the mean, sample si,e, standard deviation, and standard
error for both groups. *he information is disseminated in pairs such that pair ; should
come first and pair 2 should come second in the table.
.cross all ;9 sub)ects, triglyceride levels dropped between ;: and ;C points on average
after 9 months of the new diet.
*he sub)ects clearly lost weight over the course of the study> on average, about E pounds.
*he standard deviations for pre! and post!diet measurements reveal that sub)ects were
more variable with respect to weight than to triglyceride levels.
.t !D.2E9, the correlation between the baseline and six!month triglyceride levels is not
statistically significant. =evels were lower overall, but the change was inconsistent across
sub)ects. Several lowered their levels, but several others either did not change or increased
their levels.
On the other hand, the Pearson correlation between the baseline and six!month weight
measurements is D.??9, almost a perfect correlation. 6nlike the triglyceride levels, all
sub)ects lost weight and did so quite consistently.
*he #ean column in the paired!samples t test table displays the average difference
between triglyceride and weight measurements before the diet and six months into the diet.
*he &td. Deviation column displays the standard deviation of the average difference score.
*he &td. -rror #ean column provides an index of the variability one can expect in
repeated random samples of ;9 patients similar to the ones in this study.
*he (.* Confidence nterval of the Difference provides an estimate of the boundaries
between which the true mean difference lies in ?CG of all possible random samples of ;9
patients similar to the ones participating in this study.
*he t statistic is obtained by dividing the mean difference by its standard error.
*he &ig. ('-tailed) column displays the probability of obtaining a t statistic whose absolute
value is equal to or greater than the obtained t statistic.
Since the significance value for change in weight is less than D.DC, you can conclude that
the average loss of E.D9 pounds per patient is not due to chance variation, and can be
attributed to the diet.
Kowever, the significance value greater than D.;D for change in triglyceride level shows
the diet did not significantly reduce their triglyceride levels.
*rning: When you click the first variable of a pair, it doesn7t move to the list
box> instead, it moves to the lower left box labeled 0urrent Selections. Only
when you click a second variable and move it into 0urrent Selections can you
move the pair into the Paired "ariable list.
Two45nde#endent4Sam#les T test
'f you have 2 independent groups of sub)ects, such as 05Os and non!05Os, men and
women, or people who received a treatment and people who didn7t, and you want to test
whether they come from populations with the same mean for the variable of interest, you
have a 2!independent samples design. 'n an independent!samples design, there is no
relationship between people or ob)ects in the 2 groups. *he * test you use is called an
independent!samples * test.
xam#les
ou want to test the null hypothesis that, in the 6.S. population, the average hours
spent watching *" per day is the same for males and females.
ou want to compare 2 teaching methods. One group of students is taught by one
method, while the other group is taught by the other method. .t the end of the
course, you want to test the null hypothesis that the population values for the
average scores are equal.
ou want to test the null hypothesis that people who report their incomes in a
survey have the same average years of education as people who refuse.
Data $rrangement
'f you have 2 independent groups of sub)ects, e.g., boys and girls, and want to
compare their scores, your data file must contain two variables for each child3 one
that identifies whether a case is a boy or a girl, and one with the score. *he same
variable name is used for the scores for all cases. *o run the 2 independent
samples * test, you have to tell SPSS which variable defines the groups. *hat7s
the variable /ender, which is moved into the Hrouping "ariable box. 4otice the
2 question marks after a variable name. *hey will disappear after you use the
8efine Hroups dialog box to tell SPSS which values of the variable should be
used to form the 2 groups.
TIP 1G
Bight!click the variable name in the Hrouping "ariable box and select variable
information from the pop!up menu. 4ow you can check the codes and value labels that
you7ve defined for that variable.
*rning: 'n the define groups dialog box, you must enter the actual values that you
entered into the data editor, not the value labels. 'f you used the codes of ; for male and
2 for female and assigned them value labels of m and f, then you enter the values ; and 2,
not the labels m and f, into the define groups dialog box.
.n analyst at a department store wants to evaluate a recent credit card promotion. *o this
end, CDD cardholders were randomly selected. Kalf received an ad promoting a reduced
interest rate on purchases made over the next three months, and half received a standard
seasonal ad.
Select Anl&9e" then co'$re 'ens" then inde$endent s'$les T test
Select money spent during the promotional period as the test variable. Select type of mail
insert received as the grouping variable. *hen click define groups.
*ype D as the group ; variable and ; as the group 2 variable under define groups. 1or the
default, the program should have +use specified values- selected. *hen click continue and
ok.
*he 8escriptives table displays the sample si,e, mean, standard deviation, and standard
error for both groups. On average, customers who received the interest!rate promotion
charged about NLD more than the comparison group, and they vary a little more around
their average.
*he procedure produces two tests of the difference between the two groups. One test
assumes that the variances of the two groups are equal. *he =evene statistic tests this
assumption.
'n this example, the significance value of the statistic is D.2L9. (ecause this value is
greater than D.;D, you can assume that the groups have equal variances and ignore the
second test. 6sing the pivoting trays, you can change the default layout of the table so
that only the Oequal variancesO test is displayed.
.ctivate the pivot table. *hen under pivot, select pivoting trays.
8rag assumptions from the row to the layer and close the pivoting trays window.
With the test table pivoted so that assumptions are in the layer, the -0ual variances
assumed panel is displayed.
*he df column displays degrees of freedom. 1or the independent samples t test, this equals
the total number of cases in both samples minus 2.
*he column labeled &ig. ('-tailed) displays a probability from the t distribution with :?E
degrees of freedom. *he value listed is the probability of obtaining an absolute value
greater than or equal to the observed t statistic, if the difference between the sample means
is purely random.
*he #ean Difference is obtained by subtracting the sample mean for group 2 #the %e!
1romotion group$ from the sample mean for group ;.
*he (.* Confidence nterval of the Difference provides an estimate of the boundaries
between which the true mean difference lies in ?CG of all possible random samples of CDD
cardholders.
Since the significance value of the test is less than D.DC, you can safely conclude that the
average of L;.;; dollars more spent by cardholders receiving the reduced interest rate is
not due to chance alone. *he store will now consider extending the offer to all credit
customers.
0hurn propensity scores are applied to accounts at a cellular phone company. Banging
from D to ;DD, an account scoring CD or above may be looking to change providers. .
manager with CD customers above the threshold randomly samples 2DD below it, wanting
to compare them on average minutes used per month.
Select nl&9e" then co'$re 'ens" then inde$endent s'$les T test
Select +erge 'onthl& 'inutes s the test +ri!le nd $ro$ensit& to le+e
s the grou$ +ri!le1 Then select de2ine grou$s1
Select cut $oint nd t&$e A> s the cut $oint +lue1 Then clic/ continue nd
o/1
*he 8escriptives table shows that customers with propensity scores of CD or more are
using their cell phones about LE minutes more per month on the average than customers
with scores below CD.
*he significance value of the =evene statistic is greater than D.;D, so you can assume that
the groups have equal variances and ignore the second test. 6sing the pivoting trays,
change the default layout of the table so that only the Oequal variancesO test is displayed.
Play around with the pivot tray link if you wish.
*he t statistic provides strong evidence of a difference in monthly minutes between
accounts more and less likely to change cellular providers.
2naly3ing +ruancy Data4 +he -xample
*o perform this analysis in order to test your skills using a * test, please see the spss file
on the course blackboard page.
0ne4Sam#le T test
0onsider whether the observed truancy rate before intervention %the G of school days
missed because of truancy& differs from an assumed nationwide truancy rate of EG. ou
have one sample of data %students enrolled in the *BP program!truancy reduction
program& and you want to compare the results to a fixed, specified in!advance population
value.
*he null hypothesis is that the sample comes from a population with an average truancy
rate of EG. %.nother way of stating the null hypothesis is that the difference in the
population means between your population and the nation as a whole is D.& *he
alternative hypothesis is that you sample doesn7t come from a population with a truancy
rate of EG.
*o obtain the table below, you would do one of the following3 Go to Anl&9e" choose
desci$ti+e sttistics" then descri$ti+es" select the +ri!le to !e e0'ined" in this cse
$re$ct" then go to o$tions in the descri$ti+es dilog !o0 nd select" 'en" 'ini'u"
'0i'u'" nd stndrd de+ition" then select continue nd o/&1 You cn lso
choose 2re3uencies under the descri$ti+e sttistics lin/" select the +ri!le to !e
e0'ined" go to sttistics nd $ic/ the s'e sttistics s !o+e" select continue" nd
then o/&1
Des#r*pt*ve Stat*st*#s
299 .00 72.08 14.2038 13.07160
299
prep$t &er$et tr#at
da%s pre itervetio
Valid N Alist;iseB
N Mii)#) Ma5i)#) Mea /td. Deviatio
1rom the table above, you see that, for the 2?? students in this sample, the average
truancy rate is ;:.2G. ou know that even if the sample is selected from a population in
which the true rate is EG, you don7t expect your sample to have an observed rate of
exactly EG. Samples from the population vary. What you want to determine is whether
it7s plausible for a sample of 2?? students to have an observed truancy rate of ;:.2G if
the population value is EG.
TIP 1I
(efore you embark on actually computing a one!sample * test, make certain checks.
=ook at the histogram of the truancy rates to make sure that all of the values make sense.
.re there percentages smaller than D or greater than ;DDF .re there values that are really
far from the restF 'f so, make sure they7re not the result of errors. 'f you have a small
number of cases, outliers can have a large effect on the mean and the standard deviation.
Checking the 2ssumptions
*o use the one!sample * test, you have to make certain assumptions about the data3
*he observations must be independent of each other. 'n this data file, students
came from ;L schools, so its possible that students in the same school may be
more similar than students in different schools. 'f that7s the case, the estimated
significance level may be smaller than it should be, since you don7t have as much
information as the sample si,e indicates. %'f you have ;D students from ;D
different schools, that7s more information than having ;D students from the same
school because it7s plausible that students in the same school are more similar
than students from different schools.& 'ndependence is one of the most important
assumptions that you have to make when analy,ing data.
'n the population, the distribution of the variable must be normal, or the sample
si,e must be large enough so that it doesn7t matter. *he assumption of normally
distributed data is required for many statistical tests. *he importance of the
assumption differs, depending on the statistical test. 'n the case of a one!sample *
test, the following guidelines are suggested3 'f the number of cases is P ;C, the
data should be approximately normally distributed> if the number of cases is
between ;C and :D, the data should not have outliers or be very skewed> for
samples of :D or more, even markedly skewed distributions are acceptable.
(ecause you have close to /DD observations, there7s little need to worry about the
assumption of normality.
TIP (>
'f you have reason to believe that the assumptions required for the * test are violated in
an important way, you can analy,e the data using a nonparametric tests.
+esting the 5ypothesis
0ompute the difference between the observed sample mean and the hypothesi,ed
population value. %;:.2G!EG < 9.2G&
0ompute the standard error of the difference. *his is a measure of how much you expect
sample means, based on the same number of cases from the same population, to vary.
*he hypothetical population value is a constant and doesn7t contribute to the variability
of the differences, so the standard error of the difference is )ust the standard error of the
mean. (ased on the standard deviation in the table above, the standard error equals3
S5 < std. deviationJSMB* of the sample si,e < ;/.DLJSMB* of 2?? < .LC9 %Note:
ou should be able to obtain this value using the frequencies command and
selecting standard error mean under statistics. *his is a way for you to double
check if you are unsure of your calculations. See the table below.
Stat*st*#s
prep$t &er$et tr#at da%s pre itervetio
299
0
14.2038
.75595
13.07160
Valid
Missi!
N
Mea
/td. .rror o+ Mea
/td. Deviatio
ou can calculate the t statistic by hand if you divide the observed difference by the
standard error of the difference.
* < Observed Aean %prepct&!Predicted AeanJStd. 5rror of the mean
< ;:.2D:!EJD.LC9 < E.2;
ou can also conduct a one!sample * test using SPSS by going to nl&9e" co'$re
'ens" one-s'$le T test" selecting the rele+nt +ri!le" )i1e1 $re$ct- nd entering it
into the test +ri!le !o0 nd entering the nu'!er G in the test +lue !o0 t the
!otto' o2 the dilog !o0 nd running the nl&sis1 You will get the 2ollowing out$ut
s shown !elow1
T-TEST
/TESTVAL = 8
/MISSING = ANALYSIS
/VARIABLES = p!pc"
/CRITERIA = CI#.$%& .
&1&est
One1Sa-ple Stat*st*#s
299 14.2038 13.07160 .75595
prep$t &er$et tr#at
da%s pre itervetio
N Mea /td. Deviatio
/td. .rror
Mea
One1Sa-ple &est
8.207 298 .000 6.20378 4.7161 7.6915
prep$t &er$et tr#at
da%s pre itervetio
t d+ /i!. A2CtailedB
Mea
Di++ere$e 6o;er 1pper
958 (o+ide$e
0terval o+ t*e
Di++ere$e
Test Val#e 7 8
6se the * distribution to determine if the observed t statistic is unlikely if the null
hypothesis is true. *o calculate the observed significance level for a * statistic, you have
to take into account both how large the actual * value is and how many degrees of
freedom it has. 1or a one!sample * test, the degress of freedom %dof& is one fewer than
the number of cases. 1rom the table above, you see that the observed significance level is
P .DDD;. our observed results are very unlikely if the true rate is EG, so you re)ect the
null hypothesis. our sample probably comes from a population with a mean larger than
EG.
TIP (1
*o obtain observed significance levels for an alternative hypothesis that specifies
direction, often known as a one!sided or one!tailed test, divide the observed two!tailed
significance level by two. (e very cautious about using one!sided tests.
-xamining the Confidence nterval
'f you look at the ?CG 0onfidence 'nterval for the population difference, you see that it
ranges from :.LG to L.LG. ou don7t know whether the true population difference is in
this particular interval, but you know that ?CG of the time, ?CG confidence intervals
include the true population values. 4ote that the value of D is not included in the
confidence interval. 'f your observed significance level had been larger than D.DC, D
would have been included in the ?CG confidence interval.
TIP ((
*here is a close relationship between hypothesis testing and confidence intervals. ou
can re)ect the null hypothesis that you sample comes from a population with any value
outside of the ?CG confidence interval. *he observed significance level for the
hypothesis test will be less than D.DC.
Paired4Sam#les T test
ou7ve seen that your students have a higher truancy rate than the country as a whole.
4ow the question is whether there is a statistically significant difference in the truancy
rates before and after the truancy reduction programs. 1or each student, you have 2
values for unexcused absences. One is for the year before the student enrolled in the
program> the other is for the year in which the student was enrolled in the program. Since
there are two measurements for each sub)ect, a before and an after, you want to use a
paired!samples * test to test the null hypothesis that averages before and after rates are
equal in the population.
TIP (.
*he reason for doing a paired!samples design is to make the 2 groups as comparable as
possible on characteristics other than the one being studied. (y studying the same
students before and after intervention, you control for differences in gender,
socioeconomic status, family supervision, and so on. 6nless you have pairs of
observations that are quite similar to each other, pairing has little effect and may, in fact,
hurt your chances of re)ecting the null hypothesis when it is false.
(efore running the paired!samples * test procedure, look at the histogram of the
differences shown. ou should see that the shape of the distribution is symmetrical %i.e.
not too far from normal&. Aany of the cases cluster around D, indicating that the
difference in the before and after scores is small for these students.
Checking the 2ssumptions
*he same assumptions about the distributions of the data are required for this test as those
in the one!sample * test. *he observations should be independent> if the sample si,e is
small, the distribution of differences should be approximately normal. 4ote that the
assumptions are about the differences, not the original observations. *hat7s because a
paired!samples * test is nothing more than a one!sample * test on the differences. 'f you
calculate the differences between the pre! and post!values and use the one!sample * test
with a population value of D, you7ll get exactly the same statistic as using the paired!
samples * test.
+esting the 5ypothesis
1rom the table below, you see that the average truancy rate before intervention is ;:.2G
and the average truancy rate after intervention is ;;.:G. *hat7s a difference about 2.EG.
*o get the table below, you should go to descri$ti+es nd select the $re$ct nd $ost$ct
+ri!les nd enter the' into the +ri!le list" !e sure tht the right sttistics re
chec/ed o22 )e1g1 stndrd de+ition-" nd then hit o/&1
Pa*red Sa-ples Stat*st*#s
11.4378 299 11.18297 .64673
14.2038 299 13.07160 .75595
postp$t &er$et tr#at
da%s post itervetio
prep$t &er$et tr#at
da%s pre itervetio
&air
1
Mea N /td. Deviatio
/td. .rror
Mea
*o see how often you would expect to see a difference of at least 2.EG when the null
hypothesis of no difference is true, look at the paired!samples * test table below.
*o obtain the table below, do the following3 go to nl&9e" then select co'$re 'ens"
then select $ired-s'$les T test nd choose the ( +ri!les o2 interest o2 the $ir to
!e selected" i1e1" $re$ct nd $ost$ct" then select #/1
Pa*red Sa-ples &est
C2.76602 12.69355 .73409 C4.21067 C1.32137 C3.768 298 .000
postp$t &er$et tr#at
da%s post itervetio
C prep$t &er$et tr#at
da%s pre itervetio
&air
1
Mea /td. Deviatio
/td. .rror
Mea 6o;er 1pper
958 (o+ide$e
0terval o+ t*e
Di++ere$e
&aired Di++ere$es
t d+ /i!. A2CtailedB
*he * statistic, /.E, is computed by dividing the average difference %2.LLG& by the
standard error of the mean difference %D.L/&. *he degrees of freedom is the number of
pairs minus one. *he observed significance level is P .DD;, so you can re)ect the null
hypothesis that the pre!intervention and post!intervention truancy rates are equal in the
population. 'ntervention appears to have reduced the truancy rate.
*rning: *he conclusions you can draw about the effectiveness of truancy reduction
programs from a study like this are limited. 5ven if you restrict your conclusions to the
schools from which these children are a sample, there are many problems. Since you are
looking at differences in truancy rates between ad)acent years, you aren7t controlling for
possible increases or decreases in truancy that occur as children grow older. 1or
example, if truancy increases with age, the effect of the truancy reduction program may
be larger than it appears. *here is also potential bias in the determination of what is
considered an +excused- absence.
*he ?CG confidence interval for the population change is from ;./G to :.2G. 't appears
that if the program has an effect, it is not a very large one. One average, assuming a ;ED!
day school year, students in the truancy reduction program attended school five more
days after the program than before. *he ?CG confidence interval for the number of days
+saved- is from 2./ days to L.9 days.
. paired!samples design is effective only if you have pairs of similar cases. 'f your
pairing does not result in a positive correlation coefficient between the 2 measurements
of close to D.C, you may lose power %your computer stays on, but your ability to re)ect the
null hypothesis when it is false fi,,les& by analy,ing the data as a paired!samples design.
1rom the correlation coefficient table covering the correlation coefficient between the
pre! and post!intervention rates is close to D.C, so pairing was probably effective. See
below.
Pa*red Sa-ples Correlat*ons
299 .461 .000
postp$t &er$et tr#at
da%s post itervetio
D prep$t &er$et tr#at
da%s pre itervetio
&air
1
N (orrelatio /i!.
*rning: .lthough well!intentioned, paired designs often run into trouble. 'f you give a
sub)ect the same test before and after an intervention, the practice effect, instead of the
intervention, may be responsible for any observed change. ou must also make sure that
there is no carryover effect> that is, the effect of one intervention must be completely
gone before you impose another.
Two45nde#endent Sam#les T test
ou7ve seen that intervention seems to have had a small, although statistically significant
effect. One of the questions that remains is whether the effect is similar for boys and
girls prior to interventionF 's the average truancy rate the same for boys and girls after
interventionF 's the change in truancy rates before and after intervention the same for
boys and girlsF
2roup Stat*st*#s
152 13.0998 12.25336 .99388
147 15.3453 13.81620 1.13954
152 11.5130 11.43948 .92786
147 11.3599 10.94995 .90314
152 1.5866 11.72183 .95077
147 3.9850 13.55834 1.11827
!eder @eder
+ Fe)ale
) Male
+ Fe)ale
) Male
+ Fe)ale
) Male
prep$t &er$et tr#at
da%s pre itervetio
postp$t &er$et tr#at
da%s post itervetio
di++p$t &re C &ost
N Mea /td. Deviatio
/td. .rror
Mea
*he table above shows summary statistics for the 2 groups for all / variables. (oys had
somewhat larger average truancy scores prior to intervention than did girls. *he average
scores after intervention were similar for the 2 groups. *he difference between the
average pre! and post!intervention is larger for boys. ou must determine whether these
observed differences are large enough for you to conclude that, in the population, boys
and girls differ in average truancy rates. ou can use the 2 independent!samples * test to
test all / hypotheses.
Checking the 2ssumptions
ou must assume that all observations are independent. 'f the sample si,es in the groups
are small, the data must come from populations that have normal distributions. 'f the
sum of the sample si,es in the 2 groups is greater than :D, you don7t have to worry about
the assumption of normality. *he 2!independent!samples * test also requires
assumptions about the variances in the 2 groups. 'f the 2 samples come from populations
with the same variance, you should use the +pooled- or equal!variance * test. 'f the
variances are markedly different, you should use the separate!variance * test. (oth of
these are shown below.
Independent Sa-ples &est
5.248 .023 C1.488 297 .138 C2.24550 1.50904 C5.21527 .72426
C1.485 290.226 .139 C2.24550 1.51207 C5.22151 .73051
.122 .727 .118 297 .906 .15309 1.29578 C2.39698 2.70317
.118 296.969 .906 .15309 1.29483 C2.39511 2.70130
1.679 .196 C1.638 297 .102 C2.39839 1.46426 C5.28003 .48326
C1.634 287.906 .103 C2.39839 1.46782 C5.28740 .49063
."#al varia$es
ass#)ed
."#al varia$es
ot ass#)ed
."#al varia$es
ass#)ed
."#al varia$es
ot ass#)ed
."#al varia$es
ass#)ed
."#al varia$es
ot ass#)ed
prep$t &er$et tr#at
da%s pre itervetio
postp$t &er$et tr#at
da%s post itervetio
di++p$t &re C &ost
F /i!.
6evee2s Test +or
."#alit% o+ Varia$es
t d+ /i!. A2CtailedB
Mea
Di++ere$e
/td. .rror
Di++ere$e 6o;er 1pper
958 (o+ide$e
0terval o+ t*e
Di++ere$e
tCtest +or ."#alit% o+ Meas
ou can test the null hypothesis that the population variances in the 2 groups are equal
using the =evene test, shown above. 'f the observed significance level is small %in the
column labeled sig. under =evene7s *est&, you re)ect the null hypothesis that the
population variances are equal. 1or this example, you can re)ect the null hypothesis that
the per!intervention truancy variances are equal in the 2 groups. 1or the other 2
variables, you can7t re)ect the null hypothesis that the variances are equal.
+esting the 5ypothesis
'n the 2!independent!samples * test, the * statistic is computed the same as for the other
2 tests. 't is the ratio of the difference between the 2 sample means divided by the
standard error of the difference. *he standard error of the difference is computed
differently, depending on whether the 2 variances are assumed to be equal or not. *hat7s
why you see 2 sets of * values in the table above. 'n this example, the 2 * values and
confidence intervals based on them are very similar. *hat will always be the case when
the sample si,e in the 2 groups is almost the same.
*he degrees of freedom for the t statistic also depends on whether you assume that the 2
variances are equal. 'f the variances are assumed to be equal, the degrees of freedom is 2
fewer than the sum of the number of cases in the 2 groups. 'f you don7t assume that the
variances are equal, the degrees of freedom is calculated from the actual variances and
the sample si,es in the groups. *he result is usually not an integer.
1rom the column labeled Sig. %2!tailed&, you can7t re)ect any of the / hypotheses of
interest. *he observed results are not incompatible with the null hypothesis that boys and
girls are equally truant before and after the program and that intervention affects
confidence intervals.
*rning: When you compare 2 independent groups, one of which has a factor of interest
and the other that doesn7t, you must be very careful about drawing conclusions. 1or
example, if you compare people enrolled in a weight!loss program to people who aren7t,
you cannot attribute observed differences to the program unless the people have been
randomly assigned to two programs.
T-Tests
5rosst!ultions
ou classify cases based on values for 2 or more categorical variables %e.g. type of health
insurance coverage and satisfaction with health care.& 5ach combination of values is
called a cell. *o test whether the two variables that make up the rows and columns are
independent, you calculate how many cases you expect in each cell if the variables are
independent, and compare these expected values to those actually observed using the chi!
square statistic. 'f your observed results are unlikely if the null hypothesis of
independence is true, you re)ect the null hypothesis. ou can measure how strongly the
row and column variables are related by computing measures of association. *here are
many different measures, and they define association in different ways. 'n selecting a
measure of association, you should consider the scale on which the variables are
measured, the type of association you want to detect, and the ease of interpretation of the
measure. ou can study the relationship between a dichotomous %2!category& risk factor
and a dichotomous outcome %e.g. family history of a disease and development of the
disease&, controlling for other variables %e.g. gender& by computing special measures
based on the odds.
Chi4S/uare Test2 $re Two %ariables 5nde#endent7
'f you think that 2 variables are related, the null hypothesis that you want to test is that
they are not related. .nother way of stating the null hypothesis is that the 2 variables are
inde$endent. 'ndependence has a very precise meaning in this situation. 't means that
the probability that a case falls into a particular cell of a table is the product of the
probability that a case falls into that row and the probability that a case falls into that
column.
*rning: *he word independent as used here has nothing to do with dependent and
independent variables. 't refers to the absence of a relationship between 2 variables.
.s an example of testing whether 2 variables are independent, look at the table below, a
crosstabulation of highest educational attainment %degree& and perception of life7s
excitement%life& based on the gssdata posted on blackboard. 1rom the row G, you see
that the G of people who find life exciting is not exactly the same in the C degree groups,
although it is fairly similar for the ;
st
2 degree groups. Slightly less than half of those
with less than a high school education or with a high school education find life exciting.
Kowever, you see that there is substantial differences between those with some exposure
to college and those with a post!graduate degree. 1or those respondents, almost 2J/ find
that life is exciting.
de/ree %*/hest de/ree 0 l*fe Is l*fe e,#*t*n/3 rout*ne or dull4 Crossta.ulat*on
59 67 10 136
70.8 60.2 5.0 136.0
43.48 49.38 7.48 100.08
218 232 18 468
243.7 207.1 17.2 468.0
46.68 49.68 3.88 100.08
41 23 2 66
34.4 29.2 2.4 66.0
62.18 34.88 3.08 100.08
94 46 3 143
74.4 63.3 5.3 143.0
65.78 32.28 2.18 100.08
55 29 0 84
43.7 37.2 3.1 84.0
65.58 34.58 .08 100.08
467 397 33 897
467.0 397.0 33.0 897.0
52.18 44.38 3.78 100.08
(o#t
.5pe$ted (o#t
8 ;it*i de!ree
:i!*est de!ree
(o#t
.5pe$ted (o#t
8 ;it*i de!ree
:i!*est de!ree
(o#t
.5pe$ted (o#t
8 ;it*i de!ree
:i!*est de!ree
(o#t
.5pe$ted (o#t
8 ;it*i de!ree
:i!*est de!ree
(o#t
.5pe$ted (o#t
8 ;it*i de!ree
:i!*est de!ree
(o#t
.5pe$ted (o#t
8 ;it*i de!ree
:i!*est de!ree
0 6t *i!* s$*ool
1 :i!* s$*ool
2 ?#ior $olle!e
3 3a$*elor
4 @rad#ate
de!ree
:i!*est
de!ree
Total
1 .5$iti! 2 -o#tie 3 D#ll
li+e 0s li+e e5$iti!E ro#tie or d#llF
Total
*rning: *he chi!square test requires that all observations be independent. *his means
that each case can appear in only one cell of the table. 1or example, if you apply 2
different treatments to the same patients and classify them both times as improved or not
improved, you can7t analy,e the data with the chi!square test of independence.
Computing -xpected 6alues
ou use the chi!square test to determine if your observed results are unlikely if the 2
variables are independent in the population. 2 variables are independent if knowing the
value of one variable tells you nothing about the value of the other variable. *he level of
education one attains and one7s perception of life are independent if the probability of
any level of educational attainmentJperception of life combination is the product of the
probability of that level of educational attainment times the probability of that perception
of life. 1or example, under the independence assumption, the probability of being a
college graduate and finding life exciting is3
P < Probability#bachelor degree$ x Probability#life exciting$
P < ;:/JE?L x :9LJE?L < .DE/
'f the null hypothesis is true, you expect to find in your table L: excited people with
bachelor7s degrees. ou see this expected value in the row labeled 5xpected 0ount in the
table above
*he chi!square test is based on comparing these 2 counts3 the observed number of cases
in a cell and the expected number of cases in a cell if the 2 variables are independent.
*he Pearson chi!square statistic is3
Q
2
< R #observed!expected$
2
Jexpected
TIP (4
(y examining the differences between observed and expected values in the cells %the
residuals&, you can see where the independence model falls. ou can examine actual
residuals and residuals standardi,ed by estimates of their variability to help you pinpoint
departures from independence by requesting them in the 0ells dialog box of the
Anl&9eHDescri$ti+e SttisticsH5rosst!s $rocedure.
Determining the 7bserved &ignificance 8evel
1rom the calculated chi!square value, you can estimate how often in a sample you would
expect to see a chi!square value at least as large as the one you observed if the
independence hypothesis is true in the population. 'f the observed significance level is
small, enough you re)ect the null hypothesis that the 2 variables are independent. *he
value of chi!square depends on the number of rows and columns in the table. *he
degrees of freedom for the chi!square statistic is calculated by finding the product of one
fewer than the number of rows and one fewer than the number of columns. %the degrees
of freedom is the number of cells in a table that can be arbitrarily filled when the row and
column totals are fixed.& 'n this example, the degrees of freedom is 9.
1rom the table below, you see that the observed significance level for the Pearson chi!
square is D.DDD, so you can re)ect the null hypothesis that level of educational attainment
and perception of life are independent.
Ch*1S)uare &ests
34.750
a
8 .000
37.030 8 .000
29.373 1 .000
897
&earso (*iC/"#are
6i=eli*ood -atio
6iearC9%C6iear
Asso$iatio
N o+ Valid (ases
Val#e d+
As%)p. /i!.
A2CsidedB
2 $ells A13.38B *ave e5pe$ted $o#t less t*a 5. T*e
)ii)#) e5pe$ted $o#t is 2.43.
a.
*rning: . conservative rule for use of the chi!square test requires that the expected
values in each cell be greater than ; and that most cells have expected values greater than
C. .fter SPSS displays the pivot table with the statistics, it displays the number of celss
with expected values less than C and the minimum expected count. 'f more than 2DG of
your cells have expected values less than C, you should combine categories, if that makes
sense for your table, so that most expected values are greater than C.
-xamining 2dditional &tatistics
SPSS displays several statistics in addition to the Pearson chi!square when you ask for a
chi!square test as shown above.
*he likelihood!ratio!chi!square has a different mathematical basis than the
Pearson chi!square, but for large sample si,es, it is close in value to the Pearson
chi!square. 't is seldom that these 2 statistics will lead you to different
conclusions.
*he linear!by!linear association statistic is also known as the Aantel!Kaens,el
chi!square. 't is based on the Pearson correlation coefficient. 't tests whether
there is a linear association between the 2 variables. You S:#ULD N#T use
this sttistic 2or no'inl +ri!les1 1or ordinal variables, the test is more likely
to detect a linear association between the variables than is the Pearson!chi!square
test> it is more powerful.
. continuity!corrected!chi!square %not shown here& is shown for tables with 2
rows and 2 columns. Some statisticians claim that this leads to a better estimate
of the observed significance level, but the claim is disputed.
1isher7s exact test %not shown here& is calculated if any expected value in a 2 by 2
table is P C. ou get exact probabilities of obtaining the observed table or one
more extreme if the 2 variables are independent and the marginals are fixed. *hat
is, the number of cases in the rows and columns of the table are determined in
advance by the researcher.
*rning: *he Aantel!Kaens,el test is calculated using the actual values of the row and
column variables, so if you coded / unevenly spaced dosages of a drug as ;, 2, and /,
those values are used for the computations.
$re Pro#ortions /ual7
. special case of the chi!square test for independence is the test that several proportions
are equal. 1or example, you want to test whether the G of people who report themselves
to be very happy has changed during the time that the HSS has been conducted. *he
figure below is a crosstabulation of the G of people who say were very happy for each of
the decades. *his uses the aggregatedgss.sav file. .lmost /CG of the people questioned
in the ;?LDs claimed that they were very happy, compared to /;G in this millennium.
happy 2ENERA! %APPINESS 0 de#ade de#ade of survey Crossta.ulat*on
3637 4475 4053 1296 13461
3403.4 4516.7 4211.5 1329.4 13461.0
34.38 31.88 30.98 31.38 32.18
6977 9611 9081 2850 28519
7210.6 9569.3 8922.5 2816.6 28519.0
65.78 68.28 69.18 68.78 67.98
10614 14086 13134 4146 41980
10614.0 14086.0 13134.0 4146.0 41980.0
100.08 100.08 100.08 100.08 100.08
(o#t
.5pe$ted (o#t
8 ;it*i de$ade
de$ade o+ s#rve%
(o#t
.5pe$ted (o#t
8 ;it*i de$ade
de$ade o+ s#rve%
(o#t
.5pe$ted (o#t
8 ;it*i de$ade
de$ade o+ s#rve%
1 V.-4 :A&&4
2 &-.TT4 :A&&4
*app% @.N.-A6
:A&&0N.//
Total
1 1972C1979 2 1980C1989 3 1990C1999 4 2000C2002
de$ade de$ade o+ s#rve%
Total
Calculating the Chi-&0uare &tatistic
'f the null hypothesis is true, you expect /2.;G of people to be very happy in each
decade, the overall very happy rate. ou calculate the expected number in each decade
by multiplying the total number of people questioned in each decade by /2.;G. *he
expected number of not very happy people is 9L.?G multiplied by the number of people
in each decade. *hese values are shown in the table above. *he chi!square statistic is
calculated in the usual fashion.
1rom the table below, you see that the observed significance level for the chi!square
statistic is P .DD;, leading you to re)ect the null hypothesis that in each decade people are
equally likely to describe themselves as very happy. 4otice that the difference between
years isn7t very large> the largest G is /:./G for the ;?LDs, while the smallest is /D.?G
for the ;??Ds. the sample si,es in each group are very large, so even small differences
are statistically significant, although they may have little practical implication.
Ch*1S)uare &ests
34.180
a
3 .000
33.974 3 .000
25.746 1 .000
41980
&earso (*iC/"#are
6i=eli*ood -atio
6iearC9%C6iear
Asso$iatio
N o+ Valid (ases
Val#e d+
As%)p. /i!.
A2CsidedB
0 $ells A.08B *ave e5pe$ted $o#t less t*a 5. T*e
)ii)#) e5pe$ted $o#t is 1329.43.
a.
ntroducing a Control 6ariable
*o see whether both men and women experienced changes in happiness during this time
period, you can compute the chi!square statistic separately for men and for women, as
shown below3
Go to Anl&9e" then Descri$ti+e Sttistics" then 5rosst!s" then $ut the
+ri!le h$$& in the row !o0 nd decde in the colu'n !o0" then the
+ri!le se0 into l&er 1 o2 1" then select under the cells t! in the crosst!s
dilog !o0" the !o0es 'r/ed o!ser+ed nd e0$ected counts nd colu'n K"
then select o/ nd go !c/ nd select the sttistics !o0 in order to order chi-
s3ure test1
Ch*1S)uare &ests
3.677
a
3 .298
3.668 3 .300
.901 1 .343
18442
42.987
9
3 .000
42.712 3 .000
35.904 1 .000
23538
&earso (*iC/"#are
6i=eli*ood -atio
6iearC9%C6iear
Asso$iatio
N o+ Valid (ases
&earso (*iC/"#are
6i=eli*ood -atio
6iearC9%C6iear
Asso$iatio
N o+ Valid (ases
se5 -./&OND.NT/
/.G
1 Male
2 Fe)ale
Val#e d+
As%)p. /i!.
A2CsidedB
0 $ells A.08B *ave e5pe$ted $o#t less t*a 5. T*e )ii)#) e5pe$ted $o#t is
586.01.
a.
0 $ells A.08B *ave e5pe$ted $o#t less t*a 5. T*e )ii)#) e5pe$ted $o#t is
742.96.
9.
ou see that for men, you can7t re)ect the null hypothesis that happiness has not changed
with time. ou can re)ect the null hypothesis for women. 1rom the line plot in the graph
below, you see that in the sample, happiness decreases with time for women, but not for
men. ou can also graph the information. See the graph below, but also note how to
obtain the graph.
Go to the gr$hs 'enu" choose line" then select the 'ulti$le icon nd
su''ries 2or grou$s o2 cses" nd then clic/ de2ine1 Ne0t 'o+e decde inot
the ctegor& 0is !o0 nd se0 into the de2ine lines !& !o0 in the dilog !o0
tht $$ers1 Select other sttistic" then 'o+e h$$& into the +ri!le list"
nd then clic/ chnge sttistic1 In the sttistic su!dilog !o0" select K inside
nd t&$e 1 into !oth the low nd high te0t !o0es1 5lic/ continue" nd then
clic/ #K1
2000C2002 1990C1999 1980C1989 1972C1979
de#ade of survey
36
35
34
33
32
31
30
5
*
n
6
7
3
7
8

2
E
N
E
R
A
!

%
A
P
P
I
N
E
S
S
Fe)ale
Male
-./&OND.NT/ /.G
(ases ;ei!*ted 9% #)9er o+ $ases
(easuring Change2 (cNemar Test
*he chi!square test can also be used to test hypotheses about change when the same
people or ob)ects are observed at two different times. 1or example, the table below is a
crosstabulation of whether a person voted in ;??9 and whether he or she voted in 2DDD.
%See gssdata.sav file&
vote99 DID R VO&E IN :999 E!EC&ION 0 vote;< DID R VO&E IN 7;;< E!EC&ION
Crossta.ulat*on
(o#t
1539 151 1690
187 502 689
1726 653 2379
1 VOT.D
2 D0D NOT VOT.
vote00 D0D - VOT.
0N 2000 .6.(T0ON
Total
1 VOT.D
2 D0D
NOT VOT.
vote96 D0D - VOT. 0N
1996 .6.(T0ON
Total
.n interesting question is whether people were more likely to vote in one of the years
than the other. *he cases on the diagonal of the table don7t provide any information
because they behaved similarly in both elections. ou have to look at the off!diagonal
cells, which correspond to people who voted in one election but not the other. 'f the null
hypothesis that likelihood of voting did not change is true, a case should be equally likely
tofallinto either of the 2 off!diagonal cells. *he binomial distribution is used to calculate
the exact probability of observing a split between the 2 off!diagonal cells at least as
unequal as the one observed, if cases in the population are equally likely to fall into either
off!diagonal cell. *his test is called the ;cNe'r test1
Ch*1S)uare &ests
.057
a
2379
M$Ne)ar Test
N o+ Valid (ases
Val#e
.5a$t /i!.
A2CsidedB
3io)ial distri9#tio #sed.
a.
Ac4emar7s test can be calculated for a square table of any si,e to test whether the upper
half and the lower half of a square table are symmetric. *his test is labeled in the table
above. 1or tables with more than 2 rows and columns, it is labeled the Ac4emar!(owker
test. 1rom the figure below, you see that you can7t re)ect the null hypothesis that people
who voted in only one of the 2 elections were equally likely to vote in another.
*rning: Since the same person is asked whether he or she voted in ;??9 and whether
he or she voted in 2DDD, you can7t make a table in which the rows are years and the
columns are whether he or she voted. 5ach case would appear twice in such a table.
How Strongly are 8 %ariables Related7
'f you re)ect the null hypothesis that 2 variables are independent, you may want to
describe the nature and strength of the relationship between the 2 variables. *here are
many statistical indexes that you can use to quantify the strength of the relationship
between 2 variables in a cross!classification. 4o single measure adequately summari,es
all possible types of association. Aeasures vary in the way they define perfect and
intermediate association and in the way they are interpreted. Some measures are used
only when the categories of the variables can be ordered from lowest to highest on some
scale.
*rning: 8on7t compute a large number of measures and then report the most
impressive as if it were the only one examined.
ou can test the null hypothesis that a particular measure of association is D based on an
approximate * statistic shown in the output. 'f the observed significance level is small
enough, you re)ect the null hypothesis that the measure is D.
TIP (A
Aeasures of association should be calculated with as detailed data as possible. 8on7t
combine categories with small numbers of cases, as was suggested above for the chi!
square test of independence.
7INAL N#TE: '1 O6 W'SK *O 8O A5.S6B5S O1 .SSO0'.*'O4 *O
85*5BA'45 KOW S*BO4H= 2 ".B'.(=5S .B5 B5=.*58, *K54 P=5.S5
S55 A5 .48 .SS 1OB .SS'S*.405 '1 O6 K."5 .4 8'11'06=*'5S O4
O6B OW4.

You might also like