You are on page 1of 45

Practical Statistical Training

Using SPSS Software

Trainer – Hailegebriel Yirdaw


(PhD Candidate at AAU and University of Gothenburg)
E-mail: hailaenani@gmail.com

1
Deserves your attention!

Trainer: Hailegebriel Yirdaw 2


Quantitative Data Analysis

Quantifying Data

• Before we can do any kind of analysis, we need to quantify


our data
• “Quantification” is the process of converting data to a
numeric format
• Convert social science data into a “machine-readable”
form, a form that can be read & manipulated by computer
programs

Trainer: Hailegebriel Yirdaw 3


• You should choose a level of analysis that is appropriate for
your research question
• You should choose the type of statistical analysis
appropriate for the variables you have
• Nominal/Categorical, Ordinal, or Continuous

Trainer: Hailegebriel Yirdaw 4


• Descriptive statistics are used to summarize the basic
feature of a data set through
• measures of central tendency (mean, mode, and
median)
• dispersion (range, quartiles, variance, and standard
deviation)
• distribution (skewness and kurtosis)

Trainer: Hailegebriel Yirdaw 5


• Inferential statistics allow researchers to assess their
ability to draw conclusions that extent beyond the
immediate data, e.g.
• if a sample represents the population
• if there are differences between two or more groups
• if there are changes over time
• if there is a relationship between two or more
variables

Trainer: Hailegebriel Yirdaw 6


• Selecting the right statistical test relies on
• knowing the nature of your variables
• their scale of measurement
• their distribution shape
• types of question you want to ask

Trainer: Hailegebriel Yirdaw 7


• Univariate - simplest form, describe a case in terms of
a single variable.
- Frequency distributions
- Measures of central tendency
• Bivariate - subgroup comparisons, describe a case in
terms of two variables simultaneously.
• Multivariate - analysis of two or more variables
simultaneously.

Trainer: Hailegebriel Yirdaw 8


• That’s what a frequency distribution is for—to help
impose order on the data
• A frequency distribution is a systematic arrangement of
data values, with a count of how many times each value
occurred in a dataset
• Used first step in understanding your data!
• Through inspection of frequency distributions, you can
begin to assess how “clean” the data are

Trainer: Hailegebriel Yirdaw 9


To draw a frequency table
Analyze  Descriptive Statistics
Frequencies Frequencies dialog box will
appear Paste the variable rating into the
Variable(s) box  OK.
Note: other frequency options can also
be used.
◦ Statistics
◦ Chart (Histogram with normality
distribution)
◦ Format
Trainer: Hailegebriel Yirdaw 10
Activity 1 . (10 minutes)

a) Based on the “Expenditure data” use


frequency table to describe family size
of the respondents.

b) Do a frequency analysis on the


variable “minority”

Trainer: Hailegebriel Yirdaw 11


Cross Tabulations
Used to know the relationships
between two categorical variables.
To perform a cross-tabulation
Analyze Descriptive Statistics  Crosstabs
dialog box will appear Paste the variable
you want to have in the Column(s) Row(s) box
 OK.

Trainer: Hailegebriel Yirdaw 12


Note:
◦ The Cells subcommand has options to include
the percentages for the row, column and whole
table, as well as the expected and residual values
for each cell.
Cells  Select the one you want to display
Continue OK
Association between two variables exists as
the values expected by chance, and the
actual counts are different from each other.
◦ For example, if gender and employment
classification were unrelated, then it is expected
that 38.3 women would be in the manager
classification as opposed to the observed number,
10.
Trainer: Hailegebriel Yirdaw 13
Also possible to display various statistics
for the cross-tabulation including the chi-
square statistic and its significance level.
Statistics  Chi-square  Continue OK
The chi-square significance level shows whether
you can reject the null hypothesis that there is no
association between the two categorical variables.

Chi squared Test


Null: There is NO association between
class and survival
Alternative: There IS an association
between class and survival

Trainer: Hailegebriel Yirdaw 14


Chi-squared test statistic
The chi-squared test is used when we want to
see if two categorical variables are related
The test statistic for the Chi-squared test uses
the sum of the squared differences between each
pair of observed (O) and expected values (E)

χ =
2
n
(Oi − Ei ) 2

i =1 Ei

Trainer: Hailegebriel Yirdaw 15


Activity 2. (10 minutes)

Does Ethnic Minority Status affect job type?


(use Employment data)

Trainer: Hailegebriel Yirdaw 16


Low Cell Counts with the Chi-squared test

Check no. of cells with EXPECTED counts less than 5


SPSS reports the % of cells with an expected count <5
If more than 20% then the test statistic does not
approximate a chi-squared distribution very well
If any expected cell counts are <1 then cannot use the chi-
squared distribution
In either case if have a 2x2 table use Fishers’ Exact test
(SPSS reports this for 2x2 tables)
In larger tables (3x2 etc.) combine categories to make cell
counts larger (providing it’s meaningful)

Trainer: Hailegebriel Yirdaw 17


Using SPSS
Test Statistic = 127.859

p- value
p < 0.001

Note: Double clicking on the output will display the p-value to


more decimal places
Trainer: Hailegebriel Yirdaw 18
Exercise
Use employee. sav
Is there association between minority
classification and employment category?

Trainer: Hailegebriel Yirdaw 19


Graphs
Graphs are effective visual tools because
they present information quickly and easily.
Often data are better understood when
presented by a graph than by a table
because the graph can easily reveal a trend
(rise or decline of a variable over time) and
is a simpler visual aid for comparison
purposes.

Trainer: Hailegebriel Yirdaw 20


The Bar Graph (Chart)
Bar graphs usually present:
◦ categorical (qualitative) variables
◦ numeric (discrete) variables grouped in class
intervals.
◦ can show either % or count
◦ not very good for showing trends in more
than one category
Graphs  Legacy dialogs  Bar  Choose
the type of bar chart  Define  dialog box
will appear  Specify what the bar represents
 ok

Trainer: Hailegebriel Yirdaw 21


The other two types of bar graphs are
used in situations where you want to
graph frequencies for more than one
variable.
◦ Clustered - graphing gender (Clustered by) by
job category (Category axis)
◦ Stacked - graphing gender (Category axis) by
job category (Stacked by)
Note: Other summary statistics can also be
used
◦ For instance, summarize the mean salary for
each employment categories.

Trainer: Hailegebriel Yirdaw 22


Activity 2. (10 minutes)

a) Use “Expenditure data” to show the


mean expenditure by gender category
using bar chart.
b) Use “GDP data” to show the GDP
trend using line graph.
c) Use “Expenditure data” to show the
mean expenditure by age category and
gender

Trainer: Hailegebriel Yirdaw 23


The Pie Chart
A pie chart is a chart that is used to summarize a set
of categorical data or to display the different values of
a given variable by means of percentage distribution.
Pie Charts
◦ If all the categories sum to a meaningful total, then you can
use a pie chart
◦ Pie charts emphasise the differences in proportions
between categories
◦ OK for a single snapshot, but not very good for showing
trends
would need to have a separate pie chart for each year

Graphs  Legacy dialogs  Pie  Define  dialog box will


appear  Specify Define slice by  ok
Possible to restructure it

Trainer: Hailegebriel Yirdaw 24


Activity 3. (5 minutes)

Based on “Expenditure data” produce a


pie chart to indicate the composition of
education status of the respondents.

Trainer: Hailegebriel Yirdaw 25


The Histogram
This is the most common form of
graphical presentation of a grouped
frequency distribution.
It is used to summarize variables whose
values are numerical and measured on an
interval scale.
It divides up the range of possible values
in a data set into classes or groups.
Graphs  Legacy dialogs  Histogram  dialog
box will appear  Specify the variable  ok
Trainer: Hailegebriel Yirdaw 26
Activity 4. (6 minutes)

Based on “Expenditure data” present the


distribution of expenditure per household
and give your comment.

Trainer: Hailegebriel Yirdaw 27


Scatter Plots
It is used to visually observe the
relationship between two variables.
 Graphs Legacy Dialogs Scatter/Dot 
Simple  Define  Select variables  Continue
To fit the scatter plot with linear line
 Double click on the plot  Elements  Fit line
at total  Select the fit method  Close.
Note that there are four scatter plot
options.
◦ The Simple option graphs the relationship
between two variables.

Trainer: Hailegebriel Yirdaw 28


◦ The Matrix option is for two or more
variables that you want graphed in every
combination. (salary, salbegin, and jobtime)
◦ The Overlay option allows you to plot two
scatterplots on top of each other. (educ x
salbegin and educ x salary)
◦ The 3-D scatterplot is used to plot three
variables in three dimensional space. (educ,
salary, and salbegin)
◦ Simple dot it is a bar chart, but now
represented by dots.

Trainer: Hailegebriel Yirdaw 29


Activity 5. (6 minutes)

Based on “Expenditure data” use scatter


plot to show the kind of relationship
between expenditure and income.

Trainer: Hailegebriel Yirdaw 30


Central Tendencies and Dispersion
Graphic descriptions of statistical data are
not sufficient for statistical inference.
The two most common numerical
measures are measures of central
tendency and measures of variability.

Trainer: Hailegebriel Yirdaw 31


Measures of Central Tendency
Represent a data set by means of a single
number which is descriptive of the entire
set.
These include the mean, median and
mode.
 Mean - Arithmetic average

Trainer: Hailegebriel Yirdaw 32


How can exam score data be summarised?

Exam marks for 60 students (marked out of 65)

mean = 30.3 sd = 14.46

Trainer: Hailegebriel Yirdaw 33


 Median - It is the value of the middle
item of series.
Example - 60, 74, 80, 90, 95, 100
 Mode - value which occurs with the
greatest frequency.

Trainer: Hailegebriel Yirdaw 34


Measures of Variation (Dispersion)
Provide ways of measuring the extent to
which the data are dispersed or spread
out.
 Range - The difference between the
values of the extreme items of a series.
 Mean deviation - It is the average of
difference of the values of items from
some average of the series.
◦ Can take different forms

Trainer: Hailegebriel Yirdaw 35


◦ Coefficient of mean deviation - When mean
deviation is divided by the average used in finding
out the mean deviation itself.
 Standard deviation - the square-root of
the average of squares of deviations.

◦ Coefficient of standard deviation - When we


divide the standard deviation by the arithmetic
average of the series
◦ Coefficient of variation - When this coefficient
of standard deviation is multiplied by 100.
◦ Variance - The square of standard deviation.
Trainer: Hailegebriel Yirdaw 36
Exercise: Ticket cost comparison
Summary statistics for cost of ticket by survival

Cost of ticket Survived?


Died Survived
Mean 23.4 49.4
Median 10.5 26
Standard Deviation 34.2 68.7
Interquartile range 18.2 46.6
Minimum 0 0
Maximum 263 512.33

a) Is there a big difference in average ticket price by group?


b) Which group has data which is more spread out?
c) Is the data skewed?
d) Is the mean or median a better summary measure?

Trainer: Hailegebriel Yirdaw 37


Exercise: Ticket cost comparison Solution
a) Is there a big difference in average ticket price
by group?
The mean and median are much bigger in those who
survived.
b) Which group has data which is more spread out?
The standard deviation and interquartile range are
much bigger for those who survived so that data is
more spread out
c) Is the data skewed?
Yes. The medians are much smaller than the means
and the plots show the data is positively skewed.
d) Is the mean or median a better summary
measure?
The median as the data is skewed
Trainer: Hailegebriel Yirdaw 38
Measures of Asymmetry
One of the main features of a distribution
is the extent to which it is symmetric.
 Skewness – It is a measure of degree of
asymmetry or departure from symmetry
of a distribution.

Trainer: Hailegebriel Yirdaw 39


 Kurtosis - It is a measure of peakedness.

Trainer: Hailegebriel Yirdaw 40


Application

Several summary or descriptive statistics


are available under Analyze and Descriptive
Statistics menus.
 Analyze  Descriptive Statistics 
Descriptives  dialog box will appear  Select
variable (s)  Ok
Other several statistics can be obtained
 Options  dialog box  select
summary statistics  Continue

Trainer: Hailegebriel Yirdaw 4141


Activity 6. (10 minutes)

Based on Expenditure data generate the


numerical descriptive analysis including
Kurtosis and Skewness for variables such
as fsize, exp and inc, and interpret your
results.

Trainer: Hailegebriel Yirdaw 42


Assessing Normality
Charts can be used to informally assess whether data is:

Normally
Or….Skewed
distributed

The mean and median are very


different for skewed data.
Trainer: Hailegebriel Yirdaw 43
Choosing summary statistics

Which average and measure


of spread?

Scale Categorical

Normally Skewed data Ordinal: Nominal:


distributed Median Median Mode
Mean (Standard (Interquartile (Interquartile (None)
deviation) range) range)

Trainer: Hailegebriel Yirdaw 44


Trainer: Hailegebriel Yirdaw 45

You might also like