You are on page 1of 73

CB3021 11-1

CB3021 WEEK 11:

SPSS WORKSHOP
CONTINUE
CB3021 11-2

Outline
• ANOVA
• Chi-Square Analysis
• Correlation Test
• Linear Regression Model
• Logistic Regression Model
Analysis of Variance (ANOVA)
• ANOVA is a method that can test whether the mean values
for more than two groups are equal
• It can involve more that one e.g.
factor and compare the mean
gender
values among the factor levels
• It can determine if there is an interaction effect between
factors
• Reference:
• https://en.wikipedia.org/wiki/Analysis_of_variance

• Assumptions:
•Independence of observations
•Normality
•Equal variances, the variance of data in groups are the
same.
ANOVA
• Dataset: Deli (Week 10)

• Explore whether the mean scores for Recommendation


Likelihood (X8) are the same for customers who have
traveled different distances (X11)
• Explore whether female and male (X7) customers’
willingness to recommend the restaurant (X8) are the same
• Any interaction effect between gender and travel distances?

• Main effect for gender and travel distance


Two-way ANOVA
• X7, Gender
• X8, Recommendation likelihood
• X11, Drive Distance
Two-way ANOVA: SPSS
• Analyze => General Linear Model=>Univariate
Two-way ANOVA: SPSS
• Select Recommendation Likelihood as Dependent Variable;
Select Gender and Driven Distance as Fixed factors
Two-way ANOVA: SPSS output
Check the table of Between-Subjects Factors
Two-way ANOVA: SPSS output
• First row, Corrected Model
• Null Hypothesis: No variable has significant impact on the
output variable, Recommendation likelihood X8.
• P value <0.05, we reject H0
• conclude there exists at least one variable that has significant impact
on the output variable.
• Row for X7, p value is 0.848 > 0.05, fail to reject H0
• Gender has no significant impact on recommendation likelihood.
• Row for X11, p value is 0.000 < 0.05,
• Conclude that Driven Distance has significant impact on
recommendation likelihood.
2-way ANOVA: Try another example
• Dataset: Dataset: Edu_Experience_Salary (Week 11)

Explore: whether mean salary are


the same for participants with
different education background
(bachelor’s degree vs master’s
degree)
Explore: whether mean salary are
the same for participants different
work experience (1 year vs 5 years).
2-way ANOVA: SPSS
• Select Analyze => General Linear Model =>Univariate
• Put Salary into Dependent Variable, and Edu and
WorkExperience into Fixed Factors
2-way ANOVA: SPSS
• Select Homogeneity test in the Option box
2-way ANOVA: SPSS output
• P value is greater than 0.05
• Levene’s test does not reject the assumption of equal
variances
• That's needed for the ANOVA analysis
2-way ANOVA: SPSS output
• In “Corrected Model” row, H0: No variable has significant
impact on the output variable, Salary
• p value <0.05, reject the H0 and conclude that there exists at
least one variable that has significant impact on the output
variable.
2-way ANOVA: SPSS output
• Significant interaction effect
• Conclude that average salary are NOT the same
for different education levels
• Conclude that average salary are NOT the same
for people with different work experience
2-way ANOVA: Create mean plot
• Click Profile Plots, move Edu to the Horizontal Axis box
and WorkExperience to the Separate Lines box.
• Click Add
Chi-Square Analysis
• Association Analysis = Determine the relationship between
two nominal scale variables.
• Dataset: Deli (Week 10)

• To test whether the same proportion of male and female


(X7) would make up each of the response categories for
Driven Distance (x11)

• X7, Gender
• X11, Drive Distance
Chi-Square Analysis: SPSS
• Select ANALYZE => DESCRIPTIVE STATISTICS =>
CROSSTABS
Chi-Square Analysis: SPSS
• Select X11 as Row and X7 as Column
Chi-Square Analysis: SPSS
• Click Statistics and then click Chi-square
• Click Cells and select Observed and Expected under the
box “Counts”
Chi-Square Analysis: SPSS output
• Check the table of Chi-square Tests

• Test hypotheses: Chi-Square


Tests

• H0 : No association between
Gender and Driven Distance
• H1 : There is a significant
association between the two
variables

• P value = 0.035 <0.05, reject Ho


and conclude that that Gender
and Driven Distance are
associated
Actual count vs Expected Count
• Actual count, there are 9 female and 13 male respondents.
• Expected Count shows the number of the expected cases, if
the two variables, Gender and Driven Distance, are
independent of each other.
Actual count vs Expected Count
• For short-distance and female, the actual count is 9 but the
expected count is 13.2, the actual count is less than the
expected count.
• However, for the male group, the actual count is more than
the expected count. We can interpret that when the drive
distance is short, males are more likely to visit the store
than females.
• Opposite situation occurs in long distance driving, when
the drive distance is long, women are more likely to visit
the store.
Chi-Square Analysis
Strength of the association
• Test the strength of the association between two nominal
variables
• Click Statistics and select Phi and Cramer’s V
Strength of the association
• Phi coefficient is for 2 by 2 table
• Cramer’s V is for table of any size
• p value=0.035 <0.05, Reject H0
• Conclude that there is significant association between the
two variables at the 0.05 level of significance.
• The value of Cramer’s V is 0.366, the strength of the
association is low-to-moderate

• <0.3 low association


• 0.3-0.6 low-to-
moderate association
• >0.6 strong
association
Chi-Square Analysis: Practice
• Open the dataset you have created for: “Financial (Practice
Week 10)”
• Requirement: Compare “Partner” and “MAO” to study
whether these two variables are correlated.
• Specify the hypothesis, p-value and conclusion.
Pearson correlation coefficient
• To examine the relationship between two interval scale
variables
• Example: Open the Dataset: Deli (Week 10)

• Examine the relationship between “Satisfaction Level” (X9)


with the restaurant and “Likelihood to recommend the
restaurant to friends” (X8)
Pearson correlation coefficient
• X8 Recommendation likelihood
• X9 Satisfaction level
• Select ANALYZE => CORRELATE => BIVARIATE
Pearson correlation coefficient
• Select X8 and X9 as Variables
Pearson correlation coefficient
• Pearson correlation coefficient is 0.601.
• +1 means a perfect positive correlation, and -1 is a
perfect negative correlation. 0 means there is no
linear correlation at all.
• .601 indicates a strong positive correlation. The
higher the satisfaction level is, the more likely the
participants are going to recommend to other people.
Pearson correlation coefficient
• We can also test whether the correlation coefficient equals
zero.

• With 2-tailed significance value – in this case p = .000.

• P-value < 0.05, which means that the correlation is


significant
Correlation coefficient: Example
• Dataset: GDP and patents (Week 11)

• Provides the GDP, no. of application of patents, and no. of


colleges for different districts
• Explore: the correlation between GDP and Patents

Pearson correlation coefficient
• Select ANALYZE => CORRELATE => BIVARIATE
• Move GDP and Patents to the Variables Box, select Pearson
• Click Option and select means and standard deviation
Pearson correlation coefficient
• Correlation figure is 0.704 indicates a strong positive correlation
• The higher the GDP is, the more number of patents application in this
district.
• P value is 0.000< 0.05
• The correlation is statistically significantly differ from zero
Partial correlation coefficients
• The number of colleges in the district may impact to the
previous correlation
• We may want to control the number of colleges and check
their correlation again

• The Partial Correlations procedure computes partial


correlation coefficients that describe the linear relationship
between two variables while controlling for the effects of
one or more additional variables.
Partial correlation coefficients
Analyze => Correlate => Partial
Move GDP and Patents to the Variable box, and College to
the Controlling for box.
Partial correlation coefficients
• Click option and select…
Partial correlation coefficients
• 1st part, the Pearson correlation coefficients for all your variables
– dependent variable, independent variable, and control variables
• 2nd part shows the Pearson correlation coefficient between the
dependent and independent variable, taking into account the
control variable(s)
Partial correlation coefficients
• P value<0.05 There was a strong, positive partial
correlation between the dependent variable, "GDP", and
"patents", whilst controlling for "college"
Bivariate (Linear) Regression Analysis
• Study the relationship between two interval scale variables
• Dataset: Deli (Week 10)
• Analyze the relationship between customer satisfaction
level (X9) and the employees’ competency (X3)
• Select ANALYZE => REGRESSION => LINEAR
Regression Analysis: SPSS
• Select X9 as Dependent Variable and X3 as Independent
Variable
• Choose Enter as the method
• Click “Statistics” and select “Estimates” under “Regression
Coefficients
Regression Analysis: SPSS output
• R square is the proportion of variability in the outcome
variable that was explained by the predictor variable
• 22.1% of the variance in Satisfaction level ,X9, was
explained by the Employees Competency, X3
Regression Analysis: SPSS output
• Do the independent variables reliably
predict the dependent variable?
• ANOVA table tells us that the predictor
in our model can reliably predict the
output variable.
Regression Analysis: SPSS output
• There is an overall test assessing whether the group of
independent variables reliably predict the dependent
variable, but does not address the ability of any of the
particular independent variables to predict the dependent
variable.

• If p value <0.05 (significance level), we conclude that: At


least one independent variables is related to the dependent
variable
Regression Analysis: output
• Unstandardized Coefficients, Beta = 0.315
• For every one unit increase in the predictor variable, the outcome
variable will increase by the unstandardized beta coefficient value
• X3, Competent Employees, the outcome variable, Satisfaction
level, X9 will increase by 0.315 points per unit increase of X3
• Satisfaction level = 2.25 + 0.315 * Employees’ Competency
• P value of X3, is 0.01, showing that X3 has statistically significant
impact on the outcome variable, Satisfaction level.
Regression: Another example

• Dataset: Duck (Week 11)


• It records the weight of the ducklings and their weight after
50 days
• Explore : can we use the weight of the ducklings to predict
their weight after 50 days
Regression: Scatterplot
• Create the scatterplot for two variables.
• SPSS Graphs=> Legacy Dialogs=>Scatter/Dot
• Choose Simple Scatter
Regression: Scatterplot
• Put the dependent variable on the Y-Axis, and the
independent variable on the X-Axis

Reasonable to fit a line indicating


the relationship?
Linear Regression: SPSS
• Analyze => Regression => Linear
• Put the Weight of ducks after 50 days into the dependent
variable box
• And Weight of ducklings into the independent variable box
Linear Regression Analysis: output
• 95.5% of the variance in the Weight of Duck after 50 days
was explained by the Weight of Ducklings
• p value =0.000 <0.05, we conclude that the independent
variables reliably predict the dependent variable
Linear Regression Analysis: output
• P value of the predictor variable Weight of Duckling,
showing that Weight of Duckling has statistically
significant impact on the outcome variable, Weight of Duck
after 50 Days.
Linear Regression Analysis: output
• Unstandardized Coefficients
• The point at which the regression line crosses the Y-Axis
is 584.524.
• It is only meaningful if the data covers actual
observation at X = 0.
Linear Regression Analysis: output
• Beta for Weight of Duckling is 21.664
• The slope of the line is equal to 21.664.
• For every one unit increase in the predictor variable, the
outcome variable will increase by the unstandardized beta
coefficient value.
• Here, for every one unit increase in the Weight of Duckling,
Weight of Duck after 50 Days, will increase by 21.664 points
Linear Regression Analysis: output
• The fitted regression line is
• Weight of Duck after 50 Days = 584.524 + 21.664 *
Weight of Ducklings
Multiple Regression
• Dataset: Deli (Week 10)
• To analyze if Usage level (X10) and Drive Distance (X11)
in the restaurant can predict the satisfaction level (X9)
• Select ANALYZE => REGRESSION => LINEAR
• Select X9 as Dependent Variable and X10 and X11 as Independent
Variable
Multiple Regression: SPSS
• Select these Collinearity diagnostics and Durbin Watson
Multiple Regression: SPSS
• R square is the proportion of variability in the
outcome variable that was explained by the
predictor variables
• 82.3% of the variance in the Satisfaction Level
X9 was explained by the two predictor variables
Regression: Durban Watson Statistic
• The Durban Watson statistic ranges between 0 and 4.
• A value of DW = 2 indicates that there is no
autocorrelation.
• DW<2, it indicates a positive autocorrelation,
• DW>2 indicates a negative serial correlation.
• DW =1.852, within the range of 1.5-2.5, so we are good.
Multiple Regression: SPSS
• Do the independent variables predict the dependent variable?
• p value <0.05 we conclude that at lease one of the independent
variables is associated with the dependent variable
Multiple Regression: SPSS
• P value of the predictor variables Usage level and Drive
Distance, <0.05, showing that the two variables have
statistically significant impact on the outcome variable,
Satisfaction level
Multiple Regression: SPSS
• Beta for Usage level is 0.761 ; Beta for Drive Distance is
0.727
• Satisfaction level will increase by 0.761 points per unit
increase of the Usage level provided that ‘Driven Distance’
is keep constant
• Satisfaction level, will increase by 0.727 points per unit
increase of Driven Distance while Usage Level are the same
Multiple Regression: SPSS
• The fitted regression line is
• Satisfaction level= 3.073 + 0.761 * Usage level +
0.727*Driven Distance
Logistic Regression: SPSS
• A binomial logistic regression predicts the probability that
an observation falls into one of two categories of a
dichotomous dependent variable
• based on one or more independent variables that can be either
continuous or categorical.

• Dataset: Vote (Week 11)


• 2,440 participants’ age,
education background,
gender and their decision
of who they would vote for,
Clinton or Trump

Ref: https://en.wikipedia.org/wiki/Logistic_regression
Logistic Regression: SPSS
• Analyze →→ Regression →→ Binary Logistic…
Logistic Regression: SPSS
• Put Vote into the dependent variable box, and Educ and
Gender into the independent variable box
Logistic Regression: SPSS
• Click categorical, put gender into the categorical covariate
box, and choose indicator, change
• Male=0
• Female=1
• We use female as the reference category
Logistic Regression: SPSS
• Model Summary table tells that how much variation in the
dependent variable can be explained by the model
• This table contains the Two R square values, both are
measurements of calculating the explained variation
Logistic Regression: classification
• This gives the percent of cases for which the dependent
variables are correctly classified by the model
• The percentage is
(824+567)/( 824+567+414+563)=58.7%
Logistic Regression: equation
• The "Variables in the Equation" table shows the
parameter estimate of each independent variable in the
model and its statistical significance.
Logistic Regression: Wald Test
• The Wald test is used to determine statistical significance
for each of the independent variables. For the three factors,
p value <0.05, showing that they have statistically
significant impact on the output variable
Logistic Regression: SPSS
• The odds ratios are simply the exponentiated coefficients
from the logit model. For example, the coefficient for educ
was -.252. The odds ratio is exp(−.252)=.777.
Logistic Regression: odds ratio
• An odds ratio less than one means that an increase in X
leads to a decrease in the odds of Vote for Trump.
• An odds ratio greater than one means that an increase in x
leads to an increase in the odds of Vote for Trump.
Logistic Regression: odds ratio
• For Gender, the Beta value is 0.356, positive, and the Odds
ratio is 1.427, greater than 1.
• The odds of voting for Trump are higher for males
compared to females

You might also like