Professional Documents
Culture Documents
Inhoud
Normal distribution...............................................................................................................................3
Student T-distribution...........................................................................................................................4
Student t-test........................................................................................................................................5
One-sample t-test..............................................................................................................................5
Independent sample t-test................................................................................................................6
Paired samples t-test.........................................................................................................................7
Chi-square test......................................................................................................................................8
ANOVA...................................................................................................................................................9
Levene’s test for equal variances.....................................................................................................12
Post-hoc test....................................................................................................................................12
Correlation coefficient.........................................................................................................................13
Regression analysis..............................................................................................................................15
Multiple regression..............................................................................................................................18
Model selection...............................................................................................................................21
Required number of cases...............................................................................................................22
Multicollinearity..............................................................................................................................23
Summary multiple regressions:.......................................................................................................24
Logistic regression analysis..................................................................................................................25
Binary logistic regression and a single independent variable..........................................................25
Binary logistic regression and multiple independent variable.........................................................27
Multinomial logistic regression............................................................................................................29
Non-parametric methods....................................................................................................................31
Non-parametric test........................................................................................................................32
The Mann-Whitney U test...........................................................................................................33
The Kruskas-Wallis H test............................................................................................................34
The Wilcoxon T-test.....................................................................................................................35
The Spearman correlation...........................................................................................................36
Advantages and disadvantages of non-parametric methods..........................................................37
Normal distribution
A statement is made. We assume this statement is correct and there is no difference, we call the the
Null hypothesis. After that we say the statement is not correct an there is a difference, this we call
the Alternative hypothesis.
This test is used to test if there is a difference between groups, for example income and gender
Use the F-test (Levene’s test) to test if the unknown variances are equal.
Examples of cases:
Example
The same group of students completed two test: Test 1 and Test 2.
In a chi-square test u can see the relationship between two nominal/ordinal variables.
Expected frequency: The frequency that would result if men and women choose the transport
modes with equal frequency.
Test of independence: if there is no difference between man and woman in transport-mode choice,
then gender and transport-mode choice are independent of each other.
Summary:
With the Anova test u can test the difference in averages between more than 2 groups.
Test in one test run if the found differences in the averages are sample error or not.
ANOVA: Compares
Post-hoc test
Compare all pairs of groups
Use stricter alpha compared to t-test
Different correction methods for alpha exist
Arbitrarily, choose ‘Benferroni’ in SPSS
NOTE: Post-hoc tests are relevant only if the differences are significant in ANOVA
Covariance: A measure for the strength of a linear relation between variables X and Y:
Disadvantage of covariance:
Size of the covariance of X and Y depends on the scales on which X and Y are measured
Correlation coefficient can have a positive linear relation, a negative linear relation or no linear
relation.
Conditions:
Direction
o Positive: low values of X go more often together with low values of Y
o Negative: low values of X go more often together with high values of Y
Strength
o Values closer to +1 or -1 indicate a stronger linear relation between X an Y
Regression analysis
Can we predict the value of Y better if we know the value of X?
Linear relation between the depend variable Y and the independent variable X in reality.
Check before-hand:
Linear relation
Constant variance
Normal distribution
Goals of regression model:
Explained variance:
That part of the variance in Y that is explained by the values of the independent variable X
Summary:
Variance
o Indicator for prediction error
R-square
o How much of the prediction error can be reduced by another variable
Correlation
o Strength and direction of relationship
o Standardized (covariance = unstandardized)
Regression coefficient
o To what extent Y changes if X changes by one unit
Multiple regression
Finding the effect of a predictor on a variable while controlling for the effect of other predictors
Why standardization
o To compare variables that are measured on different scales
o For example: temperature in Fahrenheit and Celsius
Why is the partial regression coefficient (multiple regression) lower than the regression
coefficient( simple regression)?
The multiple regression coefficient is different, because it is controlled for income
The partial coefficient is lower, because the correlation between the predictors is positive
In a simple regression one assumes that SIZE is not correlated to any other predictor, which
does not hold in this case; therefore multiple regression is better
Parsimony
1. Predictor with highest correlation with dependent variable is the first to enter the model
2. Second predictor is the variable with highest partial correlation with dependent variable
3. Test whether first predictor is still significant after adding second predictor. If not, it is
removed
4. Repeat from step 2 for all next predictors
Required number of cases
Rules of thumb for designing research
Rules of thumb 1
Rules of thumb 2
By using multiple regression we correct for correlations between predictors. However, this poses
problems if correlations are high.
Multicollinearity
Problems caused by high correlations between predictors
Advantages of Logit
The Wald statistic is Chi-square X2 distributed with one degree of freedom (df=1)
Binary logistic regression and multiple independent variable
R2 measure
Percentage of correct predictions
R2 measures:
The R square measures for logistic regression are comparable to the T square measure for
linear regression
They involve a transformation of the log-likelihood ratio on a zero-one scale
R2=1, means that the increase in goodness-of-fit is maximal (the model predicts cases
perfectly)
R2=0, means the model does not improve the prediction compared to a model without
coefficients
SPSS reports:
If the model predicts a probability > 0.5, than we predict the event will happen (Y=1)
If the model predicts a probability < 0.5, than we predict the event will not happen (Y=0)
Logistic regression does not assume a linear relationship between the dependent and the
independents
It is also possible and permitted to add interaction an power term s as variables on the right-
hand side of the logistic equation, as in linear regression
The dependent variable needs not be normally distributed
The variance of the dependent variable does not need to be homogeneous across levels of
the independents
Normally distributed error terms are not assumed
Multinomial logistic regression
The dependent variable has more than two categories
In Binary logistic regression the dependent variable is dichotomous (which of two events
occurs?). In multinomial regression the dependent variable is categorical (which of multiple
events occurs?)
We study the relationship between a categorical dependent variable and one or more
interval/ratio or dichotomous independent variables
As in linear and binary logistic regression, dummy variables should be used for categorical
independent variables
SPSS will automatically dummy-code categorical variables. So, they can be entered directly
In SPSS, categorical variables are included as ‘’factors’’ and interval/ratio variables as ‘’co-
variates’’
As binary logistic regression, multinomial logistic regression does not make any assumptions
of normality, linearity, an homogeneity of variance for the independent variables
Because it does not impose these requirements, it is preferred to discriminant analysis when
the data does not satisfy the assumptions
Non-parametric methods
Levels of measurement;
Nominal
1≠2≠3
o Numbers just indicate the different categories
o Thus no ordering
o Colors, gender, means of transport
Transformation
o Numbers are fully interchangeable between categories
Example
o Means of transport; 1=car 2=bike 3=train
o Is equivalent to; 2=car 3=bike 1=train
Dichotomous
o Variable of nominal level with only 2 categories
o Gender; 1=male 2=female
Ordinal level
1<2<3 or 1>2>3
o There is an order between categories
o No equal differences consecutive categories
o Hierarchical levels, level of education, rank order
Transformation
o Any transformation, as long as order between categories stays the same
Example
o Educational level ; 1=secondary school 2=MsC 3=PhD
o Or 2=secondary school 5=MsC 7=PhD
Interval Level
2=2*1
o Order & equal intervals & absolute zero value
o Kilo’s, distance, age, temperature in Kelvin
Transformation
o Keep equal proportions
o Multiply by number; kilo=2*pound
Example equal proportions
o 20 kilometers is twice as much as 10 kilometers
o Weight 40 kilos is twice as heavy as 20 kilos
Non-parametric test
A non-parametric test will be used when:
Null hypothesis; H0 there is no difference in the rank scores between the groups
Alternative hypothesis; H1 there is a difference or rank sore of group 1 is larger (smaller) than group
2
The Kruskas-Wallis H test
This is an alternative for ANOVA
Recall ANOVA
o Test the difference of means between groups when there are more than two groups
o The goups are independent
o The dependent variable is interval/ratio variable
We use the Kruskas-Wallis H test
o Assume the same condictions
o But, the dependent variable is ordinal (rank scores)
Null hypothesis; H0 there is no difference in means of rank scores between the groups
Alternative hypothesis; H1 there is a difference in means of rank scores between the groups
Conclusion
Use a non-parametric test only when a parametric test cannot be used because assumptions
are not met
Use a non-parametric test when
o The original variables are ordinal/nominal
o The distribution of original variables are not normal
o The sample size is too small for a parametric test, thus when 10<N<30