Professional Documents
Culture Documents
STATISTICS
Selection
Sample
Population
Me a sure
Inference data
Probability
Predictor
A
Predictor A
Predictor C
Predictor C
Case Studies
Experimental
Action Research
Field Surveys
Ethnography
Secondary Data
FGD/KII
Bhattacherjee (2012). Social Science Research: Principles, Methods, and Practices. University of South Florida
Use of Regression to
Analyze a Wide Variety Sample Study
of Relationships
Include Use
continuous and polynomial
categorical terms to model Do exercise habits and diet effect weight?
variables curvature • IV-exercise habits and diet
• weight
The Research Process
Initial Observation
(Research Question)
Conclusion
Generate Theory
Measurement Occasions 1 or 2
Statement of the Problem- Do the students’ high school performance predict the college grade point?
Primary Statistical Questions equation?
◦ How accurate are our guesses using the regression equation?
◦ Example of a Study That Would Use Simple Linear Regression
Colleges don't have enough room for every high school student who applies,
and admissions offices must use some information to try to guess who will
succeed in order to make their decisions. One popular predictor has always been
SAT scores. In the late 1960s, as the college population was changing researchers
were interested in what the actual linear relationship was between scores on the
verbal section of the SAT, an interval level variable that ranged from 200 to 800,
and college grade point average (GPA) for the first year, which ranged from 0.00
to 4.00 They collected data on both variables from a sample of about 4,000
students.
Predictor Variable/IV 2+
Multiple Linear
Level of Measurement Interval Regression
Number of Levels Many
Number of Groups 1
A multiple linear regression
Criterion Variable/Response/DV 1 assumes an approximately
linear model between a
quantitative response Y on the
Level of Measurement Interval basis of more than 1 predictor
variables X.
Number of Level Many
Measurement Occasions 1
108 + k – determine the sig predictor
Sample Size 100 + 8k – best fit model
Research Design- Quantitative Research , Correlational Research, Predictive Causation Research
Objectives-
Prediction - Sometimes they want to predict a score in the future, such as administrators looking at
students' high school performance to guess what their college grade point averages will be. We tend to say that
we are predicting scores
Inferences- researchers are interested in exploring the relationship between two variables to understand
them better.
Statement of the Problem- What best fit model that can be derived from relationship of Teaching Competence
and Academic Performance?
Primary Statistical Questions equation?
◦ What are the relative contributions of each predictor to the criterion variable?
Statistical Assumptions
◦ Linear relationship
◦ Multivariate normality
◦ No or little multicollinearity
◦ No auto-correlation
◦ Homoscedasticity
Linear relationship
◦ First, linear regression needs the relationship between the independent and dependent variables to be
linear. It is also important to check for outliers since linear regression is sensitive to outlier effects. The
linearity assumption can best be tested with scatter plots, the following two examples depict two cases,
where no and little linearity is present.
Multivariate normality
This assumption can best be checked with a histogram or a Q-Q-Plot.
Normality can be checked with a goodness of fit test, e.g., the Kolmogorov-
Smirnov test. When the data is not normally distributed a non-linear
transformation (e.g., log-transformation) might fix this issue.
Multivariate
normality
If the sig value <0.05,not
normal distribution.
This shows the multiple linear regression model summary and overall fit statistics. We find that the adjusted R²
of our model is .398 with the R² = .407. This means that the linear regression explains 40.7% of the variance in
the data. The Durbin-Watson d = 2.074, which is between the two critical values of 1.5 < d < 2.5. Therefore, we
can assume that there is no first order linear auto-correlation in our multiple linear regression data.
If we would have forced all variables (Method: Enter) into the linear regression model, we would have
seen a slightly higher R² and adjusted R² (.458 and .424 respectively).
◦ The next output table is the F-test. The linear regression’s F-test has the null hypothesis that
the model explains zero variance in the dependent variable (in other words R² = 0). The F-test
is highly significant, thus we can assume that the model explains a significant amount of the
variance in murder rate.
In our stepwise multiple linear regression analysis, we find a non-significant intercept but highly significant vehicle theft
coefficient, which we can interpret as: for every 1-unit increase in vehicle thefts per 100,000 inhabitants, we will see .014
additional murders per 100,000.
If we force all variables into the multiple linear regression, we find that only burglary and motor vehicle theft are
significant predictors. We can also see that motor vehicle theft has a higher impact than burglary by comparing the
standardized coefficients (beta = .507 versus beta = .333).
Table 18. Empirical Analysis on the Indicator’s Influence of Spiritual Programs towards
Spiritual Development
Standardized
Unstandardized Coefficients Coefficients
Variables
B Std. Error Beta t Sig.
Shown in table 18 was the empirical analysis on the influence of the spiritual programs towards spiritual development. Using
Multiple Linear Regression Analysis, the model was best fit (F value= 13.369, P value= 0.00). This means that the regression models
results in significantly better prediction of spiritual development than mean value. Further, around 27.1% of the variability of the
spiritual development can be explained by the spiritual programs.
The indicators: Beliefs about the Church, Beliefs about my Life, The Practice of Prayer and The Practice of Fellowship
significantly predict the spiritual development of the students in San Pedro College.
Regression Analysis Using Dummy Variable
ANOVA VS Regression
Regression Analysis
ANOVA
Predictor Variable 1
Simple Logistic
Level of Measurement Nominal + Regression
Number of Levels 2+ It predicts the probability that
Number of Groups 1 an observation falls into one of
two categories of a
Criterion Variable 1 dichotomous dependent
variable based on one
Level of Measurement Nominal independent variable that can be
either continuous or categorical.
Number of Level 2
Measurement Occasions 1
Sample Size n = 100 + 50i, I is the
number of predictors
Research Design- Quantitative Research , Correlational
Research, Predictive Causation Research
This is the chi-square statistic and its significance level. This is the same as the F test in Multiple Linear Regression.
This is the probability of obtaining this chi-square statistic (65.588) if there is in fact no effect of the independent
variables, taken together, on the dependent variable.
Cox & Snell R Square and Nagelkerke R Square – These are pseudo R-squares. Logistic regression does not have an equivalent to the
R-squared that is found in OLS regression; however, many people have tried to come up with one. There are a wide variety of pseudo-R-
square statistics (these are only two of them). Because this statistic does not mean what R-squared means in OLS regression (the
proportion of variance explained by the predictors), we suggest interpreting this statistic with great caution.
Predicted – These are the predicted values of the dependent variable based on the full logistic regression model. This
table shows how many cases are correctly predicted (132 cases are observed to be 0 and are correctly predicted to be
0; 27 cases are observed to be 1 and are correctly predicted to be 1), and how many cases are not correctly predicted
(15 cases are observed to be 0 but are predicted to be 1; 26 cases are observed to be 1 but are predicted to be 0).
Overall Percentage – This gives the overall percent of cases that are correctly predicted by the model (in this case,
the full model that we specified). As you can see, this percentage has increased from 73.5 for the null model to 79.5
for the full model.
These are the values for the logistic regression equation for predicting the dependent variable from the independent
variable. They are in log-odds units. Similar to OLS regression, the prediction equation is
log(p/1-p) = b0 + b1*x1 + b2*x2 + b3*x3 + b3*x3+b4*x4
where p is the probability of being in honours composition. Expressed in terms of the variables used in this example,
the logistic regression equation is
log(p/1-p) = –9.561 + 0.098*read + 0.066*science + 0.058*ses(1) – 1.013*ses(2)