Professional Documents
Culture Documents
Subject PSYCHOLOGY
TABLE OF CONTENTS
1. Learning Outcomes
2. Introduction
3. Regression
3.1 History of Regression
3.2 Regression Model
3.3 factors effecting correlation
3.4 Assumptions of Regression
4. Practical applications of Regression
4.1 Applications
4.2 using SPSS for regression
4.3 SPSS20 for regression analysis
5. Summary
1. Learning Outcomes
After studying this module, you shall be able to
2. Introduction
Suppose, in a group of 40 applicants in an organization, one wants to estimate the number of
employees who would be efficient in working on a particular project knowing their past academic
record. The best possible guess that can be made is that since the job requires an average
academic performance, an average scorer could be the best possible choice. So, in that group of
40 applicants, one would take the mean percentage academic performance and those applicants
who lie in this range can be predicted to perform good on this job.
One can understand that if any two variables are related to each other statistically, i.e. if they
share a fair degree of co-variance or have a high positive or negative correlation, then a measure
on one variable can be used to predict about the possible score on the other variable with fair
amount of success.
A researcher may know that intelligence score is highly positively correlated with academic
achievement through data collected on a sample. Using the this information and the concept of
regression, if the researcher has a measure of only one of the variables, either intelligence or
academic achievement then one can make prediction about the other variable with a fair amount
of accuracy. So, the concept of regression is a step forward in the direction of studying
relationship among the variables statistically and using it to the advantage of predicting the other.
3. Regression
3.1 History of Regression
The term regression was first used by Francis Galton with reference to the inheritance of stature.
He found that the children of tall parents tended to be less tall while children of short parents
tended to be shorter. Thus, the heights of the off-springs tended to move towards the mean height
of the general population. The tendency of maintaining the mean value was called the principle of
regression by Galton and the line describing the relationship of height in parent and offspring was
called the regression line. Thus the predictor variable here becomes the parent height and the
outcome variable is the child’s height. The prediction of one variable from the other is the
concept of regression.
Similar to correlation, regression is used to analyze the relationship between two continuous
variables. It is also better suited for studying functional dependencies between factors that is
where X partially determines the level of Y. For instance, as age increases, blood pressure
PSYCHOLOGY Paper No.2: Quantitative Methods
Module No.24: Regression
____________________________________________________________________________________________________
increases. But the length of arm does not have any effect on
length of the leg. Also, it becomes better suited than correlation
for studying samples in which the investigator fixes the distribution of X or the predictor variable.
For example, if independent variable be the percentage of children receiving reduced fee school
lunches in a particular neighborhood as a substitute for neighborhood socio economic status and
the dependent variable be the percentage of bicycle riders wearing helmets. The researcher finds a
strong negative correlation of –0.85. the data obtained is useful if the researcher wants to know
about the helmet wearing behavior on the basis of data obtained on socio- economic status. The
scatter-plot of the data is depicted in the graph below. A straight line of best fit can be fitted to the
data using the least square criterion (readers may refer to the module on correlation to read more
about this). The line of best fit enables the statistician to develop the regression model.
A line of best fit is described as a stright line that runs through the data in such a manner that the
sum of square of the deviances from the mean is minimum. Let us understand this concept.
A line is recognised by its slope or the angle of the line describing the change in Y per unit X and
intercept is where the line crosses the Y axis. Regression describes the relation between Xand Y
with just such a line.
= predicted value of Y
Now identifying the best line for the data becomes a question. Had all the data points fell on a
single line, identification of the slop and intercept would have been easier. But since the statistical
data has random scatter, identifying a good line becomes a process requiring effort.
The random scatter around the line is recognised as the distance of each point from the predicted
line and these distances are referred to as residuals shown below.
One’s aim is now to identify the line that minimises the sum of
the squared residuals which is called the least squares line. The
slope of the least squares line represented by by is given as
where,
where,
= average value of Y
b = slope
= average value of X
Hence, in the above example, = 30.8833 and = 30.8333. thus, a=(30.8333) + (-0.539)
(30.8333) = 47.49 and the regression model becomes = 47.49+ (-0.54)x.
The slope in the regression model is the average change in Y per unit X. thus, the slope of -0.54
predicts 0.54 fewer helmet users per 100 bicycle riders for each additional percentage of children
receiving reduced fee meals.
Since the regression model can be used to predict the value of Y at a given level of X, like the
neighborhood in which half the children receive reduced fee lunch (X=50) has an expected
helmet use rate (per 100 riders) as 47.49+(-0.54)(50) = 20.5.
Number of cases: when doing regression, the cases-to-independent variable ratio should be 20:1
that is 20 cases for every independent variable in the model. The lowest ratio could be 5:1 that is
5 cases for every independent variable in the model
Accuracy of data: one should check the accuracy of data entry to ensure that all the values for
each variable are valid.
Missing data: one should look for missing data and if there are a lot of missing values, one should
not include those variables in analyses or, delete the cases if there are few cases with missing
values or if its important, place the mean values of that variable in the missing places.
Outliers: once should check data for outliers that is an extreme value on a particular item which is
at least 3 standard deviation above or below the mean. One may delete these cases if they are not
a part of the same population or retain it but reduce how extreme it is that is recode the value.
Normality: one should check for normal distribution of the data by constructing a histogram or
construct a normal probability plot.
Linearity: one assumes linearity or a single straight line relationship between the independent and
dependent variables as regression analysis tests for only linear relationships. Any non linear
relationship gets ignored.
Homoscedasticity: one assumes homoscedasticity that is the residuals are approximately equal for
all predicted dependent variable scores or variability of scores for the independent variables is
same at all values of the dependent variable.
Focus of analysis: the purpose of carrying out the multiple regression analysis in
quantitative psychological research is to analyze the extent to which two or more
independent variables relate to a dependent variable.
Variables involved: there may two or more than two independent variables which are
continuously scaled. The dependent variables are also continuously scaled i.e. either
interval or ratio scale of measurement.
Realtioship of the participants’ scores across the groups being compared: to be suitable
for multiple regression analysis, the participants should have scored on all the variables,
or in other words the scores are dependent upon each other.
5. Summary
The concept of regression is a step forward in the direction of studying relationship
among the variables statistically and using it to the advantage of predicting the other.