Topics
• Linear regression and logistic regression
• Factor analysis, Cluster analysis and MANOVA
• Discriminant analysis, Path analysis, Canonical correlation and
Multidimensional scaling
03/07/2025 1
Linear Regression
03/07/2025 2
What is regression?
Two variables are related implies that when one variable changes by
a certain amount the other variable changes on an average by a
certain amount
Regression is a method used to quantify linear relationship between
two continuous variables by means of a mathematical equation
The equation can be used to predict the average value of one
variable for a fixed value of another variable
03/07/2025 3
What is regression?
Regression equation: Y = b0 + b1X
Y – dependent variable
b0 – intercept
b1 – slope
X – independent variable
Intercept and slope are called “regression coefficients”
03/07/2025 4
Assumptions of simple linear regression
1. The dependent variable should be normally distributed
(continuous)
2. The observations must be independent
• Note: there are no assumptions on independent variable
03/07/2025 5
Interpretation
1. Intercept gives the average value of the dependent
variable when the value of independent variable is
zero
2. Slope gives the change in dependent variable for one
unit increase in the independent variable
03/07/2025 6
Example
Mother’s
weight (kg)
• Predict the birth weight using mother’s weight 50
39
60
• b0 = -0.43
58
• b1 =0.052
70
55
Y = b0 + b1X 45
66
47
61
03/07/2025 7
Multiple linear regression
Regression equation: Y = b0 + b1X1+ b2X2 …+ bpXp
Y – dependent variable
b0 – intercept
b1, b2,… bp – slopes associated with each of p independent variables
X1, X2,… Xp – p independent variables (i.e. multiple independent
variables)
03/07/2025 8
Example and interpretation
Y = b0 + 0.06X1- 0.07X2
Where X1 is mother’s weight and is X2 mother’s cholesterol level
• For every one unit (1kg) increase in mother’s weight, on an
average the birth weight will increase by 0.06 kg after adjusting
for the effect of mother’s cholesterol level
• For every one unit increase in mother’s cholesterol level, on an
average the birth weight will decrease by 0.07 kg after
adjusting for the effect of mother’s weight
03/07/2025 9
Binary Logistic Regression
03/07/2025 10
Binary Logistic Regression model
• Logistic regression implies modelling a categorical dependent
variable using categorical/ continuous independent variable(s)
• Binary logistic regression is used to predict a dependent variable that
is binary (dichotomous), from one or more categorical or continuous
independent variables
• Simple binary logistic regression – one independent variable
• Multiple binary logistic regression – multiple independent variables
03/07/2025 11
Simple binary logistic regression
model
• The simple linear regression model is,
Y= +
• The simple binary logistic regression model is,
=+ =
where,
• p = probability of occurrence of event, given the status of independent
variable
• ) = odds ratio for Y with respect to X
03/07/2025 12
Simple binary logistic regression
model - example
• A simple logistic regression model to find the effect of obesity on
occurrence of MI
=+
where,
p = probability of occurrence of MI, given the status of obesity
1) = odds ratio for MI with respect to obesity status
03/07/2025 13
Multiple binary logistic
regression model
• The multiple binary logistic regression model is,
=+++…
where,
• p = probability of occurrence of event given the independent
variables
• ) = odds ratio for Y with respect to Xs
03/07/2025 14
Multiple binary logistic regression model
- example
• A multiple logistic regression model to find the effect of obesity,
gender, diabetes status on occurrence of MI
=+
where,
p = probability of occurrence of MI, given the status of obesity, gender and
diabetes status
1) = odds ratio for MI with respect to obesity status, adjusted for the effect of
gender and diabetes status
2) = odds ratio for MI with respect to gender, adjusted for the effect of obesity
status and diabetes status
3) = odds ratio for MI with respect to diabetes status, adjusted for the effect of
gender and obesity status
03/07/2025 15
Odds ratio interpretation
• Odds ratio for myocardial infarction with respect to obesity status was 4.34
• Odds of MI is 4.34 times more for obese as compared to non obese.
• Odds ratio ranges from 0 to infinity.
• Odds ratio greater than 1 implies that the independent variable is a risk factor
for dependent variable
• Odds ratio less than 1 implies that the independent variable is a protective
factor for dependent variable
• Odds ratio equal to 1 implies that the independent variable and the dependent
variable have no relationship (not associated).
03/07/2025 16
Use of binary logistic regression
• To identify the risk factors (predictors) for an outcome of
interest
• To predict the probability of outcome of interest.
03/07/2025 17
Factor analysis
03/07/2025 18
Introduction
• Consider a situation in which it is required to know the climate of an
organization.
• The organizational climate depends upon many variables and can be
measured through questions such as
• The job requires me to use a number of complex or high-level skills
• I have new and interesting things to do in my work
• My work challenges me
• The job is quite simple and repetitive
• Conflicts are resolved to the satisfaction of those concerned
• My supervisor shows complete trust
• I feel free to discuss problems or negative feelings
• Creativity is actively encouraged in this organization
• Co-workers in my work unit are like a family
03/07/2025 19
Introduction
• By using the factor analysis, these variables can be grouped into some
factors based on the similarity of variables, where each factor
measures some latent characteristics of the organizational climate.
• Thus, the climate of an organization can be studied through a handful
number of factors such as job challenge, communication, trust,
innovation, job satisfaction, and employee’s welfare rather than using
a large number of parameters.
03/07/2025 20
What is a Factor Analysis?
• Factor analysis is a statistical method that is used to investigate whether
there are underlying latent variables, or factors, that can explain the
patterned correlations within a set of observed variables.
• In this case, the observed variables would be the questions asked.
• Latent variables are underlying constructs that are not directly observable
and cannot be measured by one single thing.
• For example, you cannot directly measure the quality of someone's
marriage. Instead, you can use a combination of observable variables to
measure marriage quality, including the amount of time the couple spends
together, the environment, marital conflict, marital attitudes, etc.
03/07/2025 21
What is a Factor Analysis?
• A factor analysis is a statistical procedure that is used in order to find
underlying groups of related factors in a set of observable variables.
03/07/2025 22
Goals and types
• The primary goals of factor analysis are as follows:
1. Determine how many factors underlie a set of observable variables
2. Provide a method of explaining variance among observable variables by using fewer,
newly created factors
3. Reduce data by allowing the user to extract a small set of factors (which usually are
not related to each other) from a larger set of observable variables (which are usually
correlated with each other). This allows for summarization of a large number of
variables into a smaller number of factors
4. Define the meaning or content of the factors
• There are two types of factor analyses: exploratory factor analysis (or EFA) and
confirmatory factor analysis (or CFA).
03/07/2025 23
Exploratory factor model
• The purpose of this model to discover the number of factors and does
not specify which items load on which factors. In this model, all the
sets of relationships are considered. From the observed variables, as
many factors as possible are obtained. Then, the possibility of hidden
factors are explored. Here each of the X’s are dependent on all F.
03/07/2025 24
Confirmatory factor model
• In this case, we clearly are aware about which observed variable are
caused by which factor.
• In this case, the objective is to confirm whether what we have
hypothesized is correct or not.
03/07/2025 25
Assumptions
• The following assumptions are made while using the factor analysis:
1. Data used in the factor analysis is based either on interval or on ratio scale.
2. Variables have multivariate normal distribution.
3. The variables which have been selected in the study are relevant to the concept
being assessed.
4. Enough sample size has been taken for factor analysis. Usually, minimum of 10
observations per variable is required to run the factor analysis.
5. Outliers are not present in the data.
6. Some degree of collinearity exists among the variables but there should not be
an extreme degree or singularity among the variables.
7. Linear relation exists among variables.
03/07/2025 26
When to use FA
• When you want small number of factors based on a particular number of inert-
related quantitative variables.
• To derive constructs such as intelligence, creativity, happiness etc from
measurements of other, directly observed variables.
Uses of factor analysis
• Quantifies the constructs (factors) with the help of manifest variables.
• Helps in dimension reduction of data
• Useful to name the reduced dimensions
• It provides the hidden dimensions of group characteristics which cannot be
directly observed.
03/07/2025 27
Limitations
• The analysis provides good results only if all the relevant variables
which measure the group characteristics are included in the study.
• In a situation where majority of variables are highly related, the factor
analysis may club them into one factor. This will not allow other
factors to be identified in the model that might capture more useful
relationship.
• Using factor analysis in constructing psychological test requires good
domain knowledge for identifying and naming factors because many
times multiple variables can be highly related without any reason.
03/07/2025 28
Reference
• Application of Factor Analysis in Psychological Data -
Statistics and Research Methods in Psychology with Excel
• The influence of organizational climate on sustainable relationships
between organization and employees. The KION case study.
https://www.scienpress.com/Upload/AMAE/Vol%202_4_8.pdf
03/07/2025 29
Cluster analysis
03/07/2025 30
Introduction
• In factor analysis, we take several variables, examine how much variance
these variables share, and how much is unique and then ‘cluster’ variables
together that share the same variables. In short, we cluster together
variables that look as though they explain the same variance.
• Cluster analysis is a similar technique except that rather than trying to
group together variables, we are interested in grouping cases.
• Cluster analysis is a multivariate method used to classify or group objects
into relative groups called clusters.
• The goal is to find an optimal grouping for which the observations or
objects within each cluster are similar, but the clusters are dissimilar to
each other.
03/07/2025 31
Introduction
• In cluster analysis, prior information about the group or cluster
membership for any of the objects is not known.
• So, in a sense it’s the opposite of factor analysis: instead of forming groups
of variables based on several people’s responses to those variables, we
instead group people based on their responses to several variables.
03/07/2025 32
Application
• Population of interest: older adults with chronic pain
• Data on pain intensity, number of pain sites, anxiety, depression, and pain catastrophizing
were used as grouping variables.
• Results: Four major clusters were identified: Subgroup 1 (n = 325; 15%) – moderate pain and
high psychological symptoms; Subgroup 2 (n = 516; 22%) – high pain and moderate
psychological symptoms; Subgroup 3 (n = 686; 30%) – low pain and moderate psychological
symptoms; and Subgroup 4 (n = 767; 33%) – low pain and low psychological symptoms.
• Significant differences were found between the four clusters with regard to age, sex,
educational level, family status, quality of life, general health, insomnia, and health care costs.
• The findings indicate that subgroup-specific treatment will improve pain management and
reduce health care costs.
03/07/2025 33
Reference
• Distinctive subgroups derived by cluster analysis based on pain and
psychological symptoms in Swedish older adults with chronic pain – a
population study (PainS65+) Larsson et al.
https://
bmcgeriatr.biomedcentral.com/articles/10.1186/s12877-017-0591-4
03/07/2025 34
Multivariate analysis of
variance (MANOVA)
03/07/2025 35
MANOVA
• ANOVA: ANOVA statistically tests the differences between three or
more group means. For example, if you have three different teaching
methods and you want to evaluate the average scores for these
groups, you can use ANOVA. However, ANOVA does have a drawback.
It can assess only one quantitative variable at a time. This limitation
can be an enormous problem in certain circumstances because it can
prevent you from detecting effects that actually exist.
• MANOVA: extends the capabilities of analysis of variance (ANOVA) by
assessing multiple quantitative variables simultaneously.
03/07/2025 36
Example
• Suppose we are studying three different teaching methods for a
course. We also have student satisfaction scores and test scores.
These variables are the quantitative variables. We want to determine
whether the mean scores for satisfaction and tests differ between the
three teaching methods.
03/07/2025 37
ANOVA versus MANOVA
• The graphs below display the scores by teaching method. One chart
shows the test scores and the other shows the satisfaction scores.
These plots represent how one-way ANOVA tests the data—one
quantitative variable at a time.
03/07/2025 38
ANOVA versus MANOVA
• Both of these graphs appear to show that there is no association
between teaching method and either test scores or satisfaction
scores. The groups seem to be approximately equal.
• It is concluded that the teaching method is not related to either
satisfaction or test scores.
• But this way of interpretation has some problem.
03/07/2025 39
ANOVA versus MANOVA
• Plotting the test and satisfaction scores on the scatterplot and use
teaching method as the grouping variable provides a graph as below.
• This multivariate approach represents how MANOVA tests the data.
These are the same data, but sometimes how you look at them makes
all the difference.
03/07/2025 40
ANOVA versus MANOVA
• The graph displays a positive correlation between
test scores and satisfaction.
• As student satisfaction increases, test scores tend
to increase as well.
• Moreover, for any given satisfaction score,
teaching method 3 tends to have higher test
scores than methods 1 and 2.
• In other words, students who are equally satisfied
with the course tend to have higher scores with
method 3.
• MANOVA can test this pattern statistically to help
ensure that it’s not present by chance.
03/07/2025 41
When does MANOVA provide
benefits?
• Use multivariate ANOVA when the measured quantitative variables are
correlated.
• Greater statistical power: When the quantitative variables are correlated,
MANOVA can identify effects that are smaller than those that regular
ANOVA can find.
• Limits the joint error rate: When you perform a series of ANOVA tests
because you have multiple quantitative variables, the joint probability of
rejecting a true null hypothesis increases with each additional test. Instead,
if you perform one MANOVA test, the error rate equals the significance
level.
03/07/2025 42
Reference
• http://
citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.667.5616&rep=r
ep1&type=pdf
03/07/2025 43
Discriminant analysis
03/07/2025 44
Discriminant analysis
• Discriminant analysis is used as a tool for classification, dimension
reduction, and data visualization.
• Discriminant function analysis is used to determine which variables
discriminate between two or more naturally occurring groups.
• For example, an educational researcher may want to investigate which
variables discriminate between high school graduates who decide (1) to go
to college, (2) to attend a trade or professional school, or (3) to seek no
further training or education. For that purpose the researcher could collect
data on numerous variables prior to students' graduation. After graduation,
most students will naturally fall into one of the three categories.
Discriminant Analysis could then be used to determine which variable(s) are
the best predictors of students' subsequent educational choice.
03/07/2025 45
What Is Dimensionality
Reduction?
• Multi-dimensional data is data that has multiple features which have a
correlation with one another. Dimensionality reduction simply means plotting
multi-dimensional data in just 2 or 3 dimensions.
• An alternative to dimensionality reduction is plotting the data using scatter
plots, boxplots, histograms, and so on. We can then use these graphs to identify
the pattern in the raw data.
• However, with charts, it is difficult for a layperson to make sense of the data that
has been presented. Moreover, if there are many features in the data,
thousands of charts will need to be analyzed to identify patterns.
• Dimensionality reduction algorithms solve this problem by plotting the data in 2
or 3 dimensions. This allows us to present the data explicitly, in a way that can
be understood by a layperson.
03/07/2025 46
Assumptions of discriminant
analysis
• Multivariate normality: variables are normal for each level of the
grouping variable
• Homogeneity of variance/covariance (homoscedasticity): It is
assumed that the variance/covariance matrices of variables are
homogeneous across groups
• Multicollinearity: Predictive power can decrease with an increased
correlation between variables
• Independence: Participants are assumed to be randomly sampled,
and a participant's score on one variable is assumed to be
independent of scores on that variable for all other participants
03/07/2025 47
Reference
• Use of Discriminant Analysis in Counseling Psychology Research
https://
www2.clarku.edu/faculty/pbergmann/biostats/Betz%201987.pdf
03/07/2025 48
Path analysis
03/07/2025 49
Path analysis
• Path analysis is a statistical technique that is used to examine and test
purported causal relationships among a set of variables.
• A causal relationship is directional in character, and occurs when one
variable (e.g., amount of exercise) causes changes in another variable
(e.g., physical fitness).
• The researcher specifies these relationships according to a theoretical
model that is of interest to the researcher.
• The resulting path model and the results of the path analysis are
usually then presented together in the form of a path diagram.
03/07/2025 50
Path analysis
• Typically path analysis involves the construction of a path diagram in which the
relationships between all variables and the causal direction between them are
specifically laid out.
• When conducting a path analysis, one might first construct an input path diagram,
which illustrates the hypothesized relationships. In a path diagram, researchers use
arrows to show how different variables relate to each other. An arrow pointing
from, say, Variable A to Variable B, shows that Variable A is hypothesized to
influence Variable B.
• After the statistical analysis has been completed, a researcher would then construct
an output path diagram, which illustrates the relationships as they actually exist,
according to the analysis conducted. If the researcher’s hypothesis is correct, the
input path diagram and output path diagram will show the same relationships
between variables.
03/07/2025 51
Example
• Say you hypothesize that age has a direct effect on job satisfaction, and hypothesize
that it has a positive effect, such that the older one is, the more satisfied one will be
with their job.
• Certainly other independent variables that also influence our dependent variable of
job satisfaction: for example, autonomy and income, among others.
• Using path analysis, a researcher can create a diagram that charts the relationships
between the variables. The diagram would show a link between age and autonomy,
and between age and income. Then, the diagram should also show the relationships
between these two sets of variables and the dependent variable: job satisfaction.
• After using a statistical program to evaluate these relationships, one can then
redraw the diagram to indicate the magnitude and significance of the relationships.
• For example, the researcher might find that both autonomy and income are related
to job satisfaction, that one of these two variables has a much stronger link to job
satisfaction than the other, or that neither variable has a significant link to job
satisfaction.
03/07/2025 52
Reference
• Personality , Classroom Behavior , and Student Ratings of College
Teaching Effectiveness : A Path Analysis
https://www.semanticscholar.org/paper/Personality-%2C-Classroom-Be
havior-%2C-and-Student-of-%
3A-Erdle-Murray/1e816dc96b2463e610bfa9426f0510ecdbdf46a7?p2df
03/07/2025 53
Canonical correlation
03/07/2025 54
Canonical correlation
• Canonical correlation analysis is a method to study linear relations between
two sets of variables, all measured on the same individual.
• Consider, as an example, variables related to exercise and health. On one
hand, you have variables associated with exercise, observations such as the
climbing rate on a stair stepper, how fast you can run a certain distance, the
amount of weight lifted on bench press, the number of push-ups per minute,
etc.
• On the other hand, you have variables that attempt to measure overall health,
such as blood pressure, cholesterol levels, glucose levels, body mass index,
etc.
• Two types of variables are measured and the relationships between the
exercise variables and the health variables are of interest.
03/07/2025 55
Canonical correlation
• One approach to studying relationships between the two sets of
variables is to use canonical correlation analysis which describes the
relationship between the first set of variables and the second set of
variables.
• We do not necessarily think of one set of variables as independent and
the other as dependent, though that may potentially be another
approach.
• Discriminant analysis, MANOVA, and multiple regression are all special
cases of canonical correlation. It provides the most general multivariate
framework. Because of this generality, it is probably the least used of
the multivariate procedures
03/07/2025 56
Canonical correlation
• Canonical correlation analysis can be used to address a wide range of objectives:
(a) to determine whether two sets of variables are independent of each other or,
on the other hand, how they are related
(b) to explain the nature of the relationship between two sets of variables by
assessing how each variable contributes to the extracted canonical functions.
• Essence of Canonical correlation analysis is to form pairs of linear combinations
of predictor and criterion variables to maximize the correlation between each
pair.
• Separate sets of coefficients weights are applied to the predictor and criterion
variables to form the linear combinations.
• The canonical correlation itself is the correlation between the linear
combinations of predictors and criteria.
03/07/2025 57
Reference
• The Use of Canonical Correlation Analysis to Assess the Relationship
Between Executive Functioning and Verbal Memory in Older Adults
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5119795/
03/07/2025 58