You are on page 1of 36

Epidemiological Applications in Health Services Research

Introduction to Multivariate Analysis

Dr. Ibrahim Awad Ibrahim.


Areas to be addressed today

Introduction to variables and data

Simple linear regression

Correlation

Population covariance

Multiple regression

Canonical correlation

Discriminant analysis

Logistic regression

Survival analysis

Principal component analysis

Factor analysis

Cluster analysis
Types of variables (Stevens’
classification, 1951)

Nominal
 distinct categories: race, religions, counties, sex

Ordinal
 rankings: education, health status, smoking
levels

Interval
 equal differences between levels: time,
temperature, glucose blood levels

Ratio
 interval with natural zero: bone density, weight,
height
Variables use in data analysis

Dependent: result, outcome
 developing CHD


Independent: explanatory
 Age, sex, diet, exercise


Latent constructs
 SES, satisfaction, health status


Measurable indicators
 education, employment, revisit, miles walked
Variables in data example
Name # of Position
characters
STFIPS FIPS 1 2
CODE (STATE)
STCENSUS 1 3

LEVEL 1 4

STABBREV 1 5

AREANAME 7 6
NAME OF
US/STATE/COUN
TY
POPULATION 7 13
1992 ABS
ITEM002
xyz 20
Data

Data screening and transformation

Normality

Independence

Correlation (or lack of independence)
Variable types and measures of
central tendency

Nominal: mode

Ordinal: median

Interval: Mean

Ratio: Geometric mean and harmonic
mean
Simple linear regression
Y = A + BX

X
Correlation

Mean =

Variance (SD)2 = 


Population covariance = (X-  x)(Y-  y)

Product moment coefficient=

=xy/  x  y

It lies between -1 and 1
Example physical and mental health
indicators
Correlations

PHYSICAL MENTAL
PHYSICAL Pearson Correlation 1.000 .230**
Sig. (2-tailed) . .000
N 109888 109888
MENTAL Pearson Correlation .230** 1.000
Sig. (2-tailed) .000 .
N 109888 109888
**. Correlation is significant at the 0.01 level (2-tailed).
Negative correlation

Correlations

WEIGHT AGEDIAB
WEIGHT Pearson Correlation 1.000 -.029**
Sig. (2-tailed) . .000
N 109888 109888
AGEDIAB Pearson Correlation -.029** 1.000
Sig. (2-tailed) .000 .
N 109888 109888
**. Correlation is significant at the 0.01 level (2-tailed).
Population covariance

 =0.00  =0.33  =0.6

 =0.88
Multiple regression and correlation
Simple linear Y =  + X
Multiple regression Y =  + 1X1 + 2X2 + 3X3 . . .+ pXp

EF ejection fraction

Exercise
Body fat
Issues with regression

Missing values
 random
 pattern
 mean substitution and ML

Dummy variables
 equal intervals!

Multicollinearity
 independent variables are highly
correlated

Garbage can method
Canonical correlation

An extension of multiple regression

Multiple Y variables and multiple X
variables

Finding several linear combinations of the
X var and the same number of linear
combinations of the Y var.

These combinations are called canonical
variables and the correlations between the
corresponding pairs of canonical variables
are called CANONICAL CORRELATIONS
Correlation matrix
Correlations

WTFORH PHYSHLT MENTHL POORHL HLTHPLA

WTFORHTX

Pearson Correlation
Data screening and transformation
TX
1.000
GENHLTH
.072**
H
-.008**
TH
.016**
TH
-.005
N
.023**
BPTAKE
.011**
TOLDHI
.000
Sig. (2-tailed) . .000 .006 .000 .208 .000 .000 .903

GENHLTH
N

Pearson Correlation
Normality
109888
.072**
109888
1.000
109888
-.228**
109888
-.061**
54351
-.147**
109888
.035**
108445
-.084**
77436
-.091**
Sig. (2-tailed)
N

Independence
.000
109888 109888
. .000
109888
.000
109888
.000
54351
.000
109888
.000
108445
.000
77436
PHYSHLTH Pearson Correlation -.008** -.228** 1.000 .223** .295** -.011** .083** .030**
Sig. (2-tailed) 
N
Correlation (or lack of independence)
.006
109888
.000
109888 109888
. .000
109888
.000
54351
.000
109888
.000
108445
.000
77436
MENTHLTH Pearson Correlation .016** -.061** .223** 1.000 -.120** -.038** .019** .014**
Sig. (2-tailed) .000 .000 .000 . .000 .000 .000 .000
N 109888 109888 109888 109888 54351 109888 108445 77436
POORHLTH Pearson Correlation -.005 -.147** .295** -.120** 1.000 -.001 .055** .014**
Sig. (2-tailed) .208 .000 .000 .000 . .816 .000 .005
N 54351 54351 54351 54351 54351 54351 53754 38018
HLTHPLAN Pearson Correlation .023** .035** -.011** -.038** -.001 1.000 .152** .022**
Sig. (2-tailed) .000 .000 .000 .000 .816 . .000 .000
N 109888 109888 109888 109888 54351 109888 108445 77436
BPTAKE Pearson Correlation .011** -.084** .083** .019** .055** .152** 1.000 .039**
Sig. (2-tailed) .000 .000 .000 .000 .000 .000 . .000
N 108445 108445 108445 108445 53754 108445 108445 77436
TOLDHI Pearson Correlation .000 -.091** .030** .014** .014** .022** .039** 1.000
Sig. (2-tailed) .903 .000 .000 .000 .005 .000 .000 .
N 77436 77436 77436 77436 38018 77436 77436 77436
**. Correlation is significant at the 0.01 level (2-tailed).
Discriminant analysis

A method used to classify an individual
in one of two or more groups based on a
set of measurements

Examples:
 at risk for
 heart disease
 cancer
 diabetes, etc.

It can be used for prediction and
description
Discriminant analysis

B B
ab
A
A


a and b are wrongly classified

discriminant function to describe the probability
of being classified in the right group.
Logistic regression

An alternative to discriminant analysis to
classify an individual in one of two
populations based on a set of criteria.

It is appropriate for any combination of
discrete or continuous variables

It uses the maximum likelihood
estimation to classify individuals based
on the independent variable list.
Survival analysis (event history
analysis)

Analyze the length of time it takes a
specific event to occur.

Time for death, organ failure, retirement,
etc.

Length of time function of {explanatory
variables (covariates)}
Survival data example
died
died
died
lost

surviving

1980
1985 1990
Log-linear regression

A regression model in which the
dependent variable is the log of survival
time (t) and the independent variables
are the explanatory variables.

Multiple regression Y =  + 1X1 + 2X2 + 3X3 . . .+ pXp

Log (t) =  + 1X1 + 2X2 + 3X3 . . .+ pXp + e


Cox proportional hazards model

Another method to model the relationship between
survival time and a set of explanatory variables.

Proportion of the population who die up to time (t) is
the lined area

1980 t 1985 1990


Cox proportional hazards model

The hazard function (h) at time (t) is
proportional among groups 1 & 2 so that

h1(t1)/h2(t2) is constant.
Principal component analysis

Aimed at simplifying the description of a set
of interrelated variables.

All variables are treated equally.

You end up with uncorrelated new variables
called principal components.

Each one is a linear combination of the
original variables.

The measure of the information conveyed
by each is the variance.

The PC are arranged in descending order of
the variance explained.
Principal component analysis

A general rule is to select PC explaining
at least 5% but you can go higher for
parsimony purposes.

Theory should guide this selection of
cutoff point.

Sometimes it is used to alleviate
multicollinearity.
Factor analysis

The objective is to understand the
underlying structure explaining the
relationship among the original variables.

We use the factor loading of each of the
variables on the factors generated to
determine the usability of a certain
variable.

It is guided again by theory as to what are
the structures depicted by the common
factors encompassing the selected
variables.
Factor analysis
Total Variance Explained

Extraction Sums of Squared


Initial Eigenvalues Loadings
% of Cumulativ % of Cumulativ
Component Total Variance e% Total Variance e%
1 1.699 16.986 16.986 1.699 16.986 16.986
2 1.663 16.629 33.614 1.663 16.629 33.614
3 1.108 11.083 44.697 1.108 11.083 44.697
4 1.035 10.351 55.048 1.035 10.351 55.048
5 .908 9.077 64.125
6 .881 8.808 72.933
7 .834 8.338 81.271
8 .788 7.879 89.150
9 .571 5.714 94.865
10 .514 5.135 100.000
Extraction Method: Principal Component Analysis.
Factor analysis
Component Matrixa

Component
1 2 3 4
GENHLTH .450 .207 -.150 -.552
PHYSHLTH -.770 .254 -3.31E-03 -.208
MENTHLTH .652 -.232 -6.74E-02 .353
POORHLTH -.612 6.329E-02 -1.03E-02 .110
BPTAKE -.128 .352 -.465 .474
BLOODCHO 6.411E-02 .335 -.563 .158
SEATBELT .166 .697 .242 .222
SFTYLT16 .137 .676 .447 .188
BIKEHLMT .156 .414 .210 -.299
SMOKENOW -.112 -.382 .495 .356
Extraction Method: Principal Component Analysis.
a. 4 components extracted.
Cluster analysis

A classification method for individuals into
previously unknown groups

It proceeds from the most general to the most
specific:

Kingdom: Animalia
Phylum: Chordata
Subphylum: vertebrata
Class: mammalia
Order: primates
Family: hominidae
Genus: homo
Species: sapiens
Patient clustering

Major: patients
Types: medical
Subtype: neurological
Class: genetic
Order: lateonset
disease: Guillian Barre syndrom

Hierarchical: divisive or agglumerative
Conclusions
Presentation Schedule

4 each on 4/22 and 4/27

5 on 4/29

Each presentation should be maximum of
10 minutes and 5 minutes for discussion

E-mail me your requirements of software
and hardware for your presentation.

Final projects due 5/7/99 by 5:00 pm in
my office.
Presentation Schedule 1

Date Time Who


4/22 1:00 - 1:15
1:16 - 1:30
1:31 - 1:45
1:46 - 2:00
Presentation Schedule 2

Date Time Who


4/27 1:00 - 1:15
1:16 - 1:30
1:31 - 1:45
1:46 - 2:00
2:01 - 2:15
Presentation Schedule 3

Date Time Who


4/29 1:00 - 1:15
1:16 - 1:30
1:31 - 1:45
1:46 - 2:00

You might also like