Introduction To Multivariate Analysis: Dr. Ibrahim Awad Ibrahim

Epidemiological Applications in Health Services Research
Introduction to Multivariate Analysis
Dr. Ibrahim Awad Ibrahim.

Areas to be addressed today

Introduction to variables and data

Simple linear regression

Correlation

Population covariance

Multiple regression

Canonical correlation

Discriminant analysis

Logistic regression

Survival analysis

Principal component analysis

Factor analysis

Cluster analysis
Types of variables (Stevens’
classification, 1951)

Nominal
 distinct categories: race, religions, counties, sex

Ordinal
 rankings: education, health status, smoking
levels

Interval
 equal differences between levels: time,
temperature, glucose blood levels

Ratio
 interval with natural zero: bone density, weight,
height
Variables use in data analysis

Dependent: result, outcome
 developing CHD

Independent: explanatory
 Age, sex, diet, exercise

Latent constructs
 SES, satisfaction, health status

Measurable indicators
 education, employment, revisit, miles walked
Variables in data example
Name # of Position
characters
STFIPS FIPS 1 2
CODE (STATE)
STCENSUS 1 3
LEVEL 1 4
STABBREV 1 5
AREANAME 7 6
NAME OF
US/STATE/COUN
TY
POPULATION 7 13
1992 ABS
ITEM002
xyz 20
Data

Data screening and transformation

Normality

Independence

Correlation (or lack of independence)
Variable types and measures of
central tendency

Nominal: mode

Ordinal: median

Interval: Mean

Ratio: Geometric mean and harmonic
mean
Simple linear regression
Y = A + BX
X
Correlation

Mean =
Variance (SD)2 = 


Population covariance = (X-  x)(Y-  y)

Product moment coefficient=
=xy/  x  y

It lies between -1 and 1
Example physical and mental health
indicators
Correlations
PHYSICAL MENTAL
PHYSICAL Pearson Correlation 1.000 .230**
Sig. (2-tailed) . .000
N 109888 109888
MENTAL Pearson Correlation .230** 1.000
Sig. (2-tailed) .000 .
N 109888 109888
**. Correlation is significant at the 0.01 level (2-tailed).
Negative correlation
Correlations
WEIGHT AGEDIAB
WEIGHT Pearson Correlation 1.000 -.029**
Sig. (2-tailed) . .000
N 109888 109888
AGEDIAB Pearson Correlation -.029** 1.000
Sig. (2-tailed) .000 .
N 109888 109888
Population covariance
 =0.00  =0.33  =0.6
 =0.88
Multiple regression and correlation
Simple linear Y =  + X
Multiple regression Y =  + 1X1 + 2X2 + 3X3 . . .+ pXp
EF ejection fraction
Exercise
Body fat
Issues with regression

Missing values
 random
 pattern
 mean substitution and ML

Dummy variables
 equal intervals!

Multicollinearity
 independent variables are highly
correlated

Garbage can method
Canonical correlation

An extension of multiple regression

Multiple Y variables and multiple X
variables

Finding several linear combinations of the
X var and the same number of linear
combinations of the Y var.

These combinations are called canonical
variables and the correlations between the
corresponding pairs of canonical variables
are called CANONICAL CORRELATIONS
Correlation matrix
Correlations
WTFORH PHYSHLT MENTHL POORHL HLTHPLA
WTFORHTX

Pearson Correlation
Data screening and transformation
TX
1.000
GENHLTH
.072**
H
-.008**
TH
.016**
TH
-.005
N
.023**
BPTAKE
.011**
TOLDHI
.000
Sig. (2-tailed) . .000 .006 .000 .208 .000 .000 .903
GENHLTH
N

Pearson Correlation
Normality
109888
.072**
109888
1.000
109888
-.228**
109888
-.061**
54351
-.147**
109888
.035**
108445
-.084**
77436
-.091**
Sig. (2-tailed)
N

Independence
.000
109888 109888
. .000
109888
.000
109888
.000
54351
.000
109888
.000
108445
.000
77436
PHYSHLTH Pearson Correlation -.008** -.228** 1.000 .223** .295** -.011** .083** .030**
Sig. (2-tailed) 
N
Correlation (or lack of independence)
.006
109888
.000
109888 109888
. .000
109888
.000
54351
.000
109888
.000
108445
.000
77436
MENTHLTH Pearson Correlation .016** -.061** .223** 1.000 -.120** -.038** .019** .014**
Sig. (2-tailed) .000 .000 .000 . .000 .000 .000 .000
N 109888 109888 109888 109888 54351 109888 108445 77436
POORHLTH Pearson Correlation -.005 -.147** .295** -.120** 1.000 -.001 .055** .014**
Sig. (2-tailed) .208 .000 .000 .000 . .816 .000 .005
N 54351 54351 54351 54351 54351 54351 53754 38018
HLTHPLAN Pearson Correlation .023** .035** -.011** -.038** -.001 1.000 .152** .022**
Sig. (2-tailed) .000 .000 .000 .000 .816 . .000 .000
N 109888 109888 109888 109888 54351 109888 108445 77436
BPTAKE Pearson Correlation .011** -.084** .083** .019** .055** .152** 1.000 .039**
Sig. (2-tailed) .000 .000 .000 .000 .000 .000 . .000
N 108445 108445 108445 108445 53754 108445 108445 77436
TOLDHI Pearson Correlation .000 -.091** .030** .014** .014** .022** .039** 1.000
Sig. (2-tailed) .903 .000 .000 .000 .005 .000 .000 .
N 77436 77436 77436 77436 38018 77436 77436 77436

A method used to classify an individual
in one of two or more groups based on a
set of measurements

Examples:
 at risk for
 heart disease
 cancer
 diabetes, etc.

It can be used for prediction and
description
B B
ab
A
A

a and b are wrongly classified

discriminant function to describe the probability
of being classified in the right group.
Logistic regression

An alternative to discriminant analysis to
classify an individual in one of two
populations based on a set of criteria.

It is appropriate for any combination of
discrete or continuous variables

It uses the maximum likelihood
estimation to classify individuals based
on the independent variable list.
Survival analysis (event history
analysis)

Analyze the length of time it takes a
specific event to occur.

Time for death, organ failure, retirement,
etc.

Length of time function of {explanatory
variables (covariates)}
Survival data example
died
died
died
lost
surviving
1980
1985 1990
Log-linear regression

A regression model in which the
dependent variable is the log of survival
time (t) and the independent variables
are the explanatory variables.
Multiple regression Y =  + 1X1 + 2X2 + 3X3 . . .+ pXp
Log (t) =  + 1X1 + 2X2 + 3X3 . . .+ pXp + e

Cox proportional hazards model

Another method to model the relationship between
survival time and a set of explanatory variables.

Proportion of the population who die up to time (t) is
the lined area
1980 t 1985 1990

Cox proportional hazards model

The hazard function (h) at time (t) is
proportional among groups 1 & 2 so that

h1(t1)/h2(t2) is constant.

Aimed at simplifying the description of a set
of interrelated variables.

All variables are treated equally.

You end up with uncorrelated new variables
called principal components.

Each one is a linear combination of the
original variables.

The measure of the information conveyed
by each is the variance.

The PC are arranged in descending order of
the variance explained.

A general rule is to select PC explaining
at least 5% but you can go higher for
parsimony purposes.

Theory should guide this selection of
cutoff point.

Sometimes it is used to alleviate
multicollinearity.
Factor analysis

The objective is to understand the
underlying structure explaining the
relationship among the original variables.

We use the factor loading of each of the
variables on the factors generated to
determine the usability of a certain
variable.

It is guided again by theory as to what are
the structures depicted by the common
factors encompassing the selected
variables.
Factor analysis
Total Variance Explained
Extraction Sums of Squared

Initial Eigenvalues Loadings
% of Cumulativ % of Cumulativ
Component Total Variance e% Total Variance e%
1 1.699 16.986 16.986 1.699 16.986 16.986
2 1.663 16.629 33.614 1.663 16.629 33.614
3 1.108 11.083 44.697 1.108 11.083 44.697
4 1.035 10.351 55.048 1.035 10.351 55.048
5 .908 9.077 64.125
6 .881 8.808 72.933
7 .834 8.338 81.271
8 .788 7.879 89.150
9 .571 5.714 94.865
10 .514 5.135 100.000
Extraction Method: Principal Component Analysis.
Factor analysis
Component Matrixa
Component
1 2 3 4
GENHLTH .450 .207 -.150 -.552
PHYSHLTH -.770 .254 -3.31E-03 -.208
MENTHLTH .652 -.232 -6.74E-02 .353
POORHLTH -.612 6.329E-02 -1.03E-02 .110
BPTAKE -.128 .352 -.465 .474
BLOODCHO 6.411E-02 .335 -.563 .158
SEATBELT .166 .697 .242 .222
SFTYLT16 .137 .676 .447 .188
BIKEHLMT .156 .414 .210 -.299
SMOKENOW -.112 -.382 .495 .356
Extraction Method: Principal Component Analysis.
a. 4 components extracted.
Cluster analysis

A classification method for individuals into
previously unknown groups

It proceeds from the most general to the most
specific:

Kingdom: Animalia
Phylum: Chordata
Subphylum: vertebrata
Class: mammalia
Order: primates
Family: hominidae
Genus: homo
Species: sapiens
Patient clustering

Major: patients
Types: medical
Subtype: neurological
Class: genetic
Order: lateonset
disease: Guillian Barre syndrom

Hierarchical: divisive or agglumerative
Conclusions
Presentation Schedule

4 each on 4/22 and 4/27

5 on 4/29

Each presentation should be maximum of
10 minutes and 5 minutes for discussion

E-mail me your requirements of software
and hardware for your presentation.

Final projects due 5/7/99 by 5:00 pm in
my office.
Presentation Schedule 1
Date Time Who

4/22 1:00 - 1:15
1:16 - 1:30
1:31 - 1:45
1:46 - 2:00
Date Time Who

4/27 1:00 - 1:15
1:16 - 1:30
1:31 - 1:45
1:46 - 2:00
2:01 - 2:15
Date Time Who

4/29 1:00 - 1:15
1:16 - 1:30
1:31 - 1:45
1:46 - 2:00

Introduction To Multivariate Analysis: Dr. Ibrahim Awad Ibrahim

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Multivariate Analysis: Dr. Ibrahim Awad Ibrahim

Uploaded by

Copyright:

Available Formats

Epidemiological Applications in Health Services Research

Introduction to Multivariate Analysis

Dr. Ibrahim Awad Ibrahim.

 =0.00  =0.33  =0.6

WTFORH PHYSHLT MENTHL POORHL HLTHPLA

Multiple regression Y =  + 1X1 + 2X2 + 3X3 . . .+ pXp

Log (t) =  + 1X1 + 2X2 + 3X3 . . .+ pXp + e

1980 t 1985 1990

Extraction Sums of Squared

Date Time Who

Date Time Who

Date Time Who

You might also like