You are on page 1of 46

Multivariate Data

Analysis
SETIA PRAMANA
Course Outline
Introduction
Overview of Multivariate data analysis
The applications
Matrix Algebra And Random Vectors
Sample Geometry
Multivariate Normal Distribution
Inference About A Mean Vector
Comparison Several Mean Vectors

Setia Pramana SURVIVAL DATA ANALYSIS 2


Course Outline
Principal Component Analysis
Factor Analysis
Cluster Analysis
Discriminant Analysis
Canonical Correlations

Setia Pramana SURVIVAL DATA ANALYSIS 3


Course Workload
40% Theory, 60% practice
Group Project (4 students)
Group Presentation in ENGLISH every week
Software used is mainly R, others are allowed
R code would be provided
Slides can be seen at : http://www.slideshare.net/hafidztio/

Setia Pramana SURVIVAL DATA ANALYSIS 4


Reference Books

Setia Pramana SURVIVAL DATA ANALYSIS 5


Intermezzo
http://www.youtube.com/watch?v=zRsMEl6PHhM&list=PLFE776F2C513A744E
http://tylervigen.com/
Data Types
Type of Analysis
Type of Analysis
What is Multivariate?
Univariate Analysis?
Some describe it as: any statistical technique used to analyze data
that arises from more than one variable
Multivariable vs. Multivariate Analysis
http://www.youtube.com/watch?v=KhA_PCMPZZo
Example of MV Data
Other Examples?
What is Multivariate Data Analysis?
The statistical analysis of the data collected on more than one
(response) variable.
We want to analyze them simultaneously
The variables may be correlated with each other
The dependence is taken into account
More complex univariate analysis
In the real world, most data are multivariate data
Basic Statistical Analysis for Data Mining
Types of MVA
Exploratory Data Analysis (EDA): Sometimes called data mining this area is useful for gaining
deeper insights into large, complex data sets.
Regression analysis: Develops models to predict new and future events. Is useful for predictive
analytics applications.
Classification for identifying new or existing classes: This area is useful in research,
development, market analysis, etc.
MVD objectives
1. Data reduction or structural simplification. To simplify without
loosing any valuable information and make interpretation easier.
2. Sorting and grouping. Similar objects or variables are grouped,
based upon the characteristics. Define rules for classifying objects
into well-defined groups.
3. Investigation of the dependence among variables. The nature of
the relationships among variables is of interest. Are all the
variables mutually dependent/ independent?
MVD objectives
4. Prediction. Relationships between variables must be
determined for the purpose of predicting the values of one or
more variables on the basis of observations on the other
variables.
5. Hypothesis construction and testing. Specific statistical
hypotheses, formulated are tested.
Examples of Multivariate Data
http://www.youtube.com/watch?v=eEpxN0htRKI
Software
1. SAS
2. R
3. SPSS
4. Herodes
5. etc.
Applications
Petrochemical and refining operations, including early fault detection and
gasoline blending and optimisation
Food and beverage applications, particularly for consumer segmentation and
new product development
Agricultural analysis, including real-time analysis of protein and moisture in
wheat, barley and other crops
Business Intelligence and marketing for predicting changes in dynamic markets
or better product placement
Oil and gas and mining, including analysis of machinery performance and
locating new sources of commodities
Applications
Data reduction or simplification
Using data on several variables related to cancer patient responses to
radiotherapy, a simple measure of patient response to radiotherapy was
constructed.
Multispectral image data collected by a high-altitude scanner were reduced to a
form that could be viewed as images (pictures) of a shoreline in two dimensions.
Data on several variables relating to yield and protein content were used to
create an index to select parents of subsequent generations of improved bean
plants.
Applications
Sorting and grouping
Data on several variables related to computer use were employed to create
clusters of categories of computer jobs that allow a better determination of
existing (or planned) computer utilization.
Measurements of several physiological variables were used to develop a
screening procedure that discriminates alcoholics from nonalcoholics.
Data related to responses to visual stimuli were used to develop a rule for
separating people suffering from a multiple-sclerosis-caused visual pathology
from those not suffering from the disease.
Applications
Investigation of the dependence among variables
Data on several variables were used to identify factors that were responsible
for client success in hiring external consultants.
Measurements of variables related to innovation, and variables related to the
business environment and business organization, on the other hand, were used
to discover why some firms are product innovators and some firms are not.
Measurements of pulp fiber characteristics and subsequent measurements of
characteristics of the paper made from them are used to examine the relations
between pulp fiber properties and the resulting paper properties. The goal is to
determine those fibers that lead to higher quality paper.
Applications
Prediction
The associations between test scores, and several high school performance variables,
and several college performance variables were used to develop predictors of success in
college.
Data on several variables related to the size distribution of sediments were used to
develop rules for predicting different depositional environments.
Measurements on several accounting and financial variables were used to develop a
method for identifying potentially insolvent property-liability insurers.
cDNA microarray experiments (gene expression data) are increasingly used to study
the molecular variations among cancer tumors. A reliable classification of tumors is
essential for successful diagnosis and treatment of cancer.
Applications
Hypotheses testing
Several pollution-related variables were measured to determine whether
levels for a large metropolitan area were roughly constant throughout the week,
or whether there was a noticeable difference between weekdays and weekends.
Experimental data on several variables were used to see whether the nature of
the instructions makes any difference in perceived risks, as quantified by test
scores.
Data on many variables were used to investigate the differences in structure of
American occupations to determine the support for one of two competing
sociological theories.
Other Applications?
In Group, discuss multivariate data on: 7. Business
1. Biomedical 8. Telecommunication
2. Economic 9. Education
3. Government Policy 10. Psychology
4. Health
5. Social
6. Demography
Data Structure
Descriptive Statistics
Descriptive Statistics
Descriptive Statistics
Descriptive Statistics
Visualization: Two-Dim Scatter Plots
Visualization: Two-Dim Scatter Plots
Visualization: Growth Curves
Visualization: Growth Curves
Visualization: Stars
Visualization: Stars
Visualization: Chernoff Faces
Chernoff Faces
Visualizations
Other Visualizations
Other Visualizations
Other Visualizations
Distance
Distance
Next Week: Matrix Algebra

You might also like