You are on page 1of 18

Data Projections &

Visualization

Student Eng.:
Maria-Alexandra MATEI
Introduction - Dimensionality Reduction

• Reduce complexity
• Visual
• Computational

• Identify the intrinsic dimensionality of data

• Identify the most relevant aspects of data given a task


The Curse of Dimensionality

Lower Dimension

Higher Dimension
Data Projections

• Not all projections are equal

a) b)
Data Projections

• Desired properties
• Reduced, compressed representation
• Preserved useful/intrinsic properties of the data
• Applify patterns of interest (e.g. outliers)
• Simple, interpretable

• Trade-off between simplicity and preservation of structure


Distance Function

• Helps us organize the data

• Helps us discriminate patterns


Distance Functions
• Manhattan distance (1 norm, taxicab distance)

Euclidean distance (2 norm)


Distance Functions

L-p Distance

• As p grows the largest coordinate distances tends to dominate


the global distance
Distance Functions
Data Projections

• Projective methods: preserve a property of data


• Principal Component Analysis (PCA)
• Many others: ICA, Factor Analysis,

• Manifold Learning
• Multidimensional Dimension Reduction (MDS)
• LLE, Isomap
Principal Component Analysis
Goal: Find a linear projection that captures most
of variance
Stude nt Grade s Stude nt Grade s

100
100

1st Principal Component

80
80

60
60

Stat
Stat

40
40

1st Principal Component 20 2nd Principal Component


20
0

0 20 40 60 80 100 0 20 40 60 80 100
Phys Phys
Principal Component Analysis

PCA pseudo code:


Centralize the data by subtracting the mean
Calculate the covariance matrix:

Calculate the eigenvectors(principal components)


of the covariance matrix
Select top few(2-3) eigenvectors (highest
eigenvalues)
Project the data using these eigenvectors as axis
PCA on IRIS Dataset

Screeplot Biplot
Multidimensional Scaling

Goal: Find a lower embedding of the data that


preserves pairwise distances

Formally:

: Input distance values

: Output distances values


MDS Projection of Us Capitals
Goodness of MDS Solution
Shepard Diagram
Data Distances

MDS Distances
Conclusions
More features are not necessarily better

Understand the assumptions of different modeling


choices

When choosing distance functions, projection methods


Consider the characteristics of the data
Consider the learning objective

Explore multiple choices simultaneously to gain better


insight
References
http://statweb.stanford.edu/~jtaylo/courses/stats202/mds.html

https://planspacedotorg.wordpress.com/2013/02/03/pca-3d-visualization-and-clu
stering-in-r
/

Multidimensional Scaling, Leland Wilkinson

Dimension Reduction: A Guided Tour, Christopher J.C. Burgesti

When is “nearest neighbor” meaningful?, Beyer, K.S., GoldStein, J.


Ramakrishnan, R. & Shaft g, by

You might also like