You are on page 1of 33

Metabolomic Data Analysis

Advanced Strategies for Metabolomic Data Analysis

Dmitry Grapov, PhD

Analysis at the Metabolomic Scale

Multivariate Analysis
variables

Samples

Multivariate Analysis
Simultaneous analysis of many variables
Visualization Clustering Projection Modeling Networks

Clustering
Identify

patterns
group structure

relationships
Evaluate/refine hypothesis Reduce complexity
Artist: Chuck Close

Cluster Analysis
Use the concept similarity/dissimilarity to group a collection of samples or variables Linkage
Approaches hierarchical (HCA) non-hierarchical (k-NN, k-means) distribution (mixtures models) density (DBSCAN) self organizing maps (SOM)
k-means

Distribution

Density

Hierarchical Cluster Analysis


similarity/dissimilarity defines nearness or distance
euclidean manhattan Mahalanobis non-euclidean

X Y
Y

Hierarchical Cluster Analysis


Agglomerative/linkage algorithm defines how points are grouped

single

complete centroid average

Hierarchical Cluster Analysis (cont.)

x x

Similarity

Hierarchical Cluster Analysis (cont.)


How does my metadata match my data structure?

Overview

Confirmation

Multidimensional Scaling

PLoS ONE 7(11): e48852. doi:10.1371/journal.pone.0048852

Projection of Data

The algorithm defines the position of the light source


Principal Components Analysis (PCA) unsupervised maximize variance (X) Partial Least Squares Projection to Latent Structures (PLS) supervised maximize covariance (Y ~ X)

PCA: Goals
Principal Components (PCs) non-supervised projection of the data which maximize variance explained

Results
1. eigenvalues = variance explained

2. scores = new coordinates for samples (rows)


3. loadings = linear combination of original variables

James X. Li, 2009, VisuMap Tech.

Interpreting PCA Results


Variance explained (eigenvalues)

Row (sample) scores and column (variable) loadings

PCA Example
glucose

219021

*no scaling or centering

How are scores and loadings related?

Centering and Scaling

van den Berg RA, Hoefsloot HC, Westerhuis JA, Smilde AK, van der Werf MJ (2006) Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics 7: 142.

Data scaling is very important!


glucose (clinical)

glucose (GC/TOF)
219021

*autoscaling (unit variance and centered)

Use PLS to test a hypothesis


time = 0 120 min.

Loadings on the first latent variable (x-axis) can be used to interpret the multivariate changes in metabolites which are correlated with time

Modeling multifactorial relationships


~two-way ANOVA dynamic changes among groups

goodness of the model is all about the perspective

Determine in-sample (Q2) and outof-sample error (RMSEP) and compare to a random model permutation tests

training/testing

Biological Interpretation
Projection or mapping of analysis results into a biological context.
Visualization Enrichment Networks
biochemical structural spectral empirical

Ingredients for Network Analysis


1. Determine connections

biochemical (substrate/product) chemical similarity spectral similarity empirical dependency (correlation)


magnitude importance direction relationships

2. Determine vertex properties

Making Connections Based on Biochemistry


Organism specific biochemical relationships and information Multiple organism DBs KEGG BioCyc

Reactome
Human HMDB SMPDB

Biochemical Networks

Making Connections Based on structural similarity


Use structure to generate molecular fingerprint Calculate similarities between metabolites based on fingerprint PubChem service for similarity calculations
http://pubchem.ncbi.nlm.nih.gov//score_matrix/score_matrix.cgi

BMC Bioinformatics 2012, 13:99 doi:10.1186/1471-2105-13-99

online tools
http://uranus.fiehnlab.ucdavis.edu:8080/MetaMapp/homePage

Structural Similarity Network

Making Connections Based on spectral similarity


Connect molecules based on EI or MS/MS spectral similarity Useful for linking annotated analytes (known) to unknown

Watrous J et al. PNAS 2012;109:E1743-E1752

Spectral Similarity Network

Watrous J et al. PNAS 2012;109:E1743-E1752

Making connections based on empirical relationships


Connect molecules based on strength of correlation or partialcorrelation

Treatment Effects Network


Metabolites Shape = increase/decrease Size = importance (loading) Color = correlation

Connections red = Biochemical relationships violet = Structural similarity

Summary
Multivariate analysis is useful for: Visualization Exploration and overview Complexity reduction Identification of multidimensional relationships and trends Mapping to networks Generating holistic summaries of findings

Resource
Mapping tools (review)
Brief Bioinform (2012) doi: 10.1093/bib/bbs055

Tutorials and Examples


http://imdevsoftware.wordpress.com/category/uncategorized/ https://github.com/dgrapov/TeachingDemos

Chemical Translations Services CTS: http://cts.fiehnlab.ucdavis.edu/ R-interface: https://github.com/dgrapov/CTSgetR CIR: http://cactus.nci.nih.gov/chemical/structure R-interface: https://github.com/dgrapov/CIRgetR