Micro Array Data Analysis 06

Microarray Data Analysis
The Bioinformatics side of the bench
The anatomy of your data files from Affymetrix array analysis .DAT= image file (107 pixels) .CEL= measured cell intensities .CDF= cell descriptions files (identify probe sets and probe set pairs) .CHP= calculated probe set data .RPT= report generated from .CHP
Quality Control (QC) of the chip visual inspection

Look at the .DAT file or the .CHP file image
Scratches? Spots? Corners and outside border checkerboard appearance (B2 oligo)
Positive hybridization control Used by software to place grid over image
Array name is written out in oligos!
Chip defects
Internal controls
B. subtilis genes (added poly-A tails) Assessment of quality of sample preparation Also as hybridization controls
Hybridization controls (bioB, bioC, bioD, cre) E. coli and P1 bacteriophage biotin-labeled cRNAs Spiked into the hybridization cocktail Assess hybridization efficiency
Actin and GAPDH assess RNA sample/assay quality Compare signal values from 3 end to signal values from 5 end ratio generally should not exceed 3 Percent genes present (%P) Replicate samples - similar %P values
Microarray Data Process/Outline

1. Experimental Design
2. Image Analysis scan to intensity measures (raw data)
3. Normalization clean data

4. More low level analysis-fold change, ANOVA, data filtering
5. Data mining-how to interpret > 6000 measures Databases Software Techniques-clustering, pattern recognition etc. Comparing to prior studies, across platforms?
6. Validation
Experimental Design
A good microarray design has 4 elements
1. 2. A clearly defined biological question or hypothesis Treatment, perturbation and observation of biological materials should minimize systematic bias Simple and statistically sound arrangement that minimizes cost and gains maximal information Compliance with MIAME (minimal information about microarray experiment)
3.
4.
The goal of statistics is to find signals in a sea of noise The goal of exp. design is to reduce the noise so signals can be found with as small a sample size as possible
Observational Study vs. Designed Experiment

Observational study Investigator is a passive observer who measures variables of interest, but does not attempt to influence the responses
Designed Experiment Investigator intervenes in natural course of events
What type is our DMSO exp?
Experimental Replicates
Why?
In any exp. system there is a certain amount of noiseso even 2 identical processes yield slightly different results Sources? In order to understand how much variation there is it is necessary to repeat an exp a # of independent times Replicates allow us to use statistical tests to ascertain if the differences we see are real
Technical vs. Biological Replicates
As we progress from the starting material to the scanned image we are moving from a system dominated by biological effects through one dominated by chemistry and physics noise Within Affy platform the dominant variation is usually of a biological nature thus best strategy is to produce replicates as high up the experimental tree as possible
Low level data analysis / pre-processing

Varying biological or cellular composition among sample types. Differences in sample preparation, labeling or hybridization Non specific crosshybridization of target to probes. Lead to systemic differences between individual arrays Raw Data Quality Control Scaling Normalization and filtering.
Image Analysis - Raw Data
From probe level signals to gene abundance estimates

The job of the expression summary algorithm is to take a set of Perfect Match (PM) and MisMatch (MM) probes, and use these to generate a single value representing the estimated amount of transcript in solution, as measured by that probeset.
To do this, .DAT files containing array images are first processed to produce a .CEL file, which contains measured intensities for each probe on the array.
It is the .CEL files that are analyzed by the expression calling algorithm.
MAS 5.0 output files

For each transcript (gene) on the chip:
signal intensity a present or absent call (presence call) p-value (significance value) for making that call
Each gene associated with GenBank accession number (NCBI database)
How are transcripts determined to be present or absent?
Probe pair (PM vs. MM) intensities

generate a detection p-value
assign Present, Absent, or Marginal call for transcript
Every probe pair in a probe SET has a potential vote for presence call
PM and MM Probes
The purpose of each MM probe is to provide a direct measure of background and stray-signal (perhaps due to cross-hybridization) for its perfect-match partner. In most situations the signal from each probe-pair is simply the difference PM - MM. For some probe-pairs, however, the MM signal is greater than the PM value; we have an apparently impossible measure of background.
Thank goodness for software!!!

MAS 5.0 does these calculations for you
.CHP file
Basic analysis in MAS 5.0, but it wont handle replicates Import MAS 5.0 (.CHP) data into other software, Genesifter, GCOS, SpotFire, and many others
Signal Intensity
Following these calculations, the MAS 5.0 algorithm now has a measure of the signal for each probe in a probeset. Other algortihms, ex RMA, GCRMA, dCHIP, PLIER and others have been developed by academic teams to improve the precision and accuracy of this calculation In our Exp we will use RMA and GCRMA
How do we want to analyze this data?

Pairwise analysis is most appropriate
Control vs. DMSO
List of genes that are upregulated or downregulated Determine fold up or down cutoffs
What is significant?
1.5 fold up/down? 2 fold up/down? 10 fold up/down?
Normalization - clean data

Normalizing data allows comparisons ACROSS different chips
Intensity of fluorescent markers might be different from one batch to the other Normalization allows us to compare those chips without altering the interpretation of changes in GENE EXPRESSION
Why Normalize Data?

The experimental goal is to identify biological variation (expression changes between samples) Technical variation can hide the real data Unavoidable systematic bias should be recognized and corrected
Normalization is necessary to effectively make comparisons between chips-and sometimes within a single chip.
There are different methods of normalization the assumptions of where variation exist will determine the normalization techniques used.
Always look at data before and after normalization

Spike in controls can help show which method may be best
Caveat
There is NO standard way to analyze microarray data Still figuring out how to get the best answers from microarray experiments Best to combine knowledge of biology, statistics, and computers to get answers
Venn Diagrams
MAS 5.0 GCRMA
RMA
MAS 5.0 RMA
GCRMA
Data processing is completed now what?
Fold change, ANOVA, Data filtering
Where are we now?

Ran analysis, output is a GENE LIST
List indicates what genes are up or down regulated p values for t-test Graphs of signal levels
Absolute numbers not as important here as the trends you see
Now what????
What is the first set of genes on our chips that will be filtered out?
Follow the links

Click on a gene Find links to other databases Follow links to discover what the protein does Now the fun part begins.
Back to Biology
Do the changes you see in gene expression make sense BIOLOGICALLY? If they dont make sense, can you hypothesize as to why those genes might be changing? Leads to many, many more experiments
The Gene Ontologies

A Common Language for Annotation of Genes from Yeast, Flies and Mice and Plants and Worms
and Humans
and anything else!
Gene Ontology Objectives

GO represents concepts used to classify specific parts of our biological knowledge:
Biological Process Molecular Function Cellular Component
GO develops a common language applicable to any organism GO terms can be used to annotate gene products from any species, allowing comparison of information across species
Sriniga Srinivasan, Chief Ontologist, Yahoo!
The ontology. Dividing human knowledge into a clean set of categories is a lot like trying to figure out where to find that suspenseful black comedy at your corner video store. Questions inevitably come up, like are Movies part of Art or Entertainment? (Yahoo! lists them under the latter.) -Wired Magazine, May 1996
The 3 Gene Ontologies

Molecular Function = elemental activity/task
the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity
Biological Process = biological goal or objective

broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions
Cellular Component = location or complex

subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme
Example: Gene Product = hammer

Function (what) Drive nail (into wood) Drive stake (into soil) Smash roach Clowns juggling object Process (why) Carpentry Gardening Pest Control Entertainment
Biological Examples
Biological Process Molecular Function Cellular Component
Validation
Not enough to just do microarrays Usually validate microarray results via some other technique
rt-PCR TaqMan Northern analysis Protein level analysis
No technique is perfect
Yeast Genome and Data Mining
Dynamic Nature of Yeast Genome

eORF= essential kORF= known hORF= homology identified shORF= short tORF= transposon identified qORF= questionable dORF= disabled
First published sequence claimed 6274 genes a # that has been revised many times, why?
6603 4373 1410 820
The Affy detection oligonucleotide sequences are frozen at the time of synthesis, how does this impact downstream data analysis?
Terms, Definitions, IDs

term: MAPKKK cascade (mating sensu Saccharomyces) goid: GO:0007244
definition: MAPKKK cascade involved in transduction of mating pheromone signal, as described in Saccharomyces definition_reference: PMID:9561267
SGD
SGD public microarray data sets available for public query
Homework
1. Go to http://www.yeastgenome.org/ and find 3 candidate genes of known f(x) and one of undefined f(x) that you might predict to be altered by DMSO treatment What GO biological processes and molecular mechanisms are associated with your candidate genes? Where, subcellularly does the protein reside in the cell? What other proteins are known or inferred to interact with yours? How was this interaction determined? Is this a genetic or physical interaction? Find the expression of at least one of your known genes in another public ally deposited microarray data set?
1. 2. Name of data set and how you found it? What is the largest Fold change observed for this gene in the public study?
2. 3. 4.
5.
6.
Now that you are microarray technology experts can you give me 3 reasons why the observed transcript level difference may not be confirmed through a second technology like RTQPCR?
Suggested Reading

Micro Array Data Analysis 06

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Micro Array Data Analysis 06

Uploaded by

Copyright:

Available Formats

Microarray Data Analysis

The Bioinformatics side of the bench

Quality Control (QC) of the chip visual inspection

Array name is written out in oligos!

Microarray Data Process/Outline

3. Normalization clean data

Observational Study vs. Designed Experiment

Designed Experiment Investigator intervenes in natural course of events

What type is our DMSO exp?

Technical vs. Biological Replicates

Low level data analysis / pre-processing

Image Analysis - Raw Data

From probe level signals to gene abundance estimates

MAS 5.0 output files

Each gene associated with GenBank accession number (NCBI database)

How are transcripts determined to be present or absent?

Probe pair (PM vs. MM) intensities

Thank goodness for software!!!

How do we want to analyze this data?

Normalization - clean data

Why Normalize Data?

Always look at data before and after normalization

MAS 5.0 RMA

Data processing is completed now what?

Fold change, ANOVA, Data filtering

Where are we now?

Follow the links

The Gene Ontologies

Gene Ontology Objectives

Sriniga Srinivasan, Chief Ontologist, Yahoo!

The 3 Gene Ontologies

Biological Process = biological goal or objective

Cellular Component = location or complex

Example: Gene Product = hammer

Yeast Genome and Data Mining

Dynamic Nature of Yeast Genome

6603 4373 1410 820

Terms, Definitions, IDs

SGD public microarray data sets available for public query

You might also like