Professional Documents
Culture Documents
David S. Guttman
Department of Cell & Systems Biology
Centre for the Analysis of Genome Evolution & Function
University of Toronto
Integrating
Next-Generation Sequencing
and Bioinformatics
into Your Research
Lessons Learned
Integrating NGS into Your Research
Bioinformatics in the wet lab
Experimental design
Data analysis
Data presentation
Integrating NGS into Your Research
Bioinformatics in the wet lab
Experimental design
Data analysis
Data presentation
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trained
– Finding & keeping people
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trainedSpecialized “Pet” Bioinformatician
– Finding & keeping people Pros
• High expertise
• Focused training
Cons
• Poor integration with group
• Isolated from field
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trainedCross-Trained Bioinformatician
– Finding & keeping people Pros
• Highly integrated with group
• Well-rounded training
Cons
• Time is finite
• Jack of all trades, master of none
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trained Focus
– I Finding
like
& keeping people Training
python
Credit
more than
pythons
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trained Focus
HELLO… is Credit
there anyone
out there???
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trained Focus
– Finding Sigh…
& keeping people Training
Stuck in the
middle again. Credit
Lorem Ipsum 12(3):45-67
Experimental design
Data analysis
Data presentation
Experimental Design
Collecting the right data
need? Metadata
– Sequence storage & security
Experimental Design
Collecting the right data
Power analysis
– SequenceSample
quantity & quality Sources of variation
Size Metadata
– Sequence
Statistical storage & security
Significance
Power Level How many
(type II error) (type I error) independent Bonferroni,
test? who??
Effect
Size
Experimental Design
Collecting the right data
Power analysis
– Sequence
genotype quantity & quality
Sources of variation
Metadata
– Sequence storage & security
development
population structure
environment
Variance time
Experimental Design
Collecting the right data
Power analysis
Metadata
– Sequence storage & security
Statistician
Experimental Design
Collecting the right data
– Sample structure Collection
– Sample processing
Do I need Storage
to ship my
How quickly
– Sequence quantity & quality
samples? Extraction
can I get my
samples into
a freezer
– Sequence storage Do
&revive
I security
need to
any
organisms in the
sample?
Experimental Design
Collecting the right data
– Sample structure Collection
genome
Experimental Design
Collecting the right data
– Sample structure Coverage
– Sequence quantity
Batch controls
How much of my&
dataquality
is actually useful?
–processed
Sequence
Will all my samples bestorage & security
or sequenced
at the same time?
Experimental Design
Collecting the right data
– Sample structure Coverage
Genomics Core
Experimental Design
Collecting the right data
– Sample structure Storage
– Accuracy
Experimental Design
Selecting the right NGS platform
– Data structure Read length
• Data structure
• Throughput Yield
How fast
• Cost can I get Multiplexing
my data?
• Accuracy
How many
samples do How many
I have? samples can I
multiplex on a
sequencer run?
Experimental Design
Selecting the right NGS platform
Sample collection
–Data structure Sample prep
–Throughput Sequencing
–Cost Sigh …
Data processing & storage
Time to write
–Accuracy
Platform purchase
another grant
Maintenance & service contracts
Support equipment
Support personnel
Experimental Design
Selecting the right NGS platform
– Data structure Error frequency
– Throughput cost
yield
– Cost
– Accuracy
Genomics Core
Experimental Design
Selecting the right NGS service provider
– Level of service
Sequencing
Data analysis
Data presentation
Integrating NGS into Your Research
Bioinformatics in the wet lab
Experimental design
Data analysis
Data presentation
Data Analysis
Bioinformatics considerations
• Analysis tools
• Coding
• Verification & validation
• Public access
Data Analysis
Bioinformatics considerations
• Coding Commenting
• Verification &analysis
validation
How can I ensure
that my
Git version control
is reproducible?
• Public access
Data Analysis
Bioinformatics considerations
• Coding Metadata
Code
• VerificationShould
& validation
we make
our full analysis
pipeline available?
• Public access
Integrating NGS into Your Research
Bioinformatics in the wet lab
Experimental design
Data generation
Data analysis
Data presentation
Data Presentation – a Visual Narrative
Data Visualization
• Four pillars
• Tufte principles How can I
tell a narrative
• Agile Development through
figures and tables?
Data Presentation – a Visual Narrative
Data Visualization
Accuracy
• Four pillars
Precision
• Tufte principles
Clarity
• Agile Development
Efficiency
Precision
NO…
Make them
go away!!
Accuracy
Data Presentation – a Visual Narrative
Data Visualization Overview first, zoom & filter,
details on demand
• Four pillars Pre-attentive processing
• Tufte principles Less can be more
• Agile Development Keep it proportional
Color perception
Keep it proportional
Red-Green color blind view
Data / ink ratio
Color perception
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Should I…
Should I…
Most common approach: combining existing tools with ad-hoc ones. A large part
of a bioinformatician’s work consists in gluing together different tools, ensuring
they interoperate correctly.
In practice, this often means converting data between the different formats used
by the required tools, using ad-hoc scripts.
Choices
Work style:
Interactive “Batch”
Pros: Allows for exploratory analysis Pros: Minimal manual work required
Easy generation of graphical Reproducibility built-in
reports (R, Jupyter) Suitable for HPC environments
Cons: Less reproducible Easily scalable
Constrained by hardware Cons: Requires upfront development effort
limitations Analysis path is “frozen” – can’t deal
with unexpected findings
Choices
Work style:
Interactive “Batch”
Common approach: first part of analysis performed in batch mode using a standard
process. Results from this phase should be made available in a format appropriate for
downstream interactive analysis.
This choice is influenced by the available personnel – pet bioinformatician vs. cross-trained.
Choices
Hardware
Local Cloud-based
Pros: Total control over hardware / software Pros: Does not require dedicated admin /
Lower recurring costs security personnel
Cons: Large initial investment Higher reliability, better hardware
Obsolescence Cons: Requires moving large amounts of data
Requires admin / security expertise Recurring costs proportional to volume
of data analysis
Putting the pieces together
• Analysis pipelines: sequence of analysis tools designed to perform
analysis task end-to-end, in an unsupervised way.
• DIY approach: UNIX makes it very easy to combine small tools into
larger ones, building up your custom toolkit.
"""
cat $x | rev
"""
}
Containers
Installing, configuring and updating software tools is a significant
challenge. Tools change over time, hindering reproducibility.