You are on page 1of 79

Webinar Series

Biology meets programming: Bioinformatics 101 for NGS researchers


13 May 2020
Participating experts
David Guttman, Ph.D.
University of Toronto
Toronto, Canada
Alberto Riva, Ph.D.
University of Florida
Gainesville, FL

Brought to you by the Science/AAAS Custom Publishing Office


Webinar Series
Instructions for viewers
• To share webinar via social media:

• To share webinar via email:

• To see speaker biographies, click: View Bio under speaker name

• To ask a question, click the Ask A Question tab to the right

Brought to you by the Science/AAAS Custom Publishing Office


Biology Meets Programming:
Bioinformatics 101 for NGS Researchers

David S. Guttman
Department of Cell & Systems Biology
Centre for the Analysis of Genome Evolution & Function
University of Toronto
Integrating
Next-Generation Sequencing
and Bioinformatics
into Your Research

Lessons Learned
Integrating NGS into Your Research
Bioinformatics in the wet lab

Experimental design

Data analysis

Data presentation
Integrating NGS into Your Research
Bioinformatics in the wet lab

Experimental design

Data analysis

Data presentation
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trained
– Finding & keeping people
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trainedSpecialized “Pet” Bioinformatician
– Finding & keeping people Pros
• High expertise
• Focused training
Cons
• Poor integration with group
• Isolated from field
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trainedCross-Trained Bioinformatician
– Finding & keeping people Pros
• Highly integrated with group
• Well-rounded training
Cons
• Time is finite
• Jack of all trades, master of none
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trained Focus

– I Finding
like
& keeping people Training
python
Credit
more than
pythons
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trained Focus

– Finding & keepingI like


people Training
pythons
Credit
more than
python
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trained Focus

– Finding & keeping people Training

HELLO… is Credit
there anyone
out there???
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trained Focus

– Finding Sigh…
& keeping people Training
Stuck in the
middle again. Credit
Lorem Ipsum 12(3):45-67

Nam liber tempor cum soluta nobis eleifend


option congue nihil imperdiet doming.

Wendy Wetbench, Bob Bioinformatics, G. Rand Poobah


Lorem ipsum dolor sit amet, consetetur sadipscing ipsum dolor sit amet, consetetur sadipscing elitr,
elitr, sed diam nonumy eirmod tempor invidunt ut sed diam nonumy eirmod tempor invidunt ut labore
labore et dolore magna aliquyam erat, sed diam et dolore magna aliquyam erat, sed diam voluptua.
voluptua. At vero eos et accusam et justo duo At vero eos et accusam et justo duo dolores et ea
dolores et ea rebum. Stet clita kasd gubergren, no rebum. Stet clita kasd gubergren, no sea takimata
sea takimata sanctus est Lorem ipsum dolor sit sanctus est Lorem ipsum dolor sit amet. Lorem
amet. Lorem ipsum dolor sit amet, consetetur ipsum dolor sit amet, consetetur sadipscing elitr,
sadipscing elitr, sed diam nonumy eirmod tempor sed diam nonumy eirmod tempor invidunt ut labore
Integrating NGS into Your Research
Bioinformatics in the wet lab

Experimental design

Data analysis

Data presentation
Experimental Design
Collecting the right data

Selecting the right NGS platform

Selecting the right NGS service provider


Experimental Design
Collecting the right data
– Sample structure
– Sample processing
– Sequence quantity & quality
– Sequence storage & security
Experimental Design
Collecting the right data
Sample size

– Sample structure Biological replicates

– Sample processing Controls

– Sequence quantity & quality


How many
samples do I
Sources of variation

need? Metadata
– Sequence storage & security
Experimental Design
Collecting the right data
Power analysis

– Sample structure Biological replicates

– Sample processing Controls

– SequenceSample
quantity & quality Sources of variation

Size Metadata
– Sequence
Statistical storage & security
Significance
Power Level How many
(type II error) (type I error) independent Bonferroni,
test? who??

Effect
Size
Experimental Design
Collecting the right data
Power analysis

– Sample structure Biological replicates

– Sample processing Controls

– Sequence
genotype quantity & quality
Sources of variation

Metadata
– Sequence storage & security
development
population structure
environment

Variance time
Experimental Design
Collecting the right data
Power analysis

– Sample structure Biological replicates

– Sample processing Controls

– Sequence quantity & quality Sources of variation

What metadata Metadata


– Sequence storage & security
will I need and
how should I
record it? Metadata = data about your data
Experimental Design
Collecting the right data
Power analysis

– Sample structure Biological replicates

– Sample processing Controls

– Sequence quantity & quality Sources of variation

Metadata
– Sequence storage & security

Statistician
Experimental Design
Collecting the right data
– Sample structure Collection

– Sample processing
Do I need Storage
to ship my
How quickly
– Sequence quantity & quality
samples? Extraction
can I get my
samples into
a freezer
– Sequence storage Do
&revive
I security
need to
any
organisms in the
sample?
Experimental Design
Collecting the right data
– Sample structure Collection

– Sample processing Storage

– Sequence quantity & quality


Are my samples being
sequenced on a
Extraction

– Sequence storage & security


long read or short read
sequencer?
Experimental Design
Collecting the right data
– Sample structure Coverage

– Sample processing Contamination

– Sequence quantity & quality


Batch controls
How much

– Sequence storage & security


sequence data
do I need?
Experimental Design
Collecting the right data
– Sample structure Coverage

– Sample processing Contamination

– Sequence quantity & quality


Batch controls
Depth
Coverage
– Sequence storage
5x &
Genome
security
Coverage
-15%
reads

genome
Experimental Design
Collecting the right data
– Sample structure Coverage

– Sample processing Contamination

– Sequence quantity & quality


Do I need a Batch controls
How big reference
is the genome?
– Sequence storage & security
genome?
Is there a
reference
genome??
Experimental Design
Collecting the right data
– Sample structure Coverage

– Sample processing Contamination

– Sequence quantity
Batch controls
How much of my&
dataquality
is actually useful?

– Sequence storage & security


How much of my
data will be
contaminating DNA
Experimental Design
Collecting the right data
– Sample structure Coverage

– Sample processing Contamination

– Sequence quantity & quality


Batch controls

–processed
Sequence
Will all my samples bestorage & security
or sequenced
at the same time?
Experimental Design
Collecting the right data
– Sample structure Coverage

– Sample processing Contamination

– Sequence quantity & quality


Batch controls

– Sequence storage & security

Genomics Core
Experimental Design
Collecting the right data
– Sample structure Storage

– Sample processing Identifiers

– Sequence quantity & quality


– Sequence storage & security
1001001101110
1001001101110

It’s a lot of data!


Experimental Design
Collecting the right data
– Sample structure Storage

– Sample processing Identifiers

– Sequence quantity & quality


–BobJones19.05.02
Sequence storage & security
AGTAGAGCGAGCGAGTAC
JaneDoe20.02.28 TAGCGGAGTAGACGAGAG
!!!
JohnSmith20.03.17 GACGAGGCAGCCAGATAG
Experimental Design
Selecting the right NGS platform
– Data structure
– ThroughputPlatform Lab or Core Reads Throughput Run Cost Error Rate

– Cost Oxford Nanopore MinION


Illumina MiSeq
Lab
Both
Long
Short
Low
Low
Med
High
High
Low

– Accuracy Illumina NovaSeq Core Short High Low Low


MGI DNBSEQ-G400 Core Short High Low Low
PacBio RSII Core Long Low High High
Experimental Design
Selecting the right NGS platform
– Data structure Resequencing,

– Throughput De novo sequencing,


Amplicon sequencing,

– Cost RNA-seq, ChIP-seq


etc.…

– Accuracy
Experimental Design
Selecting the right NGS platform
– Data structure Read length

– Throughput Read structure

Short read, single-end,


– Cost Short read, paired-end,
Long read

– Accuracy Read Structure Assembly Cost Throughput Error Rate

Short read, single-end Bad Very Low Very High Low


Short read, paired-end Fair Low High Low
Long read Great High Low High
Experimental Design
Selecting the right NGS platform

• Data structure
• Throughput Yield
How fast
• Cost can I get Multiplexing
my data?
• Accuracy

How many
samples do How many
I have? samples can I
multiplex on a
sequencer run?
Experimental Design
Selecting the right NGS platform
Sample collection
–Data structure Sample prep

–Throughput Sequencing

–Cost Sigh …
Data processing & storage
Time to write
–Accuracy
Platform purchase
another grant
Maintenance & service contracts

Support equipment

Support personnel
Experimental Design
Selecting the right NGS platform
– Data structure Error frequency

– Throughput Error profile

– Cost Are there a lot of


Is my genome homopolymer
– Accuracy
AT-rich tracts in my
genome?
Experimental Design
Selecting the right NGS platform
accuracy
– Data structure flexibility

– Throughput cost
yield

– Cost
– Accuracy

Genomics Core
Experimental Design
Selecting the right NGS service provider
– Level of service
Sequencing

Experimental Sample Data / Statistical Manuscript


Design Prep Analysis Prep
Integrating NGS into Your Research
Do all of this
before collecting
Bioinformatics in the wet lab
the first piece of
data!
Experimental design

Data analysis

Data presentation
Integrating NGS into Your Research
Bioinformatics in the wet lab

Experimental design

Data analysis

Data presentation
Data Analysis
Bioinformatics considerations

• Analysis tools
• Coding
• Verification & validation
• Public access
Data Analysis
Bioinformatics considerations

• Analysis tools Commercial packages


(e.g. CLC Genomics)

• Coding Public packages & platforms


(e.g. Galaxy, QIIME)

• Verification & validationPublished tools, pipelines, & libraries


House-made scripts & pipelines
• Public access
Data Analysis
Bioinformatics considerations

• Analysis tools Coding environment


(e.g. Jupyter Notebooks, R-markdown)

• Coding Commenting

• Verification &analysis
validation
How can I ensure
that my
Git version control
is reproducible?

• Public access
Data Analysis
Bioinformatics considerations

• Analysis tools Gold standard analyses


(i.e. Test Oracle)
• Coding Metamorphic testing

• Verification & validation


•Verification:
PublicDoes access
the software work (i.e. have no bugs)?
Validation: Does the software do what you what it to do?
Data Analysis
Bioinformatics considerations

• Analysis tools Data

• Coding Metadata

Code
• VerificationShould
& validation
we make
our full analysis
pipeline available?
• Public access
Integrating NGS into Your Research
Bioinformatics in the wet lab

Experimental design

Data generation

Data analysis

Data presentation
Data Presentation – a Visual Narrative
Data Visualization
• Four pillars
• Tufte principles How can I
tell a narrative
• Agile Development through
figures and tables?
Data Presentation – a Visual Narrative
Data Visualization
Accuracy
• Four pillars
Precision
• Tufte principles
Clarity
• Agile Development
Efficiency
Precision

NO…
Make them
go away!!

Accuracy
Data Presentation – a Visual Narrative
Data Visualization Overview first, zoom & filter,
details on demand
• Four pillars Pre-attentive processing
• Tufte principles Less can be more
• Agile Development Keep it proportional

Data / ink ratio

Color perception

Figures vs. tables


A
Data Presentation – a Visual Narrative
Data Visualization Overview first, zoom & filter,
details on demand
• Necessities Pre-attentive processing
• Tufte
B principles Less can be more
• Agile Development Keep it proportional

Data / ink ratio


C Gene 1 Gene 2 Gene 3
Color perception

Figures vs. tables

Civelek & Lusis 2014 Nat.Rev.Genet. 15: 34


Data Presentation – a Visual Narrative
Data Visualization
Normal view
Overview first, zoom & filter,
details on demand
• Necessities Pre-attentive processing
• Tufte principles Less can be more
• Agile Development
-1.5 0.0 1.5

Keep it proportional
Red-Green color blind view
Data / ink ratio

Color perception

Figures vs. tables


-1.5 0.0 1.5
Integrating NGS into a Wet Lab
The Death of Silo Science
• Big data requires a collaborative mindset Bioinformatics
– Experimental design
– Data generation
– Data / statistical analysis

• Collaboration from project inception


Statistics Genomics
– Greatly increases likelihood of success
– Enhances training opportunities
Thanks!
Webinar Series
Biology meets programming: Bioinformatics 101 for NGS researchers
13 May 2020
Participating experts

David Guttman, Ph.D.


University of Toronto
Toronto, Canada

Alberto Riva, Ph.D.


University of Florida
Gainesville, FL

Brought to you by the Science/AAAS Custom Publishing Office


Biology Meets Programming:
Bioinformatics 101 for NGS Researchers
Alberto Riva
ICBR Bioinformatics Core
University of Florida
ariva@ufl.edu
Who am I & how did I get here?
• Background in Computer Science, Bioengineering, Medical
Informatics.

• Started working in Bioinformatics in 2001, while at Children’s


Hospital Boston – Harvard Medical School.

• Joined UF in 2006, and ICBR in 2014.

• As of January 2019, Scientific Director of the Bioinformatics Core.


ICBR
The Interdisciplinary Center for Biological Research (ICBR) is a research
support organization that gathers multiple cores under the same roof:
• Bioinformatics
• Cytometry
• Gene Expression and Genotyping
• NextGen Sequencing
• Proteomics
• Electron Microscopy
• Monoclonal Antibodies
ICBR
The Interdisciplinary Center for Biological Research (ICBR) is a research
support organization that gathers multiple cores under the same roof:
• Bioinformatics
• Cytometry
• Gene Expression and Genotyping
• NextGen Sequencing
• Proteomics
• Electron Microscopy
• Monoclonal Antibodies
NGS raw data
The starting point for most NGS projects is a collection of fastq files,
containing millions of short reads with associated quality information.

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Second line is the read; fourth line contains quality information


encoded as characters (one for each base in the read). Quality score =
probability that the base is miscalled. E.g. best quality produced by
Illumina platforms is 40, corresponding to p=0.0001.
NGS data
Other de-facto standard data types:

• BAM – describes result of mapping reads to genome.

• VCF – describes variants, ie differences between sample(s) and


reference.

• Tab-delimited - lingua franca for tabular data.

Use Excel only for presenting final results (with caution)!


Outline of NGS analysis project
Most NGS analysis projects include the following steps:

• Quality control, data cleanup;


• Alignment / mapping to reference genome;
• Quantification: measuring relevant variables (e.g. gene expression);
• Differential analysis: comparing biological conditions;
• Downstream analysis: interpretation in context of research question;
• Presentation: packaging results, reports, plots.
Outline of NGS analysis project
Most NGS analysis projects include the following steps:

Standard, easily Quality control


automated
Alignment / mapping
Quantification
Differential analysis
Downstream analysis
Ad-hoc, hands-on Presentation
Choices

Should I…

Write my Use existing


own tools? tools?

Pros: Greater flexibility


Tool can be tailored to problem Pros: Minimal development effort
Opportunities for publishing / required
visibility Generally accepted method
Cons: Requires large time investment Cons: May not be exactly what you need
Requires technical expertise Dependent on other people’s work
Requires validation / publication
Choices

Should I…

Write my Use existing


own tools? tools?

Most common approach: combining existing tools with ad-hoc ones. A large part
of a bioinformatician’s work consists in gluing together different tools, ensuring
they interoperate correctly.

In practice, this often means converting data between the different formats used
by the required tools, using ad-hoc scripts.
Choices

Work style:

Interactive “Batch”

Pros: Allows for exploratory analysis Pros: Minimal manual work required
Easy generation of graphical Reproducibility built-in
reports (R, Jupyter) Suitable for HPC environments
Cons: Less reproducible Easily scalable
Constrained by hardware Cons: Requires upfront development effort
limitations Analysis path is “frozen” – can’t deal
with unexpected findings
Choices

Work style:

Interactive “Batch”

Common approach: first part of analysis performed in batch mode using a standard
process. Results from this phase should be made available in a format appropriate for
downstream interactive analysis.

This choice is influenced by the available personnel – pet bioinformatician vs. cross-trained.
Choices

Hardware

Local Cloud-based

Pros: Total control over hardware / software Pros: Does not require dedicated admin /
Lower recurring costs security personnel
Cons: Large initial investment Higher reliability, better hardware
Obsolescence Cons: Requires moving large amounts of data
Requires admin / security expertise Recurring costs proportional to volume
of data analysis
Putting the pieces together
• Analysis pipelines: sequence of analysis tools designed to perform
analysis task end-to-end, in an unsupervised way.

• Large-scale NGS analysis is almost always performed on UNIX (Linux)


environments.

• HPC environments allow running many (hundreds to thousands) of


jobs in parallel – large analysis tasks can be handled easily.

• Solutions range from hand-crafted scripts to very complex pipeline


managers.
• UNIX was designed by programmers for programmers! A “basic” Linux
environment already provides all you need.

• DIY approach: UNIX makes it very easy to combine small tools into
larger ones, building up your custom toolkit.

• For example, to count the number of reads in a fastq file:


#!/bin/bash
N=$(zcat $* | grep –c ^)
echo $((N/4))
Old School Cool
Bash scripting: primitive, quirky “glue” language – but can be very
effective in automating repetitive tasks.

Makefile: complete rule system to describe how to generate certain


files from their sources.

Very complex analysis pipelines can be built using Bash scripts


combined with makefiles. NOT user-friendly – but works out-of-the-
box, with no dependencies, external tools, etc.
“Traditional” programming
Most modern programming languages provide support for
bioinformatics. Examples:

• perl – specialized language for text processing. Provides bioperl.


• R – specialized data analysis language and environment. Probably the
most widely used in bioinformatics. Bioconductor: huge repository of
analysis packages. Powerful visualization. Pretty much unavoidable.
• python – general purpose language, simple and (relatively) fast.
Provides biopython (similar to bioperl), numpy / scipy (similar to R
and matlab).
Pipeline managers
Pipeline managers allow you to describe what the pipeline should do –
and the manager makes it happen. Able to handle:

• Parallelization (run same task on all input files);


• Dependencies (step B requires step A to be performed first);
• Conveying data from one step to the next;
• Submitting jobs in cluster environments;
• Errors / checkpointing (if pipeline stops, restart from last good result);
Nextflow
Nextflow provides a language to describe analysis steps in terms of
inputs, outputs, process. Automatically scales from single CPU to multi-
core to cluster environment.
/* Simple tool to reverse sequences */
process reverse {
input:
path x from records
output:
stdout into result

"""
cat $x | rev
"""
}
Containers
Installing, configuring and updating software tools is a significant
challenge. Tools change over time, hindering reproducibility.

Containers: self-contained virtual environments that store a collection


of software tools designed to work together.

Containers are immutable, can be easily distributed, and run anywhere.


Pipelines may make use of tools in containerize form; whole pipelines
may be distributed as containers.

Most commonly used: docker, singularity.


Take-home messages
• Programming is an essential skill in bioinformatics – how much you
used it depends on you!

• Know your tools: even a basic Linux environment contains a wealth of


very powerful tools ready for use.

• Carefully evaluate pros and cons of different approaches. DIY or off-


the-shelf?

• Be pragmatic! Bioinformatics == problem solving.


Webinar Series
Biology meets programming: Bioinformatics 101 for NGS researchers
13 May 2020
David Guttman, Ph.D.
University of Toronto
Toronto, Canada Q&A
Alberto Riva, Ph.D.
To ask a question, click
University of Florida
the Ask a Question tab
Gainesville, FL
on the right

Brought to you by the Science/AAAS Custom Publishing Office


Webinar Series
Biology meets programming: Bioinformatics 101 for NGS researchers
13 May 2020
Look out for more webinars in the series at:
webinar.sciencemag.org

To provide feedback on this webinar, please email


your comments to webinar@aaas.org

For related information on this webinar topic, go to:


https://go.roche.com/libraryprep
Brought to you by the Science/AAAS Custom Publishing Office

You might also like