Biology Meets Programming 101

Webinar Series
Biology meets programming: Bioinformatics 101 for NGS researchers

13 May 2020
Participating experts
David Guttman, Ph.D.
University of Toronto
Toronto, Canada
Alberto Riva, Ph.D.
University of Florida
Gainesville, FL
Brought to you by the Science/AAAS Custom Publishing Office

Webinar Series
Instructions for viewers
• To share webinar via social media:
• To share webinar via email:
• To see speaker biographies, click: View Bio under speaker name
• To ask a question, click the Ask A Question tab to the right

Biology Meets Programming:
Bioinformatics 101 for NGS Researchers
David S. Guttman
Department of Cell & Systems Biology
Centre for the Analysis of Genome Evolution & Function
Integrating
Next-Generation Sequencing
and Bioinformatics
into Your Research
Lessons Learned
Integrating NGS into Your Research
Bioinformatics in the wet lab
Experimental design
Data analysis
Data presentation
Experimental design
Data analysis
Data presentation
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trained
– Finding & keeping people
– Specialized vs cross-trainedSpecialized “Pet” Bioinformatician
– Finding & keeping people Pros
• High expertise
• Focused training
Cons
• Poor integration with group
• Isolated from field
– Specialized vs cross-trainedCross-Trained Bioinformatician
– Finding & keeping people Pros
• Highly integrated with group
• Well-rounded training
Cons
• Time is finite
• Jack of all trades, master of none
– Specialized vs cross-trained Focus
– I Finding
like
& keeping people Training
python
Credit
more than
pythons
– Finding & keepingI like

people Training
pythons
Credit
more than
python
– Finding & keeping people Training
HELLO… is Credit
there anyone
out there???
– Finding Sigh…
& keeping people Training
Stuck in the
middle again. Credit
Lorem Ipsum 12(3):45-67
Nam liber tempor cum soluta nobis eleifend

option congue nihil imperdiet doming.
Wendy Wetbench, Bob Bioinformatics, G. Rand Poobah

Lorem ipsum dolor sit amet, consetetur sadipscing ipsum dolor sit amet, consetetur sadipscing elitr,
elitr, sed diam nonumy eirmod tempor invidunt ut sed diam nonumy eirmod tempor invidunt ut labore
labore et dolore magna aliquyam erat, sed diam et dolore magna aliquyam erat, sed diam voluptua.
voluptua. At vero eos et accusam et justo duo At vero eos et accusam et justo duo dolores et ea
dolores et ea rebum. Stet clita kasd gubergren, no rebum. Stet clita kasd gubergren, no sea takimata
sea takimata sanctus est Lorem ipsum dolor sit sanctus est Lorem ipsum dolor sit amet. Lorem
amet. Lorem ipsum dolor sit amet, consetetur ipsum dolor sit amet, consetetur sadipscing elitr,
sadipscing elitr, sed diam nonumy eirmod tempor sed diam nonumy eirmod tempor invidunt ut labore
Experimental design
Data analysis
Data presentation
Experimental Design
Collecting the right data
Selecting the right NGS platform
Selecting the right NGS service provider

Experimental Design
– Sample structure
– Sample processing
– Sequence quantity & quality
– Sequence storage & security
Experimental Design
Sample size
– Sample structure Biological replicates
– Sample processing Controls

How many
samples do I
Sources of variation
need? Metadata
Experimental Design
Power analysis
– SequenceSample
quantity & quality Sources of variation
Size Metadata
– Sequence
Statistical storage & security
Significance
Power Level How many
(type II error) (type I error) independent Bonferroni,
test? who??
Effect
Size
Experimental Design
Power analysis
– Sequence
genotype quantity & quality
Sources of variation
Metadata
development
population structure
environment
Variance time
Experimental Design
Power analysis
– Sequence quantity & quality Sources of variation
What metadata Metadata

will I need and
how should I
record it? Metadata = data about your data
Experimental Design
Power analysis
– Sequence quantity & quality Sources of variation
Metadata
Statistician
Experimental Design
– Sample structure Collection
– Sample processing
Do I need Storage
to ship my
How quickly
samples? Extraction
can I get my
samples into
a freezer
– Sequence storage Do
&revive
I security
need to
any
organisms in the
sample?
Experimental Design
– Sample structure Collection
– Sample processing Storage

Are my samples being
sequenced on a
Extraction

long read or short read
sequencer?
Experimental Design
– Sample structure Coverage
– Sample processing Contamination

Batch controls
How much

sequence data
do I need?
Experimental Design

Batch controls
Depth
Coverage
– Sequence storage
5x &
Genome
security
Coverage
-15%
reads
genome
Experimental Design

Do I need a Batch controls
How big reference
is the genome?
genome?
Is there a
reference
genome??
Experimental Design
– Sequence quantity
Batch controls
How much of my&
dataquality
is actually useful?

How much of my
data will be
contaminating DNA
Experimental Design

Batch controls
–processed
Sequence
Will all my samples bestorage & security
or sequenced
at the same time?
Experimental Design

Batch controls
Genomics Core
Experimental Design
– Sample structure Storage
– Sample processing Identifiers

1001001101110
1001001101110
It’s a lot of data!

Experimental Design
– Sample structure Storage
– Sample processing Identifiers

–BobJones19.05.02
Sequence storage & security
AGTAGAGCGAGCGAGTAC
JaneDoe20.02.28 TAGCGGAGTAGACGAGAG
!!!
JohnSmith20.03.17 GACGAGGCAGCCAGATAG
Experimental Design
– Data structure
– ThroughputPlatform Lab or Core Reads Throughput Run Cost Error Rate
– Cost Oxford Nanopore MinION

Illumina MiSeq
Lab
Both
Long
Short
Low
Low
Med
High
High
Low
– Accuracy Illumina NovaSeq Core Short High Low Low

MGI DNBSEQ-G400 Core Short High Low Low
PacBio RSII Core Long Low High High
Experimental Design
– Data structure Resequencing,
– Throughput De novo sequencing,

Amplicon sequencing,
– Cost RNA-seq, ChIP-seq

etc.…
– Accuracy
Experimental Design
– Data structure Read length
– Throughput Read structure
Short read, single-end,

– Cost Short read, paired-end,
Long read
– Accuracy Read Structure Assembly Cost Throughput Error Rate
Short read, single-end Bad Very Low Very High Low

Short read, paired-end Fair Low High Low
Long read Great High Low High
Experimental Design
• Data structure
• Throughput Yield
How fast
• Cost can I get Multiplexing
my data?
• Accuracy
How many
samples do How many
I have? samples can I
multiplex on a
sequencer run?
Experimental Design
Sample collection
–Data structure Sample prep
–Throughput Sequencing
–Cost Sigh …
Data processing & storage
Time to write
–Accuracy
Platform purchase
another grant
Maintenance & service contracts
Support equipment
Support personnel
Experimental Design
– Data structure Error frequency
– Throughput Error profile
– Cost Are there a lot of

Is my genome homopolymer
– Accuracy
AT-rich tracts in my
genome?
Experimental Design
accuracy
– Data structure flexibility
– Throughput cost
yield
– Cost
– Accuracy
Genomics Core
Experimental Design
Selecting the right NGS service provider
– Level of service
Sequencing
Experimental Sample Data / Statistical Manuscript

Design Prep Analysis Prep
Do all of this
before collecting
the first piece of
data!
Experimental design
Data analysis
Data presentation
Experimental design
Data analysis
Data presentation
Data Analysis
Bioinformatics considerations
• Analysis tools
• Coding
• Verification & validation
• Public access
Data Analysis
• Analysis tools Commercial packages

(e.g. CLC Genomics)
• Coding Public packages & platforms

(e.g. Galaxy, QIIME)
• Verification & validationPublished tools, pipelines, & libraries

House-made scripts & pipelines
• Public access
Data Analysis
• Analysis tools Coding environment

(e.g. Jupyter Notebooks, R-markdown)
• Coding Commenting
• Verification &analysis
validation
How can I ensure
that my
Git version control
is reproducible?
• Public access
Data Analysis
• Analysis tools Gold standard analyses

(i.e. Test Oracle)
• Coding Metamorphic testing
• Verification & validation

•Verification:
PublicDoes access
the software work (i.e. have no bugs)?
Validation: Does the software do what you what it to do?
Data Analysis
• Analysis tools Data
• Coding Metadata
Code
• VerificationShould
& validation
we make
our full analysis
pipeline available?
• Public access
Experimental design
Data generation
Data analysis
Data presentation
Data Presentation – a Visual Narrative
Data Visualization
• Four pillars
• Tufte principles How can I
tell a narrative
• Agile Development through
figures and tables?
Data Visualization
Accuracy
• Four pillars
Precision
• Tufte principles
Clarity
• Agile Development
Efficiency
Precision
NO…
Make them
go away!!
Accuracy
Data Visualization Overview first, zoom & filter,
details on demand
• Four pillars Pre-attentive processing
• Tufte principles Less can be more
• Agile Development Keep it proportional
Data / ink ratio
Color perception
Figures vs. tables

A
Data Visualization Overview first, zoom & filter,
details on demand
• Necessities Pre-attentive processing
• Tufte
B principles Less can be more
• Agile Development Keep it proportional
Data / ink ratio

C Gene 1 Gene 2 Gene 3
Color perception
Figures vs. tables
Civelek & Lusis 2014 Nat.Rev.Genet. 15: 34

Data Visualization
Normal view
Overview first, zoom & filter,
details on demand
• Necessities Pre-attentive processing
• Tufte principles Less can be more
• Agile Development
-1.5 0.0 1.5
Keep it proportional
Red-Green color blind view
Data / ink ratio
Color perception
Figures vs. tables

-1.5 0.0 1.5
Integrating NGS into a Wet Lab
The Death of Silo Science
• Big data requires a collaborative mindset Bioinformatics
– Experimental design
– Data generation
– Data / statistical analysis
• Collaboration from project inception

Statistics Genomics
– Greatly increases likelihood of success
– Enhances training opportunities
Thanks!
Webinar Series
13 May 2020
Participating experts

Toronto, Canada
Alberto Riva, Ph.D.

Gainesville, FL

Biology Meets Programming:
Bioinformatics 101 for NGS Researchers
Alberto Riva
ICBR Bioinformatics Core
ariva@ufl.edu
Who am I & how did I get here?
• Background in Computer Science, Bioengineering, Medical
Informatics.
• Started working in Bioinformatics in 2001, while at Children’s

Hospital Boston – Harvard Medical School.
• Joined UF in 2006, and ICBR in 2014.
• As of January 2019, Scientific Director of the Bioinformatics Core.

ICBR
The Interdisciplinary Center for Biological Research (ICBR) is a research
support organization that gathers multiple cores under the same roof:
• Bioinformatics
• Cytometry
• Gene Expression and Genotyping
• NextGen Sequencing
• Proteomics
• Electron Microscopy
• Monoclonal Antibodies
ICBR
The Interdisciplinary Center for Biological Research (ICBR) is a research
support organization that gathers multiple cores under the same roof:
• Bioinformatics
• Cytometry
• Gene Expression and Genotyping
• NextGen Sequencing
• Proteomics
• Electron Microscopy
• Monoclonal Antibodies
NGS raw data
The starting point for most NGS projects is a collection of fastq files,
containing millions of short reads with associated quality information.
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Second line is the read; fourth line contains quality information

encoded as characters (one for each base in the read). Quality score =
probability that the base is miscalled. E.g. best quality produced by
Illumina platforms is 40, corresponding to p=0.0001.
NGS data
Other de-facto standard data types:
• BAM – describes result of mapping reads to genome.
• VCF – describes variants, ie differences between sample(s) and

reference.
• Tab-delimited - lingua franca for tabular data.
Use Excel only for presenting final results (with caution)!

Outline of NGS analysis project
Most NGS analysis projects include the following steps:
• Quality control, data cleanup;

• Alignment / mapping to reference genome;
• Quantification: measuring relevant variables (e.g. gene expression);
• Differential analysis: comparing biological conditions;
• Downstream analysis: interpretation in context of research question;
• Presentation: packaging results, reports, plots.
Outline of NGS analysis project
Most NGS analysis projects include the following steps:
Standard, easily Quality control

automated
Alignment / mapping
Quantification
Differential analysis
Downstream analysis
Ad-hoc, hands-on Presentation
Choices
Should I…
Write my Use existing

own tools? tools?
Pros: Greater flexibility

Tool can be tailored to problem Pros: Minimal development effort
Opportunities for publishing / required
visibility Generally accepted method
Cons: Requires large time investment Cons: May not be exactly what you need
Requires technical expertise Dependent on other people’s work
Requires validation / publication
Choices
Should I…
Write my Use existing

own tools? tools?
Most common approach: combining existing tools with ad-hoc ones. A large part
of a bioinformatician’s work consists in gluing together different tools, ensuring
they interoperate correctly.
In practice, this often means converting data between the different formats used
by the required tools, using ad-hoc scripts.
Choices
Work style:
Interactive “Batch”
Pros: Allows for exploratory analysis Pros: Minimal manual work required
Easy generation of graphical Reproducibility built-in
reports (R, Jupyter) Suitable for HPC environments
Cons: Less reproducible Easily scalable
Constrained by hardware Cons: Requires upfront development effort
limitations Analysis path is “frozen” – can’t deal
with unexpected findings
Choices
Work style:
Interactive “Batch”
Common approach: first part of analysis performed in batch mode using a standard
process. Results from this phase should be made available in a format appropriate for
downstream interactive analysis.
This choice is influenced by the available personnel – pet bioinformatician vs. cross-trained.
Choices
Hardware
Local Cloud-based
Pros: Total control over hardware / software Pros: Does not require dedicated admin /
Lower recurring costs security personnel
Cons: Large initial investment Higher reliability, better hardware
Obsolescence Cons: Requires moving large amounts of data
Requires admin / security expertise Recurring costs proportional to volume
of data analysis
Putting the pieces together
• Analysis pipelines: sequence of analysis tools designed to perform
analysis task end-to-end, in an unsupervised way.
• Large-scale NGS analysis is almost always performed on UNIX (Linux)

environments.
• HPC environments allow running many (hundreds to thousands) of

jobs in parallel – large analysis tasks can be handled easily.
• Solutions range from hand-crafted scripts to very complex pipeline

managers.
• UNIX was designed by programmers for programmers! A “basic” Linux
environment already provides all you need.
• DIY approach: UNIX makes it very easy to combine small tools into
larger ones, building up your custom toolkit.
• For example, to count the number of reads in a fastq file:

#!/bin/bash
N=$(zcat $* | grep –c ^)
echo $((N/4))
Old School Cool
Bash scripting: primitive, quirky “glue” language – but can be very
effective in automating repetitive tasks.
Makefile: complete rule system to describe how to generate certain

files from their sources.
Very complex analysis pipelines can be built using Bash scripts

combined with makefiles. NOT user-friendly – but works out-of-the-
box, with no dependencies, external tools, etc.
“Traditional” programming
Most modern programming languages provide support for
bioinformatics. Examples:
• perl – specialized language for text processing. Provides bioperl.

• R – specialized data analysis language and environment. Probably the
most widely used in bioinformatics. Bioconductor: huge repository of
analysis packages. Powerful visualization. Pretty much unavoidable.
• python – general purpose language, simple and (relatively) fast.
Provides biopython (similar to bioperl), numpy / scipy (similar to R
and matlab).
Pipeline managers
Pipeline managers allow you to describe what the pipeline should do –
and the manager makes it happen. Able to handle:
• Parallelization (run same task on all input files);

• Dependencies (step B requires step A to be performed first);
• Conveying data from one step to the next;
• Submitting jobs in cluster environments;
• Errors / checkpointing (if pipeline stops, restart from last good result);
Nextflow
Nextflow provides a language to describe analysis steps in terms of
inputs, outputs, process. Automatically scales from single CPU to multi-
core to cluster environment.
/* Simple tool to reverse sequences */
process reverse {
input:
path x from records
output:
stdout into result
"""
cat $x | rev
"""
}
Containers
Installing, configuring and updating software tools is a significant
challenge. Tools change over time, hindering reproducibility.
Containers: self-contained virtual environments that store a collection

of software tools designed to work together.
Containers are immutable, can be easily distributed, and run anywhere.

Pipelines may make use of tools in containerize form; whole pipelines
may be distributed as containers.
Most commonly used: docker, singularity.

Take-home messages
• Programming is an essential skill in bioinformatics – how much you
used it depends on you!
• Know your tools: even a basic Linux environment contains a wealth of

very powerful tools ready for use.
• Carefully evaluate pros and cons of different approaches. DIY or off-

the-shelf?
• Be pragmatic! Bioinformatics == problem solving.

Webinar Series
13 May 2020
Toronto, Canada Q&A
Alberto Riva, Ph.D.
To ask a question, click
the Ask a Question tab
Gainesville, FL
on the right

Webinar Series
13 May 2020
Look out for more webinars in the series at:
webinar.sciencemag.org
To provide feedback on this webinar, please email

your comments to webinar@aaas.org
For related information on this webinar topic, go to:

https://go.roche.com/libraryprep

Biology Meets Programming 101

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Biology Meets Programming 101

Uploaded by

Copyright:

Available Formats

Webinar Series

Biology meets programming: Bioinformatics 101 for NGS researchers

Brought to you by the Science/AAAS Custom Publishing Office

• To share webinar via email:

• To see speaker biographies, click: View Bio under speaker name

• To ask a question, click the Ask A Question tab to the right

Brought to you by the Science/AAAS Custom Publishing Office

– Finding & keepingI like

– Finding & keeping people Training

Nam liber tempor cum soluta nobis eleifend

Wendy Wetbench, Bob Bioinformatics, G. Rand Poobah

Selecting the right NGS platform

Selecting the right NGS service provider

– Sample structure Biological replicates

– Sample processing Controls

– Sequence quantity & quality

– Sample structure Biological replicates

– Sample processing Controls

– Sample structure Biological replicates

– Sample processing Controls

– Sample structure Biological replicates

– Sample processing Controls

– Sequence quantity & quality Sources of variation

What metadata Metadata

– Sample structure Biological replicates

– Sample processing Controls

– Sequence quantity & quality Sources of variation

– Sample processing Storage

– Sequence quantity & quality

– Sequence storage & security

– Sample processing Contamination

– Sequence quantity & quality

– Sequence storage & security

– Sample processing Contamination

– Sequence quantity & quality

– Sample processing Contamination

– Sequence quantity & quality

– Sample processing Contamination

– Sequence storage & security

– Sample processing Contamination

– Sequence quantity & quality

– Sample processing Contamination

– Sequence quantity & quality

– Sequence storage & security

– Sample processing Identifiers

– Sequence quantity & quality

It’s a lot of data!

– Sample processing Identifiers

– Sequence quantity & quality

– Cost Oxford Nanopore MinION

– Accuracy Illumina NovaSeq Core Short High Low Low

– Throughput De novo sequencing,

– Cost RNA-seq, ChIP-seq

– Throughput Read structure

Short read, single-end,

– Accuracy Read Structure Assembly Cost Throughput Error Rate

Short read, single-end Bad Very Low Very High Low

– Throughput Error profile

– Cost Are there a lot of

Experimental Sample Data / Statistical Manuscript

• Analysis tools Commercial packages

• Coding Public packages & platforms

• Verification & validationPublished tools, pipelines, & libraries

• Analysis tools Coding environment