You are on page 1of 11

Bioinformatics—the meaning

Bioinformatics is an interdisciplinary field that utilizes/develops computational tools for


answering biological questions. In this manner, bioinformatics combines biology, computer
science, mathematics, and statistics to analyze and interpret available biological data.
Bioinformatics has become an important part of many areas of biology. For example, In the
field of genetics, bioinformatics aids in sequencing and annotating genomes and their
observed mutations. More specifically, bioinformatic tools/softwares aid in comparing,
analyzing, and interpreting genomic data and more generally in the understanding of
evolutionary aspects of molecular biology. Bioinformatics also plays a role in the analysis of
gene and protein expression and regulation. At a more integrative level, it helps analyze and
catalogue the biological pathways and networks that are an important part of systems
biology. In structural biology, it aids in the simulation and modelling of DNA, proteins, as
well as biomolecular interactions.

Important sub-disciplines within bioinformatics include: 1) development and


implementation of computer tools/softwares that enable the interpretation of various types
of information. 2) development of new algorithms (mathematical formulas) and statistical
measures that assess relationships among members of large data sets. For example, there
are methods to locate a gene within a sequence, to predict protein structure and/or
function, and to cluster protein sequences into families of related sequences. Major
methods in the bioinformatics field include sequence alignment, gene finding, genome
assembly, drug discovery, protein structure, prediction of gene expression, protein–protein
interactions, genome-wide association studies, mapping and analyzing DNA and protein
sequences, and creating 3-D models of protein structures.

How has bioinformatics changed?


In its early days––perhaps until the early 2000s––bioinformatics was synonymous with
"sequence analysis," a then-common method in biology studies. Scientists typically obtained
just a few DNA sequences then analyzed them for various properties. In the mid 2000s, the
so-called next-generation, high-throughput sequencing instruments (such as the Illumina
HiSeq) made it possible to measure the full genomic content of a cell in a single
experimental run. With that, the quantity of data shot up immensely as scientists were able
to capture a snapshot of everything that is DNA-related. This has transformed
bioinformatics into an entirely new field of data science that builds on the "classical
bioinformatics", but has become focused on processing, investigating, and summarizing
massive data sets of uncommon complexity.

Is creativity required in bioinformatics?


Bioinformatics requires a dynamic, creative approach. Protocols should be viewed as
guidelines, not rules that guarantee success. Following protocols by the letter is usually
quite counterproductive. At best, doing so leads to sub-optimal outcomes; at worst, it can
produce misinformation and spell the end of a research project. Living organisms operate in
immense complexity. Bioinformaticians need to recognize this complexity, respond
dynamically to variations, and understand when methods and protocols are not suited to a
data set. The myriad complexities and challenges of venturing at the frontiers of scientific
knowledge always require creativity, sensitivity, and imagination––bioinformatics is no
exception.
Essential biology concepts for bioinformaticians
Biology is a domain of the Life Sciences, which include, but are not limited to, organic
chemistry, ecology, botonay, zoology, physiology, etc. As bioinformatics and other
innovative methodologies advance, we expect the Life Sciences to mature and develop rich,
accurate vocabularies and models to understand and describe living organisms. Below are
the biological concepts that we deem important for this field:

What is DNA?
DNA stands for Deoxyribo Nucleic Acid. It carries the genetic instructions required for the
development, functioning and reproduction of all known living organisms. In eukaryotic
organisms (like animals, plants, and fungi), DNA occurs in the nucleus of each cell. In
prokaryotic organisms (single-celled organisms like bacteria and mitochondria), DNA occurs
in the cell's cytoplasm.

What is DNA made of?


DNA is made up of two strands of smaller molecules coiled around each other in a double-
helix structure. If you uncoiled DNA, you could imagine it looking somewhat like a ladder.
Ladders have two important parts: the poles on either side, and the rungs that you climb up.
The "poles" of DNA are made of alternating molecules of deoxyribose (a sugar) and
phosphate. While they provide the structure, it's the "rungs" of the DNA that are most
important for bioinformatics. To understand the "rungs" of the DNA, imagine a ladder split
in half down the middle so that each half is a pole with a bunch of half-rungs sticking out. In
DNA, these "half-rungs" are a molecule called nucleotides. Genetic information is encoded
into DNA by the order, or sequence, in which these nucleotides occur.

What are nucleotides?


Nucleotides are the building blocks of nucleic acids (DNA and RNA–we'll get to that one later
on). In DNA, there are four types of nucleotide: Adenine, Cytosine, Guanine, and Thymine.
Because the order in which they occur encodes the information biologists try to understand,
we refer to them by their first letter, A , C , G and T, respectively.
A Adenine
G Guanine
C Cytosine
T Thymine

Back to the ladder analogy. "Nucleotide" is the type of molecule that makes up each half-
rung of our ladder. Because they function as the units that encode genetic information, each
letter is also called a base. For an example of how we represent the sequence of DNA bases
using these letters, if we took the DNA of the Bacillus anthracis bacteria that causes Anthrax
disease, unfolded the double-helix into a ladder, and then split the ladder in two, the top
half (the forward strand–we'll get to that in a bit) would be written like this:
ATATTTTTTCTTGTTTTTTATATCCACAAACTCTTTT

What are basepairs?


When you put both sides of the ladder back together, each rung is made by the bond
between two bases. When they are bonded, we call them a basepair.
When it comes to basepairs, it's important to remember that Adenine only ever bonds with
Thymine, and Guanine with Cytosine. In the ladder analogy, this means that each rung will
be denotated as "A-T," "T-A," "G-C," or "C-G."

What are DNA strands?


Remember when we split the DNA "ladder" in half down the middle so that each side was a
pole with half-rungs sticking out? These two halves are called strands. In order to distinguish
the two strands, scientists label one the forward strand and the second, the reverse strand.
Above, we gave the example of the forward strand in the Bacillus anthracis bacteria. Here's
what it looks like paired with its reverse strand:

The lines that connect the bases on either side denote a basepair relationship.
"Forward" and "reverse" are just labels. The choice of labeling is arbitrary and does not
depend on any inherent property of the DNA. The forward strand is not "special". Scientists
decide which to call "forward" and which to call "reverse" when they first analyze the DNA
of an organism. Even though the decision is arbitrary, it's important to maintain consistency
with that decision for the sake of clear communication.

The forward and reverse strands may also be denoted with different terms. For example in
datasets you may find them labeled as + and - . They might also be called top and bottom
strands. In our opinion, these variances are needlessly confusing. Please avoid referring to
strands with any other terms than forward and reverse.

Is there a directionality of DNA?


Yes, there is a directionality to the DNA given by polarity of the charges. This direction runs
in opposite ways for each strand. Typically we indicate this polarity with arrows:

Most biological mechanisms (but not all) take place on single strand of the DNA and in the
direction of the arrow. Hence sequences of the DNA above will be "seen" by the biochemical
machinery as either:

This latter sequence is called the reverse complement of the first and is formed by reversing
the letters then interchanging A and T and interchanging C and G .
Hence a DNA sequence AAACT may need to be considered:
 in reverse TCAAA
 as a complement TTTGA
 as a reverse-complement AGTTT
What is a sense/antisense?
When a process occurs in the expected direction then its directionality may be called sense,
if it is going against the expected direction its directionality may be called anti-sense. It is
very important not to collate the concepts of forward/reverse with sense/anti-sense as
these are completely unrelated. The sense/anti-sense is relative to a sequence's direction,
the sequence in turn may come from a forward or reverse strand.

What is DNA sequencing?


DNA sequencing is the "catch all" terminology that describes the processes for identifying
the composition of a DNA macromolecule. The results of a DNA sequencing process are data
files stored in an unprocessed format, typically either FASTA , FASTQ or unaligned BAM files.
Most published papers also store their data in repositories that can be downloaded for
reanalysis.

What gets sequenced?


It is essential to note that instruments do not directly sequence the DNA in its original form.
Sequencing requires a laboratory process that transforms the original DNA into a so called
"sequencing library" - an artificial construct based on the original DNA. The process of
creating this sequencing library introduces a wide variety of limitations and artificial
properties into the results. In addition the method of creating the sequencing library will
also limit the information that can be learned about the original DNA molecule.
Most (perhaps all) life scientists use the term "sequencing" overly generously and are often
under the assumption that it produces more precise information than what it can actually
deliver.

What is a genome?
A genome is all of an organism's DNA sequence. Each cell typically contains a copy of the
entire genome. More accurately, each cell has one or more nearly identical copies of the
genome. Replication is the process of duplicating the genome when a cell divides.
While due to complementarity the number of A and T nucleotides and the C and G
nucleotides is equal the relative ratios of AT vs CG may be very different. Some genomes
may contains more AT pairs while others may contain more CG pairs.

What is the purpose of the genome?


The genome contains the information that makes the functioning of an organism possible. In
cellular organisms for example it has regions that contain the instructions for making
proteins. These are typically called "coding regions". The genome may have other regions
that are used to produce other types of molecules and it has regions that regulate the rates
by which other processes take place.

All genomes are subject to evolutionary principles hence some (or even substantial) regions
of a healthy genome may be non-functional and may not serve any purpose anymore. Some
percent of a genome may consists of copies of various kinds of interspersed repeat
sequences. At some point these regions have been labelled as "junk DNA" a term that later
has become a lightning rod of controversy.

How big are genomes?


Functional genomes range from as short as 300bp to as long as 150 billion basepairs of Paris
Japonica a perennial, rare, showy white star-like flower from Japan. It is common to refer to
genome sizes in terms of kilo- bases (thousands), mega-bases (millions) and giga bases
(billions). Here are some genome other sizes:
 Ebola virus genome: 18 thousand basespairs (18Kb)
 E-Coli bacteria genome: 4 million basepairs (4Mb)
 Baker's yeast fungus genome: 12 million basepairs (12Mb)
 Fruit fly genome: 120 million basepairs (120Mb)
 Human genome: 3 billion basepairs (3Gb)
 Some salamander species: 120 billion basepairs (120 Gb)

What is RNA?
Whereas DNA is the blueprint RNA is a smaller interval out of this blueprint translated into a
molecule similar to the DNA except the base T (Thymine) is replaced by U (Uracil) and it
contains other chemical modifications that change its properties relative to DNA. RNA is a
polymeric, single stranded molecule that usually performs some type functionality and is
believed to exist transiently. Unlike DNA there are many classes of RNA: mRNA , tRNA , rRNA
and many others. The DNA is continuously present in the cell whereas the RNA degrades
quickly in time (minutes).

How does the genome function?


The genome has numerous functions out of which we now attempt to describe one, perhaps
the most studied phenomenon of primary mRNA transcription in eucaryotic cells.

The cell begins by transcribing a "gene" (see later) into an RNA molecule. Then pieces of the
RNA are cut out and discarded, in a process called splicing. Each discarded piece is called an
intron. Each piece between consecutive introns is called an exon, and the RNA molecule
with the introns removed is known as messenger RNA, or mRNA. Perhaps 35% of human
genes are alternatively spliced, meaning that under different circumstances, different
combinations of exons are selected.

What is a protein?
A protein is a three dimensional macromolecule built from a series of so called amino acid
molecules that can form a 3D structure. There are 20 kinds of amino acids that can form a
protein, these are labelled as letters in the 'alphabet' of protein sequences, where each
letter is an amino acid. Proteins can be described by their sequence though in our current
state of understanding the sequence alone is typically insufficient to fully determine the 3D
structure or function of it. Whereas DNA and mRNA typically carry information the proteins
are actual physical building blocks of life. Every living organism is built out of proteins and
functions via the interaction of proteins that are being continuously produced. A short series
(less than 40) of amino acids without a well defined 3D structure are called polypeptides
(peptides).

How are proteins made?


The process of reading DNA and creating mRNA of it is called transcription. Then in
eukaryotes, the mRNA is transported out of the nucleus (to the cell's cytoplasm), where it is
converted to a protein sequence in a process called translation. Multiple proteins (even
hundreds) may be translated from a single mRNA molecule.
 transcription: DNA --> mRNA
 translation: mRNA --> Protein

To perform the translation the mRNA is partitioned into units of three consecutive letters,
each called a codon. A codon is then translated via a translation table into an amino acid. A
protein is a sequence of aminoacids:

There is a genetic code for translating a codon into an amino acid. For example, the codon
TCA (or UGA if we describe it as an RNA sequence) codes for S , the amino acid Serine. The
translation process begins with the so called start codon ATG that corresponds to M (the
Methionine) amino acid. Hence all proteins sequences start with M.

What is a gene?
The "official" definition for the term gene in the Sequence Ontology is a region (or regions)
that includes all of the sequence elements necessary to encode a functional transcript. A
gene may include regulatory regions, transcribed regions and/or other functional sequence
regions.

Are there other types of genomic features?


Genomic regions may have a wide variety functions. Here are a few often studied types:

Untranslated regions:
The region of the mRNA before the start codon (or the corresponding genomic region) is
called the 5' UTR (5 prime UTR) or untranslated region; the portion from the stop codon to
the start of the poly-A tail is the 3' UTR (three prime UTR).

Promoters regions
The genomic region just before the 5' UTR may contain patterns of nucleotides, called the
promoter, that is used to position the molecular machinery that performs the transcription
to RNA. Other patterns in the DNA tell the cell when (how frequently and in what tissues) to
transcribe the gene; that is, they regulate transcription. A pattern that increases the
frequency of transcription operations is an enhancer , while one that decreases the
frequency is a silencer.

CpG islands
CpG islands are regions of DNA where a C (cytosine) nucleotide is followed by a G guanine
nucleotide in the linear sequence of bases along its 5' -> 3' direction. Cytosines in CpG
dinucleotides can be methylated. In turn methylated cytosines within a gene may change its
expression, a mechanism that is part of a larger field of science studying gene regulation
that is called epigenetics.

What is homology?
Two regions of DNA that are descended from the same sequence (through processes of
duplication of genomic regions and/or separation of two species) are homologous, or
homologs of one another.
How is bioinformatics practiced?
Bioinformatics requires a broad skillset. The diverse tasks undertaken by bioinformaticians
can be roughly divided into three tiers:

1) Data management:
Data management requires accessing, combining, converting, manipulating, storing, and
annotating data. It requires routine data quality checks, summarizing large amounts of
information, and automating existing methods.

2) Primary data analysis:


Analysis of the data requires running alignments, variation callers, and RNA-seq
quantification, as well as finding lists of genes. Strategically, the bioinformatician must
anticipate potential pitfalls of planned analyses, find alternatives, and adapt and customize
the methods to the data.

3) Data interpretation:
Data management and analysis are meaningless without accurate and insightful
interpretation. Bioinformaticians discover or support biological hypotheses via the results of
their primary analysis, and so they must be able to interpret their findings in the context of
ongoing scientific discourse.

What is the recommended computer for bioinformatics?


In our experience, the most productive setup for a bioinformatician is using a Mac OSX-
based computer to develop and test the methods and then using a high-performance Linux
workstation––or cluster––to execute these pipelines on the data.

How much computing power do we need?


Bioinformatics methodologies are improving at a rapid pace. A regular workstation is often
all you need to analyze data from a typical, high-volume sequencing run (hundreds of
millions of measurements). For example, you could handle most RNA-Seq data analysis
needed in a given day by using only the hisat2 aligner on a standard iMac computer (32GB
RAM and 8 cores used in parallel). Larger analyses like genome assembly, however, typically
require larger amounts of memory than are available on a standard computer. As a rule, low
quality data (contamination, incorrect sample prep etc) takes substantially longer to analyze
than high quality data.

Does bioinformatics need massive computing power?


No!
Mastering bioinformatics requires nothing more than a standard, Unix-ready laptop. Of
course, the computational needs for applying the bioinformatics methods on certain
problems will depend on the amount and scale of data being analyzed. Remember this, you
don't need an expensive or powerful system to become the one of the best
bioinformaticians on the planet!

What about the cloud?


Cloud computing is becoming an increasingly common platform for bioinformatics,
especially for efforts that allow bioinformaticians to "bring the tools to the data" - applying
your own algorithms to huge sequence databases. Cloud services such as Amazon Web
Services (AWS) also enable anyone to host web applications and services on powerful,
accessible machines while only paying for the compute time incurred. Running a cloud
service involves learning about object stores, virtual private clouds, file systems, and
security.

Do I need to know Unix to do bioinformatics?


Bioinformatics has primarily been developed via freely available tools written on the Unix
platform. The vast majority of new advances are published with software written for Unix-
based operating systems. It's unlikely you'll be in a position for long-term career
advancements as a bioinformatician without a basic understanding of the command line.
The good news is that using Unix is really not that complicated.

Do I need to learn a programming language?


Yes––an introductory level of programming ability (in any programming language) will be
necessary. Fortunately, programming languages function by similar underlying logics––even
when they seem quite different on the surface. While symbols, syntax, and procedures may
differ among programming languages, an introductory level of programming ability in any
language is necessary to understand the thought processes required for computational
analysis. Within this course we dedicate an entire module with several sections that help
you acquire the skills that you need.

Are there alternatives to using Unix?


Alternatives to Unix are limited, but do exist. They fall into two categories:

A. Software that provides a web interface to the command line tools:


These software systems run the same command line tools that one could install into a Unix-
based system, but the interfaces to these tools are graphical and provide better information
"discoverability."
1. Galaxy (open source)
2. GenePattern (open source)
3. BaseSpace (commercial)

B. Systems that offer custom implementations of bioinformatics methods:


These can be standalone software that runs on your local system––they are compatible with
web-based applications.
1. CLC Genomics Workbench
2. Golden Helix
3. DNA Star

Most common databases in bioinformatics


Databases are essential for bioinformatics research and applications. Many databases exist,
covering various information types: for example, DNA and protein sequences, molecular
structures, phenotypes and biodiversity. Databases may contain empirical data (obtained
directly from experiments), predicted data (obtained from analysis), or, most commonly,
both. They may be specific to a particular organism, pathway or molecule of interest.
Alternatively, they can incorporate data compiled from multiple other databases. These
databases vary in their format, access mechanism, and whether they are public or not.
Some of the most commonly used databases are listed below:
 Used in biological sequence analysis: Genbank, BLAST, UniProt.
 Used in structure analysis: Protein Data Bank (PDB).
 Used in finding Protein Families and Motif Finding: InterPro, Pfam.
 Used for Next Generation Sequencing: Sequence Read Archive (SRA).
 Used in Network Analysis: Metabolic Pathway Databases (KEGG, BioCyc), Interaction
Analysis Databases, Functional Networks.

Most common software and tools in bioinformatics


One of the common tasks for bioinformaticians is to develop softwares or tools to automate
their workflow. Such programs may range from simple command-line tools, to more
complex graphical programs and standalone web-services available from various
bioinformatics companies or public institutions. Over the years, there has been a continues
development for the open-source software tools and nowadays many softwares/tools are
available online. The combination of a continued need for new algorithms for the analysis of
emerging types of biological readouts, the potential for innovative in silico experiments, and
freely available open code bases have helped to create opportunities for all research groups
to contribute to both bioinformatics and the range of open-source software available,
regardless of their funding arrangements. The most common open-source software
packages include Bioconductor, BioPerl, Biopython, BioJava, BioRuby, EMBOSS, and
GenoCAD.

You might also like