Contig Extended Definition

unique arrangement of these bases. In

Quick-Reference Definitions:
 Sequence (noun): the specific order of order to study the function of a gene,
individual units in a segment of DNA,
RNA, or protein. scientists must determine its specific letter
o Ex: The sequence of this DNA
fragment is “A,T,G,G,C”. code (Figure 1). This process is called

 Sequence (verb): the act of determining “DNA sequencing”. When sequencing large
the order of the individual units of a
macromolecule. quantities of DNA, it must be fragmented
o Ex: The DNA segment was
sequenced. into smaller, more manageable segments.

After each segment is sequenced, they are

 Genome (noun): The collection of all
genes contained in an organism put back together through constructing
 Transcriptome (noun): The collection of contigs, in which the many fragments of
all genes that are expressed (turned on),
at a given moment in time DNA are pieced back together (like a

puzzle) by overlapping each component

Definition based on shared sequences. In this way, a

Short for “contiguous”, a contig is a gene can be reconstructed one fragment at

construct made of many small overlapping a time, eventually creating one contiguous

DNA fragments and is used to reconstruct a sequence (Figure 2).

DNA sequence. DNA is made of four

nucleotide bases: adenine (A), thymine (T),

guanine (G), and cytosine (C). Genes are

functional units of DNA, each having a

A, G, T, G, A, C

Figure 1. A breakdown of the structure of DNA, showing a molecule of DNA in its natural
helical form (left), a linearized fragment of DNA showing the complimentary nucleotide
bases as colored vertical lines (middle), and the specific sequence of this fragment
showing the bases as their letter code (right).

A. B.

Figure 2. Two representations of the process through which a contig is created. (A) An
example sentence is broken up into fragments, showing how the fragments can be re-
aligned based on the shared letters between the fragments (Taylor, 2018). (B) A
visualization that is in the format that is more typical to biological sciences, in which bars
represent pieces of DNA, and vertical lines connecting the bars to represent the individual
overlapping bases. In this example, it is shown how components 1 through 3 can be
combined to make one contig (Genome Reference Consortium).


The term “contig” was first used in a order to more easily refer to the product of

publication by Robert Staden in 1980. His overlapping many fragments into one

team devised a system for a more consensus sequence that is used in later

organized method of storing, managing, steps of reconstructing a genome. The

and manipulating sequencing data original definition is as follows:

(Staden, 1980). The word was created in

“A contig is a set of [readings] that are which each contig was constructed, the

related to one another by overlap of their contigs can be linked together by their

sequences. All [readings] belong to one overlapping sequences to form an even

and only one contig, and each contig larger segment, eventually creating one

contains at least one [reading]. The continuous segment of DNA that

[readings] in a contig can be summed to represents the original sample. In this way,

form a contiguous consensus sequence an entire genome can be reconstructed. In

and the length of this sequence is the the 1990s, this was the method used to

length of the contig” (Staden, 1980). sequence the human genome. Scientists

Though the wording may change, from all over the world together were

the definition of this term has not brought together, making unparalleled

significantly changed since its first use advances in sequencing technology

(King, Mulligan, and Stansfield, 2014; (International Human Genome

Genome Reference Consortium). The term Sequencing Consortium). With this

is sometimes used interchangeably with reference available, great leaps have been

“scaffold”, and “sequence” (Genome made in medicine, and our understanding

Reference Consortium). of gene interactions.

Methods and Uses Currently, the most popular

In bottom-up sequencing methods sequencing method is “next generation

like Shotgun sequencing, the DNA is sequencing” (NGS). With NGS, large

broken up into many fragments and segments of DNA can be sequenced much

sequenced individually, then pieced back faster, and at a much lower price. RNA

together into contigs. In the same way in sequencing is a type of NGS that uses
contigs to build transcriptomes. RNA is collaborative project between the U.S.

collected and sequenced from an National Center for Biotechnology

organism at a given time point, producing Information (NCBI) GenBank database,

many reads. These reads are then DNA DataBank of Japan (DDBJ), and

assembled into a collection of contigs. If a European Nucleotide Archive (ENA). The

reference genome is available, the contigs project aims to curate a comprehensive

can be mapped back to it to determine the collection of genomic data from research

identity of each gene. If no reference is on any organism. Genbank accepts data

available, there are programs that create a submissions from researchers, including a

transcriptome from scratch by linking the type called “Whole Genome Shotgun

contigs together by a series of scaffolds. Submissions”, which can consist of both

Gene identities can then be inferred based genomes constructed from annotated, and

on similarities to known sequences unannotated contigs. Each submission is

(Manfred et al., 2011). Eventually all reads given a four-letter identifier followed by a

are incorporated into one complete submission version number. Each

collection of all the genes that are turned individual contig is given an additional

on at that time point. code that changes as the data is updated.

Accessibility For example, a contig in a first-time

Most studies that generate submission could be given the code:

computer data make their data publically ABCD0100005, while a contig from an

available through a number of online updated version of this genome collection

sources. The International Nucleotide would be given a code like:

Sequence Database Collection is a ABCD0200045. This type of data sharing

and accessibility helps move scientific fragments of DNA. Larger fragments lead

progress. It saves researchers time and to fewer contigs constructed, thus, less

resources by allowing them to access time spent sequencing and reconstructing

already gathered data, rather than a genome (Roberts, Carneiro, and Schatz,

generate it on their own. 2017; Loman, Quick, and Simpson, 2015).

A Look into the Future With advances such as these, it may

Sequencing technology continues someday be possible to sequence a

to become faster and more extensive than genome without having to break it into

ever. Burgeoning technologies such as pieces, making contigs obsolete. However,

Pacific Biosciences (PacBio) Single until this day, constructing contigs will

Molecule Real Time (SMRT) sequencing, remain an integral step in sequencing

and Oxford Nanopore MinION, have made technologies.

it possible to sequence even larger

Definition strategies employed:

In this extended definition, I used three graphics: a mini-glossary for terms that the

general audience may not fully understand, and two diagrams. For other words that may

need defining, I used a combination of parenthetical and sentence definitions when

appropriate. These help the reader understand the concepts described. I also used every

day language and examples that everyone has heard of (like the human genome project).

I used headings to break the definition up into separate topics, as well as bolding and

italics to emphasize different features. To some degree, I used etymology, in that I wrote

about when the term was first used.


