You are on page 1of 1

An Introduction to Genome Graphs using GenGraph and Python

Jon Ambler¹*, Shandukani Mulaudzi¹, Nicola Mulder¹


¹Computational Biology Division, Dept. of Integrative Biomedical Sciences, Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Cape Town
* jon.ambler@uct.ac.za

Introduction What is GenGraph?


GenGraph is a set of Python tools for creating and working with
Genome graphs are being used increasingly genome graphs. It uses existing file formats, presenting genome
in research, replacing the limited and biased single reference graphs in a way that is intuitive even to novice programmers.

sequence. But adoption is slow, possibly due in part to a difficult A Python library - Python is simple to learn, widely used, and comes with a large number of
conceptual model and lack of tools for using genome graphs in packages for the biological sciences. GenGraph aims to add to this by bringing genome graphs to
more everyday analysis. GenGraph provides users with the Python.

power to create and work with genome graphs in an intuitive


A toolkit - GenGraph includes a toolkit for the creation of genome graphs using different
manner allowing for easier tool development.
approaches, as well as carrying out a tasks including extracting subsequences and comparing
regions between different isolates.

A structure - GenGraph creates a directed sequence graph, where the individual genomes are
encoded as walks within the graph along a labeled path. Nodes are labeled with the relative start
Features and stop positions for each isolate allowing for the use of the existing coordinate systems used in
annotation files.
Simple setup

pip install GenGraph

Download the GenGraph toolkit from Github

git clone https://github.com/jambler24/GenGraph.git

Modular design - Any alignment tool can be incorporated into the


GenGraph toolkit for creating genome graphs. This allows the
use of alignment tools that preform better for certain genomes,
and new tools that will be developed in future.

Figure 1: Interpretation and compatibility. A) The above graphic was created by extracting the
Compatible - GenGraph can export graphs in a number of subgraph that represents a gene of interest, importing it into Cytoscape, and scaling the length of
formats, including GraphML, XML, JSON, and YAML. These in the nodes by the length of the sequences within them. The nodes Aln_74_31 & Aln_74_32
turn can be imported into R and Cytoscape, or visualised using represent a SNP (C / T) near the start of the gene. B) This table shows the node attributes
D3.js or Plotly.
associated with the subgraph, including isolates and the sequences.
Simple to interpret - The structures in the graph reflect biological
relationships between sequences in an interpretable manner, Figure 2: The compression achieved by converting a
unlike De Bruijn graphs which are abstract. (Figure 1)
number of fasta files to a GenGraph genome graph is
Compression - By representing shared sequences in single related to the similarity of the sequences. At 1
nodes, multiple genomes can be represented in a single file a mutation per kb, the graph file size increases only
fraction of the size of the individual fasta files. (Figure 2) slightly as more sequences are included.

Graph Creation
Using the GenGraph toolkit, a set of genomes in fasta
format can be converted into a GenGraph graph. This
process uses existing alignment tools and is able to represent both large
structural and nucleotide level variation. Using GenGraph in Python
GenGraph includes functions
A. Identification of collinear blocks.
and methods that make it
Durning this stage, the simple to write code that answers common biological questions. As an
structural similarities between example, this code shows how to extract the longest node that contains
the sequences are resolved.
all the isolates. This could represent a conserved region containing core
B-C. Realign each of those genes.
blocks. # Import graph
>>> gg_object = import_gg_graph(< path to GG xml file >)
# Check how many isolates are included in the graph
The sequences in these blocks >>> isolate_count = len(gg_object.ids())
then undergo multiple sequence # Set some starting values
>>> longest_node = ’none’
alignment, resulting in a >>> longest_node_length = 0
>>> for a_node in gg_object.nodes():
nucleotide level alignment.
# Go through all nodes, looking for nodes that contain all the isolates
if len(gg_object.node[a_node][’ids’].split(’,’)) == isolate_count:
D. Conversion to a genome # Check if they are longer than the current longest node
if len(gg_object.node[a_node][’sequence’]) > longest_node_length:
graph. GenGraph converts the longest_node = a_node
alignments to a genome graph, longest_node_length = len(gg_object.node[a_node][’sequence’])
# Return the longest node
resolving inverted regions, >>> print(longest_node)
Aln_21_42
incorporating coordinates and >>> print(longest_node_length)
isolate information. 11302

Conclusions Acknowledgements Links and Resources


GenGraph is an open source
python library and toolkit that uses
existing file formats, presenting genome graphs in a
way that is intuitive to even novice programmers. It
comes with a number of functions that make common
tasks like extracting sequences from a region or
identifying homologues simple, and make the
transition away from single linear sequences far less
daunting, encouraging adoption.

You might also like