Professional Documents
Culture Documents
sequence. But adoption is slow, possibly due in part to a difficult A Python library - Python is simple to learn, widely used, and comes with a large number of
conceptual model and lack of tools for using genome graphs in packages for the biological sciences. GenGraph aims to add to this by bringing genome graphs to
more everyday analysis. GenGraph provides users with the Python.
A structure - GenGraph creates a directed sequence graph, where the individual genomes are
encoded as walks within the graph along a labeled path. Nodes are labeled with the relative start
Features and stop positions for each isolate allowing for the use of the existing coordinate systems used in
annotation files.
Simple setup
Figure 1: Interpretation and compatibility. A) The above graphic was created by extracting the
Compatible - GenGraph can export graphs in a number of subgraph that represents a gene of interest, importing it into Cytoscape, and scaling the length of
formats, including GraphML, XML, JSON, and YAML. These in the nodes by the length of the sequences within them. The nodes Aln_74_31 & Aln_74_32
turn can be imported into R and Cytoscape, or visualised using represent a SNP (C / T) near the start of the gene. B) This table shows the node attributes
D3.js or Plotly.
associated with the subgraph, including isolates and the sequences.
Simple to interpret - The structures in the graph reflect biological
relationships between sequences in an interpretable manner, Figure 2: The compression achieved by converting a
unlike De Bruijn graphs which are abstract. (Figure 1)
number of fasta files to a GenGraph genome graph is
Compression - By representing shared sequences in single related to the similarity of the sequences. At 1
nodes, multiple genomes can be represented in a single file a mutation per kb, the graph file size increases only
fraction of the size of the individual fasta files. (Figure 2) slightly as more sequences are included.
Graph Creation
Using the GenGraph toolkit, a set of genomes in fasta
format can be converted into a GenGraph graph. This
process uses existing alignment tools and is able to represent both large
structural and nucleotide level variation. Using GenGraph in Python
GenGraph includes functions
A. Identification of collinear blocks.
and methods that make it
Durning this stage, the simple to write code that answers common biological questions. As an
structural similarities between example, this code shows how to extract the longest node that contains
the sequences are resolved.
all the isolates. This could represent a conserved region containing core
B-C. Realign each of those genes.
blocks. # Import graph
>>> gg_object = import_gg_graph(< path to GG xml file >)
# Check how many isolates are included in the graph
The sequences in these blocks >>> isolate_count = len(gg_object.ids())
then undergo multiple sequence # Set some starting values
>>> longest_node = ’none’
alignment, resulting in a >>> longest_node_length = 0
>>> for a_node in gg_object.nodes():
nucleotide level alignment.
# Go through all nodes, looking for nodes that contain all the isolates
if len(gg_object.node[a_node][’ids’].split(’,’)) == isolate_count:
D. Conversion to a genome # Check if they are longer than the current longest node
if len(gg_object.node[a_node][’sequence’]) > longest_node_length:
graph. GenGraph converts the longest_node = a_node
alignments to a genome graph, longest_node_length = len(gg_object.node[a_node][’sequence’])
# Return the longest node
resolving inverted regions, >>> print(longest_node)
Aln_21_42
incorporating coordinates and >>> print(longest_node_length)
isolate information. 11302