How To Build A Useful Reference Sequence

How
to Build a Useful Reference Sequence

An important first step to analyzing biological sequence data is often an alignment
to a reference sequence. The most obvious reason for this is so that two or more
sequences can be compared side-by-side. Having a variety of markers annotated on
the reference and represented graphically, can be key to understanding the
differences and similarities between the aligned sequences. It can also aid in
navigating through large amounts of sequence data as well.

In this tutorial, I will show how one can create such a reference sequence using the
CLC Genomics Workbench and the UCSC Genome Browser.

We will be using a free plug-in for the CLC Workbenches that makes it possible to
batch annotate specific regions on a sequence, with various types of features,
starting with a raw sequence and using a GTF or GFF file. GFF, or Generic Feature
Format, is similar to a GTF (Gene Transfer Format) file that we will download from
the UCSC Table Browser. We will first need to install the plug-in into Genomics
Workbench. You can download and install the plug-in in one easy step using the
Plug-in manager button on the CLC Workbench Toolbar.

In the dialog above, click the Download plug-ins button and select the Annotate With
GFF file plug-in. Click the download and install button. Please note that you need an
administrator login to do this.
Next we will visit the UCSC Genome Browser:

http://genome.ucsc.edu/

Under the Download link on the left, you can download sequence and annotation
information for a variety of model organisms.

In this exercise we will download the mouse chromosome 2, as a FASTA file, and
decorate it with annotations for know genes from RefSeq, as well as, known Single
Nucleotide Polymorphisms (SNPs) from dbSNP.

Start with the backbone sequence of Chromosome 2 here under Data set by
chromosome:

http://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/chr2.fa.gz

Once the download completes, un-zip the file and import it into a new folder in the
CLC Workbench.

At this point the sequence is completely un-annotated. You may wish to add
comments to this sequence under the Element info view in the Workbench.
Useful comments here would be what build of the genome this is from, important
features on the sequence, description, etc.

Next we will retrieve the annotation tracks, for RefSeq Genes and SNPs from the
UCSC Table Browser, and save them to a GTF file:

Set the drop down choices in the Browser as shown above, and click the Get
Output button to download the file. Once you have completed the download and
inflated the file, you can inspect its contents in any simple text editor as shown
below.

This file can now be used to batch annotate the chr2 sequence previously saved in
the Workbench.
After installing the Annotate with GFF plug-in, you should now see it in the
Toolbox as shown:

Double-click to launch the tool wizard and you will be prompted for the template
sequence and the GFF file. Save the result and open up the chr2 file and inspect the
annotations. You will need to zoom out and scroll to a feature dense region of the
chromosome. Note the useful sections of the right- sidebar settings, which are
expanded in the following screen:

The Annotation Layout and Annotation Types settings offer a large degree of
flexibility with regards to the display of features and labeling. Adjusting these
settings for optimal display at differing zoom levels and feature density, will be
important when you go to generate figures for publication and presentation. Also,
note the redundant nature of some feature types. You can choose which to hide or
show.

Using a split view of the linear sequence, together with the annotation table view,
makes it easy to inspect and edit annotations. Command-click the small icon for the
Annotation Table to do this:

The Annotation table lets you easily navigate a huge sequence file or contig by
making selections on the table that are linked to the graphic display.

Filtering, column sorting, and batch renaming under the Advance Renaming
function are all useful functions for creating visual reference sequences.

Next, we will add known SNPs to the protein coding regions, of this chromosome.

We return to the UCSC Genome Browser, and the Table Browser interface.
This time we will create a GTF file containing SNP features, from the group
Variations and Repeats, and the SNP128 feature track.

Much of the data in this track is in the region outside of protein coding regions, and
of much less interest in many targeted re-sequencing strategies. As a practical
matter also, annotating just the known SNPs in coding regions, rather than the
entire chromosome, or genome, can prevent the file size from exploding. For these
reasons, we will use the Table Browser interface to intersect the SNP feature tracks
with the Refseq Gene tracks.

Click the Create button next to Intersection.

Create an intersection of SNP128 data with Refseq Genes as show above, click the
submit button, then get the output GTF file, and annotate chr2 as done before with
the Refseq track.

Unzip and open the new GTF file a text editor. In this file the SNPs, as you will see
below, are still named with the exon feature type.

We will correct this with a simple search and replace step.
replacing Exon with Variation-SNP.

Save the file, and run the Annotate Using GFF tool in the Workbench.

An inspection of the chr2 sequence in the Workbench will now show a great many
SNPs but only on those areas overlapping with the pre-existing CDS and Exon
features from the RefSeq track.

Since the naming of the features is a bit redundant with the feature type, we will use
the Advance Renaming feature to rename all the features with the /gene_id qualifier.
There are thousands of features, on this table and the Advance Rename function has
a limit of 10,000, so we will need to repeat the following process several times, with
the aid of a table filter, to filter on the names of the features we want to replace. For
example, filter on name, CDS, then select about 10,000, right-click on the selection,
and choose Advance Rename from the pop-up menu, then choose the gene_id
qualifier.

In the following screen shot we are looking at an alignment against the reference
sequence, just created, with newly found SNPs annotated on the consensus. Note
the table on the lower panel. This is the SNP Detection table showing over 5000
SNPs, filtered down to 3 non-synonymous SNPs that are also noted in dbSNP.

How To Build A Useful Reference Sequence

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

How To Build A Useful Reference Sequence

Uploaded by

Copyright:

Available Formats

How

to Build a Useful Reference Sequence

You might also like