You are on page 1of 28

SNPs: detection and genotyping - PPrraaccttiiccaall

Gaallaaxxyy
PPhheennoottyyppee AAssssoocciiaattiioonn TToooollss iinn G

This computer practical is a modified version of Giardine et al (2012) Phenotype Association


Tools in Galaxy. Curr Protoc Bioinformatics. 2012; CHAPTER: Unit15.2.
doi:10.1002/0471250953.bi1502s39. NIH Public Access.

GOAL: Identify potentially relevant SNPs from a genomic


sequencing project using Galaxy
This practical focuses on some of the tools available on the public Galaxy server
that are useful for exploring possible associations between human genetic
variants and phenotypes. This example illustrates several methods for examining
a single full-coverage genome to look for single-nucleotide polymorphisms (SNPs)
that are either 'known to be associated with disease', or 'suspected to have an
impact'. It makes use of a) public genomic data, b) tools designed specifically for
working with variants, and also c) some general tools for text manipulation and
operations on genomic coordinates.
For this example we will use an artificial dataset consisting of the SNP calls from
the "Complete Genomics" genome GS12880 [Drmanac et al., 2010]* with a few
known disease variants added. This will provide a convenient background to
search for disease SNPs, but remember, it may not necessarily represent a
realistic collection of disease SNPs for a single individual. We chose an
assortment of six SNPs from the PhenCode database [Giardine et al., 2007],
representing different genes and different parts of the gene (Table 1 of paper).
There are two coding SNPs (heterozygous) and four non-coding (one heterozygous
and three homozygous). The four non-coding SNPs are located in a promoter
region (1), a UTR (1), and in introns (2).
For a tutorial on Galaxy: https://galaxyproject.org/admin/get-galaxy/
Access galaxy: https://usegalaxy.org/ or in Europe https://usegalaxy.eu/

* Drmanac et al . (2010) Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA
Nanoarrays. Science. 327:78–81.
Our starting dataset is in "masterVar" file-format, used by the company
"Complete Genomics". Access the Galaxy portal (usegalaxy.org or usegalaxy.eu)
and go to:

Tools> Load Data ->

And then -> Paste/Fetch Data (at bottom line)

Now, upload input data file test.masterVar.gz from (paste this url):
http://www.bx.psu.edu/miller_lab/docs/galaxy_phen_assoc/tutorial/test.masterVar.gz

For that: ❶Paste that address in the window; and ❷set name to
test.masterVar, ❸type=tabular.gz, ❹species=Human Feb 2009
(GRCh37/hg19)(hg19); then click "Start";
2 3 4

Once finished then click Close and check the History panel. It takes 4-5 mins
until the tab gets green

Alternatively
download test.masterVar.gz to you computer and upload it to Galaxy, this way:
Tools-> Get Data->Upload File from your computer->Choose Local Files–keep the rest same as above)

Now, because that format (masterVar) doesn’t work well with many of the
Galaxy tools, we will convert it to the "pgSnp" format (Personal Genome
SNP format), which is a specialized BED format for SNPs. For that, within
Galaxy, go to:

Tools> GENOMICS ANALYSIS> Phenotype Associations > MasterVar to pgSNP


convert: (if you don't find "Phenotype Associations" type simply "Phenotype" in the "Tools" Search
window). File should appear already typed in 'Complete Genomics MasterVar dataset' window.
Convert indels: No. Execute.

Check History Panel. Should take ~5min. Rename the new file to test.pgSNP (click on “pencil” icon
by the file name –History section-, type in new name in the Name window). Save.
IF this is not working properly, then, alternatively download test.pgSnp
(128MB) to your computer from https://www.ehu.eus/zaindegi/ and then upload
it to galaxy as "local file" (Zaindegi: passwd:MASTER2022; LABEL:GALAXY2022)

Once uploaded, after clicking on the "eye" (in History section) you will be
able to browse the contents:

Info on Personal Genome SNP format:


https://genome.ucsc.edu/FAQ/FAQformat.html#format10
To look for dominantly inherited disease SNPs in the data set, it is helpful
first to:

A.- Remove SNPs that are present in healthy individuals.

Variability found in healthy individuals is not likely to be associated to this


type of disease. Removing it reduces the complexity of our data and
significantly simplifies our task. To do this we will pull in some files from the
Galaxy Data Libraries:

Go to Galaxy->Top bar > Shared Data > Data Libraries

Look for Putative SNP Phenotypes (or goto page 2) and click on it.

Then in the new window, check the box adjacent to hg19, and click on hg19

Now we will import some datasets. To import datasets into your history, you
need to check the boxes corresponding to the ones you want. In these
shared libraries there are currently two sets of full-coverage SNPs from
healthy individuals. One (pgsCombined24.bed) has 24 public genomes
from a variety of sources, populations, and sequencing technologies. The
other is a group of 69 from Complete Genomics (cg69.bed). In most cases
you will want to use both of these for filtering; however since our example
input dataset is based on one of the CG genomes, we’ll skip that set here.
Check only pgsCombined24.bed (will show up in 2nd page).

Then, in the Search top bar (below top Galaxy bar), click Export To History >
as Datasets > Import. Once done, click on the little house icon in the Galaxy
Top bar to go back to initial screen.

You should see the new file in the History panel. Make sure that the test.pgSnp is
a file of type “bed” and that its reference build is Human Feb 2009
(GRCh37/hg19) (hg19) (the same as pgsCombined24.bed). To check this, click on
the “pencil” icon corresponding to test.pgSnp and check Attributes and Datatypes.
Now we are ready to remove the “benign” SNPs from our input, i.e., we
want to remove from our test.pgSnp file those SNPs that were found in the
24 genomes of healthy people:

Tools> Common Genomics Tools > Operate on Genomic Intervals >Subtract


(pgsCombined24.bed from test.pgSnp; return intervals with no overlap). click Execute

Wait until the History panel area for this step becomes green

The tools in the Operate on Genomic Intervals section are examples of tools
that only work with “interval” formats, which includes pgSnp. It takes a few
minutes to complete. Then, open the results in the history panel by clicking
on the dataset name. We see that the number of SNPs is greatly reduced,
from around 3.4 million to 96 thousand (click on file name to see number
of records –i.e. snps-).

Rename (click on “pencil” icon) the new file to filtered_SNPs (Save). Click on
“Eye” icon to view the content of the file.
B.- Selecting known damaging coding SNPs (predicted),
then finding their genes and associated pathways xxxxx

Once the common, neutral SNPs have been removed, we want to identify
those mutations that are likely damaging. For that we will download a list of
SNPs with their predicted effect on health:

Go to Shared Data > Data Libraries > Putative SNP Phenotypes (pg 2) > hg19 (as
before)

This time look for the dataset “polyphen_dbsnp132.txt”(pg 2) and check


the box (to learn more about this dataset, click on its name).

(Search top bar) Click on Export To History > As Datasets > Import.

Back to working screen by clicking on Analyze Data (Galaxy Top Bar) (or by
clicking on the House icon, as before). Once there, click on the eye symbol to
see the data contained in polyphen_dbsnp132.txt
Now, to get the predictions associated with our SNPs we will do a join
between the two datasets (filtered_SNPs and polyphen_dbsnp132). We
don’t have a shared identifier to join on, so instead we will join together
rows of the two datasets whenever their positions on the genome overlap.
The resulting file will have all of the information from both datasets, only for
those positions in the (query) genome for which there is information/data in
both datasets. For that, do:

Tools > Common Genomics Tools > Operate on genomic intervals > Join

In the center panel, select the filtered_SNPs file as the first dataset and the
PolyPhen-2 predictions file (polyphen_dbsnp132.txt) as the second. We are
only interested in SNPs that appear in both sets, so leave the default
settings; i.e. do an Inner Join (with a min. overlap of 1bp)

click Execute

Rename output file as Predictions (Save). Click on the "Eye" icon. There are
249 predictions (click on the file’s name to check this; also a few rows of
data are shown in the section below the file name, in the right panel). Use
the scrollbar to scroll over to the predictions (scroll a bit to the right).
Some SNPs are classified as benign. However, the ones we are interested
in are those classified as “damaging”. To select only these, first we need to
ensure that the file format of Predictions is "tabular" (pencil icon). For that,
click on the pencil icon of the file, go to Type of Data, select Tabular; Save).

Then, Tools > General Text Tools > Filter and Sort > Select (lines that match an expression)

selects lines from a dataset that match (or don’t match) a given pattern. To
run it, choose the dataset with the join results (Predictions file) and type in
“damaging” for the pattern. The search is case-sensitive, so type it exactly
as it appears in the dataset

then click Execute

Rename output file to Damaging (Save). Predictions are no longer in column


11 because we requested the columns from the SNP dataset to be listed
first in the join results.
We find 104 SNPs that are predicted to be damaging (or probably
damaging) by PolyPhen-2, including one of our two coding disease SNPs.
For small datasets like this, you can display the entire contents by clicking
on the eye icon in the history panel (scroll a bit to the right to see the
column of health effect). You can see that we now have UniProt IDs for
many of the genes associated with these variants

We can now look for pathways associated with the genes by using the CTD
tool.

The CTD tool requires HUGO/NCBI identifiers rather than UniProt IDs for its
input, so first we will download a file from the UCSC Table Browser to map
between the identifiers. The columns do not line up in this view because
the data is tab-separated and the values in each column are not all of the
same length.
Go to UCSC Table Browser https://genome-euro.ucsc.edu/cgi-bin/hgTables

First make sure the genome and assembly are correct; we want Human
build hg19 in order to match our history datasets. We will be using Clade:
Mammal; genome: Human; assembly: Feb (2009) (GRCh37/hg19); Group:
Genes and gene predictions, Track: UCSC Genes, Table: knownGene,
because it has the most additional information, including connections
between various identifiers, and we want them for the full genome. Region:
genome. We also want the output format to be “selected fields from
primary and related tables”, and to have the senfd output to: Galaxy
(orange arrow), Output File:UCSC;. Separator: tsv;.File type returned: Plain
text. Check “Send output to Galaxy” and then click the “Get output” button.

Now we select the fields from the Main Table that we want in the output
(name, chrom, txStart, proteinID). The main table, however, does not have
all that we need, but, instead, the cross-reference table below (hg19.kgXref
fields) has both the UniProt ID (spDispalyID) and the official HUGO gene
symbol (geneSymbol).

After marking these boxes, click the “Done with Selections” button (just
below top table). Then click on “Send query to Galaxy”. The Table Browser
interface will then close, and the requested data will be sent to your Galaxy
history.

Rename the output file to UCSC (and Save)

Back at Galaxy, we are ready to run "join" to get the gene symbols
associated with the damaging SNPs. This time we don’t have genomic
positions in both sets, but they do have a field in common: the UniProt ID.
We will do the join by matching up the values in those columns.

Tools > General text Tools > Join, Subtract, and Group > Join Two Datasets

The parameters for this tool are the two datasets (Damaging and UCSC)
(must be in tabular format), the columns to match up, and what to do with
unmatched rows. By opening the two datasets in the history you can find
the column numbers containing the UniProt IDs (check in UCSC, which
column# contains prot ID, and in Damaging which column #contains prot
ID).
Make the selections, putting the SNP dataset (Damaging) first so its fields
will be first in the result rows. We are not interested in any of the rows that
don’t join so leave the rest of the options set to “No”.

Then click Execute

Rename to ForRELATIONSHIPS (Save)

Now that we have a list of gene symbols for the putatively damaging SNPs,
we can look for relationships between these genes. There are several tools
useful for this, including the CTD tool to extract pathway information from
the Comparative Toxicogenomics Database CTD [Davis et al., 2009; see also

https://academic.oup.com/nar/article/49/D1/D1138/5929242], Other options are g:Profiler


[Reimand et al., 2007], and DAVID [Huang et al., 2009]. Not all of them
work always properly from Galaxy)
WARNING!!! If CTD is not available from usegalaxy.org, try usegalaxy.eu instead: for
that: save ForRELATIONSHIPS file to local (your computer): History, click on file
name (ForRELATIONSHIPS), click on the “diskette” icon. Open usegalaxy.eu; Import
ForRELATIONSHIPS file. If this does not work either, just got to the CTD page and
upload your gene list from local

Tools> Genomics Analysis > Phenotype Associations > CTD

Make sure that the text file is tab-separated (open in excel and save again
as "Texto delimitado por tabulaciones.txt". Open usegalaxy.eu, upload this
file to History and:

just do Tools> Genomics Analysis > Phenotype Associations > CTD

Then click Execute (Warning: Identifiers-col# may vary from the one set in
the figure above. Check in your file which one contains gene names.
Alternatively, go to: http://ctdbase.org/ Then Analyze > Batch Query

Input from there your data:

Select input type Genes, Upload File…, select identifier column 30 (check
this is correct); choose data to download: Pathways (you can also play
around to explore other options)
Click Download

Alternatively: CTDquerier - Bioconductor

To run DAVID:

Tools> Phenotype Associations > DAVID (Warning: there may be a limit of


400 entries; check. If you number of loci is higher than that, try getting a list of
non-redundant loci by using Text Manipulations->cut, sort (alphabetical,
unique yes), save the file to local, and submit that to David web page
(https://david.ncifcrf.gov/) or another alternative page
(http://www.webgestalt.org/ or https://maayanlab.cloud/Enrichr/)
Then click Execute and Click on Link. Warning: in the new version of Galaxy
the web address used for David might be incorrect. If so, click on the link
and replace in the browser address bar http://david.abcc.ncifcrf.gov/ by
http://david.ncifcrf.gov/
C.- Other annotations tools (in addition to Polyphen)

A popular resource is VEP: https://www.ensembl.org/Tools/VEP

for hg19 use http://grch37.ensembl.org/Homo_sapiens/Tools/VEP

For an alternative, use wANNOVAR http://wannovar.wglab.org/index.php

accepts masterVar format!!!

Download test-masterVar.gz,from Galaxy to local. Uncompress it and upload


it to wannovar. Or, open it in Galaxy, select and copy all, paste into
wANNOVAR. Run will be queued!

Within Galaxy, it is possible to call ANNOVAR using as input a vcf file. Tools>
Variant Calling>Annovar Annotate vcf (it is possible to locally transform
mastervar to vcf using specific software available online –not provided-)
D.- Functional "predictions" of coding SNPs with
Mutationassesor.

PolyPhen-2 is way to anotate functional coding changes with Galaxy;


However, the PolyPhen-2 library dataset is pre-computed and only works for
known SNPs in the dbSNP database. Other tools, however, can directly
make predictions for all possible nucleotide substitutions in the human
exome. SIFT was a resource of this kind in Galaxy, but links to sift (provean)
are deprecated now. However, we can resort to Mutationassesor:

goto http://mutationassessor.org/r3/
first read about input format and try to get it running with your file
A.- Find regulatory regions that have already been
predicted computationally

Coding sequence is important functionally. But other regions of the


genome, like regulatory sequences, are also important for genes function.
To identify regulatory SNPs in our panel of SNPs, first we will import
another dataset from the Putative SNP Phenotypes library:
“PRPs.liftedhg19.bed”, which contains computationally-predicted regulatory
regions (PRPs).

GALAXY TOP BAR > Shared Data > Data Libraries > Putative SNPs
Phenotype (page 2) > hg19 > PRPs.liftedhg19.bed (page 2; type is interval) >
check box > To History

Then, after importing the dataset, click on Analyze Data to get back to
your history and tools page.

This file contains the intersection of the PreMods [Ferretti et al. 2007]
and the regions with high Regulatory Potential scores [Taylor et al. 2006].
Both of these sets were computed on earlier genome assemblies and
then lifted (i.e., remapped) to the current assembly (hg19).
Intersecting with the PRP

Exclusion of the SNPs from the 24 healthy individuals is useful for finding
both coding and non-coding disease variants, so we will again start with
the dataset we prepared before (filtered_SNPs).
Tools > Common Genomics Tools > Operate on Genomic Intervals > Intersect

Return Overlapping Intervals of Filtered_SNps that intersect PRPs-


liftedhg19.bed for at least 1bp >Execute.

An intersection between our input dataset (filtered_SNPs).and the PRPs will


find the SNPs within the predicted regulatory regions. The output will have
the columns from the first dataset, so be sure to specify the SNPs as the
first dataset and the PRPs as the second one.SNPs in PRPs

There are 837 SNPs in the predicted regulatory regions. Rename


output file as regSNPs.
B.- DNase hypersensitive sites (HSSs) from ENCODE data

A DNase I hypersensitive site is a region with a high DNase I cleavage. DHSs


are considered hallmarks of regulatory DNA, based on their location at
transcription start sites, and their overlap with ChIP peaks and known
regulatory elements such as enhancers and silencers. To obtain data
(DNase hypersensitive sites, HSSs) from the ENCODE project

Tools > Get Data > UCSC Main Table Browser

within UCSC Table Browser, select

clade: Mammal, genome: Human. Assembly Feb. 2009 (GRCh37/hg19)


group: Regulation, track: DNase Clusters, table:
“wgEncodeRegDnaseClusteredV3”; Region: genome. Output Format: BED;
check that Send output is Galaxy; Output File: BLANK; File type returned:
plain text; click the “get output” button.

For BED output, an intermediate page then appears with options for how
to return the intervals. Leave default options. Click Send query to Galaxy.
Rename output UCSC_ENCODE

Now, Intersect these HSSs with our data (filtered_SNPs)


Rather than continuing from the PRP results, we will go back and start
yet again with our input dataset filtered_SNPs (if no results are obtained
use test.pgSNPs instead) Once again we are using:

Tools > Common Genomics Tools > Operate on Genomic Intervals > Intersect

to find the SNPs i n o u r d a t a within the Dnase hypersensitive sites


(UCSC_ENCODE).

Return Overlapping Intervals of Filtered_SNPs that intersect UCSC_ENCODE


for at least 1bp. Execute.

Note that there are a lot more regions in this set than there were in
the PRPs (rename DNAse_SNPs).
C.- PhyloP

Next we will look at conservation between species (which is an indicative


of functional relevance) using the PhyloP tool, which looks up a pre-
computed phyloP score [Siepel et al., 2006] for each SNP position.
Normally, this would be done with

Tools> Genomics Analysis > Phenotype Association > phyloP. Select the
same input dataset (Filtered_SNPs) that we’ve been using, and then click
Execute.

If it doesn´t seem to work properly, a workaround is to modify


Filtered_SNPs to just the four first columns:

Tools> General Text Tools > Text Manipulation > Advanced Cut
Execute; change name to filtered_SNPs_cut (type must be interval)

And then, Tools > Get Data > UCSC Main

Once within UCSC Table Browser, select:

clade: Mammal, genome: Human. Assembly Feb. 2009 (GRCh37/hg19)


group: Comparative Genomics, track: Conservation, table: “100 Ver. Cons
(phyloP100wayAll)”; Output Format: data points; check that Send output
is Galaxy; Output File: BLANK; File type returned: plain text.

When selecting Region: define regions a new window opens up.


Open file filtered_SNPs_cut and select the first 1000 lines, copy.
Back to UCSC, paste in the Paste regions window and Submit
(see figure below)

This will send you back to UCSC main window

Click the “get output” button, and then, Click on "Send Query to Galaxy".
Rename resulting file to "UCSC_phyloP".
Distribution of phyloP scores
In the history panel we can see what column the scores are in, and a few of
the values. It is helpful to know the distribution of those scores. For this we
go to
Tools> Statistics and Visualization > Graph/Display Data > Histogram of
numeric col.

The PhyloP tool appends a column to its input dataset, and as expected,
we will see that the last column is new and contains numbers. Select the
dataset with the phyloP scores and indicate the column that holds the
score. Entering a large number of bars gives a finer resolution in the
output. Adding a descriptive title and label are not required, but are helpful
when coming back to this history later.

Click Execute to generate the graph. Rename output file to Histogram

Download file (click on file name, diskette icon), open pdf file of the
histogram. With a pdf reader.
Any positive phyloP score indicates conservation.
But there are many SNPs with scores between 0 and 1. To select the most
highly conserved regions, a cutoff of 0.5 seems reasonable.

Filtering the SNPs based on phyloP score

In Tools> General Text Tools > Filter and Sort > Filter tool. Select the
dataset with the phyloP scores, and enter the condition “c2>=.5”, which
means we want the rows where the score (in column 2) is greater than or
equal to 0.5.
Rename output file to phyloP_filtered.

Highly conserved SNPs


This approach has the highest number of results yet, and finds two of our
disease SNPs, including one that hasn’t been found up to this point.
Obsolete: The message about skipped lines (second blue arrow) is not a
problem; “NA” in the PhyloP tool’s output means that the score was not
available, and we want to skip those SNPs anyway.

You might also like