Professional Documents
Culture Documents
Gaallaaxxyy
PPhheennoottyyppee AAssssoocciiaattiioonn TToooollss iinn G
* Drmanac et al . (2010) Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA
Nanoarrays. Science. 327:78–81.
Our starting dataset is in "masterVar" file-format, used by the company
"Complete Genomics". Access the Galaxy portal (usegalaxy.org or usegalaxy.eu)
and go to:
Now, upload input data file test.masterVar.gz from (paste this url):
http://www.bx.psu.edu/miller_lab/docs/galaxy_phen_assoc/tutorial/test.masterVar.gz
For that: ❶Paste that address in the window; and ❷set name to
test.masterVar, ❸type=tabular.gz, ❹species=Human Feb 2009
(GRCh37/hg19)(hg19); then click "Start";
2 3 4
Once finished then click Close and check the History panel. It takes 4-5 mins
until the tab gets green
Alternatively
download test.masterVar.gz to you computer and upload it to Galaxy, this way:
Tools-> Get Data->Upload File from your computer->Choose Local Files–keep the rest same as above)
Now, because that format (masterVar) doesn’t work well with many of the
Galaxy tools, we will convert it to the "pgSnp" format (Personal Genome
SNP format), which is a specialized BED format for SNPs. For that, within
Galaxy, go to:
Check History Panel. Should take ~5min. Rename the new file to test.pgSNP (click on “pencil” icon
by the file name –History section-, type in new name in the Name window). Save.
IF this is not working properly, then, alternatively download test.pgSnp
(128MB) to your computer from https://www.ehu.eus/zaindegi/ and then upload
it to galaxy as "local file" (Zaindegi: passwd:MASTER2022; LABEL:GALAXY2022)
Once uploaded, after clicking on the "eye" (in History section) you will be
able to browse the contents:
Look for Putative SNP Phenotypes (or goto page 2) and click on it.
Then in the new window, check the box adjacent to hg19, and click on hg19
Now we will import some datasets. To import datasets into your history, you
need to check the boxes corresponding to the ones you want. In these
shared libraries there are currently two sets of full-coverage SNPs from
healthy individuals. One (pgsCombined24.bed) has 24 public genomes
from a variety of sources, populations, and sequencing technologies. The
other is a group of 69 from Complete Genomics (cg69.bed). In most cases
you will want to use both of these for filtering; however since our example
input dataset is based on one of the CG genomes, we’ll skip that set here.
Check only pgsCombined24.bed (will show up in 2nd page).
Then, in the Search top bar (below top Galaxy bar), click Export To History >
as Datasets > Import. Once done, click on the little house icon in the Galaxy
Top bar to go back to initial screen.
You should see the new file in the History panel. Make sure that the test.pgSnp is
a file of type “bed” and that its reference build is Human Feb 2009
(GRCh37/hg19) (hg19) (the same as pgsCombined24.bed). To check this, click on
the “pencil” icon corresponding to test.pgSnp and check Attributes and Datatypes.
Now we are ready to remove the “benign” SNPs from our input, i.e., we
want to remove from our test.pgSnp file those SNPs that were found in the
24 genomes of healthy people:
Wait until the History panel area for this step becomes green
The tools in the Operate on Genomic Intervals section are examples of tools
that only work with “interval” formats, which includes pgSnp. It takes a few
minutes to complete. Then, open the results in the history panel by clicking
on the dataset name. We see that the number of SNPs is greatly reduced,
from around 3.4 million to 96 thousand (click on file name to see number
of records –i.e. snps-).
Rename (click on “pencil” icon) the new file to filtered_SNPs (Save). Click on
“Eye” icon to view the content of the file.
B.- Selecting known damaging coding SNPs (predicted),
then finding their genes and associated pathways xxxxx
Once the common, neutral SNPs have been removed, we want to identify
those mutations that are likely damaging. For that we will download a list of
SNPs with their predicted effect on health:
Go to Shared Data > Data Libraries > Putative SNP Phenotypes (pg 2) > hg19 (as
before)
(Search top bar) Click on Export To History > As Datasets > Import.
Back to working screen by clicking on Analyze Data (Galaxy Top Bar) (or by
clicking on the House icon, as before). Once there, click on the eye symbol to
see the data contained in polyphen_dbsnp132.txt
Now, to get the predictions associated with our SNPs we will do a join
between the two datasets (filtered_SNPs and polyphen_dbsnp132). We
don’t have a shared identifier to join on, so instead we will join together
rows of the two datasets whenever their positions on the genome overlap.
The resulting file will have all of the information from both datasets, only for
those positions in the (query) genome for which there is information/data in
both datasets. For that, do:
Tools > Common Genomics Tools > Operate on genomic intervals > Join
In the center panel, select the filtered_SNPs file as the first dataset and the
PolyPhen-2 predictions file (polyphen_dbsnp132.txt) as the second. We are
only interested in SNPs that appear in both sets, so leave the default
settings; i.e. do an Inner Join (with a min. overlap of 1bp)
click Execute
Rename output file as Predictions (Save). Click on the "Eye" icon. There are
249 predictions (click on the file’s name to check this; also a few rows of
data are shown in the section below the file name, in the right panel). Use
the scrollbar to scroll over to the predictions (scroll a bit to the right).
Some SNPs are classified as benign. However, the ones we are interested
in are those classified as “damaging”. To select only these, first we need to
ensure that the file format of Predictions is "tabular" (pencil icon). For that,
click on the pencil icon of the file, go to Type of Data, select Tabular; Save).
Then, Tools > General Text Tools > Filter and Sort > Select (lines that match an expression)
selects lines from a dataset that match (or don’t match) a given pattern. To
run it, choose the dataset with the join results (Predictions file) and type in
“damaging” for the pattern. The search is case-sensitive, so type it exactly
as it appears in the dataset
We can now look for pathways associated with the genes by using the CTD
tool.
The CTD tool requires HUGO/NCBI identifiers rather than UniProt IDs for its
input, so first we will download a file from the UCSC Table Browser to map
between the identifiers. The columns do not line up in this view because
the data is tab-separated and the values in each column are not all of the
same length.
Go to UCSC Table Browser https://genome-euro.ucsc.edu/cgi-bin/hgTables
First make sure the genome and assembly are correct; we want Human
build hg19 in order to match our history datasets. We will be using Clade:
Mammal; genome: Human; assembly: Feb (2009) (GRCh37/hg19); Group:
Genes and gene predictions, Track: UCSC Genes, Table: knownGene,
because it has the most additional information, including connections
between various identifiers, and we want them for the full genome. Region:
genome. We also want the output format to be “selected fields from
primary and related tables”, and to have the senfd output to: Galaxy
(orange arrow), Output File:UCSC;. Separator: tsv;.File type returned: Plain
text. Check “Send output to Galaxy” and then click the “Get output” button.
Now we select the fields from the Main Table that we want in the output
(name, chrom, txStart, proteinID). The main table, however, does not have
all that we need, but, instead, the cross-reference table below (hg19.kgXref
fields) has both the UniProt ID (spDispalyID) and the official HUGO gene
symbol (geneSymbol).
After marking these boxes, click the “Done with Selections” button (just
below top table). Then click on “Send query to Galaxy”. The Table Browser
interface will then close, and the requested data will be sent to your Galaxy
history.
Back at Galaxy, we are ready to run "join" to get the gene symbols
associated with the damaging SNPs. This time we don’t have genomic
positions in both sets, but they do have a field in common: the UniProt ID.
We will do the join by matching up the values in those columns.
Tools > General text Tools > Join, Subtract, and Group > Join Two Datasets
The parameters for this tool are the two datasets (Damaging and UCSC)
(must be in tabular format), the columns to match up, and what to do with
unmatched rows. By opening the two datasets in the history you can find
the column numbers containing the UniProt IDs (check in UCSC, which
column# contains prot ID, and in Damaging which column #contains prot
ID).
Make the selections, putting the SNP dataset (Damaging) first so its fields
will be first in the result rows. We are not interested in any of the rows that
don’t join so leave the rest of the options set to “No”.
Now that we have a list of gene symbols for the putatively damaging SNPs,
we can look for relationships between these genes. There are several tools
useful for this, including the CTD tool to extract pathway information from
the Comparative Toxicogenomics Database CTD [Davis et al., 2009; see also
Make sure that the text file is tab-separated (open in excel and save again
as "Texto delimitado por tabulaciones.txt". Open usegalaxy.eu, upload this
file to History and:
Then click Execute (Warning: Identifiers-col# may vary from the one set in
the figure above. Check in your file which one contains gene names.
Alternatively, go to: http://ctdbase.org/ Then Analyze > Batch Query
Select input type Genes, Upload File…, select identifier column 30 (check
this is correct); choose data to download: Pathways (you can also play
around to explore other options)
Click Download
To run DAVID:
Within Galaxy, it is possible to call ANNOVAR using as input a vcf file. Tools>
Variant Calling>Annovar Annotate vcf (it is possible to locally transform
mastervar to vcf using specific software available online –not provided-)
D.- Functional "predictions" of coding SNPs with
Mutationassesor.
goto http://mutationassessor.org/r3/
first read about input format and try to get it running with your file
A.- Find regulatory regions that have already been
predicted computationally
GALAXY TOP BAR > Shared Data > Data Libraries > Putative SNPs
Phenotype (page 2) > hg19 > PRPs.liftedhg19.bed (page 2; type is interval) >
check box > To History
Then, after importing the dataset, click on Analyze Data to get back to
your history and tools page.
This file contains the intersection of the PreMods [Ferretti et al. 2007]
and the regions with high Regulatory Potential scores [Taylor et al. 2006].
Both of these sets were computed on earlier genome assemblies and
then lifted (i.e., remapped) to the current assembly (hg19).
Intersecting with the PRP
Exclusion of the SNPs from the 24 healthy individuals is useful for finding
both coding and non-coding disease variants, so we will again start with
the dataset we prepared before (filtered_SNPs).
Tools > Common Genomics Tools > Operate on Genomic Intervals > Intersect
For BED output, an intermediate page then appears with options for how
to return the intervals. Leave default options. Click Send query to Galaxy.
Rename output UCSC_ENCODE
Tools > Common Genomics Tools > Operate on Genomic Intervals > Intersect
Note that there are a lot more regions in this set than there were in
the PRPs (rename DNAse_SNPs).
C.- PhyloP
Tools> Genomics Analysis > Phenotype Association > phyloP. Select the
same input dataset (Filtered_SNPs) that we’ve been using, and then click
Execute.
Tools> General Text Tools > Text Manipulation > Advanced Cut
Execute; change name to filtered_SNPs_cut (type must be interval)
Click the “get output” button, and then, Click on "Send Query to Galaxy".
Rename resulting file to "UCSC_phyloP".
Distribution of phyloP scores
In the history panel we can see what column the scores are in, and a few of
the values. It is helpful to know the distribution of those scores. For this we
go to
Tools> Statistics and Visualization > Graph/Display Data > Histogram of
numeric col.
The PhyloP tool appends a column to its input dataset, and as expected,
we will see that the last column is new and contains numbers. Select the
dataset with the phyloP scores and indicate the column that holds the
score. Entering a large number of bars gives a finer resolution in the
output. Adding a descriptive title and label are not required, but are helpful
when coming back to this history later.
Download file (click on file name, diskette icon), open pdf file of the
histogram. With a pdf reader.
Any positive phyloP score indicates conservation.
But there are many SNPs with scores between 0 and 1. To select the most
highly conserved regions, a cutoff of 0.5 seems reasonable.
In Tools> General Text Tools > Filter and Sort > Filter tool. Select the
dataset with the phyloP scores, and enter the condition “c2>=.5”, which
means we want the rows where the score (in column 2) is greater than or
equal to 0.5.
Rename output file to phyloP_filtered.