Professional Documents
Culture Documents
Report
2017/5/2
Results 3
1 Data Production 3
2 Summary Statistics of Alignment 3
3 Data Quality Control 5
4 SNP Results 5
5 InDel Results 6
6 CNV Results 7
7 SV Results 7
Methods 8
1 Whole genome sequencing 8
2 Bioinformatics analysis overview 8
3 Data cleanup 9
4 Mapping and marking duplicates 9
5 Local realignment around InDels 10
6 Base Quality Score Recalibration (BQSR) 10
7 SNP and InDel calling 11
8 Variant filtering 11
9 Copy number variant calling 13
10 Structural variant calling 13
11 Variant annotation and prediction 13
12 Web Resources 14
Help 14
1 Guide to visualization 14
2 Guide to selecting variants for validation 15
3 Guide to finding candidate variants 15
4 Format of annotation files 15
5 Decompress the file 15
FAQs 16
References 16
2/17
1 Data Production
To discover genetic variations in this project, we performed whole genome sequencing of 2 DNA
sample(s) with averagely 94,725.07 Mb raw bases. After removing low-quality reads we obtained
averagely 943,318,302 clean reads (94,331.83 Mb). The clean reads of each sample had high
Q20 and Q30 , which showed high sequencing quality. The average GC content was 41.47%. All
whole genome sequencing data production was summarized in Table1. The base quality scores on
clean reads per sample were plotted(Figure1).
NA12878-
1,002,366,106 100,236.61 998,868,316 99,886.83 99.65 95.18 84.99 41.69
WGSPE100-1
NA12878-
892,135,200 89,213.52 887,768,288 88,776.83 99.51 94.71 83.94 41.25
WGSPE100-2
NA12878-
WGSPE100-1
NA12878-
WGSPE100-2
X-axis is positions along reads. Y-axis is quality value. Each dot in the image represents the quality score of the
corresponding position along reads.
3/17
NA12878-
WGSPE100- 998,868,316 99,886.83 99.47 94.35 1.77 0.52 32.94 99.10 98.62
1
NA12878-
WGSPE100- 887,768,288 88,776.83 99.41 94.83 1.21 0.60 29.35 99.11 98.55
2
Average 943,318,302 94,331.83 99.44 94.59 1.49 0.56 31.14 99.10 98.59
NA12878-
WGSPE100-1
NA12878-
WGSPE100-2
X-axis denotes sequencing depth, while Y-axis indicates the percentage of the whole genome excluding gap regions
under a given sequencing depth.
NA12878-
WGSPE100-1
NA12878-
WGSPE100-2
4/17
NA12878-
WGSPE100-1
NA12878-
WGSPE100-2
X-axis denotes insert size of paired reads, and Y-axis shows the fraction of paired reads with a given insert size.
NA12878-
WGSPE100- Y(98.13) Y(92.23) Y(91.47) Y(78.51) Y(41.69) Y(99.47) Y(1.77) Y(0.52) Y(32.94) Y(99.10)
1
NA12878-
WGSPE100- Y(97.92) Y(91.50) Y(90.74) Y(77.13) Y(41.25) Y(99.41) Y(1.21) Y(0.60) Y(29.35) Y(99.11)
2
4 SNP Results
Overall, we identified 3,307,251 SNPs in all individuals. Of these variants, 99.60% were
represented in dbSNP and 98.07% were annotated in the 1000 Genomes Project database. The
number of novel SNPs was 9,079. The ratio of transition to transversion (Ti/Tv) was 2.06. Of overall
SNPs , 10,445 were synonymous, 9,481 were missense, 27 were stoploss, 69 were stopgain, 13
were startloss and 140 were splice site. The Ti/Tv of coding SNPs was 3.09 (Table5). The summary
statistics of SNPs was shown in Table4.
5/17
NA12878-
WGSPE100- 3,309,431 99.65 97.86 7,725 1,409,940 1,899,491 1,315,432 4,277 21,665
1
NA12878-
WGSPE100- 3,268,233 99.70 98.18 6,154 1,376,262 1,891,971 1,303,356 4,179 21,579
2
5 InDel Results
There were totally 849,480 InDels called in all samples. Of these variants, 76.16% were
represented in dbSNP and 53.79% were annotated in the 1000 Genomes Project database. The
number of novel InDels was 186,149. Of overall InDels , 250 were frameshift, 5 were stoploss, 6
were startloss and 84 were splice site(Table7). The summary statistics of InDels was showed in
Table6. The length distribution of the InDels in coding sequence region(CDS) were also plotted as
Figure5.
NA12878-
WGSPE100- 816,505 77.77 55.22 165,699 302,467 514,038 346,462 697 6,200 13,466
1
NA12878-
WGSPE100- 837,072 77.49 54.69 172,094 302,855 534,217 354,674 678 6,306 13,702
2
6/17
NA12878-WGSPE100-
257 163 155 6 7 87
1
NA12878-WGSPE100-
254 165 155 7 6 87
2
NA12878-
WGSPE100-1
NA12878-
WGSPE100-2
X-axis denotes the length of Insertions/Deletions, and Y-axis indicates the number of Insertions/Deletions.
6 CNV Results
On average, we identified 3,436 copy number variations ( CNVs ) with total 11,907,050 bp
amplification length and 61,194,250 bp deletion length in all individuals. Of these CNVs , 481
overlapped coding exonic regions, 0 overlapped 5' untranslated regions, 1 overlapped 3'
untranslated regions and 951 overlapped introns. 52 overlapped 2-kb region upstream of
transcription start site and 60 overlapped 2-kb region downtream of transcription end site. The
summary statistics of CNVs was shown in Table8.
NA12878-
WGSPE100- 4,013 444 4 0 1,174 0 1 55 66 2,269 11,376,100
1
NA12878-
WGSPE100- 2,859 519 3 0 728 0 1 49 55 1,504 12,438,000
2
7 SV Results
The structural variants ( SVs ) were classified into the following five subtypes based on location
7/17
NA12878-
WGSPE100- 4,471 0 2,731 299 1,255 186 115 17 0 2,716 2 22
1
NA12878-
WGSPE100- 3,986 61 2,411 293 1,034 187 109 15 0 2,552 4 18
2
Methods
The bioinformatics analysis began with the sequencing data (raw data from the BGISEQ
machine). First, the clean data was produced by data filtering on raw data. All clean data of each
sample was mapped to the human reference genome (GRCh37/HG19). Burrows-Wheeler Aligner
(BWA)\[1\]\[2\] software was used to do the alignment. To ensure accurate variant calling, we followed
recommended Best Practices for variant analysis with the Genome Analysis Toolkit(GATK,
https://www.broadinstitute.org/gatk/guide/best-practices). Local realignment around InDels and
8/17
In addition, the strict data analysis quality control system(QC) in the whole pipeline was built to
guarantee qualified sequencing data.
The genomic variations, including SNPs and InDels were detected by the state-of-the-art
software, such as HaplotypeCaller of GATK(v3.3.0). After that, the variant quality score recalibration
(VQSR) method, which uses machine learning to identify annotation profiles of variants that are likely
to be real, was applied to get high-confident variant calls. The Copy Number Variants ( CNVs ) were
called using the CNVnator\[8\] v0.2.7 read-depth algorithm. The structural variations ( SV ) were
detected using Breakdancer\[9\]\[10\] or CREST \[11\]. Then the SnpEff tool
(http://snpeff.sourceforge.net/SnpEff_manual.html) was applied to perform a series of annotations
for variants.
The final variants and annotation results were used in the downstream advanced analysis.
3 Data cleanup
In order to decrease noise of sequencing data, data filtering was done firstly, which included: (1)
Removing reads containing sequencing adapter; (2) Removing reads whose low-quality base ratio
(base quality less than or equal to 5) is more than 50%; (3) Removing reads whose unknown base
('N' base) ratio is more than 10%. Statistical analysis of data and downstream bioinformatics analysis
were performed on this filtered, high-quality data, referred to as the " clean data ".
9/17
Picard-tools(v1.118)\[7\] was used to sort the SAM files by coordinate and converted them to
BAM files.
The same DNA molecules can be sequenced several times during the sequencing process.
The resulting duplicate reads are not informative and should not be counted as additional evidence
for or against a putative variant. We used Picard tools(v1.118)\[7\] to mark the duplicate reads , which
were ignored in downstream analysis.
8 Variant filtering
When we obtained the raw variation set containing both SNPs and InDels , it is extremely
important to apply filtering methods, in order to move on to downstream analyses with the highest-
quality call set possible. We used the GATK Variant Quality Score Recalibration (VQSR) that uses
machine learning algorithm to filter the raw variant callset. The GATK VQSR used high-quality known
variant sets as training and truth resources and built a predictive model to filter spurious variants. The
SNPs and InDels marked PASS in the output VCF file were high-confident variation set.
For SNPs recalibration strategy, we used the following datasets and features to train the model.
(a) Training sets: HapMap V3.3, Omni2.5M genotyping array data and high-confidence SNP sites
produced by the 1000 Genomes Project. (b) Features: Coverage (DP), Quality/depth (QD), Fisher
test on strand bias(FS), Odds ratio for strand bias (SOR), Mapping quality rank sum test
(MQRankSum), Read position rank sum test (ReadPosRankSum), RMS mapping quality (MQ).
The recalibration commands and parameters for SNPs were the following.
F o r InDels recalibration strategy, we used the following datasets and features to train the
model. (a) Training sets: Mills 1000G gold standard InDel set. (b) Features: Coverage (DP),
Quality/depth (QD), Fisher test on strand bias(FS), Odds ratio for strand bias (SOR), Mapping quality
rank sum test (MQRankSum), Read position rank sum test (ReadPosRankSum).
The recalibration commands and parameters for InDels were the following.
12/17
Breakdancer\[9\]\[10\] with default settings was used to detect structural variations ( SV ). This
method implemented a paired-end discordance mapping algorithm based on the separation
distance and alignment orientation between paired reads.
CREST\[11\] was used to identify SVs with standard settings. It mapped the breakpoints of SVs
by using the information of soft-clipping reads and applying an assembly-mapping-searching-
assembly-alignment procedure consisting of CAP3\[12\] and BLAT \[13\].
(a) gene-based annotation: identify whether variants cause protein coding changes and the
amino acids that are affected.
(b) filter-based annotation: identify variants that are reported in dbSNP v141, or identify the
subset of variants with MAF <1% in the 1000 Genome Project, or identify subset of coding non-
synonymous SNPs with SIFT score<0.05, or find intergenic variants with GERP++ score>2, or
many other annotations on specific mutations.
12 Web Resources
The URLs for data presented herein and data format details are as follows:
RefGene database,
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz
dbSNP, http://www.ncbi.nlm.nih.gov/snp
Help
1 Guide to visualization
The Integrative Genomics Viewer (IGV)\[3\]\[4\] is a high-performance visualization tool for
interactive exploration of many different types of large genomic datasets. IGV is freely available for
download from http://www.broadinstitute.org/igv. IGV includes a large number of specialized
features for exploring next-generation sequencing read alignments, including features for
sequencing coverage and variant visualization. IGV supports SAM / BAM read alignment file
formats and VCF format for viewing variants. The following figure illustrated the IGV application
window.
14/17
(3) variants located in homologous sequence regions. UCSC BLAT tool can be used to find
areas of probable homology. From the reference genome, we can obtain a query sequence by
extending 100bp towards both sides of the variant and submit the sequence to
http://genome.ucsc.edu/cgi-bin/hgBlat\?command=start.
(4) heterozygous variants with sequencing allele unbalance, namely the fraction of reads
supporting alternate allele is less than 0.25 or more than 0.75.
Mac user: shell: tar -zxvf *.tar.gz, and 'stuffit expander' is recommended.
FAQs
How to view BAM files in Microsoft Windows ?
Create index of BAM using Picard tools, named *.bai. Then open it with IGV.
What's the fragment length range of small InDel in exome and whole genome re-sequencing?
For small InDel , the range is from 1 to 50bp.
Base quality is not completely true, do we take this situation into consideration in variant calling?
Yes. Base Quality Score Recalibration step was used to correct raw base quality scores before calling variants by GATK package.
Compared to whole genome re-sequencing, exome sequencing only for the exon regions of DNA can be more
simple, economical and efficient. Why should we select whole genome re-sequencing? And what's the sense?
The large structure variations and the mutations in non-exome region can be calling by whole genome sequencing, so that we will have a
more comprehensive understanding of genome.
What is the purpose of checking genotype barcode of 21 SNP sites in data quality control step?
Genotype barcode of 21 SNP sites should be checked with sequencing data calls. We generate genotype barcode of 21 SNP sites by
Sequenom for all the samples to track the identity of samples during the sequencing process.
References
[1] Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics, 25: 1754-1760.
[2] Li, H. and Durbin, R. (2010). Fast and accurate long-read alignment with burrows-wheeler transform.
Bioinformatics, 26:589-595.
[3] James T. Robinson, et al. (2011) Integrative Genomics Viewer. Nature Biotechnology 29, 24-26.
[4] Helga Thorvaldsdottir, James T. Robinson, Jill P. Mesirov. (2013) Integrative Genomics Viewer (IGV):
high-performance genomics data visualization and exploration. Briefings in Bioinformatics 14, 178-192.
[5] DePristo MA. et al. (2011)A framework for variation discovery and genotyping using next generation DNA
sequencing data. Nature genetics 43, 491-498.
[6] McKenna,A. et al. (2010)The Genome Analysis Toolkit: a MapReduce framework for analyzing next
generation DNA sequencing data. Genome Research 20,1297-1303.
[7] Picard Tools (http://broadinstitute.github.io/picard/).
16/17
17/17