You are on page 1of 8

Chapter 24

A Method for Discovery of Genome-Wide SNP Between Any


Two Genotypes from Whole-Genome Re-sequencing Data
S. Gopala Krishnan, Daniel L.E. Waters, and Robert J. Henry

Abstract
Advances in sequencing technologies have aided the discovery of millions of genome-wide DNA
polymorphisms such as single-nucleotide polymorphisms (SNPs) and insertion–deletions (InDels)
which are an invaluable resource for marker-assisted breeding. Presently available bioinformatics tools
assist the discovery of polymorphisms between target genotypes and the reference genome for a range
of species. The discovery of polymorphisms between two genotypes within a breeding program is
complicated by several factors such as bias in the number of reads from each genotype and residual
heterozygosity within each genotype. In this chapter, we describe a novel approach where polymor-
phisms between a pair of genotypes are discovered from whole-genome re-sequencing data.

Key words Next-generation sequencing, Genotypes, Whole-genome re-sequencing, Genome-wide


polymorphisms, Pairwise SNPs

1 Introduction

Advances in sequencing technologies have aided the discovery of


millions of genome-wide DNA polymorphisms such as single-
nucleotide polymorphisms (SNPs) and insertion–deletions
(InDels) which are an invaluable resource for marker-assisted
breeding [1]. SNPs and InDels are becoming the preferred mark-
ers in molecular breeding due to multiple advantages such as high
frequency, stability, high throughput capability, and cost-
effectiveness over other DNA markers [2]. Next-generation
sequencing (NGS) technologies make possible the discovery of a
massive number of DNA polymorphisms by comparing the whole-
genome sequences of individuals with high-quality reference
genome sequences [3]. SNPs have been employed in breeding
programs for marker-assisted and genomic selection, association
and QTL mapping, positional cloning, haplotype and pedigree
analysis, seed purity analysis, and variety identification [4].

Robert J. Henry and Agnelo Furtado (eds.), Cereal Genomics: Methods and Protocols, Methods in Molecular Biology,
vol. 1099, DOI 10.1007/978-1-62703-715-0_24, © Springer Science+Business Media New York 2014

287
288 S. Gopala Krishnan et al.

There are a number of different NGS platforms that can be


used for whole-genome re-sequencing [5, 6], and the detailed pro-
cedure for generation of sequence reads for the discovery of poly-
morphisms using Illumina Genome Analyzer has been described
[7]. The bioinformatics tools presently available help the discovery
of genome-wide polymorphisms by comparing the whole-genome
sequence of individual genotypes with high-quality reference
genome sequences [3]. A schematic overview of the steps of SNP
discovery by comparison of high-quality reference genome sequence
with whole-genome re-sequencing data is presented in Fig. 1.
The discovery of polymorphisms between any two genotypes
is very important in practical plant breeding. One way of obtain-
ing SNPs between a set of genotypes whose whole genomes have
been re-sequenced is to first map one of the genotypes to a high-
quality reference genome (such as the IRGSP Pseudo molecule
5.0 of rice cultivar Nipponbare). The consensus sequence of the
first genotype can be retrieved from the mapping which can then
be used for mapping the reads from the whole-genome sequence
data of the second genotype. However, this is not ideal as it suf-
fers from disadvantages such as the following: (1) the consensus
sequence of the first genotype may have large gaps and hence it is
not the ideal sequence for use as reference sequence to map the
reads from the second genotype and (2) the annotations available
in the high-quality reference genome will be lost when the con-
sensus sequence of the first genotype is retrieved from the initial
mapping assembly.
Alternatively, mapping the combined whole-genome re-
sequencing data to the high-quality reference genome may be an
option for SNP discovery. However, polymorphism discovery can
be complicated by several factors such as the following: SNPs
detected may be polymorphic between the reference genome and
the genotypes in question but may not be polymorphic between
the two genotypes; there may be residual heterozygosity in one or
both of the genotypes; and there may be bias in the number of
reads for each genotype at any one locus. Ideally, we expect 50 %
of the reads belonging to one genotype and the remaining 50 % of
the reads to the second genotype at any one genome reference
position, but the number of reads that map to particular reference
position will differ between genotypes, and this bias may create a
problem for SNP discovery when there is an expectation of a 1:1
allelic ratio.
Here, we present a novel robust approach which allows
polymorphisms between a pair of genotypes to be discovered
from the whole-genome data. The method involves first sepa-
rately mapping whole-genome re-sequencing data of each geno-
type to the reference genome and detection of SNPs for
Genome-Wide SNP Discovery from Whole-Genome Re-sequencing Data 289

Fig. 1 A schematic overview of the steps involved in SNP discovery from whole-
genome re-sequencing data in comparison with a high-quality reference genome
sequence

individual genotypes. Then whole-genome data from both gen-


otypes is mapped in combination to the reference genome and
the set of SNPs in both genotypes identified in comparison to
the reference genome. Finally, the SNPs discovered from the
individual assembly and from the combined assembly are used
to identify pairwise SNPs between the genotypes by eliminating
the SNPs which are common to both the genotypes in compari-
son to the reference genome and the heterozygous SNPs which
are inherent to each genotype.
290 S. Gopala Krishnan et al.

2 Materials

2.1 Whole-Genome 1. High-quality trimmed reads from the whole-genome re-


Re-sequence Data sequencing of the genotypes.

2.2 Software 1. Software such as CLC Genomics Workbench for the assembly
of reads and detection of SNPs (see Note 1).
2. A spreadsheet program such as Microsoft Excel with the option
for filtering and eliminating duplicate values.

3 Methods

3.1 Assembly of 1. Use CLC Genomics workbench to assemble the trimmed and
Reads and Detection high-quality reads from each of the genotypes (e.g., Genotype
of SNPs 1 and Genotype 2) individually to the reference genome
(Fig. 1, steps 1–3).
2. Then use the SNP detection tool to discover SNPs in the
assembled contigs in comparison to the high-quality reference
genome (Fig. 1, steps 4–5).
3. CLC Genomics workbench is used to assemble the trimmed
reads from both genotypes in combination to the reference
genome (Fig. 1, steps 1–3).
4. The SNP detection tool is then used to discover SNPs in com-
parison to the high-quality reference genome (Fig. 1, steps
4–5) (see Note 2).
5. The three assemblies (one combined assembly of Gentoypes 1
and 2 and the two individual assemblies of Genotype 1 and
Genotype 2) give rise to three possible situations (Fig. 2).
Situation 1: In this case (Fig. 2a), even though an SNP
(C compared to T at reference position no. 7,936,303) is discov-
ered in the combined assembly and the individual assemblies, the
allele (C) is the same in both the Genotype 1 and Genotype 2, and
so there is no polymorphism between the gentoypes. Situation 2:
In this case (Fig. 2b) two alleles (G/A, G in the reference sequence
at position no. 7,941,982) are discovered in the combined assem-
bly; in the individual assemblies of Genotype 1 and Genotype 2;
both Genotype 1 (4G/4A) and Genotype 2 (8G/4A) are hetero-
zygous at this position. Therefore, even though there appears to be
two alleles in the combined assembly, the polymorphism is due to
residual heterozygosity at this position which is shared by both
Genotype 1 and Genotype 2. Situation 3: In this case (Fig. 2c),
two alleles (C/T, C in the reference sequence at position no.
7,936,320) are discovered in the combined assembly where one
allele (T) is from Genotype 1 and the other allele (C) is from
291

Fig. 2 Three assemblies (one combined assembly of Genotypes 1 and 2 and the
two individual assemblies of Genotype 1 and Genotype 2) can give rise to three
possible situations: (a) A C/T SNP where the C allele is same in both the Genotype
1 and Genotype 2; (b) a G/A SNP and both Genotype 1 and Genotype 2 are het-
erozygous at this position; (c) a C/T SNP where one allele (T) is from Genotype 1
and the other allele (C) is from Genotype 2
292 S. Gopala Krishnan et al.

Genotype 2. Therefore, we can conclude that this is an SNP


between Genotype 1 and Genotype 2. The ideal situation is where
two alleles are discovered in the combined assembly, and this rep-
resents a single allele in each of the gentoypes (Situation 3).
However, considering that millions of SNPs are discovered from
whole-genome re-sequencing data, it is impracticable to do such a
comparison for each and every SNP identified in the assemblies
generated. Therefore, there arises a need to devise a simple strategy
to identify SNPs between two gentoypes (Genotype 1 and
Genotype 2, as presented in this case).

3.2 Discovery 1. In the first step, eliminate heterozygous loci from each of the
of SNPs Between genotypes. This is achieved in a spreadsheet by filtering SNPs
Two Genotypes from non-repetitive regions discovered from separate individ-
ual assemblies of Genotype 1 and Genotype 2 with the follow-
ing parameters: coverage > 4, count of variant 1 >0, and count
of variant 2 >0. The SNPs obtained after this step will be SNPs
which are unique to each individual assembly (see Note 3).
2. In the next step, identify SNPs in the combined assembly. This
is done by filtering the SNPs from non-repetitive regions dis-
covered from combined assembly of Genotype 1 and Genotype
2 with the following parameters: coverage > 9, count of variant
1 >4, and count of variant 2 >4 (see Note 4).
3. In the final step, eliminate spurious SNPs from the pool of
putative SNPs identified between Genotype 1 and Genotype 2
in the combined assembly. This is achieved by two substeps:
First, compare SNPs from individual assembly of Genotype 1
with the SNPs identified in the combined assembly one chro-
mosome at a time. Remove the SNPs with the same reference
position in both the assemblies by using remove duplicates
option (see Note 5). Second, compare SNPs from individual
assembly of Genotype 2 with the SNPs identified in the com-
bined assembly, one chromosome at a time. Then remove the
SNPs with the same reference position in both the assemblies
by using remove duplicates option.
4. Retain the set of SNPs from the combined assembly after the
elimination of duplicates in the above steps constituting the
true pairwise SNPs between the Genotype 1 and Genotype 2
(see Note 6).

4 Notes

1. CLC Genomics Workbench is only taken as an example here,


and it is one of the many types of software available for assem-
bly and detection of SNP variants from whole-genome re-
sequencing data. Since the approach described here for
Genome-Wide SNP Discovery from Whole-Genome Re-sequencing Data 293

discovering pairwise SNP starts after SNP discovery, other


software can be utilized for the purpose of SNP discovery.
2. Eliminating the SNPs from repetitive regions can be done
through selection of “No repeats” option in the overlapping
annotation tab in CLC Genome Workbench.
3. There is a possibility of filtering out some real SNP by elimi-
nating heterozygous loci in the individual assembly. However,
there will be a huge number of SNPs remaining between the
genotypes.
4. It is essential to partition SNPs to each chromosome of the
individual genotypes separately. This will avoid complications
arising from the same reference position which occur on differ-
ent chromosomes. For example Chromosome 1, Position
456,239, could be confused with Position 456,239 on any of
the other 11 chromosomes of rice. This also ensures that the
process of eliminating the spurious SNPs from the pool of
putative SNPs identified in the combined assembly is avoided
at a later step.
5. While removing duplicates, it is important to place the SNPs
from the combined assembly below the SNPs from individual
assembly (especially in MS Excel) as the duplicate from the
pool of putative SNPs from the combined assembly needs to
be removed to identify true SNPs between Genotype 1 and
Genotype 2.
6. A confirmatory check for the robustness of the pairwise SNPs
can be performed by viewing the corresponding position of
the chromosome as shown in Fig. 2.

Acknowledgements

G. K. S. acknowledges the Department of Science and Technology,


Government of India, for the financial support under the
BOYSCAST Fellowship.

References
1. Gopala Krishnan S, Waters DLE, Henry RJ 2. Henry RJ, Edwards K (2009) New tools for
(2012) Genome-wide variations between elite single nucleotide polymorphism (SNP) discov-
lines of indica rice discovered through whole ery and analysis accelerating plant biotechnol-
genome re-sequencing. In: Rangasamy SRS ogy. Plant Biotechnol J 7:311
et al (ed) 100 years of rice science and looking 3. Gopala Krishnan S, Waters DLE, Katiyar SK,
beyond. Proceedings of the International sym- Sadananda AR, Satyadev V, Henry RJ (2012)
posium held at Tamil Nadu Agricultural Genome-wide DNA polymorphisms in elite
University, Coimbatore, Tamil Nadu, India, indica rice inbreds discovered by whole-genome
9–12 January 2012, pp 118–119 sequencing. Plant Biotechnol J 10:623–634
294 S. Gopala Krishnan et al.

4. McCouch SR, Zhao K, Wright M, Tung CW, Molecular markers for plants derived from large
Ebana K, Thomson M et al (2010) Development scale sequencing. J Biosci 37:829–841
of genome-wide SNP assays for rice. Breed Sci 7. Lakdawalla A, Schroth GP (2010) Mutation
60:524–535 discovery with the Illumina genome analyzer.
5. Edwards M, Henry R (2011) DNA sequencing In: Meksem K, Kahl G (eds) The handbook of
methods contributing to new directions in plant mutation screening—mining of
cereal research. J Cereal Sci 54:395–400 natural and induced alleles. Wiley-VCH
6. Henry RJ, Edwards M, Waters DLE, Gopala Verlag GmbH & Co. KGaA, Weinheim, pp
Krishnan S, Bundock P, Sexton TR et al (2012) 103–120

You might also like