ASSOCIATION MAPPING

IN PLANTS




Linkage analysis (QTL mapping)
Association mapping (Linkage Disequilibrium based)
Candidate gene studies through genomic
approaches


Approaches to dissect complex traits
Disadvantages of linkage mapping
Amount of genetic variation between any two parents is limited.
Genetic backgrounds in which QTL mapping is done is not
always a representative of the crop genetic background.
Only a few generations of effective recombination taking place
leading to longer segments in LD, the consequence being reduced
resolution. –i.e homozygosity is reached in a faster pace making
dissection of a genomic region difficult -(fine mapping not
possible)
Linkage mapping - counts recombination between
markers and the trait of interest (linkage) in a
biparental population.

Association mapping - measures correlation between
marker alleles and trait allele in a population (linkage
Disequilibrium)
“Association analysis, also known as LD mapping or
association mapping, is a population-based survey
used to identify trait-marker relationships based on
linkage disequilibrium” (Flint-Garcia et al. 2003)
How does one proceed?
Haplotype - a set of closely linked genetic markers on
a chromosome that tend to be inherited together
Linkage vs Association

Linkage

 Family-based
 Few markers for genome
coverage (300-400 SSRs)
 Good for initial detection; poor
for fine-mapping
 Powerful for rare variants

Association (LD mapping)

 Families or unrelated
 Many markers for genome
coverage (10
5
– 10
6
SNPs)
 Poor for initial detection; good
for fine-mapping
 Powerful for common variants;
rare variants generally
impossible
Complementary – idea is to understand
molecular genetic basis of phenotypic variation
Gene and genotype frequencies
Locus 1: allele A freq = pA
allele a freq = pa

Locus 2: allele B freq = pB
allele b freq = pB
pA + pa = 1
pAA + pAa + paa = 1
What is pAB (gamete) when there is linkage disequilibrium?
And when there is no linkage?
You restore HWE after a single generation of random mating in a population
only when the individual loci are considered singly/ one at a time.
pB + pb = 1
pBB + pBb + pbb = 1
Linkage Disequilibrium
 P
AB
≠ P
A
P
B
 P
Ab
≠ P
A
P
b
= P
A
(1-P
B
)
 P
aB
≠ P
a
P
B
= (1-P
A
) P
B
 P
ab
≠ P
a
P
b
= (1-P
A
) (1-P
B
)

8
B b Total
A P
AB
P
Ab
P
A
a P
aB
P
ab
P
a
Total P
B
P
b
1.0
SNP 1
SNP 2
Linkage Disequilibrium (LD) is a
measure of the non-random association
of alleles at two different loci.
Whatever we measure as LD is in fact ‘gametic
phase disequilibrium’ and thus will remain true only
when it is due to linkage. (corollary?)
D = ru-st
Factors affecting LD
Mutation
Self pollination
Genetic isolation
Population admixture
Small founder population / genetic drift
Selection
Epistasis
Factors Increasing LD Factors Decreasing LD
High recombination
Recurrent mutation
Outcrossing
Gene conversion

Linkage Disequilibrium – Example
How does LD arise?
 There are only three haplotypes: AG, CG, and CC.
 There is no AC haplotype, i.e., P
AC
= 0.

 However, P
A
P
C
=1/9, thus P
A
P
C
≠ P
AC
.
 These two SNPs are in linkage disequilibrium
11
-- A -- -- -- G -- -- --
-- C -- -- -- G -- -- --
-- A -- -- -- G -- -- --
-- C -- -- -- G -- -- --
-- C -- -- -- C -- -- --
Before mutation
After mutation
PA=1/2
PC=1/2
PG=1
PA=1/3
PC=2/3
PG=2/3
PC=1/3
Linkage Equilibrium – Example
How does LD disappear?
 After recombination,
 P
AG
= P
A
P
G
= 1/4,

 P
CG
= P
C
P
G
= 1/4,
 P
CC
= P
C
P
C
= 1/4, and

 P
AC
= P
A
P
C
= 1/4.
 Thus, these two SNPs are linkage equilibrium.
12
-- A -- -- -- G -- -- --
-- C -- -- -- G -- -- --
-- C -- -- -- C -- -- --
-- A -- -- -- C -- -- --
-- A -- -- -- G -- -- --
-- C -- -- -- G -- -- --
-- C -- -- -- C -- -- --
Before recombination After recombination
PA=1/2
PC=1/2
PG=1/2
PC=1/2
Measure of LD: D Coefficient
The measure the non-randomness of two loci is represented by a
deviation ‘D’ as follows:
 D = P
AB
P
ab
– P
Ab
P
aB

 P
AB
= P
A
P
B
+ D

 P
Ab
= P
A
(1-P
B
) - D
 P
aB
= (1-P
A
) P
B
- D

 P
ab
= (1-P
A
) (1-P
B
) + D
 D = 0 when the two loci are in linkage equilibrium.
13
Standardization of D Coefficient
 D coefficient is normalized (Lewontin, 1964) since range of D is always
determined by the allele frequency.
D’ = D/D
max
, where D
max
stands for the absolute maximal
possible value of D.



B a aB B a
b A Ab b A
b a ab b a
B A AB B A
B a b A
b a B A
P P D P D P P
P P D P D P P
P P D P D P P
P P D P D P P
D
P P P P
D
D
P P P P
D
D
s ¬ > = ÷
s ¬ > = ÷
÷ > ¬ > = +
÷ > ¬ > = +
¦
¦
¹
¦
¦
´
¦
>
<
=
0
0
0
0
. 0 if ,
) , min(
; 0 if ,
) , min(
'

14
0
-P
A
P
B
PaPB
D D
Interpretation of D’
 D’ is constrained between -1 and +1.
 D’ = 1 (perfect positive LD between SNP alleles)
 D’ = 0 (linkage equilibrium between SNP alleles)
 D’ = -1 (perfect negative LD between SNP alleles)
 D’ = 0.87 (strong positive LD between SNP alleles)
 D’ = 0.12 (weak positive LD between SNP alleles)

15
Measure of LD: r
2
(Hill and Robertson, 1968)
r
2
= (P
AB
− p
A
×p
B
)
2
/ p
A
×p
a
×p
B
×p
b
• 0 ≤ r
2
≤ 1.
• Most relevant LD measurement.
• 0.1 to 0.2 r
2
refers to LD decay
• r
2
is the square of correlation coefficient when alleles are binary
coded.
• If r
2
= LD value of SNP with another and h
2
q
= total trait
variation, then,
r
2
* h
2
q
= the trait variation that can be explained by these SNPs
• r
2
= χ
2
/K, where K is the number of chromosomes.
Decay of LD over Time
 The chromosome recombination decreases LD so that
equilibrium is attained at the end.
17
3/6
2/4
3/2
6/2
3/5
2/6
3/6
5/6
Allele 6 is ‘associated’ with trait of interest
4/6
2/6
6/6
6/6
3/4
5/2
Controls Cases
Allelic Association
• Direct Association
• Mutant or ‘susceptible’ polymorphism
• Allele of interest is itself involved in phenotype

• Indirect Association
• Allele itself is not involved, but a nearby correlated
marker changes phenotype

• Spurious association
• Apparent association not related to genetic causes
(most common outcome…)
Linkage Disequilibrium: correlation between (any) markers in population
Allelic Association: correlation between marker allele and trait
Allelic Association
Three Common Forms
Indirect and Direct Allelic Association
D

*
Measure trait relevance
(*) directly, ignoring
correlated markers nearby
Direct Association
M
1
M
2
M
n
Assess trait effects on QTL
via correlated markers (M
i
)
D

Indirect Association & LD
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.02 0.04 0.06 0.08 0.1
Recombination fraction
D
10 gens
20 gens
50 gens
250 gens
How far apart can markers be to detect association?
D
t
= (1 – u)
t
D
0
Expected decay of linkage disequilibrium
Association Mapping for crop
improvement (AM versus QTL Mapping)

Since Association Mapping can be conducted directly on the
breeding material:

1. Direct inference from research to breeding is possible.
2. Phenotypic variation is observed for most traits of interest.
3. Marker polymorphism is higher than in biparental
populations.
4. No need of pedigrees or structured mapping populations.
5. Routine evaluations provide phenotypic data.
6. Higher resolution possible because of recombination over
very large number of generation studiedthrough extent of
haplotype sharing.
7. Association Mapping provides other useful information
about:
Organization of genetic variation and
Polymorphism across the genome
Types of Populations


1. Germplasm Bank Collection

A collection of genetic resources including landraces, exotic material
and wild relatives.

2. Synthetic Populations

Outcrossing populations synthesized from inbred lines (segregating
generations). May be used for recurrent selection.

3. Elite Lines

Inbred lines (and checks) manipulated with the objective of releasing
new varieties in the short term.
Characteristics Related to Association Mapping
S. No Aspects of AM GP Bank Synthetic
population
Elite GP
1 Sample Core collection Segregating
progenies
Elite lines
2 Sample turnover Static Ephemeral Gradually
substituting
3 Source of P data Screenings Progeny tests Yield trials
4 Types of traits High h
2
and
domestication
traits
Depends on the
evaluation
scheme
Low h
2
traits
5 Type of Marker SNP SNP/SSR SSR
6 LD Low Intermediate and
fast decaying
High
S. No Aspects of AM GP Bank Synthetic
population
Elite GP
7 Population
structure
Medium Low High
8 Allele diversity
among samples
High Intermediate Low
9 Allele diversity
within samples
Variable 1 or 2 alleles 1 allele
10 Power Low Intermediate and
decreasing
High; could
allow genomic
scan
11 Resolution High; could allow
fine mapping
Intermediate and
increasing
Low
12 Use of informative
markers
Transfer of new
alleles by marker
assisted BC
Incorporation in
selection index
MAS in
progenies
(after
validation)
Germplasm bank core-collections - for allele-mining of
candidate genes and fine-mapped QTLs

Elite lines - to detect genomic regions associated with
traits of interest

Synthetic populations might represent a balance between
power and precision, and have the major advantage of being
unstructured
Summary (of characteristics)
Pearson Chi square test
Yates correction
Fisher’s Exact test
Structured Association
 To tackle highly structured populations

 Looks for closely related clusters and develops Q by use of a set of
random unlinked markers

 Corrects the false associations

 STRUCTURE (Pritchard et al. 2000) estimates population structure
and shared coancestry coefficients for all markers

 Not good enough when some degree of relationship (Kinship) is also
present.
 Accounts for multiple levels of relatedness (Yu et al., 2006)

 Uses Q matrix (from STRUCTURE) to account for population
subdivision

 Uses K matrix to account for relatedness within populations
using Spagedi software

 Superior to other methods (Structured Association, Genomic
Control and Quantitative Transmission Disequilibrium Test) in
Type I error control and statistical power

 Implemented in the software TASSEL

 Replacement of Q matrix with P matrix makes it more robust
(Price et al., 2006 and Zhou et al., 2007)
Mixed Linear Model (MLM)
Population Stratification
A population under study may have sub-populations, which may lead to
 Spurious association.
 Loss of power to detect real association.
EIGENSTRAT (Price et al. 2006 Nat. Genet.) uses principal components to
extract information on stratification and adjust for the stratification in
association analysis.
Mixed Population = Sub-population 1 + Sub-population 2
A a A a A a
Case 70 80
=
10 40
+
60 40
Control 50 100 20 80 30 20
 A method for joint QTL linkage and association mapping in a set of
RIL populations derived from matings to a common parent

 Dense genotyping of SNPs is performed only in the parents

 Only common parent-specific SNPs are genotyped in the RILs and
are used to identify the parental origin of chromosomal segments,
allowing projection of sequence information from parents to RILs

 LD information from ancient recombination is thus captured,
allowing for high resolution mapping with far less genotyping effort

Nested Association Mapping
Power of Association Mapping
Decisive Factors
Extent and evolution of LD in the population (mode of pollination)
Complexity and mode of gene action for the trait of interest
Sample size
Preselecting a priori known QTLs or candidate genes
Samples with longer LD blocks
 Availability of pedigree and genomic information and resources
Quality of phenotypic data
 Association mapping panel constructed
Efficiency of targeted gene sequencing (genotypic data)
LD and AM studies in various plant species
S.
No.
Crop Population LD Extent Traits
1 Rice Diverse land
races and
accessions
5- 500 kb; 20-
30 cM; 50-225
cM
Glutinous phenotype, Starch
quality, yield and its
components
2 Wheat Diverse
cultivars
<1-10 cM; 10-
20 cM
Kernel size and milling quality;
HMW gluten and blotch
resistance
3 Barley Diverse
cultivars
10-50 cM;98-
500 kb;300bp
Leaf rust, Yellow dwarf virus
and mildew resistance,
rachilla length, lodicule size
4 Maize Land races,
diverse and
elite inbred
lines
3-500 kb,; 4-41
cM
Flowering time, kernel
composition, oleic acid
content, forage quality, sweet
taste
5 Sugarcane Diverse
clones
10 cM Disease resistance
Common Statistical Software
Packages for Association Mapping
S. No. Software Package Focus Remarks
1 TASSEL Association analysis Free, LD stat, sequence
analysis, AM by GLM and
MLM
2 STRUCTURE Population structure Free, widely used for PS
analysis
3 SPAGeDi Relative kinship Free, Genetic relationship
analysis
4 EINGENSTART PCA, Association
analysis
Free, PCA was proposed
as an alternative for
population for PSA
5 MTDFREML Mixed model Free, can be used for
plant data
6 ASREML Mixed model Commercial, can be used
for plant data
Two major types
Genome-wide screen and candidate gene
Genome-wide screen
 Hypothesis-free
 High-cost: large genotyping
requirements
 Multiple-testing issues
 Possible many false positives,
fewer misses
Candidate gene
 Hypothesis-driven
 Low-cost: small genotyping
requirements
 Multiple-testing less important
 Possible many misses, fewer
false positives