You are on page 1of 10

Articles

Genome-wide association study using whole-genome


sequencing rapidly identifies new genes influencing
agronomic traits in rice
Kenji Yano1, Eiji Yamamoto2, Koichiro Aya1, Hideyuki Takeuchi1, Pei-ching Lo1, Li Hu1, Masanori Yamasaki3,
Shinya Yoshida4, Hidemi Kitano1, Ko Hirano1 & Makoto Matsuoka1
© 2016 Nature America, Inc. All rights reserved.

A genome-wide association study (GWAS) can be a powerful tool for the identification of genes associated with agronomic traits
in crop species, but it is often hindered by population structure and the large extent of linkage disequilibrium. In this study, we
identified agronomically important genes in rice using GWAS based on whole-genome sequencing, followed by the screening of
candidate genes based on the estimated effect of nucleotide polymorphisms. Using this approach, we identified four new genes
associated with agronomic traits. Some genes were undetectable by standard SNP analysis, but we detected them using gene-
based association analysis. This study provides fundamental insights relevant to the rapid identification of genes associated with
agronomic traits using GWAS and will accelerate future efforts aimed at crop improvement.

Crop improvement through genetics and breeding is essential for represent strong population structure that generates spurious associa-
increasing yields and feeding a growing world population1. The iden- tions between the phenotype and unlinked markers. Although several
tification and characterization of genes associated with agronomi- statistically robust models have been developed, it must be noted that
cally important traits is indispensable for both understanding the false positives arising from population structure in crops may not be
genetic basis of phenotypic variation and efficient crop improvement. completely controlled4–6. To address this problem, a reconstruction
Identification of genes underlying agronomic quantitative trait loci of population has been proposed, which is represented by a ‘nested
(QTLs) is usually performed using biparental mapping populations, association mapping population’22–25 and a ‘multiparent advanced
such as F2 and recombinant inbred lines2,3. However, these strategies generation intercrossing population’26–28. The second reason for the
have limitations such as low mapping resolution and limited genetic difficulty in identifying unknown genes is the large extent of link-
diversity between the mapping population parents. For example, only age disequilibrium (LD). The extent of LD in the human genome is
npg

two allelic variations are analyzed (one per parent) in a biparental generally smaller than gene size29, whereas LD in plants often ranges
population, which means that various alleles occurring in other plants over several hundred kilobases, especially in self-pollinating crops
are missed. A GWAS that analyzes associations between nucleotide such as rice and soybean30. This results in the inclusion of many
polymorphisms and phenotypic variance using a diverse population candidate genes in a single LD block exhibiting a significant signal,
set is a powerful tool for the identification of genes associated with thus entailing the need for additional experiments to conclusively
agronomic traits, because this strategy can be used to detect many identify the causal gene(s).
natural allelic variations simultaneously in a single study4–6. Recent In this study, we attempted to identify genes associated with agro-
advances in high-throughput sequencing technologies have enabled nomic traits using GWAS with minimum additional experiments (i.e.,
rapid and accurate resequencing of a large number of genomes7–15. only transgenic complementation tests). We focused on rice (Oryza
This is expected to revolutionize GWAS because the approach will sativa L.), which feeds more than three billion people in the world31.
not only prevent ascertainment bias in studies with limited nucleotide Furthermore, rice is a good candidate because of its extremely
polymorphisms but also facilitate more direct searches for the causal strong population structure and the large extent of LD owing to self-
variant(s) underlying phenotypic diversity16,17. pollination19. To perform GWAS efficiently, we carefully selected a
GWAS of numerous crop species have detected many QTLs asso- population with low population structure but large phenotypic diver-
ciated with agronomic traits9,12,18–21. However, it is often difficult sity. By performing whole-genome sequencing, we refined the number
to identify unknown genes associated with the QTLs owing to two of candidate genes based on the estimated functional importance of
major reasons4–6. The first is that diversity panels of crop species often each nucleotide polymorphism. We demonstrated that with careful

1Bioscience and Biotechnology Center, Nagoya University, Nagoya, Japan. 2NARO Institute of Vegetable and Tea Science, Tsu, Japan. 3Food Resources Education and
Research Center, Graduate School of Agricultural Science, Kobe University, Kasai, Hyogo, Japan. 4Hyogo Prefectural Research Center for Agriculture, Forestry and
Fisheries, Kasai, Hyogo, Japan. Correspondence should be addressed to M.M. (makoto@agr.nagoya-u.ac.jp) and E.Y. (yame@affrc.go.jp).

Received 2 February; accepted 26 May; published online 20 June 2016; doi:10.1038/ng.3596

Nature Genetics  VOLUME 48 | NUMBER 8 | AUGUST 2016 927


Articles

Table 1  Comparison of coefficient of variation for phenotypic that the varieties we selected did not represent a highly structured
values between the 176 Japanese varieties used in this study and population. The extent of LD is another important factor determining
the rice diversity panel reported in ref. 19 the efficiency of GWAS4–6. The decay of LD with physical distance
Coefficient of variation between SNPs occurred at 445 kb (r2 = 0.2), which is comparable to
Days to Plant Panicle Leaf that of a previous study that used a more genetically diverse popula-
n Year heading height length blade width tion19 (Supplementary Fig. 2). This indicated that the use of these
Varieties tested 176 2013 11.93 19.42 10.98 15.66 varieties had no substantial disadvantage as compared to that of the
2014   9.27 15.28 11.43 12.58 other sets of japonica germplasm, with respect to LD. Nevertheless, the
Rice diversity panel 413 11.86 18.09 14.51 20.16 extent of LD was still problematic because it resulted in the inclusion
of tens to hundreds of candidate genes within a LD block, a problem
experimental design, GWAS could be a powerful tool for the rapid that is addressed below.
identification of genes associated with agronomic traits.
Validation of rapid gene identification using GWAS
RESULTS Using a linear mixed model with correction of kinship bias
Characterization of a population used in this study (Supplementary Fig. 3), we performed GWAS for heading date (i.e.,
To perform GWAS efficiently, it is important to select a population days to heading) to assess the potential of our GWAS design for causal
that is not genetically highly structured and interrelated yet exhibits gene identification. Heading date is a suitable trait for this purpose
high phenotypic diversity. We focused on 176 japonica rice varieties because many genes with natural allelic variations have been identi-
developed in breeding programs conducted in Japan (Supplementary fied in previous studies33. Using the linear mixed model, we con-
Table 1 and Supplementary Fig. 1) for the following two reasons. sistently detected 26 loci exceeding a significant threshold (−log10
First, according to the pedigree information on these varieties (the P ≥ 4.77) in both years of experiment performed in the present study
© 2016 Nature America, Inc. All rights reserved.

rice cultivar database; see URLs), they are genetically interrelated to (Supplementary Table 5). We focused on five of these loci, includ-
each other, and we expected them not to represent a highly structured ing the top three peaks located on chromosomes 1, 6 and 11, and
population (see below). Second, although we selected the varieties two peaks with positions correlating with known heading date (Hd)
from restricted pedigrees, the phenotypic diversity of agronomic genes Hd6 and Hd2 (Fig. 2a).
traits was almost comparable to that of the global collection of diverse At the terminal end of the long arm of chromosome 3, there was
rice reported previuosly19 (Table 1, Fig. 1a–d and Supplementary a peak that mapped close to Hd6 (ref. 34) (Fig. 2a). We estimated
Table 2). This suggested that the varieties contain various QTLs asso- a candidate region from 30.50 Mb to 31.76 Mb (1,266 kb) by using
ciated with agronomic traits. pairwise LD correlations (r2 ≥ 0.6) (Supplementary Fig. 4a). We clas-
To obtain nucleotide polymorphism information, we performed sified all the polymorphisms in the candidate region into five groups.
whole-genome sequencing of the 176 varieties, and obtained a total Group I included polymorphisms that were significantly associated
of 383.8 Gb of sequence, with an average depth of 5.8× and coverage with trait variation in the GWAS (−log10 P ≥ 4.77), and predicted
of 91.2% of the reference genome32 (Supplementary Table 3). to induce amino acid exchange or to change splicing junctions (GT
After removing nucleotide polymorphisms with missing rates ≥ 0.25 or AG at the start or end of intron, respectively). Group II included
and minor allele frequency < 0.05, we generated a final set of 426,337 polymorphisms significantly associated with trait variation as group
SNPs and 67,544 insertion-deletions (indels). Among the nucleotide I and were located at the 5′ flanking sequences of genes (≤2 kb from
polymorphisms, 43,323 SNPs induced nonsynonymous substitutions, the first ATG, for example, promoter region). Group III included
whereas 1,678 and 1,656 indels induced frameshift and non-frameshift polymorphisms significantly associated with trait variation as group
mutations, respectively (Supplementary Table 4). I and located within a gene but did not meet the criteria for group
npg

Using these nucleotide polymorphisms, we performed principal I or II (for example, located in a coding region but not predicted
component analysis (PCA) to quantify the population structure of these to change an amino acid, an intron or a 3′ noncoding sequence).
176 varieties (Fig. 1e). The score plot of principal components showed Group IV included polymorphisms significantly associated with trait
continuous distribution without any distinct clusters, indicating variation as group I and located on outside coding regions. Group
V included polymorphisms not significantly
associated with trait variation. Although the
a b e candidate region on chromosome 3 con-
tained 297 polymorphisms, we assigned
Number of plants

Number of plants

75 60

50 40
0.15 Figure 1  Phenotypic diversity and genetic
25 20 structure of the Japanese rice varieties. (a–d)
0.10
0 0 Histograms of zero mean normalized phenotypic
0.05 values of days to heading (a), plant height (b),
PC2 (6.4%)

–25 0 25 −50 0 50 100


Normalized time to Normalized plant panicle length (c) and leaf blade width (d).
heading (d) height (cm) 0.00
Yellow and gray bars represent the 176
c 100 d 60 −0.05
Japanese rice varieties used in this study (data
from phenotyping performed in 2013) and the
Number of plants
Number of plants

−0.10
75
40
−0.15
diversity panel reported in ref. 19, respectively.
50 (e) PCA for the 176 Japanese rice varieties
20 −0.1 0 0.1 0.2 0.3 0.4 based on whole-genome sequence data. PC1
25
PC1 (8.5%) and PC2 indicate score of principal components
0 0
−10 −5 0 5 10 −0.5 0 0.5 1 and 2, respectively. Values in parentheses
Normalized panicle Normalized leaf blade indicate percentage of variance in the data
length (cm) width (cm) explained by each principal component.

928 VOLUME 48 | NUMBER 8 | AUGUST 2016  Nature Genetics


Articles

Figure 2  GWAS for days to heading and a 10 Hd1


c LOC_Os01g62780
identification of the causal gene for the peak Hd6 Chr. 1
Hd2
on chromosome 1. (a) Manhattan plot for 8
(+ strand)
days to heading. Dashed line represents the

–log10 (P)
+328
6
significance threshold (−log10 P = 4.77). Hap. A GTT (V)
Arrowheads indicate the position of strong 4 Hap. B ATT (I)
peaks that did not localize with the known Hd
2
genes investigated in this study. (b) Local d 2013 P = 4.20 × 10
–17
2014 P = 9.39 × 10
–18

Manhattan plot (top) and LD heatmap (bottom) 0 130


Chr. 1 2 3 4 5 6 7 8 9 10 11 12
surrounding the peak on chromosome 1. Arrow 110
indicates the position of nucleotide variation in b 10
120
100
LOC_Os01g62780. Dashed lines indicate the 8 110

Days to heading

Days to heading
candidate region for the peak. (c) Exon-intron

–log10 (P)
6 90
structure of LOC_Os01g62780 and DNA 100
polymorphism in that gene. (d) Boxplots for 4
90
days to heading based on the haplotypes (Hap.) 2 80
for LOC_Os01g62780 in 2013 (left) and 2014 0 80
(right). Box edges represent the 0.25 quantile Chr. 1 (Mb) 36.2 36.4 36.6 36.8 37.0 70
70
and 0.75 quantile with the median values shown
by bold lines. Whiskers extend to data no more than Hap. A B Hap. A B
(n = 109) (n = 67)
1.5 times the interquartile range, and remaining data are
indicated by dots. Differences between the haplotypes were e VEC Hap. A Hap. B f 120
**
analyzed by Welch’s t-test. (e) Image of transgenic plants transformed

Days to heading
110
with empty vector (VEC), haplotype A (Hap. A) and haplotype B (Hap. B).
© 2016 Nature America, Inc. All rights reserved.

n.s.
Red arrows indicate panicle exsertion. Scale bar, 15 cm. (f) Days 100
r2
to heading of the transgenic plants. Error bars, s.d. (n = 20).
**P < 0.01; n.s., not significant (Welch’s t-test). 0 0.2 0.4 0.6 0.8 1.0 90

80
only one polymorphism to group I (Supplementary Fig. 5), whereas

H .A
.B
VE
ap
ap
we assigned six polymorphisms to group II and three to group III.

H
The polymorphism in group I induced a premature stop codon in function of its rice homolog was unknown. The polymorphism on
the coding sequence of LOC_Os03g55389, which is identical to Hd6 LOC_Os01g62780 induced an amino acid exchange in the region
(ref. 34) (Supplementary Fig. 4b). The varieties carrying haplotype conserved among homologs in other plant species (Supplementary
A (we refer the haplotype corresponding to reference genome as ‘A’ Fig. 9). The residue of haplotype A (valine) is identical to that in
type and the other as ‘B’) showed earlier heading date than varieties Arabidopsis HESO1, whereas the residue of haplotype B (isoleucine) is
carrying haplotype B (Supplementary Fig. 4c), which is consistent identical to all other homologs compared in this analysis. O. rufipogon
with the previous study on Hd6 (ref. 34). (ancestral wild species of cultivated rice) and indica (another major
Similarly, we analyzed a peak on chromosome 7, which mapped close subspecies in cultivated rice) contained isoleucine at this position,
to Hd2 (ref. 35) (Supplementary Fig. 6a). The candidate region was suggesting that haplotype B is the original haplotype in Oryza species
predicted to map from 29.35 to 29.70 Mb (346 kb), and contained 124 (Supplementary Fig. 9a). Varieties carrying haplotype B showed a
polymorphisms (Supplementary Fig. 7). Among these, there was only later heading date than haplotype A (Fig. 2d). To confirm the effect of
one polymorphism assigned to group I, which was located on LOC_ this gene on heading date, we introduced the entire genome sequence
npg

Os07g49460 and was identical to Hd2 (ref. 35). The varieties contained of haplotype A and B into Nipponbare carrying haplotype A. The
three haplotypes (Supplementary Fig. 6b). Varieties carrying haplotype C, plants transformed with haplotype A showed no phenotypic change,
which is identical to the previously identified natural allelic variation whereas plants transformed with haplotype B showed later heading
of Hd2 (ref. 35), showed an earlier heading date than those carrying date when compared to the vector control and haplotype A plants
haplotype A or B (Supplementary Fig. 6c). This indicated that Hd2 (Fig. 2e,f), which is consistent with the results in Figure 2d. These
is a gene responsible for this peak. The identification of Hd6 and Hd2 observations clearly demonstrated that LOC_Os01g62780 is the causal
indicated that GWAS using whole-genome sequencing could be used to gene for the peak signal of days to heading on chromosome 1.
efficiently identify genes associated with agronomic traits. We discuss We then focused on the third highest peak on chromosome 11
the second highest peak near Hd1 on chromosome 6 below. (Fig. 2a). That locus exhibited pleiotropic associations with plant
height and panicle length (Fig. 3a,b). We estimated the candidate
Identification of new genes region to be 4.33−4.79 Mb (463 kb) (Fig. 3c), and it contained 4,125
We applied the same strategy to loci associated with heading date, polymorphisms (Supplementary Fig. 10). We assigned 107 polymor-
which have not been reported previously (i.e., chromosome 1 and 11) phisms to group I, and these mapped to 24 genes. Most of these genes
(Fig. 2a). With respect to the locus on chromosome 1, the candidate (18 of 24) were annotated as either transposon-related, members of
region was predicted to map from 36.30 to 36.65 Mb (346 kb) (Fig. 2b), the DnaK family or expressed protein, whereas the remaining six
and included 91 polymorphisms (Supplementary Fig. 8). There were were annotated as enzymes (five) or transcription factors (one). We
eight group I polymorphisms, which were located within seven genes, focused on LOC_Os11g08410, which is annotated as a GATA zinc
and all of these were annotated as transposon-related genes except finger-type transcription factor. There were three haplotypes for
for LOC_Os01g62780, a homolog of Arabidopsis thaliana HEN1 LOC_Os11g08410 (Fig. 3d). Haplotypes B and C contained one and
suppressor 1 (ref. 36) (AT2G39740; HESO1) (Fig. 2c). Arabidopsis ten polymorphisms, respectively (Fig. 3d), whereas varieties carry-
HESO1 functions as a suppressor of the hen1 mutation and exhibits ing haplotype C showed a later heading date, greater plant height and
pleiotropic phenotypes including delayed flowering37, whereas the larger panicle length than plants carrying haplotype A or haplotype B

Nature Genetics  VOLUME 48 | NUMBER 8 | AUGUST 2016 929


Articles

Figure 3  GWAS for plant height and panicle a Plant height d LOC_Os11g08410
length, and identification of the causal gene for
15
the peak on chromosome 11. (a,b) Manhattan Chr.11

–log10 (P)
plots for plant height (a) and panicle length (b). 10
(– strand)
+590 +802 +809 +863 +916 +932 +935 +985 +994 +1,007 +1,286
Arrowheads indicate the position of strong TCA GTG CGC GTC GAC ACC ATC GGG
Hap. A
peaks investigated in this study. Dashed lines 5 (S) (V) (R) (V) (D) (T) (I) (G)
TCA GTG CGC ATC GAC ACC ATC GGG
represent significance thresholds (−log10 Hap. B
(S) (V) (R) (I) (D) (T) (I) (G)

P = 3.67 in a and −log10 P = 5.30 in b). 0


1 2 3 4 5 6 7 8 9 10 11 12 Hap. C 18 bp del
ACA GCG CAC GTC GCC
18 bp in
GCC
6 bp del
ACC GTG
(T) (A) (H) (V) (A) (A) (T) (V)
(c) Local Manhattan plot (top) and LD heatmap Chromosome
(bottom) surrounding the peak on chromosome 11. b Panicle length
e
Arrow indicates position of nucleotide 2013 2013 2013
8
variations in LOC_Os11g08410. Dashed lines 130 * 140 * n.s. *

–log10 (P)
6 n.s.

Panicle length (cm)


indicate the candidate region for the peak. 120
• •

Plant height (cm)


Days to heading
120 n.s. 25
(d) Exon structure of LOC_Os11g08410 and 4 110 •

DNA polymorphisms in that gene. Red- and 100


100
2 20
gray-shaded regions indicate nucleotide 90 80
variation significantly (−log10 P ≥ 4.77,

0 80 ••

1 2 3 4 5 6 7 8 9 10 11 12 60 15 •

3.67 and 5.30 for days to heading, plant Chromosome 70


c

height and panicle length, respectively) and
Plant height Hap. A B C Hap. A B C Hap. A B C
not significantly associated with phenotypic 15 (n = 27) (n = 106) (n = 43)
variation, respectively. in, insertion; del,
–log10 (P)

deletion. (e) Days to heading, plant height and 10 f g h 20


panicle length for indicated haplotypes for 90
5 110 * ** 18 **
80
LOC_Os11g08410. Data are presented as in
© 2016 Nature America, Inc. All rights reserved.

16

Panicle length (cm)


70 n.s.

Plant height (cm)


n.s. n.s.

Days to heading
Figure 2d. Differences between the haplotypes 106 14
0 60
Chr. 11 4.2 12
were statistically analyzed based on Tukey’s test 4.4 4.6 4.8 102 50
(Mb) 10
(*P < 0.05; n.s., not significant). (f–h) Days to 40 8
98 30
heading (f), plant height (g) and panicle length 6
20 4
(h) for transgenic plants transformed with empty 94
10 2
vector (VEC), haplotype A (Hap. A) and haplotype 90 0 0
C (Hap. C). Error bars, s.d. (n = 20). *P < 0.05,

.A

.C

C
.A

.C

.A
C

.C
VE

VE
VE
**P < 0.01; n.s., not significant (Welch’s t-test).

ap

ap

ap
ap

ap

ap
2
r

H
H

H
H

H
0 0.2 0.4 0.6 0.8 1.0
(Fig. 3e). Although the amino acid differ-
ences in haplotypes A and C were distributed
throughout the gene, it was difficult to predict the effect of these differ- into Nipponbare carrying haplotype A. The plants overexpressing
ences on function because they were located in nonconserved regions NAL1 from haplotype A showed decreased panicle numbers per plant
(Supplementary Fig. 11). To examine whether this gene is causal for and increased leaf blade width relative to plants carrying the con-
the peak on chromosome 11, we introduced the genomic sequences of trol vector, whereas plants transformed with haplotype B showed no
haplotypes A or C into Nipponbare carrying haplotype A. The plants clear difference compared to controls (Fig. 4g–i and Supplementary
transformed with haplotype C exhibited delayed heading, increased Fig. 15). Although NAL1 is known to be associated with several
plant height and larger panicle length (Fig. 3f–h and Supplementary agronomic traits, to our knowledge this is the first report to document
Fig. 12), which corresponds to the phenotypic variations presented the effect of this gene on panicle number.
npg

in Figure 3e. These results demonstrated that LOC_Os11g08410


is a causal gene for variations in these three agronomic traits and Misleading association owing to allelic heterogeneity
demonstrated the utility of GWAS for identifying QTLs and genes Here we return to the second highest peak for days to heading on
with pleiotropic effects. chromosome 6 (Fig. 2a). The peak was located close to Hd1 (ref. 40),
The successful identification of genes associated with heading which contributes to genetic diversity in heading date in the rice
date encouraged us to investigate genes associated with other agro- breeding program41,42. Thus, we hypothesized that the causal
nomic traits. On chromosome 4, we found a strong peak for panicle gene of the significant peak was Hd1. Using the approach described
number per plant (Fig. 4a), which overlapped with peaks for spikelet above, we predicted the candidate region to be 7.81−8.38 Mb (564 kb)
number per panicle and leaf blade width (Fig. 4b,c), suggesting that (Fig. 5a). The candidate region did not include Hd1 (LOC_Os06g16370),
a single gene might have pleiotropic effects on these agronomically but Hd1 was localized in the next LD block (9.22–9.61 Mb),
important traits. We predicted the candidate region to be from which showed signals with lower −log10 P values than those in the
31.05 Mb to 31.41 Mb (358 kb), and it included 339 polymorphisms primary candidate region (Fig. 5a). We regarded this discrepancy as
(Fig. 4d and Supplementary Fig. 13). We assigned four polymor- an important issue for the efficacy of GWAS-based causal gene iden-
phisms to group I, and they mapped to four genes. Among these, tification, which has been noted previously18,43,44; consequently, we
LOC_Os04g52479, which encodes NALLOW LEAF 1 (NAL1) (Fig. 4e investigated its precise mechanism.
and Supplementary Fig. 14), has been previously reported to control First, we surveyed why the polymorphisms in Hd1 were not sta-
panicle size and flag leaf width, but its impact on panicle number is tistically significant. In the population we used, there were eleven
less understood38,39. There were two haplotypes with distinct phe- haplotypes for Hd1 (haplotypes A–K), which included the null and
notypes: haplotype B produced more panicles per plant, smaller intermediate alleles previously reported 41 (Supplementary Figs. 16
spikelet numbers and narrower leaf blades than varieties containing and 17). In a standard GWAS, phenotypic distribution is independ-
haplotype A (Fig. 4f). We introduced the cDNA sequence of NAL1 ently compared in each nucleotide polymorphism site. Therefore, if
from haplotype A and B under the control of a constitutive promoter a gene includes allelic heterogeneity such as the case with Hd1, it

930 VOLUME 48 | NUMBER 8 | AUGUST 2016  Nature Genetics


Articles

Figure 4  GWAS for panicle number per plant, a 7


Panicle number
per plant
e LOC_Os04g52479 (NAL1)
spikelet number per panicle and leaf blade 6

–log10 (P)
5 Chr. 4
width, and identification of the causal gene for 4
3 (+ strand)
the peak on chromosome 4. (a–c) Manhattan 2
plots for panicle number per plant (a), spikelet 1
+697
0
number per panicle (b) and leaf blade width (c). 1 2 3 4 5 6 7 8 9 10 11 12
Hap. A CAT (H)
Arrowheads indicate the position of strong Chromosome
peaks investigated in this study. Dashed lines b 15
Spikelet number
per panicle
Hap. B CGT (R)

represent significance threshold (−log 10 P =

–log10 (P)
5.39 in a, 4.60 in b and 5.50 in c). (d) Local 10
f 2013
–6
2013
–6
2013
–11
Manhattan plot (top) and LD heatmap (bottom) 5 P = 3.39 × 10 P = 6.89 × 10 P = 8.91 × 10
250 1.8

Spikelet number per panicle


surrounding the peak on chromosome 4.

Panicle number per plant


0 25

Leaf blade width (cm)


Arrow indicates the position of nucleotide 1 2 3 4 5 6 7 8 9 10 11 12
200 1.6
Chromosome
variation in LOC_Os04g52479. Dashed lines
indicate the candidate region for the peak.
c 10
Leaf blade width
20
150 1.4

–log10 (P)
(e) Exon-intron structure of LOC_Os04g52479 8 15
6 100 1.2
and DNA polymorphism in this gene. 10
4
(f) Panicle number per plant, spikelet number 2 1.0
50
per panicle and leaf blade width for the indicated 0 5
1 2 3 4 5 6 7 8 9 10 11 12
haplotypes of LOC_Os04g52479. Data are Hap. A B Hap. A B Hap. A B
Chromosome (n = 136) (n = 40)
presented as in Figure 2d. Differences between
the haplotypes were statistically analyzed d 6
Panicle number
per plant

based on Welch’s t-test. (g–i) Expression of 5 g h 14 i


–log10 (P)

1.2
LOC_Os04g52479 (NAL1) in the transgenic 4 14
© 2016 Nature America, Inc. All rights reserved.

** n.s. **

Panicle number per plant


plants (g). Panicle number per plant (h) and leaf 3 12 12 1.0

Leaf blade width (cm)


Expression of NAL1
blade width (i) for transgenic plants transformed 2
10 10
1 ** 0.8 n.s.
with empty vector (VEC), overexpression of 8 8 **
0 0.6
haplotype A (UBQ::Hap.A) and haplotype B Chr. 4 30.8 31.0 31.2 31.4 31.6 6
6
(UBQ::Hap.B). UBQ indicates maize ubiquitin (Mb) 0.4
4 4
promoter that was used for the overexpression of
2 2 0.2
LOC_Os04g52479. Error bars, s.d. (n = 12).
**P < 0.01; n.s., not significant (Welch’s t-test). 0 0 0

::H A
.B

::H C

::H .A
.B

::H C
::H .A
.B
.
BQ VE

BQ VE

BQ VE
BQ p
ap

BQ ap
ap

BQ ap
ap
U ::Ha
is difficult to achieve statistical significance
U

U
U
r2
because an allele-specific polymorphism is
compared with all other alleles, including 0 0.2 0.4 0.6 0.8 1.0

null, intermediate and fully functional alle-


les (Supplementary Fig. 16). Next, we surveyed why we observed are under the significance threshold 48,49. On chromosome 6, some
a strong peak in the LD block lacking Hd1. To illustrate the struc- genes in the 9.22–9.61 Mb region, which included Hd1 itself, formed
ture of this genome region schematically, we constructed a graphical a higher peak than genes in the 7.81–8.38 Mb region (Fig. 5e). Thus,
genotype in which major and minor polymorphisms in each site were we changed the candidate region and genes to the more plausible
represented by blue and orange color (Online Methods) (Fig. 5b). LD block. In addition, we included Hd1 in top-ranking genes as the
We classified the varieties into three groups based on Hd1 function, other heading date genes identified in the present study (Fig. 5f
npg

namely, haplotypes B and D (functional alleles), haplotype A (interme- and Supplementary Table 6). These results indicate that gene-based
diate) and haplotypes E and F (null) (Fig. 5c), whereas in this analysis association analysis might be effective to deal with misleading asso-
we ignored other varieties containing minor haplotypes. The varie- ciations in single-polymorphism-based association analysis.
ties in each group showed a similar genome structure (Fig. 5b). This Owing to the result with Hd1, we resurveyed the phenotypes by
comparison clearly indicated that the LD block including the highest using gene-based association analysis and found that the strategy is
peak corresponds well to Hd1 function, i.e., functional (haplotypes also efficient for identifying causal genes (Supplementary Fig. 19
A, B and D) and nonfunctional (haplotypes E and F), whereas the LD and Supplementary Tables 6–12). Here we especially focused on awn
block including Hd1 showed no clear difference (Fig. 5b). Finally, we length. In the analysis of awn length by single-polymorphism-based
confirmed the phenotypic distribution based on polymorphisms in GWAS, we detected a strong peak (−log10 P = 24.67) on chromosome 8
the highest peak (C or A at position 8267669 of chromosome 6) and (Fig. 6a). However, we could not find good candidate genes in the
found that the average phenotypic value was clearly different between candidate region (23.62–23.82 Mb) (Fig. 6b and Supplementary
polymorphism groups (Supplementary Fig. 17b). This represents a Table 13). According to the pedigree information, the loss of awns
theoretically anticipated situation that is designated synthetic asso- occurred at least twice in the varieties used here, suggesting that the
ciation or indirect association45,46. When whole-genome sequence causal gene of the phenotype is multiallelic, as was the case for Hd1.
data are available, another strategy for GWAS is to use a gene-based Therefore, we applied gene-based association analysis to this trait,
association analysis that uses allelic differences as markers to analyze and found a strong peak close to but outside the 23.62–23.82 Mb
the association between phenotypic variation and alleles47. region (Fig. 6c,d and Supplementary Fig. 20). Among the genes
We assumed this strategy is effective to deal with the above-men- forming the peak, LOC_Os08g37890, which showed the sixth highest
tioned problem. Although no gene exceeded a stringent significant −log10 P value (Fig. 6e and Supplementary Table 12), encodes a
threshold in the gene-based association analysis of heading date member of epidermal patterning factor-like protein (EPFL) (Fig. 6f).
(Fig. 5d,e and Supplementary Fig. 18), it has been reported that In Arabidopsis, one member of EPFL is involved in cell division in the
top-ranking genes often include plausible candidates even if they process of stomata formation50, although the roles of this gene family

Nature Genetics  VOLUME 48 | NUMBER 8 | AUGUST 2016 931


Articles

a 8 b Hap. A
d Hd2
LOC_
Os11g08410
LOC_
5 Os01g62780 Hd6
6 Hd1
–log10 (P)

–log10 (P)
4 Hap. B and D
3

2 2

1
0
Hap. E and F
Chr. 6 (Mb) 7.0 7.5 8.0 8.5 9.0 9.5 0
1 2 3 4 5 6 7 8 9 10 11 12
Chr. 6 (Mb) 7.0 7.5 8.0 8.5 9.0 9.5 Chromosome
e
c LOC_Os06g16370 (Hd1) 5
Chr. 6
(+ strand) 4

–log10 (P)
3
+328 +744
r 2
Hap. A 36 bp del. –
2
Hap. B and D – – 1
Hap. E and F – 43 bp del.
0 0.2 0.4 0.6 0.8 1.0 0
Chr. 6 (Mb) 7.0 7.5 8.0 8.5 9.0 9.5
Figure 5  Analyses of the peak for days to heading on chromosome 6. (a) Local Manhattan plot (top) f
and LD heatmap (bottom) surrounding the peak on chromosome 6. Aarrow indicates the position of 5 Hd2: –log10 (P) = 4.84 (4th)
nucleotide variations in Hd1. (b) Schematic representation of the genome structure of the region in a.
© 2016 Nature America, Inc. All rights reserved.

LOC_Os11g08410: –log10 (P) = 4.67 (7th)


Hd6: –log10 (P) = 4.37 (12th)
Major and minor alleles on each polymorphic site are represented in blue and orange, respectively 4 LOC_Os01g62780: –log10 (P) = 3.85 (20th)
Hd1: –log10 (P) = 3.78 (22th)
(Online Methods). The varieties were divided based on Hd1 function as follows: haplotype (Hap.)

–log10 (P)
A (intermediate), Hap. B and D (functional), and Hap. E and F (null). Arrowhead indicates position of 3
the nucleotide polymorphisms in Hd1, which are not reflected in the graphical genotype owing to the
2
effect of surrounding nucleotide polymorphisms (Online Methods). (c) Exon-intron structure of Hd1 and
DNA polymorphisms in Hd1. (d) Manhattan plot of gene-based association analysis for days to heading. 1
Dashed line represents a 0.2 false discovery rate (−log 10 P = 3.02). (e) Local Manhattan plot of gene-
based association analysis surrounding the peak on chromosome 6. Arrow indicates the position of Hd1. 0
(f) Plot of −log10 P values of each marker. The markers were arranged in the descending order of −log 10 0 2 4 6 8 10 12 14
P values. Arrows indicate positions of genes identified in this study. Number of markers (thousands)

in rice are largely unknown. The varieties in this study had three process that includes the selection of a not highly structured popu-
haplotypes, A, B and C (Fig. 6f). The alignment of LOC_Os08g37890 lation and whole-genome sequencing. The process includes: (i) the
and homologs indicated that haplotype C might encode the entire detection of significant signals through association analysis with
region of EPFL, whereas haplotype A (Nipponbare haplotype) pro- individual nucleotide polymorphisms; (ii) the definition of candi-
duces a truncated protein owing to a 4-bp deletion in its coding date regions with significant signals based on LD; (iii) the extraction
region. Furthermore, haplotype B contained a 6-bp deletion in the of causal genes based on the function of polymorphisms, annota-
region encoding two highly conserved amino acids (Supplementary tion information including that of Arabidopsis homologs as the case
Fig. 21a–c). Consequently, haplotype C could be a functional allele, of LOC_Os01g62780 and (iv) transformation of a gain-of-function
npg

whereas haplotype A and B were predicted to be loss-of-function haplotype into plants containing loss-of-function haplotype. As
alleles. This prediction corresponded well to the phenotypic varia- expected, we identified scores of candidate genes in step ii. However,
tion in these varieties (Fig. 6g); i.e., varieties carrying haplotype A at step 3, one to several candidates were identified in most of the cases.
or haplotype B did not form awns whereas the varieties with As a consequence, although a large LD results in inclusion of many
haplotype C developed awns. When we examined the genome struc- polymorphisms in a candidate region, our results indicated that func-
ture of this region, the pattern in the LD block corresponded to the tionally meaningful polymorphisms were less numerous and could
predicted function of LOC_Os08g37890 (Fig. 6h). These observations be evaluated by genetic transformation.
suggested that the strong peaks were spurious signals as in the case In the analysis of Hd1, we identified a spurious association caused
of Hd1. To confirm that LOC_Os08g37890 is a causal gene for the by allelic heterogeneity and complex genome structure (Fig. 5).
awnless phenotype, we introduced the entire genome of haplotype B Considering the elaborate statistics, this is considered a serious prob-
or haplotype C into Nipponbare (haplotype A). As expected, the lem for GWAS46,51,52. In this study, we demonstrated that gene-based
plants transformed with haplotype C (functional allele) formed long association analysis, which was facilitated by whole-genome sequencing,
awns, whereas plants carrying haplotype B or the vector control had is an effective approach for dealing with spurious associations
the awnless phenotype (Fig. 6i and Supplementary Fig. 21d–f). (Figs. 5 and 6). In crop domestication and breeding, it is not uncom-
These results indicated that LOC_Os08g37890 is a causal gene for mon to independently select alleles of the same gene, which results
the awnless phenotype and demonstrated that gene-based association in allelic heterogeneity of agronomically important genes41,53,54. This
analysis is an efficient method to deal with spurious signals caused suggests that gene-based association analysis is important for GWAS
by allelic heterogeneity. of agronomic traits. In this context, there is a consideration impor-
tant to mention. When the number of functionally different alleles
DISCUSSION in a gene is two, the functional difference can be explained by a SNP.
To assess the potential of GWAS for the rapid identification of new Under such a condition, the gene-based association approach results
genes associated with agronomic traits, we designed an experimental in a decrease in statistical power. We propose the use of both single

932 VOLUME 48 | NUMBER 8 | AUGUST 2016  Nature Genetics


Articles

a 25
c d e
LOC_Os08g37890:
15 15
20 –log10 (P) = 15.73 (6th)
–log10 (P)

15

–log10 (P)

–log10 (P)
–log10 (P)
15 10 10
10
10
5 5
5 5

0 0 0 0
1 2 3 4 5 6 7 8 910 11 12 1 2 3 4 5 6 7 8 9 10 11 12 0 2 4 6 8 10 12 14
Chr. 8 23.5 23.6 23.7 23.8 23.9 24.0
Chromosome Chromosome (Mb) Number of markers
(thousands)
b 25 f Chr. 8
LOC_Os08g37890
g 2013 2014
(+ strand)
20
5
* *
–log10 (P)

15 5
4

Awn length (cm)


Awn length (cm)
10 +236 +238 +297 4
5 Hap. A AGC (S) – – 3
3
0 Hap. B ACC (T) 6 bp del. 4 bp ins.
2 2
Chr. 8 23.5 23.6 23.7 23.8 23.9 24.0 Hap. C AGC (S) – 4 bp ins.
(Mb)
1 1
n.s. n.s.
h 0 0
Hap. A B C Hap. A B C
Hap. A (n = 140) (n = 21) (n = 15)
© 2016 Nature America, Inc. All rights reserved.

Hap. B
i VEC Hap. B Hap. C
r2
Hap. C
0 0.2 0.4 0.6 0.8 1.0 Chr. 8 (Mb) 23.5 23.6 23.7 23.8 23.9 24.0

Figure 6  GWAS for awn length and identification of the causal gene for the peak on chromosome 8.
(a) Manhattan plot of single-polymorphism-based association analysis. Dashed line represents a
significance threshold (−log10 P = 4.21). (b) Local Manhattan plot of single-polymorphism-based
association (top) and LD heatmap (bottom) surrounding the peak on chromosome 8. Arrow indicates
the position of nucleotide variations in LOC_Os08g37890. Dashed lines indicate the candidate region
for the peak. (c) Manhattan plot of gene-based association analysis. Dashed line represents a 0.2 false discovery rate (−log 10 P = 2.28). Arrowheads in a
and c indicate the position of strong peaks investigated in this study. (d) Local Manhattan plot of gene-based association analysis surrounding the peak on
chromosome 8. The arrow indicates the position of LOC_Os08g37890. (e) −log10 P values of each marker, arranged in descending order of −log10 P values.
Arrow indicates the position of LOC_Os08g37890. (f) Exon-intron structure of LOC_Os08g37890 and DNA polymorphisms in that gene. ins, insertion;
del, deletion. (g) Awn length based on the haplotypes for LOC_Os08g37890 in 2013 (left) and 2014 (right). Data are presented as shown in Figure 2d.
Differences between the haplotypes were statistically analyzed based on Tukey’s test (*P < 0.05; n.s., not significant). (h) Schematic representation of
the genome structure of the indicated region of chromosome 8. Major and minor alleles on each polymorphic site are represented in blue and orange,
respectively. (i) Awn lengths for transgenic plants transformed with empty vector (VEC), haplotype B (Hap. B) and haplotype C (Hap. C). Scale bar, 15 mm.
npg

polymorphisms and gene-based analysis combined with functional Accession codes. The sequence data has been deposited in
annotation of each nucleotide polymorphism. DNA Data Bank of Japan Sequence Read Archive (DRA):
All the genes analyzed in the present study included polymorphisms DRA004358.
that induced amino acid exchanges. It should be reasonable to expect
that such polymorphisms have a higher probability of being function- Note: Any Supplementary Information and Source Data files are available in the
ally important. However, polymorphisms in regulatory regions often online version of the paper.
contribute to phenotypic diversity in agronomic QTLs, such as qSH1, Acknowledgments
GS5 and GL7 (refs. 55–57). To deal with such cases, the combined use This work was supported by the Japan Society for the Promotion of Science
of GWAS with expression profiling data could be effective as recently through a Grant in Aid for Scientific Research (A) (26252001), Council for Science,
reported58. We are currently assessing that approach. Technology and Innovation (CSTI), Cross-Ministerial Strategic Innovation
Promotion Program (SIP), “Technologies for Creating Next-Generation
Agriculture, Forestry and Fisheries” (funding agency: Bio-oriented Technology
URLs. The rice cultivar database, http://ineweb.narcc.affrc.go.jp/; Research Advancement Institution, NARO), and by Grant in Aid for JSPS Fellows
Rice HapMap Project, http://www.ncgr.ac.cn/RiceHap2/; PHYLIP, Grant Number 16J08722.
http://evolution.genetics.washington.edu/; PLINK, http://pngu.mgh.
AUTHOR CONTRIBUTIONS
harvard.edu/~purcell/plink/; ClustalW, http://clustalw.ddbj.nig.ac.jp/; K.Y., K.A., H.T., P.-C.L. and L.H. performed the field experiments and analyzed
Ensembl Plants, http://plants.ensembl.org/Oryza_rufipogon/; DDBJ the results. K.Y., K.A. and H.T. performed the genotyping and the genome data
Sequence Read Archive, http://trace.ddbj.nig.ac.jp/dra/. analyses. M.Y. and S.Y. prepared the population material. K.Y. produced the
constructs and generated and analyzed the transformants. K.Y., E.Y., K.A., H.K.,
Methods K.H. and M.M. designed the research and wrote the manuscript.
Methods and any associated references are available in the online COMPETING FINANCIAL INTERESTS
version of the paper. The authors declare no competing financial interests.

Nature Genetics  VOLUME 48 | NUMBER 8 | AUGUST 2016 933


Articles

Reprints and permissions information is available online at http://www.nature.com/ 30. Gupta, P.K., Rustgi, S. & Kulwal, P.L. Linkage disequilibrium and association studies
reprints/index.html. in higher plants: present status and future prospects. Plant Mol. Biol. 57, 461–485
(2005).
1. Godfray, H.C.J. et al. Food security: the challenge of feeding 9 billion people. 31. Woolston, C. Rice. Nature 514, S49 (2014).
Science 327, 812–818 (2010). 32. International Rice Genome Sequencing Project. The map-based sequence of the
2. Miura, K., Ashikari, M. & Matsuoka, M. The role of QTLs in the breeding of high- rice genome. Nature 436, 793–800 (2005).
yielding rice. Trends Plant Sci. 16, 319–326 (2011). 33. Matsubara, K., Hori, K., Ogiso-Tanaka, E. & Yano, M. Cloning of quantitative trait
3. Huang, X. & Han, B. Natural variations and genome-wide association studies in genes from rice reveals conservation and divergence of photoperiod flowering
crop plants. Annu. Rev. Plant Biol. 65, 531–551 (2014). pathways in Arabidopsis and rice. Front. Plant Sci. 5, 193 (2014).
4. Myles, S. et al. Association mapping: critical considerations shift from genotyping 34. Takahashi, Y., Shomura, A., Sasaki, T. & Yano, M. Hd6, a rice quantitative trait
to experimental design. Plant Cell 21, 2194–2202 (2009). locus involved in photoperiod sensitivity, encodes the alpha subunit of protein kinase
5. Hamblin, M.T., Buckler, E.S. & Jannink, J.-L. Population genetics of genomics-based CK2. Proc. Natl. Acad. Sci. USA 98, 7922–7927 (2001).
crop improvement methods. Trends Genet. 27, 98–106 (2011). 35. Koo, B.H. et al. Natural variation in OsPRR37 regulates heading date and contributes
6. Lipka, A.E. et al. From association to prediction: statistical methods for the to rice cultivation at a wide range of latitudes. Mol. Plant 6, 1877–1888
dissection and selection of complex traits in plants. Curr. Opin. Plant Biol. 24, (2013).
110–118 (2015). 36. Ren, G., Chen, X. & Yu, B. Uridylation of miRNAs by HEN1 SUPPRESSOR1 in
7. Huang, X. et al. A map of rice genome variation reveals the origin of cultivated Arabidopsis. Curr. Biol. 22, 695–700 (2012).
rice. Nature 490, 497–501 (2012). 37. Chen, X., Liu, J., Cheng, Y. & Jia, D. HEN1 functions pleiotropically in Arabidopsis
8. Huang, X., Lu, T. & Han, B. Resequencing rice genomes: an emerging new era of development and acts in C function in the flower. Development 129, 1085–1094
rice genomics. Trends Genet. 29, 225–232 (2013). (2002).
9. Jia, G. et al. A haplotype map of genomic variations and genome-wide association 38. Fujita, D. et al. NAL1 allele from a rice landrace greatly increases yield in modern
studies of agronomic traits in foxtail millet (Setaria italica). Nat. Genet. 45, indica cultivars. Proc. Natl. Acad. Sci. USA 110, 20431–20436 (2013).
957–961 (2013). 39. Takai, T. et al. A natural variant of NAL1, selected in high-yield rice breeding
10. Mace, E.S. et al. Whole-genome sequencing reveals untapped genetic potential in programs, pleiotropically increases photosynthesis rate. Sci. Rep. 3, 2149
Africa’s indigenous cereal crop sorghum. Nat. Commun. 4, 2320 (2013). (2013).
11. Aflitos, S. et al. Exploring genetic variation in the tomato (Solanum section 40. Yano, M. et al. Hd1, a major photoperiod sensitivity quantitative trait locus in rice,
Lycopersicon) clade by whole-genome sequencing. Plant J. 80, 136–148 (2014). is closely related to the Arabidopsis flowering time gene CONSTANS. Plant Cell 12,
12. Lin, T. et al. Genomic analyses provide insights into the history of tomato breeding. 2473–2484 (2000).
© 2016 Nature America, Inc. All rights reserved.

Nat. Genet. 46, 1220–1226 (2014). 41. Fujino, K. et al. Multiple introgression events surrounding the Hd1 flowering-time gene
13. Hazzouri, K.M. et al. Whole genome re-sequencing of date palms yields insights in cultivated rice, Oryza sativa L. Mol. Genet. Genomics 284, 137–146 (2010).
into diversification of a fruit tree crop. Nat. Commun. 6, 8824 (2015). 42. Takahashi, Y. & Shimamoto, K. Heading date 1 (Hd1), an ortholog of Arabidopsis
14. Huang, X. et al. Genomic analysis of hybrid rice varieties reveals numerous superior CONSTANS, is a possible target of human selection during domestication to diversify
alleles that contribute to heterosis. Nat. Commun. 6, 6258 (2015). flowering times of cultivated rice. Genes Genet. Syst. 86, 175–182 (2011).
15. Zhou, Z. et al. Resequencing 302 wild and cultivated accessions identifies genes 43. Atwell, S. et al. Genome-wide association study of 107 phenotypes in Arabidopsis
related to domestication and improvement in soybean. Nat. Biotechnol. 33, thaliana inbred lines. Nature 465, 627–631 (2010).
408–414 (2015). 44. Baxter, I. et al. A coastal cline in sodium accumulation in Arabidopsis thaliana is
16. Koboldt, D.C., Steinberg, K.M., Larson, D.E., Wilson, R.K. & Mardis, E.R. driven by natural variation of the sodium transporter AtHKT1; 1. PLoS Genet. 6,
The next-generation sequencing revolution and its impact on genomics. Cell 155, e1001193 (2010).
27–38 (2013). 45. Dickson, S.P., Wang, K., Krantz, I., Hakonarson, H. & Goldstein, D.B. Rare variants
17. Ott, J., Wang, J. & Leal, S.M. Genetic linkage analysis in the age of whole-genome create synthetic genome-wide associations. PLoS Biol. 8, e1000294 (2010).
sequencing. Nat. Rev. Genet. 16, 275–284 (2015). 46. Platt, A., Vilhjálmsson, B.J. & Nordborg, M. Conditions under which genome-wide
18. Huang, X. et al. Genome-wide association studies of 14 agronomic traits in rice association studies will be positively misleading. Genetics 186, 1045–1052 (2010).
landraces. Nat. Genet. 42, 961–967 (2010). 47. Jorgenson, E. & Witte, J.S. A gene-centric approach to genome-wide association
19. Zhao, K. et al. Genome-wide association mapping reveals a rich genetic architecture studies. Nat. Rev. Genet. 7, 885–891 (2006).
of complex traits in Oryza sativa. Nat. Commun. 2, 467 (2011). 48. Ivanov, D.K. et al. Longevity GWAS using the Drosophila genetic reference panel.
20. Huang, X. et al. Genome-wide association study of flowering time and grain yield J. Gerontol. A Biol. Sci. Med. Sci. 70, 1470–1478 (2015).
traits in a worldwide collection of rice germplasm. Nat. Genet. 44, 32–39 (2012). 49. Ferrari, R. et al. A genome-wide screening and SNPs-to-genes approach to identify
21. Li, H. et al. Genome-wide association study dissects the genetic architecture of oil novel genetic risk factors associated with frontotemporal dementia. Neurobiol. Aging
biosynthesis in maize kernels. Nat. Genet. 45, 43–50 (2013). 36, 2904, e13–2904.e26 (2015).
22. Yu, J., Holland, J.B., McMullen, M.D. & Buckler, E.S. Genetic design and statistical 50. Abrash, E.B., Davies, K.A. & Bergmann, D.C. Generation of signaling specificity in
power of nested association mapping in maize. Genetics 178, 539–551 (2008). Arabidopsis by spatially restricted buffering of ligand-receptor interactions. Plant
23. McMullen, M.D. et al. Genetic properties of the maize nested association mapping Cell 23, 2864–2879 (2011).
population. Science 325, 737–740 (2009). 51. Segura, V. et al. An efficient multi-locus mixed-model approach for genome-wide
npg

24. Kump, K.L. et al. Genome-wide association study of quantitative resistance to association studies in structured populations. Nat. Genet. 44, 825–830 (2012).
southern leaf blight in the maize nested association mapping population. 52. Vilhjálmsson, B.J. & Nordborg, M. The nature of confounding in genome-wide
Nat. Genet. 43, 163–168 (2011). association studies. Nat. Rev. Genet. 14, 1–2 (2013).
25. Tian, F. et al. Genome-wide association study of leaf architecture in the maize 53. Sasaki, A. et al. Green revolution: a mutant gibberellin-synthesis gene in rice. Nature
nested association mapping population. Nat. Genet. 43, 159–162 (2011). 416, 701–702 (2002).
26. Cavanagh, C., Morell, M., Mackay, I. & Powell, W. From mutations to MAGIC: 54. Asano, K. et al. Artificial selection for a green revolution gene during japonica rice
resources for gene discovery, validation and delivery in crop plants. Curr. Opin. domestication. Proc. Natl. Acad. Sci. USA 108, 11034–11039 (2011).
Plant Biol. 11, 215–221 (2008). 55. Konishi, S. et al. An SNP caused loss of seed shattering during rice domestication.
27. Holland, J.B. MAGIC maize: a new resource for plant genetics. Genome Biol. 16, Science 312, 1392–1396 (2006).
163 (2015). 56. Li, Y. et al. Natural variation in GS5 plays an important role in regulating grain size
28. Dell’Acqua, M. et al. Genetic properties of the MAGIC maize population: a new and yield in rice. Nat. Genet. 43, 1266–1269 (2011).
platform for high definition QTL mapping in Zea mays. Genome Biol. 16, 167 57. Wang, Y. et al. Copy number variation at the GL7 locus contributes to grain size
(2015). diversity in rice. Nat. Genet. 47, 944–948 (2015).
29. Reich, D.E. et al. Linkage disequilibrium in the human genome. Nature 411, 58. Si, L. et al. OsSPL13 controls grain size in cultivated rice. Nat. Genet. 48, 447–456
199–204 (2001). (2016).

934 VOLUME 48 | NUMBER 8 | AUGUST 2016  Nature Genetics


ONLINE METHODS dummy variables. We modified the R package “rrBLUP” version 4.3 (ref. 70) to
Plant material and genotyping. We used a set of 176 japonica rice varieties enable the use of a matrix of dummy variable as fixed effects included in X of
cultivated in Japan, which were collected from various places in Japan59,60, equation (1). The genome-wide significance threshold was determined using
and maintained in the Togo Field, Field Science Center, Nagoya University permutation-based false-discovery-rate-adjusted P values71. The permutation
(Supplementary Table 1). Total DNA was extracted from leaves of each variety tests were repeated 1,000 times.
using the DNeasy Plant Mini Kit (Qiagen). The DNA was physically sheared
into ~500 bp fragments using Covaris S2 (Covaris). The fragmented DNA Graphical genotype visualization. To display genome structure in the tested
was used for a DNA library construction with the NEBNext DNA Library populations schematically, we generated the graphical genotype. Because the
Prep Reagent Set for Illumina (BioLabs). The DNA library was sequenced genotype data set was too large scale for presentation using the graphical
using an Illumina HiSeq 2000 (Illumina Co, Ltd.), and a total of 3.8 billion genotype, we used the following method to compress the data. In regions of
paired-end 100-bp reads were obtained. All reads were mapped against interest, polymorphisms were converted to numeric genotypes represented by
Os-Nipponbare-Reference-IRGSP-1.0 pseudomolecules using bwa-mem with 1 and −1, which corresponded to major and minor polymorphisms, respec-
the −M option of BWA software61. The mapped reads were realigned using tively. Polymorphisms were then divided into bins (bin size = 10 kb) based on
RealignerTargetCreator and indelRealigner of GATK software62. To label SNPs their location in the genome. In each bin, the tested population was forcibly
and indels, UnifiedGenotyper of GATK was used with the −glm BOTH option. clustered into two groups using the k-means clustering algorithm conducted
After removing nucleotide variations with missing rates ≥ 0.25 and minor using the R function “kmeans”, and the information on the cluster was used
allele frequency < 0.05, we identified 426,337 SNPs and 67,544 indels. All for generating the graphical genotype. In each bin, major and minor clusters
nucleotide polymorphisms were categorized based on their location in the ref- were represented in blue and orange in the graphical genotype.
erence genome (groups I−V). We used the generic feature format file version 3
(gff3) from the MSU rice genome annotation project63 for information on gene Phylogenetic analysis. Amino acid alignment of sequences was conducted
position and coding sequence. With respect to categorizing polymorphisms, using ClustalW (see URLs) from the DNA Data Bank of Japan (DDBJ) with
we regarded the promoter region as the 2 kb area upstream of the translation default parameter settings and then manually adjusted to optimize alignments.
initiation site of each gene. Phylogenetic trees were constructed using the neighbor-joining algorithm in
© 2016 Nature America, Inc. All rights reserved.

MEGA version 6 (ref. 72) with 1,000 replicates using the following parameters:
Population genetic analyses. The population structure of the 176 varieties was gaps/missing data, complete deletion; Jones-Taylor-Thornton (JTT) model;
estimated using PCA performed by using the software EIGENSTRAT64. To pattern among lineages, same; and rates among sites, uniform.
analyze the genetic relationship between the 176 varieties and other japonica
rice varieties, we used the genotype data set obtained from temperate and trop- Transgenic analysis. Full-length genomic DNAs encompassing the entire
ical japonica varieties as described20 (Rice HapMap Project; see URLs). Genetic sequence of LOC_Os01g62780 were amplified from Nipponbare (Hap. A) and
relationships were estimated using neighbor-joining trees constructed using Line92 (Hap. B) using PCR. Similarly, LOC_Os11g08410 full-length genomic
the software PHYLIP version 3.695 (see URLs). The LD between SNPs and DNAs were amplified from Nipponbare (Hap. A) and Line9 (Hap. C). For LOC_
indels in the 176 varieties was evaluated using squared Pearson’s correlation Os08g37890, full-length genomic DNAs were amplified from Line7 (Hap. B)
coefficient (r2) as calculated with the −r2 command in the software PLINK ver- and Line154 (Hap. C). These PCR products were cloned into pENTER/D-
sion 1.9 (see URLs). The LD heatmaps surrounding peaks in the GWAS were TOPO (Invitrogen). DNA fragments were then subcloned into the Gateway
constructed using the R package “LDheatmap”65. We estimated the candidate binary vector (pGWB) using Gateway LR Clonase Enzyme mix (Invitrogen).
regions using an r2 > 0.6 (refs. 66–68). For constitutive expression of LOC_Os04g52479, the coding sequences of
LOC_Os04g52479 were amplified from cDNA of Nipponbare (Hap. A) and
Phenotyping. Phenotyping of agronomic traits was performed in 2013 and Line130 (Hap. B), and PCR products were cloned into pENTER/D-TOPO.
2014 at the paddy field located at the Togo Field, Field Science Center, Nagoya The LOC_Os04g52479 cDNA fragments were subcloned using the LR reac-
University. Heading date was recorded as the appearance date of the first pani- tion into the Gateway binary vector pUBQ-pGWB, which contains the maize
cle and days to heading as the number of days from sowing to heading dates. ubiquitin promoter upstream of the cloning site. The primer sets used for PCR
Plant height is the mature length of main culm from the soil surface to the are listed in Supplementary Table 14. The DNA constructs and empty vectors
tip of the panicle. Panicle length of the main culm was used to represent the that served as controls were introduced into Nipponbare using Agrobacterium
npg

panicle length of a plant. Panicle number per plant was measured 20 days tumefaciens (EHA105)-mediated transformation, according to ref. 73. More
after the initiation of flowering. Leaf blade width was measured at the wid- than 12 independent T0 plants were generated and grown to maturity in pots
est point of flag leave sampled from the main stem. The awn length of apical under greenhouse conditions.
spikelets on each primary branch was used to represent the awn length of the
whole panicle. Data for each trait were evaluated from three randomly chosen
plants per variety.
59. Hashimoto, Z. et al. Genetic diversity and phylogeny of Japanese sake-brewing rice
GWAS. For GWAS, we used a linear mixed model (LMM). The LMM assumes as revealed by AFLP and nuclear and chloroplast SSR markers. Theor. Appl. Genet.
the following model69: 109, 1586–1596 (2004).
60. Ebana, K., Kojima, Y., Fukuoka, S., Nagamine, T. & Kawase, M. Development
of mini core collection of Japanese rice landrace. Breed. Sci. 58, 281–291
y = X b + Zu + e (1) (2008).
61. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler
where y is a vector of phenotype; X is a matrix of fixed effects including the transform. Bioinformatics 26, 589–595 (2010).
62. DePristo, M.A. et al. A framework for variation discovery and genotyping using
nucleotide polymorphism (for single-polymorphism-based association analy-
next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
sis) or gene haplotype (for gene-based association analysis, see below), the 63. Ouyang, S. et al. The TIGR Rice Genome Annotation Resource: improvements and
grand mean and principal component 1 calculated in the genetic relationship new features. Nucleic Acids Res. 35, D883–D887 (2007).
analysis. β is a vector of effects; Z is an incidence matrix relating y to u. The 64. Price, A.L. et al. Principal components analysis corrects for stratification in genome-
wide association studies. Nat. Genet. 38, 904–909 (2006).
variable u models the genetic background of each line as a random effect with
65. Shin, J.-H., Blay, S., McNeney, B. & Graham, J. LDheatmap: an R function for
u ~ N(0, Kσ2G), where K is a kinship matrix calculated from the nucleotide graphical display of pairwise linkage disequilibria between single nucleotide
polymorphisms, and σ2G is the genetic variance. ε is a matrix of residual effects polymorphisms. J. Stat. Softw. 16, Code Snippet 3 (2006).
such that ε ~ N(0, Iσ2e), where I is an identity matrix and σ2e is the residual 66. Ripke, S. et al. Biological insights from 108 schizophrenia-associated genetic loci.
Nature 511, 421–427 (2014).
variance. For gene-based association analysis, we performed gene haplotype
67. Ma, X. et al. No association between ovarian cancer susceptibility variants and
identification based on polymorphisms localized on the coding regions by an breast cancer risk among Chinese women. Cancer Epidemiol. Biomarkers Prev. 22,
R script written in house. Then the difference in haplotype was represented by 467–469 (2013).

doi:10.1038/ng.3596 Nature Genetics


68. Mayerle, J. et al. Identification of genetic loci associated with Helicobacter pylori 71. Dudbridge, F. & Gusnanto, A. Estimation of significance thresholds for genomewide
serologic status. J. Am. Med. Assoc. 309, 1912–1920 (2013). association scans. Genet. Epidemiol. 32, 227–234 (2008).
69. Yu, J. et al. A unified mixed-model method for association mapping 72. Tamura, K., Stecher, G., Peterson, D., Filipski, A. & Kumar, S. MEGA6: molecular
that accounts for multiple levels of relatedness. Nat. Genet. 38, 203–208 evolutionary genetics analysis version 6.0. Mol. Biol. Evol. 30, 2725–2729
(2006). (2013).
70. Endelman, J.B. Ridge regression and other kernels for genomic selection with 73. Ozawa, K. A high-efficiency Agrobacterium-mediated transformation system of rice
R package rrBLUP. Plant Genome J. 4, 250–255 (2011). (Oryza sativa L.). Methods Mol. Biol. 847, 51–57 (2012).
© 2016 Nature America, Inc. All rights reserved.
npg

Nature Genetics doi:10.1038/ng.3596

You might also like