Materials and Methods

VS-1 genome assembly. The V. sylvestris plant VS-1 of Tunisian origin (DVIT2426) was obtained
from the grape germplasm and breeding block of the Shanghai Jiaotong University in Shanghai.
Fresh young leaves were collected for the extraction of total genomic DNA using the CTAB Plant
DNA Extraction Kit (Genenode Biotech Co, Beijing). We obtained 49.5Gb (~100´) PacBio single-
molecule real-time (SMRT) reads and 26.7Gb (~54´) circular consensus sequencing (CCS) reads
on the PacBio RS II platform from BGI-Wuhan (Wuhan, China) and Berry Genomics (Beijing,
China), respectively. We also obtained a total of 170.67Gb (~350´) Illumina paired-end
sequencing data and 62.44Gb Hi-C sequencing data from Novogene (Beijing, China) (table S1).

NextDenovo (v.2.0.beta.1;; Accessed Dec. 27th,

2019) was used to generate the initial PacBio subreads assembly. The NextDenovo assembly
workflow comprises of two major steps: 1) NextCorrect: self-correction of PacBio subreads was
conducted with the parameter setting ‘seed_cutoff = 19703, minimap2_options_raw = -x ava-pb -
t 16, sort_options = -m 50g -t 16 -k 50, and correction_options = -p 32’; and 2) NextGraph: 100
round of assemblies were conducted with random parameter sets, and the assembly with the
longest contig N50 (2.40 Mb) was selected as the primary assembly for further curation and polish.
The total length of the primary assembly was 713.99 Mb, which was significantly larger than the
expected genome size (~500 Mb). This indicates the presence of redundant sequences in the
primary assembly, which is confirmed by the large proportion of BUSCO duplicated genes

We undertook a redundancy filtering step for the primary assembly with a pipeline provided
by Purge Haplotigs (42). Briefly, the pipeline first identifies putative heterozygous contigs through
read-depth analysis. Contigs with a high proportion of bases within the 0.5× read-depth peak were
assigned as putative heterozygous contigs. These putative heterozygous contigs were then subject
to a sequence alignment to identify its allelic companion contig. Then the identified haplotigs were
removed from the assembly iteratively. According to the read-depth analysis, we selected the
cutoff numbers 10, 68 and 140 for the low, midpoint and high read-depths, respectively. The cutoff
for identifying a contig as a haplotig was set to 60%. This step generated a 468.48 Mb filtered
genome assembly with a contig N50 of 5.24 Mb and 2.4% duplicated genes in the BUSCO

Illumina short reads were then used to correct residual errors in the filtered genome assembly.
Illumina reads with 10% Ns, low quality, or derived from PCR artifacts were filtered and trimmed
using the program filter_data_parallel (version 1.5) with parameters ‘-y -w 10 -B 40 -a 3 -b 2 -c 3
-d 2 -q 33’. The resultant clean reads were then mapped to contigs using Burrows-Wheeler Aligner
(BWA) mem (version 0.7.17-r1188) with default parameters. Residual errors in the contigs were
corrected with mapped NGS reads using Pilon (v.1.21) (43) with parameters ‘--fix snps, indels --

In order to elongate the polished contigs, we assembled the CCS reads using Canu (v.2.0)
(44) with with parameters ‘genomieSize=500m, batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50 -M
250, correctedErrorRate=0.050, -pacbio-hifi ccs.fasta.gz’. Then, the CCS assembly and the

polished contig assembly were aligned to each other using nucmer (from MUMmer v.4.0.0 beta2)
(45) with parameters ‘--mum -D 5’, which was followed by delta-filter with the parameters ‘-i 89
-l 1000’ and show-coords. The polished contigs were then elongated with a home-made Perl script,
filtered again with Purge Haplotigs, and polished again with CCS reads using NextPolish (46) with
default settings. This process yielded an assembly of 477.80 Mb with a contig N50 size of 13.82
Mb (table S2).

The elongated contigs were then anchored into chromosome scale using a Hi-C proximity-
based assembly approach. In total, 62.44 Gb data were used as input for Juicer (v.1.5.6) (47) and
3D-DNA (v. 180922) (48). Illumina Hi-C reads were first aligned to the contigs using BWA-MEM
(v. 0.7.17-r1188). Contigs were ordered and oriented by the 3D-DNA pipeline with parameter ‘--
editor-repeat-coverage 3’. The resultant Hi-C contact matrix was visualized using Juicebox.
Misassemblies and misjoins were manually corrected based on neighboring interactions. Using the
“finalize” section from 3D-DNA, the manually validated assembly was used to build
pseudomolecules and then to be ordered by size. Consequently, 19 high-confidence clusters
representing the haploid chromosomes of V. sylvestris were identified, covering 95.04% of the
whole assembly (fig. S1 and table S2).

We applied a few methods to evaluate the quality of the VS-1 genome assembly. Firstly, we
performed Benchmarking universal Single-Copy Orthologs (BUSCO, v2.0) (49) analysis to assess
the completeness of the VS-1 genome assembly with the genome mode, the embryophyte_odb10
lineage and the Arabidopsis species options. Our VS-1 assembly has about 95% of 1,375 complete
BUSCO genes (table S3). This number is comparable to those of the PN40024 (12X. V2; Ensemble
release 46) Pinot Noir reference genome (96.6%) (50, 51), Chardonnay genome (94.6%) (52), and
Vitis riparia genome (95.7%) (53), which were obtained from the same pipeline. The proportion
of duplicated BUSCO genes in the VS-1 assembly is 1.2%, which is comparable to that of the
PN40024 Pinot Noir reference genome (1.4%) and V. riparia genome (2.8%), but much smaller
than that of the Chardonnay genome (11.0%). Secondly, we evaluated the assembly continuity
with LTR Assembly Index (LAI) by using LTR retriever (v.2.8.2) (54). Our VS-1 assembly has a
LAI value of 18.09 (table S3), which is higher than those of the PN40024 Pinot Noir reference
genome (10.63), Chardonnay genome (15.77), and Vitis riparia genome (12.23). This result shows
that the VS-1 genome assembly has the highest continuity among them. Thirdly, we downloaded
39 transcriptomic data (26.48 Gb) for V. sylvestris from NCBI Sequence Read Archive under the
BioProject PRJNA279229 and PRJNA244752 (table S4). Raw data files in SRA format were
converted to FASTQ format using SRA Tools Kit (version 2.9.6,
and then trimmed by Trimmomatic (version 0.36) (55) with default parameters. Trimmed RNA-
seq data were aligned to the VS-1 assembly by HISAT2 (v.2.1.0) (56). The result shows that the
average mapping rate is 93.21%±0.64% across all libraries, demonstrating that the VS-1 assembly
is of high quality. Fourthly, we aligned the VS-1 assembly to the PN40024 reference assembly by
nucmer (from MUMmer, v.4.0.0 beta2) (45) with parameter ‘-c 100’. The results show that, except
for a few small inversions, there is high collinearity between VS-1 and PN40024 (fig. S1). We also
found that the percentage of anchored chromosome length is much higher in VS-1 (95.04%) than
that in PN40024 (87.64%; fig. S1). This is in line with the result that many unassigned contigs in
the PN40024 assemblies correspond to the chromosome 7 of the VS-1 genome.

Repeat annotation: Transposable elements were identified in the VS-1 genome using a
combination of homology and de novo-based approaches (57). Tandem repeats in the genome
assembly were identified using TRF (v.4.07b) (58). Well-characterized TEs were identified and
masked by searching against the VS-1 genome assembly using RepeatMasker (; and ProteinRepeatMask ( with the
Viridiplantae section of the database (release CONS-Dfam_3.0-rb20181026) as the query library.
To identify TEs that were absent in the library, LTRharvest (v.4.9.1) (59) and
LTR_FINDER_parallel (release 09/27/2019) (60) were used to de novo detect LTR
retrotransposons. LTR_retriever (v.2.8) (61) was then used for the accurate identification of LTR-
RTs from those two outputs, and the generation of a non-redundant LTR-RT library. Additionally,
Repeatmodeler (v.2.0) ( was applied to construct another de novo
repeat library. RepeatMasker was run against the masked genome assembly again, with the merged
de novo repeat library as the query library. The result shows that about 57.12% of the VS-1 genome
assembly is repetitive sequence (table S6). Among them, transposable elements are the majority
of the repetitive sequence, account for 55.98% of the total assembly length.

Prediction of non-coding RNA genes: To predict non-coding RNAs, rRNA genes for plants
were mapped to the VS-1 genome assembly using BLAST (v.2.2.26) (62) with parameters ‘-p
blastn -e 1e-5 -v 10000 -b 10000’. tRNAScan-SE (v.1.3.1) (63) was used to search for tRNA genes
with default parameters. For the identification of miRNA and snRNA genes, infernal (v.0.81) (64)
was used to search the VS-1 assembly based on covariance models deposited in Rfam database
(release 9.1) (65) (table S7). In total, we predicted 327 miRNA, 570 tRNA, 183 rRNA, and 586
snRNA genes, respectively.

Protein-coding gene annotation and filtering: To annotate the protein-coding genes in VS-1,
a combination of three strategies, including de novo, homolog-based, and RNA-seq–based
predictions, were used. For Ab initio gene prediction, Augustus (v.2.5.5) (66) was applied with the
configure file having been trained by BUSCO (v.2.0) (49). Three additional ab initio gene
prediction software were used: Genescan (version 2015-10-31) (67), GlimmerHMM (v.3.0.2) (68),
and SNAP (version 2006-07-28) (69). For homology-based annotation, protein sequences of
Arabidopsis thaliana (Ensemble release 46), V. riparia (NCBI assembly GCA_004353265.1), V.
vinifera cv. Chardonnay(52), and three versions of annotation for PN40024 12x.v2 (NCBI
assembly GCA_000003745.2, Ensemble release 46 and VCost.v3(51)) were downloaded.
Homologous sequences were aligned against the VS-1 genome assembly using TBLASTN
(v.2.2.26) (70) with parameters ‘-e 1e-5, -F F’. Genewise (v.2.2.0) (71) was used to predict gene
models based on the aligned sequences. For transcript-based annotation, quality-trimmed RNA-
seq reads were mapped to the unmasked VS-1 genome using HISAT2 (v.2.1.0) (56) with default
parameters, and StringTie (v.1.3.3b) (72) with parameter ‘-merge’ was used to combine the output
libraries to a representative set of non-redundant transcripts. Based on the abovementioned three
annotation results, a weighted and non-redundant gene set was generated by merging all of the
gene models with EvidenceModeler (v.r2012-06-25) (73). Finally, the trimmed RNA-seq data
were assembled into unigenes using Trinity (version 2.9.1) (74) with default parameters. The result
was then fed into PASA pipeline (v.2.4.1) (73) together with the EVM result for gene structure
refinement and alternative spliced isoform annotation.

To obtain reliable protein-coding gene models, we also filtered the gene set according to the
following five criteria: 1) remove a gene if more than half of its gene region was annotated as
repeat; 2) remove genes without a start or stop codon; 3) remove genes with any in-frame stop
codons; 4) remove a gene if its CDS length was shorter than 300 bp; 5) remove a gene if its CDS
length was not a multiple of three. In the end, the final reference gene set contains 34,527 protein-
coding genes with a mean transcript size of 5,275 bp, a mean coding sequence size of 1,164 bp,
and a mean number of exons per gene of 4.91 (table S8). Completeness of the annotated gene set
was evaluated by BUSCO (version 2.0) (49) with the plant-specific dataset embryophyte_odb10
(table S8). The result shows that about 95.0% of BUSCO genes are complete and the proportion
of duplicated BUSCO genes is 2.1%. These statistics are similar to those of PN40024 (96.8% and
1.5%, respectively), and better than those of Chardonnay (87.6% and 45.0%, respectively) and V.
riparia (93.0% and 31.5%, respectively) genomes.

Functional annotation: The predicted genes were further aligned to the SwissProt (75),
TrEMBL, and KEGG (76) databases by BLASTP (v.2.2.26) (77) with an E value of 1e−5, and the
most significant hits were retained. InterProScan (v.5.17-56.0) (78) was used to detect protein
motifs and domains in predicted genes against multiple database, including Coils (v.2.2.1) (79),
Gene3D (version 3.5.0) (80), Hamap (201511.02) (81), Panther (version 10.0) (82), Pfam (v.28.0)
(83), PIRSF (v.3.01) (84), PRINTS (v.42.0) (85), ProDom (2006.1) (86), ProSite (v.20.119) (87),
SMART (v.6.2) (88), SUPERFAMILY (v.1.75) (89), and TIGRFAM (v.15.0) (90). In summary,
we were able to assign functional annotation to 33,839 protein-coding genes in the VS-1 assembly,
accounting for 94.15% of total predicted protein-coding genes (table S9).

Sample collection and processing. A total of 23 institutions from 16 nations in the world
contributed to the global grapevine cohort (18, 91–96), which comprised of 2,269 V. vinifera and
1,035 V. sylvestris accessions. The V. vinifera accessions were collected from institutional
germplasms and private collections. The selection was designed to preferentially include old,
autochthonous, and economically important varieties to maximize the spectrum of genetic
diversity. The V. sylvestris accessions were collected from all major refugia in the world, which
spans a large geographical area from Levant and Transcaucasia in the east to the Iberian Peninsula
in the west (97). Total genomic DNA was either obtained from dried grapevine leaf tissues using
the CTAB Plant DNA Extraction Kit (Genenode Biotech Co, Beijing) in a wet lab at the Yunnan
Agricultural University, or directly sent from collaborators. For the latter, genomic DNA was
cleaned once by sodium acetate precipitation and reconstituted in nuclease-free water (Ambion,
Texas, USA). Sequencing libraries with an insert size of 350~550 bp were prepared with
NEBNext® Ultra™ DNA Library Prep Kit (Illumina, USA) according to the manufacturer’s
directions. Paired-end sequencing was performed on an Illumina NovaSeq 6000 platform by both
Novogene (Beijing, China) and Berry Genomics (Beijing, China). The target sequencing depth
was 20´ for each accession. After excluding unusable sequencing libraries, we curated raw
genome data for 3,270 samples (2,256 V. vinifera and 1,014 V. sylvestris; success rate 99.4%),
totaling 33.96 Tb. On top of these, we also included 271 V. vinifera accessions and 73 V. sylvestris
accessions from previous publications in the following steps (7, 8, 17).

Variant calling, validation, and annotation. The raw sequencing reads were filtered with fastp
(v.0.20.0) (98). We removed reads if more than 40% of the bases have a Phred quality lower than
20. The clean paired-end reads were then mapped back to the VS-1 genome with BWA-MEM2

(v.2.0 prel; using default parameters. We used
Samtools (v.1.9) (99) and Picard (v.2.21.6-0; to sort the
aligned reads and remove duplicated reads. The sequencing depth, duplication rate, and percentage
of mapping of each accession was calculated with bamdst (v.1.0.9; (table S12 and S13). We denoted any value that was outside
mean ± 3S.D. of these parameters to be an outlier, and excluded grapevine samples with outlier
parameters from variant calling. With this method, we retained 2,237 V. vinifera and 949 V.
sylvestris samples from our collaboration and 266 vinifera and 73 sylvestris samples from previous
publication, making the final grapevine cohort of 3,525 accessions. A single accession of
muscadine grape (ZZ-01) was included as outgroup for the downstream analyses (100).

We used the chromosomes of the VS-1 genome (excluding unanchored sequences) as

references in the identification of variants (both SNP and Indel). The variant detection was carried
out with GATK3 (v.3.8; according to the recommended
workflow (101). In brief, the variants of each accession were called using the GATK
HaplotypeCaller, and then a joint-genotyping analysis of the gVCFs was performed on all samples
(also separately for V. vinifera and V. sylvestris samples). In the filtering step, various parameters
used in the hard filtering of raw SNPs and Indels were determined according to the
recommendation of GATK (101). As a result, the SNP filter expression was set as “QD<2.0,
QUAL<30.0, SOR>3.0, FS>60.0, MQ<40.0, MQRankSum<-10.0, ReadPosRankSum<-8.0”. The
short Indel filter expression was set as “QD<2.0, QUAL<30.0, SOR>5.0, FS>100.0,
InbreedingCoeff<-0.8”. After the initial filtering step, the number of SNPs and short Indels became
56,462,680 (including 45,624,306 bi-allelic SNPs; Ti/Tv=2.24) and 11,069,435 (including
7,314,397 bi-allelic Indels), respectively. Further filtering yielded a basic set of 19,215,781 SNPs
(Ti/Tv=2.80) and 1,836,885 Indels that are bi-allelic and with less than 60% missing calls and
MAF>0.005. For many downstream analyses, the core set of 10,086,416 SNPs (Ti/Tv=2.87) and
827,214 Indels were acquired by setting the MAF cut-off at 0.05. The intergenic region of the
genome encompasses about 64.7% of SNPs and 70.0% of Indels. About 7.0% of SNPs are located
in the coding sequence, and the nonsynonymous to synonymous SNP ratio is 1.497. In comparison,
only 2.9% of Indels are found in the coding sequence. We also show that 423,625 SNPs are
predicted to be deleterious, and 151,721 Indels to cause frameshift mutations in the coding
sequence. We also calculated the ratios of transition to transversion (Ti/Tv) SNPs with the vcftools
(v.0.1.16) package (102). Notably, the ratios of Ti/Tv increased as the raw SNPs were filtered,
ranging from 2.12 to 2.87, which showcases the high quality of the SNP call sets. We also
performed variant calling separately on all V. vinifera and V. sylvestris samples (table S16). The
number of identified variants were not significantly different between the two. SNP density, Indel
density and total genetic diversity across each chromosome were calculated with 100 kb sliding
window using vcftools (V.0.1.16) (102).

We validated our 3K grapevine SNP datasets with the 10,207 SNPs on a widely used 10K
grapevine SNP chip (103). Initial inspection found that the 10K grapevine SNP chip contains one
replicate of SNP1021_163, leaving the total number of unique alleles as 10,206. Since these SNPs
are based on the PN40024 V. vinifera reference genome, we found the corresponding SNP
locations in the VS-1 genome before the validation. By using a homemade Perl script, we extracted
a short 120 bp DNA sequence at the location of each SNP from the PN40024 genome so that there
is a 60 bp DNA tag on either side of the SNP. The sequences are compiled into a fasta file. We

used MegaBLAST (104) in the BLAST (v.2.231+) suite of functions to map the sequences onto
the VS-1 genome with the command “blastn -task megablast -use_index true -db -
query snp.fa -outfmt 6 -out”. This resulted in 9,797 unique mapping loci in the VS-1 genome and
9,384 unique loci in the chromosomes (table S17). Among these 9,384 unique loci, we were able
to recover 9,134 SNPs (97.34%) in our raw SNP dataset and 9,098 SNPs (96.95%) in our filtered
SNP dataset, respectively.

We previously reported the SNP information for 49 Vitis species based on the PN40024 V.
vinifera reference genome (8). This dataset was also used for the validation process. We extracted
the SNPs for the V. vinifera and V. sylvestris accessions from the 472 Vitis SNP dataset, obtaining
21,149,067 (MAF>0.005) and 11,839,025 (MAF>0.05) SNPs, respectively (table S18). Using the
same method described above, we mapped the SNP tags onto the VS-1 genome with default
parameters. The result showed that there are 13,071,874 (MAF>0.005) and 7,352,118 (MAF>0.05)
unique mapping loci in the chromosomes. Among them, we were able to recover 11,798,615 SNPs
(MAF>0.005; 90.26%) and 6,783,742 SNPs (MAF>0.05; 92.27%) in our raw SNP dataset and
10,761,520 SNPs (MAF>0.005; 82.33%) and 6,398,063 SNPs (MAF>0.05; 87.02%) in our filtered
SNP dataset, respectively.

Our grapevine cohort contains 59 Chasselas clones, which provide a rare opportunity to
identify somatic SNPs and test if these somatic SNPs could be recovered in our SNP datasets. We
designated sample 229 as control, and utilized Mutect2 (v. (105) with default parameters
to identify somatic mutations. Even though Mutect was developed to process genomic data from
tumor tissues, it was also used in the identification of somatic mutations in oak trees (106). The
result was filtered with the command “FilterMutectCalls” and the following criteria were set: (1)
a minimum sequencing depth of 15´ at the mutant loci for the control and testing libraries; (2) no
mutant allele in the control library; (3) each somatic mutant allele was supported by 6 or more
individual reads. We identified on average 109 (range 12-248) high-quality somatic SNPs for each
of the 58 Chasselas clones (table S19). We found that on average 93.5% ± 2.8% of the somatic
SNPs could be recovered from our raw SNP dataset, and 80.7% ± 5.0% of the somatic SNPs could
be recovered from our filtered SNP dataset.

We performed SNP and Indel annotation according to the VS-1 genome using the package
ANNOVAR (v.2015-12-14) (107), and predicted the effect of nonsynonymous SNPs on the
biological function of proteins with Provean (v.1.1.5) (108).

Genetic clonal accessions. To distinguish from the concept of ‘clone’ used in viticulture, we
define genetic clones as accessions sharing genetic profiles with each other. This includes cuttings,
synonyms, and mutants. The removal of genetic clones and homonyms is crucial for the proper
analyses of grapevine population structure and history. We utilized identity-by-state (IBS) sharing
pattern estimators (109–111) to infer relationship among accessions. This approach is superior to
the identity-by-descent (IBD) inference in our case in that: (1) it does not require prior knowledge
of ancestral pedigree or allele frequencies, and (2) it is robust to SNP ascertainment errors (109–
111). We removed SNPs with low read support (<7 reads) or with high linkage disequilibrium
(LD, r2³0.5) with other SNPs for the analyses. The estimators were calculated with SNPduo
(V.2.00a)(109). By using estimator values from known clonal accession pairs as reference, we set
the following three cut-off values: R1³1.20, IBS2*ratio³0.99, and KING-robust kinship³0.3426.

We would assume a genetic clonal relationship if two of the above thresholds were met between
two accessions. We kept one accession for each distinctive genotype and marked all other clonal
accessions for exclusion from analyses.

Phylogenetic tree and network. The SNPs were processed using SNPhylo (Version 20180901)
(112) with default parameters. The resultant phylip format data were taken to construct a ML
phylogenetic tree using RAxML-NG (v.0.9.0) (113) with 32 random search trees and 100 TBE
bootstraps (114). The best tree was chosen according to the maximum Final LogLikelihood value.
A muscadine grape was included as outgroup.

For reticulate phylogenetic network construction, the SNPs with >20% missing calls and MAF <
0.05 were removed, and then PLINK (v1.90b3.38) (115) was used to remove SNPs having high
LD (r2 >= 0.1) within a continuous window of 50 SNPs (step size 1 SNPs). After converting the
SNPs to a nexus format, a phylogenetic network was constructed using SplitsTree4 (v.4.18.3)

Principal Component Analysis and ADMIXTURE. We chose the core set of SNPs (MAF
greater than 0.05) for additional pruning. PLINK (v1.90b6.12) (115) was used to remove SNPs
having high LD (r2³0.5) within a continuous window of 50 SNPs (step size 5 SNPs), which yielded
2,669,247 SNPs for both analyses. We performed PCA with GCTA (v.1.26.0) (117) using the
default settings. The first three principal components were plotted and colored according to major
viticultural region, utilization, and genetic groups, respectively.

We also examined the genetic ancestry with ADMIXTURE (v.1.3.0) (118) and determined the
choice of K using a 5-fold cross-validation (CV) procedure (119). Even though the CV error
gradually decreased from K=2 to 12, we decided to take K=8 as the optimal value. This is based
on two observations: (1) From K=8 on, the CV error decreases at a slower rate and each additional
K only reduces the CV error by 0.0015 or less; (2) At K=8, the corresponding ancestries are
sufficient to categorize both V. sylvestris and V. vinifera into distinct groups (fig. S6), reflecting
the lowest model complexity. Finally, the grouping and sorting of individuals with similar
ancestral proportions at K=8 was achieved through hierarchical clustering, so that the final
ADMIXTURE graph is easier to read.

Archetypal analysis. We chose the core set of SNPs (MAF greater than 0.05) for additional
pruning. PLINK (v1.90b6.12) (115) was used to remove SNPs having high LD (r2³0.5) within a
continuous window of 50 SNPs (step size 5 SNPs), which yielded 2,669,247 SNPs for archetypal
analyses. Archetypal analysis was performed using archetypal-analysis (120) with parameters “--
tolerance 0.0001 --max_iter 400”.

Grapevine major group characterization. Linkage disequilibrium (pairwise r2 values) was

calculated across all chromosomes using PopLDdecay (v.3.41) (121) with default parameters. The
average nucleotide diversity (π) within continuous 100 kb sliding windows, pairwise population
fixation index (FST), and individual heterozygosity were calculated with VCFtools (v.0.1.16)

Isolation-by-distance analysis. The pairwise population fixation index (FST) among all
viticultural countries/regions (minimum three individuals required) were calculated with VCFtools
(v.0.1.16) (102). The centroid latitudes and longitudes of countries/regions were used to calculate
the haversine distances with ‘distHaversine’ in R package. Scatterplot of FST and haversine
distances was used to obtain linear regressions between the two variables. Mantel test was used to
compare the similarity of FST and haversine distance matrices.

Ecological niche modelling. We compiled 41 and 16 different geographical records from all
identified Syl-W and Syl-E accessions, respectively for the analysis. The raster files of 19
bioclimatic variables at 2.5 minutes resolution for the Last Glacial Maximum (LGM, ca. 21 ka,
v1.2b) and early Holocene (EH, Greenlandian, 11.7-8.326 ka, v1.0) paleoclimate data were
obtained from PaleoClim (122). Since removing highly collinear variables has an insignificant
impact on maximum entropy model performance (123), we included all original variables in the
analysis. The R package ENMeval (v.0.3.1) (124) was used to test all combinations of defined
settings and perform cross validation for model evaluation. For the Syl-W ecotype, the settings of
LQ_1, LQH_2.5 were chosen to measure variable importance for the LGM and EH, respectively,
whereas for the Syl-E ecotype, the settings of LQ_1.5 and LQ_4 were selected. Then the
projections for habitat suitability were generated in MaxEnt (v.3.4.4) (125) from the ENMeval
results with the parameters of 10 subsample replicated runs and 30 random test percentage.

Demographic history inference. First, we employed the MSMC2 (126) to infer population size
and split time. The input files for MSMC2 were generated with MSMC Tools
( In brief, bi-allelic SNP sites with uniquely mapped reads
and 0.5 to 2-fold mean coverage depths were used in the analyses, and the remaining genomic
regions were masked using the script Then all segregating sites within each group
were phased using SHAPEIT (v.2.r904) (127). Single population demographic inference was
performed on four individuals (eight haplotypes), whereas population split inference was
performed on two individuals (four haplotypes) for each group. Only grapevine accessions with
the highest proportion of major ancestries (top 50 or major ancestry > 70%) were randomly chosen
for the inference. Single population demographic inference was repeated ten times for each group.
Median population split times were deduced from the results of 100 random combinations for each
comparison. We used a mutation rate of 5.4×10-9 per site per generation and a generation time of
3 years for demographic history inference (8).

The stairway plot 2 (v.2.1) (128) was also used for estimating the population demography
history for V. sylvestris from SNP frequency spectrum. We filtered out SNP sites in the coding
sequence region, masked genomic regions of repetitive elements, and applied mask so that short
read sequencing reads can be uniquely mapped to chromosomal regions (128). For each
population, we only included accessions with the highest proportion of major ancestries (50 for
Syl-W1, 58 for Syl-W2, 51 for Syl-E1, and 34 for Syl-E2). We estimated folded SFS using
easySFS ( Population history was predicted by
ignoring singletons and 200 bootstraps were run to assess confidence intervals. We plotted the
change of estimated median effective population size through time and the associated 95%
confidence intervals (2.5% and 97.5% percentiles).

We used Momi2 (v.2.1.19) (129) to explore demographic models for various sets of four
populations. Five individuals with the highest proportion of major ancestries were included in each
population. We filtered out SNP sites in the coding sequence and genomic regions of repetitive
elements. The extracted folded site frequency spectrum (SFS) was split into 100 equal-sized blocks
for jackknifing and bootstrapping. One gene flow event and constant population size were assumed
for a set of four-population comparison. The split times of Syl-W/Syl-E and Syl-E1/CG1 were
based on the MSMC2 results, where the interquartile range (25% to 75%) was fed into Momi2.
We fitted 20 independent runs with random starting parameters and selected the demographic
model with the biggest log-likelihood value of all runs. Then 100 bootstraps for the best model
were implemented by resampling blocks of the SFS to generate confidence intervals.

Selective sweep signals. We investigated the selection signals across the whole genome via a cross
comparison of the genetic differentiation (FST) and nucleotide diversity (π). A 50 kb sliding
window with 10 kb step approach was applied to quantify FST and π by using the VCFtools
software (v0.1.16) (102). The candidates that meet both top 5% of the two values were selected as
selective signals.

Treemix. We estimated admixture graphs of grapevine groups using TreeMix (v.1.12), which
applies a ML method based on a Gaussian model of allele frequency change (130). For each group,
individuals with at least 75% major ancestries (also average Syl-W ancestry in each V. vinifera
group <3%) were used. SNPs were filtered for missing calls and monomorphism. The topology of
the ML trees changes depending on the number of migration edges (m) allowed in the model. The
optimal number of migration edges was determined from the range of one to ten using a R package
OptM (v.0.1.6) (131). The TreeMix program was run with “-bootstrap 1000 -k 500”. The Syl-E1
group was set as root. For each migration event, we constructed the tree with migration edges 10
times using random seed. The best outcome was determined by the biggest residual value.

f-statistics, Patterson’s D, and local introgression region. Individuals with at least 75% major
ancestries were used for each group. Outgroup f3 statistics were calculated using a R package
admixr (v.0.9.1) (132) for all possible combinations of grapevine groups with Vitis rotundifolia as
the outgroup. The Patterson’s D and f4 admixture ratio for all possible combinations of trios of the
grapevine groups were calculated using Dtrios in Dsuite (v. 0.4 r42) (133) with V. rotundifolia as
the outgroup. It is worth noting that Dsuite does not assume prior knowledge of the tree, and orders
the test trio in a way so that the BBAA pattern is more common than ABBA and BABA patterns.
Dsuite also orders the position of P1 and P2 so that the resultant D statistic is always positive.
SNPs were filtered for missing calls and monomorphism. To further locate the local introgressed
genomic regions, the df and fdM statistics were calculated along the whole genome using
Dinvestigate in Dsuite with a sliding window of 50 SNPs and a step of 5 SNPs. We defined the
putative introgressed regions as those among top 1% of both values and visualized these regions
with R.

Sex determination region haplotypes. Positions of SDR-related SNPs were obtained from a
previous publication (33) and the corresponding SNPs in the SV-1 genome were obtained. The
genotypes of SDR were processed generate haplotypes and transformed to NEXUS formats by
DnaSP (v5) (134). Geographical and group categorization information was associated with

haplotypes in the NEXUS file as a trait block. Popart (v.1.7) (135)was used to construct haplotype
networks using the median-joining method.

Genome-wide association study. We performed a genome-wide association (GWA) study on

muscat and non-muscat grapevines using fastGWA-GLMM method (136) in GCTA (v.1.93.3beta)
(117). For the binary categorization, the muscat phenotype (n=134, table S1 and S14) was defined
as 1 and non-muscat phenotype (n=158) as 0. The non-muscat grapevine were selected from CG1,
the earliest domesticates. SNPs with missing calls greater than 0.2 and minor allele frequency less
than 0.01 were filtered. We defined the whole-genome significance cut-off with -log10 (P) = 6.

We also performed a GWA analysis on berry skin color using the MLMA-LOCO model in
GCTA (v.1.93.3beta) (117) to control the impact of population stratification. The phenotype for
all cultivated grapevines was obtained from VIVC, and assigned categorical values as follows:
green-yellow=1; rose=2; red=3; red-black=4. SNPs with missing calls greater than 0.3 and minor
allele frequency less than 0.01 were filtered. We defined the whole-genome significance cut-off
with -log10 (P) = 6.

Supplementary Text
Parent-offspring relationships. We also collected known parent-offspring relationships from the
Vitis International Variety Catalogue (VIVC; and used their IBS sharing pattern
estimators to determine cut-off values for first-degree relationship candidates. A total of 10,181
accession pairs met all four estimator criteria, R0£0.096, R1Î[0.5, 1.20), KING-robust
kinshipÎ[0.210, 0.3426), and IBS2*ratioÎ[0.912, 0.99). We then manually screened all candidates
to identify 194 close-cross relationships (e.g., backcross), 1,356 parent-offspring relationships, and
214 full sibling relationships (fig. S3 and table S23).

Large viticultural regions. Major viticultural countries in the world are usually categorized into
larger regional groups for both clarity and convenience. We based our categorization on a previous
report (19) but made minor modifications (fig. S3). Namely, the changes are: (1) China, Japan and
South Korea form an independent Eastern Asia regional group; (2) Iran is grouped in the Caucasus,
in that the individuals are closer to Armenian and Azerbaijan samples on the PCA plots; (3) Turkey
is not grouped with any other close-by regions. This is to better showcase the Turkish V. sylvestris
samples on the PCA plots, as they do not form a close cluster with either Caucasian or Balkan
individuals. We also agree with the previous report on listing Italy as its own group. The number
of accessions is above 50 for the majority of viticultural regions in the Eurasian continent. Maghreb
and Central Asian V. sylvestris accessions are not readily available to the field, since climate
change and social instability in the region have prevented field investigation in the past decade.
For V. vinifera, the number of table, wine, and dual-purpose grapevines accounts for 88% of the
core cultivated accessions.

Description of the V. sylvestris accessions. Four V. sylvestris groups with distinctive ancestries
are found. In the east, the Syl-E1 accessions (K2 red: 84.2%±6.6%) are limited to the banks of the
Jordan River and the Sea of Galilee in northern Levantine, whereas the Syl-E2 accessions (K6
navy blue: 72.7%±8.9%) are mainly found in South Caucasus and the southern bank of the Caspian
Sea. In the west, the Syl-W1 accessions (K1 sky blue: 94.6%±7.4%) are mainly located close to

the Danube River and the upper Rhein River, whereas the Syl-W2 accessions (K8 pink:
69.7%±10.8%) grow in the Iberian Peninsula and southwest France (fig. S8).

The admixed V. sylvestris accessions form several discrete clusters according to the
hierarchical clustering topology (fig. S8).

Syl-Admix1: The Syl-Admix1 accessions are predominantly comprised of K1

(57.5%±7.4%; sky blue) and K8 (29.6%±7.4%; pink) genetic ancestries. Their geographic
locations include Eastern France (districts Corsica, Nièvre, Bas-Rhin, and Alpes-Maritimes),
Switzerland, Italy (northern Italy, Sardinia), and Western Balkan (Croatia, Bosnia and
Herzegovina). This area is the middle zone in between Syl-W1 (K1) in the north (Germany,
Austria, and Hungary) and Syl-W2 (K8) in the south (Iberia).

Syl-Admix2: The Syl-Admix2 accessions are mainly from Northern Black Sea (Crimea)
and Eastern Balkan (Bulgaria). Besides K1 being the predominant ancestry component
(57.6%±9.3%; sky blue), these accessions contain a sizeable portion of K6 ancestry
(19.6%±6.6%). The third largest ancestry component is K8 (14.6%±4.2%; pink). The proportion
of K6 suggests a genetic influence from the Caucasian eastern ecotype Syl-E2.

Syl-Admix3: The Syl-Admix3 accessions are mainly from Eastern Turkey and Italy. These
accessions are characterized by highly admixed ancestries, which suggests intensive introgression
from V. vinifera into V. sylvestris.

Syl-Admix4: The Syl-Admix4 accessions come from the same region as Syl-Admix1. The
predominant ancestry components are also K1 (39.2%±6.1%; sky blue) and K8 (32.6%±7.9%;
pink). But Syl-Admix4 contains higher proportions of K2 - K7, which suggests intensive
introgression from V. vinifera into V. sylvestris.

Syl-Admix5: The Syl-Admix5 accessions in the Iberian region are characterized by the K7
(22.5%±9.8%; yellow) and K8 (58.5%±13.5%; pink) ancestries. Since K7 is associated with
Iberian cultivated grapevines, Syl-Admix5 represents local introgressed hybrids in the Iberian

Syl-Admix6: The Syl-Admix6 accessions were collected from Turkey, Iran, and Armenia.
They are predominantly shown by the K2 (25.4%±12.8%; red) and K6 (36.5%±14.6%; navy blue)
ancestry components and represent the admixture between the two eastern ecotype groups.
However, these accessions also contain sizeable proportions of other ancestries representing
cultivated grapes.

Description of the V. vinifera accessions. The ADMIXTURE analysis revealed that very few V.
vinifera accessions contain 100% of a specific genetic ancestry (fig. S8, table S25). This reflects
the intensive hybridization history among V. vinifera accessions. We plotted the cultivated
grapevines on a tri-plot according to the proportions of K2, K5, and the sum of all other K
components (fig. S8). The result shows that K2 and K5 are associated with table grapevines,
whereas all other K components are associated with wine grapevines.

CG1: The CG1 cultivars can trace their birthplaces to a large geographical area, which covers
East Asia, Central Asia, Western Asia, Caucasus, Northern Black Sea, and Northern Africa. The
genetic ancestry of the CG1 group is very similar to that of the Syl-E1 group, where the
characteristic K2 (red) component on average accounts for 73.9%±10.3% of the total ancestry.
‘Asswad Karech’ (Fre61), ‘Amud’ (IS39), and ‘Safsufa R’ (IS52) are Western Asia cultivars with
the highest K2 component. Given that K2 is associated with table grapevines, we reason that the
CG1 cultivars represent Western Asia table grapevines.

CG2: The CG2 cultivars bear resemblance to the Syl-E2 individuals in genetic ancestry,
where K6 (navy blue) is the predominant component (66.4%±17.1%). They are mainly located in
the Caucasus and Northern Black Sea region. The Georgian grapevine ‘Kisi’ (GE29) is the only
cultivar in this study having a pure K6 ancestry. Other grapevines with a high K6 ancestry
component include ‘Kurkena’ (GE25), ‘Ghvinis Tsiteli’ (GE13), and ‘Khikhvi’ (GE23). We
reason that the CG2 cultivars represent Caucasus wine grapevines.

CG3: A key feature of this group is the large number of muscat grapevines and their
descendants for table or dual-purpose usage. In particular, ‘Muscat Hamburg’ and ‘Königin der
Weingärten’ are the most popular parental varieties with a pure K5 (purple) ancestry. At the group
level, the K5 component accounts for 87.6%± 11.2% of the total ancestry. The geographical
distribution of CG3 cultivars is quite diffused, spanning from Eastern Asia to Western Europe.
Even though intercross among muscat grapevines is common, it should be noted that not all
descendants inherit the muscat aroma. With this said, we reason that CG3 cultivars represent
muscat grapevines.

CG4: The CG4 cultivars are mainly distributed in the Balkan and characterized by the major
K4 (orange) ancestry component (69.9%±17.4%). ‘Crimposie’, ‘Furmint’, ‘Fekete Balafant’,
‘Plavaie’, and ‘Armas’ are the cultivars with a pure or close to pure K4 ancestry. We define the
CG4 cultivars as Balkan wine grapevines.

CG5: The CG5 group represent Iberian grapevines that contain a major K7 (yellow) ancestry
component (68.8%±12.8%). Cultivars with a pure K7 ancestry include ‘Cayetana Blanca’,
‘Heben’, ‘Boal Vencedor’, and ‘Zalema’. We define the CG5 cultivars as Iberian wine grapevines.

CG6: The CG6 group is mainly associated with a K3 (dark brown) ancestry component
(68.4%±12.2%) and a major distribution area in France and Germany. Cultivars with a pure K3
ancestry include ‘Gros Noir A Tacher’, ‘Savagnin Blanc’, ‘Pinot Noir’, and ‘Bequignol Blanc’.
We define the CG6 cultivars as Western European wine grapevines.

The hierarchical clustering result reveals three major groups of admixed V. vinifera
accessions (C-Admix1-3; fig. S8). The C-Admix1 group represents the diverse breeding outcome
between muscat grapevines (K5: purple, 48.0%±11.1%) and other groups (K2, K3, K4, and K7).
The C-Admix2 group represents the breeding descendants between Western Asia table grapevines
(K2: red, 39.7%±9.6%) and other groups (K4, K5, K6, and K7). The C-Admix3 group contains
accessions of assorted genetic ancestry combinations, including Balkan wine/Iberian wine
(CG4/CG5) grape crosses, Balkan wine/ Western European wine (CG4/CG6) grape crosses,
Iberian wine/Western European wine (CG5/CG6) grape crosses, and accessions with more than

four genetic components. Of note, there are very few cultivars descending from a cross between
Caucasian wine grapevines and other groups.

The V. vinifera accessions with sizeable V. sylvestris ancestries. As six cultivated grapevine
groups and their corresponding genetic ancestries are defined above, we are able to identify
cultivars having a sizeable wild western ecotype ancestry at K=8 (table S25 and fig. S9).
Representative cultivars include ‘Riesling Blau’, ‘Manseng’ cultivars, and ‘Lambrusco’ cultivars,
all of which were shown in previous studies to be V. vinifera and V. sylvestris hybrids (11, 93,
137). The ancestry composition reveals details of the hybridization. On the one hand, the ratio of
V. vinifera and V. sylvestris ancestries approximately equals to 1:1, suggesting the establishment
of some cultivars with a single cross between cultivated and wild accessions. On the other hand,
the proportion of K1 and K8 (sky blue and pink) may inform the type of V. sylvestris used in the
cross and possibly the large region where the cross occurred. For instance, ‘Riesling Blau’ has a
higher proportion of K1 (40.3%). This suggests that the parental V. sylvestris belongs to Syl-W1
and that the proposed place of origin is where Syl-W1 could be found (e.g., Germany). In
comparison, ‘Petit Manseng’ has a higher proportion of K8 (48.1%), which indicates the cross with
Syl-W2 in Western Europe. The proportions of K1 and K8 in the Italian Lambrusco cultivars are
of similar size (about 25% each). This suggests a parental V. sylvestris of the admixed nature.

Description of shared and unique domestication signatures. In the main text, we listed gene
examples that are associated with shared and unique domestication signatures (full list in Table
S28 and S29). The descriptions of these genes and their functions are detailed below. As gene
functional annotation depends on homology-based inference, it should be noted that these
grapevine genes require additional verification for their true functions in the plant.

Vvsyl02G000297 (NPF): The protein product of this gene is predicted to be a NRT1/PTR

protein 4.5, which belongs to the large NITRATE TRANSPORTER 1/PEPTIDE
TRANSPORTER family. These membrane proteins are known to transport a variety of molecules
(i.e., nitrate, peptides) in plants, thereby playing important roles in plant development and growth

Vvsyl08G001229 (FER4): This gene product is predicted to be the iron storage protein
ferritin-4. Along with other ferritins, they not only help maintain the iron homeostasis but also
play a role in the oxidative response in plants (139).

Vvsyl09G001081 to Vvsyl09G001083, (GA2OX): The genes, also with Vvsyl09G001084

and Vvsyl09G001086 (both GA3OX), form a gibberellin oxidase gene cluster on chromosomal 9.
They are involved in the biosynthesis and degradation of the plant hormone gibberellin. It has
been shown that grapevine gibberellin oxidases play an important role in the regulation of
flowering and fruit-set (140).

Vvsyl17G000504 to Vvsyl17G000506 (PPR): The pentatricopeptide repeat-containing

proteins are mainly located to plant organelles. They play important roles in regulating plant
physiology, development and biotic/abiotic responses (141, 142).

Vvsyl17G000525, Vvsyl17G000526, and Vvsyl17G000528 (RNF181): This E3 ubiquitin
ligase gene cluster is on chromosome 17. The proteins of this gene family are involved in the
ubiquitination-mediated protein degradation pathway, thereby regulating a myriad of plant
biological processes, including growth, reproduction, and biotic/abiotic response (143).

Vvsyl05G001656 to Vvsyl05G001658 (MecgoR): This gene cluster encodes proteins with

prediction functions of methylecgonone reductase, which are essential in the biosynthesis of
tropane alkaloids in plant (144).

Vvsyl07G001732 and Vvsyl07G001733 (TR2): This gene cluster encodes tropinone

reductase homologs in grapevine. This family of enzymes controls a branch point in the tropane
alkaloid biosynthetic pathway in plant (144).

Vvsyl17G000431 to Vvsyl17G000435 (SSL): This gene cluster encodes strictosidine

synthase-like proteins. They catalyze the biosynthesis of strictosidine, a precursor molecule to
more than 2000 indole alkaloid compounds in plant (145).

Vvsyl05G001664 (SWEET17): This gene belongs to the bidirectional sugar transporter gene
(SWEET; sugars will eventually be exported transporter) family in plants. Alongside its function
in carbohydrate transport and homeostasis, the SWEET proteins have also been implicated in the
biotic/abiotic response in grapevine (146, 147).

Vvsyl06G000699 (PFKFB1): The 6-phosphofructo-2-kinase/fructose-2,6-bisphosphatase is

a key enzyme in the glycolysis signaling pathway in all cells.

Vvsyl15G001098 and Vvsyl15G001099 (BEAT): The product of these two genes (acetyl-
CoA:benzylalcohol acetyltransferase) are able to catalyze the biosynthesis of benzylacetate. This
enzyme has been implicated in the production of floral scent in Clarkia breweri and Prunus mume
(148, 149).

Vvsyl03G001186 (UFGT): The function of this gene is predicted to be anthocyanidin 3-O-

glucosyltransferase. This important enzyme is involved in the anthocyanin biosynthesis and
grapevine berry color improvement (150).

Vvsyl05G001489 and Vvsyl05G001490 (UPL6): These two genes are members of the E3
ubiquitin ligase gene family. The proteins of this gene family are involved in the ubiquitination-
mediated protein degradation pathway, thereby regulating a myriad of plant biological processes,
including growth, reproduction, and biotic/abiotic response (143).

Vvsyl17G000415 to Vvsyl17G000417 (WAK): This gene cluster encodes the putative wall-
associated receptor kinase like proteins. These transmembrane receptors bind to pectin due to
pathogen or wound, and initiate defense mechanisms in plant cells (151).

The domestication time gap between genomic inference and archaeological data. We briefly
present the available archaeological data for grapevine and discuss the possible reasons that may
underlie the domestication time gap between genomic inference and archaeological data.

The domestication of fruit trees is an indispensable component in human’s transition to sedentism
in the Neolithic. In West Asia, archaeological evidence dated the domestication of perennial fruit
crops between 8500 to 5500 ya (152). So far, the great majority of grapevine remains in
archaeological excavations were seeds. The categorization of these archaeological finds is based
upon the observation that V. sylvestris and V. vinifera differ in seed morphology (2). As shown in
table S35, the first appearance of domestic-type grapevine seeds in the Western Asia was during
the Early Bronze Age, compared to the wild-type of previous periods. In the Caucasus region,
grapevine pips were found in the Late Chalcolithic period site Areni-1 from Armenia, dating back
to ~8,000 BP. Therefore, the consensus in the archaeobotanical world states that the domestication
of perennial fruit tree (i.e., grapevines) lagged behind in time compared to the domestication of
annual grains (152). One theory speculates that fruit tree cultivation is a labor-intensive act (153).
The long-invested time means a delayed return for early humans, thereby suggesting that the first
settled agriculturalists were unlikely to domesticate fruit tree from the very start (153).

The genomic inference dates the domestication of grapevines to the Early Neolithic at ~11,000 ya,
similar to that of the annual grain domestication. Even though the estimate is a great improvement
from previous reports (~15-400 Kya) (7–9), a 2,500 to 3000-year time gap between genomic and
archaeological findings remains unresolved. There may be a few reasons that this gap exists. On
the one hand, the use of grapevine seed morphology only provides indirect evidence of
domestication. This is in contrast to the seeds of grains, where an increase in seed size or the loss
of shattering on archaeological samples provide more straightforward evidence of domestication.
In addition, binary categorization of the seed shape misses the information on the intermediate
state. Both factors could lead to an underestimate of the grapevine domestication time from the
archaeological remains. On the other hand, model-based genomic inference relies on the choice of
many parameters, which may introduce uncertainties. (1) Generation time: if grapevines had a
shorter generation time than the juvenile period of three years in the past (fig. S26), the
domestication time could be overestimated. (2) Ghost progenitor populations: though the inference
does not support the existence of such populations (fig. S27), the domestication time would have
been revised down if future archaeological evidence of an extinct progenitor population would
emerge. With these said, we could try to resolve the time gap with paleogenomic data in the future.

Fig. S1. The genome assembly of a V. sylvestris accession ‘VS-1’. (A) Pseudo-chromosomes of
the VS-1 genome assembly. Numbers corresponds to the chromosome number used in the V.
vinifera genome assembly PN40024 (12X.v2). (B) Syntenic relationship between the VS-1
genome assembly and PN40024 (12X.v2). (C) Comparison of the anchored chromosome lengths
in the VS-1 and PN40024 (12X.v2) genome assemblies.

A 19 Chr 1 B 108
2 SNP Indel total
18 V. vinifera
V. sylvestris






8 100
ic ic g ic ic c g c
on n in R n on ni in R ni
Ex tro lic UT ge tro l ic UT ge
11 In Sp er Ex In Sp er
Int I n t

C 0.8
MAF interval 0.01 MAF interval 0.05 Total
Fourfold Degenerate

0.4 Deleterious

































































D 0.4 E 1.0
MAF interval 0.01 MAF interval 0.05





0.0 0.0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 1] 2] 3] 4] 5] .1] 5] .2] 5] .3] 5] .4] 5] .5]
0.0 0.0 0.0 0.0 0.0 ,0 0.1 ,0 0.2 ,0 0.3 ,0 0.4 ,0
(0, .01, .02, .03, .04, (0.05 (0.1, (0.15 (0.2, (0.25 (0.3, (0.35 (0.4, (0.45
Size (bp) (0 (0 (0 (0

Fig. S2. Characterization of SNPs and small Indels from 3,648 V. sylvestris and V. vinifera
accessions. (A) Density plot of SNPs, small Indels (<40 bp), and nucleotide diversity (π) across
19 chromosomes of the VS-1 genome. (B) Tabulation of SNPs and small Indels according to the
different locations in the genome. (C) Frequency spectrum of SNPs according to the minor allele
frequency brackets and functional annotation. (D) Size frequency of small Indels in the genome.
(E) Frequency spectrum of small Indels according to the minor allele frequency brackets.

Fig. S3. Identification of core V. sylvestris and V. vinifera accessions in the total sample
cohort. (A) Schematic flowchart for the acquirement of 2,448 core V. sylvestris and V. vinifera
accessions from the total cohort. (B) Identification of clonal, close-cross (e.g., backcross),
parent-offspring, and full sibling relationships among 3,525 accessions according to identity-by-
state (IBS) sharing patterns. The majority of clonal relationships are among V. vinifera
individuals and shared by less than five accessions. PO, parent offspring; FS, full sibling; IBS,
identity-by-state. (C) Categorization of core accessions according to the major viticultural
regions. W. Asia, Western Asia; E. Asia, Eastern Asia; Rest. World, Rest of World; C. Asia,
Central Asia; Rus/Ukr, Russia/Ukraine; E. Euro, East Europe; C. Euro, Central Europe; W. Euro,
West Europe.

A 0.04 W. Asia Caucasus Balkan Rus/Ukr
Rus/Ukr E. Euro
Iberia Turkey C. Asia
C. Asia Turkey E. Euro
C. Euro 0.025
0.02 W. Euro


PC 3 (1.26%)
PC 2 (1.71%)

W. Asia
Balkan C. Euro Italy
Maghreb Maghreb
-0.02 Italy W. Euro

E. A
-0.050 E. Asia Rest.

V. vinifera
V. sylvestris
-0.06 -0.075

-0.025 0.000 0.025 -0.06 -0.04 -0.02 0.00 0.02 0.04

PC 1 (7.56%) PC 2 (1.71%)
B 0.04


PC 3 (1.26%)
PC 2 (1.71%)

-0.02 -0.025

-0.04 -0.050

Table Table/Wine
Wine Raisin/other
-0.06 -0.075

-0.025 0.000 0.025 -0.06 -0.04 -0.02 0.00 0.02 0.04

PC 1 (7.56%) PC 2 (1.71%)
C 0.04


PC 3 (1.26%)

PC 2 (1.71%)


-0.04 Syl-E1 Syl-W1
Syl-E2 Syl-W2

-0.06 CG5 CG6 -0.075

-0.025 0.000 0.025 -0.06 -0.04 -0.02 0.00 0.02 0.04

PC 1 (7.56%) PC 2 (1.71%)

Fig. S4. Principal component analyses of 2,448 core grapevine accessions. The projections
are colored according to major viticultural regions (A), grapevine utilization (B), and major
grapevine groups (C). The large square and circle in (A) represent the median positions.
Uncategorized and admixed accessions are greyed out.

A Caucasus B
C. Asia

Rus/Ukr Turkey
0.05 0.05

Balkan Maghreb

E. Euro
PC 2 (1.57%)

PC 2 (1.57%)
0.00 0.00

Iberia sia
Italy E. A
Rest. -0.05 Syl-E1 Syl-W1
W. Euro Syl-E2 Syl-W2
C. Euro CG1 CG2
V. vinifera CG3 CG4
V. sylvestris
W. Asia

0.00 0.05 0.00 0.05

PC 1 (8.16%) PC 1 (8.16%)

Fig. S5. Principal component analyses of 2,448 core grapevine accessions. The principal
component analysis was performed on V. sylvestris accessions and V. vinifera accessions were
projected onto the graph. The projections are colored according to major viticultural regions (A)
and major grapevine groups (B). Uncategorized and admixed accessions are greyed out.


1st Cluster

Clade I: Mainly Table Use Clade II: Mainly Wine Use

1st Cluster

2nd Cluster

TBE Bootstrap ≥ 0.70

Branch Color: V. vinifera / V. sylvestris

0.05 0.05
2nd Cluster

Viticultural Region Color Code

Western Asia Turkey Balkan Caucasus

C. Asia Eastern Asia Italy Iberia
Rus/Ukr Maghreb E. Euro C. Euro
Rest of World W. Euro

Clade I Clade I Clade II Clade II

B 1st Cluster 2nd Cluster 1st Cluster 2nd Cluster


e /W

Wine Table

Fig. S6. Maximum likelihood phylogenetic tree of 2,448 core grapevine accessions. (A)
Circular presentation of the maximum likelihood phylogenetic tree with 100 TBE bootstraps.
Two major clades are zoomed-in. Each clade contains two smaller clusters. V. sylvestris from
Western Asia is located in the clade with a majority of table grapes. V. sylvestris from Caucasus
and the rest of Europe is located in the clade with a majority of wine grapes. Stars show TBE
values greater than 0.70. Small dark circles and blue circles in the zoomed-in clades represent
clasped accessions for clarity. (B) The proportion of table, wine, table/wine, and other types of
grapevines in each cluster. C. Asia, Central Asia; E. Euro, East Europe; C. Euro, Central Europe;
W. Euro, West Europe.

e r2 u
st st
u er
Cl 1

er 4

Region Color Code

W. Asia Turkey
C. Asia E. Asia
Rus/Ukr Maghreb
Rest of World Caucasus
Balkan Iberia
Italy C. Euro
E. Euro W. Euro

ter 5
V. sylvestris V. vinifera Clus

Fig. S7. Reticulate phylogenetic network of 2,448 core grapevine accessions. The accessions
are colored according to the major viticulture regions. A total of five major clusters could be
identified. Cluster 1 contains V. sylvestris from the Western Asia and major table grapevines.
Cluster 2 contains V. sylvestris from the Caucasus and major Caucasian wine grapevines. Cluster
3 contains a majority of European wine grapevines. Cluster 4 contains mostly western wine
grapevines. Cluster 5 contains V. sylvestris from the rest of the west Eurasian continent.

A V. sylvestris V. vinifera



K=2 0.4





K=3 0.4





K=4 0.4





K=5 0.4





K=6 0.4





K=7 0.4





K=8 0.4



Syl-W1 Syl-W2 Syl-E2Syl-E1 CG1 CG2 CG3 CG4 CG5 CG6

Western Eastern
Admixed Major groups Admixed
Ecotype Ecotype

C K1K2 K3 K4 K5 K6 K7 K8 V. sylvestris K1K2 K3 K4 K5 K6K7K8 V. vinifera

Groups Groups

0.31 CG3

Syl-Admix2 C-Admix1

CV Error


Syl-E1 CG1

0 2 4 6 8 10 12 14
K Value C-Admix2

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Fig. S8. Categorization of core accessions according to ancestry. (A) ADMIXTURE

clustering of core accessions from K=2 to 8. (B) Cross-validation error plot for the unsupervised
ADMIXTURE analysis. (C) Hierarchical clustering of ancestral components at K=8 to order and
sort core accessions. Syl-W, V. sylvestris western ecotype; Syl-E, V. sylvestris eastern ecotype;
CG, cultivated grapevine.

A CG1 Western Asia Table CG2 Caucasus Wine CG3 Muscat CG4 Balkan Wine CG5 Iberian Wine CG6 Western European Wine
100 100 100 100 100 100

80 80 80 80 80 80
Ancestry %

60 60 60 60 60 60

40 40 40 40 40 40

20 20 20 20 20 20

0 0 0 0 0 0
iu sie int t
an aie as ca en or a tal er c ir c as
.I. ud ch R ra Ki
si na iteli hvi ID) urg gx de
an Heb ced alem Vi ch lan No lan Dur
l-U Am are usa l Ka rke Ts Khik (Un mb Jin in en po urm alaf Plav Arm Bl Ta in B inot ol B
ae K fs izy Ku inis Ha nig ärt im F B en Z
Isr ad Sa K 69 at Kö ing Cr te a na lV ir A agn P ign
sw Gh
v e2 sc k e
a o
N av qu
s Fr We Fe s e
A Mu Ca Gr
o S B

100 100 100 100
C 100
80 80 80 80 80
Ancestry %

Ancestry %
60 60 60 60 60

40 40 40 40 40

20 20 20 20 20

0 0 0 0 0
zi as ng ikari be g ut ra rce um
s di i la ta oe ka t
lau ana lanc Peti Peti
t in alet net
Ge anl nho lya Bul t Ka sy les Ro me Ma San zoel ats nc sc
h Pem lho yle ed ng Bl gB
r n ru ra
na D da ma Ha Mid Aks
ri ei Be aU Se Ho
sa a s m B
lin a Ro etit erdo urb
t u Ma ama Ardo mb ba
Da Mu Ko uem r ku Agu eke ca es
hu me ne o ari Ri si P V Co
C La Sor
z Dz F l a Dr Sz Zl at l v a
e ng
d i
Ma ans
100 100 100 M
K1 (Syl-W1)
80 80 80
K2 (Syl-E1; CG1)
Ancestry %

60 60 60 K3 (CG6) D Other Ks
0% 100%
40 40 40 K4 (CG4)
K5 (CG3)
20 20 20 Wine
K6 (Syl-E2; CG2) 20% 80% Raisin/Other
0 0 0
is ire r
eu ane
t or
e ne ge
ine ine ine t K7 (CG5) Unknown
as No hm og an gev gev
’O te Oh ad Früh re D sc Or ele on
e D r 17livet uA sc ier n n ad rm
e t u o Mu tline
P e Ga r D’ e A e A n M Cle K8 (Syl-W2)
Va Toz
e O u rB el a int te d
F l eu elein elein erli de 40% 60%
me V S et at Mad d Ob
Ah air sc Ma
Cl Mu
100 100 100

80 80 80 60% 40%
Ancestry %

60 60 60

40 40 40 80% 20%

20 20 20

0 0 0 100% 0%
sa ss no nde oa dura int
o tes
s s
rto prin
g izi K5
oja a
Gr ina eL
isb ja Ar Fon za u Vi ffs eK
irm K2 (Red) 0% 20% 40% 60% 80% 100%
h S ents iolla uan Tra ta iS r
cs Neg int O emr (Purple)
s ad Tin e
u rr Cr anj int P rm Alg
To S T

Fig. S9. V. vinifera accessions according to ancestry. (A) Representative cultivars from the six
V. vinifera groups (CG1-CG6) with pure or close to pure ancestries. (B) Representative admixed
V. vinifera cultivars with two major ancestry sources. (C) Representative admixed accessions
with a sizeable wild western ecotype component (sky blue Syl-W1 and pink Syl-W2). (D) Tri-
plot of V. vinifera cultivars according to the proportions of K2, K5, and the other Ks, showing K2
and K5 ancestries are associated with table grapevines and all other ancestries with wine
grapevines. Panels A, B, an C share the same ancestry color scheme. Syl-W, V. sylvestris
western ecotype; Syl-E, V. sylvestris eastern ecotype; CG, cultivated grapevine.

Fig. S10. Categorization of core accessions according to archetypal analysis. The graphs
showing the projections of grapevine accessions with different numbers of archetypes (K=3 to
10). Eight archetypes can differentiate major grapevine ancestries obtained from the
ADMIXTURE analysis. Higher archetypes at K=9 and K=10 show overfitting and the mixture of
CG4 and CG5 accessions. Uncategorized and admixed accessions are greyed out.


Nucleotide Diversity (𝜋)

0.015 b c c d c d d d
a e
0 25 .20 .15 .10 .05 0
CG1 0.3 0. 0 0 0 0 0.010
CG2 0.005
CG5 Syl-W1 Syl-W2 Syl-E2 Syl-E1 CG1 CG2 CG3 CG4 CG5 CG6
C 0.4 c

Syl-W1 d
a b b c c
b b b
Syl-W2 0.3



Sy 1

Sy 2


Sy 2








Syl-W1 Syl-W2 Syl-E2 Syl-E1 CG1 CG2 CG3 CG4 CG5 CG6

Fig. S11. Genetic diversity of major grapevine groups with distinct ancestry. (A) Pairwise
fixation index FST of major grapevine groups. Yellow color represents larger population
differentiation. Two red boxes show that CG1 is closer to Syl-E1 and CG2 is closer to Syl-E2.
(B) Nucleotide diversity (π, 100 kb window size) distribution of major grapevine groups. (C)
Individual heterozygosity distribution of major grapevine groups. Solid and dashed lines
represent median and interquartile range. White diamonds represent mean values. For mean
comparisons, P<0.05 for a<b<e<c<d from Brown-Forsythe and Welch ANOVA test with
Games-Howell post hoc multiple comparisons. Graph drawn according to the ancestry color
palette. Syl-W, V. sylvestris western ecotype; Syl-E, V. sylvestris eastern ecotype; CG, cultivated


Syl-W1 Syl-E1 CG1 CG4

Syl-W2 Syl-E2 CG2 CG5





0.0 0.0

0 5 10 15 20 0 5 10 15 20
Distance (Kb) Distance (Kb)

V. vinifera
V. sylvestris

r2 at 1Kb

0.2 0.11


CG4 Syl-E2

0.1 0.08 CG2

CG1 Syl-E1

0 20 40 60 80 100 120 140 160 180 200 3.5×10-3 4.5×10-3 5.5×10-3 6.5×10-3
Distance (Kb) Nucleotide Diversity (𝜋)

Fig. S12. Linkage disequilibrium in the major grapevine groups. Linkage disequilibrium
(LD, r2) decay of V. sylvestris (A) and V. vinifera (B) major groups both show that grapes of the
Western Asia (red lines) and Caucasian (teal lines) descents have the smallest LD extents at
around 400 – 500 bp. (C) LD decay of V. sylvestris is only slightly slower than that of V.
vinifera. (D) Inverse correlation of LD at 1 Kb and nucleotide diversity (π) from major grapevine
groups. Graph drawn according to the ancestry color palette. Syl-W, V. sylvestris western
ecotype; Syl-E, V. sylvestris eastern ecotype; CG, cultivated grapevine.

A 107 B 1.0
Syl-W1 Syl-W2
Syl-W1/ Syl-W2
Syl-E1 Syl-E2 0.8 Syl-E1/ Syl-E2



105 0.4 Syl-E1/ Syl-W1

Syl-E1/ Syl-W2
Syl-E2/ Syl-W1
Syl-E2/ Syl-W2
103 104 105 106 107 103 104 105 106
Years (g=3, µ=5.4×10-9) Years (g=3, µ=5.4×10-9)

2000 Syl-E1 2000 Syl-E2

1000 1000
800 800
600 600
400 400

Ne (×103)
Ne (×103)

200 200

100 100
80 80
60 60
40 40

20 20

10 10
8 8
6 6





10 0





10 0








(×103 years ago) (×103 years ago)

2000 2000
Syl-W1 Syl-W2
1000 1000
800 800
600 600
400 400

200 200
Ne (×103)
Ne (×103)

100 100
80 80
60 60
40 40

20 20

10 10
8 8
6 6
4 4





10 0





10 0







(×103 years ago) (×103 years ago)

Fig. S13. Demographic history of V. sylvestris grapevines. (A) Representative demographic

histories of V. sylvestris populations from 107 to 103 years ago deduced from MSMC2. Each line
shows estimation from eight haplotypes of four accessions. (B) Representative split lines among
V. sylvestris populations based on relative cross- coalescence rate (RCCR) analyses from
MSMC2. (C) Demographic histories of V. sylvestris populations deduced from Stairway Plot 2.
Red line: median of 200 inferences. Black line: 75% confidence interval. Grey line: 95%
confidence interval. Syl-W, V. sylvestris western ecotype; Syl-E, V. sylvestris eastern ecotype;
CG, cultivated grapevine.

Fig. S14. Ecological niche modeling of the suitable habitats for V. sylvestris ecotypes. The
times are at the Pleistocene Last Interglacial (~130 Kya), the Last Glacial Maximum (~21 Kya),
and early Holocene (~11.7-8.3 Kya). The color scale shows suitability score.

A Model 1: Dual origin Model 2: Single origin from Syl-E1 Model 3: Single origin from Syl-E2


Syl-E1 CG1 CG2 Syl-E2 Syl-E1 CG2 CG1 Syl-E2 Syl-E1 CG2 CG1 Syl-E2

n=5 runs AIC: 82834241.44±17.11 AIC: 84987179.61±3.72 AIC: 83196848.50±9.52

Model 1: Dual origin Model 2: Dual origin Model 3: Dual origin Model 4: Dual origin
B CG1 to CG2 CG2 to CG1 Syl-E1 to CG2 Syl-E2 to CG1

Syl-E1 CG1 CG2 Syl-E2 Syl-E1 CG1 CG2 Syl-E2 Syl-E1 CG1 CG2 Syl-E2 Syl-E1 CG1 CG2 Syl-E2
n=5 runs AIC: 82691092.94 ± 24.77 AIC: 82141082.18 ± 151.98 AIC: 82824177.83 ± 6.34 AIC: 82152522.60 ± 17.81

Model 5: Single origin from Syl-E1 Model 6: Single origin from Syl-E1 Model 7: Single origin from Syl-E2 Model 8: Single origin from Syl-E2
Syl-E2 to CG2 Syl-E2 to CG1 Syl-E1 to CG1 Syl-E1 to CG2

Syl-E1 CG2 CG1 Syl-E2 Syl-E1 CG2 CG1 Syl-E2 Syl-E1 CG2 CG1 Syl-E2 Syl-E1 CG2 CG1 Syl-E2
n=5 runs AIC: 83102162.81 ±3.93 AIC: 84968655.26 ± 0.41 AIC: 82147876.48 ±75.69 AIC: 83196842.76 ±1.58

1.70 1.67 1.68
1.69 CG2
1.68 1.66 CG4
CG4CG5 Syl-E2
f3 (Syl-W2,X; Rotund)

f3 (Syl-E2,X; Rotund)

f3 (CG2,X; Rotund)

1.67 CG6 CG1 1.66

Syl-W2 CG3 CG6
1.66 CG6 1.65 CG5
CG5 Syl-W1 Syl-W2
1.65 Syl-E2 1.65
CG3 Syl-E1
1.64 CG4 CG2 1.64
CG1 1.64
1.63 Syl-W1
1.62 Syl-E1 1.63
1.60 1.62 1.62
1.60 1.62 1.64 1.66 1.68 1.70 1.62 1.63 1.64 1.65 1.66 1.67 1.62 1.63 1.64 1.65 1.66 1.67 1.68
f3 (Syl-W1,X; Rotund) f3 (Syl-E1,X; Rotund) f3 (CG1,X; Rotund)

Fig. S15. Dual domestication of CG1 and CG2. (A, B) Phylogenetic model comparison
without gene flow or with one gene flow using Momi2 supports a dual origin of CG1 and CG2
with the lowest AIC values. (C) Outgroup f3 statistics biplots measuring genetic similarity
between CGs, Syl-W, and Syl-E. Rotund, Muscadinia rotundifolia. Stars mark the f3 statistics for
Syl-W1/Syl-W2, Syl-E1/Syl-E2, and CG1/CG2 pairs, respectively.

1.0 1.0
Syl-W1 Syl-W2
0.8 0.8

0.6 0.6

0.4 CG1 0.4 CG1
0.2 CG4
0.2 CG4
0.0 0.0
3 4 5 6
103 104 105 106 10 10 10 10
Years (g=3, µ=5.4×10-9) Years (g=3, µ=5.4×10-9)

1.0 1.0
Syl-E1 Syl-E2
0.8 0.8

0.6 0.6

0.4 CG1 0.4 CG1
0.2 CG4 0.2 CG4
0.0 0.0
3 4 5 6 3 4 5 6
10 10 10 10 10 10 10 10
Years (g=3, µ=5.4×10-9) Years (g=3, µ=5.4×10-9)

Fig. S16. Population split between V. sylvestris and V. vinifera. Representative split lines
between each V. sylvestris population and all V. vinifera groups based on relative cross-
coalescence rate (RCCR) analyses from MSMC2.

A m=0
M. rotundifolia M. rotundifolia
CG6 Syl-E2 Syl-E2

Variance Explained
Syl-W2 Syl-E1 Syl-E1 99.8%

Mean L(m)±SD
Syl-W1 CG1 CG1 -5e03
Syl-E2 CG4 CG4
Migration Migration
weight CG2 weight CG6 CG6
0.5 1 -1e04
0 0
Syl-E1 CG2 CG2 -1.5e04 Likelihoods
CG5 Syl-W1 Syl-W1 % Variance
10 s.e. CG4 10 s.e. Syl-W2 0.985
0.00 0.02 0.04 0.06 0.00 0.10 0.20 0 2 4 6 8 10
Drift parameter Drift parameter
m (migration edges)
M. r. 69.4 SE 10.1 SE
M. r.
Syl-W1 Syl-W1
Syl-W2 Syl-W2
Syl-E1 Syl-E1
Syl-E2 -69.4 SE 30
Syl-E2 -10.1 SE

CG1 CG1 Optimal m
CG2 CG2 20
CG5 CG5 10




M. r.

M. r.

0 2 4 6 8 10
m (migration edges)

B Syl-E1 m=0 0.5 Syl-E1 m=4 1.000

CG1 Syl-E2 99.8%

Variance Explained
CG3 Syl-W1
Mean L(m)±SD

0 -5e03
CG4 Migration Syl-W2
weight 0.990
Syl-W2 CG4 -1e04
Syl-W1 CG3
CG6 CG6 -1.5e04 0.980
CG2 CG5 Likelihoods
10 s.e. 0.975
Syl-E2 CG1 10 s.e. % Variance
0.000 0.010 0.020 0.000 0.010 0.020
Drift parameter Drift parameter 0 2 4 6 8 10
m (migration edges)

66.4 SE 13.9 SE 80
Syl-W1 Syl-W1

Syl-W2 Syl-W2
Syl-E1 Syl-E1
Syl-E2 Syl-E2
-66.4 SE -13.9 SE

CG2 CG2 40
20 Optimal m










0 2 4 6 8 10
m (migration edges)

Fig. S17. Introgression of Syl-W and the origination of European grapevines. (A) Outgroup
is set as M. rotundifolia. TreeMix analysis with zero and five migration edges (m=5). Optimal m
number shown by the red circle. Residual matrices are shown. Five migration edges increase the
proportion of variance explained from 96.9% (m=0) to 99.9%. Overfitting of the tree due to
outgroup selection was shown by a dubious “migration” from Syl-E1 to M. rotundifolia. (B)
Outgroup is set as Syl-E1 to avoid overfitting. TreeMix analysis with zero and four migration
edges (m=4). Optimal m number shown by the red circle. Residual matrices are shown. Four
migration edges increase the proportion of variance explained from 90.2% (m=0) to 99.5%.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19






0 0.2 0.4 0.6 0 0.2 0.4 0.6
Density CG3 Density CG4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0MB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19






0 0.2 0.4 0.6 0 0.2 0.4 0.6
CG5 Density

Fig. S18. Local introgression tracts of Syl-W in four V. vinifera grapevines. Color scheme
show the relative density of identified introgression tracts. Each tract contains 50 SNPs.

C-Admix Syl-Admix
CG1 Syl-E1
CG2 Syl-E2
CG3 Syl-W1
CG4 Syl-W2


fv haplotype

Fig. S19. Median-joining network of f and fv sex determination region haplotypes. The fv
haplotype is shown by a square. The f haplotype is shown by a circle.

C-Admix Syl-Admix
CG1 Syl-E1
Mv haplogroup CG2 Syl-E2
CG3 Syl-W1
CG4 Syl-W2
CG5 10
CG6 1

Fig. S20. Median-joining network of M and Mv sex determination region haplotypes. The
Mv haplotype is shown by a square. The M haplotype is shown by a circle.

H3 Haplotype

C-Admix Syl-Admix
CG1 Syl-E1
CG2 Syl-E2
CG3 Syl-W1
CG4 Syl-W2
CG5 10
CG6 1

Fig. S21. Median-joining network of H1 and H3 sex determination region haplotypes. The
H3 haplotype is shown by a square. The H1 haplotype is shown by a circle.


C-Admix Syl-Admix
CG1 Syl-E1
CG2 Syl-E2
CG3 Syl-W1
CG4 Syl-W2
CG6 10

H4 H5

Fig. S22. Median-joining network of H2, H3, and H5 sex determination region haplotypes.

f/f H1/H2
H1/f H2/f
H1/H1 Other




Fig. S23. Distribution of SDR genotypes in the six major grapevine groups.

Fig. S24. Grapevine group CG3 and muscat flavor. (A) Geographic distribution of CG3
grapevines. (B) Identification of SNPs associated with muscat flavor using FastGWA-GLMM.
The significance threshold is set at -log10(p)=6.0. (C) Zoomed-in genomic regions with
significant SNP signatures. Genes closest to the SNPs are colored in red. The non-synonymous
SNP Chr5:19419698 and the corresponding VvDXS gene are shown in blue.

800 CG1 CG4 100% 800 Table 100%
600 Admix 75% 600 Other

400 50% 400 50%

200 25% 200 25%

0 0% 0 0%
w e d k w se d k w se d k ow se d k
llo Ros Re Bl
ac llo Ro Re Bl
ac ell
Ro Re Blac ll Ro Re lac
-Ye - Ye d- -Y d- Ye d-
en d n- en en
e Re ee Re e R e e Re
Gr G r Gr Gr

50 100 150

Observed -Log10(P)


0 2 4 6 8 10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Expected -Log10(P)

2:3521538 A/T 2:5054627 G/T 2:16051309 T/A

50 100

2:5116947 G/T

(Mb) 3.512 3.516 3.520 4.98 5.00 5.02 5.04 5.06 5.08 5.10 16.05 16.07 16.09

Vvsyl02G000229 Vvsyl02G000303 Vvsyl02G000310 Vvsyl02G000314 Vvsyl02G001064

(VvMybA3) (VvMybA1) (VvMybA2)

E 2:3521538 A/T 2:16051309 T/A 2:5054627 G/T 2:5116947 G/T

100% 100%

75% 75%

50% 50%
25% 25%

0% 0%

ow se Red lack stris w e d k

llo Ros Re Blac estr
is ow ose Red lack stris w e d k
llo Ros Re Blac estr
ell Ro -B ylve Ye lv Ye
ll R -B ylve Ye lv
n-Y d n - d- y - d - d- y
ee Re V. s ee Re V. s e en Re V. s ee
Re V. s
Gr Gr Gr Gr

Fig. S25. Novel genes associated with berry skin color. (A, B) Categorization of cultivated
grapevine according to berry skin color (green-yellow, rose, red, and red-black). No population
stratification observed for major groups and grapevine utilization. (C) Identification of SNPs
associated with berry skin color using MLMA-LOCO. The significance threshold is set at -
log10(p)=6.0. Genomic inflation factor l=1.16. The top SNP signals shown in dashed square. (D)
Zoomed-in genomic regions with the top SNP signatures in chromosome 2. Pink represents non-
exonic SNPs. Dark red represents exonic SNPs. Relevant genes closest to the SNPs are shown.
Blue and yellow blocks represent exons and introns respectively. Four representative top non-
synonymous SNPs are labeled. Alternative splicing transcripts exist for the Vvsyl02G001064
gene. (E) The association of genotypes for representative SNPs (Ref/Alt) with berry skin color.
V. sylvestris has red berries. Ref: reference allele. Alt: alternative allele.

Fig. S26. The impact of various generation times on the population split times inferred by
MSMC2. Estimated split times of Syl-E1/CG1 and Syl-E2/CG2 population pairs using relative
cross-coalescence rate (0.5) analyses with MSMC2. Four haplotypes in each population with 100
runs for each comparison. The comparisons were kept the same across five generation times. Red
bars, median value with 95% confidence interval.

Model 2: Syl-E1 related ghost
A Model 1: Syl-E1 as progenitor
population as progenitor

~11,000 Kya

~8,000 Kya

Syl-E1 CG1 Syl-E2 Syl-E1 Ghost1 CG1 Syl-E2

n=5 runs AIC: 63399249.15 ± 19.74 AIC: 63650398.33 ± 62.67

Model 2: Syl-E2 related ghost

B Model 1: Syl-E2 as progenitor
population as progenitor

~11,000 Kya

~8,000 Kya

Syl-E1 CG2 Syl-E2 Syl-E1 CG2 Ghost2 Syl-E2

n=5 runs AIC: 61301067.74 ± 2.37 AIC: 61583922.62 ± 80.41

Fig. S27. Momi2 inference of trees with and without extinct progenitor populations. The
result does not support the descendance of CG1 (A) and CG2 (B) from extinct progenitor
populations, respectively.

