You are on page 1of 14

G3, 2023, 00(0),

https://doi.org/10.1093/g3journal/jkad138
Advance Access Publication Date: 19 June 2023
Genome Report

The generation of the first chromosome-level de novo


genome assembly and the development and validation

Downloaded from https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkad138/7202260 by guest on 28 August 2023


of a 50K SNP array for the St. John River aquaculture
strain of North American Atlantic salmon
Guangtu Gao,1,* Geoffrey C. Waldbieser,2 Ramey C. Youngblood,3 Dongyan Zhao,4 Michael R. Pietrak,5 Melissa S. Allen,6
Jason A. Stannard,6 John T. Buchanan,6 Roseanna L. Long,1 Melissa Milligan,5 Gary Burr,5 Katherine Mejía-Guerra,4
Moira J. Sheehan,4 Brian E. Scheffler,7 Caird E. Rexroad III,8 Brian C. Peterson,5 Yniv Palti 1,
*
1
USDA-ARS National Center for Cool and Cold Water Aquaculture, 11861 Leetown Road, Kearneysville, WV 25430, USA
2
USDA-ARS Warmwater Aquaculture Research Unit, 141 Experimental Station Road, Stoneville, MS 38776, USA
3
Institute for Genomics, Biocomputing and Biotechnology, Mississippi State University, Mississippi State, MS 39762, USA
4
Breeding Insight, 119 CALS Surge Facility, Cornell University, 525 Tower Road, Ithaca, NY 14853, USA
5
USDA-ARS National Cold Water Marine Aquaculture Center, 25 Salmon Farm Road, Franklin, ME 04634, USA
6
Center for Aquaculture Technologies, 8395 Camino Santa Fe, San Diego, CA 92121, USA
7
USDA-ARS Genomics and Bioinformatics Research Unit, 141 Experimental Station Road, Stoneville, MS 38776, USA
8
USDA-ARS Office of National Programs, George Washington Carver Center Room 4-2106, 5601 Sunnyside Avenue, Beltsville, MD 20705, USA

*Corresponding author: Email: yniv.palti@usda.gov; *Corresponding author: Email: Guangtu.gao@usda.gov.

Abstract
Atlantic salmon (Salmo salar) in Northeastern US and Eastern Canada has high economic value for the sport fishing and aquaculture in-
dustries. Large differences exist between the genomes of Atlantic salmon of European origin and North American (N.A.) origin. Given the
genetic and genomic differences between the 2 lineages, it is crucial to develop unique genomic resources for N.A. Atlantic salmon.
Here, we describe the resources that we recently developed for genomic and genetic research in N.A. Atlantic salmon aquaculture.
Firstly, a new single nucleotide polymorphism (SNP) database for N.A. Atlantic salmon consisting of 3.1 million putative SNPs was gen-
erated using data from whole-genome resequencing of 80 N.A. Atlantic salmon individuals. Secondly, a high-density 50K SNP array en-
riched for the genic regions of the genome and containing 3 sex determination and 61 putative continent of origin markers was
developed and validated. Thirdly, a genetic map composed of 27 linkage groups with 36K SNP markers was generated from 2,512 in-
dividuals in 141 full-sib families. Finally, a chromosome-level de novo genome assembly from a male N.A. Atlantic salmon from the
St. John River aquaculture strain was generated using PacBio long reads. Information from Hi-C proximity ligation sequences and
Bionano optical mapping was used to concatenate the contigs into scaffolds. The assembly contains 1,755 scaffolds and only 1,253
gaps, with a total length of 2.83 Gb and N50 of 17.2 Mb. A BUSCO analysis detected 96.2% of the conserved Actinopterygii genes
in the assembly, and the genetic linkage information was used to guide the formation of 27 chromosome sequences. Comparative ana-
lysis with the reference genome assembly of the European Atlantic salmon confirmed that the karyotype differences between the 2
lineages are caused by a fission in chromosome Ssa01 and 3 chromosome fusions including the p arm of chromosome Ssa01 with
Ssa23, Ssa08 with Ssa29, and Ssa26 with Ssa28. The genomic resources we have generated for Atlantic salmon provide a crucial boost
for genetic research and for management of farmed and wild populations in this highly valued species.

Keywords: genome map, SNP array, Atlantic salmon, North American, lineage, karyotype

Introduction 2004). Evidence from microsatellite DNA markers has clearly de-
monstrated that Atlantic salmon of N.A. origin is genetically dis-
Atlantic salmon (Salmo salar) in Northeastern US and Eastern
tinct from European Atlantic salmon (King et al. 2001; Council
Canada has high economic value for the sport fishing and aqua-
2002; King et al. 2005). This led to a judicial ruling that strictly regu-
culture industries and is of significant ecological and social im- lated farming of Atlantic salmon in Maine and Eastern Canada,
portance (Council 2004). Atlantic salmon of North American banned the farming of European and genetically modified salmon
(N.A). origin is also being farmed in Tasmania, Australia (Kijas strains in these areas, and required routine genetic certification of
et al. 2017), and Chile (Yáñez et al. 2016). The historic declines of the N.A. aquaculture stocks (Council 2004).
wild Atlantic salmon populations in the state of Maine, USA, led Large differences exist between the genomes of the 2 lineages
to the listing of wild or natural Atlantic salmon from 8 rivers as en- or subspecies. The haploid chromosome number of the N.A. and
dangered under the Federal Endangered Species Act (Council European Atlantic salmon is N = 27 and N = 29, respectively

Received: April 04, 2023. Accepted: May 31, 2023


Published by Oxford University Press on behalf of The Genetics Society of America 2023.
This work is written by (a) US Government employee(s) and is in the public domain in the US.
2 | G3, 2023, Vol. 00, No. 0

(Lubieniecki et al. 2010; Brenna-Hansen et al. 2012). Recent studies Quebec (Nash and Waknitz 2003), and propagated for
that used whole-genome scans with high-density single nucleo- Aquaculture on the west US coast before being added as an aqua-
tide polymorphism (SNP) assays have provided further evidence culture stock to the USDA breeding program in Maine. The
for the genomic differentiation between the N.A. and European Penobscot River strain is thought to have originated from a wild
lineages (Lehnert et al. 2020). At the time we started this project, population of N.A. Atlantic salmon in Maine that was spawned
there was only a reference genome for the European lineage at the Craig brook National Fish Hatchery. The St. John River
(Lien et al. 2016). A reference genome is required for discovering (SJR) aquaculture strain is widely used for Atlantic salmon farm-
DNA sequence variants among individuals and between popula- ing in North America. Wild fish from tributaries above the
tions and for using those variants to develop high-density geno- Mactaquac Dam (New Brunswick province, Canada) were used

Downloaded from https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkad138/7202260 by guest on 28 August 2023


typing assays. The Atlantic salmon genome is large and complex to establish the SJR aquaculture strain (Liu et al. 2017). Overall,
with a genome size estimate of 3.0 Gbp (Hardie and Hebert we discovered 6.6 million SNP markers, including more than 1.5
2003). Although it is similar in size to that of most mammals, its million markers with a minor allele frequency (MAF) ≥ 0.25. In
architecture and organization are more complex. In addition to addition, we identified 5,822 candidate markers that can poten-
the ancestral 3R whole-genome duplication shared by all ray- tially be used to distinguish between N.A. and European Atlantic
finned fish, the salmonids have undergone a more recent (4R) salmon by comparing genotypes of the 80 N.A. Atlantic salmon
whole-genome duplication event (Allendorf and Thorgaard 1984; with publicly available whole-genome sequence information
Lien et al. 2016). As a result of the salmonid ancestral 4R whole- from 31 Atlantic salmon representing a diversity of European po-
genome duplication, 9 pairs of chromosome arms in the Atlantic pulations. Although this SNP database provided a foundational re-
salmon genome have retained more than 90% of sequence simi- source for designing new SNP genotyping arrays specifically
larity to each other with patterns of tetrasomic inheritance. In to- tailored for genomic and genetic analyses in N.A. Atlantic salmon
tal, 94.4% of the genome is composed of 98 pairs of collinear blocks aquaculture, we also found a high percentage of sequence mis-
known as homeologous chromosome regions (Lien et al. 2016). In match from the alignment of N.A. salmon sequencing reads to
addition, nearly 60% of the genome contains repetitive sequences the European salmon reference genome assembly. This indicated
(Lien et al. 2016). To that end, we have recently demonstrated that that a reference genome from N.A. salmon can improve the qual-
long-read DNA sequencing and associated assembly algorithms ity and genome coverage of SNP discovery in these fish.
can help to overcome the technical difficulties associated with To enable better resource for SNP discovery and genotyping
generating de novo assemblies for the complex and large genomes and for other comparative genomics and genetic analyses, we
of salmonid species (Gao et al. 2022). generated and report here the first de novo genome assembly of
In addition to a reference genome, SNP arrays are also needed nearly complete chromosome sequences for the N.A. Atlantic sal-
for genetic studies in N.A. Atlantic salmon. High-density SNP ar- mon subspecies. The source DNA for the genome assembly was
rays are used for collecting large amounts of genome-wide geno- from a male fish from the St. John River aquaculture strain.
type data. This information is useful for dissecting the genetic Advances in long-read DNA sequencing technologies, together
basis of quantitative traits in agriculture and for implementing with the development of new assembly algorithms and pipelines,
models of genomic selection, which has revolutionized the field have recently enabled improved genome assemblies for complex
of selective breeding over the past decade (Meuwissen et al. salmonid genomes like the rainbow trout (Gao et al. 2021). The
2016). High-density SNP arrays are publicly available for Atlantic new genome assembly for Atlantic salmon was generated using
salmon of European origin (Houston et al. 2014). However, less PacBio long-read sequencing technology followed by scaffolding
than half of the SNPs from the high-density SNP arrays designed with Hi-C contact maps and Bionano optical mapping. The result
for the European salmon are informative for N.A. salmon (John was a high-quality, contiguous chromosome sequence. The con-
Buchanan, unpublished data) (Yáñez et al. 2016). The 2 N.A. popu- tigs of the new assembly were used as reference to generate a
lations that were sampled by Yanez et al. to quantify polymorph- new SNP discovery dataset for N.A. Atlantic salmon from which
ism of markers from the array they developed were different from polymorphic SNP markers that are distributed on all chromo-
the aquaculture strains used by the USDA breeding program in somes were selected for a high-density 50K SNP array. The SNP ar-
Maine. In addition, Yanez et al. reported that only ∼5% of the ray also includes sequence probes from the sdY gene as markers
SNPs from the array they developed for Chilean aquaculture for sex determination and a limited set of markers that were
strains of European origin overlapped with SNP arrays that were used in a proof of concept for preliminary assessment of their use-
developed in Europe at the time. Other researchers and commer- fulness for continent of origin (COO) classification (Europe or
cial breeding companies have developed a 50K SNP chip for N.A. North America). The new SNP array was validated by genotyping
Atlantic salmon using information from the European salmon ar- pedigreed fish from the USDA breeding program and generating
rays, but those are currently not available for the public due to in- a linkage map that facilitated anchoring and ordering of the se-
tellectual property constrains and competitive commercial quence scaffolds in 27 chromosome sequences. The reference
interests (John Buchanan, unpublished data) (Kijas et al. 2018). genome and high-density SNP array provide very important re-
Currently, resources such as high-density SNP arrays are lacking sources for more precise breeding in aquaculture, and for com-
for N.A. Atlantic salmon and there is a need for generating a refer- parative genomics research on the ecology and evolution of
ence genome and an associated high-density genotyping assay. Atlantic salmon.
Recently, we initiated efforts to generate genomic resources for
the breeding program at the USDA-ARS National Cold Water
Marine Aquaculture Center (NCWMAC) in Franklin, Maine. We
used high-coverage whole-genome Illumina resequencing for
Materials and methods
SNP discovery in 80 fish from 3 distinctive aquaculture stocks of Testing with the King 7 microsatellites
N.A. Atlantic salmon that are propagated in the NCWMAC (Gao The grandparents and parents of all the Atlantic salmon fish used
et al. 2020). The Gaspe strain is thought to have originated from for SNP discovery, genome sequencing, and SNP array genotyping
Atlantic salmon from the Grand Cascapedia river, Gaspé Bay, in this study were tested by the US Fish and Wildlife Service using
G. Gao et al. | 3

the King 7 microsatellites (King et al. 2001; King et al. 2005) and for COO classification, including 44 nuclear markers and 20 mito-
were certified to have genotypes consistent of the N.A. COO. chondrial markers identified previously (Gao et al. 2020) and 8 nu-
clear markers that were previously found to be informative for
SNP discovery COO classification by the Center for Aquaculture Technologies
The same DNA sequence data reported in our previous SNP dis- (unpublished data). Four additional sequence probes from the
covery effort (Gao et al. 2020) were used in a new SNP discovery sdY gene were also added to the array for sex determination ana-
analysis to generate the source dataset for the design of a 50K lysis. Those probes were previously included and tested in other
SNP array. The main difference was that here, we used the N.A. Axiom SNP arrays for Atlantic salmon (Kijas et al. 2018). The pres-
Atlantic salmon reference genome assembly from a fish from ence of the sequences compatible with the sdY probes would indi-

Downloaded from https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkad138/7202260 by guest on 28 August 2023


the SJR strain as the reference for the SNP discovery. All raw se- cate that the fish is a male and the absence that it is a female.
quence data used in the SNP discovery can be found in the NCBI
SRA repository (BioProject accession PRJNA559280). Cleaned reads Genotyping and validation of the SNP array
were aligned to an earlier draft of the Canu contig assembly of the and putative COO markers
N.A. Atlantic salmon genome assembly (final version in GenBank Fin tissues were collected, and genomic DNA was extracted from
Accession GCA_021399835.1, reported here for the first time) 2,512 pedigreed fish from the SJR strain of the USDA breeding pro-
using BWA-MEM with the default parameters (Li 2013). gram. Additional DNA samples (Supplementary Table 1) were
Duplicated alignments were marked using PicardTools v2.19.2 used to evaluate the COO markers in the Axiom array. Genomic
(http://broadinstitute.github.io/picard). Reads in regions of de- DNA samples were fluorescently labeled, hybridized to the SNP ar-
tected indels were realigned to remove alignment artifacts, and ray, and scanned for initial genotype calls (Neogen, Lincoln, NE)
base quality scores were recalibrated based on the realignments according to the Axiom genotyping procedures described by
to generate analysis-ready reads in bam format using GATK Affymetrix. For final genotyping calls and quality control ana-
v3.8.1 (Poplin et al. 2018). Raw variants, including SNPs and indels, lyses, we used the Affymetrix Power Tools (APT) and SNPolisher
were generated using HaplotypeCaller and GenotypeGVCF tools of software packages following the procedures and practices sug-
GATK v3.8.1 (Poplin et al. 2018). SNPs were retained if they met the gested by the Affymetrix Axiom array genotyping manual and in
following filtering criteria: (1) located more than 5 bp from an in- the same way we previously described for the rainbow trout 57K
del; (2) QUAL > 30; (3) minimum and maximum read depths of Axiom array (Palti et al. 2015).
100 and 3000, respectively; (4) for each sample, at least 1 read sup- For preliminary evaluation of putative markers that can be
porting the reference allele and 2 reads supporting the alternative used to differentiate COO, we genotyped 60 samples from
allele; (5) no missing genotype per SNP position; and (6) MAF > 0.3. European origin populations and 14 from N.A. origin. These
same samples were part of the panel that was originally used to
Selection of SNPs for the 50K array test the King 7 microsatellites (King et al. 2001; King et al. 2005).
The annotation of the European Atlantic salmon reference gen- Principal components analysis (PCA) was conducted with the pro-
ome was used as a proxy to locate genic regions in the N.A. salmon gram PLINK version 1.9 (Purcell et al. 2007) using genotypes ob-
genome. Whole-genome alignments between sequence contigs tained with 61 validated COO SNP markers. The input .ped file
from the N.A. and European Atlantic salmon genome were gener- was generated from the genotype data for each fish treated as un-
ated using minimap2 (Li 2018). Alignments passing the following related individuals and the .map file was formed using all the COO
requirements were retained for downstream analyses: (1) align- markers in the analysis.
ments to the 29 chromosomes of the European salmon genome
(Lien et al. 2016) and contigs of the N.A. salmon genome assembly Linkage map
that are longer than 5,000 bp; (2) minimum alignment length of The fish used in the linkage analysis were from the St. John River
5,000 bp; (3) minimum mapping quality of 40; (4) a N.A. salmon stock of N.A. Atlantic salmon (NCWMAC-USDA). In total, we gen-
contig has to have the “assigned” European salmon chromosome otyped 2,512 fish from 141 full-sib families, including 230 parents
(using the sum of base pairs aligned to the European genome as and 2,282 offspring. The program Lep-map3 (Rastas 2017) was
a voting system, discard duplicated, or multiple hits with similar used for generating the linkage map with the same iterative
alignment results); and (5) only 1 N.A. salmon contig with the step-wise approach used for the rainbow trout linkage map
best alignment score was kept if multiple contigs aligned to the (Gonzalez-Pena et al. 2016). Briefly, the markers were separated
same region of the European genome. To select 100K SNPs for into 27 linkage groups by running the SeparateChromosomes2
technical ranking by the SNP array manufacturer, ∼35 SNPs module at LOD score 110. More markers were added to the linkage
were selected in each 1-Mb window, with a minimum distance be- groups by running the JoinSingles2All module with a more relaxed
tween adjacent SNPs of 1.5 kb and prioritized SNPs in genic re- LOD threshold. Finally, markers were ordered in each linkage
gions based on the alignments to the European genome group by running the OrderMarkers2 module. Linkage groups
annotation. The dataset of 100K SNPs included 8,654 from 1 of were aligned with the karyotype chromosomes by comapping
the European-based high-density SNP arrays (Houston et al. the flanking sequences of SNP markers from this linkage map
2014) that were found to be of high quality and polymorphic in and the map of Brenna-Hansen et al. (Brenna-Hansen et al. 2012)
the N.A. stock of the Mowi ASA aquaculture company. The SNP in- to the 27 primary sequence scaffolds of the new N.A. Atlantic sal-
formation for those markers was provided courtesy of Serap mon genome assembly. Two programs were used to map the
Gonen and Matthew Baranski, and the original Affymetrix SNP flanking sequences of the SNP markers to the genome assembly
IDs for those markers were retained in the new 50K SNP array. sequence scaffolds. The program NovoAlign (http://www.
The ∼100K SNPs were then evaluated by the array manufactur- novocraft.com/products/novoalign/) with the parameter -t 60
er (ThermoFisher, USA) for best compatibility with the SNP array was used for the SNP markers of this linkage map, and Blastn
platform, from which 49,804 were selected as the core SNPs of the (Zhang et al. 2000) with the parameters of -evalue 1.0e−5,
50K Axiom SNP array. The total number of markers on the array is -perc_identity 80, and -qcov_hsp_perc 80 was used for the mar-
49,880 which also included 72 SNPs that can potentially be used kers of Brenna-Hansen et al. (2012).
4 | G3, 2023, Vol. 00, No. 0

Genome assembly: DNA extraction and Bionano assembly guided by the contigs of the PacBio sequence
sequencing assembly. About 1.9 million molecules with N50 of ∼305 kb
DNA from a single male (sample NCWMAC SJR YC14-15 Fam3) (∼150× genome coverage) and average label density of 10 sites
from the USDA St. John River stock of N.A. Atlantic salmon was ex- per 100 kb were selected from the raw data after the filtering
tracted using the Qiagen DNeasy kit from a fresh blood sample step, and 1,881 Bionano genome maps were generated with a total
collected in tri-potassium EDTA. The blood was stored on ice until length of ∼3.09 Gb and N50 of ∼5.1 Mb. The Bionano genome maps
DNA extraction within 24 hours to enhance the yield of high- and the 14,240 primary contigs and haplotigs were then scaffolded
molecular weight DNA for long-read sequencing. The DNA sam- using the Bionano hybrid scaffold pipeline. This resulted in the
joining of 2,174 contigs (∼2.61 Gb) into 817 scaffolds (∼2.65 Gb)

Downloaded from https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkad138/7202260 by guest on 28 August 2023


ple was sequenced at the core facility of the University of
Delaware (Newark, Delaware, USA) using 29 SMRT cells of the and breaking of 140 contigs that were considered by the Bionano
PacBio RS-II system in a continuous long-read (CLR) mode, which pipeline to be chimeric joins. The Bionano software utilized
generated approximately 312 Gb of sequence data (104× cover- 13-bp gaps to indicate the presence of small gaps of undetermined
age). The PacBio read length N50 was ∼33 kb and the average size or undetermined flanking sequence overlaps (Gao et al. 2021),
read length was ∼19 kb. For polishing the genome assembly, and the sequences flanking these gaps were manually corrected.
∼2.1 billion paired-end reads (2× 150 bp) of Illumina short reads This resulted in the removal of 21 sequence overlaps, and the total
were generated by Admera Health (South Plainfield, NJ, USA) length of the scaffolds was reduced to ∼2.63 Gb with N50 of
from the same DNA sample. The short-read sequences were pro- ∼14.2 Mb. Finally, 1,195 primary contigs (∼0.2 Gb) that were not
cessed with the program cutadapt to remove the Illumina sequen- joined with other sequences and were longer than 3 kb were
cing adaptors producing a total of ∼316 Gb (∼105× genome added to the Bionano scaffolds for further construction of the as-
coverage). In general, high genome coverage above 30× is desired sembly. Altogether, the Bionano assembly was composed of 2,012
for polishing with Illumina reads. sequence scaffolds with a total length of ∼2.84 Gb and N50 of
∼11.94 Mb.
To improve the accuracy of the Bionano scaffolds, we first used
Construction of the assembly contigs
gcpp (http://github.com/PacificBiosciences/GenomicConsensus)
The initial assembly of sequence contigs was generated from the to polish the sequences. The PacBio raw reads were first mapped
PacBio long reads using the Canu (Version 1.8) assembler (Koren to the Bionano scaffolds using minimap2 (Version 2.15-r915) (Li
et al. 2017) with the options of correctedErrorRate = 0.085, 2018), and the scaffolds were polished using the “arrow” algorithm
corMhapSensitivity = normal, and ovlMerDistinct = 0.975. After in gcpp based on the alignment results. To further polish the scaf-
the “correction” and “trimming” steps, which involve alignments folds, the Illumina short-read sequences were aligned to the scaf-
of reads, error correction, trimming of poor-sequence quality seg- folds using BWA-MEM (version 0.7.17-r188) (Li 2013) with the
ments, and removal of overlapping reads after those corrections, default parameters and the genomic variants were called using
Canu retained 2.64 million high-quality reads with a 103.5 Gb the freebayes (version 1.3.1) program (Garrison and Marth 2012)
(∼34.5×) total length from the raw PacBio reads. The trimmed with the default parameters. Variants with a quality score > 1
reads were then assembled into 14,845 contigs. and the alternative allele in the Illumina sequence reads were
Next, the programs purge_haplotigs (Roach et al. 2018) and pur- identified, and the scaffolds were corrected based on the selected
ge_dups (Guan et al. 2020) were used to correct the Canu assembly variants using bcftools (Li et al. 2009). This polishing procedure
errors and identify regional duplications due to high heterogen- was performed twice to increase the accuracy of the genome
eity in some of the genomic regions. First, we ran purge_haplotigs assembly.
to identify artefactual contigs caused by very low or very high The Hi-C proximity ligation sequence data were used to gener-
coverage detected by mapping of the original PacBio long read to ate chromosome-scale scaffolds from the polished Bionano scaf-
the Canu contigs. As a result, 1,394 Canu contigs were character- fold assembly. Fresh blood (100 µl) was collected in ethanol
ized as artefacts and thus removed from further analysis. Second, (750 µl) and immediately placed in dry ice and shipped. The Hi-C
the retained sequences were processed with purge_dups to iden- library was prepared with the Arima HiC Kit (April 2018) using re-
tify highly overlapping haplotigs. With this step, 2,263 and striction enzymes that cut at GATC and GANTC recognition sites
11,977 sequences were identified as the primary contigs and hap- and sequenced to produce 833 million Illumina paired-end reads
lotigs, respectively. This information was then used in construc- (2 × 150 bp) (Arima Genomics, San Diego, CA, USA). The reads
tion of the genome assembly scaffolds. were aligned to the Bionano scaffolds, and the alignments were fil-
tered following the Arima-HiC mapping pipeline (http://github.
Construction of the assembly scaffolds and com/ArimaGenomics/mapping_pipeline). Briefly, the forward
chromosome sequences and reverse sequence reads of the Illumina pair-ends were
To improve the contiguity of the assembly, we first used Bionano mapped separately to the Bionano hybrid scaffolds using the
optical mapping with the Saphyr platform (Bionano Genomics, BWA-MEM aligner (Li 2013). The chimeric 3′ side of the aligned
San Diego, CA, USA) to scaffold the assembled contigs. reads that crossed the ligation junctions was removed, and the
High-molecular weight DNA was extracted in agarose gel plugs reads that were mapped from both ends with the mapping quality
from fresh blood of the same Atlantic salmon fish by Amplicon score higher than 10 were retained. PCR duplicates in the Hi-C se-
Express (Pullman, Washington), and the extracted DNA molecules quences were identified and removed using the Picard Toolkit
were labeled with the Direct Label and Stain (DLS) DNA Labeling (http://broadinstitute.github.io/picard/). After these filtering
kit utilizing the Direct Label Enzyme (DLE-1) (Bionano Genomics, steps, ∼300 million paired read alignments were retained, includ-
San Diego, CA, USA). A total of 8.5 million DNA molecules were ing ∼79 million interscaffold alignments. The paired-read align-
generated with a total length of 1.2 Tb (average length 141 kb). ment data were then processed with SALSA (Ghurye et al. 2019)
The raw mapping data were processed with Bionano Solve which broke 251 chimeric scaffolds and merged the polished
software (Solve3.3_10252018) (https://bionanogenomics.com/ Bionano assembly scaffolds into 1,663 Hi-C scaffolds with a total
support-page/bionano-access-software/) to generate a de novo length of ∼2.83 Gb and N50 of ∼37.9 Mb.
G. Gao et al. | 5

Linkage information from the new genetic maps was used to Table 1. Validation of the SNPs with DNA samples from N.A.
generate chromosome sequences from the Hi-C assembly scaf- Atlantic salmon (N = 2,512). The marker categories are based on
the Affymetrix Axiom array cluster properties. Number of SNP
folds. The program NovoAlign (http://www.novocraft.com/
markers assigned to each category are listed next to the
products/novoalign/) was used to map the flanking sequences of percentage of total number of markers in parenthese.
each SNP marker to the scaffolds, and the genetic linkage infor-
mation was used to anchor the scaffolds to chromosomes and Category SNP markers

then to order, orient, and concatenate the scaffolds into 27 chro- Total SNPs on the array 49,880 (100%)
mosomes. With the guidance of the linkage information, 56 Hi-C Call rate < 97% 5,421 (10.9%)
scaffolds were broken into 150 smaller scaffolds at 94 Hi-C join No minor homozygotes 797 (1.6%)

Downloaded from https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkad138/7202260 by guest on 28 August 2023


Off-target variant 25 (0.05%)
sites due to incorrect ordering or wrong orientation of the joined
Other 5,586 (11.2%)
scaffolds, or due to the insertion of a smaller scaffold at the join Monomorphic high resolution 1,825 (3.7%)
site of the larger scaffold. Hemizygous (mitochondrial) 20 (0.04%)
Polymorphic high resolution 36,206 (72.6%)
Alignment with the assembly of the European
salmon genome
To align the chromosome sequences from the new assembly et al. 2017; Vallejo and Liu et al. 2017). The high polymorphism of
(GCA_021399835.1) to the chromosome sequences of the the validated SNP arrays as measured by average MAF and hetero-
European Atlantic salmon genome assembly, repeat masked zygosity in the target aquaculture population is substantially bet-
chromosome sequences of assembly Ssal_v3.1 (GCA_ ter than what was reported for similar high-density SNP arrays in
905237065.2) were used as the reference, and Blastn was used as salmonids (Houston et al. 2014; Palti et al. 2015; Yáñez et al. 2016).
the aligner with the parameters -evalue 1.0e−100, -word_size 51. Of the 36,206 polymorphic SNPs, 34,302 (95%) had MAF greater
All alignments longer than 2,000 bp were selected and plotted than 0.05 in the SJR aquaculture population that was sampled in
with the plot function of the program R. this study, suggesting very efficient utilization of the genotype
data from this array for genomic and genetic analyses in the target
population. In comparison, in the array developed for Chilean
Results and discussion
aquaculture populations, the fraction of validated SNPs with
New SNP database for N.A. Atlantic salmon MAF greater than 0.05 in 2 populations of N.A. origin was between
A new SNP database of ∼3.1 million SNPs mapped to contigs from 27 and 43% (Yáñez et al. 2016).
an earlier draft of the new N.A. Atlantic salmon genome assembly
(GCA_021399835.1) was generated using GATK v3.8.1 (Poplin et al.
2018) as described in the methods. Of those SNPs, ∼2.3M can be
uniquely mapped to the chromosome sequences in the current as- COO markers
sembly. The new database of ∼2.3M SNPs uniquely mapped to a Forty-one SNPs from the nuclear chromosomes and 20 mitochon-
single chromosome position is available for download including drial SNPs were shown to be potentially useful for COO differenti-
the chromosome position and alternate alleles for each SNP ation using 60 samples from European origin populations and 14
(Supplementary File 1). To generate this database, we mapped from N.A. origin. These same samples were part of the panel
the SNPs in retrospect to the final version of the chromosome se- that was originally used to test the King 7 microsatellites (King
quences. We used the Novoalign program (www.Novocraft.com) et al. 2001; King et al. 2005). The 2,512 SJR aquaculture strain fish
to align a 30-bp flanking sequence from each side of the SNP to of N.A. origin that were used for general validation of the SNP ar-
find exact matches to the genome assembly chromosome se- ray quality and for generating the linkage map were also used to
quences. Of the 0.8M SNPs that were not uniquely mapped, 0.2M validate the COO markers. The populations sampled are listed
were mapped to unplaced scaffolds, 0.4M were mapped to mul- in Supplementary Table 1, and the markers validated for COO
tiple locations in the chromosomes (MSVs), and 0.2M were dupli- identification including the allele associated with each COO and
cated (PSVs) as each allele mapped uniquely to a different position their genome positions are listed in Supplementary Table 2.
in the chromosome sequences. The nuclear COO markers were not randomly distributed in chro-
mosomes. The flanking sequences of 39 of the 41 COO nuclear
Genotyping and validation with the 50K SNP array markers were mapped to 13 of the 27 chromosomes with 7 and
The SNPs were validated with the pedigreed population from the 6 markers mapped to chromosomes 12 and 14, respectively
SJR aquaculture strain that was used to produce the linkage (Supplementary Table 2). This nonrandom distribution may
map (N = 2,512). The marker categories were based on the indicate association of genes on those 2 chromosomes with the
Affymetrix Axiom array cluster properties. The total number of genetic differentiation between the 2 lineages of Atlantic salmon.
putative SNPs placed on the chip was 49,880. A total of 36,206 However, in contrast to our findings of enriched presence of
SNPs (73%) were categorized as of high quality and polymorphic COO-associated SNP markers on chromosomes 12 and 14,
and an additional 1,825 SNPs (4%) as of high quality but mono- Lehnert et al. (Lehnert et al. 2020) identified genomic regions
morphic, using the default quality filtering of the Affymetrix associated with high differentiation between the 2 lineages on
SNPolisher software. The number of SNPs passing the quality fil- chromosomes 06, 13, 16, and 19. Therefore, a much more compre-
tering steps is summarized in Table 1. Among the polymorphic hensive evaluation of candidate COO SNPs that are distributed on
high-resolution markers, the average and median MAF were 0.32 all chromosomes with a larger panel of samples representing mul-
and 0.34, respectively. The marker distribution by MAF is shown tiple wild and aquaculture populations from Europe and North
in Fig. 1. The average and median marker heterozygosity were America is needed towards developing a reliable set of COO
40 and 45%, respectively. A high rate of polymorphism is very im- SNPs for genotyping assay. However, this type of comprehensive
portant for genome scans to map QTL and for the utilization of evaluation of candidate COO SNPs is beyond the proof of concept
marker information in genomic selection (Vallejo and Leeds we conducted in this study.
6 | G3, 2023, Vol. 00, No. 0

Downloaded from https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkad138/7202260 by guest on 28 August 2023


Fig. 1. Polymorphism of SNPs from the Axiom 50K array. MAF distribution among the ∼36K of polymorphic high-resolution SNPs.

One of the N.A. origin fish from Gander—Jonathan’s Brook linkage group ranged from 712 to 2,224. The average genetic
(sample NF2-015) had European mitochondria genotypes and length per chromosome in cM was 63 and 91 with a minimum
N.A. nuclear genotypes, indicating possible introgression of the length of 36 and 60 and maximum length of 129 and 127 for the
European mitochondrial genome due to ancestral hybridization. male and female maps, respectively. The distribution of markers
Evidence for introgression of European genotypes into N.A. popu- in linkage groups labeled to match the nomenclature of the
lations was previously reported for Atlantic salmon in European Atlantic salmon karyotype is shown in Fig. 3. Similar
Newfoundland (Bradbury et al. 2015). Also, European mtDNA al- to previous reports for Atlantic salmon and rainbow trout (Lien
leles have been confirmed in several Newfoundland populations et al. 2016; Pearse et al. 2019), the pattern of genetic distance in
of Atlantic salmon (Ian Bradbury, personal communication). the paternal map was composed of high levels of recombination
The results of the principal component analysis we conducted in the telomeric regions and high interference in the centromeric
using genotypes from the 61 validated COO markers are shown in region and throughout most of the chromosome. Conversely, the
Fig. 2. The reference samples were clearly separated into 2 clus- typical maternal pattern of recombination in acrocentric chromo-
ters in concordance with their COO (Fig. 2a), except for the single somes demonstrated higher rates of recombination near the
sample from Gander—Jonathan’s Brook described above due to its centromere and then a steady level of recombination throughout
nucleus markers being with N.A. genotypes and mitochondria the rest of the chromosome that gradually decreased to a plateau
markers with European genotypes. As expected, the 2,512 sam- in the telomere region. In the metacentric chromosome, the ma-
ples from the aquaculture SJR strain also clustered with the refer- ternal pattern demonstrated strong recombination interference
ence samples from N.A. origin (Fig. 2b). Among the reference in the centromeric region and then a similar pattern to the acro-
samples, the primary component that differentiated the centric chromosome in the direction of either telomere whereas
European origin salmon from the N.A. origin accounted for 94% the paternal chromosome was demonstrated to have high levels
of the total variance, and after adding the 2,512 SJR fish to the ana- of interference except near both telomeres (Fig. 4).
lysis, the primary component still explained ∼83% of the variance.
Genome assembly
Sex determination markers The genome assembly presented in the present research is based
Three markers derived from Atlantic salmon sdY sequences (Kijas on DNA from a single male from the USDA St. John River stock of
et al. 2018) were placed on the array in duplicates (i.e. 2 Probset IDs N.A. Atlantic salmon that was sequenced using PacBio long-read
for each SNP ID). The positions of the flanking sequences in the sequencing technology. The PacBio sequences were assembled
sdY gene sequence, Probset IDs, and the flanking sequences are into contigs with Canu, and those contigs were joined into scaf-
listed in Supplementary Table 3. The alternative alleles for SNP folds using Bionano optical mapping and Hi-C proximity ligation
ID Affx-158802265, Affx-158802258, and Affx-158802275 were C/ sequence data. Using genetic linkage information, the scaffolds
T, C/T, and G/T, respectively. Homozygous T alleles correlated from the Hi-C assembly were then anchored to and ordered on
with the absence of the sdY sequence (female), and homozygous chromosomes to generate the chromosome sequences of the
alternative alleles or heterozygous genotypes correlated with USDA_NASsal_1.1 assembly (GCA_021399835.1).
sdY presence (male). To evaluate the accuracy of the sex deter- The Canu sequence assembly was composed of 14,845 contigs
mination markers, we genotyped 135 female and 95 male parents with a total length of ∼3.97 Gb, N50 of 1.45 Mb, and BUSCO score of
of the families used for the linkage analysis and found 100% con- 95.8% (using actinopterygii_odb9 as the lineage dataset). While
cordance of the sex phenotype with the sdY marker genotypes. the estimated genome size of Atlantic salmon is ∼3.0 Gb (Lien
et al. 2016), it was clear that heterozygosity in this individual
Linkage map genome caused duplication of haplotypes for some of the hetero-
A total of 36,158 polymorphic high-resolution markers were zygous genomic regions. After the purging steps removed dupli-
mapped to 27 linkage groups with an average of 1,339 markers cated haplotypes resulting from heterozygous sequences, the
per group (Supplementary File 2). The number of markers per BUSCO score was improved slightly to 96.0% but the total length
G. Gao et al. | 7

Downloaded from https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkad138/7202260 by guest on 28 August 2023


Fig. 2. Principal components analysis with the 61 validated COO SNP markers. a) Reference samples of known origin [NA or Europe (EU)]. b) The same
reference samples with the samples from the aquaculture SJR strain fish that were used for the SNP array validation and for generating a genetic linkage
map. The dot in the middle at the bottom of both panels A and B represents 1 of the N.A. origin fish from Gander—Jonathan’s Brook (sample NF2-015),
which had European mitochondria genotypes and N.A. nuclear genotypes, indicating possible introgression of the European mitochondrial genome due
to ancestral hybridization.

of the assembly contigs at ∼3.89 Gb was still greater than the ex- therefore 1,755, total length is ∼2.83 Gb, total ungapped sequence
pected genome size. More detailed statistics for each step in the length is ∼2.79 Gb, and scaffold N50 and L50 are 16.2 Mb and 42,
assembly pipeline are shown in Table 2. The Bionano hybrid scaf- respectively.
folding pipeline used the optical map to join neighboring or over- A comparison of the assembly statistics between this assembly
lapping Canu contigs and to break chimeric sequence contigs that and the 2 annotated reference genome assemblies for the
were misassembled by Canu. The final Bionano scaffolded assem- European Atlantic salmon lineage is shown in Supplementary
bly included 2,012 scaffolds of joined primary contigs and haplo- Table 4. The contiguity of this assembly is much better than the
tigs and primary contigs that could not be joined into scaffolds. previous assembly (GCA_000233375.4) (Lien et al. 2016) which
After this step, the total length of the assembly was finally re- had scaffold N50 and L50 of ∼0.6 Mb and 594, respectively.
duced to within the expected genome size at ∼2.83 Gb and the However, the contiguity of the new reference for the European
N50 increased to 11.9 Mb (Table 2). The Hi-C ligation proximity Atlantic salmon lineage genome assembly (GCA_905237065.2)
mapping further reduced the number of scaffolds to 1,633 and in- with scaffold N50 and L50 of 28 Mb and 33, respectively, is better
creased the N50 to 37.9 Mb and the BUSCO score to 96.2%, with than the assembly we present here. It also has a nearly perfect
negligible impact on the total length of the assembly (Table 2). BUSCO score showing 99.3% of complete protein-coding genes de-
Using information from the new linkage map to anchor the tected compared to 96.2% in this genome assembly for the N.A.
scaffolds to karyotype chromosomes broke some of the largest Atlantic salmon lineage.
scaffolds due to their alignment with different linkage groups.
Overall, 633 scaffolds were anchored to 27 linkage groups with SNP array positions on the chromosome
914 spanned gaps and 606 unspanned gaps to generate 27 sequences
chromosome sequences in a total length of ∼2.53 Gb, or 89% of The genome positions of the SNPs from the Axiom 50K array that
the total genome assembly length. The distribution of chromo- were mapped to unique positions of the N.A. Atlantic salmon
somes by physical length in base pairs from this assembly is chromosome sequences with no indels and including 60 bp flank-
shown in comparison with the chromosomes of the European ing sequences for each SNP are available in Supplementary File 3.
Atlantic salmon reference genome assembly in Fig. 5. The total The distribution of the 50K SNPs on the chromosomes of the gen-
length of the 1,122 unplaced scaffolds that are not anchored to a ome assembly is shown in Supplementary Fig. 1. The average
chromosome is ∼300 Mb. The total number of scaffolds in the number of SNP markers per chromosome is 1,725 with a min-
chromosome-level assembly for the N.A. Atlantic salmon is imum of 920 and a maximum of 2,885. Overall, the total number
8 | G3, 2023, Vol. 00, No. 0

Downloaded from https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkad138/7202260 by guest on 28 August 2023


Fig. 3. Distribution of the linkage map markers in chromosomes. The chromosome axis is in the order of linkage groups from LG1 to LG27. Chromosomes
are labeled to match the nomenclature of the European Atlantic salmon karyotype.

of SNPs per chromosome is highly correlated with the physical the genes that they carry as illustrated in Supplementary Fig. 2.
size of the chromosomes (Pearson r = 0.96) in the new genome as- The rearrangements we found in the remaining 4 chromosomes
sembly we generated for N.A. Atlantic salmon. An additional 188 can be explained by 4 major separate events. The first was fission
SNPs were mapped to unplaced scaffolds or contigs. The SNP in the centromere region of chromosome Ssa01 resulting in
marker density per 1 Mb in each of the 27 chromosomes is dia- the separation of chromosome arms Ssa01q and Ssa01p. The
grammed in Fig. 6. The marker density range is from zero to 40 second was the translocation and fusion of Ssa01p with Ssa23
per 1 Mb. The marker density fluctuates within the chromosomes resulting in Chr 23 of the N.A. lineage genome assembly.
with lower density typically near the centromeres and telomeres. The evidence from sequence alignments for those 2 chromo-
This detailed characterization of marker density on each chromo- somal rearrangements is illustrated in Fig. 7. According to
some is important for assessing the utility of the array in genome Brenna-Hansen et al. (2012), this is the most common rearrange-
scans for QTL and for other genomic analyses. The overall distri- ment between salmon of European and N.A. origins as all the
bution of the array SNPs on chromosomes is highly correlated chromosome karyotypes that they have observed from N.A. sal-
with the chromosomes’ length, which is similar to what was re- mon were homozygous for the translocation of Ssa01p to Chr 23.
ported for high-density arrays that were developed for the However, more recent studies have documented differences in
European linage (Houston et al. 2014; Yáñez et al. 2016). the Ssa01/23 translocation in North America in specific geo-
graphic regions where secondary contact between European
Comparison between European and N.A. genome and N.A. Atlantic salmon has occurred (Lehnert et al. 2019;
assemblies: Watson et al. 2022). The third rearrangement event is the forma-
Similar to rainbow trout’s plasticity in chromosome number be- tion of Chr 08 in the N.A. genome from the fusion of the acro-
tween populations which correlates with geographic distribution centric chromosomes Ssa08 and Ssa29 (Fig. 8). The fourth is
(Thorgaard 1983; Gao et al. 2021), the Atlantic salmon genome the fusion of Ssa26 and Ssa28 to form Chr 26 in the N.A. genome
also exhibits tolerance to chromosomal rearrangements. While assembly (Fig. 9). Karyotype preparations and other evidence
other karyotypes have been reported for N.A. Atlantic salmon from molecular markers revealed that variation in the latter 2
from other populations with chromosome numbers varying chromosomal fusions (Ssa08/29 and Ssa26/28) can be found in
from 2N = 54 to 2N = 58 (see discussion in Brenna-Hansen et al. a wider range of Atlantic salmon populations from the N.A. lin-
2012), here, we found unequivocal evidence that supports 2N = eage (Brenna-Hansen et al. 2012; Lehnert et al. 2019; Wellband
54 in the SJR strain individual fish that we used for the genome et al. 2019). Hence, it is likely that those 2 chromosomal fusion
assembly presented in this report. The results of our linkage ana- events are more recent than the translocation event of Ssa01p
lysis coupled with the de novo assembly of large sequence to Chr 23 in N.A. Atlantic salmon.
scaffolds confirmed the chromosomal rearrangements between Smaller scale rearrangements in the form of inversions were
N.A. and European Atlantic salmon proposed previously by recently reported in Atlantic salmon (Stenløkk et al. 2022). Based
Brenna-Hansen et al. based on their own linkage analysis and kar- on our DNA sequence alignments between the genome assembly
yotyping (Brenna-Hansen et al. 2012). reported here and the European reference genome assembly, we
The chromosome sequences from this assembly of the N.A. sal- detected 2 of those inversions and found them to be larger than
mon genome (USDA_NASsal_1.1; GCA_021399835.1, N = 27) were what was reported by Stenløkk et al. (2022), with the size of 7
aligned with the chromosome sequences of the reference genome and 4 Mb for the inversions on Ssa09 (just below 100M) and
from the European lineage (Ssal_v3.1; GCA_905237065.2, N = 29) to Ssa18 (just below 80M), respectively (Supplementary File 4). We
identify and confirm the chromosomal rearrangements. Graphic were able to confirm that those 2 inversions between the 2 gen-
representations of the alignments for all 27 chromosomes can ome assemblies are likely to be real as they are encompassed
be found in Supplementary File 4. For 23 of the 27 chromosomes, within a single contig in both assemblies, and thus not likely to
we found simple collinear sequence alignments indicating highly be due to scaffolding errors. In addition, there is an optical appear-
similar order and direction of those chromosome sequences and ance of a very small inversion on Ssa16 at ∼60 Mb as shown in
G. Gao et al. | 9

Downloaded from https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkad138/7202260 by guest on 28 August 2023


Fig. 4. Genetic distance is plotted against physical distance to illustrate the differences in the typical recombination patterns between males and females
in salmonids. In the acrocentric chromosome, a) the paternal chromosome displays almost no recombination except for the area proximal to the q-arm
telomere. Conversely, the maternal pattern displays a higher rate of recombination near the centromere, a continuous even level of recombination
throughout the rest of the chromosome, and gradual decrease to a plateau near the telomere. In the metacentric chromosome, b) the maternal
chromosome displays strong recombination interference in the centromeric region and then a similar pattern to the acrocentric chromosome in the
direction of either telomere, while the paternal pattern is consistent with high interference in recombination except for elevated levels of recombination
near both telomeres.

Supplementary File 4. However, this is not a true inversion, but ra- than the rest of the genome because of those nearly identical
ther an optical illusion caused by the proximity (∼150 kb) of 2 du- stretches of sequence that could not be resolved with short reads.
plicated sequences on both genomes. Hence, the expectation would be that the average number of scaf-
folds in those hard to resolve chromosome arms will be higher
Contiguity of scaffolds in homeologous than the rest of the genome. However, the average number of
chromosome arms scaffolds in the 18 duplicated chromosome arms in this assembly
The Atlantic salmon genome contain 9 pairs of chromosome arms was 15 (Supplementary Table 5) compared to the overall average
with very high sequence similarity due to residual tetraploidy of 18 for all chromosome arms in the assembly, suggesting similar
(Lien et al. 2016). It is nearly impossible to distinguish alleles contiguity with the rest of the genome. The total number of scaf-
from homeologous tetraploid regions, and all salmonid genomes folds in chromosome sequences in this assembly was 633, and the
share this problem (Allendorf et al. 2015). In past de novo genome overall average was calculated based on the number of 50 chromo-
assemblies of salmonid genomes that were based on short reads some arms identified by Lien et al. (2016). Therefore, it appears that
(Berthelot et al. 2014; Lien et al. 2016; Pearse et al. 2019), the home- the problem of assembling those homeologous genome regions
ologous or duplicated chromosome arms were more fragmented can largely overcome with reads that are long enough to cover
10 | G3, 2023, Vol. 00, No. 0

Table 2. Statistics of the N.A. Atlantic salmon genome assembly at each step of the assembly pipeline.

Feature Canu contigs Purge duplicates Bionano scaffolds Hi-C scaffolds Chr. sequencesa

Number of scaffolds/contigs 14,845 14,240 2,012 1,633 1,755


Total length (Mb) 3,970 3,889 2,835 2,834 2,829
Maximum length (kb) 24,768 24,768 70,926 115,072 92,895
Minimum length (bp) 1,003 1,012 3,771 3,768 3,768
N50 (kb) 1,452 1,534 11,936 37,905 16,202
BUSCOb 95.8% 96.0% 95.8% 96.2% 96.2%

Downloaded from https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkad138/7202260 by guest on 28 August 2023


a
Includes the unplaced scaffolds (N = 1,122). The total number of scaffolds mapped to chromosomes was 633. Some of the largest scaffolds were broken by the
linkage map information (e.g. 2 parts were mapped to 2 different chromosomes).
b
BUSCO: Benchmarking Universal Single-Copy Orthologs. Showing the percent of complete protein-coding genes detected (using actinopterygii_odb9 as the
lineage dataset).

Fig. 5. Distribution of chromosome lengths in base pairs in the assembly from N.A. Atlantic salmon (accession GCA_021399835.1) presented in this report
a) vs. b); the distribution in the reference genome assembly from the European lineage (Ssal_v3.1; GCA_905237065.2).

stretches of DNA sequence that contain enough sequence variants Blast searches with the Refseq-annotated version of the
to distinguish between those nearly identical genome regions. European lineage genome assembly (GCF_905237065.1). For ex-
ample, the T cell receptor beta 01 region which is found on
Gene annotation Ssa01p in the European lineage reference genome has been shown
Gene annotation for this genome assembly accession is not avail- to be translocated to Chr 23 (Ssa01p_Ssa23) in the assembly of the
able currently from NCBI. Chromosome sequences or specific re- N.A. Atlantic salmon genome presented here (Grimholt et al.
gions of interests from the genome assembly can be aligned by 2022). In addition, the alignments of USDA_NASsal_1.1 (GCA_
G. Gao et al. | 11

Downloaded from https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkad138/7202260 by guest on 28 August 2023


Fig. 6. Density of markers from the N.A. Atlantic salmon Axiom 50K SNP array per million base pairs throughout the genome assembly chromosomes.

Fig. 7. Chromosome fusions and fissions between the European and N.A. Atlantic salmon lineages. Fission of the European chromosome Ssa01q and p
arms and fusion of Ssa01p with Ssa23 in the N.A. Atlantic salmon genome. Sequences were compared using data from the reference assembly for the
European lineage in NCBI (Ssal_v3.1; GCA_905237065.2).

021399835.1) to the annotated Ssal_v3.1 (GCF_905237065.1) gen- can choose the output format of interest. The output file in gen-
ome assembly are now available in the NCBI remap menu at ome workbench format can be used to view the aligned chromo-
https://www.ncbi.nlm.nih.gov/genome/tools/remap. The USDA some region or regions of choice in a genome browser. In
assembly can be used as the source assembly and the Ssal_v3.1 addition, a first pass gene annotation can be downloaded from
as the target assembly. After pasting or uploading the chromo- Ensembl Rapid Release at https://rapid.ensembl.org/Salmo_salar_
some coordinates of interest from the source assembly, the user GCA_021399835.1/Info/Index.
12 | G3, 2023, Vol. 00, No. 0

Downloaded from https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkad138/7202260 by guest on 28 August 2023


Fig. 8. Fusion of the European lineage chromosomes Ssa08 and Ssa29. Sequences were compared using data from the reference assembly for the
European lineage in NCBI (Ssal_v3.1; GCA_905237065.2).

Fig. 9. Fusion of the European lineage chromosomes Ssa26 and Ssa28. Sequences were compared using data from the reference assembly for the
European lineage in NCBI (Ssal_v3.1; GCA_905237065.2).

Location of sdY in the genome assembly Data availability


The sequence of the sex determination gene (sdY) can be mapped The reference genome is available for downloading and browsing
by sequence alignment to the last scaffold on the telomeric at NCBI (GenBank: GCA_021399835.1; BioProject: PRJNA759632).
end of chromosome 3 in the reverse strand between positions The PacBio (wgs) and the Illumina (wgs and Hi-C) raw sequence
104,938,219–104,933,632. Hence, like the current European salmon data used for the genome assembly are in NCBI-SRA (Accessions
reference genome assembly (GCA_905237065.2), this genome as- SRX17524033, SRX17524034, and SRX17524035). The new SNP dis-
sembly also provides a reference to the sdY lineage located on chro- covery dataset, the linkage map, and the genome positions of the
mosomes 3 or 6, in contrast to the sdY lineage located on SNPs from the 50K Axiom array are available as supplemental
chromosome 2 (Kijas et al. 2018). files. Supplemental Material available at figshare: https://
doi.org/10.25387/g3.22540894.

Conclusions Acknowledgements
We report here on the de novo assembly of the first chromosome- We thank Kristy Shewbridge for her technical assistance with the
level reference genome and the development and validation of preparation of the DNA sample used for genome sequencing. We
the first publicly available high-density SNP array for the N.A. thank Dr. Serap Gonen and Dr. Matthew Baranski from Mowi for
lineage of Atlantic salmon. These genomic resources are re- sharing SNP polymorphism information from the European SNP
quired for implementing genomic selection for aquaculture pro- array and for contributing samples of farmed salmon from
duction and are crucial to current research on the genetics, European origin. We thank Dr. Matthew Kent for sharing the
biology, evolution, and ecology of this subspecies that holds flanking sequence data of the markers used in the linkage map
high economic and social value in Eastern United States, published by Brenna-Hansen et al. (2012). Mention of trade names
Canada, Chile, and Australia. or commercial products in this publication is solely for the
G. Gao et al. | 13

purpose of providing specific information and does not imply Council NR. Genetic Status of Atlantic Salmon in Maine: Interim
recommendation or endorsement by the U.S. Department of Report from the Committee on Atlantic Salmon in Maine.
Agriculture. USDA is an equal opportunity provider and employer. Washington (DC): The National Academies Press; 2002.
Council NR. Atlantic Salmon in Maine. Washington (DC): The
National Academies Press; 2004.
Funding Gao G, Magadan S, Waldbieser GC, Youngblood RC, Wheeler PA,
Scheffler BE, Thorgaard GH, Palti Y. A long reads-based de-novo
This study was supported by the USDA Agricultural Research
assembly of the genome of the Arlee homozygous line reveals
Service in-house project numbers 8030-31000-004/005 and
chromosomal rearrangements in rainbow trout. G3 (Bethesda).

Downloaded from https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkad138/7202260 by guest on 28 August 2023


8082-31000-012/013 and 6066-31000-016.
2021;11(4):jkab052. doi:10.1093/g3journal/jkab052.
Gao G, Pietrak MR, Burr GS, Rexroad CE, Peterson BC, Palti Y. A new
single nucleotide polymorphism database for North American
Conflicts of interest statement Atlantic salmon generated through whole genome resequencing.
The authors declare no conflict of interest. Front Genet. 2020;11:85. doi:10.3389/fgene.2020.00085.
Gao G, Waldbieser GC, Youngblood RC, Zhao D, Pietrak M, Allen MS,
Stannard JA, Buchanan JT, Long RL, Milligan M, et al. The generation
Authors’ contributions of the first chromosome-level de-novo genome assembly and the
development and validation of a 50K SNP array for North
Guangtu Gao: writing—original draft, conceptualization, method-
American Atlantic salmon. bioRxiv: 2022.09.28.509896. https://doi.
ology, formal analysis, data curation, and writing—review & edit-
org/10.1101/2022.09.28.509896, 2022. preprint: not peer reviewed.
ing. Geoffrey C. Waldbieser: investigation, methodology,
Garrison E, Marth G. Haplotype-based variant detection from short-
resources, and writing—review & editing. Ramey C. Youngblood:
read sequencing. arXiv:1207.3907v2. https://doi.org/10.48550/
investigation, methodology, and formal analysis. Dongyan Zhao:
arXiv.1207.3907, 2012. preprint: not peer reviewed.
writing—original draft, methodology, and formal analysis.
Ghurye J, Rhie A, Walenz BP, Schmitt A, Selvaraj S, Pop M, Phillippy
Michael R. Pietrak: project administration, investigation, and val-
AM, Koren S. Integrating Hi-C links with assembly graphs for
idation. Melissa S. Allen: validation and data curation. Jason
chromosome-scale assembly. PLoS Comput Biol. 2019;15(8):
A. Stannard: project administration, methodology, and validation.
e1007273. doi:10.1371/journal.pcbi.1007273.
John T. Buchanan: methodology and supervision. Roseanna
Gonzalez-Pena D, Gao G, Baranski M, Moen T, Cleveland BM, Kenney
L. Long: investigation and validation. Melissa Milligan: investiga-
PB, Vallejo RL, Palti Y, Leeds TD. Genome-wide association study
tion and validation. Gary Burr: investigation and validation.
for identifying loci that affect fillet yield, carcass, and body
Katherine Mejía-Guerra: formal analysis. Moira J. Sheehan: re-
weight traits in rainbow trout (Oncorhynchus mykiss). Front
sources and supervision. Brian E. Scheffler: resources and supervi-
Genet. 2016;7(203):203. doi:10.3389/fgene.2016.00203.
sion. Caird E. Rexroad III: conceptualization and resources. Brian
Grimholt U, Sundaram AYM, Bøe CA, Dahle MK, Lukacs M.
C. Peterson: conceptualization, supervision, and resources. Yniv
Tetraploid ancestry provided Atlantic salmon with two paralogue
Palti: writing—original draft, conceptualization, methodology,
functional T cell receptor beta regions whereof one is completely
supervision, project administration, resources, and writing—
novel. Front Immunol. 2022;13:930312. doi:10.3389/fimmu.2022.
review & editing. All authors read and approved the manuscript.
930312.
Guan D, McCarthy SA, Wood J, Howe K, Wang Y, Durbin R.
Identifying and removing haplotypic duplication in primary gen-
Literature cited ome assemblies. Bioinformatics. 2020;36(9):2896–2898. doi:10.
Allendorf FW, Bassham S, Cresko WA, Limborg MT, Seeb LW, Seeb JE. 1093/bioinformatics/btaa025.
Effects of crossovers between homeologs on inheritance and Hardie DC, Hebert PDN. The nucleotypic effects of cellular DNA con-
population genomics in polyploid-derived salmonid fishes. J tent in cartilaginous and ray-finned fishes. Genome. 2003;46(4):
Hered. 2015;106(3):217–227. doi:10.1093/jhered/esv015. 683–706. doi:10.1139/g03-040.
Allendorf FW, Thorgaard GH. Tetraploidy and the evolution of sal- Houston RD, Taggart JB, Cézard T, Bekaert M, Lowe NR, Downing A,
monid fishes., pp. 1–46. In: Turner BJ, editors. Evolutionary Talbot R, Bishop SC, Archibald AL, Bron JE, et al. Development
Genetics of Fishes. New York: Plenum Press; 1984. and validation of a high density SNP genotyping array for
Berthelot C, Brunet F, Chalopin D, Juanchich A, Bernard M, Noël B, Atlantic salmon (Salmo salar). BMC Genomics. 2014;15:90. doi:10.
Bento P, Da Silva C, Labadie K, Alberti A, et al. The rainbow trout 1186/1471-2164-15-90.
genome provides novel insights into evolution after whole- Kijas J, Elliot N, Kube P, Evans B, Botwright N, King H, Primmer CR,
genome duplication in vertebrates. Nat Commun. 2014;5(1): Verbyla K. Diversity and linkage disequilibrium in farmed
3657. doi:10.1038/ncomms4657. Tasmanian Atlantic salmon. Anim Genet. 2017;48(2):237–241.
Bradbury IR, Hamilton LC, Dempson B, Robertson MJ, Bourret V, doi:10.1111/age.12513.
Bernatchez L, Verspoor E. Transatlantic secondary contact in Kijas J, McWilliam S, Naval Sanchez M, Kube P, King H, Evans B,
Atlantic salmon, comparing microsatellites, a single nucleotide Nome T, Lien S, Verbyla K. Evolution of sex determination loci
polymorphism array and restriction-site associated DNA sequen- in Atlantic salmon. Sci Rep. 2018;8(1):5664. doi:10.1038/s41598-
cing for the resolution of complex spatial structure. Mol Ecol. 018-23984-1.
2015;24(20):5130–5144. doi:10.1111/mec.13395. King T, Eackles M, Letcher B. Microsatellite DNA markers for the
Brenna-Hansen S, Li J, Kent MP, Boulding EG, Dominik S, Davidson study of Atlantic salmon (Salmo salar) kinship, population struc-
WS, Lien S. Chromosomal differences between European and ture, and mixed-fishery analyses. Mol Ecol Notes. 2005;5(1):
North American Atlantic salmon discovered by linkage mapping 130–132. doi:10.1111/j.1471-8286.2005.00860.x.
and supported by fluorescence in situ hybridization analysis. King TL, Kalinowski ST, Schill WB, Spidle AP, Lubinski BA. Population
BMC Genomics. 2012;13(1):432. doi:10.1186/1471-2164-13-432. structure of Atlantic salmon (Salmo salar L.): a range-wide
14 | G3, 2023, Vol. 00, No. 0

perspective from microsatellite DNA variation. Mol Ecol. 2001; Roazen D, et al. Scaling accurate genetic variant discovery to
10(4):807–821. doi:10.1046/j.1365-294X.2001.01231.x. tens of thousands of samples. bioRxiv:201178. https://doi.org/
Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. 10.1101/201178. 2018. preprint: not peer reviewed.
Canu: scalable and accurate long-read assembly via adaptive Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender
k-mer weighting and repeat separation. Genome Res. 2017; D, Maller J, Sklar P, de Bakker PIW, Daly MJ, et al. PLINK: a toolset
27(5):722–736. doi:10.1101/gr.215087.116. for whole-genome association and population-based linkage
Lehnert SJ, Bentzen P, Kess T, Lien S, Horne JB, Clément M, Bradbury analysis. Am J Hum Genet. 2007;81(3):559–575. doi:10.1086/
IR. Chromosome polymorphisms track trans-Atlantic divergence 519795.
and secondary contact in Atlantic salmon. Mol Ecol. 2019;28(8): Rastas P. Lep-MAP3: robust linkage mapping even for low-coverage

Downloaded from https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkad138/7202260 by guest on 28 August 2023


2074–2087. doi:10.1111/mec.15065. whole genome sequencing data. Bioinformatics. 2017;33(23):
Lehnert SJ, Kess T, Bentzen P, Clément M, Bradbury IR. Divergent and 3726–3732. doi:10.1093/bioinformatics/btx494.
linked selection shape patterns of genomic differentiation be- Roach MJ, Schmidt SA, Borneman AR. Purge haplotigs: allelic contig
tween European and North American Atlantic salmon (Salmo sal- reassignment for third-gen diploid genome assemblies. BMC
ar). Mol Ecol. 2020;29(12):2160–2175. doi:10.1111/mec.15480. Bioinformatics. 2018;19(1):460. doi:10.1186/s12859-018-2485-7.
Li H. Aligning sequence reads, clone sequences and assembly contigs Stenløkk K, Saitou M, Rud-Johansen L, Nome T, Moser M, Árnyasi M,
with BWA-MEM. arXiv:1303.3997. https://doi.org/10.48550/arXiv. Kent M, Barson NJ, Lien S. The emergence of supergenes from in-
1303.3997, 2013. preprint: not peer reviewed. versions in Atlantic salmon. Philos Trans R Soc B Biol Sci . 2022;
Li H. Minimap2: pairwise alignment for nucleotide sequences. 377(1856):20210195. doi:10.1098/rstb.2021.0195.
Bioinformatics. 2018;34(18):3094–3100. doi:10.1093/bioinformatics/ Thorgaard GH. Chromosomal differences among rainbow trout po-
bty191. pulations. Copeia. 1983;3:650–662. doi:10.2307/1444329.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Vallejo RL, Leeds TD, Gao G, Parsons JE, Martin KE, Evenhuis JP,
Abecasis G, Durbin R. The sequence alignment/map format and Fragomeni BO, Wiens GD, Palti Y. Genomic selection models dou-
SAMtools. Bioinformatics. 2009;25(16):2078–2079. doi:10.1093/ ble the accuracy of predicted breeding values for bacterial cold
bioinformatics/btp352. water disease resistance compared to a traditional pedigree-
Lien S, Koop BF, Sandve SR, Miller JR, Kent MP, Nome T, Hvidsten TR, based model in rainbow trout aquaculture. Genet Sel Evol.
Leong JS, Minkley DR, Zimin A, et al. The Atlantic salmon genome 2017a;49(1):17. doi:10.1186/s12711-017-0293-6.
provides insights into rediploidization. Nature. 2016;533(7602): Vallejo RL, Liu S, Gao G, Fragomeni BO, Hernandez AG, Leeds TD,
200–205. doi:10.1038/nature17164. Parsons JE, Martin KE, Evenhuis JP, Welch TJ, et al. Similar genetic
Liu L, Ang KP, Elliott JAK, Kent MP, Lien S, MacDonald D, Boulding EG. architecture with shared and unique quantitative trait loci for
A genome scan for selection signatures comparing farmed bacterial cold water disease resistance in two rainbow trout
Atlantic salmon with two wild populations: testing colocalization breeding populations. Front Genet. 2017b;8:156. doi:10.3389/
among outlier markers, candidate genes, and quantitative trait fgene.2017.00156.
loci for production traits. Evol Appl. 2017;10(3):276–296. doi:10. Watson KB, Lehnert SJ, Bentzen P, Kess T, Einfeldt A, Duffy S,
1111/eva.12450. Perriman B, Lien S, Kent M, Bradbury IR. Environmentally asso-
Lubieniecki KP, Jones SL, Davidson EA, Park J, Koop BF, Walker S, ciated chromosomal structural variation influences fine-scale
Davidson WS. Comparative genomic analysis of Atlantic salmon, population structure of Atlantic salmon (Salmo salar). Mol Ecol.
Salmo salar, from Europe and North America. BMC Genet. 2010; 2022;31(4):1057–1075. doi:10.1111/mec.16307.
11(1):105. doi:10.1186/1471-2156-11-105. Wellband K, Mérot C, Linnansaari T, Elliott JAK, Curry RA,
Meuwissen T, Hayes B, Goddard M. Genomic selection: a paradigm Bernatchez L. Chromosomal fusion and life history-associated
shift in animal breeding. Anim Front. 2016;6(1):6–14. doi:10. genomic variation contribute to within-river local adaptation of
2527/af.2016-0002. Atlantic salmon. Mol Ecol. 2019;28(6):1439–1459. doi:10.1111/
Nash CE, Waknitz FW. Interactions of Atlantic salmon in the Pacific mec.14965.
Northwest: I. Salmon enhancement and the net-pen farming indus- Yáñez JM, Naswa S, López ME, Bassini L, Correa K, Gilbey J,
try. Fish Res. 2003;62(3):237–254. doi:10.1016/S0165-7836(03)00063-8. Bernatchez L, Norris A, Neira R, Lhorente JP, et al. Genomewide
Palti Y, Gao G, Liu S, Kent MP, Lien S, Miller MR, Rexroad CE, Moen T. single nucleotide polymorphism discovery in Atlantic salmon
The development and characterization of a 57K single nucleotide (Salmo salar): validation in wild and farmed American and
polymorphism array for rainbow trout. Mol Ecol Res. 2015;15(3): European populations. Mol Ecol Resour. 2016;16(4):1002–1011.
662–672. doi:10.1111/1755-0998.12337. doi:10.1111/1755-0998.12503.
Pearse DE, Barson NJ, Nome T, Gao G, Campbell MA, Abadía-Cardoso Zhang Z, Schwartz S, Wagner L, Miller WA. A greedy algorithm for
A, Anderson EC, Rundio DE, Williams TH, Naish KA, et al. aligning DNA sequences. J Comput Biol. 2000;7(1–2):203–214.
Sex-dependent dominance maintains migration supergene in doi:10.1089/10665270050081478.
rainbow trout. Nat Ecol Evol. 2019;3(12):1731–1742. doi:10.1038/
s41559-019-1044-6.
Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, Van
der Auwera GA, Kling DE, Gauthier LD, Levy-Moonshine A, Editor: J. Yáñez

You might also like