You are on page 1of 34

Subscriber access provided by Olson Library | Northern Michigan University

Article
Predicting the ecological quality status of marine environments
from eDNA metabarcoding data using supervised machine learning
Tristan Cordier, Philippe Esling, Franck Lejzerowicz, Joana Visco, Amine
Ouadahi, Catarina Martins, Tomas Cedhagen, and Jan Pawlowski
Environ. Sci. Technol., Just Accepted Manuscript • DOI: 10.1021/acs.est.7b01518 • Publication Date (Web): 30 Jun 2017
Downloaded from http://pubs.acs.org on July 1, 2017

Just Accepted

“Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted
online prior to technical editing, formatting for publication and author proofing. The American Chemical
Society provides “Just Accepted” as a free service to the research community to expedite the
dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts
appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been
fully peer reviewed, but should not be considered the official version of record. They are accessible to all
readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered
to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published
in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just
Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor
changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers
and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors
or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Environmental Science & Technology is published by the American Chemical Society.


1155 Sixteenth Street N.W., Washington, DC 20036
Published by American Chemical Society. Copyright © American Chemical Society.
However, no copyright claim is made to original U.S. Government works, or works
produced by employees of any Commonwealth realm Crown government in the course
of their duties.
Page 1 of 33 Environmental Science & Technology

1 Predicting the ecological quality status of marine environments from eDNA

2 metabarcoding data using supervised machine learning

3 Tristan Cordier*1, Philippe Esling2, Franck Lejzerowicz1, Joana Visco3, Amine Ouadahi1,

4 Catarina Martins4, Tomas Cedhagen5, Jan Pawlowski1,3

1
6 Department of Genetics and Evolution, University of Geneva, Boulevard d’Yvoy 4, CH 1205 Geneva,

7 Switzerland


2
8 IRCAM, UMR 9912, Université Pierre et Marie Curie, 4 place Jussieu, 75005 Paris, France


3
9 ID-Gene ecodiagnostics, Ltd, chemin des Aulx 14, 1228 Plan-les-Ouates, Switzerland

4
10 Marine Harvest ASA, Sandviksboder 77AB, Bergen, 5035 Bergen, Norway

5
11 Department of Bioscience, Section of Aquatic Biology, University of Aarhus, Building 1135, Ole Worms allé

12 1, DK-8000 Aarhus, Denmark

13

14 *Corresponding author: tristan.cordier@gmail.com

15

16

17

18

19

20

21

22

23

24

25

ACS Paragon Plus Environment


Environmental Science & Technology Page 2 of 33

26

27 Abstract

28

29 Monitoring biodiversity is essential to assess the impacts of increasing anthropogenic

30 activities in marine environments. Traditionally, marine biomonitoring involves the sorting

31 and morphological identification of benthic macro-invertebrates, which is time-consuming

32 and taxonomic-expertise demanding. High-throughput amplicon sequencing of environmental

33 DNA (eDNA metabarcoding) represents a promising alternative for benthic monitoring.

34 However, an important fraction of eDNA sequences remains unassigned or belong to taxa of

35 unknown ecology, which prevent their use for assessing the ecological quality status. Here,

36 we show that supervised machine learning (SML) can be used to build robust predictive

37 models for benthic monitoring, regardless of the taxonomic assignment of eDNA sequences.

38 We tested three SML approaches to assess the environmental impact of marine aquaculture

39 using benthic foraminifera eDNA, a group of unicellular eukaryotes known to be good

40 bioindicators, as features to infer macro-invertebrates based biotic indices. We found similar

41 ecological status as obtained from macro-invertebrates inventories. We argue that SML

42 approaches could overcome and even bypass the cost and time-demanding morpho-taxonomic

43 approaches in future biomonitoring.

44

45 Key-words: Biomonitoring, biotic indices, environmental DNA, supervised machine learning,

46 predictive models, foraminifera.

47

48

49

50

ACS Paragon Plus Environment


Page 3 of 33 Environmental Science & Technology

51

52 Introduction

53 Human activities are impacting the marine ecosystem functioning through climate change1,

54 environmental pollution2, or human industry driven eutrophication3, and these pressures are

55 likely to increase with the projected demographic expansion. Such impacts are traditionally

56 monitored by surveying biodiversity, usually focusing on benthic macrofauna4,5. Biotic

57 Indices (BI) have been formalized in order to combine taxonomy and alpha-diversity

58 measures (e.g. species richness, Shannon index) to reduce the data dimensions into single

59 continuous values that are used to ascribe samples to an environmental quality status. Such

60 indices include for instance the AZTI Marine Biotic Index6 (AMBI) or more specific

61 Indicator Species Index7 (ISI), and the Norwegian Sensitivity and Quality Indices8 (NSI and

62 NQI1). The formulas of these indices include taxon-specific ecological weights or categories

63 of tolerance to disturbance defined from empirical and experimental data6. Indeed, all indices

64 currently applied in benthic monitoring are based on the sorting and identification of macro-

65 invertebrate specimens, which is extremely time consuming and taxonomic-expertise

66 demanding. The need for faster and cost-effective ways to conduct biodiversity surveys is of

67 prime importance to allow an effective management of marine resources.

68

69 High-throughput amplicon sequencing and the molecular identification of multiple species in

70 environmental DNA (eDNA metabarcoding) offer a fast and cost-effective way to describe

71 biological communities9. It gives the opportunity to overcome the limitations of morpho-

72 taxonomy based biomonitoring10-12. Recently, several attempts have been made to use the

73 eDNA metabarcoding for biomonitoring freshwater13-18 and marine ecosystems19-23.

74 Comparisons of BI inferred from eDNA metabarcoding and morphological data showed that

75 similar values could be obtained in the case of freshwater diatoms16,15 and marine

ACS Paragon Plus Environment


Environmental Science & Technology Page 4 of 33

76 invertebrates24-25. However, all these studies were dependent of reference sequence databases

77 for taxonomic assignment, to retrieve the ecological weight associated with the assigned taxa

78 (but see 18). This strongly limited the number of sequences included in the analysis, because

79 a large proportion of sequences remained unassigned, or because they were assigned to taxa

80 of unknown ecology20,24,26.

81

82 To overcome these limitations, we propose to use supervised machine learning (SML) for the

83 development of predictive models to infer BI values from eDNA metabarcoding data without

84 relying on taxonomic assignments. In SML approach, the predictive models are based on the

85 knowledge extracted from a training dataset, which typically consist of a set of features and

86 associated labels (classification) or continuous values (regression). The aim of SML is to fit

87 the training data to some function that can be used to predict a label or a continuous value to

88 new input data27 (e.g. new samples). In the recent years, there has been a growing interest to

89 investigate the usefulness of SML for ecological28-30, genetic31, or microbiome analyses27,31-34.

90 However, up to our knowledge, only one study has successfully used SML to predict

91 pollution levels as well as the values of 26 geochemical features based on a training dataset

92 composed of bacterial 16S eDNA metabarcoding data35.

93

94 In this study, we investigated the possibility of using SML to build predictive models for the

95 inference of four biotic indices commonly used for the benthic monitoring of the fish farming

96 industry. We tested this approach using benthic foraminifera, a group of ubiquitous

97 unicellular eukaryotes that include sensitive bioindicators of pollution in marine

98 environments36-39. Foraminifera are responsive to organic enrichment associated with fish

99 farming activities, as shown by both morphological40-43 and eDNA investigations22,23,44. This

100 makes foraminifera particularly appropriate for SML approach, because of their small size,

ACS Paragon Plus Environment


Page 5 of 33 Environmental Science & Technology

101 making them readily captured in eDNA samples, and because of the lack of well-defined

102 indicator morphospecies.

103

104 Material and methods

105 Sampling

106 A total of 144 sediment samples were collected in June and October 2015 at 24 stations

107 distributed at the vicinity of five salmon farms in Norway (Table S1). Four farms were

108 sampled at a rate of five stations and the remaining farm at only four stations because of

109 pebbles at the fifth station. At each station, two van Veen grabs (1000 cm2 model 12.211, KC-

110 Denmark) were deployed. From each of the 48 grabs, three samples of c.a. 5 ml were

111 collected from the two first centimeters of the surface sediment using sterile spoons,

112 transferred into 6 ml of LifeGuard Soil Preservation Solution (MO BIO, Laboratories Inc.)

113 and frozen at -20°C until DNA extraction. The rest of the sediment in each sub-sampled grab

114 was sieved through 1mm mesh and fixed in formalin for morpho-taxonomic inventory of

115 macro-invertebrates. Macrofaunal data were used to calculate for each grab four BIs (AMBI,

116 ISI, NSI and NQI1) routinely used in benthic monitoring surveys in Norway.

117

118 DNA extraction, PCR amplification and sequencing

119 The 144 frozen sediments samples were thawed on ice, centrifuged at 2500 rpm for 5 minutes

120 and the overlying solution was discarded. The total eDNA content was extracted using the

121 PowerMax® Soil DNA Isolation Kit (MO BIO Laboratories Inc.) following manufacturer

122 instructions. The 37F hypervariable region of the SSU rRNA gene, commonly used as

123 foraminiferal DNA barcode45 was PCR amplified using the forward primer s14F1 (5' -

124 AAGGGCACCACAAGAACGC - 3') and the reverse primer s15 (5' -

125 CCACCTATCACAYAATCATG - 3'), generating amplicons of about 200-250 bp as

ACS Paragon Plus Environment


Environmental Science & Technology Page 6 of 33

126 described previously22. The primers were tagged with 8-nt sequences appended at their 5’

127 ends in order to multiplex samples prior to sequencing library preparation. The tag sequences

128 have been designed with a pairwise minimum edit distance of 2 and an edit distance of 8 to

129 the corresponding positions of the conserved foraminiferal SSU rDNA sequence. We assigned

130 the tag primer combinations to samples following a Latin Square Design (Table S2) in order

131 to reduce mistagging events46. The PCR products were quantified by high-resolution capillary

132 electrophoresis (QIAxcel System, Qiagen) and pooled in equimolar concentration. The pool

133 was purified using the High Pure PCR Product Purification Kit (Roche), quantified using a

134 fluorometric method (QuBit HS dsDNA kit, Invitrogen) and used for library preparation using

135 the TruSeq® DNA PCR-Free Library Prep Kit (Illumina). After quantification using the

136 KAPA Library Quantification Kits (KAPA BIOSYSTEMS), the library was sequenced on an

137 Illumina MiSeq System using MiSeq Reagent Kit v2 and a standard 14-tiles flow cell for

138 2*251 cycles. The raw dataset is publicly available at the Sequence Read Archive under

139 BioProject PRJNA376130.

140

141 Bioinformatics

142 The paired-end raw reads were quality filtered and assembled into full-length sequences with

143 a pipeline written in C language for the fast processing of Illumina metabarcoding data

144 (https://github.com/esling/illumina-pipeline). Filtering and assembly parameters are reported

145 in Table S3. The pipeline included the demultiplexing of each sequence into its sample of

146 origin by matching the combination of 8-nt tags present at the 5’-end of each sequence read.

147 The tagged primers were trimmed from the sequences as well as the foraminifera-specific

148 conserved region of c.a. 70 nucleotides based on the detection of its GACAG motif after 60

149 positions. Every sequence lacking this motif was discarded. The dataset was then filtered for

150 potential chimeras using UCHIME47 version 4.2.40 implemented in the

ACS Paragon Plus Environment


Page 7 of 33 Environmental Science & Technology

151 identify_chimera_seq.py function of the Qiime48 1.9.1 toolkit. We used the default parameters

152 of the function, but the --split_by_sampleid option was used in order to restrict the de novo

153 search by sample (i.e. by PCR). The filtered dataset was then clustered into Operational

154 Taxonomic Units (OTUs) using swarm49 2.1.8 with the default resolution (d=1) and the

155 fastidious option. This clustering also included homologous 37F sequences generated by

156 previous fish farm sequencing surveys and obtained using identical bioinformatics processing,

157 including sequences from Scotland22, from New-Zealand23, from Norway44, as well as

158 unpublished data. The present dataset was augmented with these sequences in order to

159 increase the likelihood of the fastidious option implemented in swarm to form OTUs

160 supported by the multiple occurrence of sequences specific to the fish farm environment. The

161 representative sequences, i.e. the most abundant Individual Sequence Unit (ISU) of each

162 OTU, were used as input of the assign_taxonomy.py function of Qiime with default parameter

163 for taxonomic assignment (uclust method), using a curated foraminifera database

164 (http://forambarcoding.unige.ch). This database contains 1175 foraminiferal SSU rDNA

165 sequences representing 125 morphospecies spread into 35 families, and including as well 106

166 environmental sequences for which morphospecies are not been yet identified but that are

167 commonly found in eDNA samples. OTU-representative sequences that could not be assigned

168 were compared by BLAST50 search against GenBank. The OTU-to-sample matrix was

169 generated from the result of the clustering including all fish farms sequences with

170 make_otu_table.py function of Qiime and the data corresponding to the samples analyzed in

171 the present study was extracted from this table into the statistical R environment51 for

172 downstream statistical analysis.

173

ACS Paragon Plus Environment


Environmental Science & Technology Page 8 of 33

174 Statistics

175 The relationship between diversity and distance from the cage as well as compositional

176 variation of the studied communities were investigated based on normalized versions of the

177 OTU-to-sample matrix. Because uneven sequencing depth across samples can introduce

178 biases in the statistical analysis, samples with less than 10,000 reads were discarded. To

179 investigate correlations between diversity and distance from the cage, 100 rarefied datasets

180 were generated by aiming at 10,000 reads per sample using the rrarefy function of the R

181 vegan v2.4-1 package52. Several alpha-diversity metrics, based on all OTUs (including

182 unassigned ones) and their abundance, were then computed for each rarefied dataset and

183 averaged values were used to fit non-linear polynomial models using the lm function in R.

184 These diversity metrics included the OTU richness, the Shannon diversity53, the Pielou

185 evenness54, the SN diversity used in the calculation of NQI18, the expected OTU richness

186 (ES100 and ES50 for respectively 100 and 50 reads) and the Chao richness estimator55. To

187 investigate compositional variation, the read counts of the OTU-to-sample matrix were

188 normalized according to the cumulative-sum scaling method (CSS) using the metagenomeSeq

189 v1.16.0 R package56 (see 57 and 58). The following beta-diversity analyses were performed

190 using functions of the R vegan package. The Bray-Curtis dissimilarity index56 was calculated

191 using vegdist and the resulting pairwise dissimilarity matrix served for non-metric multi-

192 dimensional scaling (NMDS) analysis using metaMDS with default settings. The values of the

193 BIs obtained from the morpho-taxonomic data and of distance to cages were fit to the NMDS

194 ordination using envfit, and ordisurf, respectively.

195

196 Biotic indices values were predicted from the foraminiferal eDNA metabarcoding data using

197 three supervised approaches. For reference, either the ecological weights associated with the

198 taxa of the morpho-taxonomic inventories (correlation screening approach) or the BI values

ACS Paragon Plus Environment


Page 9 of 33 Environmental Science & Technology

199 calculated from these inventories (supervised machine learning approaches) were considered.

200 The BI values were predicted independently for the samples of each farm (testing datasets)

201 based on the samples of the other four farms (training datasets) used for supervision in the

202 three approaches. For the correlation screening approach, each OTU was compared to each

203 morpho-taxon across the samples of the training set. The reads associated with each OTU in

204 the PCR replicates of each grab were summed in order to measure Spearman rank correlations

205 in terms of relative abundance across grabs. The ecological weight of the taxon that has the

206 highest rho correlation above 0.7 was associated to the OTU. Biotic indices were then

207 calculated for the testing dataset using OTUs that were assigned ecological weights.

208

209 For the supervised machine learning approaches, either diversity metrics (diversity learning)

210 or OTUs composition (composition learning) were used as features. Models were built for

211 each training dataset, and BI values were predicted for the samples composing each

212 corresponding testing dataset. For diversity learning, the predicted BI values were averaged

213 for each sample over the 100 rarefied datasets. For composition learning, the effect of keeping

214 rare OTUs was investigated by comparing the results obtained when the OTUs with less than

215 10 reads or less than 100 reads across the full dataset were discarded before CSS

216 normalization. Since only one reference BI value was available for each grab, the BI values

217 predicted using both learning methods were averaged for each grab and the standard

218 deviations were computed. For both supervised diversity and composition learning approach,

219 we compared two different algorithms to predict BI values. We compared the Random Forest

220 (RF) algorithm60 implemented in the ranger v.0.6.0 R package61 for multithreading and the

221 Self-Organizing Map (SOM) algorithm implemented in the Kohonen v2.0.19 R package62.

222 For the RF algorithm, we generated 300 trees and used the default “mtry” parameter for

223 regression task, which is 1/3 of the features randomly picked to split the tree at each node,

ACS Paragon Plus Environment


Environmental Science & Technology Page 10 of 33

10

224 which usually give the best results63. RF models were measuring the importance of features in

225 the prediction of BI values. For the SOM algorithm, the network was trained with 100

226 iterations with the default learning rate (linear decline from 0.05 to 0.01 over the 100

227 iterations). The tuning of the parameters (xdim, ydim, xweight, topo) were randomly searched

228 over 100 parameters combinations (100 hyperparameter search) with the train function of the

229 caret v6.0-73 R package64, passing the “random” search option and a 10-fold cross validation

230 repeated 10 times to the trainControl function. For each training dataset (four farms), the set

231 of parameters giving the lowest average Root Mean Squared Error (RMSE) on hold-out

232 samples during cross-validation were used to build the model that predict BI values for the

233 testing dataset (the remaining farm).

234

235 To compare the accuracy of the supervised approaches, the relationships between the

236 reference and predicted BI values were modeled using the lm function in R. These BI values

237 were then converted into discrete ecological quality status, after averaging per grab in the case

238 of the predicted values. Their agreement was tested using the kappa2 function of the irr v0.84

239 R package65, with squared weight because the ecological status values are ordered from “very

240 poor” to “very good”. Agreement between the two classifications was considered as “poor

241 agreement”, e.g. Kappa value ranging from 0.01 to 0.2 to “almost perfect agreement”, e.g.

242 Kappa value ranging from 0.8 to 166. For each BI, the best model was the one associated with

243 the highest Kappa value. Its accuracy was investigated at the scale of the farm by testing for

244 difference between predicted and reference values using the Mann-Whitney test with the

245 wilcox.test in R.

246

ACS Paragon Plus Environment


Page 11 of 33 Environmental Science & Technology

11

247 Results

248 Macro-invertebrate morpho-taxonomic inventories and biotic indices

249 The morpho-taxonomic inventories of 75,431 macro-invertebrate specimens comprised 432

250 morphospecies, of which 357 have been ascribed to an ecological category in at least one of

251 the BIs (Table S4). All BIs showed that the five farms were impacted, with highest values

252 (AMBI) and lowest values (ISI, NSI and NQI1) within 200 meters from the fish cages (Figure

253 S1). The impact decreased quickly with increasing distance from the cages. According to the

254 results of non-linear models, the effect of the distance from the cage on the values of every BI

255 was highly significant (Table S5, Figure S1).

256

257 Foraminifera eDNA metabarcoding data

258 PCR products were obtained for 123 out of 144 samples and sequencing yielded 11,257,700

259 paired-end reads. The quality filtering (i.e. base call quality and contig assembly) discarded

260 24.9 % of the reads (Table S3) and de novo chimera filtering discarded a further 449

261 sequences. The resulting 8,449,933 sequences were clustered along with 45,588,115

262 homologous sequences generated in previous studies and 69,716 OTUs were delineated. An

263 OTU-to-sample matrix was formed from the 9,235 OTUs that occurred in at least one of the

264 123 samples corresponding to the five studied fish farms. The sequencing depth of the

265 samples was uneven. It ranged from 42 to 283,431 reads, with an average of 68,698 reads per

266 sample. Seven samples represented by less than 10,000 reads were discarded. The final OTU-

267 to-sample matrix was constituted of 116 sediment samples as rows, covering the 48 sediment

268 grabs from which the macro-invertebrates were inventoried, 9,170 OTUs as columns, and was

269 filled with the 8,414,122 reads (Table S6). Two additional OTU-to-sample matrices were then

270 composed from the OTUs represented by more than 10 reads and by more than 100 reads, and

271 contained 3,648 and 1,579 columns, respectively.

ACS Paragon Plus Environment


Environmental Science & Technology Page 12 of 33

12

272

273 Non-linear models yielded significant correlations between alpha-diversity metrics and the

274 distance from the cage, except Pielou’s evenness (Figure S2). The NMDS analysis of the

275 beta-diversity matrix showed that the distance from the cage is strongly structuring

276 foraminifera communities, and that the reference BI values inferred from the macro-

277 invertebrate data were strongly correlated with the ordination (Figure S3).

278

279 Taxonomic composition

280 Among the 9,170 OTUs, 2,378 (representing 70 % of the reads) were assigned to a given

281 taxonomic rank based on the consensus of three maximum hits with a minimum of 90 %

282 similarity and coverage on reference sequences of our curated foraminifera SSU rDNA

283 database. Among these assigned OTUs, the majority belong to orders Rotaliida and

284 Textulariida and class “Monothalamea” (Table S7). Sixty-two OTUs (less than 0.1 % of the

285 reads) were assigned to one of the remaining orders (Miliolida, Globerinida, Spirillinida and

286 Robertinida) and 190 OTUs (4.3 % of the reads) matched uncultured foraminifera sequences

287 of GenBank with more than 90 % of similarity and coverage. The remaining 6,602 OTUs

288 (25.7 % of the reads) could not be assigned to any reference sequence in foraminiferal

289 database and matched no GenBank sequence. The five most abundant taxa include 3 rotaliids:

290 Bulimina marginata (37 OTUs, representing 10.2 % of the reads), Stainforthia fusiformis (21

291 OTUs, representing 7.7 % of the reads) and Cibicidoides lobatulus (39 OTUs, representing 7

292 % of the reads), a monothalamid: Bathysiphon argenteus (18 OTUs, representing 7.4 % of the

293 reads) and a textularid genus: Reophax sp. (105 OTUs, representing 5.7 % of the reads).

294

ACS Paragon Plus Environment


Page 13 of 33 Environmental Science & Technology

13

295 Predictions of BI values from eDNA metabarcoding data

296 The three supervised approaches yielded accurate BI values predictions (Table 1). Linear

297 models and Kappa tests showed significance for the correlation screening approach to predict

298 the BI values from ecological weights assignments. Yet, R2 and Kappa statistics were higher

299 for the diversity and composition learning approaches than for the correlation screening

300 approach, as measured for ISI, NSI and NQI1 (Table 1, Figure 1). For AMBI, the correlation

301 screening approach performed better than the diversity learning using SOM algorithm and the

302 composition learning using the RF algorithm.

303

304 The diversity learning approach using the RF algorithm yielded predictions that had the

305 highest Kappa for NSI and NQI1, with 33 and 37 correctly classified grabs and with 15 and

306 11 grabs classified within 1 category mismatch, respectively. The Kappa statistics were 0.907

307 for NSI and 0.88 for NQI1, indicating an almost perfect agreement between molecular and

308 morpho-taxonomic data. Rarefying the OTU-to-sample dataset had a small effect on the per-

309 sample variation of the inferred values for both NSI (Figure S4) and NQI1 (Figure S5). The

310 most important diversity metrics (as measured by the RF algorithm) to infer NSI and NQI1

311 were Chao and OTU richness, and to a lesser extent SN (Figure S6, S7).

312

313 The composition learning approach using the SOM algorithm yielded predictions that had the

314 highest Kappa for AMBI and ISI, with respectively 33 and 25 correctly classified grabs, 13

315 and 21 classified within 1 category mismatch, and 2 misclassified for both BIs. The Kappa

316 statistics were 0.711 for AMBI and 0.774 for ISI, indicating a substantial agreement between

317 molecular and morpho-taxonomic data. The RF algorithm measurements of features

318 importance showed that the OTUs that were most important for inferring AMBI and ISI were

319 not necessarily the most abundant in terms of reads counts (Figure S8 and S9). The most

ACS Paragon Plus Environment


Environmental Science & Technology Page 14 of 33

14

320 useful OTU to infer AMBI values was unassigned and represented 0.7% of the reads (55,063

321 reads) and that to infer ISI values was an unidentified Monothalamid representing 0.2% of the

322 reads (19,123 reads). Discarding rare OTUs represented by less than 10 or 100 reads did not

323 significantly change the accuracy of BIs predictions using composition learning (Table S8).

324

325 Predicted values obtained with the best predictive models for each of the four BIs were

326 neither over nor under estimated (Figure 1). Grabs were evenly distributed around the correct

327 classification (Figure 1B). The median and the mean of the differences between the predicted

328 and reference BI values were close to zero and with moderate variances, which means that our

329 predictive models were not biased (Figure 1C). Mann-Whitney tests of difference between the

330 predicted and reference BI values were not significant for all fish farm and BI combinations,

331 which means that predictive models were accurately predicting BI values at the scale of the

332 fish farm (Figure S10).

333

334 Discussion

335 Our study demonstrates the usefulness of supervised machine learning (SML) approaches to

336 infer biotic indices from eDNA metabarcoding data. To our knowledge, this is the first time

337 that the SML approaches have been applied to eukaryotes eDNA-based biomonitoring

338 surveys (but see 35 for a bacteria-based survey using SML). Until now, only those groups of

339 eukaryotes for which DNA barcodes are available in reference sequence database and for

340 which an autecological value is known could be taken into account in eDNA-based

341 biomonitoring studies24,25. Provided that an appropriate training dataset is available, SML

342 approaches offer a workaround to the dependency on reference databases and thus allow

343 extending the range of potential genetic bioindicators to other taxonomic groups, especially to

344 the small-sized inconspicuous taxa, which typically dominate the eDNA samples24. With the

ACS Paragon Plus Environment


Page 15 of 33 Environmental Science & Technology

15

345 notable exception of diatoms67, most of these small-sized taxa remain poorly described in

346 terms of autecology (but see 68), which make them useless for computing biotic indices.

347 Using SML approaches, there is no need for prior knowledge on the ecological signal

348 conveyed by OTUs, because these signals are inferred during the statistical modeling.

349

350 Our results showed that accurate predictive models for BIs inference could be obtained with

351 diversity metrics or composition data derived from eDNA metabarcoding data as features in

352 SML approaches. These models led to similar bioassessments as the ones obtained using

353 traditional morphology-based macrofaunal surveys. The possibility of using diversity metrics

354 or composition data for making BIs predictions on new samples may give a practical

355 flexibility for biomonitoring surveys. On the one hand, using the diversity metrics in SML

356 reduces the dimensions of the data and therefore the computation time, and allows the

357 prediction of BIs for samples coming from various geographical regions, where the

358 taxonomic composition may be different. On the other hand, using composition data to

359 predict BIs seems more appropriate if a significant proportion of OTUs are present in both the

360 training and the testing dataset. Composition data should capture species interactions69,70 and

361 give more importance to key taxa consistently responding to specific environmental

362 variation71,72, which could yield better results for atypical samples or statistical outliers.

363

364 While analyzing eDNA metabarcoding data, various biological and technical issues need to

365 be considered for the interpretation of alpha-diversity and beta-diversity. The presence of

366 extracellular DNA necessarily blurs the ecological signal conveyed in eDNA datasets as

367 living, dead and inactive cells cannot be readily distinguished73,74. In our study, samples were

368 collected from surface sediments, where it is reasonable to expect that most of the DNA come

369 from living or recently dead organisms. Intra-genomic rRNA sequence variation75 and rRNA

ACS Paragon Plus Environment


Environmental Science & Technology Page 16 of 33

16

370 gene copy-number variation76 brings qualitative and quantitative bias in metabarcoding data.

371 From a technical point of view, PCR, sequencing77 and mistagging errors46 could add further

372 noise in the data. However, this noise can be assumed to be somewhat constant across

373 samples, and therefore disentangled from the ecological signal by SML algorithms.

374

375 The occurrence of OTUs splitting, i.e. morphospecies represented by several OTUs, could

376 also be seen as a problem for the interpretation of eDNA data, because the biological meaning

377 of OTUs is unclear78-80 and because it increases the dimensionality of the data (see below).

378 However, there is also some advantages related to the high number of OTUs. For instance, it

379 has been shown that bacterial sequences diverging by a single nucleotide can display a

380 different abundance profile across environmentally distinct samples81, pointing out that sub-

381 species units can be ecologically informative. In our study, several foraminiferal

382 morphospecies were represented by multiple OTUs. The detailed analysis of these OTUs in

383 the case of the two dominant morphospecies: Bulimina marginata and Cibicidoides lobatulus,

384 showed that in the first case the read abundance profiles of the two most abundant OTUs were

385 highly similar across samples (spearman rho = 0,91, p < 0.001, table S9), while in the second

386 case the two most abundant OTUs displayed different profiles across samples (spearman rho

387 = 0,09, p = 0.32, table S9). This suggests that these OTUs could represent different

388 populations of the same species that exhibit different ecological preference or that they belong

389 to cryptic species complexes that our reference sequence database did not capture. The

390 temptation to merge OTUs that are assigned to the same morphospecies, in an effort of

391 matching the metabarcoding data with the observable morphospecies data, is questionable for

392 two reasons. First, it means grouping sequences based on our current morpho-taxonomic

393 knowledge, which is reflected in reference sequence databases, although morphospecies, from

394 which DNA barcodes are generated, are not necessarily well defined. Second, ecologically

ACS Paragon Plus Environment


Page 17 of 33 Environmental Science & Technology

17

395 distinct sub-species units could be grouped together, losing the ecological signal conveyed by

396 distinct OTUs into a single heterogeneous OTU. Because this signal is used for BI

397 calculation, we think that SML approaches applied to biomonitoring would be more accurate

398 in the case of OTUs splitting than in the case of OTUs merging. Furthermore, we think that

399 SML approaches would as well benefit from the automation of a workflow that is entirety

400 independent from reference taxonomic database, to be reproducible over time.

401

402 Building predictive models from eDNA composition data using SML approaches hold

403 challenges too. Metabarcoding data are usually characterized by a much higher number of

404 OTUs than samples8, and this can be further increased by the OTUs splitting, like in the

405 present study. In SML, this data property makes the predictive models prone to be affected by

406 “the curse of dimensionality”31. Indeed, the probability to observe, by chance, a perfect

407 correlation between the relative abundance of one OTU and the variation of a BI increases

408 with the number of OTUs. This could lead to the overfitting of the model, thus decreasing its

409 accuracy. Finding a trade-off between the OTU granularity (dimensionality) and the

410 separation of ecologically relevant units appears as a major challenge for SML approaches

411 applied to metabarcoding data. Although our dataset was highly dimensional, our predictive

412 models were likely not overfitted for two reasons. First, the overfitting is controlled through

413 the growing of a “forest” of regression trees, that are built on a random subset of the data in

414 the RF algorithm60,30 and we used a 10-fold cross-validation step for model selection using the

415 SOM algorithm, a common procedure to prevent such problem27,83. Second, we showed that

416 our models still accurately predicted BI values without 80 % of the OTUs removed based in

417 their low abundance, supporting that our models built on the full dataset do not tend to overfit

418 due to the presence of numerous rare OTUs.

419

ACS Paragon Plus Environment


Environmental Science & Technology Page 18 of 33

18

420 Future efforts should investigate whether accurate SML predictions can be realized in broad

421 routine applications. Our results were achieved with a relatively small training dataset and

422 increasing the number of sampled farms to cover a wider sampling area will likely further

423 refine our predictive models. Given the ability to multiplex hundreds of samples and to

424 perform analyses rapidly, cost-effective biomonitoring solutions could be proposed in matters

425 of days instead of months84. Hence, an eDNA-based early warning system adapted for the

426 pro-active management of marine industrial activities could be envisioned, as it has been

427 already proposed for others ecosystems and type of bioindicators85,86.

428

429 Acknowledgements

430 This study was supported by the Swiss Network for International Studies, by the Swiss

431 National Science Foundation (grants 316030_150817 and 31003A_159709) and by the

432 Norwegian Seafood Research Fund (FHF project number 901092). We thank Vegard Aambø

433 Langvatn and Dagfinn Breivik Skomsø from Fiske Liv and Jeremiah Peder Ness from Aqua-

434 kompetanse for help in collecting samples. We also thank Are Andreassen Moe and Arne

435 Kvalvik from Marine Harvest for the organization of the sampling campaign. We also thank

436 the four anonymous reviewers for their helpful comments.

437

438 Supporting Information

439 Table S1: Sampling site coordinates and distribution of collected samples

440 Tables S2: Tag to sample reference table following a Latin Square Design and tagged primers

441 sequences for successful PCR amplified samples

442 Table S3: Quality filtering and assembly parameters

443 Table S5: Results of non-linear models testing the effect of farming site, the cages and their

444 interaction on four biotic indexes values

ACS Paragon Plus Environment


Page 19 of 33 Environmental Science & Technology

19

445 Table S6: OTU-to-sample matrix with taxonomic assignments

446 Table S7: Taxonomic composition of foraminiferal communities

447 Table S8: Accuracy of BI values predictions from composition data, as a function of

448 abundance filtering on the OTU-to-sample matrix.

449 Table S9: Pair-wise spearman correlation matrix of OTUs represented by more than 1000

450 reads and assigned to the same species.

451

452 Figure S1: Non-linear relationship between BI reference values and the distance to the cages

453 Figure S2: Correlation between alpha-diversity metrics and the distance from the cages

454 Figure S3: NMDS analysis of the Bray-Curtis beta-diversity matrix calculated from the CSS

455 normalized foraminifera eDNA dataset.

456 Figure S4: Effect of rarefying the OTU-to-sample matrix on the variation of inferred NSI

457 values per sample.

458 Figure S5: Effect of rarefying the OTU-to-sample dataset on the variation of inferred NQI1

459 values per sample.

460 Figure S6: Diversity metrics importance in the prediction of the NSI index from the Random

461 Forest models

462 Figure S7: Diversity metrics importance in the prediction of the NQI1 index from the Random

463 Forest models

464 Figure S8: OTU importance in the prediction of the AMBI index from the Random Forest

465 model

466 Figure S9: OTU importance in the prediction of the ISI index from the Random Forest model

467 Figure S10: Boxplot and Mann-Whitney tests for difference between predicted BI values with

468 the best predictive models and reference BI values for each combination of farm and BI

469

ACS Paragon Plus Environment


Environmental Science & Technology Page 20 of 33

20

470 References

471

472 (1) Doney, S. C.; Ruckelshaus, M.; Emmett Duffy, J.; Barry, J. P.; Chan, F.; English, C. A.;

473 Galindo, H. M.; Grebmeier, J. M.; Hollowed, A. B.; Knowlton, N.; et al. Climate Change

474 Impacts on Marine Ecosystems. Ann. Rev. Mar. Sci. 2012, 4 (1), 11–37.

475 (2) Peterson, C. H.; Rice, S. D.; Short, J. W.; Esler, D.; Bodkin, J. L.; Ballachey, B. E.; Irons, D.

476 B. Long-Term Ecosystem Response to the Exxon Valdez Oil Spill. Science 2003, 302

477 (December), 2082–2086.

478 (3) Smith, V. H.; Tilman, G. D.; Nekola, J. C. Eutrophication: Impacts of excess nutrient

479 inputs on freshwater, marine, and terrestrial ecosystems. In Environmental Pollution;

480 1999; Vol. 100, pp 179–196.

481 (4) Borja, A.; Ranasinghe, A.; Weisberg, S. B. Assessing ecological integrity in marine

482 waters, using multiple indices and ecosystem components: Challenges for the future.

483 Mar. Pollut. Bull. 2009, 59 (1–3), 1–4.

484 (5) Tavakoly Sany, S. B.; Hashim, R.; Rezayi, M.; Salleh, A.; Safari, O. A review of strategies

485 to monitor water and sediment quality for a sustainability assessment of marine

486 environment. Environ. Sci. Pollut. Res. Int. 2014, 21 (813), doi:10.1007/s11356-013-

487 2217-5.

488 (6) Borja, A.; Franco, J.; Pérez, V. A Marine Biotic Index to Establish the Ecological Quality

489 of Soft-Bottom Benthos Within European Estuarine and Coastal Environments. Mar.

490 Pollut. Bull. 2000, 40 (12), 1100–1114.

491 (7) Rygg, B. Indicator species index for assessing benthic ecological quality in marine

492 waters of Norway; 2002.

493 (8) Rygg, B. Developing indices for quality status classification of marine soft-bottom

ACS Paragon Plus Environment


Page 21 of 33 Environmental Science & Technology

21

494 fauna in Norway.; 2006.

495 (9) Taberlet, P.; Coissac, E.; Pompanon, F.; Brochmann, C.; Willerslev, E. Towards next-

496 generation biodiversity assessment using DNA metabarcoding. Mol. Ecol. 2012, 21 (8),

497 2045–2050.

498 (10) Woodward, G.; Gray, C.; Baird, D. J. Biomonitoring for the 21st Century: new

499 perspectives in an age of globalisation and emerging environmental threats. Limnetica

500 2013, 32 (2), 159–174.

501 (11) Aylagas, E.; Borja, Á.; Rodriguez-Ezpeleta, N. Environmental status assessment using

502 DNA metabarcoding: Towards a genetics based marine biotic index (gAMBI). PLoS ONE

503 9(3) e90529 2014, doi:10.1371/journal.pone.0090529.

504 (12) Bohmann, K.; Evans, A.; Gilbert, M. T. P.; Carvalho, G. R.; Creer, S.; Knapp, M.; Yu, D.

505 W.; de Bruyn, M. Environmental DNA for wildlife biology and biodiversity monitoring.

506 Trends in Ecology and Evolution. 2014, pp 358–367.

507 (13) Kermarrec, L.; Franc, A.; Rimet, F.; Chaumeil, P.; Humbert, J. F.; Bouchez, A. Next-

508 generation sequencing to inventory taxonomic diversity in eukaryotic communities: A

509 test for freshwater diatoms. Mol. Ecol. Resour. 2013, 13 (4), 607–619.

510 (14) Kermarrec, L.; Franc, A.; Rimet, F.; Chaumeil, P.; Frigerio, J.-M.; Humbert, J.-F.;

511 Bouchez, A. A next-generation sequencing approach to river biomonitoring using

512 benthic diatoms. Freshw. Sci. 2014, 33 (1), 349–363.

513 (15) Visco, J. A.; Apothéloz-Perret-Gentil, L.; Cordonier, A.; Esling, P.; Pillet, L.; Pawlowski, J.

514 Environmental Monitoring: Inferring the Diatom Index from Next-Generation

515 Sequencing Data. Environ. Sci. Technol. 2015, 49 (13), 7597–7605.

516 (16) Zimmermann, J.; Abarca, N.; Enke, N.; Enk, N.; Skibbe, O.; Kusber, W. H.; Jahn, R.

517 Taxonomic reference libraries for environmental barcoding: a best practice example

ACS Paragon Plus Environment


Environmental Science & Technology Page 22 of 33

22

518 from diatom research. PLoS One 2014, 9 (9), e108793.

519 (17) Zimmermann, J.; Glöckner, G.; Jahn, R.; Enke, N.; Gemeinholzer, B. Metabarcoding vs.

520 morphological identification to assess diatom diversity in environmental studies. Mol.

521 Ecol. Resour. 2015, 15 (3), 526–542.

522 (18) Apothéloz-Perret-Gentil, L.; Cordonier, A.; Straub, F.; Iseli, J.; Esling, P.; Pawlowski, J.

523 Taxonomy-free molecular diatom index for high-throughput eDNA biomonitoring.

524 Mol. Ecol. Resour. 2017, in press. DOI: 10.1111/1755-0998.12668

525 (19) Chariton, A. A.; Court, L. N.; Hartley, D. M.; Colloff, M. J.; Hardy, C. M. Ecological

526 assessment of estuarine sediments by pyrosequencing eukaryotic ribosomal DNA.

527 Front. Ecol. Environ. 2010, 8 (5), 233–238.

528 (20) Chariton, A. A.; Stephenson, S.; Morgan, M. J.; Steven, A. D. L.; Colloff, M. J.; Court, L.

529 N.; Hardy, C. M. Metabarcoding of benthic eukaryote communities predicts the

530 ecological condition of estuaries. Environ. Pollut. 2015, 203, 165–174.

531 (21) Bik, H. M.; Halanych, K. M.; Sharma, J.; Thomas, W. K. Dramatic shifts in benthic

532 microbial eukaryote communities following the deepwater horizon oil spill. PLoS One

533 2012, 7 (6).

534 (22) Pawlowski, J.; Esling, P.; Lejzerowicz, F.; Cedhagen, T.; Wilding, T. A. Environmental

535 monitoring through protist next-generation sequencing metabarcoding: Assessing the

536 impact of fish farming on benthic foraminifera communities. Mol. Ecol. Resour. 2014,

537 14 (6), 1129–1140.

538 (23) Pochon, X.; Wood, S. A.; Keeley, N. B.; Lejzerowicz, F.; Esling, P.; Drew, J.; Pawlowski, J.

539 Accurate assessment of the impact of salmon farming on benthic sediment

540 enrichment using foraminiferal metabarcoding. Marine Pollution Bulletin. 2015.

541 (24) Lejzerowicz, F.; Esling, P.; Pillet, L. L.; Wilding, T. a.; Black, K. D.; Pawlowski, J. High-

ACS Paragon Plus Environment


Page 23 of 33 Environmental Science & Technology

23

542 throughput sequencing and morphology perform equally well for benthic monitoring

543 of marine ecosystems. Sci. Rep. 2015, 5 (April), 13932.

544 (25) Aylagas, E.; Borja, Á.; Irigoien, X.; Rodríguez-Ezpeleta, N. Benchmarking DNA

545 Metabarcoding for Biodiversity-Based Monitoring and Assessment. Front. Mar. Sci.

546 2016, 3 (November), 96.

547 (26) Lanzén, A.; Lekang, K.; Jonassen, I.; Thompson, E. M.; Troedsson, C. High-throughput

548 metabarcoding of eukaryotic diversity for environmental monitoring of offshore oil-

549 drilling activities. Mol. Ecol. 2016, 25 (17), 4392–4406.

550 (27) Knights, D.; Costello, E. K.; Knight, R. Supervised classification of human microbiota.

551 FEMS Microbiology Reviews. 2011, pp 343–359.

552 (28) Olden, J. D.; Lawler, J. J.; Poff, N. L. Machine learning methods without tears: a primer

553 for ecologists. Q. Rev. Biol. 2008, 83 (2), 171–193.

554 (29) Philibert, A.; Desprez-Loustau, M. L.; Fabre, B.; Frey, P.; Halkett, F.; Husson, C.; Lung-

555 Escarmant, B.; Marçais, B.; Robin, C.; Vacher, C.; et al. Predicting invasion success of

556 forest pathogenic fungi from species traits. J. Appl. Ecol. 2011, 48 (6), 1381–1390.

557 (30) Crisci, C.; Ghattas, B.; Perera, G. A review of supervised machine learning algorithms

558 and their applications to ecological data. Ecol. Modell. 2012, 240, 113–122.

559 (31) Libbrecht, M. W.; Noble, W. S. Machine learning applications in genetics and

560 genomics. Nat Rev Genet 2015, 16 (6), 321–332.

561 (32) Statnikov, A.; Henaff, M.; Narendra, V.; Konganti, K.; Li, Z.; Yang, L.; Pei, Z.; Blaser, M.

562 J.; Aliferis, C. F.; Alekseyenko, A. V. A comprehensive evaluation of multicategory

563 classification methods for microbiomic data. Microbiome 2013, 1 (1), 11.

564 (33) Beck, D.; Foster, J. A. Machine learning techniques accurately classify microbial

565 communities by bacterial vaginosis characteristics. PLoS One 2014, 9 (2).

ACS Paragon Plus Environment


Environmental Science & Technology Page 24 of 33

24

566 (34) Martínez-García, P. M.; López-Solanilla, E.; Ramos, C.; Rodríguez-Palenzuela, P.

567 Prediction of bacterial associations with plants using a supervised machine-learning

568 approach. Environmental Microbiology. 2016.

569 (35) Smith, M. B.; Rocha, A. M.; Smillie, C. S.; Olesen, S. W.; Paradis, C.; Wu, L.; Campbell, J.

570 H.; Fortney, J. L.; Mehlhorn, T. L.; Lowe, K. A.; et al. Natural bacterial communities

571 serve as quantitative geochemical biosensors. MBio 2015, 6 (3), 1–13.

572 (36) Nigam, R.; Saraswat, R.; Panchang, R. Application of foraminifers in ecotoxicology:

573 Retrospect, perspect and prospect. Environ. Int. 2006, 32 (2), 273–283.

574 (37) Frontalini, F.; Coccioni, R. Benthic foraminifera as bioindicators of pollution: A review

575 of Italian research over the last three decades. Revue de Micropaleontologie. 2011, pp

576 115–127.

577 (38) Schönfeld, J.; Alve, E.; Geslin, E.; Jorissen, F.; Korsun, S.; Spezzaferri, S.; Abramovich,

578 S.; Almogi-Labin, A.; du Chatelet, E. A.; Barras, C.; et al. The FOBIMO (FOraminiferal

579 BIo-MOnitoring) initiative-Towards a standardised protocol for soft-bottom benthic

580 foraminiferal monitoring studies. Mar. Micropaleontol. 2012, 94–95, 1–13.

581 (39) Alve, E.; Korsun, S.; Schönfeld, J.; Dijkstra, N.; Golikova, E.; Hess, S.; Husum, K.; Panieri,

582 G. Foram-AMBI: A sensitivity index based on benthic foraminiferal faunas from North-

583 East Atlantic and Arctic fjords, continental shelves and slopes. Mar. Micropaleontol.

584 2016, 122, 1–12.

585 (40) Scott, D. B.; Schafer, C. T.; Honig, C.; Younger, D. C. Temporal variations of benthic

586 foraminiferal assemblages under or near aquaculture operations; documentation of

587 impact history. J. Foraminifer. Res. 1995, 25 (3), 224–235.

588 (41) Angel, D. L. Impact of a Net Cage Fish Farm on the Distribution of Benthic

589 Foraminifera in the Northern Gulf of Eilat (Aqaba, Red Sea). J. Foraminifer. Res. 2000,

ACS Paragon Plus Environment


Page 25 of 33 Environmental Science & Technology

25

590 30 (1), 54–65.

591 (42) Vidović, J.; Cosovic, V.; Jurasic, M.; Petricioli, D. Impact of fish farming on foraminiferal

592 community, Drvenik Veliki Island, Adriatic Sea, Croatia. Mar. Pollut. Bull. 2009, 58 (9),

593 1297–1309.

594 (43) Vidović, J.; Dolenec, M.; Dolenec, T.; Karamarko, V.; Žvab Rožič, P. Benthic

595 foraminifera assemblages as elemental pollution bioindicator in marine sediments

596 around fish farm (Vrgada Island, Central Adriatic, Croatia). Mar. Pollut. Bull. 2014, 83

597 (1), 198–213.

598 (44) Pawlowski, J.; Esling, P.; Lejzerowicz, F.; Cordier, T.; Visco, J. A.; Martins, C. I. M.;

599 Kvalvik, A.; Staven, K.; Cedhagen, T. Benthic monitoring of salmon farms in Norway

600 using foraminiferal metabarcoding. Aquac. Environ. Interact. 2016, 8, 371–386.

601 (45) Pawlowski, J.; Lejzerowicz, F.; Esling, P. Next-generation environmental diversity

602 surveys of foraminifera: Preparing the future. Biological Bulletin. 2014, pp 93–106.

603 (46) Esling, P.; Lejzerowicz, F.; Pawlowski, J. Accurate multiplexing and filtering for high-

604 throughput amplicon-sequencing. Nucleic Acids Res. 2015, 1–12.

605 (47) Edgar, R. C.; Haas, B. J.; Clemente, J. C.; Quince, C.; Knight, R. UCHIME improves

606 sensitivity and speed of chimera detection. Bioinformatics 2011, 27 (16), 2194–2200.

607 (48) Caporaso, J. G.; Kuczynski, J.; Stombaugh, J.; Bittinger, K.; Bushman, F. D.; Costello, E.

608 K.; Fierer, N.; Pena, A. G.; Goodrich, J. K.; Gordon, J. I.; et al. QIIME allows analysis of

609 high-throughput community sequencing data. Nat. Methods 2010, 7 (5), 335–336.

610 (49) Mahé, F.; Rognes, T.; Quince, C.; De Vargas, C.; Dunthorn, M. Swarm v2: highly-

611 scalable and high-resolution amplicon clustering. PeerJ 2015, 3, e1420.

612 (50) Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J. Basic Local Alignment

613 Search Tool. J. Mol. Biol. 1990, No. 215, 403–410.

ACS Paragon Plus Environment


Environmental Science & Technology Page 26 of 33

26

614 (51) R Core Team. R: A language and environment for statistical computing. R Found. Stat.

615 Comput. Vienna, Austria 2016, URL https://www.R-project.org/.

616 (52) Oksanen, J.; Blanchet, F. G.; Kindt, R.; Legendre, P.; Minchin, P. R.; O’Hara, R. B.;

617 Simpson, G. L.; Solymos, P.; Stevens, M. H. H.; Wagner, H. vegan: Community Ecology

618 Package. R package version 2.4-1. 2016.

619 (53) Shannon, C. E.; Weaver, W. The mathematical theory of information; 1949; Vol. 97.

620 (54) Pielou, E. C. The measurement of diversity in different types of biological collections.

621 J. Theor. Biol. 1966, 13, 131–144.

622 (55) Chao, A.; Shen, T. J. Nonparametric estimation of Shannon’s index of diversity when

623 there are unseen species in sample. Environ. Ecol. Stat. 2003, 10 (4), 429–443.

624 (56) Paulson, J. N.; Talukder, H.; Pop, M.; Bravo, H. C. Robust methods for differential

625 abundance analysis in marker gene surveys. Nat. Methods 2013, 10 (12), 1200–1202.

626 (57) McMurdie, P. J.; Holmes, S. Waste Not, Want Not: Why Rarefying Microbiome Data Is

627 Inadmissible. PLoS Comput. Biol. 2014, 10 (4).

628 (58) Weiss, S. J.; Xu, Z.; Amir, A.; Peddada, S.; Bittinger, K.; Gonzalez, A.; Lozupone, C.;

629 Zaneveld, J. R.; Vazquez-Baeza, Y.; Birmingham, A.; et al. Effects of library size

630 variance, sparsity, and compositionality on the analysis of microbiome data. PeerJ

631 Prepr. 2015, 3, e1408. DOI: 10.7287/peerj.preprints.1157v1

632 (59) Bray, J. R.; Curtis, J. T. An Ordination of the Upland Forest Communities of Southern

633 Wisconsin. Ecol. Monogr. 1957, 27 (4), 325–349.

634 (60) Breiman, L. Random Forests. Mach. Learn. 2001, 45 (1), 5–32.

635 (61) Wright, M. N.; Ziegler, A. ranger: A Fast Implementation of Random Forests for High

636 Dimensional Data in C++ and R. arXiv.org 2015, stat.ML.

637 (62) Wehrens, R.; Buydens, L. M. C. Self- and super-organizing maps in R: The kohonen

ACS Paragon Plus Environment


Page 27 of 33 Environmental Science & Technology

27

638 package. J. Stat. Softw. 2007, 21 (5), 1–19.

639 (63) Liaw, a; Wiener, M. Classification and Regression by randomForest. R news 2002, 2

640 (December), 18–22.

641 (64) Kuhn, M. caret: Classification and Regression Training. R package version 6.0-73.

642 Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan

643 Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael

644 Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, Can Candan and

645 Tyler Hunt. 2016.

646 (65) Gamer, M.; Lemon, J.; Fellows, I.; Singh, P. Various Coefficients of Interrater Reliability

647 and Agreement.

648 (66) Landis, J. R.; Koch, G. G. The measurement of observer agreement for categorical

649 data. Biometrics 1977, 33 (1), 159–174.

650 (67) Stevenson, R. J.; Pan, Y.; VanDam, H. Assessing environmental conditions in rivers and

651 streams with diatoms. Earth Sci. 2006, 57–85.

652 (68) Grego, M.; De Troch, M.; Forte, J.; Malej, A. Main meiofauna taxa as an indicator for

653 assessing the spatial and seasonal impact of fish farming. Mar. Pollut. Bull. 2009, 58

654 (8), 1178–1186.

655 (69) Lima-Mendez, G.; Faust, K.; Henry, N.; Decelle, J.; Colin, S.; Carcillo, F.; Chaffron, S.;

656 Ignacio-Espinosa, J. C.; Roux, S.; Vincent, F.; et al. Determinants of community

657 structure in the global plankton interactome. Science (80-. ). 2015, 348 (6237),

658 1262073_1-1262073_9.

659 (70) Vacher, C.; Tamaddoni-Nezhad, A.; Kamenova, S.; Peyrard, N.; Moalic, Y.; Sabbadin, R.;

660 Schwaller, L.; Chiquet, J.; Smith, M. A.; Vallance, J.; et al. Learning Ecological Networks

661 from Next-Generation Sequencing Data. In Advances in Ecological Research; 2016;

ACS Paragon Plus Environment


Environmental Science & Technology Page 28 of 33

28

662 Vol. 54, pp 1–39.

663 (71) Fortunato, C. S.; Eiler, A.; Herfort, L.; Needoba, J. A.; Peterson, T. D.; Crump, B. C.; CS,

664 F.; Eiler, A.; Herfort, L.; JA, N.; et al. Determining indicator taxa across spatial and

665 seasonal gradients in the Columbia River coastal margin. Isme J 2013, 7 (10), 1899–

666 1911.

667 (72) Dell’Anno, A.; Corinaldesi, C. Degradation and turnover of extracellular DNA in marine

668 sediments: Ecological and methodological considerations. Appl. Environ. Microbiol.

669 2004, 70 (7), 4384–4386.

670 (73) Corinaldesi, C.; Danovaro, R.; Dell’Anno, A. Simultaneous recovery of extracellular and

671 intracellular DNA suitable for molecular studies from marine sediments. Appl. Environ.

672 Microbiol. 2005, 71 (1), 46–50.

673 (74) Pillet, L.; Fontaine, D.; Pawlowski, J. Intra-genomic ribosomal RNA polymorphism and

674 morphological variation in elphidium macellum suggests inter-specific hybridization in

675 foraminifera. PLoS One 2012, 7(2) : e32, doi:10.1371/journal.pone.0032373.

676 (75) Weber, A. A. T.; Pawlowski, J. Can Abundance of Protists Be Inferred from Sequence

677 Data: A Case Study of Foraminifera. PLoS One 2013, 8 (2).

678 (76) Schirmer, M.; Ijaz, U. Z.; D’Amore, R.; Hall, N.; Sloan, W. T.; Quince, C. Insight into

679 biases and sequencing errors for amplicon sequencing with the Illumina MiSeq

680 platform. Nucleic Acids Res. 2015, 43 (6).

681 (77) Esling, P.; Lejzerowicz, F.; Pawlowski, J. Accurate multiplexing and filtering for high-

682 throughput amplicon-sequencing. Nucleic Acids Res. 2015, 43 (5), 2513–2524.

683 (78) Schloss, P. D.; Westcott, S. L. Assessing and improving methods used in OTU-based

684 approaches for 16S rRNA gene sequence analysis. Appl. Environ. Microbiol. 2011, 77

685 (10), 3219–3226.

ACS Paragon Plus Environment


Page 29 of 33 Environmental Science & Technology

29

686 (79) Chen, W.; Zhang, C. K.; Cheng, Y.; Zhang, S.; Zhao, H. A Comparison of Methods for

687 Clustering 16S rRNA Sequences into OTUs. PLoS One 2013, 8(8) (PLoS ONE 8(8):

688 e70837.), doi:10.1371/journal.pone.0070837.

689 (80) He, Y.; Caporaso, J. G.; Jiang, X.-T.; Sheng, H.-F.; Huse, S. M.; Rideout, J. R.; Edgar, R. C.;

690 Kopylova, E.; Walters, W. A.; Knight, R.; et al. Stability of operational taxonomic units:

691 an important but neglected property for analyzing microbial diversity. Microbiome

692 2015, 3 (1), 20.

693 (81) Eren, A. M.; Maignien, L.; Sul, W. J.; Murphy, L. G.; Grim, S. L.; Morrison, H. G.; Sogin,

694 M. L. Oligotyping: Differentiating between closely related microbial taxa using 16S

695 rRNA gene data. Methods Ecol. Evol. 2013, 4 (12), 1111–1119.

696 (82) Tsilimigras, M. C. B.; Fodor, A. A. Compositional data analysis of the microbiome:

697 fundamentals, tools, and challenges. Ann. Epidemiol. 2016, 26 (5), 330–335.

698 (83) Domingos, P. A few useful things to know about machine learning. Commun. ACM

699 2012, 55 (10), 78.

700 (84) Quinn, R. A.; Navas-molina, J. A.; Hyde, E. R.; Song, J.; Vázquez-baeza, Y.; Humphrey,

701 G.; Gaffney, J.; Minich, J. J.; Melnik, A. V; Herschend, J.; et al. From sample to multi-

702 omics conclusions in under 48 hours. 2016, 1 (2).

703 (85) Wepener, V.; van Vuren, J. H. J.; Chatiza, F. P.; Mbizi, Z.; Slabbert, L.; Masola, B. Active

704 biomonitoring in freshwater environments: Early warning signals from biomarkers in

705 assessing biological effects of diffuse sources of pollutants. Phys. Chem. Earth 2005,

706 30 (11–16 SPEC. ISS.), 751–761.

707 (86) Maradona, A.; Marshall, G.; Mehrvar, M.; Pushchak, R.; Laursen, A. E.; McCarthy, L. H.;

708 Bostan, V.; Gilbride, K. A. Utilization of multiple organisms in a proposed early-

709 warning biomonitoring system for real-time detection of contaminants: Preliminary

ACS Paragon Plus Environment


Environmental Science & Technology Page 30 of 33

30

710 results and modeling. J. Hazard. Mater. 2012, 219–220, 95–102.

711

ACS Paragon Plus Environment


Page 31 of 33 Environmental Science & Technology

Tables and figures


Table 1: Accuracy of biotic indices predictions obtained from correlation screening, diversity
learning and composition learning, with the Random Forest and Self-Organizing Map algorithms.
Best predictive models are in bold (i.e. Highest Kappa statistics).
Biotic Index Approach Supervised learning algorithm R2 Kappa
AMBI Correlation screening - 0.568*** 0.624***
Diversity learning Random Forest 0.641*** 0.69***
Self-Organizing Map 0.492*** 0.618***
Composition learning Random Forest 0.662*** 0.555***
Self-Organizing Map 0.669*** 0.711***
ISI Correlation screening - 0.65*** 0.53***
Diversity learning Random Forest 0.505*** 0.626***
Self-Organizing Map 0.449*** 0.61***
Composition learning Random Forest 0.56*** 0.631***
Self-Organizing Map 0.615*** 0.774***
NSI Correlation screening - 0.508*** 0.607***
Diversity learning Random Forest 0.83*** 0.907***
Self-Organizing Map 0.83*** 0.88***
Composition learning Random Forest 0.827*** 0.832***
Self-Organizing Map 0.794*** 0.871***
NQI1 Correlation screening - 0.76*** 0.8***
Diversity learning Random Forest 0.834*** 0.88***
Self-Organizing Map 0.805*** 0.846***
Composition learning Random Forest 0.81*** 0.856***
Self-Organizing Map 0.803*** 0.873***
*:P < 0.05; **:P < 0.01; ***:P < 0.001

ACS Paragon Plus Environment


Environmental Science & Technology Page 32 of 33

Figure 1: A) Correlation between the BI values computed from the morpho-taxonomic


inventories (morphologic data) and the one predicted from supervised machine learning
from the foraminifera SSU 37F data (molecular data). The plots are obtained from the
predictive models giving the best accuracy (i.e. highest Kappa statistic, see Table 1). AMBI
and ISI predictions were done using the SOM algorithm on composition data. NSI and NQI1
predictions were done using the RF algorithm on diversity metrics. Each dot and error bars
represent the average and standard deviation of the predicted BI values for the sediment
sample of a grab. B) Barplots indicate the amount (number over bars) and percentage (y
axis) of correct classifications (0 on x axis) and misclassifications. C) Boxplots indicates the
median and quartile values (black numbers) and the mean value (red number) of the
difference between the predicted and the reference BI values.
A AMBI prediction
SOM algorithm on composition data
ISI prediction
SOM algorithm on composition data
6

100

Aukrasanden Aukrasanden
12

Beitveitnes Beitveitnes
80

n = 33
Bjørnsvik Bjørnsvik
5

Percentage

Percentage
60

Nedre Kvarv Nedre Kvarv


10

Storvika Storvika
40
20

n=7 n=6
4

n=1 n=1
0

8
Molecular

Molecular

2 1 0 1 2
3

Status mismatch

1.68
1.5
2

0.6
0.5

0.01 0.07
1

2
0.5

0.44
R =0.669*** R =0.615***
Kappa =0.711*** Kappa =0.774***
1.5

1.5
0

0 1 2 3 4 5 6 0 2 4 6 8 10 12

Morphology Morphology

NSI prediction NQI1 prediction


RF algorithm on diversity metrics RF algorithm on diversity metrics
0.8
0

0
0

Kappa =0.907*** Kappa =0.88***


0.0
0

0 0.0 0.8

ISI prediction NSI prediction NQI1 prediction


SOM algorithm on composition data
RF algorithm
AMBI B RF algorithm
on diversity
prediction metrics on diversity metrics
ISI prediction NSI prediction NQI1 prediction
100

100
80

80

80
% of samples

n = 33
Percentage

Percentage
60

60
0.8

n = 25
40

40

n = 11 n = 10
20

20

n=7 n=6
n=1 n=1 n=1 n=1
0

2 1 0 1 2 1 0 1 2 3 0 0

C 1.68
Status mismatch Status mismatch
predicted — known value
1.5

1.49
0.6
0.5

0.54
0
0

0.07 0.16
0.01
0

0.00

0.96
0.5

0.44
2

=0.669*** R =0.615*** 2.79


Kappa =0.711*** Kappa =0.774***
4
1.5

0.0

1.5

6 8 10 0.012 0.8

ACS Paragon Plus Environment


Biotic index
33Macro-invertebrates
PageEnvironmental
of 33 New eDNA samples
Science & Technology
Sieving

Foraminifera eDNA

eDNA ACS Paragon Plus Environment


extraction
Sampling Data generation Model training Making predictions

Data collection and model building Biotic indice predictions

You might also like