You are on page 1of 10

Chapter 20

PlasmidFinder and In Silico pMLST: Identification


and Typing of Plasmid Replicons in Whole-Genome
Sequencing (WGS)
Alessandra Carattoli and Henrik Hasman

Abstract
PlasmidFinder and in silico plasmid multiLocus sequence typing (pMLST) are two easy-to-use web tools
for detection and characterization of plasmid sequences in whole-genome sequencing (WGS) data from
Enterobacteriaceae. These tools have been adopted worldwide and facilitate plasmid detection and typing
based on draft genomes of multi-drug-resistant Enterobacteriaceae. The PlasmidFinder database currently
includes 133 unique plasmid replicon sequences. It was built starting with 126 sequences devised on fully
sequenced plasmids available at the NCBI nucleotide database in 2014 and has been continuously updated
to include novel replicons detected in more recently sequenced plasmids associated with the family
Enterobacteriaceae. PlasmidFinder is usable for replicon sequence analysis of raw as well as assembled
sequencing data. For pMLST analysis, a weekly updated database was generated from www.pubmlst.org
and integrated into a web tool called in silico pMLST.

Key words Replicons, Plasmid typing, Bacterial typing, Genomics, WGS

1 Introduction

Plasmids carry specific regions, called replicons, encoding functions


that are able to activate and control their replication independently
by the replication of the bacterial chromosome. Since 2005, a PCR-
based replicon typing (PBRT) scheme has been available, targeting
the replicons of the major plasmid families occurring in Enterobac-
teriaceae [1] by PCRs. In the decade following the introduction of
PBRT, thousands of enterobacterial strains have been typed for
replicon content. Replicons were originally used to attribute plas-
mids to related plasmid families. Plasmids typed by PBRT follows
the same nomenclature of the families established by conjugation-
based incompatibility (Inc) testing [2]. Here, the strategy was to
use PCR targeting the replicon sequences that were, in most of the
cases, the molecular mechanism causing the incompatibility behav-
ior of plasmids. The Inc. prefix continues to be used to name

Fernando de la Cruz (ed.), Horizontal Gene Transfer: Methods and Protocols, Methods in Molecular Biology, vol. 2075,
https://doi.org/10.1007/978-1-4939-9877-7_20, © Springer Science+Business Media, LLC, part of Springer Nature 2020

285
286 Alessandra Carattoli and Henrik Hasman

plasmid families, even when the Inc. phenotype has not been
formally confirmed by conjugation against the appropriate refer-
ence plasmids.
With the recent rapid increase in whole-genome and whole-
plasmid sequence data generated by high-throughput sequencing
platforms, there arose a need to translate the Inc. typing and PBRT-
based classification schemes in a tool that can identify replicon
content in raw sequence data or contigs generated by high-
throughput sequencing of entire genomes. Replicon sequences
targeted by PBRT were used for building the first collection of
replicon sequences for the PlasmidFinder database [3]. The analysis
of nucleotide sequences available in GenBank determined the need
to add additional replicon sequences up to the current 110 Plasmid-
Finder Enterobacteriaceae probes to successfully recognize—at
>95% nucleotide identity and >96% coverage—almost all complete
sequences of large plasmids (>20 kb in size) available at the NCBI
database. When replicons cannot be referred to a previously existing
Inc. group nomenclature, then the plasmid is assigned to a group of
homology using replication initiation protein genes as reference for
the new plasmid types.
Among plasmids that can be present in WGS, a large majority
consists of small, ColE-like plasmids that were not detectable and
classified by Inc. typing or PBRT [3, 4]. For these plasmids, multi-
ple phylogenetic analysis of the repA, RNAI, oriT sequences
allowed the identification of 23 sequences that, using the >80%
nucleotide identity and 96% coverage criteria, were able to identify
and classify the small plasmids in WGS in discrete plasmid groups.
In conclusion, a total of 132 replicon sequences, 109 and
23 recognizing large and small plasmids, respectively, are currently
included in the PlasmidFinder Enterobacteriaceae database
(https://cge.cbs.dtu.dk/services/PlasmidFinder/). By BLASTN,
the 132 PlasmidFinder sequences recognize almost 9000 large
and more than 11,000 small complete or partial plasmid sequences,
respectively, at the currently available NCBI nucleotide database
(Dec. 2018).
Since not all plasmid families occur at the same frequency, but
rather some families are prevalent, sequence-based typing schemes
were devised to identify plasmid types within the families. IncF,
IncI1, IncN, IncHI2, IncHI1, and IncA/C plasmids are currently
subtyped by plasmid multilocus sequence typing (pMLST; http://
pubmlst.org/plasmid/) [5–9]. For pMLST analysis, a weekly
updated database was generated from www.pubmlst.org and
integrated into a web tool called in silico pMLST. PlasmidFinder
and pMLST web tools present an opportunity to screen WGS data
obtained from every kind of genome sequencers, and without
particular bioinformatics skills, retrieving plasmid information to
be used in clinical and epidemiological investigations.
PlasmidFinder and In Silico pMLST: Identification and Typing of Plasmid. . . 287

For Gram-positive bacteria, a PlasmidFinder database has been


built on replication initiation protein genes of plasmids identified in
Enterococci, Streptococci, Staphylococci, Bacilli, Clostridia, and
Lactobacilli [10, 11] and includes 141 replicase sequences recog-
nizing more than 10,000 complete or partial plasmid sequences
(with a sequence identity between 73% and 100%) in the NCBI
nucleotide database (Dec. 2018). Since the Inc. typing scheme for
Enterobacteriaceae did not include most of these plasmids, an
alternative nomenclature has been implemented by numbering
the different replication initiation protein genes and referring to
accession numbers of plasmids in GenBank used as prototypes for
specific replicons.

2 Materials

A personal computer or workstation with web access is needed.


Only data from one single isolate should be uploaded at the http://
cge.cbs.dtu.dk/services/PlasmidFinder/ website. Either raw
sequencing reads or assembled contigs can be uploaded (see Note 1).
The option “Assembled Genomes/Contigs” should be
selected if assembled sequences have been obtained from short
sequencing reads into one continuous genome or into several con-
tigs. “Assembled Genomes/Contigs” is defined as one or several
contigs in one FASTA file (one entry per contig). For preassembled
partial or complete genomes, the input sequence must be in
one-letter nucleotide code in a single FASTA file (https://en.
wikipedia.org/wiki/FASTA_format). If large collections of geno-
mic data are to be analyzed, it is also possible to perform batch
upload to the Bacterial Analysis Pipeline at the Center for Genomic
Epidemiology (https://cge.cbs.dtu.dk/; [12]).

3 Method

3.1 PlasmidFinder The web tool PlasmidFinder (http://cge.cbs.dtu.dk/services/


Single Analysis PlasmidFinder/) utilizes the BLAST algorithm to look for DNA
homologies in both raw and assembled sequencing data from four
different sequencing platforms. If assembled bacterial genomes or
plasmids are uploaded to the web service, they are immediately
converted into a BLAST database. If raw sequencing reads are
uploaded, KMA will be used for mapping [13]. KMA supports
the major sequencing platforms: Illumina, Ion Torrent, Roche
454, SOLiD, Oxford Nanopore, and Pacific Biosciences (PacBio).
If input consists of raw sequencing reads, the PlasmidFinder web
server will support FASTA and FASTQ files.
A database should be selected among Enterobacteriaceae
(a database containing 133 unique sequences) and Gram-positive
bacteria (a database containing 141 unique replication initiation
288 Alessandra Carattoli and Henrik Hasman

protein gene sequences). Furthermore, a percent identity (%ID)


threshold (the percentage of nucleotides that are identical between
the best-matching replicon sequence in the database and the
corresponding sequence in the assembled sequencing data)
between 100% and 50% can be selected. All genes with a %ID
equal or greater than the selected threshold will be shown in the
output. By default the %ID ¼ 95. This cutoff will allow the detec-
tion of the larger Inc-related plasmids, while a cutoff of 80% is
required for the detection of the smaller ColE plasmids. Finally, a
threshold for minimum %ID can be selected between 100% and
20%. All plasmids with a percent coverage equal or greater than the
selected threshold will be shown in the output. By default the
minimum percent coverage is set to 60%.
To input the sequences, upload a single FASTA file on your
local disk by using the applet. Upon sequence submission, the
green “Upload” button can be used and the status of the job (either
“queued” or “running”) will be displayed and constantly updated
until it terminates and the server output appears in the browser
window. At any time during the wait, an e-mail address can be given
and job will continue and notified by e-mail when it has terminated.

3.2 pMLST Single To execute a sequence-type prediction using the pMLST web
Analysis server, the pMLST profile for the plasmid query should be selected.
pMLST has six plasmid configurations that it distinguishes
between. They are IncA/C, IncF, IncHI1, IncHI2, IncI1, and
IncN. Raw data in FASTQ format or preassembled partial or com-
plete genomes in FASTA format can be uploaded (see Note 1).
To input the sequences, a single FASTA file on local disk can be
uploaded by using the applet. For successful typing, a partial
genome must, as a minimum, contain all the loci necessary for
pMLST concatenated in one FASTA file.
The green “Upload” button starts the job. The status of the job
(either “queued” or “running”) will be displayed and constantly
updated until it terminates and the server output page appears in
the browser. There is also the option to input e-mail address to be
notified as soon as the results are ready.

3.3 PlasmidFinder PlasmidFinder and pMLST are also included in the Bacterial Anal-
and pMLST ysis Pipeline—Batch Upload (https://cge.cbs.dtu.dk/services/
in the Bacterial cge/). The CGE Bacterial Analysis Pipeline executes a workflow
Analysis Pipeline: of services with predefined parameters, stores the submitted data,
Batch Upload and results in the database at the user’s disposal. This analysis can
only process preassembled isolates; therefore, contig files in fasta
should be uploaded. The pipeline was benchmarked using datasets
previously used to test the individual services.
Plasmid services included in the Bacterial Analysis Pipeline are
PlasmidFinder-1.2 and pMLST-1.4.
PlasmidFinder and In Silico pMLST: Identification and Typing of Plasmid. . . 289

Fig. 1 PlasmidFinder output. Overview of the PlasmidFinder V2.0 output at the web page. The dark green color
indicates a perfect match for a given plasmid. The %Identity is 100 and the sequence in the genome covers
the entire length of the plasmid in the database. The light green color indicates a warning due to a non-perfect
match. The grey color indicates a warning due to a non-perfect match, query length is shorter than plasmid
replicon length. The red color indicates that no plasmids with a match over the given threshold was found

The procedure consists of three steps. The first step is to


download metadata template and fill it (see Note 2). The second
step is to upload the template and the third step is to upload the
FASTA files and submit the process.

3.4 PlasmidFinder Once the PlasmidFinder server has finished running the submitted
Output job, it will display a graphical output similar to the example in
Fig. 1.
Output data include the name of input file(s) uploaded by the
user and the selected threshold for minimum percent identity (%
ID) between sequence in the genome of the input isolate and the
plasmid in the database. The output table has seven columns:
(1) replicon if available expressed as an Inc. group against which
the input genome has been aligned; (2) percent identity in the
alignment between the best matching plasmid in the database and
the corresponding sequence in the input genome. A perfect align-
ment is 100%, but it must also cover the entire length of the plasmid
in the database; (3) query length of the best match in the genome
sequence compared to the length of the template (the matching
plasmid replicon in the database); (4) name of contig or scaffold in
which the replicon is found; (5) starting position of the found
290 Alessandra Carattoli and Henrik Hasman

replicon in the contig; (6) notes to the plasmid; and (7) reference
GenBank accession number according to NCBI for the plasmid in
the database. The accession numbers of plasmids that have been
used to build up the PlasmidFinder database are very useful because
the reference plasmid can be used in a BLST2N analysis to detect
other contigs and scaffolds in the query sequence that presumably
belong to the same plasmid whose replicon has been identified by
PlasmidFinder (see Note 3). A FASTA file containing the best
matching sequences from the query genome can be downloaded
at the Hit in genome sequences.
The extended output shows the alignments. In the extended
output format, green color indicates matching nucleotides, red
color indicates mismatches, and gray indicates no query sequence
in part of the alignment. Downloadable files are text files containing
the result table and alignments.

3.5 pMLST Output The output shows the sequence type (ST) that has been associated
with the query and a table that has six columns containing detailed
results (Fig. 2): (a) allele name, (b) percentage of nucleotides that
are identical between the best-matching pMLST allele in the data-
base and the corresponding sequence in the plasmid, (c) length of
the alignment between the best-matching pMLST allele in the
database and the corresponding sequence in the plasmid,
(d) length of the best-matching pMLST allele in the database,

Fig. 2 In silico pMLST output. Overview of the in silico pMLST output at the web page. For a perfect matching
allele, the % identity will be 100, the allele length will equal the query length, and the number of gaps will be
0. Green color indicates a perfect match, while red color indicates an imperfect match or no match at all
PlasmidFinder and In Silico pMLST: Identification and Typing of Plasmid. . . 291

(e) number of gaps in the alignment, and (f) name of the best-
matching pMLST allele for each gene and allele identified. The
output also shows input file name used in the analysis and a possi-
bility to clicks for downloading results text files, as well as an
optional graphical presentation of the alignment for each of the
loci against the allele on the selected pMLST scheme that had the
best alignment score (see Note 4).

4 Notes

1. Upload the raw sequencing file(s) of the isolate or partial/


complete genomes. For preassembled partial or complete gen-
omes, the input sequence must be in one-letter nucleotide code
in a single FASTA file. The allowed alphabet (not case sensitive)
is the following: A C G T and N (unknown). It is indifferent
which type of short sequence reads that was used to produce
the genome.
If the input consists of raw sequencing reads, then the
PlasmidFinder web server will support FASTA and FASTQ
files. Raw read from the following technologies is supported:
l Illumina Solexa single end reads
l Illumina Solexa paired end reads
l Roche 454 single end reads
l Roche 454 paired end reads
l Ion-torrent single end reads
l SOLiD single end reads
l SOLiD paired end reads
l SOLiD mate paired reads
l Pacific Biosciences (PacBio)
l Oxford Nanopore
Depending on the technology used, it may produce more
than one raw read file. Therefore, the web server allows for
multiple FASTA or FASTQ files to be uploaded, but keep in
mind that the data have to originate from the same single
isolate.
2. The metadata template for the bacterial analysis pipeline.
Metadata is a term that covers additional data that supports
the main data, which in this case is the WGS sequence data. The
information required include place and year of isolation and so
on. The metadata template is in Excel format. It contains
27 attributes, where 11 of them are mandatory (header is in
bold), of which 7 are technical attributes used by the CGE
server. To avoid problems caused by file names, the server
accepts a limited selection of ASCII characters.
292 Alessandra Carattoli and Henrik Hasman

a-z
A-Z
0-9
_
-

3. Interpretation of PlasmidFinder results for the IncF, IncH, and


IncI plasmid families
In general, every replicon detected by PlasmidFinder could
be interpreted as one plasmid in the isolate, but there are
important exceptions due to plasmids that contain more than
one replicon and are defined as multireplicon plasmids. The
most diffused are plasmids belonging to the IncF, IncHI1, and
IncHI2 families.
IncF multireplicon plasmids may show different combina-
tions of the FIA, FIB, and/or FII-FIC replicons. A Plasmid-
Finder probe for the IncFIC(FII) combined replicon was
inserted into the database to make it more clear that some
IncFIC replicons also contain sequence data homologous to
IncFII replicons. FIA, FIB, and FII-FIC replicons may be
identified on different contigs that however belong to the
same plasmid. The assembly of these plasmids needs to link
the contigs carrying the different IncF-replicons by a
PCR-based gap closure approach. Alternatively, these complex-
ities can often be resolved by long sequence reads generated
with Single Molecule, Real-Time (SMRT) technology (Pac-
BIO, MinION, or similar).
The archetypal IncHI1 R27 plasmid (acc. no. AF250878)
carries the repHI1A, repHIB, and also an FIA-like replicons.
These can be recognized by the IncHI1A_1__AF250878,
IncHI1B(R27)_1_R27_AF250878, and IncFIA(HI1)
_1_HI1_AF250878 PlasmidFinder probes, respectively. There-
fore, the detection of these three replicons in a WGS of an
isolate can be interpreted as the presence of a unique IncHI1
plasmid in the isolate and not three different plasmids.
Novel variants of the HI1 family have been recognized and
the archetypal plasmids are named pNDM-CIT [14] and
pNDM-MAR [15]. These can be recognized by the IncHI1B
(CIT)_1_pNDM-CIT_JX182975, IncHI1A(CIT)_1_pNDM-
CIT_JX182975, IncHI1B_1_pNDM-MAR_JN420336, and
IncFIB(Mar)_1_pNDM-Mar_JN420336, PlasmidFinder
probes.
The IncHI2 family refers to the R478 archetypal plasmid
(acc. no._BX664015). R478 has two replicons, the repHI2
replicon that is unique of the IncHI2 group and the repHI2A
PlasmidFinder and In Silico pMLST: Identification and Typing of Plasmid. . . 293

replicon recognized by the IncHI2_1__BX664015 and


IncHI2A_1__BX664015 PlasmidFinder probes.
4. Interpretation of in silico pMLST output that does not report the
sequence type
In silico pMLST output gives allele numbers and the
corresponding sequence type (ST) when all the alleles are
detected and present in the pMLST database. New alleles
eventually present on the plasmid impair the generation of
the ST. For plasmids lacking one of the alleles necessary to
determine the ST, the in silico pMLST generates an output
listing the alleles present in the WGS but not the ST.
For IncF plasmids carrying multireplicons as described in
Note 3, pMLST output will provide the alleles assigned to each
IncF-replicon, but these do not generate a sequence type. The
mosaic structure of these plasmids impaired the application by a
classical multilocus sequence typing (MLST)-based approach.
IncF plasmids can have (or may not have) a complete set of
FIA, FIB, and FII-FIC replicons. In silico pMLST results can
be used to assign a FAB (FII, FIA, FIB) formula by the allele
type and number identified for each replicon (like a serotype
formula). For example, the FAB formula [F1:A1:B1] can be
assigned to plasmid pRSB107 (AJ851089), which results posi-
tive by pMLST to allele F1 for FII, A1 for FIA, and B1 for FIB
replicons, respectively. Analogously, the FAB formula [F2:A-:B-]
can be assigned to plasmid R100 (AP000342), positive for the
allele F2 only [6].

References
1. Carattoli A, Bertini A, Villa L, Falbo V, Hop- Multilocus sequence typing of IncI1 plasmids
kins KL, Threlfall J (2005) Identification of carrying extended-spectrum beta-lactamases in
plasmids by PCR-based replicon typing. J Escherichia coli and Salmonella of human and
Microbiol Methods 63:219–228 animal origin. J Antimicrob Chemother
2. Datta N, Hedges RW (1971) Compatibility 61:1229–1233
groups among fi - R factors. Nature 6. Villa L, Garcı́a-Fernández A, Fortini D, Carat-
234:222–223 toli A (2010) Replicon sequence typing of IncF
3. Carattoli A, Zankari E, Garcia-Fernandez A, plasmids carrying virulence and resistance
Voldby Larsen M, Lund O, Villa L, Aarestrup determinants. J Antimicrob Chemother
FM, Hasman H (2014) PlasmidFinder and 65:2518–2529
pMLST: in silico detection and typing of plas- 7. Garcı́a-Fernández A, Carattoli A (2010) Plas-
mids. Antimicrob Agents Chemother 58 mid double locus sequence typing for IncHI2
(7):3895–3903 plasmids, a subtyping scheme for the character-
4. Orlek A, Phan H, Sheppard AE, Doumith M, ization of IncHI2 plasmids carrying extended-
Ellington M, Peto T, Crook D, Walker AS, spectrum beta-lactamase and quinolone resis-
Woodford N, Anjum MF, Stoesser N (2017) tance genes. J Antimicrob Chemother
Ordering the mob: insights into replicon and 65:1155–1161
MOB typing schemes from analysis of a curated 8. Garcı́a-Fernández A, Villa L, Moodley A,
dataset of publicly available plasmids. Plasmid Hasman H, Miriagou V, Guardabassi L, Carat-
91:42–52 toli A (2011) Multilocus sequence typing of
5. Garcı́a-Fernández A, Chiaretto G, Bertini A, IncN plasmids. J Antimicrob Chemother
Villa L, Fortini D, Ricci A, Carattoli A (2008) 66:1987–1991
294 Alessandra Carattoli and Henrik Hasman

9. Phan MD, Kidgell C, Nair S, Holt KE, Turner FM, Lund O (2016) A bacterial analysis plat-
AK, Hinds J, Butcher P, Cooke FJ (2009) Var- form: an integrated system for analysing bacte-
iation in Salmonella enterica serovar typhi rial whole genome sequencing data for clinical
IncHI1 plasmids during the global spread of diagnostics and surveillance. PLoS One 11:
resistant typhoid fever. Antimicrob Agents e0157718
Chemother 53:716–727 13. Clausen PTLC, Aarestrup FM, Lund O (2018)
10. Jensen LB, Garcia-Migura L, Valenzuela AJ, Rapid and precise alignment of raw reads
Løhr M, Hasman H, Aarestrup FM (2010) A against redundant databases with KMA. BMC
classification system for plasmids from entero- Bioinformatics 19:307
cocci and other Gram-positive bacteria. J 14. Dolejska M, Villa L, Poirel L et al (2013)
Microbiol Methods 80:25–43 Complete sequencing of an IncHI1 plasmid
11. C L, Garcı́a-Migura L, Aspiroz C, Zarazaga M, encoding the carbapenemase NDM-1, the
Torres C, Aarestrup FM (2012) Expansion of a ArmA 16S RNA methylase and a resistance-
plasmid classification system for Gram-positive nodulation-cell division/multidrug efflux
bacteria and determination of the diversity of pump. J Antimicrob Chemother 68:34–39
plasmids in Staphylococcus aureus strains of 15. Villa L, Poirel L, Nordmann P et al (2012)
human, animal, and food origins. Appl Environ Complete sequencing of an IncH plasmid car-
Microbiol 78:5948–5955 rying the blaNDM-1, blaCTX-M-15 and
12. Thomsen MC, Ahrenfeldt J, Cisneros JL, qnrB1 genes. J Antimicrob Chemother
Jurtz V, Larsen MV, Hasman H, Aarestrup 67:1645–1650

You might also like