You are on page 1of 17

CRSeek: a Python module for facilitating complicated CRISPR

design strategies
Will Dampier Corresp., 1, 2, 3 , Cheng-Han Chung 1, 2
, Neil T Sullivan 1, 2
, Andrew J Atkins 1, 2
, Michael R Nonnemacher
1, 2, 4
, Brian Wigdahl Corresp. 1, 2, 4

1
Department of Microbiology and Immunology, Drexel University College of Medicine, Philadelphia, PA, United States
2
Center for Molecular Virology and Translational Neuroscience, Institute for Molecular & Medicine and Infectious Disease, Drexel University College of
Medicine, Philadelphia, PA, United States
3
School of Biomedical Engineering, Science, and Health Systems, Drexel University, Philadelphia, PA, United States
4
Sidney Kimmel Cancer Center, Thomas Jefferson University, Philadelphia, PA, United States

Corresponding Authors: Will Dampier, Brian Wigdahl


Email address: wnd22@drexel.edu, bw45@drexel.edu

With the popularization of the CRISPR-Cas gene editing system there has been an
explosion of new techniques made possible by this versatile technology. However, the
computational field has lagged behind with a current lack of computational tools for
developing complicated CRISPR-Cas gene editing strategies. We present crseek, a Python
package that provides a consistent application programming interface (API) for multiple
cleavage prediction algorithms. Four popular cleavage prediction algorithms were
implemented and further adapted to work on draft-quality genomes. Furthermore, since
crseek mirrors the popular scikit-learn API, the package can be easily integrated as an
upstream processing module for facilitating further CRISPR-Cas machine learning research.
The package is fully integrated with the biopython package facilitating simple import,
export, and manipulation of sequences before and after gene editing. This manuscript
presents four common gene editing tasks that would be difficult with current tools but are
easily performed with the crseek package. We believe this package will help
bioinformaticians rapidly design complex CRISPR-Cas gene editing strategies and will be a
useful addition to the field.

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
1 CRSeek: a Python module for facilitating
2 complicated CRISPR design strategies
3 Will Dampier1,2,3 , Cheng-Han Chung1,2 , Neil T. Sullivan1,2 , Andrew
4 Atkins1,2 , Michael R. Nonnemacher1,2,4 , and Brian Wigdahl1,2,4
5
1 Department of Microbiology and Immunology, Drexel University College of Medicine,
6 Philadelphia, PA, USA
7
2 Center for Molecular Virology and Translational Neuroscience, Institute for Molecular

8 Medicine and Infectious Disease, Drexel University College of Medicine, Philadelphia,


9 PA, USA
10
3 School of Biomedical Engineering, Science, and Health Systems, Drexel University,

11 Philadelphia, Pennsylvania, USA


12
4 Sidney Kimmel Cancer Center, Thomas Jefferson University, Philadelphia, PA, USA

13 Corresponding author:
14 Will Dampier and Brian Wigdahl
15 Email address: wnd22@drexel.edu, bw45@drexel.edu

16 ABSTRACT

17 With the popularization of the CRISPR-Cas gene editing system there has been an explosion of new
18 techniques made possible by this versatile technology. However, the computational field has lagged
19 behind with a current lack of computational tools for developing complicated CRISPR-Cas gene editing
20 strategies. We present crseek, a Python package that provides a consistent application programming
21 interface (API) for multiple cleavage prediction algorithms. Four popular cleavage prediction algorithms
22 were implemented and further adapted to work on draft-quality genomes. Furthermore, since crseek
23 mirrors the popular scikit-learn API, the package can be easily integrated as an upstream processing
24 module for facilitating further CRISPR-Cas machine learning research. The package is fully integrated
25 with the biopython package facilitating simple import, export, and manipulation of sequences before
26 and after gene editing. This manuscript presents four common gene editing tasks that would be difficult
27 with current tools but are easily performed with the crseek package. We believe this package will help
28 bioinformaticians rapidly design complex CRISPR-Cas gene editing strategies and will be a useful addition
29 to the field.

30 INTRODUCTION
31 Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) are a family of related genes that
32 function as a defense system in prokaryotes against phages and plasmid DNA (Barrangou, 2015). By
33 incorporating a small portion of the invading DNA into the prokaryotic genome, it can defend against
34 subsequent attacks. In this system, CRISPR-associated (Cas) genes process these genomic repeats into a
35 guide RNA (gRNA). One particular region of the gRNA, termed the spacer, directs Cas endonucleases
36 to cleave the target DNA at a region complementary to the spacer on foreign DNA (Marraffini and
37 Sontheimer, 2010). Recently, aspects of this system have been re-engineered for researchers to easily
38 use this system as a highly specific endonuclease that can be directed to edit nearly any desired DNA
39 sequence across the genome (Jinek et al., 2012; Cong et al., 2013).
40 The CRISPR-Cas system has revolutionized the gene editing field (Doudna and Charpentier, 2014).
41 It has democratized the field by lowering the difficulty of editing a specific genomic locus (Genscript,
42 2016). Various applications of CRISPR-Cas systems have been developed with useful properties such
43 as an inducible expression system (Cao et al., 2016), tissue specificity (Ablain et al., 2015), as well as
44 advanced editing strategies (Kim et al., 2017). In practice, targeting a specific locus with any of these
45 systems simply involves finding a protospacer adjacent motif (PAM) next to a unique 20 bp protospacer
46 and performing basic molecular biology techniques (Genscript, 2016; Sternberg and Doudna, 2015).

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
47 Recent studies have elucidated the majority of the mechanism and binding site recognition of the
48 CRISPR-Cas system. Early target search strategies involved using ad-hoc rule-based methods that
49 excluded potential targets by defining either a total maximum of mismatches or distinguishing between
50 seed and tail regions (Fu et al., 2013; Hsu et al., 2013; Wu et al., 2014). Current targeting methods
51 employ a data-driven approach by using a dataset of experimentally determined targeting probabilities and
52 extracting sophisticated rules to construct scoring algorithms. Hsu et al. (2013) created a position-specific
53 binding matrix by testing 700 gRNAs engineered to have intended mismatches to their target. Doench
54 et al. (2016) followed this work by creating an even larger dataset of ≥80,000 gRNAs targeting many
55 positions across the same gene. Klein et al. (2018) further refined this method by developing a kinetic
56 model of CRISPR-Cas binding and cleavage using these datasets and showed that a small number of
57 parameters can explain these experimentally determined binding rules.
58 There are numerous currently employed tools for designing gRNAs reviewed by Ding et al. (2016);
59 Cui et al. (2018). All of these tools use the same basic strategy. First, the target DNA is searched for PAM
60 sequences and the adjacent protospacers are extracted. Next, the potential spacers are searched against a
61 reference database allowing for multiple mismatches and are scored for potential off-target sites using one
62 of the many scoring algorithms described above. Finally, the protospacers are ranked by their potential
63 off-target risk. These tools can be served either as a webserver and occasionally as a locally installable
64 toolset.
65 However, all of these tools lack many of the features needed from a bioinformatic developer perspec-
66 tive.
67 • Webtools often limit searching to model organisms or pre-built indices.
68 • Most tools cannot search for non-standard Cas9 variants such as Cpf1, SaCas9, or other species-
69 specific Cas9s.
70 • Currently distributed software lacks unit-testing, one-command install of the tool and dependencies,
71 continuous integration frameworks, or a documented API.
72 These features are needed in order for bioinformaticians to design larger scale experiments or develop
73 automation methods for more advanced gene editing strategies.
74 In this manuscript we present crseek (https://github.com/DamLabResources/crseek), a Python
75 library mirroring the popular scikit-learn application program interface (API) proposed in Buitinck et al.
76 (2013). It provides both high- and low-level methods for designing CRISPR gene editing strategies.
77 crseek provides an invaluable resource for computational biologists looking to employ CRISPR-Cas
78 based gene editing techniques.

79 METHODS
80 This manuscript presents a collection of tools for the biomedical software developer looking to design
81 advanced CRISPR-Cas gene-editing strategies. This toolset was built from a developer perspective to
82 aid biologists with programming experience who are looking to perform gRNA design for non-standard
83 CRISPR-Cas gene editing. The toolset, and its dependencies, are installable using the Ananconda
84 environment system (Anaconda, 2016).

85 Terminology
86 Throughout the research field there is a great deal of variability in the nomeclature of the various parts of
87 the CRISPR-Cas complex. Many of these nomeclatures do not conform to the Python PEP-8 standard
88 (Van Rossum et al., 2001). For consistency we refer to the complex parts in the following ways:
89 • The gRNA refers to the entire strand of RNA which complexes with Cas9 to form the CRISPR-Cas
90 complex. This molecule is not directly represented in crseek.
91 • The spacer refers to the 20 nucleotide region of the gRNA which is used for target matching by
92 the CRISPR-Cas complex. This requires an RNA alphabet.
93 • The pam refers to the, potentially degenerate, nucleotide recognition site of the particular CRISPR-
94 Cas system. The PAM resides adjacent to the spacer and is required for binding. This requires a
95 DNA alphabet.

2/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
96 • The target refers to the 20 nucleotide region of the DNA which is complementary to the spacer.
97 In order to accommodate multiple CRISPR-Cas systems the target includes the PAM recognition
98 site. This requires a DNA alphabet.

99 • The loci refers to a ≥20 nucleotide segment of the DNA which may contain potential targets.

100 The crseek tool does not make a distinction between on-target sites, those where binding is intended,
101 and off-target sites, those where binding is unintended; all (spacer, target) pairs are treated equally.
102 This framing is useful for expressing bleeding-edge uses of CRISPR-Cas that exploit the promiscuity of
103 SpCas9 to target heterogeneous populations such as HIV (Dampier et al., 2018) or modulate microbial
104 communities (Gomaa et al., 2014).

105 Implementation Strategy


106 We have intentionally mirrored the scikit-learn API. We subclass the BaseEstimator class and over-
107 load the fit, transform, predict, and predict proba methods. This allows us to seamlessly
108 use all of the tools of the scikit-searn package such as cross-validation, normalization, and other prediction
109 methods (Pedregosa et al., 2011). We have decomposed this methodology into three main tasks: Searching,
110 Preprocessing, and Estimating. We use the biopython library (Cock et al., 2009) to enforce appropriate
111 nucleotide alphabets as inputs and outputs of the various tools.

112 Searching
113 We define search as the act of locating potential target sites in loci. DNA segments can be any
114 files, or directories of files, readable by Bio.SeqIO (fasta, genbank, etc), lists of Bio.SeqRecord
115 objects, or np.arrays of characters. The crseek tool provides two strategies for this searching:
116 exhaustive or mismatch based. In an exhaustive search, all positions are considered as potential targets.
117 For mismatch searching, a wrapper was created around the popular cas-offinder library (Bae et al.,
118 2014). This library allows for rapid mismatch searching, even employing any graphical processors present
119 and properly installed. The utils.tile seqrecord and utils.cas offinder return arrays of
120 (spacer, target) pairs for downstream analysis.

121 Preprocessing
122 Once targets have been found on the genomic loci they must be encoded as (spacer, target) pairs for
123 downstream analysis. There are two main strategies for this encoding: mismatch versus one-hot encoding.
124 Rule-based mismatching and the Hsu et al. (2013) strategies both give the same weight independent of
125 the mismatch identity; as such this encoding is a binary vector of length 21 (20 for the spacer, 1 for
126 pam matching). One-hot encoding is used for strategies such as Doench et al. (2016) which give different
127 binding penalties based on the mismatch identity; for example an A:T mismatch may have a larger penalty
128 than an A:C mismatch and these penalties change with position.
129 These two strategies have been encapsulated into the preprocessing.MatchingTransformer
130 and preprocessing.OneHotEncoder classes. These classes take pairs of (spacer, target)
131 and return binary vectors for downstream processing. By implementing these as subclasses of the
132 sklearn.BaseEstimator, one can use these classes in sklearn.Pipeline instances. Down-
133 stream processing can either employ one of the pre-built estimator classes or used as preprocessing
134 for other machine learning algorithms.

135 Estimating
136 Multiple pre-built algorithms were collected for determining the activity of any given (spacer, target)
137 pair. These estimators input the binary vectors produced by the preprocessing modules. A flexible
138 MismatchEstimator was built that allows one to specify the seed-length, number of seed or non-seed
139 mismatches, and the PAM identity. These parameters can also be loaded from user-supplied yaml files to
140 allow for the exploration of non-standard CRISPR-Cas systems. The penalty scoring strategy described
141 by Hsu et al. (2013) was implemented as the MITEstimator and the Doench et al. (2016) method
142 as the CFDEstimator. The newly proposed kinetic model developed by Klein et al. (2018) has also
143 been implemented as the KineticEstimator. New estimators can be easily added by subclassing the
144 SequenceBase abstract base class and overloading the relevant methods.

3/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
145 Additional Features
146 In order to account for many of the idiosyncrasies of designing CRISPR-Cas tools, a few optional features
147 were added. Degenerate bases can be used across all tools. These bases are assumed to be each relevant
148 nucleotide in equal likelihoods and the penalty scores are scaled accordingly (e.g. an R is 50% likely to be
149 an A or G). SAM/BAM files can be used as input in all search tools. This allows for search and estimation
150 across variable populations. This is useful when targeting microbiomes, metagenomes, or highly mutable
151 viral genomes.

152 In vitro validation of Cas9 cleavage of mixed populations


153 In order to evaluate whether the crseek module can be used for novel analyses, an in vitro cutting assay
154 using mixed populations of sequences from HIV-1-infected individuals was adapted. Genomic DNA
155 was extracted from PMBCs derived from five patients in the Drexel Medicine CNS AIDS Research and
156 Eradication Study (CARES). The Long Terminal Repeat (LTR) region was amplified using a two-round
157 PCR strategy as previously described (Li et al., 2011). Bulk PCR products were cloned into the pGL3
158 basic expression vector as previously described (Li et al., 2015) obtaining one to six unique sequences per
159 patient. Additionally, sequences representing the consensus LTRs from CXCR4- (X4) and CCR5-utilizing
160 (R5) viruses constructed from previous analyses (Antell et al., 2016), were synthesized and cloned into the
161 pGL3 vector by VectorBuilder (Cyagen Biosciences). The LTR-A gRNA (ATCAGATATCCACTGACCTT
162 NGG) from Hu et al. (2014) was selected for analysis, purchased from IDT, and in vitro transcribed using
163 the EnGen sgRNA synthesis procedure (NEB) as described by the manufacturer protocol. The gRNA
164 was transcribed, purified, and concentrated using the RNA clean and concentrator procedure provided by
165 Zymo Research.
166 The in vitro digestion of patient LTRs was performed by following NEB in vitro digestion of DNA with
167 Cas9 nuclease protocol with no modifications. The multiple unique sequences cloned into pGL3 vectors
168 were added in equimolar ratios to mirror the biological context of HIV-1 infection. Cutting reactions
169 were performed for 1 hour at 37°C followed by 65°C heat-inactivation for 5 min. BamHI restriction
170 endonuclease digestion was performed to linearize any undigested plasmid and samples were cleaned
171 using the Qiagen PCR cleanup procedure to ensure proper running on the High Sensitivity Bioanalyzer
172 chip (Agilent) which was used to measure the sizes and molarities of the fragments.
173 As the sizes reported by the Bioanalyzer have a known degree of noise, a mean shift clustering
174 algorithm was used to identify the appropriate cutoffs to use when assigning a Bioanalyzer peak to a
175 particular molecular fragment. We use a width of 500 bp due to the estimate of 10% error in the length
176 estimation described in technical specifications provided by Agilent. The two fragment sizes resulting
177 from the Cas9 in vitro digestion were calculated in a similar way. The calculation of cleavage efficiency,
178 the mean of the molarity of two Cas9-digested fragments (designated as mol f 1 and mol f 2 in Equation
179 1) was used as the molarity for the fragmented pGL3. The molarity of the total pGL3 was calculated by
180 adding the molarity of fragmented pGL3 and the molarity of unfragmented pGL3 (designated as molun ).
181 The function of cleavage efficiency can be described as:

(mol f 1 +mol f 2 )/2


cleavage = molun +(mol f 1 +mol f 2 )/2 (1)

182 The processed data is shown in Table 1.

183 RESULTS
184 In order to showcase the many uses of the crseek module, this manuscript includes four simple and
185 realistic CRISPR-Cas design tasks. These tasks would not be reasonably possible with currently available
186 tools due to either the draft quality of the template genome or the complexity of the design.

187 Task 1
188 Creating positive controls for CRISPR library screens are critical to downstream validation. As an
189 illustrative example, imagine developing a gRNA that can knockout the expression of eGFP from a
190 plasmid (Addgene-54622, (Davidson, 2016)) transformed into the newly sequenced Clostridioides difficile
191 genome (Genbank: NC 009089.1, Monot et al. (2011)). A robust positive control will have a perfect
192 homology with the eGFP locus while having a low homology with any other position in the C. difficile
193 genome. The following script shows a basic implementation.

4/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
# Build the CFD estimator
estimator = estimators.CFDEstimator.build_pipeline()

# Load the relevant sequences


genome_path = 'data/Clostridioides_difficile_630.gb'
plasmid_path = 'data/addgene-plasmid-54622-sequence-158642.gbk'

with open(genome_path) as handle:


genome = list(SeqIO.parse(handle, 'genbank'))[0]

with open(plasmid_path) as handle:


plasmid = list(SeqIO.parse(handle, 'genbank'))[0]

egfp_feature = get_egfp(plasmid)
egfp_record = egfp_feature.extract(plasmid)

# Extract all possible NGG targets


possible_targets = utils.extract_possible_targets(egfp_record,
pams=('NGG',))

# Find all targets across the host genome


possible_binding = utils.cas_offinder(possible_targets,
5, locus = [genome])

# Score each hit across the genome


scores = estimator.predict_proba(possible_binding.values)
possible_binding['Score'] = scores

# Find the maximum score for each spacer


results = possible_binding.groupby('spacer')['Score'].agg('max')
results.sort_values(inplace=True)

194 Utilizing this simple script reveals that there are numerous potential spacers. The biopython
195 GenomeDiagram module can be used to generate a proposed plasmid map as shown in Fig. 1. This is
196 not an unexpected result as the C. difficile genome does not contain genes that share significant homology
197 with eGFP, as such this represents simple application of the crseek module. Aside from the draft-quality
198 of the genome, this design could be accomplished with many of the currently available tools.

199 Task 2
200 The crseek module is also capable of more complex spacer design tasks. The following code snippet
201 describes the generation of a large library of spacers that target individuals genes in the C. difficile
202 genome. The tight integration of crseek with the biopython package allows for simple reading and
203 writing of the numerous file types employed in the biological field. Furthermore, with the crseek
204 module it is simple to ensure that the spacers designed for a particular gene do not have homology
205 elsewhere in the genome.

library_grnas = []
gene_info = []

genome_key = genome.id + ' ' + genome.description


for feat in genome.features:

# Only target genes


if feat.type != 'CDS':
continue
# Get info about this gene

5/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
product = feat.qualifiers['product'][0]
tag = feat.qualifiers['locus_tag'][0]
gene_record = feat.extract(genome)

# Get potential spacers


possible_targets = utils.extract_possible_targets(gene_record)

# Score hits
possible_binding = utils.cas_offinder(possible_targets,
3, locus = [genome])
scores = estimator.predict_proba(possible_binding.values)
possible_binding['Score'] = scores

# Set hits within the gene to np.nan


slic = pd.IndexSlice[genome_key, :,
feat.location.start:feat.location.end]
possible_binding.loc[slic, 'Score'] = np.nan

# Aggregate and sort off-target scan


grouped_spacers = possible_binding.groupby('spacer')
ot_scores = grouped_spacers['Score'].agg('max')

# Spacers which only have intragenic hit


ot_scores.fillna(0, inplace=True)
ot_scores.sort_values(inplace=True)

# Save information
gene_info.append({'Product': product,
'Tag': tag,
'UsefulGuides': (ot_scores<=0.25).sum()})
top = ot_scores.head()
for protospacer, off_score in top.to_dict().items():
dna_target = protospacer.back_transcribe()
strand = '+'
if location == -1:
dna_target = reverse_complement(dna_target)
strand = '-'

location = genome.seq.find(dna_target)
library_grnas.append({'Product': product,
'Tag': tag,
'Protospacer': protospacer,
'Location': location,
'Strand': strand,
'Off Target Score': off_score})

gene_df = pd.DataFrame(gene_info)
library_df = pd.DataFrame(library_grnas)

206 This analysis shows that of the 3751 annotated genes in this draft C. difficile genome, 3506 (93.4%)
207 of them can be targeted by at least five spacers without off-target effects. Fig. 2 shows the selected
208 spacers along with the genes that are targeted. The design of the crseek library facilitates this
209 analysis with only three lines of CRISPR-specific code.

6/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
210 Task 3
211 The crseek module is flexible enough to facilitate the analysis of novel experimental designs. Popula-
212 tions of mixed HIV-1 LTR-containing plasmids were subjected to an in vitro cutting assay, as described
213 above, in an effort to explore the most appropriate way of predicting SpCas9 effectiveness against a mixed
214 population of DNA fragments. This mirrors the biological situation found in HIV-1-infected individu-
215 als who often contain a swarm of related, yet genetically distinct, integrated viral genomes (Dampier
216 et al., 2017). The script below describes how to compare the predictions of three popular methods with
217 experimentally determined cutting profiles.

# Load in the known data


cutting_data = pd.read_csv('data/IVCA_gRNAs_efficiency.csv')

# Text objects need to be converted to Bio.Seq


# objects with the correct Alphabet
ct = utils.smrt_seq_convert('Seq',
cutting_data['Target'].values,
alphabet=Alphabet.generic_dna)
cutting_data['target'] = list(ct)

sd = utils.smrt_seq_convert('Seq',
cutting_data['gRNA Sequence'].values,
alphabet=Alphabet.generic_dna)
cutting_data['spacer'] = [s.transcribe() for s in sd]

# Predict the score using each model across all sequences


ests = [('MIT', estimators.MITEstimator.build_pipeline()),
('CFD', estimators.CFDEstimator.build_pipeline()),
('Kinetic', estimators.KineticEstimator.build_pipeline())]

for name, est in ests:


# Models only require the gRNA and Target sequence columns
data = cutting_data[['spacer', 'target']].values
cutting_data[name] = est.predict_proba(data)

# Aggregate the predicted cutting data across each sample


predicted_cleavage = pd.pivot_table(cutting_data,
index = 'Patient',
columns = 'gRNA Name',
values = ['MIT', 'CFD',
'Kinetic'],
aggfunc = 'mean')

218 After aggregation, these results can be plotted against the known results as shown in Fig. 3. In these
219 studies, the CFD model has the best correlation between the predicted and observed results (r2 = 0.90)
220 when compared to an r2 = 0.81 for the MIT model and r2 = 0.83 for the Kinetic model.

221 Task 4
222 The extensible nature of the crseek module allows simple integration with the scikit-learn archi-
223 tecture. The crseek.preprocessing transformers can be used to easily extract features from
224 target, spacer pairs that can be used as inputs for scikit-learn prediction models. Addition-
225 ally, the crseek.estimators subclass the sklearn.BaseEstimator and implement .fit,
226 .predict, and .predict proba methods allowing them to be used interchangeably with native
227 scikit-learn models. The script below shows how crseek tools can be used to develop new machine
228 learning methods for identifying interaction rules as well as comparing the results to currently published
229 methodologies.

7/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
files = sorted(glob.glob('data/GUIDESeq/*/*.tsv'))

dfs = [pd.read_csv(f, sep='\t').reset_index() for f in files]


hit_data = pd.concat(dfs,
axis=0, ignore_index=True)

# Multiple data loading/cleaning steps have been omitted.


# See the Github repository files for the entire script.

from sklearn.ensemble import GradientBoostingClassifier


from sklearn.model_selection import cross_validate, StratifiedKFold

# Convert spacer-target pairs into a binary "matching" vector


transformer = preprocessing.MatchingTransformer()
X = transformer.transform(hit_data[['spacer', 'target']].values)
y = (hit_data['NormSum'] >= cutoff).values

# Evaluate the module using 3-fold cross-validation


grad_res = cross_validate(GradientBoostingClassifier(), X, y,
scoring = ['accuracy',
'precision',
'recall'],
cv = StratifiedKFold(random_state=0),
return_train_score=True)
grad_res = pd.DataFrame(grad_res)

# Evaluate MITEstimator using 3-fold cross-validation


mit_res = cross_validate(estimators.MITEstimator(), X, y,
scoring = ['accuracy',
'precision',
'recall'],
cv = StratifiedKFold(random_state=0),
return_train_score=True)
mit_res = pd.DataFrame(mit_res)

230 The prediction models that are part of the crseek module can be directly used in the scikit-learn
231 evaluation functions like cross validate. Additionally, the preprocessing tools can be used
232 to convert (spacer, target) pairs into a feature space that is directly usable by other scikit-learn
233 models. Fig. 4 shows the results of exploring simple, built-in, scikit-learn models utilizing data from a
234 GUIDE-Seq experiment by Tsai et al. (2015). While these results are preliminary, there is evidence that
235 more advanced machine learning methods such as gradient boosting and feed-forward neural networks
236 may have improved prediction capabilities.

237 Discussion
238 As the CRISPR-Cas gene-editing field expands, the number of potential strategies will increase as well.
239 It is not feasible to build single purpose-driven tools to account for all of these potential strategies. For
240 example, nickase-based strategies require finding targets with potential spacers on each strand at the same
241 position (Shen et al., 2014), while employing homologous recombination gene editing strategies requires
242 finding targets that are a specific distance away (Chen et al., 2013). While it may be easy to develop
243 single instances of these strategies by hand, it would be impracticable to do so across many, potentially
244 thousands, of genes. As such, a modular architecture was employed to allow bioinformaticians to use
245 individual parts of the library as needed to leverage the power of Python to design gene-editing strategies.
246 The crseek module can also be used to explore the use of CRISPR-Cas gene-editing in non-model
247 organisms. Even ”draft quality” genome sequences containing degenerate bases can be searched. This
248 further includes the use of differing CRISPR-Cas systems that can be explored easily. Researchers can
249 also integrate crseek into the scikit-learn ecosystem to develop more advanced estimators as more data

8/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
250 becomes available.
251 The crseek module can be used to evaluate the effectiveness of CRISPR-Cas gene in rapidly
252 evolving bleeding-edge use-cases of the CRISPR-Cas system. This will be an invaluable tool for fields
253 editing highly mutable genomes and meta-genomes of mixed populations. In the case of HIV-1, recent
254 research has shown that inter-patient variability presents a barrier to effective targeting (Dampier et al.,
255 2017; Roychoudhury et al., 2018). However, utilizing this tool it was possible to exploit the promiscuity
256 of SpCas9 to construct a set spacers capable of targeting the wide range of known mutations (Dampier
257 et al., 2018). Additionally, CRISPR gene-editing has been proposed as a strategy for shifting the human
258 microbiome by selectively targeting specific species (Bikard and Barrangou, 2017; Gomaa et al., 2014).
259 The crseek module can measure the effectiveness across each composed metagenome and accommodate
260 arbitrarily complicated targeting rules (eg. target genomes A-E, miss genomes F-J).
261 Lastly, the crseek module will be useful to researchers exploring the binding rules of non-model
262 Cas9 variants. Task 4 shows the ease of training a new estimator given a set of binding data. Using a Cas9
263 variant in a genome-wide binding experiment such as GUIDE-Seq (Tsai et al., 2015), CIRCLE-Seq (Tsai
264 et al., 2017), or SITE-Seq (Cameron et al., 2017), can provide invaluable training data. The crseek
265 module can be easily merged with the scikit-learn architecture or even more advanced deep learning
266 frameworks such as Keras (Chollet et al., 2015) or Tensorflow (Abadi et al., 2015).

267 Conclusion
268 By designing a consistent API that mirrors other common tools in the Python scientific stack, crseek
269 can be used to automate arbitrarily complicated gene editing strategies with minimal effort. Along with
270 the multitude of implemented features, crseek is an invaluable resource for computational biologists
271 exploring the intricacies of CRISPR-Cas gene editing.

272 ACKNOWLEDGMENTS
273 We would like to acknowledge the patients of the Drexel Medicine CARES cohort for providing material
274 for this study. Patients in the Drexel CNS AIDS Research and Eradication Study (CARES) Cohort were
275 recruited from the Partnership Comprehensive Care Practice of the Division of Infectious Disease and
276 HIV Medicine at Drexel University College of Medicine (Drexel Medicine), located in Philadelphia,
277 Pennsylvania. Patients in the Drexel Medicine CARES cohort were recruited under protocol 1201000748
278 (Brian Wigdahl, PI), which adheres to the ethical standards of the Helsinki Declaration (1964, amended
279 most recently in 2008). All patients provided written consent upon enrollment.

280 REFERENCES
281 Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean,
282 J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R.,
283 Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster,
284 M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas,
285 F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2015). TensorFlow:
286 Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
287 Ablain, J., Durand, E. M., Yang, S., Zhou, Y., and Zon, L. I. (2015). A CRISPR/Cas9 vector system for
288 tissue-specific gene disruption in zebrafish. Developmental cell, 32(6):756–64.
289 Anaconda (2016). Anaconda software distribution.
290 Antell, G., Dampier, W., Aiamkitsumrit, B., Nonnemacher, M., Jacobson, J., Pirrone, V., Zhong, W.,
291 Kercher, K., Passic, S., Williams, J., Schwartz, G., Hershberg, U., Krebs, F., and Wigdahl, B. (2016).
292 Utilization of HIV-1 envelope V3 to identify X4- and R5-specific Tat and LTR sequence signatures.
293 Retrovirology, 13(1).
294 Bae, S., Park, J., and Kim, J.-S. (2014). Cas-OFFinder: a fast and versatile algorithm that searches
295 for potential off-target sites of Cas9 RNA-guided endonucleases. Bioinformatics (Oxford, England),
296 30(10):1473–5.
297 Barrangou, R. (2015). The roles of CRISPR–Cas systems in adaptive immunity and beyond. Current
298 Opinion in Immunology, 32:36–41.
299 Bikard, D. and Barrangou, R. (2017). Using CRISPR-Cas systems as antimicrobials. Current Opinion in
300 Microbiology, 37:155–160.

9/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
301 Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer,
302 P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., and Varoquaux, G. (2013).
303 API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD
304 Workshop: Languages for Data Mining and Machine Learning, pages 108–122.
305 Cameron, P., Cameron, P., Settle, A. H., Fuller, C. K., Thompson, M. S., Cigan, A. M., Young, J. K., and
306 May, A. P. (2017). SITE-Seq: A Genome-wide Method to Measure Cas9 Cleavage. Protocol Exchange.
307 Cao, J., Wu, L., Zhang, S.-M., Lu, M., Cheung, W. K. C., Cai, W., Gale, M., Xu, Q., and Yan, Q. (2016).
308 An easy and efficient inducible CRISPR/Cas9 platform with improved specificity for multiple gene
309 targeting. Nucleic acids research, 44(19):e149.
310 Chen, C., Fenk, L. A., and de Bono, M. (2013). Efficient genome editing in Caenorhabditis elegans by
311 CRISPR-targeted homologous recombination. Nucleic Acids Research, 41(20):e193–e193.
312 Chollet, F. et al. (2015). Keras.
313 Cock, P. J. A., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., Friedberg, I., Hamelryck,
314 T., Kauff, F., Wilczynski, B., and de Hoon, M. J. L. (2009). Biopython: freely available Python
315 tools for computational molecular biology and bioinformatics. Bioinformatics (Oxford, England),
316 25(11):1422–3.
317 Cong, L., Ran, F. A., Cox, D., Lin, S., Barretto, R., Habib, N., Hsu, P. D., Wu, X., Jiang, W., Marraffini,
318 L. A., and Zhang, F. (2013). Multiplex genome engineering using CRISPR/Cas systems. Science (New
319 York, N.Y.), 339(6121):819–23.
320 Cui, Y., Xu, J., Cheng, M., Liao, X., and Peng, S. (2018). Review of CRISPR/Cas9 sgRNA Design Tools.
321 Interdisciplinary Sciences: Computational Life Sciences, 10(2):455–465.
322 Dampier, W., Sullivan, N., Chung, C.-H., Mell, J., Nonnemacher, M., and Wigdahl, B. (2017). Designing
323 broad-spectrum anti-HIV-1 gRNAs to target patient-derived variants. Scientific Reports, 7(1).
324 Dampier, W., Sullivan, N. T., Mell, J., Pirrone, V., Ehrlich, G., Chung, C.-H., Allen, A. G., DeSimone,
325 M., Zhong, W., Kercher, K., Passic, S., Williams, J., Szep, Z., Khalili, K., Jacobson, J., Nonnemacher,
326 M. R., and Wigdahl, B. (2018). Broad spectrum and personalized gRNAs for CRISPR/Cas9 HIV-1
327 therapeutics. AIDS Research and Human Retroviruses, page AID.2017.0274.
328 Davidson, M. (2016). megfp-pbad.
329 Ding, Y., Li, H., Chen, L.-L., and Xie, K. (2016). Recent Advances in Genome Editing Using CRISPR/-
330 Cas9. Frontiers in Plant Science, 7:703.
331 Doench, J. G., Fusi, N., Sullender, M., Hegde, M., Vaimberg, E. W., Donovan, K. F., Smith, I., Tothova,
332 Z., Wilen, C., Orchard, R., Virgin, H. W., Listgarten, J., and Root, D. E. (2016). Optimized sgRNA
333 design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nature Biotechnology,
334 34(2):184–191.
335 Doudna, J. A. and Charpentier, E. (2014). The new frontier of genome engineering with CRISPR-Cas9.
336 Fu, Y., Foden, J. A., Khayter, C., Maeder, M. L., Reyon, D., Joung, J. K., and Sander, J. D. (2013).
337 High-frequency off-target mutagenesis induced by CRISPR-Cas nucleases in human cells. Nature
338 Biotechnology, 31(9):822–826.
339 Genscript (2016). CRISPR Handbook. GenScript, (October):1–32.
340 Gomaa, A. A., Klumpe, H. E., Luo, M. L., Selle, K., Barrangou, R., and Beisel, C. L. (2014). Pro-
341 grammable removal of bacterial strains by use of genome-targeting CRISPR-Cas systems. mBio,
342 5(1):00928–13.
343 Hsu, P. D., Scott, D. A., Weinstein, J. A., Ran, F. A., Konermann, S., Agarwala, V., Li, Y., Fine, E. J.,
344 Wu, X., Shalem, O., Cradick, T. J., Marraffini, L. A., Bao, G., and Zhang, F. (2013). DNA targeting
345 specificity of RNA-guided Cas9 nucleases. Nature Biotechnology, 31(9):827–832.
346 Hu, W., Kaminski, R., Yang, F., Zhang, Y., Cosentino, L., Li, F., Luo, B., Alvarez-Carbonell, D., Garcia-
347 Mesa, Y., Karn, J., Mo, X., and Khalili, K. (2014). RNA-directed gene editing specifically eradicates
348 latent and prevents new HIV-1 infection. Proceedings of the National Academy of Sciences of the
349 United States of America, 111(31):11461–6.
350 Jinek, M., Chylinski, K., Fonfara, I., Hauer, M., Doudna, J. A., and Charpentier, E. (2012). A pro-
351 grammable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science (New York,
352 N.Y.), 337(6096):816–21.
353 Kim, Y. B., Komor, A. C., Levy, J. M., Packer, M. S., Zhao, K. T., and Liu, D. R. (2017). Increasing the
354 genome-targeting scope and precision of base editing with engineered Cas9-cytidine deaminase fusions.
355 Nature biotechnology, 35(4):371–376.

10/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
356 Klein, M., Eslami-Mossallam, B., Arroyo, D. G., and Depken, M. (2018). Hybridization Kinetics Explains
357 CRISPR-Cas Off-Targeting Rules. Cell Reports, 22(6):1413–1423.
358 Li, L., Aiamkitsumrit, B., Pirrone, V., Nonnemacher, M. R., Wojno, A., Passic, S., Flaig, K., Kilareski, E.,
359 Blakey, B., Ku, J., Parikh, N., Shah, R., Martin-Garcia, J., Moldover, B., Servance, L., Downie, D.,
360 Lewis, S., Jacobson, J. M., Kolson, D., and Wigdahl, B. (2011). Development of co-selected single
361 nucleotide polymorphisms in the viral promoter precedes the onset of human immunodeficiency virus
362 type 1-associated neurocognitive impairment. Journal of NeuroVirology, 17(1):92–109.
363 Li, L., Parikh, N., Liu, Y., Kercher, K., Dampier, W., Flaig, K., Aiamkitsumrit, B., Passic, S., Frantz,
364 B., Blakey, B., Pirrone, V., Moldover, B., Jacobson, J., Feng, R., Nonnemacher, M., and Wigdahl,
365 B. (2015). Impact of Naturally Occurring Genetic Variation in the HIV-1 LTR TAR Region and Sp
366 Binding Sites on Tat-Mediated Transcription. Journal of Human Virology & Retrovirology, 2(4).
367 Marraffini, L. A. and Sontheimer, E. J. (2010). CRISPR interference: RNA-directed adaptive immunity in
368 bacteria and archaea. Nature Reviews Genetics, 11(3):181–190.
369 Monot, M., Boursaux-Eude, C., Thibonnier, M., Vallenet, D., Moszer, I., Medigue, C., Martin-Verstraete,
370 I., and Dupuy, B. (2011). Reannotation of the genome sequence of Clostridium difficile strain 630.
371 Journal of Medical Microbiology, 60(8):1193–1199.
372 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer,
373 P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and
374 Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning
375 Research, 12:2825–2830.
376 Roychoudhury, P., De Silva Feelixge, H., Reeves, D., Mayer, B. T., Stone, D., Schiffer, J. T., and Jerome,
377 K. R. (2018). Viral diversity is an obligate consideration in CRISPR/Cas9 designs for targeting the
378 HIV reservoir. BMC Biology, 16(1):75.
379 Shen, B., Zhang, W., Zhang, J., Zhou, J., Wang, J., Chen, L., Wang, L., Hodgkins, A., Iyer, V., Huang, X.,
380 and Skarnes, W. C. (2014). Efficient genome modification by CRISPR-Cas9 nickase with minimal
381 off-target effects. Nature Methods, 11(4):399–402.
382 Sternberg, S. H. and Doudna, J. A. (2015). Expanding the Biologist’s Toolkit with CRISPR-Cas9.
383 Tsai, S. Q., Nguyen, N. T., Malagon-Lopez, J., Topkar, V. V., Aryee, M. J., and Joung, J. K. (2017).
384 CIRCLE-seq: a highly sensitive in vitro screen for genome-wide CRISPR-Cas9 nuclease off-targets.
385 Nature methods, 14(6):607–614.
386 Tsai, S. Q., Zheng, Z., Nguyen, N. T., Liebers, M., Topkar, V. V., Thapar, V., Wyvekens, N., Khayter,
387 C., Iafrate, A. J., Le, L. P., Aryee, M. J., and Joung, J. K. (2015). GUIDE-seq enables genome-wide
388 profiling of off-target cleavage by CRISPR-Cas nucleases. Nature biotechnology, 33(2):187–197.
389 Van Rossum, G., Warsaw, B., and N., C. (2001). Pep 8 – style guide for python code.
390 Wu, X., Scott, D. A., Kriz, A. J., Chiu, A. C., Hsu, P. D., Dadon, D. B., Cheng, A. W., Trevino, A. E.,
391 Konermann, S., Chen, S., Jaenisch, R., Zhang, F., and Sharp, P. A. (2014). Genome-wide binding of
392 the CRISPR endonuclease Cas9 in mammalian cells. Nature biotechnology, 32(7):670–6.

393 FIGURES & TABLES

11/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
Patient Clone Target Cleavage Efficiency

R5 1 ATCAGATATCCACTGACCTT TGG 73%

X4 1 ACCAGATATCCACTGTGCTT TGG 6%

1 ACAAGATTTCCACTGACCTT TGG
A0019 2 ACAAGATTTCCACTGACCTT TGG 56%

1 ACCAGATTTCCCCTGACCTT TGG
A0020 2 ACCAGATTTCCCCTGACCTT TGG 25%

1 GTCAGATACCCACTGACCTT TGG
A0049 2 GTCAGATACCCACTGACCTT TGG 91%

3 GTCAGATATCCACTGACCTT TGG

A0050 1 ACCAGGTTTCCACTGACCTT TGG 58%

1 GTCAGATATCCACTGACCTT TGG
2 GTCAGATATCCACTGACCTT TGG
A0107 3 GTCAGATATCCACTGACCTT TGG 84%
4 GTCAGATATCCACTGACCTT TGG
5 ACTAGATGGCCACTGACCTT TGG
6 GTCAGATATCCACTGACCTT TGG

Table 1. Sequences used in the in vitro cutting assay. The Target column refers to the target region of
the cloned sequence with the highest match to the LTR-A gRNA used for cleavage
(ATCAGATATCCACTGACCTT NGG) in the LTR (loci). The pam region is bolded. The Cleavage
Efficiency refers to the percentage of the plasmids that were cleaved as measured by the BioAnalyzer.

12/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
ori

araC

Amp-R

T7 Tag

EGFP

Figure 1. A plasmid map of mEGFP-pBAD with the gene and promoter features labeled in blue and
light blue. Potential spacer-targeting sites are shown in green.

13/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
Designable genes

Undesignable genes

Spacers

Figure 2. Approximately 93% of genes in the C. difficile genome can be targeted by at least 5 spacers
without predicted off-target effects. Gene regions are shown along the inner track with those in blue
representing those for which at least five spacers could be designed while those in red could not be
properly targeted. The outer track shows the location of spacers in green.

14/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
A MIT B CFD C Kinetic
1.0
Predicted Cleavage

0.8

0.6

0.4

0.2

r2=0.81 r2=0.90 r2=0.83


0
0 0.25 0.50 0.75 1.0 0 0.25 0.50 0.75 1.0 0 0.25 0.50 0.75 1.0
Observed Cleavage Observed Cleavage Observed Cleavage

Figure 3. Prediction models have different levels of agreement with in vitro data. Three different
prediction models were used to predict the cleavage percentage of a mixed population of targets. Each
point represents the percentage of cleavage observed on a population of distinct genetic variants
compared to the predicted percentage of cleavage. The line indicates the regression line, while the
shadows indicate the 95% confidence intervals of 1000 bootstrap replicates. A) The predictions were
performed using the algorithm proposed by Hsu et al. (2013). B) The predictions were performed using
the algorithm proposed by Doench et al. (2016) C) The predictions were performed using the algorithm
proposed by Klein et al. (2018).

15/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
0.25

0.20
Mean Absolute Error

0.15

0.10

0.05

0
MIT

CFD

Kinetic

Neural Network
Gradient Boosting

Nearest Neighbors

Figure 4. Advanced machine learning techniques may be useful in improving the agreement between
predicted and observed cleavage. Each bar indicates the mean absolute error between the predicted
cleavage rate and the observed cleavage rate measured using GUIDE-Seq by Tsai et al. (2015).
Scikit-learn models were fit using default parameters on the output of the
crseek.preprocessing.MisMatchEstimator.

16/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018

You might also like