You are on page 1of 7

2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

Deep Learning for Assignment of Protein


Secondary Structure Elements from Cα Coordinates
2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) | 978-1-6654-0126-5/21/$31.00 ©2021 IEEE | DOI: 10.1109/BIBM52615.2021.9669538

Kamal Al Nasr* Ali Sekmen Bahadir Bilgin


Department of Computer Science Department of Computer Science Department of Mechanical Engineering
Tennessee State University Tennessee State University Middle East Technical University
Nashville, TN, USA Nashville, TN, USA Ankara, Turkey
kalnasr@tnstate.edu asekmen@tnstate.edu bahadir.bilgin@metu.edu.tr

Christopher Jones Ahmet Bugra Koku


Department of Computer Science Department of Mechanical Engineering
Tennessee State University Middle East Technical University
Nashville, TN, USA Ankara, Turkey
cjone141@tnstate.edu kbugra@metu.edu.tr

Abstract—This paper presents a Deep Neural network (DNN) of the protein. Sheets are stabilized by hydrogen-bonding
system that uses a large set of geometric and categorical features between N − H group of one strand with C = O group of an
for classification of secondary structure elements (SSEs) in the adjacent strand. The rest of amino acids that are not involved
protein’s trace that consists of Cα atoms on the backbone. A
systematical approach is implemented for classification of protein in one of the two main SSEs in protein structure form what is
SSE problem. This approach consists of two network architecture called loops/coils. Therefore, identifying/assigning these SSEs
search (NAS) algorithms for selecting (1) network architecture is crucial in protein structure determination.
and layer connectivity, and (2) regularization parameters. Each Secondary structure assignments have diverse applications
algorithm uses a different search space and they are used in in protein science and analysis. It is used in structure com-
succession to develop a DNN. The DNN system generates over
93% classification rate on average for multiple test sets without parison and classification. Further, it is important for protein
any post processing for amino acid configurations. modeling and structure prediction. Secondary structures have
Index Terms—protein modeling, secondary structure classifi- also been employed in protein modeling, structure prediction,
cation, Cα backbone, deep neural networks protein dynamics, and protein-protein interaction. Many ex-
perimental methods have been conventionally used to assign
I. I NTRODUCTION SSEs. Optical spectroscopy is a fast technique and has been
Proteins are molecules made of a sequence of amino acids used to inspect hydrogen-bond dynamics at a picosecond time
(i.e., residues), which perform a crucial role in cells. Each scale for a small b-turn peptide [11]. On the other hand,
protein folds into a specific 3 dimensional (i.e., 3D ) spatial circular dichroism (CD) and Raman spectroscopy are used
structure and can interact with other molecules [1]–[3]. Deter- to characterize overall protein secondary structure dynamics
mination of the 3D structure is a key process to interpret its in solution, since the helix and sheet structures give strong
biological function and has gained a primary concern in bio- characteristic spectra which are highly correlated with X-ray
logical research fields and intelligent drug design. Many exper- data [12]–[14].
imental methods have been used, conventionally, to determine Experimental methods of protein structure determination
the structure of a protein including X-Ray crystallography have many limitations including crystallization, resolution,
[4]–[6], Nuclear Magnetic Resonance [7], [8], and Cryogen- time, and cost. In contrast, protein sequencing (i.e., deter-
Electron Microscopy [4], [9], [10]. Protein folding is governed mining the sequence of amino acids making up the protein)
by some chemical forces and forms local sub-conformations is less expensive and faster [15]. Therefore, there is a huge
during its folding process. These sub-conformations are called gap between the number of determined structures and the
secondary structure elements (SSEs) and they are stabilized number of determined sequences. With the computational
by hydrogen bonds. The most common types of SSEs are power we have nowadays, several computational techniques
helices and sheets. Helices are SSEs stabilized by hydrogen- are recently used to model structures of proteins (i.e., find the
bonding between N − H group and C = O group of peptide spatial arrangements of amino acids) to close the gap. These
bonds four residues apart. Sheets are composed of two or more techniques include ab initio modeling [16]–[18], comparative
different segments (i.e., strands) stretch along the structure modeling [19], [20], and de novo modeling [21]–[25]. These
modeling techniques generate structural information of a pro-
Ali Sekmen’s research is supported by DOD grant W911NF-20-100284.
Kamal Al Nasr’s research is supported by NIH Academic Research Enhance- tein. However, modeling techniques may not necessarily find
ment Award (R15 AREA: 1R15GM126509 01).* Corresponding author the structure of a protein at atomic resolution. Therefore, they

978-1-6654-0126-5/21/$31.00 ©2021 IEEE 2546

Authorized licensed use limited to: Trial User - Warsaw University (Uniwersytet Warszawski). Downloaded on April 15,2023 at 22:27:49 UTC from IEEE Xplore. Restrictions apply.
may model a portion of a protein, its backbone chain, or a profile of geometry based features such as dihedral angles
trace of Cα coordinates. and Cα distances have been developed following this initial
When a protein structure is determined, it is made available effort, including P-SEA [40], KAKSI [32], SACF [41], STICK
to research community through some unified databases. One [42], and PMML [43], and our approach [44]. Recently, some
popular database is Protein Data Bank (PDB) [1], [26]. Each approaches used machine learning with Cα trace to assign the
protein molecule has an entry in PDB with a unique alphanu- SSEs [38], [45].
meric ID number. The entry contains information about the
protein including information about SSEs. This information A. Paper Contributions
includes the group of amino acids forming the SSEs and • A set of novel geometric features for each Cα using phys-
the type of each SSE. The information about SSEs in PDB ical constraints are integrated with additional categorical
entries are coming from experimental methods. When the features.
experimental method is not available, several computational • A Deep Neural Network (DNN) is designed and imple-
methods can be used when all-atom structure is available. mented.
The most popular computational method is Define Secondary In this paper, we develop a deep learning method to assign
Structure of Proteins (DSSP) [27], [28]. DSSP is a pattern- the locations of SSEs from a protein’s Cα trace and its
recognition process of hydrogen-bonded and geometrical fea- geometry-profile features. The information of other atoms
tures extracted from X-ray coordinates. DSSP algorithm be- such as backbone and side chain are not used. Our approach
comes one of the standard methods used to assign secondary uses 39 geometric features to assign the types of SSEs. The
structure annotation to amino acids of proteins when the rest of paper is organized as follow: Section II describes the
complete structure is available. DSSP is based primarily on geometric features used in this research. Section III provides
hydrogen-bonding patterns and geometrical feature constraints the methodology of this research. Section IV presents the
extracted from protein structure coordinates. DSSP calculates experimental results and evaluation of the proposed methods
the intra-backbone hydrogen bonds of the protein between and followed with some conclusions in Section V.
nitrogen and carbonyl groups along the protein polypeptide
chain using a Coulomb approximation of the hydrogen-bond II. F EATURE G ENERATION
energy function [29]. The original version of DSSP (now In this research, 39 geometric and categorical features are
called DSSPold) was rewritten in 2012 to improve the accu- utilized for each amino acid in the protein’s trace. 26 of those
racy of π-helix assignment [30]. Another popular algorithms features are grouped as amino acid distances, atom angles,
that is widely used is called STRIDE [31]. STRIDE aims to vector angles and torsion angles as described in [45] and
provide secondary structure assignments that are more consis- depicted in Figure 1. An additional 13 geometric features were
tent with experimental assignments. STRIDE uses additional developed (illustrated in Figure 2) and they are also used by
features such as dihedral angle propensities. STRIDE has been our deep learning system. The vector of 13 features (F α) can
optimized on data from the PDB which makes it a knowledge- be divided into three (3) category groups: axis distances (Dα),
based approach [32]. Both methods, DSSP and STRIDE, were Euclidean distances (Eα), and Neighborhood (N α). We create
reported to have less accuracy to assign π-helix [33]. Many a vector of features (F α) for every individual Cα atom of
other computational tools were developed to assign the type of the protein that consists of the three (3) category groups of
SSEs for protein residues (i.e, amino acids) when a complete features. Dα features are the distance between the projection
structure is available and to improve the identification of π- of the two Cα coordinates of interest on a virtual axis created
helix such as SECSTR [34], P-CURVE [35], XTLSSTR [36], using two neighbor Cα coordinates. After creating the virtual
and VoTAP [37]. axis, the projections of the two Cα coordinates of interest
In many cases, proteins deposited to PDB have missing are calculated and the distance is found. Eα feature is the
atomic data. The number of proteins that have one or more Euclidean distance between the two Cα atoms of interest.
missing atom information is about 30 − 40% [38]. When Finally, the N α features are some geometrical and descriptive
atomic information such as the coordinates of atoms involved features collected for a given Cα atom that describe the
in hydrogen bonds are missing DSSP, STRIDE, and sim- neighborhood of the residue based on some requirements.
ilar algorithms become inaccurate and cannot be used. In The first category of features calculated from Cα coordi-
addition to experimentally determined structures, structures nates is the set of axis distances, Dα. For residue i, the set of
produced by computational modeling techniques may not be Dα(i) consists of two axis distances: the distance of residues
complete. Therefore, many tools have been developed to assign i−2 and i−1 on the virtual axis (i.e., line segment) created by
SSEs with missing data. These tools use distance and angle atoms i−2 and i+1 and the distance of residues i and i+1 on
profiles of local fragments such as the information of Cα the same virtual axis. The distance in this set is the distance
atoms/coordinates in protein’s backbone (i.e., Cα trace). The between the projection of the two Cα atoms of interest on the
first tool to use this approach was proposed in 1977 [39] and virtual axis.
it uses a sliding window that covers four consecutive residues The second category of features calculated from Cα coordi-
to find the distances and dihedral angles of Cα coordinates. nates is the set of Euclidean distances, Eα. Eα(i) consists of
Numerous other tools that use Cα coordinates to create a four Euclidean distances of residue i and residues i − 3, i − 2,

2547

Authorized licensed use limited to: Trial User - Warsaw University (Uniwersytet Warszawski). Downloaded on April 15,2023 at 22:27:49 UTC from IEEE Xplore. Restrictions apply.
i + 2, and i + 3. The third category of features calculated
from Cα coordinates is the set of geometrical features that
describes the surrounding of Cα atom, N α. N α(i) consists
of four Euclidean distances, two scalar values, and one angular
value for residue i. To calculate this set of features, we initially
find a set of candidate neighbors of residue i. For residue j,
it is considered a candidate neighbor of residue i if all the (a) Two distance features in Dα(i) are
following requirements are fulfilled: 1) at least three residues shown in green arrows between the pro-
apart from residue i, |i − j| > 3 2) the distance between i jection of a Cα coordinates on the virtual
and j is less than 6.31Å, 3) there is another residue j ′ in axis.
the candidate list that is adjacent to j such that j = j ′ − 1
or j = j ′ + 1. After the initial candidate list of neighbor
residues is created, we drop some weak neighbors based on
the following criterion: for residue j, it is saved in the final
list if the distance of residue j and the line segment formed
between residues i − 1 and i + 1 is less than 5.81Å and its
(b) Four distance features in Eα(i).
projection is inside same line segment.
After the final list of neighbors is created, two scalar values
are calculated, the number of neighbors in the list and the
shortest Euclidian distance of residue i and the residues in the
list. Further, we calculate four Euclidean distances between i’s
surrounding and j’s surrounding, where j is the closest residue
in the list to residue i. These distances are distance between
i − 1 and j − 1, the distance between i − 1 and j + 1, the
distance between i + 1 and j − 1, and the distance between (c) Seven features calculated for N α(i).
i + 1 and j + 1. Finally, N α(i) contains one angular value, The final list of neighbors is shown in
which is the angle between the vectors formed by i − 1, i + 1 green. Two blue Cα residues are candi-
and j − 1, j + 1. dates removed from the initial list for not
being in accordance with the requirements.
The red lines are four distances calculated
between i’s surrounding and j’s surround-
ing, where j is the closest residue in the
list to i residue.
Fig. 2: The set of geometrical features, F α, collected for each
residue.

A. Dataset
In this study, a dataset of 3946 proteins which consist of
904081 amino acids and their 39 features is used. Among this
dataset, some of the amino acids are removed since they do
not have the features explained in feature generation. With
their removal, the remaining dataset consists of 868027 amino
Fig. 1: General geometric features for Cα(i). acids and their 39 features. Out of the 39 features utilized in
this study for SSE classification, one feature corresponds to
the name of the amino acid. Given that there are 20 amino
III. M ETHOD acids, this name field in the feature vector is replaced with
Given a problem, there is no universal rule of thumb for a one-hot-encoded vector. As a result, for training purposes,
determining the best network architecture that will learn the feature vector (input) size of (38+20) is obtained. Similarly, in
solution of that problem. In many cases, it is a hit and miss this dataset, the SSE elements are divided into three common
kind of a problem. In this study, a systematical approach is structures in proteins α-helices, β-sheets and loops. Hence,
implemented for classification of protein SSE problem. This The labels of SSE elements are also one hot encoded into
approach consists of two network architecture search (NAS) a label vector with size 3. Then, four different test sets,
algorithms each using a different search space. These two NAS each containing 20 unique proteins are subtracted from the
algorithms are used in succession to develop a deep neural dataset with 3946 proteins. The remaining dataset is divided
network. into training-validation sets of 90% − 10% amino acid sizes

2548

Authorized licensed use limited to: Trial User - Warsaw University (Uniwersytet Warszawski). Downloaded on April 15,2023 at 22:27:49 UTC from IEEE Xplore. Restrictions apply.
respectively. This results in a training set of 766186 amino
acids (approx. 88% of all data), validation set of 85132
(approx. 10% of all data) amino acids and a total test set
of 16708 amino acids (approx. 2% of all data). 37 angle and
distance based features inside sets are standardized ( µ = 0,
σ = 1 ) using the standard deviation and mean obtained from
training set.
B. Network Architecture
In order to build a network architecture, some NAS algo-
rithms are developed. The network architecture is decided by
two different NAS algorithms each using a different search
space. The first algorithm is for selecting network architecture
size and layer connectivity methods. The second algorithm is
for selecting the correct regularization parameters.
Parameters of search space of first stage NAS algorithm
include batch size, number of hidden layers, size of the
layers, and layer types. For layer types, a random selection
between a fully connected layer or a convolutional layer is
used. However, after preliminary experiments, convolutional
layers are excluded from search space and only fully connected
layers are used. Parameters outside of the search space of NAS
algorithm are fixed. All networks picked by NAS algorithm
use ReLU activation other than the last layer, which uses
softmax activation. Furthermore, all networks use ’Adam’
as optimization method and categorical cross-entropy as a
loss function for training. Networks employ a learning rate
scheduler to lower the learning rate at two points of training
(points of 1/3 and 2/3 epochs of total number of epochs). The
purpose of first stage NAS algorithm is to find the network
architecture that maximizes the evaluation measure applied to
the training set predictions using random search. Fig. 3: Graph of resulting network architecture.
Parameters of search space of second stage NAS algorithm
only consists of the ℓ2 regularization weights assigned to
D. Training Method
each layer of the network. Similarly, a learning rate scheduler
is used to lower the learning rate at two points of training Training of the network architecture starts by finding a
(points of 1/3 and 2/3 epochs of total number of epochs). network architecture using first stage NAS algorithm. Then,
The purpose of second stage NAS algorithm is to find the weights of the best network chosen by the first stage NAS are
network architecture that maximizes the evaluation measure transferred to a network with the same architecture, only with
applied to the validation set predictions using grid search. ℓ2 regularization on all layers. Lastly, this network is used as
a base for the second stage NAS algorithm to find a network
C. Evaluation Measures which is regularized in such a way that validation accuracy is
Labeling in such a network is done by highest probability maximized.
label from prediction (output vector). As a consequence, At the first stage of training; learning rates of 10 × 10−2 ,
thresholds are not used in labeling since an amino acid can 3 × 10−2 and 1 × 10−2 are used for 30 epochs each for a total
only have one of the SSE types. For this reason, a basic training of 90 epochs. A network with training accuracy of
understanding of accuracy is used as an evaluation measure. 96.3% and validation accuracy of 91.9% is selected. Selected
Additionally, ROC curves for each label is given individually network shape is a network with increasing width for the first
for comparison of classification consistency of each label. half of layers and getting smaller for the second half of the
1) Accuracy: Classification accuracy is used as the main layers. Neuron counts for each layer of the fully connected
evaluation measure for network performance. Networks are deep network is as follows 58, 228, 290, 399, 379, 225, 123, 3.
compared and selected with their accuracy. Accuracy is de- A diagram of this network can be seen in Figure 3. This
fined as the ratio of correctly assigned labels. network is later used in the second stage of training with
2) ROC curve: ROC curve is calculated by separating the learning rates of 10 × 10−3 , 3 × 10−3 and 1 × 10−3 for 50
output vector into individual labels and predictions. Area under epochs each for a total training of 150 epochs. The selected
ROC curve is also given for comparison reasons. network has a validation accuracy of 93.2%.

2549

Authorized licensed use limited to: Trial User - Warsaw University (Uniwersytet Warszawski). Downloaded on April 15,2023 at 22:27:49 UTC from IEEE Xplore. Restrictions apply.
IV. R ESULTS TABLE III: The performance of our model on the second two
sets of proteins
A network that uses 39 generated geometric and categorical
features is used to correctly label the protein SSEs. The
network architecture decided by NAS algorithms are trained
using 766186 amino acid test set and 85132 amino acid
validation set. This network architecture is described in the
previous section and its architecture can be seen in Figure 3.
Furthermore, this network is tested on four different test sets
of 19 and 20 proteins to label the secondary structure elements
of amino acids inside protein’s trace. The accuracies of this
classifier network can be seen in Table I. Accuracy of the sum
of all test sets is also given in the table as 93.1%. Detailed
information about proteins used in each set and the accuracy
for each protein is shown in Table II and Table III.

Test Set Accuracies


Test Set 1 Test Set 2 Test Set 3 Test Set 4 All Sets
92.9% 92.3% 93.1% 94.1% 93.1%
TABLE I: Test set accuracies for different test sets.

In Figure 4, ROC curves and ROC scores (areas under


ROC curves) are given for each label. All three curves closely
approach the point of a true positive rate of 1 and a false
positive rate of 0. This can be interpreted as the network being
a good classifier. Furthermore, due to the scores of the curves,
it can be said that the labeling performance for helices and
sheets are slightly better than labeling of loops. It can also
be said that the generated features explain helices and sheets
with more certainty.

TABLE II: The performance of our model on the first two sets
of proteins

Fig. 4: Test Set ROC curves for different classes.

V. C ONCLUSIONS

Secondary structures of a protein are used in many structural


bioinformatics applications such as structure comparison and
classification, protein modeling and structure prediction, struc-
ture visualization, protein dynamics, and protein-protein inter-
action. SSEs assignments can be conducted experimentally or
computationally. When experimental methods are unavailable,
computational methods such as DSSP and STRIDE are used
when complete structural information is available for a protein.
When some of the data involved in the calculations to find
the types of SSEs is missing, a reduced representation of a
protein structure can be used to assign the SSEs. The reduced
representation uses the minimum information about protein’s
structure such as Cα trace (i.e., Cα coordinates).
In this paper, we proposed a deep learning method to assign
a type from SSEs to protein’s residues. The approach uses

2550

Authorized licensed use limited to: Trial User - Warsaw University (Uniwersytet Warszawski). Downloaded on April 15,2023 at 22:27:49 UTC from IEEE Xplore. Restrictions apply.
Cα trace of a protein to assign one of three types: helix, [20] S. B. Pandit, Y. Zhang, J. Skolnick, Tasser-lite: an automated tool for
sheet, or loop. It used a set of 39 geometric features for Cα protein comparative modeling, Biophys J 91 (11) (2006) 4180–90.
[21] (2012). doi:10.1145/2382936.2382999.
coordinates for each residue. The model was trained on a data [22] K. Al Nasr, J. He, Constrained cyclic coordinate descent for cryo-em
set of approximately 766K protein residues and validated with images at medium resolutions: beyond the protein loop closure problem,
a dataset of approximately 85K residues that were randomly Robotica 34 (8) (2016) 1777–1790.
[23] K. Al Nasr, C. Jones, F. Yousef, R. Jebril, Pem-fitter: A coarse-grained
chosen from cullpdb pc20 res1.8 R0.25 d200528 chains5510. method to validate protein candidate models, Journal of Computational
After training step, the deep learning model was used to assign Biology (2017).
a SSE to each residue. The final accuracy of the classifier [24] M. Chen, P. R. Baldwin, S. J. Ludtke, M. L. Baker, De novo modeling in
was reported for a test set that consists of four groups of cryo-em density maps with pathwalking, Journal of Structural Biology
196 (3) (2016) 289–298.
sets conatin either 19 or 20 protein each. When using the [25] G. Terashi, D. Kihara, De novo main-chain modeling for em maps using
information from PDB as a reference, the classifier model has mainmast, Nature Communications 9 (1) (2018) 1618.
achieved accuracy an average accuracy of 93.1% on all test [26] J. L. Sussman, D. Lin, J. Jiang, N. O. Manning, J. Prilusky, O. Ritter,
E. E. Abola, Protein data bank (pdb): database of three-dimensional
sets. structural information of biological macromolecules, Acta Crystallo-
graphica. Section D, Biological Crystallography 54 (6-1) (1998) 1078–
R EFERENCES 1084.
[27] W. Kabsch, C. Sander, Dictionary of protein secondary structure: pattern
[1] V. Khanna, S. Ranganathan, N. Petrovsky, Rational Structure-Based recognition of hydrogen-bonded and geometrical features, Biopolymers
Drug Design, Academic Press, Oxford, 2019, pp. 585–600. 22 (12) (1983) 2577–2637.
[2] A. Mir, M. Naghibzadeh, N. Saadati, Index: Incremental depth extension [28] W. G. Touw, C. Baakman, J. Black, T. A. H. te Beek, E. Krieger,
approach for protein–protein interaction networks alignment, Biosystems R. P. Joosten, G. Vriend, A series of pdb-related databanks for everyday
162 (2017) 24–34. needs, Nucleic Acids Research 43 (D1) (2014) D364–D368.
[3] T. Osajima, M. Suzuki, S. Neya, T. Hoshino, Computational and statis- [29] C. A. Anderson, B. Rost, Secondary Structure Assignment, 2nd Edition,
tical study on the molecular interaction between antigen and antibody, Wiley-Blackwell, 2009, pp. 459–484.
Journal of Molecular Graphics and Modelling 53 (2014) 128–139. [30] J. Zacharias, E.-W. Knapp, Protein secondary structure classification
[4] M. J. Tarry, A. S. Haque, K. H. Bui, T. M. Schmeing, X-ray crystallogra- revisited: Processing dssp information with pssc, Journal of Chemical
phy and electron microscopy of cross- and multi-module nonribosomal Information and Modeling 54 (7) (2014) 2166–2179.
peptide synthetase proteins reveal a flexible architecture, Structure 25 (5)
[31] D. Frishman, P. Argos, Knowledge-based protein secondary structure
(2017) 783–793.e4.
assignment, Proteins: Structure, Function, and Bioinformatics 23 (4)
[5] C. Tsai, G. F. X. Schertler, Membrane Protein Crystallization, John
(1995) 566–579.
Wiley and Sons, Inc, 2020.
[32] J. Martin, G. Letellier, A. Marin, J.-F. Taly, A. G. de Brevern, J.-
[6] L. Maveyraud, L. Mourey, Protein x-ray crystallography and drug
F. Gibrat, Protein secondary structure assignment revisited: a detailed
discovery, Molecules 25 (5) (2020).
analysis of different assignment methods, BMC Structural Biology 5 (1)
[7] E. Hatzakis, Nuclear magnetic resonance (nmr) spectroscopy in food
(2005) 17.
science: A comprehensive review, Comprehensive Reviews in Food
[33] M. Fodje, S. Al-Karadaghi, Occurrence, conformational features and
Science and Food Safety 18 (1) (2019) 189–220.
amino acid propensities for the alpha-helix, Protein Engineering, Design
[8] W. Li, Y. Zhang, J. Skolnick, Application of sparse nmr restraints to
and Selection 15 (5) (2002) 353–358.
large-scale protein structure prediction, Biophys J 87 (2) (2004) 1241–
1248. [34] S. J. Hamodrakas, A protein secondary structure prediction scheme for
[9] R. Danev, H. Yanagisawa, M. Kikkawa, Cryo-electron microscopy the ibm pc and compatibles, Computer applications in the biosciences :
methodology: Current aspects and future directions, Trends in Biochem- CABIOS 4 (4) (1988) 473–477.
ical Sciences 44 (10) (2019) 837–848. [35] H. Sklenar, C. Etchebest, R. Lavery, Describing protein structure: A
[10] D. Wrapp, N. Wang, K. S. Corbett, J. A. Goldsmith, C.-L. Hsieh, general algorithm yielding complete helicoidal parameters and a unique
O. Abiona, B. S. Graham, J. S. McLellan, Cryo-em structure of the 2019- overall axis, Proteins: Structure, Function, and Bioinformatics 6 (1)
ncov spike in the prefusion conformation, Science 367 (6483) (2020) (1989) 46–60.
1260. [36] S. M. King, W. C. Johnson, Assigning secondary structure from protein
[11] C. Kolano, J. Helbing, M. Kozinski, W. Sander, P. Hamm, Watching coordinate data, Proteins: Structure, Function, and Bioinformatics 35 (3)
hydrogen-bond dynamics in a β-turn by transient two-dimensional (1999) 313–320.
infrared spectroscopy, Nature 444 (7118) (2006) 469–472. [37] F. Dupuis, J.-F. Sadoc, J.-P. Mornon, Protein secondary structure as-
[12] R. W. Janes, Bioinformatics analyses of circular dichroism protein signment through voronoı̈ tessellation, Proteins: Structure, Function, and
reference databases, Bioinformatics 21 (23) (2005) 4230–4238. Bioinformatics 55 (3) (2004) 519–528.
[13] J. G. Lees, A. J. Miles, F. Wien, B. A. Wallace, A reference database for [38] S. M. Law, A. T. Frank, C. L. Brooks III, Pcasso: A fast and efficient
circular dichroism spectroscopy covering fold and secondary structure cα-based method for accurately assigning protein secondary structure
space, Bioinformatics 22 (16) (2006) 1955–1962. elements, Journal of Computational Chemistry 35 (24) (2014) 1757–
[14] A. Micsonai, F. Wien, L. Kernya, Y.-H. Lee, Y. Goto, M. Réfrégiers, 1761.
J. Kardos, Accurate secondary structure prediction and fold recogni- [39] M. Levitt, J. Greer, Automatic identification of secondary structure in
tion for circular dichroism spectroscopy, Proceedings of the National globular proteins, Journal of Molecular Biology 114 (2) (1977) 181–239.
Academy of Sciences 112 (24) (2015) E3095. [40] G. Labesse, N. Colloc’h, J. Pothier, J.-P. Mornon, P-sea: a new efficient
[15] B. Kuhlman, P. Bradley, Advances in protein structure prediction and assignment of secondary structure from cα trace of proteins, Bioinfor-
design, Nature Reviews Molecular Cell Biology 20 (11) (2019) 681–697. matics 13 (3) (1997) 291–295.
[16] B. Adhikari, D. Bhattacharya, R. Cao, J. Cheng, Confold: Residue- [41] C. Cao, G. Wang, A. Liu, S. Xu, L. Wang, S. Zou, A new secondary
residue contact-guided ab initio protein folding, Proteins: Structure, structure assignment algorithm using cα backbone fragments, Interna-
Function, and Bioinformatics 83 (8) (2015) 1436–1449. tional Journal of Molecular Sciences 17 (3) (2016) 333.
[17] S. Dhingra, R. Sowdhamini, F. Cadet, B. Offmann, A glance into the [42] W. R. Taylor, Defining linear segments in protein structure, Journal of
evolution of template-free protein structure prediction methodologies, Molecular Biology 310 (5) (2001) 1135–1150.
Biochimie 175 (2020) 85–92. [43] A. S. Konagurthu, L. Allison, P. J. Stuckey, A. M. Lesk, Piecewise linear
[18] J. Lee, P. L. Freddolino, Y. Zhang, Ab Initio Protein Structure Prediction, approximation of protein structures using the principle of minimum
Springer Netherlands, 2017, pp. 3–35. message length, Bioinformatics 27 (13) (2011) i43–i51.
[19] S. D. Lam, S. Das, I. Sillitoe, C. Orengo, An overview of comparative [44] D. Si, S. Ji, K. Al Nasr, J. He, A machine learning approach for
modelling and resources dedicated to large-scale modelling of genome the identification of protein secondary structure elements from cryoem
sequences, Acta Crystallographica Section D 73 (8) (2017) 628–640. density maps, Biopolymers 97 (2012) 698–708.

2551

Authorized licensed use limited to: Trial User - Warsaw University (Uniwersytet Warszawski). Downloaded on April 15,2023 at 22:27:49 UTC from IEEE Xplore. Restrictions apply.
[45] M. A. Sallal, W. Chen, K. A. Nasr, Machine learning approach to assign
protein secondary structure elements from cα trace, in: 2020 IEEE
International Conference on Bioinformatics and Biomedicine (BIBM),
2020, pp. 35–41.

2552

Authorized licensed use limited to: Trial User - Warsaw University (Uniwersytet Warszawski). Downloaded on April 15,2023 at 22:27:49 UTC from IEEE Xplore. Restrictions apply.

You might also like