You are on page 1of 6



A Study On Protein Sequence Clustering

P. Usha Madhuri and Dr.S.P.Rajagopalan

Abstract— Protein sequence clustering has been widely used as a part of the analysis of protein structure and function. In most cases
single link or graph-based clustering algorithms have been applied. Clustering of proteins using SEQOPTICS (sequence clustering with
OPTICS), which is based on OPTICS (Ordering Points To Identify the Clustering Structure),is an attractive approach due to its emphasis
on visualization of results and support for interactive work such as choosing parameters. Ordering Points To Identify the Clustering
Structure can be implemented for Visualization of the sequence clustering structure. The system was evaluated by comparison with other
existing methods.

Index Terms – Clustering techniques, OPTICS, Protein Sequence, SEQOPTICS

——————————  ——————————

contain most protein sequences and are very popular in
XTRACTING useful information from biological biological research.

E  sequences is a problem with the rapid growth of

biological sequences databases. Among
biological sequences, protein sequences are an
especially interesting category since protein is
As more and more protein sequences become
available, we can better study protein structure and
function. One of the most important computational
methods is sequence clustering [5, 6]. The goal of
functionally essential in life and its alphabet is large clustering protein sequences is to get a biologically
(20 amino acids). There are several well-known protein meaningful partitioning. Clustering a large set of
databases: Pfam[1] is a collection of protein families protein sequences offers several advantages: Proteins
and domains which contains multiple protein are usually grouped into families based on the sequence
alignments of these families; National Center for similarity clustering, which provides some clues about
Biotechnology Information(NCBI)[2] protein sequence the general features of that family and evolutionary
database is an integrated, text-based search and evidence of proteins; Clustering also helps to infer the
retrieval system that is very often used in biological biological function of a new sequence by its similarity
research; Swiss-Prot[3] is a protein sequence database to some function-known sequences; Moreover, protein
which strives to provide a high level of annotation, a clustering can be used to facilitate protein 3-
minimal level of redundancy and high level of dimensional structure discovery, which is very
integration with other databases; The Protein important for understanding protein’s function.
Information Resource(PIR)[4] serves as an integrated Recently developed clustering methods have been
public resource of functional annotation of protein successful in clustering a large number of sequences
data to support genomic/proteomic research and simultaneously. ProClust[7] used a graph based
scientific discovery. These databases are often used as approach and consider multi-domain sequences.
data source for protein sequence clustering study. In SYSTER[8] overcomes the problem of an asymmetric
this paper we use two data sets from Pfam since it is a distance matrix by computing a local pairwise
semi-automatic protein family database, which aims to alignment after performing a BLAST[9] search.
be comprehensive as well as accurate. Swiss- Prot and GeneRage[10] is a fast method for clustering large
NCBI protein databases are also used because they protein data sets. ProtoMap[11] applies some more
elaborate considerations.
———————————————— Among protein sequence clustering methods,
 P. Usha Madhuri, Research Scholar, Dr.M.G.R. University, Chennai the simplest and most widely used category are based
on hierarchical clustering algorithm (single linkage)[12].
 Dr.S.P.Rajagopalan, Emeritus Professor, Department of Computer It aggregates all the sequences linked by a level of
Applications, Dr.M.G.R University, Chennai.
similarity above a given threshold, so that within a
cluster any sequence is linked to at least one other

© 2010 Journal of Computing Press, NY, USA, ISSN 2151-9617

sequence. This approach may yield fairly good results, alignments and Hidden Markov Models(HMMs). Pfam
but often a majority of sequences are grouped into one multiple alignments come in two forms. In the first,
single huge cluster resulting from a massive chain “seed” alignments are representative, non-redundant
effect due to multi-domain proteins. sets of sequences that are checked reasonably carefully
Blastclust program, one part of BLAST in a manual alignment editor.
package from NCBI, is an example of single linkage
protein sequence clustering. Another category, graph-
based clustering algorithms, are also commonly
employed due to the clustering quality. BAG[13] is a
sequence clustering algorithms based on graph theory.
OPTICS (Ordering Points To Identify the
Clustering Structure)[14] is a density-based clustering
method and is popular because it orders the data into a
density-based clustering structure corresponding to a Data Sets SEQOPTIC
broad range of parameter settings. For density-based Extraction S Clustering
methods, it is difficult to decide the input parameters
that the algorithm is sensitive to. OPTICS is a good
solution to density-based cluster ordering. Although it
does not produce clusters explicitly, OPTICS generates Result
an augmented ordering of data sets representing its Analysis
density-based clustering structure, and this structure
can be visualized. Since OPTICS does not limit cluster Figure. 1 SEQOPTICS Overview
extraction to global parameters, it is possible to extract
cluster information interactively as well as
SEQOPTICS expands the use of OPTICS, a
method that has not been used in protein sequence
For any protein sequences clustering method a
analysis. Figure 1 shows the overview of our method.
suitable distance measure needs to be chosen. Some
First, data sets are extracted from data sources (mostly
functionally related sequences share little or no
protein databases), then mixed and randomized. Three
discernible sequence similarity and detection of these
data sources are Pfam, Swiss-Prot and NCBI. Secondly,
relationships is difficult. The general practice to carry
the pairwise distances between any two proteins are
out protein sequence clustering is based on pair-wise
computed. Here we used a normalized Smith-
sequence similarity/dissimilarity computed by
Waterman score. Several other options may be chosen,
algorithms such as Smith-Waterman[15]. Some other
such as BLAST or FASTA, for distance measure. Then
protein distance measurement such as BLAST[9],
the OPTICS algorithm is adopted to execute the
FASTA[16] are also very commonly taken in existing
clustering and the clustering structure is graphically
presented. Finally the clustering results are analyzed
How to evaluate clustering results quality is
and compared to some other methods by Jaccard
an important issue in clustering analysis. For two-
dimensional data, it is clear that we can plot the data
and read the distribution to tell how good the
clustering results are. But in high dimension data or 2.1 DATA SETS
sequence clustering, direct visualization is normally The alignments are HMM-generated automatic
not feasible. In protein sequence clustering, a popular alignments of every homologous domain [1]. Two other
measure of clustering quality is based on how well the data sets are from NCBI and Swiss-Prot separately.
clusters identified by the clustering algorithm match Each protein sequence is labelled by its original
the protein families defined in some database by notation. This labelling defines the assumed “true”
biological experts[8]. Another method is to compare clusters.
SEQOPTICS with some existing methods by using
certain validation techniques [17].
For example, if a sequence is extracted with
“IGA1” from NCBI, then it is labelled as “IGA1” and
2 METHODS assumed to be in “IGA1” cluster. The size of each data
Four data sets are extracted from different protein set ranges from 197 to 319 sequences just for testing our
repositories as shown in table 1. Two of them are from method.For each data set, protein sequences from
Pfam since it is a protein families database of

different families are mixed and randomized to understanding how OPTICS works: An object p is in the
exclude manual clustering. ε- neighborhood of q if the distance from p to q is less
than ε; A core object has at least MinPts neighbors in its
TABLE 1 ε-neighborhood. The reachability distance of p is the
PROTEIN SEQUENCE DATA SETS smallest distance such that p is density-reachable from
a core object o. A cluster is a set of density-connected
objects which is maximal with respect to density-
A reachability plot is a bar chart that shows
each object’s reachability distance in the order the object
was processed which demonstrates the cluster structure
of data. The final clusters can be extracted by either ε-
cutoff or steepness of the plot. For more detailed
information about OPTICS algorithm, please refer the
original paper[14].
SEQOPTICS is implemented with a distance
measure of sequences based on Smith-Waterman
Note: The number in parenthesis is the number of algorithm. The core OPTICS part was tested with the
sequences in each family
data sets from the author. Two parameters need to be
chosen, ε and MinPts. In this paper, since the distance
2.2 COMPUTING DISTANCE between any two protein sequences is between 0 and 1,
Our approach, consonant with others, starts with a we can use a single ε for all data set, for example, set ε
distance measure. When data sets are from different as 0.99, which is slightly smaller than 1. The MinPts
protein families, it is a common practice to use a used here is 10 just for experiment purposes. For the
normalized pairwise local alignment score by Smith- whole protein database, ε can still use any value
Waterman dynamic programming algorithm. There between 0.95 to 0.99, MinPts should be set as the
are several parameters in Smith-Waterman, for average number of sequences in a family.
example, scoring matrix, open gap penalty and There are two main advantages to apply
extending gap penalty. Different scoring matrices OPTICS in protein sequences clustering analysis: 1)
including BLOSUM50 and PAM250 have been tried. OPTICS can find the local density region; 2) OPTICS
BLOSUM50, which is also used in FASTA[16], is used produces an augmented ordering of the database
in this paper. The open gap penalty taken is 12 and the representing its density based clustering structure and
extending gap penalty is 2. The final similarity score this ordering can be visualized, for example, in
between two protein sequences is then normalized by reachability plot. The cluster ordering actually contains
the following: information about every cluster, i.e., OPTICS enables
S ( a , b) the extraction of not only “traditional” cluster
SN (a, b)  information, but also intrinsic clustering structure.
Min( S (a, a ), S (b, b))
where S(a, b) is the Smith-Waterman local alignment
score between two sequences a and b; S(a, a) is the
score of sequence a to itself; S(b, b) is the score of SEQOPTICS is used to cluster the data sets. These
sequence b to itself; and SN(a, b) is the normalized provide some clues about clustering structure. The final
score. The distance between two protein sequences is density-based cluster sets are defined from the ordering
defined as: Distance (a, b) = 1 − SN (a, b); reachability distance. To judge the resulting clustering
set’s biological accuracy, we need to compare it to a
With this normalization, every distance score is “true” cluster set. However, there is no generally
between 0 and 1. If other scoring methods are used accepted “true” cluster set.
instead of Smith-Waterman, the distance measure All automatical protein clustering methods are
needs to be adjusted appropriately. based on “all against all” sequence comparison. Real
clusters need to be verified by biological expertise.
Since it is impossible to have “real” clustering, we have
2.3 OPTICS CLUSTERING to assume the original database clusters are the “real”
Some preliminary remarks on OPTICS have been given clusters. That is the way of most automatic protein
in the introduction. Some definitions might help us clustering does.

© 2010 Journal of Computing Press, NY, USA, ISSN 2151-9617

3.1 VISUALIZATION OF THE CLUSTER STRUCTURE  a is “true positive”, i.e., the number of
sequence pairs clustered together in both sets,
A reachability distance plot was made for each data
which can be define as :
set. In the figure, the horizontal axis represents the
ordering of each sequence, the vertical axis represents a  ( i , j ) M ij  1  Tij  1, i  j
the reachability distance, and each valley stands for a  b is “false negative”, i.e., the number of
cluster set. For data set 1, there are five valleys in sequence pairs clustered together in the true
Figure 2: The first two valleys are composed of cluster set, but not in the current
sequences from cytochrom B562; The third valley clustersolution defined as:
consists of sequences from glucokinase; The fourth b  ( i , j ) M ij  0  Tij  1, i  j
valley contains sequences from GABAR family; The
fifth valley are sequences from bac globin family.  c is “false positive”, i.e., the number of
sequence pairs clustered in the current
solution, but not in the true cluster set, defined
The final density-based clusters were extracted by c  ( i , j ) M ij  1  Tij  0 , i  j
using a cutoff value. For example, in Figure 2, the  d is “true negative”, i.e., the number of
cutoff value is set as 0.860 (the line reachability sequence pairs not clustered in either current
distance = 0.860). Under this cutoff condition, each solution or the true cluster set, defined as :
valley in Figure 2 between two sequences with
d  ( i , j ) M ij  0  Tij  0 , i  j
reachability distance higher than the cutoff is a cluster.
The sequence starting a valley with reachability There are many validation techniques in references[18].
distance higher than the cutoff is also in the same Three parameter based on the above are as follows:
cluster as rest sequences in the valley. Precision is defined as:
P ------ (1)
(a  c)
Recall is defined as:
R ------ (2)
(a  b)
Jaccard Coefficient is defined as:
S ------ (3)
(a  b  c)
All three values range between 0 and 1.The
better the clustering, the bigger the values. In a perfect
clustering which is identical to the true cluster, P = 1, R
= 1, and S = 1. Most existing sequence clustering
Figure 2. Cluster Structure of Date Set 1: Pfam (Protein
Familiy) methods perform well in terms of Precision but not in
Recall. This is shown in later validation result in Table 2
Any sequence with reachability distance along with additional comparative values.
higher than the cutoff is noise if it does not start a
valley. Therefore, in Figure 2, there are four clusters The clusters were formed from the same data
give the cutoff value 0.860. Similarly, there are four sets with two other clustering methods, blastclust [9]
clusters in Figure 3 given cutoff 0.745, six clusters in and BAG [13], using default parameters of these
Figure 4 given cutoff 0.860. methods. BAG is a graph based clustering method and
most existing methods are based on graph clustering.
Blastclust is chosen because it is from NCBI blast
3.3 VALIDATION OF THE CLUSTER SET package. This hierarchical sequence clustering method
A cluster set of n data points from experiment can be is very popular in biological research.
n * ( n  1)
represented by m  values in a triangular
2 Using the Jaccard Coefficient, Precision and
matrix M, where for i < j, Mi j = 1, if and only if i and j Recall, we developed comparison values shown in the
are in the same cluster and Mi j = 0 otherwise. If T is Table 2. From the Table 2, we see that SEQOPTICS
the matrix of the “true” clusters, two cluster sets produces good results relative to each original cluster
(“true” and “experimental”) can be compared based on set in terms of Jaccard Coefficient. Every SEQOPTICS
the following numbers: Jaccard Coefficient is higher than 0.65 and the highest

© 2010 Journal of Computing Press, NY, USA, ISSN 2151-9617

being 0.85. It is also seen in the table that SEQOPTICS adequacy of our approach for small-scale data and the
outperforms BAG and blastclust on all the data sets usefulness of the cluster structure visualization.
chosen on Jaccard Coefficient. According to Ankerst[14], one good feature of OPTICS
is that it is unnecessary to limit oneself to a single set of
TABLE 2 global parameters. An augmented cluster ordering
COMPARISON OF CLUSTERING RESULTS OF BAG, contains information equivalent to density based
BLASTCLUST AND SEQOPTICS METHODS clusterings corresponding to a broad range of
parameter settings; as such, the cluster ordering is a
versatile base for both automatic and interactive cluster
analysis. A second good feature lies in the visualization
of the data set distribution. Depending on data set size,
one can either represent the cluster-ordering
graphically for a small data set, or, employ an alternate
technique (appropriate) for large data sets. In this
paper, we demonstrated in SEQOPTICS the
visualization of cluster structure is meaningful.
SEQOPTICS has proved its value for small data sets
(<1000 sequences)in this paper. If we are applying this
method to a large data set, such as whole database,
The performance with BAG exceeds blastclust future improvements are necessary to make it more
for the same reason. However, BAG and blastclust useful. We may list specifically: 1) use some other
tends to give more clusters than the “true” clusters, distance measure for protein sequence distance, e.g.,
explaining why the Precision of those two methods on BLAST or FASTA; 2) apply parallel computing tools, for
all data sets are 1. But neither of these two performs example, Message Passing Interface(MPI) for large data
well in terms of Recall. Overall, SEQOPTICS performs sets; 3) implement visualization techniques for large
better than BAG and blastclust and seems a promising data sets; 4) consider incremental cluster ordering
method in terms of both clustering quality coupled algorithms since protein databases are very frequently
with its graphical representation of clustering being updated.

[1] Alex Bateman, Ewan Birney, Lorenzo Cerruti, Richard Durbin, The
The time complexity of Smith-Waterman is O(n2l2), Pfam Protein Families Database. Nucl. Acids. Res., 30(1):276–280,
where n is the number of sequences and l is the 2002.
average length of the sequence. The time complexity of [2] Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, James
OPTICS is O(n2). Therefore the total time complexity is Ostell, and David L. Wheeler. Gen- Bank: update. Nucl. Acids. Res.,
O(n2l2). This is an expensive method if Smith- 32(90001):D23–26, 2004.
Waterman is the only choice of the distance measure. [3] Amos Bairoch and Rolf Apweiler. The swiss-prot protein sequence
Fortunately there are some other options for the database and its supplement trembl in 2000. Nucl. Acids. Res.,
distance between two protein sequences, such as 28(1):45–48, 2000.
BLAST or FASTA which will dramatically decrease the [4] Cathy H. Wu, Hongzhan Huang, Leslie Arminski, Jorge Castro-
overall time complexity. Alvear,. The Protein Information Resource: an integrated public
resource of functional annotation of proteins. Nucl. Acids. Res.,
[5] Evgenia V Kriventseva, Margaret Biswas, and Rolf Apweiler.
In this paper a programmed system, SEQOPTICS, for Clustering and analysis of protein families. Current Opinion in
protein sequences clustering as shown in Figure 1 is Structural Biology, 11(3):334– 339, 2001.
described. A core portion(phase) of the system is based [6] S. Mohseni-Zadeh, P. Brezellec, and J. L. Risler. Cluster-c, an
on OPTICS clustering and visualization method, algorithm for the large-scale clustering of protein sequences based on
which we believe is being used here for the first time in the extraction of maximal cliques. Computational Biology and
protein sequence clustering. Prior to this phase, it is Chemistry, 28:211–218, 2004.
necessary to compute the distance between (protein) [7] P. Pipenbacher, A. Schliep, S. Schneckener, A. Schonhuth, D.
sequences. A normalized Smith-Waterman score is Schomburg, and R. Schrader. ProClust: improved clustering of
used in this paper to compute the required distance. protein sequences with an extended graph-based approach.
The final system phase, Results Analysis, demonstrates Bioinformatics, 18(90002):182S–191, 2002.

© 2010 Journal of Computing Press, NY, USA, ISSN 2151-9617

[8] Antje Krause. Large Scale Clustering of Protein Sequences. PhD

thesis, der Universitat Bielefeld, 2002.
[9] S. F. Altschul, W. Gish, W. Miller, E. W. Meyers, andD. J. Lipman.
Basic local alignment search tool. Journal of Molecular Biology,
215(3):403–410, 1990.
[10] Anton J. Enright and Christos A. Ouzounis. GeneRAGE: a robust
algorithm for sequence clustering and domain detection.
Bioinformatics, 16(5):451–457, 2000.
[11] Golan Yona, Nathan Linial, and Michal Linial. ProtoMap: automatic
classification of protein sequences and hierarchy of protein families.
Nucl. Acids. Res.,28(1):49–55, 2000.
[12] Sibson R. Slink: an optimally efficient algorithm for the single-link
cluster method, 1973.
[13] Sun Kim and Arvind Gopu. BAG: A Graph Theoretic Sequence
Clustering Algorithm.
[14] Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and J¨org
Sander. Optics: Ordering points to identify the clustering structure.
International Conference on Management of Data, June 1-3, 1999,
Philadelphia, Pennsylvania, USA, pages 49–60. ACM Press, 1999.
[15] T.F. Smith and M.S.Waterman. Identification of common molecular
subsequences. J. Mol. Biol., 147:195– 197, 1981.
[16] W R Pearson and D J Lipman. Improved tools for biological
sequence analysis. Proc. Natl. Acad. Sci. U.S.A., 85:2444–2448, 1988.
[17] Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. On
clustering validation techniques. Journal of Intelligent Information
Systems, 17(2-3):107–145, 2001.
[18] M. Halkidi, Y. Batistakis, and M Vazirgiannis. On clustering
validation techniques. Journal of Intelligent Information Systems,
17(2/3):107–145, 2001.
[19] B. Amit and B. Baldwin. Algorithms for scoring coreference chains,
[20] Marc Vilain, John Burger, John Aber deen, Dennis Connolly, and
Lynette Hirschman. A model-theoretic coreference scoring scheme.
In MUC6 ’95: pages 45–52, Morristown, NJ, USA, 1995.

First P. Usha Madhuri, Research Scholar, Dr.M.G.R.

University Degrees achieved Master of Computer Applications
in the year of 2003 , from University of Madras, Chennai.
Pursuing Ph.D under the supervisor Dr.S.P.Rajagopalan,
Emeritus Professor, Dr.M.G.R University, Chennai. Currently
working as Assistant professor in Velammal Engineering College,
Chennai. A paper title An overview of Basic Clustering Algorithms
was published in Internal Journal of Computer Science and
System Analysis. Associated with ISTE professional association.

Second Dr.S.P. Rajagopalan, Emertitus Professor, Department

of Computer Applications, Dr.M.G.R University, Chennai.
Syndicate member of University of Madras. Published more than
150 Papers in National and Internaltional Journals. Also
published more than 50 books in maths and computer science
disciplines. Achieved Best Teacher award.

© 2010 Journal of Computing Press, NY, USA, ISSN 2151-9617