Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
7Activity
0 of .
Results for:
No results containing your search query
P. 1
Clustering Time Series Data Stream – A Literature Survey

Clustering Time Series Data Stream – A Literature Survey

Ratings: (0)|Views: 145 |Likes:
Published by ijcsis
Mining Time Series data has a tremendous growth of interest in today’s world. To provide an indication various implementations are studied and summarized to identify the different problems in existing applications. Clustering time series is a trouble that has applications in an extensive assortment of fields and has recently attracted a large amount of research. Time series data are frequently large and may contain outliers. In addition, time series are a special type of data set where elements have a temporal ordering. Therefore clustering of such data stream is an important issue in the data mining process. Numerous techniques and clustering algorithms have been proposed earlier to assist clustering of time series data streams. The clustering algorithms and its effectiveness on various applications are compared to develop a new method to solve the existing problem. This paper presents a survey on various clustering algorithms available for time series datasets. Moreover, the distinctiveness and restriction of previous research are discussed and several achievable topics for future study are recognized. Furthermore the areas that utilize time series clustering are also summarized.
Mining Time Series data has a tremendous growth of interest in today’s world. To provide an indication various implementations are studied and summarized to identify the different problems in existing applications. Clustering time series is a trouble that has applications in an extensive assortment of fields and has recently attracted a large amount of research. Time series data are frequently large and may contain outliers. In addition, time series are a special type of data set where elements have a temporal ordering. Therefore clustering of such data stream is an important issue in the data mining process. Numerous techniques and clustering algorithms have been proposed earlier to assist clustering of time series data streams. The clustering algorithms and its effectiveness on various applications are compared to develop a new method to solve the existing problem. This paper presents a survey on various clustering algorithms available for time series datasets. Moreover, the distinctiveness and restriction of previous research are discussed and several achievable topics for future study are recognized. Furthermore the areas that utilize time series clustering are also summarized.

More info:

Published by: ijcsis on Jun 30, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

05/12/2014

pdf

text

original

 
V.Kavitha ,.M.Punithavalli
Computer Science Department, Sri Ramakrishna College of Arts and Science forWomen,Coimbatore,Tamilnadu,India.
††
Sri Ramakrishna College of Arts & Science for Women, Coimbatore ,Tamil Nadu, India.
Kavithaanand11@gmail.com ,mpunitha_srcw@yahoo.co.in 
 Abstract-
Mining Time Series data has a tremendousgrowth of interest in today’s world. To provide anindication various implementations are studied andsummarized to identify the different problems in existingapplications. Clustering time series is a trouble that hasapplications in an extensive assortment of fields and hasrecently attracted a large amount of research. Time seriesdata are frequently large and may contain outliers. Inaddition, time series are a special type of data set whereelements have a temporal ordering. Therefore clustering of such data stream is an important issue in the data miningprocess. Numerous techniques and clustering algorithmshave been proposed earlier to assist clustering of time seriesdata streams. The clustering algorithms and its effectivenesson various applications are compared to develop a newmethod to solve the existing problem. This paper presents asurvey on various clustering algorithms available for timeseries datasets. Moreover, the distinctiveness and restrictionof previous research are discussed and several achievabletopics for future study are recognized. Furthermore theareas that utilize time series clustering are also summarized.
 Keywords-
Data Mining, Data Streams, Clustering, TimeSeries, Machine Learning, Unsupervised Learning, FeatureExtraction and Feature Selection.I.I
NTRODUCTION
 Today Time Series data management has become aninteresting research topic by the data miners. Particularly,the clustering of time series has attracted the interest of researchers. Data mining is usually constrained by threelimited resources. They are Time, Memory and Samplesize. Recently time and memory seem to be bottleneck formachine learning application. Clustering is an unsupervisedlearning process for grouping a dataset into subgroups. Adata stream is an ordered sequence of points x
1
, , , , , ,x
n
.These data can be read or accessed only once or a smallnumber of times. A time series is a sequence of realnumbers, each number indicating a value at a time point.Data flows continuously from a data stream at high speed,producing more examples over time in recent real worldapplications. Traditional algorithms cannot support to thehigh speed arrival of time series data. This is a reason; thenew algorithms have been developed for real timeprocessing data.Time series data are being generated at an unique speedfrom almost every application domain e.g., Dailyfluctuations of stock market, Fault diagnosis, Dynamicscientific experiments, Electrical power demand, positionupdates of moving objects in location based services,various reading from sensor networks, Biological andMedical experimental observations, etc. Traditionallyclustering is taken as a batch procedure. Most of theclustering techniques can be two major categories. One isPartitional clustering and another one is HierarchicalClustering [1]. They are the two key aspects for achievingeffectiveness and efficiency when using time series data. Atime series experiment requires multiple arrays which allmakes it very expensive. Dimensionality reductiontechniques can be divided into two groups (i) FeatureExtraction (ii) Feature Selection. Feature Extractiontechniques extract a set of new features from the originalattributes. Feature Selection is a process that selects asubset of original attributes. There have been numeroustextbooks [5] and publications on clustering of scientificdata for a variety of areas such as taxonomy, agriculture [2],remote sensing [3], as well as process control [4]. Thispaper presents a survey on various clustering algorithmsavailable for time series datasets. Moreover, thedistinctiveness and restriction of previous research arediscussed and several achievable topics for future study arerecognized. Furthermore the areas that time seriesclustering have been applied to are also summarized.The remainder of the paper is organized as follows.Section 2 reviews the concept of time series and gives anoverview of the algorithms of different techniques. Section3 marginally discusses possible future extensions of thework. Section 4 concludes the paper with fewer discussions.II.R
ELATED
W
ORK
 Quite a number of clustering techniques has beenproposed earlier for time series data streams. This section of 
Clustering Time Series Data Stream – A LiteratureSurve
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 1, April 2010289http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
the paper discusses some of the earlier proposed methodsfor efficient clustering of time series datasets.Ville Haulamati et al. in [8] poses problem related to timeseries data clustering in Euclidean space using RandomSwap (RS) and Agglomerative Hierarchical clusteringfollowed by k-mean fine-tuning algorithm to computelocally optimal prototype. It provides best clusteringaccuracy. And also provide more improvement to k-medoids. The drawback of this algorithm is, it outperformsthe quality.Pedro Pereiva Rodrigous et al. in [6] analyzes anincremental system for clustering streaming time series,using Online Divisive Agglomerative Clustering systemcontinuously maintains a tree-like hierarchy of clustersusing a top-down strategy. Using ODAC cluster quality isto be measure to calculate cluster’s diameter. The highestdissimilarity between objects of the same cluster is definedas diameter. The strength of ODAC is do not need apredefined number of target clusters. It provides a goodperformance on finding the correct number of clustersobtained by a bunch of runs of k-Means. The disadvantageof this system is when the tree structure expands, thevariables should move from root to leaf, when there is nostatistical confidence on the decision of assignment maysplit variables. And the computation of high dimensionaldata being processed may represent a drawback of theclustering procedure.Xiang Lian et al. in [7] proposed that all types of timeseries data applications needs an efficient and effectivesimilarity search over stream data is essential. To predictthe unknown values that have not arrived at the system andanswer similarity queries based on the predicted data usingthe three approaches namely Polynomial, discrete FourierTransform (DFT) and Probabilistic. These approaches canlead to good offline prediction accuracy and not suitable foronline stream environment. Because online requires lowprediction and training costs. These approaches are straightforward for seeking general solutions. And it gives properconfidence for prediction. It can predict values whileexplicitly providing a confidence. The polynomial approachthat predicts future values based on the approximated curveof recent values. The Discrete Fourier Transform (DFT)forecasts the future values using approximations in thefrequency domain. And the probabilistic approach canprovide predicting values and it can be adaptive to thechange of data. The group probabilistic approach isutilizing the correlations among stream time series. Thedrawback of this probabilistic approach, it needs more timeto predict the future values.Sudipto Guha et al. in [13] described a streamingalgorithm that effectively clusters large data streams. Foranalysis of such data, the ability to process the data in asingle pass, or a small number of passes, while using littlememory, is crucial. STREAM algorithm based on Divideand Conquer that achieves a constant factor approximationin small space. This STREAM algorithm is based on afacility location algorithm that might produce more than k centers. The advantage of STREAM algorithm is trade off between cluster quality and running time. This algorithm iscompared with BIRCH Algorithm and proved that BIRCHappears to do a reasonable quick and dirty job.Ashish Singhal and Dale E. Seborg together in [9]calculated the degree of similarity between multivariatetime series datasets using two similarity factors with batchfermentation algorithm. One similarity factor is based onprincipal component analysis and the angles between theprincipal component subspaces. Second similarity factor isbelongs to Mahalanobis distance between the datasets.Batch fermentation algorithms are to compare the productquality data for different datasets. The advantage of thissimilarity factor with batch fermentation is very effective inclustering multivariate time series datasets and is better toexisting methodologies. It provides best clusteringperformance and the results are very close to each other andalso the clustering performance is sensitive.Hui Zhang et al. in [11] put forth an unsupervised featureextraction algorithm using orthogonal wavelet transform forautomatically choosing the dimensionality of features. Theproblem of determining the feature dimensionality iscircumvented by selecting the appropriate scale of thewavelet transform. When the dimensionality is reduced theinformation may be lost. This feature extraction algorithmcontrols the lower dimensionality and lower errors bychoosing the scale within which the nearest lower scale.The major advantage of this feature extraction is chosenautomatically. And the qualities of clustering with extractedfeatures are better than that with features corresponding tothe scale prior and posterior scale averagely for the useddata sets.Bagnall et al. in [10] explained a technique in order toassess the effects of dimensionality data into binarysequences of above and below the median, this process isknown as clipping. For long time series data the clusteringaccuracywhen using clipped data from the class of ACMAmodels is not significantly different to that achieved withunclipped data. The usage of clipped data produces betterclusters, whether the data contains outliers, when usingclipped data needs less memory and operations. Anddistance calculations can be much faster. Calculating autocorrections are faster with clipped data. Clipped data withclustering provides good clustering. But the data sets aremassive automatically the execution speed of clusteringalgorithm is reduced.Ernst et al. in [12] described an algorithm for clusteringshort time series gene expression data. Most clusteringalgorithms are not capable to make a distinction betweenreal and random patterns. They presented an algorithmspecifically designed for clustering short time seriesexpression data. Their algorithm works by assigning genesto a predefined set of model profiles that capture thepotential distinct patterns that can be expected from theexperiment. They also discussed how to obtain such a set of 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 1, April 2010290http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
profiles and how to determine the significance of each of these profiles. Significant profiles are retained for furtheranalysis and can be combined to form clusters. Also theytested their method on both simulated and real biologicaldata. Using immune response data they showed that theiralgorithm can correctly detect the temporal profile of relevant functional categories. Using Gene Ontologyanalysis the results showed that their algorithm outperformsboth general clustering algorithms and algorithms designedspecifically for clustering time series gene expression data.A new clustering method for time series data streams wasproposed by Li et al. in [14]. Clustering streaming timeseries is a complicated crisis. The majority of the traditionalalgorithms are too disorganized for large amounts of dataand outliers in them. In their paper, they proposed a newclustering method, which clusters Bi-clipped (CBC) streamdata. It contains three phrases, namely, dimensionalityreduction through piecewise aggregate approximation(PAA), Bi-clipped process that clipped the real valuedseries through bisecting the value field, and clustering.Through related experiments, they found that CBC gainshigher quality solutions in less time compared with M-clipped method that clipped the real value series throughthe mean of them, and unclipped methods. This situation isespecially distinct when streaming time series containoutliers.A clustering algorithm for time series data was put forthby Jian et al. in [15]. In the Intelligent Traffic System, theresearch about the analysis of time series of traffic flow issignificant and meaningful. Using clustering methods toinvestigate time series not only can find some typicalpatterns of traffic flow, but also can group the sections of highway by their different flow characteristics. In theirpaper, they proposed an Encoded-Bitmap-approach-basedswap method to improve the classic hierarchical method.Moreover, their experimental results showed that theirproposed method has a better performance on the changetrend of time series than classic algorithm.Beringer et al. in [16] put forth a clustering algorithm forparallel data streams. In modern years, the management andprocessing of so-called data streams has become a subjectof dynamic research in numerous fields of computer sciencesuch as, e.g., distributed systems, database systems, anddata mining. A data stream can approximately be thought of as a transient, continuously increasing sequence of time-stamped data. In their paper, they considered the problem of clustering parallel streams of real-valued data, that is to say,continuously evolving time series. In other words, they areinterested in grouping data streams the evolution over timeof which is comparable in a specific sense. In order tomaintain an up-to-date clustering structure, it isindispensable to investigate the incoming data in an onlinemanner, tolerating not more than a constant time delay. Forthis purpose, they developed a resourceful online version of the classical K-means clustering algorithm. Their method’sefficiency is mainly due to a scalable online transformationof the original data which allows for a fast computation of approximate distances between streams.Characteristics based clustering of time series data wasdescribed by Wang et al. in [17]. Their paper proposed amethod for clustering of time series based on theirstructural characteristics. Unlike other alternatives, theirproposed method does not cluster point values using adistance metric, rather it clusters based on global featuresextracted from the time series. The feature measures areobtained from each individual series and can be fed intorandom clustering algorithms, including an unsupervisedneural network algorithm, self-organizing map, orhierarchal clustering algorithm. Global measures describingthe time series are obtained by applying statisticaloperations that best capture the underlying uniqueness:trend, seasonality, periodicity, serial correlation, skewness,kurtosis, chaos, nonlinearity, and self-similarity. Since themethod clusters using extracted global measures, it reducesthe dimensionality of the time series and is much lesssensitive to missing or noisy data. They further provide asearch mechanism to find the best selection from the featureset that should be used as the clustering inputs. Theirtechnique has been tested using benchmark time seriesdatasets formerly reported for time series clustering and aset of time series datasets with known distinctiveness. Theempirical results show that their approach is able to yieldmeaningful clusters. The resulting clusters are comparableto those produced by other methods, but with somepromising and interesting variations that can beinstinctively explained with knowledge of the globalcharacteristics of the time series.Hirano et al. in [18] proposed an algorithm for clusteringthe time series medical data. Their paper presents a clusteranalysis method for multidimensional time-series data onclinical laboratory examinations. Their method representsthe time series of test results as trajectories inmultidimensional space, and compares their structuralsimilarity by using the multiscale comparison technique. Itenables us to find the part-to-part correspondences betweentwo trajectories, taking into account the relationshipsbetween different tests. The resultant distinction can befurther used with clustering algorithms for finding thegroups of similar cases. The method was applied to thecluster analysis of Albumin-Platelet data in the chronichepatitis dataset. The experimental results demonstratedthat it could form interesting groups of cases that have highcorrespondence to the fibrotic stages.Clustering of time series clipped data was projected byBagnall et al. in [19]. They showed that the simpleprocedure of clipping the time series reduces memoryrequirements and considerably speeds up clustering withoutdecreasing clustering accuracy. They also demonstrated thatclipping increases clustering accuracy when there areoutliers in the data, thus serving as a means of outlierdetection and a method of identifying modelmisspecification. They considered simulated data frompolynomial, autoregressive moving average and hidden
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 1, April 2010291http://sites.google.com/site/ijcsis/ISSN 1947-5500

Activity (7)

You've already reviewed this. Edit your review.
1 thousand reads
1 thousand reads
1 hundred reads
M s liked this
Eslam Ashraf liked this
Eslam Ashraf liked this
nirmalrajj liked this

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->