You are on page 1of 3

The objective of this Research is to design algorithms to solve each of the three problems of

cluster analysis viz., cluster tendency assessment, clustering, cluster validity, for large volumes
of high-dimensional data, including streaming data.
• To Study the various visual Clustering approaches on Big Data and identify gaps in
Social Big Data domain.
• Demonstrate the efficiency of the proposed model by comparing it with existing
models on predefined parameters for social significant data clustering
• Evaluate clustering algorithms to measure effectiveness, efficiency and quality of data
stream clustering algorithm.
• Develop, deploy a time-efficient distributed hybrid models of visual approaches for
faster significant social data clustering
• Explore Visualizing Streaming social data for Analytics using effective
Dimensionality Reduction and proposing an adaptive model for Clustering Data Streams.
Please elaborate explanation on each objective, for my understanding. The one which are
in bold.
To Study the various visual Clustering approaches on Big Data and identify gaps in Social Big
Data domain
visual Clustering approaches:
1. Clustering algorithms have emerged as an alternative powerful meta-learning tool to
accurately analyze the massive volume of data generated by modern applications.
2. Their main goal is to categorize data into clusters such that objects are grouped in the same
cluster when they are similar according to specific metrics.
3. There is a vast body of knowledge in clustering and there has been attempts to analyze and
categorize them for a larger number of applications.
4. However, one of the major issues in using clustering algorithms for big data that causes
confusion amongst practitioners is the lack of consensus in the definition of their properties
as well as a lack of formal categorization.
5. With the intention of alleviating these problems, this paper introduces concepts and
algorithms related to clustering, a concise survey of existing (clustering) algorithms as well
as providing a comparison, both from a theoretical and an empirical perspective.
6. From a theoretical perspective, we developed a categorizing framework based on the main
properties pointed out in previous studies.
7. Empirically, we conducted extensive experiments where we compared the most
representative algorithm from each of the categories using many real (big) data sets.
8. The effectiveness of the candidate clustering algorithms is measured through several
internal and external validity metrics, stability, runtime, and scalability tests. In addition,
we highlighted the set of clustering algorithms that are the best performing for big data.
Develop, deploy a time-efficient distributed hybrid models of visual approaches for faster
significant social data clustering
1. Because clustering algorithms involve several parameters, often operate in high
dimensional spaces, and must cope with noisy, incomplete and sampled data, their
performance can vary substantially for different applications and types of data.
2. For such reasons, several different approaches to clustering have been proposed in
the literature.
3. In practice, it becomes a difficult endeavor, given a dataset or problem, to choose a
suitable clustering approach. Nevertheless, much can be learned by comparing
different clustering methods.
4. Several previous efforts for comparing clustering algorithms have been reported in
the literature. Here, we focus on generating a diversified and comprehensive set of
artificial, normally distributed data containing not only distinct number of classes,
features, number of objects and separation between classes, but also a varied
structure of the involved groups (e.g. possessing predefined correlation
distributions between features).
5. The purpose of using artificial data is the possibility to obtain an unlimited number
of samples and to systematically change any of the properties of a dataset. Such
features allow the clustering algorithms to be comprehensive and strictly evaluated
in a vast number of circumstances and grants the possibility of quantifying the
sensitivity of the performance with respect to small changes in the data. It should
be observed, nevertheless, that the performance results reported in this work are
therefore respective and limited to normally distributed data, and other results could
be expected for other types of data following other statistical behavior. Here we
associate performance with the similarity between the known labels of the objects
and those found by the algorithm.
6. Many measurements have been defined for quantifying such similarity, we
compare the Jaccard index, Adjusted Rand index, Fowlkes-Mallows index, and
Normalized mutual information. A modified version of the procedure developed by
was used to create 400 distinct datasets, which were used to quantify the
performance of the clustering algorithms. We describe the adopted procedure and
the respective parameters used for data generation. Related approaches include.
Explore Visualizing Streaming social data for Analytics using effective Dimensionality
Reduction and proposing an adaptive model for Clustering Data Streams
1. As data gathering grows easier, and as researchers discover new ways to interpret data,
streamingdata algorithms have become essential in many
elds.
2. Data stream computation precludes algorithms that require random access or large
memory.
3. we consider the problem of clustering data streams, which is important in the analysis a
variety of sources of data streams, such as routing data, telephone records, web documents,
and clickstreams.
4. We provide a new clustering algorithm with theoretical guarantees on its performance.
5. We give empirical evidence of its superiority over the commonly-used k{Means algorithm.
6. We then adapt our algorithm to be able to operate on data streams and experimentally
demonstrate its superior performance in this context.

You might also like