BigData, Spark

© All Rights Reserved

3 views

BigData, Spark

© All Rights Reserved

- Temporal data clustering via weighted clustering ensemble with different representations
- AW.Data.Analytics.with.Spark.Using.Python.013484601X.epub
- Big Data Analytics Beyond Hadoop Real-Time Applications with Storm, Spark, and More Hadoop Altern.pdf
- Java Magazine Big Data Brazil
- DeZyre - Apache - Spark
- False News Detection
- Libro Spark
- Big Data Research
- 20160705 Apache Flink
- Data Scientist
- emr-dg
- EECS6893-BigDataAnalytics-Lecture1
- DS Syllabus 09 2016 OXAvnXE
- Building_Real-Time_Data_Platforms_MemSQL.pdf
- mapred_tutorial
- Health Index
- Spark
- Apache Cassandra
- Spark Tutorial
- 10.1186%2Fs40537-014-0008-6.pdf

You are on page 1of 6

com

ScienceDirect

Procedia Computer Science 107 (2017) 442 – 447

Based on Spark

Rui Liua, Xiaoge Lia*, Liping Dua, Shuting Zhia, Mian Weib

a

School of Computing, Xi’an University of Posts and Telecommunications, Xi’an 710121, China

b

Tulane University, New Orleans, LA 70118, USA

* Corresponding author: lixg@xupt.edu.cn Tel.: 15055114114

Abstract

Clustering algorithm is widely used in data mining. It attempt to classify elements into several clusters, and the elements in the

same cluster are more similar to each other meanwhile the elements belonging to other clusters are not similar. The recently

published density peaks clustering algorithm can overcome the disadvantage of the distance-based algorithm that can only find

clusters of nearly-circular shapes, instead it can discover clusters of arbitrary shapes and it is insensitive to noise data. However it

needs calculate distances between all pairs of data points and is not scalable to the big data, in order to reduce the computational

cost of the algorithm we propose an efficient distributed density peaks clustering algorithm based on Spark’s GraphX. This paper

proves the effectiveness of the method based on two different data set. The experimental results show our system can improve the

performance significantly (up to 10x) comparing to MapReduce implementation. We also evaluate our system expansibility and

scalability.

1. Introduction

Clustering analysis is an important technique in machine learning and data mining. Clustering analysis 1 divides

elements into several clusters, and the elements in the same cluster are more similar to each other meanwhile the

elements belonging to other clusters are not similar. At present, there are many clustering algorithms, such as

partition-based method(e.g. k-medoids2, k-means3), hierarchical-based method(e.g. Agglomerative

4

Nesting(AGNES) ), density-based method(e.g. Density-based Spatial Clustering of Applications with

Noise(DBSCAN)5), grid-based method(e.g. a Grid-Clustering algorithm for High-dimensional very Large spatial

databases(GCHL)6) and probability model based method. In 2014, a paper about density peaks clustering algorithm

1877-0509 © 2017 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license

(http://creativecommons.org/licenses/by-nc-nd/4.0/).

Peer-review under responsibility of the scientific committee of the 7th International Congress of Information and Communication Technology

doi:10.1016/j.procs.2017.03.138

Rui Liu et al. / Procedia Computer Science 107 (2017) 442 – 447 443

was published in Science magazine7. The core of the algorithm is that cluster centers are characterized by a higher

density than their neighbors and by a relatively large distance from points with higher densities 7.

In this paper, we present a parallel implementation of density peaks clustering system using GraphX based on

Spark. We study the effectiveness of the method and evaluate the running time under different number of nodes at

the same amount of data or under different amount of data at the same number of nodes. Finally, we compare the

running time of Spark and MapReduce to see which is better.

The rest of this paper is organized as follows. In Section 2, we review the density peaks clustering algorithm and

Spark RDD model. In Section 3, we introduce our parallel density peaks clustering System based on Spark. Section

4 provides the details of our experiment and deeply analyzes the results. Finally, in Section Conclusions we conclude

our contribution and indicates our directions for future research.

2. Related works

This section reviews the density peaks clustering algorithm and introduces Spark RDD model.

The kernel parts of density peaks clustering algorithm are computing two value for point i : the local density Ui

and the distance from points of higher density G i . And for point i , the local density Ui is defined as:

Ui ¦ F d

j

ij dc (1)

Where F x 0 if x t 0 and F x 1 otherwise, and d ij is the distance between point i and point j meanwhile d c

is a cutoff distance. Typically, to the point i , Ui is equal to the number of points that are closer than d c .

Remarkably, the algorithm is robust with respect to the choice of d c for large data sets and the algorithm is sensitive

only to the relative magnitude of Ui in different points.

G i is calculated by getting the minimum distance between the point i and any other point with higher density:

Gi min dij

j :U j ! U i

(2)

For point i with highest density, we take Gi max j (dij ) . And G i is much larger than the typical nearest neighbor

distance only for points that are global or local maxima in the density. Therefore, cluster centers are regarded as

points for which the value of G i is anomalously large.

Fig. 1. Point distribution. Fig. 2 Decision graph for the data in Fig. 1.

For each point i , Ui and G i could be expressed in a two-dimensional decision graph. For example, Fig. 1 shows

28 point embedded in a two-dimensional space, and points 1 and 10 are the density maxima, i.e. points 1 and 10 are

444 Rui Liu et al. / Procedia Computer Science 107 (2017) 442 – 447

cluster centers. Fig. 2 shows Ui and G i for each point i in a decision graph. The value of G 9 and G10 is very

different, meanwhile the value of U9 and U10 is very similar. In fact, point 9 belongs to the cluster of point 1, and

point 10 is an other cluster center. Hence, the only points are the cluster centers when they have high G and

relatively high U . Points 26, 27, and 28 are isolated because they have a relatively high G and a low U .

Spark is a fast and general engine for large-scale data processing. All operations of Spark are based on resilient

distributed datasets(RDD), which is a fault-tolerant and parallel data structure. And RDD also offers a rich set of

operations to deal with data sets. In general, there are several common models for data processing, and it contains

Iterative Algorithms, Relational Queries, MapReduce, Stream Processing. For example, MapReduce is based on

MapReduces model, and Storm is based on Stream Processing model. RDD mixes these four models, so that Spark

can be applied to a variety of large data processing.

RDD supports two characteristics of persistence and partitioning, and users can use persist and partition By

functions to control the two characteristics. The partition characteristic and the parallel computing capability of

RDD make Spark can utilize the scalable hardware resources better. If combining partitioning and persistence, it can

be more efficient to deal with massive data.

RDD has two types of operations: transformation and action. No matter how many times transformation

operation has been performed, RDD will not be really performed. Only when action operation is performed, RDD

will be triggered. In the internal implementation mechanism of RDD, the underlying interface is based on the

iterator, so that data access becomes more efficient and avoiding a large number of intermediate results on memory

consumption.

Fig. 3 outlines the architecture of our parallel density peaks clustering system.

Firstly, initialization Spark. It contains setting the threshold of local density U and the threshold of distance from

points of higher density G ; Secondly, importing vertex and edge data stored on HDFS to vertex RDD and edge

RDD separately, and computing the distance of each edge; Thirdly, combination of vertex RDD and edge RDD to

form a Graph in GrapgX8, and after that calculating the truncated distance according to the generated Graph; Then,

Rui Liu et al. / Procedia Computer Science 107 (2017) 442 – 447 445

for each point i , computing the local density Ui and the distance from points of higher density G i ; Lastly, clustering

according on the local density Ui and the distance G i .

Building graph contains three steps. Firstly, importing vertex and edge data stored on HDFS or other file systems

to vertex RDD and edge RDD separately and the initial value of each edge is set to a constant; Secondly, computing

the distance of each edge based on a distance measure formula and updating the value of each edge by the distance;

Lastly, combining vertex RDD and edge RDD to form a Graph in GrapgX. For example, existing a vertex set {1, 2,

3, 4, 5} and a edge set {(1,2),(1,3),(1,4),(1,5),(2,3),(2,4),(2,5),(3,4),(3,5), (4,5)}. When importing the vertex set and

the edge set, the initial value of each edge is set to 1, as shown in Fig. 4. When computing the distance of each edge,

the value of each edge is updated by the distance, as shown in Fig. 5.

In order to reduce computation load, the truncation distance is calculated before the local density Ui is calculated.

According to reference 1, the truncated distance is selected at 98%~99%.

According to the formula (1) to calculate the local density Ui for each vertex. Fig. 6 describes the local density

Ui of each vertex in Fig. 5. For example, the local density U1 of the vertex 1 is 4.

Fig. 6 The local density. Fig. 7 The local density and the distance from points of higher density.

According to the formula (2) to calculate the distance from points of higher density G i for each vertex. The

method of calculating G i using GraphX based on Spark is: firstly, for each edge, if the local density U source of the

446 Rui Liu et al. / Procedia Computer Science 107 (2017) 442 – 447

source vertex is less than the local density Ut arg et of the target vertex, sending a message to the source vertex;

Otherwise sending a message to the target vertex. Secondly, merging all the messages received by each vertex.

Lastly, for each vertex, finding out the minimum edge length from all messages, and the G i of every vertex is equal

to the the minimum edge length. Fig. 7 describes Ui and G i for each vertex in Fig. 6.

3.5. Clustering

The clustering process is divided into three steps, which contains selecting cluster centers, selecting isolated

points and classification. Firstly, selecting these points as cluster centers which Ui is greater than the threshold U

and G i is greater than the threshold G ; Secondly, selecting these points as isolated points which Ui is less than the

threshold U and G i is greater than the threshold G ; Lastly, assigning other points to the nearest cluster center.

4. Experiment

Spark cluster includes one master and 6 slaves. Table 1 describes the hardware configuration of the cluster.

Machine Name Role Memory CPU

Hadoop-server Master 16G 2 cores

Hadoop1~6 Slave 32G 2 cores

To conduct the empirical experiment for the parallel density peaks clustering algorithm, two separate data sets

with different size are used. The first data set is provided by reference 7 and the second data set is the news domain

text data downloaded from DataTang1. The news data set contains 10 topics, and 47,956 text. We preprocess the text

data using Chinese word segmentation system9.

Fig. 8 The running time of Spark and MapReduce . Fig. 9 Decision Graph.

Firstly, based on the first data set used in reference 7, our experimental result is consistent with the result of

reference 7, therefore our system is valid. Fig. 8 compares the running time of Spark and MapReduce on the first

data set. We can see that the running time of Spark is almost 1/10 of MapReduce.

1

http://www.datatang.com/data/43922.

Rui Liu et al. / Procedia Computer Science 107 (2017) 442 – 447 447

Fig. 9 shows the results of our system based on the second data set. When the local density threshold U is 3600

and the threshold G is 11, the number of cluster centers is 10. This result is consistent with the second data set

containing 10 topics.

Fig. 10 describes the trend of running time under different number of nodes at the same amount of data. We can

see that running time is the longest when there is only one node, and running time decreases when the number of

nodes increases. Fig. 11 describes the trend of running time under different amount of data at the same number of

nodes. We can see that running time increases almost linearly. These trends prove our system has good expansibility

and scalability.

Fig. 10 The trend of running time under different number of nodes. Fig. 11 The trend of running time under different amount of data.

Conclusions

To reduce high computational cost of density peaks clustering algorithm, we propose an efficient distributed

density peaks clustering algorithm using GraphX based on Spark. In this paper, we proves the effectiveness of the

method based on two different data set, and the experimental results show our system can improve the performance

significantly (up to 10x) comparing to MapReduce implementation. We also evaluate our system expansibility and

scalability. The future work is to study a method for adaptive threshold, instead of setting a certain threshold U and

a certain threshold G when initialization Spark.

This work is supported by Shaanxi science and technology innovation project foundation, (2016PTJS3-02) and

(2016PTJS3-05).

References

1. Xu Rui, Wunsch D II. Survey of clustering algorithms. IEEE Trans on Neural Networks, 2005, 16(3):645-678.

2. Kaufman L, Peter R. Clustering by means of medoids// Statistical Data Analysis Based on the L1 Norm and Related Methods. North-

Holland:North-Holland Press, 1987: 405-416.

3. MacQueen J. Some methods for classification and analysis of multivariate observations[C]// Proc of the 5th Berkeley Symp on Mathematical

Statistics and Probability. Berkeley: University of California Press, 1967: 281-297.

4. Huang Xing, Liu Xiaoqing, Cao Buqing, Tang Mingdong and Liu Jianxun. MSCA: Mashup Service Clustering Approach Integrating K-Means

ans Agnes Algorithms. Journal of Chinese Computer System, 2015, 36(11):2492-2497.

5. Huang Xing, Liu Xiaoqing, Cao Buqing, Tang Mingdong and Liu Jianxun. MSCA: Mashup Service Clustering Approach Integrating K-Means

ans Agnes Algorithms. Journal of Chinese Computer System, 2015, 36(11):2492-2497.

6. Wang Mingkun, Yuan Shaoguang, Zhu Yongli and Wang Dewen. Real-time Clustering for Massive Data Using Storm. Journal of Chinese

Computer Applications, 2014, 34(11):3078-3081.

7. Alex R, Alessandro L. Clustering by fast search and find of density peaks. Science, 2014, 344(1492):1492-1496.

8. Jacobs, S. A. and A. Dagnino (2016). Large-Scale Industrial Alarm Reduction and Critical Events Mining Using Graph Analytics on Spark.

2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService).

9. Du Liping, Li Xiaoge, Yu, Gen, Liu Chunli and Liu Rui. New word detection based on an improved PMI algorithm for enhancing Chinese

segmentation system. Acta Scientiarum Naturalium Universitatis Pekinensis, 2016, 52(1):35-40.

- Temporal data clustering via weighted clustering ensemble with different representationsUploaded byxpiration
- AW.Data.Analytics.with.Spark.Using.Python.013484601X.epubUploaded byGurkan Yılmaz
- Big Data Analytics Beyond Hadoop Real-Time Applications with Storm, Spark, and More Hadoop Altern.pdfUploaded byNguyen Tuan Anh
- Java Magazine Big Data BrazilUploaded byRobson Mamede
- DeZyre - Apache - SparkUploaded byMadhu
- False News DetectionUploaded byJournalNX - a Multidisciplinary Peer Reviewed Journal
- Libro SparkUploaded byEdwin Quevedo Quispe
- Big Data ResearchUploaded byRengga Haizabarid Patria
- 20160705 Apache FlinkUploaded byAnonymous jN6bVk1f
- Data ScientistUploaded bykPrasad8
- emr-dgUploaded bygeorgi
- EECS6893-BigDataAnalytics-Lecture1Uploaded bycapitalkid
- DS Syllabus 09 2016 OXAvnXEUploaded byNima Niassati
- Building_Real-Time_Data_Platforms_MemSQL.pdfUploaded byZverko Veselic
- mapred_tutorialUploaded byalbertoandreotti7105
- Health IndexUploaded byAlireza Azerila
- SparkUploaded byDulari Bosamiya Bhatt
- Apache CassandraUploaded byshikhanirankari
- Spark TutorialUploaded byDukool Sharma
- 10.1186%2Fs40537-014-0008-6.pdfUploaded byAshraf Sayed Abdou
- WIISE Course CatalogUploaded byAmit Sindhikumte
- Fdp on Im & Dm Using OstUploaded bysomenath_sengupta
- checklist for data engineers interviewUploaded byDK Eric C
- ajanasathianjuly2017Uploaded byapi-336038717
- Abstract 127Uploaded byrrdpereira
- Activity Extraction in Video DataUploaded byRashmi Shankar
- 4.1-PigUploaded byKundal Gupta
- 24. Electronics - IJECIERD -Automated Leading Vehicle - Nilesh L Gadekar - (1)Plagiarism (1)Uploaded byTJPRC Publications
- bgbgbUploaded byGustavo Gonzales Lopez
- Mapreduce Technique: Review and S.W.O.T AnalysisUploaded byInnovative Research Publications

- Retrieval of Textual and Non-textual Information InUploaded byInternational Journal of Research in Engineering and Technology
- SeminaarArman2.docxUploaded byavicii hardwell
- syllabusUploaded bysfaritha
- AnupaPaperUploaded byRenganathan V Sankaran
- Brain Tumor DetectionUploaded byvanya_13111515
- VM’s Consolidation Using Ant Colony System And Clustering Of PM’s For Energy Efficient NetworkUploaded byIJAFRC
- Advance Database SystemUploaded byManjunath Bj
- Journal of Computer Applications - www.jcaksrce.org - Volume 4 Issue 2Uploaded byJournal of Computer Applications
- Brm-chp09Uploaded byRajashekhar B Beedimani
- CS1004 DWM 2marks 2013Uploaded bykarthickamsec
- 2017_iscc_clustering.pdfUploaded byErcherio Marpaung
- Statistics-Toolbox-R2013a.pdfUploaded byLyly Magnan
- Meanshif DelphiUploaded byAde Rohmat Eded
- Image Segmentation through Clustering Based.pdfUploaded byMarius_2010
- Elastic Analysis of Drained FootingUploaded byPatricia Salinas
- Research on Strate 00 ThomUploaded byMetin Dinçer
- Smsn-Environmental Assessment of Industrial Clusters (Cepi)Uploaded bysiruslara6491
- Journey of Machine LearningUploaded bylinkranjit
- Machine Learning in Java - Sample ChapterUploaded byPackt Publishing
- Text Document Clustering Based on NeighborsUploaded byRam Chandru
- Resume GeokomputasiUploaded byAnugrah Feby
- MCA MGU 5th sem syllabusUploaded bynoblesivankutty
- MC1630_-_DATA_WAREHOUSING_AND_DATA_MININGUploaded byVenu Pmu
- DM_using ProM and WekaUploaded bySylphbich_facebook
- 625f2e31307dd11223464b551db097c36cc4Uploaded byGarima Vijay
- A Color Clustering Algorithm for Cloth ImageUploaded byChanon Jangkajit
- project-4Uploaded byAnkurGupta
- Для Просмотра Статьи Разгадайте КапчуUploaded byAnonymous ve9LNxR3lG
- Measuring Hofstede’s Five Dimensions of Cultural ValuesUploaded byGül Ekinci
- DM IntroductionUploaded bysanchobe1