You are on page 1of 4

Distributed GraphLab

A Framework for Machine Learning and Data Mining in the Cloud [12]

Kashif Rabbani
ABSTRACT 2.2 Asynchronous Iterative Computation
To ensure efficient execution of many machine learning and data Asynchronous systems are more beneficial for many MLDM
mining (MLDM) algorithms, GraphLab abstraction was intro- algorithms e.g. linear systems converge faster when solved asyn-
duced in 2012. It fulfilled the critical void found in high-level data chronously [5]. Asynchronous systems update parameters using
parallel frameworks of that era. GraphLab ensures asynchronous, the most recent parameter values as input. Accelerated conver-
dynamic, graph-parallel computation and data consistency with gence of PageRank is shown in fig.1(a). Other data parallel ab-
high-degree of parallelism in a shared-memory setting. Network stractions extended to iterative setting were not supportive for
congestion is reduced up to satisfactory level and fault tolerance asynchronous computation. Therefore, GraphLab abstraction
is added in new GraphLab abstraction. Finally the abstraction was designed to efficiently express the asynchronous iterative
is evaluated on Amazon EC2 servers and performance gain of behavior for advanced MLDM algorithms.
1-2 orders of magnitude over Hadoop-based implementation has
been observed.
2.3 Dynamic Computation
Iterative computation converges asymmetrically in many MLDM
1 INTRODUCTION algorithms e.g. in parameter optimization and PageRank. Fig.1(b)
This report summarizes the extended multi core GraphLab ab- shows how in PageRank only 3% of the vertices required more
straction introduced in [12] along with critiques, current state- than 10 updates, rest of them just required a single update to
of-the-art and related work. Systems capable to execute MLDM converge. Dynamic computation enabled prioritization of com-
algorithms efficiently on distributed large clusters (in parallel) putations across vertex-kernels to accelerate the convergence of
were highly needed to cope with the challenges of MLDM prob- MLDM algorithms.
lems and techniques. MapReduce [6] [8], Dryad [13] and Pregel Fig.1(c) shows accelerated convergence of a popular MLDM
[9] were the only existing high-level distributed systems at that algorithm called Loopy Belief propagation using dynamic com-
time. However, these systems were unable to fulfill the needs putation in web spam detection.
of MLDM community like enabling cloud services fully in dis- GraphLab and Pregel both support dynamic computation while
tributed setting. only GraphLab permits prioritization as well as pull based model
Therefore, MLDM community came up with a high-level ab- to retrieve information from adjacent vertices. Note that in this
straction of GraphLab targeting dynamic, asynchronous and GraphLab abstraction, some of the original GraphLab scheduling
graph-parallel computation (found in many MLDM algorithms) requirements are relaxed to enable efficient distributed priority
in the shared memory setting while hiding the underlying com- scheduling and FIFIO.
plexities of distributed system design.
To implement an efficient distributed execution model while
2.4 Serializability
preserving the strict consistency requirements, several methods
such as data versioning (to reduce network congestion), pipelined It is desirable to have an equivalent serial execution for all par-
distributed locking (to decrease the effects of network latency) allel executions to achieve correctness and faster convergence.
and atom graph (to address the challenges of data locality) are GraphLab allows user to choose the desired level of consistency
incorporated in GraphLab abstraction. Fault tolerance is added required for the correctness. Complexity introduced by concur-
using classic Chandy-Lamport [7] snapshot algorithm. rency get eliminated allowing the experts to focus more on the
model of algorithm instead of complex configuration settings.
Fig.1(d) shows the unstable behaviour of dynamic ALS algorithm
2 MLDM ALGORITHM PROPERTIES
between serializable and non-serializable executions for Netflix
GraphLab abstraction addresses the following key properties of movie recommendation problem.
efficient and parallel MLDM systems.

2.1 Graph Structured Computation 3 DIST. GRAPHLAB ABSTRACTION


Sometimes computation requires modelling dependencies be- Main components of GraphLab abstraction include, Data Graph
tween data. In this way more signals can be extracted from noisy to represent user mutable program state, Update function to rep-
data. resent the user computation and operation on the data graph, and
Graph-parallel abstractions e.g. Pregel [9] and GraphLab [12] Sync operation to maintain the global aggregates concurrently.
naturally express computational dependencies by adopting vertex- To work on data graph, update function transforms data in small
centric model i.e. computation run on each vertex as kernels. In overlapping contexts called scopes.
Pregel, vertices communicate via messages following bulk syn- Program state is stored as a directed graph called data graph
chronous model. In GraphLab, each vertex can read and write to G = (V , E, D). Data graph is in control of managing the user-
data on adjacent vertices and edges based on sequential shared defined data D. Any data can be associated with each vertex and
memory abstraction. GraphLab abstraction enables user to focus edge in the graph. GraphLab abstraction is mutable, indepen-
on sequential computation rather than parallel movement of data dent of edge directions and static. It is not modifiable during the
(i.e. messaging). execution.
Figure 1

Update Function is a stateless method having access to mod- the same loop on the same graph i.e. simultaneous addition and
ify data within the scope of a vertex and schedule its future execution of vertices is not considered as a risk-free model. Over-
execution on other vertices. lapping computation should not run simultaneously. Therefore,
• Input: Vertex v and its scope Sv (data stored in v and its to retain the original semantics of sequential execution several
all adjacent vertices and edges) consistency models are introduced to work on optimization of
• Output: New version of the data in the scope and set of parallel execution while maintaining the serializability.
vertices T GraphLab’ runtime ensures serializability in three different
Update : f (v, Sv ) → (Sv ,T ) ways mentioned below.
Full Consistency in which scope of concurrently updating
After execution of update function, modified data Sv is writ- functions cannot overlap and it has full access to read/write
ten back to the data graph. GraphLab gives user the freedom to but limited potential parallelism (because concurrently updating
read/write data from adjacent vertices and edges in a simple man- vertices must be at-least two vertices apart).
ner by removing the complexity to manage the movement of data Edge Consistency has slightly overlapping scope having
like in message passing or data flow models [9][13]. Update func- read-only access to the adjacent vertices with full read/write
tion can efficiently express adaptive computation by controlling access on the vertex and adjacent edges. It increases the paral-
its scheduling for the returned set of vertices T . Expressiveness lelism and used by many MLDM algorithms e.g. PageRank.
of dynamic computation actually differentiates the GraphLab Vertex Consistency allows read-only access to the adjacent
from Pregel. In Pregel, messages initiate the update functions vertices and edges with write-only access to the vertex. Update
which can only access the data in the message while GraphLab function can run simultaneously on all the vertices in this type
completely decouples the scheduling of future computation from of consistency model.
the data movement and naturally expresses the pull model.
Sync Operation & Global Values Many MLDM algorithms 4 DISTRIBUTED GRAPHLAB DESIGN
require to maintain a global statistic to describe the data stored
In this section, shared memory setting design of GraphLab is
in the data graph e.g. tracking of convergence estimators by
extended to more challenging distributed settings to discuss the
statistical inference algorithms. Therefore, GraphLab abstraction
various required techniques. Overview of distributed GraphLab
defines global values that are written by sync operation and read
design is illustrated by
by update functions. Sync operation contains a finalization phase
to support normalization (common task in MLDM algorithms).
In order to maintain updated estimates of the global values, sync
operation needs to run continuously in the background which
makes serializability of the sync operation an expensive task.
Therefore, multiple consistency levels are available for update
functions along with choice of consistent or inconsistent Sync
fig:??.
operation.
Figure 2: System Overview
3.1 GraphLab Execution Model
GraphLab execution model is based on a single loop semantics.
Input to the model is data graph G = (V , E, D), an update function 4.1 Distributed Data Graph
and initial set of verticesT to be executed. Each step takes a vertex
The key to implement an efficient distributed data graph is com-
from T , update and add to the updated set of vertices T for future
putation, communication, storage and an appropriate balance
computations. There is no specific order of execution of vertices
between all of them. Therefore, a two-phased partitioning process
besides only one constraint that eventually all the vertices should
for load balancing the graph across arbitrary cluster sizes is devel-
be executed. As GraphLab allows prioritization, therefore users
oped. First phase partitions the graph into k parts (called atom)
can assign priorities to vertices in T .
where k is greater than the number of machines. Atom is a file con-
taining graph generating commands e.g. AddVertex(100,vdata),
3.2 Consistency Models AddEdge(2 -> 4,edata). Atom also stores the ghosts information.
An execution model automatically translated from sequential Ghosts are set of vertices and edges that are adjacent to the parti-
to parallel in such a way that multiple processors are executing tion boundary. Atom index file contains meta data (connectivity
2
structure and file location) for k atoms. Second phase partitions 5 APPLICATIONS
this meta-graph over the physical machines. GraphLab was evaluated using three state-of-the-art MLDM ap-
plications. Collaborative filtering for Netflix Movie Recommenda-
4.2 Distributed GraphLab Engines tion, Named Entity Recognition (NER) using Chromatic Engine
GraphLab engine is responsible for execution of Sync operation and Video Co-segmentation (CoSeg) using distributed Locking
and Update function. It also maintains the scheduling set of Ver- Engine. It was find out that GraphLab’s performance is compara-
tices T and ensure serializability w.r.t to different consistency ble to tailored MPI implementations and it outperforms Hadoop
models. As discussed in section 3.2, performance and expressive- by 20-60x. Also Netflix, CoSeg and NER were more compactly
ness of execution model is dependent on implementation of how expressed by GraphLab abstraction as compared to MapReduce
vertices get removed from T . Two type of engines are introduced or MPI. (Ref to [12] for details about datasets and clusters speci-
to evaluate this trade-off, Chromatic Engine to support partial fications).
asynchronous execution of set of vertices T and Locking Engine In Netflix Movie Recommendation, ALS [14] algorithm
a fully asynchronous engine to allow vertex priorities.. takes a sparse U sers × Movies matrix as input. This matrix con-
tains movies ratings for every user. ALS algorithm computes
4.2.1 Chromatic Engine. Vertex coloring is a classic technique low-rank matrix factorization. Rank of the matrices is denoted
to achieve a serializable parallel execution of a set of dependent as d. Higher the rank, higher the accuracy with high computa-
tasks (vertices in a graph). It requires full communication barrier tional cost. High speedup is achieved by varying the values of d
between two color steps. Given a vertex coloring of the data & corresponding number of cycles per update by adding more
graph, edge consistency model can be satisfied by executing all machines in incremental order. A fair comparison with MPI im-
vertices of the same color before going to next color, and running plementation and Hadoop for fixed value of d = 20 and varying
syn operation between these color steps. Changes made to ghost number of machines from 4 to 64 has shown the runtime of the
vertices and edges are communicated asynchronously. Vertex experiment in seconds in the following order:
consistency can be achieved by assigning the same color to all Hadoop > MPI > GraphLab
the vertices. And finally, full consistency can be achieved by GraphLab performs 40-60 times faster than Hadoop.
constructing second-order vertex coloring (no vertex shares the Video Co-segmentation automatically identify and cluster
same color as any of its distance two neighbors). As an example spatio-temporal segments of a video using LBP [10]. It makes use
in MLDM optimization problems, bipartite (two-color) graphs of the distributed locking engine pipelines. Experiments showed
are used. that locking engine is capable of achieving high performance
4.2.2 Distributed Locking Engine. Chromatic engine can not al- and scalability on a large 10.5 million vertex graph (utilized by
low sufficient scheduling flexibility. Therefore, Distributed lock- application). It resulted in 10x speedup with 16x more machines.
ing engine extends mutual exclusion technique to overcome this Optimal weak scaling provided by locking engine states that,
limitation. It associates reader-writer locks on each vertex. Only increase in the size of a graph in proportion to the number of
the local vertices can be updated by each machine. Different con- machines does not influence the runtime. Experiments by vary-
sistency models using different protocols can be implemented. All ing the length of pipeline showed that the length of pipeline is
lock and sync requests are pipelined to reduce network latency. directly proportional to the performance and thus compensate
A pipeline of vertices is maintained by each machine for which the poor partitioning.
locks are requested but not granted yet. Once the lock acquisition It is concluded that distributed GraphLab abstraction provides
and data synchronization are completed, vertex is executed. It excellent performance for video CoSeg task by allowing dynamic
has been observed in experiments that strong and nearly linear prioritized scheduling. Pipelining has been proven as an effective
scalability is provided by distributed locking systems. The more way to minimize the latency and poor partitioning.
the length of the pipeline, the less is the runtime. Named Entity Recognition (NER) is a technique of Natural
Language Processing and information extraction. In experiments,
4.3 Fault Tolerance large amount of web crawled data is used to count the number
of occurrences of noun-phrases in each context. NER problem
Fault tolerance in GraphLab abstraction is achieved by distributed
constructs two set of vertices corresponding to noun phrases and
checkpointing in two modes, synchronous and asynchronous
their contexts. This can be mapped as a bipartite data graph, if
checkpointing.
noun-phrase occurs in the context, an edge is drawn between
Synchronous checkpointing suspends computation to save
noun-phrase and its context.
all modified data since the last checkpoint. While Asynchro-
Experiments showed that NER is not able to scale like CoSeg
nous checkpointing is based on Chandy-Lamport snapshot
and Netflix. It could only achieve a modest 3x improvement us-
algorithm [7]. The snapshot step becomes an update function
ing 16x more machines. Such poor scaling is because of large
in the GraphLab abstraction and is proven to be better than
vertex data size (816 bytes), random cut partitioning strategy
synchronous checkpointing.
and dense connectivity structure, resulting in high communica-
tion overhead in each iteration. Network analysis showed that
4.4 System Design
NER utilizes more network bandwidth than Netflix and CoSeg. It
Each machine runs one instance of GraphLab in a distributed saturates with each machine having sending rate over 100MBPS.
setting. GraphLab processes are symmetric and communicate
via remote procedural calls. First process acts as a master and
computes placement of atoms based on atom’ index. A local 6 STRENGTHS AND LIMITATIONS
scheduler is maintained by each process (for its vertices) and a In this section, we summarize few strengths and limitations of
cache to access the remote data. Distributed consensus algorithm distributed GraphLab abstraction. Application oriented extensive
decides when all the schedulers will become empty. experiments are considered as main strength of this paper. It
3
naturally exhibits dynamic priority scheduling asynchronously. An extended implementation of Spark abstraction known as
Fault tolerance using snapshotting schemes is reducing large GraphX [11] is designed for iterative graph operations. Similar
amount of communication overhead. Use of distributed locking to GraphLab abstraction it uses vertex-cut partitioning strategy.
engine pipelines hides the network latency up-to high extent. It allows developer to decide on what data portions to cache.
Random vertex-cut partitioning can perform badly in some ap- GraphX is not efficient when experimenting on applications re-
plications like NER. There is a self-edges issue in GraphLab, it is quiring large number of iterations [3].
possible to change GraphLab system to allow self-edges using Vertica [4] is a relational graph processing system which is
flags but it requires significant changes to GraphLab code [3]. not competitive to existing distributed graph processing systems
due to high I/O and network cost overhead. It is found that this
7 CURRENT STATE-OF-THE-ART overhead increases with the increase in cluster size.
Initially GraphLab was started in 2009 by Prof. Carlos Guestrin
of Carnegie Mellon University as a research project. Later on 9 CONCLUSION
it turned into a start-up company named Dato. Its name was We summarized the distributed GraphLab abstraction and other
changed to Turi few years back and finally on August 05, 2016 it related distributed graph processing systems.
was acquired by Apple for 200 millions dollars. To address the key properties exhibited by most of the MLDM
Currently Turi Create1 supports development of custom ma- algorithms shared memory GraphLab abstraction is extended to
chine learning models. It claims that it is no specifically designed the distributed setting by improving the execution model, relax-
for machine learning experts. Common ML tasks such as Rec- ing some scheduling requirements, integrating new distributed
ommender, Regression, Clustering and Classifiers can easily be data graph, introducing new fault-tolerance schemes and execu-
accomplished by Turi Create. It has built-in support for data tion engines.
visualization and it is easy to deploy on cross-platforms. Two stage partitioning scheme is introduced in distributed
data graph design to achieve efficient load-balancing and dis-
8 RELATED WORK tributed ingress for varying size clusters. A partially synchronous
chromatic engine and a fully synchronous distributed locking
In this section we will discuss related distributed graph process-
engine are designed.
ing systems including those introduced after GraphLab. In Sec.2
Distributed GraphLab abstraction was implemented in C++
while explaining the core properties of MLDM algorithms, few
and evaluated on Netflix, Named Entity Recognition and Video
comparisons of GraphLab with high-level parallel distributed
CoSegmentation using real data. It was found by extensive ex-
systems [6][13][9] are discussed.
periments that distributed GraphLab abstraction outperforms
MapReduce [6] framework simplifies parallel processing by
Hadoop (20-60x) and compete with other MPI implementations.
map & reduce semantics. It partitions data randomly across ma-
chines. MapReduce implementation is not suitable for iterative
REFERENCES
Graph algorithms due to high I/O cost of data shuffling at every
[1] 2010. Giraph. http://giraph.apache.org
iteration. [2] 2011. Neo4j. https://neo4j.com
Pregel [9] differs from GraphLab in terms of expressiveness [3] 2018. Analysis of Distributed Graph Systems. https://arxiv.org/abs/1806.08082
[4] Malu Castellanos Meichun Hsu Alekh Jindal, Samuel Madden. [n. d.]. Graph
of dynamic computations. Update function in Pregel is initiated analytics using vertica relational database. IEEE ([n. d.]).
by messages and it can only access the data associated with [5] D. P. Bertsekas and J. N. Tsitsiklis. 1989. Parallel and distributed computation:
each message, while GraphLab decouples the data movement numerical methods. Prentice-Hall Inc. (1989).
[6] Y.-A. Lin Y. Yu G. Bradski A. Y. Ng C.-T. Chu, S. K. Kim. 2006. Map-reduce for
from future scheduling completely. Also Pregel doesn’t support machine learning on multicore. NIPS (2006), 281–288.
asynchronous properties of MLDM algorithms. [7] K. M. Chandy and L. Lamport. 1985. Distributed Snapshots: Determining
Giraph [1] (An open source implementation of Pregel[9]) is a global states of distributed systems. ACM Trans 3, 1 (1985), 63–75.
[8] J. Dean and S. Ghemawat. 2004. simplified data processing on large clusters.
vertex-centric BSP system. It uses edge-cut approach to partition OSDI (2004).
data randomly. Similar to GraphLab it keeps all data in memory. [9] A. J. Bik J. Dehnert I. Horn N. Leiser G. Malewicz, M. H. Austern and G.
Czajkowski. 2010. Pregel: a system for large-scale graph processing. SIGMOD
Giraph API has a Compute function which updates the state of (November 2010), 135–146.
the vertex based on its own or neighbour’s data. Giraph is found [10] Y. Low J. Gonzalez and C. Guestrin. 2009. Belief Propagation:Residual splash
to be very competitive with GraphLab, when GraphLab runs for for optimally parallelizing belief propagation. AISTATS 5 (2009), 177–184.
[11] Ankur Dave Daniel Crankshaw Michael J. Franklin Joseph E. Gonzalez,
fixed number of iterations using random partitioning. GraphLab Reynold S. Xin and Ion Stoica. 2014. Graph processing in a distributed dataflow
wins Giraph over large clusters. framework. Proc. 11th USENIX Symp (2014).
Dryad [13] construct a dataflow graph by combining com- [12] GONZALEZ J. KYROLA A. BICKSON GUESTRIN C. LOW, Y. and J. M.
HELLERSTEIN. 2012. Distributed GraphLab: A Framework for Machine
putational vertices with communication channels and runs the Learning and Data Mining in the Cloud. PVLDB (2012).
dataflow by communicating through TCP pipes and shared-memory [13] Y. Yu A. Birrell M. Isard, M. Budiu and D.Fetterly. 2007. distributed data-parallel
programs from sequential building blocks. EuroSys (2007), 59âĂŞ72.
FIFOs. It is not based on vertex-centric computations. [14] R. Schreiber Y. Zhou, D. Wilkinson and R.Pan. 2008. Large-scale parallel
Graph structured databases like Neo4J [2] mainly focus on effi- collaborative filtering for the netflix prize. AAIM (2008), 337âĂŞ348.
cient CRUD operations of graph structured data while GraphLab [15] Magdalena Balazinska Yingyi Bu, Bill Howe and Michael D. Ernst. 2012. The
HaLoop approach to large-scale iterative data analysis. VLDB (2012), 169–190.
focuses on iterative graph structured computation.
HaLoop [15] (A modified MapReduce System) minimizes data
shuffling and network utilization after the very first iteration. It
makes master node loop-aware to reduce network communica-
tion. Slave nodes contain a special module to cache (on disk) and
index loop-invariant data (used in iterations).

1 https://turi.com

You might also like