You are on page 1of 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/257068169

Machine Learning and Cloud Computing: Survey of Distributed and SaaS


Solutions

Article · December 2012

CITATIONS READS

27 10,506

1 author:

Daniel Pop
West University of Timisoara
61 PUBLICATIONS   148 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

SHARE-PSI 2.0 View project

NESUS: Network of Sustainable Ultrascale Computing View project

All content following this page was uploaded by Daniel Pop on 19 May 2014.

The user has requested enhancement of the downloaded file.


Machine Learning and Cloud Computing: Survey of Distributed and
SaaS Solutions

Daniel Pop
Institute e-Austria Timişoara
Bd. Vasile Pârvan No. 4, 300223 Timişoara, România
E-mail: danielpop@info.uvt.ro

Abstract works). Analytics tools allow end-users to harvest the


meaningful patterns buried in large volumes of struc-
Applying popular machine learning algorithms to tured and unstructured data. Analyzing big datasets
large amounts of data raised new challenges for the gives users the power to identify new revenue sources,
ML practitioners. Traditional ML libraries does not develop loyal and profitable customer relationships,
support well processing of huge datasets, so that new and run your overall organization more efficiently and
approaches were needed. Parallelization using mod- cost effectively.
ern parallel computing frameworks, such as MapRe- Research in knowledge discovery and machine learn-
duce, CUDA, or Dryad gained in popularity and accep- ing combines classical questions of computer science
tance, resulting in new ML libraries developed on top (efficient algorithms, software systems, databases) with
of these frameworks. We will briefly introduce the most elements from artificial intelligence and statistics up to
prominent industrial and academic outcomes, such as user oriented issues (visualization, interactive mining).
Apache MahoutTM , GraphLab or Jubatus. Although for more than two decades, parallel
We will investigate how cloud computing paradigm database products, such as Teradata, Oracle or Netezza
impacted the field of ML. First direction is of popu- have provided means to realize a parallel implemen-
lar statistics tools and libraries (R system, Python) de- tation of ML-DM algorithms, expressing ML-DM al-
ployed in the cloud. A second line of products is aug- gorithms in SQL code is a complex task and difficult
menting existing tools with plugins that allow users to to maintain. Furthermore, large-scale installations of
create a Hadoop cluster in the cloud and run jobs on these products are expensive and are not an afford-
it. Next on the list are libraries of distributed imple- able option in most cases. Another driver for paradigm
mentations for ML algorithms, and on-premise deploy- shift from relational model to other alternatives is the
ments of complex systems for data analytics and data new nature of data. Until about five years ago, most
mining. Last approach on the radar of this survey is data was transactional in nature, consisting of numeric
ML as Software-as-a-Service, several BigData start-ups or string data that fit easily into rows and columns
(and large companies as well) already opening their so- of relational databases. Since then, while structured
lutions to the market. data is following a near-linear growth, unstructured
(e.g. audio and video) and semi-structured data (e.g,
Web traffic data, social media content, sensor gener-
ated data etc.) exhibit an exponential growth (see fig-
1 Introduction ure 1). Most of the new data is either semi-structured
in format, i.e. it consists of headers followed by text
Given the enormous growth of collected and avail- strings, or pure unstructured data (photo, video, au-
able data in companies, industry and science, tech- dio). While the latter has limited textual content and
niques for analyzing such data are becoming ever more is more difficult to parse and analyze, semi-structured
important. Today, data to be analyzed are no longer data triggered a plethora of non-relational data stores
restricted to sensor data and classical databases, but (NoSQL data stores) solutions tailored to handle huge
more and more include textual documents and web- amount of data. Consequently, the past 5 years have
pages (text mining, Web mining), spatial data, mul- seen researchers moving to parallelization of ML-DM
timedia data, relational data (molecules, social net- using these new platforms, such as NoSQL datastores,

1
noticed in past 5 years. These initiatives can be either
PaaS/SaaS platforms or products that can be deployed
on private environments.

Reviewing the literature and the market, we can


conclude that ML-DM comes in many flavors. We clas-
sify these approaches in 5 distinct classes:

• Machine Learning environments from the cloud –


create a computer cluster in the cloud and boot-
strapping it with statistics tools. ⇒ Section 3.

• Plugins for Machine Learning tools – augment


statistics tools with plugins that allow users to cre-
Figure 1. Trends in data growth ate a Hadoop cluster in the cloud and run ML jobs
on it. ⇒ Section 4.
distributed processing environments (MapReduce), or • Distributed Machine Learning libraries – collec-
cloud computing. tions of parallelized implementations of ML al-
At this point, it is worth reflecting to a nice gorithms for distributed environments (Hadoop,
metaphor by Ben Werther [18], co-founder of Platfora, Dryad etc). ⇒ Section 5.
for big data processing today:
• Complex Machine Learning systems – products
In ‘industrial revolution’ terms, we are in that need to be installed on private data centers
the pre-industrial era of artisanship that (or in the cloud) and offers high performance data
mining and analysis. ⇒ Section 6.
preceeded mass production.
It is the equivalent of needing to engage an • Software as a Service providers for Machine Learn-
ing – PaaS/SaaS solutions that allow clients to ac-
expert blacksmith to forge the forks and cess ML algorithms via Web services. ⇒ Section 7.
spoons for our dinner table.
The remaining of the paper is structured as follows:
next section presents similar, recent studies, followed
Machine Learning is inherently a time consuming
by 5 sections, each of them devoted to a particular
task, thus plenty of efforts were conducted to speed-up
class identified above. The paper ends with conclusion
the execution time. Cloud computing paradigm and
and future plans.
cloud providers turned out to be valuable alternatives
to speed-up machine learning platforms. Thus, popular
statistics tools environments – like R, Octave, Python 2 Related studies
– went in the cloud as well. There are two main direc-
tions to integrate them with cloud providers: create a Since 1995, many implementations were proposed
cluster in the cloud and bootstrapping it with statistic for ML-DM algorithms parallelization for shared or
tools, or augment statistic environments with plugins distributed systems. For a comprehensive study the
that allow users to create Hadoop clusters in the cloud reader is referred to a recent survey [17]. Our work
and run jobs on them. is focused in frameworks, toolkits, libraries that al-
Environments like R, Octave, Mapple and similar low large-scale, distributed implementations of state-
offer low-level infrastructure for data analysis, that of-the-art ML-DM algorithms. To this respect, we
can be applied for large datasets once leveraged by mention a recent book dealing with machine learning at
cloud providers. Machine Learning is something that large [1], which contains both presentations of general
comes on top of this and facilitates the retrieval of frameworks for highly scalable ML implementations,
useful knowledge out of huge data for customers with like DryadLINQ or IBM PMLT, and specific imple-
no/less statistical background by automatically infer- mentations of ML techniques on these platforms, like
ring ‘knowledge models’ out of data. To support this ensemble decision trees, SVM, k-Means etc. It contains
need, an explosion of start-ups, some of them in stealth contributions from both industry leaders (Google, HP,
mode yet, who are offering machine learning services IBM, Microsoft) and academia (Berkeley, NYU, Uni-
to their customers, or big data analysis services can be versity of California etc).

2
Recent articles, such as those of S. Charrington [3], control packages. Results of data analysis processes,
W. Eckerson [4] and D. Harris [5], review different named dashboard in Opani, can easily be visualized
large-scale ML solutions providers that are trying to and shared from desktop or mobile devices.
offer better tools and technologies, most of them based Approaches in this class are powerful and flexible
on Hadoop infrastructure, to move forward the novel solutions, offering users the possibility to develop com-
industry of big data. They are aiming at improving plex ML-DM applications ran on the cloud. Users are
user experience, at product recommendations, or web- freed from the burden of provisioning own distributed
site optimization applicable for finance, telecommuni- environments for scientific computing, while being able
cations, retail, advertising or media. to use their favorite environments. On the other side,
users of these tools need to have extensive experience in
3 Machine Learning environments programming and strong knowledge of statistics. Per-
from the cloud haps, due to this limited audience, the stable providers
in this category are fewer than in other categories, some
of them (such as CRdata.org) shutting down the oper-
Providers of this category offer computer clusters
ation only shortly after taking off.
using public cloud providers, such as Amazon EC2,
Rackspace etc, pre-installed with statistics software,
preferred packages being R system, Octave or Map- 4 Plugins for Machine Learning toosl
ple. These solutions offer scalable high-performance
resources in the cloud to their customers, who are freed In this class, statistics applications (e.g. R system,
from the burden of installating and managing own clus- Python) are extended with plugins that allow users to
ters. create a Hadoop cluster in the cloud and run time con-
Cloudnumbers.com 1 are using Amazon EC2 2 suming jobs over large datasets on it. Most of the in-
provider to setup computer clusters preinstalled with terest went towards R, for which several extensions are
software for scientific computing, such as R system, available, comparing to Python for which less effort
Octave or Mapple. Customers benefit from a web- was invested until recently in supporting distributed
interface where they can create own workspaces, con- processing. In this section we will mention several so-
figure and monitor the cluster, upload datasets or con- lutions for R and Python.
nect to public databases. On top of default features RHIPE 6 is a R package that implements a
from cloud provider, Cloudnumbers offers high secu- map/reduce framework for R offering access to a
rity standards by providing secure encryption for data Hadoop installation from within R environment. Us-
transmission and storage. Overall, a HPC platform in ing specific R functions, users are able to launch
the cloud, easy to create and effortless to maintain. map/reduce jobs executed on the Hadoop cluster and
CloudStat 3 is a cloud integrated development en- results are then retrieved from HDFS.
vironment built based on R system, and exposes its Snow 7 [16] and its variants (snowfall, snowFT) im-
functionalities via 2 types of user interfaces: console – plement a framework that is able to express an impor-
for experienced users in R language, and applications tant class of parallel computations and is easy to use
– designed as a point and click forms based interface within an interactive environment like R. It supports
for R for users with no R programming skills. There three types of clusters: socket-based, MPI, and PVM.
is also a CloudStat AppStore where users can choose
Segue for R 8 project makes it easier to run
applications from a growing repository.
map/reduce jobs from within R environment on elastic
Opani 4 is offering similar services to Cloudnum- clusters at Amazon Elastic Map Reduce 9 .
bers.com, but additionally helps customers to size their
Anaconda 10 is a scalable data analytics and scien-
cluster according to their needs: size of data and the
tific computing in Python offered by Continuum An-
time-frame for processing this data. They are us-
alytics 11 . It is a collection of packages (NumbaPro –
ing Rackspace’s 5 infrastructure and support environ-
fast, multi-core and GPU-enabled computation, IOPro
ments such as R system, Node and Python, bundled
with map/reduce, visualization, security and version 6 http://www.stat.purdue.edu/ sguha/rhipe/doc/html/index.html
7 http://cran.r-project.org/web/packages/
1 http://cloudnumbers.com available packages by name.html
2 http://aws.amazon.com/ec2/ 8 http://code.google.com/p/segue/
3 http://cs.croakun.com 9 http://aws.amazon.com/elasticmapreduce/
4 http://opani.com 10 https://store.continuum.io/cshop/anaconda
5 http://rackspace.com 11 http://continuum.io

3
– fast data access, and wiseRF Pine – multi-core imple- • Scalable to support various business cases. Ma-
mentation of the Random Forest) that enables large- hout is distributed under a commercially friendly
scale data management, analysis, and visualization and Apache Software license.
more. It can be installed as a full Python distribution
or can be plugged into an existing installation. • Scalable community. The goal of Mahout is to
build a vibrant, responsive, diverse community to
Due to its popularity among ML-DM practitioners,
facilitate discussions not only on the project itself
R system being the preferred tool for such tasks in past
but also on potential use cases.
2 years [15, 10], efforts have been made recently to par-
allelize lengthy processes on scalable distributed frame- Currently Mahout supports mainly four use cases:
works (Hadoop). This approach is largely preferred
over ML in the cloud due to the possibility to re-use ex- • Recommendation mining takes users’ behavior
isting infrastructure of research, or industrial (private) and from that tries to find items users might like
data centers. To the best of our knowledge, there are • Clustering takes e.g. text documents and groups
no similar approaches for related mathematical tools, them into groups of topically related documents
such as Mathematica, Maple or Matlab/Octave, except
HadoopLink 12 for Mathematica. The audience of this • Classification learns from existing categorized doc-
class of solutions is also highly qualified in program- uments what documents of a specific category look
ming languages, mathematics, statistics and machine like and is able to assign unlabelled documents to
learning algorithms. the (hopefully) correct category.
• Frequent itemset mining takes a set of item groups
5 Distributed Machine Learning li- (terms in a query session, shopping cart content)
braries and identifies, which individual items usually ap-
pear together.
This category offers complex libraries operating on
various distributed setups (Hadoop, Dryad, MPI). Integration with initiatives such as graph processing
They allow users to use out-of-the-box algorithms, or platforms Apache Giraph 14 are actively under discus-
implement their own, that are run in parallel mode over sion. An active community is behind this project.
a cluster of computers. These solutions does not inte- GraphLab 15 [11] is a framework for ML-DM in
grate, nor use, statistics/mathematics software, rather the Cloud. While high-level data parallel frameworks,
they offer self-contained packages of optimised, state- like MapReduce, simplify the design and implementa-
of-the-art ML-DM methods and algorithms. tion of large-scale data processing systems, they do not
Apache MahoutTM 13 [12] is an Apache project naturally or efficiently support many important data
to produce free implementations of distributed or oth- mining and machine learning algorithms and can lead
erwise scalable machine learning algorithms on the to inefficient learning systems. To help fill this criti-
Hadoop platform [20]. It started as a collection of in- cal void, GraphLab is an abstraction which naturally
dependent, ”Hadoop-free” components, e.g. ”Taste” expresses asynchronous, dynamic, graph-parallel com-
collaborative-filtering. Its goal is to build scalable ma- putation while ensuring data consistency and achieving
chine learning libraries, where scalable has a broader a high degree of parallel performance, in both shared-
meaning: memory and distributed settings. It is written in C++
and is able to directly access data from Hadoop Dis-
• Scalable to reasonably large datasets. Mahout’s tributed File System (HDFS) [20]. The authors report
core algorithms for clustering, classification and out-performing similar approaches by orders of magni-
batch based collaborative filtering are imple- tude.
mented on top of Apache Hadoop [20] using the DryadLINQ 16 [19, 2] is LINQ (Language IN-
map/reduce paradigm. However, it does not re- tegrated Query 17 subsystem developed at Microsoft
strict contributions to Hadoop based implemen- Research on top of Dryad [9], a general purpose ar-
tations: contributions that run on a single node chitecture for execution of data parallel applications.
or on a non-Hadoop cluster are welcome as well. It supports DAG-based abstractions, inherited from
The core libraries are highly optimized to allow Dryad, for implementing data processing algorithms.
for good performance also for non-distributed al-
14 http://incubator.apache.org/giraph/
gorithms. 15 http://graphlab.org
12 https://github.com/shadanan/HadoopLink 16 http://research.microsoft.com/en-us/projects/DryadLINQ/
13 http://mahout.apache.org 17 http://msdn.microsoft.com/netframework/future/linq/

4
A DryadLINQ program is a sequential program com- platforms, from multi-core laptops to supercomputers
posed of LINQ expressions performing arbitrary side- such as BlueGene. This is because the toolbox incor-
effect-free transformations on datasets, and can be porates a parallelization infrastructure that completely
written and debugged using standard .NET develop- separates parallel communications, control, and data
ment tools. The DryadLINQ system automatically and access from learning algorithm implementation. This
transparently translates the data-parallel portions of approach enables learning algorithm designers to
the program into a distributed execution plan which is focus on algorithmic issues without having to concern
passed to the Dryad execution platform that ensures themselves with low-level parallelization issues. It also
efficient and reliable execution of this plan. Authors enables learning algorithms to be deployed on multiple
demonstrate near-linear scaling of execution time on hardware architectures, running either serially or in
the number of computers used for a job. While the parallel, without having to change any algorithmic
DAG-based abstraction permits rich computational de- code. The toolbox uses the popular MPI library as
pendencies, it does not naturally express iterative, data the basis for its operation, and is written in C++.
parallel, task parallel and dynamic data driven algo- Despite of our effort to get latest news on this project,
rithms that are prevalent in ML-DM. we found no recent activity on this project since 2007,
Jubatus 18 [8], started April 2011, is an online/real- except for a chapter in [1] (2012). On the other side,
time machine learning platform, implemented on a dis- the toolkit is suited for parallel environments, not for
tributed architecture. Comparing to MahoutTM is a distributed ones.
next-step platform that offers stream processing and NIMBLE [6] is a sequel project to Parallel Ma-
online learning. In online ML, the model is continu- chine Learning Toolbox, also developed at IBM Re-
ously updated with each data sample that is coming search Labs. It exposes a multi-layered framework
by fast and not memory-intensive algorithms. It re- where developers may express their ML-DM algorithms
quires no data storage, nor sharing; only model mixing. as tasks. Tasks are then passed to the next layer, an
It supports classification problems (Passive Aggressive architecture independent layer, composed of one queue
(PA), Confidence Weighted Learning, AROW), PA- of DAGs of tasks, plus worker threads pool that unfold
based regression, nearest neighbor (LSH, MinHash, Eu- this queue. Next layer is an architecture dependent
clid LSH), recommendation, anomaly detection (LOF layer that translates the generic entities from upper
based on NN) and graph analysis (shortest path, layer into various runtimes. Currently, NIMBLE sup-
PageRank). In order to efficiently support online learn- ports execution on Hadoop platform [20] only. Other
ing, Jubatus operates updates on local models and platforms, such as Dryad [9], are also good candidates,
then each server transmits its model difference that are but not yet supported. Advantages of this framework
merged and distributed back to all servers. The mixed include:
model improves gradually thanks to all servers’ work.
• higher level of abstraction, hiding low-level con-
IBM Parallel Machine Learning Tool-
trol and choreography details of most of the
box 19 [13] (PMLT), a joint effort of the Machine
distributed and parallel programming paradigms
Learning group at the IBM Haifa Lab and the Data
(MR, MPI etc), allowing programmers to compose
Analytics department at the IBM Watson Lab, pro-
parallel ML-DM algorithms using reusable (serial
vides tools for execution of data mining and machine
and parallel) building blocks
learning algorithms on multiple processor environ-
ments or on multiple threaded machines. The toolbox • portability: providing specific implementation for
comprises two main components: an API for running architecture dependent layer, same code can be
the users’ own machine learning algorithms, and executed on various distributed runtimes
several pre-programmed algorithms which serve both
as examples and for comparison. The pre-programmed • efficiency and scalability: due to optimisation in-
algorithms include a parallel version of the Support troduced by DAGs of tasks and co-scheduling, re-
Vector Machine (SVM) classifier, linear regression, sults presented in [6] for Hadoop runtime show
transform regression, nearest neighbors, k-means, speedup improvement with increasing dataset size
fuzzy k-means, kernel k-means, PCA, and kernel PCA. and dimensionality.
One of the main advantages of the PML toolbox is the
ability to run it on a variety of operating systems and SystemML [7], developed at IBM Research labs
18 http://jubat.us/ as NIMBLE and PMLT, proposes an R-like language
19 https://www.research.ibm.com/haifa/projects/verification/ (Declarative Machine Learning language) that includes
ml toolbox/index.html linear algebra primitives and shows how it can be

5
optimized and compiled down to MapReduce. They and (v) business analytics platform: data discovery, ex-
report an extensive performance evaluation on three ploration, visualization and predictive analytics. Main
(Group Nonnegative Matrix Factorization, Liner re- characteristics of Pentaho solution include:
gression, Page Rank) ML algorithms on varying data
• MapReduce-based data processing
and Hadoop cluster sizes.
Table 5 presents a synthesis on investigated plat- • Can be configured for different Hadoop distribu-
forms. One can notice that Java is the preferred envi- tions (such as Cloudera, Hadapt etc.)
ronment, due to large adoption and usage of Hadoop
• Data can be loaded and processed into Hadoop
as distributed processing model. The good news is the
HDFS, HBase 23 , or Hive 24
fact that most active and lively solutions are the open-
source ones. Target audience of this class of products • Supports Pig scripts
are programmers, system developers and ML experts
who need fast, scalable distributed solutions for ML- • Native support for most NoSQL databases, such
DM problems. as Apache Cassandra, DataStax, Apache HBase,
MongoDB, 10gen etc.
6 Complex Machine Learning systems • Enables performance-optimized data analysis,
reporting and data integration for analytic
This section present several solutions for business databases (such as Teradata, monetdb, Netezza
intelligence and data analytics that share a set of com- etc.), through deep integration with native SQL
mon features: (i) all are deployable on on-premise or dialects and parallel bulk data loader
in-the-cloud clusters, (ii) provide rich set of graphical
tools to analyse, explore and visualize large amounts • Integration wit HPCC (High Performance Com-
of data, (iii) expose a rather limited set of ML-DM puting Cluster) from LexisNexis Risk Solutions 25
functions, usually limited to prediction models and • Import/export from/to PMML (Predictive Mod-
(iv) utilize Apache Hadoop [20] as processing engine eling Markup Language)
and/or storage environment. There are differences on
how data is integrated and processed, supported data • Pentaho Instaview, a visual application to reduce
sources or related to complexity of the system. Here the time needed to deploy data analytics solutions
are the most known ones: and to help novice users to get insights of their
Kitenga Analytics 20 , recently purchased by Dell, data, in three simple steps: select data source, au-
is a native Hadoop application that offers visual ETL, tomatically prepare data for analytics, and visual-
Apache SolrTM 21 -based search, natural language pro- ize and explore built models.
cessing, Apache Mahout-based data mining, and ad- • Pentaho Mobile - application for iPad that pro-
vanced visualization capabilities. It is a big data en- vides interactive business analytics for business
vironment for sophisticated analysts who want a ro- users
bust toolbox of analytical tools, all from an easy-to-
use interface that that does not require understanding Their ecosystem is composed of several powerful sys-
of complex programming or the Apache Hadoop stack tems, each of them a complex project of its own:
itself. Pentaho BI Platform/Server the BI platform is a
Pentaho Business Analytics 22 offers a complete framework providing core services, such as authen-
solution for big data analytics, supporting all phases of tication, logging, auditing and rules engines; it also
an analytics process, from pre-processing to advanced has a solution engine that integrates all other sys-
data exploration and visualization. It offers (i) a com- tems (reporting, analysis, integration and data min-
plete visual design tool to accelerate data prepara- ing); BI Server is the most well known implementation
tion and modeling, (ii) data integration from NoSQL of the platform, which functions as a web based report
and relational databases, (iii) distributed execution on management system, application integration server and
Hadoop platform [20], (iv) instant and interactive anal- lightweight workflow engine.
ysis (no code, no ETL (Extract, Transform, Load)) Pentaho Reporting based on JFreeReport, is a suite
of open-source tools – Pentaho Report Designer, Pen-
20 http://www.quest.com/news-release/quest-software-
taho Reporting Engine, Pentaho Reporting SDK and
expands-its-big-data-solution-with-new-hadoop-ce-102012-
818658.aspx 23 http://hbase.apache.org
21 http://lucene.apache.org/solr/ 24 hive.apache.org
22 http://www.pentaho.org 25 http://hpccsystems.com

6
the common reporting libraries shared with the entire tions that update a row in a table, and gatherers – close
Pentaho BI Platform – that allows users to create rela- the gap between WibiData table and key-value pairs
tional and analytical reports from a variety of sources processed by Hadoop MapReduce engine.
outputting results in various formats (HTML, PDF, We are aware that we could not cover all the solu-
Excel etc.) tion provider in the field of business intelligence and big
Pentaho Data Integration (Kettle) delivers powerful data analytics. We tried to cover those who are also of-
ETL capabilities using metadata-driven approach with fering ML components in their applications, many oth-
an intuitive, graphical, drag and drop design environ- ers focusing only on big data analytics, such as Alteryx,
ment; SiSense, SAS or SAP, being omitted from this survey.
Pentaho Analysis Service (Mondrian) is an Online Solutions in this category target mostly business users,
Analytical Processing (OLAP) server that supports who need to quickly and easily extract insights from
data analysis in real-time their data, being good candidates for users with less
Pentaho Data Mining (Weka) a collection of ma- computer or statistics background.
chine learning algorithms for classification, regression,
clustering and association rules;
7 Software as a Service providers for
Platfora 26 delivers in-memory business intelligence
with no separate data warehouse or ETL required. Its
Machine Learning
visual interface built on HTML5 allows business users
to analyse data. Results may be easily shared between This section focuses on platform-as-a-service, or
users. It relies on Hadoop cluster, that can be installed software-as-a-service providers for machine learning
either on own premise, or on cloud providers (Amazon problems. They are offering the services mainly via
EMR and S3). It is primarly focused on BI features, RESTful interfaces, and in some (rare) cases the solu-
such as elaborated visualization types (charts, plots, tion may also be installed on-premise (Myrrix), con-
maps), or slice-and-dice operations, but also offers a trasting to solutions from previous section that are
predictive analysis framework. mainly deployable systems on private data centers.
As class of ML problems, predictive modeling is the
Skytree Server 27 is a general purpose machine
favorite (BigML, Google Prediction API, Eigendog)
learning and data analytics system that supports data
among these systems. We did not include in this study
coming from relational databases, Hadoop systems, or
providers of SQL over Hadoop solutions (e.g. Cloudera
flat files and offers connectors to common statistical
Impala, Hadapt, Hive) because their main target is not
packages and ML libraries. ML methods supported are:
ML-DM, rather fast, elastic and scalable SQL process-
Support Vector Machine (SVM), Nearest Neighbor, K-
ing of relational data using the distributed architecture
Means, Principal Component Analysis (PCA), Linear
of Hadoop.
Regression, 2-point correlation and Kernel Density Es-
timation (KDE). Skytree Server connects with analyt- BigML 29 is a SaaS approach to machine learning.
ics front-ends, such as Web services or statistical and Users can setup datasources, create, visualize and share
ML libraries (R, Weka), for data visualization. Its de- prediction models (only decision trees are supported),
ployment options include cloud providers, or dedicated and use models to generate predictions. All from a
cluster based on Linux machines. It also supports cus- Web interface or programmatically using REST API.
tomers in estimating the size of the cluster they need BitYota 30 is a young start-up (2012) SaaS provider
by a simple formula (Analytics Requirements Index). for BigData warehousing solution. On top of data in-
Wibidata 28 is a complex solution based on tegration from different sources (relational, NoSQL,
open source software stack from Apache, combining HDFS) it also allows customers to run statistics and
Hadoop, HBase and Avro with proprietary compo- summarization queries in SQL92, standard R statistics
nents. WibiData’s machine learning libraries give and custom functions written in JavaScript, Perl, or
the tools to start building complex data processing Python on a parallel analytics engine. Results are vi-
pipelines immediately. WibiData also provides graphi- sualized by integrating with popular BI tools and dash-
cal tools to export your data from its distributed data boards.
repository into any relational database [21]. In order to Precog 31 has a more elaborate SaaS solution com-
simplify data processing using Hadoop, WibiData in- posed of Precog database, Quirrel language, Report-
troduces the concepts of producers – computation func- Grid and LabCoat tools. At the core of Precog, we
26 http://platfora.com 29 http://bigml.com
27 http://skytree.net 30 http://bityota.com
28 http://wibidata.com 31 http://precog.com

7
have an original (no Hadoop, no other NoSQL based), cused on business people, less knowledgeable on statis-
schemaless, columnar database designed for storing tics and machine learning.
and analyzing semi-structured, measured data, such as Myrrix 36 is a complete, real-time, scalable recom-
events (users clicking, engaging, and buying), sensor mender system built using Apache MahoutTM (see Sec-
data, activity stream data, facts, and other kinds of tion 5). It can be accessed as PaaS using a RESTful
data that do not need to be mutably updated. Precog’s interface. It is able to incrementally update the model
functionality is exposed by REST APIs, but client li- once new data is available. It is organized in 2 lay-
braries are available in JavaScript, Python, PHP, Ruby, ers – Serving (open source and free) and Computation
Java, or C#. LabCoat is a GUI tool for creation and (Hadoop based) – that can be deployed on-premise as
management of Quirrel queries. Quirrel is a a highly well, either both of them or only one.
expressive data analysis language that makes it easy
Prior Knowledge Veritable API 37 offers
to do in-database analytics, statistics, and machine
Python and Ruby interfaces; upload data on their
learning across any kind of measured data. Results
servers, and build prediction model using Markov
are available in JSON or CSV formats. ReportGrid is
Chain Monte Carlo samplers. They were operating
an HTML5 visualization engine that interactively, or
a cloud based infrastructure based on Amazon WS.
programmatically, build reports and charts.
SalesForce.com acquired Prior Knowledge at the end
Google Prediction API 32 is Google’s cloud- of 2012.
based machine learning tools that can help analyze Predictobot 38 by Prediction Appliance also aims
your data. It is closely connected to Google Cloud at doing machine learning modeling easier. The user
Storage33 where training data is stored and offers its will upload a spreadsheet of data, answer a few ques-
services using a RESTful interface, client libraries al- tions, and then download a spreadsheet with the pre-
lowing programmers to connect from Java, JavaScript, dictive model. It is going to bring predictive modeling
.NET, Ruby, Python etc. In the first step, the model to anyone with the skills to make a spreadsheet. The
need to be trained from data, supported models being business is still in stealth mode.
classification and regression for now. After the model
is built, one can query this model to obtain predic- 7.1 Text mining as SaaS
tions on new instances. Adding new data to a trained
model is called Streaming Training and it is also nicely Due to explosion of social media technologies, such
supported. Recently, PMML preprocessing feature has as blog platforms (WordPress.com, Blogger etc), mini-
been added, i.e. Prediction API .supports preprocess- blogging (Twitter), or social networks (Facebook,
ing your data against a PMML transform specified us- Google+), an increased interest is paid to text min-
ing PMML 4.0 syntax; does not support importing of ing and natural language processing (NLP) solutions
a complete PMML model that includes data. Created delivered as services to their customers. This is why
models can be shared as hosted models in the market- we devoted an entire subsection to group together
place. software/platform-as-a-service solutions for text min-
EigenDog 34 is a service for scalable predictive ing. Before reviewing available solutions, a short intro-
modeling, hosted on Amazon EC2 (for computation) duction to NLP and text mining is helpful.
and S3 (for data and models storage) platforms. It While NLP uses linguistically inspired techniques
builds decision tree model out of data in Weka’s ARFF (text is syntactically parsed using information from a
format. Models can be downloaded in binary format formal grammar and a lexicon, and the resulting in-
and integrated in user applications thanks to API, or formation is then interpreted semantically and used to
open-source library provided by vendor. extract information) to deeply analyse the document,
Metamarkets 35 claim as being Data Science-as- text mining is more recent and uses techniques devel-
a-Service providers, helping users to get insights out oped in the fields of information retrieval, statistics,
of their large datasets. They offer end-users the pos- and machine learning. Contrasting with NLP, text
sibility to perform fast, ad-hoc investigations on data, mining’s aim is not to understand what is ”said” in
to discover new and unique anomalies, to spot trends a text, rather to extract patterns across large number
in data streams, based on statistical models, in an in- of documents. Features of text mining include extrac-
tuitive, interactive and collborative way. They are fo- tion of concept/entity, text clustering, summarization,
32 https://developers.google.com/prediction/
or sentiment analysis.
33 https://developers.google.com/storage/ 36 http://myrrix.com
34 https://eigendog.com/#home 37 http://priorknowledge.com
35 http://metamarkets.com/ 38 http://predictobot.com

8
Size and number of documents that need to be pro- within unstructured content. It ranks those detected
cessed, plus real-time processing constrain contribute entities/concepts by their overall relevance, resolves
to the development of novel, distributed toolkits able those if possible into Wikipedia pages, and annotates
to answer demanding users’ needs. Websites operators tags with relevant meta-data. The service is available
are willing to offer text mining features to their visitors as an YQL table and response is in XML format. It is
with minimum investment and reduced maintenance freely available for non-commercial usage.
costs. Thus, more and more providers are offering text This section presented PaaS solutions addressing,
mining services through RESTful web services, saving to some extent, machine learning problems. A spe-
clients from costly infrastructures and deployments. cial sub-section was devoted to text mining problem
Without aiming at providing an exhaustive survey of due to its spreading in the landscape of ML PaaS
text mining P(S)aaS providers, we will mention several landscape. We notice big players, such as Yahoo! or
of them hereafter: Google, as well as many start-ups with million dollars
AlchemyAPI 39 is a cloud-based text mining SaaS fundings. They offer Web developers the possibility
platform providing the most comprehensive set of NLP to easily integrate in their sites ML intelligence. Easy
capabilities of any text mining platform, including: usage prevailed over functionality offered by these ser-
named entity extraction, sentiment analysis, concept vices, therefore there are only limited options of tweak-
tagging, author extraction, relations extraction, web ing algorithms behind the services. Thus, these are
page cleaning, language detection, keyword extraction, good candidates for users with basic ML needs, but
quotations extraction, intent mining, and topic cate- are not flexible enough for addressing more advanced
gorization. AlchemyAPI uses deep linguistic parsing, problems.
statistical natural language processing, and machine
learning to analyze your content, extracting semantic 8 Conclusions and future work
meta-data: information about people, places, compa-
nies, topics, languages, and more. It provides RESTful Our main findings are synthesized below:
API endpoints, SDKs in all major programming lan- (1) Existing programming paradigms for express-
guages and responses are encoded in various formats ing large-scale parallelism such as MapReduce (MR)
(XML, JSON, RDF). Organizations with specific data and the Message Passing Interface (MPI) are de facto
security needs or regulatory constraints are offered the choices for implementing ML-DM algorithms. More
possibility to install the solution on own environment. and more interest has been devoted to MR due to its
NathanAppTM 40 is AI-one’s general purpose ma- ability to handle large datasets and built-in resilience
chine learning PaaS, also available for deployment on- against failures.
premise as NathanNodeTM . Like Topic-Mapper, it is (2) Machine Learning in distributed environments
ideally suited to learn the meaning of any human lan- come in different approaches, offering viable and cost
guage by learning the context of words, only faster and effective alternatives to traditional ML and statistical
with greater deployment flexibility. NathanApp is a applications, which are not focused on distributed en-
RESTful API using JavaScript and JSON. vironments [14].
TextProcessing 41 is also a NLP API that sup- (3) Existing solutions target either experienced,
ports stemming and lemmatization, sentiment anal- skilled computer scientists, mathematicians, statisti-
ysis, tagging and chunk extraction, phase extraction cians or novice users who are happy with no (or few)
and named entity recognition. These services are of- possibilities to tune the algorithms. Ens-user sup-
fered open and free (for limited usage) via RESTful port and guidance is largely missing from existing dis-
API endpoints, client libraries exist in Java, Python, tributed ML-DM solutions.
Ruby, PHP and Objective-C, responses are JSON en- After reviewing over 30 different offers on the mar-
coded and Python NLTK demos are offered to achieve a ket, we think that there is still room for a scalable,
steep learning curve. For commercial purposes, clients easy to use and deploy solution for ML-DM in the con-
are offered monthly subscriptions via Mashape.com. text of cloud computing paradigm, targeting end-users
Yahoo! Content Analysis Web Service 42 de- with less programming or statistical experience, but
tects entities/concepts, categories, and relationships willing to run and tweak advanced scientific ML tasks,
such as researchers and practitioners from fields like
39 http://www.alchemyapi.com
40 http://ai-one.com
medicine, financial, telecommunications etc. To this
41 http://text-processing.com respect, our future plans include prototyping such a
42 http://developer.yahoo.com/search/content/V2/ distributed system relying on existing distributed ML-
contentAnalysis.html DM frameworks, but enhancing them with usability

9
and user friendliness features. [9] M. Isard et al. – Dryad: distributed data-parallel
programs from sequential building blocks. In
Acknowledgments SIGOPS Operating System Review, 2007
[10] KD Nuggets Survey 2012,
This work was supported by EC-FP7 project FP7- http://www.kdnuggets.com/software/suites.html
REGPOT-2011-1 284595 (HOST).
[11] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson,
C. Guestrin, J. M. Hellerstein – Distributed
References GraphLab: A Framework for Machine Learning
and Data Mining in the Cloud, Proceedings of the
[1] R. Bekkerman, M. Bilenko and J. Lang- VLDB Endowment, Vol. 5, No. 8, August 2012,
ford (editors) – Scaling up Machine Learn- Istanbul, Turkey
ing, Cambridge University Press, 2012, sum-
mary at http://people.cs.umass.edu/˜ronb/ scal- [12] S. Owen, R. Anil, T. Dunning, E. Friedman – Ma-
ing up machine learning.htm hout in Action, Manning Publications, 2011, ISBN
978-1935182689
[2] M. Budiu, D. Fetterly, M. Isard, F. McSherry,
[13] E. Pednault, E. Yom-Tov, A. Ghoting – IBM Par-
and Y. Yu – Large-Scale Machine Learning using
allel Machine Learning Toolbox, in R. Bekkerman,
DryadLINQ, in R. Bekkerman, M. Bilenko and J.
M. Bilenko and J. Langford (editors) – Scaling up
Langford (editors) – Scaling up Machine Learning,
Machine Learning, Cambridge University Press,
Cambridge University Press, 2012
2012
[3] S. Charrington – Three New Tools [14] D. Pop, G. Iuhasz – Survey of Machine Learning
Bring Machine Learning Insights to the Tools and Libraries, Institute e-Austria Timişoara
Masses, February 2012, Read Write Web, Technical Report, 2011
http://www.readwriteweb.com/hack/2012/02/
three-new-tools-bring-machine.php [15] Rexer Analytics Survey 2011,
http://www.rexeranalytics.com/Data-Miner-
[4] W. Eckerson – New technologies Survey-Results-2011.html
for Big Data, http://www.b-eye-
[16] L. Tierney, A. J. Rossini, Na Li – Snow: A parallel
network.com/blogs/eckerson/archives/2012/11/
computing framework for the R System, Int J Par-
new technologie.php (2012)
allel Prog (2009) 37:78–90, DOI 10.1007/s10766-
[5] D. Harris – 5 low-profile startups that could 008-0077-2
change the face of big data, Januray 2012, [17] S. R. Upadhyaya – Parallel approaches to ma-
http://gigaom.com/cloud/5-low-profile-startups- chine learning—A comprehensive survey, Journal
that-could-change-the-face-of-big-data/ of Parallel and Distributed Computing, Volume
73, Issue 3, March 2013, Pages 284–292
[6] A. Ghoting, P. Kambadur, E. Pednault, and R.
Kannan – NIMBLE: A Toolkit for the Imple- [18] B. Werther – Pre-industrial age of big data, June
mentation of Parallel Data Mining and Machine 2012, http://www.platfora.com/pre-industrial-
Learning Algorithms on MapReduce, KDD 11 age-of-big-data/

[7] A. Ghoting et al. – SystemML: Declarative ma- [19] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlings-
chine learning on mapreduce. In Proceedings of son, P. Kumar Gunda, J. Currey – DryadLINQ:
the 2011 IEEE 27th International Conference on A System for General-Purpose Distributed Data-
Data Engineering, ICDE 11, pages 231-242, Wash- Parallel Computing Using a High-Level Language,
ington, DC, USA, 2011 In OSDI, 2008

[8] S. Hido – Jubatus: Distributed On- [20] Apache Hadoop Webseite,


line Machine Learning Framework for http://hadoop.apache.org (2012)
Big Data, XLDB Asia, Beijing, 2012 [21] WibiData How It Works,
http://www.slideshare.net/JubatusOfficial/ http://www.wibidata.com/product/how-it-
distributed-online-machine-learning-framework- works/ (2012)
for-big-data

10
Daniel Pop received his PhD degree in computer
science from West University of Timişoara in 2006.
He is currently a senior researcher at Department of
Computer Science, Faculty of Mathematics and Com-
puter Science, West University of Timişoara. Research
interests covers high performance computing and dis-
tributed computing technologies, machine learning and
knowledge discovery and representation, and multi-
agent systems. He also has a broad experience in IT
industry (+15 years), where he applied agile software
development processes, such as SCRUM and Kanban.

11
Name Platform Licensing Language Activity
Mahout Hadoop Apache 2 Java High
GraphLab MPI / Hadoop Apache 2 C++ High
DryadLINQ Dryad Commercial .NET Low
Jubatus ZooKeeper LGPL 2 C++ Medium
NIMBLE Hadoop ? Java Low
SystemML Hadoop ? DML Low

Table 1. Distributed Frameworks for ML-DM

12

View publication stats

You might also like