You are on page 1of 2

Evaluative Summary on Databricks value propositions

The basic task of knowing what is going on with the business, drives platform adoption and technology
buying decisions. Databricks is a managed platform for running Apache Spark that aims to provide a fast
and generalized GUI for large scale data analysis. Its an implementation of Spark to help reduce complexity
of setup and operation by providing dashboards and scheduled jobs. The client does not have to learn
cluster management concepts nor perform Spark cluster maintenance. It is a point and click interface for
data analysts and BI professionals with options to automate data jobs and AWS private cluster integrations.
Their core components are

1. Workspaces (file storage),


2. Libraries (Python and Java libs to extend functions),
3. SQL tables (same as SQL Tables),
4. Clusters (managed Spark cluster instances),
5. Jobs (scheduled data workloads) and
6. Notebooks (same as Jupyter notebooks, Apache Zepppelin and R Notebooks; that executes Scala,
Python and SQL code and see results in same document)

First impressions by using the Databricks community version, it seemed like a merge of visualization suites
like inCites / Exploratory.io, and liveCode tools like Jupyter / Zepppelin. It felt like an investigative
convenience tool that pulls in functionalities of Apache Spark and presents them in a web based interface.
Since Spark became top level apache project in 2014, it has been tremendously improved in specific areas
of Data integration, ETL, Machine Learning and visualization. Data scientists can now use python APIs to
run BI code and visualization tools like Qlikview / Tableau can connect directly to Spark SQL. The data
scientist responsible to drive insights most likely is already proficient in all of the aforementioned tools.
The question then arises that what value would databricks add to the existing and rapidly evolving
infrastructure.

Databricks presents itself as a convenience tools, that anyone can be trained on, for easy cluster
management, ease of setup, collaboration, visualization etc. Although DataBricks web based interface saves
time in visualizations, but certainly restricts customization in machine learning frameworks specially deep
learning. This enforces power users to restrict queries and analysis within the bounds of the web based
system. Being a data scientist, I have used similar systems previously and I would still prefer Python and R
over a web-based tool for the heavy lifting and flexibility. Certain areas that will undergo massive change,
with the use of transfer learning (deep learning technique), are real time processing for outliers and fraud
detection and recommendations on user feedback. The web based system show no support on handling
transfer learning and this is still a vision in the company profile.

It is important to note that Databricks was founded by the creators of Apache Spark and this has played a
huge part in their success at seed funding rounds. Not to be shadowed by their popularity in Venture
Capitalists, companies would be better off adding another talented data scientist for the price of their
annual subscription.
Appendix: Quick review of the latest offerings in Spark

Spark is paving new ways to give easier access to big data for data scientists. This is reflected in their latest
architecture and platform integrations. Most recent update is the introduction of Dataset i.e. a combination
of RDDs and Dataframes. Datasets allow users with the ease to type like a RDD and query like a dataframe.
Datasets are predicted to be the way forward in Spark data structures.

1. Spark Core Engine (includes task distribution, scheduling and I/O)


2. Spark SQL (Now using dataframes and datasets)
3. Spark streaming (micro Batches using Lambda architecture i.e. incremental aggregating)
4. MLLib (9x faster than Mahout),
5. SparkR (interface for connecting Spark Cluster from Rstudio, distributes data as dataframes on all
nodes )
6. GraphX (graph relation jobs identifying nodes n edges)

Recent versions of Spark can run from Jupyter notebooks, Apache Zepppelin and Rstudio. The command
shell now natively supports Java, Scala, Python and R. The previous Java based batch oriented technologies
like MapReduce and its abstractions like Hive,Pig,Mahout etc are phasing out due to slow and tedious
performances.

Legend
RDD: a container built using varying data types spread across the cluster
Dataframe: a subset of RDD, that only inherits key value pairs and not the different data type

Author:
Saad Sadiq, PhD candidate
College of Engineering
University of Miami
Coral Gables, FL