You are on page 1of 1

WORKSHOP - Fathoming A Data Lake: A Proof of Concept

Keep pouring data into a crater, but you won't get a data lake. Of course you need a massively scalable
data storage to start with - like a Hadoop Distributed File System (HDFS).

But the goal of architecting a Data Lake is to enable continuous delivery, integration and recycling of
data for a multitude of diverse analytical applications. It supports volume, variety and velocity that far
exceed the capabilities of a traditional data warehouse. And more important, it offers a platform to
analyze and correlate data from various sources, in a way that can lead us to new insights, going beyond
the task of populating predefined data cubes.

In this workshop we will walk you through an illustrative set of steps to design and implement a data
lake, using data from financial markets.

Does your organization need to invest in a data lake? We will start off with a discussion on why you
should consider designing a Data Lake in addition to your existing enterprise data warehouse.

Then we will explain a few salient features and building blocks of a Data Lake architecture, through a
demo, including:

 Setting up a scalable HDFS Layer


 Data Ingestion in Batch and Streaming mode
 Data Curation, Discovery and Metadata
 Extensible Pluggable Data Analytics Infrastructure enabling
o Statistical Analysis
o Data Mining
o Event Processing
o Machine Learning, and more
from structured, semi-structured and unstructured data
 Data Integration into downstream applications including dashboards and decision support
systems
 Enabling multi-tenant access for various stakeholders and departments in an organization

The PoC is built upon HDFS, Pentaho Data Integration and Apache Spark - all of which are widely
deployed open source platforms.

We intend to drive home the architectural elements through our Proof of Concept and not keep you
staring at a beautiful waterfront :-) Hopefully you can take away a set of documented steps to re-build it
yourself at your organization and use our PoC as a reference point to start from.

You might also like