Professional Documents
Culture Documents
Keep pouring data into a crater, but you won't get a data lake. Of course you need a massively scalable
data storage to start with - like a Hadoop Distributed File System (HDFS).
But the goal of architecting a Data Lake is to enable continuous delivery, integration and recycling of
data for a multitude of diverse analytical applications. It supports volume, variety and velocity that far
exceed the capabilities of a traditional data warehouse. And more important, it offers a platform to
analyze and correlate data from various sources, in a way that can lead us to new insights, going beyond
the task of populating predefined data cubes.
In this workshop we will walk you through an illustrative set of steps to design and implement a data
lake, using data from financial markets.
Does your organization need to invest in a data lake? We will start off with a discussion on why you
should consider designing a Data Lake in addition to your existing enterprise data warehouse.
Then we will explain a few salient features and building blocks of a Data Lake architecture, through a
demo, including:
The PoC is built upon HDFS, Pentaho Data Integration and Apache Spark - all of which are widely
deployed open source platforms.
We intend to drive home the architectural elements through our Proof of Concept and not keep you
staring at a beautiful waterfront :-) Hopefully you can take away a set of documented steps to re-build it
yourself at your organization and use our PoC as a reference point to start from.