You are on page 1of 15

Data Lake Implementation

on Traveloka
Andi N. Dirgantara
Lead Data Engineer
Speaker Profile

● I’m Andi Nugroho Dirgantara

● 5+ years as a software engineer
● 3+ years as a data engineer (big data)
● Lead Data Engineer, Traveloka
● Lead, FB DevC Malang
● Big Data and JavaScript lover
● Father of 3+ years old son
● Gamer
○ Steam Account: hellowin_cavemen
○ Battle Tag: Hellowin#11826

How we use our data
● Business Intelligence
● Analytics
● Personalization
● Fraud Detection
● Ads optimization
● Cross selling
● AB Test
● etc.


Overly simplified data architecture on Traveloka

Product Side Data Side

● Web
● Android
Backend Database Big Data Platform ?
● etc.

Data Processing
● Analytics
● Machine Learning
● etc.

How to accommodate: It should be:

● Data Scientists ● Scalable
● Data Analysts ● Query-able
● Business Intelligence Tools ● Fault tolerant (reliable)
Without disrupting production side?

There are solutions exists, but ...


We need Data Lake
But what it is?
Data Lake by Definitions

● A data lake is a storage repository that holds a vast amount of raw data in its
native format until it is needed. -
● A data lake is a storage repository that holds a vast amount of raw data in its
native format, including structured, semi-structured, and unstructured data.
The data structure and requirements are not defined until the data is needed.
- Tamara Dull, (SAS),
● It store the data in its native/ raw format
● The schema applied when on query time
● Sometimes it’s also just a “marketing label” to simplified people saying the
technology which complied with Hadoop, just like “big data” terms for
distributed storing and query engine

Data Lake implementation on Data Team Side

Data Processing Data Source

● Analytics input ● Stream Processing
● Machine Learning (Kafka, PubSub, etc.)
● etc. ● DBs
● Data Warehouse
● etc.

Backend Big Data Platform ?

BigQuery Hive (S3)


Hive + Presto Big Query

● Deployed on Amazon Web Service (AWS) ● Deployed on Google Cloud Platform (GCP)
● Self hosted and self managed ● Managed service
● Hadoop family ● GCP family

Hive + Presto Pros and Cons

Pros Cons
● More flexible in the context of ● Harder to maintain (also
managing (self managed) because of self managed)
○ Able to define nodes, replication
factor, cluster, etc.
○ Able to specify node specs.
● Good integration with other
Hadoop ecosystem
○ Spark
○ Kafka
○ Impala
● More mature
● Open sourced

Big Query

Pros Cons
● Easier to maintain ● Less mature compared to
(managed by GCP) Hadoop ecosystem
● Good integration with other ● Limited API yet
GCP managed tools (not supported Scala API)
○ Dataflow ● Unable to store data on S3,
○ PubSub need to be on Cloud Storage
○ Cloud Storage ● Close sourced
● Enterprise ready, support is


● We use still use AWS and GCP side by side

● Maintainability is one thing, but in industry its value is everything
● Big Data stack is moving so fast
● It’s Data Engineer’s responsibility to make the migration agile
● There’s no “one thing fits all” solution

References and Other Presentations

● How Big Data Platform Handle big Things

● How to Improve Data Warehouse Efficiency using S3 over HDFS on Hive

Thank you for your time.
We are hiring...