You are on page 1of 15

Data Lake Implementation

on Traveloka
Andi N. Dirgantara
Lead Data Engineer
Speaker Profile

● I’m Andi Nugroho Dirgantara


● 5+ years as a software engineer
● 3+ years as a data engineer (big data)
● Lead Data Engineer, Traveloka
● Lead, FB DevC Malang
● Big Data and JavaScript lover
● Father of 3+ years old son
● Gamer
○ Steam Account: hellowin_cavemen
○ Battle Tag: Hellowin#11826

2
How we use our data
● Business Intelligence
● Analytics
● Personalization
● Fraud Detection
● Ads optimization
● Cross selling
● AB Test
● etc.

3
Problems

Overly simplified data architecture on Traveloka

Product Side Data Side

Client
● Web
● Android
Backend Database Big Data Platform ?
● etc.

Data Processing
● Analytics
● Machine Learning
● etc.

How to accommodate: It should be:


● Data Scientists ● Scalable
● Data Analysts ● Query-able
● Business Intelligence Tools ● Fault tolerant (reliable)
Without disrupting production side?

4
There are solutions exists, but ...

source: mattturck.com/bigdata2017

5
We need Data Lake
But what it is?
Data Lake by Definitions

● A data lake is a storage repository that holds a vast amount of raw data in its
native format until it is needed. - http://searchaws.techtarget.com
● A data lake is a storage repository that holds a vast amount of raw data in its
native format, including structured, semi-structured, and unstructured data.
The data structure and requirements are not defined until the data is needed.
- Tamara Dull, (SAS), https://www.kdnuggets.com
● It store the data in its native/ raw format
● The schema applied when on query time
● Sometimes it’s also just a “marketing label” to simplified people saying the
technology which complied with Hadoop, just like “big data” terms for
distributed storing and query engine

7
Data Lake implementation on Data Team Side

Data Processing Data Source


● Analytics input ● Stream Processing
● Machine Learning (Kafka, PubSub, etc.)
● etc. ● DBs
● Data Warehouse
● etc.

Backend Big Data Platform ?


BigQuery Hive (S3)

output
Presto

Hive + Presto Big Query


● Deployed on Amazon Web Service (AWS) ● Deployed on Google Cloud Platform (GCP)
● Self hosted and self managed ● Managed service
● Hadoop family ● GCP family

8
Hive + Presto Pros and Cons

Pros Cons
● More flexible in the context of ● Harder to maintain (also
managing (self managed) because of self managed)
○ Able to define nodes, replication
factor, cluster, etc.
○ Able to specify node specs.
● Good integration with other
Hadoop ecosystem
○ Spark
○ Kafka
○ Impala
● More mature
● Open sourced

9
Big Query

Pros Cons
● Easier to maintain ● Less mature compared to
(managed by GCP) Hadoop ecosystem
● Good integration with other ● Limited API yet
GCP managed tools (not supported Scala API)
○ Dataflow ● Unable to store data on S3,
○ PubSub need to be on Cloud Storage
○ Cloud Storage ● Close sourced
● Enterprise ready, support is
24/7

10
Conclusions
Conclusions

● We use still use AWS and GCP side by side


● Maintainability is one thing, but in industry its value is everything
● Big Data stack is moving so fast
● It’s Data Engineer’s responsibility to make the migration agile
● There’s no “one thing fits all” solution

12
References and Other Presentations

● How Big Data Platform Handle big Things


(https://speakerdeck.com/hellowin/how-big-data-platform-handle-big-things)
● How to Improve Data Warehouse Efficiency using S3 over HDFS on Hive
(https://blog.andi.dirgantara.co/how-to-improve-data-warehouse-efficiency-using-s3-over-hdfs-on-hive-e9da90ea378c)

13
Thank you for your time.
We are hiring...

visit https://www.traveloka.com/en/careers