You are on page 1of 30

Introduction to Data

Engineering
Disclaimer
● I am not the best, I simply love
what I do VERY much.
● You are more than welcome to
challenge me or anything I have
to say as I could be wrong.
In the Past (web, api, ops db, data
warehouse)

5
Then came Big Data...

6
Then came the cloud...

7
Then came the invoice ...
It keeps
growing….
Solution?

Data
Cloud Big Data
Engineering

10
Data Engineers
• Our definition of data engineering includes what some companies
might call Data Infrastructure or Data Architecture. The data
engineer gathers and collects the data, stores it, does batch
processing or real-time processing on it, and serves it via an API to a
data scientist who can easily query it.

• Data engineers deal with raw data that contains human, machine or
instrument errors. The data might not be validated and contain
suspect records; It will be unformatted and can contain codes that are
system-specific.

• Data engineers need to ensure that the architecture that is in place


supports the requirements of the data scientists and the stakeholders,
the business.
• Accordingly, the main task of engineers is to provide a reliable
infrastructure for data and facilitate data flow through pipeline.
Data Engineering VS Data Science
● Architecture ● Predictive analytics
● Data Platform scalability
○ Data
○ Faster ○ Recognition
○ Cheaper ○ User behaviour
○ Simpler ○ NLP
○ More secure ● Recognition
● Design ETL pipeline ○ Vision
○ Speech
● Network, Security &
○ Video
Regulation
Cloud VS DC ?
Cloud Data Center

● Agile innovation ● Change require time


● Scalable ● Design for peek
● Cheap to get started ● Costly to get started
● Easy to learn ● Harder to learn
● PaaS and managed ● DIY
services
Which one is faster?
Which one is cheaper? Which one is simpler?
Which one is more secure?
Scale Up VS Out

Scale Up Scale Out


● Small cluster ● Add more servers
● Usually active/passive ● Distributed : Each
● Increase resources node can handle a
per machine fraction of the task
● Pros ● Pros
○ Power Queries faster? ○ Parallelism
○ Joins ● Cons
● Cons cheaper?
○ Power Queries
○ Parallelism simpler? ○ Joins
Fixed cost VS PayAsYouGo

faster?
cheaper?
simpler?
Streaming VS batch Processing
The execution of a series of programs each on a set or "batch" of inputs, rather than a single input
which would instead be a custom job

Streaming Data is data that is generated continuously by thousands of data sources, which typically send in
the data records simultaneously, and in small sizes (order of Kilobytes)
Data Engineering landscape @ GCP

BigTable

DataFlow Cloud
SQL
Data Engineering landscape @ AWS

Spectrum DynamoDB

g
Data Engineering Landscape @ open source
Data Engineering Landscape @ Hadoop
DE challenges
● What is the company use case with data?
● Where should we build the data platform (cloud or DC)?
○ Which cloud? Which is one is cheaper?
● What technologies ?
○ Which new ones do we embrace why?
○ Which ones do we depreciate and why?
● Is the data structured? Semi structured? Unstructured?
● Is SQL good enough for the use case?
● How to build DE and DS cost effective development pipeline?
● How to communicate change in the company?
● How much time is spend on development (query time/ wait time)
● How much is going to cost me in the end of the month?
● How can we simplify the process of data development?
● Regulation?
Pop quiz, hotshot!

How much percent of the


monthly infrastructure
budget can saved by
applying DE
methodologies ?
Pop quiz, hotshot!

How much faster can your


query run by applying DE
methodologies ?
Pop quiz, hotshot!

How simple is it to use


your data platform ?
Summary… Data Engineering is all about:

Faster Cheaper Simpler


Thank You

30

You might also like