You are on page 1of 20

THE DATA

ENGINEERING
SKILLS & TOOLS
GUIDE

ANDREAS KRETZ
Founder & CEO

LearnDataEngi neeri ng. com


Page 01

CONTENT

1 Data Engineering Skills Guide 03


2 Data Science Platform Blueprint 05
3 Tools Guide 08
4 Typical Pipelines of Data Platforms 10
Transactional Pipelines 11
Analytical Pipelines 12
5 Example Platforms & Pipelines 13
Transactional Pipelines 13
Data Engineering on AWS 13
Document Streaming with Kafka, 15
Spark and MongoDB
Analytical Pipelines 16
AWS ETL Pipeline to Redshift 16
Modern Data Warehouses 17
Hadoop Warehousing with Hive 19
6 Data Engineering Academy 20

If you have any questions regarding this document either:


Write me an email to: hello@learndataengienering.com
Or join our Discord server for direct contact:
https://discord.gg/Wxy2mQA7Fy

page 01 LearnDataEngineering.com
SKILLS GUIDE
FOR DATA ENGINEERS

What skills do Data Engineers have? And what's the difference between a
Junior and Senior Data Engineer? In the matrix below I show you
examples for Junior Data Engineer, Data Engineer, Senior Data Engineer,
Data Architect and Machine Learning Engineer.

Roles as columns and skill categories as rows.

One difference between a junior and full professional role that is worth
mentioning is the level of self-sufficiency. A junior most of the time gets
told what to do while professional and senior roles usually develop their
goals & work-packages themselves.

You'll see that Senior Data Engineer and Data Architect are quite on the
same level. That means you need experience as a Data Engineer to do
Architecture.

Data Engineer and ML Engineer also are roughly on the same level. The
ML Engineer just has this added ML knowledge and focus

page 03 LearnDataEngineering.com
SKILLS GUIDE FOR DATA ENGINEERS

page 04 LearnDataEngineering.com
DATA

PLATFORM BLUEPRINT

Before you start doing projects you need to understand the typical parts
of a data platform. This platform blueprint is very important as it helps
you understand how a platform works. I split the blueprint into the
following phases. To build a data pipeline you jut need to combine the
phases.

Connect
The connection phase is the first phase where the data comes in. This
could be a hosted API gateway where clients send data to. This data then
flows through your platform.

You’ll also find here very often data integration tools that access existing
(external) systems such as data warehouses, relational SQL databases or
external APIs. They extract the data and feed it into your platform and
pipelines.

page 05 LearnDataEngineering.com
Buffer
In the middle of the blueprint you have the main processing and storage
functions of your platform.

If the source in the connection is constantly pushing data in then you


need some kind of buffer to process it dynamically. That’s where message
queues come in. They prevent what is called backpressure, where data is
coming in faster than your processing can handle.

Buffers are also very good if you have multiple processings working with
the same data. You feed the data into the queue once and multiple
consumers can work with it in parallel.
Message queues help you to manage the flow of data within your
platform without having to work with files.

Process
T h e p r o c e s s i n g f r a me wo r k i s wh e r e t h i n g s a c t u a l l y h a p p e n . T h i s i s wh e r e
d a t a i s t a k e n e i t h e r f r o m s t o r a g e o r f r o m a me s s a g e q u e u e , p r o c e s s e d a n d
stored again.

I t ' s a l mo s t t h e mo s t i mp o r t a n t p a r t o f t h e wh o l e p l a t f o r m, b e c a u s e
wi t h o u t a p r o c e s s i n g f r a me wo r k , n o t h i n g h a p p e n s . Y o u n e e d a wa y t o
t r a n s f o r m a n d a n a l y z e d a t a , a n d t h a t ' s wh a t i t ' s t h e r e f o r .

Wi t h i n t h e p r o c e s s i n g f r a me wo r k , t h e r e a r e 2 d i f f e r e n t t y p e s o f
processing: stream processing and batch processing. Batch processing is
wh e r e y o u f i n d e x t r a c t t r a n s f o r m a n d l o a d ( E T L ) j o b s . A f t e r p r o c e s s i n g
data goes into a storage.

page 06 LearnDataEngineering.com
Store
T h e s t o r e p h a s e i s wh e r e y o u p u t t h e d a t a i n a n d s t o r e i t . He r e y o u f i n d
a l l k i n d s o f d a t a s t o r e s : R e l a t i o n a l D a t a b a s e s , N o S QL d a t a b a s e s , D a t a
Wa r e h o u s e s , D a t a L a k e s . Y o u wi l l f i n d t h e d i f f e r e n t t y p e s a g a i n f u r t h e r
b e l o w o r g a n i z e d i n t o OL T P a n d OL A P d a t a s t o r e s .

The stored data is typically either used again by the processing


f r a me wo r k , f o r a v i s u a l i z a t i o n t o o l o r A P I t h a t s e r v e s d a t a t o a c l i e n t .

Visual ize
A f t e r s t o r i n g y o u r d a t a , y o u n e e d t o v i s u a l i z e i t . T h a t ' s wh e r e t o o l s l i k e
web interfaces, Business intelligence tools or monitoring dashboards
c o me i n . S o me t h i n g wh e r e a c t u a l l y a u s e r i s wo r k i n g wi t h a n d v i s u a l i z e s
the data.

page 07 LearnDataEngineering.com
TOOLS GUIDE
A HELPFUL OVERVIEW

No t h a t y o u k n o w t h e p h a s e s o f a d a t a p l a t f o r m , wh a t a r e t h e t o o l s t h a t
you insert in every phase?

On t h e n e x t p a g e, y o u c a n s e e w h i c h t o o l s a r e n e e d e d i n e a c h p a r t o f t h e
p l a t f o r m b l u e p r i n t a n d f o r wh i c h k i n d o f p l a t f o r m t h e y c o m e t o u s e .

I o r d e r e d t h e c o l u mn s i n t o c l o u d n a t i v e a n d c l o u d a g n o s t i c ( i n d e p e n d e n t
f r o m a s p e c i f i c c l o u d ) . T h e n I s p l i t t h e t wo i n t o C a t e g o r i e s .

C l o u d n a t i v e I s p l i t i n t o t h e m a i n c l o u d s : A WS , A z u r e a n d G o o g l e C l o u d
Platform (GCP). Cloud Agnostic I split into open source and vendors. This
wasn't actually not that easy, because some vendors (like Databricks) are
also cloud native, but I hope you get the point.

A s r o ws I u s e d t h e p h a s e s o f t h e b l u e p r i n t I s h o we d a b o v e a n d a d d e d s u b
categories for the tools.

Y o u a r e g o i n g t o f i n d s o me t o o l s i n m u l t i p l e c a t e g o r i e s . T h a t ' s c o m p l e t e l y
n o r ma l . S o me h a v e mu l t i p l e f u n c t i o n a l i t i e s .

Wi t h t h i s , y o u a l s o h a v e a g r e a t o v e r v i e w o f h e l p f u l t o o l c o mb i n a t i o n s o n
c l o u d s , s u c h a s L a mb d a , G l u e a n d E C S f o r t h e p r o c e s s i n g f r a m e w o r k s o n
A WS . T h e g u i d e a l s o a l l o ws y o u t o f i n d o p e n s o u r c e o r v e n d o r
alternatives to these cloud native tools.

A f t e r t h i s g u i d e y o u f i n d t h e t wo m a i n t y p e s o f p i p e l i n e s p e o p l e b u i l d .
Y o u c a n f i n d c o n c r e t e e x a mp l e s f u r t h e r b e l o w u n d e r " e x a m p l e p r o j e c t s
and pipelines" how to arrange tools together to pipelines

page 08 LearnDataEngineering.com
TOOLS GUIDE

page 09 LearnDataEngineering.com
TYPICAL PIPELINES
OF DATA PLATFORMS

You most likely want to start building and arranging the tools from above.
Before you do that you need to understand the difference between
Online Transaction Processing (OLTP) and Online Analytical Processing
(OLAP).

These are two different processes that many people mix up. They actually
represent two different parts of a data platform: OLTP part for business
processes & OLAP for analytics use cases.

page 10 LearnDataEngineering.com
Transactional Pipelines - The Transactional Part of a
Platform

T h e OL T P p a r t o f a d a t a p l a t f o r m i s o f t e n u s e d f o r r e a l i z i n g b u s i n e s s
p r o c e s s e s ( s e e u p p e r p a r t i n t h e i m a g e a b o v e ) . T h a t ’ s wh y O L T P i s m o r e
event driven and often uses streaming pipelines to process the incoming
transactions (data) very quickly.

OL T P d a t a b a s e s a r e A C I D a n d C R U D c o m p l i a n t . T h e y a r e t h e c h o i c e f o r
efficiently storing transactions like store purchases.

You open a transaction, insert or modify your data and if everything is


correct the transaction is confirmed and it's completed. If the transaction
fails, the store rolls back your transaction and nothing has happened.

User interfaces use them very often to, for instance, visualizing your
purchase history.

A n o t h e r e x a mp l e i s a f a c t o r y w h e r e y o u p r o d u c e g o o d s . T h e p r o d u c t i o n
l i n e q u e r i e s f r o m t h e d a t a b a s e t h e d e t a i l s o f wh a t t o p r o d u c e n e x t . A f t e r
an item has been produced, the transactional database records that this
item has been produced and some information about the production
process.

page 11 LearnDataEngineering.com
Analytical Pipelines - The Analytics Part of a Platform

T h e a n a l y t i c s p a r t o f t h e p l a t f o r m i s u s u a l l y t h e p a r t wh e r e d a t a i s c o m i n g
in in larger chunks. Like once a day or once an hour. ELT jobs that run on
a s c h e d u l e a r e v e r y o f t e n u s e d t o g e t t h e d a t a i n t o a n OL A P s t o r e .

T h e d a t a i n OL A P i s o f t e n l e s s s t r u c t u r e d t h a n i n OL T P , wh i c h m e a n s y o u
a r e mo r e f l e x i b l e . I n OL A P s t o r e s , t h e d a t a i s u s u a l l y s i m p l y c o p i e d i n v i a
c o p y c o mma n d s a n d n o t v i a t r a n s a c t i o n s .

Analysts have all the data at their disposal and are able to look at it from
a c o mp l e t e l y d i f f e r e n t p e r s p e c t i v e - f r o m a h o l i s t i c p e r s p e c t i v e . T h e y
g a i n i n s i g h t s i n t o l a r g e r a mo u n t s o f d a t a .

OL A P s t o r e s h e l p a n a l y s t s t o a n s we r q u e s t i o n s l i k e : I d e n t i f y t h e
c u s t o me r s wi t h t h e h i g h e s t v a l u e p u r c h a s e s i n a m o n t h o r a y e a r . A n o t h e r
question could be to find the best-selling products in the shop’s
inventory by month or quarter.

page 12 LearnDataEngineering.com
Page 01

EXAMPLE PROJECTS
& PIPELINES

No w t h a t y o u h a v e l e a r n e d e v e r y t h i n g a b o u t t h e m o s t i m p o r t a n t p a r t s o f
b u i l d i n g a p l a t f o r m, t h e p i p e l i n e s , OL T P a n d OL A P , a s we l l a s a l l t h e t o o l s
y o u c a n wo r k w i t h , y o u c a n s t a r t wi t h y o u r v e r y o wn d a t a e n g i n e e r i n g
project.

He r e i s a n o v e r v i e w o f s o me e x a m p l e p l a t f o r m a n d p i p e l i n e u s e c a s e s
w h i c h w e s e e a l l t h e t i me i n t h e r e a l wo r l d a n d wh i c h I t e a c h t o m y
students via hands-on project courses:

Transactional Pipel ines

Data Engineering on AWS

page 13 LearnDataEngineering.com
T h e A WS p r o j e c t i s p e r f e c t f o r e v e r y o n e wh o wa n t s t o s t a r t w i t h C l o u d
p l a t f o r ms . C u r r e n t l y , A WS i s t h e m o s t u s e d p l a t f o r m f o r d a t a p r o c e s s i n g .
I t i s r e a l l y g r e a t t o u s e , e s p e c i a l l y f o r t h o s e p e o p l e wh o a r e n e w i n t h e i r
Data Engineering job or looking for one.

Wo r k i n g t h r o u g h A WS , y o u l e a r n h o w t o s e t u p a c o m p l e t e e n d - t o - e n d
p r o j e c t wi t h a s t r e a mi n g a n d a n a l y t i c s p i p e l i n e . B a s e d o n t h e p r o j e c t , y o u
l e a r n h o w t o m o d e l d a t a a n d w h i c h A WS t o o l s a r e i m p o r t a n t , s u c h a s
L a mb d a , A P I G a t e wa y , G l u e , R e d s h i f t K i n e s i s a n d D y n a m o D B .

Link to the project in our Data Engineering Academy:


h t t p s : / / l e a r n d a t a e n g i n e e r i n g . c o m / p / d a t a - e n g i n e e r i n g - o n - a ws

page 14 LearnDataEngineering.com
Document Streaming with Kafka, Spark and MongoDB

T h i s f u l l e n d - t o - e n d e x a mp l e p r o j e c t u s e s e - c o m m e r c e d a t a t h a t c o n t a i n s
i n v o i c e s f o r c u s t o me r s a n d i t e m s o n t h e s e i n v o i c e s .

He r e , y o u r g o a l i s t o i n g e s t t h e i n v o i c e s o n e b y o n e , a s t h e y a r e c r e a t e d ,
and visualize them in a user interface. The technologies you will use are
F a s t A P I , A p a c h e K a f k a , A p a c h e S p a r k , Mo n g o D B a n d S t r e a ml i t - t o o l s y o u
c a n l e a r n i n t h e a c a d e my i n d i v i d u a l l y , wh i c h I r e c o m m e n d b e f o r e g e t t i n g
into this project.

A mo n g o t h e r t h i n g s , y o u l e a r n a b o u t e a c h s t e p o f h o w t o b u i l d y o u r
p i p e l i n e a n d wh i c h t o o l s t o u s e a t wh i c h p o i n t a n d g e t t o k n o w h o w t o
w o r k wi t h A p a c h e S p a r k s t r u c t u r e d s t r e a m i n g i n c o n n e c t i o n w i t h
Mo n g o D B .

A l s o , y o u s e t u p y o u r o wn d a t a v i s u a l i z a t i o n a n d c r e a t e a n i n t e r a c t i v e
u s e r i n t e r f a c e w i t h S t r e a ml i t t o s e l e c t a n d v i e w i n v o i c e s f o r c u s t o m e r s
a n d t h e i t e ms o n t h e s e i n v o i c e s .

Sounds like the right project for you? Here you can find it in my academy:
https://learndataengineering.com/p/document-streaming

page 15 LearnDataEngineering.com
Anal ytical Pipel ines

AWS ETL Pipeline to Redshift

T h i s p r o j e c t i s a E T L p i p e l i n e wh e r e we t a k e f i l e s o u t o f S 3 a n d p u t t h e
c o n t e n t s i n t o R e d s h i f t . F o r v i s u a l i z a t i o n we u s e P o we r B I . ( T h i s i s p a r t o f
t h e D a t a E n g i n e e r i n g o n A WS p r o j e c t )

As you are not using an API here, the data is stored as a file within S3 as
clients usually send in their data as files. This is very typical for an ETL
j o b , wh e r e t h e d a t a i s e x t r a c t e d f r o m S 3 a n d t h e n t r a n s f o r me d a n d
stored.

A t y p i c a l t o o l t o u s e h e r e i s A WS G l u e . Wi t h t h i s t o o l , y o u c a n w r i t e
A p a c h e S p a r k j o b s , wh i c h p u l l t h e d a t a f r o m S 3 a n d t h e n t r a n s f o r m a n d
s t o r e t h e m i n t o t h e d e s t i n a t i o n , f o r wh i c h we r e c o m m e n d R e d s h i f t .
Beside the Glue jobs, another part you should make use of is the data
c a t a l o g . I t c a t a l o g s t h e c s v f i l e s wi t h i n S 3 a s we l l a s t h e d a t a i n R e d s h i f t .
T h u s , i t b e c o me s v e r y e a s y t o a u t o m a t i c a l l y c o n f i g u r e a n d g e n e r a t e a
S p a r k j o b i n A WS G l u e , wh i c h s e n d s t h e d a t a t o R e d s h i f t .

page 16 LearnDataEngineering.com
A s s a i d b e f o r e , t h e r e c o mme n d e d s t o r a g e f o r t h e a n a l y t i c a l p a r t i s
Redshift, from which the data analysts can evaluate and visualize the
d a t a . B e s t t o u s e h e r e i s a s i n g l e s t a g i n g t a b l e i n wh i c h t h e d a t a i s s t o r e d .

T h e t y p i c a l me t h o d h e r e wo u l d b e t o u s e R e d s h i f t a s y o u r a n a l y t i c s
d a t a b a s e , t o wh i c h y o u i d e a l l y c o n n e c t a B I t o o l s u c h a s P o we r B I , f o r
e x a mp l e .

Modern Data Warehouses


Google Cloud

On G C P y o u c a n s t a r t b y c o n f i g u r i n g C l o u d S t o r a g e , B i g Qu e r y a n d D a t a
Studio on the Google Cloud Platform (GCP). Put a file into the lake,
c r e a t e a B i g Qu e r y t a b l e a n d a Qu i c k s i g h t r e p o r t t h a t y o u c a n s h a r e w i t h
a n y o n e y o u wa n t .

page 17 LearnDataEngineering.com
Modern Data Warehouses
A WS

Or g o i n t o A WS t o s e t u p a ma n u a l d a t a l a k e i n t e g r a t i o n t h r o u g h S 3 ,
A t h e n a a n d Qu i c k s i g h t . A f t e r t h a t y o u c a n c h a n g e t h e i n t e g r a t i o n t o u s i n g
the Glue Data catalog.

A s a B ONU S L E S S ON I s h a r e i n m y c o u r s e h o w y o u c a n d o w h a t y o u d i d
w i t h A WS A t h e n a t h r o u g h R e d s h i f t S p e c t r u m .

He r e y o u c a n f i n d t h e c o mp l e t e c o u r s e i n m y D a t a E n g i n e e r i n g A c a d e m y :
h t t p s : / / l e a r n d a t a e n g i n e e r i n g . c o m / p / m o d e r n - d a t a - wa r e h o u s e s

page 18 LearnDataEngineering.com
Hadoop Data Warehousing with Hive

Ha d o o p i s a J a v a - b a s e d a n d o p e n s o u r c e f r a m e wo r k t h a t h a n d l e s a l l s o r t s
o f s t o r a g e a n d p r o c e s s i n g f o r B i g D a t a a c r o s s c l u s t e r s o f c o mp u t e r s u s i n g
s i mp l e p r o g r a m mi n g mo d e l s .

Wo r k i n g t h r o u g h t h i s p r o j e c t y o u wi l l l e a r n a n d m a s t e r t h e H a d o o p
a r c h i t e c t u r e a n d i t s c o mp o n e n t s l i k e H D F S , Y A R N , Ma p R e d u c e , Hi v e a n d
Sqoop. Understand the detailed concepts of the Hadoop Eco-system
a l o n g w i t h h a n d s - o n l a b s a s we l l a s t h e m o s t i m p o r t a n t H a d o o p
c o mma n d s . A l s o , l e a r n h o w t o i m p l e m e n t a n d u s e e a c h c o m p o n e n t t o
s o l v e r e a l b u s i n e s s p r o b l e ms !

L e a r n h o w t o s t o r e a n d q u e r y y o u r d a t a wi t h S q o o p , H i v e , a n d My S Q L ,
w r i t e Hi v e q u e r i e s a n d Ha d o o p c o m m a n d s , m a n a g e B i g D a t a o n a c l u s t e r
w i t h HD F S a n d Ma p R e d u c e a n d m a n a g e y o u r c l u s t e r wi t h Y A R N a n d Hu e .

If you think that’s the perfect project for you, get right into it:
h t t p s : / / l e a r n d a t a e n g i n e e r i n g . c o m / p / d a t a - e n g i n e e r i n g - wi t h - h a d o o p

page 19 LearnDataEngineering.com
MARKETING PROPOSAL Page 01

DATA ENGINEERING
ACADEMY
The best place to learn Data Engineering!
Perfect for becoming a Data Engineer or add Data Engineering to your
skillset

Huge step by step Data Engineering course


Trusted by over 900 students
12 months or unlimited access to all courses incl. future ones during
your subscription
5 theory courses on the most important engineering techniques
10 hands-on courses the most important tools for engineers
8 hands-on example data engineering projects on major platforms like
AWS, Azure, open source platforms and Hadoop
Over 44 hours of video content
Prepared source codes
Private Discord community
Associate Data Engineer Certification and course certificates

Contact us for more information about the Academy


or advertising opportunities:
hello@learndataengineering.com

page 20 LearnDataEngineering.com

You might also like