Professional Documents
Culture Documents
ENGINEERING
SKILLS & TOOLS
GUIDE
ANDREAS KRETZ
Founder & CEO
CONTENT
page 01 LearnDataEngineering.com
SKILLS GUIDE
FOR DATA ENGINEERS
What skills do Data Engineers have? And what's the difference between a
Junior and Senior Data Engineer? In the matrix below I show you
examples for Junior Data Engineer, Data Engineer, Senior Data Engineer,
Data Architect and Machine Learning Engineer.
One difference between a junior and full professional role that is worth
mentioning is the level of self-sufficiency. A junior most of the time gets
told what to do while professional and senior roles usually develop their
goals & work-packages themselves.
You'll see that Senior Data Engineer and Data Architect are quite on the
same level. That means you need experience as a Data Engineer to do
Architecture.
Data Engineer and ML Engineer also are roughly on the same level. The
ML Engineer just has this added ML knowledge and focus
page 03 LearnDataEngineering.com
SKILLS GUIDE FOR DATA ENGINEERS
page 04 LearnDataEngineering.com
DATA
PLATFORM BLUEPRINT
Before you start doing projects you need to understand the typical parts
of a data platform. This platform blueprint is very important as it helps
you understand how a platform works. I split the blueprint into the
following phases. To build a data pipeline you jut need to combine the
phases.
Connect
The connection phase is the first phase where the data comes in. This
could be a hosted API gateway where clients send data to. This data then
flows through your platform.
You’ll also find here very often data integration tools that access existing
(external) systems such as data warehouses, relational SQL databases or
external APIs. They extract the data and feed it into your platform and
pipelines.
page 05 LearnDataEngineering.com
Buffer
In the middle of the blueprint you have the main processing and storage
functions of your platform.
Buffers are also very good if you have multiple processings working with
the same data. You feed the data into the queue once and multiple
consumers can work with it in parallel.
Message queues help you to manage the flow of data within your
platform without having to work with files.
Process
T h e p r o c e s s i n g f r a me wo r k i s wh e r e t h i n g s a c t u a l l y h a p p e n . T h i s i s wh e r e
d a t a i s t a k e n e i t h e r f r o m s t o r a g e o r f r o m a me s s a g e q u e u e , p r o c e s s e d a n d
stored again.
I t ' s a l mo s t t h e mo s t i mp o r t a n t p a r t o f t h e wh o l e p l a t f o r m, b e c a u s e
wi t h o u t a p r o c e s s i n g f r a me wo r k , n o t h i n g h a p p e n s . Y o u n e e d a wa y t o
t r a n s f o r m a n d a n a l y z e d a t a , a n d t h a t ' s wh a t i t ' s t h e r e f o r .
Wi t h i n t h e p r o c e s s i n g f r a me wo r k , t h e r e a r e 2 d i f f e r e n t t y p e s o f
processing: stream processing and batch processing. Batch processing is
wh e r e y o u f i n d e x t r a c t t r a n s f o r m a n d l o a d ( E T L ) j o b s . A f t e r p r o c e s s i n g
data goes into a storage.
page 06 LearnDataEngineering.com
Store
T h e s t o r e p h a s e i s wh e r e y o u p u t t h e d a t a i n a n d s t o r e i t . He r e y o u f i n d
a l l k i n d s o f d a t a s t o r e s : R e l a t i o n a l D a t a b a s e s , N o S QL d a t a b a s e s , D a t a
Wa r e h o u s e s , D a t a L a k e s . Y o u wi l l f i n d t h e d i f f e r e n t t y p e s a g a i n f u r t h e r
b e l o w o r g a n i z e d i n t o OL T P a n d OL A P d a t a s t o r e s .
Visual ize
A f t e r s t o r i n g y o u r d a t a , y o u n e e d t o v i s u a l i z e i t . T h a t ' s wh e r e t o o l s l i k e
web interfaces, Business intelligence tools or monitoring dashboards
c o me i n . S o me t h i n g wh e r e a c t u a l l y a u s e r i s wo r k i n g wi t h a n d v i s u a l i z e s
the data.
page 07 LearnDataEngineering.com
TOOLS GUIDE
A HELPFUL OVERVIEW
No t h a t y o u k n o w t h e p h a s e s o f a d a t a p l a t f o r m , wh a t a r e t h e t o o l s t h a t
you insert in every phase?
On t h e n e x t p a g e, y o u c a n s e e w h i c h t o o l s a r e n e e d e d i n e a c h p a r t o f t h e
p l a t f o r m b l u e p r i n t a n d f o r wh i c h k i n d o f p l a t f o r m t h e y c o m e t o u s e .
I o r d e r e d t h e c o l u mn s i n t o c l o u d n a t i v e a n d c l o u d a g n o s t i c ( i n d e p e n d e n t
f r o m a s p e c i f i c c l o u d ) . T h e n I s p l i t t h e t wo i n t o C a t e g o r i e s .
C l o u d n a t i v e I s p l i t i n t o t h e m a i n c l o u d s : A WS , A z u r e a n d G o o g l e C l o u d
Platform (GCP). Cloud Agnostic I split into open source and vendors. This
wasn't actually not that easy, because some vendors (like Databricks) are
also cloud native, but I hope you get the point.
A s r o ws I u s e d t h e p h a s e s o f t h e b l u e p r i n t I s h o we d a b o v e a n d a d d e d s u b
categories for the tools.
Y o u a r e g o i n g t o f i n d s o me t o o l s i n m u l t i p l e c a t e g o r i e s . T h a t ' s c o m p l e t e l y
n o r ma l . S o me h a v e mu l t i p l e f u n c t i o n a l i t i e s .
Wi t h t h i s , y o u a l s o h a v e a g r e a t o v e r v i e w o f h e l p f u l t o o l c o mb i n a t i o n s o n
c l o u d s , s u c h a s L a mb d a , G l u e a n d E C S f o r t h e p r o c e s s i n g f r a m e w o r k s o n
A WS . T h e g u i d e a l s o a l l o ws y o u t o f i n d o p e n s o u r c e o r v e n d o r
alternatives to these cloud native tools.
A f t e r t h i s g u i d e y o u f i n d t h e t wo m a i n t y p e s o f p i p e l i n e s p e o p l e b u i l d .
Y o u c a n f i n d c o n c r e t e e x a mp l e s f u r t h e r b e l o w u n d e r " e x a m p l e p r o j e c t s
and pipelines" how to arrange tools together to pipelines
page 08 LearnDataEngineering.com
TOOLS GUIDE
page 09 LearnDataEngineering.com
TYPICAL PIPELINES
OF DATA PLATFORMS
You most likely want to start building and arranging the tools from above.
Before you do that you need to understand the difference between
Online Transaction Processing (OLTP) and Online Analytical Processing
(OLAP).
These are two different processes that many people mix up. They actually
represent two different parts of a data platform: OLTP part for business
processes & OLAP for analytics use cases.
page 10 LearnDataEngineering.com
Transactional Pipelines - The Transactional Part of a
Platform
T h e OL T P p a r t o f a d a t a p l a t f o r m i s o f t e n u s e d f o r r e a l i z i n g b u s i n e s s
p r o c e s s e s ( s e e u p p e r p a r t i n t h e i m a g e a b o v e ) . T h a t ’ s wh y O L T P i s m o r e
event driven and often uses streaming pipelines to process the incoming
transactions (data) very quickly.
OL T P d a t a b a s e s a r e A C I D a n d C R U D c o m p l i a n t . T h e y a r e t h e c h o i c e f o r
efficiently storing transactions like store purchases.
User interfaces use them very often to, for instance, visualizing your
purchase history.
A n o t h e r e x a mp l e i s a f a c t o r y w h e r e y o u p r o d u c e g o o d s . T h e p r o d u c t i o n
l i n e q u e r i e s f r o m t h e d a t a b a s e t h e d e t a i l s o f wh a t t o p r o d u c e n e x t . A f t e r
an item has been produced, the transactional database records that this
item has been produced and some information about the production
process.
page 11 LearnDataEngineering.com
Analytical Pipelines - The Analytics Part of a Platform
T h e a n a l y t i c s p a r t o f t h e p l a t f o r m i s u s u a l l y t h e p a r t wh e r e d a t a i s c o m i n g
in in larger chunks. Like once a day or once an hour. ELT jobs that run on
a s c h e d u l e a r e v e r y o f t e n u s e d t o g e t t h e d a t a i n t o a n OL A P s t o r e .
T h e d a t a i n OL A P i s o f t e n l e s s s t r u c t u r e d t h a n i n OL T P , wh i c h m e a n s y o u
a r e mo r e f l e x i b l e . I n OL A P s t o r e s , t h e d a t a i s u s u a l l y s i m p l y c o p i e d i n v i a
c o p y c o mma n d s a n d n o t v i a t r a n s a c t i o n s .
Analysts have all the data at their disposal and are able to look at it from
a c o mp l e t e l y d i f f e r e n t p e r s p e c t i v e - f r o m a h o l i s t i c p e r s p e c t i v e . T h e y
g a i n i n s i g h t s i n t o l a r g e r a mo u n t s o f d a t a .
OL A P s t o r e s h e l p a n a l y s t s t o a n s we r q u e s t i o n s l i k e : I d e n t i f y t h e
c u s t o me r s wi t h t h e h i g h e s t v a l u e p u r c h a s e s i n a m o n t h o r a y e a r . A n o t h e r
question could be to find the best-selling products in the shop’s
inventory by month or quarter.
page 12 LearnDataEngineering.com
Page 01
EXAMPLE PROJECTS
& PIPELINES
No w t h a t y o u h a v e l e a r n e d e v e r y t h i n g a b o u t t h e m o s t i m p o r t a n t p a r t s o f
b u i l d i n g a p l a t f o r m, t h e p i p e l i n e s , OL T P a n d OL A P , a s we l l a s a l l t h e t o o l s
y o u c a n wo r k w i t h , y o u c a n s t a r t wi t h y o u r v e r y o wn d a t a e n g i n e e r i n g
project.
He r e i s a n o v e r v i e w o f s o me e x a m p l e p l a t f o r m a n d p i p e l i n e u s e c a s e s
w h i c h w e s e e a l l t h e t i me i n t h e r e a l wo r l d a n d wh i c h I t e a c h t o m y
students via hands-on project courses:
page 13 LearnDataEngineering.com
T h e A WS p r o j e c t i s p e r f e c t f o r e v e r y o n e wh o wa n t s t o s t a r t w i t h C l o u d
p l a t f o r ms . C u r r e n t l y , A WS i s t h e m o s t u s e d p l a t f o r m f o r d a t a p r o c e s s i n g .
I t i s r e a l l y g r e a t t o u s e , e s p e c i a l l y f o r t h o s e p e o p l e wh o a r e n e w i n t h e i r
Data Engineering job or looking for one.
Wo r k i n g t h r o u g h A WS , y o u l e a r n h o w t o s e t u p a c o m p l e t e e n d - t o - e n d
p r o j e c t wi t h a s t r e a mi n g a n d a n a l y t i c s p i p e l i n e . B a s e d o n t h e p r o j e c t , y o u
l e a r n h o w t o m o d e l d a t a a n d w h i c h A WS t o o l s a r e i m p o r t a n t , s u c h a s
L a mb d a , A P I G a t e wa y , G l u e , R e d s h i f t K i n e s i s a n d D y n a m o D B .
page 14 LearnDataEngineering.com
Document Streaming with Kafka, Spark and MongoDB
T h i s f u l l e n d - t o - e n d e x a mp l e p r o j e c t u s e s e - c o m m e r c e d a t a t h a t c o n t a i n s
i n v o i c e s f o r c u s t o me r s a n d i t e m s o n t h e s e i n v o i c e s .
He r e , y o u r g o a l i s t o i n g e s t t h e i n v o i c e s o n e b y o n e , a s t h e y a r e c r e a t e d ,
and visualize them in a user interface. The technologies you will use are
F a s t A P I , A p a c h e K a f k a , A p a c h e S p a r k , Mo n g o D B a n d S t r e a ml i t - t o o l s y o u
c a n l e a r n i n t h e a c a d e my i n d i v i d u a l l y , wh i c h I r e c o m m e n d b e f o r e g e t t i n g
into this project.
A mo n g o t h e r t h i n g s , y o u l e a r n a b o u t e a c h s t e p o f h o w t o b u i l d y o u r
p i p e l i n e a n d wh i c h t o o l s t o u s e a t wh i c h p o i n t a n d g e t t o k n o w h o w t o
w o r k wi t h A p a c h e S p a r k s t r u c t u r e d s t r e a m i n g i n c o n n e c t i o n w i t h
Mo n g o D B .
A l s o , y o u s e t u p y o u r o wn d a t a v i s u a l i z a t i o n a n d c r e a t e a n i n t e r a c t i v e
u s e r i n t e r f a c e w i t h S t r e a ml i t t o s e l e c t a n d v i e w i n v o i c e s f o r c u s t o m e r s
a n d t h e i t e ms o n t h e s e i n v o i c e s .
Sounds like the right project for you? Here you can find it in my academy:
https://learndataengineering.com/p/document-streaming
page 15 LearnDataEngineering.com
Anal ytical Pipel ines
T h i s p r o j e c t i s a E T L p i p e l i n e wh e r e we t a k e f i l e s o u t o f S 3 a n d p u t t h e
c o n t e n t s i n t o R e d s h i f t . F o r v i s u a l i z a t i o n we u s e P o we r B I . ( T h i s i s p a r t o f
t h e D a t a E n g i n e e r i n g o n A WS p r o j e c t )
As you are not using an API here, the data is stored as a file within S3 as
clients usually send in their data as files. This is very typical for an ETL
j o b , wh e r e t h e d a t a i s e x t r a c t e d f r o m S 3 a n d t h e n t r a n s f o r me d a n d
stored.
A t y p i c a l t o o l t o u s e h e r e i s A WS G l u e . Wi t h t h i s t o o l , y o u c a n w r i t e
A p a c h e S p a r k j o b s , wh i c h p u l l t h e d a t a f r o m S 3 a n d t h e n t r a n s f o r m a n d
s t o r e t h e m i n t o t h e d e s t i n a t i o n , f o r wh i c h we r e c o m m e n d R e d s h i f t .
Beside the Glue jobs, another part you should make use of is the data
c a t a l o g . I t c a t a l o g s t h e c s v f i l e s wi t h i n S 3 a s we l l a s t h e d a t a i n R e d s h i f t .
T h u s , i t b e c o me s v e r y e a s y t o a u t o m a t i c a l l y c o n f i g u r e a n d g e n e r a t e a
S p a r k j o b i n A WS G l u e , wh i c h s e n d s t h e d a t a t o R e d s h i f t .
page 16 LearnDataEngineering.com
A s s a i d b e f o r e , t h e r e c o mme n d e d s t o r a g e f o r t h e a n a l y t i c a l p a r t i s
Redshift, from which the data analysts can evaluate and visualize the
d a t a . B e s t t o u s e h e r e i s a s i n g l e s t a g i n g t a b l e i n wh i c h t h e d a t a i s s t o r e d .
T h e t y p i c a l me t h o d h e r e wo u l d b e t o u s e R e d s h i f t a s y o u r a n a l y t i c s
d a t a b a s e , t o wh i c h y o u i d e a l l y c o n n e c t a B I t o o l s u c h a s P o we r B I , f o r
e x a mp l e .
On G C P y o u c a n s t a r t b y c o n f i g u r i n g C l o u d S t o r a g e , B i g Qu e r y a n d D a t a
Studio on the Google Cloud Platform (GCP). Put a file into the lake,
c r e a t e a B i g Qu e r y t a b l e a n d a Qu i c k s i g h t r e p o r t t h a t y o u c a n s h a r e w i t h
a n y o n e y o u wa n t .
page 17 LearnDataEngineering.com
Modern Data Warehouses
A WS
Or g o i n t o A WS t o s e t u p a ma n u a l d a t a l a k e i n t e g r a t i o n t h r o u g h S 3 ,
A t h e n a a n d Qu i c k s i g h t . A f t e r t h a t y o u c a n c h a n g e t h e i n t e g r a t i o n t o u s i n g
the Glue Data catalog.
A s a B ONU S L E S S ON I s h a r e i n m y c o u r s e h o w y o u c a n d o w h a t y o u d i d
w i t h A WS A t h e n a t h r o u g h R e d s h i f t S p e c t r u m .
He r e y o u c a n f i n d t h e c o mp l e t e c o u r s e i n m y D a t a E n g i n e e r i n g A c a d e m y :
h t t p s : / / l e a r n d a t a e n g i n e e r i n g . c o m / p / m o d e r n - d a t a - wa r e h o u s e s
page 18 LearnDataEngineering.com
Hadoop Data Warehousing with Hive
Ha d o o p i s a J a v a - b a s e d a n d o p e n s o u r c e f r a m e wo r k t h a t h a n d l e s a l l s o r t s
o f s t o r a g e a n d p r o c e s s i n g f o r B i g D a t a a c r o s s c l u s t e r s o f c o mp u t e r s u s i n g
s i mp l e p r o g r a m mi n g mo d e l s .
Wo r k i n g t h r o u g h t h i s p r o j e c t y o u wi l l l e a r n a n d m a s t e r t h e H a d o o p
a r c h i t e c t u r e a n d i t s c o mp o n e n t s l i k e H D F S , Y A R N , Ma p R e d u c e , Hi v e a n d
Sqoop. Understand the detailed concepts of the Hadoop Eco-system
a l o n g w i t h h a n d s - o n l a b s a s we l l a s t h e m o s t i m p o r t a n t H a d o o p
c o mma n d s . A l s o , l e a r n h o w t o i m p l e m e n t a n d u s e e a c h c o m p o n e n t t o
s o l v e r e a l b u s i n e s s p r o b l e ms !
L e a r n h o w t o s t o r e a n d q u e r y y o u r d a t a wi t h S q o o p , H i v e , a n d My S Q L ,
w r i t e Hi v e q u e r i e s a n d Ha d o o p c o m m a n d s , m a n a g e B i g D a t a o n a c l u s t e r
w i t h HD F S a n d Ma p R e d u c e a n d m a n a g e y o u r c l u s t e r wi t h Y A R N a n d Hu e .
If you think that’s the perfect project for you, get right into it:
h t t p s : / / l e a r n d a t a e n g i n e e r i n g . c o m / p / d a t a - e n g i n e e r i n g - wi t h - h a d o o p
page 19 LearnDataEngineering.com
MARKETING PROPOSAL Page 01
DATA ENGINEERING
ACADEMY
The best place to learn Data Engineering!
Perfect for becoming a Data Engineer or add Data Engineering to your
skillset
page 20 LearnDataEngineering.com