You are on page 1of 2

Data Engineer

Intro - What is Data Engineering?


It comes the data engineer
 Data scattered
 Not optimized for analyses
 Legacy coding is causing corrupt data
Data Engineer to the rescue!

Data engineers: making your life easier


 Gather data from different sources
 Optimized database for analyses
 Removed corrupt data
Data scientist’s life got way easier!

Definition of job
An engineer that develops, constructs, tests, and maintains architectures such as databases and large-scale
processing systems.
 Processing large amounts of data
 Use clusters of machines

Data engineer vs Data scientist


Data Engineer Data Scientist
Develop scalable data architecture Mining data for patterns
Streamline data acquisition Statistical modelling
Set up processes to bring together data Predictive models using machine learning
Clean up corrupt data Monitor business processes
Well versed in cloud technology Clean outliers in data

Intro - Tools of the data engineer


Databases
 Hold large amounts of data
 Support application
 Other databases are used for analyses

Processing
 Clean data
 Aggregate data
 Join data
-Data engineer understand the abstractions

Scheduling
 Plan jobs with specific intervals
 Resolve dependency requirements of jobs

Existing tools: example


 Databases: MySQL, PostgreSQL, etc
 Processing: Spark, Hive, etc
 Scheduling: Apache AirFlow, Oozie, etc. Or using simple Bash tools: Cron

Intro - A data pipeline


To sum everything up, you can think of the data engineering pipeline through this diagram. It extracts all data
through connections with several databases, transforms it using a cluster computing framework like Spark, and loads
it into an analytical database. Also, everything is scheduled to run in a specific order through a scheduling framework
like Airflow. A small side note here is that the sources can be external APIs or other file formats too. We'll see this in
the exercises.
----------------------------------------> Scheduling (Apache AirFlow) ---------------------------------------->
SQL(Accounting) -----------------> Processing (Apache Spark) -----------------> SQL(Analitycs)
SQL(Online Stone)
No SQL(Catalog)

Intro - Cloud Providers


Data processing in the cloud
Clusters of machines required
Problem: self-host data-center
 Cover electrical and maintenance costs
 Peaks vs Quiet moments: hard to optimize
Solution: use the cloud

Data storage in the cloud


Reliability is required
Problem: self-host data-center
 Disaster will strike
 Need different geographical locations
Solution: use the cloud

The big three: AWS, Azure & Google


AWS: 32% market share in 2018
Azure: 17% market share in 2018
Google: 10% market share in 2018

Storage
Upload files, e.g. storing product images
Services
 AWS S3
 Azure Blob Storage
 Google Cloud Storage

Computation
Perform calculations, e.g. hosting a web server
Services
 AWS EC2
 Azure Virtual Machines
 Google Compute Engine

Databases
Hold structured information
Services
 AWS RDS
 Azure SQL Database
 Google Cloud SQL

You might also like