Professional Documents
Culture Documents
12 Must-Have Skills To Become A Data Engineer - by Anuj Syal - DataDrivenInvestor
12 Must-Have Skills To Become A Data Engineer - by Anuj Syal - DataDrivenInvestor
Published in DataDrivenInvestor
Save
Are you passionate about using data to create innovative products and solutions? If
so, a career as a data engineer may be the perfect fit for you. But what does it take to
https://medium.datadriveninvestor.com/12-must-have-skills-to-become-a-data-engineer-35b100dbee0a 1/9
5/1/23, 12:33 AM 12 Must-Have Skills to become a Data Engineer | by Anuj Syal | DataDrivenInvestor
be successful in this field? In this blog, we will explore the skills and requirements
necessary to become a data engineer and succeed in this exciting profession.
Fundamental Skills
SQL: Structured Query Language, also called See-Quell, is always at the top of
the list for beginners in the domain. The language was developed in 1970 and is
the standard language to interact with data in databases. Almost all the
databases and warehouses used a version of SQL as an interactional language.
The popular standard relational databases are MySQL and PostgreSQL.
Moreover, other tools and warehouses have adopted SQL as an abstraction,
which allows you to build ML models using SQL in big queries.
Git: Git is an important tool for version control, which is a practice of tracking
and managing changes to software code. As for every single change that you
perform, that change becomes a part of your code base in some remote
server/cloud.
But how does Git help you? Git lets you save all the changes and actions that you
take while coding and this works wonderfully while collaborating with your
team, without losing your code. You just need to simply create a new branch and
send a pull request to merge code and Voila! You’re ready to collaborate and
work on your code. Check out my video tutorial on git if you want to get started
with this technology
https://medium.datadriveninvestor.com/12-must-have-skills-to-become-a-data-engineer-35b100dbee0a 2/9
5/1/23, 12:33 AM 12 Must-Have Skills to become a Data Engineer | by Anuj Syal | DataDrivenInvestor
Linux Commands & Shell Scripting: Being a practitioner in the world of data
engineering, you would mostly be dealing with a Linux VM or Server. No matter
if in a public cloud or a private server, these machines inherently use some
version of Linux such as Ubuntu, Fedora, etc.
Therefore, to work with such machines, you are required to have some
knowledge of commands to navigate with Linux servers. Some of the basic
commands such as cd, pwd, cp, and mv are a good start, and much more to
learn further. However, Shell Scripting is a great tool to automate these Linux
commands, without needing to manually use these commands.
Data Structure & Algorithm: Next in the line is Data Structure and Algorithm.
Even though you will not be required to create data structures on your own, it is
still required for an aspiring data engineer to have an adequate understanding
and problem-solving skills of DS & Algo (similar to software engineering). For
this purpose, Easy- Intermediate-level LeetCode problems will be enough for
the initial practice.
Concept of Networking
As a data engineer, you would be responsible for quite a lot of deployments to VMs
and servers. Therefore, It is important for someone dealing with VMs (Virtual
Machines), Servers, and APIs (Application Programming Interfaces) to have a basic
understanding of basic networking concepts such as IP (Internet Protocol), DNS
(Domain Name Server), VPN, TCP, HTTP, Firewalls, etc.
Databases
1. Fundamentals: A database is a space where data is stored. You will be
interacting with many of these databases as a data engineer. For this reason, you
need to understand the fundamental concepts of databases, such as tables,
rows, columns, keys, joins, merges, and schema.
2. SQL: This was supposed to be covered once again when talking of the databases,
as it comes in handy as an interactional language when working with these
databases.
https://medium.datadriveninvestor.com/12-must-have-skills-to-become-a-data-engineer-35b100dbee0a 3/9
5/1/23, 12:33 AM 12 Must-Have Skills to become a Data Engineer | by Anuj Syal | DataDrivenInvestor
guarantee data validity, despite errors, power failures, and any other such
mishaps.
6. OLTP Vs. OLAP: OLTP (Online Analytical Processing) & OLTP (Online
Transaction Processing) are two different types of data processing systems.
Complex queries are used by online analytical processing (OLAP) to examine
past data that has been collected from OLTP systems.
Wide Column Databases: examples are Apache Cassandra, and Google BigTable
Data Warehousing
The inability of databases to store a huge amount of data leads us to a warehouse.
These data warehouses can store large volumes of current and historical data for
query and analysis. Data Warehousing is simply databases designed with analytical
workloads in mind. These are powerful enough to perform complex aggregate
https://medium.datadriveninvestor.com/12-must-have-skills-to-become-a-data-engineer-35b100dbee0a 4/9
5/1/23, 12:33 AM 12 Must-Have Skills to become a Data Engineer | by Anuj Syal | DataDrivenInvestor
1. SQL: With the advent of powerful data warehouses that abstract away
complexity, proficiency in SQL is all that is required to unlock their full
potential.
3. OLAP Vs. OLTP: The primary distinction between the two is that one uses data to
gather important insights while the other supports transaction-oriented
applications.
AWS Redshift
Azure Synapse
Snowflake
ClickHouse
Hive
S3
GCS
Distributed Systems
https://medium.datadriveninvestor.com/12-must-have-skills-to-become-a-data-engineer-35b100dbee0a 5/9
5/1/23, 12:33 AM 12 Must-Have Skills to become a Data Engineer | by Anuj Syal | DataDrivenInvestor
When multiple machines work together as a cluster, they form Distributed Systems.
These systems are used when the data is huge and cannot be managed by a single
machine. They have separate sets of technologies due to their own complexities.
Some of the concepts you must know are-
1. Big Data
2. Hadoop
3. HDFS
4. Map Reduce
Some of the technologies that are built for this purpose include Cluster technologies
like Kubernetes, Databricks, Custom Hadoop Cluster, etc. Open-source technologies
are also available in distributed systems.
Data Processing
This is where your coding skills will come to use for transforming the data as the
raw data is never usable. Being a data engineer, your job responsibility will mainly
revolve around transforming the data to be served in the right format. This further
includes cleaning the data and its validation. Panda can be your first-hand tool to
perform this process as it’s an easy-to-use python package that uses data frames.
SQL can also be used to transform big data as most of the data warehouses support
this language. Spark is the most popular framework used for big data
transformation. Similarly for stream processing Spark Streaming is the preferred
choice.
Orchestration
Orchestration is used to schedule and orchestrate jobs and create pipelines and
workflows. The best tool for orchestration is Airflow, as it uses python-based Direct
Acrylic Graphs to write down the workflow of jobs. From the simplest of tasks to the
most complex ones, Airflow can create everything. Some other orchestration tools
are Luigi, Nifi, and Jenkins.
Backend Frameworks
It can be assumed by the name itself that Backend frameworks somehow overlap
with software engineering. Backend Frameworks come to use when you require to
https://medium.datadriveninvestor.com/12-must-have-skills-to-become-a-data-engineer-35b100dbee0a 6/9
5/1/23, 12:33 AM 12 Must-Have Skills to become a Data Engineer | by Anuj Syal | DataDrivenInvestor
serve some data set, model, or functionality to be used by some application. For this
task, you will be needed to create the backend APIs/frameworks such as Flask,
Django, and FastAPI. Some of the dedicated technologies based on python are
Flask, Django, and FastAPI. Some of the cloud-based technologies are GCP Vertex AI
API for model deployments and Automl APIs.
Machine Learning
Machine Learning algorithms (or models) are just another great concept to gain
knowledge about. Machine learning is majorly used by data scientists to make
predictions by analyzing current and historical data. However, data engineers must
have a strong understanding of the basics of machine learning as it can naturally
enable them to deploy models, as well as build pipelines having more accuracy. This
directly benefits the data scientists to make precise decisions. Hence, it is good to
understand the fundamentals and frameworks of ML. Some of the platforms used
for ML Operations are Google AI Platforms, Kubeflow, and Sagemaker.
Integrated Platforms
Integrated platforms allow data scientists and data engineers to have integrated
workflows together in one place. AWS Sagemaker, Databricks, and Hugging Face,
are some examples of Integrated Platforms,
https://medium.datadriveninvestor.com/12-must-have-skills-to-become-a-data-engineer-35b100dbee0a 7/9
5/1/23, 12:33 AM 12 Must-Have Skills to become a Data Engineer | by Anuj Syal | DataDrivenInvestor
Conclusion
In the field of data engineering, there are a myriad of skills that one needs to learn,
and that requires gaining hands-on experience too. As an aspiring data engineer,
you get to choose from a wide variety of skills and tools to work with, and that’s the
thrill of it all!
Open in app Sign up Sign In
For more information and elaborative understanding, you can check out the video
below!
https://medium.datadriveninvestor.com/12-must-have-skills-to-become-a-data-engineer-35b100dbee0a 8/9
5/1/23, 12:33 AM 12 Must-Have Skills to become a Data Engineer | by Anuj Syal | DataDrivenInvestor
209 6
Subscribe
https://medium.datadriveninvestor.com/12-must-have-skills-to-become-a-data-engineer-35b100dbee0a 9/9