0% found this document useful (0 votes)
151 views5 pages

Comprehensive Guide to Data Engineering

The document outlines the key skills and concepts needed for a career in data engineering, including programming languages like Python and Scala, Linux and shell scripting, data structures and algorithms, SQL, databases, data warehousing, big data frameworks like Hadoop, Spark, and HBase, NoSQL databases, messaging queues, dashboarding tools, and big data cloud services on AWS. It provides a comprehensive overview of topics covering programming, operating systems, databases, SQL, big data, analytics, and cloud platforms.

Uploaded by

Pvsraju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views5 pages

Comprehensive Guide to Data Engineering

The document outlines the key skills and concepts needed for a career in data engineering, including programming languages like Python and Scala, Linux and shell scripting, data structures and algorithms, SQL, databases, data warehousing, big data frameworks like Hadoop, Spark, and HBase, NoSQL databases, messaging queues, dashboarding tools, and big data cloud services on AWS. It provides a comprehensive overview of topics covering programming, operating systems, databases, SQL, big data, analytics, and cloud platforms.

Uploaded by

Pvsraju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Data Engineering Technology: https://youtu.

be/JXx6iN7MKw8

Programming languages

 Python, Scala, Java

Operating Systems

 Linux, Unix, Shell Scripting

Data structures & Algorithms (Average level)

 Arrays
 Strings
 Linked List
 Stack
 Queue
 Tree (Basics)
 Graph (Basics)
 Dynamic Programming 
 Searching
 Sorting

Core Basics of DBMS

 DDL
 DCL
 DML
 Integrity Constraints
 Data Schema
 Basic Operations
 ACID Properties
 Transactions
 Concurrency Control
 Deadlock
 Indexing
 Hashing
 Normalization forms
 Views
 Stored Procedures
 ER Diagrams

SQL Scripting:
 Transactional Databases: MySQL, PostgreSQL
 All types of joins
 Nested Queries
 Group By
 Use of Case When Statements 
 Window Functions

Basic Terminologies in Big Data:

 What is Big Data?


 5 Vs of Big Data
 Distributed Computation
 Distributed Storage
 Vertical vs Horizontal Scaling
 Commodity Hardware’s
 Clusters

File formats

 CSV
 JSON
 AVRO
 Parquet
 ORC

Type of Data

1. Structured
2. Unstructured
3. Semi-structured

Data Exploration Libraries:

 Pandas 
 NumPy

Data Warehousing Concepts:

 OLAP vs OLTP
 Dimension Tables
 Fact Tables
 Star Schema
 Snowflake Schema
 Warehouse Designing Questions
 Many more topics

Big Data Frameworks:

 Apache Hadoop (Architecture Understanding Most Imp)


i. HDFS
ii. Map-Reduce
iii. Yarn

 Apache Hive

1. How to load data in different file formats


2. Internal Tables
3. External Tables
4. Querying table data stored in HDFS
5. Partitioning
6. Bucketing
7. Map-Side Join
8. Sorted-Merge Join
9. UDF’s in Hive
10. SerDe in Hive

 Apache Spark (Most Important)

1. Spark Core
2. Spark SQL
3. Spark Streaming

 Apache Flink (Real Time Data Processing / Stream Processing) 


 Apache SQOOP
 Apache FLUME

10. Workflow Schedulers, Dependency Management:

 Apache Airflow
 Azkaban
 Apache NIFI

11. NoSQL Databases:

 HBase
 DataStax Cassandra (Recommended)
 Elastic Search
 MongoDB

12. Messaging Queue Frameworks:


a. Apache KAFKA

13. Dashboarding Tools:

 Tableau
 Power BI
 Grafana
 Kibana (Part of ELK (Elastic Search - Logstash - Kibana)

14. Big Data Services in Cloud (AWS):

 On demand Machines
1. AWS EC2
 Access Management
1. AWS IAM
 For Storing and Accessing Credentials
1. AWS Secret Manager
 Distributed File Storage
1. AWS S3 
 Transactional Database Services
1. AWS RDS
2. AWS Athena
3. AWS Redshift (Data Warehousing)
 NoSQL Database Services
1. AWS Dynamo
 Serverless 
1. AWS Lambda
 ETL Services
1. AWS Glue
 Scheduler
1. AWS Cloud watch
 Distributed Data Computation
1. AWS EMR
 Messaging Queue
1. AWS SNS
2. AWS SQS
 Real Time Data Processing
1. AWS Kinesis

You might also like