Data Engineering Technology: https://youtu.
be/JXx6iN7MKw8
Programming languages
Python, Scala, Java
Operating Systems
Linux, Unix, Shell Scripting
Data structures & Algorithms (Average level)
Arrays
Strings
Linked List
Stack
Queue
Tree (Basics)
Graph (Basics)
Dynamic Programming
Searching
Sorting
Core Basics of DBMS
DDL
DCL
DML
Integrity Constraints
Data Schema
Basic Operations
ACID Properties
Transactions
Concurrency Control
Deadlock
Indexing
Hashing
Normalization forms
Views
Stored Procedures
ER Diagrams
SQL Scripting:
Transactional Databases: MySQL, PostgreSQL
All types of joins
Nested Queries
Group By
Use of Case When Statements
Window Functions
Basic Terminologies in Big Data:
What is Big Data?
5 Vs of Big Data
Distributed Computation
Distributed Storage
Vertical vs Horizontal Scaling
Commodity Hardware’s
Clusters
File formats
CSV
JSON
AVRO
Parquet
ORC
Type of Data
1. Structured
2. Unstructured
3. Semi-structured
Data Exploration Libraries:
Pandas
NumPy
Data Warehousing Concepts:
OLAP vs OLTP
Dimension Tables
Fact Tables
Star Schema
Snowflake Schema
Warehouse Designing Questions
Many more topics
Big Data Frameworks:
Apache Hadoop (Architecture Understanding Most Imp)
i. HDFS
ii. Map-Reduce
iii. Yarn
Apache Hive
1. How to load data in different file formats
2. Internal Tables
3. External Tables
4. Querying table data stored in HDFS
5. Partitioning
6. Bucketing
7. Map-Side Join
8. Sorted-Merge Join
9. UDF’s in Hive
10. SerDe in Hive
Apache Spark (Most Important)
1. Spark Core
2. Spark SQL
3. Spark Streaming
Apache Flink (Real Time Data Processing / Stream Processing)
Apache SQOOP
Apache FLUME
10. Workflow Schedulers, Dependency Management:
Apache Airflow
Azkaban
Apache NIFI
11. NoSQL Databases:
HBase
DataStax Cassandra (Recommended)
Elastic Search
MongoDB
12. Messaging Queue Frameworks:
a. Apache KAFKA
13. Dashboarding Tools:
Tableau
Power BI
Grafana
Kibana (Part of ELK (Elastic Search - Logstash - Kibana)
14. Big Data Services in Cloud (AWS):
On demand Machines
1. AWS EC2
Access Management
1. AWS IAM
For Storing and Accessing Credentials
1. AWS Secret Manager
Distributed File Storage
1. AWS S3
Transactional Database Services
1. AWS RDS
2. AWS Athena
3. AWS Redshift (Data Warehousing)
NoSQL Database Services
1. AWS Dynamo
Serverless
1. AWS Lambda
ETL Services
1. AWS Glue
Scheduler
1. AWS Cloud watch
Distributed Data Computation
1. AWS EMR
Messaging Queue
1. AWS SNS
2. AWS SQS
Real Time Data Processing
1. AWS Kinesis