Professional Documents
Culture Documents
Course Curriculum - PGP - Big Data - NITR v1.1 PDF
Course Curriculum - PGP - Big Data - NITR v1.1 PDF
About NITR
The National Institute of Technology, Rourkela is one of India’s premier national level institutions
for technical education in the country and is ranked 2nd among NITs by MHRD. NITR has also been
recognized as an Institute of National Importance by the Government of India. NITR has partnered
with Edureka to design this outline Post-Graduate Program in Big Data Engineering in order to
develop the next cadre of highly skilled professionals and experts in the field of Big Data.
About Edureka
Edureka is an e-learning education platform focused on delivering high-quality learning to
Technology professionals. We have the highest course completion rate in the industry, and we
strive to create an online ecosystem for our global learners to equip themselves with industry-
relevant skills in today’s cutting-edge technologies.
We have an easy and affordable learning solution that is accessible to millions of learners. With
our students spread across countries like the US, India, UK, Canada, Singapore, Australia, Middle
East, Brazil and many others, we have built a community of over 1 million learners across the
globe.
*Depending on industry requirements, Edureka may make changes to the course curriculum
Course Curriculum
Topics:
What is Linux?
Linux Distribution
Ubuntu Installation
Shell Scripting
Some basic and advanced Linux commands
Package Management
ssh and Remote log-in
Topics:
Introduction to Java Exceptions and its types
Java Installation Exception Handling in Java
Modifiers and Variables in Java Throw and throws keyword in Java
Java Data Types File I/O Operation in Java
Data Types Conversion in Java Wrapper Classes in Java
Java Operators Java Collection Frameworks
Control Statements in Java List and its Classification in Java
Methods in Java Queue in Java
Arrays and their Types Sets and their Classification in Java
Strings and their Operations Introduction to XML
Classes and Objects DOM Parser in Java
Constructors and their types SAX Parser
Static and This Keyword in Java StAX Parser In Java
Oops Concept in Java Introduction To XPath
Interfaces in Java Introduction To JAXB
Packages in Java JSON And Its Significance
Course Curriculum
Topics:
Global Data Explosion
Structure of Data
What is Big Data?
The 5 V’s of Big Data
Big Data Analytics
Big Data Engineer
Introduction to Data Structures
Algorithms and Asymptotic Notation
Linear Data Structures - Linked List
Topics:
Non-Linear Data Structures
Tree
o General Trees
o Binary Trees
o Binary Search Tree
o Basic operations on Trees
o Balancing of Binary search tree
AVL Trees
Graphs
o Types of Graphs
o Applications of Graphs
o Graphs Traversal Algorithms
Hashing
Java Collection Framework
Set
o HashSet
o LinkedHashSet
o TreeSet
Map
o HashMap
o LinkedHashMap
o TreeMap
Hands-on:
Implement Binary Search Tree and perform various operations such as insertion, deletion, and
searching of a node
Implement AVL tree and perform various rotations required to get a balanced BST
Implement various classes in the Set interface such as HashSet, LinkedHashSet and TreeSet
Implement various classes in Map interface such as HashMap, LinkedHashMap, and TreeMap
Hands-on:
Implement Binary search algorithm to search an element
Implement Quicksort to sort an array
Implement Mergesort to sort an array
Implement Binary search algorithm with Quicksort
Course Curriculum
Why Distributed and Parallel Computing Applications of Map & Reduce Construct
Properties of Distributed System Map & Reduce Construct use-case
Google File System Performance Evaluation of Map Construct
Remote Procedure Call & Reduce Construct
Java RMI Iterative Map Reduce
Logical Clocks Thread vs Process
Map Construct and its example Multithreading in Java
Hands-on:
Implement Multithreading in Java
Topics:
Topics:
HDFS
Hadoop 2.x core components
YARN
File Blocks
Rack Awareness
Hadoop read and write mechanism
Hadoop 2.x cluster architecture
Hadoop terminal commands
Topics:
Hadoop 2.x cluster architecture-Federation
Hadoop 2.x cluster architecture-High availability and resource management
Hadoop 2.x configuration files
Hadoop Web UI parts
Listing Files Using HDFS APIs
Access HDFS using JAVA API (Read/Write Operations)
Hands-on:
Implement various HDFS operations using Java APIs
Learning Objectives:
At the end of this module, you should be able to:
Explain MapReduce Computational Model
Understand the different stages of the MapReduce computational model such as Mapper,
Reducer, Combiner
Know about YARN MR Application Execution Flow
Know about different Input and Output Formats available in the MapReduce Framework
Explain the Serialization concept in MapReduce
Implement MapReduce Writable Interface
Implement a MapReduce Program and run it
Hands-on:
Implement MapReduce programs
Implement Combiner in a MapReduce program
Hands-on:
Analysing the datasets using Map-side, Reduce-Side join, Distributed Cache
Topics:
Hands-on:
Implement classes and objects in Scala Implement the Higher-order and Anonymous
Implement Traits in Scala functions in Scala
Implement Scala Programs
Hands-on:
Topics:
Hands-on:
Course Curriculum
Course Curriculum
Topics:
Relational Databases Entity-Relationship model
Database Design UML (Sequence Diagram)
Database Storage and Querying Features of good relational design
Database Schema Atomic Domains and Normalization
Relational Query Language Files Concept
Relational Operations Indexes Concept
Overview of the design process Measuring of Query Cost
Hands-on:
Designing Entity-Relationship model
Designing UML (Sequence Diagram)
Designing Database Schema
This module will help you to work on advanced MySQL concepts such as join operation, nested query,
creating views and procedures. You will also learn about how to load and write data from/into comma-
delimited file.
Hands-on:
Implement nested queries, views, triggers in SQL
Loading data from comma-delimited files
Writing data into comma-delimited files
Module 3: SQL Like Query Processing Engine for Big Data - Hive
Learning Objectives:
This module will help you to understand Hive concepts, Hive Data types, partition and bucketing in Hive,
internal and external tables, loading and querying data in Hive, running hive scripts.
Topics:
Introduction to Apache Hive Hive Partition
Hive Architecture and Components Hive Bucketing
Hive Metastore Hive Tables (Managed Tables and External Tables)
Comparison with Traditional Database Importing Data
Hive Data Types and Data Models Querying Data & Managing Outputs
Hands-on:
Importing data in Hive
Querying Data and managing outputs
Topics:
Hive Script Hive Indexes and views
Retail use case in Hive Hive Query Optimizers
Hive QL: Joining Tables, Dynamic Partitioning Hive Thrift Server
Custom MapReduce Scripts Hive UDF
Topics:
Need for Spark SQL Data Frames & Datasets
What is Spark SQL? Interoperating with RDDs
Spark SQL Architecture JSON and Parquet File Formats
SQL Context in Spark SQL Loading Data through Different Sources
User-defined Functions
Hands-on:
Load data through different sources
Creating Data Frames and Datasets
Implement SQL queries on DataFrames and Datasets
Topics:
Learning Objectives:
This model will help you to learn about advance Apache HBase concepts. You will find demos on HBase
Bulk Loading & HBase Filters. You will also learn what Zookeeper is all about, how it helps in monitoring a
cluster & why HBase uses Zookeeper.
Topics:
HBase v/s RDBMS HBase Client API
HBase Components HBase Data Loading Techniques
HBase Architecture Apache Zookeeper Introduction
HBase Run Modes ZooKeeper Data Model
HBase Configuration Zookeeper Service
HBase Cluster Deployment HBase Bulk Loading
HBase Data Model Getting and Inserting Data
HBase Shell HBase Filters
Hands-on:
Working on HBase Data loading, bulk loading and HBase Filters
Learning Objectives:
This model will help you to learn about Database Model and similarities between RDBMS and Cassandra
Data Model. You will also understand the key Database Elements of Cassandra and learn about the concept
of Primary Key.
Topics:
NoSQL databases Understand the analogy between RDBMS and
Common characteristics of NoSQL databases Cassandra Data Model
CAP theorem Understand following Database Elements:
How Cassandra solves the Limitations? Cluster, Keyspace, Column Family/Table, Column
History of Cassandra Column Family Options
Features of Cassandra Columns
Introduction to Database Model Wide Rows, Skinny Rows
Types of Data Models Static and dynamic tables
Hands-on:
Create Keyspace in Cassandra
Creating Tables
Topics:
Cassandra Architecture Replication Factor, Replication Strategy
Different Layers of Cassandra Architecture Different Data Types Used in Cassandra
Gossip Protocol Collection Types
Partitioning and Snitches What are CRUD Operations
Vnodes and How Read and Write Path works Insert, Select, Update and Delete of various
Understand Compaction elements
Anti-Entropy and Tombstone Various Functions Used in Cassandra
Repairs in Cassandra Importance of Roles and Indexing
Hinted Handoff Tombstones in Cassandra
Hands-on:
Check Created Keyspace in Create A Table Using Collection & UDT Column
System_Schema.Keyspaces Create Secondary Index On a Table
Update Replication Factor of Previously Created Insert Data into Table
Keyspace Insert Data into Table with UUID & TIMEUUID
Drop Previously Created Keyspace Columns
Create A Table Using cqlsh Insert Data Using COPY Command
Create A Table Using UUID & TIMEUUID Deleting Data from Table
Course Curriculum
Topics:
Why Data Mining? Measuring Data Similarity and Dissimilarity
What Is Data Mining? Data Pre-processing: An Overview
What Kinds of Data Can Be Mined? Data Cleaning
What Kinds of Patterns Can Be Mined? Data Integration
Which Technologies Are Used? Data Reduction
Which Kinds of Applications Are Targeted? Data Transformation and Data Discretization
Major Issues in Data Mining Data Warehouse: Basic Concepts
Data Objects and Attribute Types Data Warehouse Modelling: Data Cube and OLAP
Basic Statistical Descriptions of Data Data Warehouse Design and Usage
Data Visualization Data Warehouse Implementation
Hands-on:
Designing the complete ETL Pipeline
Designing a data warehouse
Creating a data warehouse
Topics:
Why Sqoop? Free-form Query Import
What is Hadoop? Generated Code
Sqoop Architecture Importing Data in Hive
How Sqoop Import and Export works? Importing Large Objects
Sqoop Commands Importing Data into HBase
Importing Data Export
Incremental Import
Hands-on:
Importing data from Relational Database to HDFS using Sqoop
Exporting data from HDFS to Relational Database using Sqoop
Importing data incrementally using Sqoop
Importing data in Hive from Relational Database using Sqoop
Importing data from Relational Database to HBase using Sqoop
Topics:
Data Manipulation Data optimization
Data Aggregation and Sampling Compression
Basic aggregation Storage & Job optimization
Enhanced aggregation Parallel execution
Grouping sets Join optimization
Rollup and Cube Common join
Aggregation condition Map join
Window functions Bucket map join
Sampling Sort merge bucket (SMB) join
Performance utilities Sort merge bucket map (SMBM) join
Design optimization Skew join
Use skewed/temporary tables Optimizer
Hands-on:
Performing ETL using Spark
Perform partitioning in Spark
Perform in-memory caching in Spark
Integrating Hive & Spark
Topics:
What is Oozie? Time and Data Trigger Coordinators
Scheduling with Oozie Monitoring an Oozie Coordinator Job
Running Oozie Applications Oozie Bundles Parameters
Oozie Workflows Variables and Functions
Monitoring Oozie Workflow Job Application Deployment Model,
Oozie Coordinators Oozie Architecture
Oozie Application Lifecycle
Hands-on:
Scheduling Job with Oozie
Creating Oozie workflow
Working with the Oozie Co-Ordinator
Course Curriculum
Topics:
What Is Stream Processing?
Stream Processing Use Cases
Advantages of Stream Processing
Challenges of Stream Processing
Stream Processing Design Points
Record-at-a-Time Versus Declarative APIs
Event Time Versus Processing Time
Continuous Versus Micro-Batch Execution
Topics:
Spark Streaming APIs - The DStream API Dropping Duplicates in a Stream
Event Time Arbitrary Stateful Processing
Stateful Processing Time-Outs
Hands-on:
Implement Flume and Spark Streaming Integration
Implement Sentiment Analysis using Spark Streaming
Topics:
Structured Streaming Basics Aggregations
Transformations and Actions Joins
Input Sources Input and Output
Sinks Where Data Is Read and Written (Sources and Sinks)
Output Modes Reading from the Kafka Source
Triggers Writing to the Kafka Sink
Event-Time Processing How Data Is Output (Output Modes)
Structured Streaming in Action When Data Is Output (Triggers)
Transformations on Streams Streaming Dataset API
Selections and Filtering
Hands-on:
Process streaming data where Kafka is a data source
Streaming Dataset API
Module 4: Low Latency and Real Time Data Processing with Kafka – I
Learning Objectives:
This module will help you to understand where Kafka fits in the Big Data space. You will learn about Kafka
components, Kafka cluster architecture and how to send and consume messages. At the end, you will learn
about Kafka producers which send records to topics.
Topics:
Publish/Subscribe Messaging Kafka Producers: Writing Messages to Kafka
What is Kafka? Sending a Message to Kafka
Topics and Partitions Configuring Producers
Producers and Consumers Serializers
Brokers and Clusters Partitions
Module 5: Low Latency and Real Time Data Processing with Kafka – II
Learning Objectives:
This module will help you to learn about Kafka consumers and various Kafka APIs such as Kafka Connect &
Kafka Stream API.
Topics:
Kafka Consumers: Reading Data from Kafka Building Data Pipelines
Consumers and Consumer Groups Data Formats
The Poll Loop Transformations
Commits and Offsets Kafka Connect
Deserializers Kafka Stream API
Standalone Consumer
Hands-on:
Running Kafka Connect and Kafka Stream API
Module 6: Building Streaming Data Pipeline Using Kafka, Spark & NoSQL
Learning Objectives:
This module will help you to create an end-to-end data pipeline using Kafka, Spark & NoSQL.
Topics:
End-to-end data pipeline using Kafka, Spark & NoSQL
Hands-on:
Building an end-to-end data pipeline using Kafka, Spark and NoSQL
Course Curriculum
Topics:
Data Governance for Big Data Metadata Management
Big Data Oversight: Five Key Concepts Consistency of Metadata and Reference Data for
Big Data Governance Framework Entity Extraction
Identifying the critical dimensions of data Data Enrichment and Enhancement
quality Data Classification
Cloudera Navigator Data Lineage and Impact Analysis
Topics:
Auditing and Access Control
Policy Enforcement and Data Lifecycle Automation
Topics:
Hadoop Security — An Overview Kerberos Trusts
Authentication, Authorization, and Accounting A Special Principal
Hadoop Authentication with Kerberos Kerberos Adding Kerberos Authorization to your Cluster
and How It Works Setting Up Kerberos for Hadoop
The Kerberos Authentication Process Securing a Hadoop Cluster with Kerberos
Topics:
How Kerberos Authenticates Users and Services Auditing HDFS Operations
Managing a Kerberized Hadoop Cluster Auditing YARN Operations
Hadoop Authorization Securing Hadoop Data
HDFS Permissions HDFS Transparent Encryption
Service Level Authorization Encrypting Data in Transition
Auditing Hadoop
Course Curriculum
Topics:
Introduction to Big Data Analytics Visualizing data
Big Data Analytics Project Life Cycle Data Science project life cycle
Identifying the problem and outcomes Hypothesis and Modeling
Data collection Measuring the effectiveness
Pre-processing data and ETL Making improvements
Performing analytics Communicating the results
Hands-on:
Pre-processing data and performing ETL
Hands-on:
Visualizing data and creating dashboards
Topics:
Machine Learning Overview
Machine Learning Terminologies
Machine Learning Types - Supervised, Unsupervised, Reinforcement Learning
Machine Learning Process
Spark Machine Learning Library
Machine Learning Pipelines
Transformers and Estimators
Topics:
Working with Vectors
Algorithms
Feature Extraction
Statistics
Classification and Regression
Clustering
Collaborative Filtering and Recommendation
Hands-on:
Analyzing data using various ML algorithms such as clustering, classification, regression and decision
tree