You are on page 1of 12

Course Curriculum: Your 10 Module Learning Plan

Module 1
HADOOP ARCHITECTURE
Learning Objectives - In this module, you will understand what is Big Data, What are the
limitations of the existing solutions for Big Data problem; How Hadoop solves the Big
Data problem, what are the common Hadoop ecosystem components, Hadoop
Architecture, HDFS and Map Reduce Framework, and Anatomy of File Write and Read.

Topics:
 What is Big Data
 Hadoop Architecture
 Hadoop ecosystem components
 Hadoop Storage: HDFS
 Hadoop Processing: MapReduce Framework
 Hadoop Server Roles: NameNode, Secondary NameNode, and DataNode
 Anatomy of File Write and Read.
Module 2

HADOOP CLUSTER CONFIGURATION AND DATA LOADING


Learning Objectives - After this module, you will learn the Hadoop Cluster Architecture
and Setup, Important Configuration files in a Hadoop Cluster, Data Loading Techniques.

Topics:
 Hadoop Cluster Architecture
 Hadoop Cluster Configuration files
 Hadoop Cluster Modes
 Multi-Node Hadoop Cluster
 A Typical Production Hadoop Cluster
 MapReduce Job execution
 Common Hadoop Shell commands
 Data Loading Techniques:
o FLUME
o SQOOP
o Hadoop Copy Commands
 Hadoop Project: Data Loading
Module 3
HADOOP MAPREDUCE FRAMEWORK
Learning Objectives - In this module, you will understand Hadoop MapReduce
framework and how MapReduce works on data stored in HDFS. You will also learn the
different types of Input and Output formats in MapReduce framework and their usage.

Topics:
 Hadoop Data Types
 Hadoop MapReduce paradigm
 Map and Reduce tasks
 MapReduce Execution Framework
 Practitioners and Combiners
 Input Formats (Input Splits and Records, Text Input, Binary Input, Multiple Inputs)
 Output Formats (Text Output, Binary Output, Multiple Output)
 Hadoop Project: MapReduce Programming
Module 4

ADVANCE MAPREDUCE
Learning Objectives - In this module, you will learn Advance MapReduce concepts such
as Counters, Schedulers, Custom Writable, Compression, Serialization, Tuning, Error
Handling, and how to deal with complex MapReduce programs.

Topics:
 Counters
 Custom Writables
 Unit Testing: JUnit and MRUnit testing framework
 Error Handling and Tuning
 Advance MapReduce
 Hadoop Project: Advance MapReduce programming and error handling.
Module 5

PIG AND PIG LATIN


Learning Objectives - In this module you will learn what is Pig, in which type of use case
we can use Pig, how Pig is tightly coupled with MapReduce, and Pig Latin scripting.

Topics:
 Installing and Running Pig
 Grunt
 Pig's Data Model
 Pig Latin
 Developing & Testing Pig Latin Scripts
 Writing Evaluation
 Filter
 Load & Store Functions
 Hadoop Project: Pig Scripting
Module 6

HIVE AND HIVEQL


Learning Objectives - This module will help you in understanding Apache Hive
Installation, Loading and Querying Data in Hive and so on.

Topics:
 Hive Architecture and Installation
 Comparison with Traditional Database
 HiveQL: Data Types, Operators and Functions
 Hive Tables(Managed Tables and External Tables, Partitions and Buckets, Storage
Formats, Importing Data, Altering Tables, Dropping Tables)
 Querying Data (Sorting And Aggregating, Map Reduce Scripts, Joins & Sub queries,
Views, Map and Reduce side Joins to optimize Query).
Module 7

ADVANCED HIVE, NO SQL DATABASES AND HBase


Learning Objectives - In this module, you will understand Advance Hive concepts such
as UDF. You will also acquire in-depth knowledge of what is HBase, how you can load
data into HBase and query data from HBase using client.

Topics:
 Hive: Data manipulation with Hive
 User Defined Functions
 Appending Data into existing Hive Table
 Custom Map/Reduce in Hive
 Hadoop Project: Hive Scripting
 HBase: Introduction to HBase
 Client API's and their features
 Available Client
 HBase Architecture
 MapReduce Integration.
Module 8

ADVANCED HBASE AND ZOOKEEPER


Learning Objectives - This module will cover Advance HBase concepts. You will also
learn what Zookeeper is all about, how it helps in monitoring a cluster, why HBase uses
Zookeeper and how to build Applications with Zookeeper.

Topics:
 HBase: Advanced Usage
 Schema Design
 Advance Indexing
 Coprocessors
 Hadoop Project: HBase tables
 The ZooKeeper Service: Data Model
 Operations
 Implementation
 Consistency
 Sessions
 States.
Module 9

HADOOP 2.0, MRv2 AND YARN


Learning Objectives - In this module, you will understand the newly added features in
Hadoop 2.0, namely, YARN, MRv2, Name Node High Availability, HDFS Federation,
support for Windows etc.

Topics:
 Schedulers: Fair and Capacity
 Hadoop 2.0 New Features: Name Node High Availability
 HDFS Federation
 MRv2
 YARN
 Running MRv1 in YARN
 Upgrade your existing MRv1 code to MRv2
 Programming in YARN framework.
Module 10

HADOOP PROJECT ENVIRONMENT AND APACHE OOZIE


Learning Objectives - In this module, you will understand how multiple Hadoop
ecosystem components work together in a Hadoop implementation to solve Big Data
problems. We will discuss multiple data sets and specifications of the project. This
module will also cover Apache Oozie Workflow Scheduler for Hadoop Jobs.

Some of the data sets on which you may work as a part of the project work:

 Twitter Data Analysis : Download twitter data and the put it in HBase and use Pig,
Hive and MapReduce to garner the popularity of some hashtags

 Stack Exchange Ranking and Percentile data-set : It is dataset from stack Over flow,
in which there ranking and percentile details of Users

 Loan Dataset : It deals with the users who has taken along with their Emi details,
time period etc

 Data -sets by Government: like Worker Population Ratio (per 1000) for persons of
age (15-59) years according to the current weekly status approach for each state/UT

 Machine Learning Dataset like Badges datasets : The dataset is for system to
encode names , for ed +/- label followed by a person’s name

 NYC Data Set: New York Stock Exchange data

 Weather Dataset : it has all the details of weather over a period of time using which
you may find out the hottest or coldest or average temperature

In addition, you can choose your own dataset and create a project around that as well.
Why Learn Hadoop?

BiG Data! A Worldwide Problem?


According to Wikipedia, “Big data is collection of data sets so large and complex that it
becomes difficult to process using on-hand database management tools or traditional
data processing applications.” In simpler terms Big Data is a term given to large
volumes of data that every IT company stores. However it is becoming almost
impossible for them to process, share, store, and search that data. What is the biggest
anxiety of the various companies world across? It is how to manage BIG Data!

It is becoming almostimpossible for the large companies to store, retrieve and process
data which is ever-increasing. If any company gets hold on this, nothing can stop it from
becoming the next BIG success. The problem lies in the use of traditional systems to
store enormous data. Though these systems were a success a few years ago, with
increasing amount and complexity of data, these are soon becoming obsolete. The
good news this…Hadoop, which is not less than a panacea for all those companies
working with BIG DATA in a variety of applications and has since become an implicit
standard for storing, handling, evaluating and retrieving hundreds of terabytes, and
even petabytes of data.

Apache Hadoop! A Solution for Big Data!

Hadoop is an open source software framework that supports data-intensive distributed


applications. Hadoop is licensed under the Apache v2 license. It is therefore generally
known as Apache Hadoop. Hadoop has been developed, based on a paper originally
written by Google on MapReduce system and applies concepts of functional
programming. Hadoop is written in the Java programming language and is the highest-
level Apache project being constructed and used by a global community of contributors.
Hadoop was developed by Doug Cutting and Michael J. Cafarella. And just don’t
overlook the charming yellow elephant you see is basically named after Doug’s son’s
toy elephant!
Some of the top companies using Hadoop:

The importance of Hadoop is evident from the fact that there are many global MNCs
that are using Hadoop and consider it as an integral part of their functioning, such as
companies like Yahoo and Facebook! On February 19, 2008, Yahoo! Inc. established
the world's largest Hadoop production application. The Yahoo! Search Webmap is a
Hadoop application that runs on over 10,000 core Linux cluster and generates data that
is now widely used in every Yahoo! Web search query.

Facebook, a $5.1 billion company has over 1 billion active users, according to Wikipedia
in 2012. Storing and managing data of such magnitude could have been a problem,
even for a company like Facebook. But thanks to Apache Hadoop. Facebook uses
Hadoop to keep track of each and every profile it has on it, as well as all the data
related to them like their pictures, posts, comments etc.

Opportunities for Hadoopers!

Opportunities for Hadoopers are infinite - being a Hadoop architect, developer, tester
and so on. If cracking and managing BIG Data is your passion in life, then think no more
and Join Edureka’s Hadoop Online course and carve a niche for yourself!

Happy Hadooping!

You might also like