You are on page 1of 13

DATA LAKE ANALYTICS PROGRAM

BIG DATA AND HADOOP TRAINING


15 April 2018 10 AM IST
22 April 2018 10 AM IST
a

Program Highlights :
Data Lake Analytics Program is developed by very experienced and proven big data professionals in core
big data industry. There are three major streams in this program wherein candidate can choose one of the
certification track and our experienced big data professionals help you to cater the training needs of
chosen certification tracks and help them to be part of big data industry. This program includes extensive
foundation training on Red Hat Linux, Apache Hadoop, Apache Spark and advanced training includes
hands on Hortonworks Distribution Apache Hadoop ( Hortonworks Data Platform ) , Data science spark
Lab with zeppelin notebook and Scala IDE . Our experienced trainers can help you to perform essential
tasks as per certification exam objectives of your chosen certification track and achieve recognition
badges :-

HDP Certified Developer (HDPCD) HDP Certified Spark


HDP Certified Administrator (HDPCA)

Tools

Open Source Projects

© 2018 Data Lake Academy


Who Should Take Data Lake Analytics Program and Why

Data Lake Analytics Program training can helpful for engineering , IT graduates to increase their
employability in market with niche technologies catered to demand in Big Data Industry . This program
will be helpful for experienced IT professionals like Linux administrators , BI Developers , Data Analyst
to upgrade their skills with next generation technologies and take up new roles like Hadoop
Administrators , Big Data Hadoop Architect , Data Engineers and Data Scientist.

Nasscom, the IT industry’s trade association, decided to clear the air and set the record straight and
said -

“ The big challenge for IT companies, however, will be to re-engineer its 3.9 million-strong human
resource base to meet the demands of a fast-transforming marketplace. Not only is technology
changing rapidly, with automation and big data making deep inroads, the demands of industry’s global
clientele have also evolved. “

“India’s $150 billion IT industry has a new mantra: Re-skill or perish”


Training Modes :

Data Lake Analytics Program training will be delivered in two modes and candidates can either opt for
physical classroom or virtual class room training delivered through audio conferencing and desktop
sharing mode.

This training will be kicked off on 15th Apr 2018 and 22nd Apr 2018 with introductory session and
training calendar will be distributed further to enrolled candidates.
Additional Services :

Industry Based Projects and Case studies , Job Assistance - Referrals, Resume Building , Professional
Grooming and Mock Interview for Freshers
Training Duration : 2-3 Months ( Weekends -SAT , SUN )

Training Location :

Shop 136 , Boulevard Mall , Mumbai Agra Road, Thane West, Thane, Maharashtra

You can email us for any queries on training program and fees on datalakeacademy@gmail.com

Contact : Pravin Bhavar - + 91 8452019117 Sunil Panwar - +91 9967407549

Amit Kadam - + 91 9962982168 Ankit Yadav - +91 9833564610

Soumya Sahu - +91 9029101137


Key Terminologies
* Data Lake is  a data management platform comprising one or more Hadoop clusters used principally to process and
store non-relational data such as log files , Internet clickstream records, sensor data, JSON objects, images and social
media posts
*Hortonworks is the leading commercial vendor of Apache Hadoop ,popular and enterprise -ready big data Hadoop
platform which widely used in big data projects across industry.
*Cloudera was founded in 2008 by some of the brightest minds at Silicon Valley’s leading companies. Doug Cutting,
co-creator of Hadoop, joined the company in 2009 as Chief Architect and remains in that role.

*Google Cloud Platform, offered by Google, is a suite of cloud computing services that runs on the same
infrastructure that Google uses internally for its end-user products, such as Google Search and YouTube.

*Red Hat® Enterprise Linux® gives you the tools you need to modernize your infrastructure, boost efficiency through
standardization and virtualisation, and ultimately prepare your datacenter for an open, hybrid cloud IT architecture.

*Scala IDE provides advanced editing and debugging support for the development of pure Scala and mixed Scala-
Java applications.

*Apache Spark™ is a fast and general engine for large-scale data processing.

Apache Zeppelin. Web-based notebook that enables data-driven, interactive data analytics and collaborative
documents with SQL, Scala and more
Ansible delivers simple IT automation that ends repetitive tasks and frees up DevOps teams for more strategic work.

RStudio makes R easier to use. It includes a code editor, debugging & visualization tools.


R is a programming language and free software environment for statistical computing and graphics that is supported by

the R Foundation for Statistical Computing


Training Content - Hadoop Administrator

FORMAT
50% Lecture
50% Hands-on Labs

AGENDA SUMMARY
Week 1 : Introduction to Linux, Big Data, Apache Hadoop

Day 1 :

OBJECTIVES \ LECTURES :
1) Introduction to Big Data, Apache Hadoop
2) Introduction to Linux
3) Linux Boot Process and Architecture
4) Linux Commands, Shell Scripts , Cron Utility
5) RHEL Linux OS Best Practices for Hadoop
6) Virtualization ( VMWARE, Virtual Box )
6) Quiz & Q&A

LAB :

1) Installation of Linux OS
2) Practising Linux OS Commands
3) Configuring RHEL Linux OS Best Practices for Hadoop
4) Read and Execute Shell Script and Schedule through Cron Utility

Day 2: Introduction to HDFS Architecture and Operations

OBJECTIVES \ LECTURES :
1) Design of HDFS

HDFS Concepts
Blocks
NameNodes DataNodes
HDFS Federation
HDFS High Availabiltity
2) Manage HDFS using Command-line Tools
3) Discussing Hadoop Cluster Installation Options
4) Understanding Hadoop Configuration Files
5) Quiz & Q&A

LAB :

1) Installation of Single Node Apache Hadoop Cluster


2) Walkthrough of Hadoop Configuration Files
3) Perform HDFS Operations , FSCK Utility

Home Assignment for Week 1


AGENDA SUMMARY

Week 2 :

Day 3 : Introduction to Apache YARN

OBJECTIVES \ LECTURES :
1) MR1 vs YARN ( MR2)
2) YARN Architecture- Anatomy of a YARN Application Run
3) Scheduling In YARN and Scheduler Options
4) Capacity Scheduler Configuration
5) Fair Scheduler Configuration
6) Preemption , Delay Scheduling
7) Dominant Resource Fairness ( DRF) Configuration
6) Quiz & Q&A

LAB :

1) Review of Home Assignment Week 1


2) Configuring Capacity Scheduler Configuration with DRF
3) Running Sample MapReduce Applications ( Hadoop Jars Examples) and Observe MR Output Results
4) Perform YARN Operations through YARN CLI

Day 4: Introduction to Hortonworks Data Platform (HDP) - Hortonworks Distribution of Apache


Hadoop

OBJECTIVES \ LECTURES :
1) Introduction to HDP and Architecture
2) Understanding Typical Production Cluster Specification - Namenodes , DataNodes, EdgeNodes,
Management or Utility Nodes - Hardware Requirements
3) Understanding Network Architecture / Topology for Typical Production Hadoop Cluster
4) Understanding Role of ZooKeeper and Journal Nodes

LAB :

Installation of Multi- Node Cluster through Ambari ( HDP 2.5 ) - Part -I


( Prerequisites - OS level configuration, Database requirements , Local Repo Setup)

1) Configuring RHEL Linux OS Best Practices Recommended by Hortonworks


2) Installation of JDK Supported by HDP
3) Installation of MySQL External DB and Configuration
4) Preparing environment for HDP Installation ( Password-less SSH , NTP , SELINUX etc)
5) Setting Up a Local Repository with No Internet Access
6) Cluster Planning Sheet for Cluster Deployment

Home Assignment for Week 2

© 2018 Data Lake Academy


AGENDA SUMMARY

Week 3 :

Day 5: Multi-Node Cluster Installation using Ambari

OBJECTIVES \ LECTURES :
1) Recap of Key Concepts of HDFS and YARN
2) Quiz & Q&A
3) Discussing Best Practices for HDP Cluster Deployment
4) Case Study of Typical Product Cluster Big Data Architecture
5) Understanding Hive and Spark Architecture
6) Manage HDFS using Ambari Web, NameNode and DataNode UIs
7) Summarize the Purpose and Benefits of Rack Awareness

LAB :

1) Review of Home Assignment Week 2


2) Installation of Multi- Node Cluster through Ambari ( HDP 2.5 ) - Part -II - Core Services
3) Managing HDFS Storage in HDP Multi Node Cluster
4) Managing HDFS Quotas
5) Configuring Rack Awareness
6) Managing HDFS Snapshots
7) Configuring HDFS Storage Policies
8) Configuring HDFS Centralized Cache
9) Using HDFS Access Control Lists
10) DistCp usage
11) Perform HDFS Operations - Fsck , dfsadmin etc

Day 6:

OBJECTIVES \ LECTURES : Yarn


1) Introduction to HDP and Architecture
2) Understanding Typical Production Cluster Specification - Namenodes , DataNodes, EdgeNodes,
Management or Utility Nodes - Hardware Requirements
3) Understanding Network Architecture / Topology for Typical Production Hadoop Cluster
4) Understanding Role of ZooKeeper and Journal Nodes

LAB :

Installation of Multi- Node Cluster through Ambari ( HDP 2.5 ) - Part -I


( Prerequisites - OS level configuration, Database requirements , Local Repo Setup)

1) Configuring RHEL Linux OS Best Practices Recommended by Hortonworks


2) Installation of JDK Supported by HDP
3) Installation of MySQL External DB and Configuration
4) Preparing environment for HDP Installation ( Password-less SSH , NTP , SELINUX etc)
5) Setting Up a Local Repository with No Internet Access
6) Cluster Planning Sheet for Cluster Deployment

Home Assignment for Week 3

© 2018 Data Lake Academy


AGENDA SUMMARY

Week 4:

Day 7 : HIGH AVAILABILITY WITH HDP, DEPLOYING HDP WITH BLUEPRINTS, AND THE HDP
UPGRADE PROCESS

OBJECTIVES \ LECTURES :
Recap of Week 3 - Quiz & Q&A
Summarize the Purpose of NameNode HA
Configure NameNode HA Using Ambari
Describe the Features and Benefits of the Apache Ambari Dashboard
Ambari Views and Blueprints

LAB :
Configuring NameNode HA
Configuring Resource Manager HA
Configuring Ambari Alerts

Day 8:

OBJECTIVES \ LECTURES :
Recall the Types of Methods and Upgrades Available in HDP
Describe the Upgrade Process, Restrictions and Pre-upgrade Checklist
Perform an Upgrade Using the Apache Ambari Web UI

LAB :

Performing an HDP Upgrade – Express

Home Assignment for Week 4

Week 5 - HDP Security

Day 9 - Ranger

OBJECTIVES \ LECTURES :
Authentication and Authorization
Ambari User Management
Ranger Architecture
Atlas Architecture
Hue Architecture
Case Study of Type User Management in Enterprise Cluster through AD ( Kerberos )
SmartSense Usage

LAB :
Ranger Installation
Creation Ranger Policies like HDFS, Hive etc

Day 10

HDP CA Certification Task - Part - 1


Home Assignment for Week 5
© 2018 Data Lake Academy
AGENDA SUMMARY

Week 6:

Day 11 :
Recap of Week 5
HDP CA Certification Tasks - Part - 2

Day 12:

HDPCA Practice Test / Mock Exam on Google Cloud Platform


Registration for Google Cloud Platform
Register HDPCA Exam on Hortonworks Portal

Project Assignment

Week 7 - Miscellaneous Topics

Day 13

Project Discussion and Mock Interviews

Q&A

© 2018 Data Lake Academy


Hadoop Developer Certification Course

FORMAT
50% Lecture
50% Hands-on Labs

AGENDA SUMMARY
Week 1 : Data Ingestion

Day 1 :

OBJECTIVES \ LECTURES :
1) Introduction to Hive
2) Hive Architecture
3) Flume Architecture

LAB :
Import data from a table in a relational database into HDFS
Import the results of a query from a relational database into HDFS
Import a table from a relational database into a new or existing Hive table
Insert or update data from HDFS into a table in a relational database
Given a Flume configuration file, start a Flume agent
Given a configured sink and source, configure a Flume memory channel with a specified capacity

Day 2: Data Transformation

OBJECTIVES \ LECTURES :
1) Introduction to Pig
2) Transformations available in PIG

LAB :
Write and execute a Pig script
Load data into a Pig relation without a schema
Load data into a Pig relation with a schema
Load data from a Hive table into a Pig relation
Use Pig to transform data into a specified format
Transform data to match a given Hive schema
Group the data of one or more Pig relations
Use Pig to remove records with null values from a relation
Store the data from a Pig relation into a folder in HDFS
Store the data from a Pig relation into a Hive table
Sort the output of a Pig relation
Remove the duplicate tuples of a Pig relation
Specify the number of reduce tasks for a Pig MapReduce job
Join two datasets using Pig
Perform a replicated join using Pig
Run a Pig job using Tez
Within a Pig script, register a JAR file of User Defined Functions
Within a Pig script, define an alias for a User Defined Function
Within a Pig script, invoke a User Defined Function

Home Assignment for Data Ingestion and Transformation using Sqoop and Flume Pig
© 2018 Data Lake Academy
Hadoop Developer Certification Course

AGENDA SUMMARY
Week 2 : Hive SQL

Day 3 :

Objectives : HiveSQL in Depth

DDL (create/drop/alter/truncate/show/describe), Statistics
(analyze), Indexes, Archiving,
DML (load/insert/update/delete/merge, import/export, explain plan),
File Formats and Compression:  RCFile, Avro, ORC, Parquet; Compression, LZO
Hive Configuration Properties
Hive Client (JDBC, ODBC, Thrift)
HiveServer2: Overview, HiveServer2 Client and Beeline, Hive Metrics

Data Analysis LAB

Write and execute a Hive query


Define a Hive-managed table
Define a Hive external table
Define a partitioned Hive table
Define a bucketed Hive table
Define a Hive table from a select query
Define a Hive table that uses the ORCFile format
Create a new ORCFile table from the data in an existing non-ORCFile Hive table
Specify the storage format of a Hive table
Specify the delimiter of a Hive table
Load data into a Hive table from a local directory
Load data into a Hive table from an HDFS directory
Load data into a Hive table as the result of a query
Load a compressed data file into a Hive table
Update a row in a Hive table
Delete a row from a Hive table
Insert a new row into a Hive table
Join two Hive tables
Run a Hive query using Tez
Run a Hive query using vectorization
Output the execution plan for a Hive query
Use a subquery within a Hive query
Output data from a Hive query that is totally ordered across multiple reducers
Set a Hadoop or Hive configuration property from within a Hive query

Day 4 : Developer Certification Tasks Preparation

Home Assignment for Hive - Data Analysis Task

© 2018 Data Lake Academy

https://cwiki.apache.org/confluence/display/Hive/Home
Spark Certification Course

Perquisites : Should have hand-on experience in Hadoop developer tasks


Should have attended Hadoop Developer Course

FORMAT
50% Lecture
50% Hands-on Labs

AGENDA SUMMARY
Week 1 : Introduction to Spark

Day 1 :

OBJECTIVES \ LECTURES :
1) Introduction to Spark
2) Benefits of Spark over MapReduce
3) Spark Architecture
4) Spark IDEs overview -( Scala IDEs, IntelliJ, Maven )

LAB :

Write a Spark Core application in Python or Scala


Initialize a Spark application
Run a Spark job on YARN
Create an RDD
Create an RDD from a file or directory in HDFS
Persist an RDD in memory or on disk
Perform Spark transformations on an RDD
Perform Spark actions on an RDD
Create and use broadcast variables and accumulators
Configure Spark properties

Day 2: Spark SQL & Spark Streaming

OBJECTIVES \ LECTURES :
1) Spark SQL architecture
2) Spark Streaming Architecture
3) Data Visualisation using Zeppelin

LAB :

Create Spark DataFrames from an existing RDD


Perform operations on a DataFrame
Write a Spark SQL application
Use Hive with ORC from Spark SQL
Write a Spark SQL application that reads and writes data from Hive tables


Home Assignment for Spark Core & SQL

© 2018 Data Lake Academy


Data Lake Analytics Program Enablers

Traditional Architecture Modern Architecture

Data Lake Analytics Program


- Hadoop Administration

Hortonworks Data Platform

Data Lake Analytics Program - Hadoop


Developer + Spark Course
Data Lake Analytics Program - Course Offerings

The candidate can opt for below course offerings and our recommendation will
be big data architect / SME program which gives more value and build strong
career ahead with leading big data organisations and startup in the market and
this is real investment in career.

HDPCA - This is entry level program which will help to achieve Hortonworks
administration certification badge.

HDPCA + Ansible - This program caters industry’s need for automation along with
Hadoop and has more value in market thus more incentives

HDP Developer Spark – This program covers apache Hadoop and HDP basics along
with core Spark developer tools with Scala IDE and zeppelin, RStudio for visualisation
and statistical reporting.

HDPCA + Cloudera Manager – This is complete Hadoop administrator program and it


covers HDP and CDH cluster Operations. This will likely raise your chances to get
selected as Hadoop administrator and most of leading companies preferring Cloudera
for more enterprise features and overall stability.

Big Data Architect / SME Course – This is unique program which meets the need of
Big Data Architect or SME role. After completing this course, you will be able to help
your clients to provide end to end solutions with Hadoop platform, Data ingestion
framework for data at rest and in motion and also covers latest spark developer tools
to build real time use case and stunning visualisations. This is recommended course
for experience hires who want to move as big data architect and also new graduates
to get overall big picture on how to build end to end solutions and can give big
opportunities in start-up firms.

© 2018 Data Lake Academy

You might also like