Data Lake Analytics Program External 15 Apr

DATA LAKE ANALYTICS PROGRAM
BIG DATA AND HADOOP TRAINING

15 April 2018 10 AM IST
22 April 2018 10 AM IST
a
Program Highlights :
Data Lake Analytics Program is developed by very experienced and proven big data professionals in core
big data industry. There are three major streams in this program wherein candidate can choose one of the
certification track and our experienced big data professionals help you to cater the training needs of
chosen certification tracks and help them to be part of big data industry. This program includes extensive
foundation training on Red Hat Linux, Apache Hadoop, Apache Spark and advanced training includes
hands on Hortonworks Distribution Apache Hadoop ( Hortonworks Data Platform ) , Data science spark
Lab with zeppelin notebook and Scala IDE . Our experienced trainers can help you to perform essential
tasks as per certification exam objectives of your chosen certification track and achieve recognition
badges :-
HDP Certified Developer (HDPCD) HDP Certified Spark

HDP Certified Administrator (HDPCA)
Tools
Open Source Projects
© 2018 Data Lake Academy

Who Should Take Data Lake Analytics Program and Why
Data Lake Analytics Program training can helpful for engineering , IT graduates to increase their
employability in market with niche technologies catered to demand in Big Data Industry . This program
will be helpful for experienced IT professionals like Linux administrators , BI Developers , Data Analyst
to upgrade their skills with next generation technologies and take up new roles like Hadoop
Administrators , Big Data Hadoop Architect , Data Engineers and Data Scientist.
Nasscom, the IT industry’s trade association, decided to clear the air and set the record straight and
said -
“ The big challenge for IT companies, however, will be to re-engineer its 3.9 million-strong human
resource base to meet the demands of a fast-transforming marketplace. Not only is technology
changing rapidly, with automation and big data making deep inroads, the demands of industry’s global
clientele have also evolved. “
“India’s $150 billion IT industry has a new mantra: Re-skill or perish”

Training Modes :
Data Lake Analytics Program training will be delivered in two modes and candidates can either opt for
physical classroom or virtual class room training delivered through audio conferencing and desktop
sharing mode.
This training will be kicked off on 15th Apr 2018 and 22nd Apr 2018 with introductory session and
training calendar will be distributed further to enrolled candidates.
Additional Services :
Industry Based Projects and Case studies , Job Assistance - Referrals, Resume Building , Professional
Grooming and Mock Interview for Freshers
Training Duration : 2-3 Months ( Weekends -SAT , SUN )
Training Location :
Shop 136 , Boulevard Mall , Mumbai Agra Road, Thane West, Thane, Maharashtra
You can email us for any queries on training program and fees on datalakeacademy@gmail.com
Contact : Pravin Bhavar - + 91 8452019117 Sunil Panwar - +91 9967407549
Amit Kadam - + 91 9962982168 Ankit Yadav - +91 9833564610
Soumya Sahu - +91 9029101137

Key Terminologies
* Data Lake is a data management platform comprising one or more Hadoop clusters used principally to process and
store non-relational data such as log files , Internet clickstream records, sensor data, JSON objects, images and social
media posts
*Hortonworks is the leading commercial vendor of Apache Hadoop ,popular and enterprise -ready big data Hadoop
platform which widely used in big data projects across industry.
*Cloudera was founded in 2008 by some of the brightest minds at Silicon Valley’s leading companies. Doug Cutting,
co-creator of Hadoop, joined the company in 2009 as Chief Architect and remains in that role.
*Google Cloud Platform, offered by Google, is a suite of cloud computing services that runs on the same
infrastructure that Google uses internally for its end-user products, such as Google Search and YouTube.
*Red Hat® Enterprise Linux® gives you the tools you need to modernize your infrastructure, boost efficiency through
standardization and virtualisation, and ultimately prepare your datacenter for an open, hybrid cloud IT architecture.
*Scala IDE provides advanced editing and debugging support for the development of pure Scala and mixed Scala-
Java applications.
*Apache Spark™ is a fast and general engine for large-scale data processing.
Apache Zeppelin. Web-based notebook that enables data-driven, interactive data analytics and collaborative
documents with SQL, Scala and more
Ansible delivers simple IT automation that ends repetitive tasks and frees up DevOps teams for more strategic work.
RStudio makes R easier to use. It includes a code editor, debugging & visualization tools.

R is a programming language and free software environment for statistical computing and graphics that is supported by
the R Foundation for Statistical Computing

Training Content - Hadoop Administrator
FORMAT
50% Lecture
50% Hands-on Labs
AGENDA SUMMARY
Week 1 : Introduction to Linux, Big Data, Apache Hadoop
Day 1 :
OBJECTIVES \ LECTURES :
1) Introduction to Big Data, Apache Hadoop
2) Introduction to Linux
3) Linux Boot Process and Architecture
4) Linux Commands, Shell Scripts , Cron Utility
5) RHEL Linux OS Best Practices for Hadoop
6) Virtualization ( VMWARE, Virtual Box )
6) Quiz & Q&A
LAB :
1) Installation of Linux OS
2) Practising Linux OS Commands
3) Configuring RHEL Linux OS Best Practices for Hadoop
4) Read and Execute Shell Script and Schedule through Cron Utility
Day 2: Introduction to HDFS Architecture and Operations
1) Design of HDFS 
HDFS Concepts
Blocks
NameNodes DataNodes
HDFS Federation
HDFS High Availabiltity
2) Manage HDFS using Command-line Tools
3) Discussing Hadoop Cluster Installation Options
4) Understanding Hadoop Configuration Files
5) Quiz & Q&A
LAB :
1) Installation of Single Node Apache Hadoop Cluster

2) Walkthrough of Hadoop Configuration Files
3) Perform HDFS Operations , FSCK Utility
Home Assignment for Week 1

AGENDA SUMMARY
Week 2 :
Day 3 : Introduction to Apache YARN
1) MR1 vs YARN ( MR2)
2) YARN Architecture- Anatomy of a YARN Application Run
3) Scheduling In YARN and Scheduler Options
4) Capacity Scheduler Configuration
5) Fair Scheduler Configuration
6) Preemption , Delay Scheduling
7) Dominant Resource Fairness ( DRF) Configuration
6) Quiz & Q&A
LAB :
1) Review of Home Assignment Week 1

2) Configuring Capacity Scheduler Configuration with DRF
3) Running Sample MapReduce Applications ( Hadoop Jars Examples) and Observe MR Output Results
4) Perform YARN Operations through YARN CLI
Day 4: Introduction to Hortonworks Data Platform (HDP) - Hortonworks Distribution of Apache

Hadoop
1) Introduction to HDP and Architecture
2) Understanding Typical Production Cluster Specification - Namenodes , DataNodes, EdgeNodes,
Management or Utility Nodes - Hardware Requirements
3) Understanding Network Architecture / Topology for Typical Production Hadoop Cluster
4) Understanding Role of ZooKeeper and Journal Nodes
LAB :
Installation of Multi- Node Cluster through Ambari ( HDP 2.5 ) - Part -I

( Prerequisites - OS level configuration, Database requirements , Local Repo Setup)
1) Configuring RHEL Linux OS Best Practices Recommended by Hortonworks

2) Installation of JDK Supported by HDP
3) Installation of MySQL External DB and Configuration
4) Preparing environment for HDP Installation ( Password-less SSH , NTP , SELINUX etc)
5) Setting Up a Local Repository with No Internet Access
6) Cluster Planning Sheet for Cluster Deployment

AGENDA SUMMARY
Week 3 :
Day 5: Multi-Node Cluster Installation using Ambari
1) Recap of Key Concepts of HDFS and YARN
2) Quiz & Q&A
3) Discussing Best Practices for HDP Cluster Deployment
4) Case Study of Typical Product Cluster Big Data Architecture
5) Understanding Hive and Spark Architecture
6) Manage HDFS using Ambari Web, NameNode and DataNode UIs
7) Summarize the Purpose and Benefits of Rack Awareness
LAB :
1) Review of Home Assignment Week 2

2) Installation of Multi- Node Cluster through Ambari ( HDP 2.5 ) - Part -II - Core Services
3) Managing HDFS Storage in HDP Multi Node Cluster
4) Managing HDFS Quotas
5) Configuring Rack Awareness
6) Managing HDFS Snapshots
7) Configuring HDFS Storage Policies
8) Configuring HDFS Centralized Cache
9) Using HDFS Access Control Lists
10) DistCp usage
11) Perform HDFS Operations - Fsck , dfsadmin etc
Day 6:
OBJECTIVES \ LECTURES : Yarn

1) Introduction to HDP and Architecture
2) Understanding Typical Production Cluster Specification - Namenodes , DataNodes, EdgeNodes,
Management or Utility Nodes - Hardware Requirements
3) Understanding Network Architecture / Topology for Typical Production Hadoop Cluster
4) Understanding Role of ZooKeeper and Journal Nodes
LAB :
Installation of Multi- Node Cluster through Ambari ( HDP 2.5 ) - Part -I

( Prerequisites - OS level configuration, Database requirements , Local Repo Setup)
1) Configuring RHEL Linux OS Best Practices Recommended by Hortonworks

2) Installation of JDK Supported by HDP
3) Installation of MySQL External DB and Configuration
4) Preparing environment for HDP Installation ( Password-less SSH , NTP , SELINUX etc)
5) Setting Up a Local Repository with No Internet Access
6) Cluster Planning Sheet for Cluster Deployment

AGENDA SUMMARY
Week 4:
Day 7 : HIGH AVAILABILITY WITH HDP, DEPLOYING HDP WITH BLUEPRINTS, AND THE HDP
UPGRADE PROCESS
Recap of Week 3 - Quiz & Q&A
Summarize the Purpose of NameNode HA
Configure NameNode HA Using Ambari
Describe the Features and Benefits of the Apache Ambari Dashboard
Ambari Views and Blueprints
LAB :
Configuring NameNode HA
Configuring Resource Manager HA
Configuring Ambari Alerts
Day 8:
Recall the Types of Methods and Upgrades Available in HDP
Describe the Upgrade Process, Restrictions and Pre-upgrade Checklist
Perform an Upgrade Using the Apache Ambari Web UI
LAB :
Performing an HDP Upgrade – Express
Week 5 - HDP Security
Day 9 - Ranger
Authentication and Authorization
Ambari User Management
Ranger Architecture
Atlas Architecture
Hue Architecture
Case Study of Type User Management in Enterprise Cluster through AD ( Kerberos )
SmartSense Usage
LAB :
Ranger Installation
Creation Ranger Policies like HDFS, Hive etc
Day 10
HDP CA Certification Task - Part - 1

AGENDA SUMMARY
Week 6:
Day 11 :
Recap of Week 5
HDP CA Certification Tasks - Part - 2
Day 12:
HDPCA Practice Test / Mock Exam on Google Cloud Platform

Registration for Google Cloud Platform
Register HDPCA Exam on Hortonworks Portal
Project Assignment
Week 7 - Miscellaneous Topics
Day 13
Project Discussion and Mock Interviews
Q&A

Hadoop Developer Certification Course
FORMAT
50% Lecture
50% Hands-on Labs
AGENDA SUMMARY
Week 1 : Data Ingestion
Day 1 :
1) Introduction to Hive
2) Hive Architecture
3) Flume Architecture
LAB :
Import data from a table in a relational database into HDFS
Import the results of a query from a relational database into HDFS
Import a table from a relational database into a new or existing Hive table
Insert or update data from HDFS into a table in a relational database
Given a Flume configuration file, start a Flume agent
Given a configured sink and source, configure a Flume memory channel with a specified capacity
Day 2: Data Transformation
1) Introduction to Pig
2) Transformations available in PIG
LAB :
Write and execute a Pig script
Load data into a Pig relation without a schema
Load data into a Pig relation with a schema
Load data from a Hive table into a Pig relation
Use Pig to transform data into a specified format
Transform data to match a given Hive schema
Group the data of one or more Pig relations
Use Pig to remove records with null values from a relation
Store the data from a Pig relation into a folder in HDFS
Store the data from a Pig relation into a Hive table
Sort the output of a Pig relation
Remove the duplicate tuples of a Pig relation
Specify the number of reduce tasks for a Pig MapReduce job
Join two datasets using Pig
Perform a replicated join using Pig
Run a Pig job using Tez
Within a Pig script, register a JAR file of User Defined Functions
Within a Pig script, define an alias for a User Defined Function
Within a Pig script, invoke a User Defined Function
Home Assignment for Data Ingestion and Transformation using Sqoop and Flume Pig
Hadoop Developer Certification Course
AGENDA SUMMARY
Week 2 : Hive SQL
Day 3 :
Objectives : HiveSQL in Depth
DDL (create/drop/alter/truncate/show/describe), Statistics
(analyze), Indexes, Archiving,
DML (load/insert/update/delete/merge, import/export, explain plan),
File Formats and Compression: RCFile, Avro, ORC, Parquet; Compression, LZO
Hive Configuration Properties
Hive Client (JDBC, ODBC, Thrift)
HiveServer2: Overview, HiveServer2 Client and Beeline, Hive Metrics
Data Analysis LAB
Write and execute a Hive query

Define a Hive-managed table
Define a Hive external table
Define a partitioned Hive table
Define a bucketed Hive table
Define a Hive table from a select query
Define a Hive table that uses the ORCFile format
Create a new ORCFile table from the data in an existing non-ORCFile Hive table
Specify the storage format of a Hive table
Specify the delimiter of a Hive table
Load data into a Hive table from a local directory
Load data into a Hive table from an HDFS directory
Load data into a Hive table as the result of a query
Load a compressed data file into a Hive table
Update a row in a Hive table
Delete a row from a Hive table
Insert a new row into a Hive table
Join two Hive tables
Run a Hive query using Tez
Run a Hive query using vectorization
Output the execution plan for a Hive query
Use a subquery within a Hive query
Output data from a Hive query that is totally ordered across multiple reducers
Set a Hadoop or Hive configuration property from within a Hive query
Day 4 : Developer Certification Tasks Preparation
Home Assignment for Hive - Data Analysis Task
https://cwiki.apache.org/confluence/display/Hive/Home
Spark Certification Course
Perquisites : Should have hand-on experience in Hadoop developer tasks

Should have attended Hadoop Developer Course
FORMAT
50% Lecture
50% Hands-on Labs
AGENDA SUMMARY
Week 1 : Introduction to Spark
Day 1 :
1) Introduction to Spark
2) Benefits of Spark over MapReduce
3) Spark Architecture
4) Spark IDEs overview -( Scala IDEs, IntelliJ, Maven )
LAB :
Write a Spark Core application in Python or Scala

Initialize a Spark application
Run a Spark job on YARN
Create an RDD
Create an RDD from a file or directory in HDFS
Persist an RDD in memory or on disk
Perform Spark transformations on an RDD
Perform Spark actions on an RDD
Create and use broadcast variables and accumulators
Configure Spark properties
Day 2: Spark SQL & Spark Streaming
1) Spark SQL architecture
2) Spark Streaming Architecture
3) Data Visualisation using Zeppelin
LAB :
Create Spark DataFrames from an existing RDD

Perform operations on a DataFrame
Write a Spark SQL application
Use Hive with ORC from Spark SQL
Write a Spark SQL application that reads and writes data from Hive tables 
Home Assignment for Spark Core & SQL

Data Lake Analytics Program Enablers
Traditional Architecture Modern Architecture
Data Lake Analytics Program

- Hadoop Administration
Hortonworks Data Platform
Data Lake Analytics Program - Hadoop

Developer + Spark Course
Data Lake Analytics Program - Course Offerings
The candidate can opt for below course offerings and our recommendation will
be big data architect / SME program which gives more value and build strong
career ahead with leading big data organisations and startup in the market and
this is real investment in career.
HDPCA - This is entry level program which will help to achieve Hortonworks
administration certification badge.
HDPCA + Ansible - This program caters industry’s need for automation along with
Hadoop and has more value in market thus more incentives
HDP Developer Spark – This program covers apache Hadoop and HDP basics along
with core Spark developer tools with Scala IDE and zeppelin, RStudio for visualisation
and statistical reporting.
HDPCA + Cloudera Manager – This is complete Hadoop administrator program and it

covers HDP and CDH cluster Operations. This will likely raise your chances to get
selected as Hadoop administrator and most of leading companies preferring Cloudera
for more enterprise features and overall stability.
Big Data Architect / SME Course – This is unique program which meets the need of
Big Data Architect or SME role. After completing this course, you will be able to help
your clients to provide end to end solutions with Hadoop platform, Data ingestion
framework for data at rest and in motion and also covers latest spark developer tools
to build real time use case and stunning visualisations. This is recommended course
for experience hires who want to move as big data architect and also new graduates
to get overall big picture on how to build end to end solutions and can give big
opportunities in start-up firms.

Data Lake Analytics Program External 15 Apr

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Lake Analytics Program External 15 Apr

Uploaded by

Copyright:

Available Formats

DATA LAKE ANALYTICS PROGRAM

BIG DATA AND HADOOP TRAINING

HDP Certified Developer (HDPCD) HDP Certified Spark

Open Source Projects

© 2018 Data Lake Academy

“India’s $150 billion IT industry has a new mantra: Re-skill or perish”

Contact : Pravin Bhavar - + 91 8452019117 Sunil Panwar - +91 9967407549

Amit Kadam - + 91 9962982168 Ankit Yadav - +91 9833564610

Soumya Sahu - +91 9029101137

RStudio makes R easier to use. It includes a code editor, debugging & visualization tools.

the R Foundation for Statistical Computing

Day 2: Introduction to HDFS Architecture and Operations

1) Installation of Single Node Apache Hadoop Cluster

Home Assignment for Week 1

Day 3 : Introduction to Apache YARN

1) Review of Home Assignment Week 1

Day 4: Introduction to Hortonworks Data Platform (HDP) - Hortonworks Distribution of Apache

Installation of Multi- Node Cluster through Ambari ( HDP 2.5 ) - Part -I

1) Configuring RHEL Linux OS Best Practices Recommended by Hortonworks

Home Assignment for Week 2

© 2018 Data Lake Academy

Day 5: Multi-Node Cluster Installation using Ambari

1) Review of Home Assignment Week 2

OBJECTIVES \ LECTURES : Yarn

Installation of Multi- Node Cluster through Ambari ( HDP 2.5 ) - Part -I

1) Configuring RHEL Linux OS Best Practices Recommended by Hortonworks

Home Assignment for Week 3

© 2018 Data Lake Academy

Performing an HDP Upgrade – Express

Home Assignment for Week 4

Week 5 - HDP Security

HDP CA Certification Task - Part - 1

HDPCA Practice Test / Mock Exam on Google Cloud Platform

Week 7 - Miscellaneous Topics

Project Discussion and Mock Interviews

© 2018 Data Lake Academy

Day 2: Data Transformation

Objectives : HiveSQL in Depth

Data Analysis LAB

Write and execute a Hive query

Day 4 : Developer Certification Tasks Preparation

Home Assignment for Hive - Data Analysis Task

© 2018 Data Lake Academy

Perquisites : Should have hand-on experience in Hadoop developer tasks

Write a Spark Core application in Python or Scala

Day 2: Spark SQL & Spark Streaming

Create Spark DataFrames from an existing RDD

Home Assignment for Spark Core & SQL

© 2018 Data Lake Academy

Traditional Architecture Modern Architecture

Data Lake Analytics Program

Hortonworks Data Platform

Data Lake Analytics Program - Hadoop

HDPCA + Cloudera Manager – This is complete Hadoop administrator program and it

© 2018 Data Lake Academy

You might also like