BIG DATA WITH HADOOP & SPARK

BIG DATA WITH HADOOP & SPARK
COURSE DESIGN
High-quality videos, slides, hands-on examples, quizzes, automated assessments, case studies,
and real-world projects.
COURSE
MATERIAL
Lifetime access to cutting-edge self-paced learning

content.
LA
90 Days of CloudxLab access for hands-on practice.
SUPPORT
Email support to answer your queries and we've also launched Discussions - a Q&A site for
Artificial Intelligence, Machine Learning, Deep Learning, Big Data & Data Science professionals.
CERTIFICATE
Earn certificate in Big Data with Hadoop and Apache Spark.
LIVE SESSIONS
60+ hours of live online instructor-led training. Classes will be conducted every Saturday &
Sunday between (8 PM - 11 PM Indian Standard Time) or (7:30 AM - 10:30 AM Pacific Time).
BIG DATA WITH HADOOP & SPARK - COURSE SYLLABUS
INTRODUCTIO
N
● What is Big Data?

● Why Now?
● Big Data Use Cases
● Various Solutions
● Overview of Hadoop Ecosystem
● Spark Ecosystem Walkthrough
● Quiz
FOUNDATION &
ENVIRONMENT
● Understanding the CloudxLab

● CloudxLab Hands-On
● Hadoop & Spark Hands-on
● Quiz and Assessment
● Basics of Linux - Quick Hands-On
● Understanding Regular Expressions
● Setting up VM (optional)
ZOOKEEPER
● Why Do we need it?

● Understanding Data Model
● Hands-On
● Quiz & Assessment
● How does election happen - Paxos Algorithm?
● Use cases
● When not to use
HDFS ● Why HDFS or Why not existing file
systems?
● Understanding the architecture
● Quiz
● Advance HDFS Concepts (HA, Federation)
● Quiz
● Hands-on with HDFS (Upload, Download, SetRep)
● Data Locality (Rack Awareness)
DATA FORMATS &
MANAGEMENT
● InputFormat and InputSplit
● JSON
● XML
● AVRO
● How to store many small files - SequenceFile?
● Parquet
● Protocol Buffers
● Comparing Compressions
● Understanding Row Oriented and Column Oriented Formats - RCFile?
YARN ● Computing - Why not existing
tools?
● MapReduce 1.0
● Resource Management: YARN Architecture
● Advance Concepts - Speculative Execution
● Quiz
MAPREDUCE BASICS
● Why MapReduce?
● Understanding MapReduce Framework
● Quiz
● Example 0 - Word Frequency Problem - Without MR
● Example 1 - Only Mapper - Image Resizing
● Example 2 - Word Frequency Problem
● Example 3 - Temperature Problem
● Example 4 - Multiple Reducer
● Example 5 - Java MapReduce Walkthrough
● Quiz
MAPREDUCE
ADVANCED
● Example 6 - Secondary Sorting (Word Recommendation)

● Example 7 - Partitioner
● Concept - Associative & Commutative
● Quiz
● Example 8 - Combiner
● Example 9 - Hadoop Streaming
● Example 10 - Adv. Problem Solving - Anagrams
● Example 11 - Adv. Problem Solving - Same DNA
● Example 12 - Adv. Problem Solving - Similar DNA
● Example 12 - Joins - Voting
● Limitations of MapReduce
● Quiz
ANALYZING DATA WITH PIG
● Why Pig?
● Basic Structure of Pig Latin
● Getting Started
● Example - NYSE Stock Exchange
● Concept - Lazy Evaluation
PROCESSING DATA WITH

HIVE
● Why Hive?
● Hive Architecture Overview
● Getting Started
● Loading Data in Hive (Tables)
● Example: Movielens Data Processing
● Advance Concepts: Views
● Connecting Tableau and HiveServer 2
● Connecting Microsoft Excel and HiveServer 2
● Project: Sentiment Analyses of Twitter Data
● Advanced - Partition Tables
● Understanding HCatalog & Impala
● Quiz
NOSQL AND HBASE
● Case Study: The days before NoSQL

● What is NoSQL?
● CAP Theorem
● HBase Architecture - Region Servers etc
● Hbase Data Model - Column Family Orientedness
● Getting Started - Create table, Adding Data
● Adv Example - Google Links Storage
● Concept - Bloom Filter
● Comparison of NOSQL Databases
● Quiz
IMPORTING DATA WITH SQOOP AND FLUME, OOZIE
● Sqoop Overview
● Import From MySQL to HDFS, Hive, HBase
● Exporting to MySQL from HDFS
● Concept - Unbounding Dataset Processing or Stream Processing
● Flume Overview: Agents - Source, Sink, Channel
● Example 1 - Data from Local network service into HDFS
● Example 2 - Extracting Twitter Data
● Quiz
● Example 3 - Creating workflow with Oozie
INTRODUCTION
● What is Big Data?

● Why Now?
● Big Data Use Cases
● Various Solutions
● Overview of Hadoop Ecosystem
● Spark Ecosystem Walkthrough
● Quiz
FOUNDATION & ENVIRONMENT
● Understanding the CloudxLab

● CloudxLab Hands-On
● Spark Hands-on
● Basics of Linux - Quick Hands-On
● Understanding Regular Expressions
● Setting up VM (optional)
RECAP OF HDFS AND YARN
SCALA BASICS
● Introduction to Scala?
● Accessing Scala using CloudxLab
● Getting Started: Interactive, Compilation, SBT
● Types, Variables & Values
● Functions
● Collections
● Classes
● Parameters
● More Features
SPARK BASICS
● What is Apache Spark?

● Why Spark?
● Using the Spark Shell on CloudxLab
● Example 1 - Performing Word Count
● Understanding Spark Cluster Modes on YARN
● RDDs (Resilient Distributed Datasets)
● General RDD Operations: Transformations & Actions
● RDD Lineage
● RDD Persistence Overview
● Distributed Persistence
● Learn operations on Key-Value Based RDD
● Solving various problems using RDD
WRITING AND DEPLOYING SPARK APPLICATIONS
● Creating the SparkContext

● Building a Spark Application (Scala, Java, Python)
● The Spark Application Web UI
● Configuring Spark Properties
● Running Spark on Cluster
● RDD Partitions
● Executing Parallel Operations
● Stages and Tasks
● Project: Churning the logs of NASA Kennedy Space Center WWW server
SPARK ADVANCED OPERATIONS
● Using Accumulators & Creating Custom Accumulators

● Using Broadcast variables
We will learn key performance considerations:
1.Level of Parallelism
2. Serialization Format
3.Memory Management
4.Hardware Provisioning
● Understanding Caching & Persistence
● We will Data Partitioning/Re-partitioning techniques.
● A project to consider the above optimization techniques.
● We will how to create custom partitioner.
RUNNING SPARK ON A CLUSTER
● Understand the Spark Runtime Architecture and various components such as Driver, Executor,
Cluster Manager etc.
● Learn what goes inside when we launch an spark application.
● We will learn the two modes of Spark: Local and Cluster.
● How to launch a program on YARN, AWS Cluster etc.
● How to setup spark in standalone mode.
● Understand and demonstrate on how to run drive in various modes.
● Learn how to package the dependencies of your code.
● Understand how to use the Spark-Submit and various command line options.
STREAM PROCESSING WITH SPARK AND KAFKA
● Common Spark Use Cases

● Example 1 - Data Cleaning (Movielens)
● Example 2 - Understanding Spark Streaming
● Understanding Kafka
● Example 3 - Spark Streaming from Kafka
● Iterative Algorithms in Spark
● Project: Real-time analytics of orders in an e-commerce company
DATAFRAMES AND SPARK SQL
● Spark SQL and the SQL Context

● Creating DataFrames
● Transforming and Querying DataFrames
● Saving DataFrames
● DataFrames and RDDs
● Comparing Spark SQL, Impala, and Hive-on-Spark
● Understanding and loading various Input formats: JSON, XML, AVRO, SequenceFile?, Parquet,
Protocol Buffers.
● Comparing Compressions
● Understanding Row Oriented and Column Oriented Formats - RCFile?
MACHINE LEARNING WITH SPARK
● GraphX: Graph Processing and Analysis

● Understanding Machine Learning
● MlLib Example: k-means
● SparkR Example
GRAPH PROCESSING WITH GRAPHX
Basics of Graph Processing: Covers the understanding of what does it mean by graph
processing in real life with examples. What are other frameworks providing graph computing?
GraphX Overview: What is GraphX? Understanding the functionalities and algorithms provided
by GraphX. And how does GraphX work. Along with comparison with other similar products.
OTHER Topics/Content
● Java Essential
● Linux Basics
● Spark On Cluster
● Adv Spark Programming
● Hands-on videos
PROJECTS
INCLUDED
● Hive - Sentiment Analysis

● Processing the NSE (National Stock Exchange) data with Hive for various insights
● Doing Analytics on the MovieLens data to generate the movie ratings
● Building Real-Time Analytics Dashboard
● Churning the logs of NASA Kennedy Space Center WWW server
● Generating movie recommendations using Spark MLLib
● Deriving the importance of various Handles at Twitter using Spark GraphX
The Big Data with Hadoop & Spark course is compatible with the following certifications:
● Cloudera Certified Professional (CCP): Data Engineer

● Cloudera Certified Associate (CCA): Spark & Hadoop Developer
● Hortonworks Certified Developer (HDPCD)
● Hortonworks Certified Developer (HDPCD): Spark
Click Here To Enroll

Now!!
Please feel free to email your queries to

reachus@cloudxlab.com
Regards, The
CloudxLab Team

BIG DATA WITH HADOOP & SPARK - COURSE SYLLABUS

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BIG DATA WITH HADOOP & SPARK - COURSE SYLLABUS

Uploaded by

Copyright:

Available Formats

Lifetime access to cutting-edge self-paced learning

90 Days of ​CloudxLab ​access for hands-on practice.

Earn certificate in Big Data with Hadoop and Apache Spark.

BIG DATA WITH HADOOP & SPARK - COURSE SYLLABUS

● What is Big Data?

● Understanding the CloudxLab

● Why Do we need it?

HDFS ​● Why HDFS or Why not existing file

YARN ​● Computing - Why not existing

● Example 6 - Secondary Sorting (Word Recommendation)

ANALYZING DATA WITH PIG

PROCESSING DATA WITH

NOSQL AND HBASE

● Case Study: The days before NoSQL

IMPORTING DATA WITH SQOOP AND FLUME, OOZIE

● What is Big Data?

FOUNDATION & ENVIRONMENT

● Understanding the CloudxLab

RECAP OF HDFS AND YARN

● What is Apache Spark?

WRITING AND DEPLOYING SPARK APPLICATIONS

● Creating the SparkContext

SPARK ADVANCED OPERATIONS

● Using Accumulators & Creating Custom Accumulators

RUNNING SPARK ON A CLUSTER

STREAM PROCESSING WITH SPARK AND KAFKA

● Common Spark Use Cases

DATAFRAMES AND SPARK SQL

● Spark SQL and the SQL Context

● GraphX: Graph Processing and Analysis

GRAPH PROCESSING WITH GRAPHX

● Hive - Sentiment Analysis

● Cloudera Certified Professional (CCP): Data Engineer

Click Here To Enroll

Please feel free to email your queries to

You might also like

90 Days of CloudxLab access for hands-on practice.

HDFS ● Why HDFS or Why not existing file

YARN ● Computing - Why not existing