Course Outline Hadoop and Spark For Big Data and Data Science

Course Outline:
Hadoop and Spark for Big Data and Data Science

By Data Science Studio
Week-1: Introduction to Big Data and Data Science

 What is Big Data
 What is Data Science
Week-2: Introduction to Hadoop and the Hadoop Ecosystem

 Problems with Traditional Large Scale Systems
 Hadoop
 Data Storage and Ingest
 Data Processing
 Data Analysis and Exploration
 Other Ecosystem Tools
 Homework Lab: Setup Hadoop
Week-3: Hadoop Architecture and Hadoop Distributed File System

(HDFS)
 Distributed Processing on a Cluster
 Storage: HDFS architecture
 Storage: Using HDFS
 Homework Lab: Access HDFS with Command Line and Hue
 Resource Management: YARN Architecture
 Resource Management: Working with YARN
 Homework Lab: Run a YARN Job
Week-4: Importing Relational Data with Apache Sqoop

 Sqoop Overview
 Basic Imports and Exports
 Limiting Results
 Improving Sqoop’s performance
 Sqoop 2
 Homework Lab: Import Data from MySQL Using Sqoop
Week-5: Introduction to Impala and Hive
 Introduction to Impala and Hive
 Why use Impala and Hive
 Querying Data with Impala and Hive
 Comparing Impala and Hive to Traditional Databases
Week-6: Modeling and Managing Data with Impala and Hive

 Data Storage Overview
 Creating Databases and Tables
 Loading Data into Tables
 HCatalog
 Impala Metadata Caching
 Homework Lab: Create and Populate Tables in Impala or Hive
Week-7: Data Formats

 File Formats
 Avro Schemas
 Avro Schema Evaluation
 Using Avro with Impala, Hive and Sqoop
 Using Parquet with Impala, Hive and Sqoop
 Compression
 Homework Lab: Select a Format for Data File
Week-8: Data File Partitioning

 Partitioning Overview
 Partitioning in Impala and Hive
 Conclusion
 Homework Lab: Partition Data in Impala or Hive
Week-9: Capturing Data with Apache Flume

 What is Apache Flume?
 Basic Flume Architecture
 Flume Sources
 Flume Sinks
 Flume Channels
 Flume Configurations
 Homework Lab: Collect Web Server Logs with Flume
Week-10: Spark Basics

 What is Apache Spark?
 Using the Spark Shell
 RDDs (Resilient Distributed Dataset)
 Functional Programming in Spark
 Homwork Lab:
o View the Spark Documentation
o Explore RDDs Using Spark Shell
o Use RDDs to Transform a Dataset
Week-11: Working with RDDs in Spark

 Creating RDDs
 Other General RDD Operations
 Homework Lab: Process Data Files with Spark
Week-12: Aggregating Data with Pair RDDs

 Key-value Pair RDDs
 MapReduce
 Other Pair RDD Operations
 Homework Lab: Use Pair RDDs to Join Two Datasets
Week-13: Writing and Deploying Spark Applications

 Spark Application vs. Spark Shell
 Creating the SparkContext
 Building a Spark Application (Scala and Java)
 Running a Spark Application
 The Spark Application Web UI
 Homework Lab: Write and Run a Spark Application
 Configure Spark Properties
 Logging
 Homework Lab: Configure a Spark Application
Week-14: Parallel Processing in Spark

 Review: Spark on a Cluster
 RDD Partitions
 Partitioning of File Based RDDs
 HDFS and Data Locality
 Executing Parallel Operations
 Stages and Tasks
 Homework Lab: View Jobs and Stages in the Spark Application UI
Week-15: Spark RDD Persistence

 RDD Lineage
 RDD Persistence Overview
 Distributed Persistence
 Homework Lab: Persist an RDD
Week-16: Common Patterns in Spark Data Processing

 Common Spark Use Cases
 Iterative Algorithms in Spark
 Graph Processing and Analysis
 Machine Learning
 Example: k-means
 Homework Lab: Iterative Processing in Spark
Week-17: Spark SQL and DataFrames

 Spark SQL and the SQL Context
 Creating DataFrames
 Transforming and Querying DataFrames
 Saving DataFrames
 DataFrames and RDDs
 Comparing Spark SQL, Impala and Hive-on-Spark
 Homework Lab: Use Spark SQL for ETL
Weel-18: Running Machine Learning Algorithms Using Spark MLlib

 Machine Learning with Spark
 Preparing Data for Machine Learning
 Building a Linear Regression Model
 Evaluating a Linear Regression Model
 Visualizing a Linear Regression Model
Week-19: BigDL Distributed Deep Learning on Apache Spark

 What is Deep Learning
 What is BigDL
 Why use BigDL
 Installing and Building BigDL
 BigDL examples
Week-20: Working on Spark in the Cloud

 Spark implementation in Databricks
 Spark implementation in Cloudera
 Spark implementation in Amazon Web Service

Course Outline Hadoop and Spark For Big Data and Data Science

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Course Outline Hadoop and Spark For Big Data and Data Science

Uploaded by

Copyright:

Available Formats

Course Outline:

Hadoop and Spark for Big Data and Data Science

Week-1: Introduction to Big Data and Data Science

Week-2: Introduction to Hadoop and the Hadoop Ecosystem

Week-3: Hadoop Architecture and Hadoop Distributed File System

Week-4: Importing Relational Data with Apache Sqoop

Week-6: Modeling and Managing Data with Impala and Hive

Week-7: Data Formats

Week-8: Data File Partitioning

Week-9: Capturing Data with Apache Flume

Week-10: Spark Basics

Week-11: Working with RDDs in Spark

Week-12: Aggregating Data with Pair RDDs

Week-13: Writing and Deploying Spark Applications

Week-14: Parallel Processing in Spark

Week-15: Spark RDD Persistence

Week-16: Common Patterns in Spark Data Processing

Week-17: Spark SQL and DataFrames

Weel-18: Running Machine Learning Algorithms Using Spark MLlib

Week-19: BigDL Distributed Deep Learning on Apache Spark

Week-20: Working on Spark in the Cloud

You might also like