Week-2: Introduction to Hadoop and the Hadoop Ecosystem
Problems with Traditional Large Scale Systems Hadoop Data Storage and Ingest Data Processing Data Analysis and Exploration Other Ecosystem Tools Homework Lab: Setup Hadoop
Week-3: Hadoop Architecture and Hadoop Distributed File System
(HDFS) Distributed Processing on a Cluster Storage: HDFS architecture Storage: Using HDFS Homework Lab: Access HDFS with Command Line and Hue Resource Management: YARN Architecture Resource Management: Working with YARN Homework Lab: Run a YARN Job
Week-4: Importing Relational Data with Apache Sqoop
Sqoop Overview Basic Imports and Exports Limiting Results Improving Sqoop’s performance Sqoop 2 Homework Lab: Import Data from MySQL Using Sqoop Week-5: Introduction to Impala and Hive Introduction to Impala and Hive Why use Impala and Hive Querying Data with Impala and Hive Comparing Impala and Hive to Traditional Databases
Week-6: Modeling and Managing Data with Impala and Hive
Data Storage Overview Creating Databases and Tables Loading Data into Tables HCatalog Impala Metadata Caching Homework Lab: Create and Populate Tables in Impala or Hive
Week-7: Data Formats
File Formats Avro Schemas Avro Schema Evaluation Using Avro with Impala, Hive and Sqoop Using Parquet with Impala, Hive and Sqoop Compression Homework Lab: Select a Format for Data File
Week-8: Data File Partitioning
Partitioning Overview Partitioning in Impala and Hive Conclusion Homework Lab: Partition Data in Impala or Hive
Week-9: Capturing Data with Apache Flume
What is Apache Flume? Basic Flume Architecture Flume Sources Flume Sinks Flume Channels Flume Configurations Homework Lab: Collect Web Server Logs with Flume
Week-10: Spark Basics
What is Apache Spark? Using the Spark Shell RDDs (Resilient Distributed Dataset) Functional Programming in Spark Homwork Lab: o View the Spark Documentation o Explore RDDs Using Spark Shell o Use RDDs to Transform a Dataset
Week-11: Working with RDDs in Spark
Creating RDDs Other General RDD Operations Homework Lab: Process Data Files with Spark
Week-12: Aggregating Data with Pair RDDs
Key-value Pair RDDs MapReduce Other Pair RDD Operations Homework Lab: Use Pair RDDs to Join Two Datasets
Week-13: Writing and Deploying Spark Applications
Spark Application vs. Spark Shell Creating the SparkContext Building a Spark Application (Scala and Java) Running a Spark Application The Spark Application Web UI Homework Lab: Write and Run a Spark Application Configure Spark Properties Logging Homework Lab: Configure a Spark Application
Week-14: Parallel Processing in Spark
Review: Spark on a Cluster RDD Partitions Partitioning of File Based RDDs HDFS and Data Locality Executing Parallel Operations Stages and Tasks Homework Lab: View Jobs and Stages in the Spark Application UI
Common Spark Use Cases Iterative Algorithms in Spark Graph Processing and Analysis Machine Learning Example: k-means Homework Lab: Iterative Processing in Spark
Week-17: Spark SQL and DataFrames
Spark SQL and the SQL Context Creating DataFrames Transforming and Querying DataFrames Saving DataFrames DataFrames and RDDs Comparing Spark SQL, Impala and Hive-on-Spark Homework Lab: Use Spark SQL for ETL
Weel-18: Running Machine Learning Algorithms Using Spark MLlib
Machine Learning with Spark Preparing Data for Machine Learning Building a Linear Regression Model Evaluating a Linear Regression Model Visualizing a Linear Regression Model
Week-19: BigDL Distributed Deep Learning on Apache Spark
What is Deep Learning What is BigDL Why use BigDL Installing and Building BigDL BigDL examples
Week-20: Working on Spark in the Cloud
Spark implementation in Databricks Spark implementation in Cloudera Spark implementation in Amazon Web Service