You are on page 1of 32

Module 2

Introduction to Big Data Frameworks:


Hadoop, NOSQL

Big Data Analytics


BEITC801 Prof. Priyanka Bandagale
FAMT, Ratnagiri
Hadoop Ecosystem
Hadoop Ecosystem
• Data Ingestion Data Processing Data Analysis

• Data Ingestion
1. Sqoop
2. Flume
• Data Processing
1. MapReduce
2. Spark
• Data Analysis
1. Pig
2. Hive
3. Impala
Hadoop Components
Hadoop Ecosystem
• HDFS -> Hadoop Distributed File System
• YARN -> Yet Another Resource Negotiator
• MapReduce -> Data processing using programming
• Spark -> In-memory Data Processing
• PIG, HIVE-> Data Processing Services using Query (SQL-like)
• HBase -> NoSQL Database
• Mahout, Spark MLlib -> Machine Learning
• Apache Drill -> SQL on Hadoop
• Zookeeper -> Managing Cluster
• Oozie -> Job Scheduling
• Flume, Sqoop -> Data Ingesting Services
• Solr & Lucene -> Searching & Indexing
• Ambari -> Provision, Monitor and Maintain cluster
Hbase
• HBase is an open source, non-relational, distributed database
modeled after Google's BigTable.
• It runs on top of Hadoop and HDFS, providing BigTable-like
capabilities for Hadoop.
• It is a NoSQL database, is non relational and column oriented
database.
Features of Hbase
• Type of NoSql database
• Strongly consistent read and write
• Automatic sharding
• Automatic RegionServer failover
• Hadoop/HDFS Integration
• HBase supports massively parallelized processing via
MapReduce for using HBase as both source and sink.
• HBase supports an easy to use Java API for programmatic
access.
• HBase also supports Thrift and REST for non-Java front-ends.
Difference between Hbase
and HDFS
HDFS Hbase
Good for storing large file Built on top of HDFS. Good for
hosting very large tables like
billions of rows X millions of
column
Write once. Append to files in Read/write many
some of recent versions but not
commonly used
No random read/write Random read/write
No individual record lookup Fast records lookup(update)
rather read all data
Sqoop
• Command-line interface for transforming data between
relational database and Hadoop
• Sqoop uses MapReduce framework to import and export the
data, which provides parallel mechanism as well as fault
tolerance.
• Imports use to populate tables in Hadoop
• Exports use to put data from Hadoop into relational database
such as SQL server

Hadoop sqoop RDBMS


How Sqoop works
• The dataset being transferred is broken into small blocks.
• Individual mapper is responsible for transferring a block of the
dataset.
• Sqoop command submitted by the end user is parsed by
Sqoop and launches Hadoop Map only job to import or export
data because Reduce phase is required only when
aggregations are needed.
• Sqoop just imports and exports the data; it does not do any
aggregations.
How Sqoop works
Apache Spark
• Apache Spark is an open-source, distributed processing
system used for big data workloads.
• It utilizes in-memory caching, and optimized query execution
for fast analytic queries against data of any size. It provides
development APIs in Java, Scala, Python and R.
• Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster
management.
Apache Spark Workloads
Apache Spark
• The Spark framework includes:
• Spark Core as the foundation for the platform
• Spark SQL for interactive queries
• Spark Streaming for real-time analytics
• Spark MLlib for machine learning
• Spark GraphX for graph processing
Pig
• Apache Pig is a high-level data flow platform for executing
MapReduce programs of Hadoop.
• It consist of two components-
1.Pig Latin- The data processing language
2.Compiler- To translate pig latin to MapReduce Programming.
Pig in Hadoop Ecosystem

Pig (Scripting)

MapReduce(Distributed Programming
Interface

HDFS
Applications of Apache Pig:

• Processing of web logs.


• Data processing for search platforms.
• Support for Ad-hoc queries across large data sets.
• Quick prototyping of algorithms for processing large data sets.
DEMO
Hadoop Installation
(CDH ) for
• Download and windows
install VM player
https://
my.vmware.com/web/vmware/free#desktop_end_user_
computing/vmware_player/6_0
Hadoop Installation (CDH )
for windows
Make sure you have enabled
virtualization in bios
Hadoop Installation (CDH ) for
windows
Download “Quick start VM with CDH” : Download for VMWare
http://www.cloudera.com/content/cloudera/en/downloads/
quickstart_vms/cdh-4-7-x.html
Hadoop Installation (CDH ) for
windows
 Unzip “cloudera-quickstart-vm-4.7.0-0-vmware”
 Open CDH using VMPlayer:
 Open VM Player
 Click open a virtual machine
 Select the file “cloudera-quickstart-vm-4.7.0-0-vmware” in the
extracted directory of “cloudera-quickstart-vm-4.7.0-0-vmware”.
Virtual machine will be added to your VM player.
 Select this virtual machine and click play virtual machine.
References
• http://training.cloudera.com/essentials.pdf
• http://en.wikipedia.org/wiki/Apache_Hadoop
• http://practicalanalytics.wordpress.com/2011/11/06/
explaining-hadoop-to-management-whats-the-big-data-deal/
• https://developer.yahoo.com/hadoop/tutorial/module1.html
• http://hadoop.apache.org/
• http://wiki.apache.org/hadoop/FrontPage
Questions?

You might also like