Professional Documents
Culture Documents
• Data Ingestion
1. Sqoop
2. Flume
• Data Processing
1. MapReduce
2. Spark
• Data Analysis
1. Pig
2. Hive
3. Impala
Hadoop Components
Hadoop Ecosystem
• HDFS -> Hadoop Distributed File System
• YARN -> Yet Another Resource Negotiator
• MapReduce -> Data processing using programming
• Spark -> In-memory Data Processing
• PIG, HIVE-> Data Processing Services using Query (SQL-like)
• HBase -> NoSQL Database
• Mahout, Spark MLlib -> Machine Learning
• Apache Drill -> SQL on Hadoop
• Zookeeper -> Managing Cluster
• Oozie -> Job Scheduling
• Flume, Sqoop -> Data Ingesting Services
• Solr & Lucene -> Searching & Indexing
• Ambari -> Provision, Monitor and Maintain cluster
Hbase
• HBase is an open source, non-relational, distributed database
modeled after Google's BigTable.
• It runs on top of Hadoop and HDFS, providing BigTable-like
capabilities for Hadoop.
• It is a NoSQL database, is non relational and column oriented
database.
Features of Hbase
• Type of NoSql database
• Strongly consistent read and write
• Automatic sharding
• Automatic RegionServer failover
• Hadoop/HDFS Integration
• HBase supports massively parallelized processing via
MapReduce for using HBase as both source and sink.
• HBase supports an easy to use Java API for programmatic
access.
• HBase also supports Thrift and REST for non-Java front-ends.
Difference between Hbase
and HDFS
HDFS Hbase
Good for storing large file Built on top of HDFS. Good for
hosting very large tables like
billions of rows X millions of
column
Write once. Append to files in Read/write many
some of recent versions but not
commonly used
No random read/write Random read/write
No individual record lookup Fast records lookup(update)
rather read all data
Sqoop
• Command-line interface for transforming data between
relational database and Hadoop
• Sqoop uses MapReduce framework to import and export the
data, which provides parallel mechanism as well as fault
tolerance.
• Imports use to populate tables in Hadoop
• Exports use to put data from Hadoop into relational database
such as SQL server
Pig (Scripting)
MapReduce(Distributed Programming
Interface
HDFS
Applications of Apache Pig: