Professional Documents
Culture Documents
Introduction to
Big Data and
Hadoop
S.Mohan Kumar
Other sources
IoT
IoT
Smart meter
Fraud Life sciences Advertising Equipment
monitoring
detection research analysis monitoring
Trendwise Analytics
Who is using?
Early adopters:
Who is using?
Facebook: LinkedIn
Who is Using?
Yahoo
Inmobi
More than 100,000 CPUs in >40,000
computers running Hadoop
Running Apache Hadoop on
Biggest cluster: 4500 nodes around 700 nodes (16800
(2*4cpu boxes w 4*1TB disk & 16GB cores, 5+ PB) in 6 Data
RAM)
Centers for ETL, Analytics,
Data Science and Machine
Ebay: Learning
532 nodes cluster (8 * 532
cores, 5.3PB).
Trendwise Analytics
Mainstream...
Other Examples:
UPS implemented Big Data to track data on 16.3 million
packages per day for 8.8 million customers, with an
average of 39.5 million tracking requests from customers
per day. The company stores over 16 petabytes of data.
Source: http://www.csc.com/big_data/success_stories
Some more...
1
Trendwise Analytics
Closer home..
TECHNOLOGY
Technology options
Hadoop – most popular
NoSQL Databases – Cassandra, MongoDB
SAP Hana - Proprietary
Spark
…..
Trendwise Analytics
What is Hadoop?
Open source project started by Doug Cutting
A platform to manage Big Data
Helps in Distributed computing
Runs on Commodity Hardware
Trendwise Analytics
What is Hadoop?
Hadoop uses simple, robust techniques on inexpensive computer systems to
deliver very high data availability and to analyze enormous amounts of
information quickly. Hadoop offers enterprises a powerful new tool for
managing big data.
Hadoop Is:
• Open Source, Java
• Apache project
• Target cluster of commodity PCs
• Cost-effective bulk computing
• Distributed File System
• Modeled on GFS
• Distributed Processing Framework
• Using Map/Reduce metaphor
• General
• A scalable, Fault tolerant, High performance distributed file system
(Storage)
• No RAID required
• Access from C, Java, FUSE, WebDAV, Thrift
• Single namespace for entire cluster
• Managed by a single name node.
• Hierarchal directories
• Optimized for streaming reads of large files.
Hive
Pig
• A high-level data-flow language and execution
framework for parallel computation
• Simple to write Map Reduce program
• Abstracts you from specific detail
• Focus on data processing
• Data flow
• For data manipulation
Sqoop
• Sqoop is a tool designed to help users of large
data import existing relational databases into
their Hadoop clusters
• Automatic data import
• SQL to Hadoop
• Generates code for use in MapReduce
applications
Zookeeper
• A high-performance coordination service for
distributed applications
• Zookeeper is a centralized service for
maintaining configuration information, naming,
providing distributed synchronization, and
providing group services
Avro
• A data serialization system that provides
dynamic integration with scripting languages
• Avro Data is smaller and faster
• Avro RPC
Chukwa
• A data collection system for managing large
distributed systems
• Build on HDFS and MapReduce
• Tools kit for displaying, monitoring and
analyzing the log files
HBase
• HBase is the Hadoop database. Think of it as a
distributed scalable Big Data store.
• HBase can be used when we require random,
real-time read/write access to your Big Data
• HBase is an open-source, distributed, versioned,
column-oriented store modeled after Google's
Bigtable: A Distributed Storage System
Benefits of Hadoop
• Hadoop is designed to run on cheap commodity
hardware
• It automatically handles data replication and node
failure
• Handles large volumes of unstructured data easily
• Last but not least – its free! ( Open source)
Trendwise Analytics
Source:http://pivotalhd.docs.pivotal.io/docs/getting-started.html
Trendwise Analytics
Modes of Hadoop
• Distributed mode – cluster ( for production)
• Psuedo distributed mode- single node
• Local ( Standalone ) mode
• Single java process – not very common
Trendwise Analytics
Introduction of Yarn
HDFS Federation
Windows OS support
Trendwise Analytics
• Cloudera
• Hortonworks
• MapR
• Greenplum, A Division of EMC
• IBM InfoSphere BigInsights
Namenode 50070
Jobtracker 50030
Tasktracker 50060
Trendwise Analytics
Job Tracker
Trendwise Analytics
Task Tracker
Trendwise Analytics
Trendwise Analytics
HDFS
Trendwise Analytics
Hardware Failure
Detection of faults and quick, automatic recovery from them is a core architectural
goal of HDFS.
HDFS is designed more for batch processing rather than interactive use by users.
The emphasis is on high throughput of data access rather than low latency of data
access.
Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes
to terabytes in size. Thus, HDFS is tuned to support large files.
Trendwise Analytics
HDFS applications need a write-once-read-many access model for files. A file once
created, written, and closed need not be changed. This assumption simplifies data
coherency issues and enables high throughput data access.
HDFS has been designed to be easily portable from one platform to another.
Trendwise Analytics
Replication Factor
Heartbeat
Blockreport
Rack awareness
High Availability
Trendwise Analytics
HDFS Components
Name Node
Data Node
Trendwise Analytics
HDFS Architecture
Trendwise Analytics
Namenode
– Hardware server
– Software Daemon
Namenode
In addition to the RAM...Namenode saves the information onto disk from time to
time in two files:
– fsimage
– edits
Name Node
This is the controller or Master of Cluster, which manages all the operations
requested by Client. Name Node performs following major operations –
• Allocation of Blocks to File
• Monitoring Data Node for Data Node Failure and new Data Node addition
• Replication Management
• User requests management – like writing file, reading file etc.
• Transaction Tracking and logging of transaction
Meta Data
This contains all information about File System, like:
Secondary Namenode
Secondary Namenode
Trendwise Analytics
Datanode
It consists of :
– Software Daemon
Datanode
– Replication – default 3
– Configurable
Data Node
This is where data is stored in HDFS, Data node is not aware of what file data belongs, it
writes data in local file system in form of blocks. Data node has following major
functionalities:
• Write/Read Block to/from Local File
• Perform operation as directed by Name Node
• Register/Heartbeat itself with name node and provide Block report to name node
Replication – Factor 3
Trendwise Analytics
Client
Client is the one which is communicating to HDFS using predefined API, it
could be FS utility provided in Hadoop, Java API provided by Hadoop, or
some other API to communicate to Hadoop in language independent
manner.
• The Client breaks File.txt into (3) Blocks. For each block, the Client
consults the Name Node and receives a list of (3) Data Nodes that should
have a copy of this block.
• The Client then writes the block directly to the Data Node
• Data Node after completion of writing the BLOCK - Data Node replicates
the same block to other Data Nodes
• Cycle repeats for the remaining blocks.
Copy single src, or multiple srcs from local file system to the destination
filesystem. Also reads input from stdin and writes to destination filesystem.
hadoop fs -put localfile /user/hadoop/hadoopfile
hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdir
hadoop fs -put localfile hdfs://nn.example.com/hadoop/hadoopfile
hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile
Reads the input from stdin.
chmod
Usage: hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI …]
Change the permissions of files. With -R, make the change recursively through the
directory structure. The user must be the owner of the file, or else a super-user.
• The address and port where the web interface listens can be changed by
setting dfs.http.address in conf/hadoop-site.xml. It must be of the form
address:port. To accept requests on all addresses, use 0.0.0.0.
• From this interface, you can browse HDFS itself with a basic file-browser
interface. Each DataNode exposes its file browser interface on port 50075.
You can override this by setting the dfs.datanode.http.address
configuration key to a setting other than 0.0.0.0:50075. Log files
generated by the Hadoop daemons can be accessed through this interface,
which is useful for distributed debugging and troubleshooting.
HDFS commands
2. Transferring Files
hadoop fs –get (File transferring from HDFS to Local)
hadoop fs –put (File transferring from local to HDFS)
hadoop fs –cp (File transferring within the HDFS)
hadoop fs -copyFromLocal (is same as put)
hadoop fs -copyToLocal ( is same as get)
3. Other commands
hadoop fs –cat (Display the content in the File within HDFS)
hadoop fs –du <File_name> (Display the File size)
hadoop fs –tail <File_name> (Displays last kilobyte of the File)
hadoop fs –rm (Remove particular File in HDFS)
hadoop fs –mkdir <directory_name> (Create new directory in HDFS)
hadoop fs –dus <directory_name> (Display the directory size)
hadoop fs –rmr (Remove File directory in HDFS)
Trendwise Analytics
Hands on
Trendwise Analytics
MapReduce
Trendwise Analytics
MapReduce
Jobtracker (master)
Tasktracker (slave)
Jobtracker
Tasktracker
MapReduce Program
– Mapper
– Reducer
Driver
InputFormat:
– TextInputFormat ( Default )
– KeyValueTextInputFormat
– SequenceFileInputFormat
– SequenceFilesAsTextInputFormat
Mapper
Hello , 1
World , 1
Hello , 1
Trendwise, 1
Reducer
Hello , 2
World , 1
Trendwise, 1
part-r-00000
Demo
Yarn Architecture