You are on page 1of 20

Hadoop – An Introduction

Shankar Radhakrishnan
HCL Technologies
Agenda
State of the Data
What is Hadoop
Hadoop Ecosystem
References
State of the data
Data driven businesses
 Businesses have been collecting information all the time
Mine more == Collect more (and vice-versa)
Challenges
 Application Complexities
 Data growth
 Infrastructure
 Economics
Need of the day
State of the data
Data driven business
 Businesses have been collecting information
all the time
Mine more == Collect more (and vice-versa)
Challenges
 Application Complexities
 Data growth
 Infrastructure
 Economics
Data driven business
Applications
 Searches, Message posts, Comments, Emails,
Blogs, Photos, Video Clips, Product Listings
 ERP, CRM, Databases, Internal Applications,
Customer/Consumer facing products
 Mobile
Context
 Web, Customers, Products, Business Systems,
Processes, Services
Support Systems
 CRM, SOA, Recommendation Systems/processes,
Data warehouses, Business Intelligence, BPM
State of the data
Data driven businesses
 Businesses have been collecting information
all the time
Mine more == Collect more (and vice-versa)
Challenges
 Application Complexities
 Data growth
 Infrastructure
 Economics
Mine more
Drivers
 ROI
 Customer Retention
 Product Affinity
 Market Trends
 Research Analysis
 Customer/Consumer Analytics
Process
 Clustering
 Classification
 Build Relationships
 Regression
Types
 Structured
 Semi-structured
 Unstructured
State of the data
Data driven businesses
 Businesses have been collecting information
all the time
Mine more == Collect more (and vice-versa)
Challenges
 Application Complexities
 Data growth
 Infrastructure
 Economics
Challenges
Complex Applications
 Data integration is a good but complex problem to solve
Data Growth
 Growth is exponential
Infrastructure
 Availability
 Unscalable hardware
Economics
 Managing high data volume comes at a price
 Failures are very costly
Need of the day
System that can handle high volume data
System that can perform complex operations
Scalable
Robust
Highly Available
Fault Tolerant
Cheap
Top level Apache project
Open source
Inspired by Google’s white papers on
Map/Reduce (MR), Google File System (GFS)
Originally developed to support Apache Nutch Search
Engine
Software Framework - Java
Designed
 For sophisticated analysis
 To deal with structured and unstructured complex data
Why Hadoop?
Runs on commodity hardware
Shared-nothing architecture
Scale hardware when ever you want
System compensates for hardware scaling
and issues (if any)
Run large-scale, high volume data processes
Scales well with complex analysis jobs
Handles failures
Ideal to consolidate data from both new and legacy data
sources
Value to the business
Hadoop in an enterprise - Example
Hadoop Ecosystem
HDFS Hadoop Distributed File System
Map/Reduce Software framework for Clustered,
Distributed data processing
ZooKeeper Scheduler
Avro Data Serialization
Chukwa Data Collection System to monitor
Distributed Systems
HBase Data storage for distributed large tables
Hive Data warehousing infrastructure
Pig High-Level Query Language
HDFS – Hadoop Distributed File System
Master/Slave Architecture
Runs on commodity hardware
Fault Tolerant
Handle large volumes of data
Provides High Throughput
Streaming data-access
Simple file coherency model
Portable to heterogeneous hardware and software
Robust
Handles disk failures, replication (& re-replication)
Performs cluster rebalancing, data integrity checks
HDFS – Example
Name node
• File system operations
• Maps data-nodes

Data node
• Process read/write
• Handles Data-blocks
• Replication
Hadoop Map/Reduce
Tagged by a job
 Splits input data-set into separate chunk’s
 Processed by map tasks, in parallel
 Sorts the output of the maps
 Processed by reduce tasks, in parallel
Typically stored and processed in a file system
Framework takes care of
 Scheduling tasks
 Monitoring
 Re-executing failed tasks
Example : Mapper Function
Example : Reduce Function
Who runs Hadoop?

You might also like