Professional Documents
Culture Documents
Shankar Radhakrishnan
HCL Technologies
Agenda
State of the Data
What is Hadoop
Hadoop Ecosystem
References
State of the data
Data driven businesses
Businesses have been collecting information all the time
Mine more == Collect more (and vice-versa)
Challenges
Application Complexities
Data growth
Infrastructure
Economics
Need of the day
State of the data
Data driven business
Businesses have been collecting information
all the time
Mine more == Collect more (and vice-versa)
Challenges
Application Complexities
Data growth
Infrastructure
Economics
Data driven business
Applications
Searches, Message posts, Comments, Emails,
Blogs, Photos, Video Clips, Product Listings
ERP, CRM, Databases, Internal Applications,
Customer/Consumer facing products
Mobile
Context
Web, Customers, Products, Business Systems,
Processes, Services
Support Systems
CRM, SOA, Recommendation Systems/processes,
Data warehouses, Business Intelligence, BPM
State of the data
Data driven businesses
Businesses have been collecting information
all the time
Mine more == Collect more (and vice-versa)
Challenges
Application Complexities
Data growth
Infrastructure
Economics
Mine more
Drivers
ROI
Customer Retention
Product Affinity
Market Trends
Research Analysis
Customer/Consumer Analytics
Process
Clustering
Classification
Build Relationships
Regression
Types
Structured
Semi-structured
Unstructured
State of the data
Data driven businesses
Businesses have been collecting information
all the time
Mine more == Collect more (and vice-versa)
Challenges
Application Complexities
Data growth
Infrastructure
Economics
Challenges
Complex Applications
Data integration is a good but complex problem to solve
Data Growth
Growth is exponential
Infrastructure
Availability
Unscalable hardware
Economics
Managing high data volume comes at a price
Failures are very costly
Need of the day
System that can handle high volume data
System that can perform complex operations
Scalable
Robust
Highly Available
Fault Tolerant
Cheap
Top level Apache project
Open source
Inspired by Google’s white papers on
Map/Reduce (MR), Google File System (GFS)
Originally developed to support Apache Nutch Search
Engine
Software Framework - Java
Designed
For sophisticated analysis
To deal with structured and unstructured complex data
Why Hadoop?
Runs on commodity hardware
Shared-nothing architecture
Scale hardware when ever you want
System compensates for hardware scaling
and issues (if any)
Run large-scale, high volume data processes
Scales well with complex analysis jobs
Handles failures
Ideal to consolidate data from both new and legacy data
sources
Value to the business
Hadoop in an enterprise - Example
Hadoop Ecosystem
HDFS Hadoop Distributed File System
Map/Reduce Software framework for Clustered,
Distributed data processing
ZooKeeper Scheduler
Avro Data Serialization
Chukwa Data Collection System to monitor
Distributed Systems
HBase Data storage for distributed large tables
Hive Data warehousing infrastructure
Pig High-Level Query Language
HDFS – Hadoop Distributed File System
Master/Slave Architecture
Runs on commodity hardware
Fault Tolerant
Handle large volumes of data
Provides High Throughput
Streaming data-access
Simple file coherency model
Portable to heterogeneous hardware and software
Robust
Handles disk failures, replication (& re-replication)
Performs cluster rebalancing, data integrity checks
HDFS – Example
Name node
• File system operations
• Maps data-nodes
Data node
• Process read/write
• Handles Data-blocks
• Replication
Hadoop Map/Reduce
Tagged by a job
Splits input data-set into separate chunk’s
Processed by map tasks, in parallel
Sorts the output of the maps
Processed by reduce tasks, in parallel
Typically stored and processed in a file system
Framework takes care of
Scheduling tasks
Monitoring
Re-executing failed tasks
Example : Mapper Function
Example : Reduce Function
Who runs Hadoop?