Let’s Hadoop

1

WHAT’S THE BIG DEAL WITH BIG DATA? 1. 2 .

Gartner Predicts 800% data growth over next 5 years Big Data opens the door to a new approach to engaging customers and making decisions 3 .Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture. manage. and process the data within a tolerable elapsed time.

BIG DATA: WHAT ARE THE CHALLENGES? 2. 4 .

Example: Need to process 100TB datasets • On 1 node: – scanning @ 50MB/s = 23 days • On 1000 node cluster: – scanning @ 50MB/s = 33 min • Hardware Problems  Hardware Problems / Process and combine data from Multiple disks Traditional Systems: They can’t scale. 5 . access speeds—the rate at which data can be read from drives— have not kept up. not reliable and expensive. How we can capture and deliver data to right people in real-time?  How we can understand and use big data when it is in Variety of forms?  How we can store/analyze the data given its size and computational capacity?  While the storage capacities of hard drives have increased massively over the years.

WHAT TECHNOLOGIES SUPPORT BIG DATA? 3. 6 .

Scale-out everything: •Storage •Compute 7 .

8 .WHAT MAKES HADOOP DIFFERENT? 4.

■ Robust—Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. ■ Replication . ■ Simple—Hadoop allows users to quickly write efficient parallel code.Use replication across servers to deal with unreliable storage/servers 9 . ■ Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster.■ Accessible—Hadoop runs on large clusters of commodity machines or on cloud (EC2 ). ■ Data Locality—Move Computation to the Data.

10 .IS HADOOP ONE-STOP SOLUTION? 5.

. ... Real time Small datasets Algorithms requires large temp space Problems that are CPU bound and have lots of cross talk 11 Not good for..Good for..

distributed processing on compute clusters.  Framework written in Java  Designed to solve problem that involve analyzing large data (Petabytes)  Programing model based on Google’s Map Reduce  Infrastructure based on Google’s Big Data and Distributed File System  Hadoop consists of two core components.  The Hadoop Distributed File System (HDFS) . 12 .A distributed file system  MapReduce . Hadoop is an open source framework for writing and running distributed applications that process large amounts of data.

NameNode  This manages the file system namespace (metadata) and regulates access to files by clients. perform block operation. delete. and renaming files and directories DataNode  This manages storage attached to the node in which they run. and replication upon request from NameNode  Many Data Nodes. typically one DataNode for a physical node 13 .  DataNode serves read. closing.  The NameNode executes file system namespace operations like opening. write.

 Large-Scale Data Processing o Want to use 1000s of CPUs o But don’t want hassle of managing things  MapReduce Architecture provides o Automatic parallelization & distribution o Fault tolerance o I/O scheduling o Monitoring & status updates  MapReduce is a method for distributing a task across multiple nodes  Each node processes data stored on that node  Consists of two phases: o Map o Reduce 14 .

15 .

 There are two types of nodes that control the job execution process: o Jobtracker o Tasktrackers  The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers  Tasktrackers run tasks and send progress reports to the jobtracker  Jobtracker runs in NameNode. the mapper reads data in the form of key/value pairs  The Reducer process all output from mapper and arrives at final output as final key/value pairs writes to HDFS. 16 . Tasktracker runs in DataNode. In map phase .

17 .

Sign up to vote on this title
UsefulNot useful