You are on page 1of 21

C K Pithawalla College of Engineering & Technology,Surat

Hadoop
Introduction & Setup
Prepared By, Nishant M Gandhi.
Certified network Manager by Nettech. Diploma in Cyber Security(persuing).

What is

?

Hadoop is a framework for running applications on large clusters built of commodity hardware.
HADOOP WIKI Open source, Java Google’s MapReduce inspired Yahoo’s Hadoop. Now part of Apache group

Hadoop Architecture on DELL C Series Server .

Mahout™: A Scalable machine learning and data mining library. Other Hadoop-related projects: Avro™: A data serialization system. Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. Cassandra™: A scalable multi-master database Chukwa™: A data collection system for managing large distributed systems.Hadoop Software Stack Hadoop Common: The common utilities Hadoop Distributed File System (HDFS™): A distributed file system Hadoop Map Reduce: Distributed processing on compute clusters. ZooKeeper™:co-ordination services . HBase™: A scalable. distributed database that supports structured data storage for large tables. Pig™: A high-level data-flow language and execution framework for parallel computation.

000 core Linux cluster and produces data that is now used in every Yahoo! Web search query. Yahoo! Inc. launched what it claimed was the world's largest Hadoop production application . On February 19. 2008.Who Uses Hadoop? • Click to edit Master text styles econd level ● Third level ● Fourth level ● Fifth level • The Yahoo! Search Webmap is a Hadoop application that runs on more than 10.

Here are the cluster statistics from the HDFS cluster at Facebook: .Who Uses Hadoop? The Datawarehouse Hadoop cluster at Facebook • • • • ck to edit Master text styles 21 PB of storage in a single HDFS cluster cond level 2000 machines Third level 12 TB per machine (a few machines have 24 TB each) ● Fourth level 1200 machines with 8 cores each + 800 machines with 16 cores each ● Fifth level 32 GB of RAM per machine • • 15 map-reduce tasks per machine • That's a total of more than 21 PB of configured storage capacity! This is larger than the previously known Yahoo!'s cluster of 14 PB.

Who Uses Hadoop? Creater of MapReduce Runs Hadoop for NSResearch cluster HDFS is inspired by GFS .

Who Uses Hadoop? Other Hadoop Users: IBM NEW YORK TIMES Twitter Veoh Amazon Apple eBay AOL Hewlett-Packard Joost .

(This is the key for Hadoop’s success) . scheduling the program’s execution across several machines.g.Map Reduce • Programming model developed at Google • Sort/merge based distributed computing • Initially..com. but now used extensively by more organizations (e. it was intended for their internal search/indexing application. etc. and managing required intermachine communication.g. Yahoo.) • It is functional style programming (e. IBM. Amazon.. LISP) that is naturally parallelizable across a large cluster of workstations or PCS. handling machine failures. • The underlying system takes care of the partitioning of the input data.

Map Reduce .

. GFS is not open source.Hadoop Distributed File System (HDFS)    At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose. Doug Cutting and others at Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS).

runs on heterogeneous OS . 100 million files.Goals of HDFS • Very Large Distributed File System – 10K nodes. 10 PB • Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them • Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth • User Space.

DFShell The HDFS shell is invoked by: bin/hadoop dfs <args> • • • • • • • • • cat chgrp chmod chown copyFromLocal copyToLocal cp du dus • • • • • • • • • expunge get getmerge ls lsr mkdir movefromLocal mv touchz • • • • • • • • put rm rmr setrep stat tail test text .

org/mapreduce/releases.Hadoop Single Node Setup Step 1: Download hadoop from http://hadoop.html Step 2: Untar the hadoop file: tar xvfz hadoop-0.20.tar.gz .2.apache.

sh .Hadoop Single Node Setup Step 3: Set the path to java compiler by editing JAVA_HOME Parameter in hadoop/conf/hadoop--‐env.

Hadoop Single Node Setup Step 4: Create an RSA key to be used by hadoop when ssh’ing to localhost: ssh-keygen -t rsa -P “ “ cat ~/.ssh/authorized_keys .ssh/id_rsa.pub >> ~/.

Hadoop Single Node Setup Step 5: Do the following changes to the configuration files under hadoop/conf core--site.xml: <configuration> <property> <name>hadoop.default.tmp.name</name> <value>hdfs://localhost:54310</value> </property> </configuration> .dir</name> <value>TEMPORARY-DIR-FOR-HADOOPDATASTORE</ value> </property> <property> <name>fs.

xml: <configuration> <property> <name>mapred.replication</name> <value>1</value> </property> </configuration> .Hadoop Single Node Setup mapred--site.tracker</name> <value>localhost:54311</value> </property> </configuration> hdfs--site.xml: <configuration> <property> <name>dfs.job.

From hadoop directory run the following: bin/hadoop namenode -format .Hadoop Single Node Setup Step 6: Format the hadoop file system.

/start-all.Using Hadoop 1)How to start Hadoop? cd hadoop/bin ./stop-all.sh 2)How to stop Hadoop? cd hadoop/bin .sh 3)How to copy file from local to HDFS? cd hadoop bin/hadoop dfs –put local_machine_path hdfs_path 4)How to list files in HDFS? cd hadoop bin/hadoop dfs -ls .

.Thank You…. .