You are on page 1of 27

Cloudera Essentials for

Apache Hadoop

01-1
What is Apache Hadoop?

▪ Hadoop is a software framework for storing, processing, andanalyzing


“big data”
– Distributed
– Scalable
– Fault-tolerant
– Open source

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-2
Some Facts About Apache Hadoop

▪ Open source
– Overseen by the Apache Software Foundation
▪ Over 90 active committers to core Hadoop from over 20 companies
– Cloudera, Intel, LinkedIn, Facebook, Yahoo!, and more
▪ Hundreds more committers on other Hadoop-related projects and tools
– Known as the “Hadoop ecosystem”
▪ Many hundreds of other contributors writing features, fixing bugs, and so
on.

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-3
Traditional Large-Scale Computation

▪ Traditionally, computationhas been


processor-bound
– Relatively small amounts of data
– Lots of complexprocessing

▪ The early solution: bigger computers


– Faster processor, morememory
– But even this couldn’t keep up

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-5
Distributed Systems

▪ The better solution:more computers


– Distributed systems—use multiple machines
for a singlejob

“In pioneer days they used oxen for heavy


pulling, and when one ox couldn’t budge a log,
we didn’t try to grow a larger ox. We shouldn’t
be trying for bigger computers, but for more
systems of computers.”
– Grace Hopper

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-6
The Origins of Hadoop

▪ Hadoop is based on work done at Google in the late 1990s/early 2000s


▪ Google’s problem:
– Indexing the entire web requires massive amounts of storage
– A new approach was required to process such large amounts of data
▪ Google’s solution:
– GFS, the Google File System
– Described in a paper released in 2003
– Distributed MapReduce
– Described in a paper released in 2004
▪ Doug Cutting and others read these papers and implemented a similar,
open-source solution
– This is what would become Hadoop

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-10 01-6
What is Hadoop?

▪ Hadoop is a distributed data storage and processing platform


– Stores massive amounts of data in a very resilient way
– Handles low-level distributed system details and enables your
developers to focus on the business problems
▪ Tools built around Hadoop (the “Hadoop ecosystem”) can be
configured/extended to handlemany different tasks
– Extract Transform Load (ETL)
– Business Intelligence (BI) environment
– General datastorage
– Predictive analytics
– Statistical analysis
– Machine learning
–…

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-12
Core Hadoop is a File System and aProcessing Framework

▪ The Hadoop Distributed File System (HDFS)


– Any type of file can be stored in HDFS
– Data is split into chunks and replicated as it is written
– Provides resiliency and high availability
– Handled automatically by Hadoop
▪ YARN (Yet Another Resource Negotiator)
– Manages the processing resources of the Hadoop cluster
– Schedules jobs
– Runs processing frameworks
▪ MapReduce
– A distributed processing framework

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-13
Hadoop is Scalable

▪ Adding nodes (machines) adds capacityproportionally


▪ Increasing load results in a graceful decline in performance
– Not failure of the system

Capacity

Number of Nodes

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-14
Hadoop is Fault Tolerant

▪ Node failure is inevitable


▪ What happens?
– System continues to function
– Master reassigns work to a different node
– Data replication means there is no loss of data
– Nodes which recover rejoin the cluster automatically

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-15
The Hadoop Ecosystem(1)

▪ Many tools have been developed around “Core Hadoop”


– Known as the Hadoop ecosystem
▪ Designed to make Hadoop easier to use or to extend itsfunctionality
▪ All are open source
▪ The ecosystem is growing all thetime

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-16
The Hadoop Ecosystem(2)

▪ Examples of Hadoop ecosystem projects (all included inCDH):


Project What does itdo?
Spark In-memory execution framework
HBase NoSQL database built on HDFS
Hive SQL processing engine designed for batch workloads
Impala SQL query engine designed for BI workloads
Parquet Very efficient columnar data storage format
Sqoop Data movement to/from RDBMSs
Kafka Streaming data ingestion
Solr Enables users to find the data they need
Hue Web-based user interface for Hadoop
Oozie Workflow scheduler used to manage jobs

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-17
Hadoop is ConstantlyEvolving

▪ New features are frequently added to the Hadoop ecosystem


– Processing frameworks
– Support for new file types
– Performance enhancements
▪ Open Source creates an environment which encouragesevolution
▪ Examples of recent ecosystem projects
– Spark
– Parquet
– Kafka

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-18
The Hadoop Distributed File System (HDFS)

▪ HDFS is the storage layer for Hadoop


▪ A filesystem which can store any type of data
▪ Provides inexpensive and reliable storage for massive amounts of data
– Data is replicated across computers
▪ HDFS performs best with a “modest”number of large files
– Millions, rather than billions, of files
– Each file typically100MB or more
▪ File in HDFS are “write once”
– Appends are permitted
– No random writes are allowed

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-9
HDFS Basic Concepts

▪ HDFS is a filesystem written inJava


▪ Sits on top of a native filesystem
▪ Scalable
▪ Fault tolerant
▪ Supports efficient processing with MapReduce, Spark, and other
frameworks

HDFS

Hadoop
Cluster

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-10
How Files are Stored (1)

▪ Data files are split into blocks and distributed to data nodes

Block 1

Very
Large Block 2
Data File

Block 3

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-11
How Files are Stored (2)

▪ Data files are split into blocks and distributed to data nodes

Block 1
Block 1

Very
Large Block 2
Data File

Block 3

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-12
How Files are Stored (3)

▪ Data files are split into blocks and distributed to data nodes
▪ Each block is replicated on multiple nodes (default: 3x replication)

Block 1

Block 1
Block 1

Very
Large Block 2
Data File

Block 3

Block 1

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-13
How Files are Stored (4)

▪ Data files are split into blocks and distributed to data nodes
▪ Each block is replicated on multiple nodes (default: 3x replication)

Block 1

Block 3

Block 1
Block 1

Very Block 2

Block 2
Large Block 2
Block 3
Data File

Block 2
Block 3

Block 1

Block 3

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-14
How Files are Stored (5)

▪ Data files are split into blocks and distributed to data nodes
▪ Each block is replicated on multiple nodes (default: three-fold replication)
▪ NameNode stores metadata
Block 1 Name
Block 3
Node
Block 1
Block 1
Metadata:
Very Block 2
information
Block 2
Large Block 2 about files
Block 3
Data File and blocks

Block 2
Block 3

Block 1

Block 3

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-15
Getting Data In and Out ofHDFS
▪ Hadoop
put
– Copies data between client (local) HDFS
and HDFS (cluster) Client Cluster
– API or command line get

▪ Ecosystem Projects
– Flume
– Collects data from network
sources (such as, websites,
system logs)
– Sqoop
– Transfers data
between HDFS and
RDBMSs
▪ Business Intelligence Tools

RDBMS: Relational Database Management System

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-16
Example: Storing and Retrieving Files (1)

Local

Node A Node D
/logs/
150808.log

Node B Node E

/logs/
150809.log Node C

HDFS
Cluster
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-17
Example: Storing and Retrieving Files (2)

Metadata B1: A,B,D NameNode


B2: B,D,E
B3: A,B,C
/logs/150808.log: B1,B2,B3 B4: A,B,E
/logs/150809.log: B4,B5 B5: C,E,D

1 Node A Node D
/logs/
2 1 3 1 5
150808.log
3 4 2

Node B Node E
1 2 2 5
3 4 4
/logs/ 4
150809.log 5 Node C
3 5

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-18
Example: Storing and Retrieving Files (3)

Metadata B1: A,B,D NameNode


B2: B,D,E
B3: A,B,C
/logs/150808.log: B1,B2,B3 B4: A,B,E
/logs/150809.log: B4,B5 B5: C,E,D

1 Node A Node D /logs/150809.log?


/logs/
2 1 3 1 5
150808.log
3 4 2

Node B Node E B4,B5


1 2 2 5
3 4 4
/logs/ 4
150809.log 5 Node C Client
3 5

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-19
Example: Storing and Retrieving Files (4)

Metadata B1: A,B,D NameNode


B2: B,D,E
B3: A,B,C
/logs/150808.log: B1,B2,B3 B4: A,B,E
/logs/150809.log: B4,B5 B5: C,E,D

1 Node A Node D /logs/150809.log?


/logs/
2 1 3 1 5
150808.log
3 4 2

Node B Node E B4,B5


B4,B5
1 2 2 5
3 4 4
/logs/ 4
150809.log 5 Node C Client
3 5

© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-20

You might also like