Professional Documents
Culture Documents
Apache Hadoop
01-1
What is Apache Hadoop?
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-2
Some Facts About Apache Hadoop
▪ Open source
– Overseen by the Apache Software Foundation
▪ Over 90 active committers to core Hadoop from over 20 companies
– Cloudera, Intel, LinkedIn, Facebook, Yahoo!, and more
▪ Hundreds more committers on other Hadoop-related projects and tools
– Known as the “Hadoop ecosystem”
▪ Many hundreds of other contributors writing features, fixing bugs, and so
on.
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 01-3
Traditional Large-Scale Computation
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-5
Distributed Systems
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-6
The Origins of Hadoop
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-10 01-6
What is Hadoop?
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-12
Core Hadoop is a File System and aProcessing Framework
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-13
Hadoop is Scalable
Capacity
Number of Nodes
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-14
Hadoop is Fault Tolerant
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-15
The Hadoop Ecosystem(1)
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-16
The Hadoop Ecosystem(2)
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-17
Hadoop is ConstantlyEvolving
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 02-18
The Hadoop Distributed File System (HDFS)
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-9
HDFS Basic Concepts
HDFS
Hadoop
Cluster
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-10
How Files are Stored (1)
▪ Data files are split into blocks and distributed to data nodes
Block 1
Very
Large Block 2
Data File
Block 3
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-11
How Files are Stored (2)
▪ Data files are split into blocks and distributed to data nodes
Block 1
Block 1
Very
Large Block 2
Data File
Block 3
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-12
How Files are Stored (3)
▪ Data files are split into blocks and distributed to data nodes
▪ Each block is replicated on multiple nodes (default: 3x replication)
Block 1
Block 1
Block 1
Very
Large Block 2
Data File
Block 3
Block 1
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-13
How Files are Stored (4)
▪ Data files are split into blocks and distributed to data nodes
▪ Each block is replicated on multiple nodes (default: 3x replication)
Block 1
Block 3
Block 1
Block 1
Very Block 2
Block 2
Large Block 2
Block 3
Data File
Block 2
Block 3
Block 1
Block 3
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-14
How Files are Stored (5)
▪ Data files are split into blocks and distributed to data nodes
▪ Each block is replicated on multiple nodes (default: three-fold replication)
▪ NameNode stores metadata
Block 1 Name
Block 3
Node
Block 1
Block 1
Metadata:
Very Block 2
information
Block 2
Large Block 2 about files
Block 3
Data File and blocks
Block 2
Block 3
Block 1
Block 3
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-15
Getting Data In and Out ofHDFS
▪ Hadoop
put
– Copies data between client (local) HDFS
and HDFS (cluster) Client Cluster
– API or command line get
▪ Ecosystem Projects
– Flume
– Collects data from network
sources (such as, websites,
system logs)
– Sqoop
– Transfers data
between HDFS and
RDBMSs
▪ Business Intelligence Tools
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-16
Example: Storing and Retrieving Files (1)
Local
Node A Node D
/logs/
150808.log
Node B Node E
/logs/
150809.log Node C
HDFS
Cluster
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-17
Example: Storing and Retrieving Files (2)
1 Node A Node D
/logs/
2 1 3 1 5
150808.log
3 4 2
Node B Node E
1 2 2 5
3 4 4
/logs/ 4
150809.log 5 Node C
3 5
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-18
Example: Storing and Retrieving Files (3)
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-19
Example: Storing and Retrieving Files (4)
© Copyright 2010-2017 Cloudera. All rights reserved. Not to be reproduced or shared without prior written consent from Cloudera. 03-20