Hadoop

Introduction to
Hadoop and Spark
Antonino Virgillito
THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
Eurostat
Large-scale Computation
• Traditional solutions for computing large
quantities of data relied mainly on processor
• Complex processing made on data moved in
memory
• Scale only by adding power (more memory, faster
processor)
• Works for relatively small-medium amounts of
data but cannot keep up with larger datasets
• How to cope with today’s indefinitely growing
production of data?
• Terabytes per day 2
Eurostat
Distributed Computing
• Multiple machines connected among each other
and cooperating for a common job
• «Cluster»
• Challenges
• Complexity of coordination – all processes and
data have to be maintained syncronized about the
global system state
• Failures
• Data distribution
3
Eurostat
Hadoop
• Open source platform for distributed processing
of large datasets
• Based on a project developed at Google
• Functions:
• Distribution of data and processing across
machines
• Management of the cluster
• Simplified programming model

• Easy to write distributed algorithms
Eurostat
Hadoop scalability
 Hadoop can reach massive scalability by
exploiting a simple distribution architecture and
coordination model
 Huge clusters can be made up using (cheap)
commodity hardware
 Cluster can easily scale up with little or no
modifications to the programs
Eurostat
Hadoop Concepts
• Applications are written in common high-level
languages
• Inter-node communication is limited to the
minimum
• Data is distributed in advance
• Bring the computation close to the data
• Data is replicated for availability and reliability
• Scalability and fault-tolerance
6
Eurostat
Scalability and Fault-tolerance
• Scalability principle
• Capacity can be increased by adding nodes to the cluster
• Increasing load does not cause failures but in the worst
case only a graceful degradation of performance
• Fault-tolerance
• Failure of nodes are considered inevitable and are coped
with in the architecture of the platform
• System continues to function when failure of a node
occurs – tasks are re-scheduled
• Data replication guarantees no data is lost
• Dynamic reconfiguration of the cluster when nodes join
and leave
7
Eurostat
Benefits of Hadoop
• Previously impossible or impractical analysis
made possible
• Lower cost of hardware
• Less time
• Ask Bigger Questions
8
Eurostat
Hadoop Components
Hive Pig Sqoop HBase
Flume Mahout Oozie
Core Components
9
Eurostat
Hadoop Core Components
• HDFS: Hadoop Distributed File System
• Abstraction of a file system over a cluster
• Stores large amount of data by transparently
spreading it on different machines
• MapReduce
• Simple programming model that enables parallel
execution of data processing programs
• Executes the work on the data near the data
• In a nutshell: HDFS places the data on the cluster
and MapReduce does the processing work
Eurostat
Structure of an Hadoop Cluster
• Hadoop Cluster:
• Group of machines working together to store and
process data
• Any number of “worker” nodes
• Run both HDFS and MapReduce components
• Two “Master” nodes
• Name Node: manages HDFS
• Job Tracker: manages MapReduce
11
Eurostat
Hadoop Principle Hadoop is basically a
middleware platform that
I’m one manages a cluster of
big data machines
set
The core components is a
distributed file system
(HDFS)
Files in HDFS are split into

Hadoop
blocks that are scattered
over the cluster
HDFS
The cluster can grow
indefinitely simply by adding
new nodes
Eurostat
The MapReduce Paradigm
Parallel processing paradigm
Programmer is unaware of parallelism
Programs are structured into a two-phase
execution
Map Reduce x4
x5
x3
Data elements are An algorithm is applied to all
classified into the elements of the same
categories category
Eurostat
MapReduce Concepts
• Automatic parallelization and distribution
• Fault-tolerance
• A clean abstraction for programmers
• MapReduce programs are usually written in Java
• Can be written in any language using Hadoop
Streaming
• All of Hadoop is written in Java
• MapReduce abstracts all the ‘housekeeping’ away
from the developer
• Developer can simply concentrate on writing the
14
Map and Reduce functions
Eurostat
MapReduce and Hadoop
Hadoop MapReduce is logically

placed on top of HDFS
MapReduce
HDFS
Eurostat
MapReduce and Hadoop
MR works on (big) files

Hadoop loaded on HDFS
Each node in the cluster

executes the MR program
MR MR MR MR in parallel, applying map
and reduces phases on the
blocks it stores
HDFS HDFS HDFS HDFS
Output is written
on HDFS
Scalability principle:
Perform the computation were the data is
Eurostat
Hive
• Apache Hive is a high-level abstraction on
top of MapReduce
– Uses an SQL/like language called HiveQL
– Generates MapReduce jobs that run on the
Hadoop cluster
– Originally developed by Facebook for data
warehousing
– Now an open/source Apache project
17
Eurostat
Overview
• HiveQL queries are transparently mapped into
MapReduce jobs at runtime by the Hive execution
engine
• Also makes optimizations
• Jobs are submitted to the Hadoop cluster
18
Eurostat
Hive Tables
• Hive works on the abstraction of table, similar to a
table in a relational database
• Main difference: a Hive table is simply a directory in
HDFS, containing one or more files
• By default files are in text format but different
formats can be specified
• The structure and location of the tables are stored in
a backing SQL database called the metastore
• Transparent for the user
• Can be any RDBMS, specified at configuration time
19
Eurostat
Hive Tables
• At query time, the metastore is consulted to
check if the query is consistent with the tables it
invokes
• The query itself operates on the actual data files
stored in HDFS
20
Eurostat
Hive Tables
• By default, tables are stored in a warehouse directory
on HDFS
• Default location:
/user/hive/warehouse/<db>/<table>
• Each subdirectory of the warehouse directory is
considered a database
• Each subdirectory of a database directory is a table
• All files in a table directory are considered part of the
table when querying
• Must have the same structure
21
Eurostat
Pig
• Tool for querying data on Hadoop clusters
• Widely used in the Hadoop world
• Yahoo! estimates that 50% of their Hadoop workload on
their 100,000 CPUs clusters is genarated by Pig scripts
• Allows to write data manipulation scripts written
in a high-level language called Pig Latin
• Interpreted language: scripts are translated into
MapReduce jobs
• Mainly targeted at joins and aggregations
Eurostat
Overview of Pig
• PigLatin
• Language for definition of data flow
• Grunt
• Interactive shell for typing and executing PigLatin
statements
• Interpreter and execution engine
23
Eurostat
RHadoop
• Collection of packages that allows integration of R with
HDFS and MapReduce
• Hadoop provides the storage while R brings the
processing
• Just a library
• Not a special run-time, Not a different language, Not a
special purpose language
• Incrementally port your code and use all packages
• Requires R installed and configured on all nodes in the
cluster
Eurostat
RHadoop Packages
• rhdfs
• Interface for reading and writing files from/to a
HDFS cluster
• rmr2
• Interface to MapReduce through R
• rhbase
• Interface to HBase
Eurostat
rhdfs
• As Hadoop MapReduce programs use HDFS for

taking their input and writing their output, it is
necessary to access them from R console
• The R programmer can easily perform read and
write operations on distributed data files.
• Basically, rhdfs package calls the HDFS API in
backend to operate data sources stored on HDFS.
Eurostat
rmr2
• rmr2 is an R interface for providing Hadoop
MapReduce facility inside the R environment.
• So, the R programmer needs to just divide their
application logic into the map and reduce phases and
submit it with the rmr2 methods.
• After that, rmr2 calls the Hadoop streaming
MapReduce API with several job parameters as input
directory, output directory, mapper, reducer, and so
on, to perform the R MapReduce job over Hadoop
cluster.
Eurostat
mapreduce
• The mapreduce function takes as input a set of named
parameters
• input: input path or variable
• input.format: specification of input format
• output: output path or variable
• map: map function
• reduce: reduce function
• map and reduce function present the usual interface
• A call to keyval(k,v) inside the map and reduce function
is used to emit respectively intermediate and output
key-value pairs
Eurostat
WordCount in R
wordcount = wc.reduce =
function( function(word, counts ) {
input, keyval(word, sum(counts))}
output = NULL,
pattern = " "){ mapreduce(
input = input ,
wc.map = output = output,
function(., lines) { input.format = "text",
keyval( map = wc.map,
unlist( reduce = wc.reduce,
strsplit( combine = T)}
x = lines,
split = pattern)),
1)}
Eurostat
Reading delimited data
tsv.reader = function(con, nrecs){
lines = readLines(con, 1)
if(length(lines) == 0)
NULL
else {
delim = strsplit(lines, split = "\t")
keyval(
sapply(delim,function(x) x[1]),
sapply(delim,function(x) x[-1]))}}
freq.counts = mapreduce(
input = tsv.data,
input.format = tsv.format,
map = function(k, v) keyval(v[1,], 1),
reduce = function(k, vv) keyval(k, sum(vv)))
Eurostat
Reading named columns
tsv.reader = function(con, nrecs){
lines = readLines(con, 1)
if(length(lines) == 0)
NULL
else {
delim = strsplit(lines, split = "\t")
keyval(sapply(delim, function(x) x[1]),
data.frame(
location = sapply(delim, function(x) x[2]),
name = sapply(delim, function(x) x[3]),
value = sapply(delim, function(x) x[4])))}}
freq.counts = mapreduce(
input = tsv.data,
input.format = tsv.format,
map = function(k, v) {
filter = (v$name == "blarg")
keyval(k[filter], log(as.numeric(v$value[filter])))},
reduce = function(k, vv) keyval(k, mean(vv)))
Eurostat
Apache Spark
• A general purpose framework for big data
processing
• It interfaces with many distributed file systems,

such as Hdfs (Hadoop Distributed File System),
Amazon S3, Apache Cassandra and many others
• 100 times faster than Hadoop for in-memory

computation
32
Eurostat
Multilanguage API
• You can write applications in various languages
• Java
• Python
• Scala
• R
• In the context of this course we will consider

Python
33
Eurostat
Built-in Libraries
34
Eurostat
RDD - Resilient Distributed Dataset
• The RDD is the core abstraction used by Spark to

work on data
• A RDD is a collection of elements partitioned in
every cluster node. Spark operates in parallel on
them
• Every RDD is created from a file on Hadoop
filesystem
• They can be made persistent in memory
35
Eurostat
Transformations
• For example, map is a transformation that takes

all elements of the dataset, pass them to a
function and returns another RDD with the results
resultRDD = originalRDD.map(myFunction)
36
Eurostat
Actions
• For example, reduce is an action. It aggregates

all elements of the RDD using a function and
returns the result to the driver program
result = rdd.reduce(function)
37
Eurostat
SparkSQL and DataFrames
• SparkSQL is the spark module for structured data

processing
• DataFrame API is one of the ways to interact with

SparkSQL
38
Eurostat
DataFrames
• A DataFrame is a collection of data organized into
columns
• Similar to tables in relational databases
• Can be created from various sources: structured

data files, Hive Tables, external db, csv etc…
39
Eurostat
Example operations on DataFrames
• To show the content of the DataFrame
• df.show()
• To print the Schema of the DataFrame
• df.printSchema()
• To select a column
• df.select(‘columnName’).show()
• To filter by some parameter
• df.filter(df[‘columnName’] > N).show()
40
Eurostat

Hadoop

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop

Uploaded by

Copyright:

Available Formats

Introduction to

Hadoop and Spark

• Simplified programming model

Hive Pig Sqoop HBase

Flume Mahout Oozie

Files in HDFS are split into

Hadoop MapReduce is logically

MR works on (big) files

Each node in the cluster

• As Hadoop MapReduce programs use HDFS for

• It interfaces with many distributed file systems,

• 100 times faster than Hadoop for in-memory

• In the context of this course we will consider

• The RDD is the core abstraction used by Spark to

• For example, map is a transformation that takes

• For example, reduce is an action. It aggregates

• SparkSQL is the spark module for structured data

• DataFrame API is one of the ways to interact with

• Similar to tables in relational databases

• Can be created from various sources: structured

You might also like