You are on page 1of 40

Introduction to

Hadoop and Spark

Antonino Virgillito

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Eurostat
Large-scale Computation
• Traditional solutions for computing large
quantities of data relied mainly on processor
• Complex processing made on data moved in
memory
• Scale only by adding power (more memory, faster
processor)
• Works for relatively small-medium amounts of
data but cannot keep up with larger datasets
• How to cope with today’s indefinitely growing
production of data?
• Terabytes per day 2
Eurostat
Distributed Computing
• Multiple machines connected among each other
and cooperating for a common job
• «Cluster»
• Challenges
• Complexity of coordination – all processes and
data have to be maintained syncronized about the
global system state
• Failures
• Data distribution

3
Eurostat
Hadoop
• Open source platform for distributed processing
of large datasets
• Based on a project developed at Google
• Functions:
• Distribution of data and processing across
machines
• Management of the cluster

• Simplified programming model


• Easy to write distributed algorithms
Eurostat
Hadoop scalability
 Hadoop can reach massive scalability by
exploiting a simple distribution architecture and
coordination model
 Huge clusters can be made up using (cheap)
commodity hardware
 Cluster can easily scale up with little or no
modifications to the programs

Eurostat
Hadoop Concepts
• Applications are written in common high-level
languages
• Inter-node communication is limited to the
minimum
• Data is distributed in advance
• Bring the computation close to the data
• Data is replicated for availability and reliability
• Scalability and fault-tolerance

6
Eurostat
Scalability and Fault-tolerance
• Scalability principle
• Capacity can be increased by adding nodes to the cluster
• Increasing load does not cause failures but in the worst
case only a graceful degradation of performance
• Fault-tolerance
• Failure of nodes are considered inevitable and are coped
with in the architecture of the platform
• System continues to function when failure of a node
occurs – tasks are re-scheduled
• Data replication guarantees no data is lost
• Dynamic reconfiguration of the cluster when nodes join
and leave
7
Eurostat
Benefits of Hadoop
• Previously impossible or impractical analysis
made possible
• Lower cost of hardware
• Less time
• Ask Bigger Questions

8
Eurostat
Hadoop Components

Hive Pig Sqoop HBase

Flume Mahout Oozie

Core Components

9
Eurostat
Hadoop Core Components
• HDFS: Hadoop Distributed File System
• Abstraction of a file system over a cluster
• Stores large amount of data by transparently
spreading it on different machines
• MapReduce
• Simple programming model that enables parallel
execution of data processing programs
• Executes the work on the data near the data
• In a nutshell: HDFS places the data on the cluster
and MapReduce does the processing work

Eurostat
Structure of an Hadoop Cluster
• Hadoop Cluster:
• Group of machines working together to store and
process data
• Any number of “worker” nodes
• Run both HDFS and MapReduce components
• Two “Master” nodes
• Name Node: manages HDFS
• Job Tracker: manages MapReduce

11
Eurostat
Hadoop Principle Hadoop is basically a
middleware platform that
I’m one manages a cluster of
big data machines
set
The core components is a
distributed file system
(HDFS)

Files in HDFS are split into


Hadoop
blocks that are scattered
over the cluster

HDFS
The cluster can grow
indefinitely simply by adding
new nodes

Eurostat
The MapReduce Paradigm
Parallel processing paradigm
Programmer is unaware of parallelism
Programs are structured into a two-phase
execution
Map Reduce x4
x5
x3
Data elements are An algorithm is applied to all
classified into the elements of the same
categories category

Eurostat
MapReduce Concepts
• Automatic parallelization and distribution
• Fault-tolerance
• A clean abstraction for programmers
• MapReduce programs are usually written in Java
• Can be written in any language using Hadoop
Streaming
• All of Hadoop is written in Java
• MapReduce abstracts all the ‘housekeeping’ away
from the developer
• Developer can simply concentrate on writing the
14
Map and Reduce functions
Eurostat
MapReduce and Hadoop

Hadoop MapReduce is logically


placed on top of HDFS

MapReduce

HDFS

Eurostat
MapReduce and Hadoop

MR works on (big) files


Hadoop loaded on HDFS

Each node in the cluster


executes the MR program
MR MR MR MR in parallel, applying map
and reduces phases on the
blocks it stores
HDFS HDFS HDFS HDFS

Output is written
on HDFS

Scalability principle:
Perform the computation were the data is
Eurostat
Hive
• Apache Hive is a high-level abstraction on
top of MapReduce
– Uses an SQL/like language called HiveQL
– Generates MapReduce jobs that run on the
Hadoop cluster
– Originally developed by Facebook for data
warehousing
– Now an open/source Apache project

17
Eurostat
Overview
• HiveQL queries are transparently mapped into
MapReduce jobs at runtime by the Hive execution
engine
• Also makes optimizations
• Jobs are submitted to the Hadoop cluster

18
Eurostat
Hive Tables
• Hive works on the abstraction of table, similar to a
table in a relational database
• Main difference: a Hive table is simply a directory in
HDFS, containing one or more files
• By default files are in text format but different
formats can be specified
• The structure and location of the tables are stored in
a backing SQL database called the metastore
• Transparent for the user
• Can be any RDBMS, specified at configuration time
19
Eurostat
Hive Tables
• At query time, the metastore is consulted to
check if the query is consistent with the tables it
invokes
• The query itself operates on the actual data files
stored in HDFS

20
Eurostat
Hive Tables
• By default, tables are stored in a warehouse directory
on HDFS
• Default location:
/user/hive/warehouse/<db>/<table>
• Each subdirectory of the warehouse directory is
considered a database
• Each subdirectory of a database directory is a table
• All files in a table directory are considered part of the
table when querying
• Must have the same structure
21
Eurostat
Pig
• Tool for querying data on Hadoop clusters
• Widely used in the Hadoop world
• Yahoo! estimates that 50% of their Hadoop workload on
their 100,000 CPUs clusters is genarated by Pig scripts
• Allows to write data manipulation scripts written
in a high-level language called Pig Latin
• Interpreted language: scripts are translated into
MapReduce jobs
• Mainly targeted at joins and aggregations

Eurostat
Overview of Pig
• PigLatin
• Language for definition of data flow
• Grunt
• Interactive shell for typing and executing PigLatin
statements
• Interpreter and execution engine

23
Eurostat
RHadoop
• Collection of packages that allows integration of R with
HDFS and MapReduce
• Hadoop provides the storage while R brings the
processing
• Just a library
• Not a special run-time, Not a different language, Not a
special purpose language
• Incrementally port your code and use all packages
• Requires R installed and configured on all nodes in the
cluster

Eurostat
RHadoop Packages
• rhdfs
• Interface for reading and writing files from/to a
HDFS cluster
• rmr2
• Interface to MapReduce through R
• rhbase
• Interface to HBase

Eurostat
rhdfs

• As Hadoop MapReduce programs use HDFS for


taking their input and writing their output, it is
necessary to access them from R console
• The R programmer can easily perform read and
write operations on distributed data files.
• Basically, rhdfs package calls the HDFS API in
backend to operate data sources stored on HDFS.

Eurostat
rmr2
• rmr2 is an R interface for providing Hadoop
MapReduce facility inside the R environment.
• So, the R programmer needs to just divide their
application logic into the map and reduce phases and
submit it with the rmr2 methods.
• After that, rmr2 calls the Hadoop streaming
MapReduce API with several job parameters as input
directory, output directory, mapper, reducer, and so
on, to perform the R MapReduce job over Hadoop
cluster.

Eurostat
mapreduce
• The mapreduce function takes as input a set of named
parameters
• input: input path or variable
• input.format: specification of input format
• output: output path or variable
• map: map function
• reduce: reduce function
• map and reduce function present the usual interface
• A call to keyval(k,v) inside the map and reduce function
is used to emit respectively intermediate and output
key-value pairs

Eurostat
WordCount in R
wordcount = wc.reduce =
function( function(word, counts ) {
input, keyval(word, sum(counts))}
output = NULL,
pattern = " "){ mapreduce(
input = input ,
wc.map = output = output,
function(., lines) { input.format = "text",
keyval( map = wc.map,
unlist( reduce = wc.reduce,
strsplit( combine = T)}
x = lines,
split = pattern)),
1)}

Eurostat
Reading delimited data
tsv.reader = function(con, nrecs){
lines = readLines(con, 1)
if(length(lines) == 0)
NULL
else {
delim = strsplit(lines, split = "\t")
keyval(
sapply(delim,function(x) x[1]),
sapply(delim,function(x) x[-1]))}}

freq.counts = mapreduce(
input = tsv.data,
input.format = tsv.format,
map = function(k, v) keyval(v[1,], 1),
reduce = function(k, vv) keyval(k, sum(vv)))

Eurostat
Reading named columns
tsv.reader = function(con, nrecs){
lines = readLines(con, 1)
if(length(lines) == 0)
NULL
else {
delim = strsplit(lines, split = "\t")
keyval(sapply(delim, function(x) x[1]),
data.frame(
location = sapply(delim, function(x) x[2]),
name = sapply(delim, function(x) x[3]),
value = sapply(delim, function(x) x[4])))}}

freq.counts = mapreduce(
input = tsv.data,
input.format = tsv.format,
map = function(k, v) {
filter = (v$name == "blarg")
keyval(k[filter], log(as.numeric(v$value[filter])))},
reduce = function(k, vv) keyval(k, mean(vv)))

Eurostat
Apache Spark
• A general purpose framework for big data
processing

• It interfaces with many distributed file systems,


such as Hdfs (Hadoop Distributed File System),
Amazon S3, Apache Cassandra and many others

• 100 times faster than Hadoop for in-memory


computation
32
Eurostat
Multilanguage API
• You can write applications in various languages
• Java
• Python
• Scala
• R

• In the context of this course we will consider


Python

33
Eurostat
Built-in Libraries

34
Eurostat
RDD - Resilient Distributed Dataset

• The RDD is the core abstraction used by Spark to


work on data
• A RDD is a collection of elements partitioned in
every cluster node. Spark operates in parallel on
them
• Every RDD is created from a file on Hadoop
filesystem
• They can be made persistent in memory

35
Eurostat
Transformations

• For example, map is a transformation that takes


all elements of the dataset, pass them to a
function and returns another RDD with the results

resultRDD = originalRDD.map(myFunction)

36
Eurostat
Actions

• For example, reduce is an action. It aggregates


all elements of the RDD using a function and
returns the result to the driver program

result = rdd.reduce(function)

37
Eurostat
SparkSQL and DataFrames

• SparkSQL is the spark module for structured data


processing

• DataFrame API is one of the ways to interact with


SparkSQL

38
Eurostat
DataFrames
• A DataFrame is a collection of data organized into
columns

• Similar to tables in relational databases

• Can be created from various sources: structured


data files, Hive Tables, external db, csv etc…

39
Eurostat
Example operations on DataFrames
• To show the content of the DataFrame
• df.show()
• To print the Schema of the DataFrame
• df.printSchema()
• To select a column
• df.select(‘columnName’).show()
• To filter by some parameter
• df.filter(df[‘columnName’] > N).show()

40
Eurostat

You might also like