Professional Documents
Culture Documents
Antonino Virgillito
THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
Eurostat
Large-scale Computation
• Traditional solutions for computing large
quantities of data relied mainly on processor
• Complex processing made on data moved in
memory
• Scale only by adding power (more memory, faster
processor)
• Works for relatively small-medium amounts of
data but cannot keep up with larger datasets
• How to cope with today’s indefinitely growing
production of data?
• Terabytes per day 2
Eurostat
Distributed Computing
• Multiple machines connected among each other
and cooperating for a common job
• «Cluster»
• Challenges
• Complexity of coordination – all processes and
data have to be maintained syncronized about the
global system state
• Failures
• Data distribution
3
Eurostat
Hadoop
• Open source platform for distributed processing
of large datasets
• Based on a project developed at Google
• Functions:
• Distribution of data and processing across
machines
• Management of the cluster
Eurostat
Hadoop Concepts
• Applications are written in common high-level
languages
• Inter-node communication is limited to the
minimum
• Data is distributed in advance
• Bring the computation close to the data
• Data is replicated for availability and reliability
• Scalability and fault-tolerance
6
Eurostat
Scalability and Fault-tolerance
• Scalability principle
• Capacity can be increased by adding nodes to the cluster
• Increasing load does not cause failures but in the worst
case only a graceful degradation of performance
• Fault-tolerance
• Failure of nodes are considered inevitable and are coped
with in the architecture of the platform
• System continues to function when failure of a node
occurs – tasks are re-scheduled
• Data replication guarantees no data is lost
• Dynamic reconfiguration of the cluster when nodes join
and leave
7
Eurostat
Benefits of Hadoop
• Previously impossible or impractical analysis
made possible
• Lower cost of hardware
• Less time
• Ask Bigger Questions
8
Eurostat
Hadoop Components
Core Components
9
Eurostat
Hadoop Core Components
• HDFS: Hadoop Distributed File System
• Abstraction of a file system over a cluster
• Stores large amount of data by transparently
spreading it on different machines
• MapReduce
• Simple programming model that enables parallel
execution of data processing programs
• Executes the work on the data near the data
• In a nutshell: HDFS places the data on the cluster
and MapReduce does the processing work
Eurostat
Structure of an Hadoop Cluster
• Hadoop Cluster:
• Group of machines working together to store and
process data
• Any number of “worker” nodes
• Run both HDFS and MapReduce components
• Two “Master” nodes
• Name Node: manages HDFS
• Job Tracker: manages MapReduce
11
Eurostat
Hadoop Principle Hadoop is basically a
middleware platform that
I’m one manages a cluster of
big data machines
set
The core components is a
distributed file system
(HDFS)
HDFS
The cluster can grow
indefinitely simply by adding
new nodes
Eurostat
The MapReduce Paradigm
Parallel processing paradigm
Programmer is unaware of parallelism
Programs are structured into a two-phase
execution
Map Reduce x4
x5
x3
Data elements are An algorithm is applied to all
classified into the elements of the same
categories category
Eurostat
MapReduce Concepts
• Automatic parallelization and distribution
• Fault-tolerance
• A clean abstraction for programmers
• MapReduce programs are usually written in Java
• Can be written in any language using Hadoop
Streaming
• All of Hadoop is written in Java
• MapReduce abstracts all the ‘housekeeping’ away
from the developer
• Developer can simply concentrate on writing the
14
Map and Reduce functions
Eurostat
MapReduce and Hadoop
MapReduce
HDFS
Eurostat
MapReduce and Hadoop
Output is written
on HDFS
Scalability principle:
Perform the computation were the data is
Eurostat
Hive
• Apache Hive is a high-level abstraction on
top of MapReduce
– Uses an SQL/like language called HiveQL
– Generates MapReduce jobs that run on the
Hadoop cluster
– Originally developed by Facebook for data
warehousing
– Now an open/source Apache project
17
Eurostat
Overview
• HiveQL queries are transparently mapped into
MapReduce jobs at runtime by the Hive execution
engine
• Also makes optimizations
• Jobs are submitted to the Hadoop cluster
18
Eurostat
Hive Tables
• Hive works on the abstraction of table, similar to a
table in a relational database
• Main difference: a Hive table is simply a directory in
HDFS, containing one or more files
• By default files are in text format but different
formats can be specified
• The structure and location of the tables are stored in
a backing SQL database called the metastore
• Transparent for the user
• Can be any RDBMS, specified at configuration time
19
Eurostat
Hive Tables
• At query time, the metastore is consulted to
check if the query is consistent with the tables it
invokes
• The query itself operates on the actual data files
stored in HDFS
20
Eurostat
Hive Tables
• By default, tables are stored in a warehouse directory
on HDFS
• Default location:
/user/hive/warehouse/<db>/<table>
• Each subdirectory of the warehouse directory is
considered a database
• Each subdirectory of a database directory is a table
• All files in a table directory are considered part of the
table when querying
• Must have the same structure
21
Eurostat
Pig
• Tool for querying data on Hadoop clusters
• Widely used in the Hadoop world
• Yahoo! estimates that 50% of their Hadoop workload on
their 100,000 CPUs clusters is genarated by Pig scripts
• Allows to write data manipulation scripts written
in a high-level language called Pig Latin
• Interpreted language: scripts are translated into
MapReduce jobs
• Mainly targeted at joins and aggregations
Eurostat
Overview of Pig
• PigLatin
• Language for definition of data flow
• Grunt
• Interactive shell for typing and executing PigLatin
statements
• Interpreter and execution engine
23
Eurostat
RHadoop
• Collection of packages that allows integration of R with
HDFS and MapReduce
• Hadoop provides the storage while R brings the
processing
• Just a library
• Not a special run-time, Not a different language, Not a
special purpose language
• Incrementally port your code and use all packages
• Requires R installed and configured on all nodes in the
cluster
Eurostat
RHadoop Packages
• rhdfs
• Interface for reading and writing files from/to a
HDFS cluster
• rmr2
• Interface to MapReduce through R
• rhbase
• Interface to HBase
Eurostat
rhdfs
Eurostat
rmr2
• rmr2 is an R interface for providing Hadoop
MapReduce facility inside the R environment.
• So, the R programmer needs to just divide their
application logic into the map and reduce phases and
submit it with the rmr2 methods.
• After that, rmr2 calls the Hadoop streaming
MapReduce API with several job parameters as input
directory, output directory, mapper, reducer, and so
on, to perform the R MapReduce job over Hadoop
cluster.
Eurostat
mapreduce
• The mapreduce function takes as input a set of named
parameters
• input: input path or variable
• input.format: specification of input format
• output: output path or variable
• map: map function
• reduce: reduce function
• map and reduce function present the usual interface
• A call to keyval(k,v) inside the map and reduce function
is used to emit respectively intermediate and output
key-value pairs
Eurostat
WordCount in R
wordcount = wc.reduce =
function( function(word, counts ) {
input, keyval(word, sum(counts))}
output = NULL,
pattern = " "){ mapreduce(
input = input ,
wc.map = output = output,
function(., lines) { input.format = "text",
keyval( map = wc.map,
unlist( reduce = wc.reduce,
strsplit( combine = T)}
x = lines,
split = pattern)),
1)}
Eurostat
Reading delimited data
tsv.reader = function(con, nrecs){
lines = readLines(con, 1)
if(length(lines) == 0)
NULL
else {
delim = strsplit(lines, split = "\t")
keyval(
sapply(delim,function(x) x[1]),
sapply(delim,function(x) x[-1]))}}
freq.counts = mapreduce(
input = tsv.data,
input.format = tsv.format,
map = function(k, v) keyval(v[1,], 1),
reduce = function(k, vv) keyval(k, sum(vv)))
Eurostat
Reading named columns
tsv.reader = function(con, nrecs){
lines = readLines(con, 1)
if(length(lines) == 0)
NULL
else {
delim = strsplit(lines, split = "\t")
keyval(sapply(delim, function(x) x[1]),
data.frame(
location = sapply(delim, function(x) x[2]),
name = sapply(delim, function(x) x[3]),
value = sapply(delim, function(x) x[4])))}}
freq.counts = mapreduce(
input = tsv.data,
input.format = tsv.format,
map = function(k, v) {
filter = (v$name == "blarg")
keyval(k[filter], log(as.numeric(v$value[filter])))},
reduce = function(k, vv) keyval(k, mean(vv)))
Eurostat
Apache Spark
• A general purpose framework for big data
processing
33
Eurostat
Built-in Libraries
34
Eurostat
RDD - Resilient Distributed Dataset
35
Eurostat
Transformations
resultRDD = originalRDD.map(myFunction)
36
Eurostat
Actions
result = rdd.reduce(function)
37
Eurostat
SparkSQL and DataFrames
38
Eurostat
DataFrames
• A DataFrame is a collection of data organized into
columns
39
Eurostat
Example operations on DataFrames
• To show the content of the DataFrame
• df.show()
• To print the Schema of the DataFrame
• df.printSchema()
• To select a column
• df.select(‘columnName’).show()
• To filter by some parameter
• df.filter(df[‘columnName’] > N).show()
40
Eurostat