Professional Documents
Culture Documents
Hadoop Ecosystem Masterclass Slides
Hadoop Ecosystem Masterclass Slides
Source: http://blog.linkedin.com/2016/01/12/the-25-skills-that-can-get-you-hired-in-2016/
Source: https://www.quora.com/What-is-the-average-salary-package-for-a-Hadoop-admin
fi
Learn Big Data
Source: http://www.indeed.com/salary/q-Hadoop-Engineer-l-San-Francisco,-CA.html
• In the past, I have delivered the of cial Hortonworks and MapR Hadoop
trainings to S&P 500 companies in the UK, the US and in Asia
• Using my expertise, I created this brand new Online Hadoop Course, t for
anyone who wants to get a job in the Big Data eld, or for anyone that just
wants to know more about Hadoop
Copyright 2016 Edward Viaene
fi
fi
fi
Course Layout
What is Big Data Introduction to The Hadoop
Security Advanced Topics
and Hadoop Hadoop Ecosystem
Manual / Automated ETL with MR, Pig, Yarn Schedulers,
What is Big Data and Spark Kerberos intro
Ambari install Queues, Sizing
Hive Query
What is Hadoop Introduction to HDFS SQL with Hive LDAP intro
Optimization
First MapReduce
What is Data Science Program: WordCount Kafka SPNEGO Spark Optimization
HDFS Transparent
Ambari Blueprints Phoenix
Encryption
To summarize
• You get lifetime access to over 6 hours of video
Introduction
Course Layout
What is Big Data Introduction to The Hadoop
Security Advanced Topics
and Hadoop Hadoop Ecosystem
Manual / Automated ETL with MR, Pig, Yarn Schedulers,
What is Big Data and Spark Kerberos intro
Ambari install Queues, Sizing
Hive Query
What is Hadoop Introduction to HDFS SQL with Hive LDAP intro
Optimization
First MapReduce
What is Data Science Program: WordCount Kafka SPNEGO Spark Optimization
HDFS Transparent
Ambari Blueprints Phoenix
Encryption
Course objective
• To understand concepts like Big Data, Data Analytics, and Data Science
Practice
• A lot of slides provide practical information on how to do things
• In demo lectures, I show you how setup and manage your own cluster,
which you should also try out yourself
• I will keep my git repository up to date with new and extra information
(http://github.com/wardviaene/hadoop-ops-course)
• If you have problems trying out something yourself, you can use the
support channels to get help
Learn DevOps - Edward Viaene - Copyright 2015-2016
• We also have a Facebook group called Learn Big Data: The Hadoop
Ecosystem Masterclass
• You can scan the following barcode or use the link in the document after
this introduction movie
• Data Science is getting even more important on bigger data sets, where traditional
analytics are pretty limited
• In practice, our Big Data Solution not only needs to store and process data, it also needs
the capabilities to iteratively run models on our data and provide us with answers to our
business questions
Data Scientist
Domain knowledge,
and algorithms, Data cleaning,
Machine Learning Hacker mindset
Math &
What is Hadoop
What is Hadoop?
• Hadoop is software to reliable and scalable store and process Big Data
What is Hadoop?
• HDFS, the Hadoop Distributed File
System, stores the data within
cluster
MapReduce Hive Spark … A lot more
• Yarn is the interface you submit your
applications to, it manages cluster
resources like CPU and MEM for you
Yarn (Yet Another Resource Manager)
• Applications can be submitted using
any of the processing engines built
on top of Hadoop
HDFS
• One of the simplest distributed
application you can write is counting
the words of a text le stored in HDFS
using the MapReduce framework
fi
What is Hadoop?
• Hadoop is a distributed System
• The data nodes / worker nodes (in green) can scale horizontally
Namenode
Yarn Resource Manager
Other Master Services
Standby Namenode
Standby Yarn Resource
Manager
} Master nodes
Often Virtualized
What is Hadoop?
• Hadoop is very different than a single machine system
• Developers will have to change the way they develop to be able to write
applications for Hadoop
• Hadoop will automatically recover from a node failure. This means that
code within the application possibly gets executed twice.
What is Hadoop?
• Let’s compare the cost per
$100
gigabyte of a traditional
database and Hadoop
$75
• Traditional databases (in green)
that cannot scale horizontally will
$50 initially have a lower cost per GB
fi
Hadoop Distributions
Apache Hadoop Hortonworks Cloudera MapR
MapR Control
Frontend Apache Ambari Apache Ambari Cloudera Manager
System
Source: hortonworks.com
Copyright 2016 Edward Viaene
Installation
HDP Installation
• How do you get started with Hadoop?
HDP Install
• There are different HDP installs:
• Manual install
• Unattended install
fi
Our setup
• To install a real cluster you
Laptop / Desktop need a minimum of 3
nodes
Vagrant
• https://www.virtualbox.org/
• https://www.vagrantup.com/
• https://github.com/wardviaene/hadoop-ops-course/archive/master.zip
• Download these les and put them somewhere you can remember
e.g. C:\hadoop on Windows or in your home directory on MacOSX/Linux
Copyright 2016 Edward Viaene
fi
fi
fi
What is Vagrant
• Vagrant can create and con gure virtual development environments
• Lightweight
• Reproducible
• Portable
• Disposable environments
Copyright 2016 Edward Viaene
fi
Getting started
• Vagrant will come out of the box with support for VirtualBox
• To launch the 3 node cluster, go to the folder where you downloaded it and
type:
$ vagrant up node1 node2 node3
• To remove the machines and the data (when you don’t need it anymore)
$ vagrant destroy
Vagrant le
• A project has a Vagrant le with the con guration
• It describes the kind of machine and resources needed for the project to
run and what software to install
fi
fi
Installation
• Ambari-server on node1 will install and con gure all the nodes
• NodeManager
Copyright 2015-2016 Edward Viaene
1
fi
Demo
The hadoop Sandbox
Demo
Installing hadoop using ambari
HDFS
Hadoop uploads
hadoop fs -put Metadata
Hadoop Client
NameNode
128 MB
List of IP addresses le metadata
Big data 128 MB of DataNodes
big le.csv
128 MB
100 MB
Datanode communications
NameNode
le metadata
Blockreport &
heartbeat
Demo
HDFS put demo
MapReduce WordCount
K, K,
DataNode 1 Map 1 People, (1) People, 1
K, We, 2
We, (1,1)
We, 1
We the People of the United the, 1
People, 1
States, in Order to form a more
of, 1
Reducer 1
perfect Union,
…
H
Shuf e D
DataNode 2 Map 2 & K, K, F
of, (1) of, 1
Sort the, (1,1) the, 2 S
Reducer 2
DataNode 3 Map 3
K,
We, 1
…
the, 1
…
fl
Demo
MapReduce WordCount demo
Lines that span blocks
Map 1 Map 2
DataNode 1 DataNode 2
Map 1 Map 2
DataNode 1 DataNode 2
ResourceManager (RM)
Application
Submission Scheduler
Client App
e.g. yarn jar
examples.jar ApplicationsManager
NM 1 NM 2 NM 3 NM 4 NM 5
Application
Map Map Map Master
spark
Application
Reduce Master
Demo
Yarn demo
Ambari
Ambari Management
• Ambari can be used for cluster management
• Ambari can upgrade the HDP Stack when a new version comes out
• Ambari views allows developers to see data in HDFS and submit queries
Copyright 2016 Edward Viaene
Ambari API
• Ambari has an REST API that can be used
• Some features, like Ambari blueprints are not implemented in the web
interface yet (as of early 2016)
Ambari Blueprints
• Ambari blueprints allows you to restore a cluster con guration from a
JSON le
• To create a blueprint, you can issue an API call to a running cluster to get
a snapshot of its current con guration:
fi
fi
Ambari Blueprints
• To execute an unattended install, you can send a blueprint using the API
• This is done after installing ambari-server, but before the installation wizard itself
• You can take a blueprint from your cluster, or you can take one from my github
repository
Ambari Blueprints
• To create a new cluster, you need a Cluster Creation template, to map
services to different nodes
• This command will send the template and start creating the cluster:
Demo
Using Ambari & Ambari blueprint demo
The Hadoop Ecosystem
ETL Tools
• The Hadoop ecosystem has a lot of applications which
can be used for different purposes
ETL Tools
• Pig: enables data analysis using a high level programming
language called Pig Latin
Yarn
• Developers can write their code in Scala/Java/Python
HDFS
• Is now the most popular Apache project
fi
Pig Architecture
example.pig (Pig Latin)
http://pig.apache.org
csv = LOAD ‘test.csv’ using PigStorage(‘,’) AS ( rstname:chararray, lastname:chararay)
csv_john = FILTER csv BY rstname == ‘John’;
csv_john_limit = LIMIT csv_john 3;
DUMP csv_john_limit;
Parse Compile
Yarn
Plan Optimize
HDFS
fi
Demo
Pig demo
Spark
Apache Spark Architecture
Application Submission
Spark job Resource Manager (Yarn)
Applicatio
Master
Spark Executor Spark Executor
Spark Driver
HDFS
Running spark
• Using spark-submit:
$ spark-submit --master yarn --deploy-mode cluster --driver-memory 1g --executor-memory 2g --executor-cores 1 --jars library.jar app.jar
$ spark-submit --master yarn --deploy-mode client --class package.mainClass scala-app.jar
$ spark-submit --master local[2] examples/src/main/python/pi.py 10
• Using spark-shell:
$ pyspark
$ spark-shell
$ sparkR
Since Spark 1.4 there is also an R API available (still experimental early 2016)
WordCount in Spark
• Spark provides a much simpler framework than MapReduce to do
processing
• New features are available in Scala and Java rst, then Python
1. text_ le = sc.textFile(“hdfs:///user/vagrant/constitution.txt“)
2. counts = text_ le. atMap(lambda line: line.split(" ")) \
3. .map(lambda word: (word, 1)) \
4. .reduceByKey(lambda a, b: a + b)
5. counts.saveAsTextFile(“hdfs:///user/vagrant/wordcount_spark_output“)
Source: http://spark.apache.org/examples.html
fi
Demo
Apache Spark installation and demo
RDDs
• Spark’s primary abstraction is called a
HDFS
Resilient Distributed Dataset
rdd = sc.textFile(“hdfs:///user/edward/data“)
• RDDs can be made from les in HDFS
Node 1 Node 2 using sc.textFile() in Spark
memory memory
RDD • RDDs reside in memory by default
RDDs
• An RDD can transform in an other
Node 1 Node 2 RDD
memory
RDD memory
fl
RDDs
• Filtering data is another example of a
Node 1 Node 2 transformation
memory
RDD memory
fi
RDDs
• To return the result, an action can be
Node 1 Node 2 applied
memory memory
fi
A Shuf e in Spark
• Transformations like ReduceByKey can
Node 1 Node 2 have a big impact on the cluster
memory memory
RDD resources, much more than a map or
lter task
fl
fi
fi
fl
Demo
Apache Spark Actions and Transformations
Some transformations
Transformation Meaning
map The map from the MapReduce paradigm. Transforms data (the T in ETL)
Same as map, but can output zero or more output items (rows). For example one line
atMap
as input can be outputted as words
Aggregates by a given function, like Reduce in the MapReduce paradigm. Will shuf e
reduceByKey
the data across the network
Returns all the elements from an RDD to the driver (client). Can only be done if data is
collect
small enough (i.e. after aggregations / lters)
saveAsTextFile Saves the data to a textFile. Can be local, but mostly HDFS
• MLLib’s models run distributed, which can scale much better than models
written in for instance Python’s Scikit-Learn library
Clustering Example
• Another use case is to use a
clustering algorithm to interpret
geolocation data
Hive
Hive
• Hive is a data warehouse for Hadoop
• The data is stored in HDFS, the queries are executed on the worker nodes
(= bring compute to where the data is)
Hive Architecture
beeline JDBC legacy hive CLI
Thrift
Metastore Hiveserver2
MapReduce
Parse Optimize v2 or Tez
RDBMS
backend (e.g.
MySQL) Yarn
Plan Execute
HDFS
• Port 10000 is the default jdbc hive port, the protocol is thrift over the binary protocol
• An http protocol is also available and can be enabled in the hive con guration, the default http port is 10001
fi
• /apps/hive/warehouse/customers
Hive Databases
• A table can also be created in another database rather than ‘default’
create database mydatabase;
use mydatabase;
CREATE TABLE customers
(id INT,
rstname STRING,
lastname STRING);
• /apps/hive/warehouse/mydatabase/customers
fi
INSERT INTO customers (id, rstname, lastname) VALUES (1, ‘John’, ‘Smith’);
INSERT INTO customers (id, rstname, lastname) VALUES (2, ‘Bob’, ‘Smith’);
INSERT INTO customers (id, rstname, lastname) VALUES (3, ‘Emma’, ‘Smith’);
• This will move the data from the HDFS path /tmp/data.csv to
/apps/hive/warehouse/customers
Copyright 2016 Edward Viaene
fi
fi
fi
• hadoop fs can be done from an edge node (a node with only client
libraries), a datanode (a node in the cluster), or a desktop/laptop that has
hadoop installed and con gured (xml con guration les need to be
present)
Copyright 2016 Edward Viaene
fi
fi
fi
Hive Querying
Output all the data: This query will not run a MapReduce job,
SELECT * FROM customers;
because no operation needs to run on the
data, the data can simply be shown on
screen
Aggregation:
This query will run Map and Reduce jobs.
SELECT rstname, COUNT(*) FROM customers The aggregation (GROUP BY) will run in
GROUP BY rstname; the Reduce tasks
fi
Demo
Hive demo
Hive Partitioning
• In Hive, using HDFS, indexes will not speed up queries like in Relational
Databases
• Using HDFS, even if we would know where the data would be on disk, it
would still take a long time to retrieve the full blocks, because it’s not
possible to do random reads on a le in HDFS
fi
fi
Hive Partitioning
CREATE TABLE employees (id INT, rstname STRING, lastname STRING)
• To partition a table, use the
PARTITIONED BY (department String); PARTITIONED BY clause in HiveQL
INSERT INTO employees (id, rstname, lastname, department) VALUES
(1, ‘John’, ‘Smith’, ‘hr’); • After inserting some data, you will
INSERT INTO employees (id, rstname, lastname, department) VALUES
(1, ‘Alice’, ‘Jones’, ‘operations’);
see that Hive automatically creates
INSERT INTO employees (id, rstname, lastname, department) VALUES subdirectories for every
(1, ‘Joe’, ‘Johnson’, ‘ nance’);
department
/apps/hive/warehouse/employees/department=hr/
• Anytime a query is executed on a
/apps/hive/warehouse/employees/department=operations/ speci c department, a big portion
/apps/hive/warehouse/employees/department= nance/
of the data can be skipped,
because Hive can go straight to
the data in one of the
SELECT * from employees where department = ‘hr’
subdirectories
fi
fi
fi
fi
fi
fi
Hive Bucketing
CREATE TABLE customers (id INT, rstname STRING, lastname STRING)
• Often the column you want to
PARTITIONED BY (orderID INT) INTO 5 BUCKETS; partition on, doesn’t have a one-to-
one relation to the amount of
partitions you want to have
Hive Bucketing
CREATE TABLE customers (id INT, rstname STRING, lastname STRING)
• Using a hashing algorithm on the
PARTITIONED BY (orderID INT) INTO 5 BUCKETS; orderID, the rows will be evenly
divided over all 5 buckets
INSERT with orderID 1
Extending Hive
• You can extend hive by writing User De ned Functions (UDFs)
• other languages like Python are also available, but might be less
performant than ones written in Java
fi
fi
Extending Hive
• Data loading and writing is controlled by SerDes (Serialization and
Deserialization)
• There are libraries available for the most common data types
• Avro
• XML
• JSON
• You can also write your own Serialization and Deserialization in Java
Stinger Initiative
• The Stinger initiative wants to lower the execution time of Hive queries
Tez
Pig / Hive on MapReduce Pig / Hive on Tez
M M M M M M
R R R R
M M M M
Read / Write to HDFS
R R
M M
R
R
Read / Write to HDFS
M M
R R
• Enables compression
fi
fi
fi
ORC
• ORC is the recommended format when running hive queries
CREATE TABLE employees (id INT, rstname STRING, lastname STRING) STORED AS ORC
Hive in Spark
• You can use Hive in Apache Spark
creates sqlContext
Spark using HiveContext
• Data can be ingested into Hadoop and
saved in hive-tables for easy access to
sqlContext.sql(“SELECT * from customers”).take(1) Data Analysts
Realtime Processing
Real-time processing
• Up until now we’ve been using batch processing:
• You typically run batch jobs in intervals, for example every hour or
every day
• It’s the easiest way to get started with Big Data and often people start
rst with batch and then when there’s a need to process the data in
smaller time intervals, they add stream processing capabilities
Real-time processing
• Stream processing requires new technologies
• HDFS can not quickly read 1 event (e.g. 1 line or 1 record), it’s only fast
when reading full blocks of data
Real-time processing
• Constantly running services within the Hadoop cluster can retrieve the
data from a queue system like Kafka
Real-time processing
• Once the data has been processed it needs to be stored (persisted)
• Again HDFS is not a good t here, the le system hasn’t been designed
to store a lot of small les. A solution could be to append all events to
one or a few les, but then reading one event would be dif cult
• Those datastores enable fast writes to the datastore and fast reads
when you know the row key
Storm, or Storm, or
Processing Spark Streaming Spark Streaming
per event (Storm)
or per micro-batch of events (Spark)
HBase, or HBase, or
Persist data in distributed, Accumulo Accumulo
scalable, big data store
Kafka
Kafka
• Kafka is a high throughput distributed messaging system
• It’s durable: messages are persisted to disk and replicated within the
cluster to avoid data loss. Each Kafka instance can handle TBs of data
without performance impact
Copyright 2016 Edward Viaene
Kafka Terminology
• Kafka maintains feeds of messages in
Producer Producer Producer
categories called topics.
Kafka Terminology
Node 1
• Kafka uses Zookeeper to know which Kafka
brokers are alive and in sync (using Zookeeper
Kafka broker heartbeat functionality)
Topic 1 Topic 2
• Zookeeper is also used to keep track of which
messages a consumer has read already and
which ones not. This is called the offset
Zookeeper Zookeeper Zookeeper
• Zookeeper is service that runs on the cluster,
Node 2
where con guration can be stored in a
distributed way and that can be used to keep
track of which nodes are still alive
Kafka broker
Topic 1 Topic 2 • A typical Hadoop cluster has 3 or 5 Zookeeper
nodes, providing protection against node failure
Kafka Topics
node 1 • For each topic, Kafka maintains a
Kafka Broker partitioned log (p0 and p1 in the picture)
Topic
• You can choose how many partitions
p0 0 1 2 3 4 5 you want
Writes
p1 0 1 2 3 4 5 6 7 • Each partition is an ordered, immutable
(unchangeable) sequence of messages
Kafka Topics
node 1 • Partitions allow you to scale beyond the
Kafka Broker size that will t on one server
fi
Kafka Topics
Kafka Broker • Replication can be activated on a per topic
basis
Topic
Kafka Topics
Kafka Broker
Topic Producer
Leader p0 0 1 2 3 4 5
• It is the producer process that writes to the
Follower p1 0 1 2 3 4 5 6 7 partitions
Kafka Topics
Kafka Broker
Topic Producer
Leader p0 0 1 2 3 4 5
• Kafka guarantees that messages sent by
Follower p1 0 1 2 3 4 5 6 7 a producer to a partition will be appended
in the order they are sent
Kafka Broker
• If message 0 is sent before message 1 to
Topic
partition 0, then message 1 will always
Follower p0 0 1 2 3 4 5 appear after message 0 in the log
Leader p1 0 1 2 3 4 5 6 7
Kafka Topics
Kafka Broker
Topic Consumer
Leader p0 0 1 2 3 4 5
• Consumers read from the leader partitions
Follower p1 0 1 2 3 4 5 6 7
• A Consumer sees messages in the order they
are stored in the log
Kafka Broker
Topic
• Only “committed” messages are given out to
the consumer
Follower p0 0 1 2 3 4 5
• A message is considered committed when all
Leader p1 0 1 2 3 4 5 6 7 the “in sync” replicas have the message
applied to their logs
Copyright 2016 Edward Viaene
Kafka Topics
Kafka Broker
• Consumer groups can be
Topic Consumer created
Leader p0 0 1 2 3 4 5message • Each message in a topic is
Consumer sent to only one consumer in a
Follower p1 0 1 2 3 4 5 6 7
group
Kafka Messages
• Kafka guarantees at-least-once message delivery by default
• For at-most-once message delivery, the consumer also needs to commit its
offset before it will process messages, and this on the destination storage
system
• Most users keep the at-least-once message delivery by default, and make
sure that the processing of messages can be repeatable
Kafka Messages
• Kafka messages can be compressed
• The batch of messages will be written in compressed form and will only be
decompressed by the consumer
Log Compaction
• Log Compaction in Kafka is an alternative to just delete messages older
than a certain time
• Kafka can maintain the most recent entry for every unique key
• In that case Kafka will have to do log compaction (remove older entries)
• Only when the unique key gets overwritten, the older data can be
removed
Copyright 2016 Edward Viaene
Log Compaction
Offset 1 2 3 … 10 …
Log before
Key 1 2 3 … 1 …
compaction
Message Bob Smith Alice Jones Tom Johnson … Joe Smith …
Offset 2 3 … 10 …
Log after
Key 2 3 … 1 …
compaction
Message Alice Jones Tom Johnson … Joe Smith …
• Log compaction guarantees that each key is unique in the tail of the log
• Any consumer that stays within the head of the log (~1 gig) will see all messages
• This is important for consumers: as long as they read within the last gigabyte of messages,
they will never miss a message - older messages might get deleted during compaction
1 2 3 10 11 15 20 21 22 30 44 60 61 80 81 82 83 84 85 96
• Website Activity Tracking: to track site activities like visits, searches. This
was the original use case for Kafka by LinkedIn
• Commit log: Kafka can serve as a commit log for a distributed system
Kafka in HDP
• Kafka can be installed using Ambari
• There are initiatives to enable Kafka on Yarn (KOYA), but that’s not yet
available for production usage
Usage
• Create topic mytopic with 4 partitions
Kafka demo
Storm
Storm
• Storm is a realtime distributed computation system
• It’s included within the Hortonworks Data Platform and can be installed
through Ambari
Copyright 2016 Edward Viaene
Storm Components
• On storm you run topologies
Storm Components
• The Supervisor listens for work and
starts and stops worker processes as
Supervisor
needed
Zookeeper
Supervisor • Each worker process executes a subset
of a topology
Nimbus Zookeeper Supervisor
• A running topology consists of many
Supervisor worker processes across the Storm
Zookeeper cluster
Supervisor
• All coordination between Nimbus and
the Supervisors is done using Zookeeper
Storm Components
• Nimbus and the Supervisors keep their
state in Zookeeper
Supervisor
Supervisor
Usage
• To submit a new topology using storm, you need to package your code as
a jar (can be done by the developer or build systems like Jenkins)
• Once you have the jar, you can create a new Topology by using the
following command:
$ storm jar all-my-code.jar package.MyTopology arguments
• This allows you to not only submit your topologies in Java, but also in for
example Python
Storm Streams
• The core abstraction in Storm is a
stream
Bolt
Bolt
• A stream is a sequence of tuples
(a named list of values)
Spout Bolt
Storm Streams
• A bolt consumes any number of
input streams, does processing,
Bolt and can emit more streams
Bolt
• Bolts can run functions, lter tuples,
Spout Bolt do streaming aggregations, talk to
databases, and so on
Spout Bolt
• A network of spouts and bolts
makes up the topology
fi
Topology Example
Producer
Storm Topology
e.g. Log data,
twitter data
Kafka Broker
Spout Bolt Bolt HBase
Topic
Region Server
p0 0 1 2 3 4 5
Region Server
p1 0 1 2 3 4 5 6 7
Kafka input Processing HBase Bolt Persistent Storage
stream
Topology Example
• This is a WordCount Example Topology in Java:
1. TopologyBuilder builder = new TopologyBuilder();
2.
3. BrokerHosts hosts = new ZkHosts(“node1.example.com:2181”);
4. SpoutCon g spoutCon g = new SpoutCon g(hosts, “mytopic”, “”, UUID.randomUUID().toString());
5. spoutCon g.scheme = new SchemeAsMultiScheme(new StringScheme());
6. KafkaSpout kafkaSpout = new KafkaSpout(spoutCon g);
7.
8. builder.setSpout(“sentences", kafkaSpout, 5);
9. builder.setBolt(“split", new SplitSentence(), 8)
10. .shuf eGrouping("sentences");
11. builder.setBolt(“count", new WordCount(), 12)
12. . eldsGrouping("split", new Fields("word"));
• The rst bolt splits the lines coming from Kafka into words, the second bolt counts the words
fi
fi
fi
fl
fi
fi
fi
fi
WordCount visualization
Sentences Spout Split Bolt Count Bolt
Demo
Storm demo
Message delivery
• Storm supports all 3 message processing guarantees
• At most once
• A message is passed through the topology at most once, failed messages will
not pass through a second time
• At least once
• Exactly once
• A message will exactly once go through the Topology, which can be achieved
using Trident
Copyright 2016 Edward Viaene
Message delivery
• Storm considers a message successfully delivered when a message went
through the whole topology
• To make sure that every tuple has been processed, tuples will go through
the topology at least once: one time when successful, or multiple times
when there was a failure during the processing
• To anchor a tuple, the developer needs to make sure that in the code of the Bolt
itself, the input tuple is emitted together with the output:
• In this example, a sentence is split into words. When emitting a word, the original tuple is
the rst argument. The original tuple is anchored to the newly emitted word and can then
be tracked in the next bolt
Copyright 2016 Edward Viaene
fi
• To guarantee the delivery, every bolt needs to implement the anchoring and
acking
• To make this easier for the developer Storm has a class BasicBolt that can be
used to do this automatically
Copyright 2016 Edward Viaene
Impact of acking
• Guaranteeing delivery will have a big performance impact
• It will also increase bandwidth, because an ack will be sent for every tuple
Storm Scaling
• The developer con gures the initial parallelism
Storm Scaling
• Scale out can be done in the same way as Hadoop does it
• Ambari can be used to add a new host and install the necessary services
• When adding new hosts, new resources will be available for the Storm cluster
• You can designate some of the nodes for Data Storage and batch processing,
and some of them for realtime processing
• You can also have a mixed environment: data storage, batch processing, and
realtime processing on the same nodes and let them share resources
Copyright 2016 Edward Viaene
Trident
Trident
• Trident is an alternative interface you can use in Storm
Trident
• Trident will stream small batches of tuples instead of handling them one by one
• The size of the batches will be decided on the throughput, but one batch can
be thousands or millions of messages
We the People of the United States, in Order to form a more perfect Union, We the People of the United States, in Order to form a more perfect Union,
establish Justice, insure domestic Tranquility, provide for the common establish Justice, insure domestic Tranquility, provide for the common
defence, promote the general Welfare, and secure the Blessings of Liberty defence, promote the general Welfare, and secure the Blessings of Liberty Batch 1
to ourselves and our Posterity, do ordain and establish this Constitution to ourselves and our Posterity, do ordain and establish this Constitution
for the United States of America.
All legislative Powers herein granted shall be vested in a for the United States of America.
Congress of the United States, which shall consist of a Senate and House of All legislative Powers herein granted shall be vested in a
Representatives.
The House of Representatives shall be composed of Members
Congress of the United States, which shall consist of a Senate and House of
Representatives. Batch 2
The House of Representatives shall be composed of Members
chosen every second Year by the People of the several States, and the
chosen every second Year by the People of the several States, and the
Electors in each State shall have the Quali cations requisite for Electors
of the most numerous Branch of the State Legislature.
Electors in each State shall have the Quali cations requisite for Electors
No Person shall be a Representative who shall not have attained to the Age
of the most numerous Branch of the State Legislature.
of twenty ve Years, and been seven Years a Citizen of the United States,
and who shall not, when elected, be an Inhabitant of that State in which he
No Person shall be a Representative who shall not have attained to the Age Batch 3
of twenty ve Years, and been seven Years a Citizen of the United States,
and who shall not, when elected, be an Inhabitant of that State in which he
Trident Example
1. FixedBatchSpout spout = new FixedBatchSpout(new Fields("sentence"), 3,
2. new Values("the cow jumped over the moon"),
3. new Values("the man went to the store and bought some candy"),
4. new Values("four score and seven years ago"),
5. new Values("how many apples can you eat"));
6.
7. spout.setCycle(true);
8.
9. TridentTopology topology = new TridentTopology();
10.
11. TridentState wordCounts = topology.newStream("spout1", spout)
12. .each(new Fields("sentences"), new Split(), new Fields("word"))
13. .groupBy(new Fields("word"))
14. .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))
15. .parallelismHint(6);
• If the goal is to have the lowest latency possible, don’t use Trident,
because the batches and the “state” it creates means higher latency to
get a tuple processed (Trident tracks the state of the stream)
• If you can live with the “at least once” processing storm provides, you
don’t need Trident
Spark Streaming
Spark Streaming
• Spark Streaming is an alternative to Storm
• Spark has gained a lot in popularity and is currently the most popular project
of the Apache foundation (early 2016)
• One of the great advantages in using Spark Streaming is code reuse: batch
processing code written in Spark can be reused in realtime processing,
using the same framework
• For very low latency processing, Storm is still better, otherwise Spark
Streaming is going to be your preferable choice
Copyright 2016 Edward Viaene
Spark Streaming
• Spark Streaming has the following advantages:
• Fault Tolerant: it has stateful exactly-one semantics out of the box (like
Trident)
Spark Streaming
• Spark Streaming integrates well with other technologies in our ecosystem:
Kafka HDFS
Flume HBase
Kinesis Cassandra
Twitter Dashboards
Spark Engine
• There are basic sources, like le streams or a stream from a builtin library like
HDFS
• Advanced sources are streams that come from external non-spark libraries,
like Kafka
• When using these advanced sources, extra libraries have to be linked and
shipped with the program (i.e. the appropriate classes/jars have to be
added when submitting the application)
fi
Receivers
• Receivers can be reliable or unreliable:
• Sources like Kafka are reliable receivers, because transferred data can be
acknowledged by the receiver. Spark can ensure no data has been lost in
transfer due to any kind of failure
fl
Demo
Spark demo
UpdateStateByKey
• Spark Streaming allows you to keep information (a state) of your stream
• For example, you might want to keep a counter, or you might want to keep track of the count
of all the words across your stream when doing a WordCount:
…
def updateFunction(new_values, last_sum):
return sum(new_values) + (last_sum or 0)
fl
Checkpointing
• When storing state, you want to have resilient storage that can recover
from a failure
fi
fi
Checkpointing
• To enable checkpointing when keeping the state, you set the checkpoint
directory in the code:
…
ssc.checkpoint(“checkpoint")
…
• Note that saving to a reliable storage can slow down processing of the
batches when checkpointing needs to happen
Demo
Spark state demo
Windowing
• Spark Streaming also supports windowed computations
Window length: 3
Sliding interval: 2
Window 1 Window 2
Windowed sentences from sentences from
second 0 to 6 second 4 to 10
DStream
Copyright 2016 Edward Viaene
• You might have trained a model using a dataset in batch, but when new
data comes in, you still want to have the algorithm learn and adapt the
model using this new data. This is called online learning
• New data that comes in the stream will be predicted using a model, but
the model will be changed using this new information
HBase
HBase
• HBase is a realtime, distributed, scalable big data store on top of Hadoop
• Random read/writes means you can fetch any record in the big data
store fast, you don’t need to read the whole le like in HDFS
fi
fi
HBase
• HBase can host very large tables: billions of rows and millions of columns
• Consistency: the data is consistent, all nodes/clients see the same data
• Availability: the system is available, can give you a response with the data
• Partition Tolerance: the system can still operate when partitions become
unavailable
• HBase is a CP system, you will always have consistent reads and writes, but when a
partition (a node) becomes unavailable, you will not be able to read the data until
another node has recovered that part of the data.
Copyright 2016 Edward Viaene
HBase Table
Column Family • An HBase table is different than a
Column Quali er Column Quali er table in a Relational Database
rowkey cell value cell value
• The rst difference you will see is that
HBase uses Column Families to group
Customer SalesInfo Columns
Rowkey
Firstname Lastname Orders TotalSales
• The Columns itself are called Column
1 John Smith 5 300
Quali ers in HBase
2 Alice Jones
3 Bob Johnson 1 15 • Column Families (CF) physically
4 Jim Brown
separate a set of columns
5 Joe Williams • every CF has a separate directory
6 Tom Miller 3 6580 in HDFS
Copyright 2016 Edward Viaene
fi
fi
fi
fi
HBase Table
• Every cell in HBase can have multiple
Customer SalesInfo versions
rowkey
Firstname Lastname Orders TotalSales • When updating a cell, you actually insert a
v1: 5 v1: 300
new value, with a new version number
1 v1: John v1: Smith
v2: 6 v2: 310
• You can con gure to keep 1 version or
multiple versions
fi
HBase Table
• The rowkey is a unique key for the
Customer SalesInfo
row key table, like the primary key in a
Firstname Lastname Orders TotalSales relational database
1 John Smith 5 300
2 Alice Jones • The row key is sorted alphabetically
3 Bob Johnson 1 15
• The row key will be the only way to
4 Jim Brown
get a row out of HBase quickly, so
5 Joe Williams row key design is very important
6 Tom Miller 3 6580
• Ideally, data related to each other
should be close to each other by
choosing a good row key
HBase Table
Region 1 • Based on the rowkey, the table will be
Customer SalesInfo divided over the different nodes in Hadoop
row key
Firstname Lastname Orders TotalSales
• An HBase table is divided into Regions
1 John Smith 5 300
2 Alice Jones • The Regions are hosted on a RegionServer,
a service running on the worker/data nodes
HBase Architecture
Reading/Writing a value from/to HBase
1. Retrieve location of hbase:meta table Zookeeper
Client Zookeeper
2. hbase:meta: 192.168.199.2
4. Zookeeper
3. hbase:meta
row key info:server
Node 1 Node 2 Node 3
customers,1,1 192.168.199.2:60020
customers,3,2 192.168.199.3:60020
Region 3 Region n
• We know that customers starting
with 5 should go to Region 3 on
RegionServer 3, so that’s where
we’ll send our write
HDFS
• Changes are appended to the end of
WAL
the WAL, because that is the quickest
way on disk to do a write
Copyright 2016 Edward Viaene
fi
fl
fl
HDFS
WAL
fi
HBase Architecture - Read
Client • When the record is found in the
3. HFile, that block on HDFS will be
copied into the Block Cache and the
RegionServer 3 record will be send to the Client
Block Cache
• A side effect is that data that is “in
2.
Region 3 3. Region n proximity” to the record just read,
Memstore Memstore
that data will now be in the Block
Memstore Cache too
CF: Customer CF: SalesInfo
• GET statements are the fastest way to get data out of HBase
• Scans can retrieve multiple row(s) for a partial matching row key. Scans can
return all data from a table (also called a full table scan)
• A scan is slower than a GET, but the only choice when the row key is not known
• You should avoid full table scans (scanning the full table and returning all data)
Copyright 2016 Edward Viaene
HBase Architecture
HFile
Memstore
CF: Customer
• Data in the Memstore and HFiles
HFile
are stored as Key Value
fi
fi
fi
HBase Architecture
• Rowkey HFilesCQ
After a lot of writes, multiple are getting Value Timestam
Cellcreated
1 Firstname Joe 2
• Updates, like updating 1
customer 1’s rstnameSmith
Lastname from John to1Joe, are also recognized as
new writes 2 Firstname Alice 1
2 Lastname Jones 1
HFile HFile
3 Firstname Bob 1
3 Lastname Johnson 1
Rowkey CQ Cell Value Timestam Rowkey CQ Cell Value Timestam
4 Firstname Jim 1 1 Firstname Joe 2
1 Firstname John 1
1 Lastname 4 Smith Lastname 1 Brown 5 1 Firstname Joe 1
2 Firstname 5 Alice Firstname1 Joe 5 1 Lastname Williams 1
2 Lastname 5 Jones Lastname 1 Williams 6 1 Firstname Tom 1
3 Firstname 6 Bob Firstname 1 Tom 6 1 Lastname Miller 1
3 Lastname 6Johnson Lastname1 Miller 1
4 Firstname Jim 1
4 Lastname Brown 1
Merging HFiles = Compaction
Copyright 2016 Edward Viaene
fi
HBase Architecture
• HBase has 2 different compaction mechanism
• Minor Compaction: smaller les are merged into 1 sorted HFile. This has
low impact to the RegionServers
• During Major Compaction, a lot of les are read and written again, which
leads to a lot of IO operations and network traf c
fi
fi
WAL
X HDFS •
default (HDFS block replication)
fi
HBase Architecture - Crash
Node 2 • The WAL, which is just a log le with
all transaction appended to it, will
HMaster Memstore RegionServer 2
have to be sorted and written to
disk
HDFS
WAL • To achieve that, HBase will copy the
WAL into the respective Memstore
Node 3
• At the end, it ushes the Memstore
Memstore RegionServer 3 to disk, to an HFile
fl
fi
Rowkey
Customer
Rowkey
Customer
• Next, the directory structure in
Firstname Lastname Firstname Lastname
1 John Smith 4 Jim Brown HDFS is updated
2 Alice Jones 5 Joe Williams
3 Bob Johnson 6 Tom Miller
• Clients will now go to the new region
to read/write data
HBase Architecture
RegionServer 1 RegionServer 2
HFile HFile
HDFS HDFS
• This can result in the data (HFiles) not being local to the new node
Copyright 2016 Edward Viaene
HBase Architecture
RegionServer 1 RegionServer 2
Region Region
Customer Customer
Rowkey Rowkey
Firstname Lastname Firstname Lastname
1 John Smith 4 Jim Brown
2 Alice Jones 5 Joe Williams
3 Bob Johnson reads 6 Tom Miller
New writes
HFile HFile
HDFS HDFS
• New data gets written locally on RegionServer2, but when previous data needs to be
read, it’s possible that data from the HFile from RegionServer1 needs to be retrieved
• HFiles are replicated 3 times by default, so this wouldn’t happen if the node moves to
another node that has a copy of the HFile
Copyright 2016 Edward Viaene
HBase Architecture
RegionServer 1 RegionServer 2
Region Region
Customer Customer
Rowkey Rowkey
Firstname Lastname Firstname Lastname
1 John Smith 4 Jim Brown
2 Alice Jones 5 Joe Williams
3 Bob Johnson 6 Tom Miller
New writes
HFile HFile
HDFS HDFS
• During the next Major Compaction, the old HFile is merged together with
the newer HFiles onto RegionServer 2
Rowkey
Customer
Firstname Lastname
• New writes will always go to the
4 Jim Brown same region server, because one
5 Joe Williams
6 Tom Miller
RegionServer will always be hosting
7
New the region with the highest customer
8 Writes number
9
HBase hotspotting
• To avoid this, you have to make sure that
Rowkey Salted Rowkey
your writes are evenly balanced over the
1 0_1 multiple RegionServers
2 1_2
7 0_7
8 1_8
• You can create for example 3 “buckets”
9 2_9
which have number 0, 1 and 2.
HBase hotspotting
RegionServer 1 Salted Rowkey • Every time you PUT a new record,
Region
0_1 even though your real key is
0_4
incremental, your writes go to
0_7
different RegionServers
Salted Rowkey
RegionServer 1
1_2
• Salting can be used in
Region
1_5 combination with pre-splitting. If
1_8 you create 100 buckets, you
might already want to pre-split
RegionServer 1
Salted Rowkey
your table in 100 regions, using
Region
2_3
this pre x
2_6
2_9
• Row key design is not easy and should be well thought over
• Pre-splitting can give you a performance gain (less moving parts when
writing data)
Demo
HBase demo: installation
Demo
HBase demo: HBase shell
Demo
HBase demo: Spark to HBase
Phoenix
SQL on top of HBase
Phoenix
• Phoenix is a relational database layer on top of HBase, enabling SQL over
HBase
• When SQL Queries are received by Phoenix, they will be compiled into a
series of HBase queries
Phoenix Bene ts
• The extra SQL layer still achieves good performance and in some cases performs even
better than writing all the HBase queries yourself
• The SQL statements are compiled into HBase scans and executed in parallel on the
HBase RegionServers
• Filters in your SQL “WHERE”-clause are converted into lters in HBase. The data is
ltered on the RegionServers instead of on the client
• Phoenix makes executing queries on HBase a lot easier - it can do JOINs, add INDEXes
• SQL is a language a lot of people in organizations already know. Phoenix enables those
people not to learn yet another language
• A lot of external applications can only communicate using SQL, Phoenix allows those
applications to work together with HBase
Copyright 2016 Edward Viaene
fi
fi
fi
Phoenix example
• Phoenix example:
$ phoenix-sqlline node1.example.com
> create table test (mykey integer not null primary key, mycolumn varchar);
> upsert into test values (1,'Hello');
> upsert into test values (2,'World!');
> select * from test;
• Which outputs
+-------+------------+
| MYKEY | MYCOLUMN |
+-------+------------+
| 1 | Hello |
| 2 | World! |
+-------+------------+
• One of the HBase optimizations is to make sure that writes are load
balanced across the RegionServers
• During the HBase Hotspotting lecture, I explained that salting can turn our
incremental key into a composite key that will make sure your writes get
load balanced over the different RegionServers
CREATE TABLE TEST (ID VARCHAR NOT NULL PRIMARY KEY, NAME VARCHAR) SALT_BUCKETS=16
CREATE TABLE TEST (ID VARCHAR NOT NULL PRIMARY KEY, NAME VARCHAR) COMPRESSION='GZ'
• In Phoenix you can create an index on tables that have immutable data and
mutable data
• Immutable data means you can add rows, but you’re not changing them
• Mutable data means you might still be changing the data you inserted
• Indexes on immutable data are faster than mutable (because indexes are
easier to maintain if the data doesn’t change)
Copyright 2016 Edward Viaene
• To create an index on mutable data, you can use the following statement:
CREATE TABLE test (mykey varchar primary key, col1 varchar, col2 varchar);
create index idx on test (col2)
• You can use the explain statement to see if an index is included in your query:
EXPLAIN select * from test where col2 = ‘test’;
• Let’s say we recreated our Customer table in Phoenix and we want to join
it to order data:
• The fastest way to JOIN 2 tables is the Hash JOIN, which will be used by
Phoenix if one of the tables is small enough to t in memory
• Spark could connect using a JDBC connection, but the Spark plugin is
able to parallelize queries better, so that should be the preferred option
• To use the Spark Phoenix plugin, the developer has to include the
phoenix-<version>-client-spark.jar library
• After including this library, queries can be loaded into a Spark RDD or a
Spark DataFrame (DataFrames in Spark are a representation of a table)
Demo
Phoenix demo
Hadoop Security
Creating a secure cluster
Kerberos
• Kerberos, a network authentication protocol, is used to secure a Hadoop
cluster
• You can install MIT Kerberos on your cluster using Ambari, or use an
external product that provides Kerberos
Kerberos Architecture
Key Distribution Center (KDC) • Rather than exchanging passwords, the client
will send its username to the Authentication
Server and ask for a TGT (1), a Ticket
Authentication Granting Ticket
1. Client requests TGT
Server (AS)
• This TGT will allow the user to request tickets
later on, to log on to services
User
Client
2. AS sends encrypted TGT Database • The Authentication Server sends back 2
messages to the client, with one of them the
TGT (2). The rst message is encrypted with
Ticket Granting the hashed password found in the User
Server Database
fi
Kerberos Architecture
Key Distribution Center (KDC) • When the client wants to log in to a
service (for instance the client wants to
Authentication use HDFS), it will need a Service Ticket
1. Client requests TGT
Server (AS)
• The client will ask the Ticket Granting
Service (TGS) a Service Ticket (3). The
User
Client
2. AS sends encrypted TGT Database client sends its TGT to the TGS to show
it has been authenticated
Kerberos Architecture
Key Distribution Center (KDC) • The client now has enough
information to authenticate
Authentication
Server (AS) itself to the Service, for
example HDFS
User
Client
Database • The client will send the Service
Ticket to the Service (5)
Ticket Granting
Server (TGS) • The Service will decrypt the
Service Ticket using its own
key and can validate the
5. Service Ticket clients’ true identity
Service
Kerberos Backend
• You can use the build-in Kerberos database,
Key Distribution Center (KDC) which is the default option if you let Ambari
install MIT Kerberos
Kerberos in Hadoop
• If you already have a Directory
Service, like Windows Active
Directory, you might want to link
Key Distribution Center (KDC) Windows Active Directory
your MIT Kerberos with AD, by
building a trust
Authentication
Windows Server
Server (AS)
Trust • The Kerberos within Hadoop will
then accept users known in
Ticket Granting Active Directory
Server
• In this setup, you don’t need to
recreate every user that would
need access to your cluster. You
can use the already existing users
from an existing Active Directory
Copyright 2016 Edward Viaene
Terminology
• Kerberos REALM: a realm is a logical network, similar to a domain in Active
Directory. It de nes a group of systems within Kerberos that have the same
master KDC. A realm can be all departments within North America region.
Smaller companies only have 1 REALM
• user@realm: joe@EXAMPLE.COM
• service@realm: hdfs@EXAMPLE.COM
• user/host@realm: john/node1.example.com@EXAMPLE.COM
Copyright 2016 Edward Viaene
fi
Keytabs
• Every principal needs a password to log in to Kerberos
• All our Hadoop Services like hdfs, Zookeeper, Yarn, also need service
principals, but cannot have a password because they’re services, not users
• Kerberos provides a way to store the “password” in a le, called the keytab
• In this keytab a unique key is saved that can be used by the services to
authenticate themselves with Kerberos
• On a Linux system, when a service has read access to this keytab, it can
authenticate using its principal name (service name) and the keytab as a
password
fi
Keytabs
• In Hadoop we have many services
• Ambari can generate and maintain keytabs for us, this is an option we can
choose during the Ambari Kerberos setup
• When a new node is added, ambari will make sure the necessary keytabs
are available on the new node
• How this creation of keytabs work will become clear in the next demo
Demo
Enabling Kerberos on Hadoop
Hadoop Security
SPNEGO
SPNEGO
• SPNEGO stands for Simple and Protected GSSAPI Negotiation Mechanism
• The supported technologies that can be negotiated between client and server
are mainly NTLM and Kerberos
• Using a browser, you can use your kerberos ticket to authenticate to a website
or API, rather than providing a login and password
Copyright 2016 Edward Viaene
y
SPNEGO
Client
Key Distribution Center (KDC) • First, the client needs to obtain a
Kerberos ticket. This can be done
Authentication
Server (AS)
manually or can happen when the
Kerberos client logged into his PC
Browser
Ticket
Ticket Granting
Server
• The browser will use SPNEGO to
negotiate with the Hadoop Web
UIs to use kerberos instead of a
login/password
Hadoop
fi
SPNEGO
• SPNEGO is used to secure access to the Hadoop APIs and Web UIs
Demo
SPNEGO
Apache Ranger
Apache Ranger
• Apache Ranger enables authorization for our services in Hadoop
Ranger Architecture
External Services
• Ranger uses a UserSync service to
User Directory periodically pull users from a Central
(LDAP, AD, …) User Directory, like LDAP or Active
Directory
Hadoop Services
• An administrator can login to a Ranger
HDFS Ranger Plugin
Ranger UserSync Web UI to manage users and
Yarn Ranger Plugin permissions
Hive Ranger Plugin
• A Ranger Plugin can be enabled for the
HBase Ranger Plugin Hadoop Services to enable
Storm Ranger Plugin
Ranger authorization using the de ned
permissions in Ranger, called policies
Kafka Ranger Plugin RDBMS
fi
Ranger Policies
Service Behavior Granularity
Hive will disallow any query, except if the user has been granted Policy on query type and on
Hive access in Ranger database, table, column, and UDF
Read / Write / Create / Admin on
HBase Same as hive
Table, Column Family, and Column
Users will not be allowed to submit applications / administer a queue,
Yarn unless de ned in the con guration or in Ranger
Based on Queue name
Storm User will have no permissions to a topology by default Permissions on Storm Topologies
Kafka User will have no permissions to a topic by default Based on Kafka Topics
Based on topology and Service
Knox No access by default, permissions can be set in Ranger
Name
Demo
Ranger demo
HDFS Encryption
HDFS Encryption
• HDFS implements transparent end-to-end encryption
• Data will therefore be encrypted at-rest (on disk) and in-transit (over the
network)
• HDFS encryption can satisfy the needs of companies to encrypt data. This
can be regulatory driven, for instance to be able to save credit card data
on Hadoop
Copyright 2016 Edward Viaene
Ranger KMS
• To enable HDFS encryption on HDP, Ranger KMS need to be installed
• Ranger KMS (Key Management System) has been written on top of the
Hadoop KMS
• The original KMS system can store keys in les (using Java keystore)
fi
Ranger KMS
• Ranger KMS has 3 main functions:
• Audit: log all actions performed by Ranger KMS and store them in
HDFS or even Solr (database with full text search capabilities)
Ranger KMS
• When your data is encrypted, the only way to access your data is by
having the key to decrypt the data
• Be careful not to loose this key, or your data will become useless
• It’s not possible to decrypt your data without a key, even with the fastest
computers, it would still take years to brute-force the key
• Make sure you backup your Ranger KMS Database and your keys, and
have them stored in safe place
Demo
HDFS Encryption in Hadoop
Knox Gateway
Knox
• Knox is a REST API gateway to interact with Hadoop clusters
• It provides a single point of entrance for all REST interactions with Hadoop
• Protects the details of a Hadoop cluster and only exposes the Knox API
Knox Architecture
Firewall Firewall • All the Hadoop Services will
be rewalled from the Client,
Hadoop Services
Knox will serve as single
WebHDFS point of entrance
HCatalog
• The client will have to
HBase authenticate when
KNOX
Oozie connecting Knox
Client Gateway
Hive (JDBC) • Knox will validate the users
Yarn RM credentials using a User
Directory (LDAP, AD)
Storm
Advanced Topics
Yarn Schedulers
Yarn scheduler
• Yarn has different schedulers available
• The ones that come with HDP are the Fair Scheduler and the Capacity
Scheduler
• The Fair Scheduler gives users a fair share of the cluster, such that all
apps get an equal share of resources over time
Capacity Scheduler
• When using the capacity scheduler,
root you can create hierarchical queues
yarn.scheduler.capacity.root.queues
fi
Capacity Scheduler
• Let’s consider the following
root 100% con guration:
yarn.scheduler.capacity.root.queues=adhoc,prod
50% 50%
Capacity Scheduler
• You can also create child queues
root 100% of children of the root queue:
Capacity Scheduler
• Applications submitted to adhoc or
root 100% prod can use only 50% of the total
cluster resources
50% 50%
• You may ask yourself, what
adhoc prod happens when you submit an
application to the ad-hoc queue
and the prod queue is empty?
Capacity Scheduler
• To avoid underutilization, you can
root 100% enable elasticity on queues,
allowing a queue to exceed its
50-100% 50% capacity when other queues are
not in use:
adhoc prod
yarn.scheduler.capacity.root.adhoc.maximum-capacity=100
Capacity Scheduler
• Now that the ad-hoc queue can use
root 100% more than 50% at any point in time,
there is a possibility that applications
in the production queue cannot use
the full 50%
adhoc prod
• By default applications in the
70% 30% production queue will have to wait until
utilization utilization the application(s) using more than
50% in the ad-hoc queue are nished
fi
Capacity Scheduler
• Alternatively, pre-emption can be enabled
root 100%
• Pre-emption will notify the Application
Master of running applications in the ad-hoc
queue
adhoc prod
• The Application Master can gracefully exit
containers to reduce its usage
50% 50%
utilization utilization • After a precon gured amount of time, the
Resource Manager can still forcefully
remove the containers, if the Application
Master didn’t remove them itself
fi
fi
Demo
Yarn Capacity Scheduler
Label based scheduling
• You can also schedule
root applications on speci c nodes,
using label based scheduling in
Yarn
fi
Label based scheduling
• The rst step is to label the nodes
root using the yarn command:
fi
yarn.scheduler.capacity.root.adhoc.accessible-node-labels=lowspec
yarn.scheduler.capacity.root.adhoc.accessible-node-labels.lowspec.capacity=100
yarn.scheduler.capacity.root.prod.accessible-node-labels=highspec
adhoc prod yarn.scheduler.capacity.root.prod.accessible-node-labels.highspec.capacity=100
Yarn Sizing
Yarn sizing
• Container sizing is one of the most dif cult parts when setting up a multi-
tenant (multi-user) cluster
• Every user can set their own size within hard limits, so if you set high hard
limits, every user can choose whatever they like
• Best is to choose some good defaults and let the user give some freedom
fi
Yarn sizing
• A good rule of thumb is to take a look at disks, cores and memory available in every node
• The maximum containers & Memory per container can be calculated by:
• When having 8 disks, 16 cores and 256 GB memory per node, you could reserve 32 GB for
OS, 32 GB for HBase, and other services and keep 192 GB for Yarn
• Using the formula you would use 14 containers per node, with 13.71 GB of RAM each
• In this example the decision was mainly based on the amount of disks in the server,
giving enough throughput to the disk per Yarn container
Copyright 2016 Edward Viaene
CGroups
• CGroups can limit the resource usage per container
• CGroups can also be used to limit the total Yarn CPU utilization of all
containers to a certain limit
• This can be useful when dedicating a part of the cluster for realtime
processing using technologies like Storm and HBase
• Use ORC as data storage, as explained earlier in this course, it will give
you a big performance boost over standard text les
• Stinger.Next
Copyright 2016 Edward Viaene
fi
CBO
• Cost Based Optimization (CBO) enables Hive to generate ef cient execution plan
by examining table statistics
• By following these 4 phases CBO will reduce the execution time of Hive queries
signi cantly
Copyright 2016 Edward Viaene
fi
fi
fi
CBO
• Performance testing by Hortonworks shows an average speedup of 2.5
times, using TPC-DS benchmarked queries (a benchmark standard)
• CBO is available starting from HDP 2.2 and can be used by enabling CBO
in Ambari and running an analyze query by developer:
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
Stinger.Next
• Every new Hive version has new improvements to speed up queries
• The Stinger project brought us features like Tez, Vectorization, and ORC to
lower execution time of queries
• Stinger.Next will also make sure that Hive meets the SQL:2011 Standard
JOIN Strategies
• There are different JOIN strategies that Hive will perform, depending on how the
data looks like
• The most performant join is the in-memory join (works only if one of the 2
tables ts in memory)
• The third type of join is a Sort-Merge-Bucket join (works with tables larger
than available memory, but data needs to be in a certain format)
• In most cases Hive will be able to automatically determine what JOIN type to use
In-memory JOIN
Node 1 2 GB
• When doing an in-memory
Table 1 and block 1 join, the smaller table will be
Map table 1
of table 2 are joined
in Map task 1
kept in memory on every
node
HDFS table 2: bl1
128 MB
Copyright 2016 Edward Viaene
fi
MapReduce JOIN
Node 1
• When the dataset is too big, a
Map 1. The Map tasks processes full Node
MapReduce
3 join needs to
two blocks: from table 1 and 2 be performed
The Map outputs a K,V pair
table 1: bl1 where the K is the JOIN key Reduce
HDFS
table 2: bl1
128 MB Node 4
Node 2
Reduce
Map
2. Keys that are the same
are send to the same reducer
table 1: bl2 Therefore, records that should 3. The actual join happens at
HDFS be joined end up at the same the reducers which then writes
table 2: bl2
reducer the end result to HDFS (1 le
per reducer)
128 MB
Copyright 2016 Edward Viaene
fi
MapReduce JOIN
• The big downside of the MapReduce join is that ALL records need to be send
over the network during the shuf e & sort phase
• This means records that can never be joined will also be send over the network,
increasing disk IO and network traf c
• A way to avoid most of the disk IO and network is to lter the records that can not
be joined
• Bloom lter can be used as an ef cient lter mechanism to lter unwanted records
• The Bloom lter algorithm will lter ef ciently, but will give you false positives (i.e. it
might not lter certain records). When doing JOINs, false positives are harmless,
because those records will not be joined anyways
fi
fi
fi
fl
fi
fi
fi
fi
fi
fi
Sort-Merge-Bucket JOIN
• When doing an Sort-Merge-Bucket
set hive.auto.convert.sortmerge.join=true; (SMB) JOIN, the data needs to be
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true; bucketed and sorted on the join
set hive.auto.convert.sortmerge.join.noconditionaltask=true; key on both tables
CREATE TABLE customers (id INT, rstname STRING, lastname STRING)
CLUSTERED BY (id) SORTED BY (id) INTO 16 BUCKETS; • To hint Hive you want to use an
SMB-JOIN, you can set the
CREATE TABLE orders (id INT, customer_id INT, date STRING)
CLUSTERED BY (customer_id) SORTED BY (customer_id) INTO 16 BUCKETS; variables shown on the left
fi
Sort-Merge-Bucket JOIN
Node 1
• A Map task can now JOIN the
customers, there is no Reduce
ID Firstname Lastname ID customer_id date
2 1 2016-01-02
task necessary
1 John Smith
Map 3 1 2016-01-06
2 Alice Jones
5 3 2016-01-10 • The map task expects the ID of
3 Joe Johnson 6 3 2016-01-10 the customer table and the
customer_id of the orders table to
Node 2 be sorted, and the same join keys
should be in the same bucket
ID customer_id date
ID Firstnam Lastname
4 2016-01-01
• Node 1 will join the orders of
1
e
4 Anna Brown Map 4 5 2016-01-10 customers 1, 2, and 3. There will
5 Jennifer Williams
7 5 2016-01-25 never be orders of customer 1, 2,
8 5 2016-01-26 and 3 in other buckets
Spark Optimizations
JOINs in Spark
• Ef cient JOINs in Spark can also be done in the same way as Hive: by
taking one part of the dataset in memory, rather than sending the data
over the network during a shuf e
• To take one RDD in memory of every executor, you can use broadcast:
1. data = [(1, “customer1”), (2, “customer2”), (3, “customer3”), (4, “customer4”)]
2. distOrders = [(2, “order1”), (3, “order2”)]
3.
4. distData = sc.parallelize(data.collectAsMap())
5. distOrders = sc.parallelize(distOrders.collectAsMap())
6.
7. sc.broadcast(distData)
8. result = distOrders.map(lambda order: order if order[0] in distData.value)
fi
fl
Avoid GroupByKey
• In Spark, you should avoid using GroupByKey (groups all the keys
together), if you can use ReduceByKey (executes logic on while grouping)
• Consider the example where you would like to take a sum of the words in a
WordCount program. You could rst GroupByKey and then perform another
map transformation to take a sum, or you could combine the GroupByKey
and sum in one transformation using ReduceByKey
GroupByKey
(a, 1) (a, 1) (a, 1)
(a, 1) (a, 1) (b, 1)
(a, 1) (a, 1) (b, 1)
(b, 1) (a, 1) (b, 1)
(a, 1)
(a, 1)
(a, 1) (b, 1)
(a, 1) (b, 1)
(a, 1) (a, 8) (b, 1) (b, 4)
(a, 1) (b, 1)
(a, 1)
(a, 1)
ReduceByKey
(a, 1) (a, 1) (a, 1)
(a, 1) (a, 3) (a, 1) (b, 1) (a, 1)
(a, 4)
(a, 1) (b, 1) (a, 1) (b, 1) (b, 3)
(b, 1) (a, 1) (b, 1)
Less Network
Traf c
(a, 3)
(b, 1)
(a, 4) (a, 8) (b, 4)
(b, 3)
(a, 1)
fi
High Availability
Namenode HA
• Up to now, we only have been using 1 Namenode
• To solve this problem, since HDP 2, the Namenode can be con gured in
High Availability mode
• A second Namenode will be started that will take over when the active
Namenode isn’t healthy anymore
fi
Namenode HA Architecture
LOCK
ZK ZK ZK
LOCK LOCK LOCK
ZKFC ZKFC
Active Standby
Active
Namenode Namenode
JN JN JN
Client
Demo
Enabling NN HA
Database HA
• Hadoop uses a lot of databases to store its information in:
• All these databases need to be setup in high availability and need to have a proper
backup procedure (e.g. MySQL Replication for HA and mysqldump for backups)
• Often enterprises have already procedures for this, because Mysql / Postgresql /
Oracle is often already used a lot in companies
Copyright 2016 Edward Viaene
fi
Thank you!
Course Layout
Concepts Use-Cases Technologies
1. Deploying a distributed
What is Scalability Vagrant, Cassandra, DigitalOcean
database using Vagrant and
API
DigitalOcean
Designing Scalable Architecture 2. Deploying a highly available, Terraform, Docker, AWS (VPC,
distributed application using Elastic Beanstalk, RDS, Load
terraform and AWS EB Balancer, ECR, S3)
Distributed Databases Dokku, Deis, Ansible, CoreOS,
3. Your own Private Cloud using
Terraform, Digital Ocean API,
Dokku and Deis
Kubernetes
On-Premise vs Cloud
IaaS vs PaaS
HDFS Transparent
Ambari Blueprints Phoenix
Encryption
Bonus Lecture
• These are the 2 other courses available:
• SCALABLE_BONUS_DEVOP
• SCALABLE_BONUS_BIGDAT
• Check out the introduction movies for more information about the courses
Copyright 2016 Edward Viaene
S
Spark DataFrames
Dataframe Common Commands
Are executed lazily
Transformations Actions
select show
selectExpr count
distinct collect
groupBy take
sum
orderBy
lter
limit
Spark Dataframes
Spark Dataframes
>>> data.groupby('workclass').workclass.count()
Spark Dataframes
>>> dataSpark.groupBy('workclass').count().collect()
Spark Dataframes
fi
fi
fi
Spark Dataframes
>>> sparkF = pyspark.sql.functions.udf(f, pyspark.sql.types.IntegerType())
>>> dataSpark.select(
sparkF(dataSpark.workclass, dataSpark. nal_weight).alias('result')
).collect()
fi
fi
Spark Dataframes
• Estimator: a model
Training data
Transformers Estimator
Dataframes
Dataframes