TP1.1 MR Spark Dcejs

SDI
Metz ‐ Big Data computing models
Using Spark cluster of
CentraleSupelec DCE
(Data Center for Education
Stéphane Vialle & Gianluca Quercini

CentraleSupelec DCE
• DCE architecture and access
 Connection experiment
• HDFS principles & commands
 HDFS experiment
• Spark principles & commands
 Spark experiment
1 ‐ DCE architecture and access
CPU clusters
CentraleSupelec Metz Campus
Cluster «Sarah»
srun… 32 nodes
chome.metz.supelec.fr
ssh ssh Cluster «Kyle»
64 nodes
cslurm
Cluster «John»
heterogeneous
Research
CPU clusters
Spark Master & Cluster « Sarah »
HDFS Name Node
Sar Sar Sar
chome.metz.supelec.fr 01 02 16
hdfs dfs … 15 Data Nodes
ssh ssh spark‐submit…
ssh Kyle Kyle Kyle
01 02 68
cslurm
srun… Cluster « Kyle »
ssh…
srun…
CPU clusters
HDFS Name Node
Sar Sar Sar
chome.metz.supelec.fr 01 02 16
ssh Kyle Kyle Kyle
01 02 68
cslurm
ssh…
srun…
« dcejs »
CPU clusters
• https://tutos.metz.centralesupelec.fr
• Look at dcejs
• Dowload & install the version for your system
(Windows, Linux, Mac)
• Install a VNC client
(ex: TigerVNC on Windows)
CPU clusters
1. Launch dcejs
• Select CPU
• Write your cpusdi1_x
login
• Click on « Connect »
• During the first

connexion to the
DCEyou will need to
enter the passwd of
your DCE login
CPU clusters
2. Launch dcejs
• ……
• Enter a passphrase
(choose a basic one)
• Click on « Validate »
CPU clusters
3. Launch dcejs
• ……
• Enter a reservation
code (ask to the
teacher)
• Enter: 4
• Enter: 1
• Enter : normal
• Click on « Validate »
CPU clusters
4. Launch dcejs
• ……
• Click on
« Start VNC »
CPU clusters
5. Launch dcejs
• ……
• Click on
« Start VNC »
• Get the local port
number
Ex: 5901
• Launch your VNC
client with default
options
(ex: TigerVNC on
Windows)
CPU clusters
6. On windows:
• Launch your VNC client with all default options
(ex: TigerVNC on Windows)
• Enter the port
number returned
by dcejs
6. On Linux & Mac :
• It should be possible to just click on the port
number in the dcejs window.
CPU clusters
7. The desktop of the
remote DCE machine
appears
• You can launch a
terminal, and code or
xedit…
• When you deconnect:

NEVER shut down the
machine!
Use the disconnect
button
CentraleSupelec DCE
 HDFS experiment
2 ‐ HDFS principles and commands
Hadoop software architecture
Map‐Reduce Map‐Reduce application
algorithmic (data analysis)
Map‐Reduce
paradigm & mechanisms
Ressource and application
manager
File
DB import/
export HDFS:
Linux/ Hadoop
Distributed
Windows
file system
File System Standard disks & PC
HDFS principles
name node
hdfs dfs NameNode
active HDFS map/fat
‐put
‐ls
‐rm • Metada
‐get • DataNode choice
… • Commands
DataNode DataNode DataNode DataNode DataNode
data node data node data node data node data node

HDFS principles
name node
hdfs dfs NameNode
active HDFS map/fat
‐put
Linux FS • Metada
• DataNode choice
2 block files: 1 2 • Commands
DataNode DataNode DataNode DataNode DataNode

1 1 1 2 2 2
data node data node data node data node data node
• Files ares splitted into data blocks (64 or 128 Mbytes)

• Each block is replicated (default: x3) to ensure fault tolerance
• The NameNode chooses the Data Nodes storing blocks and replicas
HDFS commands on DCE
HDFS HDFS: NameNode service: sar01:9000
Name Sar Sar Sar /data Read only
Node 01 02 16 /cpusdi1
15 Data Nodes /cpusdi1_1/ R/W for cpusdi1_1
hdfs dfs …
…
Client Kyle Kyle Kyle /cpusdi1_19/ R/W for cpusdi1_19
machines 01 02 68
On a cluster node (client machine):
hdfs dfs ‐ls ‐h hdfs://sar01:9000/data
hadoop fs ‐ls ‐h hdfs://sar01:9000/data (alternative syntax)
Found 2 items
drwxrwxr‐x ‐ root supergroup 0 2019‐10‐18 11:23
hdfs://sar01:9000/data/climate
‐rw‐r‐‐r‐‐ 3 root supergroup 593.5 K 2019‐10‐04 13:58
nb of replicas size hdfs://sar01:9000/data/sherlock.txt
hdfs dfs …
…
machines 01 02 68
hdfs dfs ‐ls ‐h hdfs://sar01:9000/data/climate
Found 2 items
drwxr‐xr‐x ‐ root supergroup 0 2019‐10‐18 11:33
hdfs://sar01:9000/data/climate/csv.gz
‐rw‐r‐‐r‐‐ 3 root supergroup 0 2019‐10‐17 23:10
hdfs://sar01:9000/data/climate/ghcnd‐stations.txt
hdfs dfs …
…
machines 01 02 68
hdfs dfs ‐ls ‐h hdfs://sar01:9000/data/climate/csv.gz
Found 257 items
‐rw‐r‐‐r‐‐ 3 root supergroup 3.3 K 2019‐10‐18 11:23
hdfs://sar01:9000/data/climate/csv.gz/1763.csv.gz
….
….
….
‐rw‐r‐‐r‐‐ 3 root supergroup 183.0 M 2019‐10‐18 11:33
‐rw‐r‐‐r‐‐ 3 root supergroup 141.6 M 2019‐10‐18 11:33
hdfs dfs …
…
machines 01 02 68
hdfs dfs ‐cat hdfs://sar01:9000/data/sherlock.txt | more
Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle
….
hdfs dfs …
…
machines 01 02 68
hdfs dfs ‐ls ‐h hdfs://sar01:9000/cpusdi1/cpusdi1_1
empty Use YOUR
HDFS account
hdfs dfs …
…
machines 01 02 68
cp ~vialle/DCE‐Spark/RFC793‐TCP.txt .
hdfs dfs ‐put RFC793‐TCP.txt hdfs://sar01:9000/cpusdi1/cpusdi1_1/
hdfs dfs ‐ls ‐h hdfs://sar01:9000/cpusdi1/cpusdi1_1/
Found 1 items
‐rw‐r‐‐r‐‐ 3 cpusdi1_1 cpusdi1 173.8 K 2019‐10‐21 01:51
hdfs://sar01:9000/cpusdi1/cpusdi1_1/RFC793‐TCP.txt
Use YOUR HDFS account
CentraleSupelec DCE
 HDFS experiment
3 ‐ Spark principles and commands
Spark deployment on top of HDFS
Spark Master as cluster manager: standalone mode
spark-submit --master spark://node:port … myApp
Spark Master Cluster worker node

 Cluster & Hadoop Data Node
Manager
Cluster worker node
& Hadoop Data Node
HDFS
Name Node Cluster worker node
& Hadoop Data Node
Cluster Spark Master Spark app. Driver

deployment mode:  Cluster • DAG builder
Manager • DAG scheduler‐optimizer
• Task scheduler
HDFS Spark Spark

Name Node executor executor
Spark Spark Spark
executor executor executor
Spark deployment on top of HDFS
Spark Master as cluster manager: standalone mode
spark-submit --master spark://node:port … myApp
Spark Master Cluster worker node

 Cluster & Hadoop Data Node
Manager
Cluster worker node
& Hadoop Data Node
HDFS
Name Node Cluster worker node
& Hadoop Data Node
Default configuration of Spark standalone mode:
• (only) 1GB/Spark Executor
• To process each job, the Spark Master:
− creates one Executor on each Worker node
− involves all cores of the node in its Executor
 flooding strategy: unlimited nb of nodes and CPU
cores per application execution
Spark‐HDFS configuration on DCE
HDFS Name Node
Sar Sar Sar
phome…. 01 02 16
ssh Kyle Kyle Kyle
01 02 68
ssh… slurm1
srun…
Default configuration on DCE:
• (only) 15 cores per Spark application
• 15 Data nodes
 One Spark application: 1Executor/Data Node & 1 core/Executor
• 16 (logical) cores per node
 16 Spark applications can run concurrently on the Spark cluster
Spark‐HDFS configuration on DCE
HDFS Name Node
Sar Sar Sar
phome…. 01 02 16
ssh Kyle Kyle Kyle
01 02 68
ssh… slurm1
srun…
Possible tunning on DCE:
• Specify the maximum amount of memory per Spark Executor
spark-submit --executor-memory XX …
• Specify the total amount of CPU cores used to process one

Spark application (through all its Spark executors)
spark-submit --total-executor-cores YY …
Spark commands on DCE
Spark Master HDFS NameNode service: sar01:9000
& HDFS Name Sar Sar Sar /data Read only
hdfs dfs …
spark‐submit… …
Sar
Kyle Sar
Kyle Kyle /cpusdi1_19/ R/W for cpusdi1_19
17
01 02
18 68
Client machines Spark Master service: sar01:7077
cp ~vialle/DCE‐Spark/template_wc.py ./wc.py
 Edit and update the Python code Use YOUR
HDFS account
spark‐submit ‐‐master spark://sar01:7077 wc.py
hdfs dfs ‐ls ‐h hdfs://sar01:9000/cpusdi1/cpusdi1_1/sherlock.out
hdfs dfs ‐cat hdfs://sar01:9000/cpusdi1/cpusdi1_1/sherlock.out/*
Spark commands on DCE
Spark Master HDFS NameNode service: sar01:9000
& HDFS Name Sar Sar Sar /data Read only
hdfs dfs …
spark‐submit… …
Sar
Kyle Sar
Kyle Kyle /cpusdi1_19/ R/W for cpusdi1_19
17
01 02
18 68
Client machines Spark Master service: sar01:7077
Remarks (for this lab):
• Use the « template‐xx.py » file to develop your code xx.py
and execute your code with :
spark‐submit ‐‐master spark://sar01:7077 xx.py
• Write lambda with syntax: (lambda a, b : (a[0] + b[0], a[1] + b[1]))
(lambda (v1,n1), (v2,n2) : (v1+v2,n1+n2))
CentraleSupelec DCE
(Data Center for Education)
Questions ?

TP1.1 MR Spark Dcejs

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TP1.1 MR Spark Dcejs

Uploaded by

Copyright:

Available Formats

SDI

Metz ‐ Big Data computing models

Stéphane Vialle & Gianluca Quercini

• During the first

• When you deconnect:

DataNode DataNode DataNode DataNode DataNode

data node data node data node data node data node

DataNode DataNode DataNode DataNode DataNode

• Files ares splitted into data blocks (64 or 128 Mbytes)

Spark Master Cluster worker node

Cluster Spark Master Spark app. Driver

HDFS Spark Spark

Spark Master Cluster worker node

• Specify the total amount of CPU cores used to process one

You might also like