Professional Documents
Culture Documents
TP1.1 MR Spark Dcejs
TP1.1 MR Spark Dcejs
Using Spark cluster of
CentraleSupelec DCE
(Data Center for Education
• DCE architecture and access
Connection experiment
• HDFS principles & commands
HDFS experiment
• Spark principles & commands
Spark experiment
1 ‐ DCE architecture and access
CPU clusters
CentraleSupelec Metz Campus
Cluster «Sarah»
srun… 32 nodes
chome.metz.supelec.fr
ssh ssh Cluster «Kyle»
64 nodes
cslurm
Cluster «John»
heterogeneous
Research
1 ‐ DCE architecture and access
CPU clusters
CentraleSupelec Metz Campus
Spark Master & Cluster « Sarah »
HDFS Name Node
Sar Sar Sar
chome.metz.supelec.fr 01 02 16
hdfs dfs … 15 Data Nodes
ssh ssh spark‐submit…
ssh Kyle Kyle Kyle
01 02 68
cslurm
srun… Cluster « Kyle »
ssh…
srun…
1 ‐ DCE architecture and access
CPU clusters
CentraleSupelec Metz Campus
Spark Master & Cluster « Sarah »
HDFS Name Node
Sar Sar Sar
chome.metz.supelec.fr 01 02 16
hdfs dfs … 15 Data Nodes
ssh ssh spark‐submit…
ssh Kyle Kyle Kyle
01 02 68
cslurm
srun… Cluster « Kyle »
ssh…
srun…
« dcejs »
1 ‐ DCE architecture and access
CPU clusters
• https://tutos.metz.centralesupelec.fr
• Look at dcejs
• Dowload & install the version for your system
(Windows, Linux, Mac)
• Install a VNC client
(ex: TigerVNC on Windows)
1 ‐ DCE architecture and access
CPU clusters
1. Launch dcejs
• Select CPU
• Write your cpusdi1_x
login
• Click on « Connect »
CPU clusters
2. Launch dcejs
• ……
• Enter a passphrase
(choose a basic one)
• Click on « Validate »
1 ‐ DCE architecture and access
CPU clusters
3. Launch dcejs
• ……
• Enter a reservation
code (ask to the
teacher)
• Enter: 4
• Enter: 1
• Enter : normal
• Click on « Validate »
1 ‐ DCE architecture and access
CPU clusters
4. Launch dcejs
• ……
• Click on
« Start VNC »
1 ‐ DCE architecture and access
CPU clusters
5. Launch dcejs
• ……
• Click on
« Start VNC »
• Get the local port
number
Ex: 5901
• Launch your VNC
client with default
options
(ex: TigerVNC on
Windows)
1 ‐ DCE architecture and access
CPU clusters
6. On windows:
• Launch your VNC client with all default options
(ex: TigerVNC on Windows)
• Enter the port
number returned
by dcejs
6. On Linux & Mac :
• It should be possible to just click on the port
number in the dcejs window.
1 ‐ DCE architecture and access
CPU clusters
7. The desktop of the
remote DCE machine
appears
• You can launch a
terminal, and code or
xedit…
• DCE architecture and access
Connection experiment
• HDFS principles & commands
HDFS experiment
• Spark principles & commands
Spark experiment
2 ‐ HDFS principles and commands
Hadoop software architecture
Map‐Reduce Map‐Reduce application
algorithmic (data analysis)
Map‐Reduce
paradigm & mechanisms
Ressource and application
manager
File
DB import/
export HDFS:
Linux/ Hadoop
Distributed
Windows
file system
File System Standard disks & PC
2 ‐ HDFS principles and commands
HDFS principles
name node
hdfs dfs NameNode
active HDFS map/fat
‐put
‐ls
‐rm • Metada
‐get • DataNode choice
… • Commands
HDFS principles
name node
hdfs dfs NameNode
active HDFS map/fat
‐put
Linux FS • Metada
• DataNode choice
2 block files: 1 2 • Commands
HDFS commands on DCE
HDFS HDFS: NameNode service: sar01:9000
Name Sar Sar Sar /data Read only
Node 01 02 16 /cpusdi1
15 Data Nodes /cpusdi1_1/ R/W for cpusdi1_1
hdfs dfs …
…
Client Kyle Kyle Kyle /cpusdi1_19/ R/W for cpusdi1_19
machines 01 02 68
On a cluster node (client machine):
hdfs dfs ‐ls ‐h hdfs://sar01:9000/data
hadoop fs ‐ls ‐h hdfs://sar01:9000/data (alternative syntax)
Found 2 items
drwxrwxr‐x ‐ root supergroup 0 2019‐10‐18 11:23
hdfs://sar01:9000/data/climate
‐rw‐r‐‐r‐‐ 3 root supergroup 593.5 K 2019‐10‐04 13:58
nb of replicas size hdfs://sar01:9000/data/sherlock.txt
2 ‐ HDFS principles and commands
HDFS commands on DCE
HDFS HDFS: NameNode service: sar01:9000
Name Sar Sar Sar /data Read only
Node 01 02 16 /cpusdi1
15 Data Nodes /cpusdi1_1/ R/W for cpusdi1_1
hdfs dfs …
…
Client Kyle Kyle Kyle /cpusdi1_19/ R/W for cpusdi1_19
machines 01 02 68
On a cluster node (client machine):
hdfs dfs ‐ls ‐h hdfs://sar01:9000/data/climate
Found 2 items
drwxr‐xr‐x ‐ root supergroup 0 2019‐10‐18 11:33
hdfs://sar01:9000/data/climate/csv.gz
‐rw‐r‐‐r‐‐ 3 root supergroup 0 2019‐10‐17 23:10
hdfs://sar01:9000/data/climate/ghcnd‐stations.txt
2 ‐ HDFS principles and commands
HDFS commands on DCE
HDFS HDFS: NameNode service: sar01:9000
Name Sar Sar Sar /data Read only
Node 01 02 16 /cpusdi1
15 Data Nodes /cpusdi1_1/ R/W for cpusdi1_1
hdfs dfs …
…
Client Kyle Kyle Kyle /cpusdi1_19/ R/W for cpusdi1_19
machines 01 02 68
On a cluster node (client machine):
hdfs dfs ‐ls ‐h hdfs://sar01:9000/data/climate/csv.gz
Found 257 items
‐rw‐r‐‐r‐‐ 3 root supergroup 3.3 K 2019‐10‐18 11:23
hdfs://sar01:9000/data/climate/csv.gz/1763.csv.gz
….
….
….
‐rw‐r‐‐r‐‐ 3 root supergroup 183.0 M 2019‐10‐18 11:33
hdfs://sar01:9000/data/climate/csv.gz/2018.csv.gz
‐rw‐r‐‐r‐‐ 3 root supergroup 141.6 M 2019‐10‐18 11:33
hdfs://sar01:9000/data/climate/csv.gz/2019.csv.gz
2 ‐ HDFS principles and commands
HDFS commands on DCE
HDFS HDFS: NameNode service: sar01:9000
Name Sar Sar Sar /data Read only
Node 01 02 16 /cpusdi1
15 Data Nodes /cpusdi1_1/ R/W for cpusdi1_1
hdfs dfs …
…
Client Kyle Kyle Kyle /cpusdi1_19/ R/W for cpusdi1_19
machines 01 02 68
On a cluster node (client machine):
hdfs dfs ‐cat hdfs://sar01:9000/data/sherlock.txt | more
Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle
….
2 ‐ HDFS principles and commands
HDFS commands on DCE
HDFS HDFS: NameNode service: sar01:9000
Name Sar Sar Sar /data Read only
Node 01 02 16 /cpusdi1
15 Data Nodes /cpusdi1_1/ R/W for cpusdi1_1
hdfs dfs …
…
Client Kyle Kyle Kyle /cpusdi1_19/ R/W for cpusdi1_19
machines 01 02 68
On a cluster node (client machine):
hdfs dfs ‐ls ‐h hdfs://sar01:9000/cpusdi1/cpusdi1_1
empty Use YOUR
HDFS account
2 ‐ HDFS principles and commands
HDFS commands on DCE
HDFS HDFS: NameNode service: sar01:9000
Name Sar Sar Sar /data Read only
Node 01 02 16 /cpusdi1
15 Data Nodes /cpusdi1_1/ R/W for cpusdi1_1
hdfs dfs …
…
Client Kyle Kyle Kyle /cpusdi1_19/ R/W for cpusdi1_19
machines 01 02 68
On a cluster node (client machine):
cp ~vialle/DCE‐Spark/RFC793‐TCP.txt .
hdfs dfs ‐put RFC793‐TCP.txt hdfs://sar01:9000/cpusdi1/cpusdi1_1/
hdfs dfs ‐ls ‐h hdfs://sar01:9000/cpusdi1/cpusdi1_1/
Found 1 items
‐rw‐r‐‐r‐‐ 3 cpusdi1_1 cpusdi1 173.8 K 2019‐10‐21 01:51
hdfs://sar01:9000/cpusdi1/cpusdi1_1/RFC793‐TCP.txt
Use YOUR HDFS account
Using Spark cluster of
CentraleSupelec DCE
• DCE architecture and access
Connection experiment
• HDFS principles & commands
HDFS experiment
• Spark principles & commands
Spark experiment
3 ‐ Spark principles and commands
Spark deployment on top of HDFS
Spark Master as cluster manager: standalone mode
spark-submit --master spark://node:port … myApp
Spark deployment on top of HDFS
Spark Master as cluster manager: standalone mode
spark-submit --master spark://node:port … myApp
Default configuration of Spark standalone mode:
• (only) 1GB/Spark Executor
• To process each job, the Spark Master:
− creates one Executor on each Worker node
− involves all cores of the node in its Executor
flooding strategy: unlimited nb of nodes and CPU
cores per application execution
3 ‐ Spark principles and commands
Spark‐HDFS configuration on DCE
Spark Master & Cluster « Sarah »
HDFS Name Node
Sar Sar Sar
phome…. 01 02 16
hdfs dfs … 15 Data Nodes
ssh ssh spark‐submit…
ssh Kyle Kyle Kyle
01 02 68
ssh… slurm1
srun… Cluster « Kyle »
srun…
Default configuration on DCE:
• (only) 15 cores per Spark application
• 15 Data nodes
One Spark application: 1Executor/Data Node & 1 core/Executor
• 16 (logical) cores per node
16 Spark applications can run concurrently on the Spark cluster
3 ‐ Spark principles and commands
Spark‐HDFS configuration on DCE
Spark Master & Cluster « Sarah »
HDFS Name Node
Sar Sar Sar
phome…. 01 02 16
hdfs dfs … 15 Data Nodes
ssh ssh spark‐submit…
ssh Kyle Kyle Kyle
01 02 68
ssh… slurm1
srun… Cluster « Kyle »
srun…
Possible tunning on DCE:
• Specify the maximum amount of memory per Spark Executor
spark-submit --executor-memory XX …
Spark commands on DCE
Spark Master HDFS NameNode service: sar01:9000
& HDFS Name Sar Sar Sar /data Read only
Node 01 02 16 /cpusdi1
15 Data Nodes /cpusdi1_1/ R/W for cpusdi1_1
hdfs dfs …
spark‐submit… …
Sar
Kyle Sar
Kyle Kyle /cpusdi1_19/ R/W for cpusdi1_19
17
01 02
18 68
Client machines Spark Master service: sar01:7077
On a cluster node (client machine):
cp ~vialle/DCE‐Spark/template_wc.py ./wc.py
Edit and update the Python code Use YOUR
HDFS account
spark‐submit ‐‐master spark://sar01:7077 wc.py
hdfs dfs ‐ls ‐h hdfs://sar01:9000/cpusdi1/cpusdi1_1/sherlock.out
hdfs dfs ‐cat hdfs://sar01:9000/cpusdi1/cpusdi1_1/sherlock.out/*
3 ‐ Spark principles and commands
Spark commands on DCE
Spark Master HDFS NameNode service: sar01:9000
& HDFS Name Sar Sar Sar /data Read only
Node 01 02 16 /cpusdi1
15 Data Nodes /cpusdi1_1/ R/W for cpusdi1_1
hdfs dfs …
spark‐submit… …
Sar
Kyle Sar
Kyle Kyle /cpusdi1_19/ R/W for cpusdi1_19
17
01 02
18 68
Client machines Spark Master service: sar01:7077
Remarks (for this lab):
• Use the « template‐xx.py » file to develop your code xx.py
and execute your code with :
spark‐submit ‐‐master spark://sar01:7077 xx.py
• Write lambda with syntax: (lambda a, b : (a[0] + b[0], a[1] + b[1]))
(lambda (v1,n1), (v2,n2) : (v1+v2,n1+n2))
Using Spark cluster of
CentraleSupelec DCE
(Data Center for Education)
Questions ?