SparkinDockerinKubernetes APracticalApproachforScalableNLP byJrgenSchmidl TowardsDataScience

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/346494099
Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by

Jürgen Schmidl | Towards Data Science
Article · January 2020
CITATIONS READS
0 663
1 author:
Jürgen Schmdil
Hochschule der Medien Stuttgart
1 PUBLICATION 0 CITATIONS
SEE PROFILE
All content following this page was uploaded by Jürgen Schmdil on 30 November 2020.
The user has requested enhancement of the downloaded file.

30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
Open in app
502K Followers · About Follow
Spark in Docker in Kubernetes: A Practical

Approach for Scalable NLP
Natural Language Processing using the Google Cloud Platform’s Kubernetes Engine
Jürgen Schmidl Jan 18 · 12 min read
Image by Free-Photos from Pixabay
This article is part of a larger project. If you are also interested in scalable web scraping or
building highly scalable dashboards, you will find corresponding links at the end of the
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 1/22
article.
Open in app
Table of Contents
1. Prerequisites to the reader
2. Introduction
2.1 Purpose of this Project
2.2 Introduction to scalable NLP
3. Architecture
4. Setup
5. How to use Spark-NLP

5.1 Overview
5.2 Spark-NLP in Python
5.3 Build Docker
6. Deploy to Kubernetes
6.1 Set up Kubernetes Cluster
6.2 Set up Redis as a Kubernetesservice
6.3 Fill Redis queue with tasks
6.4 Deploy a Docker Container
6.5 Check results
1. Prerequisites to the reader

The project was developed on the Google Cloud Platform and it is recommended to do
the tutorial there as well. Nevertheless you can run it on a local machine, but you need
to alter the code and replace some of the used resources.
This article is aimed to readers who already have some experience with the Google
Cloud Platform and Linux shell. To help new readers getting started, links to additional
resources can be found within this article. If you haven’t worked with Google Cloud
Plattform, you can use Google’s free trial program.
Open in app
2. Introduction
2.1 Purpose of this Project
The goal of this article is to show how entities (e.g. Docker, Hadoop, etc.) can be
extracted from articles (based on the structure of TowardsDatascience) in a scalable
way using NLP. We will also look at how we can use other NLP methods like POS
tagging.
2.2 Introduction to scalable NLP

Natural Language Processing (NLP)
Natural Language Processing enables machines to understand the structure and
meaning of natural language, and allows them to recognize patterns and relationships
in text.
Why should it be scalable?

The processing of written language can be very complex and can take a long time
without scalable architecture. Of course you can upgrade any system and use faster
processors, but in doing so the costs increase disproportionately to the achieved
efficiency gains. It is better to choose an architecture which can distribute the
computing load over several machines.
Apache Spark
Spark is a great way to make data processing and machine learning scalable. It can be
run locally or on a cluster and uses distributed Datasets, as well as processing pipelines.
Further information about Spark can be found here:
Introduction to Apache Spark

MapReduce and Spark are both used for large-scale data
processing. However, MapReduce has some shortcomings which…
towardsdatascience.com
Spark-NLP
Open in app
Spark-NLP is a library for Python and Scala, that allows to process written language
with Spark. It will be presented in the following chapters. More information can be
found here:
Introduction to Spark NLP: Foundations and Basic Components

Why would we need another NLP library?
towardsdatascience.com
Redis
Redis is a key value store we will use to build a task queue.
Docker and Kubernetes

A Docker container can be imagined as a complete system in a box. If the code runs in a
container, it is independent from the host’s operating system. This limits the scalability
of Spark, but can be compensated by using a Kubernetes cluster. These clusters scale
very quickly and easily via the number of containers. If you want to know more, please
see:
Kubernetes? Docker? What is the difference?

From a distance, Docker and Kubernetes can appear to be similar
technologies; they both help you run applications…
blog.containership.io
3. Architecture
To start with, this is how our architecture will look like:
Open in app
Architecture for scalable text processing
As you can see, this approach is a batch architecture. The Python script processes text
stored in the Google Datastore and creates a job queue. This queue will be processed by
the Kubernetes pods and the results are written into BigQuery.
To keep the tutorial as short as possible, only the more computationally intensive
language processing part scales. The task queue creator is a simple Python script
(which could also run in Docker).
4. Setup
Start Google Cloud Shell
We will work with the Cloud Console from Google. To open it you need to create a
project and activate billing.
Then you should see the Cloud Shell button in the upper right corner.
After clicking, the shell should open in the lower part of the window (if you run into
trouble, use Chrome Browser). To work comfortably with the shell, I recommend to
start the editor:
Get the Repository

Open in app
You can download the Repository using:
git clone https://github.com/Juergen-Schmidl/TWD-01-2020.git
and the necessary model template:
$cd TWD-01-2020/5_NLP
$bash get_model.sh &

Your Cloud Shell may freeze, try reconnecting after a few minutes
Set your Project ID as Environment Variable

Since we need the Project ID on many occasions, it is easier to set it as an environment
variable. Please note, that if you interrupt the shell during this tutorial, you have to set
it again. You can find your projects with ID using:
$gcloud projects list
Please run:
$export Project="yourprojectID"
Get your Service Account

You will need the key for a service account quite often. To learn how to create one, you
can read this. I prefer the method using the IAM web interface, but there are many
ways. For simplicity’s sake, it should be given the role of “editor”, but a finer
adjustment is recommended in the long run. After that, carry out the following steps:
Download the key as JSON file
Rename your key to sa.json
put one copy in each directory (4_Setup, 5_NLP, 6_Scheduler)
Your directory should now look like this:

Open in app
Setup Input-data:
(If you have completed the first part of the project, you can skip this.)
We use Google Cloud Datastore in Datastore-Mode to provide the source data. In order
to prepare the Datastore, you have to put it into Datastore-Mode. To do this, simply
search for Datastore in the Cloud Platform and click on “Select Datastore Mode”. (if
needed, choose a location as well)
After that change directory to 4_Setup and run:
$cd .. ; cd 4_Setup
(You may have to enter this command manually)
$python3 Create_Samples.py
If you see “Samples generated”, you have got 20 sample entries in your Cloud
Open in app
Datastore.
Setup Output-Tables
To store the processed data we create several BigQuery tables. We do this using the
following bash-script:
$bash Create_BQ_Tables.sh
If all tables were created, we have successfully created all required resources.
5. How to use Spark-NLP

5.1 Overview
The NLP module is located in the repository folder “5_NLP”. Please move to this
directory (using the shell). The following files should be in the folder:
Explainer.py
The main script. Here Spark will be started, the pipeline will be created and filled
with the pipeline model. The text processing also takes place here.
Model_Template.zip
An example model that extracts entities, i.e. proper names from the texts.
sa.json
Your Google Cloud service account. If you run into 404 or 403 errors, please check
the permission granted in the IAM for this service account.
Dockerfile
This file contains the setup for the environment of the script. The exact setup is
explained below.
requirements.txt
This file contains required Python libraries, which are installed during the creation
of the Docker image.
Explainer.yaml
Open in app
This file contains information on how Kubernetes should handle the Docker image.
5.2 Spark-NLP in Python

As mentioned above, spark-nlp is a library that allows us to process texts in Spark. To
do so, we use the library with the Python script “Explainer.py”. I commented on the
code extensively, so I will only cover a few parts here:
First of all, you may need to specify the name of your service account file (if you
haven’t stuck to sa.json):
1 def __init__(self):
2 """Initialisation of this class"""
3 print("Init")
4 # Imports
5 [....]
6
7 # Define your Service-Account.json
8 self.sa = 'sa.json'
9
10 [....]
Explainer.py (init) hosted with ❤ by GitHub view raw
The script’s entry point uses Peter Hoffmann’s Redis class to query the Redis instance
regularly for new entries in the task queue. We haven’t set up the instance, so the script
will not work yet.
1 if __name__ == "__main__":
2
3 # Create Instance of Class
4 e = SparkNLP_Explainer()
5 print("Worker has been started")
6 while 1 == 1: # Wait for work
7
8 # Check if queue is empty
9 if e.q.empty() is False:
10
11 # Split String to get ID as well as text
12 Job = str(e.q.get().decode("utf-8")).split("/")
13
14 # Write string contents to variable
Open in app
15 ID = Job[0]
16 print(ID)
17 Text = Job[1]
18
19 # Start Explainer
20 e.Explain(Text, ID)
21 print("Text has been successfully processed")
Explainer.py (Entry Point) hosted with ❤ by GitHub view raw
As soon as a task arrives in the task queue, the “Explain” function is called, where the
processing takes place.
1 def Explain(self, text, ID):

2 """Does the NLP on the String and insert the result in BigQuery"""
3 from sparknlp.base import LightPipeline
4
5 print("Explain... " + ID)
6
7 # In this case we use a Light-Pipeline. For furter information see:
8 # https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1
9 # To set up the Pipeline the previous downloaded Modell is needed
10
11 lp = LightPipeline(self.Model)
12
13 # No let the Pipeline annotate our text
14 r = lp.annotate(text)
15
16 # Create empty list
17 toret = []
18
19 # iterate through entities
20 for i in r["entities"]:
21 # and append it to the list with the corresponding ID
22 toret.append((i, ID))
23
24 # Check if list is empty
25 if toret != []:
26
27 # Create for each item a row in BigQuery
28 errors = self.client.insert_rows(self.table, toret) # API request
29
30 # Print possible errors
31 if errors != []:
32 print(errors)
33 else:
34 in app
Open print("No entities found")
Explainer.py (Explain) hosted with ❤ by GitHub view raw
As you can see, the actual logic is located in the model that is stored in (self.Model).
This model contains all important steps for the NLP, like Tokenizer, Lemmatizer or
Enitiy-Tagging and is unpacked from a ZIP file with the function Load_Model(). To
build a model yourself, please refer to this notebook:
5.3 Build Docker

The Python file needs a working Spark environment. In order to provide this
environment, a docker container is created using the docker file. Our docker file looks
like this:
1 # This Dockerfile is based on:

2 # https://hub.docker.com/r/godatadriven/pyspark/dockerfile
3
4 # Selection of the (OS)Build
5 FROM openjdk:8-jre-slim
6
7 # Docker variable
8 ARG BUILD_DATE
9 ARG SPARK_VERSION=2.4.0
10
11 # Set labels (if you want to)
12 LABEL org.label-schema.name="Apache PySpark $SPARK_VERSION" \
13 org.label-schema.build-date=$BUILD_DATE \
14 org.label-schema.version=$SPARK_VERSION
15
16 # Set environment variables in Docker Container
17 ENV PATH="/opt/miniconda3/bin:${PATH}"
18 ENV PYSPARK_PYTHON="/opt/miniconda3/bin/python"
19 ENV PYTHONUNBUFFERED=1
20
21 COPY . /app
22 WORKDIR /app
23
24 # Install the required Applications
25 # iputils-ping is for testing purposes
26 RUN apt-get update && \
27 apt-get install -y curl bzip2 --no-install-recommends && \
28 apt-get install -y iputils-ping && \

Open
29 in appcurl -s --url "https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh" --out
30 bash /tmp/miniconda.sh -b -f -p "/opt/miniconda3" && \
31 rm /tmp/miniconda.sh && \
32 conda config --set auto_update_conda true && \
33 conda config --set channel_priority false && \
34 conda update conda -y --force && \
35 conda clean -tipsy && \
36 echo "PATH=/opt/miniconda3/bin:\${PATH}" > /etc/profile.d/miniconda.sh && \
37 pip install --no-cache pyspark==${SPARK_VERSION} && \
38 pip install -r requirements.txt && \
39 SPARK_HOME=$(python /opt/miniconda3/bin/find_spark_home.py) && \
40 echo "export SPARK_HOME=$(python /opt/miniconda3/bin/find_spark_home.py)" > /etc/profile.d/
41 curl -s --url "http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sd
42 curl -s --url "http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aw
43 mkdir -p $SPARK_HOME/conf && \
44 echo "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" >> $SPARK_HOME/conf/s
45 apt-get remove -y curl bzip2 && \
46 apt-get autoremove -y && \
47 apt-get clean
48
49 # Start Python Script
50 CMD ["python", "-u", "Explainer.py"]
Dockerfile hosted with ❤ by GitHub view raw
A docker file allows us to create a complete system using one file. The most important
commands are:
FROM: Sets the Base-Image. A base image can be a native operating system, but other
programs may already have been installed on it.
ENV: Spark needs some environment variables to work. With the ENV command those
are set for the Docker container.
COPY and WORKDIR: COPY copies the entire parent directory of the docker file into
the container and WORKDIR sets this directory as working directory.
RUN: Calls commands that are executed in the Docker Containers shell. Usually used
to install applications.
CMD: A docker file can only have one CMD, here the actual Python script is called. The
-u operator is important to get logs from the container.
To build the Docker File, please chance the directory to “5_NLP” and execute the
Open in app
following commands:
$docker build --tag explainer_img .

(You may have to enter this command manually)
This command builds the Docker image from the Dockerfile in this directory. We
cannot start it yet because the Redis instance is not running, but we have successfully
created the image.
To run it later on a Kubernetes Cluster, we have to push the image into the Container
Registry. To do so, activate the API by using Google Cloud Platform and search for
“Container Registry”. Afterwards, run the following commands:
$docker tag explainer_img gcr.io/$Project/nlp_explainer:latest
$docker push gcr.io/$Project/nlp_explainer:latest
You should now be able to see your file in the Container Registry:
If this worked so far, we can now move on to the Kubernetes cluster and get this project
to work.
6 Deployment to Kubernetes
6.1 Set up Kubernetes Cluster
This part is quite simple, since Google allows us to create a Kubernetes cluster by
command line. You can run this command to create a very small cluster.
$bash
Open in appCreate_Cluster.sh
The creation may take a few minutes. If you want to create a bigger cluster, check out
Kubernetes Engine on Google Cloud Plattform. If you created the cluster using the web
interface of Kubernetes Engine, you initially need to connect your console to the
cluster. You can get the statement by clicking on “Connect”:
6.2 Set up Redis as a Kubernetes Service

First we have to make Redis available on the Kubernetes cluster and register it as a
service. This is necessary because the containers run isolated from each other. If a
container is registered as service, all containers can connect to it.
To do this, the Redis container must be created from a .yaml file, located in the folder
“6_Scheduler”. Run:
$kubectl create -f redis-master-service.yaml
and register it as as service (from another .yaml file):
$kubectl create -f redis-master.yaml
If you take a closer look on the .yaml files, you will see that you can specify all the
settings needed. The line “replicas:” is of particular importance, because its value
defines the number of parallel instances and therefore the capacity for processing data
(of course limited by the underlying machine).
We work on a quite small machine, so we shouldn’t create more than one replica.
If the creation was successful, you should see the following output:
Open in app
$kubectl get pods
This is the Pods containing Redis
$kubectl get services
And here you can see the service wich provides connectivity to other pods
6.3 Fill Redis queue with tasks

Now that we have set up the Redis instance, we can start filling it with tasks. First we
need to establish a local connection with the Redis service. The following command is
needed for this:
$kubectl port-forward deployment/redis-master 6379 &
Then we have to install Redis on the local machine:
$sudo pip3 install redis
Afterwards, the Python script can be called. It retrieves the data from the Cloud
Datastore, preprocesses it and puts it into the Redis task queue.
You can start the script with:
$python3
Open in app Scheduler.py
This script is commented in detail so I will only mention a few points :
1 def Process_Batch(self):
2 """ Starts the Scheduler """
3
4 # set Variable for Loop
5 loop = True
6
7 # start Loop
8 while loop is True:
9
10 # Get Batch of Articles
11 Batch = self.get_next_Batch(10)
12
13 # Iterate trough Batch
14 for Article in Batch:
15
16 # Check if Article has already been processed
17 if Article.key.id_or_name not in self.all_processed_entities:
18
19 # Combine and send article to redis
20 self.Send_Job(Article['Title'] + " " +
21 Article['Text'], str(Article.key.id_or_name))
22
23 # Create Masterdata for Article
24 self.create_Masterdata_for_Article(Article)
25 else:
26 print("Article allready processed")
27 print("Queue size: " + str(self.q.qsize()))
28
29 # Check if cursor has reached EOF (End of File)
30 if self.next_cursor is None:
31 loop = False
32 print("No more entires")
Scheduler (Process_Batch) hosted with ❤ by GitHub view raw
The “Process Batch” method contains the actual logic. Here, the articles are read from
the Cloud Datastore in batches and passed to the “Send_Job” method.
1 def Send_Job(self, Text, ID):

2 """Send ID and String to Redis-Server"""
3
3
4 # Redis doesn't allow special characters, so we need to replace them
Open in app
5 # Regex is quite slow, so we check first if its needed
6 if Text.replace(" ", "").isalnum() is False:
7 Text = (" ".join(re.findall(r"[A-Za-z0-9]*", Text))).replace(" ", " ")
8
9 # Since we only store strings, the ID is separated with a "/""
10 self.q.put(ID + "/" + Text)
Scheduler.py Send_Job hosted with ❤ by GitHub view raw
Since Redis does not like special characters, these are removed to ensure smooth
processing.
Then the created jobs are stored in the Redis database with the .put command.
Note: Checking if we need to call a regex method is 10 times faster than the
replacement. If special characters are already taken into account when filling the
Datastore, the scheduler can work much faster.
Comparison of execution times
You should see the following output:
Open in app
The task queue is filled
Don’t forget to terminate the port forwarding afterwards:
$pkill kubectl -9
6.4 Deploy a Docker Container

You have already created the docker container for the explainer in 5.3 and pushed it
into the Cloud Container Registry with the name “ gcr/[your
Project]nlp_explainer:latest ”. We will now deploy it on the Kubernetes Cluster.
For this purpose a .yaml file containing all relevant information for Kubernetes is used
again.
Please Note, that you have to insert your own container from the
registry(image:)!
1 apiVersion: apps/v1beta1
2 kind: Deployment
3 metadata:
4 name: explainer
5 labels:
6 name: explainer
7 spec:
8 replicas:
9 template:
10 metadata:
11 labels:
12 name: explainer
13 spec:
14 containers:
15 - name: tdw-explainer
16 image: gcr.io/[your Project]/nlp_explainer:latest
Explainer.yaml hosted with ❤ by GitHub view raw
To push the .yaml file, you just need to execute the following command in the “5_NLP”
Open in app
folder:
$kubectl create -f Explainer.yaml
If you use the small cluster, I recommend to just deploy one replica!
Otherwise you will run into problems caused by lack of performance.
After that you can see the pods using:
$kubectl get pods
The number of pods may differ, according to your Explainer.yaml file
After the creation of the container, you should get this result with the following
command (can take up to 2 minutes):
$kubectl logs [pod-name]
for example:
$kubectl logs explainer-544d123125-7nzbg
You can also see errors in the log, if something went wrong.
Open in app
If everything has been processed, the pod stays idle or will be evicted and recreated by
Kubernetes (because it uses a loop and does not finish).
To delete all pods without recreation use:
$kubectl delete deployments [deployment]
for example:
$kubectl delete deployments explainer
If you don’t want to use the Kubernetes Cluster anymore you should delete it either via
the web interface or by using the following command, otherwise Google will continue
billing you.
$gcloud config set project $Project

$gcloud container clusters delete [your-clustername] --zone [Zone]
for example:
$gcloud config set project $Project
$gcloud container clusters delete "your-first-cluster-1" --zone "us-
central1-a"
6.5 Check Results

To check the result, please go to the Google Cloud Platform and search for “BigQuery”.
You should be able to see the following tables in the left bottom:
Open in app
The tables “Article_masterdata” and “Article_tags” have been created by the Scheduler
to serve the needs of the consecutive project. But we want to see the content of the
“Entitiy_raw” table.
To access them click on it and move to “Preview”:
You should see the results of the entity recognition with the respective article ID.
And that’s it, you are done!
If you are interested in seeing how to create a highly scalable dashboard based on this data,
we would be happy if you read the following tutorial: Build a highly scalable dashboard
that runs on Kubernetes by Arnold Lutsch
You could also replace the sample data with real articles using this tutorial: Build a
scalable webcrawler for towards data science with Selenium and python by Philipp
Postels
Sign up for The Daily Pick

By Towards Data Science
Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday
to Thursday. Make learning your daily ritual. Take a look
Emails will be sent to js.schmidl@gmail.com.

Get this newsletter
Not you?
Thanks to Philipp Postels (hide).

Some rights reserved
View publication stats
Open in app
Docker Kubernetes Google Cloud Platform NLP Spark
About Help Legal
Get the Medium app

SparkinDockerinKubernetes APracticalApproachforScalableNLP byJrgenSchmidl TowardsDataScience

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SparkinDockerinKubernetes APracticalApproachforScalableNLP byJrgenSchmidl TowardsDataScience

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by

Article · January 2020

The user has requested enhancement of the downloaded file.

502K Followers · About Follow

Spark in Docker in Kubernetes: A Practical

Jürgen Schmidl Jan 18 · 12 min read

Image by Free-Photos from Pixabay

5. How to use Spark-NLP

1. Prerequisites to the reader

2.2 Introduction to scalable NLP

Why should it be scalable?

Introduction to Apache Spark

Introduction to Spark NLP: Foundations and Basic Components

Docker and Kubernetes

Kubernetes? Docker? What is the difference?

Architecture for scalable text processing

Get the Repository

git clone https://github.com/Juergen-Schmidl/TWD-01-2020.git

and the necessary model template:

$bash get_model.sh &

Set your Project ID as Environment Variable

$gcloud projects list

Get your Service Account

Download the key as JSON file

Rename your key to sa.json

put one copy in each directory (4_Setup, 5_NLP, 6_Scheduler)

Your directory should now look like this:

After that change directory to 4_Setup and run:

5. How to use Spark-NLP

5.2 Spark-NLP in Python

Explainer.py (init) hosted with ❤ by GitHub view raw

Explainer.py (Entry Point) hosted with ❤ by GitHub view raw

1 def Explain(self, text, ID):

Explainer.py (Explain) hosted with ❤ by GitHub view raw

5.3 Build Docker

1 # This Dockerfile is based on:

28 apt-get install -y iputils-ping && \

Dockerfile hosted with ❤ by GitHub view raw

$docker build --tag explainer_img .

$docker tag explainer_img gcr.io/$Project/nlp_explainer:latest

$docker push gcr.io/$Project/nlp_explainer:latest

6.2 Set up Redis as a Kubernetes Service

$kubectl create -f redis-master-service.yaml

and register it as as service (from another .yaml file):

$kubectl create -f redis-master.yaml

$kubectl get pods

This is the Pods containing Redis

$kubectl get services

6.3 Fill Redis queue with tasks

$kubectl port-forward deployment/redis-master 6379 &

Then we have to install Redis on the local machine:

$sudo pip3 install redis

This script is commented in detail so I will only mention a few points :

Scheduler (Process_Batch) hosted with ❤ by GitHub view raw

1 def Send_Job(self, Text, ID):

Scheduler.py Send_Job hosted with ❤ by GitHub view raw

Comparison of execution times

You should see the following output:

Don’t forget to terminate the port forwarding afterwards:

6.4 Deploy a Docker Container

Explainer.yaml hosted with ❤ by GitHub view raw

$kubectl create -f Explainer.yaml