Professional Documents
Culture Documents
net/publication/346494099
CITATIONS READS
0 663
1 author:
Jürgen Schmdil
Hochschule der Medien Stuttgart
1 PUBLICATION 0 CITATIONS
SEE PROFILE
All content following this page was uploaded by Jürgen Schmdil on 30 November 2020.
Open in app
This article is part of a larger project. If you are also interested in scalable web scraping or
building highly scalable dashboards, you will find corresponding links at the end of the
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 1/22
30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
article.
Open in app
Table of Contents
1. Prerequisites to the reader
2. Introduction
2.1 Purpose of this Project
2.2 Introduction to scalable NLP
3. Architecture
4. Setup
6. Deploy to Kubernetes
6.1 Set up Kubernetes Cluster
6.2 Set up Redis as a Kubernetesservice
6.3 Fill Redis queue with tasks
6.4 Deploy a Docker Container
6.5 Check results
This article is aimed to readers who already have some experience with the Google
Cloud Platform and Linux shell. To help new readers getting started, links to additional
resources can be found within this article. If you haven’t worked with Google Cloud
Plattform, you can use Google’s free trial program.
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 2/22
30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
Open in app
2. Introduction
2.1 Purpose of this Project
The goal of this article is to show how entities (e.g. Docker, Hadoop, etc.) can be
extracted from articles (based on the structure of TowardsDatascience) in a scalable
way using NLP. We will also look at how we can use other NLP methods like POS
tagging.
Apache Spark
Spark is a great way to make data processing and machine learning scalable. It can be
run locally or on a cluster and uses distributed Datasets, as well as processing pipelines.
Further information about Spark can be found here:
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 3/22
30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
Spark-NLP
Open in app
Spark-NLP is a library for Python and Scala, that allows to process written language
with Spark. It will be presented in the following chapters. More information can be
found here:
Redis
Redis is a key value store we will use to build a task queue.
3. Architecture
To start with, this is how our architecture will look like:
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 4/22
30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
Open in app
As you can see, this approach is a batch architecture. The Python script processes text
stored in the Google Datastore and creates a job queue. This queue will be processed by
the Kubernetes pods and the results are written into BigQuery.
To keep the tutorial as short as possible, only the more computationally intensive
language processing part scales. The task queue creator is a simple Python script
(which could also run in Docker).
4. Setup
Start Google Cloud Shell
We will work with the Cloud Console from Google. To open it you need to create a
project and activate billing.
Then you should see the Cloud Shell button in the upper right corner.
After clicking, the shell should open in the lower part of the window (if you run into
trouble, use Chrome Browser). To work comfortably with the shell, I recommend to
start the editor:
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 5/22
30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
$cd TWD-01-2020/5_NLP
Please run:
$export Project="yourprojectID"
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 6/22
30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
Setup Input-data:
(If you have completed the first part of the project, you can skip this.)
We use Google Cloud Datastore in Datastore-Mode to provide the source data. In order
to prepare the Datastore, you have to put it into Datastore-Mode. To do this, simply
search for Datastore in the Cloud Platform and click on “Select Datastore Mode”. (if
needed, choose a location as well)
$cd .. ; cd 4_Setup
(You may have to enter this command manually)
$python3 Create_Samples.py
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 7/22
30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
If you see “Samples generated”, you have got 20 sample entries in your Cloud
Open in app
Datastore.
Setup Output-Tables
To store the processed data we create several BigQuery tables. We do this using the
following bash-script:
$bash Create_BQ_Tables.sh
If all tables were created, we have successfully created all required resources.
Explainer.py
The main script. Here Spark will be started, the pipeline will be created and filled
with the pipeline model. The text processing also takes place here.
Model_Template.zip
An example model that extracts entities, i.e. proper names from the texts.
sa.json
Your Google Cloud service account. If you run into 404 or 403 errors, please check
the permission granted in the IAM for this service account.
Dockerfile
This file contains the setup for the environment of the script. The exact setup is
explained below.
requirements.txt
This file contains required Python libraries, which are installed during the creation
of the Docker image.
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 8/22
30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
Explainer.yaml
Open in app
This file contains information on how Kubernetes should handle the Docker image.
First of all, you may need to specify the name of your service account file (if you
haven’t stuck to sa.json):
1 def __init__(self):
2 """Initialisation of this class"""
3 print("Init")
4 # Imports
5 [....]
6
7 # Define your Service-Account.json
8 self.sa = 'sa.json'
9
10 [....]
The script’s entry point uses Peter Hoffmann’s Redis class to query the Redis instance
regularly for new entries in the task queue. We haven’t set up the instance, so the script
will not work yet.
1 if __name__ == "__main__":
2
3 # Create Instance of Class
4 e = SparkNLP_Explainer()
5 print("Worker has been started")
6 while 1 == 1: # Wait for work
7
8 # Check if queue is empty
9 if e.q.empty() is False:
10
11 # Split String to get ID as well as text
12 Job = str(e.q.get().decode("utf-8")).split("/")
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 9/22
30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
13
14 # Write string contents to variable
Open in app
15 ID = Job[0]
16 print(ID)
17 Text = Job[1]
18
19 # Start Explainer
20 e.Explain(Text, ID)
21 print("Text has been successfully processed")
As soon as a task arrives in the task queue, the “Explain” function is called, where the
processing takes place.
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 10/22
30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
33 else:
34 in app
Open print("No entities found")
As you can see, the actual logic is located in the model that is stored in (self.Model).
This model contains all important steps for the NLP, like Tokenizer, Lemmatizer or
Enitiy-Tagging and is unpacked from a ZIP file with the function Load_Model(). To
build a model yourself, please refer to this notebook:
A docker file allows us to create a complete system using one file. The most important
commands are:
FROM: Sets the Base-Image. A base image can be a native operating system, but other
programs may already have been installed on it.
ENV: Spark needs some environment variables to work. With the ENV command those
are set for the Docker container.
COPY and WORKDIR: COPY copies the entire parent directory of the docker file into
the container and WORKDIR sets this directory as working directory.
RUN: Calls commands that are executed in the Docker Containers shell. Usually used
to install applications.
CMD: A docker file can only have one CMD, here the actual Python script is called. The
-u operator is important to get logs from the container.
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 12/22
30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
To build the Docker File, please chance the directory to “5_NLP” and execute the
Open in app
following commands:
This command builds the Docker image from the Dockerfile in this directory. We
cannot start it yet because the Redis instance is not running, but we have successfully
created the image.
To run it later on a Kubernetes Cluster, we have to push the image into the Container
Registry. To do so, activate the API by using Google Cloud Platform and search for
“Container Registry”. Afterwards, run the following commands:
You should now be able to see your file in the Container Registry:
If this worked so far, we can now move on to the Kubernetes cluster and get this project
to work.
6 Deployment to Kubernetes
6.1 Set up Kubernetes Cluster
This part is quite simple, since Google allows us to create a Kubernetes cluster by
command line. You can run this command to create a very small cluster.
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 13/22
30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
$bash
Open in appCreate_Cluster.sh
The creation may take a few minutes. If you want to create a bigger cluster, check out
Kubernetes Engine on Google Cloud Plattform. If you created the cluster using the web
interface of Kubernetes Engine, you initially need to connect your console to the
cluster. You can get the statement by clicking on “Connect”:
To do this, the Redis container must be created from a .yaml file, located in the folder
“6_Scheduler”. Run:
If you take a closer look on the .yaml files, you will see that you can specify all the
settings needed. The line “replicas:” is of particular importance, because its value
defines the number of parallel instances and therefore the capacity for processing data
(of course limited by the underlying machine).
We work on a quite small machine, so we shouldn’t create more than one replica.
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 14/22
30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
If the creation was successful, you should see the following output:
Open in app
And here you can see the service wich provides connectivity to other pods
Afterwards, the Python script can be called. It retrieves the data from the Cloud
Datastore, preprocesses it and puts it into the Redis task queue.
You can start the script with:
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 15/22
30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
$python3
Open in app Scheduler.py
1 def Process_Batch(self):
2 """ Starts the Scheduler """
3
4 # set Variable for Loop
5 loop = True
6
7 # start Loop
8 while loop is True:
9
10 # Get Batch of Articles
11 Batch = self.get_next_Batch(10)
12
13 # Iterate trough Batch
14 for Article in Batch:
15
16 # Check if Article has already been processed
17 if Article.key.id_or_name not in self.all_processed_entities:
18
19 # Combine and send article to redis
20 self.Send_Job(Article['Title'] + " " +
21 Article['Text'], str(Article.key.id_or_name))
22
23 # Create Masterdata for Article
24 self.create_Masterdata_for_Article(Article)
25 else:
26 print("Article allready processed")
27 print("Queue size: " + str(self.q.qsize()))
28
29 # Check if cursor has reached EOF (End of File)
30 if self.next_cursor is None:
31 loop = False
32 print("No more entires")
The “Process Batch” method contains the actual logic. Here, the articles are read from
the Cloud Datastore in batches and passed to the “Send_Job” method.
Since Redis does not like special characters, these are removed to ensure smooth
processing.
Then the created jobs are stored in the Redis database with the .put command.
Note: Checking if we need to call a regex method is 10 times faster than the
replacement. If special characters are already taken into account when filling the
Datastore, the scheduler can work much faster.
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 17/22
30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
Open in app
The task queue is filled
$pkill kubectl -9
For this purpose a .yaml file containing all relevant information for Kubernetes is used
again.
Please Note, that you have to insert your own container from the
registry(image:)!
1 apiVersion: apps/v1beta1
2 kind: Deployment
3 metadata:
4 name: explainer
5 labels:
6 name: explainer
7 spec:
8 replicas:
9 template:
10 metadata:
11 labels:
12 name: explainer
13 spec:
14 containers:
15 - name: tdw-explainer
16 image: gcr.io/[your Project]/nlp_explainer:latest
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 18/22
30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
To push the .yaml file, you just need to execute the following command in the “5_NLP”
Open in app
folder:
If you use the small cluster, I recommend to just deploy one replica!
Otherwise you will run into problems caused by lack of performance.
After the creation of the container, you should get this result with the following
command (can take up to 2 minutes):
for example:
$kubectl logs explainer-544d123125-7nzbg
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 19/22
30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
You can also see errors in the log, if something went wrong.
Open in app
If everything has been processed, the pod stays idle or will be evicted and recreated by
Kubernetes (because it uses a loop and does not finish).
for example:
$kubectl delete deployments explainer
If you don’t want to use the Kubernetes Cluster anymore you should delete it either via
the web interface or by using the following command, otherwise Google will continue
billing you.
for example:
$gcloud config set project $Project
$gcloud container clusters delete "your-first-cluster-1" --zone "us-
central1-a"
You should be able to see the following tables in the left bottom:
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 20/22
30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
Open in app
The tables “Article_masterdata” and “Article_tags” have been created by the Scheduler
to serve the needs of the consecutive project. But we want to see the content of the
“Entitiy_raw” table.
You should see the results of the entity recognition with the respective article ID.
If you are interested in seeing how to create a highly scalable dashboard based on this data,
we would be happy if you read the following tutorial: Build a highly scalable dashboard
that runs on Kubernetes by Arnold Lutsch
You could also replace the sample data with real articles using this tutorial: Build a
scalable webcrawler for towards data science with Selenium and python by Philipp
Postels
Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday
to Thursday. Make learning your daily ritual. Take a look
30.11.2020 Spark in Docker in Kubernetes: A Practical Approach for Scalable NLP | by Jürgen Schmidl | Towards Data Science
Open in app
https://towardsdatascience.com/spark-in-docker-in-kubernetes-a-practical-approach-for-scalable-nlp-9dd6ef47c31e 22/22