Professional Documents
Culture Documents
1 Introduction 5
2 Prerequisites 6
3 Python virtual environments 7
3.1 Installing Python in Ubuntu 22.04 7
3.2 Installing Python in macOS 9
3.3 Creating a Python virtual environment 11
3.4 Switching between environments 11
3.5 Installing packages in a Python virtual environment 12
3.6 Deleting/reinstalling a Python virtual environment 13
4 What is Git? 13
5 Installing and configuring the Git client 14
6 Creating a project in Azure DevOps 14
7 Cloning the project 16
7.1 Accessing the Azure project page 16
7.2 Finding the Git repository URL and cloning the project 17
8 Building and testing the project 17
9 Installing and running JupyterLab 18
10 Installing PyCharm 19
10.1 Ubuntu 22.04 19
10.2 macOS 19
10.3 First time running PyCharm 19
11 Opening & configuring the project with PyCharm 20
12 Python packages & project structure 21
12.1 gitignore 22
12.2 Python package declaration or setup.py 24
12.3 Delivery folder and build script 25
12.4 Continuous integration files 25
13 The Docknet library 25
13.1 The Jupyter notebooks 25
13.2 The Docknet main class 27
13.2.1 Docstrings 27
13.2.2 Type hints 28
13.2.3 Getter and setter methods 30
13.3 Data generators and class inheritance 30
13.4 Activation functions and custom exceptions 32
2
13.5 Cost functions 33
13.6 Initializers and abstract classes 33
13.7 Docknet layers 36
13.7.1 Special class methods __getattr__ and __setattr__ 36
13.7.2 Docknet input layer 37
13.7.3 Docknet dense layer 38
13.8 Optimizers 38
13.9 Utility functions and classes 39
14 Unit testing with pytest & PyCharm 39
14.1 How to run unit tests 40
14.2 Organizing unit tests 41
14.3 pytest fixtures 42
14.4 Parameterized pytests 43
14.5 A bigger pytest, and a word on mocking 45
14.6 A note on code refactoring 47
15 Working with Git branches 47
15.1 Git Workflow 47
15.2 Git Branches 49
15.3 Git merge conflicts 51
16 Object serialization and deserialization 53
16.1 JSON 53
16.2 pickle 56
17 Python package commands 57
18 Resource files 61
19 Configuration files 62
20 Web services in Python with Flask 63
21 Docker 67
21.1 Installing Docker 67
21.2 The Dockerfile 67
21.3 Building, running and deleting a Docker image 72
21.4 Docker tutorial 73
22 Continuous integration 73
22.1 Creating an Azure DevOps pipeline 74
22.2 Other CI uses and systems 75
23 Proposed challenges 75
23.1 Challenge 1: New data generator 75
23.2 Challenge 2: New activation function 76
23.3 Challenge 3: Cross-entropy for multi-class classification 76
3
23.4 Challenge 4: Xavier’s initializer 77
23.5 Challenge 5: Dropout layer 77
23.6 Challenge 6: Momentum and RMSProp optimizers 78
4
1 Introduction
This guide describes a way of collaboratively developing software in local machines based on
industrial standards. The methods and tools described here are focused on Python projects
that implement processing pipelines involving one or more machine learning models. We
provide instructions on how to install and use these tools in Linux-based systems, namely
Ubuntu and macOS. Due to differences between MS Windows and Linux-based systems,
developing code that runs on both kinds of OS requires extra effort and care.1 Usually the
code we develop is, at a final stage, to run as a service in a Linux-based Docker container,
hence portability of the code between Windows and Linux OSs is not a must. For this reason,
we advise to develop on Linux-based OSs only (including macOS or Linux virtual machines
within Windows) to avoid potential problems.
Note we build together pieces of software that serve as the foundation of other pieces (e.g.,
a data pre-processing that is needed for later training a model with a particular machine
learning algorithm). We need a mechanism that allows us to share the code amongst the
team, as well as to run the code independently of the person who developed it and the
machine where it was initially developed.
For the sake of this guide, we will use an example of Python project called Docknet, a pure
NumPy implementation of neural networks that can be used to learn the math and algorithms
behind neural nets. The code has been made open source and published in GitHub:
https://github.com/Accenture/Docknet
You may follow this training guide in teams of up to 5 persons. Each team will have to create
an Azure DevOps project using the trial version as described in section 6, since it is not allowed
to use The Dock’s Azure DevOps space for the sake of training.2 The Docknet source code is
to be uploaded to a Git repository in that project so that you can share the same repository
(explained as well in section 6). The first sections of this guide describe how to install and
configure the tools that you will need, so continue reading and following the steps before
jumping to section 6.
1 For instance, Linux operating systems use the slash as the file path separator, while Windows uses the
backslash. We will always have to use Python function os.path.join(folder1, folder2,…, file1)
to generate paths with the proper file separator independently of the OS, instead of hardcoding the file
separator as a string of the form ‘folder1/folder2/file1’. Another typical interoperability problem
arises when running unit tests that compare multiline strings, since Linux uses code ‘\n’ as end of line while
Windows uses ‘\r\n’. For the tests to pass in both systems one possible workaround is to systematically
remove all ‘\r’ characters before doing string comparisons, which adds boilerplate to the tests.
2 Note the trial version of Azure DevOps does not allow for more than 5 members in a single project, hence the
limit.
5
2 Prerequisites
Basic knowledge of the Linux / macOS command line is recommended. A good free book can
be found here:
http://linuxcommand.org/tlcl.php
While it is not needed to know the whole content of this book, basic knowledge of the Linux
shell (chapter 1), file system (chapters 2 to 4), file permissions, sudo command and vi editor
is advised.
This guide provides specific instructions for Ubuntu 22.04 and macOS. In case you have a
Windows machine, you will first need to install a Ubuntu 22.04 virtual machine. You may
either use VirtualBox or WLS (Windows Linux Subsystem). The former allows for having a full
Ubuntu Desktop machine inside Windows, while the latter provides a Linux terminal only. If
you need a full Ubuntu graphical environment, then VirtualBox would be preferred. For
instance, with VirtualBox it is possible to run both PyCharm and the code inside Ubuntu. With
WSL it is still possible to code with PyCharm, running PyCharm in Windows while making it
use a Python interpreter inside WSL. However this is only supported in the Professional
Edition of PyCharm, which is not free. A good tutorial on installing Ubuntu 22.04 with
VirtualBox can be found here:
https://linuxhint.com/install-ubuntu22-04-virtual-box/
Note you have to run the installer as admin (right click on the installation program then click
on “Run as admin”.
https://ubuntu.com/wsl
Furthermore, the documentation on running WSL interpreters from PyCharm can be found
here:
https://www.jetbrains.com/help/pycharm/using-wsl-as-a-remote-interpreter.html
Apart from having a Linux machine (Ubuntu or macOS), some basic knowledge of neural
networks with fully connected layers is advised in order to better understand the example
code we use in this training, though for understanding the tools and coding techniques it is
not necessary. If you have never done a training on deep learning, following the 2 first courses
of Coursera’s Deep Learning Specialization would be more than enough:
https://www.coursera.org/specializations/deep-learning
The example code implements the following concepts, all explained in those 2 courses:
6
• Dense layers (also called fully connected layers)
• Forward propagation
• Cross entropy cost function
• Random network parameter initialization
• Gradient descent
• Backward propagation
• Partial derivatives of cross entropy, activation functions and linear functions of the
dense layer neurons (all used for implementing the backward propagation)
• Multi-batch training
• Adam parameter optimizer (a variant of gradient descent)
For each project in which we work, we create a dedicated Python virtual environment in order
to make sure the proper Python interpreter and packages are used, while avoiding potential
clashes between projects. When working on a particular project, we activate the
corresponding Python virtual environment, and when switching to a different project we
deactivate the current Python virtual environment and activate the corresponding one. We
may also open multiple terminals and activate a different Python virtual environment in each
one in order to run code of different projects simultaneously, each one running in their
corresponding Python virtual environments.
While it is possible to create Python virtual environments with Conda, Conda comes with a
set of preinstalled packages that may not be required for our projects or have a different
version number than the one we require. Note it is not possible to install multiple versions of
the same package in the same Python Virtual environment. For these reasons we prefer to
use plain Python virtual environments, which include no preinstalled packages whatsoever,
then install the minimum set of packages with the needed version numbers for each project.
3Random number generators usually allow for an input parameter “seed”, an integer number that determines
what is the sequence of random numbers that will be generated. When we do not need to repeat the same
sequence, we use as seed the current system timestamp, which is an integer number that will never repeat.
7
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt-get install python3.9 python3.9-dev python3.9-venv
The first command adds a Launchpad repository that includes Python 3.9, and the second one
installs it. Note the default Ubuntu 22.04 repository provides Python 3.10, but we will use
version 3.9 for compatibility reasons with Apple Silicon machines.
Whether we use Python 3.9 or any other further version depends on the requirements of the
project, but once a version has been chosen the same version should consistently be used
across all the project machines.
Package python3.9 contains Python 3.9, python3.9-dev is needed for installing some
Python packages that require compiling native code (e.g., NumPy), and python3.9-venv
is needed for creating Python virtual environments.
Note that it is good practice to keep your system up to date so that you install the latest
version of each Ubuntu package. To do so, you may run first the following commands:
Additionally, make sure you have configured the system locales. During the installation of
Ubuntu Desktop, the locales are already configured. However, Ubuntu Server does not come
with the locales pre-configured. You will be using Ubuntu Server when running an Amazon
EC2 instance based on Ubuntu, or when running a Ubuntu-based Docker image. In these
scenarios you need to run the following commands:
• Install the US English UTF-8 locales (you may choose to install other locales, but be
consistent across the whole project):
• Set the following system-wide environment variables (these values are for US English
UTF-8, if you use other locales then replace the values by the corresponding ones):
8
LANG=en_US.UTF-8
LANGUAGE= en_US:en
LC_ALL=en_US.UTF-8
In a Docker image you can define these variables with command ENV, e.g.:
ENV LANG=en_US.UTF-8
In an EC2 instance or any other Ubuntu Server machine these variables are defined in
file /etc/default/locale.
A typical problem derived from not setting the locales is when we open a text file without
specifying the encoding to use, for instance
instead of
The former code will use the system’s default locale, which in case of not being set it may
vary from machine to machine. If we try to read a text file as UTF-8 when it has been written
as ISO-8859-1 we may corrupt the data or get an exception.
Note this command contains no line breaks; before the URL there is a white space, not a new
line. Be careful when copying commands in this guide that span over multiple lines.
In order to be able to install specific versions of Python, we will install pyenv with Homebrew:
4 In fact, APT is Debian’s advanced package manager; Ubuntu is a Debian derived Linux distribution which,
among other things, has inherited APT.
5 Homebrew homepage: https://brew.sh/
6 MacPorts homepage: https://www.macports.org/
9
Additionally, we need to add the following lines:
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init --path)"
to our shell config file. Here is a table with the corresponding config files, depending on your
shell:
pyenv allows us to download the source code of any available specific version of Python, then
compiles and installs it in our machine. We can install as many versions we want, and switch
from a version to another as we work in different projects requiring different versions. Some
libraries are required in order to compile Python; install them with the following command:
You can list the available Python versions for download with the following command:
Usually, you’ll want the versions that are just numbers with dots (e.g., 3.9.14) instead of other
versions such as miniconda3-4.7.12; as mentioned before, plain Python versions (with version
numbers only) do not come with preinstalled packages, so we are free to create our Python
environment with exactly the packages and package versions that are required. Once we have
chosen a Python version (e.g., 3.9.14), we can install it as follows:
pyenv versions
python --version
10
3.3 Creating a Python virtual environment
Assuming you are to work on a project called “docknet”, we are going to create a Python
virtual environment for that project in a folder $HOME/docknet_venv. 7 Once you have
activated the Python version you want to use for the project, type the following command:
A folder $HOME/docknet_venv should have been created and populated with other
folders and files. The most relevant to us are:
• Folder bin: this is where the commands of the Python virtual environment are
installed, most of all being simply symbolic links to the actual commands installed in
our machine, such as:
o python, python3 or python3.9: they all point to the Python command
we used for creating the Python virtual environment.
o pip, pip3 or pip3.9: they all point to the pip command that corresponds
to the chosen Python version.
o activate: this is the script that activates the environment; it is to be run
preceded by command source.
o Any other commands that our Python project may define are installed here.
Obviously, in a fresh Python virtual environment there are none.
• Folder lib: here is where Python packages8 are installed when running command
pip; if our project defines a Python package, it’s also installed here.
source path_to_bin_folder/activate
The activate script modifies the system environment in which the terminal is running,
such as the default path to the python and pip commands. We need to use command
source so that the changes in the environment are done to the terminal environment and
not just to the subprocess created when running the script. Otherwise, the changes done to
the environment would be deleted once the script execution is finished. Once we activate a
Python virtual environment, the default python and pip commands will be those of the
environment and not those selected by pyenv. We only need to set the python version with
pyenv for creating a new Python virtual environment, but then the Python version for each
environment becomes the default one when we activate them.
7 Remember $HOME is a system variable whose value is the path to your home folder, e.g., if your username is
smith, your home folder in Ubuntu is at /home/smith, and in macOS at /Users/smith
8 Python packages may also be called libraries
11
$HOME/.zprofile if your interpreter is zsh (modern macOSs use zsh by default). Add at
the end of the corresponding file the following line:
You may also add the following alias to make it easier to go to the project folder:
deactivate
This command is not a script in the bin folder, it is a system function that is created by the
activate script and deleted upon deactivation.
In our Python projects it is advised to have a file requirements.txt that contains the list
of all the necessary packages, along with their versions, in order to facilitate the task of
installing all the required dependencies. Provided that we have such file, we can simply type
the following command to install all of them:
The requirements file is a text file that contains a Python package name per line, e.g.:
numpy
pandas
scikit-learn
You can use the # symbol to add comments, and suffix ==version_number to specify a
version number of a package, for instance:
# dependencies of component X
numpy==1.24.2
pandas==1.5.3
scikit-learn==1.2.1
Note if version numbers are not specified, the latest versions available will be installed. While
we may want the latest versions available, it may happen that at some point a new version of
a package is released which is no longer compatible with our project, and upon reinstalling
the project (e.g., in the production environment) the code will fail. You can check which
12
packages and package versions are installed in the currently activated environment by
running command:
pip freeze
A trick you can use if you want to add the latest available version of a package to a project is
to first manually install the package with pip without specifying a version number, then use
pip freeze to check the version installed, then add the package with that version to the
requirements file.
We may be exploring whether to use a new package in our project and end up breaking the
virtual environment by installing some package that is incompatible with our project. Note
that a package may depend on other packages, thus installing a package with pip may in
return install many others. It can then be difficult and time consuming to figure out which
package is producing the conflict and what should be deinstalled in order to solve the
problem. Instead, we can simply delete the virtual environment and create it again.
4 What is Git?
The Wikipedia definition of Git is: “Git is a distributed version-control system for tracking
changes in source code during software development”. In practice, Git behaves as a file
repository, either local or online, that not only stores a set of files and folders but also keeps
track of every change made to them since the creation of the repository. This avoids
accidental deletions of code and allows us to rollback changes if necessary, so we can try
different solutions without being afraid of breaking the code or potentially losing important
files.
Usually, we first create a central Git repository in some server or cloud; at The Dock, Azure
DevOps is the current cloud solution to manage project Git repositories, along with other
project processes and metadata. A project’s Git repository or repositories are first created
with Azure DevOps in the cloud. When we create a new Git repository, it is initially empty.
We then use a Git client to make a clone of the remote Git repositories in our machine. This
clone is a mirror of the remote repository that allows us to work locally, without the need of
Internet connection. It not only contains the files downloaded from the central repository
(which initially are none), but metadata files that keep track of every possible change and
keep our local Git copy synchronized with the remote one (the one in the Azure cloud).
Each team member periodically synchronizes their local clone of the Git repository with the
remote one in order to share their contributions with the rest of the team, and to obtain
13
other’s contributions. Since different developers work concurrently, synchronizing local and
remote repositories implicitly requires integrating different pieces of work, which at times
may be in conflict. If we initially define the project to develop as a set of software
components, and the way these components will interact, each developer can focus on a
different component in order to avoid conflicts. Developers modify or create different files,
and Git automatically integrates the changes by simply keeping the latest version of each file.
However, when 2 or more developers modify the same parts of a file, Git does not know how
to merge both changes; should Git keep one version and drop the other, or vice-versa, or
should new code be written in order to take into account both changes? Git also implements
a conflict resolution mechanism that allows us to manually select one of these 3 options, and
to develop new code to solve the conflict, if needed.
In macOS you can type the following command, provided that you already installed
Homebrew (see section 3.2):
Finally, you need to tell your Git client what username and email to use when connecting to
a repository so that you don’t get an authentication error:
Remember your Accenture username is the part of your email before the @ symbol.
When uploading changes to a Git repository from the command line, Git automatically opens
a text editor to type in a descriptive message of the changes. To choose the editor to use,
type the following command:
Finally, type the following command to avoid some errors when working with branches:
14
First of all, go to the following website:
https://azure.microsoft.com/en-us/services/devops/
Sign up using your Accenture e-mail.9 If requested, select “Ireland” as region. At this moment
you will have created a new organization in Azure whose name is your Accenture EID (the
part to the left of the @ symbol in your Accenture email). You will be presented with a web
page to create the first project of the organization. Choose a project name (e.g., “docknet”),
select “Private” and click on button “Create project”.
A Git repository with the same name will be created by default. Additional repositories can
be created within the same project if needed (e.g., a repo for the code and another for the
datasets), but for this training we will only use the default one. Now follow section 7.2 to
clone the project, which for the moment will be empty. Open a terminal and go to the
docknet folder where you cloned the project. If you list the files with command:
ls
you will see no files, though in fact there is a hidden folder called .git which contains the
Git metadata. To see it you need to use command:
ls -a
9In fact, you can use any email in order to use the trial version of Azure DevOps, but in order to avoid having
to reconfigure your Git email we will use our Accenture emails.
15
Download the file Docknet-master.zip from here:
https://github.com/Accenture/Docknet/archive/refs/heads/master.zip
and unzip it inside the docknet folder where you cloned the project. Note if the unzip
process created yet another folder (e.g., Docknet-master), you are to move all the files
and folders and place them directly inside folder docknet (delete the empty folder when
done). Now in the terminal, inside the cloned docknet folder, type the following commands:
git add .
git commit -m "First version"
git push
This will add the code to the Git repository. And explanation of what these commands do is
given in section 15.
Now the team members are to be added to the Azure project so that they can also access it.
On the Azure project web page, at the bottom left, click on “Project settings”. Now in the left
panel click on “Teams”. You will see there is a default “docknet Team” already created. Click
on it to manage the team. Now use button “Add” at the top right corner to invite the other
team members to the project. Use their Accenture emails to invite them.
The new added members have by default a “Stakeholder” type account which does not let
them to access the project repository yet. All their accounts must be changed to “Basic”. To
do so, go to your Azure organization page, either by removing “/docknet” from the URL in
your web browser or by clicking on your Accenture ID at the centre of the top bar of the page.
Now click on “Organization settings” at the bottom left corner of the page. Then click on
“Users” in the left panel. You will see the list of users that belong to your organization. For
each user whose access level is not “Basic”, move the mouse pointer to the corresponding
row in the user table to see a 3-dot icon appear to the right of the table. Click on that icon
then on option “Change access level”. Finally select option “Basic” in the drop-down menu
and click on button “Save”. Repeat this operation for each team member.
All the team members should now be able to access the project page and to clone the
repository, as described in the next section. To facilitate the task, share with them the URL of
the project page. You can go to the project page by going back to the organization page as
before, then clicking on the “docknet” project box.
https://dev.azure.com
16
Log in if requested, then you should see in the left panel the list of organizations you belong
to. If you have already participated in a Dock’s Azure project, you should see “thedock”, the
Dock’s organization. The team member who configured the Azure project should have
created an organization whose name is their Accenture ID and should have added you to that
organization. That organization should appear as well in the left panel, otherwise just ask for
the project URL and access it directly. The next time you access the main Azure page you
should see the Accenture ID organization. By clicking on it, you should then see a “docknet”
box which corresponds to the project. Click on it to access the project page.
7.2 Finding the Git repository URL and cloning the project
Once in the Azure project page, you will see a column of buttons in the left panel. Each one
corresponding to a different section of the project page. Click on icon to access the
10
repository section. Click on the “Clone” button at the top right corner of the page. A panel
will pop up from where the URL of the repository can be copied. Since SSH is no longer
permitted at The Dock, make sure you get the HTTPS version. Click on the button “Generate
Git Credentials” to generate a password that will be asked each time we access the repository.
Copy it somewhere so that you do not need to generate a new one each time.
Open a terminal and go to the folder where you store the projects (e.g., $HOME/src). Now
type the following command:
The command will create a new folder docknet and download all the project files inside.
delivery/scripts/build.sh
that takes care of creating the Python virtual environment, installing all the required
dependencies, installing the project package in the environment, then run the unit tests. Run
the script and check that no test errors are reported. If so, you have a stable version of the
project you can continue developing.
In case for some reason your Python virtual environment gets corrupted, you can quickly
recreate the environment, install all the dependencies and verify it works again by rerunning
the build script.
This same build script will be used within a Docker container for the continuous integration
system to run the tests upon each update of the code sent to the remote Git repository. The
bash script knows whether it is being run by the continuous integration system or by
somebody else (a developer). When run by somebody else, it will install the additional
10If you can access the project web page but don’t see the repositories icon, ask the person who created the
Azure project page to change your account access level from “Stakeholder” to “Basic”, as explained in section
6
17
packages listed in file requirements-dev.txt, which are required for development
purposes only, namely:
Note JupyterLab is the latest Jupyter version since March 2020. The previous version was
called Jupyter Notebook, and IPython before that. JupyterLab interface has more options and
panels than that of Jupyter Notebook, but if you install JupyterLab you will have the option to
use either interface.
In order to open the JupyterLab web interface, open a terminal, activate the Python virtual
environment of your project, go to the main folder of the project, and run the following
command:
jupyter lab
In case you want to use the former Jupyter Notebook interface, run this command instead:
jupyter notebook
Note it is important to run the Jupyter server in the root folder of the project, so that folder
will become the root in the Jupyter interface, and the metadata of your notebooks will be
written in and loaded from that folder.
Upon running the Jupyter server (either lab or notebook), a web page with the interface will
be automatically open. Note that the terminal where you run the jupyter command must
stay open, or the server will stop. When running the command on the terminal, several
messages will be printed, one of them giving you the URL of the Jupyter web page:
http://localhost:8888
In case you close the Jupyter tab and do not remember the URL to open it again, you can refer
to the terminal messages.
11 Setuptools, the Python packaging system, is able to run the tests without having to install pytest; the
continuous integration system does not require to install pytest since it uses Setuptools to run the tests, but
for development purposes it is more convenient to have pytest installed.
12 Note that you need the commercial version of PyCharm to edit and run Jupyter notebooks within PyCharm;
18
Finally, for working with Jupyter notebooks directly on PyCharm you will also need to install
Jupyter as described in this section. The only difference is PyCharm will take care of starting
and stopping the Jupyter server, so you will not need to run it yourself in the command line,
and you will use the PyCharm interface instead of a web browser.
10 Installing PyCharm
PyCharm is a smart Python programming interface that is easy to start with and assists us
during the development process so we can rather focus on the problem instead of fighting
with the particularities of the programming language. It makes easy to navigate through big
projects, is integrated with Git, and understands and can run unit tests, among many other
features. There are 2 flavours of PyCharm: Professional and Community. The Community
edition is free and usually enough for our needs. The Professional edition adds some
additional features, such as being able to run Jupyter Notebooks. In this guide we will use the
Community edition. If for some reason you need the Professional edition, licenses can be
requested through Accenture’s software catalog:
https://support.accenture.com/support_portal?id=acn_sac&spa=1&page=details&category
=&sc_cat_id=356f867ddbf8ac987faf89584b9619e9
Once finished you should be able to find PyCharm from the Launcher icon. Add PyCharm to
the launch bar for quicker access (right click on the PyCharm icon, then click on “Add to
favourites).
10.2 macOS
Download the DMG package of the community edition from the following web page:
https://www.jetbrains.com/pycharm/download/#section=mac
Make sure you select the proper DMG version for your computer (Intel or Apple Silicon) by
clicking on the DMG button. Then click on the Download button to download the DMG
package.
Once downloaded, double click on the DMG file. In the window that will pop up, drag and
drop the PyCharm icon on the Applications icon. You should then be able to find PyCharm in
your Applications folder. Drag and drop the PyCharm icon on the Dock bar at the bottom of
your desktop for easier access.
19
1. Accept the PyCharm license
2. Either choose or not to send usage statistics
3. Choose not to import settings
4. Select UI theme (dark or clear, I personally find dark causes less eye fatigue)
5. Install plugins:
a. IdeaVim is not recommended unless you are familiar with PyCharm using
IdeaVim since it completely modifies the PyCharm editor behaviour
b. Markdown is recommended to have a nicer Markdown file editor
c. Select R if you also work with R
d. Do not select AWS Toolkit unless you plan to develop AWS Serverless
applications
e. There are many other plugins that can be installed afterwards, just click on the
“Start using PyCharm” button to finish the configuration process
Our project may not only contain Python source code but also data and configuration files,
unit tests, documentation, bash scripts, etc. Due to the flexibility of Python, PyCharm has no
way of knowing which part of the project tree contains the source code, so we have to tell it.
By default PyCharm assumes that source packages will be placed in the project root folder.
Usually a Python project will contain a single package (e.g. docknet) with subpackages
inside, so for simplicity we will simply place the main package folder inside the project root.
Other convention could be to create a src folder an place there the Python packages, in
which case we will have to inform Pycharm that folder src is a source code root. This is done
as follows:
In more advanced projects we may implement multiple Python packages, e.g. backend,
frontend and common code, and manage all of them as Git submodules of a single project. 13
In that case we can open in PyCharm the folder containing all the submodules so we can work
on all of them as if it was a single project, but we will have to indicate where is the root source
code folder of each subproject as explained above.
13Git submodules is an advanced feature that facilitates to word with multi-package projects. This is not
discussed in this guide, but more info can be found here: https://git-scm.com/book/en/v2/Git-Tools-
Submodules
20
Additionally, we also have to tell PyCharm which Python virtual environment to use for the
project:
Once PyCharm knows where is the source code and what is the Python virtual environment
to use, it will scan all the libraries installed in the environment as well as the project source
code and build index that will let us navigate through the code quickly, and will also highlight
any errors found (e.g. imported packages that are not installed in the selected environment).
While Python includes by default a unit test library called unittest, in this guide we use
pytest, which contains additional features. We need to tell PyCharm we will be using this
library to run the unit tests:
A typical software project, either for Python or any other language, is composed of:
• Source code: implements the business logic (e.g., detecting mentions of drugs in
documents)
• Configuration files: parameters that modify the way in which the code will run,
without having to modify the source code (e.g., whether dropout will be used or not
to train a machine learning model)
• Test code: code to be used to automatically verify that the business logic is properly
implemented (e.g., check that a tokenizer splits a sequence of characters into the
expected sequence of tokens)
21
• Build scripts: scripts used to automate the tasks of building the project distributable,
installing it and running the tests. These build scripts may also include configuration
files that define different options of the build process.
Usually, the project distributable only includes the implementation of the business logic and
the configuration and resource files required to run it. All the other files (test code and build
scripts) are used during the development and testing phases and are not required to run the
business logic. By placing the source code in a folder and the test code in a different folder,
we prevent the test code from being included in the package distributable. We put all the
source code, configuration files and resource files necessary to run the code in the folder
corresponding to the main project package (e.g. docknet). All the test code, configuration
files and resource files used for running the tests are placed in folder test.
In the “docknet” project we have also added a folder exploration with some Jupyter
notebooks, which use the code inside the docknet folder. The Jupyter notebooks are
neither to be included in the project distributable, since their code is not reusable. As the
folder name suggests, they are meant for exploration only.
Files and folders other than docknet, test and exploration correspond to build scripts
and configuration files of the development process itself. We describe them in the sub-
sections below.
12.1 gitignore
File .gitignore lists the files or folders that we may have inside the project file tree that
are not to be saved in the central repository. Typical examples of these files are:
• Temporary files such as Vim’s .swp backup files
• Metadata files created by our OS (e.g. .DS_Store) or the programming
environment (e.g., PyCharm’s .idea folder).
• Python files created when building the project distributable (folders build and
dist)
• Log files and other files that might be created when running the project code or tests
(e.g. .pytest_cache and __pycache__ folders), or Jupyter notebooks (folder
.ipynb_checkpoints).
All these files are not to be uploaded to the central repository since they are temporary and
unique to each developer. Apart from taking unneeded space in the central repository, other
developers will get a copy of them when synchronizing with the remote repository. Moreover,
conflicts may arise if 2 developers are uploading the same temporary or metadata files to the
repository. Take as example PyCharm’s metadata; among other things, PyCharm stores in the
metadata the list of tabs you had open last time you opened the project so that you can
resume your work exactly where you left. If you don’t ignore these files, your Git client will
think you have new code to upload to the remote repository every time you open or close a
tab, bloating the repository with new and unnecessary versions of these metadata files.
Moreover, if more than one developer uploads the PyCharm metadata files, Git will
constantly report code conflicts since different developers will have different tabs opened.
22
For this reason, make sure that files that are not to be shared with other developers are
either:
• added to .gitignore, for the temporary and metadata files listed above,
• added to some notebook in the exploration folder, for the case of exploratory code
that you don’t know yet if it will be finally required or not, or that needs to be
refactored before being integrated with the rest of the components, or
• stored outside of the project folder, such as data files you may use to make manual
tests.
Note as well that Git is meant to store source code, not binary files. Git is capable of efficiently
storing different versions of text files by storing sequences of changes instead of whole files
for each version. However, this doesn’t work well with binary files and Git will store the whole
files for every version, wasting storage space and network bandwidth. The problem worsens
with large binary files, such as machine learning models and multimedia files. Note as well
that since Git keeps track of every file version, deleting a binary file from the repository does
not solve the problem: the file will still be stored in a previous version of the code, and every
developer that will clone the repository will have to download it. There can be exceptions,
such as when including a reasonably small binary model in the resources of our source code
so that the project can work with a default model, though it is better to separate data from
code (e.g., publish the models in an Amazon bucket and access it from the code).
Each line in .gitignore specifies one filter of file or folder to ignore. Let root be the
folder containing file .gitignore, common filters are:
• .*swp: every file whose name ends with .swp, anywhere under folder root
• __pycache__/: every folder named __pycache__, anywhere under folder root
• /folder1/folder2/ or folder1/folder2/: ignore folder2 at the precise
path root/folder1/folder2
• /folder1/file1 or folder1/file1: ignore file1 at the precise path
root/folder1/file1
Note that the moment we specify a path, either starting with / or not, we ignore a specific
file or folder, not a file or folder anywhere under root. While it is possible to add
.gitignore files anywhere in the project folder, it’s usually better to have only one at the
root folder of the project.
Finally, it is possible to add comment lines using symbol # at the beginning of the line.
For a comprehensive reference on the .gitignore file syntax, visit the official
documentation:
https://git-scm.com/docs/gitignore
Depending on the kind of project (Python, Java, C++, etc.) one may copy a predefined
.gitignore file. The following GitHub repository contains examples of .gitignore files
for many different kinds of projects:
23
https://github.com/github/gitignore
Note that the distributable package will only contain the Python scripts in folder src, which
are usually the only files we will need to distribute to run the project.
The file setup.py basically builds a Python dictionary with a set of package parameters,
then calls function setup from the Setuptools Python package to perform the requested
action (build distributable, install or test, among others). 15
In case you are creating your own Python package, you may copy/paste this setup.py file
and update the following lines according to your project:
• PKGNAME='docknet': name of the package
• DESC=''' A pure NumPy implementation of neural
networks''': description of the package
• license='(c) Accenture': project license, leave (c) Accenture for private
Accenture license
• author='Javier Sastre': package maintainer
• author_email='j.sastre.martinez@accenture.com': email of the
package maintainer
• keywords=['Accenture', 'The Dock', 'deep learning',
'neural network', 'docknet']: descriptive project keywords for indexing
purposes in a package repository
• classifiers=[‘Programming Language :: Python 3 ::
Only’,…]: standard category labels for Python projects, also used for indexing
purposes16
Apart from script setup.py, you’ll also need to copy and update the following configuration
files:
• CHANGES.txt: text file listing the features and bug fixes implemented in each
package version; if we are not publishing the package, we can ignore this file.
14 Alternatively, you may simply run the command pytest to run all the tests inside a project folder, provided
that the corresponding Python virtual environment is active
15 More actions are possible, though these are the most common ones. A tutorial on Setuptools can be found
at https://packaging.python.org/tutorials/packaging-projects/
16 The comprehensive list of classifiers can be found at https://pypi.org/classifiers/
24
• MANIFEST.in: contains the list of files outside the src folder that should also be
included in the project distributable (requirements.txt and VERSION.txt
files)
• requirements.txt: contains the list of packages this project depends on (as
explained in section 3.5).
• requirements-dev.txt: contains the list of additional packages required for
development purposes only. As explained in section 8, the build script installs them
whenever it’s not run by the continuous integration pipeline (it will install them when
it’s run by any developer).
• setup.py: the Python script used to package and install the project as well as to run
all the tests from the command line. This file is included by default in the package
distributable, since it is needed to install it.
• setup.cfg: additional configuration parameters of the setup.py script
• VERSION.txt: contains the version number of the package as plain text; if we are
not publishing the package, we can just leave version 0.0.1 inside
25
binary classification and a Docknet capable of properly classifying them, except for boundary
cases (regions where individuals of both classes “touch”).
By default, the notebooks can only use the Python packages that have been installed in the
active Python virtual environment where the Jupyter server is running.18 For the notebooks
to be able to use the code in the src folder, the notebooks start with the following
instructions:
import os
import sys
These instructions programmatically add the src folder to the list of folders where the
Python interpreter will look for imported packages.
1. The cluster data generator generates 2 clusters of points that are linearly separable,
so a simple logistic regression should be enough.
2. The chessboard data generator generates some sort of 2x2 chess board, with a
diagonal belonging to a class and the other diagonal to the other class. Both classes
are no longer linearly separable, but doing a hierarchical split would allow to separate
both classes, e.g., first split the space in half, then for each subspace do another split
to do the final classification.
3. The island data generator generates a cluster of points that is surrounded by a ring
(the sea). This case is not linearly separable, though a support vector machine could
properly classify this case by using a gaussian kernel, for instance.
4. The swirl data generator generates 2 clusters of points distributed in 2 swirls of
different phases, whose separation is more challenging than the other cases.
26
function to use (sigmoid, tanh or relu in
src/docknet/function/activation_function.py).
• Add a last dense layer having a single neuron and “sigmoid” as activation function, for
binary classification
• Set the Docknet parameter initializer (e.g., an instance of the
RandomNormalInitializer in folder src/docknet/initializer)
• Set the docknet cost function (cross entropy in
src/docknet/function/cost_function.py)
• Finally, set the parameter optimizer (e.g., AdamOptimizer in
src/docknet/optimizer)
Invoking the Docknet’s method train, all the configured components will be used to
initialize the layer parameters, perform a number of training iterations (forward and
backward propagations) and optimize the layer parameters to minimize the specified cost
function as much as possible. In the train method we specify the training dataset and the
corresponding labels, the batch size and a stop condition (e.g., maximum number of epochs).
There are other stop conditions that can be used, see the definition of method train for a
comprehensive list.
Method train returns the sequence of average costs per epoch and per iteration, which give
us an idea on whether the network we have defined manages to find a proper parameter
configuration or not, or whether we are performing an excessive number of epochs or not.
Then the notebook computes the predictions on the test set for the trained model, invoking
the Docknet’s method predict. Finally, we can see the scatterplots of the expected
classification (the training set), the points of the actual classification that have been properly
classified, and the points of the actual classification that have been misclassified (in the best
case, an empty scatterplot).
One can play with the different parameters of the Docknet, using different amounts of layers,
neurons, activation functions, number of epochs and batch size to see how they impact the
final result and how fast each configuration manages to converge.
13.2.1 Docstrings
All classes and methods are documented in the source code using multiline docstrings, for
instance:
27
def train_batch(self, X: np.ndarray, Y: np.ndarray) -> float:
"""
Train the network for a batch of data
:param X: 2-dimensional array of input vectors, one vector per column
:param Y: 2-dimensional array of expected values to predict, one single row with same amount of
columns than X
:return: aggregated cost for the entire batch (without averaging)
"""
These are sort descriptions of the classes or methods that a user of the library can read to
quickly understand how the class or methods is supposed to be used, what are the input
parameters (codes :param) and the return value (code :return), if any. The entire
Docknet library contains docstrings that explain the purpose of each class and method.
In PyCharm, one can start typing the triple quotes that start the docstring and PyCharm will
autocomplete the docstring with the corresponding parameters and return value we have
declared in the method header. More info on docstrings can be found here:
https://www.python.org/dev/peps/pep-0257/
-> type_hint:
after the closing parenthesis (e.g., the train_batch method returns a float). They can also
be used in variables declared inside methods, such as the members of a class declared in the
__init__ method:
self.layers: List[AbstractLayer] = []
self._cost_function_name: Optional[str] = None
self._initializer: Optional[AbstractInitializer] = None
self._optimizer: Optional[AbstractOptimizer] = None
Since in Python a variable can change its type at any moment (the moment we assign to the
variable some value of a different type), without type hints it is not possible to infer what are
the types of the parameters that the function will receive. Though type hints are not needed
to write a Python code that can be run, it improves the code readability and allows for
PyCharm to perform additional verifications to the code we are writing and to provide
28
assistance on how to use the different variables. For instance, if we start typing a new line
after the train_batch header with the code
X.
PyCharm will suggest all the available NumPy array functions, since it knows that X is
supposed to be a NumPy array. Change the type of X to str, now try to write the following
statement:
X = X + 1
PyCharm will highlight the number 1 to indicate a potential error. Move the cursor over
number 1 and you will see a message Expected type str, since you can add strings but
not a string and an integer. Remember to undo all these changes (e.g., with Command + Z in
macOS or Ctrl + Z in any other OS) to not break the code.
Though it may seem more work to declare the type hints, the effort is regarded with less time
wasted trying to understand how to use the methods and having extra help from PyCharm.
It is possible to declare more complex type hints such as lists, dictionaries and tuples of
different kinds of objects, for instance
At first when you will write these codes PyCharm will complain that they are unresolved
references, as when you use any class without having imported it first (the offending code is
underlined in red, and if we move the mouse pointer over the offending code, we will see the
error message). All these classes belong to the typing package. For instance, for the case
of Tuple you will need to add the following import:
However, you can ask PyCharm to automatically add the import for you. Any time PyCharm
highlights a piece of code in red, you can move the cursor to the offending code and press
command + enter in macOS, or ALT + enter in other OSs, to see a window with potential
solutions suggested by PyCharm. In case of a missing import, PyCharm will suggest adding the
import, among other possible solutions. Press enter to let PyCharm add the import. In case
PyCharm finds several packages that define the missing reference, it will present a list of
options. For the case of type hints, select the typing package with the arrow keys and press
enter.
29
13.2.3 Getter and setter methods
Note: The Docknet class contain some methods that are annotated with the codes
@property
and
@XXX.setter,
Getters and setters are used to abstract the library user from the actual way in which the
variable is stored, giving the possibility of alternate implementations. For instance, see how
the cost function getter and setter are implemented: the setter expects a string with the cost
function name, and the getter returns the last function name set, but actually the Docknet
class requires a Python function as cost function. Moreover, it also requires another Python
function that implements the corresponding cost function derivative in order to compute the
backward propagation for training the network. To simplify the usage of the library and avoid
giving the wrong function derivative, the setter uses the function name to retrieve itself the
corresponding pairs of functions from a dictionary of cost functions defined in
src/docknet/function/cost_function.py, and the getter simply returns the
name of the cost function that has been set instead of the function itself. The same
mechanism is used for getting activation functions by name along with their corresponding
derivative functions, which will be later explained in section 13.4.
class ChessboardDataGenerator(DataGenerator):
That means the child class gets for free a definition of all the functions of the parent class.
The child class can optionally redefine the methods of the parent class, if necessary, or extend
them by redefining them but then add a call to the parent class method to reuse it somehow.
For instance, the __init__ method of the DataGenerator class:
30
is extended in each child class. The DataGenerator class expects to receive 2 Python
functions, each one producing 2D vectors of a different class, given a 2D array of random
numbers between 0 and 1. These pair of functions are provided by each child of the
DataGenerator in order to generate different datasets.
Open the ChessboardDataGenerator, for instance, and you will see that it declares the
2 class functions, func0 and func1, and an __init__ method that is an extension of the
DataGenerator __init__ method. First of all, this method calls the parent class __init__
method with the following instruction:
super().__init__([self.func0, self.func1])
The function super() refers to the parent object of this object, thus the .__init__
statement calls the __init__ method of the parent class, passing to the parent the
expected 2 functions declared in the child class. The remaining code does some
precomputations that are used by the func0 and func1 methods.
Summarizing, class inheritance can be used 1) to factor out common code and 2) to ensure a
common behaviour of a set of classes so that they can be easily exchanged (we can use a
different data generator without having to modify the code of the Docknet that is to ingest
the data). Once the DataGenerator is defined, multiple developers can work on different
data generators without worrying about how they will integrate their generators with the
Docknet class, as long as they derive the DataGenerator class as expected.
More info on Python classes and class inheritance can be found here:
https://docs.python.org/3/tutorial/classes.html
The 4 implemented data generators are inspired in those used in this web, where one can
play around with a neural network applying different layers, neurons, regularization, etc.:
https://playground.tensorflow.org/
31
13.4 Activation functions and custom exceptions
Activation functions are used in the networks dense layers to compute the output of each
neuron, once the linear part of the neuron has been computed. In the literature we find
several options of activation functions one can use, some specific ones are needed in the
output layer depending on whether we want to do binary classification (activation function
sigmoid) or multi-class classification (activation function softmax). What all the
activation functions have in common is that they are not linear functions. Otherwise adding
more layers to a network wouldn’t add more power to the network since the composition of
any number of linear functions is equivalent to a single linear function.
For the moment the Docknet library implements the following activation functions
sigmoid, tanh (hyperbolic tangent) and relu (rectified linear unit), enough for doing
binary classification. The functions are defined in file
src/function/activation_function.py. The mathematical definition of these
functions is given in their corresponding docstrings.
In order to be able to compute the network backward propagation, the derivatives of these
functions are also required. These are defined in the same file as sigmod_prime,
tanh_prime and relu_prime. We will see later that when specifying the activation
function of a new dense layer, we just have to provide the name of the activation function
instead of the pairs of functions activation + derivative. This is the same case than the
specification of the Docknet’s cost function and its derivative, previously mentioned in section
13.2.3. To make sure that the proper derivative is used for each activation function, we have
defined a dictionary associating each activation function name with the corresponding pair of
functions:
Additionally, we have defined the following function to retrieve the pair of functions from the
dictionary, and generate a specific exception in case there is no implementation for the
requested activation function:
32
return activation_functions[activation_function_name]
except KeyError:
raise UnknownActivationFunctionName(activation_function_name)
While it would be possible to directly access the dictionary to get the pair of functions, in case
we specified a non-existent activation function name we would get a KeyError exception,
which is thrown any time we try to get a value from a dictionary for a given key that does not
exist. However, that exception would not provide much information to the user of the library
on what went wrong, while by generating a specific exception we can provide a more
informative error message. Here we are the definition of the custom exception:
class UnknownActivationFunctionName(Exception):
def __init__(self, activation_function_name: str):
message = (
f'Unknown activation function name {activation_function_name}')
super().__init__(message)
We simply create a class that extends the Exception class. The name of the class could
already be enough to explain the error reason, such as
UnknownActivationFunctionName. We can include a customized error message with
the exception, as in the example, by redefining the Exception __init__ method to pass
the custom message to the parent class. In the example, we are passing the received
activation function name to give a better hint on what went wrong.
There are several cost functions one can use, though the most common is cross entropy. For
the moment, cross entropy is the only cost function we have implemented, though we have
followed the same code structure than for activation functions in order to enable further
extension of the library.
33
We also want for the initializers to be easily exchangeable, so we can use any initializer
implementation without having to modify the code of the Docknet class. Note how the
Docknet initializer setter is implemented:
@initializer.setter
def initializer(self, initializer: AbstractInitializer):
"""
Sets the network parameter initializer; required for training only
:param initializer: an initializer object (e.g. an instance of RandomNormalInitializer)
"""
self._initializer = initializer
class AbstractInitializer(ABC):
@abstractmethod
def initialize(self, layers: List[AbstractLayer]):
"""
Initializes the parameters of the passed layers
:param layers: a list of layers
"""
pass
First of all, we declare the class as a child of Python’s class ABC, the parent of every abstract
base class. Then we declare all the abstract methods preceded by the annotation:
@abstractmethod
In the methods we only provide the method name, and the parameters. Though not
mandatory, we use type hints in order to make clear what each input parameter is, and to
34
declare the return value, if any (in this case, the initialize method does not return
anything). As implementation we just add instruction pass. Child classes will have to extend
the abstract methods providing an actual implementation (not just pass), and not add the
annotation @abstractmethod since their methods will no longer be abstract (otherwise
we also won’t be able to create instances of those classes). See for instance the
implementation of RandomNormalInitializer at
src/docknet/initializer/random_normal_initializer.py:
class RandomNormalInitializer(AbstractInitializer):
"""
Random normal initializer sets all network parameters randomly using a
normal distribution with a given mean and
standard deviation
"""
def __init__(self, mean: float = 0.0, stddev: float = 0.05):
"""
Initialize the random normal initializer, given a mean a standard
deviation
:param mean: the mean of the normal distribution
:param stddev: the standard deviation of the normal distribution
"""
self.mean = mean
self.stddev = stddev
35
ensures every layer class defines a getter params, which returns a dictionary with all the
parameters of the layer. The random initializer iterates over each parameter in the dictionary
and provides a random value from a normal distribution with a given mean and standard
deviation, defined at the moment of creating the random initializer instance.
13.7 Docknet layers
The Docknet layers are defined at src/docknet/layer. As for the initializers, an
AbstractLayer class has been defined to act as an interface between every possible layer
implementation and the Docknet class. The abstract layer defines the following abstract
methods, that every layer must implement:
• forward_propagate: how does the layer compute its output, given the output of
the previous layer; this method is used for making predictions (see method predict
of the Docknet class)
• cached_forward_propagate: same than forward_propagate, but caching
in each layer some values computed during forward propagation that are later
required to compute the backward propagation (the output of the previous layer and
the output of the linear part of this layer)
• backward_propagate: how does the layer compute the gradients of each
parameter, based on the cached values and the gradient of the cost function w.r.t. the
output of this layer (previously computed by the next layer during backward
propagation)
• clean_cache: deletes every cache variable so that, once a model is trained, these
values are omitted when saving the model to a file
Apart from abstract methods, the abstract layer class provides actual implementations of
some methods common to all layer implementations, to factor out code, namely the
dimension getter (the number of outputs of the layer) and params getter and setter (the
dictionary of parameters of the layer). The params dictionary is retrieved and modified by the
parameter initializers and the parameter optimizers. The dimension is used when adding new
layers, since the number of parameters of a new dense layer depends on both on the number
of neurons of the layer and the number of outputs of the previous layer.
l1.params[‘W’]
36
By default, Python classes contain a special variable __dict__ which simulates a dictionary
having all the variables of the class, so writing:
l1._params
is equivalent to writing:
l1.__dict__[‘_params’]
If defined, method __getattr__ is called in case the __dict__ dictionary does not have
the requested key. In that case, our __getattr__ implementation looks for that key in the
params dictionary of the class, so we can write:
l1.W
l1.params[‘W’]
This later simplifies the code of the forward and backward propagation methods of the layer.
Method __setattr___ is called when trying to set the value of a member of the class. In
our implementation we first check if the variable name is a key of the params dictionary, and
if so, we set the variable of that parameter. Otherwise, we call the __setattr__ of the
parent class in order to let the Python interpreter continue with the standard behaviour
(setting the value of a class variable). This way we can also use the following notation:
l1.W = W
instead of:
l1.params[‘W’] = W
37
The layer still has to declare a dictionary of parameters though it does not have any
parameter, thus declares an empty dictionary. During backward propagation it simply returns
an empty array of gradients, so the optimizer will not try to modify any layer parameter.
Finally, clean_cache does nothing since the layer doesn’t cache anything.
Z = W * A_previous + b
A = activation_function(Z)
The backward propagation is a little bit more complex, you can refer to the code to see exactly
how it is being computed.
13.8 Optimizers
As for the initializers (section 13.6) and the layers (13.7), the optimizers are implemented by
extending an abstract class, to ensure a common interface for all the optimizers (see script
src/docknet/optimizer/AbstractOptimizer.py). The optimizers must
implement 2 methods:
• reset: receives the list of layers of the network and performs any initialization
required before starting the training process. This method is called by the train
method of the Docknet before running any training iteration.
• optimize: receives the list of layers of the network and the corresponding list of
parameter gradients, one dictionary of gradients per layer. It then updates the values
of the parameters of each layer, based on the received gradients. This method is called
after each training iteration in the train_batch method of the Docknet.
p = p - learning_rate * gradient_of_p
38
directions that do not contribute to approach the minimum). The corrections applied to the
layer parameters maximize the systematic directions and dampen the fluctuating ones. To
account for the directions of the previous modifications, Adam maintains 2 variables v and s
(see the Adam optimization code for more info) which have to be reset to 0 values before
starting the training.
In this guide we use the Python library pytest in order to write and run the tests. By default,
Python already comes with a library called unittest that can be used for the same purpose,
however the pytest library comes with some interesting additional functionalities such as
fixtures and parameterized tests which we will see in the examples.
An automated test is just a function that makes use of some other function or method and
checks that, given an input, the function or method returns the expected output, for instance:
def test_addition():
operand1 = 1
operand2 = 2
expected = 3
actual = operand1 + operand2
assert actual == expected
39
It is common terminology in all unit test frameworks (not just in pytest) to refer to the
returned value as “actual”, and to the correct value as “expected”. In Python we use the
reserved word assert to compare the actual and expected values, or in general to evaluate
any Boolean expression (e.g., correct behaviour might be to return some value above some
threshold, instead of returning some specific value). If the expression is false, and exception
is thrown, and the test function is considered failed. Otherwise, the test function execution
ends without throwing any exception, and thus is considered successful. If the code we run
throws any exception that is not captured (using a try/except block), the test is also
considered failed. Note assert is a reserved Python word, one does not need to import any
unit testing library such as unittest or pytest in order to use it.
We can also run all project tests from the command line by running the build script (see
section 8). Indeed, it is good practice to run the build script before uploading any changes to
the Git repository in order to make sure the code we share with the team is stable; otherwise,
we might be introducing some bug that could block the work of our colleagues. To prevent
this from happening, it is also common practice to implement a continuous integration
pipeline integrated with the Git repository (explained in section 22), which will run all the
tests upon each change committed to the repository in order to detect errors as soon as
possible.
In case we already have the Python virtual environment created, activated, and with all the
dependencies installed (e.g., by running pip install -r requirements.txt), we
can also run command:
pytest
in any project folder in order to run from the command line all tests inside that folder, without
having to recreate the entire virtual environment.
In pytest, any method whose name starts with test is considered to be a test. Since we put
all the test files in folder test, (as explained in section 12), and to prevent pytest from
mistaking some function in folder src as a test, we add a file pytest.ini to the root folder
of the project with the following content:
[pytest]
testpaths = test/unit
40
Note that in order to be able to run pytest from the command line and let it find the
corresponding source code in folder src, we need to add this code to file
test/unit/__init__.py:
import os
import sys
It is also possible to run individual tests with command pytest in the command line,
however running them with PyCharm eases navigating through the different test outputs and
towards the precise lines of code that produced the exceptions, since PyCharm interprets the
pytest output messages and adds the corresponding clickable links. Moreover, one can add
breakpoints to specific lines of code by clicking in PyCharm on the margin space to the right
of the line number; we will see a big red dot appear in the margin to mark the breakpoint,
which we can click on in order to remove the breakpoint. When debugging code, we must set
at least one breakpoint so that the debugger pauses the execution at that point, otherwise
all the test code will be run without stopping, having the same effect than running the test
instead of debugging it.
For each Python script xxx.py in the src folder that we want to test - a script implementing
some component or set of related functions - we create a corresponding test file
test_xxx.py inside the unit folder. Note that in folder src we may define several folders
and sub-folders in order to better structure the scripts. While pytest does not make any
difference between different test packages, and indeed we cannot have 2 test scripts having
the same name even if they are in different folders, it can be useful to mimic the same folder
structure inside the unit folder than in the src folder, in order to make more obvious which
test file corresponds to which source code script. Moreover, having related tests on a
separate folder allows us to easily run all of them, either in PyCharm by right clicking on the
corresponding folder then clicking on run/debug tests, or from the command line by going to
the corresponding folder and running command pytest. Regarding the data folder, it can
also be useful to replicate inside the src folder structure.
When using pytest and organizing the tests in different folders, it is not really needed to add
a __init__.py file to each folder, since pytest ignores the packages and assumes all the
41
test files belong to one single anonymous package. However, there are 2 cases in which we
may require the __init__.py files, hence it can be better to systematically add them:
1. To be able to run pytest from the command line and let it find the source code files in
folder src, as explained in the previous section
2. To be able to import and reuse test code across different test files
Regarding the second point, we have an example of this in the Docknet library: script
test/unit/docknet/dummy_docknet.py contains hardcoded computations of a
specific neural network, with all values that are expected during the forward and backward
propagation during a first training iteration. We then use these expected values to test the
different Docknet components, which are distributed across different test files. For instance,
test file test/unit/docknet/layer/test_dense_layer.py imports all the
expected values in the dummy_docknet.py as follows:
@pytest.fixture
def data_generator1():
generator = SwirlDataGenerator(x_range, y_range)
yield generator
A fixture is a function that accepts no arguments and returns some object or value that is to
be used by other test functions. Note the parameters x_range and y_range have been
hardcoded as global parameters of the test script so they can be reused in any test. We add
the annotation:
@pytest.fixture
so pytest recognizes the function as a fixture. The fixture is to return the object or value using
Python keyword yield instead of return, so once the test method has finished using the
42
object returned by the fixture the execution flow goes back to the line right after yield in order
free any resources taken by the object (e.g., closing a file, deleting a memory buffer, etc.). In
unit test frameworks, the process of creating the needed objects for running a test is
commonly known as setup, and the process of deleting and/or freeing the resources as tear
down.
def test_generate_sample(data_generator1):
size = 2000
X, Y = data_generator1.generate_balanced_shuffled_sample(size)
axe = plt.subplot()
plot_scatter(axe, X[0, :], X[1, :], Y[0, :], x_range, y_range, ‘Swirl sample’)
assert X.shape == (2, size)
assert Y.shape == (1, size)
We simply add the name of the fixture function as a parameter of the test function, then we
can use that parameter as if it was the object or value returned by the fixture. This unit test
tests the generate_sample method of the SwirlDataGenerator, which is analogous
to the generate_sample method of the ChessBoardDataGenerator described in
section 13.3. In this test we generate a random sample and check that the returned arrays
have the proper shape. Even if we are not checking the exact values that are being returned,
the test will already force to run the code of this data generator, allowing to catch syntax
errors that we may have. Note that in contrast with other popular programming languages
like Java or C++, Python is an interpreted language, which means that the code is not
translated to machine language (compiled) before running it, and hence even syntactic errors
will not be caught until we run the code.
43
1. The first approach is more verbose but allows us to run one specific test case by right
clicking on the corresponding test function then clicking on run or debug.
Furthermore, it is later easier to identify which particular test case failed, provided
that we give the test case function a descriptive name, when running all the tests,
since we then will get the list of all test function names that failed (e.g.,
test_sigmoid_zero instead of just test_sigmoid for some test case).
2. The second approach is more convenient when we want to test a longer list of cases,
and the list of input parameters and expected values is short (e.g., the input value of
the sigmoid function and the expected output)
In order to define a parameterized test, we first create an array of tuples where each tuple
contains the input and expected values for each test case. In the example below, since our
implementation of the sigmoid function may accept either simple values or NumPy arrays,
we test both cases for the border and normal cases, namely:
Our tuples are pairs where the first value or array is the input, and the second value or array
is the expected output:
sigmoid_test_cases = [
(np.array([-100., 0., 100]), np.array([0., 0.5, 1.])),
(-100., 0.),
(0., 0.5),
(100., 1.),
(np.array([0.]), np.array([0.5])),
]
@pytest.mark.parametrize(argument_names, argument_values)
44
Note in this particular test function we have used NumPy function
assert_array_almost_equal instead of Python’s keyword assert. When
comparing float numbers, it may happen that depending on where we run the test, the test
will pass or fail because different machines may compute float numbers with more or less
decimals, hence we cannot use the strict equality comparator ==. NumPy provides this test
function which can be used to compare either simple values or NumPy arrays for equality up
to a given number of decimals, which by default is set to 6.
Note when we run in PyCharm this test function, PyCharm will run the function as many times
as test cases, following the order given by the list of tuples. If we want to debug one particular
case, we either have to put that case the first in the list, so the debugger will start with that
case, or comment out all the previous test cases.
Here we are another example of parameterized test for the cross-entropy function. This test
case is taken from file test/unit/function/cost_function.py. This particular
test was used to trace back an error produced when the neural net manages to output 0 or 1
instead of values near 0 or 1, in which case the cross-entropy function is not defined and a
NaN was being returned, breaking the training process. Including these test cases allowed us
to check different ways to overcome this situation, until the expected behaviour was
achieved, as long as serving as documentation of what the cross-entropy function should
return on these border cases.
cross_entropy_test_cases = [
(np.array([[1., 0.]]), np.array([[1., 0.]]), 0.),
(np.array([[1., 0.]]), np.array([[0., 1.]]), np.inf),
(np.array([[1., 0.]]), np.array([[0.5, 0.]]), 0.6931471805599453),
(np.array([[1., 0.]]), np.array([[1., 0.5]]), 0.6931471805599453),
(np.array([[1., 0.]]), np.array([[0.5, 0.5]]), 1.3862943611198906),
]
45
Finally, here we are a fixture creating a simple Docknet object, which is later used to test an
entire training iteration with the corresponding forward and backward propagation:
@pytest.fixture
def docknet1():
docknet1 = Docknet()
docknet1.add_input_layer(2)
docknet1.add_dense_layer(3, 'relu')
docknet1.add_dense_layer(1, 'sigmoid')
docknet1.cost_function = 'cross_entropy'
docknet1.initializer = DummyInitializer()
docknet1.optimizer = GradientDescentOptimizer()
yield docknet1
def test_train(docknet1):
docknet1.train(X, Y, batch_size=2, max_number_of_epochs=1)
expected_optimized_W1 = optimized_W1
expected_optimized_b1 = optimized_b1
expected_optimized_W2 = optimized_W2
expected_optimized_b2 = optimized_b2
actual_optimized_W1 = docknet1.layers[1].params['W']
actual_optimized_b1 = docknet1.layers[1].params['b']
actual_optimized_W2 = docknet1.layers[2].params['W']
actual_optimized_b2 = docknet1.layers[2].params['b']
assert_array_almost_equal(actual_optimized_W1, expected_optimized_W1)
assert_array_almost_equal(actual_optimized_b1, expected_optimized_b1)
assert_array_almost_equal(actual_optimized_W2, expected_optimized_W2)
assert_array_almost_equal(actual_optimized_b2, expected_optimized_b2)
The train function is invoked, requesting to run for one epoch only on an input batch of
size 2. Afterwards, the test verifies that each layer parameters have been set to the expected
values. Strictly speaking, this is not a unit test, since we are not only testing the train
method but all the other class methods (such as the forward and backward propagate
methods of the DenseLayer class) involved in the training process. A more advanced
technique used in testing code consists in creating mocked versions of the objects, hardcoding
the values that their methods should return during the test so if the test fails it is because of
the implementation of the train method and not because of some other method it uses.
Nevertheless, this test method can help us debug the entire training process to make sure it
is properly implemented, regardless of the other parts of the Docknet library.
46
More information on pytest and other of its functionalities can be found at:
https://docs.pytest.org/en/latest/
https://mock.readthedocs.io/en/latest/
In order to support this way of coding, unit testing is a key factor. As the number of
components increase, it is difficult to assess what will be the impact of a modification of some
components. Perhaps then new feature to implement requires some modification on the
inputs or outputs some function uses, and that will have an impact in every component using
the function: they will all have to be refactored in order to conform with the new interface.
In return, adapting a component may result in modifications that have to be further
propagated to other components using them, and so forth. Having unit tests for each one of
the components allows us to have control on the components that need to be modified, since
once we make a modification, we can run all the tests to see which fail, and in return pointing
us towards all the components that will have to be adapted. Refactoring code consists then
in updating the different components and the corresponding tests, in order to reflect the new
expected behaviour, re-running the tests in order to track how the changes propagate across
the entire project, then continue the refactoring until we arrive to a new stable version of the
code we can share with the team.
Finally, a code refactor may result at some point to be a bad idea, and we may want to go
back to the previous version of the code, dropping all the sequence of changes we may have
done to many different files. For this reason, working with Git branches (explained in the next
section) is key, allowing us to test any code modification, no matter how risky it may seem,
since with one single Git command we can go back to the master version of the code which
we should all try to keep stable.
47
Below is a simple description of each of these files (source: https://git-
scm.com/book/en/v2/Getting-Started-What-is-Git%3F)
Working Directory. This is a single checkout of one version of the project. These files are
pulled out of the compressed database in the Git directory and placed on disk for you to use
or modify.
Staging area or Index. This corresponds to a file generally contained in your Git directory, that
stores information about what will go into your next commit.
The Git directory is where Git stores the metadata and object database for your project. This
is what is copied when you clone a repository from another computer.
https://git-scm.com/book/en/v2/Getting-Started-What-is-Git%3F
Then, the different states can be summarised in the following manner (source: https://git-
scm.com/book/en/v2/Getting-Started-What-is-Git%3F)
If a version of a file is in the Git directory, it’s considered to be committed. If it has been
modified and was added to the staging area, it is staged. And if it was changed since it was
checked out but has not been staged, it is modified.
48
A more detailed workflow to the one above to further illustrates the relation between the
workspace, the index, and the repository - and the more general idea of using Git to build a
workflow – it is displayed below:
https://blog.osteele.com/2008/05/my-git-workflow/
https://www.atlassian.com/git/tutorials/using-branches
What is a branch? Git branches are effectively a pointer to a snapshot of your changes.
When do we need to branch? When you want to add a new feature or fix a bug - no matter
how big or how small - you spawn a new branch to encapsulate your changes.
49
Don’t mess with the Master (https://thenewstack.io/dont-mess-with-the-master-working-
with-branches-in-git-and-github)
Merging
https://www.atlassian.com/git/tutorials/using-branches/git-merge
Example
50
15.3 Git merge conflicts
(Source: https://www.atlassian.com/git/tutorials/using-branches/merge-conflicts)
During a merge, Git will try to figure out how to automatically integrate new changes.
However, there are cases where Git cannot automatically determine what is correct. Two
common causes of conflicts include: (a) current local branch has a modified file while the
branch being merged does not have that file – it has been deleted (b) the two branches have
one or more files with changes in the same lines that differ across the branches. Git will mark
the file as being conflicted and stop the merging process. It is then up to the developers to
resolve the conflict.
Types of merge conflict. A merge conflict can arise at two separate points: when starting and
during the merge process.:
Unmerged paths:
(use "git add <file>..." to mark resolution)
The output of git status indicates that there are unmerged paths due to a conflict. It also
indicates the files causing the conflict: ‘merge.txt’ in the example above.
51
Next step is to examine the conflicting file(s) and see what the discrepancies are. We can do
this by using the cat command. Git uses three different types of lines to display the differences
between the branches in the modified file. See an example below:
cat merge.txt
<<<<<<< HEAD
this is some content to mess with
content to append
=======
totally different content to merge later
>>>>>>> new_branch_to_merge_later
The ============= line is the ‘center’ of the conflict. All the content between the center
and the <<<<<<<<<< HEAD line us the content that exists in the current branch master
which the HEAD ref is pointing to. All content between the center and >>>>>>>>>
new_branch_to_merge_later is content that is present in our merging branch.
Once the file has been edited – in this case simply combining the text from both files – use
git add merge.txt to stage the new merged content. To finalise the merge, create a
new commit by executing:
Git commit -m ‘merged and resolved the conflict in merge.txt’
52
• git merge –-abort: executing git merge with the abort option will exit from the
merge process and return the branch to the state before the merge began
• git reset: can be used during a merge conflict to reset conflicted files to a known
good state.
Advanced tips
• Merging vs Rebasing: https://www.atlassian.com/git/tutorials/merging-vs-rebasing
16.1 JSON
53
if isinstance(pathname_or_file, str):
with open(pathname_or_file, 'wt', encoding='UTF-8') as fp:
json.dump(self, fp, **kwargs)
else:
json.dump(self, pathname_or_file, **kwargs)
Note under the hood the method simply calls json.dump; by default, this method supports
class attributes that are either simple data types (e.g., numbers, strings, etc.) or Python
dictionaries and lists. For other classes we need to implement our own JSON encoder:
class DocknetJSONEncoder(json.JSONEncoder):
"""
JSON encoder needed for serializing a Docknet to JSON format; defines how to serialize special
Docknet classes such
as the Docknet itself, the layers and NumPy arrays
"""
def default(self, obj: Any) -> Union[List[AbstractLayer], Dict[str, Union[int, Dict[str, np.ndarray], str]],
object]:
if isinstance(obj, Docknet):
return obj.layers
elif isinstance(obj, AbstractLayer):
return obj.to_dict()
elif isinstance(obj, np.ndarray):
return obj.tolist()
else:
return super().default(obj)
The encoder simply overwrites the default method of the parent JSONEncoder in order to
have a special behaviour whenever the object to serialize is a Docknet, any kind of layer (a
child of AbstractLayer), or a NumPy array. For each case we define how to convert these
objects into Python dictionaries or lists, and let JSON take care of serializing those. For any
other cases we simply revert to the default serializer of the JSONEncoder.
Here we test the serializer, invoking the method for a given Docknet and comparing the result
with the JSON of an expected JSON file we have created in advance (file
test/data/docknet1.json):
def test_to_json(docknet1):
# Set network parameters as for the dummy initializer in order to enforce a specific expected output
54
docknet1.initializer.initialize(docknet1.layers)
expected_path = os.path.join(data_dir, 'docknet1.json')
with open(expected_path, 'rt', encoding='UTF-8') as fp:
expected = fp.read()
actual_file = io.StringIO()
docknet1.to_json(actual_file, True)
actual = actual_file.getvalue()
assert actual == expected
A little trick for not having to manually write the expected JSON file is to first use the unit test
to save the generated JSON as the expected one. Then we manually check the file in order to
ensure it is correct, and finally remove or comment out the code for saving the actual JSON
as the expected one. If at some point the serialization code is broken, the actual JSON will
differ from the expected one and the test will fail. For instance, imagine we change the
definition of the Docknet by adding some attribute of a new class for which we have not
defined a custom JSON serializer: the test will fail when trying to serialize this new Docknet,
reminding us that we also need to adapt the JSON serializer.
55
We use the method json.load in order to load the JSON file as Python dictionary where
its values will either be simple data types, other Python dictionaries or lists. In the same way
we implemented a custom JSON encoder to transform the Docknet object into these data
types, we need now some custom code to re-instantiate the Docknet object from this Python
dictionary representation. We create an empty Docknet instance, then traverse the list of
layer descriptions and create one by one the corresponding layers. We check field ‘type’ to
know which kind of layer to instantiate and extract the layer dimension and activation
function from fields ‘dimension’ and ‘activation_function’, respectively. Once a layer is added
to the Docknet, we check if the JSON file includes parameters for the layer (field ‘params’). If
that’s the case, we parse the parameters (convert the lists of values back to NumPy arrays)
and simply assign them to the layer params field.
In order to test the deserializer we simply deserialize an example Docknet JSON, serialize it
back and verify that we obtain again the same file:
def test_read_json_to_json():
expected_path = os.path.join(data_dir, 'docknet1.json')
with open(expected_path, 'rt', encoding='UTF-8') as fp:
expected_json = fp.read()
actual_docknet = net.read_json(expected_path)
actual_file = io.StringIO()
actual_docknet.to_json(actual_file, True)
actual_json = actual_file.getvalue()
assert actual_json == expected_json
16.2 pickle
Implementing Pickle serializers and deserializers is much straightforward than JSON’s since
there is no need to implement custom serializers and deserializers. We simply call
pickle.dump in order to create the binary file:
56
else:
pickle.dump(self, pathname_or_file)
Since the resulting file is binary, we have no way to manually check the contents of the file in
order to validate it, like we did with the JSON file. As a work-around we create a Docknet from
a JSON file, save it to pickle, then load it back from the pickle, save it to JSON, then compare
the resulting JSON is equal to the expected one:
def test_to_pickle_read_pickle_to_json(docknet1):
# Set network parameters as for the dummy initializer in order to enforce a specific expected output
docknet1.initializer.initialize(docknet1.layers)
pkl_path = os.path.join(temp_dir, 'docknet1.pkl')
expected_json_path = os.path.join(data_dir, 'docknet1.json')
with open(expected_json_path, 'rt', encoding='UTF-8') as fp:
expected_json = fp.read()
docknet1.to_pickle(pkl_path)
docknet2 = read_pickle(pkl_path)
actual_file = io.StringIO()
docknet2.to_json(actual_file, True)
actual_json = actual_file.getvalue()
assert actual_json == expected_json
Provided that the JSON unit tests pass, if the pickle unit test fails then the problem should
be in the pickle serialization/deserialization code.
57
We can easily create command line entry points using Python’s argparse package. The
Docknet library includes commands for:
This way we can use the library without the need of Jupyter notebooks, and potentially let it
run with bigger datasets on a server by invoking the commands from a shell. Moreover, we
can create bash scripts that could call sequences of commands (e.g., generate a dataset, train
a Docknet, then make predictions). As an example, here is the code used to create the
command line entry point for the data generators:
import argparse
import sys
import pandas as pd
def parse_args():
"""
Parse command-line arguments
:return: parsed arguments
"""
parser = argparse.ArgumentParser(description='Generate dataset')
parser.add_argument('--generator', '-g', action='store', required=True,
help=f'Data generator to use '
f'({",".join(data_generators.keys())})')
parser.add_argument('--x0_min', action='store', default=-5.0, type=float,
help='Minimum value of x0')
parser.add_argument('--x0_max', action='store', default=5.0, type=float,
help='Maximum value of x0')
parser.add_argument('--x1_min', action='store', default=-5.0, type=float,
help='Minimum value of x1')
parser.add_argument('--x1_max', action='store', default=5.0, type=float,
58
help='Maximum value of x1')
parser.add_argument('--size', '-s', action='store', required=True,
type=int, help='Sample size')
parser.add_argument('--output', '-o', action='store', default=None,
help='Output path (defaults to standard output)')
args = parser.parse_args()
if args.generator not in data_generators.keys():
print(f'Unknown data generator {args.generator}; available generators '
f'are: {",".join(data_generators.keys())}')
sys.exit(1)
if args.x0_min >= args.x0_max:
print('Empty x0 range')
sys.exit(1)
if args.x1_min >= args.x1_max:
print('Empty x1 range')
sys.exit(1)
return args
def main():
args = parse_args()
generator = make_data_generator(args.generator, (args.x0_min, args.x0_max),
(args.x1_min, args.x1_max))
X, Y = generator.generate_balanced_shuffled_sample(args.size)
X_df = pd.DataFrame(X)
Y_df = pd.DataFrame(Y)
sample_df = pd.concat([X_df, Y_df], axis=0, ignore_index=True)
if args.output:
with open(args.output, 'wt', encoding='UTF-8') as fp:
sample_df.to_csv(fp, header=False, index=False)
else:
sample_df.to_csv(sys.stdout, header=False, index_label=False)
if __name__ == '__main__':
main()
59
We basically declare an ArgumentParser, then declare the potential parameters for this
parser with method add_argument. For each argument we can declare:
• a long argument name (e.g., generator)
• an abbreviated from of the name (e.g., g)
• the action to perform if the parameter is given (e.g., ‘store’ for storing the associated
value, ‘store_true’ for simply associating a Boolean True value to the parameter)
• whether the parameter is required (required=True) or not required but have a
default value (default=-0.5)
• the type of the parameter value, in case it is not just a string (e.g., type=float),
and
• a help message describing the parameter
Note these are just some of the typical options that one might use and is not a comprehensive
list of all options available in argparse. For a comprehensive documentation please refer to
the official documentation at: https://docs.python.org/3/library/argparse.html.
Also note that by default a parameter --help (or -h) is automatically created, which will
print on the screen all the available parameters and their corresponding help messages.
Once the parameters are declared, we simply call method parse_args to parse the
parameters used when calling the Python script. An object containing all the parameters
found along with their values will be returned. argparse will raise an exception if unknown
parameters are found, if required arguments are not provided, or if invalid parameter values
are given (e.g., the parameter is supposed to be an integer, but a different kind of value is
provided). Argparse can perform some additional checks depending on the restrictions
specified in the parameter declaration (refer to the official documentation for more
information). Additional and more convoluted checks are to be implemented by us (e.g.,
checking that the minimum value of x0 is less than its maximum value).
Finally, we need to declare in the setup.py script the system commands that will be
created upon installing the Docknet Python package:
entry_points={'console_scripts': [
'docknet_generate_data = docknet.generate_data:main',
'docknet_evaluate = docknet.evaluate:main',
'docknet_predict = docknet.predict:main',
'docknet_start = docknet.app:main',
'docknet_train = docknet.train:main'
]},
60
function of the docknet.generate_data script. As a convention, we add the prefix
“docknet_” to all our commands so we can easily get a list of all the Docknet commands
available in a terminal: we can start by typing “docknet” in the terminal then press the tab
key twice, and the OS autocompletion function will list all the commands available that start
with “docknet”.
18 Resource files
By default, the package build process only includes Python script files inside src folder, even
if we add a resources folder inside src and place there some resource files (e.g.,
precomputed pickle models and configuration files). Files other than Python scripts must be
listed in setup.py so that they are also included in the package:
Upon installing the Python package, the package file will simply be unzipped inside the lib
folder of the active Python virtual environment. For instance, after running the build script
you can see the Docknet resource files at:
$HOME/docknet_venv/lib/python3.9/site-
packages/docknet/resources
Given a Python script in folder src/docknet, to access a resource file one can simply create
the path of the resource file relative to the script that is running, such as
src/resources/chessboard.pkl, one can then open it as a standard file with
Python’s open. The path can be created as follows:
61
1. We are running the Docknet code that we have previously installed in the Python
virtual environment
2. We are running the code directly from the copy we downloaded from the project repo,
without installing the package (running from the source code).
Depending on the case, the resource file to open will be in a different folder (the lib folder
of the Python virtual environment or the folder where we cloned the repository). Moreover,
different developers will have these files in different folders, since each one will have a
different user folder. For this reason, we are to use relative paths to the resource files.
19 Configuration files
Configuration files are typically used to store in one place all the application parameters (e.g.,
default hyperparameters used to train a model, or parameters of each component of a
processing pipeline, model to use for a given component, etc.). In the Docknet library we
simply store the configuration of the Docknet web service (explained in section 20). The
typical format used for configuration files is YAML. This format can be seen as a less verbose
version of JSON:
app:
debug: True
host: 0.0.0.0
port: 8080
For loading the configuration file, one can compute the path to the file as for any resource
file, then use PyYAML library to load it as a Python dictionary. The same configuration
dictionary is to be used across all the objects that form the application, so it would be a waste
of resources to load the configuration several times. The Docknet project includes a utilities
file that ensures the configuration will be loaded one time only upon the first time a config
object is instantiated:
config = Config()
Further instantiations of the Config class will return over and over the same Python dictionary
object.
app.run(**config.app)
This particular example is used to start a web service with all the required configuration
parameters in the YAML app section. Note that for this to work the parameter names in the
config file section must match those of the Python method that is being invoked.
62
20 Web services in Python with Flask
We’ve already seen in section how0 to add a command-line interface that wraps some
business logic implemented as some Python class or method. A web service is just another
way of packaging an application so that it can be accessed through a web interface (e.g., a
web browser). While one can also invoke commands on a remote machine through SSH, a
web interface is usually more convenient and user friendly. For the app to be accessible from
a web client, the app needs to implement a REST interface. These interfaces can be easily
implemented by using the Flask and Flask-RESTful libraries. The REST interface in the Docknet
library is defined in the app.py script:
import os
import numpy as np
app = Flask(__name__)
api = Api(app)
config = Config()
class PredictionServer(Resource):
"""
REST API layer on top of Docknet model that returns predicted values, given an input vector
defined by 2 parameters
x0 and x1. This class is to be inherited by another class that provides the path to the Docknet pickle
model file to
63
use for making predictions.
"""
def __init__(self, pkl_pathname: str):
"""
Create a prediction server
:param pkl_pathname: path to the Docknet pickle model file to load
"""
super().__init__()
self.docknet = read_pickle(pkl_pathname)
64
class ChessboardPredictionServer(PredictionServer):
def __init__(self):
super().__init__(chessboard_model_pathname)
class ClusterPredictionServer(PredictionServer):
def __init__(self):
super().__init__(cluster_model_pathname)
class IslandPredictionServer(PredictionServer):
def __init__(self):
super().__init__(island_model_pathname)
class SwirlPredictionServer(PredictionServer):
def __init__(self):
super().__init__(swirl_model_pathname)
# Add the prediction servers for each one of the 4 models chessboard, cluster, island and swirl
api.add_resource(ChessboardPredictionServer, ‘/chessboard_prediction’)
api.add_resource(ClusterPredictionServer, ‘/cluster_prediction’)
api.add_resource(IslandPredictionServer, ‘/island_prediction’)
api.add_resource(SwirlPredictionServer, ‘/swirl_prediction’)
def main():
# Start the service; the service stops when the process is killed or the docker container running this
service is
# shut down
app.run(**config.app)
if __name__ == ‘__main__’:
main()
We have defined a generic PredictionServer, which loads the specified Docknet model
upon instantiation. The server implements the REST get interface in order to compute a
prediction for a given point specified in the request URL by means of parameters x0 and x1.
65
Internally, the server simply parses the parameters, verifies that they are correct, uses the
predict method in order to compute a result, generates a response in JSON format, and
returns it. In case of error, an error message is returned, instead of a prediction result.
In order to start the web server, we just need to activate the Python virtual environment
where the Docknet library is installed then invoke command docknet_start from the
terminal. Once the server is initialized, we can send prediction requests by entering any of
the following URLs in a web browser:
http://localhost:8080/chessboard_prediction?x0=2&x1=2
http://localhost:8080/cluster_prediction?x0=2&x1=2
http://localhost:8080/island_prediction?x0=2&x1=2
http://localhost:8080/swirl_prediction?x0=2&x1=2
Note these URLs are valid for our local machine, and that the port specified is 8080, the same
that has been given in the config file. Note as well that the URLs above specify parameters x0
and x1 both with value 2. One can simply modify these values at will. Finally, one can also use
the curl command instead of a web browser in order to get a response in the command line,
for instance
curl "http://localhost:8080/chessboard_prediction?x0=2&x1=2"
Note that whether we use a web browser or curl we obtain a message as follows:
{
"message": 1,
"success": true
}
If a prediction could be computed, field success is true, and the message contains the
prediction. If an error happened, then field success is false and field message contains the
error message. For instance, the following URL…
http://localhost:8080/chessboard_prediction?x0=2
… results in message
{
"message": "Missing mandatory argument x1",
66
"success": false
}
To stop the web service, simply press Ctrl + C in the terminal where you started the service,
or close that terminal, or kill that process.
Finally, the web app can be installed in a server and consumed remotely provided that the
server has port 8080 open and either an IP address or domain name accessible from our
location, either within the same LAN or from the Internet. However, configuring a server and
securing it is beyond the scope of this guide.
21 Docker
Docker images are some sort of virtual machines that can be used to easily replicate a specific
machine in which some software is to be run, so we can easily deploy one or more instances
of our application without having to configure or install any required dependency for each
app instance we want to run. Docker images are somewhat similar to virtual machines,
though they are more efficient since while virtual machines are full copies of machines with
the corresponding virtual hardware and virtual operating system, Docker images reuse the
Linux kernel of the physical machine where they are running. Indeed, Docker cannot be run
natively in non-Linux platforms such as Windows or macOS.
On Windows and macOS we need to install Docker Desktop, which contains everything that
is needed in order to create and run Docker images in those machines. The official installation
instructions and download link can be found here:
• macOS: https://docs.docker.com/docker-for-mac/install/
• Windows: https://docs.docker.com/docker-for-windows/install/
Note since February 2022 Docker Desktop requires a paid subscription for companies over a
certain size so to use it in Accenture a WBS is required. A free alternative using some packages
available with Homebrew can be found here:
https://dhwaneetbhatt.com/blog/run-docker-without-docker-desktop-on-macos
On Ubuntu we just need to install the Docker Engine, which is free to use. The official
instructions can be found here:
https://docs.docker.com/engine/install/ubuntu/
In order to define a Docker image, we create a file named Dockerfile where we list the
sequence of commands required to install the corresponding machine, using the
Dockerfile notation. For instance, the Docknet project includes the following
Dockerfile in the root folder:
67
FROM ubuntu:22.04
LABEL docknet.docker.version="1"
# System update
RUN apt-get update
RUN apt-get upgrade -y
RUN apt-get dist-upgrade -y
# Set locale
RUN apt-get install -y locales
RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8
# Run build script as the docknet user to install the package and run the tests
USER docker
WORKDIR /home/docker/docknet
RUN delivery/scripts/build.sh
The Dockerfile command FROM indicates which Docker image to use as the base for this
Docker image. This is a mechanism similar to class inheritance, where a Docker image inherits
the result of running the parent Dockerfile commands, then adds additional commands
afterwards. We inherit here the base Ubuntu 22.04 Docker image in order to run the project
in this system:
68
FROM ubuntu:22.04
Note that Ubuntu 22.04 is not quite lightweight, though we use it here for convenience since
installing software in Ubuntu is quite straightforward. For production environments, Alpine
Linux19 is typically used instead, since it is an extremely lightweight distribution specifically
developed to be run inside Docker containers. However, it usually takes more time and effort
since more software components need to be installed in such images.
Dockerfile command LABEL is used to add arbitrary pairs label/value to a Docker image. We
define here a label docknet.docker.version with the Dockerfile version number:
LABEL docknet.docker.version="1"
Defining this label right at the beginning of the Dockerfile has a specific purpose, due to the
Docker cache system. When we run a Dockerfile in order to build a Docker image, for each
Dockerfile instruction the result is saved in a cache. Trying to rebuild the Dockerfile without
modifying any line has no effect, since the Docker build system reuses the results stored in
the cache. When we modify one line of the Dockerfile, all the results already computed that
are previous to the modified line are retrieved from the cache, then the modified lines and
the lines afterwards are re-run and the result per line stored in the cache. If we want to re-
run the entire Dockerfile without modifying its commands, we can update the version value
in the label, which will have no effect in the result apart from re-running all the commands.
This is in particular useful when we simply want to update the Docker image with the latest
versions of the Ubuntu packages.
In the Dockerfile, we use the command RUN to run arbitrary commands in the selected OS,
and ENV in order to define environment variables. Note that building a Docker image is
equivalent to starting some new brand machine with a fresh installation of the selected OS,
then running some commands to configure the machine and to install the needed software.
The first, second and third blocks after the label are used to update the list of available Ubuntu
packages, install the system locale, and install Python, as explained in section 3.1:
# System update
RUN apt-get update
RUN apt-get upgrade -y
RUN apt-get dist-upgrade -y
# Set locale
RUN apt-get install -y locales
RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
19
https://alpinelinux.org/
69
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8
The Dockerfile build process runs using the root user by default, so we have administrative
permissions to do any modifications needed. Since we do not require administrative
permissions to run our Python project tests, we add a standard user (without administrative
privileges) to test the project:
This is not strictly needed, but it is good practice: do not run a command with administrative
privileges unless you really need it.
Next, we add the source code inside the Docker image using the Dockerfile command ADD:
Note the Docker image contains its own virtual file system, hence any additional file we want
to include has to explicitly be available inside the container. The command ADD can be used
to copy some file or folder from the host machine to the Docker container. Since the
Dockerfile is in the root folder of the project, the entire project folder will be copied inside
the Docker image in folder /home/docker/docknet. It would also be possible to run
command git clone in order to obtain the code directly from the repository, provided we
had previously run the command to install and configure a Git client in the Dockerfile.
However, note that in this case the process of building the Docker image will require access
to the Git repository, which is restricted to 2 options that are neither viable:
1. Entering a password, which we cannot do when building the Docker image since it is
a fully automated process (we will not see the prompt asking for entering the
password)
2. Using a SSH certificate, however SSH access has been cut in The Dock for security
reasons.
Once the project code is copied in the image, we set user docker as the owner of the
corresponding folder:
70
# Make the Docker user the Docknet folder owner
RUN chown -R docker:docker /home/docker/docknet
This is needed since by default the owner is user root, which would prevent user docker
from building and testing the project.
Now that everything has been set up for building the project, we change the current active
user to user docker, using the Dockerfile command USER, and set the current active
directory to the project folder, using the command WORKDIR:
USER docker
WORKDIR /home/docker/docknet
Note inside a Dockerfile command cd has no effect and command WORKDIR is to be used
instead.
Finally, we run the build.sh script (see section 8) as we do in our local machine, so the
build and test process is the same either case:
RUN delivery/scripts/build.sh
A last Dockerfile line uses command CMD in order to indicate what will happen by default
when we run the Docker image:
This has no effect during the build process, just when the Docker image is run without
specifying which command to run once the image is started. In the example, we activate the
Python virtual environment created by the build.sh script inside the Docker image, then
run the docknet_start command in order to start the web service (see section 16).
This Dockerfile is enough for installing, testing and running Python projects in a container as
a web service, though Docker supports many other features. For a comprehensive Dockerfile
reference, visit:
https://docs.docker.com/engine/reference/builder/
It’s worth mentioning a more sophisticated way to build smaller Docker images called multi-
staging:
https://docs.docker.com/develop/develop-images/multistage-build/
Note every time a command is run in a Dockerfile, the result is saved in the image. However,
it is not uncommon to have to install software that is needed for compiling or testing our
project, but not for running it. This software will be unnecessarily taking space in our Docker
71
image. Multi-staging allows us to build different Docker images, one per stage, where one
stage image can inherit whatever is needed from the other stage images. This way we can
inherit, for instance, just the compilation result and hence get rid of the compilation tools or
the Git client that could have been used to clone the project.
• docker_build.sh
• docker_run.sh
• docker_rm.sh
All these scripts first load script docker_config.sh where we centralize the parameters
common to all the scripts.
Internally, the scripts just use a few Docker commands in the command line. To build a Docker
image we can go to the folder containing the Dockerfile and run the following command:
where TAG is a name we give the image to later easily refer to it (e.g., docknet). The image is
built and stored in the Docker system in our own machine. We do not need to manage any
image file; it is handled by our Docker installation. To list the available images, we run the
command:
docker images
To run the Docker image so it starts the web server, we run anywhere the following
command:
Parameter p is used to map port 8080 inside the container to port 8080 of the physical
machine. This is needed to be able to access the web service inside the container. The it
parameter indicates the container is to be run in interactive mode. This allows us to stop the
container by pressing Ctrl + C. Otherwise the container will run indefinitely until we either
close the terminal or kill the corresponding process.
Finally, we can delete a docker image, provided that it is not running, as follows:
Note the image_id is not the same as the tag, it is an alphanumeric code. The f parameter
serves to force the deletion of the image, even in case there is some child image of this one.
72
In script docker_rm.sh we use the following command to locate all images defining a label
docknet.docker.version and delete them:
https://docs.docker.com/get-started/
to get a better understanding of what Docker is, and to follow the Docker tutorial to get more
familiar with the Docker capabilities:
https://docs.docker.com/get-started/overview/
22 Continuous integration
In this section we explain how to quickly implement a continuous integration pipeline in Azure
DevOps based on the Docker container seen in the previous section. A continuous integration
pipeline is just an automated process that is triggered whenever we push some changes to a
Git repository, and which verifies that the new version of the code is stable by running again
all the tests. Even though we should run ourselves the tests in our machines before uploading
code, it may happen that while the tests pass in our machine, they fail somewhere else.
Typical reasons for this to happen are:
• We committed and pushed the code but forgot to add first some new data or source
code file; the tests run in our machine, but not in someone else’s machine since they
do not have all the required files.
• While implementing a new feature we needed a new library that we manually
installed in our Python virtual environment by running pip in a terminal, but we
forgot to add the library in the requirements.txt file. Therefore, when someone
else tries to run the tests, even after a clean project rebuild, the tests of the new
feature fail due to missing dependencies.
• We needed to install some native tool or library or needed to change some system
configuration (e.g., add a system variable), but we didn’t update the build script
and/or the Dockerfile. Note this may not prevent the error to happen in the
development machines of the rest of the team, since they may need to manually
install or configure their machines, but at least they can use the Dockerfile as a
reference guide on how to install and configure their own machines for the project to
run. This is particularly important in the event of team members rolling-off projects.
Verifying that the project is stable every time we have new changes to push to the repository
is time consuming. Note a full verification would consist in:
73
• Reinstalling and configuring a fresh machine
• Installing the project dependencies and the project itself
• Running all the unit tests
In order to remove this repetitive work, we can use a continuous integration system that will
do this for us, every time we create a new candidate version of the code (a new pull request
to be merged). Usually, continuous integration systems and software repositories are tightly
integrated, so that the continuous integration system checks the project each time a new pull
request is created and each time the pull request is updated, and the repository does not
allow for merging the pull request until the continuous integration system validates it. In case
of error, a notification can be automatically sent for the corresponding developer to check
the error and fix it. This way we can reduce the probability of having an unstable version of
the code, and hence potentially blocking someone else’s work or even worst, not realizing the
code is broken until we try to run a demo in front of a client.
With Azure DevOps we can simply create a pipeline that builds the project’s Docker image.
Remember that the image we used in the Docknet project (see section 21) simply runs the
build script in order to install the project and run the tests. Hence if any test fails, the image
build will fail, and Azure DevOps will report the error.
Once finished, one can then pull the new code version to get the azure-pipelines.yml
file and, potentially, edit it and commit a new version. Each time the pipeline is run, an e-mail
is sent indicating whether the process succeeded or failed. The email contains a button “View
results” we can click on in order to open a web page with the process report. In this page we
can see the terminal output messages, which can be useful to quickly determine what went
wrong.
While Azure offers a specific pipeline for building Python packages and running tests, we have
used here the pipeline for building a Docker image so that we control the environment in
74
which the project is tested. The Python build pipeline uses an Azure virtual machine which we
can also tweak by modifying the azure-pipelines.yml, however that virtual machine
solely runs in Azure. The Docker image can be either built in Azure, our computers, in Amazon
instances, and many other systems. Hence, we can test the project inside a Docker in any
computer and yet obtain the same result since with the Docker image we control the
environment in which the project is tested.
Apart from Azure DevOps, Jenkins is a popular open-source alternative for implementing
continuous integration pipelines. Jenkins is compatible with Linux, macOS and Windows.
More info on Jenkins can be found here:
https://www.jenkins.io/
Finally, for open-source projects (e.g., published in GitHub) one can use Travis CI for free.
Travis CI is another continuous integration service that provides virtual machines for running
processes in Linux, macOS and Windows machines. Note Docker containers can simulate
different versions of Linux distributions, but one cannot run macOS or Windows in a Docker
container. For testing software in Windows or macOS one can use the corresponding Travis
CI virtual images. For specific Linux distributions one can always select some Linux virtual
machine, then run inside a Docker image so we can still easily replicate the same result in our
own computer. More information on Travis CI can be found here:
https://travis-ci.org/
23 Proposed challenges
Here we present 5 challenges that you can solve as a team, one member of the team working
on a separate challenge at the same time. The idea is to practice the contents of this guide,
namely:
75
another data generator that could put a Docknet to the test? Implement it as another derived
class of DataGenerator (folder src/docknet/data_generator).
For implementing the test, you may copy a test of a previously implemented data generator
(folder test/unit/docknet/data_generator) and modify it to use the new data
generator. By debugging the test, you can see the corresponding scatterplot without having
to use a Jupyter notebook, while being able to debug the code. Verify that the test works
before merging the Git branch with master.
Make a copy of one of the Jupyter notebooks in the exploration folder and then modify
it to use the new data generator and then modify the Docknet hyperparameters to try to
properly classify the new test set.
For implementing the test, you may copy the test for the ReLU activation function and the
one for its derivative (file
test/unit/docknet/function/activation_function.py) and update them
accordingly.
Implement a Jupyter notebook in the exploration folder that presents the different
activation functions and compares them (e.g., create several Docknets with the same
structure but different activation functions in the hidden layers, then train them and predict
with the same datasets).
Suggestion: make a copy of the cross-entropy function and its derivative (script
src/docknet/function/test_cost_function.py), and the corresponding tests
(script test/unit/docknet/function/test_cost_function.py) and modify
them accordingly.
76
23.4 Challenge 4: Xavier’s initializer
In section 13.6 we presented a random normal initializer of the network parameters
(implemented in script
src/docknet/initializer/random_normal_initializer.py). Xavier (also
called Glorot) is another popular initializer. Can you implement it? It is similar to the random
initializer in which it also draws values from a normal distribution, but the mean is set to 0
and the standard deviation depends on the number of neurons of the layer. Note you can get
the number of neurons of and AbstractLayer with getter method dimension. Here
you are the official paper where Xavier’s initialization is described:
http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf
Suggestion: make a copy of the random normal initializer and its test (script
test/unit/docknet/initializer/test_random_normal_initializer.p
y) and update them accordingly.
Suggestion: copy the implementation of the input layer and of its unit test (folder
test/unit/docknet/layer/test_input_layer.py) and update accordingly.
Note that for a unit test you just need to test the dropout layer, not an entire network with
dropout layers (as we did for exhaustively testing the predict and train methods, hardcoding
a forward and backward propagation of a dummy network in file
test/unit/docknet/dummy_docknet.py). For testing forward propagation it
suffices to make a fake input for the dropout layer and verify the forward propagation method
zeroes the corresponding neurons (use np.random.seed to always mask the same
neurons so the test can be repeated). For the backward propagation you will have to make
another fake input, write the layer cache a fake mask that presumably was applied during
forward propagation, and verify that the corresponding gradients are zeroed.
77
Make a Jupyter notebook in the exploration folder comparing a Docknet with and
without dropout layers after each hidden layer.
The bias corrections of v and s are no longer needed so you can delete them.
Implement a Jupyter notebook in the exploration folder in order to compare all 4 optimizers.
Which one manages to converge faster? Which one is the slowest?
78