You are on page 1of 78

Team software development guide

Javier Sastre, Jee-Hyub Kim, Laura Álvarez & Seán Gorman


Table of contents

1 Introduction 5
2 Prerequisites 6
3 Python virtual environments 7
3.1 Installing Python in Ubuntu 22.04 7
3.2 Installing Python in macOS 9
3.3 Creating a Python virtual environment 11
3.4 Switching between environments 11
3.5 Installing packages in a Python virtual environment 12
3.6 Deleting/reinstalling a Python virtual environment 13
4 What is Git? 13
5 Installing and configuring the Git client 14
6 Creating a project in Azure DevOps 14
7 Cloning the project 16
7.1 Accessing the Azure project page 16
7.2 Finding the Git repository URL and cloning the project 17
8 Building and testing the project 17
9 Installing and running JupyterLab 18
10 Installing PyCharm 19
10.1 Ubuntu 22.04 19
10.2 macOS 19
10.3 First time running PyCharm 19
11 Opening & configuring the project with PyCharm 20
12 Python packages & project structure 21
12.1 gitignore 22
12.2 Python package declaration or setup.py 24
12.3 Delivery folder and build script 25
12.4 Continuous integration files 25
13 The Docknet library 25
13.1 The Jupyter notebooks 25
13.2 The Docknet main class 27
13.2.1 Docstrings 27
13.2.2 Type hints 28
13.2.3 Getter and setter methods 30
13.3 Data generators and class inheritance 30
13.4 Activation functions and custom exceptions 32

2
13.5 Cost functions 33
13.6 Initializers and abstract classes 33
13.7 Docknet layers 36
13.7.1 Special class methods __getattr__ and __setattr__ 36
13.7.2 Docknet input layer 37
13.7.3 Docknet dense layer 38
13.8 Optimizers 38
13.9 Utility functions and classes 39
14 Unit testing with pytest & PyCharm 39
14.1 How to run unit tests 40
14.2 Organizing unit tests 41
14.3 pytest fixtures 42
14.4 Parameterized pytests 43
14.5 A bigger pytest, and a word on mocking 45
14.6 A note on code refactoring 47
15 Working with Git branches 47
15.1 Git Workflow 47
15.2 Git Branches 49
15.3 Git merge conflicts 51
16 Object serialization and deserialization 53
16.1 JSON 53
16.2 pickle 56
17 Python package commands 57
18 Resource files 61
19 Configuration files 62
20 Web services in Python with Flask 63
21 Docker 67
21.1 Installing Docker 67
21.2 The Dockerfile 67
21.3 Building, running and deleting a Docker image 72
21.4 Docker tutorial 73
22 Continuous integration 73
22.1 Creating an Azure DevOps pipeline 74
22.2 Other CI uses and systems 75
23 Proposed challenges 75
23.1 Challenge 1: New data generator 75
23.2 Challenge 2: New activation function 76
23.3 Challenge 3: Cross-entropy for multi-class classification 76

3
23.4 Challenge 4: Xavier’s initializer 77
23.5 Challenge 5: Dropout layer 77
23.6 Challenge 6: Momentum and RMSProp optimizers 78

4
1 Introduction
This guide describes a way of collaboratively developing software in local machines based on
industrial standards. The methods and tools described here are focused on Python projects
that implement processing pipelines involving one or more machine learning models. We
provide instructions on how to install and use these tools in Linux-based systems, namely
Ubuntu and macOS. Due to differences between MS Windows and Linux-based systems,
developing code that runs on both kinds of OS requires extra effort and care.1 Usually the
code we develop is, at a final stage, to run as a service in a Linux-based Docker container,
hence portability of the code between Windows and Linux OSs is not a must. For this reason,
we advise to develop on Linux-based OSs only (including macOS or Linux virtual machines
within Windows) to avoid potential problems.

When collaboratively developing software, we have to bear in mind that…


• … we develop pieces of code that are part of a whole; there is no “my code” or “your
code”, we develop code that is to interoperate with other pieces of code that are not
necessarily developed by us, hence…
• … being able to run in our machines a piece of code we have just developed is not
enough; anybody in the team should also be able to run and test it in theirs.

Note we build together pieces of software that serve as the foundation of other pieces (e.g.,
a data pre-processing that is needed for later training a model with a particular machine
learning algorithm). We need a mechanism that allows us to share the code amongst the
team, as well as to run the code independently of the person who developed it and the
machine where it was initially developed.

For the sake of this guide, we will use an example of Python project called Docknet, a pure
NumPy implementation of neural networks that can be used to learn the math and algorithms
behind neural nets. The code has been made open source and published in GitHub:

https://github.com/Accenture/Docknet

You may follow this training guide in teams of up to 5 persons. Each team will have to create
an Azure DevOps project using the trial version as described in section 6, since it is not allowed
to use The Dock’s Azure DevOps space for the sake of training.2 The Docknet source code is
to be uploaded to a Git repository in that project so that you can share the same repository
(explained as well in section 6). The first sections of this guide describe how to install and
configure the tools that you will need, so continue reading and following the steps before
jumping to section 6.

1 For instance, Linux operating systems use the slash as the file path separator, while Windows uses the
backslash. We will always have to use Python function os.path.join(folder1, folder2,…, file1)
to generate paths with the proper file separator independently of the OS, instead of hardcoding the file
separator as a string of the form ‘folder1/folder2/file1’. Another typical interoperability problem
arises when running unit tests that compare multiline strings, since Linux uses code ‘\n’ as end of line while
Windows uses ‘\r\n’. For the tests to pass in both systems one possible workaround is to systematically
remove all ‘\r’ characters before doing string comparisons, which adds boilerplate to the tests.
2 Note the trial version of Azure DevOps does not allow for more than 5 members in a single project, hence the

limit.

5
2 Prerequisites
Basic knowledge of the Linux / macOS command line is recommended. A good free book can
be found here:

http://linuxcommand.org/tlcl.php

While it is not needed to know the whole content of this book, basic knowledge of the Linux
shell (chapter 1), file system (chapters 2 to 4), file permissions, sudo command and vi editor
is advised.

This guide provides specific instructions for Ubuntu 22.04 and macOS. In case you have a
Windows machine, you will first need to install a Ubuntu 22.04 virtual machine. You may
either use VirtualBox or WLS (Windows Linux Subsystem). The former allows for having a full
Ubuntu Desktop machine inside Windows, while the latter provides a Linux terminal only. If
you need a full Ubuntu graphical environment, then VirtualBox would be preferred. For
instance, with VirtualBox it is possible to run both PyCharm and the code inside Ubuntu. With
WSL it is still possible to code with PyCharm, running PyCharm in Windows while making it
use a Python interpreter inside WSL. However this is only supported in the Professional
Edition of PyCharm, which is not free. A good tutorial on installing Ubuntu 22.04 with
VirtualBox can be found here:

https://linuxhint.com/install-ubuntu22-04-virtual-box/

Note you have to run the installer as admin (right click on the installation program then click
on “Run as admin”.

Instructions on how to install Ubuntu in WSL can be found here:

https://ubuntu.com/wsl

Furthermore, the documentation on running WSL interpreters from PyCharm can be found
here:

https://www.jetbrains.com/help/pycharm/using-wsl-as-a-remote-interpreter.html

Apart from having a Linux machine (Ubuntu or macOS), some basic knowledge of neural
networks with fully connected layers is advised in order to better understand the example
code we use in this training, though for understanding the tools and coding techniques it is
not necessary. If you have never done a training on deep learning, following the 2 first courses
of Coursera’s Deep Learning Specialization would be more than enough:

https://www.coursera.org/specializations/deep-learning

The example code implements the following concepts, all explained in those 2 courses:

• Activation functions (sigmoid, tanh and ReLU)


• Binary classification (no multiclass classification, no softmax)

6
• Dense layers (also called fully connected layers)
• Forward propagation
• Cross entropy cost function
• Random network parameter initialization
• Gradient descent
• Backward propagation
• Partial derivatives of cross entropy, activation functions and linear functions of the
dense layer neurons (all used for implementing the backward propagation)
• Multi-batch training
• Adam parameter optimizer (a variant of gradient descent)

3 Python virtual environments


In order to ensure that the same Python code produces the same results in different
machines, we need to make sure that the environment in which the code is run is the same.
A significant advantage of software-based experimental science over other experimental
sciences, such as molecular biology, is that we can have full control on each step of our
experiments so that we consistently reproduce the same results, even when random variables
are involved.3 Python virtual environments allow us to control which version of the Python
interpreter is going to be used in our experiments, as well as which version of each Python
package our code depends on. Moreover, recreating the same Python virtual environment in
different machines is straightforward, provided that we keep a file with the list of packages
needed along with their version numbers.

For each project in which we work, we create a dedicated Python virtual environment in order
to make sure the proper Python interpreter and packages are used, while avoiding potential
clashes between projects. When working on a particular project, we activate the
corresponding Python virtual environment, and when switching to a different project we
deactivate the current Python virtual environment and activate the corresponding one. We
may also open multiple terminals and activate a different Python virtual environment in each
one in order to run code of different projects simultaneously, each one running in their
corresponding Python virtual environments.

While it is possible to create Python virtual environments with Conda, Conda comes with a
set of preinstalled packages that may not be required for our projects or have a different
version number than the one we require. Note it is not possible to install multiple versions of
the same package in the same Python Virtual environment. For these reasons we prefer to
use plain Python virtual environments, which include no preinstalled packages whatsoever,
then install the minimum set of packages with the needed version numbers for each project.

3.1 Installing Python in Ubuntu 22.04


Even if you don’t develop on a Ubuntu machine, knowing how to install Python on Ubuntu
can be useful for creating a Ubuntu-based Docker container for the project. In order to install
Python 3.9 in Ubuntu you need to run the following commands in a terminal:

3Random number generators usually allow for an input parameter “seed”, an integer number that determines
what is the sequence of random numbers that will be generated. When we do not need to repeat the same
sequence, we use as seed the current system timestamp, which is an integer number that will never repeat.

7
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt-get install python3.9 python3.9-dev python3.9-venv

The first command adds a Launchpad repository that includes Python 3.9, and the second one
installs it. Note the default Ubuntu 22.04 repository provides Python 3.10, but we will use
version 3.9 for compatibility reasons with Apple Silicon machines.

Whether we use Python 3.9 or any other further version depends on the requirements of the
project, but once a version has been chosen the same version should consistently be used
across all the project machines.

Package python3.9 contains Python 3.9, python3.9-dev is needed for installing some
Python packages that require compiling native code (e.g., NumPy), and python3.9-venv
is needed for creating Python virtual environments.

Note that it is good practice to keep your system up to date so that you install the latest
version of each Ubuntu package. To do so, you may run first the following commands:

• Reload the package index:

sudo apt-get update

• Update every non-system outdated package:

sudo apt-get upgrade

• Update every system outdated package:

sudo apt-get dist-upgrade

Additionally, make sure you have configured the system locales. During the installation of
Ubuntu Desktop, the locales are already configured. However, Ubuntu Server does not come
with the locales pre-configured. You will be using Ubuntu Server when running an Amazon
EC2 instance based on Ubuntu, or when running a Ubuntu-based Docker image. In these
scenarios you need to run the following commands:

• Install the “locales” system package in order to manage the locales:

sudo apt-get install -y locales

• Install the US English UTF-8 locales (you may choose to install other locales, but be
consistent across the whole project):

sudo locale-gen en_US.UTF-8

• Set the following system-wide environment variables (these values are for US English
UTF-8, if you use other locales then replace the values by the corresponding ones):

8
LANG=en_US.UTF-8
LANGUAGE= en_US:en
LC_ALL=en_US.UTF-8

In a Docker image you can define these variables with command ENV, e.g.:

ENV LANG=en_US.UTF-8

In an EC2 instance or any other Ubuntu Server machine these variables are defined in
file /etc/default/locale.

A typical problem derived from not setting the locales is when we open a text file without
specifying the encoding to use, for instance

with open(pathname, 'rt') as fp:

instead of

with open(pathname, 'rt', encoding='UTF-8') as fp:

The former code will use the system’s default locale, which in case of not being set it may
vary from machine to machine. If we try to read a text file as UTF-8 when it has been written
as ISO-8859-1 we may corrupt the data or get an exception.

3.2 Installing Python in macOS


macOS does not have an official package installation tool similar to Ubuntu’s apt.4 However,
there have been several open-source initiatives to create an equivalent tool for macOS, the
most popular being Homebrew5 and MacPorts.6 Whenever possible, it is better not to mix
different package managers. In this guide we will use Homebrew only.

To install Homebrew, type the following command in a terminal:

/bin/bash -c "$(curl -fsSL


https://raw.githubusercontent.com/Homebrew/install/HEAD/instal
l.sh)"

Note this command contains no line breaks; before the URL there is a white space, not a new
line. Be careful when copying commands in this guide that span over multiple lines.

In order to be able to install specific versions of Python, we will install pyenv with Homebrew:

brew install pyenv

4 In fact, APT is Debian’s advanced package manager; Ubuntu is a Debian derived Linux distribution which,
among other things, has inherited APT.
5 Homebrew homepage: https://brew.sh/
6 MacPorts homepage: https://www.macports.org/

9
Additionally, we need to add the following lines:

export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init --path)"

to our shell config file. Here is a table with the corresponding config files, depending on your
shell:

Shell Config file


bash ~/.bash_profile or ~/.bashrc
zsh ~/.zshenv or ~/.zshrc

pyenv allows us to download the source code of any available specific version of Python, then
compiles and installs it in our machine. We can install as many versions we want, and switch
from a version to another as we work in different projects requiring different versions. Some
libraries are required in order to compile Python; install them with the following command:

brew install openssl readline sqlite3 xz zlib

You can list the available Python versions for download with the following command:

pyenv install --list

Usually, you’ll want the versions that are just numbers with dots (e.g., 3.9.14) instead of other
versions such as miniconda3-4.7.12; as mentioned before, plain Python versions (with version
numbers only) do not come with preinstalled packages, so we are free to create our Python
environment with exactly the packages and package versions that are required. Once we have
chosen a Python version (e.g., 3.9.14), we can install it as follows:

pyenv install 3.9.14

List all installed versions as follows:

pyenv versions

Set a given version as the active version as follows:

pyenv global 3.9.14

Check the current active python version as follows:

python --version

10
3.3 Creating a Python virtual environment
Assuming you are to work on a project called “docknet”, we are going to create a Python
virtual environment for that project in a folder $HOME/docknet_venv. 7 Once you have
activated the Python version you want to use for the project, type the following command:

python -m venv $HOME/docknet_venv

A folder $HOME/docknet_venv should have been created and populated with other
folders and files. The most relevant to us are:
• Folder bin: this is where the commands of the Python virtual environment are
installed, most of all being simply symbolic links to the actual commands installed in
our machine, such as:
o python, python3 or python3.9: they all point to the Python command
we used for creating the Python virtual environment.
o pip, pip3 or pip3.9: they all point to the pip command that corresponds
to the chosen Python version.
o activate: this is the script that activates the environment; it is to be run
preceded by command source.
o Any other commands that our Python project may define are installed here.
Obviously, in a fresh Python virtual environment there are none.
• Folder lib: here is where Python packages8 are installed when running command
pip; if our project defines a Python package, it’s also installed here.

3.4 Switching between environments


To activate a Python virtual environment, open a terminal and run its activate script preceded
by command source:

source path_to_bin_folder/activate

The activate script modifies the system environment in which the terminal is running,
such as the default path to the python and pip commands. We need to use command
source so that the changes in the environment are done to the terminal environment and
not just to the subprocess created when running the script. Otherwise, the changes done to
the environment would be deleted once the script execution is finished. Once we activate a
Python virtual environment, the default python and pip commands will be those of the
environment and not those selected by pyenv. We only need to set the python version with
pyenv for creating a new Python virtual environment, but then the Python version for each
environment becomes the default one when we activate them.

To facilitate the activation of a virtual environment we can define a command alias


docknet_activate. In Ubuntu you can edit the file $HOME/.bashrc, and in macOS the
file $HOME/.bash_profile, if your interpreter is bash, or either $HOME/.zshrc or

7 Remember $HOME is a system variable whose value is the path to your home folder, e.g., if your username is
smith, your home folder in Ubuntu is at /home/smith, and in macOS at /Users/smith
8 Python packages may also be called libraries

11
$HOME/.zprofile if your interpreter is zsh (modern macOSs use zsh by default). Add at
the end of the corresponding file the following line:

alias docknet_activate="source $HOME/docknet_env/bin/activate"

You may also add the following alias to make it easier to go to the project folder:

alias cddocknet ="cd path_to_project_docknet_folder"

In order to deactivate the virtual environment, we simply type:

deactivate

This command is not a script in the bin folder, it is a system function that is created by the
activate script and deleted upon deactivation.

3.5 Installing packages in a Python virtual environment


Once the virtual environment is activated, we can simply type python or pip without having
to specify the path to the environment’s bin folder. We can then manually install packages
in the scope of the activated environment by running:

pip install package_name

In our Python projects it is advised to have a file requirements.txt that contains the list
of all the necessary packages, along with their versions, in order to facilitate the task of
installing all the required dependencies. Provided that we have such file, we can simply type
the following command to install all of them:

pip install -r path_to_requirements_folder/requirements.txt

The requirements file is a text file that contains a Python package name per line, e.g.:

numpy
pandas
scikit-learn

You can use the # symbol to add comments, and suffix ==version_number to specify a
version number of a package, for instance:

# dependencies of component X
numpy==1.24.2
pandas==1.5.3
scikit-learn==1.2.1

Note if version numbers are not specified, the latest versions available will be installed. While
we may want the latest versions available, it may happen that at some point a new version of
a package is released which is no longer compatible with our project, and upon reinstalling
the project (e.g., in the production environment) the code will fail. You can check which

12
packages and package versions are installed in the currently activated environment by
running command:

pip freeze

A trick you can use if you want to add the latest available version of a package to a project is
to first manually install the package with pip without specifying a version number, then use
pip freeze to check the version installed, then add the package with that version to the
requirements file.

3.6 Deleting/reinstalling a Python virtual environment


In order to remove a Python virtual environment, we simply delete the folder where we
created it (e.g., folder $HOME/docknet_venv in our example). We can then create it again
and install all the required dependencies with 3 commands:

python -m venv $HOME/docknet_venv


docknetactivate
pip install -r path_to_requirements_folder/requirements.txt

We may be exploring whether to use a new package in our project and end up breaking the
virtual environment by installing some package that is incompatible with our project. Note
that a package may depend on other packages, thus installing a package with pip may in
return install many others. It can then be difficult and time consuming to figure out which
package is producing the conflict and what should be deinstalled in order to solve the
problem. Instead, we can simply delete the virtual environment and create it again.

4 What is Git?
The Wikipedia definition of Git is: “Git is a distributed version-control system for tracking
changes in source code during software development”. In practice, Git behaves as a file
repository, either local or online, that not only stores a set of files and folders but also keeps
track of every change made to them since the creation of the repository. This avoids
accidental deletions of code and allows us to rollback changes if necessary, so we can try
different solutions without being afraid of breaking the code or potentially losing important
files.

Usually, we first create a central Git repository in some server or cloud; at The Dock, Azure
DevOps is the current cloud solution to manage project Git repositories, along with other
project processes and metadata. A project’s Git repository or repositories are first created
with Azure DevOps in the cloud. When we create a new Git repository, it is initially empty.
We then use a Git client to make a clone of the remote Git repositories in our machine. This
clone is a mirror of the remote repository that allows us to work locally, without the need of
Internet connection. It not only contains the files downloaded from the central repository
(which initially are none), but metadata files that keep track of every possible change and
keep our local Git copy synchronized with the remote one (the one in the Azure cloud).

Each team member periodically synchronizes their local clone of the Git repository with the
remote one in order to share their contributions with the rest of the team, and to obtain

13
other’s contributions. Since different developers work concurrently, synchronizing local and
remote repositories implicitly requires integrating different pieces of work, which at times
may be in conflict. If we initially define the project to develop as a set of software
components, and the way these components will interact, each developer can focus on a
different component in order to avoid conflicts. Developers modify or create different files,
and Git automatically integrates the changes by simply keeping the latest version of each file.
However, when 2 or more developers modify the same parts of a file, Git does not know how
to merge both changes; should Git keep one version and drop the other, or vice-versa, or
should new code be written in order to take into account both changes? Git also implements
a conflict resolution mechanism that allows us to manually select one of these 3 options, and
to develop new code to solve the conflict, if needed.

5 Installing and configuring the Git client


In order to access a Git repository, we first need to install and configure a Git client in our
development machine. In Ubuntu simply type the following command:

sudo apt-get install git

In macOS you can type the following command, provided that you already installed
Homebrew (see section 3.2):

brew install git

Finally, you need to tell your Git client what username and email to use when connecting to
a repository so that you don’t get an authentication error:

git config --global user.name your_accenture_username


git config --global user.email your_accenture_email

Remember your Accenture username is the part of your email before the @ symbol.

When uploading changes to a Git repository from the command line, Git automatically opens
a text editor to type in a descriptive message of the changes. To choose the editor to use,
type the following command:

git config --global core.editor editor_of_your_choice

Examples of potential editors are nano, vim, emacs, or even notepad++.exe in


Windows systems, provided that you have previously installed the editor of your choice.

Finally, type the following command to avoid some errors when working with branches:

git config --global push.default current

6 Creating a project in Azure DevOps


Each team is to create an Azure DevOps project using the trial version, to upload the Docknet
project code there and to give access to the team. This section describes how to do so.

14
First of all, go to the following website:

https://azure.microsoft.com/en-us/services/devops/

and click on the button “Start free”:

Sign up using your Accenture e-mail.9 If requested, select “Ireland” as region. At this moment
you will have created a new organization in Azure whose name is your Accenture EID (the
part to the left of the @ symbol in your Accenture email). You will be presented with a web
page to create the first project of the organization. Choose a project name (e.g., “docknet”),
select “Private” and click on button “Create project”.

A Git repository with the same name will be created by default. Additional repositories can
be created within the same project if needed (e.g., a repo for the code and another for the
datasets), but for this training we will only use the default one. Now follow section 7.2 to
clone the project, which for the moment will be empty. Open a terminal and go to the
docknet folder where you cloned the project. If you list the files with command:

ls

you will see no files, though in fact there is a hidden folder called .git which contains the
Git metadata. To see it you need to use command:

ls -a

9In fact, you can use any email in order to use the trial version of Azure DevOps, but in order to avoid having
to reconfigure your Git email we will use our Accenture emails.

15
Download the file Docknet-master.zip from here:

https://github.com/Accenture/Docknet/archive/refs/heads/master.zip

and unzip it inside the docknet folder where you cloned the project. Note if the unzip
process created yet another folder (e.g., Docknet-master), you are to move all the files
and folders and place them directly inside folder docknet (delete the empty folder when
done). Now in the terminal, inside the cloned docknet folder, type the following commands:

git add .
git commit -m "First version"
git push

This will add the code to the Git repository. And explanation of what these commands do is
given in section 15.

Now the team members are to be added to the Azure project so that they can also access it.
On the Azure project web page, at the bottom left, click on “Project settings”. Now in the left
panel click on “Teams”. You will see there is a default “docknet Team” already created. Click
on it to manage the team. Now use button “Add” at the top right corner to invite the other
team members to the project. Use their Accenture emails to invite them.

The new added members have by default a “Stakeholder” type account which does not let
them to access the project repository yet. All their accounts must be changed to “Basic”. To
do so, go to your Azure organization page, either by removing “/docknet” from the URL in
your web browser or by clicking on your Accenture ID at the centre of the top bar of the page.
Now click on “Organization settings” at the bottom left corner of the page. Then click on
“Users” in the left panel. You will see the list of users that belong to your organization. For
each user whose access level is not “Basic”, move the mouse pointer to the corresponding
row in the user table to see a 3-dot icon appear to the right of the table. Click on that icon
then on option “Change access level”. Finally select option “Basic” in the drop-down menu
and click on button “Save”. Repeat this operation for each team member.

All the team members should now be able to access the project page and to clone the
repository, as described in the next section. To facilitate the task, share with them the URL of
the project page. You can go to the project page by going back to the organization page as
before, then clicking on the “docknet” project box.

7 Cloning the project


7.1 Accessing the Azure project page
In the previous section we described how to create a free Azure organization, a project inside
the organization, and how to add members to the project. To facilitate finding the project
page, ask the person who set up the Azure project to send you the project URL. Another way
of finding it out by yourself is through the Azure page:

https://dev.azure.com

16
Log in if requested, then you should see in the left panel the list of organizations you belong
to. If you have already participated in a Dock’s Azure project, you should see “thedock”, the
Dock’s organization. The team member who configured the Azure project should have
created an organization whose name is their Accenture ID and should have added you to that
organization. That organization should appear as well in the left panel, otherwise just ask for
the project URL and access it directly. The next time you access the main Azure page you
should see the Accenture ID organization. By clicking on it, you should then see a “docknet”
box which corresponds to the project. Click on it to access the project page.

7.2 Finding the Git repository URL and cloning the project
Once in the Azure project page, you will see a column of buttons in the left panel. Each one
corresponding to a different section of the project page. Click on icon to access the
10
repository section. Click on the “Clone” button at the top right corner of the page. A panel
will pop up from where the URL of the repository can be copied. Since SSH is no longer
permitted at The Dock, make sure you get the HTTPS version. Click on the button “Generate
Git Credentials” to generate a password that will be asked each time we access the repository.
Copy it somewhere so that you do not need to generate a new one each time.

Open a terminal and go to the folder where you store the projects (e.g., $HOME/src). Now
type the following command:

git clone repository_URL

The command will create a new folder docknet and download all the project files inside.

8 Building and testing the project


The project you have cloned contains a bash script

delivery/scripts/build.sh

that takes care of creating the Python virtual environment, installing all the required
dependencies, installing the project package in the environment, then run the unit tests. Run
the script and check that no test errors are reported. If so, you have a stable version of the
project you can continue developing.

In case for some reason your Python virtual environment gets corrupted, you can quickly
recreate the environment, install all the dependencies and verify it works again by rerunning
the build script.

This same build script will be used within a Docker container for the continuous integration
system to run the tests upon each update of the code sent to the remote Git repository. The
bash script knows whether it is being run by the continuous integration system or by
somebody else (a developer). When run by somebody else, it will install the additional

10If you can access the project web page but don’t see the repositories icon, ask the person who created the
Azure project page to change your account access level from “Stakeholder” to “Basic”, as explained in section
6

17
packages listed in file requirements-dev.txt, which are required for development
purposes only, namely:

• pytest: required for running the unit tests11


• jupyterlab: required for working with Jupyter Notebooks, either with PyCharm12
or with the own Jupyter web interface

9 Installing and running JupyterLab


As described in the previous section, the build script already takes care of installing
JupyterLab. If you need to install it manually, you can simply run the following command once
the Python virtual environment is created and activated (see section 3):

pip install jupyterlab

Note JupyterLab is the latest Jupyter version since March 2020. The previous version was
called Jupyter Notebook, and IPython before that. JupyterLab interface has more options and
panels than that of Jupyter Notebook, but if you install JupyterLab you will have the option to
use either interface.

In order to open the JupyterLab web interface, open a terminal, activate the Python virtual
environment of your project, go to the main folder of the project, and run the following
command:

jupyter lab

In case you want to use the former Jupyter Notebook interface, run this command instead:

jupyter notebook

Note it is important to run the Jupyter server in the root folder of the project, so that folder
will become the root in the Jupyter interface, and the metadata of your notebooks will be
written in and loaded from that folder.

Upon running the Jupyter server (either lab or notebook), a web page with the interface will
be automatically open. Note that the terminal where you run the jupyter command must
stay open, or the server will stop. When running the command on the terminal, several
messages will be printed, one of them giving you the URL of the Jupyter web page:

http://localhost:8888

In case you close the Jupyter tab and do not remember the URL to open it again, you can refer
to the terminal messages.

11 Setuptools, the Python packaging system, is able to run the tests without having to install pytest; the
continuous integration system does not require to install pytest since it uses Setuptools to run the tests, but
for development purposes it is more convenient to have pytest installed.
12 Note that you need the commercial version of PyCharm to edit and run Jupyter notebooks within PyCharm;

otherwise, you have to use the web interface provided by Jupyter.

18
Finally, for working with Jupyter notebooks directly on PyCharm you will also need to install
Jupyter as described in this section. The only difference is PyCharm will take care of starting
and stopping the Jupyter server, so you will not need to run it yourself in the command line,
and you will use the PyCharm interface instead of a web browser.

10 Installing PyCharm
PyCharm is a smart Python programming interface that is easy to start with and assists us
during the development process so we can rather focus on the problem instead of fighting
with the particularities of the programming language. It makes easy to navigate through big
projects, is integrated with Git, and understands and can run unit tests, among many other
features. There are 2 flavours of PyCharm: Professional and Community. The Community
edition is free and usually enough for our needs. The Professional edition adds some
additional features, such as being able to run Jupyter Notebooks. In this guide we will use the
Community edition. If for some reason you need the Professional edition, licenses can be
requested through Accenture’s software catalog:

https://support.accenture.com/support_portal?id=acn_sac&spa=1&page=details&category
=&sc_cat_id=356f867ddbf8ac987faf89584b9619e9

10.1 Ubuntu 22.04


Open a terminal and type the following command:

sudo snap install pycharm-community --classic

Once finished you should be able to find PyCharm from the Launcher icon. Add PyCharm to
the launch bar for quicker access (right click on the PyCharm icon, then click on “Add to
favourites).

10.2 macOS
Download the DMG package of the community edition from the following web page:

https://www.jetbrains.com/pycharm/download/#section=mac

Make sure you select the proper DMG version for your computer (Intel or Apple Silicon) by
clicking on the DMG button. Then click on the Download button to download the DMG
package.

Once downloaded, double click on the DMG file. In the window that will pop up, drag and
drop the PyCharm icon on the Applications icon. You should then be able to find PyCharm in
your Applications folder. Drag and drop the PyCharm icon on the Dock bar at the bottom of
your desktop for easier access.

10.3 First time running PyCharm


Click on the PyCharm icon to open it. The first time you run it you have to go through some
configuration steps, namely:

19
1. Accept the PyCharm license
2. Either choose or not to send usage statistics
3. Choose not to import settings
4. Select UI theme (dark or clear, I personally find dark causes less eye fatigue)
5. Install plugins:
a. IdeaVim is not recommended unless you are familiar with PyCharm using
IdeaVim since it completely modifies the PyCharm editor behaviour
b. Markdown is recommended to have a nicer Markdown file editor
c. Select R if you also work with R
d. Do not select AWS Toolkit unless you plan to develop AWS Serverless
applications
e. There are many other plugins that can be installed afterwards, just click on the
“Start using PyCharm” button to finish the configuration process

11 Opening & configuring the project with PyCharm


Click on the “Open” button then select the folder where the project was downloaded when
cloning it with Git (section 6). In the left panel we can see the file tree of the project. If we
click on a file, it will open in the editor at the middle panel.

Our project may not only contain Python source code but also data and configuration files,
unit tests, documentation, bash scripts, etc. Due to the flexibility of Python, PyCharm has no
way of knowing which part of the project tree contains the source code, so we have to tell it.
By default PyCharm assumes that source packages will be placed in the project root folder.
Usually a Python project will contain a single package (e.g. docknet) with subpackages
inside, so for simplicity we will simply place the main package folder inside the project root.
Other convention could be to create a src folder an place there the Python packages, in
which case we will have to inform Pycharm that folder src is a source code root. This is done
as follows:

1. On the menu bar, click on “PyCharm->Preferences”.


2. Click on the triangle to the left of “Project: XXX” to expand further options.
3. Click on “Project Structure” under option “Project Interpreter”
4. Right click on each folder that is a source code root (e.g. src) then on “Sources”
5. Click on button “OK”

In more advanced projects we may implement multiple Python packages, e.g. backend,
frontend and common code, and manage all of them as Git submodules of a single project. 13
In that case we can open in PyCharm the folder containing all the submodules so we can work
on all of them as if it was a single project, but we will have to indicate where is the root source
code folder of each subproject as explained above.

13Git submodules is an advanced feature that facilitates to word with multi-package projects. This is not
discussed in this guide, but more info can be found here: https://git-scm.com/book/en/v2/Git-Tools-
Submodules

20
Additionally, we also have to tell PyCharm which Python virtual environment to use for the
project:

1. On the menu bar, click on “PyCharm->Preferences”.


6. Click on the triangle to the left of “Project: XXX” to expand further options.
7. Click on “Project Interpreter”
8. Click on the cog wheel button at the top right corner, then on “Add…”
9. Select “Existing environment”, then on the “…” button to browse your file system
10. Browse to the “bin” folder of the Python virtual environment you created
11. Select the “python” command inside the “bin” folder and click on button “OK”
12. Click on button “OK” to close the “Add Python Interpreter” window

Once PyCharm knows where is the source code and what is the Python virtual environment
to use, it will scan all the libraries installed in the environment as well as the project source
code and build index that will let us navigate through the code quickly, and will also highlight
any errors found (e.g. imported packages that are not installed in the selected environment).

While Python includes by default a unit test library called unittest, in this guide we use
pytest, which contains additional features. We need to tell PyCharm we will be using this
library to run the unit tests:

1. On the menu bar, click on “PyCharm->Preferences”.


2. Either type on the search box “unit test” or manually browse to “Tools->Python
Integrated Tools”
3. In the box “Default test runner”, select option “pytest”

12 Python packages & project structure


The example project that this guide is based on consists of a single Python package with a set
of Python modules and unit tests. Python packages facilitate the distribution and installation
of Python projects. When we execute command pip install numpy what we are doing
is downloading the NumPy Python package from PyPI, the official Python package repository,
and decompressing it in the lib folder of the currently activated Python virtual environment.
Be careful to always activate a Python virtual environment before installing a package,
otherwise it will get installed in the system’s Python.

A typical software project, either for Python or any other language, is composed of:
• Source code: implements the business logic (e.g., detecting mentions of drugs in
documents)
• Configuration files: parameters that modify the way in which the code will run,
without having to modify the source code (e.g., whether dropout will be used or not
to train a machine learning model)
• Test code: code to be used to automatically verify that the business logic is properly
implemented (e.g., check that a tokenizer splits a sequence of characters into the
expected sequence of tokens)

21
• Build scripts: scripts used to automate the tasks of building the project distributable,
installing it and running the tests. These build scripts may also include configuration
files that define different options of the build process.

Usually, the project distributable only includes the implementation of the business logic and
the configuration and resource files required to run it. All the other files (test code and build
scripts) are used during the development and testing phases and are not required to run the
business logic. By placing the source code in a folder and the test code in a different folder,
we prevent the test code from being included in the package distributable. We put all the
source code, configuration files and resource files necessary to run the code in the folder
corresponding to the main project package (e.g. docknet). All the test code, configuration
files and resource files used for running the tests are placed in folder test.

In the “docknet” project we have also added a folder exploration with some Jupyter
notebooks, which use the code inside the docknet folder. The Jupyter notebooks are
neither to be included in the project distributable, since their code is not reusable. As the
folder name suggests, they are meant for exploration only.

Files and folders other than docknet, test and exploration correspond to build scripts
and configuration files of the development process itself. We describe them in the sub-
sections below.

12.1 gitignore
File .gitignore lists the files or folders that we may have inside the project file tree that
are not to be saved in the central repository. Typical examples of these files are:
• Temporary files such as Vim’s .swp backup files
• Metadata files created by our OS (e.g. .DS_Store) or the programming
environment (e.g., PyCharm’s .idea folder).
• Python files created when building the project distributable (folders build and
dist)
• Log files and other files that might be created when running the project code or tests
(e.g. .pytest_cache and __pycache__ folders), or Jupyter notebooks (folder
.ipynb_checkpoints).

All these files are not to be uploaded to the central repository since they are temporary and
unique to each developer. Apart from taking unneeded space in the central repository, other
developers will get a copy of them when synchronizing with the remote repository. Moreover,
conflicts may arise if 2 developers are uploading the same temporary or metadata files to the
repository. Take as example PyCharm’s metadata; among other things, PyCharm stores in the
metadata the list of tabs you had open last time you opened the project so that you can
resume your work exactly where you left. If you don’t ignore these files, your Git client will
think you have new code to upload to the remote repository every time you open or close a
tab, bloating the repository with new and unnecessary versions of these metadata files.
Moreover, if more than one developer uploads the PyCharm metadata files, Git will
constantly report code conflicts since different developers will have different tabs opened.

22
For this reason, make sure that files that are not to be shared with other developers are
either:
• added to .gitignore, for the temporary and metadata files listed above,
• added to some notebook in the exploration folder, for the case of exploratory code
that you don’t know yet if it will be finally required or not, or that needs to be
refactored before being integrated with the rest of the components, or
• stored outside of the project folder, such as data files you may use to make manual
tests.

Note as well that Git is meant to store source code, not binary files. Git is capable of efficiently
storing different versions of text files by storing sequences of changes instead of whole files
for each version. However, this doesn’t work well with binary files and Git will store the whole
files for every version, wasting storage space and network bandwidth. The problem worsens
with large binary files, such as machine learning models and multimedia files. Note as well
that since Git keeps track of every file version, deleting a binary file from the repository does
not solve the problem: the file will still be stored in a previous version of the code, and every
developer that will clone the repository will have to download it. There can be exceptions,
such as when including a reasonably small binary model in the resources of our source code
so that the project can work with a default model, though it is better to separate data from
code (e.g., publish the models in an Amazon bucket and access it from the code).

Each line in .gitignore specifies one filter of file or folder to ignore. Let root be the
folder containing file .gitignore, common filters are:

• .*swp: every file whose name ends with .swp, anywhere under folder root
• __pycache__/: every folder named __pycache__, anywhere under folder root
• /folder1/folder2/ or folder1/folder2/: ignore folder2 at the precise
path root/folder1/folder2
• /folder1/file1 or folder1/file1: ignore file1 at the precise path
root/folder1/file1

Note that the moment we specify a path, either starting with / or not, we ignore a specific
file or folder, not a file or folder anywhere under root. While it is possible to add
.gitignore files anywhere in the project folder, it’s usually better to have only one at the
root folder of the project.

Finally, it is possible to add comment lines using symbol # at the beginning of the line.

For a comprehensive reference on the .gitignore file syntax, visit the official
documentation:

https://git-scm.com/docs/gitignore

Depending on the kind of project (Python, Java, C++, etc.) one may copy a predefined
.gitignore file. The following GitHub repository contains examples of .gitignore files
for many different kinds of projects:

23
https://github.com/github/gitignore

12.2 Python package declaration or setup.py


In order to declare a Python package, we add a file setup.py at the root folder of the
project. Once we have an active virtual environment, we can run this file from the command
line as follows to perform several actions:
• python setup.py sdist: builds the distributable package in folder dist
• python setup.py install: builds the distributable package (if not done yet)
and installs it in the active virtual environment
• python setup.py test: runs all the project tests14

Note that the distributable package will only contain the Python scripts in folder src, which
are usually the only files we will need to distribute to run the project.

The file setup.py basically builds a Python dictionary with a set of package parameters,
then calls function setup from the Setuptools Python package to perform the requested
action (build distributable, install or test, among others). 15

In case you are creating your own Python package, you may copy/paste this setup.py file
and update the following lines according to your project:
• PKGNAME='docknet': name of the package
• DESC=''' A pure NumPy implementation of neural
networks''': description of the package
• license='(c) Accenture': project license, leave (c) Accenture for private
Accenture license
• author='Javier Sastre': package maintainer
• author_email='j.sastre.martinez@accenture.com': email of the
package maintainer
• keywords=['Accenture', 'The Dock', 'deep learning',
'neural network', 'docknet']: descriptive project keywords for indexing
purposes in a package repository
• classifiers=[‘Programming Language :: Python 3 ::
Only’,…]: standard category labels for Python projects, also used for indexing
purposes16

Apart from script setup.py, you’ll also need to copy and update the following configuration
files:
• CHANGES.txt: text file listing the features and bug fixes implemented in each
package version; if we are not publishing the package, we can ignore this file.

14 Alternatively, you may simply run the command pytest to run all the tests inside a project folder, provided
that the corresponding Python virtual environment is active
15 More actions are possible, though these are the most common ones. A tutorial on Setuptools can be found

at https://packaging.python.org/tutorials/packaging-projects/
16 The comprehensive list of classifiers can be found at https://pypi.org/classifiers/

24
• MANIFEST.in: contains the list of files outside the src folder that should also be
included in the project distributable (requirements.txt and VERSION.txt
files)
• requirements.txt: contains the list of packages this project depends on (as
explained in section 3.5).
• requirements-dev.txt: contains the list of additional packages required for
development purposes only. As explained in section 8, the build script installs them
whenever it’s not run by the continuous integration pipeline (it will install them when
it’s run by any developer).
• setup.py: the Python script used to package and install the project as well as to run
all the tests from the command line. This file is included by default in the package
distributable, since it is needed to install it.
• setup.cfg: additional configuration parameters of the setup.py script
• VERSION.txt: contains the version number of the package as plain text; if we are
not publishing the package, we can just leave version 0.0.1 inside

12.3 Delivery folder and build script


• delivery: folder containing scripts for building the project (namely the build.sh
mentioned in section 8), and potentially the Dockerfiles (explained in section 21.2), if
we need more than one (e.g., for testing the project in more than one OS). For the
Docknet project we have provided a single Ubuntu-based Dockerfile in the project
root folder.

12.4 Continuous integration files


• azure-pipelines.yml: configuration file created by the Azure DevOps web
interface when defining a continuous integration pipeline (explained in section
22.121).
• Dockerfile: description of the Docker container required to install the Python
package and run the tests (used in the continuous integration pipeline, explained in
section 22).

13 The Docknet library


The Docknet library is a pure NumPy implementation of a framework for creating, training
and doing predictions with neural networks. It is based on the 2 first courses of Coursera’s
Deep Learning Specialization by Andrew Ng.17 Part of the training consists in implementing
the fundamental functions that neural networks are based on. With these functions, and a
little bit of research to fill some gaps in the training, we have built a fully functional Python
library with which one can do experiments and compare different alternate implementations
of the building blocks that compose the neural networks. We will use this library as example
to illustrate the concepts of this training.

13.1 The Jupyter notebooks


Let’s start with a few Jupyter notebooks to see how to use the library. In the folder
exploration There are 4 Jupyter notebooks, each one showing an example of dataset for

17 Specialization web page: https://www.coursera.org/specializations/deep-learning

25
binary classification and a Docknet capable of properly classifying them, except for boundary
cases (regions where individuals of both classes “touch”).

By default, the notebooks can only use the Python packages that have been installed in the
active Python virtual environment where the Jupyter server is running.18 For the notebooks
to be able to use the code in the src folder, the notebooks start with the following
instructions:

import os
import sys

# Add ../src to the list of available Python packages


module_path = os.path.abspath(os.path.join('..', 'src'))
if module_path not in sys.path:
sys.path.insert(0, module_path)

These instructions programmatically add the src folder to the list of folders where the
Python interpreter will look for imported packages.

The notebooks first use the dataset generators in folder


src/docknet/data_generator to generate a balanced random sample of 2 classes,
for binary classification. A training set and a test set are generated. Each class element is a 2D
vector. We can later see in the notebooks the scatterplots of the training and test sets, to
have an idea of the complexity of the binary classification the neural network will have to deal
with. There are 4 data generators:

1. The cluster data generator generates 2 clusters of points that are linearly separable,
so a simple logistic regression should be enough.
2. The chessboard data generator generates some sort of 2x2 chess board, with a
diagonal belonging to a class and the other diagonal to the other class. Both classes
are no longer linearly separable, but doing a hierarchical split would allow to separate
both classes, e.g., first split the space in half, then for each subspace do another split
to do the final classification.
3. The island data generator generates a cluster of points that is surrounded by a ring
(the sea). This case is not linearly separable, though a support vector machine could
properly classify this case by using a gaussian kernel, for instance.
4. The swirl data generator generates 2 clusters of points distributed in 2 swirls of
different phases, whose separation is more challenging than the other cases.

After the graphics, a Docknet is built as follows:

• Create an instance of the class Docknet


• Add an input layer (instance of class InputLayer at src/docknet/layer),
specifying the dimension of the input vectors
• Add any amount of hidden dense layers (instance of class DenseLayer at
src/docknet/layer) specifying the number of neurons and the activation

18 See section 9 to know how to start a Jupyter server

26
function to use (sigmoid, tanh or relu in
src/docknet/function/activation_function.py).
• Add a last dense layer having a single neuron and “sigmoid” as activation function, for
binary classification
• Set the Docknet parameter initializer (e.g., an instance of the
RandomNormalInitializer in folder src/docknet/initializer)
• Set the docknet cost function (cross entropy in
src/docknet/function/cost_function.py)
• Finally, set the parameter optimizer (e.g., AdamOptimizer in
src/docknet/optimizer)

Invoking the Docknet’s method train, all the configured components will be used to
initialize the layer parameters, perform a number of training iterations (forward and
backward propagations) and optimize the layer parameters to minimize the specified cost
function as much as possible. In the train method we specify the training dataset and the
corresponding labels, the batch size and a stop condition (e.g., maximum number of epochs).
There are other stop conditions that can be used, see the definition of method train for a
comprehensive list.

Method train returns the sequence of average costs per epoch and per iteration, which give
us an idea on whether the network we have defined manages to find a proper parameter
configuration or not, or whether we are performing an excessive number of epochs or not.

Then the notebook computes the predictions on the test set for the trained model, invoking
the Docknet’s method predict. Finally, we can see the scatterplots of the expected
classification (the training set), the points of the actual classification that have been properly
classified, and the points of the actual classification that have been misclassified (in the best
case, an empty scatterplot).

Finally, accuracy metrics and the confusion table are shown.

One can play with the different parameters of the Docknet, using different amounts of layers,
neurons, activation functions, number of epochs and batch size to see how they impact the
final result and how fast each configuration manages to converge.

13.2 The Docknet main class


The Docknet main class is defined in file src/docknet/net.py. It basically contains a list
of layer objects and defines the methods to build and configure the network, train it and do
predictions, as seen in the Jupyter notebooks. Additionally, it defines methods to_pickle
and read_pickle for saving a trained model as a binary pickle file, and to_json and
read_json to use a JSON file format.

13.2.1 Docstrings
All classes and methods are documented in the source code using multiline docstrings, for
instance:

27
def train_batch(self, X: np.ndarray, Y: np.ndarray) -> float:
"""
Train the network for a batch of data
:param X: 2-dimensional array of input vectors, one vector per column
:param Y: 2-dimensional array of expected values to predict, one single row with same amount of
columns than X
:return: aggregated cost for the entire batch (without averaging)
"""

These are sort descriptions of the classes or methods that a user of the library can read to
quickly understand how the class or methods is supposed to be used, what are the input
parameters (codes :param) and the return value (code :return), if any. The entire
Docknet library contains docstrings that explain the purpose of each class and method.

In PyCharm, one can start typing the triple quotes that start the docstring and PyCharm will
autocomplete the docstring with the corresponding parameters and return value we have
declared in the method header. More info on docstrings can be found here:

https://www.python.org/dev/peps/pep-0257/

13.2.2 Type hints


You will have noticed in the train_batch method that each parameter function is
annotated with a code :np.ndarray. That indicates that the method expects to receive
NumPy arrays as parameters X and Y. As well, it is possible to declare what is the type of the
method return value with code:

-> type_hint:

after the closing parenthesis (e.g., the train_batch method returns a float). They can also
be used in variables declared inside methods, such as the members of a class declared in the
__init__ method:

self.layers: List[AbstractLayer] = []
self._cost_function_name: Optional[str] = None
self._initializer: Optional[AbstractInitializer] = None
self._optimizer: Optional[AbstractOptimizer] = None

Since in Python a variable can change its type at any moment (the moment we assign to the
variable some value of a different type), without type hints it is not possible to infer what are
the types of the parameters that the function will receive. Though type hints are not needed
to write a Python code that can be run, it improves the code readability and allows for
PyCharm to perform additional verifications to the code we are writing and to provide

28
assistance on how to use the different variables. For instance, if we start typing a new line
after the train_batch header with the code

X.

PyCharm will suggest all the available NumPy array functions, since it knows that X is
supposed to be a NumPy array. Change the type of X to str, now try to write the following
statement:

X = X + 1

PyCharm will highlight the number 1 to indicate a potential error. Move the cursor over
number 1 and you will see a message Expected type str, since you can add strings but
not a string and an integer. Remember to undo all these changes (e.g., with Command + Z in
macOS or Ctrl + Z in any other OS) to not break the code.

Though it may seem more work to declare the type hints, the effort is regarded with less time
wasted trying to understand how to use the methods and having extra help from PyCharm.
It is possible to declare more complex type hints such as lists, dictionaries and tuples of
different kinds of objects, for instance

• Tuple[int, np.ndarray] declares a tuple of an integer number and a NumPy


array,
• List[int] declares a list of integers,
• Dict[str, np.ndarray] a dictionary using strings as keys and NumPy arrays as
values,
• Optional[BinaryIO] declares a variable that can either be a binary file or None,
and
• Union[str, TextIO] declares a variable that can either be a string or a text file.

At first when you will write these codes PyCharm will complain that they are unresolved
references, as when you use any class without having imported it first (the offending code is
underlined in red, and if we move the mouse pointer over the offending code, we will see the
error message). All these classes belong to the typing package. For instance, for the case
of Tuple you will need to add the following import:

from typing import Tuple

However, you can ask PyCharm to automatically add the import for you. Any time PyCharm
highlights a piece of code in red, you can move the cursor to the offending code and press
command + enter in macOS, or ALT + enter in other OSs, to see a window with potential
solutions suggested by PyCharm. In case of a missing import, PyCharm will suggest adding the
import, among other possible solutions. Press enter to let PyCharm add the import. In case
PyCharm finds several packages that define the missing reference, it will present a list of
options. For the case of type hints, select the typing package with the arrow keys and press
enter.

29
13.2.3 Getter and setter methods
Note: The Docknet class contain some methods that are annotated with the codes

@property

and

@XXX.setter,

such as methods initializer and cost_function. These annotations are used to


define getter and setter methods, which are equivalent to adding a variable to the class. The
getter defines how to read that class “pseudo” variable, and the setter how to write it.

Getters and setters are used to abstract the library user from the actual way in which the
variable is stored, giving the possibility of alternate implementations. For instance, see how
the cost function getter and setter are implemented: the setter expects a string with the cost
function name, and the getter returns the last function name set, but actually the Docknet
class requires a Python function as cost function. Moreover, it also requires another Python
function that implements the corresponding cost function derivative in order to compute the
backward propagation for training the network. To simplify the usage of the library and avoid
giving the wrong function derivative, the setter uses the function name to retrieve itself the
corresponding pairs of functions from a dictionary of cost functions defined in
src/docknet/function/cost_function.py, and the getter simply returns the
name of the cost function that has been set instead of the function itself. The same
mechanism is used for getting activation functions by name along with their corresponding
derivative functions, which will be later explained in section 13.4.

13.3 Data generators and class inheritance


As explained in section 13.1, we have implemented 4 data generators to have some binary
classification challenges to solve with neural networks, namely the cluster, the chessboard,
the island and the swirl data generators. They are all defined in folder
src/docknet/data_generator. Some of the behaviour of these data generators is
identical, so for not having to repeat the corresponding code 4 times we have defined a parent
class that factors our that code. This class is named DataGenerator. All the data
generators are children of this class, for instance see the definition of class
ChessBoardDataGenerator:

class ChessboardDataGenerator(DataGenerator):

That means the child class gets for free a definition of all the functions of the parent class.
The child class can optionally redefine the methods of the parent class, if necessary, or extend
them by redefining them but then add a call to the parent class method to reuse it somehow.
For instance, the __init__ method of the DataGenerator class:

def __init__(self, class_functions: Tuple[Callable, Callable]):

30
is extended in each child class. The DataGenerator class expects to receive 2 Python
functions, each one producing 2D vectors of a different class, given a 2D array of random
numbers between 0 and 1. These pair of functions are provided by each child of the
DataGenerator in order to generate different datasets.

Thee DataGenerator defines a function generate_class_sample that generates


first a sample of 2D random vectors between 0 and 1, then applies to each vector the function
of the selected class (class 0 or 1) to map these vectors to vectors of the corresponding child
data generator. Additionally, the DataGenerator defines a method
generate_balanced_shuffled_sample, the one used in the Jupyter notebooks, to
generate a sample containing vectors of both classes in the same amount, where vectors have
already been shuffled for being able to do train the networks with batches of data. The
method returns an array X with the samples and an array Y with the labels, with the shapes
expected by the train and predict methods of the Docknet class. Hence, we can now define
any child of DataGenerator that just provides the 2 mapping functions, and not have to
worry about how to generate random vectors, shuffling, or formatting the samples so that
they are compatible with the Docknet class.

Open the ChessboardDataGenerator, for instance, and you will see that it declares the
2 class functions, func0 and func1, and an __init__ method that is an extension of the
DataGenerator __init__ method. First of all, this method calls the parent class __init__
method with the following instruction:

super().__init__([self.func0, self.func1])

The function super() refers to the parent object of this object, thus the .__init__
statement calls the __init__ method of the parent class, passing to the parent the
expected 2 functions declared in the child class. The remaining code does some
precomputations that are used by the func0 and func1 methods.

Summarizing, class inheritance can be used 1) to factor out common code and 2) to ensure a
common behaviour of a set of classes so that they can be easily exchanged (we can use a
different data generator without having to modify the code of the Docknet that is to ingest
the data). Once the DataGenerator is defined, multiple developers can work on different
data generators without worrying about how they will integrate their generators with the
Docknet class, as long as they derive the DataGenerator class as expected.

More info on Python classes and class inheritance can be found here:

https://docs.python.org/3/tutorial/classes.html

The 4 implemented data generators are inspired in those used in this web, where one can
play around with a neural network applying different layers, neurons, regularization, etc.:

https://playground.tensorflow.org/

31
13.4 Activation functions and custom exceptions
Activation functions are used in the networks dense layers to compute the output of each
neuron, once the linear part of the neuron has been computed. In the literature we find
several options of activation functions one can use, some specific ones are needed in the
output layer depending on whether we want to do binary classification (activation function
sigmoid) or multi-class classification (activation function softmax). What all the
activation functions have in common is that they are not linear functions. Otherwise adding
more layers to a network wouldn’t add more power to the network since the composition of
any number of linear functions is equivalent to a single linear function.
For the moment the Docknet library implements the following activation functions
sigmoid, tanh (hyperbolic tangent) and relu (rectified linear unit), enough for doing
binary classification. The functions are defined in file
src/function/activation_function.py. The mathematical definition of these
functions is given in their corresponding docstrings.

In order to be able to compute the network backward propagation, the derivatives of these
functions are also required. These are defined in the same file as sigmod_prime,
tanh_prime and relu_prime. We will see later that when specifying the activation
function of a new dense layer, we just have to provide the name of the activation function
instead of the pairs of functions activation + derivative. This is the same case than the
specification of the Docknet’s cost function and its derivative, previously mentioned in section
13.2.3. To make sure that the proper derivative is used for each activation function, we have
defined a dictionary associating each activation function name with the corresponding pair of
functions:

activation_functions: Dict[str, Tuple[Callable, Callable]] = {


sigmoid.__name__: (sigmoid, sigmoid_prime),
relu.__name__: (relu, relu_prime),
tanh.__name__: (tanh, tanh_prime)
}

Additionally, we have defined the following function to retrieve the pair of functions from the
dictionary, and generate a specific exception in case there is no implementation for the
requested activation function:

def get_activation_function(activation_function_name: str):


"""
Given the name of an activation function, retrieve the corresponding function and its derivative
:param activation_function_name: the name of the activation function
:return: the corresponding activation function and its derivative
"""
try:

32
return activation_functions[activation_function_name]
except KeyError:
raise UnknownActivationFunctionName(activation_function_name)

While it would be possible to directly access the dictionary to get the pair of functions, in case
we specified a non-existent activation function name we would get a KeyError exception,
which is thrown any time we try to get a value from a dictionary for a given key that does not
exist. However, that exception would not provide much information to the user of the library
on what went wrong, while by generating a specific exception we can provide a more
informative error message. Here we are the definition of the custom exception:

class UnknownActivationFunctionName(Exception):
def __init__(self, activation_function_name: str):
message = (
f'Unknown activation function name {activation_function_name}')
super().__init__(message)

We simply create a class that extends the Exception class. The name of the class could
already be enough to explain the error reason, such as
UnknownActivationFunctionName. We can include a customized error message with
the exception, as in the example, by redefining the Exception __init__ method to pass
the custom message to the parent class. In the example, we are passing the received
activation function name to give a better hint on what went wrong.

13.5 Cost functions


Cost functions are the functions the network will try to minimize while training, in order to
do better classifications for a given training set. The value returned by the cost function
reflects the amount of deviation from the expected classification (the labels of the training
set). The training process tries to modify the network parameters in order to reduce this
deviation.

There are several cost functions one can use, though the most common is cross entropy. For
the moment, cross entropy is the only cost function we have implemented, though we have
followed the same code structure than for activation functions in order to enable further
extension of the library.

13.6 Initializers and abstract classes


The Docknet class uses initializers in order to set the layer parameter values to different
values before starting to train. We need the parameters to be initialized to different values
before starting to train so that each neuron can specialize in computing different outputs.
Otherwise, all the neurons of the same layer would be like clones, performing the same
computation and being modified in the same way during training, thus performing the same
computation over and over. In that case the network would not benefit from having multiple
neurons in a layer, since all would be repeating the same computation.

33
We also want for the initializers to be easily exchangeable, so we can use any initializer
implementation without having to modify the code of the Docknet class. Note how the
Docknet initializer setter is implemented:

@initializer.setter
def initializer(self, initializer: AbstractInitializer):
"""
Sets the network parameter initializer; required for training only
:param initializer: an initializer object (e.g. an instance of RandomNormalInitializer)
"""
self._initializer = initializer

The type hint of the initializer parameter is AbstractInitializer, which is an abstract


class. An abstract class is a class that declares more or one methods for which an
implementation is not provided, so you cannot create an instance of these classes since their
definition is incomplete. Child classes of abstract classes are meant to provide the final
implementation, depending on the child class. Abstract classes are useful for forcing a set of
derived classes to implement a set of methods, so later we can ensure all of them have the
same interface. This way the Docknet class can require an initializer of type
AbstractInitializer, then receive any child of AbstractInitializer since all
of them implement the AbstractInitializer methods.

Open file src/docknet/initializer/abstract_initializer.py to see how


AbstractInitializer is defined:

class AbstractInitializer(ABC):
@abstractmethod
def initialize(self, layers: List[AbstractLayer]):
"""
Initializes the parameters of the passed layers
:param layers: a list of layers
"""
pass

First of all, we declare the class as a child of Python’s class ABC, the parent of every abstract
base class. Then we declare all the abstract methods preceded by the annotation:

@abstractmethod

In the methods we only provide the method name, and the parameters. Though not
mandatory, we use type hints in order to make clear what each input parameter is, and to

34
declare the return value, if any (in this case, the initialize method does not return
anything). As implementation we just add instruction pass. Child classes will have to extend
the abstract methods providing an actual implementation (not just pass), and not add the
annotation @abstractmethod since their methods will no longer be abstract (otherwise
we also won’t be able to create instances of those classes). See for instance the
implementation of RandomNormalInitializer at
src/docknet/initializer/random_normal_initializer.py:

class RandomNormalInitializer(AbstractInitializer):
"""
Random normal initializer sets all network parameters randomly using a
normal distribution with a given mean and
standard deviation
"""
def __init__(self, mean: float = 0.0, stddev: float = 0.05):
"""
Initialize the random normal initializer, given a mean a standard
deviation
:param mean: the mean of the normal distribution
:param stddev: the standard deviation of the normal distribution
"""
self.mean = mean
self.stddev = stddev

def initialize(self, network_layers: List[AbstractLayer]):


"""
Initializes the parameters of the passed layers
:param network_layers: a list of layers
"""
# For each layer
for p in [layer.params for layer in network_layers]:
# For each parameter
for k in p.keys():
# Randomly initialize the parameter
p[k] = np.random.randn(*p[k].shape) * self.stddev + self.mean

The RandomNormalInitializer is a child of AbstractInitializer, and provides


an actual implementation of method initialize. The method receives a list of
AbstractLayer objects, the abstract class for all network layers. This abstract class

35
ensures every layer class defines a getter params, which returns a dictionary with all the
parameters of the layer. The random initializer iterates over each parameter in the dictionary
and provides a random value from a normal distribution with a given mean and standard
deviation, defined at the moment of creating the random initializer instance.
13.7 Docknet layers
The Docknet layers are defined at src/docknet/layer. As for the initializers, an
AbstractLayer class has been defined to act as an interface between every possible layer
implementation and the Docknet class. The abstract layer defines the following abstract
methods, that every layer must implement:

• forward_propagate: how does the layer compute its output, given the output of
the previous layer; this method is used for making predictions (see method predict
of the Docknet class)
• cached_forward_propagate: same than forward_propagate, but caching
in each layer some values computed during forward propagation that are later
required to compute the backward propagation (the output of the previous layer and
the output of the linear part of this layer)
• backward_propagate: how does the layer compute the gradients of each
parameter, based on the cached values and the gradient of the cost function w.r.t. the
output of this layer (previously computed by the next layer during backward
propagation)
• clean_cache: deletes every cache variable so that, once a model is trained, these
values are omitted when saving the model to a file

Both cached_forward_propagate and backward_propagate are the methods


used to train the network (see methods train_batch and train of the Docknet class),
then clean_cache is called as last training step.

Apart from abstract methods, the abstract layer class provides actual implementations of
some methods common to all layer implementations, to factor out code, namely the
dimension getter (the number of outputs of the layer) and params getter and setter (the
dictionary of parameters of the layer). The params dictionary is retrieved and modified by the
parameter initializers and the parameter optimizers. The dimension is used when adding new
layers, since the number of parameters of a new dense layer depends on both on the number
of neurons of the layer and the number of outputs of the previous layer.

13.7.1 Special class methods __getattr__ and __setattr__


Finally, the abstract method also implements 2 Python special class methods,
__getattr__ and __setattr__. These methods alter the way in which class variables
are read and written, respectively. Note that the layer parameters are all to be stored in a
Python dictionary params inside the layer class, as defined by the abstract class. Let l1 be
a layer object, this implies that for accessing param W of the layer (an array containing all the
W params of all the neurons in the layer), one has to use the following notation:

l1.params[‘W’]

36
By default, Python classes contain a special variable __dict__ which simulates a dictionary
having all the variables of the class, so writing:

l1._params

is equivalent to writing:

l1.__dict__[‘_params’]

If defined, method __getattr__ is called in case the __dict__ dictionary does not have
the requested key. In that case, our __getattr__ implementation looks for that key in the
params dictionary of the class, so we can write:

l1.W

instead of the more cumbersome notation:

l1.params[‘W’]

This later simplifies the code of the forward and backward propagation methods of the layer.

Method __setattr___ is called when trying to set the value of a member of the class. In
our implementation we first check if the variable name is a key of the params dictionary, and
if so, we set the variable of that parameter. Otherwise, we call the __setattr__ of the
parent class in order to let the Python interpreter continue with the standard behaviour
(setting the value of a class variable). This way we can also use the following notation:

l1.W = W

instead of:

l1.params[‘W’] = W

to set the value of a layer parameter.

13.7.2 Docknet input layer


By convention, layer 0 of a neural network is the actual input vector that the network is
ingesting. We have defined an input layer implementation which simply propagates the input
vector that is being given. A dimension must be given to the input layer, which is later used
to be passed to the next layer when created, in order to know how many parameters the next
layer requires (remember the number of parameters W is equal to the number of neurons of
the previous layer X the number of neurons of the current layer). At least, the forward
propagation of the input layer verifies that the dimension of the input vector it is being given
is the same than the dimension of the input layer, otherwise it throws a custom exception
with an informative message. The cached forward propagation has nothing to cache, so it
simply falls back to the non-cached forward propagation.

37
The layer still has to declare a dictionary of parameters though it does not have any
parameter, thus declares an empty dictionary. During backward propagation it simply returns
an empty array of gradients, so the optimizer will not try to modify any layer parameter.

Finally, clean_cache does nothing since the layer doesn’t cache anything.

13.7.3 Docknet dense layer


The dense layer implements the standard fully connected layer we can see in any introductory
material to neural networks. For clarity’s sake, the forward and backward propagation
functions have been decomposed in 2 sub-functions, one for the linear part of the network:

Z = W * A_previous + b

and another for the activation:

A = activation_function(Z)

The backward propagation is a little bit more complex, you can refer to the code to see exactly
how it is being computed.

13.8 Optimizers
As for the initializers (section 13.6) and the layers (13.7), the optimizers are implemented by
extending an abstract class, to ensure a common interface for all the optimizers (see script
src/docknet/optimizer/AbstractOptimizer.py). The optimizers must
implement 2 methods:

• reset: receives the list of layers of the network and performs any initialization
required before starting the training process. This method is called by the train
method of the Docknet before running any training iteration.
• optimize: receives the list of layers of the network and the corresponding list of
parameter gradients, one dictionary of gradients per layer. It then updates the values
of the parameters of each layer, based on the received gradients. This method is called
after each training iteration in the train_batch method of the Docknet.

We have implemented 2 gradient optimizers: GradientDescentOptimizer and


AdamOptimizer. Gradient descent is pretty simple, it requires no initialization and as
optimization it simply applies the following formula to each layer parameter p:

p = p - learning_rate * gradient_of_p

Adam is a more sophisticated version of Gradient descent. It also ends up subtracting an


amount to each parameter of each layer, but in a way that minimizes fluctuations of the cost
function when training with minibatches. Note that one training with minibatches, after each
iteration the optimizer updates the network parameters based on a fraction of the dataset,
so fluctuations of the cost function are to be expected. Adam takes into account optimization
directions are systematic (e.g., systematically moving the cost to the right in order to
approach a minimum) and which directions fluctuate (e.g., moving up and down alternatively,

38
directions that do not contribute to approach the minimum). The corrections applied to the
layer parameters maximize the systematic directions and dampen the fluctuating ones. To
account for the directions of the previous modifications, Adam maintains 2 variables v and s
(see the Adam optimization code for more info) which have to be reset to 0 values before
starting the training.

13.9 Utility functions and classes


Common general methods and classes are placed in folder src/docknet/util, namely
some functions to convert random numbers to polar coordinates and polar coordinates to
cartesian coordinates (script geometry.py), and a Notifier class used to print
debugging messages with different colours:
• info messages are printed in green,
• warn messages are printed in yellow,
• error messages are printed in red, and
• debug messages are printed with the default terminal colour.

14 Unit testing with pytest & PyCharm


When we develop a relatively complex piece of software, a common practice is to implement
automated tests for each individual component so we can be sure a component has the
expected behaviour before building anything on top of it. Note if we don’t test our software
until a last development stage, finding errors in the code will be much more difficult and time
consuming since they can be anywhere in the code. By testing each component separately,
we can save the time of having to execute other independent piece of code until we arrive to
the part where we suspect we may have a bug. Testing each component separately helps us
to discard potential sources of errors and speeds up the debugging process. Moreover, each
unit test serves as documentation by example on how each component is supposed to be
used, since they are formed by working source code that could simply be copy/pasted
somewhere else then modified to suit a specific purpose.

In this guide we use the Python library pytest in order to write and run the tests. By default,
Python already comes with a library called unittest that can be used for the same purpose,
however the pytest library comes with some interesting additional functionalities such as
fixtures and parameterized tests which we will see in the examples.

An automated test is just a function that makes use of some other function or method and
checks that, given an input, the function or method returns the expected output, for instance:

def test_addition():
operand1 = 1
operand2 = 2
expected = 3
actual = operand1 + operand2
assert actual == expected

39
It is common terminology in all unit test frameworks (not just in pytest) to refer to the
returned value as “actual”, and to the correct value as “expected”. In Python we use the
reserved word assert to compare the actual and expected values, or in general to evaluate
any Boolean expression (e.g., correct behaviour might be to return some value above some
threshold, instead of returning some specific value). If the expression is false, and exception
is thrown, and the test function is considered failed. Otherwise, the test function execution
ends without throwing any exception, and thus is considered successful. If the code we run
throws any exception that is not captured (using a try/except block), the test is also
considered failed. Note assert is a reserved Python word, one does not need to import any
unit testing library such as unittest or pytest in order to use it.

14.1 How to run unit tests


To run a test in PyCharm with pytest, we first must configure the project to use pytest as test
runner, as explained in section 11. Note we need to do this only one time per project. Then
we can simply right click on the test function header, then click on either run or debug test in
the window that will pop up. We can also right click on a file containing unit tests then click
on run, in order to run all the tests in the file. We can also right click on a folder containing
test files in order to run all tests in that folder. By default, PyCharm presents the list of all
tests that run and failed, and one can click on any test to view the corresponding output and
error messages.

We can also run all project tests from the command line by running the build script (see
section 8). Indeed, it is good practice to run the build script before uploading any changes to
the Git repository in order to make sure the code we share with the team is stable; otherwise,
we might be introducing some bug that could block the work of our colleagues. To prevent
this from happening, it is also common practice to implement a continuous integration
pipeline integrated with the Git repository (explained in section 22), which will run all the
tests upon each change committed to the repository in order to detect errors as soon as
possible.

In case we already have the Python virtual environment created, activated, and with all the
dependencies installed (e.g., by running pip install -r requirements.txt), we
can also run command:

pytest

in any project folder in order to run from the command line all tests inside that folder, without
having to recreate the entire virtual environment.

In pytest, any method whose name starts with test is considered to be a test. Since we put
all the test files in folder test, (as explained in section 12), and to prevent pytest from
mistaking some function in folder src as a test, we add a file pytest.ini to the root folder
of the project with the following content:

[pytest]
testpaths = test/unit

40
Note that in order to be able to run pytest from the command line and let it find the
corresponding source code in folder src, we need to add this code to file
test/unit/__init__.py:

import os
import sys

src_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), '..', 'src')


sys.path.insert(0, src_path)

It is also possible to run individual tests with command pytest in the command line,
however running them with PyCharm eases navigating through the different test outputs and
towards the precise lines of code that produced the exceptions, since PyCharm interprets the
pytest output messages and adds the corresponding clickable links. Moreover, one can add
breakpoints to specific lines of code by clicking in PyCharm on the margin space to the right
of the line number; we will see a big red dot appear in the margin to mark the breakpoint,
which we can click on in order to remove the breakpoint. When debugging code, we must set
at least one breakpoint so that the debugger pauses the execution at that point, otherwise
all the test code will be run without stopping, having the same effect than running the test
instead of debugging it.

14.2 Organizing unit tests


As explained in section 12, we place all the test code in a separate test folder from the rest
of the source code so that we can later package the code without including the tests, which
are only needed for validating the code. A useful convention is later to place 2 other sub-
folders inside: a data folder for storing any data file needed by the tests (e.g., a JSON file
containing some complex object to be returned by a function or method), and a unit folder
containing all the unit tests.

For each Python script xxx.py in the src folder that we want to test - a script implementing
some component or set of related functions - we create a corresponding test file
test_xxx.py inside the unit folder. Note that in folder src we may define several folders
and sub-folders in order to better structure the scripts. While pytest does not make any
difference between different test packages, and indeed we cannot have 2 test scripts having
the same name even if they are in different folders, it can be useful to mimic the same folder
structure inside the unit folder than in the src folder, in order to make more obvious which
test file corresponds to which source code script. Moreover, having related tests on a
separate folder allows us to easily run all of them, either in PyCharm by right clicking on the
corresponding folder then clicking on run/debug tests, or from the command line by going to
the corresponding folder and running command pytest. Regarding the data folder, it can
also be useful to replicate inside the src folder structure.

When using pytest and organizing the tests in different folders, it is not really needed to add
a __init__.py file to each folder, since pytest ignores the packages and assumes all the

41
test files belong to one single anonymous package. However, there are 2 cases in which we
may require the __init__.py files, hence it can be better to systematically add them:

1. To be able to run pytest from the command line and let it find the source code files in
folder src, as explained in the previous section
2. To be able to import and reuse test code across different test files

Regarding the second point, we have an example of this in the Docknet library: script
test/unit/docknet/dummy_docknet.py contains hardcoded computations of a
specific neural network, with all values that are expected during the forward and backward
propagation during a first training iteration. We then use these expected values to test the
different Docknet components, which are distributed across different test files. For instance,
test file test/unit/docknet/layer/test_dense_layer.py imports all the
expected values in the dummy_docknet.py as follows:

from test.unit.docknet.dummy_docknet import *

14.3 pytest fixtures


When we define a component as a Python class in a Python script, it is common to write a
corresponding test script containing the tests for all the methods of the class. In that situation,
each test requires an instance of the class in order to invoke the corresponding method to
obtain the actual value, and it may happen that the same instance could be used across the
different tests. In order to factor out the code for creating the instance, we can use a special
kind of pytest function called a fixture. The following is an example of fixture extracted from
file test/unit/data_generator/test_swirl_data_generator.py:

x_range = (0., 9.)


y_range = (-3., 0.)

@pytest.fixture
def data_generator1():
generator = SwirlDataGenerator(x_range, y_range)
yield generator

A fixture is a function that accepts no arguments and returns some object or value that is to
be used by other test functions. Note the parameters x_range and y_range have been
hardcoded as global parameters of the test script so they can be reused in any test. We add
the annotation:

@pytest.fixture

so pytest recognizes the function as a fixture. The fixture is to return the object or value using
Python keyword yield instead of return, so once the test method has finished using the

42
object returned by the fixture the execution flow goes back to the line right after yield in order
free any resources taken by the object (e.g., closing a file, deleting a memory buffer, etc.). In
unit test frameworks, the process of creating the needed objects for running a test is
commonly known as setup, and the process of deleting and/or freeing the resources as tear
down.

Below we have a unit test function that uses the fixture:

def test_generate_sample(data_generator1):
size = 2000
X, Y = data_generator1.generate_balanced_shuffled_sample(size)
axe = plt.subplot()
plot_scatter(axe, X[0, :], X[1, :], Y[0, :], x_range, y_range, ‘Swirl sample’)
assert X.shape == (2, size)
assert Y.shape == (1, size)

We simply add the name of the fixture function as a parameter of the test function, then we
can use that parameter as if it was the object or value returned by the fixture. This unit test
tests the generate_sample method of the SwirlDataGenerator, which is analogous
to the generate_sample method of the ChessBoardDataGenerator described in
section 13.3. In this test we generate a random sample and check that the returned arrays
have the proper shape. Even if we are not checking the exact values that are being returned,
the test will already force to run the code of this data generator, allowing to catch syntax
errors that we may have. Note that in contrast with other popular programming languages
like Java or C++, Python is an interpreted language, which means that the code is not
translated to machine language (compiled) before running it, and hence even syntactic errors
will not be caught until we run the code.

14.4 Parameterized pytests


There are cases in which we may want to test some method or function for a variety of inputs,
for instance to test that our implementation of the sigmoid function is returning the expected
values for the border cases (a big negative and a big positive number) and some case in
between such as 0. Instead of copy/pasting the same test function several times to just modify
the hardcoded inputs and expected outputs, we have 2 choices in order to factor out the test
code:
1. Create an exercise_sigmoid function that receives an input and the expected
output, computes the actual value and compares the actual with the expected, then
create several test functions test_sigmoid_negative,
test_sigmoid_positive and test_sigmoid_zero that call the exercise
function for evaluating a single case.
2. Implement a single parameterized pytest function which will be called by pytest for
each test case we want to evaluate.

Both approaches have their pros and cons:

43
1. The first approach is more verbose but allows us to run one specific test case by right
clicking on the corresponding test function then clicking on run or debug.
Furthermore, it is later easier to identify which particular test case failed, provided
that we give the test case function a descriptive name, when running all the tests,
since we then will get the list of all test function names that failed (e.g.,
test_sigmoid_zero instead of just test_sigmoid for some test case).
2. The second approach is more convenient when we want to test a longer list of cases,
and the list of input parameters and expected values is short (e.g., the input value of
the sigmoid function and the expected output)

In order to define a parameterized test, we first create an array of tuples where each tuple
contains the input and expected values for each test case. In the example below, since our
implementation of the sigmoid function may accept either simple values or NumPy arrays,
we test both cases for the border and normal cases, namely:

• sigmoid(-100) should return a value near to 0,


• sigmoid(0) should return 0.5,
• sigmoid(100) should return a value near to 1, and
• sigmoid([-100, 0, 100]) should return [0, 0.5, 1], approximately.

Our tuples are pairs where the first value or array is the input, and the second value or array
is the expected output:

sigmoid_test_cases = [
(np.array([-100., 0., 100]), np.array([0., 0.5, 1.])),
(-100., 0.),
(0., 0.5),
(100., 1.),
(np.array([0.]), np.array([0.5])),
]

For a test function to be parameterized, we precede it with the pytest annotation:

@pytest.mark.parametrize(argument_names, argument_values)

where argument_names is a string containing a comma-separated list of argument


names the test function will receive (same number of parameters than elements in the test
case tuples) and the argument_values is the variable that contains the list of tuples.

@pytest.mark.parametrize(‘x, expected’, sigmoid_test_cases)


def test_sigmoid(x: Union[float, np.ndarray], expected: Union[float, np.ndarray]):
actual = sigmoid(x)
assert_array_almost_equal(actual, expected, verbose=True)

44
Note in this particular test function we have used NumPy function
assert_array_almost_equal instead of Python’s keyword assert. When
comparing float numbers, it may happen that depending on where we run the test, the test
will pass or fail because different machines may compute float numbers with more or less
decimals, hence we cannot use the strict equality comparator ==. NumPy provides this test
function which can be used to compare either simple values or NumPy arrays for equality up
to a given number of decimals, which by default is set to 6.

Note when we run in PyCharm this test function, PyCharm will run the function as many times
as test cases, following the order given by the list of tuples. If we want to debug one particular
case, we either have to put that case the first in the list, so the debugger will start with that
case, or comment out all the previous test cases.

Here we are another example of parameterized test for the cross-entropy function. This test
case is taken from file test/unit/function/cost_function.py. This particular
test was used to trace back an error produced when the neural net manages to output 0 or 1
instead of values near 0 or 1, in which case the cross-entropy function is not defined and a
NaN was being returned, breaking the training process. Including these test cases allowed us
to check different ways to overcome this situation, until the expected behaviour was
achieved, as long as serving as documentation of what the cross-entropy function should
return on these border cases.

cross_entropy_test_cases = [
(np.array([[1., 0.]]), np.array([[1., 0.]]), 0.),
(np.array([[1., 0.]]), np.array([[0., 1.]]), np.inf),
(np.array([[1., 0.]]), np.array([[0.5, 0.]]), 0.6931471805599453),
(np.array([[1., 0.]]), np.array([[1., 0.5]]), 0.6931471805599453),
(np.array([[1., 0.]]), np.array([[0.5, 0.5]]), 1.3862943611198906),
]

@pytest.mark.parametrize(‘Y, Y_circ, expected’, cross_entropy_test_cases)


def test_cross_entropy(Y: np.ndarray, Y_circ: np.ndarray, expected: float):
# Disable NumPy warnings when trying to divide by 0 in the border cases, otherwise these tests
produce warnings
np.setter(divide='ignore', over='ignore')
actual = cross_entropy(Y_circ, Y)
np.testing.assert_almost_equal(actual, expected)

14.5 A bigger pytest, and a word on mocking

45
Finally, here we are a fixture creating a simple Docknet object, which is later used to test an
entire training iteration with the corresponding forward and backward propagation:

@pytest.fixture
def docknet1():
docknet1 = Docknet()
docknet1.add_input_layer(2)
docknet1.add_dense_layer(3, 'relu')
docknet1.add_dense_layer(1, 'sigmoid')
docknet1.cost_function = 'cross_entropy'
docknet1.initializer = DummyInitializer()
docknet1.optimizer = GradientDescentOptimizer()
yield docknet1

def test_train(docknet1):
docknet1.train(X, Y, batch_size=2, max_number_of_epochs=1)
expected_optimized_W1 = optimized_W1
expected_optimized_b1 = optimized_b1
expected_optimized_W2 = optimized_W2
expected_optimized_b2 = optimized_b2
actual_optimized_W1 = docknet1.layers[1].params['W']
actual_optimized_b1 = docknet1.layers[1].params['b']
actual_optimized_W2 = docknet1.layers[2].params['W']
actual_optimized_b2 = docknet1.layers[2].params['b']
assert_array_almost_equal(actual_optimized_W1, expected_optimized_W1)
assert_array_almost_equal(actual_optimized_b1, expected_optimized_b1)
assert_array_almost_equal(actual_optimized_W2, expected_optimized_W2)
assert_array_almost_equal(actual_optimized_b2, expected_optimized_b2)

The train function is invoked, requesting to run for one epoch only on an input batch of
size 2. Afterwards, the test verifies that each layer parameters have been set to the expected
values. Strictly speaking, this is not a unit test, since we are not only testing the train
method but all the other class methods (such as the forward and backward propagate
methods of the DenseLayer class) involved in the training process. A more advanced
technique used in testing code consists in creating mocked versions of the objects, hardcoding
the values that their methods should return during the test so if the test fails it is because of
the implementation of the train method and not because of some other method it uses.
Nevertheless, this test method can help us debug the entire training process to make sure it
is properly implemented, regardless of the other parts of the Docknet library.

46
More information on pytest and other of its functionalities can be found at:

https://docs.pytest.org/en/latest/

and on mocking in Python, here:

https://mock.readthedocs.io/en/latest/

14.6 A note on code refactoring


It is not unusual, and even more when following agile methodologies, to have to refactor
code. We may start working in a minimum viable product with the intention to have
something to show to the stakeholders as soon as possible and have to make compromises.
In Agile development, we do not develop the final code in one step, but make the code evolve
through different stable versions, each one adding functionalities that can then be evaluated,
and the next features to implement decided based on the results of the previous version.
Having frequent development and evaluation iterations prevents from spending too much
time in an approach that could later be proved to be inadequate.

In order to support this way of coding, unit testing is a key factor. As the number of
components increase, it is difficult to assess what will be the impact of a modification of some
components. Perhaps then new feature to implement requires some modification on the
inputs or outputs some function uses, and that will have an impact in every component using
the function: they will all have to be refactored in order to conform with the new interface.
In return, adapting a component may result in modifications that have to be further
propagated to other components using them, and so forth. Having unit tests for each one of
the components allows us to have control on the components that need to be modified, since
once we make a modification, we can run all the tests to see which fail, and in return pointing
us towards all the components that will have to be adapted. Refactoring code consists then
in updating the different components and the corresponding tests, in order to reflect the new
expected behaviour, re-running the tests in order to track how the changes propagate across
the entire project, then continue the refactoring until we arrive to a new stable version of the
code we can share with the team.

Finally, a code refactor may result at some point to be a bad idea, and we may want to go
back to the previous version of the code, dropping all the sequence of changes we may have
done to many different files. For this reason, working with Git branches (explained in the next
section) is key, allowing us to test any code modification, no matter how risky it may seem,
since with one single Git command we can go back to the master version of the code which
we should all try to keep stable.

15 Working with Git branches


15.1 Git Workflow
One of the main things to remember about Git as you are learning how to work with Git is the
three main states that files in Git can exist in: modified, staged and committed.

47
Below is a simple description of each of these files (source: https://git-
scm.com/book/en/v2/Getting-Started-What-is-Git%3F)

Working Directory. This is a single checkout of one version of the project. These files are
pulled out of the compressed database in the Git directory and placed on disk for you to use
or modify.

Staging area or Index. This corresponds to a file generally contained in your Git directory, that
stores information about what will go into your next commit.
The Git directory is where Git stores the metadata and object database for your project. This
is what is copied when you clone a repository from another computer.

https://git-scm.com/book/en/v2/Getting-Started-What-is-Git%3F

A very basic Git workflow is as follows:


1. Modify files in your working directory
2. Selectively stage just those changes that you want to be part of your next commit -
which adds only those changes to Index or Staging area
3. Make a commit. This operation takes the files as they are in the staging area and
stores that snapshot permanently in your Git directory.

Then, the different states can be summarised in the following manner (source: https://git-
scm.com/book/en/v2/Getting-Started-What-is-Git%3F)

If a version of a file is in the Git directory, it’s considered to be committed. If it has been
modified and was added to the staging area, it is staged. And if it was changed since it was
checked out but has not been staged, it is modified.

48
A more detailed workflow to the one above to further illustrates the relation between the
workspace, the index, and the repository - and the more general idea of using Git to build a
workflow – it is displayed below:

https://blog.osteele.com/2008/05/my-git-workflow/

As discussed above, a typical workflow involves implementing a conceptual change as set of


little steps. It is common to add each step to the index or staging area but save the commit
until the full conceptual change has been implemented and we are back to working, tested
code.
Some common operations in addition to the basics: git add, git commit, git push
and git pull are the following:
• git diff tells me what I’ve changed since the last checkpoint
• git diff head shows what’s changed since the last commit
• git checkout . reverts to the last checkpoint
• git checkout head . reverts to the last commit.

15.2 Git Branches

https://www.atlassian.com/git/tutorials/using-branches

What is a branch? Git branches are effectively a pointer to a snapshot of your changes.

When do we need to branch? When you want to add a new feature or fix a bug - no matter
how big or how small - you spawn a new branch to encapsulate your changes.

49
Don’t mess with the Master (https://thenewstack.io/dont-mess-with-the-master-working-
with-branches-in-git-and-github)

Show me all the branches


git branch

How to create a branch?


git branch <branch>

Merging

https://www.atlassian.com/git/tutorials/using-branches/git-merge

Example

50
15.3 Git merge conflicts
(Source: https://www.atlassian.com/git/tutorials/using-branches/merge-conflicts)

During a merge, Git will try to figure out how to automatically integrate new changes.
However, there are cases where Git cannot automatically determine what is correct. Two
common causes of conflicts include: (a) current local branch has a modified file while the
branch being merged does not have that file – it has been deleted (b) the two branches have
one or more files with changes in the same lines that differ across the branches. Git will mark
the file as being conflicted and stop the merging process. It is then up to the developers to
resolve the conflict.

Types of merge conflict. A merge conflict can arise at two separate points: when starting and
during the merge process.:

(a) Git fails to start the merge.


This situation will present itself when there are changes in either the working directory or the
staging area of the current project. The reason behind this fail is that the pending changes
could be overwritten over by the commits that are being merged in.
A merge failure on start will output the following message:
Error: Entry ‘<filename.’ Not uptodate. Cannot merge.
To resolve this issue, the local state will need to be stabilized using git stash, git
checkout, git commit, or git reset.

(b) Git fails during the merge


A failure during a merge indicates a conflict between the current local branch and the branch
being merged. This indicates a conflict in the code between the two branches. A mid-merge
failure will output the following error message:
Error: Entry ‘<filename>’ would be overwritten by merge. Cannot
merge. (Changes in staging area)

How to identify merge conflicts


We can gain further information as to the conflict by running the command git status
git status
On branch master
You have unmerged paths.
(fix conflicts and run "git commit")
(use "git merge --abort" to abort the merge)

Unmerged paths:
(use "git add <file>..." to mark resolution)

both modified: merge.txt

The output of git status indicates that there are unmerged paths due to a conflict. It also
indicates the files causing the conflict: ‘merge.txt’ in the example above.

51
Next step is to examine the conflicting file(s) and see what the discrepancies are. We can do
this by using the cat command. Git uses three different types of lines to display the differences
between the branches in the modified file. See an example below:
cat merge.txt
<<<<<<< HEAD
this is some content to mess with
content to append
=======
totally different content to merge later
>>>>>>> new_branch_to_merge_later

The ============= line is the ‘center’ of the conflict. All the content between the center
and the <<<<<<<<<< HEAD line us the content that exists in the current branch master
which the HEAD ref is pointing to. All content between the center and >>>>>>>>>
new_branch_to_merge_later is content that is present in our merging branch.

How to resolve merge conflicts using the command line?


The most direct way to resolve a merge conflict is to edit the conflicted file. Open the
merge.txt file in a text editor. In the example above, lets simply remove all the conflict
dividers. The modified merge.txt file should then look like this:

this is some content to mess with


content to append
totally different content to merge later
git commit -m "merged and resolved the conflict in merge.txt"

Once the file has been edited – in this case simply combining the text from both files – use
git add merge.txt to stage the new merged content. To finalise the merge, create a
new commit by executing:
Git commit -m ‘merged and resolved the conflict in merge.txt’

Git commands that can help resolve merge conflicts


General tools:
• git status: Helps to identify conflicted files
• git log --merge: Passing the --merge argument to the git log command will
produce a log with a list of commits that conflict between the merging branches
• git diff: diff helps find differences between states of a repository/files. This is
useful in predicting and preventing merge conflicts.
Tools for when git fails to start a merge:
• git checkout: checkout can be used for undoing changes to files, or for changing
branches
• git reset –-mixed: reset can be used to undo changes to the working directory
and staging area
Tools for when Git conflicts arise during a merge:

52
• git merge –-abort: executing git merge with the abort option will exit from the
merge process and return the branch to the state before the merge began
• git reset: can be used during a merge conflict to reset conflicted files to a known
good state.

Advanced tips
• Merging vs Rebasing: https://www.atlassian.com/git/tutorials/merging-vs-rebasing

16 Object serialization and deserialization


Object serialization is the process of transforming and object into a sequence of bytes or
characters in order to save its state into a file or to transfer the object through a network as
a data stream. Conversely, object deserialization stands for parsing the sequence of bytes or
characters to load back the object into memory. These processes require to define a
representation format of the object, a convention to follow when serializing the object so that
upon deserialization it is possible to interpret each byte or character in the file or stream to
rebuild the same object back. The Docknet class supports 2 different formats, JSON and pickle.
JSON is a standard text-based serialization format that is widely used and that follows a similar
structure than Python dictionaries (see file test/data/docknet1.json for an
example). Pickle is a binary format that is provided by Python for any Python object, so it’s
quite straightforward to use. There is a trade-off between using JSON and using pickle: JSON
files are human readable while pickle files take less space.

16.1 JSON

To serialize a Docknet object to JSON, simply use method to_json:

def to_json(self, pathname_or_file: Union[str, TextIO], pretty_print: bool = False):


"""
Save the current network parameters to a JSON file. Intended for debugging/testing purposes. For
making actually
using the network for making predictions, use method to_pickle, with will save the parameters in a
more
efficient binary format
:param pathname_or_file: either a path to a JSON file or a file-like object
:param pretty_print: generate a well formatted JSON for manual review
"""
kwargs = {'cls': DocknetJSONEncoder}
if pretty_print:
kwargs['indent'] = 4
kwargs['sort_keys'] = True

53
if isinstance(pathname_or_file, str):
with open(pathname_or_file, 'wt', encoding='UTF-8') as fp:
json.dump(self, fp, **kwargs)
else:
json.dump(self, pathname_or_file, **kwargs)

Note under the hood the method simply calls json.dump; by default, this method supports
class attributes that are either simple data types (e.g., numbers, strings, etc.) or Python
dictionaries and lists. For other classes we need to implement our own JSON encoder:

class DocknetJSONEncoder(json.JSONEncoder):
"""
JSON encoder needed for serializing a Docknet to JSON format; defines how to serialize special
Docknet classes such
as the Docknet itself, the layers and NumPy arrays
"""
def default(self, obj: Any) -> Union[List[AbstractLayer], Dict[str, Union[int, Dict[str, np.ndarray], str]],
object]:
if isinstance(obj, Docknet):
return obj.layers
elif isinstance(obj, AbstractLayer):
return obj.to_dict()
elif isinstance(obj, np.ndarray):
return obj.tolist()
else:
return super().default(obj)

The encoder simply overwrites the default method of the parent JSONEncoder in order to
have a special behaviour whenever the object to serialize is a Docknet, any kind of layer (a
child of AbstractLayer), or a NumPy array. For each case we define how to convert these
objects into Python dictionaries or lists, and let JSON take care of serializing those. For any
other cases we simply revert to the default serializer of the JSONEncoder.

Here we test the serializer, invoking the method for a given Docknet and comparing the result
with the JSON of an expected JSON file we have created in advance (file
test/data/docknet1.json):

def test_to_json(docknet1):
# Set network parameters as for the dummy initializer in order to enforce a specific expected output

54
docknet1.initializer.initialize(docknet1.layers)
expected_path = os.path.join(data_dir, 'docknet1.json')
with open(expected_path, 'rt', encoding='UTF-8') as fp:
expected = fp.read()
actual_file = io.StringIO()
docknet1.to_json(actual_file, True)
actual = actual_file.getvalue()
assert actual == expected

A little trick for not having to manually write the expected JSON file is to first use the unit test
to save the generated JSON as the expected one. Then we manually check the file in order to
ensure it is correct, and finally remove or comment out the code for saving the actual JSON
as the expected one. If at some point the serialization code is broken, the actual JSON will
differ from the expected one and the test will fail. For instance, imagine we change the
definition of the Docknet by adding some attribute of a new class for which we have not
defined a custom JSON serializer: the test will fail when trying to serialize this new Docknet,
reminding us that we also need to adapt the JSON serializer.

In order to deserialize a Docknet we use the global method read_json:

def read_json(pathname: str) -> Docknet:


"""
Create a new Docknet initialized with previously saved parameters in JSON format
:param pathname: path and name of the JSON file
:return: the initialized Docknet
"""
with open(pathname, 'rt', encoding='UTF-8') as fp:
layers_description = json.load(fp, encoding='UTF-8')
docknet = Docknet()
for desc in layers_description:
if desc['type'] == 'input':
docknet.add_input_layer(desc['dimension'])
elif desc['type'] == 'dense':
docknet.add_dense_layer(desc['dimension'], desc['activation_function'])
if 'params' in desc:
params = {k: np.array(v) for k, v in desc['params'].items()}
docknet.layers[-1].params = params
return docknet

55
We use the method json.load in order to load the JSON file as Python dictionary where
its values will either be simple data types, other Python dictionaries or lists. In the same way
we implemented a custom JSON encoder to transform the Docknet object into these data
types, we need now some custom code to re-instantiate the Docknet object from this Python
dictionary representation. We create an empty Docknet instance, then traverse the list of
layer descriptions and create one by one the corresponding layers. We check field ‘type’ to
know which kind of layer to instantiate and extract the layer dimension and activation
function from fields ‘dimension’ and ‘activation_function’, respectively. Once a layer is added
to the Docknet, we check if the JSON file includes parameters for the layer (field ‘params’). If
that’s the case, we parse the parameters (convert the lists of values back to NumPy arrays)
and simply assign them to the layer params field.

In order to test the deserializer we simply deserialize an example Docknet JSON, serialize it
back and verify that we obtain again the same file:

def test_read_json_to_json():
expected_path = os.path.join(data_dir, 'docknet1.json')
with open(expected_path, 'rt', encoding='UTF-8') as fp:
expected_json = fp.read()
actual_docknet = net.read_json(expected_path)
actual_file = io.StringIO()
actual_docknet.to_json(actual_file, True)
actual_json = actual_file.getvalue()
assert actual_json == expected_json

16.2 pickle

Implementing Pickle serializers and deserializers is much straightforward than JSON’s since
there is no need to implement custom serializers and deserializers. We simply call
pickle.dump in order to create the binary file:

def to_pickle(self, pathname_or_file: Union[str, BinaryIO]):


"""
Save the current network parameters to a pickle file; to be used after training so that the model can
be later
reused for making predictions without having to train the network again
:param pathname_or_file: either a path to a pickle file or a file-like object
"""
if isinstance(pathname_or_file, str):
with open(pathname_or_file, 'wb') as fp:
pickle.dump(self, fp)

56
else:
pickle.dump(self, pathname_or_file)

and call method pickle.load to load the object back:

def read_pickle(pathname: str) -> Docknet:


"""
Create a new Docknet initialized with previously saved parameters in pickle format
:param pathname: path and name of the pickle file
:return: the initialized Docknet
"""
with open(pathname, 'rb') as fp:
docknet = pickle.load(fp)
return docknet

Since the resulting file is binary, we have no way to manually check the contents of the file in
order to validate it, like we did with the JSON file. As a work-around we create a Docknet from
a JSON file, save it to pickle, then load it back from the pickle, save it to JSON, then compare
the resulting JSON is equal to the expected one:

def test_to_pickle_read_pickle_to_json(docknet1):
# Set network parameters as for the dummy initializer in order to enforce a specific expected output
docknet1.initializer.initialize(docknet1.layers)
pkl_path = os.path.join(temp_dir, 'docknet1.pkl')
expected_json_path = os.path.join(data_dir, 'docknet1.json')
with open(expected_json_path, 'rt', encoding='UTF-8') as fp:
expected_json = fp.read()
docknet1.to_pickle(pkl_path)
docknet2 = read_pickle(pkl_path)
actual_file = io.StringIO()
docknet2.to_json(actual_file, True)
actual_json = actual_file.getvalue()
assert actual_json == expected_json

Provided that the JSON unit tests pass, if the pickle unit test fails then the problem should
be in the pickle serialization/deserialization code.

17 Python package commands

57
We can easily create command line entry points using Python’s argparse package. The
Docknet library includes commands for:

• generating datasets with any of the dataset generators available


• training a Docknet, given a Docknet layer description in a JSON file
• evaluating a pre-trained Docknet with a test set
• computing predictions with a pre-trained Docknet and unlabeled dataset, and for
• starting a Docknet-based web service (the web service is explained in section 20)

This way we can use the library without the need of Jupyter notebooks, and potentially let it
run with bigger datasets on a server by invoking the commands from a shell. Moreover, we
can create bash scripts that could call sequences of commands (e.g., generate a dataset, train
a Docknet, then make predictions). As an example, here is the code used to create the
command line entry point for the data generators:

import argparse
import sys

import pandas as pd

from docknet.data_generator.data_generator_factory import (data_generators,


make_data_generator)

def parse_args():
"""
Parse command-line arguments
:return: parsed arguments
"""
parser = argparse.ArgumentParser(description='Generate dataset')
parser.add_argument('--generator', '-g', action='store', required=True,
help=f'Data generator to use '
f'({",".join(data_generators.keys())})')
parser.add_argument('--x0_min', action='store', default=-5.0, type=float,
help='Minimum value of x0')
parser.add_argument('--x0_max', action='store', default=5.0, type=float,
help='Maximum value of x0')
parser.add_argument('--x1_min', action='store', default=-5.0, type=float,
help='Minimum value of x1')
parser.add_argument('--x1_max', action='store', default=5.0, type=float,

58
help='Maximum value of x1')
parser.add_argument('--size', '-s', action='store', required=True,
type=int, help='Sample size')
parser.add_argument('--output', '-o', action='store', default=None,
help='Output path (defaults to standard output)')

args = parser.parse_args()
if args.generator not in data_generators.keys():
print(f'Unknown data generator {args.generator}; available generators '
f'are: {",".join(data_generators.keys())}')
sys.exit(1)
if args.x0_min >= args.x0_max:
print('Empty x0 range')
sys.exit(1)
if args.x1_min >= args.x1_max:
print('Empty x1 range')
sys.exit(1)
return args

def main():
args = parse_args()
generator = make_data_generator(args.generator, (args.x0_min, args.x0_max),
(args.x1_min, args.x1_max))
X, Y = generator.generate_balanced_shuffled_sample(args.size)
X_df = pd.DataFrame(X)
Y_df = pd.DataFrame(Y)
sample_df = pd.concat([X_df, Y_df], axis=0, ignore_index=True)
if args.output:
with open(args.output, 'wt', encoding='UTF-8') as fp:
sample_df.to_csv(fp, header=False, index=False)
else:
sample_df.to_csv(sys.stdout, header=False, index_label=False)

if __name__ == '__main__':
main()

59
We basically declare an ArgumentParser, then declare the potential parameters for this
parser with method add_argument. For each argument we can declare:
• a long argument name (e.g., generator)
• an abbreviated from of the name (e.g., g)
• the action to perform if the parameter is given (e.g., ‘store’ for storing the associated
value, ‘store_true’ for simply associating a Boolean True value to the parameter)
• whether the parameter is required (required=True) or not required but have a
default value (default=-0.5)
• the type of the parameter value, in case it is not just a string (e.g., type=float),
and
• a help message describing the parameter

Note these are just some of the typical options that one might use and is not a comprehensive
list of all options available in argparse. For a comprehensive documentation please refer to
the official documentation at: https://docs.python.org/3/library/argparse.html.

Also note that by default a parameter --help (or -h) is automatically created, which will
print on the screen all the available parameters and their corresponding help messages.

Once the parameters are declared, we simply call method parse_args to parse the
parameters used when calling the Python script. An object containing all the parameters
found along with their values will be returned. argparse will raise an exception if unknown
parameters are found, if required arguments are not provided, or if invalid parameter values
are given (e.g., the parameter is supposed to be an integer, but a different kind of value is
provided). Argparse can perform some additional checks depending on the restrictions
specified in the parameter declaration (refer to the official documentation for more
information). Additional and more convoluted checks are to be implemented by us (e.g.,
checking that the minimum value of x0 is less than its maximum value).

Finally, we need to declare in the setup.py script the system commands that will be
created upon installing the Docknet Python package:

entry_points={'console_scripts': [
'docknet_generate_data = docknet.generate_data:main',
'docknet_evaluate = docknet.evaluate:main',
'docknet_predict = docknet.predict:main',
'docknet_start = docknet.app:main',
'docknet_train = docknet.train:main'
]},

For instance, a system command docknet_generate_data will be available in a


terminal, whenever we activate the Python virtual environment where we have installed the
Docknet package. Invoking this command will have the same effect than running the main

60
function of the docknet.generate_data script. As a convention, we add the prefix
“docknet_” to all our commands so we can easily get a list of all the Docknet commands
available in a terminal: we can start by typing “docknet” in the terminal then press the tab
key twice, and the OS autocompletion function will list all the commands available that start
with “docknet”.

18 Resource files
By default, the package build process only includes Python script files inside src folder, even
if we add a resources folder inside src and place there some resource files (e.g.,
precomputed pickle models and configuration files). Files other than Python scripts must be
listed in setup.py so that they are also included in the package:

include_package_data=False, # otherwise package_data is not used


package_data={
PKGNAME: [
'resources/config.yaml',
'resources/chessboard.pkl',
'resources/cluster.pkl',
'resources/island.pkl',
'resources/swirl.pkl'
]
},

Upon installing the Python package, the package file will simply be unzipped inside the lib
folder of the active Python virtual environment. For instance, after running the build script
you can see the Docknet resource files at:

$HOME/docknet_venv/lib/python3.9/site-
packages/docknet/resources

Given a Python script in folder src/docknet, to access a resource file one can simply create
the path of the resource file relative to the script that is running, such as
src/resources/chessboard.pkl, one can then open it as a standard file with
Python’s open. The path can be created as follows:

resources_dir = os.path.join(os.path.dirname(__file__), 'resources')

chessboard_model_pathname = os.path.join(resources_dir, 'chessboard.pkl')

Note we may have 2 different situations here:

61
1. We are running the Docknet code that we have previously installed in the Python
virtual environment
2. We are running the code directly from the copy we downloaded from the project repo,
without installing the package (running from the source code).

Depending on the case, the resource file to open will be in a different folder (the lib folder
of the Python virtual environment or the folder where we cloned the repository). Moreover,
different developers will have these files in different folders, since each one will have a
different user folder. For this reason, we are to use relative paths to the resource files.

19 Configuration files
Configuration files are typically used to store in one place all the application parameters (e.g.,
default hyperparameters used to train a model, or parameters of each component of a
processing pipeline, model to use for a given component, etc.). In the Docknet library we
simply store the configuration of the Docknet web service (explained in section 20). The
typical format used for configuration files is YAML. This format can be seen as a less verbose
version of JSON:

app:
debug: True
host: 0.0.0.0
port: 8080

For loading the configuration file, one can compute the path to the file as for any resource
file, then use PyYAML library to load it as a Python dictionary. The same configuration
dictionary is to be used across all the objects that form the application, so it would be a waste
of resources to load the configuration several times. The Docknet project includes a utilities
file that ensures the configuration will be loaded one time only upon the first time a config
object is instantiated:

config = Config()

Further instantiations of the Config class will return over and over the same Python dictionary
object.

Configuration of different application classes or methods can be organized in different


sections of the YAML file. We can then create an instance of a class or call a method, in
general, by passing at once one entire YAML section as follows:

app.run(**config.app)

This particular example is used to start a web service with all the required configuration
parameters in the YAML app section. Note that for this to work the parameter names in the
config file section must match those of the Python method that is being invoked.

62
20 Web services in Python with Flask
We’ve already seen in section how0 to add a command-line interface that wraps some
business logic implemented as some Python class or method. A web service is just another
way of packaging an application so that it can be accessed through a web interface (e.g., a
web browser). While one can also invoke commands on a remote machine through SSH, a
web interface is usually more convenient and user friendly. For the app to be accessible from
a web client, the app needs to implement a REST interface. These interfaces can be easily
implemented by using the Flask and Flask-RESTful libraries. The REST interface in the Docknet
library is defined in the app.py script:

import os

import numpy as np

from flask import Flask, jsonify, request, Response


from flask_restful import Api, Resource
from docknet.net import read_pickle
from docknet.util.config import Config

app = Flask(__name__)
api = Api(app)

resources_dir = os.path.join(os.path.dirname(__file__), 'resources')

chessboard_model_pathname = os.path.join(resources_dir, 'chessboard.pkl')


cluster_model_pathname = os.path.join(resources_dir, 'cluster.pkl')
island_model_pathname = os.path.join(resources_dir, 'island.pkl')
swirl_model_pathname = os.path.join(resources_dir, 'swirl.pkl')

config = Config()

class PredictionServer(Resource):
"""
REST API layer on top of Docknet model that returns predicted values, given an input vector
defined by 2 parameters
x0 and x1. This class is to be inherited by another class that provides the path to the Docknet pickle
model file to

63
use for making predictions.
"""
def __init__(self, pkl_pathname: str):
"""
Create a prediction server
:param pkl_pathname: path to the Docknet pickle model file to load
"""
super().__init__()
self.docknet = read_pickle(pkl_pathname)

def get(self) -> Response:


"""
Returns the predicted value for
:return:
"""
x0 = request.args.get('x0', type=float)
x1 = request.args.get('x1', type=float)
if x0 is None:
status_code = 400
success = False
message = ‘Missing mandatory argument x0’
elif x1 is None:
status_code = 400
success = False
message = ‘Missing mandatory argument x1’
else:
success = True
status_code = 200
X = np.array([[x0], [x1]])
Y = np.round(self.docknet.predict(X))
message = int(Y[0, 0])
response = jsonify(success=success, message=message)
response.status_code = status_code
return response

64
class ChessboardPredictionServer(PredictionServer):
def __init__(self):
super().__init__(chessboard_model_pathname)

class ClusterPredictionServer(PredictionServer):
def __init__(self):
super().__init__(cluster_model_pathname)

class IslandPredictionServer(PredictionServer):
def __init__(self):
super().__init__(island_model_pathname)

class SwirlPredictionServer(PredictionServer):
def __init__(self):
super().__init__(swirl_model_pathname)

# Add the prediction servers for each one of the 4 models chessboard, cluster, island and swirl
api.add_resource(ChessboardPredictionServer, ‘/chessboard_prediction’)
api.add_resource(ClusterPredictionServer, ‘/cluster_prediction’)
api.add_resource(IslandPredictionServer, ‘/island_prediction’)
api.add_resource(SwirlPredictionServer, ‘/swirl_prediction’)

def main():
# Start the service; the service stops when the process is killed or the docker container running this
service is
# shut down
app.run(**config.app)

if __name__ == ‘__main__’:
main()

We have defined a generic PredictionServer, which loads the specified Docknet model
upon instantiation. The server implements the REST get interface in order to compute a
prediction for a given point specified in the request URL by means of parameters x0 and x1.

65
Internally, the server simply parses the parameters, verifies that they are correct, uses the
predict method in order to compute a result, generates a response in JSON format, and
returns it. In case of error, an error message is returned, instead of a prediction result.

Servers ChessboardPredictionServer, ClusterPredictionServer,


IslandPredictionServer and SwirlPredictionServer simply extend the
generic PredictionServer by specifying a model pickle to use, all from the Docknet
package resources. Then we use method add_resource in order to register each
prediction server and map it to a unique path within the server’s URL
(chessboard_prediction, cluster_prediction, island_prediction and
swirl_prediction, respectively).

In order to start the web server, we just need to activate the Python virtual environment
where the Docknet library is installed then invoke command docknet_start from the
terminal. Once the server is initialized, we can send prediction requests by entering any of
the following URLs in a web browser:

http://localhost:8080/chessboard_prediction?x0=2&x1=2
http://localhost:8080/cluster_prediction?x0=2&x1=2
http://localhost:8080/island_prediction?x0=2&x1=2
http://localhost:8080/swirl_prediction?x0=2&x1=2

Note these URLs are valid for our local machine, and that the port specified is 8080, the same
that has been given in the config file. Note as well that the URLs above specify parameters x0
and x1 both with value 2. One can simply modify these values at will. Finally, one can also use
the curl command instead of a web browser in order to get a response in the command line,
for instance

curl "http://localhost:8080/chessboard_prediction?x0=2&x1=2"

Note that whether we use a web browser or curl we obtain a message as follows:

{
"message": 1,
"success": true
}

If a prediction could be computed, field success is true, and the message contains the
prediction. If an error happened, then field success is false and field message contains the
error message. For instance, the following URL…

http://localhost:8080/chessboard_prediction?x0=2

… results in message

{
"message": "Missing mandatory argument x1",

66
"success": false
}

To stop the web service, simply press Ctrl + C in the terminal where you started the service,
or close that terminal, or kill that process.

Finally, the web app can be installed in a server and consumed remotely provided that the
server has port 8080 open and either an IP address or domain name accessible from our
location, either within the same LAN or from the Internet. However, configuring a server and
securing it is beyond the scope of this guide.

21 Docker
Docker images are some sort of virtual machines that can be used to easily replicate a specific
machine in which some software is to be run, so we can easily deploy one or more instances
of our application without having to configure or install any required dependency for each
app instance we want to run. Docker images are somewhat similar to virtual machines,
though they are more efficient since while virtual machines are full copies of machines with
the corresponding virtual hardware and virtual operating system, Docker images reuse the
Linux kernel of the physical machine where they are running. Indeed, Docker cannot be run
natively in non-Linux platforms such as Windows or macOS.

21.1 Installing Docker

On Windows and macOS we need to install Docker Desktop, which contains everything that
is needed in order to create and run Docker images in those machines. The official installation
instructions and download link can be found here:

• macOS: https://docs.docker.com/docker-for-mac/install/
• Windows: https://docs.docker.com/docker-for-windows/install/

Note since February 2022 Docker Desktop requires a paid subscription for companies over a
certain size so to use it in Accenture a WBS is required. A free alternative using some packages
available with Homebrew can be found here:

https://dhwaneetbhatt.com/blog/run-docker-without-docker-desktop-on-macos

On Ubuntu we just need to install the Docker Engine, which is free to use. The official
instructions can be found here:

https://docs.docker.com/engine/install/ubuntu/

21.2 The Dockerfile

In order to define a Docker image, we create a file named Dockerfile where we list the
sequence of commands required to install the corresponding machine, using the
Dockerfile notation. For instance, the Docknet project includes the following
Dockerfile in the root folder:

67
FROM ubuntu:22.04

LABEL docknet.docker.version="1"

# System update
RUN apt-get update
RUN apt-get upgrade -y
RUN apt-get dist-upgrade -y

# Set locale
RUN apt-get install -y locales
RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8

# Python and common tools


RUN add-apt-repository -y ppa:deadsnakes/ppa
RUN apt-get install -y python3.9 python3.9-dev python3.9-venv

# Create Docker user


RUN useradd -ms /bin/bash docker

# Copy the Docknet repo into Docker container


ADD . /home/docker/docknet
# Make the Docker user the docknet folder owner
RUN chown -R docker:docker /home/docker/docknet

# Run build script as the docknet user to install the package and run the tests
USER docker
WORKDIR /home/docker/docknet
RUN delivery/scripts/build.sh

CMD . /home/docker/docknet_venv/bin/activate && docknet_start

The Dockerfile command FROM indicates which Docker image to use as the base for this
Docker image. This is a mechanism similar to class inheritance, where a Docker image inherits
the result of running the parent Dockerfile commands, then adds additional commands
afterwards. We inherit here the base Ubuntu 22.04 Docker image in order to run the project
in this system:

68
FROM ubuntu:22.04

Note that Ubuntu 22.04 is not quite lightweight, though we use it here for convenience since
installing software in Ubuntu is quite straightforward. For production environments, Alpine
Linux19 is typically used instead, since it is an extremely lightweight distribution specifically
developed to be run inside Docker containers. However, it usually takes more time and effort
since more software components need to be installed in such images.

Dockerfile command LABEL is used to add arbitrary pairs label/value to a Docker image. We
define here a label docknet.docker.version with the Dockerfile version number:

LABEL docknet.docker.version="1"

Defining this label right at the beginning of the Dockerfile has a specific purpose, due to the
Docker cache system. When we run a Dockerfile in order to build a Docker image, for each
Dockerfile instruction the result is saved in a cache. Trying to rebuild the Dockerfile without
modifying any line has no effect, since the Docker build system reuses the results stored in
the cache. When we modify one line of the Dockerfile, all the results already computed that
are previous to the modified line are retrieved from the cache, then the modified lines and
the lines afterwards are re-run and the result per line stored in the cache. If we want to re-
run the entire Dockerfile without modifying its commands, we can update the version value
in the label, which will have no effect in the result apart from re-running all the commands.
This is in particular useful when we simply want to update the Docker image with the latest
versions of the Ubuntu packages.

In the Dockerfile, we use the command RUN to run arbitrary commands in the selected OS,
and ENV in order to define environment variables. Note that building a Docker image is
equivalent to starting some new brand machine with a fresh installation of the selected OS,
then running some commands to configure the machine and to install the needed software.
The first, second and third blocks after the label are used to update the list of available Ubuntu
packages, install the system locale, and install Python, as explained in section 3.1:

# System update
RUN apt-get update
RUN apt-get upgrade -y
RUN apt-get dist-upgrade -y

# Set locale
RUN apt-get install -y locales
RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8

19
https://alpinelinux.org/

69
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8

# Python and common tools


RUN add-apt-repository -y ppa:deadsnakes/ppa
RUN apt-get install -y python3.9 python3.9-dev python3.9-venv

The Dockerfile build process runs using the root user by default, so we have administrative
permissions to do any modifications needed. Since we do not require administrative
permissions to run our Python project tests, we add a standard user (without administrative
privileges) to test the project:

# Create Docker user


RUN useradd -ms /bin/bash docker

This is not strictly needed, but it is good practice: do not run a command with administrative
privileges unless you really need it.

Next, we add the source code inside the Docker image using the Dockerfile command ADD:

# Copy the Docknet repo into the Docker container


ADD . /home/docker/docknet

Note the Docker image contains its own virtual file system, hence any additional file we want
to include has to explicitly be available inside the container. The command ADD can be used
to copy some file or folder from the host machine to the Docker container. Since the
Dockerfile is in the root folder of the project, the entire project folder will be copied inside
the Docker image in folder /home/docker/docknet. It would also be possible to run
command git clone in order to obtain the code directly from the repository, provided we
had previously run the command to install and configure a Git client in the Dockerfile.
However, note that in this case the process of building the Docker image will require access
to the Git repository, which is restricted to 2 options that are neither viable:

1. Entering a password, which we cannot do when building the Docker image since it is
a fully automated process (we will not see the prompt asking for entering the
password)
2. Using a SSH certificate, however SSH access has been cut in The Dock for security
reasons.

Once the project code is copied in the image, we set user docker as the owner of the
corresponding folder:

70
# Make the Docker user the Docknet folder owner
RUN chown -R docker:docker /home/docker/docknet

This is needed since by default the owner is user root, which would prevent user docker
from building and testing the project.

Now that everything has been set up for building the project, we change the current active
user to user docker, using the Dockerfile command USER, and set the current active
directory to the project folder, using the command WORKDIR:

USER docker
WORKDIR /home/docker/docknet

Note inside a Dockerfile command cd has no effect and command WORKDIR is to be used
instead.

Finally, we run the build.sh script (see section 8) as we do in our local machine, so the
build and test process is the same either case:

RUN delivery/scripts/build.sh

A last Dockerfile line uses command CMD in order to indicate what will happen by default
when we run the Docker image:

CMD . /home/docker/docknet_venv/bin/activate && docknet_start

This has no effect during the build process, just when the Docker image is run without
specifying which command to run once the image is started. In the example, we activate the
Python virtual environment created by the build.sh script inside the Docker image, then
run the docknet_start command in order to start the web service (see section 16).

This Dockerfile is enough for installing, testing and running Python projects in a container as
a web service, though Docker supports many other features. For a comprehensive Dockerfile
reference, visit:

https://docs.docker.com/engine/reference/builder/

It’s worth mentioning a more sophisticated way to build smaller Docker images called multi-
staging:

https://docs.docker.com/develop/develop-images/multistage-build/

Note every time a command is run in a Dockerfile, the result is saved in the image. However,
it is not uncommon to have to install software that is needed for compiling or testing our
project, but not for running it. This software will be unnecessarily taking space in our Docker

71
image. Multi-staging allows us to build different Docker images, one per stage, where one
stage image can inherit whatever is needed from the other stage images. This way we can
inherit, for instance, just the compilation result and hence get rid of the compilation tools or
the Git client that could have been used to clone the project.

21.3 Building, running and deleting a Docker image


We have provided Bash scripts to facilitate building, running and deleting a Docknet Docker
image. These can be found in folder delivery/scripts, namely:

• docker_build.sh
• docker_run.sh
• docker_rm.sh

All these scripts first load script docker_config.sh where we centralize the parameters
common to all the scripts.

Internally, the scripts just use a few Docker commands in the command line. To build a Docker
image we can go to the folder containing the Dockerfile and run the following command:

docker build -t TAG .

where TAG is a name we give the image to later easily refer to it (e.g., docknet). The image is
built and stored in the Docker system in our own machine. We do not need to manage any
image file; it is handled by our Docker installation. To list the available images, we run the
command:

docker images

To run the Docker image so it starts the web server, we run anywhere the following
command:

docker run -p 8080:8080 -it TAG

Parameter p is used to map port 8080 inside the container to port 8080 of the physical
machine. This is needed to be able to access the web service inside the container. The it
parameter indicates the container is to be run in interactive mode. This allows us to stop the
container by pressing Ctrl + C. Otherwise the container will run indefinitely until we either
close the terminal or kill the corresponding process.

Finally, we can delete a docker image, provided that it is not running, as follows:

docker rmi -f image_id

Note the image_id is not the same as the tag, it is an alphanumeric code. The f parameter
serves to force the deletion of the image, even in case there is some child image of this one.

72
In script docker_rm.sh we use the following command to locate all images defining a label
docknet.docker.version and delete them:

docker rmi -f "$(docker images -f "label=$LABEL" -q)"

21.4 Docker tutorial


In this guide we have just listed a minimal set of features that are useful to do our daily work.
However, Docker includes many other features, such as publishing our own images for easing
their use or running swarms of Dockers in order to take advantage of cloud computing,
automatically launching more instances of a Docker image or stopping them to accommodate
a variable demand. We recommend reading the official Docker overview:

https://docs.docker.com/get-started/

to get a better understanding of what Docker is, and to follow the Docker tutorial to get more
familiar with the Docker capabilities:

https://docs.docker.com/get-started/overview/

22 Continuous integration
In this section we explain how to quickly implement a continuous integration pipeline in Azure
DevOps based on the Docker container seen in the previous section. A continuous integration
pipeline is just an automated process that is triggered whenever we push some changes to a
Git repository, and which verifies that the new version of the code is stable by running again
all the tests. Even though we should run ourselves the tests in our machines before uploading
code, it may happen that while the tests pass in our machine, they fail somewhere else.
Typical reasons for this to happen are:

• We committed and pushed the code but forgot to add first some new data or source
code file; the tests run in our machine, but not in someone else’s machine since they
do not have all the required files.
• While implementing a new feature we needed a new library that we manually
installed in our Python virtual environment by running pip in a terminal, but we
forgot to add the library in the requirements.txt file. Therefore, when someone
else tries to run the tests, even after a clean project rebuild, the tests of the new
feature fail due to missing dependencies.
• We needed to install some native tool or library or needed to change some system
configuration (e.g., add a system variable), but we didn’t update the build script
and/or the Dockerfile. Note this may not prevent the error to happen in the
development machines of the rest of the team, since they may need to manually
install or configure their machines, but at least they can use the Dockerfile as a
reference guide on how to install and configure their own machines for the project to
run. This is particularly important in the event of team members rolling-off projects.

Verifying that the project is stable every time we have new changes to push to the repository
is time consuming. Note a full verification would consist in:

73
• Reinstalling and configuring a fresh machine
• Installing the project dependencies and the project itself
• Running all the unit tests

In order to remove this repetitive work, we can use a continuous integration system that will
do this for us, every time we create a new candidate version of the code (a new pull request
to be merged). Usually, continuous integration systems and software repositories are tightly
integrated, so that the continuous integration system checks the project each time a new pull
request is created and each time the pull request is updated, and the repository does not
allow for merging the pull request until the continuous integration system validates it. In case
of error, a notification can be automatically sent for the corresponding developer to check
the error and fix it. This way we can reduce the probability of having an unstable version of
the code, and hence potentially blocking someone else’s work or even worst, not realizing the
code is broken until we try to run a demo in front of a client.

With Azure DevOps we can simply create a pipeline that builds the project’s Docker image.
Remember that the image we used in the Docknet project (see section 21) simply runs the
build script in order to install the project and run the tests. Hence if any test fails, the image
build will fail, and Azure DevOps will report the error.

22.1 Creating an Azure DevOps pipeline


To create a new pipeline in Azure DevOps, we go to the project web page and click on the
pipelines icon , then click on the “Create Pipeline” button. A wizard starts, guiding us
through the different steps, namely:
1. Indicate where the source code is; since we have the project repo in Azure DevOps,
we select Azure Repos Git.
2. Select the Git repo from the list; note that one project may have more than one repo,
hence we have to specify the repo that will trigger this pipeline.
3. Select which kind of pipeline to create, in our case the pipeline for building Docker
images. We will be asked to select the Dockerfile to build, which is automatically
detected by Azure (simply accept the detected one).
4. A definition of the pipeline will be presented, allowing for any custom modifications.
Click on the button “Save and run”, then select “Commit directly to the master
branch” and click again on “Save and run”. This will add to the repository a new file
azure-pipelines.yml which contains the pipeline definition, and the pipeline
will be run for the first time.

Once finished, one can then pull the new code version to get the azure-pipelines.yml
file and, potentially, edit it and commit a new version. Each time the pipeline is run, an e-mail
is sent indicating whether the process succeeded or failed. The email contains a button “View
results” we can click on in order to open a web page with the process report. In this page we
can see the terminal output messages, which can be useful to quickly determine what went
wrong.

While Azure offers a specific pipeline for building Python packages and running tests, we have
used here the pipeline for building a Docker image so that we control the environment in

74
which the project is tested. The Python build pipeline uses an Azure virtual machine which we
can also tweak by modifying the azure-pipelines.yml, however that virtual machine
solely runs in Azure. The Docker image can be either built in Azure, our computers, in Amazon
instances, and many other systems. Hence, we can test the project inside a Docker in any
computer and yet obtain the same result since with the Docker image we control the
environment in which the project is tested.

22.2 Other CI uses and systems


Continuous integration pipelines can be used for automating many other tasks. For instance,
one could be refining some image classifier and already have a dataset somewhere with which
to train the classifier. Upon each new version of the code, a continuous integration pipeline
could train a new model and publish it in an Amazon bucket, keeping all the different versions
of the model for each version of the code. Moreover, it could also generate a report on the
accuracy of each model and upload it to the bucket as well so we could trace the variations
in the accuracy with respect to the different versions of the code.

Apart from Azure DevOps, Jenkins is a popular open-source alternative for implementing
continuous integration pipelines. Jenkins is compatible with Linux, macOS and Windows.
More info on Jenkins can be found here:

https://www.jenkins.io/

Finally, for open-source projects (e.g., published in GitHub) one can use Travis CI for free.
Travis CI is another continuous integration service that provides virtual machines for running
processes in Linux, macOS and Windows machines. Note Docker containers can simulate
different versions of Linux distributions, but one cannot run macOS or Windows in a Docker
container. For testing software in Windows or macOS one can use the corresponding Travis
CI virtual images. For specific Linux distributions one can always select some Linux virtual
machine, then run inside a Docker image so we can still easily replicate the same result in our
own computer. More information on Travis CI can be found here:

https://travis-ci.org/

23 Proposed challenges
Here we present 5 challenges that you can solve as a team, one member of the team working
on a separate challenge at the same time. The idea is to practice the contents of this guide,
namely:

• Creating a branch where to develop the new feature


• Creating a test for the new feature that will be used both for debugging the new code
and for validating it before merging it
• Merging the new branch to master and deleting the branch

23.1 Challenge 1: New data generator


In section 13.3 we presented 4 different data generators that are later used in the Jupyter
notebooks as examples of problems that can be solved with a Docknet. Can you think of

75
another data generator that could put a Docknet to the test? Implement it as another derived
class of DataGenerator (folder src/docknet/data_generator).

For implementing the test, you may copy a test of a previously implemented data generator
(folder test/unit/docknet/data_generator) and modify it to use the new data
generator. By debugging the test, you can see the corresponding scatterplot without having
to use a Jupyter notebook, while being able to debug the code. Verify that the test works
before merging the Git branch with master.

Make a copy of one of the Jupyter notebooks in the exploration folder and then modify
it to use the new data generator and then modify the Docknet hyperparameters to try to
properly classify the new test set.

23.2 Challenge 2: New activation function


In section 13.4 we presented the activation functions, namely sigmoid, tanh and ReLU
(implemented in script src/docknet/function/activation_function.py).
Extend the Docknet library with an implementation of another activation function, for
instance, the Leaky ReLU. The Leaky ReLU is almost the same than the ReLU function except
for returning a small fraction of the input (e.g., 0.01 * x) for negative values of x, instead of 0.
The derivative of the Leaky ReLU is 1 for x equal or greater than 0, and the corresponding
fraction (0.01 in the example) for negative values.

For implementing the test, you may copy the test for the ReLU activation function and the
one for its derivative (file
test/unit/docknet/function/activation_function.py) and update them
accordingly.

Implement a Jupyter notebook in the exploration folder that presents the different
activation functions and compares them (e.g., create several Docknets with the same
structure but different activation functions in the hidden layers, then train them and predict
with the same datasets).

23.3 Challenge 3: Cross-entropy for multi-class classification


In section 13.5 we presented the cross-entropy cost function implemented in the Docknet
library. For the moment the library only allows for binary classification. A first step for
supporting multi-class classification would be to implement a multi-class version of the cross-
entropy function so that it accepts labels of more than one class. Note that the Y vectors for
binary classification have a shape (1, m), where m is the amount of input examples. For the
case of n classification, the Y vectors are of shape (n, m).

Suggestion: make a copy of the cross-entropy function and its derivative (script
src/docknet/function/test_cost_function.py), and the corresponding tests
(script test/unit/docknet/function/test_cost_function.py) and modify
them accordingly.

76
23.4 Challenge 4: Xavier’s initializer
In section 13.6 we presented a random normal initializer of the network parameters
(implemented in script
src/docknet/initializer/random_normal_initializer.py). Xavier (also
called Glorot) is another popular initializer. Can you implement it? It is similar to the random
initializer in which it also draws values from a normal distribution, but the mean is set to 0
and the standard deviation depends on the number of neurons of the layer. Note you can get
the number of neurons of and AbstractLayer with getter method dimension. Here
you are the official paper where Xavier’s initialization is described:

http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf

Suggestion: make a copy of the random normal initializer and its test (script
test/unit/docknet/initializer/test_random_normal_initializer.p
y) and update them accordingly.

Implement a Jupyter notebook in the exploration folder to compare both initializers.


Which one works better?

23.5 Challenge 5: Dropout layer


In section 13.7 we presented the Docknet layers, namely input and dense (implemented in
folder src/docknet/layer). Implement a dropout layer that can be used for
regularization. The dropout layer randomly masks a percentage of the outputs of the previous
layer. This percentage is given as a hyperparameter of the layer (it is not a parameter to be
optimized by the optimizers, but just to be set once when initializing the layer). During
forward propagation, the layer selects a percentage of the inputs and rewrites them with
zeroes. Since the layer has no parameters, it has no parameter gradients to return during
backward propagation (it is to return an empty dictionary of gradients). However, it is to
return the gradient of the cost function w.r.t. the previous layer activation. For the neurons
that were not masked, it simply propagates backwards the gradient returned by the next
layer. For the neurons that it zeroed, it returns zero since the derivative of a constant is zero.
This means that the dropout layer will have to cache during forward propagation which
neurons it zeroed in order to know which gradients to propagate backwards, and which ones
to zero.

Suggestion: copy the implementation of the input layer and of its unit test (folder
test/unit/docknet/layer/test_input_layer.py) and update accordingly.
Note that for a unit test you just need to test the dropout layer, not an entire network with
dropout layers (as we did for exhaustively testing the predict and train methods, hardcoding
a forward and backward propagation of a dummy network in file
test/unit/docknet/dummy_docknet.py). For testing forward propagation it
suffices to make a fake input for the dropout layer and verify the forward propagation method
zeroes the corresponding neurons (use np.random.seed to always mask the same
neurons so the test can be repeated). For the backward propagation you will have to make
another fake input, write the layer cache a fake mask that presumably was applied during
forward propagation, and verify that the corresponding gradients are zeroed.

77
Make a Jupyter notebook in the exploration folder comparing a Docknet with and
without dropout layers after each hidden layer.

23.6 Challenge 6: Momentum and RMSProp optimizers


In section 13.8 we presented the Gradient Descent and Adam optimizers (implemented in
folder src/docknet/optimizer). Adam is in fact a combination of two other optimizers:
Momentum and RMSProp. The first one applies the v’s for minimizing oscillations, and the
second applies the s’s for increasing the velocity towards the systematic direction.
Suggestion: copy/paste the Adam optimizer, remove the unneeded code and modify the
formula that updates the parameters as follows:
• Momentum: p = p - learning_rate * v
• RMSProp: p = p - learning_rate * gradient_of_p / sqrt(s)

The bias corrections of v and s are no longer needed so you can delete them.

Make 2 copies of the Adam test (script


test/unit/docknet/optimizer/test_adam_optimizer.py) and update them
for testing the Momentum and RMSProp tests.

Implement a Jupyter notebook in the exploration folder in order to compare all 4 optimizers.
Which one manages to converge faster? Which one is the slowest?

78

You might also like