You are on page 1of 13

3 Python Tools Data Scientists Can

Use for Production-Quality Code


Just because you’re a data scientist, doesn’t mean
you shouldn’t write good code

Genevieve Hayes
Follow
Sep 7 · 7 min read

My first experience with coding was using S-Plus (a forerunner to


R) as an undergraduate statistics student. Our lecturer, a professor
with decades of experience, taught us how to fit regression models
by literally typing our code one line at a time into the S-Plus
console.

If you wanted to be able to re-run your code at a future point in


time, you could save it in a text file and then cut and paste it into
the console.

It wasn’t until years later, in my third job after graduation, and


after completing a PhD in Statistics, that I first discovered
programming scripts, and what good code actually looks like.

Looking back, I’m astounded this situation ever occurred.


However, from talking to my data scientist friends, particularly
those who don’t come from a software developer background, it
appears my situation is not unique.

It is an unfortunate fact that many data scientists do not know how


to write production-quality code.
Production-quality code is code that is:

 Readable;
 Free from errors;
 Robust to exceptions;
 Efficient;
 Well documented; and
 Reproducible.

Producing it is not rocket science.

Anyone smart enough to be able to understand neural networks


and support vector machines (i.e. most data scientists) is certainly
capable of learning good coding practices.

The problem is that most data scientists don’t even realize that
writing production-quality code is something they can and should
learn.

How to Write Production-Quality


Code
In my article 12 Steps to Production-Quality Data Science Code, I
outline, in detail, a simple process data scientists can follow to
make their code production ready.

In summary, the steps are:


1. Determine what you’re trying to achieve;
2. Build a minimum viable product;
3. Reduce repetition using the DRY principle;
4. Create and run unit tests;
5. Deal with exceptions;
6. Maximize time and space efficiency;
7. Make variable and function names meaningful;
8. Check your code against a style guide;
9. Ensure reproducibility;
10. Add comments and documentation;
11. Ask for a code review; and
12. Deploy.

For many of these steps, there are no real short cuts to be taken.
The only way to build a minimum viable product, for example, is to
roll up your sleeves and start coding. However, in a few cases, tools
exist to automate tedious manual processes and make your life
much easier.

In Python, this is the situation for steps 4, 8 and 10, thanks to the
unittest, flake8 and sphinx packages.

Let’s look at each of these packages one by one.


Automate Your Error Checks with
unittest
Unit tests are used to ensure that the functions that make up your
code are doing what they should be doing, under a range of
different circumstances.

If your code only contains a small number of relatively straight


forward functions, then you probably only need a handful of unit
tests, which you can feasibly run and check manually.

However, as your code increases in size and complexity, the


number of unit tests you are going to need, in order to ensure
broad coverage, is also going to increase, as will the risk of human
error resulting from manual testing. This is where the unittest
package comes in.

The unittest package is designed specifically for automating unit


tests. To run your unit tests through the unittest package, you
simply create a class, and write your unit tests as methods (i.e.
functions) that sit within the class.

For example, consider a function for calculating the area of a circle,


given its radius:
To check this function is working as expected, you could create
unit tests to ensure that the function produces the correct output
for two different radius values, say 2 and 0.

Using the unittest package, you can automate these checks as


follows:
In this example, I have called my test class “TestFunctions”, but
you can call it anything, provided the class has unittest.TestCase as
its parent class.

Within this class, I have created two unit tests, one to test
circle_area() works for radius 2 (test_circle_area1) and one for
radius 0 (test_circle_area2). The names of these functions, again,
don’t matter, except that they must start with test_ and have the
parameter self.

The final line of the code runs the tests.

Assuming all tests are passed, the output will look something like
this, with a dot on the top line for each test that has passed.
Alternatively, if one of your tests fails, then the top line of the
output will include an “F” for each failed test and further output
will be provided, giving details of the failures.
If you are writing your code using Python scripts (i.e. .py files),
ideally you should house your unit tests in a separate testing file, to
keep them apart from your main code. However, if you are using a
Jupyter notebook, you can just place the unit tests in the final cell
of the notebook.

One you have created your unit tests and got them working, it is
worthwhile re-running them whenever you make any (significant)
changes to your code.

Check for PEP8 Compliance with


flake8
A coding style guide is a document that sets out all the coding
conventions and best practices for a particular programming
language. In Python, the go to style guide is the PEP 8 — Style
Guide for Python Code.

PEP 8 is a 27 page document, so ensuring your code is compliant


with every single item can be a chore. Fortunately, there are tools
to assist with this.

If you are writing your code as a Python script, the flake8 package
will check for PEP 8 compliance.

After installing this package, just navigate to the folder containing


the code you want to check (filename.py) and run the following
command at the command prompt:
flake8 filename.py

The output will tell you exactly where your code is non-compliant.

For example, this output tells us that the Python script


20190818_Production_Examples.py contains 6 instances of non-
compliance. The first of these instances is in row 1, column 1,
where the package ‘math’ is imported but unused:
In Jupyter Notebooks, several extensions exist to ensure PEP 8
compliance, including Jupyterlab-flake8 and jupyter-autopep8.

Create Professional-Looking
Documentation with sphinx
Ever wondered how the creators of Python packages, such as
NumPy and scikit-learn, get their documentation to look so good?

The answer is sphinx, a Python package that can literally convert


your docstrings into documentation in a variety of formats,
including HTML, pdf and ePub.

It also integrates with GitHub and ReadtheDocs (a documentation


hosting platform), such that your documentation is automatically
rebuilt whenever any updates to your code are pushed to GitHub,
ensuring your documentation always remains up to date.

I used sphinx when I wrote the Python package mlrose and here is
an extract from one of the functions contained within this package.
Note the very specific way in which the docstring at the top of this
function is formatted.
Running sphinx over this code produces the following nicely
formatted documentation:

Getting started with sphinx can be a little difficult, but an excellent


tutorial on the topic can be found here.

Most data scientists don’t know how to produce production-quality


code, but if you want to stand out from the crowd, then you
shouldn’t be aiming to be like most data scientists, anyway.
By following a simple 12 step process and integrating a few
straight-forward tools into your workflow, it is possible to make
dramatic improvements to the quality of the code you are
producing.

You may never get your code to the point where no one is ever
going to complain about it, but at the very least, you won’t
embarrass yourself by trying to fit a regression model, one line at a
time, in the S-Plus console.

About the Author


I am a data scientist with over 15 years’ experience working in the
data industry. I have a PhD in Statistics from the Australian
National University and a Masters in Computer Science (Machine
Learning) from Georgia Tech.

Visit my website, Genevieve Hayes Data Science, where I share


technical and career advice, targeted at current and aspiring data
scientists. Alternatively, follow me on Twitter or connect with me
on LinkedIn.

You might also like