You are on page 1of 11

New to machine learning?

Try to
avoid these mistakes
The things I learned the hard way as a data scientist
Assaad MOAWAD Following
Jul 20 · 6 min read

Machine learning (ML) is one of the hottest 3elds in computer science.


So many people are jumping with the fake idea that it’s just about
running 10 lines of python code, and expecting things to work by magic
in any situation. This blog post is about all the things I learned the hard
way. Hope it saves you some good time falling for the same mistakes.
. . .

1. You shouldn’t believe it’s magic


Machine learning is like any scienti3c 3eld: it has its own rules, logic and
limitation. Believing that it’s some sort of black magic doesn’t help to
improve your ML skills. This belief will stand against your scienti3c
curiosity needed to understand how each model/layer type works.
Believing it’s magic is a lazy way to convince yourself why you don’t need
to understand the mechanics running behind.

The only magical thing about ML is that there is no magic behind. It’s based
on pure logic, math, and of course on some randomness and luck…
Machine learning is magical but NOT magic

2. Don’t start with real datasets


Starting with a real world data set, full of problems and noise won’t help
you in the quest of understanding. Instead of that, generate some ideal
fake datasets. For example, start by creating a list of random x, and a list
of y=3x+2 and test how a dense layer will learn the weight 3 and the
bias +2.

This allows you to reverse engineer the mechanics behind back-


propagation, the diSerent optimizers, the diSerent initialization
methods, and see if the model is converging fast enough. Reverse-
engineering is a fun way to understand how things are working without
going into the complex mathematical details. (But if you can, why not!)

Whenever trying to solve a new type of problems, try to think 3rst if


there is a way to generate 3rst an ideal, noiseless, fake dataset in order to
check which type of layers or ML models will solve easier this challenge.
If a model couldn’t solve this fake/easy challenge there is even no point
in trying it on a harder noisy dataset.

3. Don’t start with huge datasets


Throwing immediately the full dataset into a model and waiting hours
before getting a 3rst result, is counterproductive. Instead better start
with a small subset of the data and experiment with the diSerent models
3rst. Once you get an initial 3rst result, then you can go big-data.
Go step by step

4. Visualize/Clean the data @rst


Understanding your dataset is a key point and main ingredient of
success, cleaning data is one of the most important step in ML. Finding
out problems in data collection, storage, frequency rate, are important.
Visualizing the dataset can help identifying many problems. Does the
data contain lots of missing values? How to replace them? Is the data
sampled at the same rate? Does the features need normalization? Are
the features independent? Do we need to run PCA 3rst? Perhaps the
most important principle in data science is the following:
Shit IN -> Shit OUT principle

5. Number of features don’t matter, number of dimensions do.


A dataset with 100 features where all the features are linearly correlated
can be reduce e_ciently to only 1 feature with no data loss.
Understanding the diSerence between features and dimensions help a
lot to reduce the complexity of dataset.

6. Be careful of data-leakage or skewed data


Neural network are famous in their ability to cheat and lazily learn
things we don’t want them to learn it. For instance, If there is a large
diSerence between classes, let’s say 99.99% of mails are not spam and
0.01% are spam, then there is high chance that the neural network will
lazily learn to classify non-spam all the time. Another problem is the
data-leakage:

Any feature whose value would not actually be available in practice at the
time you’d want to use the model to make a prediction, is a feature that can
introduce leakage to your model.

When the input data you are using to train a machine learning algorithm
happens to have the information you are trying to predict — Daniel
Gutierrez, Ask a Data Scientist: Data Leakage
So be sure that there is no easy way to cheat and 3nd an easy but
meaningless correlation between inputs and outputs.

Take care of leakage

7. Iterate fast and fail faster


I got inspired for this point after watching this video about the
Marshmallow challenge:

Build a tower, build a team | Tom Wujec


Marshmallow challenge

The lesson is not to assume that your initial idea of model will work, and
you might be mistaken with your assumption, the faster you discover
that, the faster you might try something else without losing much time,
reaching deadlines or paying hefty cloud compute bills before
discovering the truth.

Having a ground model that works good enough quickly is a key-point,


the remaining time can be spent to improve the result and updating the
record for the best model “so-far”. But the key is not to wait till the last
moment before validating the result.

8. Be careful from over-@tting


It’s very tempting to be proud of a model with a 0 loss in the training
phase. However it might not be the optimal model for a real world. This
is a well known problem of over-3tting
Over-Jtting illustrated

9. Don’t start with complex ML models


Don’t use directly a bi-directional LSTM variational auto-encoder with
attention mechanism model, if your problem can already be solved
using simply one linear dense layer.

Tensor7ow is powerful, but you might not be

It’s always better to start with simple models that you can understand
and control 3rst, then increase in complexity gradually to check if the
complexity brings any added value or not. Remember that at the end,
the more complex the model gets, the more data will be needed in order
to truly validate that the model didn’t over3t.

A good rule of thumb is to keep a track of the number of parameters in a


model vs the data set size. If your model parameter number get bigger
than your dataset size, this should trigger an automatic warning!
Don’t go full complexity

10. Don’t hesitate to create a custom-made loss function


The objective of machine learning is to reduce the loss function. In many
cases the default ones (RMSE, LOGLOSS) are enough. But many people
are scared to de3ne their own loss functions for some speci3c problems.

Loss functions can be seen as the punishment or reward you want the
ML model to achieve. So if you want the ML model to converge to a
speci3c behavior, you can do so by creating a loss function that rewards
this behavior while punishing the misbehavior for your speci3c problem
and speci3c dataset.
Machine learning is like a child — It needs guidance, through loss functions

. . .

Closing points — Investments


The best investment for machine learning is in knowledge, the more you
understand what is the abstract role of each layer, the mechanics behind
each function, the limitations, the easiest it will be to create a model
faster.

The second best investment, is in a dedicated hardware for machine


learning. A 300$ GPU can accelerate up to 20x the training workload,
meaning a one day job can be reduced to one hour, and the same data
scientist engineer can try 20x more models within the same duration.

One last point:

Never, ever underestimate the importance of


having fun.
. . .

If you enjoyed reading, follow us on: Facebook, Twitter, LinkedIn

Originally published at https://medium.com on July 19, 2019.


Machine Learning Mistakes To Avoid ArtiJcial Intelligence Neural Networks Data Science

You might also like