Machine Learning Engineer Nanodegree: Capstone Proposal

Machine Learning Engineer
Nanodegree
Capstone Proposal
Khalil Henchi
January 22th, 2020
CNN Project: Dog Breed Classifier
A. Definition
1. Project Overview
Domain Background
Machine learning is the new trend of technology and the most popular in the 21 century until
now. This is due to the increasing performance of computers and calculators.
The use of machine learning in the computer vision field is a subject that continues to fuel the
curiosity of scientists and engineers. In fact, scientists have been trying to make machines extract
meaningful information from visual data for about 60 years now. The breakthrough that made
computer vision reappear in the surface as a hot topic was in 2012 when AlexNet won ImageNet.
One of the most well-known computer vision task is Image Classification.

A breakthrough in building models for image classification came with the discovery that a
convolutional neural network(CNN) could be used to progressively extract higher- and
higher-level representations of the image content. Instead of preprocessing the data to derive
features like textures and shapes, a CNN takes the image’s raw pixel data as input and “learns”
how to extract these features, and ultimately infer what object they constitute.
For machine learning community, dog breed classification challenge is well-known. This
challenge is also available on Kaggle [1]
As udacity provides this project in the list of possible capstone project, I decided to work in it as
my capstone project because my goal is to get a job as a computer vision engineer so this project
will be a valuable asset on my CV.
1
Datasets and Inputs
The dataset for this project is provided by Udacity. We have pictures of dogs and humans. Each
image is identified by a unique id.
We have 8351 total dog images. Dog pictures are split into three folder:
❏ Train : 6680 Images

❏ Test : 836 Images
❏ Valid : 835 Images
In each group, images are sorted given the dog’s breed. We have 133 dog breeds.
Human pictures are sorted by name of each human. We have 13 233 total human pictures.
By analyzing the datasets, we see that all pictures are taken from different and various angles.
Besides, their dimensions are differents, and in some pictures their more than an object.
2. Problem Statement
For this project our goal is to detect whether there a human or a dog or none of them in a given
photo. In the case where there is a dog detected in the photo, we will look for its corresponding
breed. In the other case where there is a human detected, we will look for its most resembling
dog breed. In the last case where no human nor dog are detected, we will show an error message.
Images are random with different sizes, taken from different angles, and in different moments
during the day.
To summarize, our goal is:

Given a random picture of a dog, our model should be able to determine the dog breed from 133
breed class. If the input is a human picture, the code will identify the resembling dog breed.
Otherwise, it displays an error.
We will use Convolutional Neural Networks (CNN) to create our app.

We will use a pre-trained face detector from OpenCV to detect humains. And to detect dogs, we
will use a pre-trained VGG16 model. We will create our CNN model to identify dogs using
transfer learning.
2
3. Evaluation Metrics
Depending on the dataset, accuracy may not be a good metric for a classification problem. In this
case precision and recall can be good evaluation metrics. F1 score is a possible metric as it
combines precision and recall.
The traditional F-measure or balanced F-score (F1 score) is the harmonic mean of precision and
recall, it is given by the following formula:
I checked the dog breed dataset and the classes (breeds) are relatively balanced, so a simple
accuracy score is considered representative in this project.
3
B. Analysis
1. Data Exploration
The dataset for this project is provided by Udacity. We have pictures of dogs and humans. Each
image is identified by a unique id.
We have 8351 total dog images. Dog pictures are split into three folder:
❏ Train : 6680 Images

❏ Test : 836 Images
❏ Valid : 835 Images
In each group, images are sorted given the dog’s breed. We have 133 dog breeds.
Human pictures are sorted by name of each human. We have 13 233 total human pictures.
By analyzing the datasets, we see that all pictures are taken from different and various angles.
Besides, their dimensions are differents, and in some pictures their more than an object: more
than a human, more than a dog, both human and dog are present .
2. Exploratory Visualization
The following image how some sample from the data sets.
4
Data set samples
There are:
- significant variations in shapes and size of different images

- significant variations in the color intensities as the images were taken from
different angles and at different moments. The Color images have 3 channels
R-Red, G-Green, B-Blue.
The training data provided for this competition was not distributed uniformly among the
different classes of dogs, as shown in the figure below. This might cause a problem as the
model may find it difficult to make a correct prediction for the later class with fewer
samples and predict an image to be from classes with higher number of samples more often.
5
3. Algorithms and Techniques
- Deep learning:
Deep learning (also known as deep structured learning or hierarchical learning) is part of
a broader family of machine learning methods based on learning data representations, as
opposed to task-specific algorithms.
- Convolutional neural network:
A convolutional neural network (CNN or ConvNet) is a class of deep, feed-forward
artificial neural networks that has successfully been applied to analyzing visual imagery.
A CNN consists of an input and an output layer, as well as multiple hidden layers. The
hidden layers are either convolutional, pooling or fully connected. We give CNN an input
and it learns by itself that what features it has to detect. We won't specify the initial
values of features or what kind of patterns it has to detect.
Various Layers:
● Convolutional - Also referred to as Conv. layer, it forms the basis of the
CNN and performs the core operations of training and consequently firing the neurons of
the network. It performs the convolutional operation over the input.
6
● Pooling layers -Pooling layers reduce the spatial dimensions (Width x Height)
of the input Volume for the next Convolutional Layer. It does not affect the depth
dimension of the Volume.
● Fully connected layer - The fully connected or Dense layer is configured
exactly the way its name implies. It is fully connected with the output of the previous
layer. Fully connected layers are typically used in the last stages of the CNN to
connected to the output layer and construct the desired number of outputs.
● Dropout layer - Dropout is a regularization technique for reducing overfitting
in neural networks by preventing complex co-adaptations on training data. It is a very
efficient way of performing model averaging with neural networks. The term "dropout"
refers to dropping out units (both hidden and visible) in a neural network.
● Flatten - Flattens the output of the convolutional layers to feed into the
Dense layers.
- Activation Functions:
In CNN, the activation function of a node defines the output of that node given an
input or set of inputs.
Some activation functions are:
● The softmax function squashes the output of each unit to be between 0 and 1,
just like a sigmoid function. It also divides each output such that the total sum of the
outputs is equal to 1.
● A ReLu (or rectified linear unit) has output 0 if the input is less than 0,
and raw output otherwise. i.e, if the input is greater than 0, the output is equal to the
input.
- Transfer Learning:
In transfer learning, we take the learned understanding and pass it to a new deep
learning model. We take a pre-trained neural network and adapt it to a new neural
network with different dataset.
For this problem we use “ResNet-101” neural network.
● ResNet-101 is a convolutional neural network that is trained on more than a million images
from the ImageNet database. The network is 101 layers deep and can classify images into 1000
object categories, such as keyboard, mouse, pencil, and many animals. As a result, the network
has learned rich feature representations for a wide range of images. The network has an image
input size of 224-by-224.
7
4. Benchmark Model
For the benchmark model, we will use the algorithms outlined in the paper [2]. The paper
describes five different algorithms with the following accuracies.
Method Top-1 Accuracy (%)
Random Guessing 0.83
LeNet: (CRP-50 x 3)+(FC-1000)+(FC-120) <2
LeNet: (CRP-500 x 5)+(FC-1000)+(FC-120) <2
GoogLeNet: 3 layers, 16 filters per CONV <2
GoogLeNet: 4 layers, 16 filters per CONV 3.4
GoogLeNet:7 layers, filter 16x2+64x2+128x3 5
GoogLeNet:6 layers, filter 16x2+64x2+128x2 8.9
LeNet: (CRP-500x6)+(FC-1000)x2+(FC-120) 9.5
After defining the model architecture, it was trained on the training set with validation
split of 20% and the best weights were saved during the training process. After training,
predictions were made on the test set.
8
C. Methodology
1. Data Preprocessing
Based on our exploratory visualization, we can see that the samples are not of the same size.
Most neural networks expect the images of a fixed size. Therefore, we will need to apply some
preprocessing to the data set.
Let’s create the following transforms:
- Resize: to resize the image

- RandomResizedCrop: to crop from image randomly. This is data augmentation.
- RandomRotation: to crop from image randomly. This is data augmentation.
- RandomHorizontalFlip: to crop from image randomly. This is data augmentation.
- ToTensor: to convert the numpy images to torch images (we need to swap axes).
- Normalize (mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]): to get the data
within a range (we specify) and which helps in making training a lot faster.
2. Implementation
Our first in this project was to create a CNN from scratch. A random chance is 1 out 133, so
anything above 1% is better than random. But, according to the notes in the Jupyter notebook,
we should get something greater than 10%.
For the basic CNN, I chose the following architecture:
CNN architecture
9
I utilized max pooling to reduce the dimensionality. Pooling makes the CNN run faster but also
reduces overfitting.
I increased the number of filter from 16 to 32 to 64 as is standard practice in CNNs.

For the first convolutional layer, I used a kernel size equals to choose to use a kernel size equals
to 7 to help learn larger spatial filters and to help reduce volume size.
For stride selection, I used 2 for the first layer and 1 for the others.
I selected 15 epochs because with 10 epochs and less the accuracy is lower than 10%.
In the next section, we’ll apply transfer learning to use an already-established architecture to
hopefully optimize results.
3. Refinement
Since creating a CNN from scratch did not perform so good. Its accuracy was about 11%. This is
better than random, but there’s a lot of room to improve. First, the VGG16 model was utilized.
So, I utilized a different pre-trained model, ResNet 101 given its high accuracy.
With this architecture, we made some modifications to add a fully connected layer with
combination of linear layers with Dropout regularization, and fully connected dense layer as the
output layer.
After training the model and testing it, we get significantly improved results and accuracy. For
less number of epochs, we get 80% of accuracy.
I added a Dropout layer to reduce overfitting. The final layer of the model is used to predict the
category (one of the 133 dog breeds).
10
D. Results
1. Model Evaluation and Validation
The “from scratch” dog breed classifier has an accuracy of 11%. Whereas our architecture with
transfer learning has an accuracy of 80%.
In both cases, the accuracy is higher than the defined benchmark.
The model correctly knew it was a dog or human every time, and it also matched the dog breeds
appropriately. (You can see these results in the Jupyter Notebook).
2. Justification
The scratch-made CNN likely performed so poorly as it was hardly given any training data
compared to other pretrained architectures like ResNet.
In addition to the complexity of the architecture itself, the ResNet architecture was also trained
on vastly more images than I trained my scratch-made architecture on.
I think with more data the scratch-made CNN can perform better.
E. Reference
1. https://en.wikipedia.org/wiki/Convolutional_neural_network
2. https://www.kaggle.com/c/dog-breed-identification/overview
3. "Using Convolutional Neural Networks to Classify Dog Breeds" (Hsu, 2012)
4. ImageNet. http://www.image-net.org
5. https://www.pyimagesearch.com/2018/12/31/keras-conv2d-and-convolutional-layers/
11

Machine Learning Engineer Nanodegree: Capstone Proposal

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Engineer Nanodegree: Capstone Proposal

Uploaded by

Copyright:

Available Formats

Machine Learning Engineer

CNN Project: Dog Breed Classifier

One of the most well-known computer vision task is Image Classification.

❏ Train : 6680 Images

To summarize, our goal is:

We will use Convolutional Neural Networks (CNN) to create our app.

❏ Train : 6680 Images

- significant variations in shapes and size of different images

- Convolutional neural network:

Method Top-1 Accuracy (%)

Random Guessing 0.83

LeNet: (CRP-50 x 3)+(FC-1000)+(FC-120) <2

LeNet: (CRP-500 x 5)+(FC-1000)+(FC-120) <2

GoogLeNet: 3 layers, 16 filters per CONV <2

GoogLeNet: 4 layers, 16 filters per CONV 3.4

GoogLeNet:7 layers, filter 16x2+64x2+128x3 5

GoogLeNet:6 layers, filter 16x2+64x2+128x2 8.9

LeNet: (CRP-500x6)+(FC-1000)x2+(FC-120) 9.5

Let’s create the following transforms:

- Resize: to resize the image

I increased the number of filter from 16 to 32 to 64 as is standard practice in CNNs.

You might also like

Machine Learning Engineer Nanodegree: Capstone Proposal

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Engineer Nanodegree: Capstone Proposal

Uploaded by

Copyright:

Available Formats

Machine Learning Engineer

CNN Project: Dog Breed Classifier

One of the most well-known computer vision task is Image Classification.

❏ Train : 6680 Images

To summarize, our goal is:

We will use Convolutional Neural Networks (CNN) to create our app.

❏ Train : 6680 Images

- significant variations in shapes and size of different images

- ​Convolutional​ ​ neural​ ​ network:

Method Top-1 Accuracy (%)

Random Guessing 0.83

LeNet: (CRP-50 x 3)+(FC-1000)+(FC-120) <2

LeNet: (CRP-500 x 5)+(FC-1000)+(FC-120) <2

GoogLeNet: 3 layers, 16 filters per CONV <2

GoogLeNet: 4 layers, 16 filters per CONV 3.4

GoogLeNet:7 layers, filter 16x2+64x2+128x3 5

GoogLeNet:6 layers, filter 16x2+64x2+128x2 8.9

LeNet: (CRP-500x6)+(FC-1000)x2+(FC-120) 9.5

Let’s create the following transforms:

- Resize: to resize the image

I increased the number of filter from 16 to 32 to 64 as is standard practice in CNNs.

You might also like

- Convolutional neural network: