You are on page 1of 11

Machine Learning Engineer

Nanodegree
Capstone Proposal
Khalil Henchi
January 22th, 2020

CNN Project: Dog Breed Classifier

A. Definition

1. Project Overview

Domain Background

Machine learning is the new trend of technology and the most popular in the 21 century until
now. This is due to the increasing performance of computers and calculators.

The use of machine learning in the computer vision field is a subject that continues to fuel the
curiosity of scientists and engineers. In fact, scientists have been trying to make machines extract
meaningful information from visual data for about 60 years now. The breakthrough that made
computer vision reappear in the surface as a hot topic was in 2012 when AlexNet won ImageNet.

One of the most well-known computer vision task is Image Classification.


A breakthrough in building models for image classification came with the discovery that a
convolutional neural network(CNN) could be used to progressively extract higher- and
higher-level representations of the image content. Instead of preprocessing the data to derive
features like textures and shapes, a CNN takes the image’s raw pixel data as input and “learns”
how to extract these features, and ultimately infer what object they constitute.

For machine learning community, dog breed classification challenge is well-known. This
challenge is also available on Kaggle [1]
As udacity provides this project in the list of possible capstone project, I decided to work in it as
my capstone project because my goal is to get a job as a computer vision engineer so this project
will be a valuable asset on my CV.

1
Datasets and Inputs

The dataset for this project is provided by Udacity. We have pictures of dogs and humans. Each
image is identified by a unique id.

We have 8351 total dog images. Dog pictures are split into three folder:

❏ Train : 6680 Images


❏ Test : 836 Images
❏ Valid : 835 Images

In each group, images are sorted given the dog’s breed. We have 133 dog breeds.

Human pictures are sorted by name of each human. We have 13 233 total human pictures.

By analyzing the datasets, we see that all pictures are taken from different and various angles.
Besides, their dimensions are differents, and in some pictures their more than an object.

2. Problem Statement
For this project our goal is to detect whether there a human or a dog or none of them in a given
photo. In the case where there is a dog detected in the photo, we will look for its corresponding
breed. In the other case where there is a human detected, we will look for its most resembling
dog breed. In the last case where no human nor dog are detected, we will show an error message.
Images are random with different sizes, taken from different angles, and in different moments
during the day.

To summarize, our goal is:


Given a random picture of a dog, our model should be able to determine the dog breed from 133
breed class. If the input is a human picture, the code will identify the resembling dog breed.
Otherwise, it displays an error.

We will use Convolutional Neural Networks (CNN) to create our app.


We will use a pre-trained face detector from OpenCV to detect humains. And to detect dogs, we
will use a pre-trained VGG16 model. We will create our CNN model to identify dogs using
transfer learning.

2
3. Evaluation Metrics

Depending on the dataset, accuracy may not be a good metric for a classification problem. In this
case precision and recall can be good evaluation metrics. F1 score is a possible metric as it
combines precision and recall.

The traditional F-measure or balanced F-score (F1 score) is the harmonic mean of precision and
recall, it is given by the following formula:

I checked the dog breed dataset and the classes (breeds) are relatively balanced, so a simple
accuracy score is considered representative in this project.

3
B. Analysis

1. Data Exploration

The dataset for this project is provided by Udacity. We have pictures of dogs and humans. Each
image is identified by a unique id.

We have 8351 total dog images. Dog pictures are split into three folder:

❏ Train : 6680 Images


❏ Test : 836 Images
❏ Valid : 835 Images

In each group, images are sorted given the dog’s breed. We have 133 dog breeds.

Human pictures are sorted by name of each human. We have 13 233 total human pictures.

By analyzing the datasets, we see that all pictures are taken from different and various angles.
Besides, their dimensions are differents, and in some pictures their more than an object: more
than a human, more than a dog, both human and dog are present .

2. Exploratory Visualization

The following image how some sample from the data sets.

4
Data set samples

There are:

- significant variations in shapes and size of different images


- significant​ ​ variations​ ​ in​ ​ the​ ​ color​ ​ intensities​ ​ as​ ​ the​ ​ images​ ​ were taken​ ​ from​​ ​
different​ ​ angles​ ​ and​ ​ at​ different moments. The​ ​ Color​ ​ images​ ​ have​ ​ 3 ​ ​ channels​ ​
R-Red,​ ​ G-Green,​ ​ B-Blue.
The​ ​ training​ ​ data​ ​ provided​ ​ for​ ​ this​ ​ competition​ ​ was​ ​ not​ ​ distributed​ ​ uniformly​ ​ among​ ​ the
different​ ​ classes​ ​ of​ dogs, as shown in the figure below. This might cause a problem as​ the​ ​
model​ ​ may​ ​ find​ ​ it​ ​ difficult​ ​ to​ ​ make​ ​ a ​ ​ correct​ ​ prediction​ ​ for​ ​ the​ ​ later class​ with fewer
samples​ and​ ​ predict​ ​ an​ ​ image​ to be from classes with higher number of samples more​ ​ often.

5
3. Algorithms and Techniques

- ​Deep​ ​ learning:

Deep​ ​ learning​​ ​(also​ ​ known​ ​ as​ ​ deep​ ​ structured​ ​ learning​ ​ or hierarchical​ ​ learning)​ ​ is​ ​ part​ ​ of​ ​
a broader​ ​ family​ ​ of​ ​ machine​ ​ learning​ ​ methods based​ ​ on​ ​ learning​ ​ data​ ​ representations,​ ​ as​ ​
opposed​ ​ to​ ​ task-specific​ ​ algorithms.

- ​Convolutional​ ​ neural​ ​ network:

A convolutional​ ​ neural​ ​ network (CNN​ ​ or​ ​ ConvNet)​ ​ is​ ​ a ​ ​ class​ ​ of​ ​ deep,​ ​ feed-forward​ ​
artificial​ ​ neural​ ​ networks​ ​ that​ ​ has successfully​ ​ been​ ​ applied​ ​ to​ ​ analyzing​ ​ visual​ ​ imagery.​ ​
A ​ ​ CNN​ ​ consists​ ​ of​ ​ an​ ​ input​ ​ and an​ ​ output​ ​ layer,​ ​ as​ ​ well​ ​ as​ ​ multiple​ ​ hidden​ ​ layers.​ ​ The​ ​
hidden​ ​ layers​ ​ are​ ​ either convolutional,​ ​ pooling​ ​ or​ ​ fully​ ​ connected.​ ​ We​ ​ give​ ​ CNN​ ​ an​ ​ input​ ​
and​ ​ it​ ​ learns​ ​ by​ ​ itself that​ ​ what​ ​ features​ ​ it​ ​ has​ ​ to​ ​ detect.​ ​ We​ ​ won't​ ​ specify​ ​ the​ ​ initial​ ​
values​ ​ of​ ​ features​ ​ or​ ​ what kind​ ​ of​ ​ patterns​ ​ it​ ​ has​ ​ to​ ​ detect.

Various​ ​ Layers:

● Convolutional​​ ​ - ​ ​ Also​ ​ referred​ ​ to​ ​ as​ ​ Conv.​ ​ layer,​ ​ it​ ​ forms​ ​ the​ ​ basis​ ​ of​ ​ the​ ​
CNN and​ ​ performs​ ​ the​ ​ core​ ​ operations​ ​ of​ ​ training​ ​ and​ ​ consequently​ ​ firing​ ​ the​ ​ neurons of​ ​
the​ ​ network.​ ​ It​ ​ performs​ ​ the​ ​ convolutional​ ​ operation​ ​ over​ ​ the​ ​ input.

6
● Pooling​ ​ layers​​ ​ -Pooling​ ​ layers​ ​ reduce​ ​ the​ ​ spatial​ ​ dimensions​ ​ (Width​ ​ x ​ ​ Height)​ ​
of the​ ​ input​ ​ Volume​ ​ for​ ​ the​ ​ next​ ​ Convolutional​ ​ Layer.​ ​ It​ ​ does​ ​ not​ ​ affect​ ​ the​ ​ depth
dimension​ ​ of​ ​ the​ ​ Volume.
● Fully​ ​ connected​ ​ layer​​ ​ - ​ ​ The​ ​ fully​ ​ connected​ ​ or​ ​ Dense​ ​ layer​ ​ is​ ​ configured​
exactly the​ ​ way​ ​ its​ ​ name​ ​ implies.​ ​ It​ ​ is​ ​ fully​ ​ connected​ ​ with​ ​ the​ ​ output​ ​ of​ ​ the​ ​ previous
layer.​ ​ Fully​ ​ connected​ ​ layers​ ​ are​ ​ typically​ ​ used​ ​ in​ ​ the​ ​ last​ ​ stages​ ​ of​ ​ the​ ​ CNN​ ​ to
connected​ ​ to​ ​ the​ ​ output​ ​ layer​ ​ and​ ​ construct​ ​ the​ ​ desired​ ​ number​ ​ of​ ​ outputs.
● Dropout​ ​ layer​​ ​ - ​ ​ Dropout​ ​ is​ ​ a ​ ​ regularization​ ​ technique​ ​ for​ ​ reducing​ ​ overfitting​ ​
in neural​ ​ networks​ ​ by​ ​ preventing​ ​ complex​ ​ co-adaptations​ ​ on​ ​ training​ ​ data.​ ​ It​ ​ is​ ​ a very​ ​
efficient​ ​ way​ ​ of​ ​ performing​ ​ model​ ​ averaging​ ​ with​ ​ neural​ ​ networks.​ ​ The​ ​ term "dropout"​ ​
refers​ ​ to​ ​ dropping​ ​ out​ ​ units​ ​ (both​ ​ hidden​ ​ and​ ​ visible)​ ​ in​ ​ a ​ ​ neural network.
● Flatten​​ ​ - ​ ​ Flattens​ ​ the​ ​ output​ ​ of​ ​ the​ ​ convolutional​ ​ layers​ ​ to​ ​ feed​ ​ into​ ​ the​ ​
Dense layers.

- ​Activation​ ​ Functions​:

​In​ ​ CNN,​ ​ the​ ​ activation​ ​ function​ ​ of​ ​ a ​ ​ node​ ​ defines​ ​ the​ ​ output​ ​ of that​ ​ node​ ​ given​ ​ an​ ​
input​ ​ or​ ​ set​ ​ of​ ​ inputs.
Some​ ​ activation​ ​ functions​ ​ are:
● The​ ​ softmax​ ​ function​ ​ squashes​ ​ the​ ​ output​ ​ of​ ​ each​ ​ unit​ ​ to​ ​ be​ ​ between 0​ ​ and​ ​ 1,​ ​
just​ ​ like​ ​ a ​ ​ sigmoid​ ​ function.​ ​ It​ ​ also​ ​ divides​ ​ each​ ​ output​ ​ such​ ​ that​ ​ the​ ​ total sum​ ​ of​ ​ the​ ​
outputs​ ​ is​ ​ equal​ ​ to​ ​ 1.
● A ​ ​ ReLu​ ​ (or​ ​ rectified​ ​ linear​ ​ unit)​ ​ has​ ​ output​ ​ 0 ​ ​ if​ ​ the​ ​ input​ ​ is​ ​ less​ ​ than​ ​ 0,​ ​
and raw​ ​ output​ ​ otherwise.​ ​ i.e,​ ​ if​ ​ the​ ​ input​ ​ is​ ​ greater​ ​ than​ ​ 0,​ ​ the​ ​ output​ ​ is​ ​ equal​ ​ to​ ​ the
input.

- ​Transfer​ ​ Learning​​:

In​ ​ transfer​ ​ learning,​ ​ we​ ​ take​ ​ the​ ​ learned​ ​ understanding​ ​ and​ ​ pass​ ​ it to​ ​ a ​ ​ new​ ​ deep​ ​
learning​ ​ model.​ ​ We​ ​ take​ ​ a ​ ​ pre-trained​ ​ neural​ ​ network​ ​ and​ ​ adapt​ ​ it​ ​ to​ ​ a new​ ​ neural​ ​
network​ ​ with​ ​ different​ ​ dataset.
For​ ​ this​ ​ problem​ ​ we​ ​ use​​ “ResNet-101” neural​ ​ network.
● ResNet-101 is a convolutional neural network that is trained on more than a million images
from the ImageNet database. The network is 101 layers deep and can classify images into 1000
object categories, such as keyboard, mouse, pencil, and many animals. As a result, the network
has learned rich feature representations for a wide range of images. The network has an image
input size of 224-by-224.

7
4. Benchmark Model
For the benchmark model, we will use the algorithms outlined in the paper [2]. The paper
describes five different algorithms with the following accuracies.

Method Top-1 Accuracy (%)

Random Guessing 0.83

LeNet: (CRP-50 x 3)+(FC-1000)+(FC-120) <2

LeNet: (CRP-500 x 5)+(FC-1000)+(FC-120) <2

GoogLeNet: 3 layers, 16 filters per CONV <2

GoogLeNet: 4 layers, 16 filters per CONV 3.4

GoogLeNet:7 layers, filter 16x2+64x2+128x3 5

GoogLeNet:6 layers, filter 16x2+64x2+128x2 8.9

LeNet: (CRP-500x6)+(FC-1000)x2+(FC-120) 9.5

After​ ​ defining​ ​ the​ ​ model​ ​ architecture,​ it​​ was​ trained​ ​ on​ ​ the​ ​ training​ ​ set​ ​ with​ ​ validation​
split​ ​ of​ ​ 20% and​ ​ the​ ​ best​ ​ weights​ ​ were​ ​ saved​ ​ during​ ​ the​ ​ training​ ​ process.​ ​ After​ ​ training,​
predictions​ ​ were made​ ​ on​ ​ the​ ​ test​ ​ set​.

8
C. Methodology
1. Data Preprocessing

Based on our exploratory visualization, we can see that the samples are not of the same size.
Most neural networks expect the images of a fixed size. Therefore, we will need to apply some
preprocessing to the data set.

Let’s create the following transforms:

- Resize: to resize the image


- RandomResizedCrop: to crop from image randomly. This is data augmentation.
- RandomRotation: to crop from image randomly. This is data augmentation.
- RandomHorizontalFlip: to crop from image randomly. This is data augmentation.
- ToTensor: to convert the numpy images to torch images (we need to swap axes).
- Normalize (mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]): to get the data
within a range (we specify) and which helps in making training a lot faster.

2. Implementation

Our first in this project was to create a CNN from scratch. A random chance is 1 out 133, so
anything above 1% is better than random. But, according to the notes in the Jupyter notebook,
we should get something greater than 10%.
For the basic CNN, I chose the following architecture:

CNN architecture

9
I utilized max pooling to reduce the dimensionality. Pooling makes the CNN run faster but also
reduces overfitting.

I increased the number of filter from 16 to 32 to 64 as is standard practice in CNNs.


For the first convolutional layer, I used a kernel size equals to choose to use a kernel size equals
to 7 to help learn larger spatial filters and to help reduce volume size.
For stride selection, I used 2 for the first layer and 1 for the others.
I selected 15 epochs because with 10 epochs and less the accuracy is lower than 10%.

In the next section, we’ll apply transfer learning to use an already-established architecture to
hopefully optimize results.

3. Refinement

Since creating a CNN from scratch did not perform so good. Its accuracy was about 11%. This is
better than random, but there’s a lot of room to improve. First, the VGG16 model was utilized.
So, I utilized a different pre-trained model, ResNet 101 given its high accuracy.

With this architecture, we made some modifications to add ​a fully connected layer with
combination of linear layers with Dropout regularizatio​n, and fully connected dense layer as the
output layer.

After training the model and testing it, we get significantly improved results and accuracy. For
less number of epochs, we get 80% of accuracy.

I added a Dropout layer to reduce overfitting. The final layer of the model is used to predict the
category (one of the 133 dog breeds).

10
D. Results
1. Model Evaluation and Validation
The “from scratch” dog breed classifier has an accuracy of 11%. Whereas our architecture with
transfer learning has an accuracy of 80%.
In both cases, the accuracy is higher than the defined benchmark.

The model correctly knew it was a dog or human every time, and it also matched the dog breeds
appropriately. (You can see these results in the Jupyter Notebook).

2. Justification

The scratch-made CNN likely performed so poorly as it was hardly given any training data
compared to other pretrained architectures like ResNet.

In addition to the complexity of the architecture itself, the ResNet architecture was also trained
on vastly more images than I trained my scratch-made architecture on.
I think with more data the scratch-made CNN can perform better.

E. Reference
1. https://en.wikipedia.org/wiki/Convolutional_neural_network
2. https://www.kaggle.com/c/dog-breed-identification/overview
3. "Using Convolutional Neural Networks to Classify Dog Breeds" (Hsu, 2012)
4. ImageNet. ​http://www.image-net.org
5. https://www.pyimagesearch.com/2018/12/31/keras-conv2d-and-convolutional-layers/

11

You might also like