You are on page 1of 34

Running Head: CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 1

Implementing Capsule Neural Networks in Traffic Light Image Recognition

Advanced Scientific Research Paper

Submitted to the Center for Advanced Studies, Wheeler High School

by

Shenoy, Keshav

Department of Computer Science

Kennesaw State University

Marietta, GA

November 2018
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 2

Abstract

Traffic Light Image Recognition is the problem of determining the signal of traffic lights within

images taken by an autonomous vehicle. Currently, this is done by Convolutional Neural

Network (CNN) machine learning systems. However, CNNs have issues managing positional

information and don’t route data dynamically, so researchers have suggested using Capsule

Neural Networks (CapsNets) instead. CapsNets don’t utilize max pooling layers, the cause of

CNN information loss, and do dynamically route information through capsules, so they may

have significant benefits over CNNs. This study utilized the engineering design process,

investigating the ability of CapsNets to classify images of traffic lights by signal by adapting a

CapsNet for that purpose. Researchers modified a CapsNet to accept traffic light image input

data and altered its capsule layers slightly to allow for a 3-class classification problem. The

criteria for evaluation was based on the final validation accuracy of the CapsNet, with the

threshold for hypothesis support set at 85% final accuracy. The CapsNet reached a validation

accuracy of 100% in less than 40 steps, strongly supporting the hypothesis. This reinforces the

need for a continuation of research, development, and optimization of CapsNets for autonomous

driving problems, especially in image classification, detection, and other perception problems.

Key Words: autonomous driving, machine learning, convolutional neural network, capsule neural

network, traffic light image recognition


CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 3

Table of Contents

Chapter 1: Introduction....................................................................................................................6
Statement of the Problem...........................................................................................................6
Purpose of the Study .................................................................................................................6
Research Question.....................................................................................................................6
Hypothesis Statement.................................................................................................................6
Significance of the Study...........................................................................................................7
Definition of Key Terms............................................................................................................7
Summary....................................................................................................................................8

Chapter 2: Literature Review.........................................................................................................10


Current Approaches to Image Recognition..............................................................................10
Benefits of CapsNets for Image Recognition..........................................................................13
Summary..................................................................................................................................14

Chapter 3: Research Method..........................................................................................................16


Research Methods and Design(s).............................................................................................16
Population ...............................................................................................................................18
Sample......................................................................................................................................19
Materials/Instruments .............................................................................................................19
Operational Definition of Variables ........................................................................................20
Data Collection, Processing, and Analysis..............................................................................21
Assumptions.............................................................................................................................21
Limitations...............................................................................................................................22
Delimitations............................................................................................................................23
Summary..................................................................................................................................23

Chapter 4: Findings........................................................................................................................25
Results......................................................................................................................................25
Evaluation of Findings.............................................................................................................25
Summary .................................................................................................................................26

Chapter 5: Implications, Real World Connections, Recommendations, and Conclusions............28


Implications..............................................................................................................................28
Real World Connections..........................................................................................................29
Recommendations....................................................................................................................29
Conclusions..............................................................................................................................30

References…..................................................................................................................................31
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 4

LIST OF ABBREVIATIONS

CapsNet Capsule Neural Network

CNN Convolution Neural Network

ML Machine Learning

SSD Single Shot Multibox Detector

SVM Support Vector Machine

TF TensorFlow
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 5

LIST OF FIGURES

Figure 1 CapsNet Validation Accuracy.........................................................................................26


CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 6

Chapter 1: Introduction

Autonomous vehicles need accurate programming and perception tools to analyze the

world around them and make appropriate decisions quickly. One such perception problem in this

field is determining the signal of traffic lights around the autonomous vehicle. Traffic lights are

entirely color-based, so self-driving vehicles must use camera images to determine their signal,

not radar, sonar, or another technology (Fairfield & Urmson, 2011, p. 2). To determine the signal

of a traffic light from an image, researchers have turned to machine learning, specifically looking

at CNNs (Fairfield & Urmson, 2011, p. 2).

Statement of the Problem

According to Hinton (2014), CNNs have multiple flaws when it comes to evaluating

positional data (pose) and dynamic data routing (6:10), so CNN approaches to traffic light image

recognition may lack viability and another type of artificial neural network, the CapsNet, may do

better in image recognition (7:00).

Purpose of the Study

This study will determine the potential of CapsNets to improve final accuracy in traffic

light image recognition when compared to CNNs by constructing a CapsNet to classify traffic

lights images by signal.

Research Question

How do CapsNets perform when implemented in traffic light image recognition?

Hypothesis Statement
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 7

A trained CapsNet will recognize traffic lights from a traffic light image dataset with

more than 85% final validation/evaluation accuracy.

Independent variables: The neural network’s construction and build

Dependent variables: The final validation/evaluation accuracy of the neural network

Significance of the Study

By showing the performance of CapsNet technology within traffic light image

recognition in autonomous driving, this research can support or fail to support a shift in resources

towards further CapsNet research. The potential for a more powerful and accurate CNN is very

significant, because CNNs are currently within the forefront of object recognition (Hinton, 2014,

7:00). Improving upon the capabilities of CNNs with CapsNets could change how researchers

approach image recognition problems and push further forward the adoption of autonomous

motor vehicles globally as well as the incorporation of artificial intelligence and ML in

commonly used/household objects.

Definition of Key Terms

Artificial Intelligence: Nilsson (2010) defines it as “Artificial intelligence is that activity devoted

to making machines intelligent...” (as cited in Stone et al., 2016, p.12).

Artificial Neural Network: Machine Learning using a collection of interconnected nodes,

“neurons,” loosely based on the organization of certain neurons in human brains (Rawat &

Wang, 2017, p. 2354).

Autonomous driving: An emerging technology in which artificial intelligence will control the

movement of transport vehicles instead of humans (Stone et al., 2016, p. 7).


CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 8

Convolutional Neural Network: A type of feedforward artificial neural network built from layers

of convolutional and pooling layers (Rawat & Wang, 2017, p. 2354).

Capsule Neural Network: A type of artificial neural network that modifies convolutional neural

networks by segmenting groups of neurons into capsules for the better evaluation of positional

data (Hinton, 2014, 3:08).

Image Recognition (or Image Classification): “…the task of categorizing images into one of

several predefined classes…” (Rawat & Wang, 2017, p. 2352).

Convolutional Layers: “…serve as feature extractors, and thus they learn the feature

representations of their input images…” (Rawat & Wang, 2017, p. 2355).

Machine Learning: “…the design of learning algorithms, as well as scaling existing algorithms,

to work with extremely large data sets” (Stone et al., 2016, p. 9).

Pooling Layer: LeCun et al. (1989a), LeCun et al. (1989b), LeCun et al. (1998), and Ranzato,

Huang, Boureau, and LeCun (2007) claimed that pooling layers “…reduce the spatial resolution

of the feature maps and thus achieve spatial invariance to input distortions and translations” (as

cited in Rawat & Wang, 2017, p. 2356).

Pose: A specific type of positional data, including position, orientation, scale, deformation,

velocity, color, and more, which is recorded by CapsNets (Hinton, 2014, 3:23).

Summary

Traffic light image recognition is the problem of classifying images of traffic lights based

on the signal of the traffic light in the image. The CNN model currently leads in the field for

performing traffic light image recognition (Fairfield & Urmson, 2011, p. 1), but they have issues
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 9

with retaining positional data and routing data through layers (Hinton, 2014, 6:10). As such, this

study seeks to investigate the performance of CapsNets when implemented in traffic light image

recognition. If the CapsNet has a final validation accuracy above 85%, it will support the idea

that CapsNets have the potential to replace CNNs within image recognition.
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 10

Chapter 2: Literature Review

Currently, the field of image object recognition within ML is increasing in importance for

a number of different applications. Specifically, Fairfield and Urmson (2011) discussed its

growing significance in the field of autonomous driving, where researchers use artificial neural

networks in combination with cameras to build perception systems (p. 1). They specifically cited

the issue of traffic light image recognition, which alternative measures like sonar or radar cannot

perform, because interpreting traffic lights requires knowledge of color (Fairfield & Urmson,

2011, p. 1). As such, a large amount of development has gone into designing the best learning

algorithms for traffic light image recognition problems.

Current Approaches to Image Recognition

So far, Huang et al. (2017) has found that the CNN is the most successful ML model (p.

1) for image recognition problems. As proof, Huang et al. (2017) referenced the usage of Faster

R-CNNs, R-FCNs, and SSDs⎯systems based on a CNN framework⎯in multiple tech products

as proof of the current dominance of the CNN (p.1). Lim, Hong, Choi, and Byun (2017)

explained further, describing CNN architecture as one where a network feeds image data through

a series of deep (convolutional) and pooling layers to extract features for classification (p. 11).

They explained that CNN technology is state-of-the-art, needing only one network to accurately

classify many different classes of traffic signs (Lim et al., 2017, p. 10).

Despite this, significant issues with the CNN model exist when applied to object

detection/classification problems like traffic light image recognition. Liu et al. (2016) identified

balancing speed performance and accuracy as one important problem (p. 21). To alleviate some

of this, Liu et al. (2016) proposed the SSD (Single Shot MultiBox Detector) – a “deep network
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 11

based object detector that does not resample pixels or features for bounding box hypotheses and

and is as accurate as approaches that do” (p. 22). This is still a CNN but Liu et al. designed it

specifically to achieve strong accuracy without abandoning performance (Liu et al., 2016, p. 23).

By replacing bounding boxes proposals with a convolutional filter, Liu et al. (2016) constructed

a model that operates at higher frames per second than previous approaches with Faster R-CNN,

another CNN variant introduced to solve the same problem (p. 36). In a comprehensive review of

approaches to traffic light image recognition, Huang et al. (2017) contrasted with Liu et al.’s

opinion, suggesting that Faster R-CNN operates at a similar speed as SSD when Faster R-CNN

minimizes bounding box proposals (p. 34). This makes Faster R-CNN a more desirable CNN

architecture, because it maintains high accuracy on very small objects, while SSD accuracy

declines (Huang et al., 2017, p. 14). Meanwhile, Lim et al. (2017) took a different approach to

the optimization problem by combining a Support Vector Machine (SVM) model – an ML

system which does not utilize neural networks – with CNN technology to improve results (p. 2).

They utilized SVMs first to verify the image and a CNN afterwards to classify the image (Lim et

al., 2017, p. 2). Lim et al.’s (2017) combination worked out, forming a system able to classify

images at real-time with 97.9% average accuracy and with improved accuracy specifically in

poor lighting (p. 19).

All three of the above approaches, Liu et al. (2016), Huang et al. (2017), and Lim et al.

(2017), attempted and succeeded in improving the CNN accuracy by altering the working and

variant of the model itself, but other researchers have attempted to improve the CNN through

external changes to structure or learning strategy. A strong example of this are Fairfield and

Urmson (2011), who showed the ability for mapped traffic lights to improve detection results

within a model (p. 6). By mapping the location of traffic lights against current location of the
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 12

vehicle, their network could predict when it should expect to detect traffic lights and when it

should expect not to, reducing false positives and false negatives (Fairfield & Urmson, 2011, p.

6). Ghahramani (2015) took a more technical approach, exploring the ability for probabilistic

frameworks – models which “make predictions about future data, and take decisions that are

rational given these predictions” (p. 452) – to increase accuracy. This approach adds another

level onto machine learning, encouraging systems to attempt to solve problems before they occur

and speed up training (Ghahramani, 2015, p. 458). Tyukin, Gorban, and Romanenko (2018) did

something similar by considering the use of multiple ML models within a teacher-student model,

which would speed up the training of classification algorithms and improve the universality of

models in application to data (p. 1). They improve on previous work in the field by creating a

framework for the teacher-student model which requires less raw data and training (Tyukin et al.,

2018, p. 2). Though not implemented within the context of autonomous driving, the success of

the model within CNN image recognition suggests its potential for the field.

Together, these CNN models form the basis for researchers’ current approaches to

solving image recognition problems in autonomous driving. Researchers are working on

optimizing the balance between accuracy and performance by constructing new variants of CNN

with different traits and layer make-ups (Liu et al., 2016, p. 22; Huang et al., 2017, p. 14; Lim et

al., 2017, p. 2), sometimes even incorporating other machine learning systems, like an SVM, as a

supplement to the CNN framework (Lim et al., 2017, p. 2). Outside of the actual model,

researchers have attempted to improve image recognition technology by finding other areas to

supply supplementary data, like Fairfield and Urmson’s (2011) mapping of traffic lights to

provide context for classifications (p. 6) or Ghahramani’s (2015) application of probability data
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 13

to ML frameworks (p. 452). Finally, some researchers, like Tyukin et al. (2018), have looked not

at speeding up performance, but speeding up training or training on limited data (p. 2).

Benefits of CapsNets for Image Recognition

Hinton (2014) has challenged the CNN architecture itself, referencing their lack of

structure as a major flaw with their performance in handling positional data (1:47). As a way to

fix this, Hinton, Krizhevksy, and Wang (2011) proposed CapsNets, artificial neural networks

similar to CNNs but with layers loosely replaced with “capsules” (p. 45). According to Hinton

(2014), capsules would output likelihood of feature presence and pose information, the positional

information that many CNNs don’t take into account (3:09).

First, Hinton (2014) claims, capsules would improve massively on the current CNN

practice of max pooling, which reduces the available information in a subsampling procedure

(6:57). CapsNets eliminate pooling completely, instead using coincidence filtering to find

clusters of inputs at high dimensions, removing unwanted background inputs while keeping the

useful pose data (Hinton, 2014, 5:26). Secondly, Sabour, Frosst, and Hinton (2017) pointed out

the benefits of capsules for the dynamic routing of information, which would allow the

specialization of specific capsules for certain tasks (p. 2). This contrasts with max pooling, which

Sabour et al. (2017) stated will, “throw away information about the precise position of the entity

within the region” (p. 2) like the pose information, because it considers multiple input vectors,

not just the most active one. These two effects, the removal of subsampling and the introduction

of dynamic routing, could lead to improvements in a number of fields, including: digit

segmentation and separation, like that performed by Hinton, Ghahramani, and Teh (2000, p. 1)

and Sabour et al. (2017); traffic sign image recognition, like that done by Lim et al. (2017); and

shape analysis, like that described by Hinton (2014, 15:15). In fact, Kumar, Arthika, and
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 14

Parameswaran (2018) have already implemented CapsNets in traffic sign image recognition with

strong results: 97.6% accuracy and 0.0311038 loss at the end of validation (p. 4546). Researchers

have not yet applied CapsNets to the primary issue of this research, traffic light image

recognition, but the strong results in the similar field of traffic sign image recognition provide a

strong basis for support.

From the literature, it becomes clear that there are numerous areas for potential

improvement within CapsNets that do not exist in CNNs. These include the elimination of

information loss from down-sampling suggested by Hinton et al. (2011, p. 50) and by Hinton

(2014, 6:55), as well as within dynamic routing between capsules to enable specialization

(Sabour et al., 2017, p. 2). Sabour et al. (2017) goes so far as to state that, “The fact that a simple

capsules system already gives unparalleled performance at segmenting overlapping digits is an

early indication that capsules are a direction worth exploring” (p. 9). This supports the

conclusion that CapsNets, if developed at the same level as CNNs have enjoyed, could become

one of the leading approaches towards image classification. The further success of Kumar et al.

(2018) necessitates further research into the CapsNet architecture, especially within the context

of autonomous driving applications.

Summary

Right now, researchers approach traffic light image recognition through the use of CNNs

with additional supplemental data (Fairfield & Urmson, 2011, p. 1). This approach has

succeeded in achieving a baseline level of efficiency, but there eventually reaches a plateau

where CNNs have to sacrifice accuracy for performance and vice versa (Liu et al., 2016, p. 21).

Researchers have managed to make some progress by optimizing the build of the CNNs (Liu et

al., 2016, p. 22; Huang et al., 2017, p. 14; Lim et al., 2017, p. 2), the supplemental data provided
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 15

(Fairfield & Urmson, 2011, p. 6; Ghahramani, 2015, p. 452) or the amount of training needed

(Tyukin et al., 2018, p. 1), but Hinton (2014) claims that the problem is with the CNN model

itself (1:47). Hinton et al. (2011) points out that a different structure called a CapsNet could

eschew the use of max pooling and preserve positional pose data by using capsules instead of

layers (p. 49). Sabour et al. (2017) also identifies that a Capsule Neural Network could specialize

its routing of data through different capsules, improving accuracy and analysis of pose data (p.

2). These two potential benefits, as well as the success of Kumar et al. (2018) in a similar field,

support the need for further study into the use of CapsNets in traffic light image recognition.
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 16

Chapter 3: Research Method

Autonomous vehicles need accurate programming and perception tools to analyze the

world around them and make appropriate decisions quickly. One such perception problem in this

field is determining the signal of traffic lights around the autonomous vehicle. Traffic lights are

entirely color-based, so self-driving vehicles must use camera images to determine their signal,

not radar, sonar, or another technology (Fairfield & Urmson, 2011, p. 2). To determine the signal

of a traffic light from an image, researchers have turned to machine learning, specifically looking

at CNN-type artificial neural networks (Fairfield & Urmson, 2011, p. 2). However, according to

Hinton (2014), CNNs have multiple flaws when it comes to evaluating positional data (pose) and

dynamic routing, so other types of artificial neural network like CapsNets may be a better

solution (6:10). This study will determine the potential of CapsNets to improve final accuracy in

traffic light image recognition by constructing a CapsNet to classify traffic lights images by

signal.

Research Methods and Design

First, the researcher examined CNNs and CapsNets implemented previously for similar

problems. By basing the foundational areas of the design from models shown to previously have

success, the researcher established the model on a stable basis from which to start the design

process. Specifically, the researcher focused on the CapsNet research done by Kumar et al.

(2018). Kumar et al.’s (2018) CapsNet research serves as an appropriate starting place because it

successfully classified traffic signs, a similar problem to that of traffic light image recognition.

By using the same base model, the researchers could accurately evaluate results in context. This

also informed the choice of hypothesis for the research. Since Kumar et al. (2018) constructed
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 17

their model in the context of traffic signs, a slightly lower accuracy⎯85%⎯makes sense for

evaluating the potential of the model in context.

Second, the researcher selected the data for training and validation. In this case, the

researcher took data from Udacity’s bag⎯as in rosbag files⎯of images from Carla

(Dosovitskiy, Ros, Codevilla, Lopez, & Koltun, 2017). These are images taken by the self-

driving car Carla (Dosovitskiy et al., 2017). The researcher chose this dataset because of its

relative simplicity, containing only three classes and with a range of different sizes for traffic

lights.

Third, the researcher built an algorithm to properly process input data into a format

understandable by the CapsNet. The researcher achieved this by developing a Python program

which read the images and the accompanying xml/yaml files with the ground-truth labels and

cropped the images into just the traffic lights. Then, the program scaled each image to exactly

32x32 pixel size so all the inputs would have the same dimensions. These two functions used the

OpenCV, glob, and PIL python packages and were necessary to remove the image backgrounds,

which would hurt the ability of the network to properly classify images. The researcher chose the

32x32 pixel size to minimize the amount of RAM necessary to run and process all the images

within the network.

From here, the program fed the images into a separate researcher-developed python code

which read each pixel of the images from the traditional file format into a 3-dimensional array of

pixel values. One array represented each image, with height, width, and RGB making up the 3

dimensions. The code also found the ground-truth for the signal of each image and appended

them into a 1-dimensional array of the same sample size as the 3-dimensional one. The program
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 18

split both of these two arrays into two sets, training data and validation data, for a total of 4 input

arrays into the network. Prior to it entering CapsNet, another custom program preprocessed the

data, enhancing brightness and contrast using PIL and dividing all the values by 255 to keep

values within a range of zero to one. However, before the data could train in the CapsNet, some

of the network’s properties had to change.

Kumar et al.’s (2018) constructed their CapsNet with three main layers: the input layer,

the primary capsule layer, and the traffic sign capsule layer (p. 4546). The changes to the input

were already described above. The primary capsule layer feeds input into two convolutional

layers⎯which utilize a rectified linear unit activation function and 0.7 dropout⎯and then into

the primary capsules, sixteen filters of 1600 capsules (p. 4546). The researcher did not alter the

primary capsule layer at all. Kumar et al.’s (2018) traffic sign capsule layer consisted of 43

capsules, one for each class in the traffic sign dataset (p. 4546). This was changed to 3 for this

research, because there are only three classes of traffic light in this dataset. The researcher did

not alter the CapsNet’s reconstruction network, which provides feedback to the network after

images go through the CapsNet and reduces overfitting, either.

When the researcher made completed these to the CapsNet, training could begin. The

data entered directly into the neural network architecture, training for 100 iterations with a batch

size of 50 for both neural networks. After each iteration of training, the network evaluated itself

on separate validation data. Both trainings and validations occurred within the Keras and

TensorFlow ML frameworks. At the end of the training and validation iterations, the network

plotted loss and accuracy curves with TensorBoard using the event files generated by the

CapsNet.

Population
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 19

The direct population of this study are traffic lights oriented vertically with red, yellow,

and green as the three signals they can output. This includes many of the traffic lights within the

US. However, because of the broad potential applications of the study’s CapsNet, there is no

specific reason that the results cannot be generalized to include horizontal traffic lights and those

with other signals e.g. green left arrow and blinking red light, suggesting a population of all

traffic lights.

Sample

The total sample of images was 318 pictures of traffic lights taken in relatively similar

settings with differing lighting, size, and signal orientation from the Udacity ROSbag (file

format) of images from Carla (Dosovitskiy et al., 2017). This is a smaller dataset for machine

learning applications, which is good as a baseline study, because the trained algorithm avoiding

overfitting on the training dataset will support the ability of the CapsNet algorithm to be applied

to larger amounts of data. Additionally, the fact that these images are from a self-driving car

helps boost the reliability of the results in their application to real-world autonomous driving

applications.

The researcher broke these 318 images further into 155 images for training and 163 for

testing and validation. The 155 training images contained 63 images of red traffic lights, 48

images of green traffic lights, and 44 images of yellow traffic lights, while the validation set had

67 images of red traffic lights, 88 images of green traffic lights, and 8 images of yellow traffic

lights. The differing proportions of red, green, and yellow lights within the training and

validation sets serves to mirror the differing signals self-driving vehicles will be exposed to and

ensures the network isn’t expecting a certain proportion of each signal. Most specifically, the low
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 20

number of yellow traffic lights in the testing set mimics the relatively low number of yellow

lights in real traffic and shouldn’t affect results significantly if the network operates well.

Materials/Instruments

As previously described, the researcher used data from the Udacity set of images from

Carla, a self-driving vehicle (Dosovitskiy et al., 2017). Before use in the network, the program

shuffled the order of the dataset, building internal validity by the randomization of all possible

assignments within the network constructions process. This included the order of the data during

training and validation, the subsamples of data used for training, and the subsamples of data used

for validation. The researcher also controlled as many variables as possible, including the

number of steps allowed within training and validation, the dataset used, and the computer the

algorithm executed on.

The researcher completed the entire project within Python, a popular programming

language for building machine learning applications. Additionally, the researcher utilized the TF

and Keras Python frameworks, which implement a large number of classes, functions, and

objects for ML. Both frameworks provide simple, pre-implemented methods for developing the

ML algorithm, measuring the change in the dependent variable (accuracy), and recording the

artificial neural networks every step. Since both of these frameworks operate and have operated

successfully in the construction of other image recognition neural networks, their use here should

not cause error. The researcher utilized other Python packages, such as PIL, glob, and OpenCV,

mostly to avoid the need for reimplementation of basic methods and their use should have no

bearing on the study results.


CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 21

Additionally, the researcher built the CapsNet by modifying and building on the

algorithm created by Kumar et al. (2018) for the classification of traffic signs in the ways

expressed in the previous section. Should the CapsNet operate at a similar level of success as

within the Kumar et al. (2018) research, it would promote the Inter-rater reliability of the

CapsNet structure, because the model, run on two different datasets in two different contexts,

returned similar results. This would support the idea that the model is operating correctly and is

not strong for just one set of data samples. This would also improve the external validity of the

project by suggesting that the success of CapsNets in these two areas could extend to more image

perception problems in machine learning.

Operational Definition of Variables

Independent Variables: The model, build, and design of the ML system that the

researcher has trained and tested. This includes the structure of the parts of the Artificial Neural

Network, the type of the artificial neural network (CapsNet in this case), how the data is

preprocessed, and the hyperparameters (learning rate, optimizer) applied to the network.

Dependent Variable: Final accuracy of the CapsNet operating on validation/evaluation

data after training. According to Google Developers “…accuracy is the fraction of predictions

our model got right” (“Classification: Accuracy,” 2018)

Data Collection, Processing, and Analysis

The data is numerical. The final accuracy is a single number taken at the end of training

from a table of validation/evaluation accuracy over iteration, while loss is a number measured

per step as a sum of residual squares. Final accuracy is the number being used to evaluate the

success of the product, while loss simply informs the researchers of how the model’s accuracy
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 22

changed over training and validation. The researched analyzed the model solely based on the

final accuracy achieved after convergence. This is appropriate because accuracy is the only value

of the network being considered. The purpose is to determine the ability of the CapsNet to

classify traffic light signals from images, not its performance, so the only necessary value is the

final accuracy after convergence. The builder’s role in this case was simply to observe the

accuracy curve after a certain number of iterations and attempt to troubleshoot logic or syntax

issues with the code until the model begins to converge to an accuracy value greater than 80%

accuracy. This indicates that the model is actually learning from the data. At this point the

builder’s role is to determine where the model stops learning⎯the accuracy and loss curves

plateau⎯and record the final accuracy and number of iterations needed.

Assumptions

The researcher assumed that the datasets accurately represented the population of traffic

lights that autonomous motor vehicles would encounter in practice. While traffic lights do not

vary too greatly, some alterations exist in structure and orientation based on locale.

The researcher assumed that the performance of the produced CapsNet after training

accurately models the performance of a theoretical CapsNet trained more extensively with a

larger amount of data. This is a fair assumption, because the research by Kumar et al. (2018) uses

significantly more data (p. 4547) and still succeeds with the same model.

The researcher assumed that the dataset developers annotated the datasets with the correct

bounding boxes and signal. This should prove true, because the dataset has been used previously

by other researchers.
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 23

The researcher assumed that the power of the Central Processing Unit and processors of

the computer used during training and testing will not affect accuracy results, as there is no

reason to believe they will.

The researcher assumed that the ML algorithm can operate at full potential within the TF

framework and Python language. There is no reason to believe the framework should limit the

research in any way.

Limitations

There are many different types of ML algorithms and neural networks. The research

stayed confined to the performance of CNNs and CapsNets due to their relevance and current use

within the field.

The research only studied traffic light image recognition in the context of only three

classes⎯red, green, and yellow⎯and with a relatively small dataset of about 300 images total.

The research only investigated the performance of CapsNets within the TF/Keras

framework and did not attempt to reconstruct the design within any other ML framework. This

should not affect the external validity of the research at all because the network structure is the

same regardless of implementing framework.

The research stayed within the ML area of artificial intelligence and did not examine

other areas of artificial intelligence within autonomous driving. The research should not be

applied beyond machine learning, so this should not affect the universality of the research.

Delimitations
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 24

The research limited itself to a few levels of image quality and dimensions with the

understanding that practically applied autonomous driving applications will have similar or

greater levels of image quality.

While still attempting to identify relatively small traffic lights with artificial neural

networks, the research scaled all traffic lights to a 32x32px size to have a consistent data input

size. This varied the resolution of the images based on the original light size but did not vary the

final input size of the image.

The research limited itself to the study of accuracy, with the understanding that a high

final accuracy indicates the ability for optimization in terms of performance and speed on more

suitable processing equipment.

Summary

The researcher reconstructed and adapted the CapsNet system produced by Kumar et al.

(2018) to evaluate traffic light image data instead of traffic sign image data. This traffic light

data came from the Udacity Carla images (Dosovitskiy, 2017) and only included vertical red,

green, and yellow traffic lights, but could extend to a population of horizontal and non-traditional

traffic lights due to the previous example of Kumar et al.’s success with many classes. The

researcher built the model using TF and Keras, which are strong choices as machine learning

algorithms due to their previous use by researcher in image recognition (Kumar et al., 2018, p.

4547; Huang et al., 2017, p. 2). The researcher assumed that the specifics of the study, which

data was used, which framework was used, which computer was used, should not affect the

results because of the focus on final accuracy and the reliability of these systems in previous
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 25

research. The researchers also limited the study to only the viability of CapsNets within traffic

light image recognition.


CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 26

Chapter 4: Findings

The purpose of the research is to determine the viability of CapsNets to improve final

accuracy in traffic light image recognition. The researcher constructed a CapsNet and tested it on

traffic light image data to determine how strong the CapsNet architecture is at classifying. After

training and testing, the CapsNet performed very well, reaching high levels of accuracy and

supporting the hypothesis. This supports the potential of CapsNets to change the field of traffic

light image recognition.

Results

The CapsNet had a final validation accuracy of 100% after 40 steps. The criteria for

evaluation was the final validation accuracy of the Neural Network and the threshold that the

hypothesis set was 85% final accuracy. The CapsNet converged to 100% final accuracy, which is

past the threshold set at 85% final accuracy. From that, the researcher concluded that the

hypothesis that the CapsNet will classify traffic light images at greater than 85% final validation

accuracy is supported.

Evaluation of Findings

The results supported the hypothesis very strongly, with the network converging to 100%

validation accuracy for sure by the 40th step. This is significantly higher than the threshold set in

the hypothesis at 85%. It is also slightly higher than the accuracy of the research done by Kumar

et al. (2018) using the same model, which was 97.6% (p. 4547). This could be because Kumar et

al. (2018) had a data set of 12,630 testing images (p. 4547), which is significantly more

opportunities for the network to make a wrong classification than in this research, which had

only 163 testing images. Additionally, the difference could be because Kumar et al.’s (2018)
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 27

research had 43 different classes of traffic sign to classify (p. 4546), while this research had only

3 classes of traffic light. Overall, the findings fall in line with that study and also the expectations

for the network, strongly supporting the potential for CapsNets to improve traffic light image

recognition applications.

1.2

0.8
Accuracy

0.6
Validation
Accuracy
0.4

0.2

0
0 5 10 15 20 25 30 35 40 45
Iteration (Steps)

Figure 1. CapsNet Validation Accuracy. This figure displays the accuracy of the CapsNet in

classifying the Validation dataset.

This is reinforced by Figure 1, which shows the model’s validation accuracy curve. The

curve shows that, with further refinement, the amount of training needed to reach a high level of

accuracy could decrease even further, because after step 15 the model stays at 100% validation

accuracy aside from one deviation.

Summary

Researchers found that the CapsNet was able to converge to a final validation accuracy of

100% after about 40 iterations. This strongly supports the hypothesis, which set the criteria for

evaluation at 85% final validation accuracy, and also mirrors the results of the Kumar et al.
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 28

(2018) study, which used a very similar model to this research. Therefore, CapsNets can classify

traffic light images by signal and may constitute an improvement over current approaches to

traffic light image recognition.


CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 29

Chapter 5: Implications, Real World Connections, Recommendations, and Conclusions

The current approach to traffic light image recognition uses CNN technology (Huang et

al., 2017, p. 1), but this is an issue because CNN’s do not retain positional pose information

(Hinton, 2014, 6:57) and do not dynamically route data through layers (Sabour et al., 2017, p. 2),

which might reduce their accuracy and performance significantly. This research tests CapsNets

for the problem of traffic light image recognition, because they do utilize pose information and

dynamically route information through specialized capsules (Sabour et al., 2017, p. 2). The

researcher did this by altering the CapsNet model constructed by Kumar et al. (2018) to accept

traffic light image information instead of traffic sign image information and changing the

construction of the capsules. The results very strongly supported the hypothesis, with the

network quickly converging to a 100% validation accuracy. The previous work done with

CapsNets by Kumar et al. (2018) supports the idea that the study’s limitations are not the cause

for this high accuracy, supporting the idea that further research should be done on using

CapsNets in image perception and autonomous driving.

Implications

The main research question was “How do CapsNets perform when implemented in traffic

light image recognition?” The research supports the idea that CapsNets perform very well when

implemented in traffic light image recognition, as the CapsNet surpassed the hypothesis

threshold of 85% final validation accuracy. The most impactful limitation to the research is the

small size of the training and validation datasets and the limitation to only three classes⎯red,

green, and yellow traffic lights⎯within the datasets. However, this limitation should not color

the interpretation of the research, because Kumar et al.’s (2018) research has already shown that
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 30

CapsNets can handle large amounts of data and large numbers of classes (p. 4547). The only

issue in question was whether CapsNets could accurately classify traffic light images, where the

evidence from this research supported a yes answer to that question. The results represent a

possible improvement to the efforts made by Huang et al. (2017) and Liu et al. (2016) to increase

the optimization of traffic light image recognition, where CapsNets could serve as more accurate

solutions to the classification parts of their recognition algorithms.

Real World Connections

Researchers should practically apply this study in the field of autonomous driving, where

more manufacturers could explore the potential of CapsNets to improve efficiency and

performance of their self-driving vehicles. The research’s CapsNet’s quick convergence to 100%

validation accuracy suggests that CapsNets may prove very strong in classifying images and as

tools in a variety of perception problems in the industry.

Additionally, given this study’s CapsNet’s high validation accuracy, manufacturers of

autonomous vehicles should research CapsNet performance further to see if they have a more

favorable performance accuracy tradeoff than the prevailing CNN architecture does.

Recommendations

There are a number of other perception problems within autonomous driving that are

different enough from traffic light image classification to warrant further research with a

CapsNet architecture. One specific area is research into the field of traffic light image detection,

which is more computationally involved and may gain more from the CapsNet’s use of

positional data than an image classification problem. Since accuracy was high when it comes to

classification for both traffic lights and traffic signs (Kumar et al., 2017, p. 4547), the results
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 31

could be informative when CapsNets are used in image detection. Another area where further

research should occur is a repetition of this study but with out-of-model improvements to

CapsNet architecture. CNNs have led in image recognition for quite a while (Huang et al., 2017,

p. 2) and as such have had optimizations in many places outside of just the model construction,

including the use of probabilistic frameworks (Ghahramani, 2015, p. 452) and the incorporation

of supplementary information like traffic sign location (Fairfield & Urmson, 2011, p. 6). As

such, researchers should attempt to also provide these auxiliary improvements to CapsNets as

well and test to see potential further improvements in accuracy. Because the research findings

show that CapsNets can already reach 100% validation accuracy in traffic light image

recognition, more information could speed up the training or improve the accuracy to

performance tradeoff.

Conclusions

This research found that CapsNet architecture can successfully classify traffic light

images by signal. Additionally, because of previous research done with the same CapsNet

architecture (Kumar et al., 2018), it seems unlikely that this high accuracy is due to a small

dataset or small number of classes. As a result, it seems likely that CapsNets could operate

successfully in a number of other ML problems throughout autonomous driving, especially other

non-classification image perception problems like traffic light detection and traffic sign

detection. The findings also support further autonomous driving research into optimization of

CapsNets and determining if CapsNets have a smaller accuracy to performance tradeoff than

CNNs.
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 32

References

Classification: Accuracy. (2018, October 1). Retrieved from

https://developers.google.com/machine-learning/crash-course/classification/accuracy

Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., & Koltun, V. (2017). CARLA: An open

urban driving simulator. 1st Conference on Robot Learning. [Data Files] Retrieved from

Proceedings of Machine Learning Research database.

Fairfield, N., & Urmson, C. (2011). Traffic light mapping and detection. IEEE International

Conference on Robotics and Automation. doi:10.1109/icra.2011.5980164

Ghahramani, Z. (2015, May 28). Probabilistic machine learning and artificial

intelligence. Nature,521(7553), 452-459. doi:10.1038/nature14541

Hinton, G. E., Ghahramani, Z., & Teh, Y. W. (2000). Learning to parse images. Advances in

Neural Information Processing Systems. Retrieved from the NIPS Proceedings database.

Hinton, G. E., Krizhevsky, A., & Wang, S. D. (2011). Transforming auto-encoders. Lecture

Notes in Computer Science Artificial Neural Networks and Machine Learning – ICANN

2011,44-51. doi:10.1007/978-3-642-21735-7_6

Hinton, G. E. (2014, December 4) What's wrong with convolutional nets? [Video File].

Retrieved from https://techtv.mit.edu/collections/bcs/videos/30698-what-s-wrong-with-

convolutional-nets

Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., . . . Murphy, K. (2017).

Speed/accuracy trade-offs for modern convolutional object detectors. 2017 IEEE


CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 33

Conference on Computer Vision and Pattern Recognition (CVPR).

doi:10.1109/cvpr.2017.351

Kumar, A. D., Arthika, R. K., & Parameswaran, L. (2018). Novel deep learning model for traffic

sign detection using capsule networks. International Journal of Pure and Applied

Mathematics,118(20), 4543-4548. Retrieved from the Academic Publications database.

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D.

(1989a). Handwritten digit recognition with a back-propagation network. In D. S.

Touretzky (Ed.), Advances in neural information processing systems (pp. 396–404).

Cambridge, MA: MIT Press.

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D.

(1989b). Backpropagation applied to handwritten zip code recognition. Neural

Computation, 1(4), 541–551 doi:10.1162/neco.1989.1.4.541

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to

document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

doi:10.1109/5.726791

Lim, K., Hong, Y., Choi, Y., & Byun, H. (2017). Real-time traffic sign recognition based on a

general purpose GPU and deep-learning. Plos One,12(3).

doi:10.1371/journal.pone.0173317

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., & Berg, A. C. (2016). SSD:

Single shot MultiBox detector. Computer Vision – ECCV 2016 Lecture Notes in

Computer Science,21-37. doi:10.1007/978-3-319-46448-0_2


CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 34

Nilsson, N. J. (2010). The quest for artificial intelligence: A history of ideas and achievements.

Cambridge: Cambridge University Press. Available from Google Books database.

Ranzato, M. A., Huang, F. J., Boureau, Y., & LeCun, Y. (2007). Unsupervised learning of

invariant feature hierarchies with applications to object recognition. In Proceedings IEEE

Conference on Computer Vision and Pattern Recognition,1-8.

doi:10.1109/CVPR.2007.383157

Rawat, W., & Wang, Z. (2017). Deep convolutional neural networks for image classification: A

comprehensive review. Neural Computation,29(9), 2352-2449.

doi:10.1162/neco_a_00990

Sabour, S., Frosst, N., & Hinton, G. E. (2017). Dynamic routing between capsules. 31st

Conference on Neural Information Processing Systems. Retrieved from the NIPS

Proceedings database

Stone, P., Brooks, R., Brynjolfsson, E., Calo, R., Etzioni, O., Hager, G., … Teller, A. (2016,

September). Artificial intelligence and life in 2030." One Hundred Year Study on

Artificial Intelligence,1-52. Retrieved from http://ai100.stanford.edu/2016-report

Tyukin, I. Y., Gorban, A. N., Sofeykov, K. I., & Romanenko, I. (2018, August 13). Knowledge

transfer between artificial intelligence systems. Frontiers in Neurorobotics,12.

doi:10.3389/fnbot.2018.00049

You might also like