Professional Documents
Culture Documents
by
Shenoy, Keshav
Marietta, GA
November 2018
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 2
Abstract
Traffic Light Image Recognition is the problem of determining the signal of traffic lights within
Network (CNN) machine learning systems. However, CNNs have issues managing positional
information and don’t route data dynamically, so researchers have suggested using Capsule
Neural Networks (CapsNets) instead. CapsNets don’t utilize max pooling layers, the cause of
CNN information loss, and do dynamically route information through capsules, so they may
have significant benefits over CNNs. This study utilized the engineering design process,
investigating the ability of CapsNets to classify images of traffic lights by signal by adapting a
CapsNet for that purpose. Researchers modified a CapsNet to accept traffic light image input
data and altered its capsule layers slightly to allow for a 3-class classification problem. The
criteria for evaluation was based on the final validation accuracy of the CapsNet, with the
threshold for hypothesis support set at 85% final accuracy. The CapsNet reached a validation
accuracy of 100% in less than 40 steps, strongly supporting the hypothesis. This reinforces the
need for a continuation of research, development, and optimization of CapsNets for autonomous
driving problems, especially in image classification, detection, and other perception problems.
Key Words: autonomous driving, machine learning, convolutional neural network, capsule neural
Table of Contents
Chapter 1: Introduction....................................................................................................................6
Statement of the Problem...........................................................................................................6
Purpose of the Study .................................................................................................................6
Research Question.....................................................................................................................6
Hypothesis Statement.................................................................................................................6
Significance of the Study...........................................................................................................7
Definition of Key Terms............................................................................................................7
Summary....................................................................................................................................8
Chapter 4: Findings........................................................................................................................25
Results......................................................................................................................................25
Evaluation of Findings.............................................................................................................25
Summary .................................................................................................................................26
References…..................................................................................................................................31
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 4
LIST OF ABBREVIATIONS
ML Machine Learning
TF TensorFlow
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 5
LIST OF FIGURES
Chapter 1: Introduction
Autonomous vehicles need accurate programming and perception tools to analyze the
world around them and make appropriate decisions quickly. One such perception problem in this
field is determining the signal of traffic lights around the autonomous vehicle. Traffic lights are
entirely color-based, so self-driving vehicles must use camera images to determine their signal,
not radar, sonar, or another technology (Fairfield & Urmson, 2011, p. 2). To determine the signal
of a traffic light from an image, researchers have turned to machine learning, specifically looking
According to Hinton (2014), CNNs have multiple flaws when it comes to evaluating
positional data (pose) and dynamic data routing (6:10), so CNN approaches to traffic light image
recognition may lack viability and another type of artificial neural network, the CapsNet, may do
This study will determine the potential of CapsNets to improve final accuracy in traffic
light image recognition when compared to CNNs by constructing a CapsNet to classify traffic
Research Question
Hypothesis Statement
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 7
A trained CapsNet will recognize traffic lights from a traffic light image dataset with
recognition in autonomous driving, this research can support or fail to support a shift in resources
towards further CapsNet research. The potential for a more powerful and accurate CNN is very
significant, because CNNs are currently within the forefront of object recognition (Hinton, 2014,
7:00). Improving upon the capabilities of CNNs with CapsNets could change how researchers
approach image recognition problems and push further forward the adoption of autonomous
Artificial Intelligence: Nilsson (2010) defines it as “Artificial intelligence is that activity devoted
“neurons,” loosely based on the organization of certain neurons in human brains (Rawat &
Autonomous driving: An emerging technology in which artificial intelligence will control the
Convolutional Neural Network: A type of feedforward artificial neural network built from layers
Capsule Neural Network: A type of artificial neural network that modifies convolutional neural
networks by segmenting groups of neurons into capsules for the better evaluation of positional
Image Recognition (or Image Classification): “…the task of categorizing images into one of
Convolutional Layers: “…serve as feature extractors, and thus they learn the feature
Machine Learning: “…the design of learning algorithms, as well as scaling existing algorithms,
to work with extremely large data sets” (Stone et al., 2016, p. 9).
Pooling Layer: LeCun et al. (1989a), LeCun et al. (1989b), LeCun et al. (1998), and Ranzato,
Huang, Boureau, and LeCun (2007) claimed that pooling layers “…reduce the spatial resolution
of the feature maps and thus achieve spatial invariance to input distortions and translations” (as
Pose: A specific type of positional data, including position, orientation, scale, deformation,
velocity, color, and more, which is recorded by CapsNets (Hinton, 2014, 3:23).
Summary
Traffic light image recognition is the problem of classifying images of traffic lights based
on the signal of the traffic light in the image. The CNN model currently leads in the field for
performing traffic light image recognition (Fairfield & Urmson, 2011, p. 1), but they have issues
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 9
with retaining positional data and routing data through layers (Hinton, 2014, 6:10). As such, this
study seeks to investigate the performance of CapsNets when implemented in traffic light image
recognition. If the CapsNet has a final validation accuracy above 85%, it will support the idea
that CapsNets have the potential to replace CNNs within image recognition.
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 10
Currently, the field of image object recognition within ML is increasing in importance for
a number of different applications. Specifically, Fairfield and Urmson (2011) discussed its
growing significance in the field of autonomous driving, where researchers use artificial neural
networks in combination with cameras to build perception systems (p. 1). They specifically cited
the issue of traffic light image recognition, which alternative measures like sonar or radar cannot
perform, because interpreting traffic lights requires knowledge of color (Fairfield & Urmson,
2011, p. 1). As such, a large amount of development has gone into designing the best learning
So far, Huang et al. (2017) has found that the CNN is the most successful ML model (p.
1) for image recognition problems. As proof, Huang et al. (2017) referenced the usage of Faster
R-CNNs, R-FCNs, and SSDs⎯systems based on a CNN framework⎯in multiple tech products
as proof of the current dominance of the CNN (p.1). Lim, Hong, Choi, and Byun (2017)
explained further, describing CNN architecture as one where a network feeds image data through
a series of deep (convolutional) and pooling layers to extract features for classification (p. 11).
They explained that CNN technology is state-of-the-art, needing only one network to accurately
classify many different classes of traffic signs (Lim et al., 2017, p. 10).
Despite this, significant issues with the CNN model exist when applied to object
detection/classification problems like traffic light image recognition. Liu et al. (2016) identified
balancing speed performance and accuracy as one important problem (p. 21). To alleviate some
of this, Liu et al. (2016) proposed the SSD (Single Shot MultiBox Detector) – a “deep network
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 11
based object detector that does not resample pixels or features for bounding box hypotheses and
and is as accurate as approaches that do” (p. 22). This is still a CNN but Liu et al. designed it
specifically to achieve strong accuracy without abandoning performance (Liu et al., 2016, p. 23).
By replacing bounding boxes proposals with a convolutional filter, Liu et al. (2016) constructed
a model that operates at higher frames per second than previous approaches with Faster R-CNN,
another CNN variant introduced to solve the same problem (p. 36). In a comprehensive review of
approaches to traffic light image recognition, Huang et al. (2017) contrasted with Liu et al.’s
opinion, suggesting that Faster R-CNN operates at a similar speed as SSD when Faster R-CNN
minimizes bounding box proposals (p. 34). This makes Faster R-CNN a more desirable CNN
architecture, because it maintains high accuracy on very small objects, while SSD accuracy
declines (Huang et al., 2017, p. 14). Meanwhile, Lim et al. (2017) took a different approach to
system which does not utilize neural networks – with CNN technology to improve results (p. 2).
They utilized SVMs first to verify the image and a CNN afterwards to classify the image (Lim et
al., 2017, p. 2). Lim et al.’s (2017) combination worked out, forming a system able to classify
images at real-time with 97.9% average accuracy and with improved accuracy specifically in
All three of the above approaches, Liu et al. (2016), Huang et al. (2017), and Lim et al.
(2017), attempted and succeeded in improving the CNN accuracy by altering the working and
variant of the model itself, but other researchers have attempted to improve the CNN through
external changes to structure or learning strategy. A strong example of this are Fairfield and
Urmson (2011), who showed the ability for mapped traffic lights to improve detection results
within a model (p. 6). By mapping the location of traffic lights against current location of the
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 12
vehicle, their network could predict when it should expect to detect traffic lights and when it
should expect not to, reducing false positives and false negatives (Fairfield & Urmson, 2011, p.
6). Ghahramani (2015) took a more technical approach, exploring the ability for probabilistic
frameworks – models which “make predictions about future data, and take decisions that are
rational given these predictions” (p. 452) – to increase accuracy. This approach adds another
level onto machine learning, encouraging systems to attempt to solve problems before they occur
and speed up training (Ghahramani, 2015, p. 458). Tyukin, Gorban, and Romanenko (2018) did
something similar by considering the use of multiple ML models within a teacher-student model,
which would speed up the training of classification algorithms and improve the universality of
models in application to data (p. 1). They improve on previous work in the field by creating a
framework for the teacher-student model which requires less raw data and training (Tyukin et al.,
2018, p. 2). Though not implemented within the context of autonomous driving, the success of
the model within CNN image recognition suggests its potential for the field.
Together, these CNN models form the basis for researchers’ current approaches to
optimizing the balance between accuracy and performance by constructing new variants of CNN
with different traits and layer make-ups (Liu et al., 2016, p. 22; Huang et al., 2017, p. 14; Lim et
al., 2017, p. 2), sometimes even incorporating other machine learning systems, like an SVM, as a
supplement to the CNN framework (Lim et al., 2017, p. 2). Outside of the actual model,
researchers have attempted to improve image recognition technology by finding other areas to
supply supplementary data, like Fairfield and Urmson’s (2011) mapping of traffic lights to
provide context for classifications (p. 6) or Ghahramani’s (2015) application of probability data
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 13
to ML frameworks (p. 452). Finally, some researchers, like Tyukin et al. (2018), have looked not
at speeding up performance, but speeding up training or training on limited data (p. 2).
Hinton (2014) has challenged the CNN architecture itself, referencing their lack of
structure as a major flaw with their performance in handling positional data (1:47). As a way to
fix this, Hinton, Krizhevksy, and Wang (2011) proposed CapsNets, artificial neural networks
similar to CNNs but with layers loosely replaced with “capsules” (p. 45). According to Hinton
(2014), capsules would output likelihood of feature presence and pose information, the positional
First, Hinton (2014) claims, capsules would improve massively on the current CNN
practice of max pooling, which reduces the available information in a subsampling procedure
(6:57). CapsNets eliminate pooling completely, instead using coincidence filtering to find
clusters of inputs at high dimensions, removing unwanted background inputs while keeping the
useful pose data (Hinton, 2014, 5:26). Secondly, Sabour, Frosst, and Hinton (2017) pointed out
the benefits of capsules for the dynamic routing of information, which would allow the
specialization of specific capsules for certain tasks (p. 2). This contrasts with max pooling, which
Sabour et al. (2017) stated will, “throw away information about the precise position of the entity
within the region” (p. 2) like the pose information, because it considers multiple input vectors,
not just the most active one. These two effects, the removal of subsampling and the introduction
segmentation and separation, like that performed by Hinton, Ghahramani, and Teh (2000, p. 1)
and Sabour et al. (2017); traffic sign image recognition, like that done by Lim et al. (2017); and
shape analysis, like that described by Hinton (2014, 15:15). In fact, Kumar, Arthika, and
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 14
Parameswaran (2018) have already implemented CapsNets in traffic sign image recognition with
strong results: 97.6% accuracy and 0.0311038 loss at the end of validation (p. 4546). Researchers
have not yet applied CapsNets to the primary issue of this research, traffic light image
recognition, but the strong results in the similar field of traffic sign image recognition provide a
From the literature, it becomes clear that there are numerous areas for potential
improvement within CapsNets that do not exist in CNNs. These include the elimination of
information loss from down-sampling suggested by Hinton et al. (2011, p. 50) and by Hinton
(2014, 6:55), as well as within dynamic routing between capsules to enable specialization
(Sabour et al., 2017, p. 2). Sabour et al. (2017) goes so far as to state that, “The fact that a simple
early indication that capsules are a direction worth exploring” (p. 9). This supports the
conclusion that CapsNets, if developed at the same level as CNNs have enjoyed, could become
one of the leading approaches towards image classification. The further success of Kumar et al.
(2018) necessitates further research into the CapsNet architecture, especially within the context
Summary
Right now, researchers approach traffic light image recognition through the use of CNNs
with additional supplemental data (Fairfield & Urmson, 2011, p. 1). This approach has
succeeded in achieving a baseline level of efficiency, but there eventually reaches a plateau
where CNNs have to sacrifice accuracy for performance and vice versa (Liu et al., 2016, p. 21).
Researchers have managed to make some progress by optimizing the build of the CNNs (Liu et
al., 2016, p. 22; Huang et al., 2017, p. 14; Lim et al., 2017, p. 2), the supplemental data provided
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 15
(Fairfield & Urmson, 2011, p. 6; Ghahramani, 2015, p. 452) or the amount of training needed
(Tyukin et al., 2018, p. 1), but Hinton (2014) claims that the problem is with the CNN model
itself (1:47). Hinton et al. (2011) points out that a different structure called a CapsNet could
eschew the use of max pooling and preserve positional pose data by using capsules instead of
layers (p. 49). Sabour et al. (2017) also identifies that a Capsule Neural Network could specialize
its routing of data through different capsules, improving accuracy and analysis of pose data (p.
2). These two potential benefits, as well as the success of Kumar et al. (2018) in a similar field,
support the need for further study into the use of CapsNets in traffic light image recognition.
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 16
Autonomous vehicles need accurate programming and perception tools to analyze the
world around them and make appropriate decisions quickly. One such perception problem in this
field is determining the signal of traffic lights around the autonomous vehicle. Traffic lights are
entirely color-based, so self-driving vehicles must use camera images to determine their signal,
not radar, sonar, or another technology (Fairfield & Urmson, 2011, p. 2). To determine the signal
of a traffic light from an image, researchers have turned to machine learning, specifically looking
at CNN-type artificial neural networks (Fairfield & Urmson, 2011, p. 2). However, according to
Hinton (2014), CNNs have multiple flaws when it comes to evaluating positional data (pose) and
dynamic routing, so other types of artificial neural network like CapsNets may be a better
solution (6:10). This study will determine the potential of CapsNets to improve final accuracy in
traffic light image recognition by constructing a CapsNet to classify traffic lights images by
signal.
First, the researcher examined CNNs and CapsNets implemented previously for similar
problems. By basing the foundational areas of the design from models shown to previously have
success, the researcher established the model on a stable basis from which to start the design
process. Specifically, the researcher focused on the CapsNet research done by Kumar et al.
(2018). Kumar et al.’s (2018) CapsNet research serves as an appropriate starting place because it
successfully classified traffic signs, a similar problem to that of traffic light image recognition.
By using the same base model, the researchers could accurately evaluate results in context. This
also informed the choice of hypothesis for the research. Since Kumar et al. (2018) constructed
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 17
their model in the context of traffic signs, a slightly lower accuracy⎯85%⎯makes sense for
Second, the researcher selected the data for training and validation. In this case, the
researcher took data from Udacity’s bag⎯as in rosbag files⎯of images from Carla
(Dosovitskiy, Ros, Codevilla, Lopez, & Koltun, 2017). These are images taken by the self-
driving car Carla (Dosovitskiy et al., 2017). The researcher chose this dataset because of its
relative simplicity, containing only three classes and with a range of different sizes for traffic
lights.
Third, the researcher built an algorithm to properly process input data into a format
understandable by the CapsNet. The researcher achieved this by developing a Python program
which read the images and the accompanying xml/yaml files with the ground-truth labels and
cropped the images into just the traffic lights. Then, the program scaled each image to exactly
32x32 pixel size so all the inputs would have the same dimensions. These two functions used the
OpenCV, glob, and PIL python packages and were necessary to remove the image backgrounds,
which would hurt the ability of the network to properly classify images. The researcher chose the
32x32 pixel size to minimize the amount of RAM necessary to run and process all the images
From here, the program fed the images into a separate researcher-developed python code
which read each pixel of the images from the traditional file format into a 3-dimensional array of
pixel values. One array represented each image, with height, width, and RGB making up the 3
dimensions. The code also found the ground-truth for the signal of each image and appended
them into a 1-dimensional array of the same sample size as the 3-dimensional one. The program
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 18
split both of these two arrays into two sets, training data and validation data, for a total of 4 input
arrays into the network. Prior to it entering CapsNet, another custom program preprocessed the
data, enhancing brightness and contrast using PIL and dividing all the values by 255 to keep
values within a range of zero to one. However, before the data could train in the CapsNet, some
Kumar et al.’s (2018) constructed their CapsNet with three main layers: the input layer,
the primary capsule layer, and the traffic sign capsule layer (p. 4546). The changes to the input
were already described above. The primary capsule layer feeds input into two convolutional
layers⎯which utilize a rectified linear unit activation function and 0.7 dropout⎯and then into
the primary capsules, sixteen filters of 1600 capsules (p. 4546). The researcher did not alter the
primary capsule layer at all. Kumar et al.’s (2018) traffic sign capsule layer consisted of 43
capsules, one for each class in the traffic sign dataset (p. 4546). This was changed to 3 for this
research, because there are only three classes of traffic light in this dataset. The researcher did
not alter the CapsNet’s reconstruction network, which provides feedback to the network after
When the researcher made completed these to the CapsNet, training could begin. The
data entered directly into the neural network architecture, training for 100 iterations with a batch
size of 50 for both neural networks. After each iteration of training, the network evaluated itself
on separate validation data. Both trainings and validations occurred within the Keras and
TensorFlow ML frameworks. At the end of the training and validation iterations, the network
plotted loss and accuracy curves with TensorBoard using the event files generated by the
CapsNet.
Population
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 19
The direct population of this study are traffic lights oriented vertically with red, yellow,
and green as the three signals they can output. This includes many of the traffic lights within the
US. However, because of the broad potential applications of the study’s CapsNet, there is no
specific reason that the results cannot be generalized to include horizontal traffic lights and those
with other signals e.g. green left arrow and blinking red light, suggesting a population of all
traffic lights.
Sample
The total sample of images was 318 pictures of traffic lights taken in relatively similar
settings with differing lighting, size, and signal orientation from the Udacity ROSbag (file
format) of images from Carla (Dosovitskiy et al., 2017). This is a smaller dataset for machine
learning applications, which is good as a baseline study, because the trained algorithm avoiding
overfitting on the training dataset will support the ability of the CapsNet algorithm to be applied
to larger amounts of data. Additionally, the fact that these images are from a self-driving car
helps boost the reliability of the results in their application to real-world autonomous driving
applications.
The researcher broke these 318 images further into 155 images for training and 163 for
testing and validation. The 155 training images contained 63 images of red traffic lights, 48
images of green traffic lights, and 44 images of yellow traffic lights, while the validation set had
67 images of red traffic lights, 88 images of green traffic lights, and 8 images of yellow traffic
lights. The differing proportions of red, green, and yellow lights within the training and
validation sets serves to mirror the differing signals self-driving vehicles will be exposed to and
ensures the network isn’t expecting a certain proportion of each signal. Most specifically, the low
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 20
number of yellow traffic lights in the testing set mimics the relatively low number of yellow
lights in real traffic and shouldn’t affect results significantly if the network operates well.
Materials/Instruments
As previously described, the researcher used data from the Udacity set of images from
Carla, a self-driving vehicle (Dosovitskiy et al., 2017). Before use in the network, the program
shuffled the order of the dataset, building internal validity by the randomization of all possible
assignments within the network constructions process. This included the order of the data during
training and validation, the subsamples of data used for training, and the subsamples of data used
for validation. The researcher also controlled as many variables as possible, including the
number of steps allowed within training and validation, the dataset used, and the computer the
The researcher completed the entire project within Python, a popular programming
language for building machine learning applications. Additionally, the researcher utilized the TF
and Keras Python frameworks, which implement a large number of classes, functions, and
objects for ML. Both frameworks provide simple, pre-implemented methods for developing the
ML algorithm, measuring the change in the dependent variable (accuracy), and recording the
artificial neural networks every step. Since both of these frameworks operate and have operated
successfully in the construction of other image recognition neural networks, their use here should
not cause error. The researcher utilized other Python packages, such as PIL, glob, and OpenCV,
mostly to avoid the need for reimplementation of basic methods and their use should have no
Additionally, the researcher built the CapsNet by modifying and building on the
algorithm created by Kumar et al. (2018) for the classification of traffic signs in the ways
expressed in the previous section. Should the CapsNet operate at a similar level of success as
within the Kumar et al. (2018) research, it would promote the Inter-rater reliability of the
CapsNet structure, because the model, run on two different datasets in two different contexts,
returned similar results. This would support the idea that the model is operating correctly and is
not strong for just one set of data samples. This would also improve the external validity of the
project by suggesting that the success of CapsNets in these two areas could extend to more image
Independent Variables: The model, build, and design of the ML system that the
researcher has trained and tested. This includes the structure of the parts of the Artificial Neural
Network, the type of the artificial neural network (CapsNet in this case), how the data is
preprocessed, and the hyperparameters (learning rate, optimizer) applied to the network.
data after training. According to Google Developers “…accuracy is the fraction of predictions
The data is numerical. The final accuracy is a single number taken at the end of training
from a table of validation/evaluation accuracy over iteration, while loss is a number measured
per step as a sum of residual squares. Final accuracy is the number being used to evaluate the
success of the product, while loss simply informs the researchers of how the model’s accuracy
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 22
changed over training and validation. The researched analyzed the model solely based on the
final accuracy achieved after convergence. This is appropriate because accuracy is the only value
of the network being considered. The purpose is to determine the ability of the CapsNet to
classify traffic light signals from images, not its performance, so the only necessary value is the
final accuracy after convergence. The builder’s role in this case was simply to observe the
accuracy curve after a certain number of iterations and attempt to troubleshoot logic or syntax
issues with the code until the model begins to converge to an accuracy value greater than 80%
accuracy. This indicates that the model is actually learning from the data. At this point the
builder’s role is to determine where the model stops learning⎯the accuracy and loss curves
Assumptions
The researcher assumed that the datasets accurately represented the population of traffic
lights that autonomous motor vehicles would encounter in practice. While traffic lights do not
vary too greatly, some alterations exist in structure and orientation based on locale.
The researcher assumed that the performance of the produced CapsNet after training
accurately models the performance of a theoretical CapsNet trained more extensively with a
larger amount of data. This is a fair assumption, because the research by Kumar et al. (2018) uses
significantly more data (p. 4547) and still succeeds with the same model.
The researcher assumed that the dataset developers annotated the datasets with the correct
bounding boxes and signal. This should prove true, because the dataset has been used previously
by other researchers.
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 23
The researcher assumed that the power of the Central Processing Unit and processors of
the computer used during training and testing will not affect accuracy results, as there is no
The researcher assumed that the ML algorithm can operate at full potential within the TF
framework and Python language. There is no reason to believe the framework should limit the
Limitations
There are many different types of ML algorithms and neural networks. The research
stayed confined to the performance of CNNs and CapsNets due to their relevance and current use
The research only studied traffic light image recognition in the context of only three
classes⎯red, green, and yellow⎯and with a relatively small dataset of about 300 images total.
The research only investigated the performance of CapsNets within the TF/Keras
framework and did not attempt to reconstruct the design within any other ML framework. This
should not affect the external validity of the research at all because the network structure is the
The research stayed within the ML area of artificial intelligence and did not examine
other areas of artificial intelligence within autonomous driving. The research should not be
applied beyond machine learning, so this should not affect the universality of the research.
Delimitations
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 24
The research limited itself to a few levels of image quality and dimensions with the
understanding that practically applied autonomous driving applications will have similar or
While still attempting to identify relatively small traffic lights with artificial neural
networks, the research scaled all traffic lights to a 32x32px size to have a consistent data input
size. This varied the resolution of the images based on the original light size but did not vary the
The research limited itself to the study of accuracy, with the understanding that a high
final accuracy indicates the ability for optimization in terms of performance and speed on more
Summary
The researcher reconstructed and adapted the CapsNet system produced by Kumar et al.
(2018) to evaluate traffic light image data instead of traffic sign image data. This traffic light
data came from the Udacity Carla images (Dosovitskiy, 2017) and only included vertical red,
green, and yellow traffic lights, but could extend to a population of horizontal and non-traditional
traffic lights due to the previous example of Kumar et al.’s success with many classes. The
researcher built the model using TF and Keras, which are strong choices as machine learning
algorithms due to their previous use by researcher in image recognition (Kumar et al., 2018, p.
4547; Huang et al., 2017, p. 2). The researcher assumed that the specifics of the study, which
data was used, which framework was used, which computer was used, should not affect the
results because of the focus on final accuracy and the reliability of these systems in previous
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 25
research. The researchers also limited the study to only the viability of CapsNets within traffic
Chapter 4: Findings
The purpose of the research is to determine the viability of CapsNets to improve final
accuracy in traffic light image recognition. The researcher constructed a CapsNet and tested it on
traffic light image data to determine how strong the CapsNet architecture is at classifying. After
training and testing, the CapsNet performed very well, reaching high levels of accuracy and
supporting the hypothesis. This supports the potential of CapsNets to change the field of traffic
Results
The CapsNet had a final validation accuracy of 100% after 40 steps. The criteria for
evaluation was the final validation accuracy of the Neural Network and the threshold that the
hypothesis set was 85% final accuracy. The CapsNet converged to 100% final accuracy, which is
past the threshold set at 85% final accuracy. From that, the researcher concluded that the
hypothesis that the CapsNet will classify traffic light images at greater than 85% final validation
accuracy is supported.
Evaluation of Findings
The results supported the hypothesis very strongly, with the network converging to 100%
validation accuracy for sure by the 40th step. This is significantly higher than the threshold set in
the hypothesis at 85%. It is also slightly higher than the accuracy of the research done by Kumar
et al. (2018) using the same model, which was 97.6% (p. 4547). This could be because Kumar et
al. (2018) had a data set of 12,630 testing images (p. 4547), which is significantly more
opportunities for the network to make a wrong classification than in this research, which had
only 163 testing images. Additionally, the difference could be because Kumar et al.’s (2018)
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 27
research had 43 different classes of traffic sign to classify (p. 4546), while this research had only
3 classes of traffic light. Overall, the findings fall in line with that study and also the expectations
for the network, strongly supporting the potential for CapsNets to improve traffic light image
recognition applications.
1.2
0.8
Accuracy
0.6
Validation
Accuracy
0.4
0.2
0
0 5 10 15 20 25 30 35 40 45
Iteration (Steps)
Figure 1. CapsNet Validation Accuracy. This figure displays the accuracy of the CapsNet in
This is reinforced by Figure 1, which shows the model’s validation accuracy curve. The
curve shows that, with further refinement, the amount of training needed to reach a high level of
accuracy could decrease even further, because after step 15 the model stays at 100% validation
Summary
Researchers found that the CapsNet was able to converge to a final validation accuracy of
100% after about 40 iterations. This strongly supports the hypothesis, which set the criteria for
evaluation at 85% final validation accuracy, and also mirrors the results of the Kumar et al.
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 28
(2018) study, which used a very similar model to this research. Therefore, CapsNets can classify
traffic light images by signal and may constitute an improvement over current approaches to
The current approach to traffic light image recognition uses CNN technology (Huang et
al., 2017, p. 1), but this is an issue because CNN’s do not retain positional pose information
(Hinton, 2014, 6:57) and do not dynamically route data through layers (Sabour et al., 2017, p. 2),
which might reduce their accuracy and performance significantly. This research tests CapsNets
for the problem of traffic light image recognition, because they do utilize pose information and
dynamically route information through specialized capsules (Sabour et al., 2017, p. 2). The
researcher did this by altering the CapsNet model constructed by Kumar et al. (2018) to accept
traffic light image information instead of traffic sign image information and changing the
construction of the capsules. The results very strongly supported the hypothesis, with the
network quickly converging to a 100% validation accuracy. The previous work done with
CapsNets by Kumar et al. (2018) supports the idea that the study’s limitations are not the cause
for this high accuracy, supporting the idea that further research should be done on using
Implications
The main research question was “How do CapsNets perform when implemented in traffic
light image recognition?” The research supports the idea that CapsNets perform very well when
implemented in traffic light image recognition, as the CapsNet surpassed the hypothesis
threshold of 85% final validation accuracy. The most impactful limitation to the research is the
small size of the training and validation datasets and the limitation to only three classes⎯red,
green, and yellow traffic lights⎯within the datasets. However, this limitation should not color
the interpretation of the research, because Kumar et al.’s (2018) research has already shown that
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 30
CapsNets can handle large amounts of data and large numbers of classes (p. 4547). The only
issue in question was whether CapsNets could accurately classify traffic light images, where the
evidence from this research supported a yes answer to that question. The results represent a
possible improvement to the efforts made by Huang et al. (2017) and Liu et al. (2016) to increase
the optimization of traffic light image recognition, where CapsNets could serve as more accurate
Researchers should practically apply this study in the field of autonomous driving, where
more manufacturers could explore the potential of CapsNets to improve efficiency and
performance of their self-driving vehicles. The research’s CapsNet’s quick convergence to 100%
validation accuracy suggests that CapsNets may prove very strong in classifying images and as
autonomous vehicles should research CapsNet performance further to see if they have a more
favorable performance accuracy tradeoff than the prevailing CNN architecture does.
Recommendations
There are a number of other perception problems within autonomous driving that are
different enough from traffic light image classification to warrant further research with a
CapsNet architecture. One specific area is research into the field of traffic light image detection,
which is more computationally involved and may gain more from the CapsNet’s use of
positional data than an image classification problem. Since accuracy was high when it comes to
classification for both traffic lights and traffic signs (Kumar et al., 2017, p. 4547), the results
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 31
could be informative when CapsNets are used in image detection. Another area where further
research should occur is a repetition of this study but with out-of-model improvements to
CapsNet architecture. CNNs have led in image recognition for quite a while (Huang et al., 2017,
p. 2) and as such have had optimizations in many places outside of just the model construction,
including the use of probabilistic frameworks (Ghahramani, 2015, p. 452) and the incorporation
of supplementary information like traffic sign location (Fairfield & Urmson, 2011, p. 6). As
such, researchers should attempt to also provide these auxiliary improvements to CapsNets as
well and test to see potential further improvements in accuracy. Because the research findings
show that CapsNets can already reach 100% validation accuracy in traffic light image
recognition, more information could speed up the training or improve the accuracy to
performance tradeoff.
Conclusions
This research found that CapsNet architecture can successfully classify traffic light
images by signal. Additionally, because of previous research done with the same CapsNet
architecture (Kumar et al., 2018), it seems unlikely that this high accuracy is due to a small
dataset or small number of classes. As a result, it seems likely that CapsNets could operate
non-classification image perception problems like traffic light detection and traffic sign
detection. The findings also support further autonomous driving research into optimization of
CapsNets and determining if CapsNets have a smaller accuracy to performance tradeoff than
CNNs.
CAPSNETS IN TRAFFIC LIGHT IMAGE RECOGNITION 32
References
https://developers.google.com/machine-learning/crash-course/classification/accuracy
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., & Koltun, V. (2017). CARLA: An open
urban driving simulator. 1st Conference on Robot Learning. [Data Files] Retrieved from
Fairfield, N., & Urmson, C. (2011). Traffic light mapping and detection. IEEE International
Hinton, G. E., Ghahramani, Z., & Teh, Y. W. (2000). Learning to parse images. Advances in
Neural Information Processing Systems. Retrieved from the NIPS Proceedings database.
Hinton, G. E., Krizhevsky, A., & Wang, S. D. (2011). Transforming auto-encoders. Lecture
Notes in Computer Science Artificial Neural Networks and Machine Learning – ICANN
2011,44-51. doi:10.1007/978-3-642-21735-7_6
Hinton, G. E. (2014, December 4) What's wrong with convolutional nets? [Video File].
convolutional-nets
Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., . . . Murphy, K. (2017).
doi:10.1109/cvpr.2017.351
Kumar, A. D., Arthika, R. K., & Parameswaran, L. (2018). Novel deep learning model for traffic
sign detection using capsule networks. International Journal of Pure and Applied
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to
doi:10.1109/5.726791
Lim, K., Hong, Y., Choi, Y., & Byun, H. (2017). Real-time traffic sign recognition based on a
doi:10.1371/journal.pone.0173317
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., & Berg, A. C. (2016). SSD:
Single shot MultiBox detector. Computer Vision – ECCV 2016 Lecture Notes in
Nilsson, N. J. (2010). The quest for artificial intelligence: A history of ideas and achievements.
Ranzato, M. A., Huang, F. J., Boureau, Y., & LeCun, Y. (2007). Unsupervised learning of
doi:10.1109/CVPR.2007.383157
Rawat, W., & Wang, Z. (2017). Deep convolutional neural networks for image classification: A
doi:10.1162/neco_a_00990
Sabour, S., Frosst, N., & Hinton, G. E. (2017). Dynamic routing between capsules. 31st
Proceedings database
Stone, P., Brooks, R., Brynjolfsson, E., Calo, R., Etzioni, O., Hager, G., … Teller, A. (2016,
September). Artificial intelligence and life in 2030." One Hundred Year Study on
Tyukin, I. Y., Gorban, A. N., Sofeykov, K. I., & Romanenko, I. (2018, August 13). Knowledge
doi:10.3389/fnbot.2018.00049