Deep Learning For Consumer Devices and Services

Deep Learning for
Consumer Devices
and Services
Pushing the limits
for machine learning,
artificial intelligence,
and computer vision.
iStockphoto/Bee Bright
By Joseph Lemley, Shabab Bazrafkan,
and Peter Corcoran
I
n the last few years, we have witnessed an expo- Deep Learning
nential growth in research activity into the advanced training
of convolutional neural networks (CNNs), a field that has Deep Learning and Consumer Electronics
become known as deep learning. This has been triggered There has, perhaps, never been a better time to take advantage
by a combination of the availability of significantly larger of the power of deep learning in consumer products. In retro-
data sets, thanks in part to a corresponding growth in big data, spect, we may consider 2016 the year that deep learning tool
and the arrival of new graphics-processing-unit (GPU)-based kits and techniques matured from tools that were mostly orient-
hardware that enables these large data sets to be processed in ed toward researchers into easily used product-enabling tech-
reasonable timescales. Suddenly, a wide variety of long-stand- nology that can be used to add intelligence to almost any
ing problems in machine learning, artificial intelligence, and consumer device, even by nonexperts. We expect to see an ex-
computer vision have seen significant improvements, often plosion in products that take advantage of these resources in the
sufficient to break through long-standing performance barri- coming years, with early adopters differentiating themselves
ers. Across multiple fields, these achievements have inspired from competitors and further refinement of technology and
the development of improved tools and methodologies leading deep learning methods. In this article, we hope to provide you
to even broader applicability of deep learning. The new gener- with the tools and understanding to start using deep learning to-
ation of smart assistants, such as Alexa, Hello Google, and day or to understand what consumer devices that are built using
others, have their roots and learning algorithms tied to deep this technology can do and how they work.
learning. In this article, we review the current state of deep
learning, explain what it is, why it has managed to improve on A LOOK AT Neural Networks
the long-standing techniques of conventional neural networks, Artificial neural networks (ANNs) are able to learn something
and, most importantly, how you can get started with adopting about what they see and then can generalize that knowledge to
deep learning into your own research activities to solve both examples (or samples) that they have never seen before [1]. This
new and old problems and build better, smarter consumer de- is a very powerful capability that humans often take for granted
vices and services. because our brains automatically do it so well. You are able to
understand the concept of a rock after seeing and perhaps touch-
Digital Object Identifier 10.1109/MCE.2016.2640698
ing very few examples of rocks. From that point on, you can
Date of publication: 15 March 2017 identify any rock, even those that are shaped differently or have
48 IEEE Consumer Electronics Magazine ^ APRIL 2017 2162-2248/172017IEEE

different colors or textures from the rocks youve seen before. an ANN is not trained sufficiently to do this, we call it under-
This approach is very different from the traditional method of fitting, which means that the network did not sufficiently
teaching or explicitly programming computers based on detailed learn the training set. The opposite of this is called overfit-
rules that must cover every possible outcome. ting, where the network learns the training set so well that it
The process of discerning the category to which a piece of cannot effectively be applied to data that it has not seen
data belongs is called a classification task; one of the more before [3]. These concepts can be best understood by refer-
famous uses of this technique is that of training a neural net- ring to Figure 1, which illustrates a simple two-dimensional
work. The ability to classify unseen examples is referred to as
generalization. Not surprisingly, ANNs are especially power-
ful in tasks for which the appropriate outcome cannot be
determined beforehand and thus cannot use traditional prepro-
grammed rules [2].
Training the Network

ANNs are not the only way to teach computers, and much of
the terminology we use when talking about ANNs comes from Overfitted
other fields, such as statistics. The process of teaching our net- Underfitted
work is called training. When we train a network, what we are Well Fitted
formally doing is fitting a network to our training (data) set. Sample Point
This language is borrowed from mathematics, where we may
try to find the best way to fit information to a regression line or
other mathematical model. The training set is the sample infor- Figure 1. The red dashed line is an example of overfitting. It is
mation that we think is sufficiently representative that our net- unlikely to work as well as the solid black line at predicting future
work should be able to learn from it. The thing we want our values. The straight green dashed line is an example of underfit-
ting. Choosing an overly complex model (such as a high-degree
network to learn to do is called the task. polynomial that generates the red line) leads to overfitting, but
When training ANNs, we want the network to perform picking a model that is too simple (the straight line) causes under-
well at a given task on the basis of unseen information. When fitting. Our goal is to find the best model to fit the data.
APRIL 2017 ^ IEEE Consumer Electronics Magazine 49

data set. Note that, in practice, we deal with multidimension- which we base the evaluation of our model during training is
al data, which leads to far more complex fitting problems. called the validation set. These data are never used directly for
If a model is underfitting, we can increase the number or size training, only to provide us with information about how well the
of parameters or improve the model. If a model is overfitting, we network performs during the training process.
can reduce the size of the model or apply other techniques,
which we will describe later. A key challenge is accurately iden- Supervised Versus Unsupervised Learning
tifying whether the model is close to optimal fitting. This is one There are two primary types of learning that neural networks can
of the reasons that very large data sets are needed to achieve do: 1) supervised learning, in which the data is labeled or anno-
working solutions for these problems. tated in some way, and the task is to somehow learn to match the
We measure the suitability of a model for a given task by with- data with the labels; and 2) unsupervised learning, where there
holding some of the information that we have from the training are no labels and the neural network learns to find relationships
process so that we can evaluate the model during and after train- between data [4]. A common example of supervised learning
ing. We typically separate this withheld information into two is classification, where we try to find a category (or label) that
parts: the validation set and the training set. The information on can successfully discriminate between each class using the sup-
plied labels and data. An example of unsupervised learning is
clustering, where the neural network tries to separate data into
chunks or groups without anyone giving the network any labels
Output
Input to apply to the groups it finds. The appeal of the latter approach
is that creating ground truth labels is a time-consuming and
often expensive process that requires humans to create appropri-
ate labels for the training, validation, and testing sets before a
model can be trained and later placed into products in the real
world with unseen (and unlabeled) data. Despite the cost and
difficulty in data preparation, much of the success of neural net-
works comes from supervised learning.
Inside the Neural Network

Weve discussed what neural networks can do, but we have
not presented the details of how they do it. Neural networks
typically have an input layer, an output layer, and one or more
Hidden Layers
so-called hidden layers (Figure 2). These layers are full of
nodes, often called neurons, which are connected to subse-
Figure 2. A typical ANN with a fully connected hidden layer.
quent and previous layers using a number of schemes. Perhaps
the most common connection scheme is the fully connected
layer, in which every neuron in a layer is connected to every
Cell Body (Soma) neuron in the previous and next layer.
This idea is inspired from biological neurons, where we have
axons and dendrites that connect individual neurons to each other.
Axon In the biological model, axons receive input from other neurons
and dendrites transmit information to other cells [5]. This corre-
Synapse sponds to the input and output connections in a neural network.
The concept of multiple connection schemes also comes from
biology, where we see unipolar, bipolar, multipolar, and pseu-
Dendrite
dounipolar connection mechanisms. The body of a biological
(a) neuron is called the soma, which can decide when and what to
transmit on the basis of various criteria. Artificial neurons have a
similar mechanism called the activation function. This model of
Inputs
neurons (see Figure 3) was first invented in 1943 by Warren
Output
McCulloch and Walter Pitts [6]. The first popular implementation
of these in computers was formalized in 1957 by Frank Rosen-
blatt in the idea of the Perceptron [7].
All ANNs have layers, weights, inputs, biases (or thresh-
(b) olds), activation functions, and some connection mechanism.
This is not enough, however, to effectively train a neural net-
Figure 3. (a) A biological neuron (illustrated by Kimberly Sowell, work. When training a neural network, we calculate an error
used with permission) and (b) an artificial neuron. (or loss). There are many ways to calculate the loss, but the
50 IEEE Consumer Electronics Magazine ^ APRIL 2017

simplest way is the difference between the predicted (learned) lection includes methods like principal component analysis and
and the true outputs. linear Fisher discriminant [9].
Training algorithms work by updating the weights and mea- A newer technique based on sparse mapping calls for
suring the way that the loss changes over time. This is usually expanding the dimensions with the goal of representing more
accomplished by some form of a gradient descent optimizer. Its abstract features instead of trying to reduce the dimensions.
possible to become stuck or quickly converge to a nonoptimal This is inspired by models of the visual cortex of animals
solution when strictly following the steepest gradients. For this [10]. Convolutional layers, a key component of deep learn-
reason, we typically use some form of stochastic gradient de- ing, make use of this sparse mapping approach.
scent, which is simply a stochastic or slightly randomized form
of gradient descent. Back propagation is used to allow us to Convolution and Its Role in a Neural Network
propagate the error information from the last layer to the first One of the most novel and useful aspects of CNNs is that
layer to modify the weights, and this is often used in a way that they can learn the filters that previously had to be custom-
is synonymous with training. Methods that can be used to im- designed by the researcher, a task that would often take years
prove a networks results are called learning rules. of trial and error. Convolutional layers are essential to this
task in modern deep neural networks.
A Renaissance for Neural Networks? Convolutional layers make use of the convolution operator.
Although these methods have been successfully used for de- The convolution operator is used on two functions. One is the
cades, they have seen a very recent resurgence as refinements signal from the sample space and the other, called the filter, is
upon existing techniques, combined with newer hardware and applied to the sample. On a GPU, convolutions are implement-
the growth of big data, have created an artificial intelligence ed as matrix multiplications. This operator has a long history in
boom that shows no signs of slowing down. The global market image processing applications and dates back to the time when
for smart machines is expected to exceed 15 billion by 2019, digital image processing started. The convolution operation
with an average annual growth rate of nearly 20% [8]. The set can be discussed in both spatial and transform space. In the
of techniques that have led to this growth has been coined deep spatial space, the convolution operator is the equivalent opera-
learning. In the following sections, we will discuss some of the tion of correlation with the reversed filter, i.e., this operator
techniques that enable deep learning, why they work, and how calculates the similarity of the input function with the filter.
they are used to make smart machines. For example, the edge detection and corner detection filters use
the similarity of the input image with a predefined filter mim-
Convolutional Neural Networks icking the edge or corner shape. Looking at the function in the
To store and process analog signals (e.g., voice and biological transform space, the convolution is performing frequency fil-
signals) on a computer, one must first convert these inputs tering. For example, low-pass or high-pass filters have their
into a digital form in a process called analog-to-digital con- equivalent spatial filters, which could be applied to the image
version. This transforms the information from a continuous using convolution operations (see Figure 4).
space to a discrete space. Some information is lost in the pro- In the deep learning approach, the filter is learned and
cess. Often that information isnt critical to understanding the applied to the data during the training process with the hope
underlying analog signal, but in some instances, we could that, after training, the learned filter will be the best choice
lose critical data, such as high-frequency information. for the task. One difference between the way convolutions are
In digital processing, we often refer to a piece of data, such used in CNNs from more traditional uses is that the convolu-
as a picture, as a sample. It is convenient, but computationally tion operator is applied using a four-dimensional (4-D) filter.
expensive, to represent these samples in a high-dimensional space This is essentially a set of three-dimensional (3-D) filters that
where each unit (or pixel, in the case of images), is considered as are stacked in the fourth dimension. We use 4-D filter to map
being located on a specific axis with the range of possible values one 3-D space to another 3-D space (see Figure 5).
being the size of that axis (e.g., 0255 in the case of an 8-b col-
or-channel in an image). An image that is 100 100 pixels would Convolutional Layer
be represented as a vector that is a point in a 10,000-dimensional In general, for an n-dimensional signal, the convolutional
space. We call this space the feature space. Since even the most layer is an n- or (n + 1)-dimensional feature space mapping
powerful computers can have trouble with such high-dimension- with n + 1 or n + 2 dimensional kernels (filters). For exam-
al space, we try to reduce the number of dimensions to just those ple, given a two-dimensional image, the convolutional layer
that are critical to a task or to change the way these features are would be 3-D with a 4-D kernel. In this case, the four dimen-
represented. In the traditional pattern recognition approach, to sions of the kernel correspond to
perform a given task, we separate our process into two steps: the width of the input
feature generation and feature selection. The former generates the height of the input
new features from the pixel space, while the latter reduces the the number of channels of the input
dimensionality of the feature space. Examples of feature genera- the number of channels of the output.
tion include morphological, Fourier, and wavelet transforms, In Figure 5, you can see two different convolutional layers. The
which create more useful features for specific tasks. Feature se- k channel layer on the left side is mapped to a p channel layer on

Low-Level Midlevel High-Level Trainable
Feature Feature Feature Classifier
Figure 4. Visualization of the learned convolutional filters at different layers. (Copyright Figure 6. A pooling operation reduces
Movidius, used with permission.) the size of the feature space.
the right side using a 4-D kernel. This kernel is shown using dif- Neural Networks section), a dense, fully connected layer emerg-
ferent 3-D kernels with different colors. es from the convolutional layers. For example, consider a convo-
lutional layer with ten channels and a 100 100 feature space.
Pooling Layer Placing a fully connected layer after it would result in computing
Pooling is an operation that accepts a pool of data values as input 100,000 weights from this layer to each neuron in the dense
and generates one value from them to be passed to the next layer. layer, which requires significant memory and computation
For example, this operation could be mean or maximum of the resources. Using a pooling layer helps reduce resource demands.
input values. There are two important purposes for pooling opera- Additionally, without a pooling layer, a network this big might
tions. One is reducing the size of the data space to reduce overfit- suffer from overfitting, especially if there is not enough represen-
ting, and the other is transition invariance. A pooling layer is tative data in the training set. Pooling also helps provide transi-
performing the pooling operation on its inputs. Figure 6 shows tion invariance by helping each kernel cover more space.
how a pooling operation is applied to a one-channel input to
reduce the dimensionality of the dataspace. In Figure 6, we have Nonlinearities
a 3 3 pooling operation applied to a 12 6 one-channel feature Each neuron in the deep neural network model is taking advan-
space. The most frequently used pooling operation is max-pool- tage of a nonlinear activation function to calculate the output
ing, wherein the maximum value of the features in the pool is value. This leads to one of the most advantageous properties of
selected to be mapped to the next layer. the deep neural networks: their ability to describe highly non-
The importance of this layer is described in the next example. linear systems. Being highly nonlinear helps the model be suit-
In generic deep neural networks (described in the Generic Deep able for real-life problems and gives solutions in pattern
recognition that cannot be achieved through more classical
methods. In the early days of the neural networks, the tanh and
sigmoid activation functions were most popular. Realizing that
most of the data tends to be concentrated around zero, newer
techniques such as rectified linear units and exponential linear
units [11] have become popular because they are nonlinear near
zero. They also benefit from an infinite range.
Deep Neural Networks

The power of deep learning first made worldwide news in 2011,
when a deep learning algorithm achieved better-than-human
visual pattern recognition in an international competition. The
accuracy was six-times better than the nearest nonneural net-
work approach and twice as accurate as human experts [12].
Before Convolution After Convolution
Generic Deep Neural Networks
Figure 5. A 4-D filter maps one 3-D space to another 3-D space Generic deep neural networks are usually made of one or more
using convolutions. convolutional layers, wherein each convolutional layer is usually

accompanied by a pooling (max-pool-
Input
ing) operation. One can also use bigger Output
strides in the convolution layer to
reduce the data dimensionality (the
Conv Layer
Pool/Conv
stride of a convolution operation is the
Conv Layer
Pool
Pool
number of the pixels the kernel window
is sliding before calculating the convolu-
tion in each location). In generic deep
neural networks, the convolutional lay-
ers are usually followed by one or more Fully Connected
Dense Layers
dense fully connected layers (Figure 7).
The rectified linear unit is the most Figure 7. A generic deep neural network, with convolutional (conv) and pooling layers
common activation function used in followed by fully conneceted dense layers.
these networks. The last layer is typical-
ly taking advantage of other nonlinearities based on the task.
Fully Convolutional Networks 1 1 5 5

1 5 1 1 5 5
Fully convolutional networks are deep neural networks in which 2 2 9 9
2 9 Unpool 2 2
all of the layers are convolutional, pooling, and unpooling layers, 2 2 9 9
wherein the pooling and unpooling layers are usually placed
between two convolutional layers. They are similar to typical deep
neural networks except they have no dense layer. The output of a Figure 8. A 2 2 unpooling operation with repeating values.
fully convolutional network is as the same type as its input. For
example, if the input is a k-channel image, the output of the net-
work could be a p-channel image but not something else entirely.
1 0 0 0
1 5
0 0 0 5
Unpooling 0 2 9 0
Unpooling layers are designed to perform the inverse operation 2 9 Unpool 2 2
0 0 0 0
of pooling layers by increasing the size of the feature space.
These layers operate on a single pixel and expand that pixel into a
pool of data. There are different implementations of unpooling Figure 9. A 2 2 sparse unpooling operation.
layers. The most popular are repeating values and sparse unpool-
ing in deep neural network designs. The repeating value tech- called the decoder. The autoencoder refers to the merged struc-
nique is expanding the data from one value to a pool of data with ture of the encoder and decoder in a model (Figure 10).
the same value (Figure 8). With sparse unpooling, the values of
the original data are mapped to a larger space in sparse form. Fig- Recurrent Neural Networks
ure 9 illustrates 2 2 sparse unpooling. There are different meth- Recurrent neural networks (RNNs) are designed to operate on
ods for choosing the location where the value is mapped in sparse a sequence of the data as input, such as a sequence of frames
unpooling. For example, in [13] the indices have been memo- from a video. This can be considered as a system with memo-
rized while applying the pooling layer, and in the unpooling ry that can remember the input at the previous stage and make
layer, those indices are used to map the values. decision for the present input based on the sequence of previ-
ous data. The memory of an RNN network is known as the
Autoencoders hidden state of the network. The hidden state of the network is
Autoencoders are a deep neural network design in which the updated based on the current state. The input to the network
input and output data are from the same class and also have the and the output of the RNN is calculated based on the state of
same data structure. For example, if the input of the network is the network at each sequence. RNNs have had great success in
a three-channel, 128 128 image, the output of the auto- speech and video scene recognition, where they are often
encoder is also a three-channel image with the size 128 128. combined with convolutional layers.
The autoencoder network could be a fully convolutional net-
work, or it can have one or several fully connected layers in the State of the Art Today
middle of the network, which is usually known as the bottle-
neck of the network. The idea with this kind of network is to Main Enabling Technologies
create a compressed version of the input in the bottleneck. The
data can then be reconstructed from the bottleneck. The part of GPUs: How They Impact Training
the network that does the compression is called the encoder. Despite advancements in parallel pipelines, hyperthreading, and
The part that decompresses the data from the bottleneck is multiple cores, modern central processing units (CPUs) are

Input Output
Unpooling
De-Conv
Conv Layer
Unpooling
De-Conv
De-Conv
Pool/Conv
Conv Layer
Pool
Pool
Fully Connected Layers
Figure 10. An autoencoder is the merged structure of the encoder and decoder in a model.
the use of a scalable link interface

Table 1. Nvidias Pascal GPU Series [15], [16]. (SLI) bridge, but generic drivers cur-
rently exist only for two-GPU SLI for
Titan X Pascal Model 1080 Model 1070 Model 1060
the 10 series. Nvidia also sells non-
CUDA cores 3,584 2,560 2,048 1,280 consumer GPUs that are specialized
for deep learning that do not have
Memory capacity 12 GDDR5X 8 GB GDDR5X 8 GB GDDR5 Up to 6 GB GDDR5
this driver limitation.
Memory speed 10 Gb/s 10 Gb/s 8 Gb/s 8 Gb/s When choosing a GPU setup for
deep learning, it is important to
choose a device that has enough
optimized primarily for problems that are best solved sequential- memory to fit your model and enough cores to efficiently solve
ly. This is ideally suited for many common algorithms and tasks, the problem. One also needs to decide on a multi-GPU setup ver-
but is not performant for tasks with inherent parallelism. GPUs sus a single-GPU setup. It is also important that enough system
are designed for highly parallel graphical processes that can nor- memory exists to easily copy the model between GPU random
mally be reduced to a limited set of matrix or tensor operations. access memory (RAM) and system RAM with room to spare for
As GPUs became more and more powerful to cope with the the operating system and any running applications and services.
increasing sophistication of 3-D graphical models used in gam-
ing and in augmented and virtual reality, they became the pre- Deep Learning on a Stick
ferred option for deep learning research. A typical GPU has Many companies are deploying hardware customized for deep
hundreds or thousands of cores, and although each core is much learning, often targeted at deployment in smart objects. There
slower than a typical CPU core, together they are able to train are significant speed and power consumption improvements that
networks (especially deep neural networks) at the level of one or can be realized in hardware implementations of deep learning
two orders of magnitude faster than with a CPU. This makes operations. Movidius, for example, released the Fathom Neural
intuitive sense. Human neurons are significantly slower than Compute Stick, which uses approximately 12.5-times less power
GPU or CPU cores, but our brains are able to perform many rec- than an equivalent Nvidia GPU in a small USB-compatible
ognition and classification tasks faster and with more accuracy device [17] (Figure 11). It is targeted for use in robots, drones,
than most computer models. This is because human brains have and surveillance cameras and can drastically improve recogni-
billions of individual asynchronous neurons and highly parallel tion accuracy. It is a good choice for off-the-shelf deep learning
analog electrochemical communication. hardware when one is ready to deploy a model.
GPU Versus CPU Training Benchmarks CUDNN/CUDA

For some tasks, a GPU can be on the order of a hundred-times Nvidia is the most popular consumer GPU manufacturer with
faster than a CPU while using less power and at a much lower deep learning researchers. Although their products are targeted
cost. Generally, any task that requires the same operation to be primarily at gaming and 3-D graphics processing, they also
performed thousands of times (as in matrix multiplication) will invest significant resources in deep learning research. Nvidia
greatly benefit from the use of a GPU [14]. provides CUDNN, a low-level library of deep learning primi-
tives that run on top of CUDA [17].
GPU Models and Set-ups
Nvidias latest consumer series is the Pascal series, labeled with a Developing and Evaluating
10 prefix (Table 1). This series represents a significant Deep Learning Models
improvement in both performance and power consumption over When deploying or designing a deep neural network, it is
previous models. Multiple Nvidia GPUs can be combined with useful to try out several variations with many parameters.

Although speed and power consumption are critical to end where it pretrains a model based on a vast base of labeled
products, people often design their networks using slightly photos, and later on the iPhone to annotate images as they are
slower but higher-level frameworks. This allows us to test out taken. It also analyzes all photos on a phone when the device
ideas and prove products before creating highly optimized is plugged in and not in use so as to avoid draining battery
implementations in hardware or software. Although deep power. Google takes a different approach, avoiding on-device
learning can be implemented in nearly any programing lan- classification, instead processing users photos with its cloud
guage, most resources and community support available is infrastructure [25].
intended for Python. Although most deep learning tools are Neurala released an API that uses deep learning to per-
now compatible with Python 3X, the deep learning communi- form real-time object tracking, recognition, and other tasks.
ty still primarily uses Python 2.7. As of September 2016, it is the largest supplier of deep learn-
The most popular high-level frameworks for this are ing software for consumer drones. Its deep learning software
Caffe, Theano, and TensorFlow. Caffe [18] was created at the allows drones to operate autonomously, performing tasks
Berkely Vision and Learning Center. Models are defined in a such as follow this person or find this car. Such sophisti-
configuration file instead of being coded directly. They claim cated task performance was within the realm of sci-fi only a
to have the fastest CNN implementation available. Theano few years ago [26].
[19] was initially designed for fast, stable symbolic opera-
tions, including symbolic differentiation. It dynamically gen- Conclusion
erates optimized C code, which can be transparently executed Deep learning is being used today in our cell phones, cars,
on the GPU up to 140-times faster than an equivalent CPU and tablets and computers. It has pushed the boundaries of
implementation. Finally, TensorFlow [20], the newest of the what is possible for tasks such as image segmentation [13],
three, was designed by Google and is currently experiencing object detection [27], face recognition [28], voice analyzing
the fastest growth in usage. As of this writing, TensorFlow [29], emotion detection [30], and gender recognition [31].
has slightly slower performance than the other two on bench- Why has deep learning suddenly catalyzed research across so
marks, but it is easier to deploy on multiple GPUs. many fields? It is a combination of many factors: the recent
There are a number of frameworks built on top of these emergence of highly affordable high-density, GPU-based
platforms, such as Lasagne and Keras, that allow further computational hardware has provided the engines to process
abstraction and thus ease and speed of development. Keras is very large data sets and implement the advanced training
a good choice, especially for beginners, because it can use methodologies required to develop accurate CNNs; the wide-
either TensorFlow and or Theano as a back end and it pro- spread availability of GPUs in todays devices, coupled with
vides a simpler model for development. cloud-based data processing services, provides the means to
apply these CNN architectures to everyday applications such
Example Applications as voice or image processing; big data provides the fuel to
Recently, Google provided a mobile TensorFlow application drive research activity and refine results to the point where
program interface (API) that allows developers to deploy deep learning solutions typically outperform even the best of
deep learning models directly to smartphones [21]. Google human-designed pattern recognition tools.
uses it for Google Translates instant visual translation and Today, we are at a point where many new problems can be
recommends it for any application where processing needs to tackled through the use of deep learning techniques. Many of
be done on the device. these are long-standing problems, such as voice recognition,
FotoNation (www.fotonation.com), a company with soft-
ware running in over 2.7 billion smartphones worldwide,
has released a new face-recognition product based on the
latest deep learning techniques to achieve rapid and accu-
rate face recognition. Nuance uses deep learning for its
Dragon Naturally Speaking line of voice recognition appli-
cations [22]. Deep learning is used in both initial product
development and later for custom refinement on consumer
electronics. Prisma, with 27 million users, is a popular cell
phone app that uses deep learning to transform photos into
paintings in the style of famous artists [23]. Soundhound,
Inc., is collaborating with Nvidia using deep learning to
add a smart, voice-enabled, conversational interface to
every technology that humans interact with based on its
platform called Houndify [24].
At Apples Worldwide Developers Conference 2016,
Apples John Gruber revealed in his keynote speech that it
uses deep learning in its photo app, first in its own data center Figure 11. The Movidus fathom USB stick. (Image courtesy of Movidius.)

which was never quite good enough to make its way out of the [10] (2013). Convolutional Neural Networks (LeNet). [Online]. Avail-
laboratory and into everyday use. This year, however, we have able: http://deeplearning.net/tutorial/lenet.html
seen several launches of smart speakers that can control your [11] D. A. Clevert, T. Unterthiner, and S. Hochreiter, Fast and accurate
deep network learning by exponential linear units (ELUs), presented at
homeand behind these new devices lies a web of deep learn-
the 2016 Int. Conf. Learning Representations (ICLR 2016), San Juan,
ing technologies, both analyzing your needs, translating your Puerto Rico. doi: abs/1511.07289.
voice requests and marshalling the necessary logistics to deliv- [12] J. Schmidhuber. (2014). Who invented backpropagation? [Online]. Avail-
er everything to your doorstep or onto your television set. able: http://people.idsia.ch/~juergen/who-invented-backpropagation.html
To get started with solving your own problem, youll [13] V. Badrinarayanan, A. Kendall, and R. Cipolla. (2015, Nov.). SegNet:
need a state-of-the-art GPUthe same technology going A deep convolutional encoder-decoder architecture for image segmenta-
tion. [Online]. Available: https://arxiv.org/abs/1511.00561
into the latest gaming personal computersand most of the
[14] K. Krewell. (2009). Whats the difference between a CPU and a
core software is freely available on the Internet. There are GPU. [Online]. Available: https://blogs.nvidia.com/blog/2009/12/16/
several software packages trending in the deep learning whats-the-difference-between-a-cpu-and-a-gpu/
field, including but not limited to Theano (on Python), Lasa- [15] Nvidia. Geforce GTX 10-series notebooks. [Online]. Available:
gne (on Theano), Tensorflow (on Python and C++), Caffe http://www.geforce.com/hardware/10series/notebook
(on Python and MATLAB), and MatConvNet (on MATLAB). [16] Nvidia. NVIDIA TitanX graphics card with Pascal. [Online]. Avail-
able: http://www.geforce.com/hardware/10series/titan-x-pascal
The good news is that all of these tools are relatively inex-
[17] Nvidia. NVIDIA cuDNN. [Online]. Available: https://developer
pensive and readily available! .nvidia.com/cudnn
Finally, youll need a large data setfor your particular [18] Berkeley Vision and Learning Center. Caffe. [Online]. Available:
application well assume you have your own specialized data http://caffe.berkeleyvision.org/
sources, but there are many public sources of suitable data. [19] Theano Development Team. Theano. [Online]. Available: http://
From here on, its deep oceans of learning ahoy! deeplearning.net/software/theano/
[20] TensorFlow. TensorFlow is an open source software library for
machine intelligence. [Online]. Available: https://www.tensorflow.org/
ABOUT THE AUTHORS [21] TensorFlow. Mobile TensorFlow. [Online]. Available: https://www
Joseph Lemley (jlemley@fotonation.com) is a Ph.D. student .tensorflow.org/mobile.html
with the Center for Cognition, Connected, and Computation- [22] N. Lenke. (2016, Aug. 16). Why were using Deep Learning for our
al Imaging, National University of Ireland, Galway. Dragon speech recognition engine. [Online]. Available: http://whatsnext
Shabab Bazrafkan (sbazrafkan@fotonation.com) is a Ph.D. .nuance.com/in-the-labs/dragon-deep-learning-speech-recognition/
[23] Nvidia. (2016, July 29). The technology behind the viral Prisma
student with the Center for Cognition, Connected, and Compu-
photo app. [Online]. Available: https://news.developer.nvidia.com/the-
tational Imaging, National University of Ireland, Galway. technology-behind-the-viral-prisma-photo-app/
Peter Corcoran (dr.peter.corcoran@ieee.org) is a profes- [24] A. Galella. (2016, Jan. 5). SoundHound Inc. collaborates with
sor of electrical engineering in the College of Engineering NVIDIA to bring Deep Learning-based natural language understanding
and Informatics, National University of Ireland Galway. He is to cars. [Online]. Available: http://www.businesswire.com/news/
home/20160105006247/en/SoundHound-Collaborates-NVIDIA-Bring-
the emeritus editor-in-chief of IEEE Consumer Electronics
Deep-Learning-Based-Natural
Magazine and an IEEE Fellow. [25] T. Hoff. (2016, June 20). The technology behind Apple photos and
the future of Deep Learning and privacy. [Online]. Available: http://
References highscalability.com/blog/2016/6/20/the-technology-behind-apple-photos-
[1] B. Yegnanarayana, Artificial Neural Networks. New Delhi, India: PHI and-the-future-of-deep-le.html
Learning Pvt. Ltd., 2009. [26] R. Matus. (2016, Sept. 28). Neurala becomes largest supplier of
[2] S. Lawrence, C. L. Giles, and A. C. Tsoi, What size neural network Deep Learning software running in real time on a drone. [Online]. Avail-
gives optimal generalization? Convergence properties of backpropaga- able: http://www.neurala.com/top-deep-learning-software/
tion, Univ. Maryland, College Park, Tech. Rep. UMIACS-TR-96-22 and [27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D.
CS-TR-3617, 1996, pp. 137. Erhan, V. Vanhoucke, and A. Rabinovich. (2014, Sept.). Going deeper
[3] Y. Chauvin, Generalization performance of overtrained backpropaga- with convolutions. [Online]. Available: https://arxiv.org/abs/1409.4842
tion networks, in Neural Networks, New York: Springer, 1990, pp. 4555. [28] Y. Sun, Y. Chen, X. Wang, and X. Tang, Deep Learning face repre-
[4] S. Theodoridis and K. Koutroumbas, Pattern recognition and neural sentation by joint identification-verification, in Proc. Advances in Neu-
networks, in Machine Learning and Its Applications, New York: Spring- ral Information Processing Systems, 2014, pp. 19881996.
er, 2001, pp. 169195. [29] X. L. Zhang and J. Wu, Deep belief networks based voice activity
[5] P. Lewis. (2007, Feb. 22) Analogy between human and artificial neu- detection, IEEE Trans. Audio. Speech. Lang. Processing, vol. 21, no. 4,
ral nets. [Online]. Available: http://users.ecs.soton.ac.uk/phl/ctit/nn/ pp. 697710, 2013.
node2.html
[30] S. Bazrafkan, T. Nedelcu, P. Filipczuk, and P. Corcoran, Deep
[6] W. S. McCulloch and W. Pitts, A logical calculus of the ideas immanent
Learning for facial expression recognition: A step closer to a smartphone
in nervous activity, Bull. Math. Biophys, vol. 5, no. 4, pp. 115133, 1943.
that knows your moods, in Proc. IEEE Int. Conf. Consumer Electronics,
[7] F. Rosenblatt, The Perceptron: A perceiving and recognizing automa-
2017, pp. 236239.
ton, Cornell Aeronautical Laboratory, Buffalo, NY, Report 85, 1957, pp.
[31] J. Lemley, S. Abdul-Wahid, D. Banik, and R. Andonie, Compari-
460461.
son of recent machine learning techniques for gender recognition from
[8] A. McWilliams, Smart machines: Technologies and global markets,
facial images, in Proc. Modern Artificial Intelligence and Cognitive Sci-
BCC Research, Wellesley, MA, Rep. IAS094A , May 2014.
ence 2016, Dayton, OH, pp. 97102.
[9] G. Chandrashekar and F. Sahin, A survey on feature selection meth-
ods, Comput. Electr. Eng, vol. 40, no. 1, pp. 1628, 2014.

Deep Learning For Consumer Devices and Services

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning For Consumer Devices and Services

Uploaded by

Copyright:

Available Formats

Deep Learning for

48 IEEE Consumer Electronics Magazine ^ APRIL 2017 2162-2248/172017IEEE

Training the Network

APRIL 2017 ^ IEEE Consumer Electronics Magazine 49

Inside the Neural Network

50 IEEE Consumer Electronics Magazine ^ APRIL 2017

APRIL 2017 ^ IEEE Consumer Electronics Magazine 51

Deep Neural Networks

52 IEEE Consumer Electronics Magazine ^ APRIL 2017

Fully Convolutional Networks 1 1 5 5

APRIL 2017 ^ IEEE Consumer Electronics Magazine 53

Fully Connected Layers

the use of a scalable link interface

GPU Versus CPU Training Benchmarks CUDNN/CUDA

54 IEEE Consumer Electronics Magazine ^ APRIL 2017

APRIL 2017 ^ IEEE Consumer Electronics Magazine 55

56 IEEE Consumer Electronics Magazine ^ APRIL 2017

You might also like