You are on page 1of 1

Figure 3.

Inverted residuals
Figure 4. SSD network structure
B. Linear Bottlenecks and Inverted residuals
After the depthwise separable convolution, the ReLU
operation on the low-dimensional features is easy to cause
information loss. While in the high-dimensional ReLU
operation, the loss of information will be rare. The Linear
Bottlenecks complements the depthwise separable convolution.
ReLU is replaced by a linear activation function due to
information loss caused by ReLU. Mobilenet V2 removes the
activation function after the original dw used to reduce the
dimension of pw. In order to increase the special channel after
dw processing, the pw convolution is used to advance the
dimension.
So dw perform convolution operation extraction feature in a
high-dimensional space in the conventional layer or fully
connected layer, there are problems such as loss of information
and loss of information transmission. ResNet solves this
problem to some extent, protecting the integrity of the
information by bypassing the input information directly to the
output. The MobileNet V2 model also adds shortcuts. The
difference with ResNet is that Mobilenet V2 first upgrades the
input instead of dimension reduction. Here, the figure 2 is the
mobilenet structure and inverted residue is shown as figure 3,
and the connection is a shortcut [21].
Figure 5. Nao robot
C. SSD
8400 with 2.8 GHz, 6-core CPU, NVIDIA GeForce GTX 1070,
SSD (Single Shot MultiBox Detector) is a fast model for
16G RAM. Nao robot hardware environment built-in
detecting objects by a single deep neural network [22]. SSD
embedded Linux operating system, The hardware platform is
can finish a multi-target detection by simultaneously predicting
ATOM Z530 1.6 GHz CPU,1 GB RAM, 2 GB Flash memory,8
target categories and bounding boxes. SSD model is shown in
GB Micro SDHC.
figure 4.
As shown in Figure 5, Nao has two same video cameras on
SSD model is based on a feedforward convolution network.
its forehead. They can provide a up to 1280x960 resolution at
Its backbone structure is VGG16[23], and then adds 6 extra
29 frames per second.
feature layers following the base network. Feature map size is
reduced layer by layer, using 6 different feature layers to detect
targets of different scales, low-level prediction small targets, A. Datasets and evaluation results
high-level prediction large targets. A large number of multi- In this paper, we used the fer2013 and CK+ data sets for
scale discretized borders are generated on different information our model and test.
layers to predict the offset of the default box of different scales
Fer2013 is labeled as seven types of happy, angry, sad,
and aspect ratios and the associated confidence. In order to
surprise, fear, disgust, neutral. The training set consists of
solve the problem of excessive parameter size and execution
28,709 examples. The public test set includes 3,589 examples.
efficiency of the training model, the convolutional layer in the
The private test set includes another 3,589 examples.
real SSD is replaced by mobilebet's depthwise separable layer,
which improves the operational efficiency and real-time CK+ is facial expression dataset consists of 123
performance of the experiment. individuals, 593 image sequences.327 of the image sequences
have an emotion label, which is labeled as seven types of
IV. EXPERIMENTS emoticons: happy, anger, fear, sad, surprise, contempt, and
disgust. In order to be compatible with fer2013 data sets, the
Our model adopts the Keras framework, and the backbone expressions are despised. There is no contempt expression in
network is tensorflow software library. The operating system is Fer2013 expression datasets, so the contempt expressions are
Ubuntu 18.04.1. The hardware platform is Intel® Core™ i5- despised.

120

You might also like