You are on page 1of 5

2016 3rd International Conference on Informative and Cybernetics for Computational Social

Systems (ICCSS)

Scene image classification method based on


Alex-Net model

Jing Sun, Xibiao Cai Fuming Sun, Jianguo Zhang


School of Electronics and Information Engineering School of Electronics and Information Engineering
Liaoning University of Technology Liaoning University of Technology
Jinzhou, Liaoning Jinzhou, Liaoning
694600868@qq.com, xbc1111@126.com sunwenfriend@hotmail.com, 1192102470@qq.com

Abstract—Deep convolutional neural network (DCNN) is a and the Bag of Visual Word (BOW) model [3-4] is one of the
powerful method of learning image features with more most widely used methods. Firstly, semantic modeling is
discriminative and has been studied deeply and applied widely in carried out on the scene in this method; Secondly, the
the field of computer vision and pattern recognition. In order to probability distribution of the semantic subject is learned;
further explore the superior performance of DCNN and improve Finally, the scene is classified according to the probability
the accuracy of the scene image classification, this paper presents distribution. Bosch et al. [5] applied the method of PLSA to
a novel algorithm of scene classification, which fully learning the achieve the 8 classes scene classification; Fu et al. [6]
deep characteristics of the images based on the classical Alex-Net proposed a progressive approach to generate BOW on the
model and support vector machine. In the first place, we use the
basis of PLSA model, this method has achieved better results
Alex-Net model learning scene image features and extract the last
in the case of small data sets; Bai et al. [7] used a way of
layer with 4096 neurons of the Alex-Net model as the image
features in this method; Then, we use the Lib-SVM training
combing the top-down and bottom-up information for visual
model for scene image classification and compare with word weighting, and it have achieved better effect; At the
classification method based on the regression model; Finally, we same time, some scholars introduced the formal concept
carried out the experiments on two common datasets in this analysis (FCA) model into the scene classification, which has
paper. The experimental results have shown that DCNN can improved the classification performance for some concept.
extract the image features effectively. Meanwhile, the trained
scene model also has stronger generalization performance and
achieves the state-of-the-art classification accuracy.

Keywords—deep convolutional neural network; scene


classification; feature learning; training model

I. INTRODUCTION
Fig. 1. Four scene images
With the increasing development of the Internet and digital
photography equipment, the image has become an integral part However, Hinton et al. [8] proposed deep learning in 2006,
of real life, and how to quickly and effectively retrieve and and it has become a popular research subject in the field of
identify the image information has become a hot topic which computer vision and pattern recognition. The basic principle
needs to be solved urgently at present. Scene image of deep learning is to use neural networks with multiple
classification is the key to solve the pattern recognition based hidden layers, to form a more abstract high-level
on image. As one of the important contents of machine representation through the combination of low-level features,
learning and pattern recognition, scene image classification and to find the distributed feature representation of the data.
has captured increasing attention. The purpose of scene Because of its excellent performance, deep learning has
classification is to assign one or more semantic categories to aroused widespread concern and in-depth research, and it has
one or a set of images. That is to say, under the condition of gained a great success in computer vision, speech recognition,
giving a set of semantic categories, such as semantic concept: natural language processing and the other fields.
mountain, coast, highway, forest, et al., the images are
automatically classified, as shown in Fig. 1. As one of the representative model of deep learning, one of
the most obvious advantages of the deep convolutional neural
Traditionally, for the problem of scene image classification, network model is no longer need to manually design the local
people mainly study the semantic modeling technologies [1-2], features compared with other traditional supervised learning

978-1-5090-3367-6/16/$31.00 ©2016 IEEE


363
methods, instead of automatically extracting the local features model by normalizing the local input regions. As for data, the
with translation invariance in the stacking convolution and data augmentation is completed in the case of label-preserving
down-sampled operations. Hence, the higher labor caused by transformations: the original image is transformed by adding
designing local features and training time can be greatly multiple of the data, while enhancing the intensities of the
reduced. The other advantage of DCNN is that employed a RGB channels in training images, so as to prevent the
recently-developed regularization method called Dropout phenomenon of over-fitting. To improve the robustness in the
which can be prevented the over-fitting. fully-connected layers we employed a recently-developed
regularization method called Dropout, so the learning of
In this paper, a scene image classification method based on hidden layer is not dependent on the features of the upper
the deep convolutional neural network is proposed. The layer.
specific contributions of this paper are as follows: at first, the
Alex-Net model is used to learn scene image features, and the
last layer with 4096 neurons as the image feature
representation. Meanwhile, the 12 scene models are
constructed by using the Lib-SVM classifier. Then, the test
images are classified by the training scene model and Lib-
SVM classifier. Finally, experimental results on two real-
world data sets demonstrate that the proposed classification
model in this paper can achieve superior results.

II. CONVOLUTIONAL NEURAL NETWORK MODEL


CNN is a special neural network model proposed by
LeCun, which is used for document image recognition, and Fig. 2. The framework of Alex-Net model
has made a great breakthrough in image classification and
retrieval [9], target detection [10] and so on. Deep CNN III. THE SPECIFIC PROCESS OF DEEP LEARNING
reduces the dimensions of image by increasing the number of
hidden layer (convolutional layer and sampling layer), and We use the Alex-Net model learning scene image features,
extracts the sparse image features in low-dimensional space. and the scene image features are obtained through a series of
Because of their weights sharing, CNN has much fewer transformation during the training phase, such as convolution,
neurons and parametersso it is easier to train. max-pooling, etc. The specific process of feature extraction is
shown in Fig. 3.
Alex-Net model is the most representative model of CNN,
which has three obvious advantages, i.e., superior performance,
less training parameters and strong robustness. It is inspired by
the principles of human vision, while mimicking the way of
dual channel visual transmission, and learning image features
by using two channels, as depicted in Fig. 2. Alex-Net is the
deep convolutional neural network model with multiple
hidden layers, includes an input layer, five convolutional
layers, three pooling layers, three fully-connected layers, and
an output layer with 1000 output class labels. In the whole Fig. 3. The flow chart of feature extraction based on CNN
model, the learning feature is carried out in the convolutional
layer with two channels that are studied separately, and they
are crossed only in the third feature extraction layer. The first A. Input layer
fully-connected layer is cross-mixing for the features of two The size of the raw scene images is 256™256. In the first
groups respectively, and then the next fully-connected layer place, the input images are selected: we do this by extracting
repeat until the last fully-connected layer combine with the random 224™224 patches from the four corners and the center
features of two groups together to get a 4096-dimensional location of the 256 ™ 256 images respectively. Then, the
feature vector. horizontal reflections are carried out on these five 224™224
The Alex-Net model needs to adopt appropriate methods patches, hence ten patches in all. Finally, each image of ten
to make training faster and prevent over-fitting due to the patches is selected to be input to the training network, which is
complex structure, many training parameters and a large used as the input images.
amount of training data. So Alex-Net building a model by
directly using the Rectified Linear Unit (ReLU) nonlinearity B. Convolutional layer (C1)
in the data structures to make the initialized method is more C1 is the first convolutional layer of the feature extraction,
consistent with the theory, and begin to train the network and we obtain 96 feature maps of size 55h55. In fact, it is
directly from the starting point to enhance training speed;
obtained by utilizing the convolutional kernel of size 11h11.
Local response normalization (LRN) can be performed in a
Because the size of the receptive field of the neuron is
process called "near-suppression" operation, which can
determined by the size of the convolutional kernel, the ideal
effectively improve the generalization performance of the

364
size of the convolutional kernel is to extract the effective local proves that depth will have a great affect on the performance
features in the range of convolutional kernel with of convolutional neural network, and lacks of depth will
representation ability. Therefore, the proper setting of the reduce the abilities of extracting features in the convolutional
convolutional kernel is very important to extract the effective neural network.
image features and improve the performance of the
convolutional neural network. C1 filters the 224h224 input F. Fully-connected layer (Fc)
image with 96 convolutional kernels of size 11h11 with a In the whole process of learning, the remaining three are
stride of 4 pixels for sampling frequency, namely the the fully-connected layers, which make the learning features
convolutional kernel is spread over every unit of size 11h11. of two channels cross-mixing to obtain a 4096-dimensional
Ultimately, we get 96 feature maps with the size of 55h55. feature vector. We use Dropout technique in the first two
fully-connected layers, setting to zero the input of the first two
C. Max-pooling layer (P1) fully-connected layers with the probability 1/2, which do not
contribute to the forward pass and do not participate in
P1 is the first max-pooling layer, which has 96 feature
backward. The hidden layer of learning in this way cannot rely
maps of size 27™27. The pooling process is to select the on the presence of particular other features of the previous
maximum in each of the pooling regions as the value of the layer, which makes the learning features can lead to stronger
area after pooling. In this layer, we choose a max-pooling robustness, select more adaptive parameters, and significantly
layer over a 3 ™3 region in order to control the speed of improve the generalization capabilities of the system.
dimensionality reduction, because the decline in dimension is
decreased exponentially, the speed is falling faster means the
IV. EXPERIMENTS
image features are more rough, and many image details are
lost subsequently. Since the pooling layer has a region of size To evaluate the approach, we carried out the experiments
3™3, while the stride of size 2, so we obtain the overlapping on two common test databases including ImageNet2012 and
pooling. This scheme reduces the top-1 and top-5 error rates NUS-WIDE in this paper. Each of two databases contains a
by 0.4% and 0.3%, respectively, as compared with the no- variety of categories of scene images. And the relevant
overlapping scheme, which produces output of equivalent descriptions of scene category and data distribution are
dimensions. described as Fig. 4. There are 53694 scene pictures in the
training set, while the test sets ImageNet2012 and NUS-WIDE
D. Convolutional layer (C2) are composed of 3678 and 3593 pictures respectively.
The second feature extraction layer C2, which has a lot of
similarities with the C1, while there is a certain gap. C2 takes
as input the output of the first convolutional layer and filters it
with 256 convolutional kernels of size 5™5 with a stride of
1pixel, and it is feasible to obtain 256 feature maps with the
size of 27™27. In C1 layer, the receptive field of each neurons
is equivalent to the raw image of size 33™33. Now C2 layer
use a kernel of 5™5 to convolution, so that the receptive fields
of neurons further enhance, which is equivalent to the original
image of size 165 ™ 165. Each feature maps in C2 is not
directly obtained by convoluting in P1 layer, but by combining Fig. 4. Each category distribution of databases
several or all of the feature maps in P1 as input for
convoluting again. The reason for this one is that the sparsely In the experiment, we use the Lib-SVM and SoftMax
connected mechanism keeps the number of connections in a classifier to classify the test images on the two data sets. In the
reasonable range. The other is that the asymmetry of network first place, we extract 4096 neurons of the last layer in the
enables the different combinations can extract various features. Alex-Net model as the scene image features; then, two
classifiers are trained by using the image features; finally, the
E. The remaining convolutional layers and pooling layers trained classifier is used to classify the test images, and the
classification accuracy (AC) as shown in Tableĉ(0:building
These layers have the same working principle with the first
1:coast 2:insidecity 3:forest 4:highway 5:kitchen 6:livingroom
two layers, but the size and number of the feature maps have
7:mountain 8:sea 9:sky 10:snow 11:street).
changed. However, the size of the C5 layer after the
convolution is still 13™13. In this process, the max-pooling
layers follow the second (C2) and fifth (C5) convolutional TABLE I. ACCURACY OF DIFFERENT CLASSIFIERS ON TWO DATABASE
layers with the kernels of size 3 ™3. With the continuous
AC/(%) on ImageNet2012 AC/(%) on NUS-WIDE
increase of the depth of the convolution, the extracted features
are more abstract, with more discriminative and expressive Lib-SVM SoftMax Lib-SVM SoftMax
power. The experimental results demonstrate that the 0 91.01 90.73 0 88.25 87.75
classification accuracy is only about 40% when only take the
1 83.89 83.61 1 80.29 79.41
first two layers of the convolutional neural network, which

365
2 81.17 80.52 2 78.67 78.33 11.16% higher than HOG algorithm and 7.95% higher than
KNN algorithm. To sum up, deep convolutional neural
3 80.18 79.87 3 74.15 73.37
network can be more effectively extracted the deep
4 91.16 91.54 4 86.42 86.78 characteristics of the scene image, which makes the extracted
5 87.14 87.14 5 82.44 81.20 features more discriminative, and greatly improves the
performance of the scene classification.
6 92.04 92.04 6 86.00 85.67
7 85.03 85.03 7 81.48 80.42
V. CONCLUSIONS
8 95.02 94.35 8 91.04 89.92
Deep learning has become a hot area of machine learning
9 93.00 92.33 9 89.32 88.46 research, and it has been made a significant breakthrough in
10 90.00 90.00 10 87.45 85.47 image recognition, speech analysis, target detection and other
fields. In this paper, we extract the last layer features of the
11 90.07 89.38 11 89.21 86.92
deep convolutional neural network for scene classification by
Avg. 88.31 88.03 Avg. 84.56 83.72 using the Alex-Net model in the deep learning framework,
which makes the classification accuracy can be further
improved, so a stronger generalization performance and higher
From Tableĉ, we can see that the average classification efficiency scene image classification model is constructed. In
accuracy of the two classifiers in the two different data sets are future work, we will introduce the relationship between
relatively high, which is almost the same accuracy. At the different scenes in the training set to reduce redundant
same time, the accuracy is close when the two classifiers are information, so as to further improve the performance of scene
used to classify each class, which is illustrated that the feature classification.
extracted from the deep convolutional neural network has
stronger discriminative and robustness. In this paper, the scene
image classification method based on Alex-Net model is Acknowledgment
proposed for scene classification, which is extracted scene This work is partially supported by National Natural
feature on the basis of the Alex-Net model, by using the Lib- Science Funds of China (No.61572244, No.61472059) and the
SVM classifier to train model, the classification accuracy of program for Liaoning Excellent Talents in University
scene images is significantly improved. (LR2015030).
Table Ċ show the accuracy by three methods that include
HOGˈKNN and we proposed algorithm on ImageNet2012
and NUS-WIDE, respectively. References
TABLE II. ACCURACY OF DIFFERENT METHODS ON TWO DATABASE
[1] O. Russakovsky, Y. Lin, K. Yu, and F. F. Li, Object-centric spatial
pooling for image classification, Computer Vision-ECCV, pp. 1-15,
AC/(%) on ImageNet2012 AC/(%) on NUS-WIDE 2012.
HOG KNN DCNN HOG KNN DCNN [2] A. Angelova and S.H. Zhu, Efficient object detection and segmentation
for fine-grained recognition, IEEE Conference on Computer Vision and
0 73.26 80.13 91.01 0 71.02 77.32 88.20 Pattern Recognition, Portland, pp. 811-818, 2013.
1 70.21 71.39 83.89 1 67.76 69.34 80.28 [3] B. Fernando, E. Fronomt, and D. Muselet, Supervised learning of
Gaussian mixture models for visual vocabulary generation, Pattern
2 73.26 78.34 81.17 2 70.05 76.02 75.58 Recognition, vol. 2, pp. 897-907,2012.
3 70.03 69.96 80.18 3 67.81 69.06 74.13 [4] T. Li , T. Mei, I.S. Kweon and X.S. Hua, Contextual bag-of-words for
visual categorization, IEEE Transactions and Systems for Video
4 80.00 84.06 91.16 4 77.83 82.89 86.54 Technology, vol. 4, pp. 381-392, 2011.
5 81.32 82.65 87.14 5 78.36 79.86 82.43 [5] A. Bosch, A. Zisserman, X. Munoz, Scene classification via pLSA,
Computer Vision-ECCV 2006, Berlin Heidelberg: Springer, pp. 517-530,
6 85.34 86.65 92.04 6 81.16 83.10 86.08 2006.
7 70.22 70.34 85.03 7 67.22 69.03 81.48 [6] P.K. Elango and K. Jayarman, Clustering images using the latent
dirichlet allocation model, University of Wisconsin, 2005.
8 75.01 82.45 95.02 8 73.06 79.67 91.03
[7] Z. Fu, H. Lu, and W. Li, Incremental visual objects clustering with the
9 76.65 82.87 93.00 9 73.78 80.00 89.36 growing vocabulary tree, Multimedia Tools and Applications, vol. 3, pp.
535-552, 2012.
10 78.08 83.15 90.00 10 75.54 81.18 87.45
[8] G.E. Hinton and O. Simon, A fast learning algorithm for deep belief nets,
11 76.34 78.23 90.07 11 73.89 76.08 89.01 Neural Computation, vol. 8, pp. 1527-1554, 2006.
Avg. 75.81 79.18 88.31 Avg. 73.39 76.96 84.55 [9] K. Alex, S. Ilya, and H. Geoffrey, ImageNet Classification with Deep
Convolutional Neural Network. In NIPS, 2012.
[10] R. Girshick, J. Donahue, and T. Darrell T, Rich feature hierachies for
From the Table Ċ, we can see that the average accuracy of accurate object detection and semantic segmentation, IEEE Conference
on Computer Vision and Pattern Recegnition. Columbus. USA, 2014.
DCNN algorithm is 12.50% higher than HOG and 9.31%
higher than KNN in the ImageNet2012 dataset. In the NUS-
WIDE database, the average accuracy of DCNN algorithm is

366
[11] D. Jia, D. Wei, L.J. Li, L. Kai, and F.F. Li, ImageNet: A Large-scale [13] J.R. Smith and C.S. Li, Image classification and querying using
Hierarchical Image Database, IEEE Conference on Computer Vision and composite region64 templates, Journal of Computer Vision and Pattern
Pattern Recognition, 2009. Recognition, vol. 2, pp. 165-174, 1999.
[12] F.F. Li and P. Perona, A Bayesian hierarchical model for learning [14] S.L. Zhang, J.F. Zhang, and L.H. Hu, Semantic annotation model of
natural scene categories, Proceedings of IEEE Computer Society image scene based on formal concept analysis, Computer Application,
Conference on Computer Vision and Pattern Recognition, pp. 524-531, vol. 4, pp. 1093-1096, 2015.
2005. [15] Y. Jiang, Research on scene image content representation and
classification, National University of Defense Technology Doctoral
Thesis, 2010.

367

You might also like