You are on page 1of 5

Embedded System Design for Visual Scene

Classification
Sumair Aziz*, Zeshan Kareem, Muhammad Umar Khan, Muhammad Atif Imtiaz
Department of Electronic Engineering,
University of Engineering and Technology Taxila, Pakistan
Email: *sumair.aziz@uettaxila.edu.pk, zeshanali6454@gmail.com, umar.khan@uettaxila.edu.pk, atif.imtiaz@uettaxila.edu.pk

Abstract— Computer vision and robotics community is calibrated. It also involves the road detection, pedestrian
experiencing growing interest in visual scene classification detection, traffic signs recognition and so on [6].
due to availability of low cost and compact visual sensing There are many challenges that affect the
devices. This paper presents framework aimed at embedded computer vision-based algorithms designed for scene
system design for visual scene classification. In the proposed
classification. Variation in lighting conditions in which
framework we used data fusion of local and global
descriptors as feature vectors for scene classification. We image is captured, affect the image quality and
construct feature vector by integrating Local Quinary thresholding parameters used for image segmentation.
Patterns (LQP), Bag of Visual Words (BoW) and Histogram Foggy and hazy conditions also affect the image quality.
of Oriented Gradients (HOG). For classification multiclass Viewpoint from which image is captured by the camera is
Support Vector Machines (SVM) is used. Experiments are the most important aspect for recognizing the scene
performed on publicly available MIT indoor scene accurately and it continuously remains to be the problem
classification database. Comparison of our approach with for researchers.
other methods show that our approach is efficient in terms of
overall accuracy. II. LITERATURE REVIEW
Keywords—Local quinary patterns, Bag of visual words, Solution for urban scene classification based on fusion
histogram of gradients, SVM, scene classification framework that works on parts of over-segmented images
based on Dempster-Shafer theory is presented in [6].
Three types of sensors namely camera, LIDAR used in this
I. INTRODUCTION approach. KITTI [7] dataset is used for results validation.
Scene understanding is one of the most important problem Eextended multi-structure local binary pattern based
in computer vision. Although remarkable progress has feature extraction approach is proposed for high-resolution
been achieved in the past decades, general-purpose scene image classification in [8]. Experimental results show that
understanding is still considered to be very challenging the used method work efficiently in capturing spatial
[1]. One of the incredible skills of the human visual system pattern and local contrast, consistently outperforming more
is how accurately it can identify and understand the than a few state-of-the-art classification algorithms. In [9]
complex visual world. The quick extraction of semantic authors provided an overview of the several technologies
understanding of a scene requires a few hundred that have been established in the recent years to help the
milliseconds to be consolidated into the memory [2]. The visually impaired in recognizing the general objects in an
purpose of the scene recognition is to discover the overall indoor environment on the methods based on computer
scene category by putting stress on understanding its vision. Since the only piece of the visual data that system
global characteristics [3, 4]. Visual scene classification and extract, store and compare for each object is much less
understanding is to examine a scene by considering the than that of non-tag based systems, which deal with much
geometric and semantic context of its contents and to find more detailed information, such as color, size, shape, etc.
the core relationships between them. In scene recognition for each object.
there is a need for a model that classifies the scene, The combination of audio and video features, as well as
identifies and parts each component of the object, as well combination of different type of classifiers are allocated to
as explains the image with a list of labels [5]. Scene attain frame by frame classification of personal video
recognition can be very useful in the progressive driver recordings into semantically meaningful categories such as
support system, and generally for robotics. Scene data about the environment (indoor, outdoor, etc.), the
understanding can be applied to a multi-sensor system presence or the absence of the people and their activities
containing a stereo camera and a LIDAR, which are to be like sports or partying [10]. A system designed that learns
which classifiers and parameters are suitable for this task.

978-1-5386-7266-2/18/$31.00 ©2018 IEEE 739


Figure 1: Proposed System Block Diagram

Two classifiers are used, GMM classifier for audio organized as follows. Section III describes the proposed
features and SVM classifier for image features. Classifier visual classification system. Experimental setup and results
combination schema is implemented using the meta- are discussed in Section IV. Finally, conclusions of our
classification. Meta-classification gets the better results work are presented in Section V.
over the different approaches performed, resulting a
promising f-measure higher than 57% in average for all III. PROPOSED VISUAL CLASSIFICATION SYSTEM
categories and higher than 73% over the diverse General architecture of the visual classification system is
categories. shown in Fig. 1. In the first step, color image having Red,
In [11] scene classification problem is addressed Green and Blue channels is captured from camera which is
that by using histogram features extracted from the passed to the embedded processor (Raspberry Pi) [21] for
saliency maps to predict the existence of interesting objects further processing. Local Quinary Patterns (LQP), Bag of
in images and to quickly prune uninteresting images. For visual words (BoVW) and Histogram of Gradients (HOG)
the validation of this approach, a database is constructed features are extracted from image. Finally, scene
that consists of 1000 background and object images classification is performed using multiclass Support vector
captured in the environment in which robot is working. machines (SVM) classifier.
Results are evaluated in terms of overall performance of
A. Feature Extraction using Local Quniary Patterns
the proposed approach for the different saliency maps
using Precision, Recall and F1-Measure. Recently, Let the image captured from camera be I (i, j ) having
Convolutional neural networks (CNN) [12] and transfer red, blue and green channels. LQP features are computed
learning is also employed for scene recognition [10]. independently for each channel. For a given image I ,
Authors performed experiments on MIT indoor scene LQP features are extracted from each window of 3× 3
recognition dataset and compared differences in properties size. Finally, histograms bins are created for all computed
of indoor and outdoor scene recognition. In [13] scene LQP features in order to reduce the data dimensionality
classification is performed using convolutional neural and increase computation speed.
networks [12] that works on the technical approaches The local quinary values were obtained using Eq. (1),
which are Baseline, ResNet [14], Dropout and L2
Regularization, Batch Normalization, Loss and Optimizer
and Saliency Maps. Experiments were performed on +2 I ≥L c 2

+1 L ≤ (I ) > L
Places2 dataset [15]. In [16] novel method is presented for 
1 c 2

assisted living environment and human are tracked by x( I , L , L ) =  0 − L ≥ ( I ) < L


c 1 2 1 c 1
(1)
video cameras and arrays of microphones for audio −1 L < ( I ) ≤ − L

2 c 1
recording. Results are reported for audio monitoring and
video tracking individually. −2 I ≤ −L c 2

In our work we use feature fusion of multiple


descriptors that best describes the image for scene
55 21 12 1 10 0 0 1
classification. Local Quinary Patterns (LQP) [17] based
34 7 21 89 45 1 0
feature descriptor was originally proposed for content-
17 56 22 25 98 1 0 0
based image retrieval problems. LQP effectively captures
81 31 19 14 34
the local neighborhood information present in texture 0 0 0
61 88 49 0 43
images. The bag-of-visual words (BoVW) technique is 0 1
50

mainly inspired from problem of text document analysis 0 0 0 8 4 2 1


[18]. BoVW computes the feature vector by capturing the 16 1
type of important interest points in an image. Histogram of -2 0 +2
32 64 128
64
0 0 0
Oriented Gradients (HOG) [19, 20] operator computes the +2 +1
0 0
gradients present in image patches. We used feature fusion +2 -1 -2 136
0 1 0
approach for scene image classification by combining
strong features from these descriptors. Multiple categories 1 0 0

of indoor scenes are classified using multiclass Support 0 0

Vector Machines (SVM) classifier. Rest of this paper is 0 0 1

Figure 2: Local Quinary Pattern Feature Descriptor

740
Where L1 and L 2 are user defined upper and lower It also proved effective for object detection [22]. To extract
threshold parameters. I c is the center pixel of 3 × 3 HOG features from input scene image, first image is
window in shown in Fig. 2. Finally, whole image is divided into overlapping blocks. Afterwards, orientations
represented by histogram of all features computed from of gradients are computed from each block, followed by
each window. quantization of features into histogram bins. Each bin has
its own orientation range. Finally, features from all image
B. Bag of Visual Words (BoVW) blocks are concatenated to form a feature vector that
represents whole scene image.
The bag-of-words (BoW) methodology was basically
proposed for problem of text document analysis, later on it
was adopted for image processing and computer vision D. Classification - Support Vector Machines (SVM)
applications [18]. A visual analogue of a word is used in Scene image classification is performed using multiclass
BoW model for image analysis, that is based on clustering SVM. SVM finds a hyperplane that separates D-
low-level visual features of local regions such as color or dimensional data into its two categories [23]. It is a
texture and finally feature vector of desired length (words) discriminative model for classification that principally
is constructed as a result. depends on two basic assumptions. First, complex
BoVW feature extraction scheme is shown graphically in classification problems can be classified through simple
Fig. 3 and it involve following processing steps: linear discriminative functions by transforming data in to a
(i) Detect regions or points of interest high-dimension space. Second, training patterns used in
(ii) Compute local features descriptors over those SVMs are only those that are close to the decision surface
regions supposing they provide the most relevant information for
(iii) Quantize the feature descriptors into words to classification. Support vector machines were proposed as
form the visual vocabulary, and binary classifiers. However, in real scenarios, data is to be
(iv) Compute the appearance of specific word in classified into multiple classes. This is done by using
image in the vocabulary for creating the BoW multiclass SVM. Either One Against One(OAO) or One
feature Against All (OAA) approach can be used [24]. For image
Given a training dataset F consists of k images represented scene classification, features extracted in previous stage
by T = f 1, f 2, f 3,..., fk ,where f is extracted SURF are used to train multiclass SVM (OAO) classifier.
features, k-means clustering algorithm group F based on
fixed number of visual categories or words V represented E. Proposed Algorithm
by V = v1, v 2, v 3,..., vc , where c is the cluster number. Data
Input: scene image
is summarized as vector of N × C occurrence table of Output: classify scene
counts Cij = k ( vi, fj ) ,where n ( vi, fj ) denotes how frequent 1. Input image from attached camera
the word vi appeared in an image fj . 2. Compute LQP, BoW and HOG features and
construct histogram
C. Histogram of Oriented Gradient (HOG)
3. Concatenate all three feature histograms to form a
HOG feature descriptor is very popular in computer vision single feature vector.
domain and its mainly used for human detection [19, 20].
4. Perform SVM multiclass classification to classify
the image.
Images K-means
Feature Vectors
IV. EXPERIMENTS AND RESULTS
We evaluate our proposed method for scene classification
V3 V1 on MIT indoor 67 dataset [25]. It consists of total 67
indoor scene classes, out of which 15 are selected for
V2 evaluation purpose. The images in dataset are best
characterized by the objects they contain and their global
Bag-of-Words layout. MIT indoor dataset is a challenging dataset that
illustrates the significance of using semantically
meaningful representation for complex scene cluttered
with objects. Figure 4 shows selected images from 15
categories used for evaluation.
We tested our algorithm using MATLAB 2016 on i3
workstation having 8 GB RAM. We also performed
experimentation on Raspberry Pi 3 embedded computer
[21] using OpenCV [26]. We used 5-fold cross validation
for experimental evaluation and results are discussed in
terms of accuracy, precision and recall.
Figure 3: Feature Extraction through Bag of Visual Words  TP + TN 
Accuracy =   ×100 (2)
 TP + TN + FP + FN 

741
TABLE I. RESULTS OF INDIVIDUAL CLASSES IN TERMS OF RECALL
AND PRECISION

True Positive Rates Positive Predicted


Classes
(Recall) Value (Precision)
Bedroom 0.93 0.72
Bathroom 0.77 0.98
Child room 0.67 0.90
Closet 0.81 0.92
Clothing store 0.79 0.87
Computer room 0.74 0.86
Dining room 0.80 0.99
Game room 0.71 0.97
Garage 0.97 1
Kitchen 0.76 0.92
Living room 0.67 0.99
Figure 4: Scene Images from MIT Indoor Database Lobby 0.78 0.64

 TP  computational speed and discriminative performance that


Precision=   × 100 (3)
 TP + FP  makes it suitable choice for embedded systems. While
 TP  HOG is successful in terms of capturing gradient
Recall =   × 100 (4) information about the edges present in image. BoVW
 TP + FN  technique is used to capture the global information
The results in terms of accuracy of applying our proposed available in image. It computes the interest points in whole
integrated framework that consists of LQP, HOG and BoW image and cluster them in groups. Feature fusion of global
feature sets is presented in Fig. 5. Our proposed approach and local feature descriptors gives us better result as
achieved overall 81.6% accuracy. The difference between compared to earlier work. Results of individual classes in
texture of a living room image and bedroom is less, so the terms of recall and precision are presented in table 1.
it become very difficult to distinguish. This results in
predication inaccuracy i.e. 18% living room scene images V. CONCLUSIONS
predicted as bedroom. We also compared our results with In this paper, we presented an integrated feature fusion
widely used local feature descriptors, Local Binary
approach for indoor scene classification. Our approach
Patterns [27], Local Ternary Patterns [28] and Local Tri-
fuses useful features from local and global descriptors, i.e.
directional patterns [29]. Comparison results are shown in
Fig. 6. LQP features are very effective in terms of LQP, BoVW and HOG features are fused to form a strong
representation of scene image. LQP and HOG are robust to
noise and extract local neighborhood information from
image patches while BoVW represents the global interest
points present in an image. Multiclass SVM is applied for
classification on MIT indoor scene dataset. In future, we
will implement Field Programmable Gate Arrays (FPGA)

Figure 5: Confusion matrix showing accuracy of individual classes


Figure 6: Comparison of proposed framework with other methods

742
based hardware accelerator for LQP and HOG feature [14] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning
for image recognition," in Proceedings of the IEEE
descriptors to enhance the overall system performance in conference on computer vision and pattern recognition, 2016,
terms of throughput which is critical requirement of high pp. 770-778.
performance and real time embedded systems. [15] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba,
"Places: A 10 million image database for scene recognition,"
IEEE transactions on pattern analysis and machine
REFERENCES intelligence, 2017.
[16] A. Karpov, L. Akarun, H. Yalçın, A. Ronzhin, B. E. Demiröz,
A. Çoban, et al., "Audio-visual signal processing in a
[1] S. Song, S. P. Lichtenberg, and J. Xiao, "Sun rgb-d: A rgb-d multimodal assisted living environment," in Fifteenth Annual
scene understanding benchmark suite," in Proceedings of the Conference of the International Speech Communication
IEEE conference on computer vision and pattern recognition, Association, 2014.
2015, pp. 567-576. [17] S. K. Vipparthi and S. K. Nagar, "Color directional local
[2] L. Fei-Fei, A. Iyer, C. Koch, and P. Perona, "What do we quinary patterns for content based indexing and retrieval,"
perceive in a glance of a real-world scene?," Journal of Human-centric Computing and Information Sciences, vol. 4,
vision, vol. 7, pp. 10-10, 2007. p. 6, 2014.
[3] A. Oliva and A. Torralba, "Modeling the shape of the scene: [18] C.-F. Tsai, "Bag-of-words representation in image annotation:
A holistic representation of the spatial envelope," A review," ISRN Artificial Intelligence, vol. 2012, 2012.
International journal of computer vision, vol. 42, pp. 145- [19] N. Dalal and B. Triggs, "Histograms of oriented gradients for
175, 2001. human detection," in Computer Vision and Pattern
[4] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, Recognition, 2005. CVPR 2005. IEEE Computer Society
"Learning deep features for scene recognition using places Conference on, 2005, pp. 886-893.
database," in Advances in neural information processing [20] Q. Zhu, M.-C. Yeh, K.-T. Cheng, and S. Avidan, "Fast
systems, 2014, pp. 487-495. human detection using a cascade of histograms of oriented
[5] L.-J. Li, R. Socher, and L. Fei-Fei, "Towards total scene gradients," in Computer Vision and Pattern Recognition,
understanding: Classification, annotation and segmentation in 2006 IEEE Computer Society Conference on, 2006, pp. 1491-
an automatic framework," in Computer Vision and Pattern 1498.
Recognition, 2009. CVPR 2009. IEEE Conference on, 2009, [21] V. Menezes, V. Patchava, and M. S. D. Gupta, "Surveillance
pp. 2036-2043. and monitoring system using Raspberry Pi and SimpleCV,"
[6] P. Xu, F. Davoine, J.-B. Bordes, H. Zhao, and T. Denoeux, in Green Computing and Internet of Things (ICGCIoT), 2015
"Information fusion on oversegmented images: An International Conference on, 2015, pp. 1276-1278.
application for urban scene understanding," in Thirteenth [22] W. Zhang, G. Zelinsky, and D. Samaras, "Real-time accurate
IAPR International Conference on Machine Vision object detection using multiple resolutions," in Computer
Applications, 2013, pp. 189-193. Vision, 2007. ICCV 2007. IEEE 11th International
[7] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, "Vision meets Conference on, 2007, pp. 1-8.
robotics: The KITTI dataset," The International Journal of [23] S. Amarappa and S. Sathyanarayana, "Data classification
Robotics Research, vol. 32, pp. 1231-1237, 2013. using Support vector Machine (SVM), a simplified
[8] X. Bian, C. Chen, Q. Du, and Y. Sheng, "Extended multi- approach," International Journal of Electronics and
structure local binary pattern for high-resolution image scene Computer Science Engineering. ISSN-2277-1956, 2014.
classification," in Geoscience and Remote Sensing [24] J. Milgram, M. Cheriet, and R. Sabourin, "“One against one”
Symposium (IGARSS), 2016 IEEE International, 2016, pp. or “one against all”: Which one is better for handwriting
5134-5137. recognition with SVMs?," in Tenth international workshop
[9] R. Jafri, S. A. Ali, H. R. Arabnia, and S. Fatima, "Computer on frontiers in handwriting recognition, 2006.
vision-based object recognition for the visually impaired in [25] A. Quattoni and A. Torralba, "Recognizing indoor scenes," in
an indoors environment: a survey," The Visual Computer, Computer Vision and Pattern Recognition, 2009. CVPR
vol. 30, pp. 1197-1222, 2014. 2009. IEEE Conference on, 2009, pp. 413-420.
[10] A. M. Barbancho, L. J. Tardón, J. López-Carrasco, J. Eggink, [26] K. Pulli, A. Baksheev, K. Kornyakov, and V. Eruhimov,
and I. Barbancho, "Automatic classification of personal video "Real-time computer vision with OpenCV," Communications
recordings based on audiovisual features," Knowledge-Based of the ACM, vol. 55, pp. 61-69, 2012.
Systems, vol. 89, pp. 218-227, 2015. [27] M. Pietikäinen, A. Hadid, G. Zhao, and T. Ahonen, Computer
[11] C. Scharfenberger, S. L. Waslander, J. S. Zelek, and D. A. vision using local binary patterns vol. 40: Springer Science &
Clausi, "Existence detection of objects in images for robot Business Media, 2011.
vision using saliency histogram features," in Computer and [28] X. Tan and B. Triggs, "Enhanced local texture feature sets for
Robot Vision (CRV), 2013 International Conference on, face recognition under difficult lighting conditions," in
2013, pp. 75-82. International Workshop on Analysis and Modeling of Faces
[12] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, and Gestures, 2007, pp. 168-182.
"A survey of deep neural network architectures and their [29] M. Verma and B. Raman, "Local tri-directional patterns: A
applications," Neurocomputing, vol. 234, pp. 11-26, 2017. new texture feature descriptor for image retrieval," Digital
[13] J. King, V. Kishore, and F. Ranalli, "Scene classification with Signal Processing, vol. 51, pp. 62-72, 2016.
Convolutional Neural Networks."

743

You might also like