You are on page 1of 5

OBJECT RECOGNITION UNDER MULTIFARIOUS CONDITIONS: A RELIABILITY

ANALYSIS AND A FEATURE SIMILARITY-BASED PERFORMANCE ESTIMATION

Dogancan Temel*, Jinsol Lee*, and Ghassan AlRegib

Center for Signal and Information Processing,


School of Electrical and Computer Engineering,
Georgia Institute of Technology, Atlanta, GA, 30332-0250
{cantemel, jinsol.lee, alregib}@gatech.edu

ABSTRACT conditions are either limited or unrealistic. Recently, we in-


troduced the CURE-OR dataset and analyzed the recognition
In this paper, we investigate the reliability of online recogni-
performance with respect to simulated challenging conditions
tion platforms, Amazon Rekognition and Microsoft Azure,
[10]. Hendrycks and Dietterich [11] also studied the effect of
with respect to changes in background, acquisition device,
similar conditions by postprocessing the images in ImageNet
and object orientation. We focus on platforms that are com-
[1]. In [12–15], performance variation under simulated chal-
monly used by the public to better understand their real-world
lenging conditions were analyzed for traffic sign recognition
performances. To assess the variation in recognition perfor-
and detection. Aforementioned studies overlooked the ac-
mance, we perform a controlled experiment by changing the
quisition conditions and investigated the effect of simulated
acquisition conditions one at a time. We use three smart-
conditions. In contrast to the literature [5–9, 11–13, 15]
phones, one DSLR, and one webcam to capture side views
and our previous work [10], the main focus of this study is
and overhead views of objects in a living room, an office, and
to analyze the effect of real-world acquisition conditions in-
photo studio setups. Moreover, we introduce a framework to
cluding device type, orientation and background. In Fig. 1,
estimate the recognition performance with respect to back-
we show sample images obtained under different acquisition
grounds and orientations. In this framework, we utilize both
conditions.
handcrafted features based on color, texture, and shape char-
acteristics and data-driven features obtained from deep neu-
ral networks. Experimental results show that deep learning-
based image representations can estimate the recognition per-
formance variation with a Spearman’s rank-order correlation
of 0.94 under multifarious acquisition conditions.
Index Terms— object dataset, controlled experiment
(a) White (b) 2D Living (c) 2D Kitchen (d) 3D Living (e) 3D Office
with recognition platforms, performance estimation, deep Room Room
learning, feature similarity
1. INTRODUCTION
In recent years, the performance of visual recognition and
detection algorithms have considerably advanced with the
progression of data-driven approaches and computational
capabilities [1, 2]. These advancements enabled state-of-the- (f) 0◦ (Front) (g) 90◦ (h) 180◦ (i) 270◦ (j) Overhead
art methods to achieve human-level performance in specific
Fig. 1: Object backgrounds and orientations in CURE-OR.
recognition tasks [3, 4]. Despite these significant achieve-
ments, it remains a challenge to utilize such technologies If we consider ideal acquisition conditions as reference
in real-world environments that diverge from training con- conditions that lead to the highest recognition rate, any vari-
ditions. To identify the factors that can affect recognition ation would decrease the recognition performance and affect
performance, we need to perform controlled experiments as visual representations. Based on this assumption, we hypothe-
in [5–9]. Even though these studies shed a light on the vul- size that recognition performance variations can be estimated
nerability of existing recognition approaches, investigated by variations in visual representations. Overall, the contribu-
*Equal contribution. tions of this manuscript are five folds. First, we investigate
Dataset: https://ghassanalregib.com/cure-or/ the effect of background on object recognition by perform-

978-1-5386-6249-6/19/$31.00 ©2019 IEEE 3033 ICIP 2019


ing controlled experiments with different backgrounds. Sec- The most challenging scenarios correspond to the real-world
ond, we analyze the effect of acquisition devices by compar- office and living room because of the complex background
ing the recognition accuracy of images captured with differ- structure. Recognition accuracy in front of 2D backdrops is
ent devices. Third, we analyze the recognition performance higher than the real-world setups because foreground objects
with respect to different orientation configurations. Fourth, are more distinct when background is out of focus.
we introduce a framework to estimate the recognition per- 70
30

Top-5 Accuracy (%)

Top-5 Accuracy (%)


formance variation under varying backgrounds and orienta- 60
50 25
tions. Fifth, we benchmark the performance of handcrafted 40 20
and data-driven features obtained from deep neural networks 30 15
in the proposed framework. The outline of this paper is as fol- 20 10
10 5
lows. In Section 2, we analyze the objective recognition per- 0 0
formance with respect to acquisition conditions. In Section 3, Whivteing roomD Kitchevning room3D Office Whivteing roomD Kitchevning room3D Office
we describe the recognition performance estimation frame- 2D Li 2 3D Li 2D Li 2 3D Li
work and benchmark hand-crafted and data-driven methods. (a) AMZN: Backgrounds (b) MSFT: Backgrounds
Finally, we conclude our work in Section 4.
17.5
50

Top-5 Accuracy (%)


Top-5 Accuracy (%)
15.0
2. RECOGNITION UNDER MULTIFARIOUS 40 12.5
CONDITIONS 30 10.0
7.5
20 5.0
Based on scalability, user-friendliness, computation time,
10 2.5
service fees, access to labels and confidence scores, we 0.0
0
assessed off-the-shelf platforms and decided to utilize Mi- 0 deg 90 deg180 deg270 deOgverhead 0 deg 90 deg180 deg270 deOgverhead
crosoft Azure Computer Vision (MSFT) and Amazon Rekog-
nition (AMZN) platforms. As a test set, we use the recently (c) AMZN: Orientations (d) MSFT: Orientations
introduced CURE-OR dataset that includes one million im- 50 17.5
Top-5 Accuracy (%)

Top-5 Accuracy (%)


ages of 100 objects captured with different devices under 40 15.0
12.5
various object orientations, backgrounds, and simulated chal- 30 10.0
lenging conditions. Objects are classified into 6 categories: 20 7.5
5.0
toys, personnel belongings, office supplies, household items, 10 2.5
sport/entertainment items, and health/personal care items as 0 0.0
e HTC LG gitech Nikon e HTC LG gitech Nikon
described in [10]. We identified 4 objects per category for iPhon Lo iPhon Lo
each platform for testing, but because Azure only identified (e) AMZN: Devices (f) MSFT: Devices
3 objects correctly in one category, we excluded an object
Resize Underexposure Overexposure Blur Contrast
with the lowest number of correctly identified images from DirtyLens1 DirtyLens2 Salt&Pepper Average
Amazon for fair comparison. Therefore, we used 23 objects
to assess the robustness of the recognition platforms. Original Fig. 2: Recognition accuracy versus acquisition conditions.
images (challenge-free) in each category were processed to In terms of orientations, front view (0 deg) leads to the
simulate realistic challenging conditions including underex- highest recognition accuracy as shown in Fig. 2(c-d), which is
posure, overexposure, blur, contrast, dirty lens, salt and pep- expected because the objects in CURE-OR face forward with
per noise, and resizing as illustrated in [10]. We calculated their most characteristic features. In contrast, these character-
the top-5 accuracy for each challenge category to quantify istic features are highly self-occluded in the overhead view,
recognition performance. Specifically, we calculated the ratio which leads to the lowest recognition performance. In case
of correct classifications for each object in which ground truth of left, right, and back views, characteristic features are not
label was among the highest five predictions. as clear as in front view but self-occlusion is not as signifi-
We report the recognition performance of online plat- cant as in overhead view. Therefore, these orientations lead
forms with respect to varying acquisition conditions in Fig. 2. to medium recognition performances compared to front and
Each line represents a challenge type, except the purple line overhead views. Recognition performances with respect to
that shows the average of all challenge types. In terms of acquisition devices are reported in Fig. 2(e-f), which shows
object backgrounds, white background leads to the highest that performance variation based on device types is less sig-
recognition accuracy in both platforms as shown in Fig. 2(a- nificant than backgrounds and orientations. However, there is
b), which is followed by 2D textured backgrounds of kitchen still a performance difference between images obtained from
and living room, and then by 3D backgrounds of office and different devices. Overall, Nikon D80 and Logitech C920
living room. Objects are recognized more accurately in front lead to highest recognition performance in both platforms,
of the white backdrop because there is no texture or color which highlights the importance of image quality for recog-
variation in the background that can resemble other objects. nition applications.

3034
Table 1: Recognition accuracy estimation performance of image feature distances in terms of Spearman correlation.

Feature Distance Metric


Condition Feature
Type
l1 l2 l2 2 SAD SSAD Canberra Chebyshev Minkowski Bray-Curtis Cosine
Amazon Rekognition (AMZN)
Color 0.14 0.30 0.29 0.10 0.29 0.88 0.01 0.30 0.14 0.20
Daisy 0.31 0.27 0.26 0.07 0.26 0.31 0.40 0.27 0.31 0.27
Hand-
Edge 0.18 0.08 0.12 0.19 0.07 0.66 0.04 0.08 0.45 0.17
crafted
Gabor 0.77 0.76 0.76 0.35 0.76 0.58 0.71 0.76 0.77 0.71
Background
HOG 0.13 0.17 0.16 0.08 0.16 0.01 0.12 0.17 0.13 0.13
VGG11 0.85 0.85 0.85 0.10 0.85 0.93 0.84 0.85 0.85 0.85
Data-
VGG13 0.85 0.85 0.83 0.01 0.83 0.92 0.69 0.85 0.85 0.86
driven
VGG16 0.88 0.84 0.84 0.08 0.84 0.94 0.79 0.84 0.88 0.85
Color 0.28 0.41 0.41 0.54 0.41 0.04 0.48 0.41 0.28 0.16
Daisy 0.45 0.28 0.17 0.03 0.17 0.45 0.08 0.28 0.45 0.21
Hand-
Edge 0.71 0.66 0.69 0.19 0.63 0.67 0.65 0.66 0.65 0.45
crafted
Gabor 0.05 0.06 0.09 0.39 0.09 0.24 0.02 0.06 0.05 0.06
Orientation
HOG 0.19 0.16 0.19 0.51 0.19 0.30 0.09 0.16 0.19 0.15
VGG11 0.86 0.92 0.91 0.34 0.91 0.69 0.94 0.92 0.86 0.89
Data-
VGG13 0.91 0.90 0.84 0.01 0.84 0.65 0.78 0.90 0.91 0.88
driven
VGG16 0.88 0.92 0.84 0.48 0.84 0.72 0.87 0.92 0.88 0.87
Microsoft Azure (MSFT)
Color 0.12 0.13 0.13 0.14 0.13 0.91 0.02 0.13 0.12 0.21
Daisy 0.14 0.18 0.17 0.01 0.17 0.14 0.34 0.18 0.14 0.18
Hand-
Edge 0.20 0.10 0.11 0.27 0.08 0.55 0.08 0.10 0.39 0.14
crafted
Gabor 0.85 0.84 0.84 0.29 0.84 0.59 0.80 0.84 0.85 0.82
Background
HOG 0.30 0.32 0.31 0.17 0.31 0.11 0.18 0.32 0.30 0.10
VGG11 0.94 0.94 0.94 0.13 0.94 0.83 0.90 0.94 0.94 0.93
Data-
VGG13 0.93 0.92 0.91 0.03 0.91 0.86 0.62 0.92 0.93 0.93
driven
VGG16 0.91 0.93 0.93 0.15 0.93 0.87 0.89 0.93 0.91 0.93
Color 0.28 0.45 0.47 0.02 0.47 0.04 0.46 0.45 0.28 0.27
Daisy 0.48 0.43 0.34 0.24 0.34 0.48 0.32 0.43 0.48 0.38
Hand-
Edge 0.54 0.50 0.51 0.15 0.53 0.45 0.47 0.50 0.35 0.15
crafted
Gabor 0.25 0.21 0.18 0.24 0.18 0.10 0.23 0.21 0.25 0.37
Orientation
HOG 0.11 0.06 0.11 0.36 0.11 0.22 0.13 0.06 0.11 0.38
VGG11 0.38 0.46 0.50 0.03 0.50 0.34 0.42 0.46 0.38 0.43
Data-
VGG13 0.52 0.48 0.47 0.15 0.47 0.44 0.43 0.48 0.52 0.51
driven
VGG16 0.43 0.46 0.48 0.71 0.48 0.46 0.53 0.46 0.43 0.44

80 80 Front
Side
Top-5 Accuracy(%)

Top-5 Accuracy(%)

Top-5 Accuracy(%)

Top-5 Accuracy(%)
80 80
Top
70 70
60 60
60 60
40 White 40 White
2D1 2D1
2D2 2D2 50 Front 50
20 3D1 20 3D1 Side
3D2 3D2 Top
3 × 10 3 6 × 10 3 10 3 10 2 5 × 101 7 × 101 10 1 3 × 10 1
log(1 / distance) log(1 / distance) log(1 / distance) log(1 / distance)
(a) AMZN Background (b) AMZN Background (c) AMZN Orientation (d) AMZN Orientation
- VGG16 Canberra - Color Canberra - VGG11 Chebyshev - Edge l1

50 50 Front
40 40 Side
Top-5 Accuracy(%)

Top-5 Accuracy(%)

Top-5 Accuracy(%)

Top-5 Accuracy(%)

40 40 Top
30 30 30 30

20 White 20 White 20 20
2D1 2D1
10 2D2 10 2D2 Front
3D1 3D1 10 Side 10
0 3D2 0 3D2 Top
101 3 × 101 10 3 10 2 2 × 107 4 × 107 10 1 3 × 10 1
log(1 / distance) log(1 / distance) log(1 / distance) log(1 / distance)
(e) MSFT Background (f) MSFT Background (g) MSFT Orientation (h) MSFT Orientation
- VGG11 Minkowski - Color Canberra - VGG16 SAD - Edge l1
Fig. 3: Scatter plots of top hand-crafted and data-driven recognition accuracy estimation methods.

3035
3. RECOGNITION PERFORMANCE ESTIMATION feature representations and mask the effect of changes in
UNDER MULTIFARIOUS CONDITIONS the backgrounds. To distinguish differences in backgrounds
Based on the experiments reported in Section 2, the refer- overlooked by edge characteristics, frequency and orienta-
ence configuration that leads to the highest recognition per- tion characteristics can be considered with Gabor features.
formance is front view, white background, and Nikon DSLR. Data-driven methods including VGG utilize all three channels
We conducted two experiments to estimate the recognition of images for feature extraction, which can give them an in-
performance with respect to changes in background and ori- herent advantage over the methods that solely utilize color or
entation. We utilized the 10 common objects of both plat- structure information. Overall, data-driven method VGG leads
forms for direct comparison. In the background experiment, to the highest performance in the background experiment for
we grouped images captured with a particular device (5) in both recognition platforms. In terms of hand-crafted features,
front of a specific background (5), which leads to 25 image color leads to the highest performance followed by Gabor
groups with front and side views of the objects. In the orien- whereas edge-based methods result in inferior performance.
tation experiment, we grouped images captured with a partic- Distinguishing changes in orientation is more challenging
ular device (5) from an orientation (3) among front, top, and compared to backgrounds because region of interest is lim-
side views, which leads to 15 image groups with images of ited to a smaller area. Therefore, overall recognition accu-
the objects in front of white, living room, and kitchen back- racy estimation performances are lower for orientations com-
drops. For each image group, we obtained an average recog- pared to backgrounds as reported in Table 1. Similar to the
nition performance per recognition platform and an average background experiment, VGG architectures lead to the highest
feature distance between the images in the group and their ref- performance estimation in the orientation experiment. How-
erence image. Finally, we analyzed the relationship between ever, hand-crafted methods are dominated by edge features
recognition accuracy and feature distance with correlations instead of Gabor representations. We show the scatter plots
and scatter plots. We extracted commonly used handcrafted of top performing data-driven and hand-crafted methods in
and data-driven features as follows: Fig. 3 in which x-axis corresponds to average distance be-
p Color: Histograms of color channels in RGB.
tween image features and y-axis corresponds to top-5 accu-
p Daisy: Local image descriptor based on convolutions of
racy. Image groups corresponding to different configurations
are more distinctly clustered in terms of background as ob-
gradients in specific directions with Gaussian filters [16].
p Edge: Histogram of vertical, horizontal, diagonal, and
served in Fig. 3(a-b, e-f). In terms of orientation, VGG leads
to a clear distinction of configurations for Amazon Rekogni-
non-directional edges.
p Gabor: Frequency and orientation information of im-
tion as observed in Fig. 3(c) whereas image groups are over-
lapping in other experiments as shown in Fig. 3(d, g-h). Clus-
ages extracted through Gabor filters.
p HOG: Histogram of oriented gradients over local regions.
tering configurations is more challenging in the orientation

p
experiment because it is not possible to easily separate orien-
VGG: Features obtained from convolutional neural net- tation configurations based on their recognition accuracy.
works that are based on stacked 3 × 3 convolutional layers
[17]. The VGG index indicates the number of weighted 4. CONCLUSION
layers in which last three layers are fully connected layers. In this paper, we analyzed the robustness of recognition plat-
We calculated the distance between features with l1 norm, forms and reported that object background can affect recogni-
l2 norm, l2 2 norm, sum of absolute differences (SAD), sum tion performance as much as orientation whereas acquisition
of squared absolute differences (SSAD), weighted l1 norm devices have minor influence on recognition. We also in-
(Canberra), l∞ norm (Chebyshev), Minkowski distance, troduced a framework to estimate recognition performance
Bray-Curtis dissimilarity, and Cosine distance. We report variation and showed that color-based features capture back-
the recognition accuracy estimation performance in Table 1 ground variations, edge-based features capture orientation
in terms of Spearman correlation between top-5 recognition variations, and data-driven features capture both background
accuracy scores and feature distances. We highlight the top and orientation variations in a controlled setting. Overall,
data-driven and hand-crafted methods with light blue for each recognition performance can significantly change depending
recognition platform and experiment. on the acquisition conditions, which highlights the need for
In the background experiment, color characteristics of more robust platforms that we can confide in. Estimating
different backgrounds are distinct from each other as ob- recognition performance with feature similarity-based met-
served in Fig. 1. In terms of low level characteristic features rics can be helpful to test the robustness of algorithms before
including Daisy, Edge, and HOG, edges in the backgrounds deployment. However, the applicability of such estimation
can distinguish highly textured backgrounds from less tex- frameworks can drastically increase if we design no-reference
tured backgrounds. However, edges would not be sufficient approaches that can provide a recognition performance esti-
to distinguish lowly textured backgrounds from each other. mation without a reference image similar to the no-reference
Moreover, edges of the foreground objects can dominate the algorithms in image quality assessment field.

3036
5. REFERENCES Recognition,” in IEEE International Conference on Ma-
chine Learning and Applications (ICMLA), 2018.
[1] J. Deng, W. Dong, R. Socher, L. J. Li, Kai Li, and
Li Fei-Fei, “ImageNet: A large-scale hierarchical im- [11] D. Hendrycks and T. G. Dietterich, “Benchmarking
age database,” in IEEE Conference on Computer Vision Neural Network Robustness to Common Corruptions
and Pattern Recognition (CVPR), June 2009, pp. 248– and Surface Variations,” in International Conference on
255. Learning Representations (ICLR), 2019.

[2] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, [12] D. Temel, G. Kwon, M. Prabhushankar, and G. AlRegib,
D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft “CURE-TSR: Challenging Unreal and Real Environ-
COCO: Common objects in context,” in European Con- ments for Traffic Sign Recognition,” in Neural Informa-
ference on Computer Vision (ECCV), D. Fleet, T. Pa- tion Processing Systems (NeurIPS), Machine Learning
jdla, B. Schiele, and T. Tuytelaars, Eds., Cham, 2014, for Intelligent Transportation Systems Workshop, 2017.
pp. 740–755, Springer International Publishing. [13] D. Temel and G. AlRegib, “Traffic signs in the wild:
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep Highlights from the ieee video and image processing
into rectifiers: Surpassing human-level performance on cup 2017 student competition [SP competitions],” IEEE
imagenet classification,” in IEEE International Con- Sig. Proc. Mag., vol. 35, no. 2, pp. 154–161, March
ference on Computer Vision (ICCV), Washington, DC, 2018.
USA, 2015, pp. 1026–1034, IEEE Computer Society. [14] M. Prabhushankar, G. Kwon, D. Temel, and
[4] R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun, G. AIRegib, “Semantically interpretable and con-
“Deep Image: Scaling up Image Recognition,” in trollable filter sets,” in IEEE International Conference
arXiv:1501.02876, 2015. on Image Processing (ICIP), Oct 2018, pp. 1053–1057.
[15] D. Temel, T. Alshawi, M-H. Chen, and G. AlRegib,
[5] S. Dodge and L. Karam, “Understanding how im-
“Challenging environments for traffic sign detection:
age quality affects deep neural networks,” in Interna-
Reliability assessment under inclement conditions,”
tional Conference on Quality of Multimedia Experience
arXiv:1902.06857, 2019.
(QoMEX), June 2016, pp. 1–6.
[16] E. Tola, V. Lepetit, and P. Fua, “Daisy: An efficient
[6] Y. Zhou, S. Song, and N. Cheung, “On classification
dense descriptor applied to wide-baseline stereo,” IEEE
of distorted images with deep convolutional neural net-
Transactions on Pattern Analysis and Machine Intelli-
works,” in IEEE International Conference on Acoustics,
gence (PAMI), vol. 32, no. 5, pp. 815–830, May 2010.
Speech and Signal Processing (ICASSP), March 2017,
pp. 1213–1217. [17] K. Simonyan and A. Zisserman, “Very deep convolu-
tional networks for large-scale image recognition,” in
[7] H. Hosseini, B. Xiao, and R. Poovendran, “Google’s arXiv:1409.1556, 2014.
cloud vision api is not robust to noise,” in 16th IEEE
International Conference on Machine Learning and Ap-
plications (ICMLA), Dec 2017, pp. 101–105.

[8] J. Lu, H. Sibai, E. Fabry, and D Forsyth, “No need to


worry about adversarial examples in object detection in
autonomous vehicles,” in IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR) Work-
shop, 2017.

[9] N. Das, M. Shanbhogue, S.-T. Chen, F. Hohman, S. Li,


L. Chen, M. E. Kounavis, and D. H. Chau, “SHIELD:
Fast, practical defense and vaccination for deep learn-
ing using jpeg compression,” in ACM SIGKDD Inter-
national Conference on Knowledge Discovery & Data
Mining (KDD), New York, NY, USA, 2018, KDD ’18,
pp. 196–204, ACM.

[10] D. Temel, J. Lee, and G. AlRegib, “CURE-OR:


Challenging Unreal and Real Environments for Object

3037

You might also like