You are on page 1of 4

Open World Plant Image Identification

Based on Convolutional Neural Network


Siang Thye Hang*, Masaki Aono†
Toyohashi University of Technology, Japan
E-mail: hang@kde.cs.tut.ac.jp*, aono@tut.jp†

Abstract— In this paper, we propose several enhancements to


the well-known VGG 16-layers Convolutional Neural Network II. MODEL ENHANCEMENT
(CNN) model towards open world image classification, by taking
plant identification as an example. We first propose to replace the The following sections describe enhancements applied to the
last pooling layer of the VGG 16-layers model with a Spatial VGG 16-layers model.
Pyramid Pooling layer, enabling the model to accept arbitrary
sized input images. Second, for the activation function, we replace A. Spatial Pyramid Pooling
Rectified Linear Unit (ReLU) with Parametric ReLU in order to The VGG 16-layers model [3] is one of the top performers in
increase the adaptability of parameter learning. In addition, we ILSVRC 2014, consisting a total of 16 convolutional and fully
introduce the Unseen Category Query Identification algorithm to connected layers. The original VGG 16-layers model requires
identify and omit images of unseen category, thus preventing false
classification into predefined categories. Such algorithm is
input image of spatial size 224 × 224. Following such spatial
essential in real life, since there is no guarantee that a given image size restriction, arbitrarily sized input images has to be either
has to fall into a predefined category. We use the dataset from the cropped or warped (Fig. 1), leading to information loss.
LifeCLEF 2016 plant identification task. We compare our results To circumvent such restriction, as shown in Fig. 2, we have
with other participants and demonstrate that our enhanced model replaced the last pooling layer (pool5) with a Spatial Pyramid
with proposed algorithm exhibits outstanding performance. Pooling (SPP) [7] layer. The last convolution layer (conv5_3)
produces feature map of 512 channels. With SPP (of 3 pyramid
I. INTRODUCTION levels), the feature map is spatially divided into 1 × 1, 2 ×
Nowadays, conservation of biodiversity is becoming an 2, 4 × 4, a total of 21 regions. Each region is then average
important duty for us. To achieve this, accurate knowledge is pooled, producing a vector of fixed size 21 × 512 = 10752, as
essential. However, even for professionals, identifying a illustrated in Fig. 3. Such conversion of feature map into a
species can be a very difficult task. Convolutional Neural fixed size vector allows the model to accept arbitrarily sized
Networks (CNNs) are leading the best performance in various input image.
image retrieval tasks such as the ImageNet Large Scale Visual B. Parametric Rectified Linear Unit
Recognition Challenge (ILSVRC) [1] [2] [3] [4]. CNNs learn
We also replace all of the Rectified Linear Unit (ReLU)
filters of filters through multiple convolution layers, enabling
activations with a learnable version known as Parametric ReLU
higher level of feature abstraction than hand crafted features
[8], which consistently outperforms ReLU in empirical
such as Scale Invariant Feature Transform (SIFT).
experiments by Bing et al [9].
In recent years, CNN has gained popularity in various image
The activation function of ReLU is shown in (1). With the
retrieval tasks including the LifeCLEF Plant Identification
use of ReLU, negative input is entirely saturated to zero. In
Task (PlantCLEF) [5] [6]. Current state-of-the-art CNN models
Parametric ReLU, the negative part is multiplied by a
classify an image into fixed number of (e.g. 1000) classes with
parameter 𝛼 , as shown in (2). The parameter 𝛼 is learned
high accuracy. However, they share an inherent limitation: the
through back propagation during training of the CNN model.
ability to handle query of unseen category, which are never
observed during training. Ability to identify such query is
ReLU(𝑥) = max(0, 𝑥) (1)
important for an image recognition system to work in real
world but not controlled laboratory environment. Therefore, we
𝑥, if 𝑥 > 0
propose the Unseen Category Query Identification (UCQI) PReLU(𝑥) { (2)
𝛼𝑥, otherwise
algorithm to identify query of unseen category in post
processing.
In the next section, we detail our enhancement of the VGG
III. POST PROCESSING
16-layers, and we present the proposed algorithm in Section III.
Section IV shows evaluation results and Section V concludes The following section discusses the One-Versus-All method’s
this paper. limitation to handle query of unseen category in multiclass
classification. A new algorithm is proposed to cope with such
query.
input image of size 224 × 224 input image of arbitrary size

OR

(cropped) (warped)

convolution layers with ReLU activations and pooling layers convolution layers with PReLU activations and pooling layers

last pooling layer (pool5) Spatial Pyramid Pooling layer

hidden layers with ReLU activations hidden layers with PReLU activations

classification layer (fc8) classification layer (fc8)

post processing with Unseen Category Query Identification

class
probability Omitted? OR
A B C A B C
Fig. 1. The VGG 16-layers model Fig. 2. Modified model with Unseen Category Query Identification

B. Unseen Category Query Identification Algorithm


A. Multiclass Classification with One Versus All Method
Before explaining our proposed algorithm, it should be noted
This section discusses the limitation of One-Versus-All method
that to train the enhanced model, we select a subset of the
using binary classifiers in multiclass classification.
labeled images for validation, while the remaining for training.
Binary Classification. Assume a binary classifier 𝐹 of class 𝐶 This will be detailed in Section IV A. The validation images
which maps an input 𝑥 into a score 𝐹(𝑥). 𝐹 can be optimized will not be used to train the enhanced model but to evaluate the
such that 𝐹(𝑥) = 1 when 𝑥 ∈ 𝐶, and 𝐹(𝑥) = −1 when 𝑥 ∉ 𝐶. model during its training process.
The first subsection defines the threshold vector based on the
Multiclass Classification. In the case of multiclass classifier, training images, while the second subsection elaborates the
the One-Versus-All method is commonly used [10] to threshold vector computation with validation images.
classify 𝑁 classes 𝐶 = [𝐶1 , … , 𝐶𝑁 ] using 𝑁 binary
classifiers 𝐹 = [𝐹1 , … , 𝐹𝑁 ] . As shown in (3), 𝑥 ∈ 𝐶𝑖 when Unseen Category Query Identification Based on Training
𝐹𝑖 (𝑥) is the highest among all of the binary classifiers. Images. The classifier layer (fc8) of the enhanced model
outputs a vector of 𝑁 scores, which is an output of 𝑁 classes
𝑥 ∈ 𝐶𝑖 where 𝑖 = arg max 𝐹𝑘 (𝑥) (3) before softmax normalization. We extract the class scores of
𝑘∈𝑁
training images with the enhanced model. Assuming we have
𝑀 training images, the extracted class scores can be
With (3), it is guaranteed that a class will be assigned to 𝑥,
represented as a matrix 𝑺𝑀×𝑁 . For each row (representing a
regardless of how high or low the scores produced by each
classifier might be. Fig. 4’s first example (from left) is ideal, as training image) of 𝑺, incorrectly predicted ones are removed
from the matrix. With the remaining correctly predicted class
there is only one strong positive score, thus we can confidently
predict that input 𝛼 (Fig. 4) belongs to the first class. scores 𝑺𝑀′ ×𝑁 (𝑀′ ≤ 𝑀) , we define the threshold vector 𝒕 =
Meanwhile, the second and third examples have either weak or [𝑡1 , … , 𝑡𝑁 ] with class-wise minima, as shown in (4).
no positive score. Based on (3), input 𝛽 (Fig. 4) is classified
𝑡𝑖 = min′ 𝑺𝑘,𝑖 (𝑖 = 1, … , 𝑁) (4)
into the third class, while input 𝛾 is classified into the first class, 𝑘𝜖𝑀
despite that all of the scores are rather low or even negative.
Based on the scenario, we realize that there are cases that an We infer that if an image belongs to a known category,
input should be identified and omitted by all of the binary among the 𝑁 classes, at least one class score should be higher
classifiers. To identify such input, an algorithm is elaborated. than its corresponding threshold. Thus, during evaluation, any

arbitrarily sized split into 1 × 1 2×2 4×4


feature map

average pool and concatenate into a feature vector


Fig. 3. An example of Spatial Pyramid Pooling with 21 regions Fig. 4. An example of class scores (𝑁 = 4) based on three inputs 𝛼, 𝛽, 𝛾
image with scores lower than 𝒕 for all of the classes will be
identified as image of unseen category and thus omitted from
being classified.
Taking Validation Images into Account. The threshold
Fig. 5. Images based on a same plant (with ObservationId 41797)
vector 𝒕 obtained from (4) is solely based on the training
images. Due to factors such as overfitting, there is a possibility
that lower threshold can be acquired from the validation IV. EVALUATION
images. Reusing (4), we compute threshold vector based on the
validation images, which we denote as 𝒗 = [𝑣1 , … , 𝑣𝑁 ]. Then, The following sections detail our experiments and results.
we compute ratio vector 𝒓 = [𝑟1 , … , 𝑟𝑁 ] with element-wise A. Evaluation Experiments
ratio of 𝒗 over 𝒕, as shown in (5).
Dataset. The PlantCLEF 2016 dataset contains images of 1000
𝑣𝑖 classes i.e. plant species. It consists of 113204 labelled images
𝑟𝑖 = (𝑖 = 1, … , 𝑁) (5) and 8000 unlabeled images for evaluation purpose. The
𝑡𝑖
evaluation images contain images of unseen category, i.e.
Although we expect 𝑣𝑖 < 𝑡𝑖 i.e. 𝑟𝑖 < 1 for all of the classes invasive species or non-plant images. From the training images,
𝑖 = 1, … , 𝑁, as the number of validation images is much less we randomly select 2048 images for validation purpose,
than the number of training images, it is more likely that 𝑟𝑖 > 1. leaving 111156 images to train the enhanced model.
We also expect that elements in 𝒓 are positive and finite. Thus,
Data Augmentation. We augment the training images as
we compute the average 𝑧̅ of the elements in ratio vector 𝒓 follows: while preserving aspect ratio, images are resized such
matching the condition 0 < 𝑟𝑖 < 1, as shown in (7) and (8). We that the shorter side becomes 224 and 336. Random cropping
then multiply the threshold vector 𝒕 by 𝑧̅, as shown in (9). Note to 224 × 224 and random flipping by y-axis are also applied to
that in (8), ‖𝒛‖0 denotes the L0-norm i.e. number of non-zero the resized images. For evaluation images, images are resized
elements of vector 𝒛. such that the shorter side becomes 224, without cropping or
flipping. Cropping is not required for evaluation images as the
𝒛 = [𝑧1 , … , 𝑧𝑁 ] (6) enhanced model already accepts arbitrarily sized input images,
as mentioned in Section II A.
𝑟𝑖 0 < 𝑟𝑖 < 1
𝑧𝑖 = { (7)
0 otherwise Training of Enhanced Model. To train such a deep model, we
use Xavier’s method [11] to initialize the weights of the
1
𝑧̅ = ∑𝑁
𝑖=1 𝑧𝑖 𝑛 = ‖𝒛‖0 (8) convolution and fully connected layers. We train the model
𝑛
using Stochastic Gradient Descent with momentum 0.9,
learning rate 0.01 through 0.0001, and batch size 50. The
𝒕′ = 𝑧̅𝒕 (9)
learning rate is multiplied by 0.1 when the validation accuracy
stops improving. As the result, we train the enhanced model
If validation images are to be considered, 𝒕′ will be used
with learning rate 0.01 for 30 epochs, 0.001 for 15 epochs and
instead of 𝒕. Similarly, any evaluation image with scores lower
finally 0.0001 for 8 epochs.
than 𝒕′ for all of the classes will be omitted from being
classified into a known category.
C. Multiple Image Based Identification
We use the PlantCLEF 2016 dataset [6] in our experiments.
In the dataset, we observe that images of the same plant (i.e.
observation) are grouped with a unique observation identifier
(ObservationId), as shown in Fig. 5. Within the dataset, there
are observations with more than 30 images, while there are
Fig. 6. 10 of the 195 images omitted with threshold vector 𝒕
observations with only one image as well.
In real world, multiple photos may be captured from a same
plant, and some of them may be badly taken. Instead of
identifying based on one single image, taking the average
across multiple images should improve identification
performance. After identifying and omitting images of unseen
category with threshold vector 𝒕′ , we compute class wise
average of the scores of the remaining images with the same
ObservationId.
Fig. 7. 10 of the 69 images omitted with threshold vector 𝒕′. Note that
these images are a subset of the 195 images shown in Fig. 6.
Unseen Category Query Identification. As explained in  In our approach, training images are resized to two scales.
section III B, threshold vector 𝒕 that considers only training It may be worthwhile to consider applying random scaling
images is computed with (4). Applying 𝒕 onto the evaluation instead of a constant number of (two) scales. Other than
images, 195 out of 8000 images are identified and omitted. scaling, random rotation should be applied as well.
Examples of the omitted images are shown in Fig. 6.
Considering training and validation images, threshold  The scores of each class extracted from training and
vector 𝒕′ is computed with (9) after acquiring 𝑧̅ ≈ 0.91 with (8). validation images (as mentioned in Section III B) may be
Applying 𝒕′ to the same evaluation images, as 𝒕′ is a scaled- skewed or of high variance. In such case, taking minima as
the threshold may not be feasible. We might consider using
down variant of 𝒕, only 69 images are identified and omitted.
mean or median to obtain a more stable threshold.
Example of the omitted images of is shown in Fig. 7.
Evaluation Metric. Given the precision of 𝑖 -th evaluation  The PlantCLEF 2016 dataset includes rich metadata
image as P(𝑖) , average precision of 𝑘 -th class AP(𝑘) information such as GPS coordinate, location name, time
taken, organ type and other taxonomic information (Genus
(containing 𝑀𝑘 images) is defined in (10). Mean Average
and Family). Utilization of other metadata information may
Precision (MAP) for all the classes ( 𝑁 = 1000) is used to
further improve identification performance.
evaluate classification performance, shown in (11).
1 𝑀 ACKNOWLEDGMENT
AP(𝑘) = 𝑘
∑𝑖=1 P(𝑖) (10)
𝑀𝑘
We would like to thank MEXT KAKENHI, Grant-in-Aid for
1 Challenging Exploratory Research, Grant Number 15K12027
MAP = ∑𝑁
𝑘=1 AP(𝑘) (11) for partial support of our work.
𝑁

Thus, any image of unseen category associated with high score REFERENCES
is likely to degrade the MAP. In addition to this, MAP without
images of unseen category is also evaluated.
[1] A. Krizhevsky, I. Sutskever and G. E. Hinton, "ImageNet
B. Evaluation Results Classification with Deep Convolutional," in Advances in Neural
Our first method uses threshold vector 𝒕 to identify and omit Information Processing Systems 25, 2012.
images of unseen category while our second method uses [2] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
threshold vector 𝒕′ . We compare our results with the top D. Erhan, V. Vanhoucke and A. Rabinovich, "Going deeper with
performing teams (runs) of PlantCLEF 2016. Evaluation results convolutions," CoRR, vol. 1409.4842, 2014.
are summarized in Table I. [3] K. Simonyan and A. Zisserman, "Very Deep Convolutional
Networks for Large-Scale Image Recognition," CoRR, vol.
TABLE I 1409.1556, 2014.
EVALUATION RESULTS
[4] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for
MAP with images of MAP without images of Image Recognition," CoRR, vol. 1512.03385, 2015.
Method
unseen category unseen category
Our method (with 𝒕) 0.736 0.820 [5] H. Goeau, P. Bonnet and A. Joly, "LifeCLEF Plant Identification
Our method (with 𝒕') 0.742 0.827 Task," in CLEF working notes 2015, 2015.
SabanciUGebzeTU Run 1 0.738 0.806 [6] H. Goëau, P. Bonnet and A. Joly, "Plant Identification In An
CMP Run 1 0.710 0.790 Open-World (LifeCLEF 2016)," in CLEF working notes 2016,
LIIR KUL Run 3 0.703 0.761 2016.
UM Run 4 0.669 0.598
[7] K. He, X. Zhang, S. Ren and J. Sun, "Spatial Pyramid Pooling in
QUT Run 3 0.629 0.610
Deep Convolutional Networks for Visual Recognition," CoRR,
vol. 1406.4729, 2014.
Our method with threshold vector 𝒕′ shows the highest MAP.
[8] K. He, X. Zhang, S. Ren and J. Sun, "Delving Deep into
Usage of threshold vector 𝒕′ considering both training and
Rectifiers: Surpassing Human-Level Performance on ImageNet
validation images yields a slight improvement.
Classification," CoRR, vol. 1502.01852, 2015.
V. CONCLUSION [9] B. Xu, N. Wang, T. Chen and M. Li, "Empirical Evaluation of
Rectified Activations in Convolution," CoRR, vol. 1505.00853,
In this paper, we have presented our approach towards open 2015.
world plant image identification, focusing on model [10] V. N. Vapnik, Statistical Learning Theory, Wiley, 1998.
enhancements and an unseen category image identification
[11] X. Glorot and Y. Bengio, "Understanding the difficulty of
algorithm. We still leave some rooms for future improvements,
training deep feedforward neural networks," in Proceedings of
which are itemized as follows:
AISTATS 2010, 2010.

You might also like