You are on page 1of 7

Deep learning for fine-grained image analysis: A survey

Xiu-Shen Wei1 , Jianxin Wu2 , Quan Cui1,3


1
Megvii Research Nanjing, Megvii Technology, Nanjing, China
2
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
3
Graduate School of IPS, Waseda University, Fukuoka, Japan

Abstract Generic image recognition Fine-grained image recognition


arXiv:1907.03069v1 [cs.CV] 6 Jul 2019

Computer vision (CV) is the process of using ma- Birds


chines to understand and analyze imagery, which is
an integral branch of artificial intelligence. Among
Dogs Samoyed
various research areas of CV, fine-grained image
analysis (FGIA) is a longstanding and fundamen- Husky
tal problem, and has become ubiquitous in diverse Oranges
real-world applications. The task of FGIA targets an-
alyzing visual objects from subordinate categories, Alaska
e.g., species of birds or models of cars. The small
inter-class variations and the large intra-class vari- Figure 1: Fine-grained image analysis vs. generic image analysis
ations caused by the fine-grained nature makes it a (taking the recognitiont task for an example).
challenging problem. During the booming of deep
learning, recent years have witnessed remarkable
meta-category), e.g., different species of animals/plants, dif-
progress of FGIA using deep learning techniques.
ferent models of cars, different kinds of retail products, etc
In this paper, we aim to give a survey on recent ad-
(cf. Fig. 1). In the real-world, FGIA enjoys a wide-range
vances of deep learning based FGIA techniques in a
of applications in both industry and research societies, such
systematic way. Specifically, we organize the exist-
as automatic biodiversity monitoring, climate change evalu-
ing studies of FGIA techniques into three major cate-
ation, intelligent retail, intelligent transportation, and many
gories: fine-grained image recognition, fine-grained
more. Particularly, a number of influential academic competi-
image retrieval and fine-grained image generation.
tions about FGIA are frequently held on Kaggle.1 Several
In addition, we also cover some other important is-
representative competitions, to name a few, are the Nature
sues of FGIA, such as publicly available benchmark
Conservancy Fisheries Monitoring (for fish species catego-
datasets and its related domain specific applications.
rization), Humpback Whale Identification (for whale identity
Finally, we conclude this survey by highlighting sev-
categorization) and so on. Each competition attracted more
eral directions and open problems which need be
than 300 teams worldwide to participate, and some even ex-
further explored by the community in the future.
ceeded 2,000 teams.
On the other hand, deep learning techniques [LeCun et al.,
1 Introduction 2015] have emerged in recent years as powerful methods for
learning feature representations directly from data, and have
Computer vision (CV) is an interdisciplinary scientific field of led to remarkable breakthroughs in the filed of FGIA. With
artificial intelligence (AI), which deals with how computers rough statistics on each year, on average, there are around ten
can be made to gain high-level understanding from digital im- conference papers of deep learning based FGIA techniques
ages or videos. The tasks of computer vision include methods published on each of AI’s and CV’s premium conferences, like
for acquiring, processing, analyzing and understanding digital IJCAI, AAAI, CVPR, ICCV, ECCV, etc. It shows that FGIA
images, and the process of extracting numerical or symbolic with deep learning is of notable research interests. Given this
information, e.g., in the forms of decisions or predictions, from period of rapid evolution, the aim of this paper is to provide a
high-dimensional raw image data in the real world. comprehensive survey of the recent achievements in the FGIA
As an interesting, fundamental and challenging problem in filed brought by deep learning techniques.
computer vision, fine-grained image analysis (FGIA) has been In the literature, there was an existing survey related to fine-
an active area of research for several decades. The goal of
FGIA is to retrieve, recognize and generate images belonging 1
Kaggle is an online community of data scientists and machine
to multiple subordinate categories of a super-category (aka learners: https://www.kaggle.com/.
Intra-class variances
FG image recognition
Artic Tern
By localization-classification subnetworks
By end-to-end feature encoding
With external information
Common Tern
With web data
With multi-modality data
With humans in the loop
Forster’s Tern
FG image retrieval

Unsupervised with pre-trained models


FGIA
Supervised with metric learning Caspian Tern

FG image generation Inter-class variances

Generating from FG image distributions Figure 3: Key challenge of fine-grained image analysis, i.e., small
Generating from text descriptions inter-class variations and large intra-class variations. We here present
each of four Tern species in each row in the figure, respectively.
Future directions of FGIA
Automatic fine-grained models the new trends and future directions to provide a potential
Fine-grained few shot learning road map for fine-grained researchers or other interested
readers in the broad AI community.
Fine-grained hashing
The rest of the survey is organized as follows. Section 2
FGIA within more realistic settings introduce backgrounds of this paper, i.e., the FGIA problem
and its main challenges. In Section 3, we review multiple com-
Figure 2: Main aspects of our hierarchical and structrual organization monly used fine-grained benchmark datasets. Section 4 ana-
of fine-grained image analysis (FGIA) in this survey paper. lyzes the three main paradigms of fine-grained image recogni-
tion. Section 5 presents recent progress of fine-grained image
retrieval. Section 6 discusses fine-grained image generation
grained tasks, i.e., [Zhao et al., 2017], which simply included from a generative perspective. Furthermore, in Section 7, we
several fine-grained recognition approaches for comparisons. introduce some other domain specific applications of real-
Our work differs with it in that ours is more comprehensive. world related to FGIA. Finally, we conclude this paper and
Specifically, except for fine-grained recognition, we also an- discuss future directions and open issues in Section 8.
alyze and discuss the other two central fine-grained analysis
tasks, i.e., fine-grained image retrieval and fine-grained image 2 Background: problem and main challenges
generation, which can not be overlooked as they are two inte-
gral aspects of FGIA. Additionally, on another important AI In this section, we summarize the related background of this
conference in the Pacific Rim nations, PRICAI, Wei and Wu paper, including the problem and its key challenges.
organized a specific tutorial2 aiming at the fine-grained image Fine-grained image analysis (FGIA) focuses on dealing
analysis topic. We refer interested readers to the tutorial which with the objects belonging to multiple sub-categories of the
provides some additional detailed information. same meta-category (e.g., birds, dogs and cars), and generally
In this paper, our survey take a unique deep learning based involves central tasks like fine-grained image recognition, fine-
perspective to review the recent advances of FGIA in a sys- grained image retrieval, fine-grained image generation, etc.
tematic and comprehensive manner. The main contributions What distinguishes FGIA from the generic one is: in generic
of this survey are three-fold: image analysis, the target objects belong to coarse-grained
meta-categories (e.g., birds, oranges and dogs), and thus are
• We give a comprehensive review of FGIA techniques visually quite different. However, in FGIA, since objects come
based on deep learning, including problem backgrounds, from sub-categories of one meta-category, the fine-grained
benchmark datasets, a family of FGIA methods with deep nature causes them visually quite similar. We take image
learning, domain-specific FGIA applications, etc. recognition for illustration. As shown in Fig. 1, in fine-grained
• We provide a systematic overview of recent advances of recognition, the task is required to identify multiple similar
deep learning based FGIA techniques in a hierarchical species of dogs, e.g., Husky, Samoyed and Alaska. For
and structural manner, cf. Fig. 2. accurate recognition, it is desirable to distinguish them by
• We discuss the challenges and open issues, and identify capturing slight and subtle differences (e.g., ears, noses, tails),
which also meets the demand of other FGIA tasks (e.g., re-
2
http://www.weixiushen.com/tutorial/PRICAI18/FGIA.html trieval and generation).
Figure 4: Example fine-grained images belonging to different species of flowers/vegetable, different models of cars/aircrafts and different kinds
of retail products. Accurate identification of these fine-grained objects requires the dependences on the discriminative but subtle object parts or
image regions. (Best viewed in color and zoomed in.)

Table 1: Summary of popular fine-grained image datasets. Note that “BBox” indicates whether this dataset provides object bounding box
supervisions. “Part anno.” means providing the key part localizations. “HRCHY” corresponds to hierarchical labels. “ATR” represents the
attribute labels (e.g., wing color, male, female, etc). “Texts” indicates whether fine-grained text descriptions of images are supplied.

Dataset name Meta-class ] images ] categories BBox Part anno. HRCHY ATR Texts
Oxford Flower [Nilsback and Zisserman, 2008] Flowers 8,189 102 X
CUB200-2011 [Wah et al., 2011] Birds 11,788 200 X X X X
Stanford Dog [Khosla et al., 2011] Dogs 20,580 120 X
Stanford Car [Krause et al., 2013] Cars 16,185 196 X
FGVC Aircraft [Maji et al., 2013] Aircrafts 10,000 100 X X
Birdsnap [Berg et al., 2014] Birds 49,829 500 X X X
Fru92 [Hou et al., 2017] Fruits 69,614 92 X
Veg200 [Hou et al., 2017] Vegetable 91,117 200 X
iNat2017 [Horn et al., 2017] Plants & Animals 859,000 5,089 X X
RPC [Wei et al., 2019a] Retail products 83,739 200 X X

Furthermore, fine-grained nature also brings the small inter- Image label: American Goldfinch

class variations caused by highly similar sub-categories, and Has_Bill_Shape: Cone


the large intra-class variations in poses, scales and rotations, Has_Wing_Color: Black and white
ATR
as presented by Fig. 3. It is the opposite of the generic image Has_Back_Color: Yellow
analysis (i.e., the small intra-class variations and the large inter- … …
class variations), which makes fine-grained image analysis a Text descriptions:
challenging problem. “This very distinct looking bird has a black crown, a
yellow belly, and white and black wings.”

3 Benchmark datasets
Figure 5: An example image with its supervisions associated with
In the past decade, the vision community has released CUB200-2011. As shown, multiple types of supervisions include:
many benchmark fine-grained datasets covering diverse do- image labels, part annotations (aka key point localizations), object
mains such as birds [Wah et al., 2011; Berg et al., 2014], bounding boxes (i.e., the green one), attribute labels (i.e., “ATR”),
dogs [Khosla et al., 2011], cars [Krause et al., 2013], air- and text descriptions by natural languages. (Best viewed in color.)
planes [Maji et al., 2013], flowers [Nilsback and Zisserman,
2008], vegetable [Hou et al., 2017], fruits [Hou et al., 2017],
retail products [Wei et al., 2019a], etc (cf. Fig. 4). In Table 1,
we list a number of image datasets commonly used by the choose it for comparisons with state-of-the-arts. Moreover,
fine-grained community, and specifically indicate their meta- constant contributions are made upon CUB200-2011 for fur-
category, the amounts of fine-grained images, the number of ther research, e.g., collecting text descriptions of the fine-
fine-grained categories, extra different kinds of available su- grained images for multi-modality analysis, cf. [Reed et al.,
pervisions, i.e., bounding boxes, part annotations, hierarchical 2016; He and Peng, 2017a].
labels, attribute labels and text visual descriptions, cf. Fig. 5. Additionally, in recent years, more challenging and practical
These datasets have been one of the most important fac- fine-grained datasets are proposed increasingly, e.g., iNat2017
tors for the considerable progress in the filed, not only as a for natural species of plants, animals [Horn et al., 2017] and
common ground for measuring and comparing performance RPC for daily retail products [Wei et al., 2019a]. Many novel
of competing approaches, but also pushing this filed towards features deriving from these datasets are, to name a few, large-
increasingly complex, practical and challenging problems. scale, hierarchical structure, domain gap and long-tail distri-
Specifically, among them, CUB200-2011 is one of the most bution, which reveals the practical requirements in real-world
popular fine-grained datasets. Almost all the FGIA approaches and could arouse the studies of FGIA in more realistic settings.
4 Fine-grained image recognition is also eager for discovering the subtle differences between
Fine-grained image recognition has been the most active re- these part representations. Advanced techniques, like attention
search area of FGIA in the past decade. In this section, mechanisms [Yang et al., 2018] and multi-stage strategies [He
we review the milestones of fine-grained recognition frame- and Peng, 2017b] complicate the joint training of the integrated
works since deep learning entered the filed. Broadly, these localization-classification subnetworks.
fine-grained recognition approaches can be organized into
three main paradigms, i.e., fine-grained recognition (1) with 4.2 By end-to-end feature encoding
localization-classification subnetworks; (2) with end-to-end Different from the first paradigm, the second paradigm, i.e.,
feature encoding and (3) with external information. Among end-to-end feature encoding, leans to directly learn a more
them, the first and second paradigms restrict themselves by discriminative feature representation by developing powerful
only utilizing the supervisions associated with fine-grained deep models for fine-grained recognition. The most representa-
images such as image labels, bounding boxes, part annota- tive method among them is Bilinear CNNs [Lin et al., 2015b],
tions, etc. In addition, automatic recognition systems cannot which represents an image as a pooled outer product of fea-
yet achieve excellent performance due to the fine-grained chal- tures derived from two deep CNNs, and thus encodes higher
lenges. Thus, researchers gradually attempt to involve external order statistics of convolutional activations to enhance the mid-
but cheap information (e.g., web data, text descriptions) into level learning capability. Thanks to its high model capacity,
fine-grained recognition for further improving accuracy, which Bilinear CNNs achieve remarkable fine-grained recognition
corresponds to the third paradigm of fine-grained recognition. performance. However, the extremely high dimensionality of
Popularly used evaluation metric in fine-grained recognition is bilinear features still makes it impractical for realistic applica-
the averaged classification accuracy across all the subordinate tions, especially for the large-scale ones.
categories of the datasets. Aiming at this problem, more recent attempts, e.g., [Gao
et al., 2016; Kong and Fowlkes, 2017; Cui et al., 2017], try
4.1 By localization-classification subnetworks to aggregate low-dimensional embeddings by applying tensor
To mitigate the challenge of intra-class variations, researchers sketching [Pham and Pagh, 2013; Charikar et al., 2002], which
in the fine-grained community pay attentions on capturing can approximate the bilinear features and maintain comparable
discriminative semantic parts of fine-grained objects and then or higher recognition accuracy. Other works, e.g., [Dubey et
constructing a mid-level representation corresponding to these al., 2018], focus on designing a specific loss function tailored
parts for the final classification. Specifically, a localization for fine-grained and is able to drive the whole deep model for
subnetwork is designed for locating these key parts. While learning discriminative fine-grained representations.
later, a classification subnetwork follows and is employed for
recognition. The framework of such two collaborative subnet- 4.3 With external information
works forms the first paradigm, i.e., fine-grained recognition As aforementioned, beyond the conventional recognition
with localization-classification subnetworks. paradigms, another paradigm is to leverage external informa-
Thanks to the localization information, e.g., part-level tion, e.g., web data, multi-modality data or human-computer
bounding boxes or segmentation masks, it can obtain more dis- interactions, to further assist fine-grained recognition.
criminative mid-level (part-level) representations w.r.t. these
fine-grained parts. Also, it further enhances the learning ca- With web data
pability of the classification subnetwork, which could signifi- To identify the minor distinction among various fine-grained
cantly boost the final recognition accuracy. categories, sufficient well-labeled training images are in high
Earlier works belonging to this paradigm depend on addi- demand. However, accurate human annotations for fine-
tional dense part annotations (aka key points localization) grained categories are not easy to acquire, due to the difficulty
to locate semantic key parts (e.g., head, torso) of objects. of annotations (always requiring domain experts) and the myr-
Some of them learn part-based detectors [Zhang et al., 2014; iads of fine-grained categories (i.e., more than thousands of
Lin et al., 2015a], and some of them leverage segmentation subordinate categories in a meta-category).
methods for localizing parts [Wei et al., 2018a]. Then, these Therefore, a part of fine-grained recognition methods seek
methods concatenate multiple part-level features as a whole to utilize the free but noisy web data to boost recognition per-
image representation, and feed it into the following classifica- formance. The majority of existing works in this line can be
tion subnetwork for final recognition. Thus, these approaches roughly grouped into two directions. One of them is to crawl
are also termed as part-based recognition methods. noisy labeled web data for the test categories as training data,
However, obtaining such dense part annotations is labor- which is regarded as webly supervised learning [Zhuang et
intensive, which limits both scalability and practicality of al., 2017; Sun et al., 2019]. Main efforts of these approaches
real-world fine-grained applications. Recently, it emerges a concentrate on: (1) overcoming the dataset gap between easily
trend that more techniques under this paradigm only require acquired web images and the well-labeled data from standard
image labels [Jaderberg et al., 2015; Fu et al., 2017; Zheng et datasets; and (2) reducing the negative effects caused by the
al., 2017; Sun et al., 2018] for accurate part localization. The noisy data. For dealing with the aforementioned problems,
common motivation of them is to first find the corresponding deep learning techniques of adversarial learning [Goodfellow
parts and then compare their appearance. Concretely, it is et al., 2014] and attention mechanisms [Zhuang et al., 2017]
desirable to capture semantic parts (e.g., head and torso) to are frequently utilized. The other direction of using web data
be shared across fine-grained categories, and meanwhile, it is to transfer the knowledge from an auxiliary categories with
Hooded Warbler
Crown: Yellow
Pine Warbler Throat: Black

Wing: Brown White

Crown: Black

… Tennessee Warbler
Eye: Black Belly: yellow

Wing: Yellow Olive

Throat: Yellow
Wing: Grey Yellow
Kentucky Warbler
Crown: Grey Yellow

Figure 6: An example knowledge graph for modeling the category-


attribute correlations on CUB200-2011.

well-labeled training data to the test categories, which usu- Figure 7: An illustration of fine-grained retrieval. Given a query
ally employs zero-shot learning [Niu et al., 2018] or meta image (aka probe) of “Dodge Charger Sedan 2012”, fine-
learning [Zhang et al., 2018] to achieve that goal. grained retrieval is required to return images of the same car model
from a car database (aka galaxy). In this figure, the top-4 returned
With multi-modality data
image marked in a red rectangle presents a wrong result, since its
Multi-modal analysis has attracted a lot of attentions with the model is “Dodge Caliber Wagon 2012”.
rapid growth of multi-media data (e.g., image, text, knowledge
base, etc). In fine-grained recognition, it takes multi-modality
data to establish joint-representations/embeddings for incor- 5 Fine-grained image retrieval
porating multi-modality information. It is able to boost fine- Beyond image recognition, fine-grained retrieval is another
grained recognition accuracy. In particular, frequently utilized crucial aspect of FGIA and emerges as a hot topic. Its evalua-
multi-modality data includes text descriptions (e.g., sentences tion metric is the common mean average precision (mAP).
and phrases of natural languages) and graph-structured knowl- In fine-grained image retrieval, given database images of
edge base. Compared with strong supervisions of fine-grained the same sub-category (e.g., birds or cars) and a query, it
images, e.g., part annotations, text descriptions are weak super- should return images which are in the same variety as the
visions. Besides, text descriptions can be relatively accurately query, without resorting to any other supervision signals, cf.
returned by ordinary humans, rather than the experts in a spe- Fig. 7. Compared with generic image retrieval which focuses
cific domain. In addition, high-level knowledge graph is an on retrieving near-duplicate images based on similarities in
existing resource and contains rich professional knowledge, their contents (e.g., textures, colors and shapes), while fine-
such as DBpedia [Lehmann et al., 2015]. In practice, both grained retrieval focuses on retrieving the images of the same
text descriptions and knowledge base are effective as extra types (e.g., the same subordinate species for the animals and
guidance for better fine-grained image representation learning. the same model for the cars). Meanwhile, objects in fine-
Specifically, [Reed et al., 2016] collects text descriptions, grained images have only subtle differences, and vary in poses,
and introduces a structured joint embedding for zero-shot fine- scales and rotations.
grained image recognition by combining texts and images. In the literature, [Wei et al., 2017] is the first attempt
Later, [He and Peng, 2017a] combines the vision and language to fine-grained image retrieval using deep learning. It em-
streams in a joint training end-to-end fashion to preserve the ploys pre-trained CNN models to select the meaningful deep
intra-modality and inter-modality information for generating descriptors by localizing the main object in fine-grained
complementary fine-grained representations. For fine-grained images unsupervisedly, and further reveals that selecting
recognition with knowledge base, some works, e.g., [Chen et only useful deep descriptors with removing background or
al., 2018; Xu et al., 2018a], introduce the knowledge base in- noise could significantly benefit retrieval tasks. Recently, to
formation (always associating with attribute labels, cf. Fig. 6) break through the limitation of unsupervised fine-grained re-
to implicitly enriching the embedding space (also reasoning trieval by pre-trained models, some trials [Zheng et al., 2018;
about the discriminative attributes for fine-grained objects). Zheng et al., 2019] tend to discovery novel loss functions
With humans in the loop under the supervised metric learning paradigm. Meanwhile,
they still design additional specific sub-modules tailored for
Fine-grained recognition with humans in the loop is usually
fine-grained objects, e.g., the weakly-supervised localization
an iterative system composed of a machine and a human user,
module proposed in [Zheng et al., 2018], which is under the
which combines both human and machine efforts and intel-
inspiration of [Wei et al., 2017].
ligence. Also, it requires the system to work in a human
labor-economy way as possible. Generally, for these kinds of
recognition methods, the system in each round is seeking to 6 Fine-grained image generation
understand how humans perform recognition, e.g., by asking Apart from the supervised learning tasks, image generation
untrained humans to label the image class and pick up hard is a representative topic of unsupervised learning. It deploys
examples [Cui et al., 2016], or by identifying key part localiza- deep generative models, e.g., GAN [Goodfellow et al., 2014],
tion and selecting discriminative features [Deng et al., 2016] to learn to synthesize realistic images which looks visually
for fine-grained recognition. authentic. With the quality of generated images becoming
higher, more challenging goals are expected, i.e., fine-grained Automatic fine-grained models Nowadays, automated ma-
image generation. As the term suggests, fine-grained genera- chine learning (AutoML) [Feurer et al., 2015] and neural
tion will synthesize images in fine-grained categories such as architecture search (NAS) [Elsken et al., 2018] are attracting
faces of a specific person or objects in a subordinate category. fervent attentions in the artificial intelligence community, es-
The first work in this line was CVAE-GAN proposed in [Bao pecially in computer vision. AutoML targets automating the
et al., 2017], which combines a variational auto-encoder with end-to-end process of applying machine learning to real-world
a generative adversarial network under a conditional genera- tasks. While, NAS, the process of automating neural network
tive process to tackle this problem. Specifically, CVAE-GAN architecture designing, is thus a logical next step in AutoML.
models an image as a composition of label and latent attributes Recent methods of AutoML and NAS could be comparable
in a probabilistic model. Then, by varying the fine-grained or even outperform hand-designed architectures in various
category fed into the resulting generative model, it can gener- computer vision applications. Thus, it is also promising that
ate images in a specific category. More recently, generating automatic fine-grained models developed by AutoML or NAS
images from text descriptions [Xu et al., 2018b] behaves pop- techniques could find a better and more tailor-made deep mod-
ular in the light of its diverse and practical applications, e.g., els, and meanwhile it can advance the studies of AutoML and
art generation and computer-aided design. By performing an NAS in turn.
attention equipped generative network, the model can synthe- Fine-grained few-shot learning Humans are capable of
size fine-grained details of subtle regions by focusing on the learning a new fine-grained concept with very little super-
relevant words of text descriptions. vision, e.g., few exemplary images for a species of bird, yet
our best deep learning fine-grained systems need hundreds or
7 Domain specific applications related to thousands of labeled examples. Even worse, the supervision
fine-grained image analysis of fine-grained images are both time-consuming and expen-
In the real world, deep learning based fine-grained image anal- sive, since fine-grained objects should be always accurately
ysis techniques are also adopted to diverse domain specific ap- labeled by domain experts. Thus, it is desirable to develop
plications and shows great performance, such as clothes/shoes fine-grained few-shot learning (FGFS) [Wei et al., 2019b].
retrieval [Song et al., 2017] in recommendation systems, fash- The task of FGFS requires the learning systems to build clas-
ion image recognition [Liu et al., 2016] in e-commerce plat- sifiers for novel fine-grained categories from few examples
forms, product recognition [Wei et al., 2019a] in intelligent (only one or less than five) in an meta-learning fashion. Robust
retail, etc. These applications are highly related to both fine- FGFS methods could extremely strengthen the usability and
grained retrieval and recognition of FGIA. scalability of fine-grained recognition.
Additionally, if we move down the spectrum of granularity, Fine-grained hashing As there exist growing attentions on
in the extreme, face identification can be viewed as an instance FGIA, more large-scale and well-constructed fine-grained
of fine-grained recognition, where the granularity is under datasets have been released, e.g., [Berg et al., 2014; Horn
the identity granularity level. Moreover, person/vehicle re- et al., 2017; Wei et al., 2019a]. In real applications like fine-
identification is another fine-grained related task, which aims grained image retrieval, it is natural to raise a problem that
at determining whether two images are taken from the same the cost of finding the exact nearest neighbor is prohibitively
specific person/vehicle. Apparently, re-identification tasks are high in the case that the reference database is very large. Hash-
also under identity granularity. ing [Wang et al., 2018; Li et al., 2016], acting as one of the
In practice, these works solve the corresponding domain most popular and effective techniques of approximate nearest
specific tasks by following the motivations of FGIA, which neighbor search, has the potential to deal with large-scale fine-
includes capturing the discriminative parts of objects (faces, grained data. Therefore, fine-grained hashing is a promising
persons and vehicles) [Suh et al., 2018], discovering coarse- direction worth further explorations.
to-fine structural information [Wei et al., 2018b], developing Fine-grained analysis within more realistic settings In
attribute-based models [Liu et al., 2016], and so on. the past decade, fine-grained image analysis related techniques
have been developed and achieve good performance in its
8 Concluding remarks and future directions traditional settings, e.g., the empirical protocols of [Wah et
Fine-grained image analysis (FGIA) based on deep learning al., 2011; Khosla et al., 2011; Krause et al., 2013]. How-
have made great progress in recent years. In this paper, we ever, these settings can not satisfy the daily requirements of
give an extensive survey on recent advances in FGIA with various real-world applications nowadays, e.g., recognizing
deep learning. We mainly introduced the FGIA problem and retail products in storage racks by models trained with im-
its challenges, discussed the significant improvements of fine- ages collected in controlled environments [Wei et al., 2019a]
grained image recognition/retrieval/generation, and also pre- and recognizing/detecting natural species in the wild [Horn et
sented some domain specific applications related to FGIA. al., 2017]. In consequence, novel fine-grained image analy-
Despite the great success, there are still many unsolved prob- sis topics, to name a few—fine-grained analysis with domain
lems. Thus, in this section, we will point out these problems adaptation, fine-grained analysis with knowledge transfer, fine-
explicitly and introduce some research trends for the future grained analysis with long-tailed distribution, and fine-grained
evolution. We hope that this survey not only provides a bet- analysis running on resource constrained embedded devices—
ter understanding of FGIA but also facilitates future research deserve a lot of research efforts towards the more advanced
activities and application developments in this field. and practical FGIA.
References [Liu et al., 2016] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. DeepFashion: Power-
[Bao et al., 2017] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. CVAE-GAN: Fine- ing robust clothes recognition and retrieval with rich annotations. In CVPR, pages
1096–1104, 2016.
grained image generation through asymmetric training. In ICCV, pages 2745–2754,
2017. [Maji et al., 2013] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-
grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
[Berg et al., 2014] T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Jacobs, and P. N.
Belhumeur. Birdsnap: Large-scale fine-grained visual categorization of birds. In [Nilsback and Zisserman, 2008] M.-E. Nilsback and A. Zisserman. Automated flower
CVPR, pages 2019–2026, 2014. classification over a large number of classes. In Indian Conf. on Comput. Vision,
Graph. and Image Process., pages 722–729, 2008.
[Charikar et al., 2002] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent
items in data streams. In ICALP, pages 693–703, 2002. [Niu et al., 2018] L. Niu, A. Veeraraghavan, and A. Sabharwal. Webly supervised
learning meets zero-shot learning: A hybrid approach for fine-grained classification.
[Chen et al., 2018] T. Chen, L. Lin, R. Chen, Y. Wu, and X. Luo. Knowledge- In CVPR, pages 7171–7180, 2018.
embedded representation learning for fine-grained image recognition. In IJCAI,
pages 627–634, 2018. [Pham and Pagh, 2013] N. Pham and R. Pagh. Fast and scalable polynomial kernels
via explicit feature maps. In KDD, pages 239–247, 2013.
[Cui et al., 2016] Y. Cui, F. Zhou, Y. Lin, and S. Belongie. Fine-grained categorization
and dataset bootstrapping using deep metric learning with humans in the loop. In [Reed et al., 2016] S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep represen-
CVPR, pages 1153–1162, 2016. tations of fine-grained visual descriptions. In CVPR, pages 49–58, 2016.
[Cui et al., 2017] Y. Cui, F. Zhou, J. Wang, X. Liu, Y. Lin, and S. Belongie. Kernel [Song et al., 2017] J. Song, Q. Yu, Y.-Z. Song, T. Xiang, and T. M. Hospedales. Deep
pooling for convolutional neural network. In CVPR, pages 2921–2930, 2017. spatial-semantic attention for fine-grained sketch-based image retrieval. In ICCV,
pages 5551–5560, 2017.
[Deng et al., 2016] J. Deng, J. Krause, M. Stark, and L. Fei-Fei. Leveraging the wis-
[Suh et al., 2018] Y. Suh, J. Wang, S. Tang, T. Mei, and K. M. Lee. Part-aligned bilin-
dom of the crowd for fine-grained recognition. TPAMI, 38(4):666–676, 2016.
ear representations for person re-identification. In ECCV, pages 402–419, 2018.
[Dubey et al., 2018] A. Dubey, O. Gupta, R. Raskar, and N. Naik. Maximum entropy
[Sun et al., 2018] M. Sun, Y. Yuan, F. Zhou, and E. Ding. Multi-attention multi-class
fine-grained classification. In NeurIPS, pages 637–647, 2018.
constraint for fine-grained image recognition. In ECCV, pages 834–850, 2018.
[Elsken et al., 2018] T. Elsken, J. H. Metzen, and F. Hutter. Neural architecture search: [Sun et al., 2019] X. Sun, L. Chen, and J. Yang. Learning from web data using adver-
A survey. arXiv preprint arXiv:1808.05377, 2018. sarial discriminative neural networks for fine-grained classification. In AAAI, 2019.
[Feurer et al., 2015] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, [Wah et al., 2011] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The
and F. Hutter. Efficient and robust automated machine learning. In NIPS, pages Caltech-UCSD birds-200-2011 dataset. Tech. Report CNS-TR-2011-001, 2011.
2962–2970, 2015.
[Wang et al., 2018] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen. A survey on
[Fu et al., 2017] J. Fu, H. Zheng, and T. Mei. Look closer to see better: Recurrent learning to hash. TPAMI, 40(4):769–790, 2018.
attention convolutional neural network for fine-grained image recognition. In CVPR,
pages 4438–4446, 2017. [Wei et al., 2017] X.-S. Wei, J.-H. Luo, J. Wu, and Z.-H. Zhou. Selective convolutional
descriptor aggregation for fine-grained image retrieval. TIP, 26(6):2868–2881, 2017.
[Gao et al., 2016] Y. Gao, O. Beijbom, N. Zhang, and T. Darrell. Compact bilinear
pooling. In CVPR, pages 317–326, 2016. [Wei et al., 2018a] X.-S. Wei, C.-W. Xie, J. Wu, and C. Shen. Mask-CNN: Localizing
parts and selecting descriptors for fine-grained bird species categorization. Pattern
[Goodfellow et al., 2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, Recognition, 76:704–714, 2018.
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial
[Wei et al., 2018b] X.-S. Wei, C.-L. Zhang, L. Liu, C. Shen, and J. Wu. Coarse-to-fine:
nets. In NIPS, pages 2672–2680, 2014.
A RNN-based hierarchical attention model for vehicle re-identification. In ACCV,
[He and Peng, 2017a] X. He and Y. Peng. Fine-grained image classification via com- 2018.
bining vision and language. In CVPR, pages 5994–6002, 2017.
[Wei et al., 2019a] X.-S. Wei, Q. Cui, L. Yang, P. Wang, and L. Liu. RPC: A large-
[He and Peng, 2017b] X. He and Y. Peng. Weakly supervised learning of part selection scale retail product checkout dataset. arXiv preprint arXiv:1901.07249, 2019.
model with spatial constraints for fine-grained image classification. In AAAI, pages [Wei et al., 2019b] X.-S. Wei, P. Wang, L. Liu, C. Shen, and J. Wu. Piecewise classifier
4075–4081, 2017.
mappings: Learning fine-grained learners for novel categories with few examples.
[Horn et al., 2017] G. Van Horn, O. M. Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, TIP, in press, 2019.
H. Adam, P. Perona, and S. Belongie. The iNaturalist species classification and [Xu et al., 2018a] H. Xu, G. Qi, J. Li, M. Wang, K. Xu, and H. Gao. Fine-grained
detection dataset. In CVPR, pages 8769–8778, 2017. image classification by visual-semantic embedding. In IJCAI, pages 1043–1049,
[Hou et al., 2017] S. Hou, Y. Feng, and Z. Wang. VegFru: A domain-specific dataset 2018.
for fine-grained visual categorization. In ICCV, pages 541–549, 2017. [Xu et al., 2018b] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He.
[Jaderberg et al., 2015] M. Jaderberg, K. Simonyan, A. Zisserman, and AttnGAN: Fine-grained text to image generation with attentional generative adver-
K. Kavukcuoglu. Spatial transformer networks. In NIPS, pages 2017–2025, sarial networks. In CVPR, pages 1316–1324, 2018.
2015. [Yang et al., 2018] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, and L. Wang. Learning
[Khosla et al., 2011] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel to navigate for fine-grained classification. In ECCV, pages 438–454, 2018.
dataset for fine-grained image categorization. In CVPR Workshop on Fine-Grained [Zhang et al., 2014] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based
Visual Categorization, pages 806–813, 2011. R-CNNs for fine-grained category detection. In ECCV, pages 834–849, 2014.
[Kong and Fowlkes, 2017] S. Kong and C. Fowlkes. Low-rank bilinear pooling for [Zhang et al., 2018] Y. Zhang, H. Tang, and K. Jia. Fine-grained visual categorization
fine-grained classification. In CVPR, pages 365–374, 2017. using meta-learning optimization with sample selection of auxiliary data. In ECCV,
[Krause et al., 2013] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3D object repre- pages 233–248, 2018.
sentations for fine-grained categorization. In ICCV Workshop on 3D Representation [Zhao et al., 2017] B. Zhao, J. Feng, X. Wu, and S. Yan. A survey on deep learning-
and Recognition, 2013. based fine-grained object classification and semantic segmentation. International
[LeCun et al., 2015] Y. LeCun, Y. Bengion, and G. Hinton. Deep learning. Nature, Journal of Automation and Computing, 14(2):119–135, 2017.
521:436–444, 2015. [Zheng et al., 2017] H. Zheng, J. Fu, T. Mei, and J. Luo. Learning multi-attention
[Lehmann et al., 2015] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, convolutional neural network for fine-grained image recognition. In ICCV, pages
5209–5217, 2017.
P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia
- A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic [Zheng et al., 2018] X. Zheng, R. Ji, X. Sun, Y. Wu, F. Huang, and Y. Yang. Cen-
Web Journal, pages 167–195, 2015. tralized ranking loss with weakly supervised localization for fine-grained object re-
trieval. In IJCAI, pages 1226–1233, 2018.
[Li et al., 2016] W.-J. Li, S. Wang, and W.-C. Kang. Feature learning based deep su-
pervised hashing with pairwise labels. In IJCAI, pages 1711–1717, 2016. [Zheng et al., 2019] X. Zheng, R. Ji, X. Sun, B. Zhang, Y. Wu, and F. Huang. Towards
optimal fine grained retrieval via decorrelated centralized loss with normalize-scale
[Lin et al., 2015a] D. Lin, X. Shen, C. Lu, and J. Jia. Deep LAC: Deep localization,
layer. In AAAI, 2019.
alignment and classification for fine-grained recognition. In CVPR, pages 1666–
1674, 2015. [Zhuang et al., 2017] B. Zhuang, L. Liu, Y. Li, C. Shen, and I. Reid. Attend in groups:
a weakly-supervised deep learning framework for learning from web data. In CVPR,
[Lin et al., 2015b] T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN models pages 1878–1887, 2017.
for fine-grained visual recognition. In ICCV, pages 1449–1457, 2015.

You might also like