You are on page 1of 4

The Herbarium Challenge 2019 Dataset

Kiat Chuan Tan∗ Yulong Liu∗ Barbara Ambrose


Google Research Google Research New York Botanical Garden
kiatchuan@google.com liuyl@google.com bambrose@nybg.org

Melissa Tulig Serge Belongie


New York Botanical Garden Google Research & Cornell Tech
arXiv:1906.05372v2 [cs.CV] 15 Jun 2019

mtulig@nybg.org sjb344@cornell.edu

Abstract

Herbarium sheets are invaluable for botanical research,


and considerable time and effort is spent by experts to label
and identify specimens on them. In view of recent advances
in computer vision and deep learning, developing an au-
tomated approach to help experts identify specimens could
significantly accelerate research in this area. Whereas most
existing botanical datasets comprise photos of specimens in
the wild, herbarium sheets exhibit dried specimens, which
poses new challenges. We present a challenge dataset of
herbarium sheet images labeled by experts, with the intent
of facilitating the development of automated identification Figure 1. Number of imaged sheets per species, sorted from most
techniques for this challenging scenario. common to least common.

covered. Without new tools for species discovery, it is likely


1. Introduction that we will lose these species to the current extinction event
We are currently in the midst of the sixth great extinc- before we know their names, specific benefits, and the role
tion, Anthropocene extinction, that is largely being caused they keep in maintaining the delicate balance of the ecosys-
by humans. A recent summary of a forthcoming United tems in which they occur. The major challenge to under-
Nations report indicates that 1,000,000 species could go ex- standing plant diversity is speeding up the process of dis-
tinct within decades [1]. As large ecosystems are being de- covery. In flowering plants, it takes an average of 35 years
stroyed, it is likely that many of these extinctions are undoc- from plant collection to species description while less than
umented as many species still await discovery. It cannot be 16% of new species are described in less than 5 years [2].
overstated that plants are fundamental to life on earth; they It has also been suggested that Herbaria are a major fron-
convert energy into food for all organisms, produce oxygen tier for species discovery with more than 50% of unknown
needed by most organisms, provide shelter, medicines, fuel, species already in herbarium collections [2].
hold soil in place, and help regulate the water cycle. The The use of Artificial Intelligence for automatic species
loss of plant species is increasing the fragility of ecosys- identification is a new and rapidly developing field and
tems which intensifies the devastating effects of fires and holds promise for identifying the still to be named species
flood. Plant species loss may be removing potentially valu- in herbaria [4, 5, 11, 15, 17, 18, 19, 20, 22]. Several stud-
able resources for human health and well-being before we ies of automated plant species identification studies have
have the opportunity to study and cultivate them. focused on images of leaves alone [11, 20] while oth-
There are currently 400,000 known plant species and ers have performed automatic species identification using
there are an estimated 80,000 plant species still to be dis- herbarium specimens trained with leaf images [4, 19]. Sev-
eral studies have been performed using herbarium images
∗ Equal contribution. alone [4, 5, 13, 15, 17, 22], however, new models are still

1
needed to automatically identify species using herbarium
specimens alone and under conditions that more accurately
reflect the distribution of specimen images available for
species present in herbaria.
There are approximately 3,000 herbaria world-wide with
an estimated 380 million specimens. The New York Botan-
ical Garden (NYBG) has more than 7.8 million herbarium
specimens, and over 3 million of these have been imaged.
Each specimen is accompanied by an approximate standard Figure 2. Different specimens of the same species, varying by
set of metadata, including the name of the plant, a descrip- specimen age.
tion of the location where it was collected, name of the col-
lector, the accession number of the collector, and the date of
collection.

2. The Herbarium Dataset1


The Herbarium dataset contains 46,469 digitally imaged
herbarium sheets representing 683 species from the flower-
ing plant family Melastomataceae (melastome). This fam-
ily is also known as the princess flower family and many
species have extraordinary flowers. The Melastomataceae
is a large family with more than 160 genera and more than Figure 3. Visually similar specimens of 3 different species.
5,500 named species [13]. Each herbarium specimen was
carefully identified and labeled with a unique species by hu-
man experts from the New York Botanical Garden. We have can be small, i.e. different species can be visually very sim-
maintained the original high resolution images as digitally ilar (figure 3).
imaged by the New York Botanical Garden.
2.2. Dataset Split
Some melastome species are represented by more than a
hundred specimens while others are represented by less than We split the dataset into 75% training, 5% validation
20 specimens, which more accurately reflects the real-world and 20% test. The split was performed randomly on a per-
distribution of specimens per species found in herbaria. See species level, to ensure that the training, validation and test
figure 1 for a distribution of the counts per species. datasets all have similar species distributions. In total, there
We calculated the overlap between species contained in are 34,225 training images, 2,679 validation images and
the Herbarium challenge dataset with the plant species in 9,565 test images.
the iNaturalist 2018 challenge dataset [9]. The overlap only
comprises 2 out of the 683 species in the Herbarium dataset. 2.3. Dataset Preprocessing
We preprocessed the dataset provided by New York
2.1. Dataset Challenges Botanical Garden before publishing to FGVC competition.
The Herbarium dataset presents multiple challenges for These steps include 1) blurring text and barcodes in the im-
species identification. First, the dataset has a large class im- age, 2) providing a down-sampled dataset for convenience.
balance. Second, expert botanists often rely on characters
that occupy a small portion of the herbarium sheet in or- 2.3.1 Image Blurring
der to identify the species, such as leaf veins, leaf margins,
and flower shape [6][18]. Most of the existing state-of-the- The original herbarium images often have handwritten or
art image classification methods use a low resolution im- printed labels attached to it, including date, place found,
age as input [16]. This lower resolution may be insufficient description of the specimen, and the specimen name. Some
to capture these smaller characters, and existing networks sheets also have attached barcodes and identifiers which can
may find it challenging to learn to focus on these small dif- be used to trace back to the exact specimen on New York
ferences. Third, individuals of the same species can vary Botanical Garden’s website. In order to prevent models
widely in their morphology, i.e. intraclass variation can be from classifying images based on barcode and text infor-
large. For example, a specimen’s color composition can mation, we have artificially blurred these information (fig-
change with age (figure 2). Finally, the interclass variation ure 4).
The PhotoOCR system [3] was used for text detection.
1 Available at https://github.com/visipedia/herbarium comp After extracting bounding boxes of text and barcodes, a

2
Figure 5. Final challenge leaderboard.

classification accuracy of 89.8%.


Their method involved an ensemble of 5 different mod-
Figure 4. Comparison between original images and blurred im- els, using the SeResNext-50, SeResNext-101 [21] and
ages. Left column are original images and right column are blurred ResNet-152 [10] architectures. They also used a combina-
images. tion of different losses: cross-entropy, focal loss, and class-
balanced focal loss [7]. Their models were pretrained on
the ImageNet [14] and iNaturalist Challenge 2018 datasets,
Heavy Gaussian Blend algorithm was used to blur the de- with input image dimensions of 448x448 and standard
tected regions. The algorithm applies a mean blur on the data augmentation. Additional techniques used include de-
bounding box first, then blurs the regions using a single formable convolutions [8], iSQRT [12] and random erasing
Gaussian blur with noise added and use a smooth alpha map [23].
to blend into original.
4. Conclusions and Future Work
2.3.2 Image Resizing We have developed the Herbarium Challenge dataset
The images collected by the New York Botanical Garden to facilitate the development of automatic species identifi-
have high resolution, with the majority of images of dimen- cation models for the flowering plant family Melastomat-
sion 6,000 by 4,000. While high resolution images are use- aceae. This offers the immediate benefit of accelerating the
ful for botanical research, they may not convenient for com- identification of unlabeled specimens in this family, poten-
petition participants to download, process, and train models tially resulting in the identification of new species.
on. In future work, we have multiple directions in mind:
For convenience, we provided a resized dataset, where extending the dataset to incorporate other plant families,
each image has been resized (preserving aspect ratios) to adding annotations beyond species, and facilitating new
have a maximum of 1024 pixels in the larger dimension. species identification using a test set with previously unseen
This resizing drastically reduced the overall dataset size classes.
from 52GB to 2.3GB. Acknowledgements. We thank NYBG for support and
our colleagues at NYBG, particularly, Lawrence Kelly, Da-
3. The Herbarium Challenge 2019 mon Little, and Barbara Thiers as well as Nicole Tiernan
and Kim Watson for digitization. A special thanks to our
The Herbarium Challenge 2019 was conducted through NYBG colleague Fabin A. Michelangeli and his collabora-
Kaggle as part of FGVC6 at CVPR19, with 22 participat- tors Frank Almeda, Eldis Becquer, Renato Goldenberg, and
ing teams and 254 submissions. The final leaderboard from Walter Judd who explore, describe, and determine melas-
the held-out private test data can be seen in Figure 5. The tome species. The digitization of the melastomes in this
winning method by Megvii Research Nanjing achieved a dataset was made possible by funding from NSF grants in-

3
cluding DEB-1343612 and DEB-0818399 to F.A.M. and works by iterative matrix square root normalization. CoRR,
DBI-0749751, DBI-9987500, DBI-0543335 DBI-1053290 abs/1712.01034, 2017.
to B.T as well as funding from the Google, Mellon, and [13] David J. Mabberley. Mabberley’s Plant-book: A Portable
Sloan Foundations to NYBG. Dictionary of Plants, their Classification and Uses. Cam-
We would also like to thank: Srikanth Belwadi, R.V. bridge University Press, 4 edition, 2017.
Guha and Kiran Panesar from dataCommons.org for help [14] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
with preparing the dataset; Christine Kaeser-Chen and jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
Hartwig Adam from Google Research; Maggie Demkin Aditya Khosla, Michael S. Bernstein, Alexander C. Berg,
from Kaggle; the Herbarium Challenge 2019 competitors, and Fei-Fei Li. Imagenet large scale visual recognition chal-
and the FGVC2019 workshop organizers. lenge. CoRR, abs/1409.0575, 2014.
[15] Eric Schuettpelz, Paul B. Frandsen, Rebecca B. Dikow, Abel
References Brown, Sylvia Orli, Melinda Peters, Adam Metallo, Vicki
Funk, and Laurence J. Dorr. Applications of deep convo-
[1] Summary of United Nations report from the Intergovernmen- lutional neural networks to digitized natural history collec-
tal Science-Policy Platform on Biodiversity and Ecosystem tions. Biodiversity Data Journal, 5:e21139, 11 2017.
Services (IPBES). May 2019. [16] Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke.
[2] Daniel P. Bebber, Mark A. Carine, John R. I. Wood, Alexan- Inception-v4, Inception-ResNet and the Impact of Residual
dra H. Wortley, David J. Harris, Ghillean T. Prance, Ger- Connections on Learning. CoRR, abs/1602.07261, 2016.
rit Davidse, Jay Paige, Terry D. Pennington, Norman K. B. [17] Jakob Unger, Dorit Merhof, and Susanne Renner. Computer
Robson, and Robert W. Scotland. Herbaria are a major vision applied to herbarium specimens of german trees: Test-
frontier for species discovery. Proceedings of the National ing the future utility of the millions of herbarium specimen
Academy of Sciences, 107(51):22169–22171, 2010. images for automated identification. BMC Evolutionary Bi-
[3] Alessandro Bissacco, Mark Cummins, Yuval Netzer, and ology, 16, 12 2016.
Hartmut Neven. PhotoOCR: Reading text in uncontrolled [18] Jana Wäldchen and Patrick Mäder. Plant Species Identifi-
conditions. In Proceedings of the IEEE International Con- cation Using Computer Vision Techniques: A Systematic
ference on Computer Vision, pages 785–792, 2013. Literature Review. Archives of Computational Methods in
[4] Jose Carranza-Rojas, Herve Goeau, Pierre Bonnet, Erick Engineering, 25(2):507–543, Apr 2018.
Mata-Montero, and Alexis Joly. Going deeper in the auto- [19] D Wijesingha and Faiz Marikar. Automatic Detection Sys-
mated identification of Herbarium specimens. BMC Evolu- tem for the Identification of Plants Using Herbarium Speci-
tionary Biology, 17(1):181, Aug 2017. men Images. Tropical Agricultural Research, 23, 09 2012.
[5] J. Y. Clark, D. P. A. Corney, and H. L. Tang. Automated plant [20] Peter Wilf, Shengping Zhang, Sharat Chikkerur, Stefan A.
identification using artificial neural networks. In 2012 IEEE Little, Scott L. Wing, and Thomas Serre. Computer vision
Symposium on Computational Intelligence in Bioinformatics cracks the leaf code. Proceedings of the National Academy
and Computational Biology (CIBCB), pages 343–348, May of Sciences, 113(12):3305–3310, 2016.
2012. [21] Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu,
[6] James S. Cope, David Corney, Jonathan Y. Clark, Paolo Re- and Kaiming He. Aggregated residual transformations for
magnino, and Paul Wilkin. Plant species identification using deep neural networks. CoRR, abs/1611.05431, 2016.
digital morphometrics: A review. Expert Systems with Ap- [22] Sohaib Younis, Claus Weiland, Robert Hoehndorf, Stefan
plications, 39(8):7562 – 7573, 2012. Dressler, Thomas Hickler, Bernhard Seeger, and Marco
[7] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge J. Schmidt. Taxon and trait recognition from digitized herbar-
Belongie. Class-balanced loss based on effective number of ium specimens using deep convolutional neural networks.
samples. CoRR, abs/1901.05555, 2019. Botany Letters, 03 2018.
[8] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong [23] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and
Zhang, Han Hu, and Yichen Wei. Deformable convolutional Yi Yang. Random erasing data augmentation. CoRR,
networks. CoRR, abs/1703.06211, 2017. abs/1708.04896, 2017.
[9] Grant Van Horn, Oisin Mac Aodha, Yang Song, Alexan-
der Shepard, Hartwig Adam, Pietro Perona, and Serge J.
Belongie. The iNaturalist Challenge 2017 Dataset. CoRR,
abs/1707.06642, 2017.
[10] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-
works. CoRR, abs/1709.01507, 2017.
[11] Soon Jye Kho, Sugumaran Manickam, Sorayya Malek, Mo-
geeb Mosleh, and Sarinder Kaur Dhillon. Automated plant
identification using artificial neural network and support vec-
tor machine. Frontiers in Life Science, 10(1):98–107, 2017.
[12] Peihua Li, Jiangtao Xie, Qilong Wang, and Zilin Gao.
Towards faster training of global covariance pooling net-

You might also like