You are on page 1of 7

MIRecipe: A Recipe Dataset for Stage-Aware Recognition of

Changes in Appearance of Ingredients


Yixin Zhang Yoko Yamakata Keishi Tajima
zhangyx@dl.soc.i.kyoto-u.ac.jp yamakata@mi.u-tokyo.ac.jp tajima@i.kyoto-u.ac.jp
Kyoto University The University of Tokyo Kyoto University
Japan Japan Japan

ABSTRACT
In this paper, we introduce a new recipe dataset MIRecipe (Multimedia-
Instructional Recipe). It has both text and image data for every
cooking step, while the conventional recipe datasets only contain
final dish images, and/or images only for some of the steps. It con-
sists of 26,725 recipes, which include 239,973 steps in total. The
recognition of ingredients in images associated with cooking steps
poses a new challenge: Since ingredients are processed during cook-
ing, the appearance of the same ingredient is very different in the
beginning and finishing stages of the cooking. The general object Figure 1: Ingredients change their appearance in the begin-
recognition methods, which assume the constant appearance of ning, intermediate, and finishing stages of the cooking. In
objects, do not perform well for such objects. To solve the problem, addition, the variance in their appearance becomes larger
we propose two stage-aware techniques: stage-wise model learn- as the process proceeds to the later stages. Our purpose is to
ing, which trains a separate model for each stage, and stage-aware recognize such appearance-changing ingredients in images
curriculum learning, which starts with the training data from the associated with the cooking steps by taking into account the
beginning stage and proceeds to the later stages. Our experiment relative position of the step in the cooking procedure.
with our dataset shows that our method achieves higher accuracy
than the model trained using all the data without considering the
stages. Our dataset is available at our GitHub repository.
become popular. Much research has been conducted using multime-
CCS CONCEPTS
dia recipe data collected from them, such as food category recogni-
• Computing methodologies → Image representations; Ob- tion from dish images [12, 14, 22, 28], semantic structure analysis
ject recognition. of recipe texts [21, 32], recipe text retrieval from food images [4],
and recipe text generation from food images [24, 25, 27, 30].
KEYWORDS Many recipe data includes both text and images. Most of them
datasets, recipe data, cooking, multimedia, food recognition only include final dish images and explain the cooking procedure
ACM Reference Format: only in text, but some sites, e.g., Haodou and Cookpad, also provide
Yixin Zhang, Yoko Yamakata, and Keishi Tajima. 2021. MIRecipe: A Recipe photos of the scenes of the cooking steps. Those images have been
Dataset for Stage-Aware Recognition of Changes in Appearance of Ingre- used in research on multimedia recipe processing, such as [4, 35].
dients. In ACM Multimedia Asia (MMAsia ’21), December 1–3, 2021, Gold A problem in such multimedia recipe descriptions is that some
Coast, Australia. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/ description is sometimes omitted in text when it is shown in images.
3469877.3490596 In particular, the name of the ingredient processed in the step is of-
ten omitted when it is shown in the associated image. In such cases,
1 INTRODUCTION users only accessing the text, e.g., through a smart speaker, cannot
User-submitted recipe sites, such as Allrecipes1 in North America know what ingredients to process, so we need to complement such
and UK, Cookpad2 in Japan, and Haodou3 in China, have recently omitted ingredient names by recognizing the associated image.
The technical difficulty in that task is that ingredients change
1 https://www.allrecipes.com/
2 http://cookpad.com/
their appearance while they are processed during the cooking, and
3 http://www.haodou.com/recipe/ the same ingredient looks different at the beginning and the end of
the cooking. This is a new challenge that has not been addressed
Permission to make digital or hard copies of part or all of this work for personal or in existing studies on recipe image processing.
classroom use is granted without fee provided that copies are not made or distributed Their changes are, however, not completely random but some-
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. what regular. Figure 1 shows example images of potatoes in various
For all other uses, contact the owner/author(s). cooking stages. In the beginning stage, potatoes look similar, brown
MMAsia ’21, December 1–3, 2021, Gold Coast, Australia and round. In the intermediate stage, they become pale yellow and
© 2021 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-8607-4/21/12. cut into various shapes. In the finishing stage, their appearances
https://doi.org/10.1145/3469877.3490596 become even more diverse as they are mixed with other ingredients.
MMAsia ’21, December 1–3, 2021, Gold Coast, Australia Yixin Zhang, Yoko Yamakata, and Keishi Tajima

This example shows two characteristics of ingredients: (1) their ap- dataset, contains text cooking instructions, but it also only focuses
pearance changes depending on the cooking stage, and (2) their on final dish images. In Cookpad recipes [10], some steps have im-
appearance becomes more diverse in the later stages. ages, but many steps are explained only in text. On the other hand,
In this paper, we propose a method for recognizing such in- on the recipe site Haodou, from which we collected our data, every
gredients in images associated with cooking steps. Following the cooking step has an image. It allows us to continuously observe the
observation above, our method is stage-aware, i.e., it pays attention changes of the appearances of ingredients.
to the relative position of the step the image is associated with. MM-ReS [26] is also a multimedia recipe dataset, but there are
As ingredients change the appearance depending on the cooking three differences: (1) In MM-ReS dataset, images are not associated
stage, we first propose a method that separates the training data with every instruction step. (2) Since Haodou, from which we col-
into subsets corresponding to the beginning, intermediate, and lected our data, has far more recipes than Instructables, from which
finishing stages, and trains a separate model for each stage. Given MM-ReS recipes were collected, our dataset is extendable to a larger
a target image, we use a model corresponding to its stage. This size. (3) The MM-ReS dataset is not publicly available up to now.
method works well for images in the beginning stage because the None of these existing datasets fully provides multimedia infor-
same kind of ingredients have similar appearance in the beginning mation on the instructional aspect of the recipe, while it is one of
stage. For the later stages, however, this method does not work well the most important aspects of recipe data. We hence construct a
because the appearance of ingredients becomes more diverse. brand new recipe dataset MIRecipe that provides that information.
To improve the performance for the later stages, we propose
another method that introduces curriculum learning. It first trains 2.2 Recipe Image Recognition
a model with the data from the beginning stage, and gradually adds Recently, many studies have made a great progress in the food
more diverse data from the later stages. Our method switches these recognition field. Some of them focus on multi-modal recipe re-
two methods depending on the stage of the target image data. trieval [3, 29, 31]. The main aim of these studies is to retrieve recipe
We also constructed a recipe dataset where every cooking step titles or contents from the images of final dishes. Although some
is associated with a image. None of the existing dataset has im- solutions in these studies can be adopted to the ingredient recog-
ages for every cooking step. Our dataset is available at our GitHub nition, our work focuses on the sequential instructional images
repository4 and contains 26,725 recipes with 239,973 steps. rather than images of the final dishes. Other studies mainly focus
We then focused on the 20 most frequent ingredients in our on the food categorization based on the final dish images without
dataset, and constructed a sub-dataset consisting of steps includ- explicit analysis of ingredient composition [5, 12, 15, 17, 20, 28, 35].
ing at least one of these 20 ingredients. This sub-dataset consists Our method intends to recognize ingredients in images describ-
of 30,023 steps. To each image in this sub-dataset, we manually ing cooking steps. Our method is useful for the automatic under-
assigned labels for these 20 ingredients. We divide it into the be- standing of the cooking procedures and for the recognition of in-
ginning, intermediate, and finishing stages so that each of these 20 gredients used in the recipe. Ingredient estimation only from a final
ingredients appears almost evenly in the three stages. dish image is a task far harder than food categorization [4].
The contributions of this paper are summarized as follows:
• We constructed a recipe dataset where every cooking step 2.3 Recipe Text Processing
has both text and image. This dataset can be better used for Recipe text has some differences from general text, and we cannot
some studies on recipe, e.g., cooking workflow extraction, simply apply the existing text processing method to recipe data.
and cooking procedural text generation from recipe images. There has been research on the detection of named entities, e.g.,
We also constructed a sub-dataset consisting of steps includ- ingredients and tools, in the recipe text [23]. However, their method
ing the 20 most frequent ingredients, with their labels. cannot detect ingredients omitted in the text.
• We proposed a method of recognizing ingredients in images
associated with cooking steps in recipe data. Because in- 2.4 Image-Text Pair Processing
gredients change their appearance during the cooking, our
There has also been research on the analysis of words or tags
method is stage-aware. Our method outperforms several
attached to images, such as tag importance estimation for a tag set
baseline methods in our experiment with our dataset.
attached to an image [16, 18] and inferring the semantic relationship
between the tags [13]. They focus on data from image posting sites
2 RELATED WORK
like Flickr, where tags are used to complement image, so tags usually
In this research, we construct a multi-modal cooking step descrip- describe important items in the images. On the other hand, in image-
tion dataset, and develop an ingredient recognition method for such text pairs describing cooking steps, ingredients processed in the
data. These topics are related to the following research lines. steps are the most important items, but sometimes omitted in the
text when it is shown in the images. Image-text pairs describing
2.1 Recipe Datasets cooking steps thus have a very different characteristic.
Many recipe datasets have been published in recent years. Some
datasets mainly focus on final dish images rather than recipe in- 3 DATASET
structions. Recipe 1M+ [19], a famous and representative recipe In this section, we explain the details of our recipe dataset MIRecipe.
4 https://github.com/bimissing/recipeData. For copyright reasons, the text and image It contains 26,725 cooking recipes with 239,973 instruction steps
data are provided as a list of URLs and a download script. (and the same number of images). The resolution of each image is
MIRecipe: A Recipe Dataset for Stage-Aware Recognition of Changes in Appearance of Ingredients MMAsia ’21, December 1–3, 2021, Gold Coast, Australia

around 700×500. MIRecipe is superior to the existing representative


datasets for the following reasons:
• MIRecipe is a dataset containing images for every cooking
step while other recipe datasets provide images only for a
part of steps or only provide the finished dish image.
• Recipe 1M+, ETH Food-101 [2], and some other datasets
contain a large number of images, but Haodou, from which
we collect data, has a far larger number of recipes so our
dataset can be extended to a larger scale.
The detailed data statistics are available in our GitHub repository5 . Figure 2: Division of images into three subsets correspond-
ing to beginning, intermediate, and finishing stages based
3.1 Ingredient Classes on their relative position in the entire procedure.
To identify the ingredient classes appearing in the instructional text Table 1: Division into Subsets for Three Stages
of Chinese recipes on Haodou, we first constructed a Chinese recipe
Threshold # of Images Ratio
corpus of 50 recipes and manually annotated recipe named entities
(r-NEs) in accordance with the guidelines previously defined for Beginning Subset (0, 0.3) 11,469 0.382
Japanese and English [36]. We then train an r-NE recognizer by Intermediate Subset [0.3, 0.6] 11,018 0.367
using that annotation corpus and the BERT-NE6 (a state-of-the-art Finishing Subset (0.6, 1] 7,536 0.251
named entity recognizer based on the BERT neural network archi- Total (0, 1] 30,023 1
tecture [8]). By applying the obtained r-NE recognizer to our entire
dataset, we obtained 5,685 distinct ingredient classes, of which 24 of steps in the recipe. For example, the relative position of Step 4
classes appear more than 1,000 times, 298 classes appear more than is 0.267 if there are 15 steps in the recipe. We decide the threshold
100 but less than 1,000 times, and 5,355 classes, accounting for values for division into three subsets so that the frequency of each of
94.20% of the 5,685 classes, appear only 50 times or less. the 20 ingredients in each subset is roughly balanced. The threshold
In this research, we focus on 20 ingredient classes selected from values we used and the number of images in each subset are shown
the 24 classes that appear more than 1,000 times. The other four in Table 1. We use this sub-dataset in our experiment.
classes, "water," "flour," "salt," and "milk," were removed because it We also divide these subsets into training sets and test sets.
is hard to recognize them in images. The 20 classes are as follows: As one image may be labeled with multiple ingredient labels, we
Potato, ginger, onion, pork, shrimp, chicken, corn, manually check images with multiple labels, and decide which class
carrot, eggplant, shallot, tofu, spinach, sauce, chili, they should be assigned to in order to avoid assigning an image to
bread, dough, fish, egg, cucumber, and soybean. both the training set of class A and the test set of class B.
The total number of images in these 20 classes is 35,401, but
some images appear in multiple classes, and we have 30,023 images
3.3 Analysis of Subset Characteristics
in total if we remove duplication. They are used in the experiment To analyze the characteristics of three subsets, we visualize the
in this research because we only focus on these 20 classes. 20 ingre- distribution of the image features in the three subsets of the same
dients may not sound a lot, but in reality, ingredients can change in class. We first vectorize images by using VGG16 [33] trained on
many ways during cooking, and we need data that is dense enough ImageNet [7] data. We extract the output of two fully connected
to predict those changes. To recognize other less frequent classes layers in VGG16, which is a 4096-dimensional vector, and apply the
(less than 1,000 times), we need a larger training dataset. t-SNE method to convert them into two-dimensional vectors.
Figure 3 shows the distribution of the produced two-dimensional
3.2 Division into Subsets for Three Stages vectors for several ingredient classes. We can see that the images
in the beginning, intermediate, and finishing stages of the same
Food appearances change during the cooking procedure. For exam-
class have different distributions. This confirms our assumption
ple, a potato in the beginning stage is usually in its initial round
that the appearance of the same ingredient is different in different
shape, but it may be cut into slices or blocks in the intermediate
cooking stages. It also shows that our simple method of dividing
stage, and in the finishing stage, it may be plated with other ingre-
data into three subsets in accordance with their relative position
dients. This example represents a typical scenario of the processing
works to some extent. Based on this result, we expect that training
of an ingredient in a recipe: it is first in an original raw state, then
a separate model for each stage improves the recognition accuracy.
pre-processed alone, and finally cooked together with other ingredi-
We can also find that the images in the beginning stage tend to
ents. Based on this observation, we divide cooking steps into three
be more clustered and show stronger similarity, while the images
subsets: beginning, intermediate, and finishing stages.
in the finishing stage are more scattered. It is especially evident for
We divide cooking steps into three stages based on their relative
ginger and onion. This confirms our second assumption that the
positions in the entire procedure, as shown in Fig. 2. We calculate
appearance of the same ingredient become more diverse in the later
the relative position of a step by dividing its position by the number
stages, which is the reason why we introduce curriculum learning.
5 https://github.com/bimissing/recipeData On the other hand, for tofu and bread, images in the finishing stages
6 https://github.com/kyzhouhzau/BERT-NER are also relatively clustered. This result can be partly because of
MMAsia ’21, December 1–3, 2021, Gold Coast, Australia Yixin Zhang, Yoko Yamakata, and Keishi Tajima

Ginger Onion Tofu Bread

Figure 3: Distribution of image features in three subsets of several ingredient classes. Images in a beginning subset tend to be
more clustered, while images in a finishing subset are more scattered. It is especially evident for ginger and onion, while the
finishing subsets are also relatively clustered for tofu and bread.

the inaccuracy of division into subsets, but it also coincides with labels in it. Similarly, we define S2 and S3 for the intermediate and
our observation that tofu and bread are relatively easy to recognize finishing subsets. Let h Si : img → [K] denote the classifier trained
even in the final dish images compared with ginger and onion. by Si that classifies a given image img into one of the K ingredient
classes. When given a target image associated with a step in a stage
4 PROPOSED METHOD j (j = 1, 2, 3), we need a model selection function m : j → i, which
Our method is composed of two techniques: (1) stage-wise model chooses one of h Si (i = 1, 2, 3) based on j.
learning, which trains separate models for three cooking stages The best choice may depend not only on j but also on the ingre-
and (2) stage-aware curriculum learning, which starts with training dient class, but we do not know the class of the target image when
data from the beginning stage and proceeds to the later stages. Our we select i, so we cannot use that information for selecting i. On the
method select one of these two techniques depending on the stage of other hand, basic image features are available when we select i, but
the target data: for target data from the beginning stage, our method they are unlikely to be useful for selecting i. The relative position
chooses the stage-wise model learning, and for target data from of the target step is the most promising feature that is available
the later stages, our method chooses the stage-aware curriculum when we select i so we here focus on the model selection function
learning. The overview of our method is shown in Fig. 4. We explain m that only takes j as the input.
the details of these two component methods in the following.
4.2 Stage-Aware Curriculum Learning
4.1 Stage-Wise Model Learning Our second method is based on the idea of curriculum learning.
Even in the recipe image recognition, the general object recognition Humans and animals learn much better if examples are presented
methods, which assume the constant appearance of objects, perform not randomly but in a meaningful order that starts with the most
well for entities like cooking tools whose shapes do not change basic concepts and progressively proceeds to more complex ones.
during the cooking process. However, they do not perform well for One of the design issues in curriculum learning is to choose a
ingredients whose appearance changes during the cooking. scoring function f (the function to measure the difficulty of data)
To deal with such changes, we train three separate models by since it depends on prior knowledge about the data. Here we ap-
using the training data from the beginning, intermediate, and fin- proximate the difficulty of a training data (imgij , label ij ) by their
ishing stages, respectively. Given a target data, we select the best stage i. An ingredient changes its shape and color from its original
model for the stage of the target data. We expect that it is simply state in the beginning stage to a more processed and mixed shape
the model trained by the data from the same stage as the target. and color in the intermediate and finishing stage, so we expect that
However, we need not to restrict ourselves to this simple ap- this order is appropriate order for curriculum learning.
proach. As shown in the previous section, images from the later Another issue in curriculum learning is a pacing function p [9]
stages are not clustered well compared with the beginning stage, that determines a sequence of data used for the training in each step
and they can be even noisy. Therefore, the model trained by a of the curriculum learning. There have been two basic approaches.
cleaner training data from the beginning stage might be better than The original curriculum learning method [1] proposes to start with
the model trained by the data from the later stages, even for target a sub-dataset consisting of simpler data, and then switch to a harder
data from the later stages. For this reason, in the experiment shown target dataset. On the other hand, the idea of Baby Steps curriculum
later, we measure the accuracy for every combination of three train- [34] is that simpler data in the training should not be discarded
ing datasets and three test datasets, and determine which model is suddenly, but instead, the complexity of the training data should
the best for target images from each stage. be gradually increased [6]. Here, we adopt the latter one.
More formally, let S1 = {(imgi1 , label i1 )|1 ≤ i ≤ N 1 } be the We designed a subset-level balance scheme for our curriculum
set of the pairs of images and corresponding food labels in the learning. We first set the batch size in our training to 256 and epoch
training dataset from the beginning stage. N 1 is the total number of number to 40. The 256 training samples in each mini-batch are
MIRecipe: A Recipe Dataset for Stage-Aware Recognition of Changes in Appearance of Ingredients MMAsia ’21, December 1–3, 2021, Gold Coast, Australia

Figure 4: Overview of the Proposed Method. The image feature vectors gradually become less clustered as the cooking process
proceeds. For target images in the beginning stage, we use a model trained by the training data from the beginning stage. For
target images in the intermediate and finishing stages, we use a model trained by the curriculum learning.

randomly selected from the three subsets in the following scheme: The data size used in each experiment is the same, and the ratio of
Learning Phase 1: (256, 0, 0), Learning Phase 2: (128, 128, 0), Learning training data to test data is 7:3 for all the experiments.
Phase 3: (128, 64, 64), and Learning Phase 4: (64, 64, 128).
We also need to decide "switch epochs", i.e., when we switch from 5.2 Stage-Wise Model Learning
a phase to the next phase. Here we set every 10 epochs as a switch
For our first method, stage-wise model learning, we test all combina-
epoch. We chose 10 epochs because a model almost converges in 10
tions of the three subsets of the training data and the three subsets
epochs in our experiment. The curriculum first trains the model on
of the test data. We also measure the accuracy by the whole training
the training set created in accordance with the scheme for Phase 1
set and for the whole test set. The accuracy for each combination
until it reaches a switch epoch. It then proceeds to Phase 2 until it
is shown in Table 2. The last row and the last column corresponds
reaches the next switch epoch, and so on. In the training period,
to the whole training set and the whole test set, respectively. The
each phase takes over the weights before the FC layers from the
cell at the right bottom corner corresponds to the baseline method,
previous phase. This is also illustrated in Fig. 4.
which uses the whole training data for all test data.
The results show that we obtain the best recognition perfor-
4.3 Stage-Aware Model Selection. mance when the training data and test data are from the same sub-
To achieve the best performance, our method switches the two set. Therefore, we can adopt the simplest model selection function
techniques explained above depending on the stage of the target m(i) = i. The average accuracy of the stage-wise model learning
data. In other words, we select the method with the higher expected with such m over the entire test set is the average of the three
accuracy for the stage of the target data. As explained in Section 1, diagonal cells (bold ones) in Table 2. It is 59.19%, which is higher
we use the stage-wise model learning for target images from the than 49.63% of the baseline method.
beginning stage, and use the stage-aware curriculum learning for For a target data from the beginning stage, our final method uses
target images from the later stages, as illustrated in Fig. 4. the model trained by the data from the beginning stage as explained
before. It corresponds to the first row of Table 2. We call it Model 1.
5 EXPERIMENTS
We conducted experiments to evaluate the performance of the 5.3 Stage-Aware Curriculum Learning
proposed two techniques, and also our main method that switches Table 3 shows the results of the curriculum learning with several
these two depending on the stage of the target data. curriculum patterns including the one introduced in Section 4.2.
In Pattern 1, we only use training data from the beginning stage.
5.1 Experiment Settings Therefore, it is basically the same model as the first row in Table 2.
In our experiments, our methods are implemented on the Pytorch In Pattern 2, we start with a training dataset that only includes data
platform. We employ the SENet154 [11] architecture as our basic from the beginning stage, proceed to a training dataset that only
architecture. We set the batch size as 256 and epoch number as includes data from the intermediate stage, and stop there without
40. The network is optimized with Adaptive Moment Estimation proceeding to a training data set including data from the finishing
(Adam) and the learning rate is set to 0.0002 initially. stage. Pattern 3 is the strategy we explained in Section 4.2, where
As the baseline method, we also compare a simple method that training samples are selected from all the three subsets following
trains SENet154 using all the training data with ignoring the stages. the scheme shown in Section 4.2.
MMAsia ’21, December 1–3, 2021, Gold Coast, Australia Yixin Zhang, Yoko Yamakata, and Keishi Tajima

Table 2: Accuracy of Stage-wise Model Learning Table 5: Comparison of Our Methods Based on SENet154
with Other Standard Methods
Training \ Beginning Intermediate Finishing All Plan Top-1 acc. Top-3 acc. Top-5 acc.
Test Subset
Ours Stage-Wise 59.19% 81.21% 89.47%
Beginning 64.16% 54.72% 40.63% 49.91%
Model 2 60.35% 83.76% 90.91%
Intermediate 55.63% 60.59% 49.51% 47.33%
Baseline SENet154 49.63% 76.51% 86.93%
Finishing 42.79% 50.66% 52.83% 47.01%
Resnet50 46.41% 74.13% 84.35%
All 50.17% 51.74% 46.28% 49.63%
VGG16 43.67% 72.39% 85.01%
AlexNet 31.77% 64.06% 77.42%
Table 3: Accuracy of Stage-aware Curriculum Learning
Training Pattern\ Beginning Intermediate Finishing Table 6: Final Accuracy of Our method
Test Subset
Pattern 1 64.16% 54.72% 40.63% Model Selection Accuracy
Pattern 2 61.79% 58.84% 55.31% Beginning Subset Model 1 64.16%
Pattern 3 60.10% 62.61% 58.34% Intermediate Subset Model 2 62.61%
Finishing Subset Model 2 58.34%
Average 61.70%
Table 4: Comparison of the Proposed and Baseline Methods
Plan Top-1 acc.
Baseline (SENet154) 49.63% images in the beginning subset, the stage-wise learning method
Stage-Wise Beginning Subset 64.16% (Model 1) achieves a better result than the stage-aware curriculum
Model Intermediate Subset 60.59% learning method (Model 2). On the other hand, For images in the
Learning Finishing Subset 52.83% intermediate and finishing subsets, the stage-aware curriculum
(m(i) = i) Average 59.19% learning method (Model 2) achieves a higher accuracy. Therefore,
Curriculum Beginning Subset 60.10% we should apply Model 1 when we are given a target data from the
Learning Intermediate Subset 62.61% beginning subset, and we should apply Model 2 when we are given
Pattern 3 Finishing Subset 58.34% a target data from the intermediate or finishing subset.
(Model 2) Average 60.35% By switching these two methods, we can achieve the average
accuracy of 61.70% over the whole test dataset as shown in Table 6.

We can see that the model trained by the curriculum Pattern 3 6 CONCLUSION
achieves the best performance for the target data from the interme- In this paper, we first constructed a Chinese recipe dataset which
diate and finishing stages. We call this model Model 2. It achieves contains both text and image data for every cooking step. Our
the accuracy shown in the last row in Table 3 for three test sets. dataset will be expanded and updated in the future, and an English
The average accuracy over all the test datasets is 60.35%, which is translated version will be available.
also higher than 49.63% of the baseline method. We also proposed an image recognition method for ingredients
We compare the accuracy of the baseline method, stage-wise whose appearance changes with the progress of the cooking proce-
model learning where m(i) = i, and Model 2, for each subset of dure. Our stage-wise model learning method improves the recog-
the test dataset in Table 4. We can see that the average accuracy nition accuracy for the easily recognizable images in the begin-
of both of the two proposed methods is higher than the accuracy ning stages. The curriculum learning method is introduced to deal
of the baseline method. Besides, stage-wise model learning for the with the problem that the appearance of the same ingredient be-
beginning stage, i.e., Model 1, achieves higher accuracy for the comes more diverse in the intermediate and finishing stages. We
beginning subset than Model 2, while Model 2 achieves higher can achieve high average accuracy by switching these two methods
accuracy for the intermediate and finishing subset than Model 1. depending on the stage of the target data.
We focus on recipe data in this paper, but many other types of
5.4 Comparisons with Other Standard Methods objects also constantly change their appearance. Application of our
We also compare our methods with other standard object recogni- method to those data is an interesting topic for future work. For
tion methods: AlexNet, ResNet50 and VGG16 on our dataset. The example, we can apply our method to pictures of the growth process
result is shown in Table 5. In this table, we also show Top-3 accu- of animals and plants. In this paper, we focused on the single-
racy and Top-5 accuracy. We can find that both of our proposed label recognition. However, multiple ingredients often appear in
methods improve the performance significantly. one image. Experiments with multi-label data is also an important
remaining issue for future work.
5.5 Stage-Aware Model Selection
Both of the proposed methods (stage-wise model learning and ACKNOWLEDGMENTS
stage-aware curriculum learning) achieve higher accuracy than This work was supported by JST CREST Grant Number JPMJCR16E3
the baseline method. According to Table 4, we can find that for and JSPS KAKENHI Grant Number 18K11425 and 20H04210 Japan.
MIRecipe: A Recipe Dataset for Stage-Aware Recognition of Changes in Appearance of Ingredients MMAsia ’21, December 1–3, 2021, Gold Coast, Australia

REFERENCES From an Image Sequence. IEEE Access 9 (2020), 2125–2141.


[1] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. [26] Liang-Ming Pan, Jingjing Chen, Jianlong Wu, Shaoteng Liu, Chong-Wah Ngo,
Curriculum learning. In Proceedings of the 26th annual international conference Min-Yen Kan, Yugang Jiang, and Tat-Seng Chua. 2020. Multi-modal cooking work-
on machine learning. 41–48. flow construction for food recipes. In Proceedings of the 28th ACM International
[2] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101–mining Conference on Multimedia. 1132–1141.
discriminative components with random forests. In European conference on com- [27] Siyuan Pan, Ling Dai, Xuhong Hou, Huating Li, and Bin Sheng. 2020. Chefgan:
puter vision. Springer, 446–461. Food image generation from recipes. In Proceedings of the 28th ACM International
[3] Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, Nicolas Thome, and Conference on Multimedia. 4244–4252.
Matthieu Cord. 2018. Cross-modal retrieval in the cooking context: Learning [28] Paritosh Pandey, Akella Deepthi, Bappaditya Mandal, and Niladri B Puhan. 2017.
semantic text-image embeddings. In The 41st International ACM SIGIR Conference FoodNet: Recognizing foods using ensemble of deep networks. IEEE Signal
on Research & Development in Information Retrieval. 35–44. Processing Letters 24, 12 (2017), 1758–1762.
[4] Jingjing Chen and Chong-Wah Ngo. 2016. Deep-based ingredient recognition for [29] Hai X Pham, Ricardo Guerrero, Jiatong Li, and Vladimir Pavlovic. 2021. CHEF:
cooking recipe retrieval. In Proceedings of the 24th ACM international conference Cross-modal hierarchical embeddings for food domain retrieval. arXiv preprint
on Multimedia. 32–41. arXiv:2102.02547 (2021).
[5] Xin Chen, Yu Zhu, Hua Zhou, Liang Diao, and Dongyan Wang. 2017. Chinese- [30] Amaia Salvador, Michal Drozdzal, Xavier Giro-i Nieto, and Adriana Romero.
foodnet: A large-scale image dataset for chinese food recognition. arXiv preprint 2019. Inverse cooking: Recipe generation from food images. In Proceedings of the
arXiv:1705.02743 (2017). IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10453–10462.
[6] Volkan Cirik, Eduard Hovy, and Louis-Philippe Morency. 2016. Visualizing and [31] Amaia Salvador, Erhan Gundogdu, Loris Bazzani, and Michael Donoser. 2021.
understanding curriculum learning for long short-term memory networks. arXiv Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and
preprint arXiv:1611.06204 (2016). Self-supervised Learning. In Proceedings of the IEEE/CVF Conference on Computer
[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: Vision and Pattern Recognition. 15475–15484.
A large-scale hierarchical image database. In 2009 IEEE conference on computer [32] Tetsuro Sasada, Shinsuke Mori, Tatsuya Kawahara, and Yoko Yamakata. 2015.
vision and pattern recognition. Ieee, 248–255. Named entity recognizer trainable from partially annotated data. In Conference
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: of the Pacific Association for Computational Linguistics. Springer, 148–160.
Pre-training of deep bidirectional transformers for language understanding. arXiv [33] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks
preprint arXiv:1810.04805 (2018). for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[9] Guy Hacohen and Daphna Weinshall. 2019. On the power of curriculum learning [34] Valentin I Spitkovsky, Hiyan Alshawi, and Dan Jurafsky. 2010. From baby steps
in training deep networks. In International Conference on Machine Learning. to leapfrog: How “less is more” in unsupervised dependency parsing. In Human
PMLR, 2535–2544. Language Technologies: The 2010 Annual Conference of the North American Chapter
[10] Jun Harashima, Yuichiro Someya, and Yohei Kikuta. 2017. Cookpad image dataset: of the Association for Computational Linguistics. 751–759.
An image collection as infrastructure for food research. In Proceedings of the 40th [35] Xin Wang, Devinder Kumar, Nicolas Thome, Matthieu Cord, and Frederic Pre-
International ACM SIGIR Conference on Research and Development in Information cioso. 2015. Recipe recognition with large multimodal food dataset. In 2015 IEEE
Retrieval. 1229–1232. International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 1–6.
[11] Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceed- [36] Yoko Yamakata, John Carroll, and Shinsuke Mori. 2017. A comparison of cooking
ings of the IEEE conference on computer vision and pattern recognition. 7132–7141. recipe named entities between Japanese and English. In Proceedings of the 9th
[12] Hokuto Kagaya, Kiyoharu Aizawa, and Makoto Ogawa. 2014. Food detection Workshop on Multimedia for Cooking and Eating Activities in conjunction with
and recognition using convolutional neural network. In Proceedings of the 22nd The 2017 International Joint Conference on Artificial Intelligence. 7–12.
ACM international conference on Multimedia. 1085–1088.
[13] Marie Katsurai, Takahiro Ogawa, and Miki Haseyama. 2014. A cross-modal
approach for extracting semantic relationships between concepts using tagged
images. IEEE Transactions on Multimedia 16, 4 (2014), 1059–1074.
[14] Yoshiyuki Kawano and Keiji Yanai. 2015. Foodcam: A real-time food recognition
system on a smartphone. Multimedia Tools and Applications 74, 14 (2015), 5263–
5287.
[15] Kuang-Huei Lee, Xiaodong He, Lei Zhang, and Linjun Yang. 2018. Cleannet: Trans-
fer learning for scalable image classifier training with label noise. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition. 5447–5456.
[16] Shangwen Li, Sanjay Purushotham, Chen Chen, Yuzhuo Ren, and C-C Jay Kuo.
2017. Measuring and predicting tag importance for image retrieval. IEEE trans-
actions on pattern analysis and machine intelligence 39, 12 (2017), 2423–2436.
[17] Chang Liu, Yu Cao, Yan Luo, Guanling Chen, Vinod Vokkarane, and Yunsheng
Ma. 2016. Deepfood: Deep learning-based food image recognition for computer-
aided dietary assessment. In International Conference on Smart Homes and Health
Telematics. Springer, 37–48.
[18] Taka Maenishi and Keishi Tajima. 2019. Identifying Tags Describing Image
Contents. In Proc. of ACM Hypertext. 297–298.
[19] Javier Marin, Aritro Biswas, Ferda Ofli, Nicholas Hynes, Amaia Salvador, Yusuf
Aytar, Ingmar Weber, and Antonio Torralba. 2019. Recipe1M+: A Dataset for
Learning Cross-Modal Embeddings for Cooking Recipes and Food Images. IEEE
Trans. Pattern Anal. Mach. Intell. (2019).
[20] Simon Mezgec and Barbara Koroušić Seljak. 2017. NutriNet: a deep learning food
and drink image recognition system for dietary assessment. Nutrients 9, 7 (2017),
657.
[21] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
(2013).
[22] Weiqing Min, Linhu Liu, Zhiling Wang, Zhengdong Luo, Xiaoming Wei, Xiaolin
Wei, and Shuqiang Jiang. 2020. ISIA Food-500: A Dataset for Large-Scale Food
Recognition via Stacked Global-Local Attention Network. In Proceedings of the
28th ACM International Conference on Multimedia. 393–401.
[23] Shinsuke Mori, Hirokuni Maeta, Yoko Yamakata, and Tetsuro Sasada. 2014. Flow
Graph Corpus from Recipe Texts.. In LREC. 2370–2377.
[24] Taichi Nishimura, Atsushi Hashimoto, and Shinsuke Mori. 2019. Procedural
text generation from a photo sequence. In Proceedings of the 12th International
Conference on Natural Language Generation. 409–414.
[25] Taichi Nishimura, Atsushi Hashimoto, Yoshitaka Ushiku, Hirotaka Kameko, Yoko
Yamakata, and Shinsuke Mori. 2020. Structure-Aware Procedural Text Generation

You might also like