Professional Documents
Culture Documents
ABSTRACT
In this paper, we introduce a new recipe dataset MIRecipe (Multimedia-
Instructional Recipe). It has both text and image data for every
cooking step, while the conventional recipe datasets only contain
final dish images, and/or images only for some of the steps. It con-
sists of 26,725 recipes, which include 239,973 steps in total. The
recognition of ingredients in images associated with cooking steps
poses a new challenge: Since ingredients are processed during cook-
ing, the appearance of the same ingredient is very different in the
beginning and finishing stages of the cooking. The general object Figure 1: Ingredients change their appearance in the begin-
recognition methods, which assume the constant appearance of ning, intermediate, and finishing stages of the cooking. In
objects, do not perform well for such objects. To solve the problem, addition, the variance in their appearance becomes larger
we propose two stage-aware techniques: stage-wise model learn- as the process proceeds to the later stages. Our purpose is to
ing, which trains a separate model for each stage, and stage-aware recognize such appearance-changing ingredients in images
curriculum learning, which starts with the training data from the associated with the cooking steps by taking into account the
beginning stage and proceeds to the later stages. Our experiment relative position of the step in the cooking procedure.
with our dataset shows that our method achieves higher accuracy
than the model trained using all the data without considering the
stages. Our dataset is available at our GitHub repository.
become popular. Much research has been conducted using multime-
CCS CONCEPTS
dia recipe data collected from them, such as food category recogni-
• Computing methodologies → Image representations; Ob- tion from dish images [12, 14, 22, 28], semantic structure analysis
ject recognition. of recipe texts [21, 32], recipe text retrieval from food images [4],
and recipe text generation from food images [24, 25, 27, 30].
KEYWORDS Many recipe data includes both text and images. Most of them
datasets, recipe data, cooking, multimedia, food recognition only include final dish images and explain the cooking procedure
ACM Reference Format: only in text, but some sites, e.g., Haodou and Cookpad, also provide
Yixin Zhang, Yoko Yamakata, and Keishi Tajima. 2021. MIRecipe: A Recipe photos of the scenes of the cooking steps. Those images have been
Dataset for Stage-Aware Recognition of Changes in Appearance of Ingre- used in research on multimedia recipe processing, such as [4, 35].
dients. In ACM Multimedia Asia (MMAsia ’21), December 1–3, 2021, Gold A problem in such multimedia recipe descriptions is that some
Coast, Australia. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/ description is sometimes omitted in text when it is shown in images.
3469877.3490596 In particular, the name of the ingredient processed in the step is of-
ten omitted when it is shown in the associated image. In such cases,
1 INTRODUCTION users only accessing the text, e.g., through a smart speaker, cannot
User-submitted recipe sites, such as Allrecipes1 in North America know what ingredients to process, so we need to complement such
and UK, Cookpad2 in Japan, and Haodou3 in China, have recently omitted ingredient names by recognizing the associated image.
The technical difficulty in that task is that ingredients change
1 https://www.allrecipes.com/
2 http://cookpad.com/
their appearance while they are processed during the cooking, and
3 http://www.haodou.com/recipe/ the same ingredient looks different at the beginning and the end of
the cooking. This is a new challenge that has not been addressed
Permission to make digital or hard copies of part or all of this work for personal or in existing studies on recipe image processing.
classroom use is granted without fee provided that copies are not made or distributed Their changes are, however, not completely random but some-
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. what regular. Figure 1 shows example images of potatoes in various
For all other uses, contact the owner/author(s). cooking stages. In the beginning stage, potatoes look similar, brown
MMAsia ’21, December 1–3, 2021, Gold Coast, Australia and round. In the intermediate stage, they become pale yellow and
© 2021 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-8607-4/21/12. cut into various shapes. In the finishing stage, their appearances
https://doi.org/10.1145/3469877.3490596 become even more diverse as they are mixed with other ingredients.
MMAsia ’21, December 1–3, 2021, Gold Coast, Australia Yixin Zhang, Yoko Yamakata, and Keishi Tajima
This example shows two characteristics of ingredients: (1) their ap- dataset, contains text cooking instructions, but it also only focuses
pearance changes depending on the cooking stage, and (2) their on final dish images. In Cookpad recipes [10], some steps have im-
appearance becomes more diverse in the later stages. ages, but many steps are explained only in text. On the other hand,
In this paper, we propose a method for recognizing such in- on the recipe site Haodou, from which we collected our data, every
gredients in images associated with cooking steps. Following the cooking step has an image. It allows us to continuously observe the
observation above, our method is stage-aware, i.e., it pays attention changes of the appearances of ingredients.
to the relative position of the step the image is associated with. MM-ReS [26] is also a multimedia recipe dataset, but there are
As ingredients change the appearance depending on the cooking three differences: (1) In MM-ReS dataset, images are not associated
stage, we first propose a method that separates the training data with every instruction step. (2) Since Haodou, from which we col-
into subsets corresponding to the beginning, intermediate, and lected our data, has far more recipes than Instructables, from which
finishing stages, and trains a separate model for each stage. Given MM-ReS recipes were collected, our dataset is extendable to a larger
a target image, we use a model corresponding to its stage. This size. (3) The MM-ReS dataset is not publicly available up to now.
method works well for images in the beginning stage because the None of these existing datasets fully provides multimedia infor-
same kind of ingredients have similar appearance in the beginning mation on the instructional aspect of the recipe, while it is one of
stage. For the later stages, however, this method does not work well the most important aspects of recipe data. We hence construct a
because the appearance of ingredients becomes more diverse. brand new recipe dataset MIRecipe that provides that information.
To improve the performance for the later stages, we propose
another method that introduces curriculum learning. It first trains 2.2 Recipe Image Recognition
a model with the data from the beginning stage, and gradually adds Recently, many studies have made a great progress in the food
more diverse data from the later stages. Our method switches these recognition field. Some of them focus on multi-modal recipe re-
two methods depending on the stage of the target image data. trieval [3, 29, 31]. The main aim of these studies is to retrieve recipe
We also constructed a recipe dataset where every cooking step titles or contents from the images of final dishes. Although some
is associated with a image. None of the existing dataset has im- solutions in these studies can be adopted to the ingredient recog-
ages for every cooking step. Our dataset is available at our GitHub nition, our work focuses on the sequential instructional images
repository4 and contains 26,725 recipes with 239,973 steps. rather than images of the final dishes. Other studies mainly focus
We then focused on the 20 most frequent ingredients in our on the food categorization based on the final dish images without
dataset, and constructed a sub-dataset consisting of steps includ- explicit analysis of ingredient composition [5, 12, 15, 17, 20, 28, 35].
ing at least one of these 20 ingredients. This sub-dataset consists Our method intends to recognize ingredients in images describ-
of 30,023 steps. To each image in this sub-dataset, we manually ing cooking steps. Our method is useful for the automatic under-
assigned labels for these 20 ingredients. We divide it into the be- standing of the cooking procedures and for the recognition of in-
ginning, intermediate, and finishing stages so that each of these 20 gredients used in the recipe. Ingredient estimation only from a final
ingredients appears almost evenly in the three stages. dish image is a task far harder than food categorization [4].
The contributions of this paper are summarized as follows:
• We constructed a recipe dataset where every cooking step 2.3 Recipe Text Processing
has both text and image. This dataset can be better used for Recipe text has some differences from general text, and we cannot
some studies on recipe, e.g., cooking workflow extraction, simply apply the existing text processing method to recipe data.
and cooking procedural text generation from recipe images. There has been research on the detection of named entities, e.g.,
We also constructed a sub-dataset consisting of steps includ- ingredients and tools, in the recipe text [23]. However, their method
ing the 20 most frequent ingredients, with their labels. cannot detect ingredients omitted in the text.
• We proposed a method of recognizing ingredients in images
associated with cooking steps in recipe data. Because in- 2.4 Image-Text Pair Processing
gredients change their appearance during the cooking, our
There has also been research on the analysis of words or tags
method is stage-aware. Our method outperforms several
attached to images, such as tag importance estimation for a tag set
baseline methods in our experiment with our dataset.
attached to an image [16, 18] and inferring the semantic relationship
between the tags [13]. They focus on data from image posting sites
2 RELATED WORK
like Flickr, where tags are used to complement image, so tags usually
In this research, we construct a multi-modal cooking step descrip- describe important items in the images. On the other hand, in image-
tion dataset, and develop an ingredient recognition method for such text pairs describing cooking steps, ingredients processed in the
data. These topics are related to the following research lines. steps are the most important items, but sometimes omitted in the
text when it is shown in the images. Image-text pairs describing
2.1 Recipe Datasets cooking steps thus have a very different characteristic.
Many recipe datasets have been published in recent years. Some
datasets mainly focus on final dish images rather than recipe in- 3 DATASET
structions. Recipe 1M+ [19], a famous and representative recipe In this section, we explain the details of our recipe dataset MIRecipe.
4 https://github.com/bimissing/recipeData. For copyright reasons, the text and image It contains 26,725 cooking recipes with 239,973 instruction steps
data are provided as a list of URLs and a download script. (and the same number of images). The resolution of each image is
MIRecipe: A Recipe Dataset for Stage-Aware Recognition of Changes in Appearance of Ingredients MMAsia ’21, December 1–3, 2021, Gold Coast, Australia
Figure 3: Distribution of image features in three subsets of several ingredient classes. Images in a beginning subset tend to be
more clustered, while images in a finishing subset are more scattered. It is especially evident for ginger and onion, while the
finishing subsets are also relatively clustered for tofu and bread.
the inaccuracy of division into subsets, but it also coincides with labels in it. Similarly, we define S2 and S3 for the intermediate and
our observation that tofu and bread are relatively easy to recognize finishing subsets. Let h Si : img → [K] denote the classifier trained
even in the final dish images compared with ginger and onion. by Si that classifies a given image img into one of the K ingredient
classes. When given a target image associated with a step in a stage
4 PROPOSED METHOD j (j = 1, 2, 3), we need a model selection function m : j → i, which
Our method is composed of two techniques: (1) stage-wise model chooses one of h Si (i = 1, 2, 3) based on j.
learning, which trains separate models for three cooking stages The best choice may depend not only on j but also on the ingre-
and (2) stage-aware curriculum learning, which starts with training dient class, but we do not know the class of the target image when
data from the beginning stage and proceeds to the later stages. Our we select i, so we cannot use that information for selecting i. On the
method select one of these two techniques depending on the stage of other hand, basic image features are available when we select i, but
the target data: for target data from the beginning stage, our method they are unlikely to be useful for selecting i. The relative position
chooses the stage-wise model learning, and for target data from of the target step is the most promising feature that is available
the later stages, our method chooses the stage-aware curriculum when we select i so we here focus on the model selection function
learning. The overview of our method is shown in Fig. 4. We explain m that only takes j as the input.
the details of these two component methods in the following.
4.2 Stage-Aware Curriculum Learning
4.1 Stage-Wise Model Learning Our second method is based on the idea of curriculum learning.
Even in the recipe image recognition, the general object recognition Humans and animals learn much better if examples are presented
methods, which assume the constant appearance of objects, perform not randomly but in a meaningful order that starts with the most
well for entities like cooking tools whose shapes do not change basic concepts and progressively proceeds to more complex ones.
during the cooking process. However, they do not perform well for One of the design issues in curriculum learning is to choose a
ingredients whose appearance changes during the cooking. scoring function f (the function to measure the difficulty of data)
To deal with such changes, we train three separate models by since it depends on prior knowledge about the data. Here we ap-
using the training data from the beginning, intermediate, and fin- proximate the difficulty of a training data (imgij , label ij ) by their
ishing stages, respectively. Given a target data, we select the best stage i. An ingredient changes its shape and color from its original
model for the stage of the target data. We expect that it is simply state in the beginning stage to a more processed and mixed shape
the model trained by the data from the same stage as the target. and color in the intermediate and finishing stage, so we expect that
However, we need not to restrict ourselves to this simple ap- this order is appropriate order for curriculum learning.
proach. As shown in the previous section, images from the later Another issue in curriculum learning is a pacing function p [9]
stages are not clustered well compared with the beginning stage, that determines a sequence of data used for the training in each step
and they can be even noisy. Therefore, the model trained by a of the curriculum learning. There have been two basic approaches.
cleaner training data from the beginning stage might be better than The original curriculum learning method [1] proposes to start with
the model trained by the data from the later stages, even for target a sub-dataset consisting of simpler data, and then switch to a harder
data from the later stages. For this reason, in the experiment shown target dataset. On the other hand, the idea of Baby Steps curriculum
later, we measure the accuracy for every combination of three train- [34] is that simpler data in the training should not be discarded
ing datasets and three test datasets, and determine which model is suddenly, but instead, the complexity of the training data should
the best for target images from each stage. be gradually increased [6]. Here, we adopt the latter one.
More formally, let S1 = {(imgi1 , label i1 )|1 ≤ i ≤ N 1 } be the We designed a subset-level balance scheme for our curriculum
set of the pairs of images and corresponding food labels in the learning. We first set the batch size in our training to 256 and epoch
training dataset from the beginning stage. N 1 is the total number of number to 40. The 256 training samples in each mini-batch are
MIRecipe: A Recipe Dataset for Stage-Aware Recognition of Changes in Appearance of Ingredients MMAsia ’21, December 1–3, 2021, Gold Coast, Australia
Figure 4: Overview of the Proposed Method. The image feature vectors gradually become less clustered as the cooking process
proceeds. For target images in the beginning stage, we use a model trained by the training data from the beginning stage. For
target images in the intermediate and finishing stages, we use a model trained by the curriculum learning.
randomly selected from the three subsets in the following scheme: The data size used in each experiment is the same, and the ratio of
Learning Phase 1: (256, 0, 0), Learning Phase 2: (128, 128, 0), Learning training data to test data is 7:3 for all the experiments.
Phase 3: (128, 64, 64), and Learning Phase 4: (64, 64, 128).
We also need to decide "switch epochs", i.e., when we switch from 5.2 Stage-Wise Model Learning
a phase to the next phase. Here we set every 10 epochs as a switch
For our first method, stage-wise model learning, we test all combina-
epoch. We chose 10 epochs because a model almost converges in 10
tions of the three subsets of the training data and the three subsets
epochs in our experiment. The curriculum first trains the model on
of the test data. We also measure the accuracy by the whole training
the training set created in accordance with the scheme for Phase 1
set and for the whole test set. The accuracy for each combination
until it reaches a switch epoch. It then proceeds to Phase 2 until it
is shown in Table 2. The last row and the last column corresponds
reaches the next switch epoch, and so on. In the training period,
to the whole training set and the whole test set, respectively. The
each phase takes over the weights before the FC layers from the
cell at the right bottom corner corresponds to the baseline method,
previous phase. This is also illustrated in Fig. 4.
which uses the whole training data for all test data.
The results show that we obtain the best recognition perfor-
4.3 Stage-Aware Model Selection. mance when the training data and test data are from the same sub-
To achieve the best performance, our method switches the two set. Therefore, we can adopt the simplest model selection function
techniques explained above depending on the stage of the target m(i) = i. The average accuracy of the stage-wise model learning
data. In other words, we select the method with the higher expected with such m over the entire test set is the average of the three
accuracy for the stage of the target data. As explained in Section 1, diagonal cells (bold ones) in Table 2. It is 59.19%, which is higher
we use the stage-wise model learning for target images from the than 49.63% of the baseline method.
beginning stage, and use the stage-aware curriculum learning for For a target data from the beginning stage, our final method uses
target images from the later stages, as illustrated in Fig. 4. the model trained by the data from the beginning stage as explained
before. It corresponds to the first row of Table 2. We call it Model 1.
5 EXPERIMENTS
We conducted experiments to evaluate the performance of the 5.3 Stage-Aware Curriculum Learning
proposed two techniques, and also our main method that switches Table 3 shows the results of the curriculum learning with several
these two depending on the stage of the target data. curriculum patterns including the one introduced in Section 4.2.
In Pattern 1, we only use training data from the beginning stage.
5.1 Experiment Settings Therefore, it is basically the same model as the first row in Table 2.
In our experiments, our methods are implemented on the Pytorch In Pattern 2, we start with a training dataset that only includes data
platform. We employ the SENet154 [11] architecture as our basic from the beginning stage, proceed to a training dataset that only
architecture. We set the batch size as 256 and epoch number as includes data from the intermediate stage, and stop there without
40. The network is optimized with Adaptive Moment Estimation proceeding to a training data set including data from the finishing
(Adam) and the learning rate is set to 0.0002 initially. stage. Pattern 3 is the strategy we explained in Section 4.2, where
As the baseline method, we also compare a simple method that training samples are selected from all the three subsets following
trains SENet154 using all the training data with ignoring the stages. the scheme shown in Section 4.2.
MMAsia ’21, December 1–3, 2021, Gold Coast, Australia Yixin Zhang, Yoko Yamakata, and Keishi Tajima
Table 2: Accuracy of Stage-wise Model Learning Table 5: Comparison of Our Methods Based on SENet154
with Other Standard Methods
Training \ Beginning Intermediate Finishing All Plan Top-1 acc. Top-3 acc. Top-5 acc.
Test Subset
Ours Stage-Wise 59.19% 81.21% 89.47%
Beginning 64.16% 54.72% 40.63% 49.91%
Model 2 60.35% 83.76% 90.91%
Intermediate 55.63% 60.59% 49.51% 47.33%
Baseline SENet154 49.63% 76.51% 86.93%
Finishing 42.79% 50.66% 52.83% 47.01%
Resnet50 46.41% 74.13% 84.35%
All 50.17% 51.74% 46.28% 49.63%
VGG16 43.67% 72.39% 85.01%
AlexNet 31.77% 64.06% 77.42%
Table 3: Accuracy of Stage-aware Curriculum Learning
Training Pattern\ Beginning Intermediate Finishing Table 6: Final Accuracy of Our method
Test Subset
Pattern 1 64.16% 54.72% 40.63% Model Selection Accuracy
Pattern 2 61.79% 58.84% 55.31% Beginning Subset Model 1 64.16%
Pattern 3 60.10% 62.61% 58.34% Intermediate Subset Model 2 62.61%
Finishing Subset Model 2 58.34%
Average 61.70%
Table 4: Comparison of the Proposed and Baseline Methods
Plan Top-1 acc.
Baseline (SENet154) 49.63% images in the beginning subset, the stage-wise learning method
Stage-Wise Beginning Subset 64.16% (Model 1) achieves a better result than the stage-aware curriculum
Model Intermediate Subset 60.59% learning method (Model 2). On the other hand, For images in the
Learning Finishing Subset 52.83% intermediate and finishing subsets, the stage-aware curriculum
(m(i) = i) Average 59.19% learning method (Model 2) achieves a higher accuracy. Therefore,
Curriculum Beginning Subset 60.10% we should apply Model 1 when we are given a target data from the
Learning Intermediate Subset 62.61% beginning subset, and we should apply Model 2 when we are given
Pattern 3 Finishing Subset 58.34% a target data from the intermediate or finishing subset.
(Model 2) Average 60.35% By switching these two methods, we can achieve the average
accuracy of 61.70% over the whole test dataset as shown in Table 6.
We can see that the model trained by the curriculum Pattern 3 6 CONCLUSION
achieves the best performance for the target data from the interme- In this paper, we first constructed a Chinese recipe dataset which
diate and finishing stages. We call this model Model 2. It achieves contains both text and image data for every cooking step. Our
the accuracy shown in the last row in Table 3 for three test sets. dataset will be expanded and updated in the future, and an English
The average accuracy over all the test datasets is 60.35%, which is translated version will be available.
also higher than 49.63% of the baseline method. We also proposed an image recognition method for ingredients
We compare the accuracy of the baseline method, stage-wise whose appearance changes with the progress of the cooking proce-
model learning where m(i) = i, and Model 2, for each subset of dure. Our stage-wise model learning method improves the recog-
the test dataset in Table 4. We can see that the average accuracy nition accuracy for the easily recognizable images in the begin-
of both of the two proposed methods is higher than the accuracy ning stages. The curriculum learning method is introduced to deal
of the baseline method. Besides, stage-wise model learning for the with the problem that the appearance of the same ingredient be-
beginning stage, i.e., Model 1, achieves higher accuracy for the comes more diverse in the intermediate and finishing stages. We
beginning subset than Model 2, while Model 2 achieves higher can achieve high average accuracy by switching these two methods
accuracy for the intermediate and finishing subset than Model 1. depending on the stage of the target data.
We focus on recipe data in this paper, but many other types of
5.4 Comparisons with Other Standard Methods objects also constantly change their appearance. Application of our
We also compare our methods with other standard object recogni- method to those data is an interesting topic for future work. For
tion methods: AlexNet, ResNet50 and VGG16 on our dataset. The example, we can apply our method to pictures of the growth process
result is shown in Table 5. In this table, we also show Top-3 accu- of animals and plants. In this paper, we focused on the single-
racy and Top-5 accuracy. We can find that both of our proposed label recognition. However, multiple ingredients often appear in
methods improve the performance significantly. one image. Experiments with multi-label data is also an important
remaining issue for future work.
5.5 Stage-Aware Model Selection
Both of the proposed methods (stage-wise model learning and ACKNOWLEDGMENTS
stage-aware curriculum learning) achieve higher accuracy than This work was supported by JST CREST Grant Number JPMJCR16E3
the baseline method. According to Table 4, we can find that for and JSPS KAKENHI Grant Number 18K11425 and 20H04210 Japan.
MIRecipe: A Recipe Dataset for Stage-Aware Recognition of Changes in Appearance of Ingredients MMAsia ’21, December 1–3, 2021, Gold Coast, Australia