Professional Documents
Culture Documents
Yangyi Chen1 , Xingyao Wang1 , Manling Li2 , Derek Hoiem1 , Heng Ji1
1
University of Illinois Urbana-Champaign 2 Northwestern University
{yangyic3, xingyao6, manling2, dhoiem, hengji}@illinois.edu
Abstract
State-of-the-art vision-language models
(VLMs) still have limited performance in
arXiv:2311.13258v1 [cs.CV] 22 Nov 2023
2022). Large-scale multimodal data is required and ing for visual structural knowledge learning.
emphasized to better align two modalities (Li et al.,
2021a; Radford et al., 2021; Jia et al., 2021). 3 ViStruct
The current stage of VLMs moves towards a We propose ViStruct that learns a vision-language
more unified pre-training goal to perform various model that takes images as input and generates a
tasks without task-specific adaptations (Wang et al., set of structural knowledge, such as concepts, rela-
2021, 2022a; Li et al., 2023; Chen et al., 2023b) tions, and events involved in the images (see Fig-
by leveraging the capacity of large language mod- ure 2). We first define a code-vision representation
els (Yuan et al., 2023a; Touvron et al., 2023). Some (Sec. 3.1), which serves as the output to decode vi-
research leverages programming language (Gupta sual structural knowledge as code blocks. We then
and Kembhavi, 2023; Surís et al., 2023; Yuan et al., propose a curriculum pyramid to guide the train-
2023b), but lacks perception ability and relies on ing objectives from low-level towards high-level
pre-trained object detectors. Although we witness semantics (Sec. 3.2). To train the model, we con-
great success of existing VLMs on benchmark struct a comprehensive multi-granularity dataset
evaluations, in-depth diagnostic investigations have ViStruct Suite (Sec. 3.3), and optimize the model
exposed their shortcomings in fundamental visual only on information that describe relavant visual
concept identification (Hendricks and Nematzadeh, strcutural information (Sec. 3.4).
2021; Zhao et al., 2022; Yüksekgönül et al., 2022),
raising doubts about whether current VLMs are 3.1 Code-Vision Representation
overemphasizing objects to represent images while Visual structural information depicts both visual
ignoring other visual structures, akin to the initial knowledge elements (such as visual concepts,
stage VLMs, making them inadequate in structural locations, attributes) and complex structural
knowledge extraction. Thus, in this work, we connections between them (such as relations, and
design a training framework to improve VLMs’ visual events with associated arguments). To en-
visual structural knowledge extraction. code visual structural knowledge, the first question
is to define the appropriate structured representa-
2.2 Curriculum Learning tion. Such a representation is required to not only
Curriculum learning has proven to be a useful tech- have robust capabilities for illustrating structures
nique for improving the learning process by grad- and their intricate interconnections, but also be
ually exposing the learner to more challenging ex- unified and consistent across multiple levels of
amples (Bengio et al., 2009; Karras et al., 2018; semantic granularity. It is also expected to be adapt-
Wang et al., 2020; Su et al., 2021; Platanios et al., able and extendable to accommodate other future
2019; Yu et al., 2022). However, it has been stud- semantic granularities. In order to address these re-
ied mainly in a more limited form, ordering and quirements, we leverage the inherent capabilities of
presenting examples to train a single task (Bengio programming language (PL). We adopt Python as
et al., 2009; Spitkovsky et al., 2010a; Kumar et al., the PL in this paper due to its straightforwardness
2010; Jiang et al., 2014b; Graves et al., 2017). In and popularity, allowing us to effectively harness its
this work, we propose the first task-level, multi- structured representation power and well-resourced
faceted, multimodal approach to curriculum learn- data for code-vision pretraining. Similar to PL, we
Visual Structural Information & Definition PL class definition Algorithm 1 The training algorithm described in
Visual structure Class name
Event arguments/subject (object) in relation Arguments for class instantiation
Sec. 3.4. We omit the warm-up process for brevity.
Human-written definition Docstring
1: Pθ ▷ a pre-trained vision-language model
Table 1: The alignment of visual structures, visual se- 2: Dstage = [DCR , DOG , DOA , DOR , DE ] ▷
mantic roles, annotated definitions, and programming Stage-specific datasets defined in Sec. 3.2
language (PL) class definitions. The concrete examples 3: DRB = {} ▷ Replay buffer in Sec. 3.2
are shown in Figure 3. 4: for Ds in Dstage do
5: Dtrain = Ds ∪ DRB
6: Finetune M using Dtrain
7: DRB = DRB ∪ random_subset_of(Ds )
8: end for
Table 2: The correspondence between visual structural information and Python code representations.
Table 3: The results (%) of the visual relation detection. Table 4: The results (%) of the scene graph classifica-
* denotes results from Yao et al. (2021). tion. * denotes results from Yao et al. (2021).
Table 5: The results (%) of the situation recognition task. * denotes results from Cho et al. (2022).
This paper utilizes the programming language to Steven Bird, Ewan Klein, and Edward Loper. 2009. Nat-
ural language processing with Python: analyzing text
represent visual structures, with a focus on knowl- with the natural language toolkit. " O’Reilly Media,
edge representation. However, due to the con- Inc.".
strained capability of the base model, direct code
generation is not feasible, resulting in reduced per- María Alejandra Bravo, Sudhanshu Mittal, Simon Ging,
and Thomas Brox. 2022. Open-vocabulary attribute
formance in downstream tasks. One potential solu- detection. CoRR, abs/2211.12914.
tion is to integrate more advanced vision-language
models, such as BLIP-2 with more capacity. By Keyan Chen, Xiaolong Jiang, Yao Hu, Xu Tang, Yan
doing so, the trained models can effectively gen- Gao, Jianqi Chen, and Weidi Xie. 2023a. Ovarnet:
Towards open-vocabulary object attribute recognition.
erate code representations of visual structures for
CoRR, abs/2301.09506.
diverse images in downstream tasks.
Yangyi Chen, Karan Sikka, Michael Cogswell, Heng
Acknowledgement Ji, and Ajay Divakaran. 2023b. Dress: Instructing
large vision-language models to align and interact
We thank the anonymous reviewers for their sugges- with humans via natural language feedback. arXiv
tions and comments. This research is based upon preprint arXiv:2311.10081.
work supported by U.S. DARPA ECOLE Program Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji,
No. #HR00112390060 and U.S. DARPA KAIROS and Ajay Divakaran. 2023c. Measuring and improv-
Program No. FA8750-19-2-1004. The views and ing chain-of-thought reasoning in vision-language
conclusions contained herein are those of the au- models. arXiv preprint arXiv:2309.04461.
thors and should not be interpreted as necessarily Junhyeong Cho, Youngseok Yoon, and Suha Kwak.
representing the official policies, either expressed 2022. Collaborative transformers for grounded situa-
or implied, of DARPA, or the U.S. Government. tion recognition. In IEEE/CVF Conference on Com-
The U.S. Government is authorized to reproduce puter Vision and Pattern Recognition, CVPR 2022,
New Orleans, LA, USA, June 18-24, 2022, pages
and distribute reprints for governmental purposes
19627–19636. IEEE.
notwithstanding any copyright annotation therein.
Thilini Cooray, Ngai-Man Cheung, and Wei Lu. 2020.
Attention-based context aware reasoning for situa-
References tion recognition. In 2020 IEEE/CVF Conference
on Computer Vision and Pattern Recognition, CVPR
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- 2020, Seattle, WA, USA, June 13-19, 2020, pages
garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, 4735–4744. Computer Vision Foundation / IEEE.
and Devi Parikh. 2015. VQA: visual question an-
swering. In 2015 IEEE International Conference Paria Davoodi, Mehdi Ezoji, and Naser Sadeghnejad.
on Computer Vision, ICCV 2015, Santiago, Chile, 2023. Classification of natural images inspired by the
December 7-13, 2015, pages 2425–2433. IEEE Com- human visual system. Neurocomputing, 518:60–69.
puter Society.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
Collin F. Baker, Charles J. Fillmore, and John B. Lowe. and Li Fei-Fei. 2009. Imagenet: A large-scale hier-
1998. Framenet: A knowledge base for natural lan- archical image database. In 2009 IEEE conference
guage processing. In Proceedings of the 17th inter- on computer vision and pattern recognition, pages
national conference on Computational linguistics, 248–255. Ieee.
pages 86–90. Association for Computational Linguis-
tics. Shachi Deshpande, Kaiwen Wang, Dhruv Sreenivas,
Zheng Li, and Volodymyr Kuleshov. 2022. Deep
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, multi-modal structural equations for causal effect
and Jason Weston. 2009. Curriculum learning. In estimation with unstructured proxies. Advances in
Proceedings of the 26th annual international confer- Neural Information Processing Systems, 35:10931–
ence on machine learning, pages 41–48. 10944.
Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Lu Jiang, Deyu Meng, Teruko Mitamura, and Alexan-
Shuohang Wang, Lijuan Wang, Chenguang Zhu, der G. Hauptmann. 2014a. Easy samples first: Self-
Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng paced reranking for zero-example multimedia search.
Liu, and Michael Zeng. 2022. An empirical study of In Proceedings of the ACM International Conference
training end-to-end vision-and-language transform- on Multimedia, MM ’14, Orlando, FL, USA, Novem-
ers. In IEEE/CVF Conference on Computer Vision ber 03 - 07, 2014, pages 547–556. ACM.
and Pattern Recognition, CVPR 2022, New Orleans,
LA, USA, June 18-24, 2022, pages 18145–18155. Lu Jiang, Deyu Meng, Teruko Mitamura, and Alexan-
IEEE. der G. Hauptmann. 2014b. Easy Samples First:
Self-paced Reranking for Zero-Example Multimedia
Yi Fung, Christopher Thomas, Revanth Gangi Reddy, Search. In Proceedings of the 22nd ACM Interna-
Sandeep Polisetty, Heng Ji, Shih-Fu Chang, Kathleen tional Conference on Multimedia, pages 547–556,
McKeown, Mohit Bansal, and Avi Sil. 2021. Infos- Orlando Florida USA. ACM.
urgeon: Cross-media fine-grained information con-
sistency checking for fake news detection. In Proc. Amita Kamath, Christopher Clark, Tanmay Gupta, Eric
The Joint Conference of the 59th Annual Meeting of Kolve, Derek Hoiem, and Aniruddha Kembhavi.
the Association for Computational Linguistics and 2022. Webly supervised concept expansion for gen-
the 11th International Joint Conference on Natural eral purpose vision models. In Computer Vision -
Language Processing (ACL-IJCNLP 2021). ECCV 2022 - 17th European Conference, Tel Aviv, Is-
rael, October 23-27, 2022, Proceedings, Part XXXVI,
Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng volume 13696 of Lecture Notes in Computer Science,
Liu, and Jianfeng Gao. 2022. Vision-language pre- pages 662–681. Springer.
training: Basics, recent advances, and future trends.
Found. Trends Comput. Graph. Vis., 14(3-4):163– Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-
352. semantic alignments for generating image descrip-
tions. In Proceedings of the IEEE Conference on
Alex Graves, Marc G. Bellemare, Jacob Menick, Remi Computer Vision and Pattern Recognition (CVPR),
Munos, and Koray Kavukcuoglu. 2017. Auto- pages 3128–3137.
mated Curriculum Learning for Neural Networks.
arXiv:1704.03003 [cs]. Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehti-
nen. 2018. Progressive growing of gans for improved
Tanmay Gupta and Aniruddha Kembhavi. 2023. Vi- quality, stability, and variation. In Proceedings of the
sual programming: Compositional visual reasoning 6th International Conference on Learning Represen-
without training. In Proceedings of the IEEE/CVF tations.
Conference on Computer Vision and Pattern Recog-
nition, pages 14953–14962. Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt:
Vision-and-language transformer without convolu-
Lisa Anne Hendricks and Aida Nematzadeh. 2021. tion or region supervision. In Proceedings of the
Probing image-language transformers for verb un- 38th International Conference on Machine Learning,
derstanding. In Findings of the Association for Com- ICML 2021, 18-24 July 2021, Virtual Event, volume
putational Linguistics: ACL-IJCNLP 2021, pages 139 of Proceedings of Machine Learning Research,
3635–3644, Online. Association for Computational pages 5583–5594. PMLR.
Linguistics.
Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li,
Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jer-
Liu, Dongmei Fu, and Jianlong Fu. 2021. Seeing nite, Margaret Mitchell, Sean Hughes, Thomas Wolf,
out of the box: End-to-end pre-training for vision- et al. 2022. The stack: 3 tb of permissively licensed
language representation learning. In IEEE Confer- source code. arXiv preprint arXiv:2211.15533.
ence on Computer Vision and Pattern Recognition,
CVPR 2021, virtual, June 19-25, 2021, pages 12976– Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari,
12985. Computer Vision Foundation / IEEE. Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom,
Jasper Uijlings, Stefan Popov, Andreas Veit, Serge
Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Belongie, Victor Gomes, Abhinav Gupta, Chen Sun,
and Jianlong Fu. 2020. Pixel-bert: Aligning image Gal Chechik, David Cai, Zheyun Feng, Dhyanesh
pixels with text by deep multi-modal transformers. Narayanan, and Kevin Murphy. 2017. Openimages:
CoRR, abs/2004.00849. A public dataset for large-scale multi-label and multi-
class image classification. Dataset available from
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana https://github.com/openimages.
Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung,
Zhen Li, and Tom Duerig. 2021. Scaling up vi- Ranjay Krishna, Vincent S. Chen, Paroma Varma,
sual and vision-language representation learning with Michael S. Bernstein, Christopher Ré, and Li Fei-
noisy text supervision. In Proceedings of the 38th In- Fei. 2019. Scene graph prediction with limited la-
ternational Conference on Machine Learning, ICML bels. In 2019 IEEE/CVF International Conference on
2021, 18-24 July 2021, Virtual Event, volume 139 of Computer Vision, ICCV 2019, Seoul, Korea (South),
Proceedings of Machine Learning Research, pages October 27 - November 2, 2019, pages 2580–2590.
4904–4916. PMLR. IEEE.
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- Unsupervised vision-and-language pre-training with-
son, Kenji Hata, Joshua Kravitz, Stephanie Chen, out parallel images and captions. In Proceedings of
Yannis Kalantidis, Li-Jia Li, David A. Shamma, the 2021 Conference of the North American Chap-
Michael S. Bernstein, and Li Fei-Fei. 2016. Vi- ter of the Association for Computational Linguistics:
sual genome: Connecting language and vision us- Human Language Technologies, NAACL-HLT 2021,
ing crowdsourced dense image annotations. CoRR, Online, June 6-11, 2021, pages 5339–5350. Associa-
abs/1602.07332. tion for Computational Linguistics.
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- Manling Li, Ruochen Xu, Shuohang Wang, Luowei
son, Kenji Hata, Joshua Kravitz, Stephanie Chen, Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng,
Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Heng Ji, and Shih-Fu Chang. 2022b. Clip-event:
2017. Visual genome: Connecting language and vi- Connecting text and images with event structures. In
sion using crowdsourced dense image annotations. Proc. Conference on Computer Vision and Pattern
International journal of computer vision, 123(1):32– Recognition (CVPR2022).
73.
Manling Li, Ruochen Xu, Shuohang Wang, Luowei
M. Kumar, Benjamin Packer, and Daphne Koller. 2010. Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng,
Self-Paced Learning for Latent Variable Models. In Heng Ji, and Shih-Fu Chang. 2022c. Clip-event:
Advances in Neural Information Processing Systems, Connecting text and images with event structures. In
volume 23. Curran Associates, Inc. Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 16420–
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and 16429.
Daxin Jiang. 2020a. Unicoder-vl: A universal en-
coder for vision and language by cross-modal pre- Manling Li, Alireza Zareian, Qi Zeng, Spencer White-
training. In The Thirty-Fourth AAAI Conference on head, Di Lu, Heng Ji, and Shih-Fu Chang. 2020b.
Artificial Intelligence, AAAI 2020, The Thirty-Second Cross-media structured common space for multime-
Innovative Applications of Artificial Intelligence Con- dia event extraction. In Proceedings of the 58th An-
ference, IAAI 2020, The Tenth AAAI Symposium on nual Meeting of the Association for Computational
Educational Advances in Artificial Intelligence, EAAI Linguistics, ACL 2020, Online, July 5-10, 2020,
2020, New York, NY, USA, February 7-12, 2020, pages 2557–2568. Association for Computational
pages 11336–11344. AAAI Press. Linguistics.
Ruiyu Li, Makarand Tapaswi, Renjie Liao, Jiaya Jia,
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Raquel Urtasun, and Sanja Fidler. 2017. Situation
Hoi. 2023. BLIP-2: bootstrapping language-image recognition with graph neural networks. In IEEE
pre-training with frozen image encoders and large International Conference on Computer Vision, ICCV
language models. CoRR, abs/2301.12597. 2017, Venice, Italy, October 22-29, 2017, pages 4183–
Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. 4192. IEEE Computer Society.
Hoi. 2022a. BLIP: bootstrapping language-image Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang,
pre-training for unified vision-language understand- Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong
ing and generation. In International Conference on Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng
Machine Learning, ICML 2022, 17-23 July 2022, Bal- Gao. 2020c. Oscar: Object-semantics aligned pre-
timore, Maryland, USA, volume 162 of Proceedings training for vision-language tasks. In Computer Vi-
of Machine Learning Research, pages 12888–12900. sion - ECCV 2020 - 16th European Conference, Glas-
PMLR. gow, UK, August 23-28, 2020, Proceedings, Part
XXX, volume 12375 of Lecture Notes in Computer
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Science, pages 121–137. Springer.
Shafiq Joty, Caiming Xiong, and Steven Chu Hong
Hoi. 2021a. Align before fuse: Vision and language A. S. Lillard, M. J. Heise, E. M. Richey, X. Tong,
representation learning with momentum distillation. A. Hart, and P. M. Bray. 2017. Montessori preschool
Advances in neural information processing systems, elevates and equalizes child outcomes: A longitudi-
34:9694–9705. nal study. In Frontiers in Psychology.
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James
Hsieh, and Kai-Wei Chang. 2019a. Visualbert: A Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,
simple and performant baseline for vision and lan- and C. Lawrence Zitnick. 2014. Microsoft COCO:
guage. CoRR, abs/1908.03557. common objects in context. In Computer Vision -
ECCV 2014 - 13th European Conference, Zurich,
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Switzerland, September 6-12, 2014, Proceedings,
Hsieh, and Kai-Wei Chang. 2019b. Visualbert: A Part V, volume 8693 of Lecture Notes in Computer
simple and performant baseline for vision and lan- Science, pages 740–755. Springer.
guage. arXiv preprint arXiv:1908.03557.
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee.
Liunian Harold Li, Haoxuan You, Zhecan Wang, Alireza 2019. Vilbert: Pretraining task-agnostic visiolinguis-
Zareian, Shih-Fu Chang, and Kai-Wei Chang. 2021b. tic representations for vision-and-language tasks. In
Advances in Neural Information Processing Systems et al. 2021. Learning transferable visual models
32: Annual Conference on Neural Information Pro- from natural language supervision. In International
cessing Systems 2019, NeurIPS 2019, December 8- Conference on Machine Learning, pages 8748–8763.
14, 2019, Vancouver, BC, Canada, pages 13–23. PMLR.
Grace Luo, Trevor Darrell, and Anna Rohrbach. 2021. Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi
NewsCLIPpings: Automatic Generation of Out-of- Zelnik-Manor. 2021. Imagenet-21k pretraining for
Context Multimodal Media. In Proceedings of the the masses.
2021 Conference on Empirical Methods in Natural
Language Processing, pages 6801–6817, Online and Arka Sadhu, Tanmay Gupta, Mark Yatskar, Ram Neva-
Punta Cana, Dominican Republic. Association for tia, and Aniruddha Kembhavi. 2021. Visual semantic
Computational Linguistics. role labeling for video understanding. In Proceedings
of the IEEE/CVF Conference on Computer Vision
Aman Madaan, Shuyan Zhou, Uri Alon, Yiming Yang, and Pattern Recognition, pages 5589–5600.
and Graham Neubig. 2022. Language models of code
are few-shot commonsense learners. In Proceedings Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng,
of the 2022 Conference on Empirical Methods in Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun.
Natural Language Processing, EMNLP 2022, Abu 2019. Objects365: A large-scale, high-quality
Dhabi, United Arab Emirates, December 7-11, 2022, dataset for object detection. In 2019 IEEE/CVF In-
pages 1384–1403. Association for Computational ternational Conference on Computer Vision, ICCV
Linguistics. 2019, Seoul, Korea (South), October 27 - November
2, 2019, pages 8429–8438. IEEE.
George A. Miller. 1995. Wordnet: A lexical database
for english. Commun. ACM, 38(11):39–41. Piyush Sharma, Nan Ding, Sebastian Goodman, and
Radu Soricut. 2018. Conceptual captions: A cleaned,
Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. hypernymed, image alt-text dataset for automatic im-
2011. Im2text: Describing images using 1 million age captioning. In Proceedings of the 56th Annual
captioned photographs. In Advances in Neural In- Meeting of the Association for Computational Lin-
formation Processing Systems 24: 25th Annual Con- guistics (Volume 1: Long Papers), pages 2556–2565.
ference on Neural Information Processing Systems
2011. Proceedings of a meeting held 12-14 December Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and
2011, Granada, Spain, pages 1143–1151. Nicu Sebe. 2022. Curriculum learning: A survey.
Int. J. Comput. Vis., 130(6):1526–1565.
Charulata Patil and Aditya Abhyankar. 2023. Generat-
ing comprehensive scene graphs with integrated mul- Valentin I. Spitkovsky, Hiyan Alshawi, and Dan Ju-
tiple attribute detection. Mach. Vis. Appl., 34(1):11. rafsky. 2010a. Baby steps: How “less is more” in
unsupervised dependency parsing. In Human Lan-
Khoi Pham, Kushal Kafle, Zhe Lin, Zhihong Ding, Scott guage Technologies: The 2010 Annual Conference of
Cohen, Quan Tran, and Abhinav Shrivastava. 2022. the North American Chapter of the Association for
Improving closed and open-vocabulary attribute pre- Computational Linguistics.
diction using transformers. In Computer Vision -
ECCV 2022 - 17th European Conference, Tel Aviv, Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Ju-
Israel, October 23-27, 2022, Proceedings, Part XXV, rafsky. 2010b. From baby steps to leapfrog: How
volume 13685 of Lecture Notes in Computer Science, "less is more" in unsupervised dependency parsing.
pages 201–219. Springer. In Human Language Technologies: Conference of
the North American Chapter of the Association of
Emmanouil Antonios Platanios, Otilia Stretcu, Graham Computational Linguistics, Proceedings, June 2-4,
Neubig, Barnabas Poczos, and Tom Mitchell. 2019. 2010, Los Angeles, California, USA, pages 751–759.
Competence-based curriculum learning for neural The Association for Computational Linguistics.
machine translation. In Proceedings of the 2019
Conference of the North American Chapter of the Yixuan Su, Deng Cai, Qingyu Zhou, Zibo Lin, Simon
Association for Computational Linguistics: Human Baker, Yunbo Cao, Shuming Shi, Nigel Collier, and
Language Technologies, Volume 1 (Long and Short Yan Wang. 2021. Dialogue response selection with
Papers), pages 1162–1172, Minneapolis, Minnesota. hierarchical curriculum learning. In Proceedings
Association for Computational Linguistics. of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International
Sarah M. Pratt, Mark Yatskar, Luca Weihs, Ali Farhadi, Joint Conference on Natural Language Processing
and Aniruddha Kembhavi. 2020. Grounded situation (Volume 1: Long Papers), pages 1740–1751, Online.
recognition. In Computer Vision - ECCV 2020 - 16th Association for Computational Linguistics.
European Conference, Glasgow, UK, August 23-28,
2020, Proceedings, Part IV, volume 12349 of Lecture Mohammed Suhail and Leonid Sigal. 2019. Mixture-
Notes in Computer Science, pages 314–332. Springer. kernel graph attention network for situation recogni-
tion. In 2019 IEEE/CVF International Conference on
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Computer Vision, ICCV 2019, Seoul, Korea (South),
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- October 27 - November 2, 2019, pages 10362–10371.
try, Amanda Askell, Pamela Mishkin, Jack Clark, IEEE.
Dídac Surís, Sachit Menon, and Carl Vondrick. 2023. Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yu-
Vipergpt: Visual inference via python execution for lia Tsvetkov, and Yuan Cao. 2021. Simvlm: Simple
reasoning. arXiv preprint arXiv:2303.08128. visual language model pretraining with weak super-
vision. arXiv preprint arXiv:2108.10904.
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning
cross-modality encoder representations from trans- Haoyang Wen, Ying Lin, Tuan M. Lai, Xiaoman Pan,
formers. In Proceedings of the 2019 Conference on Sha Li, Xudong Lin, Ben Zhou, Manling Li, Haoyu
Empirical Methods in Natural Language Processing Wang, Hongming Zhang, Xiaodong Yu, Alexan-
and the 9th International Joint Conference on Natu- der Dong, Zhenhailong Wang, Yi R. Fung, Piyush
ral Language Processing (EMNLP-IJCNLP), pages Mishra, Qing Lyu, Dídac Surís, Brian Chen, Susan W.
5100–5111, Hong Kong, China. Association for Com- Brown, Martha Palmer, Chris Callison-Burch, Carl
putational Linguistics. Vondrick, Jiawei Han, Dan Roth, Shih-Fu Chang, and
Heng Ji. 2021. Resin: A dockerlized schema-guided
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- cross-document cross-lingual cross-media informa-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay tion extraction and event tracking system. In Proc.
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti The 2021 Conference of the North American Chapter
Bhosale, et al. 2023. Llama 2: Open founda- of the Association for Computational Linguistics -
tion and fine-tuned chat models. arXiv preprint Human Language Technologies (NAACL-HLT2021)
arXiv:2307.09288. Demo Track.
Table 7: The datasets included in ViStruct Suite. #B_Sample denotes the number of samples in the replay buffer.