Professional Documents
Culture Documents
Linguistic Reasoning
Yasaman Etesam and Ozge Nilay Yalcin and Chuxuan Zhang and Angelica Lim
Simon Fraser University, BC, Canada
yetesam@sfu.ca, oyalcin@sfu.ca, cza152@sfu.ca, angelica@sfu.ca
https://rosielab.github.io/Fast-and-Slow/
Abstract
picked the top 1, 3, or 5 physical signals with high- can see in Table 3, the best results are for the when
est probabilities to generate captions and record pick the top M ean + 9 × std physical signals.
the mAP. As the second approach, we picked all
the signals which had the probability greater than 3.2 Text to Emotion Inference
M ean + std, M ean + 2 × std, M ean + 5 × std,
M ean + 7 × std, and M ean + 9 × std as valid Using the caption, we then obtain a set of emotions
physical signals and used them to generate the cap- experienced by this individual in the bounding box
tions. Note that the mean in this case is equal to using a large language model. To do this, we pro-
100 divided by number of physical signals. As you vide the caption, along with a prompt, to GPT-3.5
(text-davinci-003) with the temperature set to zero.
Table 2: Failure examples of narrative captions.
2: This person is a baby who is punching a person (boxing) 3: This person is an elderly person who is doing a tennis
1: This person is an adult who is acting in a play at a ther- at a sporting event stands. This person is chewing on a swing at a rec center. This person has leg muscles tight-
apist office. This person is not engaging in conversation. knuckle. This person is propping the head up with fist. ening. This person is using the body to nudge, push, or
This person is responding with anger by punching. block.
Table 3: Results for different approaches of picking Six labels were requested from GPT-3.5 since
physical signals on the 1000 randomly picked people the average number of ground truth labels in the
from validation set. We also report the average number
validation set was 6.
of physical signals and average number of words in
captions for each approach. For generalization and reproducibility purposes,
we also tested our narrative captions with three
mAP ave #sigs ave cap len open-source Large Language Models: Baichuan-
top1 25.22 1.0 22.016
chat-13B2 , Falcon-Instruct-7B (Almazrouei et al.,
top3 25.33 3.0 37.24 2023), and MPT-Chat-7B (Team et al., 2023). How-
top5 25.35 5.0 52.506 ever, due to their significantly lower number of
mean + std 25.43 44.512 330.15 parameters compared to GPT-3.5, they did not per-
mean + 3 × std 25.53 12.127 104.702
mean + 5 × std 25.32 5.524 56.184 form as well. These models achieved mAP scores
mean + 7 × std 25.47 3.058 37.816 of 17.12, 17.10, and 17.11, respectively, which are
mean + 9 × std 25.54 1.899 28.986 considerably lower than GPT-3.5’s score of 18.58.
4 Experiments
The prompt describes the emotions precisely as de-
scribed in (Kosti et al., 2019) and asks for the best We use the EMOTIC dataset (Kosti et al., 2019)
six emotion labels understood from the narrative which covers 26 different labels as shown in Ta-
caption, denoted by <caption>: ble 4. The related emotion recognition task is to
provide a list of emotion labels that matches those
"<caption> From suffering (which means psy-
chosen by annotators. The training set (70%) was
chological or emotional suffering; distressed; an-
annotated by 1 annotator, where Validation (10%)
guished), pain (which means physical pain), aver-
and Test (20%) sets were annotated by 5 and 3
sion (which means feeling disgust, dislike, repul-
annotators, respectively.
sion; feeling hate), disapproval (which means feel-
ing that something is wrong or reprehensible; con- 4.1 Evaluation Metrics and Baselines
tempt; hostile), [...] sadness (which means feeling
unhappy, sorrow, disappointed, or discouraged), Following the previous work in context-based emo-
and sympathy (which means state of sharing others’ tion recognition, we use the standard Mean Average
emotions, goals or troubles; supportive; compas- Precision (mAP) metric. We compare the following
sionate), pick a set of six most likely labels that this methods.
2
person is feeling at the same time." https://github.com/baichuan-inc/Baichuan-13B
Table 4: Performance of different models on EMOTIC. Action, NarraCap, and ExpNet captioning use GPT-3 for
inference.
EMOTIC Along with the EMOTIC dataset, Kosti instruction-following data generated using GPT-
et al. (2019) introduced a two-branched network 4 (OpenAI, 2023).
baseline. The first branch is in charge of extract- Random We consider selecting either 6 emotions
ing body related features and the second branch randomly from all possible labels (Rand) or se-
is in charge of extracting scene-context features. lecting 6 labels randomly where the weights are
Then a fusion network combines these features and determined by the number of times each emotion
estimates the output. is repeated in the validation set (Rand(W)).
Emoticon Motivated by Frege’s principle (Resnik, Majority This Majority baseline selects the top 6
1967), Mittal et al. (2020) proposed an approach most common emotions in the validation set as the
by combining three different interpretations of con- predicted labels for all test images (Maj).
text. They used pose and face features (context1), CLIP-only We evaluate the capabilities of the vi-
background information (context2), and interac- sion language model CLIP to predict the emotion
tions/social dynamics (context3). They used a labels. In this study we pass the image with the
depth map to model the social interactions in the emotion labels in the format of: "The person in
images. Later, they concatenate these different fea- the red bounding box is feeling {emotion label}" to
tures and pass it to fusion model to generate outputs. produce the probabilities that CLIP gives to each
(Mittal et al., 2020) of these sentences. We then pick the 6 labels with
LLaVA Large Language and Vision Assistant the highest probabilities.
(LLaVA) (Liu et al., 2023) is a multi-purpose multi- Action-only We consider only the what portion
modal model designed by combining CLIP’s visual of the caption and disregard physical signals, en-
encoder (Radford et al., 2021) and LLAMA’s lan- vironment and gender. We pass the set of actions
guage decoder (Touvron et al., 2023). The model to CLIP and generate a single action caption per
is fine-tuned end-to-end on the language-image image in the form of This person is [activity]. We
Figure 2: Results including standard error depict sig- ceives the body pose with raised arms and predicts
nificantly higher mAP for "Fast and Slow" methods the emotions of ‘surprise’ and ‘fear’. On the other
compared to "Fast" only (CLIP).
hand, the ’Fast and Slow’ frameworks, given the
18.8 context of skiing, may reason that this body pose
CLIP
is not related to fear or surprise but is more indica-
LLaVA
ExpNet + GPT3.5 tive of ‘happiness’ and ‘excitement’. Similarly, in
18.6
NarraCaps + GPT3.5
Table 1.5, CLIP predicts embarrassment for the
woman at the beach. After investigation, it appears
mAP
The research team conducted all the coding and Manuele Barraco, Marcella Cornia, Silvia Cascianelli,
Lorenzo Baraldi, and Rita Cucchiara. 2022. The
emotion annotation for the images in this study.
unreasonable effectiveness of clip features for image
The images were sourced from the EMOTIC captioning: an experimental analysis. In proceedings
Dataset, which consists images obtained from inter- of the IEEE/CVF conference on computer vision and
net and public datasets like MSCOCO and Ade20k. pattern recognition, pages 4662–4670.
To access the EMOTIC dataset, permission must Lisa Feldman Barrett. 2017. How emotions are made:
be obtained from the owners. Research team has The secret life of the brain. Pan Macmillan.
no affiliation with the individuals depicted in the
images. The images show individuals experienc- Lisa Feldman Barrett, Ralph Adolphs, Stacy Marsella,
Aleix M Martinez, and Seth D Pollak. 2019. Emo-
ing a range of emotions, such as grief at a funeral, tional expressions reconsidered: Challenges to infer-
participation in protests, or being in war-related ring emotion from human facial movements. Psycho-
scenarios. However, the dataset does not contain logical science in the public interest, 20(1):1–68.
images that surpass the explicitness or intensity Lisa Feldman Barrett, Batja Mesquita, and Maria Gen-
typically encountered in news media. dron. 2011. Context in emotion perception. Current
It is crucial to note that there might be biases directions in psychological science, 20(5):286–290.
affecting this work, such as cultural biases from
Sigal G Barsade and Donald E Gibson. 1998. Group
annotators and biases caused by the languages that emotion: A view from top and bottom.
GPT-3.5 and CLIP are trained on. Furthermore,
The source Emotion Thesaurus book is written by Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi,
et al. 2020. Piqa: Reasoning about physical com-
North American authors, and similar writing guides monsense in natural language. In Proceedings of the
in other languages may lead to differing results. AAAI conference on artificial intelligence, volume 34,
The implementation of this work utilized pre- pages 7432–7439.
trained models, like GPT-3.5, CLIP, and LLaVA.
Robert James R Blair. 2005. Responding to the emo-
Although no additional training was conducted in tions of others: Dissociating forms of empathy
this study, it is crucial to acknowledge the signifi- through the study of typical and psychiatric popula-
cant carbon cost associated with training such mod- tions. Consciousness and cognition, 14(4):698–718.
els, which should not be underestimated.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
As a final note, it is worth mentioning that the Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
authors strongly oppose the utilization of any soft- Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot Yibo Huang, Hongqian Wen, Linbo Qing, Rulong Jin,
learners. Advances in neural information processing and Leiming Xiao. 2021. Emotion recognition based
systems, 33:1877–1901. on body and context fusion in the wild. In Proceed-
ings of the IEEE/CVF International Conference on
Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Computer Vision, pages 3609–3617.
Elhoseiny. 2022. Visualgpt: Data-efficient adaptation
of pretrained language models for image captioning. Daniel Kahneman, Shane Frederick, et al. 2002. Rep-
In Proceedings of the IEEE/CVF Conference on Com- resentativeness revisited: Attribute substitution in
puter Vision and Pattern Recognition (CVPR), pages intuitive judgment. Heuristics and biases: The psy-
18030–18040. chology of intuitive judgment, 49(49-81):74.
Yejin Choi. 2022. The curious case of commonsense Suzanne Keen and Suzanne Keen. 2015. Narrative emo-
intelligence. Daedalus, 151(2):139–155. tions. Narrative Form: Revised and Expanded Sec-
ond Edition, pages 152–161.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Maarten Bosma, Gaurav Mishra, Adam Roberts, Byoung Chul Ko. 2018. A brief review of facial emotion
Paul Barham, Hyung Won Chung, Charles Sutton, recognition based on visual information. sensors,
Sebastian Gehrmann, et al. 2022. Palm: Scaling 18(2):401.
language modeling with pathways. arXiv preprint
arXiv:2204.02311. Ronak Kosti, Jose M Alvarez, Adria Recasens, and
Agata Lapedriza. 2019. Context based emotion
Marcella Cornia, Lorenzo Baraldi, Giuseppe Fia- recognition using emotic dataset. IEEE transac-
meni, and Rita Cucchiara. 2021. Universal cap- tions on pattern analysis and machine intelligence,
tioner: Inducing content-style separation in vision- 42(11):2755–2766.
and-language model training. arXiv preprint
arXiv:2111.12727. Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote,
Tomaso Poggio, and Thomas Serre. 2011. Hmdb: a
Willams de Lima Costa, Estefania Talavera, Lucas Silva large video database for human motion recognition.
Figueiredo, and Veronica Teichrieb. 2023. High-level In 2011 International conference on computer vision,
context representation for emotion recognition in im- pages 2556–2563. IEEE.
ages. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR) Nhat Le, Khanh Nguyen, Anh Nguyen, and Bac Le.
Workshops, pages 326–334. 2022. Global-local attention for emotion recognition.
Neural Computing and Applications, 34(24):21625–
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 21639.
Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand- Jiyoung Lee, Seungryong Kim, Sunok Kim, Jungin
ing. arXiv preprint arXiv:1810.04805. Park, and Kwanghoon Sohn. 2019. Context-aware
emotion recognition networks. In Proceedings of
Abhinav Dhall, Jyoti Joshi, Ibrahim Radwan, and the IEEE/CVF international conference on computer
Roland Goecke. 2013. Finding happiest moments in vision, pages 10143–10152.
a social context. In Computer Vision–ACCV 2012:
11th Asian Conference on Computer Vision, Daejeon, Xiang Lorraine Li, Adhiguna Kuncoro, Jordan Hoff-
Korea, November 5-9, 2012, Revised Selected Papers, mann, Cyprien de Masson d’Autume, Phil Blunsom,
Part II 11, pages 613–626. Springer. and Aida Nematzadeh. 2022. A systematic investiga-
tion of commonsense knowledge in large language
Paul Ekman and Wallace V Friesen. 1971. Constants models. In Proceedings of the 2022 Conference on
across cultures in the face and emotion. Journal of Empirical Methods in Natural Language Processing,
personality and social psychology, 17(2):124. pages 11838–11855.
Paul Ekman and Wallace V Friesen. 1976. Measuring Matthew D Lieberman. 2007. Social cognitive neu-
facial movement. Environmental psychology and roscience: a review of core processes. Annu. Rev.
nonverbal behavior, 1:56–75. Psychol., 58:259–289.
Jonathan St BT Evans. 2008. Dual-processing accounts Matthew D Lieberman, Naomi I Eisenberger, Molly J
of reasoning, judgment, and social cognition. Annu. Crockett, Sabrina M Tom, Jennifer H Pfeifer, and
Rev. Psychol., 59:255–278. Baldwin M Way. 2007. Putting feelings into words.
Psychological science, 18(5):421–428.
Maria Gendron, Kristen A Lindquist, Lawrence Barsa-
lou, and Lisa Feldman Barrett. 2012. Emotion words Tsung-Yi Lin, Michael Maire, Serge Belongie, James
shape emotion percepts. Emotion, 12(2):314. Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,
and C Lawrence Zitnick. 2014. Microsoft coco:
Jia Cheng Hu, Roberto Cavicchioli, and Alessandro Common objects in context. In Computer Vision–
Capotondi. 2022. Expansionnet v2: Block static ECCV 2014: 13th European Conference, Zurich,
expansion in fast end to end training for image cap- Switzerland, September 6-12, 2014, Proceedings,
tioning. arXiv preprint arXiv:2208.06551. Part V 13, pages 740–755. Springer.
Kristen A Lindquist, Lisa Feldman Barrett, Eliza Bliss- Maja Pantic and Leon JM Rothkrantz. 2000. Expert
Moreau, and James A Russell. 2006. Language and system for automatic analysis of facial expressions.
the perception of emotion. Emotion, 6(1):125. Image and Vision Computing, 18(11):881–905.
Kristen A Lindquist and Maria Gendron. 2013. What’s Rosalind W Picard. 2000. Affective computing. MIT
in a word? language constructs emotion perception. press.
Emotion Review, 5(1):66–71.
Becca Puglisi and Angela Ackerman. 2016a. The Rural
Kristen A Lindquist, Jennifer K MacCormack, and Setting Thesaurus: A Writer’s Guide to Personal and
Holly Shablack. 2015. The role of language in emo- Natural Places, volume 4. JADD Publishing.
tion: Predictions from psychological constructionism.
Frontiers in psychology, 6:444. Becca Puglisi and Angela Ackerman. 2016b. The Ur-
ban Setting Thesaurus: A Writer’s Guide to City
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Spaces, volume 5. JADD Publishing.
Lee. 2023. Visual instruction tuning. arXiv preprint
arXiv:2304.08485. Becca Puglisi and Angela Ackerman. 2019. The emo-
tion thesaurus: A writer’s guide to character expres-
Rui Mao, Qian Liu, Kai He, Wei Li, and Erik Cambria. sion, volume 1. JADD Publishing.
2022. The biases of pre-trained language models:
An empirical study on prompt-based sentiment anal- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
ysis and emotion detection. IEEE Transactions on Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
Affective Computing. try, Amanda Askell, Pamela Mishkin, Jack Clark,
et al. 2021. Learning transferable visual models from
Alexander Mathews, Lexing Xie, and Xuming He. 2016. natural language supervision. In International confer-
Senticap: Generating image descriptions with sen- ence on machine learning, pages 8748–8763. PMLR.
timents. In Proceedings of the AAAI conference on
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
artificial intelligence, volume 30.
Dario Amodei, Ilya Sutskever, et al. 2019. Language
models are unsupervised multitask learners. OpenAI
Alexander Mathews, Lexing Xie, and Xuming He. 2018.
blog, 1(8):9.
Semstyle: Learning to generate stylised image cap-
tions using unaligned text. In Proceedings of the
Renate LEP Reniers, Rhiannon Corcoran, Richard
IEEE Conference on Computer Vision and Pattern
Drake, Nick M Shryane, and Birgit A Völlm. 2011.
Recognition (CVPR).
The qcae: A questionnaire of cognitive and affec-
tive empathy. Journal of personality assessment,
Wafa Mellouk and Wahida Handouzi. 2020. Facial
93(1):84–95.
emotion recognition using deep learning: review and
insights. Procedia Computer Science, 175:689–694. Michael David Resnik. 1967. The context principle in
frege’s philosophy. Philosophy and Phenomenologi-
Trisha Mittal, Pooja Guhan, Uttaran Bhattacharya, Ro- cal Research, 27(3):356–365.
han Chandra, Aniket Bera, and Dinesh Manocha.
2020. Emoticon: Context-aware multimodal emotion Maarten Sap, Ronan LeBras, Daniel Fried, and Yejin
recognition using frege’s principle. In Proceedings Choi. 2022. Neural theory-of-mind? on the limits
of the IEEE/CVF Conference on Computer Vision of social intelligence in large lms. arXiv preprint
and Pattern Recognition, pages 14234–14243. arXiv:2210.13312.
Akshita Mittel and Shashank Tripathi. 2023. Peri: Part Maarten Sap, Hannah Rashkin, Derek Chen, Ronan
aware emotion recognition in the wild. In Computer LeBras, and Yejin Choi. 2019. Socialiqa: Com-
Vision–ECCV 2022 Workshops: Tel Aviv, Israel, Octo- monsense reasoning about social interactions. arXiv
ber 23–27, 2022, Proceedings, Part VI, pages 76–92. preprint arXiv:1904.09728.
Springer.
Konrad Schindler, Luc Van Gool, and Beatrice
Aleksandar S Mojasevic. 2016. Daniel kahneman: De Gelder. 2008. Recognizing emotions expressed
Thinking, fast and slow. by body pose: A biologically inspired neural model.
Neural networks, 21(9):1238–1246.
Pansy Nandwani and Rupali Verma. 2021. A review on
sentiment analysis and emotion detection from text. Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy,
Social Network Analysis and Mining, 11(1):81. Amy Wu, and Andrew Zisserman. 2020. A short
note on the kinetics-700-2020 human action dataset.
Desmond C Ong, Jamil Zaki, and Noah D Goodman. arXiv preprint arXiv:2010.10864.
2019. Computational models of emotion inference
in theory of mind: A review and roadmap. Topics in Khurram Soomro, Amir Roshan Zamir, and Mubarak
cognitive science, 11(2):338–357. Shah. 2012. Ucf101: A dataset of 101 human ac-
tions classes from videos in the wild. arXiv preprint
OpenAI. 2023. Gpt-4 technical report. arXiv:1212.0402.
Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao
Silvia Cascianelli, Giuseppe Fiameni, and Rita Cuc- Yang. 2016. Building a large scale dataset for image
chiara. 2022. From show to tell: A survey on deep emotion recognition: The fine print and the bench-
learning-based image captioning. IEEE transac- mark. In Proceedings of the AAAI conference on
tions on pattern analysis and machine intelligence, artificial intelligence, volume 30.
45(1):539–559.
Sicheng Zhao, Hongxun Yao, Yue Gao, Rongrong Ji,
Akihiro Tanaka, Ai Koizumi, Hisato Imai, Saori Hi- and Guiguang Ding. 2016. Continuous probability
ramatsu, Eriko Hiramoto, and Beatrice De Gelder. distribution prediction of image emotions via multi-
2010. I feel your voice: Cultural differences in the task shared sparse regression. IEEE transactions on
multisensory perception of emotion. Psychological multimedia, 19(3):632–645.
science, 21(9):1259–1262.
Sicheng Zhao, Xingxu Yao, Jufeng Yang, Guoli Jia,
Guiguang Ding, Tat-Seng Chua, Bjoern W Schuller,
MosaicML NLP Team et al. 2023. Introducing mpt-30b: and Kurt Keutzer. 2021. Affective image content
Raising the bar for open-source foundation models. analysis: Two decades review and new perspectives.
IEEE Transactions on Pattern Analysis and Machine
Selvarajah Thuseethan, Sutharshan Rajasegarar, and Intelligence, 44(10):6729–6751.
John Yearwood. 2021. Boosting emotion recogni-
tion in context using non-target subject information. Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-
In 2021 International Joint Conference on Neural Fei. 2016. Visual7w: Grounded question answering
Networks (IJCNN), pages 1–7. IEEE. in images. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier 4995–5004.
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, A Appendix
Faisal Azhar, et al. 2023. Llama: Open and effi-
cient foundation language models. arXiv preprint
arXiv:2302.13971.
Figure 3: Annotation interface for human annotators. The annotator follows the annotation procedures as follows: 1.
indicate the bounding box for the subject they are annotating 2. select the perceived age, and social identity of the
subject 3. Go over the lists of physical signals for eyes, mouth, face, hands, torso, and feet, and check the ones they
see fit 4. Choose the correct bounding box of a subject who has interaction with the subject of interest 5. select the
perceived sex, and describe the inferred relation and interaction between the two subjects 6. Describe the subject’s
physical surroundings. The captions generated are displayed for the annotator to review easily.
Table 8: Example Listing of Physical Signals from (Puglisi and Ackerman, 2019). Full list is not included due to
copyright.