You are on page 1of 12

Open in app

Search Write

Get unlimited access to the best of Medium for less than $1/week. Become a member

Are you thirsty for social chitchat


data?
We give you SODA: Million-scale Dialogue Distillation with Social
Commonsense Contextualization

Hyunwoo Kim · Follow


Published in AI2 Blog · 6 min read · 22 hours ago

Although social conversations occur every day and everywhere around you,
they are often not recorded as data. And when they are (e.g., text messages),
research use is rightly restricted due to privacy and legal concerns. As a
result, collecting high-quality, everyday social conversations on a large scale
has long been recognized as a difficult task. It’s almost similar to searching for
drinking water in the sea — it’s there, but not in a usable form, leaving many
with a thirst for large-scale, quality social chitchat data.
Image credit: Bing Image Creator

In this blog post, we introduce SODA, the first million-scale high-quality social
chitchat dataset that will quench this thirst. What’s even better? Our recent
paper, “SODA: Million-scale Dialogue Distillation with Social Commonsense
Contextualization”, which has been accepted to EMNLP as an oral
presentation, shows how anyone can obtain substantially larger and more
diverse social chitchat data with better quality.

Too good to be true. Large, diverse, and high-quality?

Yes, we can achieve all three by leveraging the power of large language
models (LLMs) and symbolic commonsense knowledge graphs. More
concretely, we use OpenAI’s InstructGPT and Atomic10x to distill social
conversation. However, you can use other open-source LLMs too, such as
Llama-2.

Isn’t it obvious that you can generate dialogues with LLMs?

Yes, it is very obvious. However, if you’re trying to generate a vast amount of


coherent dialogues spanning an exceptionally broad range of everyday
scenarios, the problem becomes challenging. Repeatedly tasking LLMs with
the prompt, “Generate 10 coherent dialogues with diverse topics, but don’t let
the dialogue topics overlap,” won’t cover all the various everyday scenarios.
This is where the symbolic commonsense knowledge graph comes in to save
the day.

What’s a symbolic commonsense knowledge graph?

A symbolic commonsense knowledge graph is a way to organize general


human knowledge about everyday life in a format that both computers and
people can understand. It’s made up of “nodes” which are like points
representing different ideas or things, and “edges” which are the lines that
connect these points to show how they’re related. The connections make
something called a “triple” — for instance, you could have one point for
“PersonX moves a step closer to the goal” and another for “take the first
step,” and they would be connected by a relationship called “xNeed.” This
makes a triple: (PersonX moves a step closer to the goal, xNeed, take the first
step). By linking many of these triples together, the graph creates a big web
of common knowledge that’s easy to navigate for AI.

So how do you go from knowledge graphs to social dialogues?


We obtain dialogues by contextualizing (i.e., adding more context
information) the commonsense triples in a step-by-step manner. These
triples are the distilled essence of our social experiences, abstracted into
narratives and ultimately crystallized into concise pieces of knowledge. By
leveraging LLMs, we reverse this abstraction process by taking the
commonsense knowledge triples and expanding them into short narratives
and conversations that might have originally contained that knowledge.

An illustration of our distillation framework CO_3 for obtaining rich social dialogues from symbolic
commonsense knowledge graph.

(1) First, we convert the symbolic knowledge triple into sentence form in a
rule-based manner. For example, the commonsense knowledge in the above
figure is converted to “Madeleine took the first step. Madeleine moves a step
closer to the goal.” (2) Next, we use the LLM to generate a short narrative
based on the sentence form commonsense knowledge. Also, we infer the
likely conversation participants (e.g., Madeleine and coach) using the LLM. (3)
Finally, with the conversation participants and narrative as input, we prompt
the LLM to generate a full, multi-turn conversation.

The final dataset of SODA comprises 1.5 million conversations with more than
11 million utterances, making it the largest publicly available social chitchat
dataset.

How good is the quality of SODA?

To assess the relative quality of the corpus, we conducted head-to-head


human evaluations comparing SODA with two widely used open-domain
dialogue datasets: DailyDialog and BlendedSkillTalk. We random sample 300
dialogues from each dataset and evaluate them according to six criteria: (1)
natural flow, (2) context dependence, (3) topic consistency, (4) speaker
consistency, (5) specificity, and (6) overall. Judges are asked to select a better
dialogue between the two, regarding each criterion.
Despite being fully machine-generated, human raters judge SODA as better
in quality compared to both DailyDialog and BlendedSkillTalk across all axes
by a large margin, except for the context dependence compared with
BlendedSkillTalk. In particular, evaluators rate the flow of SODA to be
significantly more natural than other datasets that were collected through
crowdsourcing.

Any other characteristics of SODA?

SODA also contains rich emotion-related information. Since commonsense


knowledge from Atomic10x includes emotional reactions of people to events
(i.e., the xReact triples), conversations with rich emotional contents are also
included in SODA. In total, SODA includes 385K conversations generated
from 1.7K unique emotion descriptions of the xReact triples. Therefore, it
contains significantly more descriptive emotion labels (i.e., the Tail node) than
other datasets which have a fixed number of classes. Furthermore, because
we construct conversations in a bottom-up fashion from those emotional
reactions, we know which speaker in the conversation is experiencing the
emotion (i.e., PersonX) and what caused the emotion (i.e., the Head node).

How strong would the model be if trained on SODA?


We compared COSMO, our 3B model trained on SODA, to four other
conversational agents (i.e., BlenderBot, GODEL, Koala, Vicuna) on
DailyDialog, which is an out-of-domain dataset for all models. We performed
head-to-head comparison between two responses, each from a different
model. We sample 100 test examples randomly from datasets and ask three
human judges on Amazon Mechanical Turk to select the better response
between the two in terms of four distinct criteria: (1) naturalness, (2)
consistency, (3) specificity, and (4) overall.

Although COSMO is trained on significantly smaller amount of data (1.5M


dialogues vs. 1.5B Reddit comments, 551M Reddit threads) and is significantly
smaller (3B vs. 7B), it outperforms all other existing models with a significant
margin across all aspects. The most surprising part is that human judges
prefer COSMO’s responses even over the original ground truth responses in
the dataset. This suggests that dialogue models trained on SODA can lead to
high generalizability and naturalness, even for unseen conversations.

Conclusion

SODA is not only orders of magnitude larger than existing popular dialogue
datasets; it is also perceived to be significantly better than them across
multiple aspects (e.g., naturalness, specificity, consistency). Furthermore, our
distillation framework offers a cost and time-efficient method to collect rich
social chitchat data. With SODA, we hope to alleviate the data scarcity issue
of social chitchat.

Please check out our resources if you’re interested in more details:

Paper on Semantic Scholar

Code for making your own SODA:


https://github.com/skywalker023/sodaverse

The SODA dataset and COSMO are publicly available under permissive
license CC-BY-4.0 at the HuggingFace hub:

SODA: https://huggingface.co/datasets/allenai/soda

COSMO: https://huggingface.co/allenai/cosmo-xl

Check out our current openings, follow @allen_ai on Twitter, and subscribe to
the AI2 Newsletter to stay current on news and research coming out of AI2.
Large Language Models Dataset Chit Chat

Written by Hyunwoo Kim Follow

1 Follower · Writer for AI2 Blog

Postdoctoral Researcher at Allen Institute for AI

More from Hyunwoo Kim and AI2 Blog

Luca Soldaini in AI2 Blog Nouha Dziri in AI2 Blog

Dolma: 3 Trillion Token Open Faith and Fate: Limits of


Corpus for Language Model… Transformers on Compositionality
We released Dolma, OLMo’s pretraining Contributors: Nouha Dziri*, Ximing Lu*,
dataset. Dolma open dataset of 3 trillion… Melanie Sclar*, Xiang Lorraine Li, Liwei Jiang…
13 min read · Aug 18 7 min read · Nov 17

121 2

Cassidy Trier in AI2 Blog AI2 in AI2 Blog

Case Study: Iterative Design for Featured AI2er: Patrick Beukema


Skimming Support Patrick Beukema is the Lead ML Engineer for
A closer look at Semantic Reader’s latest AI Skylight
feature and the collaborative design process…

7 min read · Oct 6 7 min read · Nov 8

35 251

See all from Hyunwoo Kim See all from AI2 Blog

Recommended from Medium


Rahul Nayak in Towards Data Science Gavin Li

How to Convert Any Text Into a Unbelievable! Run 70B LLM


Graph of Concepts Inference on a Single 4GB GPU wi…
A method to convert any text corpus into a Large language models require huge
Knowledge Graph using Mistral 7B. amounts of GPU memory. Is it possible to ru…

12 min read · Nov 10 6 min read · Nov 18

3K 37 1.1K 17

Lists

Natural Language Processing AI Regulation


914 stories · 428 saves 6 stories · 205 saves

Generative AI Recommended ChatGPT prompts


Reading 30 stories · 726 saves
52 stories · 455 saves
Michael Wilson in Entrepreneur Power Network Omkar Nagarkar

Turning ChatGPT into Your Leveraging GPT-4 for Rapid Data


Personal Assistant: Harnessing A… Science Analysis: A…
Photo by Mojahid Mottakin on Unsplash Introduction

2 min read · 3 days ago 13 min read · Nov 1

124 2 18

Tam Nguyen Angelica Lo Duca in IT Books

9 Methods to Enhance the Book Review: Generative AI for


Performance of a LLM RAG… Data Analytics
It is easy to prototype your first LLM RAG A preliminary review of the book Generative
(Retrieval Augmented Generation)… AI for Data Analytics by Artur Guja, Marlena…

8 min read · Nov 19 · 4 min read · 2 days ago

64 27

See more recommendations

You might also like