Professional Documents
Culture Documents
Search Write
Get unlimited access to the best of Medium for less than $1/week. Become a member
Although social conversations occur every day and everywhere around you,
they are often not recorded as data. And when they are (e.g., text messages),
research use is rightly restricted due to privacy and legal concerns. As a
result, collecting high-quality, everyday social conversations on a large scale
has long been recognized as a difficult task. It’s almost similar to searching for
drinking water in the sea — it’s there, but not in a usable form, leaving many
with a thirst for large-scale, quality social chitchat data.
Image credit: Bing Image Creator
In this blog post, we introduce SODA, the first million-scale high-quality social
chitchat dataset that will quench this thirst. What’s even better? Our recent
paper, “SODA: Million-scale Dialogue Distillation with Social Commonsense
Contextualization”, which has been accepted to EMNLP as an oral
presentation, shows how anyone can obtain substantially larger and more
diverse social chitchat data with better quality.
Yes, we can achieve all three by leveraging the power of large language
models (LLMs) and symbolic commonsense knowledge graphs. More
concretely, we use OpenAI’s InstructGPT and Atomic10x to distill social
conversation. However, you can use other open-source LLMs too, such as
Llama-2.
An illustration of our distillation framework CO_3 for obtaining rich social dialogues from symbolic
commonsense knowledge graph.
(1) First, we convert the symbolic knowledge triple into sentence form in a
rule-based manner. For example, the commonsense knowledge in the above
figure is converted to “Madeleine took the first step. Madeleine moves a step
closer to the goal.” (2) Next, we use the LLM to generate a short narrative
based on the sentence form commonsense knowledge. Also, we infer the
likely conversation participants (e.g., Madeleine and coach) using the LLM. (3)
Finally, with the conversation participants and narrative as input, we prompt
the LLM to generate a full, multi-turn conversation.
The final dataset of SODA comprises 1.5 million conversations with more than
11 million utterances, making it the largest publicly available social chitchat
dataset.
Conclusion
SODA is not only orders of magnitude larger than existing popular dialogue
datasets; it is also perceived to be significantly better than them across
multiple aspects (e.g., naturalness, specificity, consistency). Furthermore, our
distillation framework offers a cost and time-efficient method to collect rich
social chitchat data. With SODA, we hope to alleviate the data scarcity issue
of social chitchat.
The SODA dataset and COSMO are publicly available under permissive
license CC-BY-4.0 at the HuggingFace hub:
SODA: https://huggingface.co/datasets/allenai/soda
COSMO: https://huggingface.co/allenai/cosmo-xl
Check out our current openings, follow @allen_ai on Twitter, and subscribe to
the AI2 Newsletter to stay current on news and research coming out of AI2.
Large Language Models Dataset Chit Chat
121 2
35 251
See all from Hyunwoo Kim See all from AI2 Blog
3K 37 1.1K 17
Lists
124 2 18
64 27