You are on page 1of 3

Overview:

Databricks has released Dolly 2.0, an LLM that follows instructions and
is open source. It has been fine-tuned on a dataset that is both
transparent and free to use and is also open sourced for commercial
purposes.

What is Dolly 2.0?

● Dolly 2.0 is an open-source instruction following LLM.

● It is trained on a human-generated instruction dataset licensed for


research and commercial use.

● The model is based on a EleutherAI model family and has a 12b


parameter language model.

● They fine-tuned EleutherAI’s pythia-12b in order to get the dolly


2.0 model. The fine-tuning process used their instruct data set,
which they claim is better than the original Dolly that was trained
on the synthetic alpaca dataset.
● The model requires significant hardware to run due to its size.

About the Dataset

● The dataset, called Databricks Dolly 15K, contains 15,000


high-quality human-generated prompt-response pairs specifically
designed for instruction tuning large language models.

● The data set contains natural and expressive training records that
represent a wide range of behaviours from brainstorming and
content generation to information extraction and summarization.

● The dataset was generated by professionals as high quality and


contains long answers to most tasks.

● This dataset is released under the Creative Commons 3.0 licence


which means anyone can use, modify or extend it for any purpose
including commercial applications.

Commercial Use

● Existing instruction following models prohibit commercial use, so


Databricks created this new data set because they wanted to
produce an open-source model that can be commercially used.

● Databricks is open sourcing the entirety of Dolly 2.0 including the


training code data set and model weights making it suitable for
commercial use.

● Any organisation can create and customise powerful LLMs that


can talk to people without paying for API access or sharing data
with third parties.
Few Drawbacks

● As per information available on GitHub page for the dataset,


During the process of developing prompts and responses,
Wikipedia information was utilised in the training. This means that
any biases present in Wikipedia could potentially be reflected in
the final dataset.

● Additionally, some of the individuals involved in creating the


dataset were not native English speakers, which could result in
inconsistencies.

● Furthermore, the demographic composition of the team


responsible for the dataset's creation could also contribute to
biases specific to their backgrounds being present in the dataset.

Important download Links

● Link to Dolly 2.0 Demo


● Link to Dataset:
● Link to Alpaca compatible dataset here:
● Link to blog:

Conclusion (Encouraging Innovation)

The open-source data sets and models encourage commentary


research and Innovation that will help ensure everyone benefits from
advances in artificial intelligence technology.
Databricks hopes that Dolly and the open-source data set will act as the
seed for a multitude of follow-on works which may serve to bootstrap
even more powerful language models. It is not meant to be state of the
art but rather a good model at following instructions.

You might also like