Professional Documents
Culture Documents
Handling SlimOrca Dataset
Handling SlimOrca Dataset
If download directly from Huggingface, the dataset is a .jsonl file with each
line is a dict form (json) conversation. Each dict only has 1 single feature
‘conversations’ - which contains an instruction from system (optional), a
prompt by user (human) and an output from gpt.
Example:
{
"conversations": [
{
"from": "system",
"value": "You are an AI assistant. You will be given a
},
{
"from": "human",
"value": "Write an article based on this \\"A man has b
"weight": 0.0
},
{
"from": "gpt",
"value": "Title: Tragedy Strikes in Sydney: Victims Sta
"weight": 1.0
}
]
}
If loading from the datasets library using load_dataset(), the list will be type
DatasetDict() with a similar form (’train’ split only and feature ‘conversations’
Data Handling 1
⇒ In order to use this dataset, we need to parse the dataset to an usable form
(Split to system-human-gpt).
import json
from tqdm import tqdm
DATA_PATH = 'path/to/SlimOrca/oo-labeled_correct.gpt4.sharegpt.j
SAVE_PATH = 'path/to/SlimOrca/data.jsonl'
if __name__ == '__main__':
f = open(DATA_PATH)
progress_bar = tqdm(desc='Processing chunks')
for line_id, line in enumerate(f):
Data Handling 2
item = json.loads(line)
convert_item_to_training_data(item, SAVE_PATH, line_id >
progress_bar.update(1)
progress_bar.set_description(f'Processing chunks (items=
f.close()
In this case, I chose to process the Dataset() by iterate thru each conversation and
append a dict form of it (role: text) to a list and then convert the list to a
pd.DataFrame, which is easy to use.
def parse_dataset(dataset):
dialouge = []
for conv in dataset:
dialouge.append({chat['from']: chat['value'] for chat in
return pd.DataFrame(dialouge)
Data Handling 3