You are on page 1of 3

Data Handling

The SlimOrca dataset format:

If download directly from Huggingface, the dataset is a .jsonl file with each
line is a dict form (json) conversation. Each dict only has 1 single feature
‘conversations’ - which contains an instruction from system (optional), a
prompt by user (human) and an output from gpt.

Example:

{
"conversations": [
{
"from": "system",
"value": "You are an AI assistant. You will be given a
},
{
"from": "human",
"value": "Write an article based on this \\"A man has b
"weight": 0.0
},
{
"from": "gpt",
"value": "Title: Tragedy Strikes in Sydney: Victims Sta
"weight": 1.0
}
]
}

If loading from the datasets library using load_dataset(), the list will be type
DatasetDict() with a similar form (’train’ split only and feature ‘conversations’

Data Handling 1
⇒ In order to use this dataset, we need to parse the dataset to an usable form
(Split to system-human-gpt).

Case 1: Downloaded data directly


****Not my idea******
In this case we read-in the downloaded .jsonl file, get all the text and their
corresponding role, create an item and then save to a new dataset file.

import json
from tqdm import tqdm

DATA_PATH = 'path/to/SlimOrca/oo-labeled_correct.gpt4.sharegpt.j
SAVE_PATH = 'path/to/SlimOrca/data.jsonl'

def convert_item_to_training_data(item, save_path, to_append):


conv = item['conversations']
assert [s['from'] for s in conv] in [
['system', 'human', 'gpt'], ['human', 'gpt']
], f'Invalid conversation format: {conv}'
conv = {turn['from']: turn['value'] for turn in conv}

sys_content, user_content = conv.get('system', ''), conv.get


target = conv.get('gpt', '')
item = {
"instruction": sys_content,
"rounds": [{"prompt": user_content, "response": target}]
}

with open(save_path, 'a' if to_append else 'w') as f:


f.write(json.dumps(item)+'\n')

if __name__ == '__main__':
f = open(DATA_PATH)
progress_bar = tqdm(desc='Processing chunks')
for line_id, line in enumerate(f):

Data Handling 2
item = json.loads(line)
convert_item_to_training_data(item, SAVE_PATH, line_id >
progress_bar.update(1)
progress_bar.set_description(f'Processing chunks (items=
f.close()

Case 2: Import data from load_dataset()


****This feels much easier to digest for me****

In this case, I chose to process the Dataset() by iterate thru each conversation and
append a dict form of it (role: text) to a list and then convert the list to a
pd.DataFrame, which is easy to use.

def parse_dataset(dataset):
dialouge = []
for conv in dataset:
dialouge.append({chat['from']: chat['value'] for chat in
return pd.DataFrame(dialouge)

Data Handling 3

You might also like