Handling SlimOrca Dataset

Uploaded by

longthaisona1k60

0% found this document useful (0 votes)

3 views3 pages

Original Title

Handling_SlimOrca_Dataset (1)

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

3 views3 pages

Handling SlimOrca Dataset

Uploaded by

longthaisona1k60

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 3

Search inside document

Data Handling

The SlimOrca dataset format:

If download directly from Huggingface, the dataset is a .jsonl file with each
line is a dict form (json) conversation. Each dict only has 1 single feature
‘conversations’ - which contains an instruction from system (optional), a
prompt by user (human) and an output from gpt.

Example:

{
"conversations": [
{
"from": "system",
"value": "You are an AI assistant. You will be given a
},
{
"from": "human",
"value": "Write an article based on this \\"A man has b
"weight": 0.0
},
{
"from": "gpt",
"value": "Title: Tragedy Strikes in Sydney: Victims Sta
"weight": 1.0
}
]
}

If loading from the datasets library using load_dataset(), the list will be type
DatasetDict() with a similar form (’train’ split only and feature ‘conversations’

Data Handling 1
⇒ In order to use this dataset, we need to parse the dataset to an usable form
(Split to system-human-gpt).

Case 1: Downloaded data directly

****Not my idea******
In this case we read-in the downloaded .jsonl file, get all the text and their
corresponding role, create an item and then save to a new dataset file.

import json
from tqdm import tqdm

DATA_PATH = 'path/to/SlimOrca/oo-labeled_correct.gpt4.sharegpt.j
SAVE_PATH = 'path/to/SlimOrca/data.jsonl'

def convert_item_to_training_data(item, save_path, to_append):

conv = item['conversations']
assert [s['from'] for s in conv] in [
['system', 'human', 'gpt'], ['human', 'gpt']
], f'Invalid conversation format: {conv}'
conv = {turn['from']: turn['value'] for turn in conv}

sys_content, user_content = conv.get('system', ''), conv.get

target = conv.get('gpt', '')
item = {
"instruction": sys_content,
"rounds": [{"prompt": user_content, "response": target}]
}

with open(save_path, 'a' if to_append else 'w') as f:

f.write(json.dumps(item)+'\n')

if __name__ == '__main__':
f = open(DATA_PATH)
progress_bar = tqdm(desc='Processing chunks')
for line_id, line in enumerate(f):

Data Handling 2
item = json.loads(line)
convert_item_to_training_data(item, SAVE_PATH, line_id >
progress_bar.update(1)
progress_bar.set_description(f'Processing chunks (items=
f.close()

Case 2: Import data from load_dataset()

****This feels much easier to digest for me****

In this case, I chose to process the Dataset() by iterate thru each conversation and
append a dict form of it (role: text) to a list and then convert the list to a
pd.DataFrame, which is easy to use.

def parse_dataset(dataset):
dialouge = []
for conv in dataset:
dialouge.append({chat['from']: chat['value'] for chat in
return pd.DataFrame(dialouge)

Data Handling 3

API Reference - OpenAI API
Document46 pages
API Reference - OpenAI API
scribd
No ratings yet
Solution Manual For Data Abstraction and Problem Solving With Java Walls and Mirrors 3rd Edition Janet Prichard ISBN 0132122308 9780132122306
Document8 pages
Solution Manual For Data Abstraction and Problem Solving With Java Walls and Mirrors 3rd Edition Janet Prichard ISBN 0132122308 9780132122306
vkthik
No ratings yet
Machine Learning Algorithms PDF
Document148 pages
Machine Learning Algorithms PDF
jeff omanga
No ratings yet
Sure
Document22 pages
Sure
Bratadeep Sarkar
No ratings yet
CS Project
Document13 pages
CS Project
Kevin Shalom
No ratings yet
File Programs
Document23 pages
File Programs
Ishaan Seth
No ratings yet
2023 Aug How To Produce Data For A Neural networkORG
Document6 pages
2023 Aug How To Produce Data For A Neural networkORG
Ali Riza SARAL
No ratings yet
Solution Manual For Data Abstraction and Problem Solving With Java Walls and Mirrors 3 e 3rd Edition 0132122308
Document9 pages
Solution Manual For Data Abstraction and Problem Solving With Java Walls and Mirrors 3 e 3rd Edition 0132122308
Wilma Clark
100% (35)
Tensorflow and Keras Apis: 0.1 Computer Vision: Neural Networks and Deep Learning
Document32 pages
Tensorflow and Keras Apis: 0.1 Computer Vision: Neural Networks and Deep Learning
Guy Anthony NAMA NYAM
No ratings yet
YouTube DownLoader
Document20 pages
YouTube DownLoader
Mackos-Gnu
No ratings yet
Hibernate Class Notes
Document97 pages
Hibernate Class Notes
chanty android
No ratings yet
BDA List of Experiments For Practical Exam
Document21 pages
BDA List of Experiments For Practical Exam
Pharoah Gamerz
No ratings yet
Solution Manual For Data Abstraction and Problem Solving With Java Walls and Mirrors 3 e 3rd Edition 0132122308
Document38 pages
Solution Manual For Data Abstraction and Problem Solving With Java Walls and Mirrors 3 e 3rd Edition 0132122308
inhumanladlefulel4iu
100% (22)
Best Practices For Writing Python Functions
Document18 pages
Best Practices For Writing Python Functions
Vivashwanth Pai
No ratings yet
Api html5 PDF
Document13 pages
Api html5 PDF
Cosme Santos
No ratings yet
03.python & Computer Vision
Document17 pages
03.python & Computer Vision
rino tri a.p
No ratings yet
Data Mining
Document20 pages
Data Mining
21800768
No ratings yet
IDAP Assignment
Document6 pages
IDAP Assignment
Rithik Reddy
No ratings yet
DA Unit 4
Document46 pages
DA Unit 4
Madhukar
No ratings yet
Chapter-1: Introduction To Java
Document35 pages
Chapter-1: Introduction To Java
RavinderSingh
No ratings yet
Cs (File Handling) Important
Document12 pages
Cs (File Handling) Important
Sahil Ahmad
No ratings yet
Simple Programs On Data File Manipulations
Document4 pages
Simple Programs On Data File Manipulations
Venkata Naresh
No ratings yet
Ai - Phase 3
Document9 pages
Ai - Phase 3
Manikandan N
No ratings yet
Java Means Durgasoft: DURGA SOFTWARE SOLUTIONS, 202 HUDA Maitrivanam, Ameerpet, Hyd. PH: 040-64512786
Document12 pages
Java Means Durgasoft: DURGA SOFTWARE SOLUTIONS, 202 HUDA Maitrivanam, Ameerpet, Hyd. PH: 040-64512786
Shubh
No ratings yet
Nekobin
Document2 pages
Nekobin
Forwarding
No ratings yet
MOD-3 Dap
Document41 pages
MOD-3 Dap
Varshitha Kn
No ratings yet
Python Lab ALL 10 Prgms
Document16 pages
Python Lab ALL 10 Prgms
dvyvmsfcdwzbxpmymt
No ratings yet
3.6.6 Lab - Parse Different Data Types With Python
Document6 pages
3.6.6 Lab - Parse Different Data Types With Python
Willy Dinata
No ratings yet
Python - II
Document50 pages
Python - II
Moulding life
No ratings yet
Python Cheat Sheet: Topics
Document16 pages
Python Cheat Sheet: Topics
Dexie Cabañelez Manahan
No ratings yet
Python Cheat Sheet
Document16 pages
Python Cheat Sheet
jhdmss
No ratings yet
Raw Nitex
Document5 pages
Raw Nitex
neel neelanti
No ratings yet
Part 6
Document11 pages
Part 6
Naji Saleh
No ratings yet
CS Practical File 2023-24 (Python and SQL)
Document52 pages
CS Practical File 2023-24 (Python and SQL)
jitendratyagi2005
No ratings yet
Object Serialization With Pickle, JSON and YAML PDF
Document10 pages
Object Serialization With Pickle, JSON and YAML PDF
Rajendra Buchade
No ratings yet
File Handling: OOP MIT Fall 2012
Document3 pages
File Handling: OOP MIT Fall 2012
amir
No ratings yet
W11 Lab
Document4 pages
W11 Lab
Mai Tera Hero
No ratings yet
Code:: To Find Frequent Itemsets and Association Between Different Itemsets Using Apriori Algorithm
Document28 pages
Code:: To Find Frequent Itemsets and Association Between Different Itemsets Using Apriori Algorithm
Rishav
No ratings yet
Pattern Recognition
Document26 pages
Pattern Recognition
Aryan Attri
No ratings yet
09 Jdom
Document26 pages
09 Jdom
Jr Khacha
No ratings yet
Python Unit 5 and 4
Document20 pages
Python Unit 5 and 4
Shubham Mishra
No ratings yet
Tensor Flow and Keras Sample Programs
Document22 pages
Tensor Flow and Keras Sample Programs
vinothkumar0743
No ratings yet
OOP ASSIGNMENT TUESDAY Haris
Document24 pages
OOP ASSIGNMENT TUESDAY Haris
Jawad Nasir
No ratings yet
DA0101EN-Review-Introduction - Jupyter Notebook
Document8 pages
DA0101EN-Review-Introduction - Jupyter Notebook
Sohail Doulah
No ratings yet
Reading An Entire File at Once: Generating Current Date
Document2 pages
Reading An Entire File at Once: Generating Current Date
Ruoyu
No ratings yet
Selenium - Drop A File From The Desktop On A Drop Area
Document4 pages
Selenium - Drop A File From The Desktop On A Drop Area
hsuyip
No ratings yet
Osintgram
Document28 pages
Osintgram
josephkandolo5
No ratings yet
Android Internal Storage With Examples
Document9 pages
Android Internal Storage With Examples
20ITR081 POORNISHA K
No ratings yet
CPS5401 Assignment 101222
Document3 pages
CPS5401 Assignment 101222
Mike Myers
No ratings yet
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
Document5 pages
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
gprasadatvu
No ratings yet
Introduction To R
Document36 pages
Introduction To R
Refael Lav
No ratings yet
Spark RDD Commands - Spark Core
Document7 pages
Spark RDD Commands - Spark Core
Nagraj Goud
No ratings yet
Aa 1
Document9 pages
Aa 1
ajay sharma
No ratings yet
DEM Program Understanding
Document8 pages
DEM Program Understanding
Anusha Kishore
No ratings yet
3.6.6 Lab - Parse Different Data Types With Python
Document9 pages
3.6.6 Lab - Parse Different Data Types With Python
Samuel Garcia
No ratings yet
10.1 Object-Oriented Programming in Python
Document23 pages
10.1 Object-Oriented Programming in Python
James Pawlak
No ratings yet
13,14,15 Appseamolec
Document19 pages
13,14,15 Appseamolec
Nila Novita Sari
No ratings yet
File and Streams in C++
Document8 pages
File and Streams in C++
Gemex4fsh
No ratings yet
Dicom Image 3D Mainpulation
Document3 pages
Dicom Image 3D Mainpulation
Ali Nawaz
No ratings yet
Alzheimers Classification 2
Document17 pages
Alzheimers Classification 2
chimatayaswanth555
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Package Jsonlite': R Topics Documented
Document15 pages
Package Jsonlite': R Topics Documented
Nick Doublethreefour
No ratings yet
FineTuning Process Using OpenAI 1703440516
Document14 pages
FineTuning Process Using OpenAI 1703440516
Victor
No ratings yet
05-Choosing Appropriate Message Transformation and Routing Patterns
Document21 pages
05-Choosing Appropriate Message Transformation and Routing Patterns
Rama Devi
No ratings yet
Rds Logstash Opensearch
Document6 pages
Rds Logstash Opensearch
Fahmid Rahim Noor 1504028
No ratings yet
ArangoDB PerformanceCourse Release 1
Document71 pages
ArangoDB PerformanceCourse Release 1
arangodb448
No ratings yet