You are on page 1of 3

MAKING A CHAT-GPT MODEL FOR USAGE BY FUTO AS AN

INSTITUTION
Chat-GPT which stands for Chat Generative Pre-trained Transformed is a chatbot that is based
on a large language model that enables it’s users to refine and steer a conversation towards a
desired length, format, style, level of detail and language.

Chat-GPT was based on GPT models that were fine-tuned to target conversational usage. The
fine-tuning process leveraged Supervised learning and Reinforcement learning from Human
Feedback. Both approaches employed human trainers to improve model performance. In the case
of supervised learning, the trainers played both sides: the user and the AI assistant. In the
reinforcement learning stage, human trainers first ranked responses that the model had created in
a previous conversation. These rankings were used to create "reward models" that were used to
fine-tune the model further.

Prompt Engineering is the process of refining prompts that a person can input into a generative
artificial intelligence (AI) service to create text or images. It is a technique that AI engineers use
when refining language large models (LLMs) with specific or recommended prompts.

A large language model (LLM) is a language model notable for its ability to achieve general-
purpose language generation and understanding. LLMs acquire these abilities by learning
statistical relationships from text documents during a computationally intensive self-
supervised and semi-supervised training process.

DATA PREPROCESSING FOR LLMs


1. Text tokenization: Tokenization is the process of splitting text into individual units, typically
words or subwords. This step is crucial for the model to understand the structure of the text. In
languages like English, tokenization is relatively straightforward, as words are typically separated
by spaces.

2. Handling Special Tokens: LLMs often use special tokens to indicate the beginning and
end of a text sequence, as well as to represent padding and out-of-vocabulary words.
These tokens help the model learn context and structure.
3. Subword Tokenization: For languages with complex morphology or limited resources,
subword tokenization can be advantageous. Subword tokenization breaks words into
smaller units, such as prefixes and suffixes, allowing the model to handle rare words and
morphological variations more effectively.
4. Remove Stopwords and Punctuation: Connector words, such as "for," "the," and "is,"
provide little semantic value and can be removed to reduce the dimensionality of the data.
Similarly, punctuation marks can often be omitted without affecting the overall meaning
of the text.
5. Text Normalization: Text normalization involves converting text to a consistent format.
This can include converting text to lowercase, handling contractions, and converting
numbers to words. Normalization ensures that similar words are treated the same way by
the model.
6. Handling Spelling and Typographical Errors: Correcting spelling errors and
typographical errors is crucial for LLM performance. Misspelled words can confuse the
model and lead to inaccurate predictions. Spell-checking and correction mechanisms can
be applied to address this issue.
7. Deal with Noisy Text: Real-world text data can often be noisy, containing errors,
abbreviations, and informal language. Preprocessing should aim to clean and standardize
the text while retaining its naturalness and authenticity.

The quality of datasets and the precision of data preprocessing play a vital role in shaping the
capabilities and behavior of these models. A well-curated and diverse dataset, coupled with
proper preprocessing, paves the way for the development of LLMs that can generate coherent,
contextually accurate, and unbiased human-like text.

STEPS TO TAKE WHEN CREATING A GPT

1. Collection of data: This data includes the names of lecturers and their contact emails,
educational papers and any other necessary thing that will be needed by a student.
2. Data Preprocessing: This involves the steps mentioned in the preprocessing section of
this paper
3. Importation Model: A lot of LLMs are free and open source, choosing the best one on
Kaggle or any other machine learning community.
4. Training model with dataset that has already been preprocessed
5. Interfacing and releasing

Although the above steps might seem simple, they are very strenuous and braintasking

You might also like